- Linux-kselftest-mirror - lists.linaro.org

[PATCH net] selftests: net: increase inter-packet timeout in udpgro.sh

by Paolo Abeni

The mentioned test is not very stable when running on top of debug kernel build. Increase the inter-packet timeout to allow more slack in such environments. Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO") Signed-off-by: Paolo Abeni <pabeni(a)redhat.com> --- tools/testing/selftests/net/udpgro.sh | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/udpgro.sh b/tools/testing/selftests/net/udpgro.sh index 1dc337c709f8..b17e032a6d75 100755 --- a/tools/testing/selftests/net/udpgro.sh +++ b/tools/testing/selftests/net/udpgro.sh @@ -48,7 +48,7 @@ run_one() { cfg_veth - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 100 ${rx_args} & local PID1=$! wait_local_port_listen ${PEER_NS} 8000 udp @@ -95,7 +95,7 @@ run_one_nat() { # will land on the 'plain' one ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -b ${addr1} -n 0 & local PID1=$! - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${family} -b ${addr2%/*} ${rx_args} & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 100 ${family} -b ${addr2%/*} ${rx_args} & local PID2=$! wait_local_port_listen "${PEER_NS}" 8000 udp @@ -117,9 +117,9 @@ run_one_2sock() { cfg_veth - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} -p 12345 & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 100 ${rx_args} -p 12345 & local PID1=$! - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 2000 -R 10 ${rx_args} & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 2000 -R 100 ${rx_args} & local PID2=$! wait_local_port_listen "${PEER_NS}" 12345 udp -- 2.50.0

18 minutes

2
1
0 0

[PATCH 00/10] rust: use `core::ffi::CStr` method names

by Tamir Duberstein

This is series 2b/5 of the migration to `core::ffi::CStr`[0]. 20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com. This series depends on the prior series[0] and is intended to go through the rust tree to reduce the number of release cycles required to complete the work. Subsystem maintainers: I would appreciate your `Acked-by`s so that this can be taken through Miguel's tree (where the other series must go). [0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm… Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Tamir Duberstein (10): gpu: nova-core: use `core::ffi::CStr` method names rust: auxiliary: use `core::ffi::CStr` method names rust: configfs: use `core::ffi::CStr` method names rust: cpufreq: use `core::ffi::CStr` method names rust: drm: use `core::ffi::CStr` method names rust: firmware: use `core::ffi::CStr` method names rust: kunit: use `core::ffi::CStr` method names rust: miscdevice: use `core::ffi::CStr` method names rust: net: use `core::ffi::CStr` method names rust: of: use `core::ffi::CStr` method names drivers/gpu/drm/drm_panic_qr.rs | 2 +- rust/kernel/auxiliary.rs | 4 ++-- rust/kernel/configfs.rs | 4 ++-- rust/kernel/cpufreq.rs | 2 +- rust/kernel/drm/device.rs | 4 ++-- rust/kernel/firmware.rs | 2 +- rust/kernel/kunit.rs | 6 +++--- rust/kernel/miscdevice.rs | 2 +- rust/kernel/net/phy.rs | 2 +- rust/kernel/of.rs | 2 +- samples/rust/rust_configfs.rs | 2 +- 11 files changed, 16 insertions(+), 16 deletions(-) --- base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494 change-id: 20250709-core-cstr-fanout-1-f20611832272 prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

2 hours, 33 minutes

4
19
0 0

[PATCH net 0/2] selftests: mptcp: connect: cover alt modes

by Matthieu Baerts (NGI0)

mptcp_connect.sh can be executed manually with "-m <MODE>" and "-C" to make sure everything works as expected when using "mmap" and "sendfile" modes instead of "poll", and with the MPTCP checksum support. These modes should be validated, but they are not when the selftests are executed via the kselftest helpers. It means that most CIs validating these selftests, like NIPA for the net development trees and LKFT for the stable ones, are not covering these modes. To fix that, new test programs have been added, simply calling mptcp_connect.sh with the right parameters. The first patch can be backported up to v5.6, and the second one up to v5.14. Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org> --- Matthieu Baerts (NGI0) (2): selftests: mptcp: connect: also cover alt modes selftests: mptcp: connect: also cover checksum tools/testing/selftests/net/mptcp/Makefile | 3 ++- tools/testing/selftests/net/mptcp/mptcp_connect_checksum.sh | 4 ++++ tools/testing/selftests/net/mptcp/mptcp_connect_mmap.sh | 4 ++++ tools/testing/selftests/net/mptcp/mptcp_connect_sendfile.sh | 4 ++++ 4 files changed, 14 insertions(+), 1 deletion(-) --- base-commit: b640daa2822a39ff76e70200cb2b7b892b896dce change-id: 20250714-net-mptcp-sft-connect-alt-c1aaf073ef4e Best regards, -- Matthieu Baerts (NGI0) <matttbe(a)kernel.org>

2 hours, 58 minutes

1
2
0 0

[PATCH v12 net-next 00/15] AccECN protocol patch series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the v12 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best Regards, Chia-Yu --- v12 (04-Jul-2025) - Fix compilation issues with some intermediate patches in v11 - Add more comments for AccECN helpers of tcp_ecn.h v11 (03-Jul-2025) - Fix compilation issues with some intermediate patches in v10 v10 (02-Jul-2025) - Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>) - Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>) - Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>) - Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>) - Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>) - Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch - Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>) v9 (21-Jun-2025) - Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>) - Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>) - Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>) - Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>) - Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>) v8 (10-Jun-2025) - Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>) - Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>) - Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>) v7 (14-May-2025) - Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>) - Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message for #9 to explain the increase in tcp_sock_write_rx group size - Modify group size of tcp_sock_write_tx in #10 based on pahole results v6 (09-May-2025) - Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>) - Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>) - Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>) - Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>) - Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>) - Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>) - Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15 v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Mar-2025) - Add one missing patch from the previous AccECN protocol preparation patch series to this patch series. --- Chia-Yu Chang (6): tcp: reorganize tcp_sock_write_txrx group for variables later tcp: ecn functions in separated include file tcp: Add wait_third_ack for ECN negotiation in simultaneous connect tcp: accecn: AccECN option send control tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (9): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics Documentation/networking/ip-sysctl.rst | 55 +- .../networking/net_cachelines/tcp_sock.rst | 13 + include/linux/tcp.h | 33 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 87 ++- include/net/tcp_ecn.h | 663 ++++++++++++++++++ include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 4 + net/ipv4/sysctl_net_ipv4.c | 19 + net/ipv4/tcp.c | 29 +- net/ipv4/tcp_input.c | 371 ++++++++-- net/ipv4/tcp_ipv4.c | 8 +- net/ipv4/tcp_minisocks.c | 40 +- net/ipv4/tcp_output.c | 297 ++++++-- net/ipv6/syncookies.c | 2 + net/ipv6/tcp_ipv6.c | 1 + 16 files changed, 1444 insertions(+), 187 deletions(-) create mode 100644 include/net/tcp_ecn.h -- 2.34.1

3 hours, 38 minutes

2
25
0 0

[RFC Patch 0/2] selftests/mm: assert rmap behave as expected

by Wei Yang

As David suggested, currently we don't have a high level test case to verify the behavior of rmap. This patch set introduce the verification on rmap by migration. Patch 1 is a preparation to move ksm related operation into vm_util. Patch 2 is the new test case. Currently it covers following four scenarios: * anonymous page * shmem page * pagecache page * ksm page Wei Yang (2): selftests/mm: put general ksm operation into vm_util selftests/mm: assert rmap behave as expected MAINTAINERS | 1 + tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + .../selftests/mm/ksm_functional_tests.c | 76 +-- tools/testing/selftests/mm/rmap.c | 466 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 4 + tools/testing/selftests/mm/vm_util.c | 71 +++ tools/testing/selftests/mm/vm_util.h | 7 + 8 files changed, 563 insertions(+), 66 deletions(-) create mode 100644 tools/testing/selftests/mm/rmap.c -- 2.34.1

4 hours, 4 minutes

2
8
0 0

[PATCH] selftests/mm: refactor common code and improve test skipping in guard_region

by wang lian

Move the generic `FORCE_READ` macro from `guard-regions.c` to the shared `vm_util.h` header to promote code reuse. In `guard-regions.c`, replace `ksft_exit_skip()` with the `SKIP()` macro to ensure only the current test is skipped on permission failure, instead of terminating the entire test binary. Signed-off-by: wang lian <lianux.mm(a)gmail.com> --- tools/testing/selftests/mm/guard-regions.c | 9 +-------- tools/testing/selftests/mm/vm_util.h | 7 +++++++ 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 93af3d3760f9..b0d42eb04e3a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -35,13 +35,6 @@ static volatile sig_atomic_t signal_jump_set; static sigjmp_buf signal_jmp_buf; -/* - * Ignore the checkpatch warning, we must read from x but don't want to do - * anything with it in order to trigger a read page fault. We therefore must use - * volatile to stop the compiler from optimising this away. - */ -#define FORCE_READ(x) (*(volatile typeof(x) *)x) - /* * How is the test backing the mapping being tested? */ @@ -582,7 +575,7 @@ TEST_F(guard_regions, process_madvise) /* OK we don't have permission to do this, skip. */ if (count == -1 && errno == EPERM) - ksft_exit_skip("No process_madvise() permissions, try running as root.\n"); + SKIP(return, "No process_madvise() permissions, try running as root.\n"); /* Returns the number of bytes advised. */ ASSERT_EQ(count, 6 * page_size); diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 2b154c287591..c20298ae98ea 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -18,6 +18,13 @@ #define PM_SWAP BIT_ULL(62) #define PM_PRESENT BIT_ULL(63) +/* + * Ignore the checkpatch warning, we must read from x but don't want to do + * anything with it in order to trigger a read page fault. We therefore must use + * volatile to stop the compiler from optimising this away. + */ +#define FORCE_READ(x) (*(volatile typeof(x) *)x) + extern unsigned int __page_size; extern unsigned int __page_shift; -- 2.43.0

4 hours, 29 minutes

4
4
0 0

[PATCH 0/2] selftests/fchmodat2: Error handling and general cleanups

by Mark Brown

I looked at the fchmodat2() tests since I've been experiencing some random intermittent segfaults with them in my test systems, while doing so I noticed these two issues. Unfortunately I didn't figure out the original yet, unless I managed to fix it unwittingly. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Mark Brown (2): selftests/fchmodat2: Clean up temporary files and directories selftests/fchmodat2: Use ksft_finished() tools/testing/selftests/fchmodat2/fchmodat2_test.c | 166 ++++++++++++++------- 1 file changed, 112 insertions(+), 54 deletions(-) --- base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a change-id: 20250711-selftests-fchmodat2-c30374c376f8 Best regards, -- Mark Brown <broonie(a)kernel.org>

4 hours, 47 minutes

1
2
0 0

[PATCH v2 0/4] signal handling support for nolibc

by Benjamin Berg

From: Benjamin Berg <benjamin.berg(a)intel.com> Hi, This patchset adds signal handling to nolibc. Initially, I would like to use this for tests. But in the long run, the goal is to use nolibc for the UML kernel itself. In both cases, signal handling will be needed. v2 contains some bugfixes and has a better test coverage. Also addressed are various review comments. Benjamin Benjamin Berg (4): selftests/nolibc: fix EXPECT_NZ macro selftests/nolibc: validate order of constructor calls tools/nolibc: add more generic bitmask macros for FD_* tools/nolibc: add signal support tools/include/nolibc/arch-arm.h | 7 + tools/include/nolibc/arch-arm64.h | 3 + tools/include/nolibc/arch-loongarch.h | 3 + tools/include/nolibc/arch-m68k.h | 10 ++ tools/include/nolibc/arch-mips.h | 3 + tools/include/nolibc/arch-powerpc.h | 8 + tools/include/nolibc/arch-riscv.h | 3 + tools/include/nolibc/arch-s390.h | 8 +- tools/include/nolibc/arch-sh.h | 5 + tools/include/nolibc/arch-sparc.h | 47 ++++++ tools/include/nolibc/arch-x86.h | 13 ++ tools/include/nolibc/signal.h | 103 ++++++++++++ tools/include/nolibc/sys.h | 2 +- tools/include/nolibc/time.h | 3 +- tools/include/nolibc/types.h | 81 ++++++---- .../selftests/nolibc/nolibc-test-linkage.c | 17 +- tools/testing/selftests/nolibc/nolibc-test.c | 150 +++++++++++++++++- 17 files changed, 422 insertions(+), 44 deletions(-) -- 2.50.0

5 hours, 16 minutes

3
12
0 0

[PATCH v5] selftests/mm: add process_madvise() tests

by wang lian

Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. Signed-off-by: wang lian <lianux.mm(a)gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: David Hildenbrand <david(a)redhat.com> Suggested-by: Zi Yan <ziy(a)nvidia.com> Suggested-by: Mark Brown <broonie(a)kernel.org> Acked-by: SeongJae Park <sj(a)kernel.org> --- Changelog v5: - Refactor the remote_collapse test to concentrate on its primary goal confirming the successful remote invocation of process_madvise() on a child process. - Split the validation logic for invalid pidfds out of the remote test and into two new (`exited_process_pidfd` and `bad_pidfd`). - Based mm-new branch, can ensure clean application Changelog v4: https://lore.kernel.org/lkml/20250710112249.58722-1-lianux.mm@gmail.com/ - Refine resource cleanup logic in test teardown to be more robust. - Improve remote_collapse test to correctly handle different THP (Transparent Huge Page) policies ('always', 'madvise', 'never'), including handling race conditions with khugepaged. - Resolve build errors Changelog v3: https://lore.kernel.org/lkml/20250703044326.65061-1-lianux.mm@gmail.com/ - Rebased onto the latest mm-stable branch to ensure clean application. - Refactor common signal handling logic into vm_util to reduce code duplication. - Improve test robustness and diagnostics based on community feedback. - Address minor code style and script corrections. Changelog v2: https://lore.kernel.org/lkml/20250630140957.4000-1-lianux.mm@gmail.com/ - Drop MADV_DONTNEED tests based on feedback. - Focus solely on process_madvise() syscall. - Improve error handling and structure. - Add future-proof flag test. - Style and comment cleanups. -V1: https://lore.kernel.org/lkml/20250621133003.4733-1-lianux.mm@gmail.com/ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/process_madv.c | 304 ++++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + 4 files changed, 311 insertions(+) create mode 100644 tools/testing/selftests/mm/process_madv.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index f2dafa0b700b..e7b23a8a05fe 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -21,6 +21,7 @@ on-fault-limit transhuge-stress pagemap_ioctl pfnmap +process_madv *.tmp* protection_keys protection_keys_32 diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ae6f994d3add..d13b3cef2a2b 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -85,6 +85,7 @@ TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += pfnmap +TEST_GEN_FILES += process_madv TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/process_madv.c b/tools/testing/selftests/mm/process_madv.c new file mode 100644 index 000000000000..249e2ed8dfe9 --- /dev/null +++ b/tools/testing/selftests/mm/process_madv.c @@ -0,0 +1,304 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <errno.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <linux/mman.h> +#include <sys/syscall.h> +#include <unistd.h> +#include <sched.h> +#include "vm_util.h" + +#include "../pidfd/pidfd.h" + +FIXTURE(process_madvise) +{ + unsigned long page_size; + pid_t child_pid; + int pidfd; +}; + +FIXTURE_SETUP(process_madvise) +{ + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); + self->pidfd = PIDFD_SELF; + self->child_pid = -1; +}; + +FIXTURE_TEARDOWN_PARENT(process_madvise) +{ + if (self->child_pid > 0) { + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + } +} + +static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, + size_t vlen, int advice, unsigned int flags) +{ + return syscall(__NR_process_madvise, pidfd, iovec, vlen, advice, flags); +} + +/* + * This test uses PIDFD_SELF to target the current process. The main + * goal is to verify the basic behavior of process_madvise() with + * a vector of non-contiguous memory ranges, not its cross-process + * capabilities. + */ +TEST_F(process_madvise, basic) +{ + const unsigned long pagesize = self->page_size; + const int madvise_pages = 4; + struct iovec vec[madvise_pages]; + int pidfd = self->pidfd; + ssize_t ret; + char *map; + + /* + * Create a single large mapping. We will pick pages from this + * mapping to advise on. This ensures we test non-contiguous iovecs. + */ + map = mmap(NULL, pagesize * 10, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + /* Fill the entire region with a known pattern. */ + memset(map, 'A', pagesize * 10); + + /* + * Setup the iovec to point to 4 non-contiguous pages + * within the mapping. + */ + vec[0].iov_base = &map[0 * pagesize]; + vec[0].iov_len = pagesize; + vec[1].iov_base = &map[3 * pagesize]; + vec[1].iov_len = pagesize; + vec[2].iov_base = &map[5 * pagesize]; + vec[2].iov_len = pagesize; + vec[3].iov_base = &map[8 * pagesize]; + vec[3].iov_len = pagesize; + + ret = sys_process_madvise(pidfd, vec, madvise_pages, MADV_DONTNEED, 0); + if (ret == -1 && errno == EPERM) + SKIP(return, + "process_madvise() unsupported or permission denied, try running as root.\n"); + else if (errno == EINVAL) + SKIP(return, + "process_madvise() unsupported or parameter invalid, please check arguments.\n"); + + /* The call should succeed and report the total bytes processed. */ + ASSERT_EQ(ret, madvise_pages * pagesize); + + /* Check that advised pages are now zero. */ + for (int i = 0; i < madvise_pages; i++) { + char *advised_page = (char *)vec[i].iov_base; + + /* Content must be 0, not 'A'. */ + ASSERT_EQ(*advised_page, '\0'); + } + + /* Check that an un-advised page in between is still 'A'. */ + char *unadvised_page = &map[1 * pagesize]; + + for (int i = 0; i < pagesize; i++) + ASSERT_EQ(unadvised_page[i], 'A'); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize * 10), 0); +} + +/* + * This test deterministically validates process_madvise() with MADV_COLLAPSE + * on a remote process, other advices are difficult to verify reliably. + * + * The test verifies that a memory region in a child process, + * focus on process_madv remote result, only check addresses and lengths. + * The correctness of the MADV_COLLAPSE can be found in the relevant test examples in khugepaged. + */ +TEST_F(process_madvise, remote_collapse) +{ + const unsigned long pagesize = self->page_size; + int pidfd; + long huge_page_size; + int pipe_info[2]; + ssize_t ret; + struct iovec vec; + + struct child_info { + pid_t pid; + void *map_addr; + } info; + + huge_page_size = default_huge_page_size(); + if (huge_page_size <= 0) + SKIP(return, "Could not determine a valid huge page size.\n"); + + ASSERT_EQ(pipe(pipe_info), 0); + + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) { + char *map; + size_t map_size = 2 * huge_page_size; + + close(pipe_info[0]); + + map = mmap(NULL, map_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fault in as small pages */ + for (size_t i = 0; i < map_size; i += pagesize) + map[i] = 'A'; + + /* Send info and pause */ + info.pid = getpid(); + info.map_addr = map; + ret = write(pipe_info[1], &info, sizeof(info)); + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[1]); + + pause(); + exit(0); + } + + close(pipe_info[1]); + + /* Receive child info */ + ret = read(pipe_info[0], &info, sizeof(info)); + if (ret <= 0) { + waitpid(self->child_pid, NULL, 0); + SKIP(return, "Failed to read child info from pipe.\n"); + } + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[0]); + self->child_pid = info.pid; + + pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(pidfd, 0); + + vec.iov_base = info.map_addr; + vec.iov_len = huge_page_size; + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_COLLAPSE, 0); + if (ret == -1) { + if (errno == EINVAL) + SKIP(return, "PROCESS_MADV_ADVISE is not supported.\n"); + else if (errno == EPERM) + SKIP(return, + "No process_madvise() permissions, try running as root.\n"); + goto cleanup; + } + + ASSERT_EQ(ret, huge_page_size); + +cleanup: + /* Cleanup */ + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + if (pidfd >= 0) + close(pidfd); +} + +/* + * Test process_madvise() with a pidfd for a process that has already + * exited to ensure correct error handling. + */ +TEST_F(process_madvise, exited_process_pidfd) +{ + struct iovec vec; + ssize_t ret; + int pidfd; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* + * Using a pidfd for a process that has already exited should fail + * with ESRCH. + */ + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) + exit(0); + + pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Wait for the child to ensure it has terminated. */ + waitpid(self->child_pid, NULL, 0); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ESRCH); + close(pidfd); +} + +/* + * Test process_madvise() with bad pidfds to ensure correct error + * handling. + */ +TEST_F(process_madvise, bad_pidfd) +{ + struct iovec vec; + ssize_t ret; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* Using an invalid fd number (-1) should fail with EBADF. */ + ret = sys_process_madvise(-1, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a valid fd that is not a pidfd (e.g. stdin) should fail + * with EBADF. + */ + ret = sys_process_madvise(STDIN_FILENO, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); +} + +/* + * Test process_madvise() with an invalid flag value. Currently, only a flag + * value of 0 is supported. This test is reserved for the future, e.g., if + * synchronous flags are added. + */ +TEST_F(process_madvise, flag) +{ + const unsigned long pagesize = self->page_size; + unsigned int invalid_flag; + int pidfd = self->pidfd; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + invalid_flag = 0x80000000; + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, invalid_flag); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index a38c984103ce..471e539d82b8 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -65,6 +65,8 @@ separated by spaces: test pagemap_scan IOCTL - pfnmap tests for VM_PFNMAP handling +- process_madv + test for process_madv - cow test copy-on-write semantics - thp @@ -425,6 +427,9 @@ CATEGORY="madv_guard" run_test ./guard-regions # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests CATEGORY="madv_populate" run_test ./madv_populate +# PROCESS_MADV test +CATEGORY="process_madv" run_test ./process_madv + CATEGORY="vma_merge" run_test ./merge if [ -x ./memfd_secret ] -- 2.43.0

5 hours, 21 minutes

2
1
0 0

[PATCH bpf-next v2 0/3] bpf: Show precise rejected function when attaching to __noreturn and deny list functions

by KaFai Wan

Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions. Add log for attaching tracing programs to functions in deny list. Add selftest for attaching tracing programs to functions in deny list. changes: v2: - change verifier log message (Alexei) - add missing Suggested-by v1: https://lore.kernel.org/all/20250710162717.3808020-1-mannkafai@gmail.com/ --- KaFai Wan (3): bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions bpf: Add log for attaching tracing programs to functions in deny list selftests/bpf: Add selftest for attaching tracing programs to functions in deny list kernel/bpf/verifier.c | 5 ++++- .../selftests/bpf/prog_tests/tracing_deny.c | 11 +++++++++++ .../testing/selftests/bpf/progs/fexit_noreturns.c | 2 +- tools/testing/selftests/bpf/progs/tracing_deny.c | 15 +++++++++++++++ 4 files changed, 31 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_deny.c create mode 100644 tools/testing/selftests/bpf/progs/tracing_deny.c -- 2.43.0

6 hours, 46 minutes

1
3
0 0

[PATCH v9 00/29] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)

by Nicolin Chen

The vIOMMU object is designed to represent a slice of an IOMMU HW for its virtualization features shared with or passed to user space (a VM mostly) in a way of HW acceleration. This extended the HWPT-based design for more advanced virtualization feature. HW QUEUE introduced by this series as a part of the vIOMMU infrastructure represents a HW accelerated queue/buffer for VM to use exclusively, e.g. - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer each of which allows its IOMMU HW to directly access a queue memory owned by a guest VM and allows a guest OS to control the HW queue direclty, to avoid VM Exit overheads to improve the performance. Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC allowing VMM to forward the IOMMU-specific queue info, such as queue base address, size, and etc. Meanwhile, a guest-owned queue needs the guest kernel to control the queue by reading/writing its consumer and producer indexes, via MMIO acceses to the hardware MMIO registers. Introduce an mmap infrastructure for iommufd to support passing through a piece of MMIO region from the host physical address space to the guest physical address space. The mmap info (offset/ length) used by an mmap syscall must be pre-allocated and returned to the user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC call. Thus, it requires a driver-specific user data support in the vIOMMU allocation flow. As a real-world use case, this series implements a HW QUEUE support in the tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it is also the Tegra CMDQV series Part-2 (user-space support), reworked from Previous RFCv1: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS. // Unmap latencies from "dma_map_benchmark -g @granule -t @threads", // by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq" @granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0 4KB 1 35.7 us 5.3 us 16KB 1 41.8 us 6.8 us 64KB 1 68.9 us 9.9 us 128KB 1 109.0 us 12.6 us 256KB 1 187.1 us 18.0 us 4KB 2 96.9 us 6.8 us 16KB 2 97.8 us 7.5 us 64KB 2 151.5 us 10.7 us 128KB 2 257.8 us 12.7 us 256KB 2 443.0 us 17.9 us This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v9 Paring QEMU branch for testing (reusing v8): https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v8 Changelog v9 (attached git-diff v8..v9 at the end of this letter) * Add Reviewed-by from Vasant and Jason * [iommufd] Fix offset calculation * [iommufd] Add unaligned iova/length selftest coverage for hw_queue * [iommufd] Pass in aligned iova/length to iommufd_access_pin_pages() * [smmu] Change "u32 *type" at arm_smmu_hw_info() in the header v8 https://lore.kernel.org/all/cover.1751677708.git.nicolinc@nvidia.com/ * Add Reviewed-by from Pranj, Kevin and Jason * Improve kdoc and comments * [iommufd] Skip selftest for no_viommu variants * [iommufd] Add unmap coverage for non internal area * [iommufd] Skip the first page when mtree_alloc_range() * [iommufd] Correct the passed in index to mtree_erase() * [iommufd] Correct variable types in iommufd_hw_queue_alloc_phys() * [iommufd] Reject iopt_unmap_iova_range() if area->num_locks is set * [tegra] Rename "SID replacement" with "SID mapping" * [tegra] Unwrap useless _tegra241_vcmdq_hw_init helper v7 https://lore.kernel.org/all/cover.1750966133.git.nicolinc@nvidia.com/ * Rebased on Jason's for-next tree (iommufd_hw_queue-prep series) * Add Reviewed-by from Baolu, Jason, Pranjal * Update kdocs and notes * [iommu] Replace "u32" with "enum iommu_hw_info_type" * [iommufd] Rename vdev->id to vdev->virt_id * [iommufd] Replace macros with inline helpers * [iommufd] Report unmapped_bytes in error path * [iommufd] Add iommufd_access_is_internal helper * [iommufd] Do not drop ops->unmap check for mdevs * [iommufd] Store physical addresses in immap structure * [iommufd] Reorder access and hw_queue object allocations * [iommufd] Scan for an internal access before any unmap call * [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue * [iommufd] Use kcalloc to avoid failure due to memory fragmentation * [tegra] Use "else" * [tegra] Lock destroy() using lvcmdq_mutex v6 https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/ * Rebase on iommufd_hw_queue-prep-v2 * Add Reviewed-by from Kevin and Jason * [iommufd] Update kdocs and notes * [iommufd] Drop redundant pages[i] check * [iommufd] Allow nesting_parent_iova to be 0 * [iommufd] Add iommufd_hw_queue_alloc_phys() * [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs * [iommufd] Move destroy ops to vdevice/hw_queue structures * [iommufd] Add union in hw_info struct to share out_data_type field * [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs * [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init * [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init * [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys * [smmu] Drop arm_smmu_domain_ipa_to_pa * [smmu] Update arm_smmu_impl_ops changes for vsmmu_init * [tegra] Add a vdev_to_vsid macro * [tegra] Add lvcmdq_mutex to protect multi queues * [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak) v5 https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc6 * Add Reviewed-by from Jason and Kevin * Correct typos in kdoc and update commit logs * [iommufd] Add a cosmetic fix * [iommufd] Drop unused num_pfns * [iommufd] Drop unnecessary check * [iommufd] Reorder patch sequence * [iommufd] Use io_remap_pfn_range() * [iommufd] Use success oriented flow * [iommufd] Fix max_npages calculation * [iommufd] Add more selftest coverage * [iommufd] Drop redundant static_assert * [iommufd] Fix mmap pfn range validation * [iommufd] Reject unmap on pinned iovas * [iommufd] Drop redundant vm_flags_set() * [iommufd] Drop iommufd_struct_destroy() * [iommufd] Drop redundant queue iova test * [iommufd] Use "mmio_addr" and "mmio_pfn" * [iommufd] Rename to "nesting_parent_iova" * [iommufd] Make iopt_pin_pages call option * [iommufd] Add ictx comparison in depend() * [iommufd] Add iommufd_object_alloc_ucmd() * [iommufd] Move kcalloc() after validations * [iommufd] Replace ictx setting with WARN_ON * [iommufd] Make hw_info's type bidirectional * [smmu] Add supported_vsmmu_type in impl_ops * [smmu] Drop impl report in smmu vendor struct * [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV * [tegra] Replace "number of VINTFs" with a note * [tegra] Drop the redundant lvcmdq pointer setting * [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA * [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op v4 https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc5 * Add Reviewed-by from Vasant * Rename "vQUEUE" to "HW QUEUE" * Use "offset" and "length" for all mmap-related variables * [iommufd] Use u64 for guest PA * [iommufd] Fix typo in uAPI doc * [iommufd] Rename immap_id to offset * [iommufd] Drop the partial-size mmap support * [iommufd] Do not replace WARN_ON with WARN_ON_ONCE * [iommufd] Use "u64 base_addr" for queue base address * [iommufd] Use u64 base_pfn/num_pfns for immap structure * [iommufd] Correct the size passed in to mtree_alloc_range() * [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops v3 https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/ * Add Reviewed-by from Baolu, Pranjal, and Alok * Revise kdocs, uAPI docs, and commit logs * Rename "vCMDQ" back to "vQUEUE" for AMD cases * [tegra] Add tegra241_vcmdq_hw_flush_timeout() * [tegra] Rename vsmmu_alloc to alloc_vintf_user * [tegra] Use writel for SID replacement registers * [tegra] Move mmap removal call to vsmmu_destroy op * [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user() * [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED() * [iommufd] Add an object-type "owner" to immap structure * [iommufd] Drop the ictx input in the new for-driver APIs * [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle * [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers * [iommufd] Rename iommufd_ctx_alloc/free_mmap to _iommufd_alloc/destroy_mmap v2 https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/ * Add Reviewed-by from Jason * [smmu] Fix vsmmu initial value * [smmu] Support impl for hw_info * [tegra] Rename "slot" to "vsid" * [tegra] Update kdocs and commit logs * [tegra] Map/unmap LVCMDQ dynamically * [tegra] Refcount the previous LVCMDQ * [tegra] Return -EEXIST if LVCMDQ exists * [tegra] Simplify VINTF cleanup routine * [tegra] Use vmid and s2_domain in vsmmu * [tegra] Rename "mmap_pgoff" to "immap_id" * [tegra] Add more addr and length validation * [iommufd] Add more narrative to mmap's kdoc * [iommufd] Add iommufd_struct_depend/undepend() * [iommufd] Rename vcmdq_free op to vcmdq_destroy * [iommufd] Fix bug in iommu_copy_struct_to_user() * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap() * [iommufd] Test the queue memory for its contiguity * [iommufd] Return -ENXIO if address or length fails * [iommufd] Do not change @min_last in mock_viommu_alloc() * [iommufd] Generalize TEGRA241_VCMDQ data in core structure * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping v1 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/ Thanks Nicolin Nicolin Chen (29): iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range iommufd: Correct virt_id kdoc at struct iommu_vdevice_alloc iommufd/viommu: Explicitly define vdev->virt_id iommu: Use enum iommu_hw_info_type for type in hw_info op iommu: Add iommu_copy_struct_to_user helper iommu: Pass in a driver-level user data structure to viommu_init op iommufd/viommu: Allow driver-specific user data for a vIOMMU object iommufd/selftest: Support user_data in mock_viommu_alloc iommufd/selftest: Add coverage for viommu data iommufd/access: Add internal APIs for HW queue to use iommufd/access: Bypass access->ops->unmap for internal use iommufd/viommu: Add driver-defined vDEVICE support iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC iommufd: Add mmap interface iommufd/selftest: Add coverage for the new mmap interface Documentation: userspace-api: iommufd: Update HW QUEUE iommu: Allow an input type in hw_info op iommufd: Allow an input data_type via iommu_hw_info iommufd/selftest: Update hw_info coverage for an input data_type iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops iommu/tegra241-cmdqv: Use request_threaded_irq iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 25 +- drivers/iommu/iommufd/io_pagetable.h | 5 +- drivers/iommu/iommufd/iommufd_private.h | 46 +- drivers/iommu/iommufd/iommufd_test.h | 20 + include/linux/iommu.h | 50 +- include/linux/iommufd.h | 160 ++++++ include/uapi/linux/iommufd.h | 147 +++++- tools/testing/selftests/iommu/iommufd_utils.h | 89 +++- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 28 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 477 +++++++++++++++++- drivers/iommu/intel/iommu.c | 7 +- drivers/iommu/iommufd/device.c | 87 +++- drivers/iommu/iommufd/driver.c | 82 ++- drivers/iommu/iommufd/io_pagetable.c | 13 +- drivers/iommu/iommufd/main.c | 69 +++ drivers/iommu/iommufd/pages.c | 12 +- drivers/iommu/iommufd/selftest.c | 153 +++++- drivers/iommu/iommufd/viommu.c | 218 +++++++- tools/testing/selftests/iommu/iommufd.c | 141 +++++- .../selftests/iommu/iommufd_fail_nth.c | 15 +- Documentation/userspace-api/iommufd.rst | 12 + 21 files changed, 1745 insertions(+), 111 deletions(-) -- 2.43.0 diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index aa25156e04a3..3fa02c51df9f 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -1045,7 +1045,8 @@ struct arm_vsmmu { }; #if IS_ENABLED(CONFIG_ARM_SMMU_V3_IOMMUFD) -void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type); +void *arm_smmu_hw_info(struct device *dev, u32 *length, + enum iommu_hw_info_type *type); size_t arm_smmu_get_viommu_size(struct device *dev, enum iommu_viommu_type viommu_type); int arm_vsmmu_init(struct iommufd_viommu *viommu, diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 00641204efb2..91339f799916 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -206,7 +206,11 @@ static void iommufd_hw_queue_destroy_access(struct iommufd_ctx *ictx, struct iommufd_access *access, u64 base_iova, size_t length) { - iommufd_access_unpin_pages(access, base_iova, length); + u64 aligned_iova = PAGE_ALIGN_DOWN(base_iova); + u64 offset = base_iova - aligned_iova; + + iommufd_access_unpin_pages(access, aligned_iova, + PAGE_ALIGN(length + offset)); iommufd_access_detach_internal(access); iommufd_access_destroy_internal(ictx, access); } @@ -239,22 +243,23 @@ static struct iommufd_access * iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd, struct iommufd_viommu *viommu, phys_addr_t *base_pa) { + u64 aligned_iova = PAGE_ALIGN_DOWN(cmd->nesting_parent_iova); + u64 offset = cmd->nesting_parent_iova - aligned_iova; struct iommufd_access *access; struct page **pages; size_t max_npages; size_t length; - u64 offset; size_t i; int rc; - offset = - cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova); - /* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */ + /* max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */ if (check_add_overflow(offset, cmd->length, &length)) return ERR_PTR(-ERANGE); if (check_add_overflow(length, PAGE_SIZE - 1, &length)) return ERR_PTR(-ERANGE); max_npages = length / PAGE_SIZE; + /* length needs to be page aligned too */ + length = max_npages * PAGE_SIZE; /* * Use kvcalloc() to avoid memory fragmentation for a large page array. @@ -274,8 +279,7 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd, if (rc) goto out_destroy; - rc = iommufd_access_pin_pages(access, cmd->nesting_parent_iova, - cmd->length, pages, 0); + rc = iommufd_access_pin_pages(access, aligned_iova, length, pages, 0); if (rc) goto out_detach; @@ -287,13 +291,12 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd, goto out_unpin; } - *base_pa = page_to_pfn(pages[0]) << PAGE_SHIFT; + *base_pa = (page_to_pfn(pages[0]) << PAGE_SHIFT) + offset; kfree(pages); return access; out_unpin: - iommufd_access_unpin_pages(access, cmd->nesting_parent_iova, - cmd->length); + iommufd_access_unpin_pages(access, aligned_iova, length); out_detach: iommufd_access_detach_internal(access); out_destroy: diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 9d5b852d5e19..d59d48022a24 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -3104,17 +3104,18 @@ TEST_F(iommufd_viommu, hw_queue) /* Allocate index=0, declare ownership of the iova */ test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, PAGE_SIZE, &hw_queue_id[0]); - /* Fail duplicate */ + /* Fail duplicated index */ test_err_hw_queue_alloc(EEXIST, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, PAGE_SIZE, &hw_queue_id[0]); /* Fail unmap, due to iova ownership */ test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE); /* The 2nd page is not pinned, so it can be unmmap */ - test_ioctl_ioas_unmap(iova + PAGE_SIZE, PAGE_SIZE); + test_ioctl_ioas_unmap(iova2, PAGE_SIZE); - /* Allocate index=1 */ + /* Allocate index=1, with an unaligned case */ test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, - iova, PAGE_SIZE, &hw_queue_id[1]); + iova + PAGE_SIZE / 2, PAGE_SIZE / 2, + &hw_queue_id[1]); /* Fail to destroy, due to dependency */ EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, hw_queue_id[0]));

6 hours, 47 minutes

5
39
0 0

[PATCH 00/17] rust: replace `kernel::c_str!` with C-Strings

by Tamir Duberstein

This series depends on step 3[0] which depends on steps 2a[1] and 2b[2] which both depend on step 1[3]. This series also has a minor merge conflict with a small change[4] that was taken through driver-core-testing. This series is marked as depending on that change; as such it contains the post-conflict patch. Subsystem maintainers: I would appreciate your `Acked-by`s so that this can be taken through Miguel's tree (where the previous series must go). Link https://lore.kernel.org/all/20250710-cstr-core-v14-0-ca7e0ca82c82@gmail.com/ [0] Link: https://lore.kernel.org/all/20250709-core-cstr-fanout-1-v1-0-64308e7203fc@g… [1] Link: https://lore.kernel.org/all/20250709-core-cstr-fanout-1-v1-0-fd793b3e58a2@g… [2] Link: https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm… [3] Link: https://lore.kernel.org/all/20250704-cstr-include-aux-v1-1-e1a404ae92ac@gma… [4] Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Tamir Duberstein (17): drivers: net: replace `kernel::c_str!` with C-Strings gpu: nova-core: replace `kernel::c_str!` with C-Strings rust: auxiliary: replace `kernel::c_str!` with C-Strings rust: clk: replace `kernel::c_str!` with C-Strings rust: configfs: replace `kernel::c_str!` with C-Strings rust: cpufreq: replace `kernel::c_str!` with C-Strings rust: device: replace `kernel::c_str!` with C-Strings rust: firmware: replace `kernel::c_str!` with C-Strings rust: kunit: replace `kernel::c_str!` with C-Strings rust: macros: replace `kernel::c_str!` with C-Strings rust: miscdevice: replace `kernel::c_str!` with C-Strings rust: net: replace `kernel::c_str!` with C-Strings rust: pci: replace `kernel::c_str!` with C-Strings rust: platform: replace `kernel::c_str!` with C-Strings rust: seq_file: replace `kernel::c_str!` with C-Strings rust: str: replace `kernel::c_str!` with C-Strings rust: sync: replace `kernel::c_str!` with C-Strings drivers/block/rnull.rs | 2 +- drivers/cpufreq/rcpufreq_dt.rs | 5 ++--- drivers/gpu/drm/nova/driver.rs | 10 +++++----- drivers/gpu/nova-core/driver.rs | 6 +++--- drivers/net/phy/ax88796b_rust.rs | 7 +++---- drivers/net/phy/qt2025.rs | 5 ++--- rust/kernel/clk.rs | 6 ++---- rust/kernel/configfs.rs | 5 ++--- rust/kernel/cpufreq.rs | 3 +-- rust/kernel/device.rs | 4 +--- rust/kernel/firmware.rs | 6 +++--- rust/kernel/kunit.rs | 11 ++++------- rust/kernel/net/phy.rs | 6 ++---- rust/kernel/platform.rs | 4 ++-- rust/kernel/seq_file.rs | 4 ++-- rust/kernel/str.rs | 5 ++--- rust/kernel/sync.rs | 5 ++--- rust/kernel/sync/completion.rs | 2 +- rust/kernel/workqueue.rs | 8 ++++---- rust/macros/kunit.rs | 10 +++++----- rust/macros/module.rs | 2 +- samples/rust/rust_configfs.rs | 5 ++--- samples/rust/rust_driver_auxiliary.rs | 4 ++-- samples/rust/rust_driver_faux.rs | 4 ++-- samples/rust/rust_driver_pci.rs | 4 ++-- samples/rust/rust_driver_platform.rs | 4 ++-- samples/rust/rust_misc_device.rs | 3 +-- scripts/rustdoc_test_gen.rs | 4 ++-- 28 files changed, 63 insertions(+), 81 deletions(-) --- base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494 change-id: 20250710-core-cstr-cstrings-1faaa632f0fd prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 prerequisite-message-id: 20250709-core-cstr-fanout-1-v1-0-64308e7203fc(a)gmail.com prerequisite-patch-id: fa79c5d8fd2762b5e488ba017e13a5774d933f81 prerequisite-patch-id: c338aa49e1319e9e802de2ad8bb0fa688bce9d9c prerequisite-patch-id: 589a352ba7f7c9aefefd84dfd3b6b20e290b0d14 prerequisite-patch-id: 29fc25261295349f6747d1bb409cf18130e9aa69 prerequisite-patch-id: 3d89601bba1fb01d190b0ba415b28ad9cbf1e209 prerequisite-patch-id: 10923aebf24011b727f60496c0f9e0ad57e0a967 prerequisite-patch-id: 56583fd829951fb4fac843c6b1874c643b726de0 prerequisite-patch-id: 9a7e8ba460358985147efd347658be31fbc78ba2 prerequisite-patch-id: 5821a23334e317cd0351b8e4404b9e3b36b72d67 prerequisite-message-id: 20250709-core-cstr-fanout-1-v1-0-fd793b3e58a2(a)gmail.com prerequisite-patch-id: 0ccc3545ff9bf22a67b79a944705cef2fb9c2bbf prerequisite-patch-id: b1866166714606d5c11a4d7506abe4c2f86dac8d prerequisite-patch-id: 163b8ff1edaf8e48976fd5de3f64e68fc38c7277 prerequisite-patch-id: 8fee5e2daf0749362331dad4fc63d907a01b14e9 prerequisite-patch-id: 366ef1f93fb40b1d039768f2041ff79995e7e228 prerequisite-patch-id: 1d350291f9292f910081856d8f7d5e4d9545cfd1 prerequisite-patch-id: 9a6a60bd2b209126de64c16a77a3a1d229dd898c prerequisite-patch-id: 08ae5855768ec3b4c68272b86d2a0e0667c9aa47 prerequisite-patch-id: f15b54927660a03b52ffb34fb7943ac3228b7803 prerequisite-patch-id: f0dbf0a55a27fe8e199e242d1f79ea800d1ddb66 prerequisite-change-id: 20250201-cstr-core-d4b9b69120cf:v14 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 prerequisite-patch-id: 0ccc3545ff9bf22a67b79a944705cef2fb9c2bbf prerequisite-patch-id: b1866166714606d5c11a4d7506abe4c2f86dac8d prerequisite-patch-id: 163b8ff1edaf8e48976fd5de3f64e68fc38c7277 prerequisite-patch-id: 8fee5e2daf0749362331dad4fc63d907a01b14e9 prerequisite-patch-id: 366ef1f93fb40b1d039768f2041ff79995e7e228 prerequisite-patch-id: 1d350291f9292f910081856d8f7d5e4d9545cfd1 prerequisite-patch-id: 9a6a60bd2b209126de64c16a77a3a1d229dd898c prerequisite-patch-id: 08ae5855768ec3b4c68272b86d2a0e0667c9aa47 prerequisite-patch-id: f15b54927660a03b52ffb34fb7943ac3228b7803 prerequisite-patch-id: f0dbf0a55a27fe8e199e242d1f79ea800d1ddb66 prerequisite-patch-id: fa79c5d8fd2762b5e488ba017e13a5774d933f81 prerequisite-patch-id: c338aa49e1319e9e802de2ad8bb0fa688bce9d9c prerequisite-patch-id: 589a352ba7f7c9aefefd84dfd3b6b20e290b0d14 prerequisite-patch-id: 29fc25261295349f6747d1bb409cf18130e9aa69 prerequisite-patch-id: 3d89601bba1fb01d190b0ba415b28ad9cbf1e209 prerequisite-patch-id: 10923aebf24011b727f60496c0f9e0ad57e0a967 prerequisite-patch-id: 56583fd829951fb4fac843c6b1874c643b726de0 prerequisite-patch-id: 9a7e8ba460358985147efd347658be31fbc78ba2 prerequisite-patch-id: 5821a23334e317cd0351b8e4404b9e3b36b72d67 prerequisite-patch-id: 9c0a6624ed7b7e1d0373985c5c084a844e7c49ce prerequisite-patch-id: 6d8dbdf864f79fc0c2820e702a7cb87753649ca0 prerequisite-patch-id: 2bc4afce0104c13c0dd4d50923b0db2f5cd11129 prerequisite-change-id: 20250704-cstr-include-aux-7847969762a8:v1 prerequisite-patch-id: 1f79f64dd9b8a092ff039e6c7fad1430afb8ea25 Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

7 hours, 30 minutes

3
19
0 0

[PATCH 0/9] rust: use `kernel::{fmt,prelude::fmt!}`

by Tamir Duberstein

This is series 2a/5 of the migration to `core::ffi::CStr`[0]. 20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com. This series depends on the prior series[0] and is intended to go through the rust tree to reduce the number of release cycles required to complete the work. Subsystem maintainers: I would appreciate your `Acked-by`s so that this can be taken through Miguel's tree (where the other series must go). [0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm… Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Tamir Duberstein (9): gpu: nova-core: use `kernel::{fmt,prelude::fmt!}` rust: alloc: use `kernel::{fmt,prelude::fmt!}` rust: block: use `kernel::{fmt,prelude::fmt!}` rust: device: use `kernel::{fmt,prelude::fmt!}` rust: file: use `kernel::{fmt,prelude::fmt!}` rust: kunit: use `kernel::{fmt,prelude::fmt!}` rust: pin-init: use `kernel::{fmt,prelude::fmt!}` rust: seq_file: use `kernel::{fmt,prelude::fmt!}` rust: sync: use `kernel::{fmt,prelude::fmt!}` drivers/block/rnull.rs | 2 +- drivers/gpu/nova-core/gpu.rs | 3 +-- drivers/gpu/nova-core/regs/macros.rs | 6 +++--- rust/kernel/alloc/kbox.rs | 2 +- rust/kernel/alloc/kvec.rs | 2 +- rust/kernel/alloc/kvec/errors.rs | 2 +- rust/kernel/block/mq.rs | 2 +- rust/kernel/block/mq/gen_disk.rs | 2 +- rust/kernel/block/mq/raw_writer.rs | 3 +-- rust/kernel/device.rs | 6 +++--- rust/kernel/fs/file.rs | 5 +++-- rust/kernel/init.rs | 4 ++-- rust/kernel/kunit.rs | 8 ++++---- rust/kernel/seq_file.rs | 6 +++--- rust/kernel/sync/arc.rs | 3 +-- scripts/rustdoc_test_gen.rs | 2 +- 16 files changed, 28 insertions(+), 30 deletions(-) --- base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494 change-id: 20250709-core-cstr-fanout-1-f20611832272 prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

7 hours, 34 minutes

3
16
0 0

[PATCH net-next v7 0/3] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v7: - Rebased on top of net-next - Using `ethtool -l` json option instead of parsing it manually. - Link to v6: https://lore.kernel.org/r/20250711-netpoll_test-v6-0-130465f286a8@debian.org Changes in v6: - Remove the network toggled (Jakub) - Set ringsize and queue size (Jakub) - Some other general improvements (Jakub) - Link to v5: https://lore.kernel.org/r/20250709-netpoll_test-v5-0-b3737895affe@debian.org Changes in v5: - Rebased on top of net-next. - Calling bpftrace_stop using the defer helper. (Willem) - Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org Changes in v4: - Make the test XFail if it doesn't hit the function we are looking for - Toggle the interface while the traffic is flowing. - Bumped the number of messages from 10 to 40 per iterations. * This is hitting ~15 times per run on my vng test. - Decreased the time from 15 seconds to 10 seconds, given that if it didn't hit the function in 10 seconds, 5 seconds extra will not help. - Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org Changes in v3: - Make pylint happy (Simon) - Remove the unnecessary patch in bpftrace to raise an exception when it fails. (Jakub) - Improved the bpftrace code (Willem) - Stop sending messages if bpftrace is not alive anymore. - Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (2): selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/py/__init__.py | 4 +- .../testing/selftests/drivers/net/netpoll_basic.py | 396 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 35 ++ 4 files changed, 434 insertions(+), 2 deletions(-) --- base-commit: b06c4311711c57c5e558bd29824b08f0a6e2a155 change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

8 hours, 53 minutes

1
3
0 0

[PATCH net-next v6 0/3] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v6: - Remove the network toggled (Jakub) - Set ringsize and queue size (Jakub) - Some other general improvements (Jakub) - Link to v5: https://lore.kernel.org/r/20250709-netpoll_test-v5-0-b3737895affe@debian.org Changes in v5: - Rebased on top of net-next. - Calling bpftrace_stop using the defer helper. (Willem) - Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org Changes in v4: - Make the test XFail if it doesn't hit the function we are looking for - Toggle the interface while the traffic is flowing. - Bumped the number of messages from 10 to 40 per iterations. * This is hitting ~15 times per run on my vng test. - Decreased the time from 15 seconds to 10 seconds, given that if it didn't hit the function in 10 seconds, 5 seconds extra will not help. - Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org Changes in v3: - Make pylint happy (Simon) - Remove the unnecessary patch in bpftrace to raise an exception when it fails. (Jakub) - Improved the bpftrace code (Willem) - Stop sending messages if bpftrace is not alive anymore. - Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (2): selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/py/__init__.py | 3 +- .../testing/selftests/drivers/net/netpoll_basic.py | 411 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 35 ++ 4 files changed, 449 insertions(+), 1 deletion(-) --- base-commit: 0f26870a989bf69957ed69d10c7ffc57ca5a7f52 change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

8 hours, 58 minutes

2
5
0 0

[PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance

by Jakub Kicinski

The rtnetlink test for preferred lifetime of an address is quite flaky. Problems started around the 6.16 merge window in May. The test fails with: FAIL: preferred_lft addresses remaining and unlike most of our flakes this one fails on the "normal" kernel builds, not the builds with kernel/configs/debug.config. I suspect the flakes may be related to power saving, since the expirations run from a "power efficient" workqueue. Adding a short sleep seems to decrease the flakes by 8x but they still happen. With this patch in place we get a flake every couple of weeks, not every couple of days. Better ideas welcome.. Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- CC: liuhangbin(a)gmail.com CC: shuah(a)kernel.org CC: linux-kselftest(a)vger.kernel.org --- tools/testing/selftests/net/rtnetlink.sh | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 2e8243a65b50..b9e1497ea27a 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -299,6 +299,11 @@ kci_test_addrlft() done sleep 5 + # Schedule out for a bit, address GC runs from the power efficient WQ + # if the long sleep above has put the whole system into sleep state + # the WQ may have not had a chance to run. + sleep 0.1 + run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy" if [ $? -eq 0 ]; then check_err 1 -- 2.50.0

11 hours, 31 minutes

2
3
0 0

[PATCH v22 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v22. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v22 (11-Jul-2025) - Fix the issue when user provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>) v21 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>) - Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>) - Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>) - Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>) - Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>) - Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>) - Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>) - Fix some typos and format issues of dualpi2 attributes v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 151 ++- include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 68 + net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1171 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1665 insertions(+), 5 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

12 hours, 36 minutes

3
8
0 0

[PATCH] tools/nolibc: add support for Alpha

by Thomas Weißschuh

A straightforward new architecture. Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Only tested on QEMU so far. Testing on real hardware would be very welcome. Test instructions: $ cd tools/testings/selftests/nolibc/ $ make -f Makefile.nolibc ARCH=alpha CROSS_COMPILE=alpha-linux- nolibc-test $ file nolibc-test nolibc-test: ELF 64-bit LSB executable, Alpha (unofficial), version 1 (SYSV), statically linked, not stripped $ ./nolibc-test Running test 'startup' 0 argc = 1 [OK] ... Total number of errors: 0 Exiting with status 0 --- tools/include/nolibc/arch-alpha.h | 164 +++++++++++++++++++++++++ tools/include/nolibc/arch.h | 2 + tools/testing/selftests/nolibc/Makefile.nolibc | 5 + tools/testing/selftests/nolibc/nolibc-test.c | 4 + tools/testing/selftests/nolibc/run-tests.sh | 3 +- 5 files changed, 177 insertions(+), 1 deletion(-) diff --git a/tools/include/nolibc/arch-alpha.h b/tools/include/nolibc/arch-alpha.h new file mode 100644 index 0000000000000000000000000000000000000000..6b9bb6c749b931f30ce7bd6cd125622828405604 --- /dev/null +++ b/tools/include/nolibc/arch-alpha.h @@ -0,0 +1,164 @@ +/* SPDX-License-Identifier: LGPL-2.1 OR MIT */ +/* + * Alpha specific definitions for NOLIBC + * Copyright (C) 2025 Thomas Weißschuh <linux(a)weissschuh.net> + */ + +#ifndef _NOLIBC_ARCH_ALPHA_H +#define _NOLIBC_ARCH_ALPHA_H + +#include "compiler.h" +#include "crt.h" + +/* + * Syscalls for Alpha: + * - registers are 64-bit + * - syscall number is passed in $0/v0 + * - the system call is performed by calling callsys + * - syscall return comes in $0/v0, error flag in $19/a4 + * - arguments are passed in $16/a0 to $21/a5 + * - GCC does not support symbol register names + */ + +#define my_syscall0(num) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall1(num, arg1) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall2(num, arg1, arg2) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall3(num, arg1, arg2, arg3) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall4(num, arg1, arg2, arg3, arg4) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + register long _arg4 __asm__ ("$19") = (long)(arg4); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3), "r"(_arg4) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall5(num, arg1, arg2, arg3, arg4, arg5) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + register long _arg4 __asm__ ("$19") = (long)(arg4); \ + register long _arg5 __asm__ ("$20") = (long)(arg5); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3), "r"(_arg4), \ + "r"(_arg5) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall6(num, arg1, arg2, arg3, arg4, arg5, arg6) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + register long _arg4 __asm__ ("$19") = (long)(arg4); \ + register long _arg5 __asm__ ("$20") = (long)(arg5); \ + register long _arg6 __asm__ ("$21") = (long)(arg6); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3), "r"(_arg4), \ + "r"(_arg5), "r"(_arg6) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +/* startup code */ +void __attribute__((weak, noreturn)) __nolibc_entrypoint __no_stack_protector _start(void) +{ + __asm__ volatile ( + "br $gp, 0f\n" /* setup $gp, so that 'lda' works */ + "0: ldgp $gp, 0($gp)\n" + "lda $27, _start_c\n" /* setup current function address for _start_c */ + "mov $sp, $16\n" /* save argc pointer to $16, as arg1 of _start_c */ + "br _start_c\n" /* transfer to c runtime */ + ); + __nolibc_entrypoint_epilogue(); +} + +#endif /* _NOLIBC_ARCH_ALPHA_H */ diff --git a/tools/include/nolibc/arch.h b/tools/include/nolibc/arch.h index 426c89198135564acca44c485e5c2d8ba36a6fe9..72585d4c04e2896a275faadf881e98286f914fb3 100644 --- a/tools/include/nolibc/arch.h +++ b/tools/include/nolibc/arch.h @@ -37,6 +37,8 @@ #include "arch-m68k.h" #elif defined(__sh__) #include "arch-sh.h" +#elif defined(__alpha__) +#include "arch-alpha.h" #else #error Unsupported Architecture #endif diff --git a/tools/testing/selftests/nolibc/Makefile.nolibc b/tools/testing/selftests/nolibc/Makefile.nolibc index 0fb759ba992ee6b1693b88f1b2e77463afa9f38b..0da33fe99bd630ec5100b5beed939d524af2b3d4 100644 --- a/tools/testing/selftests/nolibc/Makefile.nolibc +++ b/tools/testing/selftests/nolibc/Makefile.nolibc @@ -93,6 +93,7 @@ IMAGE_sparc32 = arch/sparc/boot/image IMAGE_sparc64 = arch/sparc/boot/image IMAGE_m68k = vmlinux IMAGE_sh4 = arch/sh/boot/zImage +IMAGE_alpha = vmlinux IMAGE = $(objtree)/$(IMAGE_$(XARCH)) IMAGE_NAME = $(notdir $(IMAGE)) @@ -123,6 +124,7 @@ DEFCONFIG_sparc32 = sparc32_defconfig DEFCONFIG_sparc64 = sparc64_defconfig DEFCONFIG_m68k = virt_defconfig DEFCONFIG_sh4 = rts7751r2dplus_defconfig +DEFCONFIG_alpha = defconfig DEFCONFIG = $(DEFCONFIG_$(XARCH)) EXTRACONFIG_x32 = -e CONFIG_X86_X32_ABI @@ -130,6 +132,7 @@ EXTRACONFIG_arm = -e CONFIG_NAMESPACES EXTRACONFIG_armthumb = -e CONFIG_NAMESPACES EXTRACONFIG_m68k = -e CONFIG_BLK_DEV_INITRD EXTRACONFIG_sh4 = -e CONFIG_BLK_DEV_INITRD -e CONFIG_CMDLINE_FROM_BOOTLOADER +EXTRACONFIG_alpha = -e CONFIG_BLK_DEV_INITRD EXTRACONFIG = $(EXTRACONFIG_$(XARCH)) # optional tests to run (default = all) @@ -162,6 +165,7 @@ QEMU_ARCH_sparc32 = sparc QEMU_ARCH_sparc64 = sparc64 QEMU_ARCH_m68k = m68k QEMU_ARCH_sh4 = sh4 +QEMU_ARCH_alpha = alpha QEMU_ARCH = $(QEMU_ARCH_$(XARCH)) QEMU_ARCH_USER_ppc64le = ppc64le @@ -203,6 +207,7 @@ QEMU_ARGS_sparc32 = -M SS-5 -m 256M -append "console=ttyS0,115200 panic=-1 $( QEMU_ARGS_sparc64 = -M sun4u -append "console=ttyS0,115200 panic=-1 $(TEST:%=NOLIBC_TEST=%)" QEMU_ARGS_m68k = -M virt -append "console=ttyGF0,115200 panic=-1 $(TEST:%=NOLIBC_TEST=%)" QEMU_ARGS_sh4 = -M r2d -serial file:/dev/stdout -append "console=ttySC1,115200 panic=-1 $(TEST:%=NOLIBC_TEST=%)" +QEMU_ARGS_alpha = -M clipper -append "console=ttyS0 panic=-1 $(TEST:%=NOLIBC_TEST=%)" QEMU_ARGS = -m 1G $(QEMU_ARGS_$(XARCH)) $(QEMU_ARGS_BIOS) $(QEMU_ARGS_EXTRA) # OUTPUT is only set when run from the main makefile, otherwise diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c index a297ee0d6d0754dfcd9f9e5609d42c7442dabc4e..bbbb2a485f220fed69556baaf2603d9cf24a1c36 100644 --- a/tools/testing/selftests/nolibc/nolibc-test.c +++ b/tools/testing/selftests/nolibc/nolibc-test.c @@ -709,6 +709,10 @@ int run_startup(int min, int max) /* checking NULL for argv/argv0, environ and _auxv is not enough, let's compare with sbrk(0) or &end */ extern char end; char *brk = sbrk(0) != (void *)-1 ? sbrk(0) : &end; +#if defined(__alpha__) + /* the ordering above does not work on an alpha kernel */ + brk = NULL; +#endif /* differ from nolibc, both glibc and musl have no global _auxv */ const unsigned long *test_auxv = (void *)-1; #ifdef NOLIBC diff --git a/tools/testing/selftests/nolibc/run-tests.sh b/tools/testing/selftests/nolibc/run-tests.sh index e8af1fb505cf3573b4a6b37228dee764fe2e5277..8ce57d7006594c531f471d777d579c4f08d87efe 100755 --- a/tools/testing/selftests/nolibc/run-tests.sh +++ b/tools/testing/selftests/nolibc/run-tests.sh @@ -28,6 +28,7 @@ all_archs=( sparc32 sparc64 m68k sh4 + alpha ) archs="${all_archs[@]}" @@ -189,7 +190,7 @@ test_arch() { echo "Unsupported configuration" return fi - if [ "$arch" = "m68k" -o "$arch" = "sh4" ] && [ "$llvm" = "1" ]; then + if [ "$arch" = "m68k" -o "$arch" = "sh4" -o "$arch" = "alpha" ] && [ "$llvm" = "1" ]; then echo "Unsupported configuration" return fi --- base-commit: b9e50363178a40c76bebaf2f00faa2b0b6baf8d1 change-id: 20250609-nolibc-alpha-33e79644544c Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

13 hours, 29 minutes

3
3
0 0

[PATCH v2] selftests/tty: add TIOCSTI test suite

by Abhinav Saxena via B4 Relay

From: Abhinav Saxena <xandfury(a)gmail.com> TIOCSTI is a TTY ioctl command that allows inserting characters into the terminal input queue, making it appear as if the user typed those characters. Add a test suite with four tests to verify TIOCSTI behaviour in different scenarios when dev.tty.legacy_tiocsti is both enabled and disabled: - Test TIOCSTI functionality when legacy support is enabled - Test TIOCSTI rejection when legacy support is disabled - Test capability requirements for TIOCSTI usage - Test TIOCSTI security with file descriptor passing The tests validate proper enforcement of the legacy_tiocsti sysctl introduced in commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled"). See tty_ioctl(4) for details on TIOCSTI behavior and security requirements. Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- This patch adds comprehensive selftests for the TIOCSTI ioctl to validate proper behaviour under different system configurations. =============== The TIOCSTI ioctl allows inserting characters into the terminal input queue, making it appear as if the user typed those characters. This functionality has security implications and behaviour that varies based on system configuration. Background ========== CONFIG_LEGACY_TIOCSTI controls the default value for the dev.tty.legacy_tiocsti sysctl, which remains runtime-configurable. The dev.tty.legacy_tiocsti sysctl was introduced in commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") to provide administrators control over TIOCSTI usage. When legacy_tiocsti is disabled, TIOCSTI requires CAP_SYS_ADMIN capability. However, the current implementation only checks the current process's credentials via capable(CAP_SYS_ADMIN), which doesn't validate against the file opener's credentials stored in file->f_cred. This creates a potential security scenario where an unprivileged process can open a TTY fd and pass it to a privileged process via SCM_RIGHTS. Testing ======= The test suite includes four comprehensive tests: - Test TIOCSTI functionality when legacy support is enabled - Test TIOCSTI rejection when legacy support is disabled - Test capability requirements for TIOCSTI usage - Test TIOCSTI security with file descriptor passing All patches have been validated using: - scripts/checkpatch.pl --strict (0 errors, 0 warnings) - Functional testing on kernel v6.16-rc2 - File descriptor passing security test scenarios The fd_passing_security test demonstrates the security concern. To verify, disable legacy TIOCSTI and run the test: $ echo "0" | sudo tee /proc/sys/dev/tty/legacy_tiocsti $ sudo ./tools/testing/selftests/tty/tty_tiocsti_test -t fd_passing_security Patch Overview ============== PATCH 1/1: selftests/tty: add TIOCSTI test suite Comprehensive test suite demonstrating the issue and fix validation References ========== - tty_ioctl(4) - documents TIOCSTI ioctl and capability requirements - commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") - Documentation/security/credentials.rst - https://github.com/KSPP/linux/issues/156 - https://lore.kernel.org/linux-hardening/Y0m9l52AKmw6Yxi1@hostpad/ - drivers/tty/Kconfig Configuration References: [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [2] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [3] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> Changes in v2: - Focused series on selftests only - Removed SELinux capability checking patch for separate submission - Link to v1: https://lore.kernel.org/r/20250622-toicsti-bug-v1-0-f374373b04b2@gmail.com --- tools/testing/selftests/tty/Makefile | 6 +- tools/testing/selftests/tty/config | 1 + tools/testing/selftests/tty/tty_tiocsti_test.c | 421 +++++++++++++++++++++++++ 3 files changed, 427 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/tty/Makefile b/tools/testing/selftests/tty/Makefile index 50d7027b2ae3..7f6fbe5a0cd5 100644 --- a/tools/testing/selftests/tty/Makefile +++ b/tools/testing/selftests/tty/Makefile @@ -1,5 +1,9 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS = -O2 -Wall -TEST_GEN_PROGS := tty_tstamp_update +TEST_GEN_PROGS := tty_tstamp_update tty_tiocsti_test +LDLIBS += -lcap include ../lib.mk + +# Add libcap for TIOCSTI test +$(OUTPUT)/tty_tiocsti_test: LDLIBS += -lcap diff --git a/tools/testing/selftests/tty/config b/tools/testing/selftests/tty/config new file mode 100644 index 000000000000..c6373aba6636 --- /dev/null +++ b/tools/testing/selftests/tty/config @@ -0,0 +1 @@ +CONFIG_LEGACY_TIOCSTI=y diff --git a/tools/testing/selftests/tty/tty_tiocsti_test.c b/tools/testing/selftests/tty/tty_tiocsti_test.c new file mode 100644 index 000000000000..6a4b497078b0 --- /dev/null +++ b/tools/testing/selftests/tty/tty_tiocsti_test.c @@ -0,0 +1,421 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * TTY Tests - TIOCSTI + * + * Copyright © 2025 Abhinav Saxena <xandfury(a)gmail.com> + */ + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/ioctl.h> +#include <errno.h> +#include <stdbool.h> +#include <string.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <pwd.h> +#include <termios.h> +#include <grp.h> +#include <sys/capability.h> +#include <sys/prctl.h> + +#include "../kselftest_harness.h" + +/* Helper function to send FD via SCM_RIGHTS */ +static int send_fd_via_socket(int socket_fd, int fd_to_send) +{ + struct msghdr msg = { 0 }; + struct cmsghdr *cmsg; + char cmsg_buf[CMSG_SPACE(sizeof(int))]; + char dummy_data = 'F'; + struct iovec iov = { .iov_base = &dummy_data, .iov_len = 1 }; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = cmsg_buf; + msg.msg_controllen = sizeof(cmsg_buf); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + cmsg->cmsg_len = CMSG_LEN(sizeof(int)); + + memcpy(CMSG_DATA(cmsg), &fd_to_send, sizeof(int)); + + return sendmsg(socket_fd, &msg, 0) < 0 ? -1 : 0; +} + +/* Helper function to receive FD via SCM_RIGHTS */ +static int recv_fd_via_socket(int socket_fd) +{ + struct msghdr msg = { 0 }; + struct cmsghdr *cmsg; + char cmsg_buf[CMSG_SPACE(sizeof(int))]; + char dummy_data; + struct iovec iov = { .iov_base = &dummy_data, .iov_len = 1 }; + int received_fd = -1; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = cmsg_buf; + msg.msg_controllen = sizeof(cmsg_buf); + + if (recvmsg(socket_fd, &msg, 0) < 0) + return -1; + + for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_RIGHTS) { + memcpy(&received_fd, CMSG_DATA(cmsg), sizeof(int)); + break; + } + } + + return received_fd; +} + +static inline bool has_cap_sys_admin(void) +{ + cap_t caps = cap_get_proc(); + + if (!caps) + return false; + + cap_flag_value_t cap_val; + bool has_cap = (cap_get_flag(caps, CAP_SYS_ADMIN, CAP_EFFECTIVE, + &cap_val) == 0) && + (cap_val == CAP_SET); + + cap_free(caps); + return has_cap; +} + +/* + * Simple privilege drop that just changes uid/gid in current process + * and also capabilities like CAP_SYS_ADMIN + */ +static inline bool drop_to_nobody(void) +{ + /* Drop supplementary groups */ + if (setgroups(0, NULL) != 0) { + printf("setgroups failed: %s", strerror(errno)); + return false; + } + + /* Change group to nobody */ + if (setgid(65534) != 0) { + printf("setgid failed: %s", strerror(errno)); + return false; + } + + /* Change user to nobody (this drops capabilities) */ + if (setuid(65534) != 0) { + printf("setuid failed: %s", strerror(errno)); + return false; + } + + /* Verify we no longer have CAP_SYS_ADMIN */ + if (has_cap_sys_admin()) { + printf("ERROR: Still have CAP_SYS_ADMIN after changing to nobody"); + return false; + } + + printf("Successfully changed to nobody (uid:%d gid:%d)\n", getuid(), + getgid()); + return true; +} + +static inline int get_legacy_tiocsti_setting(void) +{ + FILE *fp; + int value = -1; + + fp = fopen("/proc/sys/dev/tty/legacy_tiocsti", "r"); + if (!fp) { + if (errno == ENOENT) { + printf("legacy_tiocsti sysctl not available (kernel < 6.2)\n"); + } else { + printf("Cannot read legacy_tiocsti: %s\n", + strerror(errno)); + } + return -1; + } + + if (fscanf(fp, "%d", &value) == 1) { + printf("legacy_tiocsti setting=%d\n", value); + + if (value < 0 || value > 1) { + printf("legacy_tiocsti unexpected value %d\n", value); + value = -1; + } else { + printf("legacy_tiocsti=%d (%s mode)\n", value, + value == 0 ? "restricted" : "permissive"); + } + } else { + printf("Failed to parse legacy_tiocsti value"); + value = -1; + } + + fclose(fp); + return value; +} + +static inline int test_tiocsti_injection(int fd) +{ + int ret; + char test_char = 'X'; + + ret = ioctl(fd, TIOCSTI, &test_char); + if (ret == 0) { + /* Clear the injected character */ + printf("TIOCSTI injection succeeded\n"); + } else { + printf("TIOCSTI injection failed: %s (errno=%d)\n", + strerror(errno), errno); + } + return ret == 0 ? 0 : -1; +} + +FIXTURE(tty_tiocsti) +{ + int tty_fd; + char *tty_name; + bool has_tty; + bool initial_cap_sys_admin; + int legacy_tiocsti_setting; +}; + +FIXTURE_SETUP(tty_tiocsti) +{ + TH_LOG("Running as UID: %d with effective UID: %d", getuid(), + geteuid()); + + self->tty_fd = open("/dev/tty", O_RDWR); + self->has_tty = (self->tty_fd >= 0); + + if (self->tty_fd < 0) + TH_LOG("Cannot open /dev/tty: %s", strerror(errno)); + + self->tty_name = ttyname(STDIN_FILENO); + TH_LOG("Current TTY: %s", self->tty_name ? self->tty_name : "none"); + + self->initial_cap_sys_admin = has_cap_sys_admin(); + TH_LOG("Initial CAP_SYS_ADMIN: %s", + self->initial_cap_sys_admin ? "yes" : "no"); + + self->legacy_tiocsti_setting = get_legacy_tiocsti_setting(); +} + +FIXTURE_TEARDOWN(tty_tiocsti) +{ + if (self->has_tty && self->tty_fd >= 0) + close(self->tty_fd); +} + +/* Test case 1: legacy_tiocsti != 0 (permissive mode) */ +TEST_F(tty_tiocsti, permissive_mode) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting == 0) + SKIP(return, + "Test requires permissive mode (legacy_tiocsti=1)"); + // clang-format on + + ASSERT_TRUE(self->has_tty); + + if (self->initial_cap_sys_admin) { + ASSERT_TRUE(drop_to_nobody()); + ASSERT_FALSE(has_cap_sys_admin()); + } + + /* In permissive mode, TIOCSTI should work without CAP_SYS_ADMIN */ + EXPECT_EQ(test_tiocsti_injection(self->tty_fd), 0) + { + TH_LOG("TIOCSTI should succeed in permissive mode without CAP_SYS_ADMIN"); + } +} + +/* Test case 2: legacy_tiocsti == 0, without CAP_SYS_ADMIN (should fail) */ +TEST_F(tty_tiocsti, restricted_mode_nopriv) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting != 0) + SKIP(return, + "Test requires restricted mode (legacy_tiocsti=0)"); + // clang-format on + + ASSERT_TRUE(self->has_tty); + + if (self->initial_cap_sys_admin) { + ASSERT_TRUE(drop_to_nobody()); + ASSERT_FALSE(has_cap_sys_admin()); + } + /* In restricted mode, TIOCSTI should fail without CAP_SYS_ADMIN */ + EXPECT_EQ(test_tiocsti_injection(self->tty_fd), -1); + + /* + * it might fail with either EPERM or EIO + * EXPECT_TRUE(errno == EPERM || errno == EIO) + * { + * TH_LOG("Expected EPERM, got: %s", strerror(errno)); + * } + */ +} + +/* Test case 3: legacy_tiocsti == 0, with CAP_SYS_ADMIN (should succeed) */ +TEST_F(tty_tiocsti, restricted_mode_priv) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting != 0) + SKIP(return, + "Test requires restricted mode (legacy_tiocsti=0)"); + // clang-format on + + /* Must have CAP_SYS_ADMIN for this test */ + if (!self->initial_cap_sys_admin) + SKIP(return, "Test requires CAP_SYS_ADMIN"); + + ASSERT_TRUE(self->has_tty); + ASSERT_TRUE(has_cap_sys_admin()); + + /* In restricted mode, TIOCSTI should succeed with CAP_SYS_ADMIN */ + EXPECT_EQ(test_tiocsti_injection(self->tty_fd), 0) + { + TH_LOG("TIOCSTI should succeed in restricted mode with CAP_SYS_ADMIN"); + } +} + +/* Test TIOCSTI security with file descriptor passing */ +TEST_F(tty_tiocsti, fd_passing_security) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting != 0) + SKIP(return, + "Test requires restricted mode (legacy_tiocsti=0)"); + // clang-format on + + /* Must start with CAP_SYS_ADMIN */ + if (!self->initial_cap_sys_admin) + SKIP(return, "Test requires initial CAP_SYS_ADMIN"); + + int sockpair[2]; + pid_t child_pid; + + ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, sockpair), 0); + + child_pid = fork(); + ASSERT_GE(child_pid, 0) + TH_LOG("Fork failed: %s", strerror(errno)); + + if (child_pid == 0) { + /* Child process - become unprivileged, open TTY, send FD to parent */ + close(sockpair[0]); + + TH_LOG("Child: Dropping privileges..."); + + /* Drop to nobody user (loses all capabilities) */ + drop_to_nobody(); + + /* Verify we no longer have CAP_SYS_ADMIN */ + if (has_cap_sys_admin()) { + TH_LOG("Child: Failed to drop CAP_SYS_ADMIN"); + _exit(1); + } + + TH_LOG("Child: Opening TTY as unprivileged user..."); + + int unprivileged_tty_fd = open("/dev/tty", O_RDWR); + + if (unprivileged_tty_fd < 0) { + TH_LOG("Child: Cannot open TTY: %s", strerror(errno)); + _exit(1); + } + + /* Test that we can't use TIOCSTI directly (should fail) */ + + char test_char = 'X'; + + if (ioctl(unprivileged_tty_fd, TIOCSTI, &test_char) == 0) { + TH_LOG("Child: ERROR - Direct TIOCSTI succeeded unexpectedly!"); + close(unprivileged_tty_fd); + _exit(1); + } + TH_LOG("Child: Good - Direct TIOCSTI failed as expected: %s", + strerror(errno)); + + /* Send the TTY FD to privileged parent via SCM_RIGHTS */ + TH_LOG("Child: Sending TTY FD to privileged parent..."); + if (send_fd_via_socket(sockpair[1], unprivileged_tty_fd) != 0) { + TH_LOG("Child: Failed to send FD"); + close(unprivileged_tty_fd); + _exit(1); + } + + close(unprivileged_tty_fd); + close(sockpair[1]); + _exit(0); /* Child success */ + + } else { + /* Parent process - keep CAP_SYS_ADMIN, receive FD, test TIOCSTI */ + close(sockpair[1]); + + TH_LOG("Parent: Waiting for TTY FD from unprivileged child..."); + + /* Verify we still have CAP_SYS_ADMIN */ + ASSERT_TRUE(has_cap_sys_admin()); + + /* Receive the TTY FD from unprivileged child */ + int received_fd = recv_fd_via_socket(sockpair[0]); + + ASSERT_GE(received_fd, 0) + TH_LOG("Parent: Received FD %d (opened by unprivileged process)", + received_fd); + + /* + * VULNERABILITY TEST: Try TIOCSTI with FD opened by unprivileged process + * This should FAIL even though parent has CAP_SYS_ADMIN + * because the FD was opened by unprivileged process + */ + char attack_char = 'V'; /* V for Vulnerability */ + int ret = ioctl(received_fd, TIOCSTI, &attack_char); + + TH_LOG("Parent: Testing TIOCSTI on FD from unprivileged process..."); + if (ret == 0) { + TH_LOG("*** VULNERABILITY DETECTED ***"); + TH_LOG("Privileged process can use TIOCSTI on unprivileged FD"); + } else { + TH_LOG("TIOCSTI failed on unprivileged FD: %s", + strerror(errno)); + EXPECT_EQ(errno, EPERM); + } + close(received_fd); + close(sockpair[0]); + + /* Wait for child */ + int status; + + ASSERT_EQ(waitpid(child_pid, &status, 0), child_pid); + EXPECT_EQ(WEXITSTATUS(status), 0); + ASSERT_NE(ret, 0); + } +} + +TEST_HARNESS_MAIN --- base-commit: 40f92e79b0aabbf3575e371f9054657a421a3e79 change-id: 20250618-toicsti-bug-7822b8e94a32 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

18 hours, 11 minutes

1
0
0 0

[PATCH net-next V3 0/4] selftests: drv-net: Test XDP native support

by Mohsin Bashir

This patch series add tests to validate XDP native support for PASS, DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment. For adjustment tests, validate support for both the extension and shrinking cases across various packet sizes and offset values. The pass criteria for head/tail adjustment tests require that at-least one adjustment value works for at-least one packet size. This ensure that the variability in maximum supported head/tail adjustment offset across different drivers is being incorporated. The results reported in this series are based on fbnic. However, the series is tested against multiple other drivers including netdevism. Note: The XDP support for fbnic will be added later. --- Change-log: V3: P1: Remove exception handling upon xdp attachment failure for better debuggability P4: Handle the bound issue with the use of bpf_xdp_load_bytes() V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.… Mohsin Bashir (4): selftests: drv-net: Test XDP_PASS/DROP support selftests: drv-net: Test XDP_TX support selftests: drv-net: Test tail-adjustment support selftests: drv-net: Test head-adjustment support tools/testing/selftests/drivers/net/Makefile | 1 + tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++ .../selftests/net/lib/xdp_native.bpf.c | 538 ++++++++++++++ 3 files changed, 1195 insertions(+) create mode 100755 tools/testing/selftests/drivers/net/xdp.py create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c -- 2.47.1

23 hours, 39 minutes

3
10
0 0

[PATCH V9 0/7] Add NUMA mempolicy support for KVM guest-memfd

by Shivank Garg

This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work that enabled host-mapping for guest_memfd memory [1]. == Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2). == Implementation == This series implements proper NUMA policy support for guest-memfd by: 1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy. With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design). == Upstream Plan == Phased approach as per David's guest_memfd extension overview [2] and community calls [3]: Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work. Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [4]. This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities. Please review and provide feedback! Thanks, Shivank == Changelog == - v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [5] and guest_memfd bi-weekly upstream call discussion [6]). - v7: Use inodes to store NUMA policy instead of file [7]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. - v9: Rebase on top of Fuad's V13 and incorporate review comments [1] https://lore.kernel.org/all/20250709105946.4009897-1-tabba@google.com [2] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo… [4] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [5] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [6] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [7] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes Matthew Wilcox (Oracle) (2): mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies Shivank Garg (4): mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 122 ++++++++- virt/kvm/guest_memfd.c | 255 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 10 +- 15 files changed, 408 insertions(+), 49 deletions(-) -- 2.43.0 --- == Earlier Postings == v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com

1 day, 1 hour

1
7
0 0

[PATCH 0/2] tools/nolibc: add x32 support

by Thomas Weißschuh

Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Thomas Weißschuh (2): tools/nolibc: define time_t in terms of __kernel_old_time_t selftests/nolibc: add x32 test configuration tools/include/nolibc/std.h | 4 +++- tools/testing/selftests/nolibc/Makefile.nolibc | 12 ++++++++++++ tools/testing/selftests/nolibc/run-tests.sh | 7 ++++++- 3 files changed, 21 insertions(+), 2 deletions(-) --- base-commit: 750aef513c610a37a13732ec64902428b839715e change-id: 20250712-nolibc-x32-796a9da4c9ed Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

1 day, 5 hours

2
3
0 0

[PATCH v23 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v23. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v23 (13-Jul-2025) and v22 (11-Jul-2025) - Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>) v21 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>) - Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>) - Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>) - Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>) - Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>) - Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>) - Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>) - Fix some typos and format issues of dualpi2 attributes v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 151 ++- include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 68 + net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1171 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1665 insertions(+), 5 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

1 day, 7 hours

1
6
0 0

[PATCH v2 0/4] HID: core: fix __hid_request when no report ID is used

by Benjamin Tissoires

New version (unchanged for patches 1-3), with a test added so we can detect this. Followup of https://lore.kernel.org/linux-input/c75433e0-9b47-4072-bbe8-b1d14ea97b13@ro… This initial series attempt at fixing the various bugs discovered by Alan regarding __hid_request(). Syzbot managed to create a report descriptor which presents a feature request of size 0 (still trying to extract it) and this exposed the fact that __hid_request() was incorrectly handling the case when the report ID is not used. Send a first batch of fixes now so we get the feedback from syzbot ASAP. Note: in the series, I also mentioned that the report of size 0 should be stripped out of the HID device, but I'm not entirely sure this would be a good idea in the end. Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org> --- Changes in v2: - added Tested-by from syzbot (https://lore.kernel.org/r/686e9113.050a0220.385921.0008.GAE@google.com) - added a python test for it - Link to v1: https://lore.kernel.org/r/20250709-report-size-null-v1-0-194912215cbc@kerne… --- Benjamin Tissoires (4): HID: core: ensure the allocated report buffer can contain the reserved report ID HID: core: ensure __hid_request reserves the report ID as the first byte HID: core: do not bypass hid_hw_raw_request selftests/hid: add a test case for the recent syzbot underflow drivers/hid/hid-core.c | 19 +++++-- tools/testing/selftests/hid/tests/test_mouse.py | 70 +++++++++++++++++++++++++ 2 files changed, 84 insertions(+), 5 deletions(-) --- base-commit: 1f988d0788f50d8464f957e793fab356e2937369 change-id: 20250709-report-size-null-37619ea20288 Best regards, -- Benjamin Tissoires <bentiss(a)kernel.org>

1 day, 10 hours

1
7
0 0

[PATCH v2 0/6] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

by Jiaqi Yan

Problem ======= When host APEI is unable to claim synchronous external abort (SEA) during stage-2 guest abort, today KVM directly injects an async SError into the VCPU then resumes it. The injected SError usually results in unpleasant guest kernel panic. One of the major situation of guest SEA is when VCPU consumes recoverable uncorrected memory error (UER), which is not uncommon at all in modern datacenter servers with large amounts of physical memory. Although SError and guest panic is sufficient to stop the propagation of corrupted memory there is room to recover from an UER in a more graceful manner. Proposed Solution ================= Alternatively KVM can replay the SEA to the faulting VCPU, via existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or the fault that cause SEA is not from guest kernel, the blast radius can be limited to the consuming or faulting guest userspace process, so the VM can keep running. In addition, instead of doing under the hood without involving userspace, there are benefits to redirect the SEA to VMM: - VM customers care about the disruptions caused by memory errors, and VMM usually has the responsibility to start the process of notifying the customers of memory error events in their VMs. For example some cloud provider emits a critical log in their observability UI [1], and provides playbook for customers on how to mitigate disruptions to their workloads. - VMM can protect future memory error consumption by unmapping the poisoned pages from stage-2 page table with KVM userfault, or by splitting the memslot that contains the poisoned guest pages [2]. - VMM can keep track of SEA events in the VM. When VMM thinks the status on the host or the VM is bad enough, e.g. number of distinct SEAs exceeds a threshold, it can restart the VM on another healthy host. - Behavior parity with x86 architecture. When machine check exception (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to let VMM either recover from the MCE, or terminate itself with VM. The prior RFC proposes to implement SIGBUS on arm64 as well, but Marc preferred VCPU exit over signal [3]. However, implementation aside, returning SEA to VMM is on par with returning MCE to VMM. Once SEA is redirected to VMM, among other actions, VMM is encouraged to inject external aborts into the faulting VCPU, which is already supported by KVM on arm64. We notice injecting instruction abort is not fully supported by KVM_SET_VCPU_EVENTS. Complement it in the patchset. New UAPIs ========= This patchset introduces following userspace-visiable changes to empower VMM to control what happens next for SEA on guest memory: - KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled this new capability at VM creation, and the SEA is not caused by memory allocated for stage-2 translation table, instead of injecting SError, return KVM_EXIT_ARM_SEA to userspace. - KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details about the SEA is provided in arm_sea as much as possible, including sanitized ESR value at EL2, if guest virtual and physical addresses (GPA and GVA) are available and the values if available. - KVM_CAP_ARM_INJECT_EXT_IABT. VMM today can inject external data abort to VCPU via KVM_SET_VCPU_EVENTS API. However, in case of instruction abort, VMM cannot inject it via KVM_SET_VCPU_EVENTS. KVM_CAP_ARM_INJECT_EXT_IABT is just a natural extend to KVM_CAP_ARM_INJECT_EXT_DABT that tells VMM KVM_SET_VCPU_EVENTS now supports external instruction abort. * From v1 [4]: - Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer"). - Sanitize ESR_EL2 before reporting it to userspace. - Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to stage-2 translation table. [1] https://cloud.google.com/solutions/sap/docs/manage-host-errors [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com [3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org [4] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com Jiaqi Yan (5): KVM: arm64: VM exit to userspace to handle SEA KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT Documentation: kvm: new uAPI for handling SEA Raghavendra Rao Ananta (1): KVM: arm64: Allow userspace to inject external instruction aborts Documentation/virt/kvm/api.rst | 128 ++++++- arch/arm64/include/asm/kvm_emulate.h | 67 ++++ arch/arm64/include/asm/kvm_host.h | 8 + arch/arm64/include/asm/kvm_ras.h | 2 +- arch/arm64/include/uapi/asm/kvm.h | 3 +- arch/arm64/kvm/arm.c | 6 + arch/arm64/kvm/guest.c | 13 +- arch/arm64/kvm/inject_fault.c | 3 + arch/arm64/kvm/mmu.c | 59 ++- include/uapi/linux/kvm.h | 12 + tools/arch/arm64/include/asm/esr.h | 2 + tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 2 + .../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++ .../testing/selftests/kvm/arm64/sea_to_user.c | 340 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 16 files changed, 718 insertions(+), 29 deletions(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c -- 2.49.0.1266.g31b7d2e469-goog

1 day, 16 hours

2
16
0 0

[PATCH net-next v3 0/3] netdevsim: add support for PHY devices

by Maxime Chevallier

Hi everyone, Here's a V3 for the netdevsim PHY support. This V3 includes : - A fix for a compiling issue with PHYLIB=n - An updated KConfig to only allow PHYLIB=y|n - Converted the link setting file to a bool debugfs file, relying on link state polling The idea of this series is to allow attaching virtual PHY devices to netdevsim, so that we can test PHY-related ethtool commands. This can be extended in the future for phylib testing as well. V2: https://lore.kernel.org/netdev/20250708115531.111326-1-maxime.chevallier@bo… - Fix building with PHYLIB=m - Use shellcheck on the shell scripts V1: https://lore.kernel.org/netdev/20250702082806.706973-1-maxime.chevallier@bo… Maxime Chevallier (3): net: netdevsim: Add PHY support in netdevsim selftests: ethtool: Drop the unused old_netdevs variable selftests: ethtool: Introduce ethernet PHY selftests on netdevsim drivers/net/Kconfig | 1 + drivers/net/netdevsim/Makefile | 4 + drivers/net/netdevsim/dev.c | 2 + drivers/net/netdevsim/netdev.c | 8 + drivers/net/netdevsim/netdevsim.h | 25 ++ drivers/net/netdevsim/phy.c | 375 ++++++++++++++++++ .../selftests/drivers/net/netdevsim/config | 1 + .../drivers/net/netdevsim/ethtool-common.sh | 19 +- .../drivers/net/netdevsim/ethtool-phy.sh | 64 +++ 9 files changed, 496 insertions(+), 3 deletions(-) create mode 100644 drivers/net/netdevsim/phy.c create mode 100755 tools/testing/selftests/drivers/net/netdevsim/ethtool-phy.sh -- 2.49.0

2 days, 1 hour

3
6
0 0

[PATCH v4 00/15] kunit: Introduce UAPI testing framework

by Thomas Weißschuh

Currently testing of userspace and in-kernel API use two different frameworks. kselftests for the userspace ones and Kunit for the in-kernel ones. Besides their different scopes, both have different strengths and limitations: Kunit: * Tests are normal kernel code. * They use the regular kernel toolchain. * They can be packaged and distributed as modules conveniently. Kselftests: * Tests are normal userspace code * They need a userspace toolchain. A kernel cross toolchain is likely not enough. * A fair amout of userland is required to run the tests, which means a full distro or handcrafted rootfs. * There is no way to conveniently package and run kselftests with a given kernel image. * The kselftests makefiles are not as powerful as regular kbuild. For example they are missing proper header dependency tracking or more complex compiler option modifications. Therefore kunit is much easier to run against different kernel configurations and architectures. This series aims to combine kselftests and kunit, avoiding both their limitations. It works by compiling the userspace kselftests as part of the regular kernel build, embedding them into the kunit kernel or module and executing them from there. If the kernel toolchain is not fit to produce userspace because of a missing libc, the kernel's own nolibc can be used instead. The structured TAP output from the kselftest is integrated into the kunit KTAP output transparently, the kunit parser can parse the combined logs together. Further room for improvements: * Call each test in its completely dedicated namespace * Handle additional test files besides the test executable through archives. CPIO, cramfs, etc. * Compatibility with kselftest_harness.h (in progress) * Expose the blobs in debugfs * Provide some convience wrappers around compat userprogs * Figure out a migration path/coexistence solution for kunit UAPI and tools/testing/selftests/ Output from the kunit example testcase, note the output of "example_uapi_tests". $ ./tools/testing/kunit/kunit.py run --kunitconfig lib/kunit example ... Running tests with: $ .kunit/linux kunit.filter_glob=example kunit.enable=1 mem=1G console=tty kunit_shutdown=halt [11:53:53] ================== example (10 subtests) =================== [11:53:53] [PASSED] example_simple_test [11:53:53] [SKIPPED] example_skip_test [11:53:53] [SKIPPED] example_mark_skipped_test [11:53:53] [PASSED] example_all_expect_macros_test [11:53:53] [PASSED] example_static_stub_test [11:53:53] [PASSED] example_static_stub_using_fn_ptr_test [11:53:53] [PASSED] example_priv_test [11:53:53] =================== example_params_test =================== [11:53:53] [SKIPPED] example value 3 [11:53:53] [PASSED] example value 2 [11:53:53] [PASSED] example value 1 [11:53:53] [SKIPPED] example value 0 [11:53:53] =============== [PASSED] example_params_test =============== [11:53:53] [PASSED] example_slow_test [11:53:53] ======================= (4 subtests) ======================= [11:53:53] [PASSED] procfs [11:53:53] [PASSED] userspace test 2 [11:53:53] [SKIPPED] userspace test 3: some reason [11:53:53] [PASSED] userspace test 4 [11:53:53] ================ [PASSED] example_uapi_test ================ [11:53:53] ===================== [PASSED] example ===================== [11:53:53] ============================================================ [11:53:53] Testing complete. Ran 16 tests: passed: 11, skipped: 5 [11:53:53] Elapsed time: 67.543s total, 1.823s configuring, 65.655s building, 0.058s running Based on v6.16-rc1. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Changes in v4: - Move Kconfig.nolibc from tools/ to init/ - Drop generic userprogs nolibc integration - Drop generic blob framework - Pick up review tags from David - Extend new kunit TAP parser tests - Add MAINTAINERS entry - Allow CONFIG_KUNIT_UAPI=m - Split /proc validation into dedicated UAPI test - Trim recipient list a bit - Use KUNIT_FAIL_AND_ABORT() over KUNIT_FAIL() - Link to v3: https://lore.kernel.org/r/20250611-kunit-kselftests-v3-0-55e3d148cbc6@linut… Changes in v3: - Reintroduce CONFIG_CC_CAN_LINK_STATIC - Enable CONFIG_ARCH_HAS_NOLIBC for m68k and SPARC - Properly handle 'clean' target for userprogs - Use ramfs over tmpfs to reduce dependencies - Inherit userprogs byte order and ABI from kernel - Drop now unnecessary "#ifndef NOLIBC" - Pick up review tags - Drop usage of __private in blob.h, sparse complains and it is not really necessary - Fix execution on loongarch when using clang - Drop userprogs libgcc handling, it was ugly and is not yet necessary - Link to v2: https://lore.kernel.org/r/20250407-kunit-kselftests-v2-0-454114e287fd@linut… Changes in v2: - Rebase onto v6.15-rc1 - Add documentation and kernel docs - Resolve invalid kconfig breakages - Drop already applied patch "kbuild: implement CONFIG_HEADERS_INSTALL for Usermode Linux" - Drop userprogs CONFIG_WERROR integration, it doesn't need to be part of this series - Replace patch prefix "kconfig" with "kbuild" - Rename kunit_uapi_run_executable() to kunit_uapi_run_kselftest() - Generate private, conflict-free symbols in the blob framework - Handle kselftest exit codes - Handle SIGABRT - Forward output also to kunit debugfs log - Install a fd=0 stdin filedescriptor - Link to v1: https://lore.kernel.org/r/20250217-kunit-kselftests-v1-0-42b4524c3b0a@linut… --- Thomas Weißschuh (15): kbuild: userprogs: avoid duplication of flags inherited from kernel kbuild: userprogs: also inherit byte order and ABI from kernel kbuild: doc: add label for userprogs section init: re-add CONFIG_CC_CAN_LINK_STATIC init: add nolibc build support fs,fork,exit: export symbols necessary for KUnit UAPI support kunit: tool: Add test for nested test result reporting kunit: tool: Don't overwrite test status based on subtest counts kunit: tool: Parse skipped tests from kselftest.h kunit: Always descend into kunit directory during build kunit: qemu_configs: loongarch: Enable LSX/LSAX kunit: Introduce UAPI testing framework kunit: uapi: Add example for UAPI tests kunit: uapi: Introduce preinit executable kunit: uapi: Validate usability of /proc Documentation/dev-tools/kunit/api/index.rst | 5 + Documentation/dev-tools/kunit/api/uapi.rst | 14 + Documentation/kbuild/makefiles.rst | 2 + MAINTAINERS | 11 + Makefile | 7 +- fs/exec.c | 2 + fs/file.c | 1 + fs/filesystems.c | 2 + fs/fs_struct.c | 1 + fs/pipe.c | 2 + include/kunit/uapi.h | 77 ++++++ init/Kconfig | 7 + init/Kconfig.nolibc | 15 + init/Makefile.nolibc | 13 + kernel/exit.c | 3 + kernel/fork.c | 2 + lib/Makefile | 4 - lib/kunit/Kconfig | 14 + lib/kunit/Makefile | 27 +- lib/kunit/kunit-example-test.c | 15 + lib/kunit/kunit-example-uapi.c | 22 ++ lib/kunit/kunit-test-uapi.c | 51 ++++ lib/kunit/kunit-test.c | 23 +- lib/kunit/kunit-uapi.c | 305 +++++++++++++++++++++ lib/kunit/uapi-preinit.c | 63 +++++ tools/testing/kunit/kunit_parser.py | 13 +- tools/testing/kunit/kunit_tool_test.py | 11 + tools/testing/kunit/qemu_configs/loongarch.py | 2 + .../test_is_test_passed-failure-nested.log | 10 + .../test_data/test_is_test_passed-kselftest.log | 3 +- 30 files changed, 714 insertions(+), 13 deletions(-) --- base-commit: 9d5898b413d17510b2a41664a42390a2c79f8bf4 change-id: 20241015-kunit-kselftests-56273bc40442 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

2 days, 9 hours

7
37
0 0

[PATCH net-next v5 0/3] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v5: - Rebased on top of net-next. - Calling bpftrace_stop using the defer helper. (Willem) - Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org Changes in v4: - Make the test XFail if it doesn't hit the function we are looking for - Toggle the interface while the traffic is flowing. - Bumped the number of messages from 10 to 40 per iterations. * This is hitting ~15 times per run on my vng test. - Decreased the time from 15 seconds to 10 seconds, given that if it didn't hit the function in 10 seconds, 5 seconds extra will not help. - Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org Changes in v3: - Make pylint happy (Simon) - Remove the unnecessary patch in bpftrace to raise an exception when it fails. (Jakub) - Improved the bpftrace code (Willem) - Stop sending messages if bpftrace is not alive anymore. - Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (2): selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/py/__init__.py | 3 +- .../testing/selftests/drivers/net/netpoll_basic.py | 366 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 35 ++ 4 files changed, 404 insertions(+), 1 deletion(-) --- base-commit: 0234362d0af4649bc2ff745e94d06d0c6f0a46ce change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

2 days, 20 hours

2
6
0 0

[PATCH v18 00/27] riscv control-flow integrity for usermode

by Deepak Gupta

Basics and overview =================== Software with larger attack surfaces (e.g. network facing apps like databases, browsers or apps relying on browser runtimes) suffer from memory corruption issues which can be utilized by attackers to bend control flow of the program to eventually gain control (by making their payload executable). Attackers are able to perform such attacks by leveraging call-sites which rely on indirect calls or return sites which rely on obtaining return address from stack memory. To mitigate such attacks, risc-v extension zicfilp enforces that all indirect calls must land on a landing pad instruction `lpad` else cpu will raise software check exception (a new cpu exception cause code on riscv). Similarly for return flow, risc-v extension zicfiss extends architecture with - `sspush` instruction to push return address on a shadow stack - `sspopchk` instruction to pop return address from shadow stack and compare with input operand (i.e. return address on stack) - `sspopchk` to raise software check exception if comparision above was a mismatch - Protection mechanism using which shadow stack is not writeable via regular store instructions More information an details can be found at extensions github repo [1]. Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel CET [3] and branch target identification (BTI) [4] on arm. Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack. x86 and arm64 support for user mode shadow stack is already in mainline. Kernel awareness for user control flow integrity ================================================ This series picks up Samuel Holland's envcfg changes [2] as well. So if those are being applied independently, they should be removed from this series. Enabling: In order to maintain compatibility and not break anything in user mode, kernel doesn't enable control flow integrity cpu extensions on binary by default. Instead exposes a prctl interface to enable, disable and lock the shadow stack or landing pad feature for a task. This allows userspace (loader) to enumerate if all objects in its address space are compiled with shadow stack and landing pad support and accordingly enable the feature. Additionally if a subsequent `dlopen` happens on a library, user mode can take a decision again to disable the feature (if incoming library is not compiled with support) OR terminate the task (if user mode policy is strict to have all objects in address space to be compiled with control flow integirty cpu feature). prctl to enable shadow stack results in allocating shadow stack from virtual memory and activating for user address space. x86 and arm64 are also following same direction due to similar reason(s). clone/fork: On clone and fork, cfi state for task is inherited by child. Shadow stack is part of virtual memory and is a writeable memory from kernel perspective (writeable via a restricted set of instructions aka shadow stack instructions) Thus kernel changes ensure that this memory is converted into read-only when fork/clone happens and COWed when fault is taken due to sspush, sspopchk or ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled, kernel will automatically allocate a shadow stack for that clone call. map_shadow_stack: x86 introduced `map_shadow_stack` system call to allow user space to explicitly map shadow stack memory in its address space. It is useful to allocate shadow for different contexts managed by a single thread (green threads or contexts) risc-v implements this system call as well. signal management: If shadow stack is enabled for a task, kernel performs an asynchronous control flow diversion to deliver the signal and eventually expects userspace to issue sigreturn so that original execution can be resumed. Even though resume context is prepared by kernel, it is in user space memory and is subject to memory corruption and corruption bugs can be utilized by attacker in this race window to perform arbitrary sigreturn and eventually bypass cfi mechanism. Another issue is how to ensure that cfi related state on sigcontext area is not trampled by legacy apps or apps compiled with old kernel headers. In order to mitigate control-flow hijacting, kernel prepares a token and place it on shadow stack before signal delivery and places address of token in sigcontext structure. During sigreturn, kernel obtains address of token from sigcontext struture, reads token from shadow stack and validates it and only then allow sigreturn to succeed. Compatiblity issue is solved by adopting dynamic sigcontext management introduced for vector extension. This series re-factor the code little bit to allow future sigcontext management easy (as proposed by Andy Chiu from SiFive) config and compilation: Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this config option picks the kernel support for user control flow integrity. This optin is presented only if toolchain has shadow stack and landing pad support. And is on purpose guarded by toolchain support. Reason being that eventually vDSO also needs to be compiled in with shadow stack and landing pad support. vDSO compile patches are not included as of now because landing pad labeling scheme is yet to settle for usermode runtime. To get more information on kernel interactions with respect to zicfilp and zicfiss, patch series adds documentation for `zicfilp` and `zicfiss` in following: Documentation/arch/riscv/zicfiss.rst Documentation/arch/riscv/zicfilp.rst How to test this series ======================= Toolchain --------- $ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev $ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static" $ make -j$(nproc) Qemu ---- Get the lastest qemu $ cd qemu $ mkdir build $ cd build $ ../configure --target-list=riscv64-softmmu $ make -j$(nproc) Opensbi ------- $ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi $ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic Linux ----- Running defconfig is fine. CFI is enabled by default if the toolchain supports it. $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) In case you're building your own rootfs using toolchain, please make sure you pick following patch to ensure that vDSO compiled with lpad and shadow stack. "arch/riscv: compile vdso with landing pad" Branch where above patch can be picked https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1 Running ------- Modify your qemu command to have: -bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin -cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true vDSO related Opens (in the flux) ================================= I am listing these opens for laying out plan and what to expect in future patch sets. And of course for the sake of discussion. Shadow stack and landing pad enabling in vDSO ---------------------------------------------- vDSO must have shadow stack and landing pad support compiled in for task to have shadow stack and landing pad support. This patch series doesn't enable that (yet). Enabling shadow stack support in vDSO should be straight forward (intend to do that in next versions of patch set). Enabling landing pad support in vDSO requires some collaboration with toolchain folks to follow a single label scheme for all object binaries. This is necessary to ensure that all indirect call-sites are setting correct label and target landing pads are decorated with same label scheme. How many vDSOs --------------- Shadow stack instructions are carved out of zimop (may be operations) and if CPU doesn't implement zimop, they're illegal instructions. Kernel could be running on a CPU which may or may not implement zimop. And thus kernel will have to carry 2 different vDSOs and expose the appropriate one depending on whether CPU implements zimop or not. References ========== [1] - https://github.com/riscv/riscv-cfi [2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c… [3] - https://lwn.net/Articles/889475/ [4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific… [5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i… [6] - https://lwn.net/Articles/940403/ To: Thomas Gleixner <tglx(a)linutronix.de> To: Ingo Molnar <mingo(a)redhat.com> To: Borislav Petkov <bp(a)alien8.de> To: Dave Hansen <dave.hansen(a)linux.intel.com> To: x86(a)kernel.org To: H. Peter Anvin <hpa(a)zytor.com> To: Andrew Morton <akpm(a)linux-foundation.org> To: Liam R. Howlett <Liam.Howlett(a)oracle.com> To: Vlastimil Babka <vbabka(a)suse.cz> To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> To: Paul Walmsley <paul.walmsley(a)sifive.com> To: Palmer Dabbelt <palmer(a)dabbelt.com> To: Albert Ou <aou(a)eecs.berkeley.edu> To: Conor Dooley <conor(a)kernel.org> To: Rob Herring <robh(a)kernel.org> To: Krzysztof Kozlowski <krzk+dt(a)kernel.org> To: Arnd Bergmann <arnd(a)arndb.de> To: Christian Brauner <brauner(a)kernel.org> To: Peter Zijlstra <peterz(a)infradead.org> To: Oleg Nesterov <oleg(a)redhat.com> To: Eric Biederman <ebiederm(a)xmission.com> To: Kees Cook <kees(a)kernel.org> To: Jonathan Corbet <corbet(a)lwn.net> To: Shuah Khan <shuah(a)kernel.org> To: Jann Horn <jannh(a)google.com> To: Conor Dooley <conor+dt(a)kernel.org> To: Miguel Ojeda <ojeda(a)kernel.org> To: Alex Gaynor <alex.gaynor(a)gmail.com> To: Boqun Feng <boqun.feng(a)gmail.com> To: Gary Guo <gary(a)garyguo.net> To: Björn Roy Baron <bjorn3_gh(a)protonmail.com> To: Benno Lossin <benno.lossin(a)proton.me> To: Andreas Hindborg <a.hindborg(a)kernel.org> To: Alice Ryhl <aliceryhl(a)google.com> To: Trevor Gross <tmgross(a)umich.edu> Cc: linux-kernel(a)vger.kernel.org Cc: linux-fsdevel(a)vger.kernel.org Cc: linux-mm(a)kvack.org Cc: linux-riscv(a)lists.infradead.org Cc: devicetree(a)vger.kernel.org Cc: linux-arch(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: alistair.francis(a)wdc.com Cc: richard.henderson(a)linaro.org Cc: jim.shu(a)sifive.com Cc: andybnac(a)gmail.com Cc: kito.cheng(a)sifive.com Cc: charlie(a)rivosinc.com Cc: atishp(a)rivosinc.com Cc: evan(a)rivosinc.com Cc: cleger(a)rivosinc.com Cc: alexghiti(a)rivosinc.com Cc: samitolvanen(a)google.com Cc: broonie(a)kernel.org Cc: rick.p.edgecombe(a)intel.com Cc: rust-for-linux(a)vger.kernel.org changelog --------- v18: - rebased on 6.16-rc1 - uprobe handling clears ELP in sstatus image in pt_regs - vdso was missing shadow stack elf note for object files. added that. Additional asm file for vdso needed the elf marker flag. toolchain should complain if `-fcf-protection=full` and marker is missing for object generated from asm file. Asked toolchain folks to fix this. Although no reason to gate the merge on that. - Split up compile options for march and fcf-protection in vdso Makefile - CONFIG_RISCV_USER_CFI option is moved under "Kernel features" menu Added `arch/riscv/configs/hardening.config` fragment which selects CONFIG_RISCV_USER_CFI v17: - fixed warnings due to empty macros in usercfi.h (reported by alexg) - fixed prefixes in commit titles reported by alexg - took below uprobe with fcfi v2 patch from Zong Li and squashed it with "riscv/traps: Introduce software check exception and uprobe handling" https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/ v16: - If FWFT is not implemented or returns error for shadow stack activation, then no_usercfi is set to disable shadow stack. Although this should be picked up by extension validation and activation. Fixed this bug for zicfilp and zicfiss both. Thanks to Charlie Jenkins for reporting this. - If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by Charlie Jenkins. - Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to keep it off till we have more hardware availibility with RVA23 profile and zimop/zcmop implemented. Else this will start breaking people's workflow - Includes the fix if "!RV64 and !SBI" then definitions for FWFT in asm-offsets.c error. v15: - Toolchain has been updated to include `-fcf-protection` flag. This exists for x86 as well. Updated kernel patches to compile vDSO and selftest to compile with `fcf-protection=full` flag. - selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI. - Patch to enable shadow stack for kernel wasn't hidden behind CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that. v14: - rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants. - Took Radim's suggestions on bitfields. - Placed cfi_state at the end of thread_info block so that current situation is not disturbed with respect to member fields of thread_info in single cacheline. v13: - cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses riscv_has_extension_unlikely() - uses nops(count) to create nop slide - RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it - changed ternaries to simply use implicit casting to convert to bool. - kernel command line allows to disable zicfilp and zicfiss independently. updated kernel-parameters.txt. - ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace kselftest. - cosmetic and grammatical changes to documentation. v12: - It seems like I had accidently squashed arch agnostic indirect branch tracking prctl and riscv implementation of those prctls. Split them again. - set_shstk_status/set_indir_lp_status perform CSR writes only when CPU support is available. As suggested by Zong Li. - Some minor clean up in kselftests as suggested by Zong Li. v11: - patch "arch/riscv: compile vdso with landing pad" was unconditionally selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to to `lpad 0`. v10: - dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch is not that interesting to this patch series for risc-v. There are instances in arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch to expedite merging in riscv tree. - Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to validate presence of cfi based on config. - Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure we add single vdso object with cfi enabled. But a vdso object with scheme of zero labeled landing pad is least common denominator and should work with all objects of zero labeled as well as function-signature labeled objects. v9: - rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion") - dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs) - dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs) v8: - rebased on palmer/for-next - dropped samuel holland's `envcfg` context switch patches. they are in parlmer/for-next v7: - Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv" Instead using `deactivate_mm` flow to clean up. see here for more context https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.… - Changed the header include in `kselftest`. Hopefully this fixes compile issue faced by Zong Li at SiFive. - Cleaned up an orphaned change to `mm/mmap.c` in below patch "riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE" - Lock interfaces for shadow stack and indirect branch tracking expect arg == 0 Any future evolution of this interface should accordingly define how arg should be setup. - `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper `is_shadow_stack_vma`. - Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv… v6: - Picked up Samuel Holland's changes as is with `envcfg` placed in `thread` instead of `thread_info` - fixed unaligned newline escapes in kselftest - cleaned up messages in kselftest and included test output in commit message - fixed a bug in clone path reported by Zong Li - fixed a build issue if CONFIG_RISCV_ISA_V is not selected (this was introduced due to re-factoring signal context management code) v5: - rebased on v6.12-rc1 - Fixed schema related issues in device tree file - Fixed some of the documentation related issues in zicfilp/ss.rst (style issues and added index) - added `SHADOW_STACK_SET_MARKER` so that implementation can define base of shadow stack. - Fixed warnings on definitions added in usercfi.h when CONFIG_RISCV_USER_CFI is not selected. - Adopted context header based signal handling as proposed by Andy Chiu - Added support for enabling kernel mode access to shadow stack using FWFT (https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…) - Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv… (Note: I had an issue in my workflow due to which version number wasn't picked up correctly while sending out patches) v4: - rebased on 6.11-rc6 - envcfg: Converged with Samuel Holland's patches for envcfg management on per- thread basis. - vma_is_shadow_stack is renamed to is_vma_shadow_stack - picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch - signal context: using extended context management to maintain compatibility. - fixed `-Wmissing-prototypes` compiler warnings for prctl functions - Documentation fixes and amending typos. - Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/ v3: - envcfg logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been picked on per task basis, even though CPU didn't implement it. Fixed in this series. - dt-bindings As suggested, split into separate commit. fixed the messaging that spec is in public review - arch_is_shadow_stack change arch_is_shadow_stack changed to vma_is_shadow_stack - hwprobe zicfiss / zicfilp if present will get enumerated in hwprobe - selftests As suggested, added object and binary filenames to .gitignore Selftest binary anyways need to be compiled with cfi enabled compiler which will make sure that landing pad and shadow stack are enabled. Thus removed separate enable/disable tests. Cleaned up tests a bit. - Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ v2: - Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow integrity for user mode programs can be compiled in the kernel. - Enabling of control flow integrity for user programs is left to user runtime - This patch series introduces arch agnostic `prctls` to enable shadow stack and indirect branch tracking. And implements them on riscv. --- Changes in v18: - Link to v17: https://lore.kernel.org/r/20250604-v5_user_cfi_series-v17-0-4565c2cf869f@ri… Changes in v17: - Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri… Changes in v16: - Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri… Changes in v15: - changelog posted just below cover letter - Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri… Changes in v14: - changelog posted just below cover letter - Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri… Changes in v13: - changelog posted just below cover letter - Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri… Changes in v12: - changelog posted just below cover letter - Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri… Changes in v11: - changelog posted just below cover letter - Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri… --- Andy Chiu (1): riscv: signal: abstract header saving for setup_sigcontext Deepak Gupta (25): mm: VM_SHADOW_STACK definition for riscv dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml) riscv: zicfiss / zicfilp enumeration riscv: zicfiss / zicfilp extension csr and bit definitions riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE riscv/mm: manufacture shadow stack pte riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs riscv/mm: write protect and shadow stack riscv/mm: Implement map_shadow_stack() syscall riscv/shstk: If needed allocate a new shadow stack on clone riscv: Implements arch agnostic shadow stack prctls prctl: arch-agnostic prctl for indirect branch tracking riscv: Implements arch agnostic indirect branch tracking prctls riscv/traps: Introduce software check exception and uprobe handling riscv/signal: save and restore of shadow stack for signal riscv/kernel: update __show_regs to print shadow stack register riscv/ptrace: riscv cfi status and state via ptrace and in core files riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe riscv: kernel command line option to opt out of user cfi riscv: enable kernel access to shadow stack memory via FWFT sbi call riscv: create a config for shadow stack and landing pad instr support riscv: Documentation for landing pad / indirect branch tracking riscv: Documentation for shadow stack on riscv kselftest/riscv: kselftest for user mode cfi Jim Shu (1): arch/riscv: compile vdso with landing pad and shadow stack note Documentation/admin-guide/kernel-parameters.txt | 8 + Documentation/arch/riscv/index.rst | 2 + Documentation/arch/riscv/zicfilp.rst | 115 +++++ Documentation/arch/riscv/zicfiss.rst | 179 +++++++ .../devicetree/bindings/riscv/extensions.yaml | 14 + arch/riscv/Kconfig | 21 + arch/riscv/Makefile | 5 +- arch/riscv/configs/hardening.config | 4 + arch/riscv/include/asm/asm-prototypes.h | 1 + arch/riscv/include/asm/assembler.h | 44 ++ arch/riscv/include/asm/cpufeature.h | 12 + arch/riscv/include/asm/csr.h | 16 + arch/riscv/include/asm/entry-common.h | 2 + arch/riscv/include/asm/hwcap.h | 2 + arch/riscv/include/asm/mman.h | 26 + arch/riscv/include/asm/mmu_context.h | 7 + arch/riscv/include/asm/pgtable.h | 30 +- arch/riscv/include/asm/processor.h | 1 + arch/riscv/include/asm/thread_info.h | 3 + arch/riscv/include/asm/usercfi.h | 95 ++++ arch/riscv/include/asm/vector.h | 3 + arch/riscv/include/uapi/asm/hwprobe.h | 2 + arch/riscv/include/uapi/asm/ptrace.h | 34 ++ arch/riscv/include/uapi/asm/sigcontext.h | 1 + arch/riscv/kernel/Makefile | 1 + arch/riscv/kernel/asm-offsets.c | 10 + arch/riscv/kernel/cpufeature.c | 27 + arch/riscv/kernel/entry.S | 33 +- arch/riscv/kernel/head.S | 27 + arch/riscv/kernel/process.c | 27 +- arch/riscv/kernel/ptrace.c | 95 ++++ arch/riscv/kernel/signal.c | 148 +++++- arch/riscv/kernel/sys_hwprobe.c | 2 + arch/riscv/kernel/sys_riscv.c | 10 + arch/riscv/kernel/traps.c | 54 ++ arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++ arch/riscv/kernel/vdso/Makefile | 11 +- arch/riscv/kernel/vdso/flush_icache.S | 4 + arch/riscv/kernel/vdso/getcpu.S | 4 + arch/riscv/kernel/vdso/rt_sigreturn.S | 4 + arch/riscv/kernel/vdso/sys_hwprobe.S | 4 + arch/riscv/kernel/vdso/vgetrandom-chacha.S | 5 +- arch/riscv/mm/init.c | 2 +- arch/riscv/mm/pgtable.c | 16 + include/linux/cpu.h | 4 + include/linux/mm.h | 7 + include/uapi/linux/elf.h | 2 + include/uapi/linux/prctl.h | 27 + kernel/sys.c | 30 ++ tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/cfi/.gitignore | 3 + tools/testing/selftests/riscv/cfi/Makefile | 16 + tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++ tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++ tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++ tools/testing/selftests/riscv/cfi/shadowstack.h | 27 + 56 files changed, 2383 insertions(+), 31 deletions(-) --- base-commit: a2a05801de77ca5122fc34e3eb84d6359ef70389 change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2 -- - debug

2 days, 22 hours

1
28
0 0

[PATCH net-next V2 0/5] selftests: drv-net: Test XDP native support

by Mohsin Bashir

This patch series add tests to validate XDP native support for PASS, DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment. For adjustment tests, validate support for both the extension and shrinking cases across various packet sizes and offset values. The pass criteria for head/tail adjustment tests require that at-least one adjustment value works for at-least one packet size. This ensure that the variability in maximum supported head/tail adjustment offset across different drivers is being incorporated. The results reported in this series are based on fbnic. However, the series is tested against multiple other drivers including netdevism. Note: The XDP support for fbnic will be added later. --- Change-log: V2: - Remove unused and libxdp-devel headers - Fix reverse xmas tree in xdp_native.bpf.c - Reorder headers in xdp_native.bpf.c V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.… Mohsin Bashir (5): selftests: drv-net: Add bpftool util selftests: drv-net: Test XDP_PASS/DROP support selftests: drv-net: Test XDP_TX support selftests: drv-net: Test tail-adjustment support selftests: drv-net: Test head-adjustment support tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/py/__init__.py | 2 +- tools/testing/selftests/drivers/net/xdp.py | 663 ++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 4 + .../selftests/net/lib/xdp_native.bpf.c | 524 ++++++++++++++ 5 files changed, 1193 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/drivers/net/xdp.py create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c -- 2.47.1

3 days, 1 hour

3
10
0 0

[PATCH] selftests/mm: fix split_huge_page_test for folio_split() tests.

by Zi Yan

PID_FMT does not have an offset field, so folio_split() tests are not performed. Add PID_FMT_OFFSET with an offset field and use it to perform folio_split() tests. Fixes: 80a5c494c89f ("selftests/mm: add tests for folio_split(), buddy allocator like split") Signed-off-by: Zi Yan <ziy(a)nvidia.com> --- tools/testing/selftests/mm/split_huge_page_test.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index aa7400ed0e99..f0d9c035641d 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -31,6 +31,7 @@ uint64_t pmd_pagesize; #define INPUT_MAX 80 #define PID_FMT "%d,0x%lx,0x%lx,%d" +#define PID_FMT_OFFSET "%d,0x%lx,0x%lx,%d,%d" #define PATH_FMT "%s,0x%lx,0x%lx,%d" #define PFN_MASK ((1UL<<55)-1) @@ -483,7 +484,7 @@ void split_thp_in_pagecache_to_order_at(size_t fd_size, const char *fs_loc, write_debugfs(PID_FMT, getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order); else - write_debugfs(PID_FMT, getpid(), (uint64_t)addr, + write_debugfs(PID_FMT_OFFSET, getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order, offset); for (i = 0; i < fd_size; i++) -- 2.47.2

3 days, 4 hours

5
5
0 0

[PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs

by Lorenzo Stoakes

Historically we've made it a uAPI requirement that mremap() may only operate on a single VMA at a time. For instances where VMAs need to be resized, this makes sense, as it becomes very difficult to determine what a user actually wants should they indicate a desire to expand or shrink the size of multiple VMAs (truncate? Adjust sizes individually? Some other strategy?). However, in instances where a user is moving VMAs, it is restrictive to disallow this. This is especially the case when anonymous mapping remap may or may not be mergeable depending on whether VMAs have or have not been faulted due to anon_vma assignment and folio index alignment with vma->vm_pgoff. Often this can result in surprising impact where a moved region is faulted, then moved back and a user fails to observe a merge from otherwise compatible, adjacent VMAs. This change allows such cases to work without the user having to be cognizant of whether a prior mremap() move or other VMA operations has resulted in VMA fragmentation. In order to do this, this series performs a large amount of refactoring, most pertinently - grouping sanity checks together, separately those that check input parameters and those relating to VMAs. we also simplify the post-mmap lock drop processing for uffd and mlock()'d VMAs. With this done, we can then fairly straightforwardly implement this functionality. This works exclusively for mremap() invocations which specify MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the notification of the userland fault handler would require us to drop the mmap lock. It is also not compatible with file-backed mappings with customised get_unmapped_area() handlers as these may not honour MREMAP_FIXED. The input and output addresses ranges must not overlap. We carefully account for moves which would result in VMA iterator invalidation. While there can be gaps between VMAs in the input range, there can be no gap before the first VMA in the range. v3: * Disallowed move operation except for MREMAP_FIXED. * Disallow gap at start of aggregate range to avoid confusion. * Disallow any file-baked VMAs with custom get_unmapped_area. * Renamed multi_vma to seen_vma to be clearer. Stop reusing new_addr, use separate target_addr var to track next target address. * Check if first VMA fails multi VMA check, if so we'll allow one VMA but not multiple. * Updated the commit message for patch 9 to be clearer about gap behaviour. * Removed accidentally included debug goto statement in test (doh!). Test was and is passing regardless. * Unmap target range in test, previously we ended up moving additional VMAs unintentionally. This still all passed :) but was not what was intended. * Removed self-merge check - there is absolutely no way this can happen across multiple VMAs, as there is no means of moving VMAs such that a VMA merges with itself. v2: * Squashed uffd stub fix into series. * Propagated tags, thanks! * Fixed param naming in patch 4 as per Vlastimil. * Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per Liam. * Correctly return -EFAULT if no VMAs in input range. * Account for get_unmapped_area() disregarding MAP_FIXED and returning an altered address. * Added additional explanatatory comment to the remap_move() function. https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/ v1: https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (10): mm/mremap: perform some simple cleanups mm/mremap: refactor initial parameter sanity checks mm/mremap: put VMA check and prep logic into helper function mm/mremap: cleanup post-processing stage of mremap mm/mremap: use an explicit uffd failure path for mremap mm/mremap: check remap conditions earlier mm/mremap: move remap_is_valid() into check_prep_vma() mm/mremap: clean up mlock populate behaviour mm/mremap: permit mremap() move of multiple VMAs tools/testing/selftests: extend mremap_test to test multi-VMA mremap fs/userfaultfd.c | 15 +- include/linux/userfaultfd_k.h | 5 + mm/mremap.c | 553 +++++++++++++++-------- tools/testing/selftests/mm/mremap_test.c | 146 +++++- 4 files changed, 518 insertions(+), 201 deletions(-) -- 2.50.0

3 days, 4 hours

2
14
0 0

[PATCH v2 0/3] module: make structure definitions always visible

by Thomas Weißschuh

Code using IS_ENABLED(CONFIG_MODULES) as a C expression may need access to the module structure definitions to compile. Make sure these structure definitions are always visible. This will conflict with commit 6bb37af62634 ("module: Move modprobe_path and modules_disabled ctl_tables into the module subsys") from the sysctl tree, but the resolution is trivial. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Changes in v2: - Pick up tags from v1 - Keep MODULE_ARCH_INIT and 'struct module' definitions together - Link to v1: https://lore.kernel.org/r/20250612-kunit-ifdef-modules-v1-0-fdccd42dcff8@li… --- Thomas Weißschuh (3): module: move 'struct module_use' to internal.h module: make structure definitions always visible kunit: test: Drop CONFIG_MODULE ifdeffery include/linux/module.h | 29 +++++++++++------------------ kernel/module/internal.h | 7 +++++++ lib/kunit/test.c | 8 -------- 3 files changed, 18 insertions(+), 26 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20250611-kunit-ifdef-modules-0fefd13ae153 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

3 days, 4 hours

3
6
0 0

[PATCH 0/3] module: make structure definitions always visible

by Thomas Weißschuh

Code using IS_ENABLED(CONFIG_MODULES) as a C expression may need access to the module structure definitions to compile. Make sure these structure definitions are always visible. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Thomas Weißschuh (3): module: move 'struct module_use' to internal.h module: make structure definitions always visible kunit: test: Drop CONFIG_MODULE ifdeffery include/linux/module.h | 30 ++++++++++++------------------ kernel/module/internal.h | 7 +++++++ lib/kunit/test.c | 8 -------- 3 files changed, 19 insertions(+), 26 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20250611-kunit-ifdef-modules-0fefd13ae153 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

3 days, 5 hours

4
15
0 0

[PATCH v5] selftests: riscv: add misaligned access testing

by Clément Léger

This selftest tests all the currently emulated instructions (except for the RV32 compressed ones which are left as a future exercise for a RV32 user). For the FPU instructions, all the FPU registers are tested. Signed-off-by: Clément Léger <cleger(a)rivosinc.com> --- This commits needs [1] for the test to be passed or it will fail due to missing sign extension. Note: This test can be executed using an SBI that support FWFT or that delegates misaligned traps by default. If using QEMU, you will need the patches mentioned at [2] so that misaligned accesses will generate a trap. Note: This commit was part of a series [3] that was partially merged. Note: the remaining checkpatch errors are not applicable to this tests which is a userspace one and does not use the kernel headers. Macros with complex values can not be enclosed in do while loop since they are generating functions. Link: https://lore.kernel.org/linux-riscv/mvmikk0goil.fsf@suse.de/ [1] Link: https://lore.kernel.org/all/20241211211933.198792-1-fkonrad@amd.com/ [2] Link: https://lore.kernel.org/linux-riscv/20250422162324.956065-1-cleger@rivosinc… [3] V5: - Sign extend LWU return value - Added sign extensions tests - Tests multiples values for GP load/store V4: - Fixed a typo in test name s/load/store V3: - Fixed a segfault and a sign extension error found when compiling with -O<x>, x != 0 (Alex) - Use inline assembly to generate the sigbus and avoid GCC optimizations V2: - Fix commit description - Fix a few errors reported by checkpatch.pl --- tools/testing/selftests/riscv/Makefile | 2 +- .../selftests/riscv/misaligned/.gitignore | 1 + .../selftests/riscv/misaligned/Makefile | 12 + .../selftests/riscv/misaligned/common.S | 33 ++ .../testing/selftests/riscv/misaligned/fpu.S | 180 +++++++++++ tools/testing/selftests/riscv/misaligned/gp.S | 113 +++++++ .../selftests/riscv/misaligned/misaligned.c | 288 ++++++++++++++++++ 7 files changed, 628 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile create mode 100644 tools/testing/selftests/riscv/misaligned/common.S create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c diff --git a/tools/testing/selftests/riscv/Makefile b/tools/testing/selftests/riscv/Makefile index 099b8c1f46f8..95a98ceeb3b3 100644 --- a/tools/testing/selftests/riscv/Makefile +++ b/tools/testing/selftests/riscv/Makefile @@ -5,7 +5,7 @@ ARCH ?= $(shell uname -m 2>/dev/null || echo not) ifneq (,$(filter $(ARCH),riscv)) -RISCV_SUBTARGETS ?= abi hwprobe mm sigreturn vector +RISCV_SUBTARGETS ?= abi hwprobe mm sigreturn vector misaligned else RISCV_SUBTARGETS := endif diff --git a/tools/testing/selftests/riscv/misaligned/.gitignore b/tools/testing/selftests/riscv/misaligned/.gitignore new file mode 100644 index 000000000000..5eff15a1f981 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/.gitignore @@ -0,0 +1 @@ +misaligned diff --git a/tools/testing/selftests/riscv/misaligned/Makefile b/tools/testing/selftests/riscv/misaligned/Makefile new file mode 100644 index 000000000000..1aa40110c50d --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2021 ARM Limited +# Originally tools/testing/arm64/abi/Makefile + +CFLAGS += -I$(top_srcdir)/tools/include + +TEST_GEN_PROGS := misaligned + +include ../../lib.mk + +$(OUTPUT)/misaligned: misaligned.c fpu.S gp.S + $(CC) -g3 -static -o$@ -march=rv64imafdc $(CFLAGS) $(LDFLAGS) $^ diff --git a/tools/testing/selftests/riscv/misaligned/common.S b/tools/testing/selftests/riscv/misaligned/common.S new file mode 100644 index 000000000000..8fa00035bd5d --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/common.S @@ -0,0 +1,33 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +.macro lb_sb temp, offset, src, dst + lb \temp, \offset(\src) + sb \temp, \offset(\dst) +.endm + +.macro copy_long_to temp, src, dst + lb_sb \temp, 0, \src, \dst, + lb_sb \temp, 1, \src, \dst, + lb_sb \temp, 2, \src, \dst, + lb_sb \temp, 3, \src, \dst, + lb_sb \temp, 4, \src, \dst, + lb_sb \temp, 5, \src, \dst, + lb_sb \temp, 6, \src, \dst, + lb_sb \temp, 7, \src, \dst, +.endm + +.macro sp_stack_prologue offset + addi sp, sp, -8 + sub sp, sp, \offset +.endm + +.macro sp_stack_epilogue offset + add sp, sp, \offset + addi sp, sp, 8 +.endm diff --git a/tools/testing/selftests/riscv/misaligned/fpu.S b/tools/testing/selftests/riscv/misaligned/fpu.S new file mode 100644 index 000000000000..a7ad4430a424 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/fpu.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +#include "common.S" + +#define CASE_ALIGN 4 + +.macro fpu_load_inst fpreg, inst, precision, load_reg +.align CASE_ALIGN + \inst \fpreg, 0(\load_reg) + fmv.\precision fa0, \fpreg + j 2f +.endm + +#define flw(__fpreg) fpu_load_inst __fpreg, flw, s, a4 +#define fld(__fpreg) fpu_load_inst __fpreg, fld, d, a4 +#define c_flw(__fpreg) fpu_load_inst __fpreg, c.flw, s, a4 +#define c_fld(__fpreg) fpu_load_inst __fpreg, c.fld, d, a4 +#define c_fldsp(__fpreg) fpu_load_inst __fpreg, c.fldsp, d, sp + +.macro fpu_store_inst fpreg, inst, precision, store_reg +.align CASE_ALIGN + fmv.\precision \fpreg, fa0 + \inst \fpreg, 0(\store_reg) + j 2f +.endm + +#define fsw(__fpreg) fpu_store_inst __fpreg, fsw, s, a4 +#define fsd(__fpreg) fpu_store_inst __fpreg, fsd, d, a4 +#define c_fsw(__fpreg) fpu_store_inst __fpreg, c.fsw, s, a4 +#define c_fsd(__fpreg) fpu_store_inst __fpreg, c.fsd, d, a4 +#define c_fsdsp(__fpreg) fpu_store_inst __fpreg, c.fsdsp, d, sp + +.macro fp_test_prologue + move a4, a1 + /* + * Compute jump offset to store the correct FP register since we don't + * have indirect FP register access (or at least we don't use this + * extension so that works on all archs) + */ + sll t0, a0, CASE_ALIGN + la t2, 1f + add t0, t0, t2 + jr t0 +.align CASE_ALIGN +1: +.endm + +.macro fp_test_prologue_compressed + /* FP registers for compressed instructions starts from 8 to 16 */ + addi a0, a0, -8 + fp_test_prologue +.endm + +#define fp_test_body_compressed(__inst_func) \ + __inst_func(f8); \ + __inst_func(f9); \ + __inst_func(f10); \ + __inst_func(f11); \ + __inst_func(f12); \ + __inst_func(f13); \ + __inst_func(f14); \ + __inst_func(f15); \ +2: + +#define fp_test_body(__inst_func) \ + __inst_func(f0); \ + __inst_func(f1); \ + __inst_func(f2); \ + __inst_func(f3); \ + __inst_func(f4); \ + __inst_func(f5); \ + __inst_func(f6); \ + __inst_func(f7); \ + __inst_func(f8); \ + __inst_func(f9); \ + __inst_func(f10); \ + __inst_func(f11); \ + __inst_func(f12); \ + __inst_func(f13); \ + __inst_func(f14); \ + __inst_func(f15); \ + __inst_func(f16); \ + __inst_func(f17); \ + __inst_func(f18); \ + __inst_func(f19); \ + __inst_func(f20); \ + __inst_func(f21); \ + __inst_func(f22); \ + __inst_func(f23); \ + __inst_func(f24); \ + __inst_func(f25); \ + __inst_func(f26); \ + __inst_func(f27); \ + __inst_func(f28); \ + __inst_func(f29); \ + __inst_func(f30); \ + __inst_func(f31); \ +2: +.text + +#define __gen_test_inst(__inst, __suffix) \ +.global test_ ## __inst; \ +test_ ## __inst:; \ + fp_test_prologue ## __suffix; \ + fp_test_body ## __suffix(__inst); \ + ret + +#define gen_test_inst_compressed(__inst) \ + .option arch,+c; \ + __gen_test_inst(c_ ## __inst, _compressed) + +#define gen_test_inst(__inst) \ + .balign 16; \ + .option push; \ + .option arch,-c; \ + __gen_test_inst(__inst, ); \ + .option pop + +.macro fp_test_prologue_load_compressed_sp + copy_long_to t0, a1, sp +.endm + +.macro fp_test_epilogue_load_compressed_sp +.endm + +.macro fp_test_prologue_store_compressed_sp +.endm + +.macro fp_test_epilogue_store_compressed_sp + copy_long_to t0, sp, a1 +.endm + +#define gen_inst_compressed_sp(__inst, __type) \ + .global test_c_ ## __inst ## sp; \ + test_c_ ## __inst ## sp:; \ + sp_stack_prologue a2; \ + fp_test_prologue_## __type ## _compressed_sp; \ + fp_test_prologue_compressed; \ + fp_test_body_compressed(c_ ## __inst ## sp); \ + fp_test_epilogue_## __type ## _compressed_sp; \ + sp_stack_epilogue a2; \ + ret + +#define gen_test_load_compressed_sp(__inst) gen_inst_compressed_sp(__inst, load) +#define gen_test_store_compressed_sp(__inst) gen_inst_compressed_sp(__inst, store) + +/* + * float_fsw_reg - Set a FP register from a register containing the value + * a0 = FP register index to be set + * a1 = addr where to store register value + * a2 = address offset + * a3 = value to be store + */ +gen_test_inst(fsw) + +/* + * float_flw_reg - Get a FP register value and return it + * a0 = FP register index to be retrieved + * a1 = addr to load register from + * a2 = address offset + */ +gen_test_inst(flw) + +gen_test_inst(fsd) +#ifdef __riscv_compressed +gen_test_inst_compressed(fsd) +gen_test_store_compressed_sp(fsd) +#endif + +gen_test_inst(fld) +#ifdef __riscv_compressed +gen_test_inst_compressed(fld) +gen_test_load_compressed_sp(fld) +#endif diff --git a/tools/testing/selftests/riscv/misaligned/gp.S b/tools/testing/selftests/riscv/misaligned/gp.S new file mode 100644 index 000000000000..5abec5ccc828 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/gp.S @@ -0,0 +1,113 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +#include "common.S" + +.text + +.macro __gen_test_inst inst, src_reg + \inst a2, 0(\src_reg) + move a0, a2 +.endm + +.macro gen_func_header func_name, rvc + .option arch,\rvc + .global test_\func_name + test_\func_name: +.endm + +.macro gen_test_inst inst + .option push + gen_func_header \inst, -c + __gen_test_inst \inst, a0 + .option pop + ret +.endm + +.macro gen_test_lwu + .option push + gen_func_header lwu, -c + __gen_test_inst lwu, a0 + .option pop + # RISC-V ABI states that C expect a sign extended 32 bits integer + sext.w a0, a0 + ret +.endm + +.macro __gen_test_inst_c name, src_reg + .option push + gen_func_header c_\name, +c + __gen_test_inst c.\name, \src_reg + .option pop + ret +.endm + +.macro gen_test_inst_c name + __gen_test_inst_c \name, a0 +.endm + + +.macro gen_test_inst_c_ldsp + .option push + gen_func_header c_ldsp, +c + sp_stack_prologue a1 + copy_long_to t0, a0, sp + c.ldsp a0, 0(sp) + sp_stack_epilogue a1 + .option pop + ret +.endm + +.macro lb_sp_sb_a0 reg, offset + lb_sb \reg, \offset, sp, a0 +.endm + +.macro gen_test_inst_c_sdsp + .option push + gen_func_header c_sdsp, +c + /* Misalign stack pointer */ + sp_stack_prologue a1 + /* Misalign access */ + c.sdsp a2, 0(sp) + copy_long_to t0, sp, a0 + sp_stack_epilogue a1 + .option pop + ret +.endm + + + /* + * a0 = addr to load from + * a1 = address offset + * a2 = value to be loaded + */ +gen_test_inst lh +gen_test_inst lhu +gen_test_inst lw +gen_test_lwu +gen_test_inst ld +#ifdef __riscv_compressed +gen_test_inst_c lw +gen_test_inst_c ld +gen_test_inst_c_ldsp +#endif + +/* + * a0 = addr where to store value + * a1 = address offset + * a2 = value to be stored + */ +gen_test_inst sh +gen_test_inst sw +gen_test_inst sd +#ifdef __riscv_compressed +gen_test_inst_c sw +gen_test_inst_c sd +gen_test_inst_c_sdsp +#endif + diff --git a/tools/testing/selftests/riscv/misaligned/misaligned.c b/tools/testing/selftests/riscv/misaligned/misaligned.c new file mode 100644 index 000000000000..57ddcbdc947c --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/misaligned.c @@ -0,0 +1,288 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <linux/ptrace.h> +#include "../../kselftest_harness.h" + +#include <stdlib.h> +#include <stdio.h> +#include <stdint.h> +#include <float.h> +#include <errno.h> +#include <math.h> +#include <string.h> +#include <signal.h> +#include <stdbool.h> +#include <unistd.h> +#include <inttypes.h> +#include <ucontext.h> + +#include <sys/prctl.h> + +#define stringify(s) __stringify(s) +#define __stringify(s) #s + +#define U8_MAX ((u8)~0U) +#define S8_MAX ((s8)(U8_MAX >> 1)) +#define S8_MIN ((s8)(-S8_MAX - 1)) +#define U16_MAX ((u16)~0U) +#define S16_MAX ((s16)(U16_MAX >> 1)) +#define S16_MIN ((s16)(-S16_MAX - 1)) +#define U32_MAX ((u32)~0U) +#define U32_MIN ((u32)0) +#define S32_MAX ((s32)(U32_MAX >> 1)) +#define S32_MIN ((s32)(-S32_MAX - 1)) +#define U64_MAX ((u64)~0ULL) +#define S64_MAX ((s64)(U64_MAX >> 1)) +#define S64_MIN ((s64)(-S64_MAX - 1)) + +#define int16_TEST_VALUES {S16_MIN, S16_MIN/2, -1, 1, S16_MAX/2, S16_MAX} +#define int32_TEST_VALUES {S32_MIN, S32_MIN/2, -1, 1, S32_MAX/2, S32_MAX} +#define int64_TEST_VALUES {S64_MIN, S64_MIN/2, -1, 1, S64_MAX/2, S64_MAX} +#define uint16_TEST_VALUES {0, U16_MAX/2, U16_MAX} +#define uint32_TEST_VALUES {0, U32_MAX/2, U32_MAX} + +#define float_TEST_VALUES {FLT_MIN, FLT_MIN/2, FLT_MAX/2, FLT_MAX} +#define double_TEST_VALUES {DBL_MIN, DBL_MIN/2, DBL_MAX/2, DBL_MAX} + +static bool float_equal(float a, float b) +{ + float scaled_epsilon; + float difference = fabsf(a - b); + + // Scale to the largest value. + a = fabsf(a); + b = fabsf(b); + if (a > b) + scaled_epsilon = FLT_EPSILON * a; + else + scaled_epsilon = FLT_EPSILON * b; + + return difference <= scaled_epsilon; +} + +static bool double_equal(double a, double b) +{ + double scaled_epsilon; + double difference = fabsl(a - b); + + // Scale to the largest value. + a = fabs(a); + b = fabs(b); + if (a > b) + scaled_epsilon = DBL_EPSILON * a; + else + scaled_epsilon = DBL_EPSILON * b; + + return difference <= scaled_epsilon; +} + +#define fpu_load_proto(__inst, __type) \ +extern __type test_ ## __inst(unsigned long fp_reg, void *addr, unsigned long offset, __type value) + +fpu_load_proto(flw, float); +fpu_load_proto(fld, double); +fpu_load_proto(c_flw, float); +fpu_load_proto(c_fld, double); +fpu_load_proto(c_fldsp, double); + +#define fpu_store_proto(__inst, __type) \ +extern void test_ ## __inst(unsigned long fp_reg, void *addr, unsigned long offset, __type value) + +fpu_store_proto(fsw, float); +fpu_store_proto(fsd, double); +fpu_store_proto(c_fsw, float); +fpu_store_proto(c_fsd, double); +fpu_store_proto(c_fsdsp, double); + +#define gp_load_proto(__inst, __type) \ +extern __type test_ ## __inst(void *addr, unsigned long offset, __type value) + +gp_load_proto(lh, int16_t); +gp_load_proto(lhu, uint16_t); +gp_load_proto(lw, int32_t); +gp_load_proto(lwu, uint32_t); +gp_load_proto(ld, int64_t); +gp_load_proto(c_lw, int32_t); +gp_load_proto(c_ld, int64_t); +gp_load_proto(c_ldsp, int64_t); + +#define gp_store_proto(__inst, __type) \ +extern void test_ ## __inst(void *addr, unsigned long offset, __type value) + +gp_store_proto(sh, int16_t); +gp_store_proto(sw, int32_t); +gp_store_proto(sd, int64_t); +gp_store_proto(c_sw, int32_t); +gp_store_proto(c_sd, int64_t); +gp_store_proto(c_sdsp, int64_t); + +#define TEST_GP_LOAD(__inst, __type_size, __type) \ +TEST(gp_load_ ## __inst) \ +{ \ + int offset, ret, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type ## __type_size ## _t test_val[] = __type ## __type_size ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (offset = 1; offset < (__type_size) / 8; offset++) { \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type ## __type_size ## _t ref_val = test_val[val_i]; \ + __type ## __type_size ## _t *ptr = \ + (__type ## __type_size ## _t *)(buf + offset); \ + memcpy(ptr, &ref_val, sizeof(ref_val)); \ + __type ## __type_size ## _t val = \ + test_ ## __inst(ptr, offset, ref_val); \ + EXPECT_EQ(ref_val, val); \ + } \ + } \ +} + +TEST_GP_LOAD(lh, 16, int) +TEST_GP_LOAD(lhu, 16, uint) +TEST_GP_LOAD(lw, 32, int) +TEST_GP_LOAD(lwu, 32, uint) +TEST_GP_LOAD(ld, 64, int) +#ifdef __riscv_compressed +TEST_GP_LOAD(c_lw, 32, int) +TEST_GP_LOAD(c_ld, 64, int) +TEST_GP_LOAD(c_ldsp, 64, int) +#endif + +#define TEST_GP_STORE(__inst, __type_size, __type) \ +TEST(gp_store_ ## __inst) \ +{ \ + int offset, ret, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type ## __type_size ## _t test_val[] = \ + __type ## __type_size ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type ## __type_size ## _t ref_val = test_val[val_i]; \ + for (offset = 1; offset < (__type_size) / 8; offset++) { \ + __type ## __type_size ## _t val = ref_val; \ + __type ## __type_size ## _t *ptr = \ + (__type ## __type_size ## _t *)(buf + offset); \ + memset(ptr, 0, sizeof(val)); \ + test_ ## __inst(ptr, offset, val); \ + memcpy(&val, ptr, sizeof(val)); \ + EXPECT_EQ(ref_val, val); \ + } \ + } \ +} + +TEST_GP_STORE(sh, 16, int) +TEST_GP_STORE(sw, 32, int) +TEST_GP_STORE(sd, 64, int) +#ifdef __riscv_compressed +TEST_GP_STORE(c_sw, 32, int) +TEST_GP_STORE(c_sd, 64, int) +TEST_GP_STORE(c_sdsp, 64, int) +#endif + +#define __TEST_FPU_LOAD(__type, __inst, __reg_start, __reg_end) \ +TEST(fpu_load_ ## __inst) \ +{ \ + int ret, offset, fp_reg, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type test_val[] = __type ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (fp_reg = __reg_start; fp_reg < __reg_end; fp_reg++) { \ + for (offset = 1; offset < 4; offset++) { \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type val, ref_val = test_val[val_i]; \ + void *load_addr = (buf + offset); \ + \ + memcpy(load_addr, &ref_val, sizeof(ref_val)); \ + val = test_ ## __inst(fp_reg, load_addr, offset, ref_val); \ + EXPECT_TRUE(__type ##_equal(val, ref_val)); \ + } \ + } \ + } \ +} + +#define TEST_FPU_LOAD(__type, __inst) \ + __TEST_FPU_LOAD(__type, __inst, 0, 32) +#define TEST_FPU_LOAD_COMPRESSED(__type, __inst) \ + __TEST_FPU_LOAD(__type, __inst, 8, 16) + +TEST_FPU_LOAD(float, flw) +TEST_FPU_LOAD(double, fld) +#ifdef __riscv_compressed +TEST_FPU_LOAD_COMPRESSED(double, c_fld) +TEST_FPU_LOAD_COMPRESSED(double, c_fldsp) +#endif + +#define __TEST_FPU_STORE(__type, __inst, __reg_start, __reg_end) \ +TEST(fpu_store_ ## __inst) \ +{ \ + int ret, offset, fp_reg, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type test_val[] = __type ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (fp_reg = __reg_start; fp_reg < __reg_end; fp_reg++) { \ + for (offset = 1; offset < 4; offset++) { \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type val, ref_val = test_val[val_i]; \ + \ + void *store_addr = (buf + offset); \ + \ + test_ ## __inst(fp_reg, store_addr, offset, ref_val); \ + memcpy(&val, store_addr, sizeof(val)); \ + EXPECT_TRUE(__type ## _equal(val, ref_val)); \ + } \ + } \ + } \ +} +#define TEST_FPU_STORE(__type, __inst) \ + __TEST_FPU_STORE(__type, __inst, 0, 32) +#define TEST_FPU_STORE_COMPRESSED(__type, __inst) \ + __TEST_FPU_STORE(__type, __inst, 8, 16) + +TEST_FPU_STORE(float, fsw) +TEST_FPU_STORE(double, fsd) +#ifdef __riscv_compressed +TEST_FPU_STORE_COMPRESSED(double, c_fsd) +TEST_FPU_STORE_COMPRESSED(double, c_fsdsp) +#endif + +TEST_SIGNAL(gen_sigbus, SIGBUS) +{ + uint32_t val = 0xDEADBEEF; + uint8_t buf[16] __attribute__((aligned(16))); + int ret; + + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_SIGBUS); + ASSERT_EQ(ret, 0); + + asm volatile("sw %0, 1(%1)" : : "r"(val), "r"(buf) : "memory"); +} + +int main(int argc, char **argv) +{ + int ret, val; + + ret = prctl(PR_GET_UNALIGN, &val); + if (ret == -1 && errno == EINVAL) + ksft_exit_skip("SKIP GET_UNALIGN_CTL not supported\n"); + + exit(test_harness_run(argc, argv)); +} -- 2.43.0

3 days, 5 hours

1
0
0 0

[PATCH net-next v6] ipv6: add `force_forwarding` sysctl to enable per-interface forwarding

by Gabriel Goller

It is currently impossible to enable ipv6 forwarding on a per-interface basis like in ipv4. To enable forwarding on an ipv6 interface we need to enable it on all interfaces and disable it on the other interfaces using a netfilter rule. This is especially cumbersome if you have lots of interface and only want to enable forwarding on a few. According to the sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding for all interfaces, while the interface-specific `net.ipv6.conf.<interface>.forwarding` configures the interface Host/Router configuration. Introduce a new sysctl flag `force_forwarding`, which can be set on every interface. The ip6_forwarding function will then check if the global forwarding flag OR the force_forwarding flag is active and forward the packet. To preserver backwards-compatibility reset the flag (on all interfaces) to 0 if the net.ipv6.conf.all.forwarding flag is set to 0. Add a short selftest that checks if a packet gets forwarded with and without `force_forwarding`. [0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com> Acked-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com> --- v6: * rebase * remove brackts around single line * add 'nodad' to addresses in selftest to avoid sporadic failures v5: https://lore.kernel.org/netdev/20250707094307.223975-1-g.goller@proxmox.com/ * update conf/all/forwarding docs * simplified backwards-compat comment * remove ASSERT_RTNL as it's guaranteed by __in6_dev_get_rtnl_net() already * cange ip6_forward logic so that it doesn't depend on the idev existing * move WRITE_ONCE inside device lock v4: https://lore.kernel.org/netdev/20250703160154.560239-1-g.goller@proxmox.com/ * actually write the sysctl value to the table * use ASSERT_RTNL() when forwarding the sysctl change * remove useless comments in function body * simplify forwarding and force_forwarding check in ip6_output.c * fix code backticks in Documentation (double instead of single) * add selftests v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ * remove forwarding=0 setting force_forwarding=0 globally. * add min and max (0 and 1) value to sysctl. v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/ * rename from `do_forwarding` to `force_forwarding`. * add global `force_forwarding` flag which will enable `force_forwarding` on every interface like the `ipv4.all.forwarding` flag. * `forwarding`=0 will disable global and per-interface `force_forwarding`. * export option as NETCONFA_FORCE_FORWARDING. v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ Documentation/networking/ip-sysctl.rst | 9 +- include/linux/ipv6.h | 1 + include/uapi/linux/ipv6.h | 1 + include/uapi/linux/netconf.h | 1 + include/uapi/linux/sysctl.h | 1 + net/ipv6/addrconf.c | 82 ++++++++++++++ net/ipv6/ip6_output.c | 3 +- tools/testing/selftests/net/Makefile | 1 + .../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++ 9 files changed, 201 insertions(+), 3 deletions(-) create mode 100755 tools/testing/selftests/net/ipv6_force_forwarding.sh diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 0f1251cce314..6d92bae0257a 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2281,8 +2281,8 @@ conf/all/disable_ipv6 - BOOLEAN conf/all/forwarding - BOOLEAN Enable global IPv6 forwarding between all interfaces. - IPv4 and IPv6 work differently here; e.g. netfilter must be used - to control which interfaces may forward packets and which not. + IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must + be used to control which interfaces may forward packets. This also sets all interfaces' Host/Router setting 'forwarding' to the specified value. See below for details. @@ -2292,6 +2292,11 @@ conf/all/forwarding - BOOLEAN proxy_ndp - BOOLEAN Do proxy ndp. +force_forwarding - BOOLEAN + Enable forwarding on this interface only -- regardless of the setting on + ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0, + the ``force_forwarding`` flag will be reset on all interfaces. + fwmark_reflect - BOOLEAN Controls the fwmark of kernel-generated IPv6 reply packets that are not associated with a socket for example, TCP RSTs or ICMPv6 echo replies). diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 5aeeed22f35b..d975a86f29be 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -17,6 +17,7 @@ struct ipv6_devconf { __s32 hop_limit; __s32 mtu6; __s32 forwarding; + __s32 force_forwarding; __s32 disable_policy; __s32 proxy_ndp; __cacheline_group_end(ipv6_devconf_read_txrx); diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index cf592d7b630f..d4d3ae774b26 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -199,6 +199,7 @@ enum { DEVCONF_NDISC_EVICT_NOCARRIER, DEVCONF_ACCEPT_UNTRACKED_NA, DEVCONF_ACCEPT_RA_MIN_LFT, + DEVCONF_FORCE_FORWARDING, DEVCONF_MAX }; diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h index fac4edd55379..1c8c84d65ae3 100644 --- a/include/uapi/linux/netconf.h +++ b/include/uapi/linux/netconf.h @@ -19,6 +19,7 @@ enum { NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN, NETCONFA_INPUT, NETCONFA_BC_FORWARDING, + NETCONFA_FORCE_FORWARDING, __NETCONFA_MAX }; #define NETCONFA_MAX (__NETCONFA_MAX - 1) diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 8981f00204db..63d1464cb71c 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -573,6 +573,7 @@ enum { NET_IPV6_ACCEPT_RA_FROM_LOCAL=26, NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27, NET_IPV6_RA_DEFRTR_METRIC=28, + NET_IPV6_FORCE_FORWARDING=29, __NET_IPV6_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index ba2ec7c870cc..580aed034849 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { @@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; /* Check if link is ready: is it up and is a valid qdisc available */ @@ -857,6 +859,9 @@ static void addrconf_forward_change(struct net *net, __s32 newf) idev = __in6_dev_get_rtnl_net(dev); if (idev) { int changed = (!idev->cnf.forwarding) ^ (!newf); + /* Disabling all.forwarding sets 0 to force_forwarding for all interfaces */ + if (newf == 0) + WRITE_ONCE(idev->cnf.force_forwarding, 0); WRITE_ONCE(idev->cnf.forwarding, newf); if (changed) @@ -5719,6 +5724,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf, array[DEVCONF_ACCEPT_UNTRACKED_NA] = READ_ONCE(cnf->accept_untracked_na); array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft); + array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding); } static inline size_t inet6_ifla6_size(void) @@ -6747,6 +6753,75 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write return ret; } +static void addrconf_force_forward_change(struct net *net, __s32 newf) +{ + struct net_device *dev; + struct inet6_dev *idev; + + for_each_netdev(net, dev) { + idev = __in6_dev_get_rtnl_net(dev); + if (idev) { + int changed = (!idev->cnf.force_forwarding) ^ (!newf); + + WRITE_ONCE(idev->cnf.force_forwarding, newf); + if (changed) + inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + dev->ifindex, &idev->cnf); + } + } +} + +static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct inet6_dev *idev = ctl->extra1; + struct ctl_table tmp_ctl = *ctl; + struct net *net = ctl->extra2; + int *valp = ctl->data; + int new_val = *valp; + int old_val = *valp; + loff_t pos = *ppos; + int ret; + + tmp_ctl.extra1 = SYSCTL_ZERO; + tmp_ctl.extra2 = SYSCTL_ONE; + tmp_ctl.data = &new_val; + + ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos); + + if (write && old_val != new_val) { + if (!rtnl_net_trylock(net)) + return restart_syscall(); + + WRITE_ONCE(*valp, new_val); + + if (valp == &net->ipv6.devconf_dflt->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_DEFAULT, + net->ipv6.devconf_dflt); + } else if (valp == &net->ipv6.devconf_all->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_ALL, + net->ipv6.devconf_all); + + addrconf_force_forward_change(net, new_val); + } else { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + idev->dev->ifindex, + &idev->cnf); + } + rtnl_net_unlock(net); + } + + if (ret) + *ppos = pos; + return ret; +} + static int minus_one = -1; static const int two_five_five = 255; static u32 ioam6_if_id_max = U16_MAX; @@ -7217,6 +7292,13 @@ static const struct ctl_table addrconf_sysctl[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, + { + .procname = "force_forwarding", + .data = &ipv6_devconf.force_forwarding, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = addrconf_sysctl_force_forwarding, + }, }; static int __addrconf_sysctl_register(struct net *net, char *dev_name, diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7bd29a9ff0db..3853090d7282 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -509,7 +509,8 @@ int ip6_forward(struct sk_buff *skb) u32 mtu; idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif)); - if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0) + if (!READ_ONCE(net->ipv6.devconf_all->forwarding) && + (!idev || !READ_ONCE(idev->cnf.force_forwarding))) goto error; if (skb->pkt_type != PACKET_HOST) diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 332f387615d7..f64ec8a15a77 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -112,6 +112,7 @@ TEST_PROGS += skf_net_off.sh TEST_GEN_FILES += skf_net_off TEST_GEN_FILES += tfo TEST_PROGS += tfo_passive.sh +TEST_PROGS += ipv6_force_forwarding.sh # YNL files, must be before "include ..lib.mk" YNL_GEN_FILES := busy_poller netlink-dumps diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh new file mode 100755 index 000000000000..bf0243366caa --- /dev/null +++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh @@ -0,0 +1,105 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Test IPv6 force_forwarding interface property +# +# This test verifies that the force_forwarding property works correctly: +# - When global forwarding is disabled, packets are not forwarded normally +# - When force_forwarding is enabled on an interface, packets are forwarded +# regardless of the global forwarding setting + +source lib.sh + +cleanup() { + cleanup_ns $ns1 $ns2 $ns3 +} + +trap cleanup EXIT + +setup_test() { + # Create three namespaces: sender, router, receiver + setup_ns ns1 ns2 ns3 + + # Create veth pairs: ns1 <-> ns2 <-> ns3 + ip link add name veth12 type veth peer name veth21 + ip link add name veth23 type veth peer name veth32 + + # Move interfaces to namespaces + ip link set veth12 netns $ns1 + ip link set veth21 netns $ns2 + ip link set veth23 netns $ns2 + ip link set veth32 netns $ns3 + + # Configure interfaces + ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12 nodad + ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21 nodad + ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23 nodad + ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32 nodad + + # Bring up interfaces + ip -n $ns1 link set veth12 up + ip -n $ns2 link set veth21 up + ip -n $ns2 link set veth23 up + ip -n $ns3 link set veth32 up + + # Add routes + ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2 + ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1 + + # Disable global forwarding + ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0 +} + +test_force_forwarding() { + local ret=0 + + echo "TEST: force_forwarding functionality" + + # Check if force_forwarding sysctl exists + if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then + echo "SKIP: force_forwarding not available" + return $ksft_skip + fi + + # Test 1: Without force_forwarding, ping should fail + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "FAIL: ping succeeded when forwarding disabled" + ret=1 + else + echo "PASS: forwarding disabled correctly" + fi + + # Test 2: With force_forwarding enabled, ping should succeed + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "PASS: force_forwarding enabled forwarding" + else + echo "FAIL: ping failed with force_forwarding enabled" + ret=1 + fi + + return $ret +} + +echo "IPv6 force_forwarding test" +echo "==========================" + +setup_test +test_force_forwarding +ret=$? + +if [ $ret -eq 0 ]; then + echo "OK" + exit 0 +elif [ $ret -eq $ksft_skip ]; then + echo "SKIP" + exit $ksft_skip +else + echo "FAIL" + exit 1 +fi -- 2.39.5

3 days, 6 hours

1
0
0 0

[PATCH net-next v5] ipv6: add `force_forwarding` sysctl to enable per-interface forwarding

by Gabriel Goller

It is currently impossible to enable ipv6 forwarding on a per-interface basis like in ipv4. To enable forwarding on an ipv6 interface we need to enable it on all interfaces and disable it on the other interfaces using a netfilter rule. This is especially cumbersome if you have lots of interface and only want to enable forwarding on a few. According to the sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding for all interfaces, while the interface-specific `net.ipv6.conf.<interface>.forwarding` configures the interface Host/Router configuration. Introduce a new sysctl flag `force_forwarding`, which can be set on every interface. The ip6_forwarding function will then check if the global forwarding flag OR the force_forwarding flag is active and forward the packet. To preserver backwards-compatibility reset the flag (on all interfaces) to 0 if the net.ipv6.conf.all.forwarding flag is set to 0. Add a short selftest that checks if a packet gets forwarded with and without `force_forwarding`. [0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com> --- v5: * update conf/all/forwarding docs * simplified backwards-compat comment * remove ASSERT_RTNL as it's guaranteed by __in6_dev_get_rtnl_net() already * cange ip6_forward logic so that it doesn't depend on the idev existing * move WRITE_ONCE inside device lock v4: https://lore.kernel.org/netdev/20250703160154.560239-1-g.goller@proxmox.com/ * actually write the sysctl value to the table * use ASSERT_RTNL() when forwarding the sysctl change * remove useless comments in function body * simplify forwarding and force_forwarding check in ip6_output.c * fix code backticks in Documentation (double instead of single) * add selftests v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ * remove forwarding=0 setting force_forwarding=0 globally. * add min and max (0 and 1) value to sysctl. v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/ * rename from `do_forwarding` to `force_forwarding`. * add global `force_forwarding` flag which will enable `force_forwarding` on every interface like the `ipv4.all.forwarding` flag. * `forwarding`=0 will disable global and per-interface `force_forwarding`. * export option as NETCONFA_FORCE_FORWARDING. v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ Documentation/networking/ip-sysctl.rst | 9 +- include/linux/ipv6.h | 1 + include/uapi/linux/ipv6.h | 1 + include/uapi/linux/netconf.h | 1 + include/uapi/linux/sysctl.h | 1 + net/ipv6/addrconf.c | 83 ++++++++++++++ net/ipv6/ip6_output.c | 3 +- tools/testing/selftests/net/Makefile | 1 + .../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++ 9 files changed, 202 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/net/ipv6_force_forwarding.sh diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 0f1251cce314..6d92bae0257a 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2281,8 +2281,8 @@ conf/all/disable_ipv6 - BOOLEAN conf/all/forwarding - BOOLEAN Enable global IPv6 forwarding between all interfaces. - IPv4 and IPv6 work differently here; e.g. netfilter must be used - to control which interfaces may forward packets and which not. + IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must + be used to control which interfaces may forward packets. This also sets all interfaces' Host/Router setting 'forwarding' to the specified value. See below for details. @@ -2292,6 +2292,11 @@ conf/all/forwarding - BOOLEAN proxy_ndp - BOOLEAN Do proxy ndp. +force_forwarding - BOOLEAN + Enable forwarding on this interface only -- regardless of the setting on + ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0, + the ``force_forwarding`` flag will be reset on all interfaces. + fwmark_reflect - BOOLEAN Controls the fwmark of kernel-generated IPv6 reply packets that are not associated with a socket for example, TCP RSTs or ICMPv6 echo replies). diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 5aeeed22f35b..d975a86f29be 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -17,6 +17,7 @@ struct ipv6_devconf { __s32 hop_limit; __s32 mtu6; __s32 forwarding; + __s32 force_forwarding; __s32 disable_policy; __s32 proxy_ndp; __cacheline_group_end(ipv6_devconf_read_txrx); diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index cf592d7b630f..d4d3ae774b26 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -199,6 +199,7 @@ enum { DEVCONF_NDISC_EVICT_NOCARRIER, DEVCONF_ACCEPT_UNTRACKED_NA, DEVCONF_ACCEPT_RA_MIN_LFT, + DEVCONF_FORCE_FORWARDING, DEVCONF_MAX }; diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h index fac4edd55379..1c8c84d65ae3 100644 --- a/include/uapi/linux/netconf.h +++ b/include/uapi/linux/netconf.h @@ -19,6 +19,7 @@ enum { NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN, NETCONFA_INPUT, NETCONFA_BC_FORWARDING, + NETCONFA_FORCE_FORWARDING, __NETCONFA_MAX }; #define NETCONFA_MAX (__NETCONFA_MAX - 1) diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 8981f00204db..63d1464cb71c 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -573,6 +573,7 @@ enum { NET_IPV6_ACCEPT_RA_FROM_LOCAL=26, NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27, NET_IPV6_RA_DEFRTR_METRIC=28, + NET_IPV6_FORCE_FORWARDING=29, __NET_IPV6_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index ba2ec7c870cc..92acf44febd1 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { @@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; /* Check if link is ready: is it up and is a valid qdisc available */ @@ -857,6 +859,9 @@ static void addrconf_forward_change(struct net *net, __s32 newf) idev = __in6_dev_get_rtnl_net(dev); if (idev) { int changed = (!idev->cnf.forwarding) ^ (!newf); + /* Disabling all.forwarding sets 0 to force_forwarding for all interfaces */ + if (newf == 0) + WRITE_ONCE(idev->cnf.force_forwarding, newf); WRITE_ONCE(idev->cnf.forwarding, newf); if (changed) @@ -5719,6 +5724,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf, array[DEVCONF_ACCEPT_UNTRACKED_NA] = READ_ONCE(cnf->accept_untracked_na); array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft); + array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding); } static inline size_t inet6_ifla6_size(void) @@ -6747,6 +6753,76 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write return ret; } +static void addrconf_force_forward_change(struct net *net, __s32 newf) +{ + struct net_device *dev; + struct inet6_dev *idev; + + for_each_netdev(net, dev) { + idev = __in6_dev_get_rtnl_net(dev); + if (idev) { + int changed = (!idev->cnf.force_forwarding) ^ (!newf); + + WRITE_ONCE(idev->cnf.force_forwarding, newf); + if (changed) { + inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + dev->ifindex, &idev->cnf); + } + } + } +} + +static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct inet6_dev *idev = ctl->extra1; + struct ctl_table tmp_ctl = *ctl; + struct net *net = ctl->extra2; + int *valp = ctl->data; + int new_val = *valp; + int old_val = *valp; + loff_t pos = *ppos; + int ret; + + tmp_ctl.extra1 = SYSCTL_ZERO; + tmp_ctl.extra2 = SYSCTL_ONE; + tmp_ctl.data = &new_val; + + ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos); + + if (write && old_val != new_val) { + if (!rtnl_net_trylock(net)) + return restart_syscall(); + + WRITE_ONCE(*valp, new_val); + + if (valp == &net->ipv6.devconf_dflt->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_DEFAULT, + net->ipv6.devconf_dflt); + } else if (valp == &net->ipv6.devconf_all->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_ALL, + net->ipv6.devconf_all); + + addrconf_force_forward_change(net, new_val); + } else { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + idev->dev->ifindex, + &idev->cnf); + } + rtnl_net_unlock(net); + } + + if (ret) + *ppos = pos; + return ret; +} + static int minus_one = -1; static const int two_five_five = 255; static u32 ioam6_if_id_max = U16_MAX; @@ -7217,6 +7293,13 @@ static const struct ctl_table addrconf_sysctl[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, + { + .procname = "force_forwarding", + .data = &ipv6_devconf.force_forwarding, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = addrconf_sysctl_force_forwarding, + }, }; static int __addrconf_sysctl_register(struct net *net, char *dev_name, diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7bd29a9ff0db..3853090d7282 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -509,7 +509,8 @@ int ip6_forward(struct sk_buff *skb) u32 mtu; idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif)); - if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0) + if (!READ_ONCE(net->ipv6.devconf_all->forwarding) && + (!idev || !READ_ONCE(idev->cnf.force_forwarding))) goto error; if (skb->pkt_type != PACKET_HOST) diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 332f387615d7..f64ec8a15a77 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -112,6 +112,7 @@ TEST_PROGS += skf_net_off.sh TEST_GEN_FILES += skf_net_off TEST_GEN_FILES += tfo TEST_PROGS += tfo_passive.sh +TEST_PROGS += ipv6_force_forwarding.sh # YNL files, must be before "include ..lib.mk" YNL_GEN_FILES := busy_poller netlink-dumps diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh new file mode 100644 index 000000000000..62adc9d4afc9 --- /dev/null +++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh @@ -0,0 +1,105 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Test IPv6 force_forwarding interface property +# +# This test verifies that the force_forwarding property works correctly: +# - When global forwarding is disabled, packets are not forwarded normally +# - When force_forwarding is enabled on an interface, packets are forwarded +# regardless of the global forwarding setting + +source lib.sh + +cleanup() { + cleanup_ns $ns1 $ns2 $ns3 +} + +trap cleanup EXIT + +setup_test() { + # Create three namespaces: sender, router, receiver + setup_ns ns1 ns2 ns3 + + # Create veth pairs: ns1 <-> ns2 <-> ns3 + ip link add name veth12 type veth peer name veth21 + ip link add name veth23 type veth peer name veth32 + + # Move interfaces to namespaces + ip link set veth12 netns $ns1 + ip link set veth21 netns $ns2 + ip link set veth23 netns $ns2 + ip link set veth32 netns $ns3 + + # Configure interfaces + ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12 + ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21 + ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23 + ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32 + + # Bring up interfaces + ip -n $ns1 link set veth12 up + ip -n $ns2 link set veth21 up + ip -n $ns2 link set veth23 up + ip -n $ns3 link set veth32 up + + # Add routes + ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2 + ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1 + + # Disable global forwarding + ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0 +} + +test_force_forwarding() { + local ret=0 + + echo "TEST: force_forwarding functionality" + + # Check if force_forwarding sysctl exists + if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then + echo "SKIP: force_forwarding not available" + return $ksft_skip + fi + + # Test 1: Without force_forwarding, ping should fail + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "FAIL: ping succeeded when forwarding disabled" + ret=1 + else + echo "PASS: forwarding disabled correctly" + fi + + # Test 2: With force_forwarding enabled, ping should succeed + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "PASS: force_forwarding enabled forwarding" + else + echo "FAIL: ping failed with force_forwarding enabled" + ret=1 + fi + + return $ret +} + +echo "IPv6 force_forwarding test" +echo "==========================" + +setup_test +test_force_forwarding +ret=$? + +if [ $ret -eq 0 ]; then + echo "OK" + exit 0 +elif [ $ret -eq $ksft_skip ]; then + echo "SKIP" + exit $ksft_skip +else + echo "FAIL" + exit 1 +fi -- 2.39.5

3 days, 6 hours

3
3
0 0

[PATCH v4] selftests/mm: add process_madvise() tests

by wang lian

Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. Signed-off-by: wang lian <lianux.mm(a)gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: David Hildenbrand <david(a)redhat.com> Suggested-by: Zi Yan <ziy(a)nvidia.com> Acked-by: SeongJae Park <sj(a)kernel.org> --- Changelog v4: - Refine resource cleanup logic in test teardown to be more robust. - Improve remote_collapse test to correctly handle different THP (Transparent Huge Page) policies ('always', 'madvise', 'never'), including handling race conditions with khugepaged. - Resolve build errors Changelog v3: https://lore.kernel.org/lkml/20250703044326.65061-1-lianux.mm@gmail.com/ - Rebased onto the latest mm-stable branch to ensure clean application. - Refactor common signal handling logic into vm_util to reduce code duplication. - Improve test robustness and diagnostics based on community feedback. - Address minor code style and script corrections. Changelog v2: https://lore.kernel.org/lkml/20250630140957.4000-1-lianux.mm@gmail.com/ - Drop MADV_DONTNEED tests based on feedback. - Focus solely on process_madvise() syscall. - Improve error handling and structure. - Add future-proof flag test. - Style and comment cleanups. -V1: https://lore.kernel.org/lkml/20250621133003.4733-1-lianux.mm@gmail.com/ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-regions.c | 51 --- tools/testing/selftests/mm/process_madv.c | 447 +++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + tools/testing/selftests/mm/vm_util.c | 35 ++ tools/testing/selftests/mm/vm_util.h | 22 + 7 files changed, 511 insertions(+), 51 deletions(-) create mode 100644 tools/testing/selftests/mm/process_madv.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 824266982aa3..95bd9c6ead9e 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -25,6 +25,7 @@ pfnmap protection_keys protection_keys_32 protection_keys_64 +process_madv madv_populate uffd-stress uffd-unit-tests diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ae6f994d3add..d13b3cef2a2b 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -85,6 +85,7 @@ TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += pfnmap +TEST_GEN_FILES += process_madv TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 93af3d3760f9..4cf101b0fe5e 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -9,8 +9,6 @@ #include <linux/limits.h> #include <linux/userfaultfd.h> #include <linux/fs.h> -#include <setjmp.h> -#include <signal.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> @@ -24,24 +22,6 @@ #include "../pidfd/pidfd.h" -/* - * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1: - * - * "If the signal occurs other than as the result of calling the abort or raise - * function, the behavior is undefined if the signal handler refers to any - * object with static storage duration other than by assigning a value to an - * object declared as volatile sig_atomic_t" - */ -static volatile sig_atomic_t signal_jump_set; -static sigjmp_buf signal_jmp_buf; - -/* - * Ignore the checkpatch warning, we must read from x but don't want to do - * anything with it in order to trigger a read page fault. We therefore must use - * volatile to stop the compiler from optimising this away. - */ -#define FORCE_READ(x) (*(volatile typeof(x) *)x) - /* * How is the test backing the mapping being tested? */ @@ -120,14 +100,6 @@ static int userfaultfd(int flags) return syscall(SYS_userfaultfd, flags); } -static void handle_fatal(int c) -{ - if (!signal_jump_set) - return; - - siglongjmp(signal_jmp_buf, c); -} - static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, size_t n, int advice, unsigned int flags) { @@ -180,29 +152,6 @@ static bool try_read_write_buf(char *ptr) return try_read_buf(ptr) && try_write_buf(ptr); } -static void setup_sighandler(void) -{ - struct sigaction act = { - .sa_handler = &handle_fatal, - .sa_flags = SA_NODEFER, - }; - - sigemptyset(&act.sa_mask); - if (sigaction(SIGSEGV, &act, NULL)) - ksft_exit_fail_perror("sigaction"); -} - -static void teardown_sighandler(void) -{ - struct sigaction act = { - .sa_handler = SIG_DFL, - .sa_flags = SA_NODEFER, - }; - - sigemptyset(&act.sa_mask); - sigaction(SIGSEGV, &act, NULL); -} - static int open_file(const char *prefix, char *path) { int fd; diff --git a/tools/testing/selftests/mm/process_madv.c b/tools/testing/selftests/mm/process_madv.c new file mode 100644 index 000000000000..7d7509486d46 --- /dev/null +++ b/tools/testing/selftests/mm/process_madv.c @@ -0,0 +1,447 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <errno.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <linux/mman.h> +#include <sys/syscall.h> +#include <unistd.h> +#include <sched.h> +#include <linux/pidfd.h> +#include <linux/uio.h> +#include "vm_util.h" + +FIXTURE(process_madvise) +{ + int pidfd; + int flag; + pid_t child_pid; +}; + +FIXTURE_SETUP(process_madvise) +{ + self->pidfd = PIDFD_SELF; + self->flag = 0; + self->child_pid = -1; + setup_sighandler(); +}; + +FIXTURE_TEARDOWN_PARENT(process_madvise) +{ + teardown_sighandler(); + if (self->child_pid > 0) { + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + } +} + +static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, + size_t vlen, int advice, unsigned int flags) +{ + return syscall(__NR_process_madvise, pidfd, iovec, vlen, advice, flags); +} + +/* + * Enable our signal catcher and try to read the specified buffer. The + * return value indicates whether the read succeeds without a fatal + * signal. + */ +static bool try_read_buf(char *ptr) +{ + bool failed; + + /* Tell signal handler to jump back here on fatal signal. */ + signal_jump_set = true; + /* If a fatal signal arose, we will jump back here and failed is set. */ + failed = sigsetjmp(signal_jmp_buf, 0) != 0; + + if (!failed) + FORCE_READ(ptr); + + signal_jump_set = false; + return !failed; +} + +TEST_F(process_madvise, basic) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + const int madvise_pages = 4; + char *map; + ssize_t ret; + struct iovec vec[madvise_pages]; + + /* + * Create a single large mapping. We will pick pages from this + * mapping to advise on. This ensures we test non-contiguous iovecs. + */ + map = mmap(NULL, pagesize * 10, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map == MAP_FAILED) + ksft_exit_skip("mmap failed, not enough memory.\n"); + + /* Fill the entire region with a known pattern. */ + memset(map, 'A', pagesize * 10); + + /* + * Setup the iovec to point to 4 non-contiguous pages + * within the mapping. + */ + vec[0].iov_base = &map[0 * pagesize]; + vec[0].iov_len = pagesize; + vec[1].iov_base = &map[3 * pagesize]; + vec[1].iov_len = pagesize; + vec[2].iov_base = &map[5 * pagesize]; + vec[2].iov_len = pagesize; + vec[3].iov_base = &map[8 * pagesize]; + vec[3].iov_len = pagesize; + + ret = sys_process_madvise(PIDFD_SELF, vec, madvise_pages, MADV_DONTNEED, + 0); + if (ret == -1 && errno == EPERM) + ksft_exit_skip( + "process_madvise() unsupported or permission denied, try running as root.\n"); + else if (errno == EINVAL) + ksft_exit_skip( + "process_madvise() unsupported or parameter invalid, please check arguments.\n"); + + /* The call should succeed and report the total bytes processed. */ + ASSERT_EQ(ret, madvise_pages * pagesize); + + /* Check that advised pages are now zero. */ + for (int i = 0; i < madvise_pages; i++) { + char *advised_page = (char *)vec[i].iov_base; + + /* Access should be successful (kernel provides a new page). */ + ASSERT_TRUE(try_read_buf(advised_page)); + /* Content must be 0, not 'A'. */ + ASSERT_EQ(*advised_page, 0); + } + + /* Check that an un-advised page in between is still 'A'. */ + char *unadvised_page = &map[1 * pagesize]; + + ASSERT_TRUE(try_read_buf(unadvised_page)); + for (int i = 0; i < pagesize; i++) + ASSERT_EQ(unadvised_page[i], 'A'); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize * 10), 0); +} + +static long get_smaps_anon_huge_pages(pid_t pid, void *addr) +{ + char smaps_path[64]; + char *line = NULL; + unsigned long start, end; + long anon_huge_kb; + size_t len; + FILE *f; + bool in_vma; + + in_vma = false; + snprintf(smaps_path, sizeof(smaps_path), "/proc/%d/smaps", pid); + f = fopen(smaps_path, "r"); + if (!f) + return -1; + + while (getline(&line, &len, f) != -1) { + /* Check if the line describes a VMA range */ + if (sscanf(line, "%lx-%lx", &start, &end) == 2) { + if ((unsigned long)addr >= start && + (unsigned long)addr < end) + in_vma = true; + else + in_vma = false; + continue; + } + + /* If we are in the correct VMA, look for the AnonHugePages field */ + if (in_vma && + sscanf(line, "AnonHugePages: %ld kB", &anon_huge_kb) == 1) + break; + } + + free(line); + fclose(f); + + return (anon_huge_kb > 0) ? (anon_huge_kb * 1024) : 0; +} + +static bool is_thp_always(void) +{ + const char *path = "/sys/kernel/mm/transparent_hugepage/enabled"; + char buf[32]; + FILE *f = fopen(path, "r"); + + if (!f) + return false; + + if (fgets(buf, sizeof(buf), f)) + if (strstr(buf, "[always]")) { + fclose(f); + return true; + } + + fclose(f); + return false; +} + +/** + * TEST_F(process_madvise, remote_collapse) + * + * This test deterministically validates process_madvise() with MADV_COLLAPSE + * on a remote process, other advices are difficult to verify reliably. + * + * The test verifies that a memory region in a child process, initially + * backed by small pages, can be collapsed into a Transparent Huge Page by a + * request from the parent. The result is verified by parsing the child's + * /proc/<pid>/smaps file. + */ +TEST_F(process_madvise, remote_collapse) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + int pidfd; + long huge_page_size; + int pipe_info[2]; + ssize_t ret; + struct iovec vec; + + struct child_info { + pid_t pid; + void *map_addr; + } info; + + huge_page_size = default_huge_page_size(); + if (huge_page_size <= 0) + ksft_exit_skip("Could not determine a valid huge page size.\n"); + + ASSERT_EQ(pipe(pipe_info), 0); + + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) { + char *map; + size_t map_size = 2 * huge_page_size; + + close(pipe_info[0]); + + map = mmap(NULL, map_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fault in as small pages */ + for (size_t i = 0; i < map_size; i += pagesize) + map[i] = 'A'; + + /* Send info and pause */ + info.pid = getpid(); + info.map_addr = map; + ret = write(pipe_info[1], &info, sizeof(info)); + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[1]); + + pause(); + exit(0); + } + + close(pipe_info[1]); + + /* Receive child info */ + ret = read(pipe_info[0], &info, sizeof(info)); + if (ret <= 0) { + waitpid(self->child_pid, NULL, 0); + ksft_exit_skip("Failed to read child info from pipe.\n"); + } + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[0]); + self->child_pid = info.pid; + + pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(pidfd, 0); + + vec.iov_base = info.map_addr; + vec.iov_len = huge_page_size; + + if (is_thp_always()) { + long initial_huge_pages; + + /* + * When THP is 'always', khugepaged may pre-emptively + * collapse the pages before our MADV_COLLAPSE call. Check + * the initial state to provide a more accurate test report. + */ + initial_huge_pages = + get_smaps_anon_huge_pages(self->child_pid, info.map_addr); + + if (initial_huge_pages == 2 * huge_page_size) { + /* + * The pages were already collapsed by khugepaged. + * The test goal narrows to verifying that MADV_COLLAPSE + * correctly returns success on an already-collapsed + * region, as documented. + */ + ksft_test_result_skip( + "THP is 'always' and pages were pre-collapsed; verifying success on already-collapsed page.\n"); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_COLLAPSE, + 0); + ASSERT_EQ(ret, huge_page_size); + goto cleanup; + } + + /* + * Pages are still small, creating a race between our call + * and khugepaged. This is the main test scenario for 'always'. + */ + ret = sys_process_madvise(pidfd, &vec, 1, MADV_COLLAPSE, 0); + + if (ret == -1) { + /* + * MADV_COLLAPSE lost the race to khugepaged, which + * likely held a page lock. The kernel correctly + * reports this temporary contention with EAGAIN. + */ + if (errno == EAGAIN) { + ksft_test_result_skip( + "THP is 'always', process_madvise returned EAGAIN due to an expected race with khugepaged.\n"); + } else { + ksft_test_result_fail( + "process_madvise failed with unexpected errno %d in 'always' mode.\n", + errno); + } + goto cleanup; + } + + /* + * MADV_COLLAPSE won the race and successfully collapsed + * the pages. Verify the final state. + */ + ASSERT_EQ(ret, huge_page_size); + ASSERT_EQ(get_smaps_anon_huge_pages(self->child_pid, info.map_addr), + huge_page_size); + ksft_test_result_pass( + "THP is 'always', MADV_COLLAPSE won race and collapsed pages.\n"); + goto cleanup; + } + + /* + * THP is 'madvise' or 'never'. No race is expected with khugepaged. + * We can perform a straightforward state-change verification. + */ + ASSERT_EQ(get_smaps_anon_huge_pages(self->child_pid, info.map_addr), 0); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_COLLAPSE, 0); + if (ret == -1) { + if (errno == EINVAL) + ksft_exit_skip( + "PROCESS_MADV_ADVISE is not supported.\n"); + else if (errno == EPERM) + ksft_exit_skip( + "No process_madvise() permissions, try running as root.\n"); + goto cleanup; + } + ASSERT_EQ(ret, huge_page_size); + + ASSERT_EQ(get_smaps_anon_huge_pages(self->child_pid, info.map_addr), + huge_page_size); + + ksft_test_result_pass( + "MADV_COLLAPSE successfully verified via smaps.\n"); + +cleanup: + /* Cleanup */ + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + if (pidfd >= 0) + close(pidfd); +} + +/* + * Test process_madvise() with various invalid pidfds to ensure correct error + * handling. This includes negative fds, non-pidfd fds, and pidfds for + * processes that no longer exist. + */ +TEST_F(process_madvise, invalid_pidfd) +{ + struct iovec vec; + pid_t child_pid; + ssize_t ret; + int pidfd; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* Using an invalid fd number (-1) should fail with EBADF. */ + ret = sys_process_madvise(-1, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a valid fd that is not a pidfd (e.g. stdin) should fail + * with EBADF. + */ + ret = sys_process_madvise(STDIN_FILENO, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a pidfd for a process that has already exited should fail + * with ESRCH. + */ + child_pid = fork(); + ASSERT_NE(child_pid, -1); + + if (child_pid == 0) + exit(0); + + pidfd = syscall(__NR_pidfd_open, child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Wait for the child to ensure it has terminated. */ + waitpid(child_pid, NULL, 0); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ESRCH); + close(pidfd); +} + +/* + * Test process_madvise() with an invalid flag value. Now we only support flag=0 + * future we will use it support sync so reserve this test. + */ +TEST_F(process_madvise, flag) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + unsigned int invalid_flag; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + ksft_exit_skip("mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + invalid_flag = 0x80000000; + + ret = sys_process_madvise(PIDFD_SELF, &vec, 1, MADV_DONTNEED, + invalid_flag); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index dddd1dd8af14..84fb51902c3e 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -65,6 +65,8 @@ separated by spaces: test pagemap_scan IOCTL - pfnmap tests for VM_PFNMAP handling +- process_madv + test process_madvise - cow test copy-on-write semantics - thp @@ -422,6 +424,9 @@ CATEGORY="hmm" run_test bash ./test_hmm.sh smoke # MADV_GUARD_INSTALL and MADV_GUARD_REMOVE tests CATEGORY="madv_guard" run_test ./guard-regions +# PROCESS_MADVISE TEST +CATEGORY="process_madv" run_test ./process_madv + # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests CATEGORY="madv_populate" run_test ./madv_populate diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index 5492e3f784df..85b209260e5a 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -20,6 +20,9 @@ unsigned int __page_size; unsigned int __page_shift; +volatile sig_atomic_t signal_jump_set; +sigjmp_buf signal_jmp_buf; + uint64_t pagemap_get_entry(int fd, char *start) { const unsigned long pfn = (unsigned long)start / getpagesize(); @@ -524,3 +527,35 @@ int read_sysfs(const char *file_path, unsigned long *val) return 0; } + +static void handle_fatal(int c) +{ + if (!signal_jump_set) + return; + + siglongjmp(signal_jmp_buf, c); +} + +void setup_sighandler(void) +{ + struct sigaction act = { + .sa_handler = &handle_fatal, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + if (sigaction(SIGSEGV, &act, NULL)) + ksft_exit_fail_perror("sigaction in setup"); +} + +void teardown_sighandler(void) +{ + struct sigaction act = { + .sa_handler = SIG_DFL, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + if (sigaction(SIGSEGV, &act, NULL)) + ksft_exit_fail_perror("sigaction in teardown"); +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index b8136d12a0f8..6bc4177a2807 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -8,6 +8,8 @@ #include <unistd.h> /* _SC_PAGESIZE */ #include "../kselftest.h" #include <linux/fs.h> +#include <setjmp.h> +#include <signal.h> #define BIT_ULL(nr) (1ULL << (nr)) #define PM_SOFT_DIRTY BIT_ULL(55) @@ -61,6 +63,24 @@ static inline void skip_test_dodgy_fs(const char *op_name) ksft_test_result_skip("%s failed with ENOENT. Filesystem might be buggy (9pfs?)\n", op_name); } +/* + * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1: + * + * "If the signal occurs other than as the result of calling the abort or raise + * function, the behavior is undefined if the signal handler refers to any + * object with static storage duration other than by assigning a value to an + * object declared as volatile sig_atomic_t" + */ +extern volatile sig_atomic_t signal_jump_set; +extern sigjmp_buf signal_jmp_buf; + +/* + * Ignore the checkpatch warning, we must read from x but don't want to do + * anything with it in order to trigger a read page fault. We therefore must use + * volatile to stop the compiler from optimising this away. + */ +#define FORCE_READ(x) (*(volatile typeof(x) *)x) + uint64_t pagemap_get_entry(int fd, char *start); bool pagemap_is_softdirty(int fd, char *start); bool pagemap_is_swapped(int fd, char *start); @@ -90,6 +110,8 @@ bool find_vma_procmap(struct procmap_fd *procmap, void *address); int close_procmap(struct procmap_fd *procmap); int write_sysfs(const char *file_path, unsigned long val); int read_sysfs(const char *file_path, unsigned long *val); +void setup_sighandler(void); +void teardown_sighandler(void); static inline int open_self_procmap(struct procmap_fd *procmap_out) { -- 2.43.0

3 days, 6 hours

4
11
0 0

[PATCH v3] selftests/mm: add process_madvise() tests

by wang lian

Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. Signed-off-by: wang lian <lianux.mm(a)gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: David Hildenbrand <david(a)redhat.com> Acked-by: SeongJae Park <sj(a)kernel.org> --- Changelog v3: - Rebased onto the latest mm-stable branch to ensure clean application. - Refactor common signal handling logic into vm_util to reduce code duplication. - Improve test robustness and diagnostics based on community feedback. - Address minor code style and script corrections. Changelog v2: - Drop MADV_DONTNEED tests based on feedback. - Focus solely on process_madvise() syscall. - Improve error handling and structure. - Add future-proof flag test. - Style and comment cleanups. tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/guard-regions.c | 51 --- tools/testing/selftests/mm/process_madv.c | 358 +++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + tools/testing/selftests/mm/vm_util.c | 35 ++ tools/testing/selftests/mm/vm_util.h | 22 ++ 7 files changed, 422 insertions(+), 51 deletions(-) create mode 100644 tools/testing/selftests/mm/process_madv.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 824266982aa3..95bd9c6ead9e 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -25,6 +25,7 @@ pfnmap protection_keys protection_keys_32 protection_keys_64 +process_madv madv_populate uffd-stress uffd-unit-tests diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ae6f994d3add..d13b3cef2a2b 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -85,6 +85,7 @@ TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += pfnmap +TEST_GEN_FILES += process_madv TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index 93af3d3760f9..4cf101b0fe5e 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -9,8 +9,6 @@ #include <linux/limits.h> #include <linux/userfaultfd.h> #include <linux/fs.h> -#include <setjmp.h> -#include <signal.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> @@ -24,24 +22,6 @@ #include "../pidfd/pidfd.h" -/* - * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1: - * - * "If the signal occurs other than as the result of calling the abort or raise - * function, the behavior is undefined if the signal handler refers to any - * object with static storage duration other than by assigning a value to an - * object declared as volatile sig_atomic_t" - */ -static volatile sig_atomic_t signal_jump_set; -static sigjmp_buf signal_jmp_buf; - -/* - * Ignore the checkpatch warning, we must read from x but don't want to do - * anything with it in order to trigger a read page fault. We therefore must use - * volatile to stop the compiler from optimising this away. - */ -#define FORCE_READ(x) (*(volatile typeof(x) *)x) - /* * How is the test backing the mapping being tested? */ @@ -120,14 +100,6 @@ static int userfaultfd(int flags) return syscall(SYS_userfaultfd, flags); } -static void handle_fatal(int c) -{ - if (!signal_jump_set) - return; - - siglongjmp(signal_jmp_buf, c); -} - static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, size_t n, int advice, unsigned int flags) { @@ -180,29 +152,6 @@ static bool try_read_write_buf(char *ptr) return try_read_buf(ptr) && try_write_buf(ptr); } -static void setup_sighandler(void) -{ - struct sigaction act = { - .sa_handler = &handle_fatal, - .sa_flags = SA_NODEFER, - }; - - sigemptyset(&act.sa_mask); - if (sigaction(SIGSEGV, &act, NULL)) - ksft_exit_fail_perror("sigaction"); -} - -static void teardown_sighandler(void) -{ - struct sigaction act = { - .sa_handler = SIG_DFL, - .sa_flags = SA_NODEFER, - }; - - sigemptyset(&act.sa_mask); - sigaction(SIGSEGV, &act, NULL); -} - static int open_file(const char *prefix, char *path) { int fd; diff --git a/tools/testing/selftests/mm/process_madv.c b/tools/testing/selftests/mm/process_madv.c new file mode 100644 index 000000000000..3d26105b4781 --- /dev/null +++ b/tools/testing/selftests/mm/process_madv.c @@ -0,0 +1,358 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <errno.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <unistd.h> +#include <sched.h> +#include <sys/pidfd.h> +#include "vm_util.h" + +#include "../pidfd/pidfd.h" + +FIXTURE(process_madvise) +{ + int pidfd; + int flag; +}; + +FIXTURE_SETUP(process_madvise) +{ + self->pidfd = PIDFD_SELF; + self->flag = 0; + setup_sighandler(); +}; + +FIXTURE_TEARDOWN(process_madvise) +{ + teardown_sighandler(); +} + +static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, + size_t vlen, int advice, unsigned int flags) +{ + return syscall(__NR_process_madvise, pidfd, iovec, vlen, advice, flags); +} + +/* + * Enable our signal catcher and try to read the specified buffer. The + * return value indicates whether the read succeeds without a fatal + * signal. + */ +static bool try_read_buf(char *ptr) +{ + bool failed; + + /* Tell signal handler to jump back here on fatal signal. */ + signal_jump_set = true; + /* If a fatal signal arose, we will jump back here and failed is set. */ + failed = sigsetjmp(signal_jmp_buf, 0) != 0; + + if (!failed) + FORCE_READ(ptr); + + signal_jump_set = false; + return !failed; +} + +TEST_F(process_madvise, basic) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + const int madvise_pages = 4; + char *map; + ssize_t ret; + struct iovec vec[madvise_pages]; + + /* + * Create a single large mapping. We will pick pages from this + * mapping to advise on. This ensures we test non-contiguous iovecs. + */ + map = mmap(NULL, pagesize * 10, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map == MAP_FAILED) + ksft_exit_skip("mmap failed, not enough memory.\n"); + + /* Fill the entire region with a known pattern. */ + memset(map, 'A', pagesize * 10); + + /* + * Setup the iovec to point to 4 non-contiguous pages + * within the mapping. + */ + vec[0].iov_base = &map[0 * pagesize]; + vec[0].iov_len = pagesize; + vec[1].iov_base = &map[3 * pagesize]; + vec[1].iov_len = pagesize; + vec[2].iov_base = &map[5 * pagesize]; + vec[2].iov_len = pagesize; + vec[3].iov_base = &map[8 * pagesize]; + vec[3].iov_len = pagesize; + + ret = sys_process_madvise(PIDFD_SELF, vec, madvise_pages, MADV_DONTNEED, + 0); + if (ret == -1 && errno == EPERM) + ksft_exit_skip( + "process_madvise() unsupported or permission denied, try running as root.\n"); + else if (errno == EINVAL) + ksft_exit_skip( + "process_madvise() unsupported or parameter invalid, please check arguments.\n"); + + /* The call should succeed and report the total bytes processed. */ + ASSERT_EQ(ret, madvise_pages * pagesize); + + /* Check that advised pages are now zero. */ + for (int i = 0; i < madvise_pages; i++) { + char *advised_page = (char *)vec[i].iov_base; + + /* Access should be successful (kernel provides a new page). */ + ASSERT_TRUE(try_read_buf(advised_page)); + /* Content must be 0, not 'A'. */ + ASSERT_EQ(*advised_page, 0); + } + + /* Check that an un-advised page in between is still 'A'. */ + char *unadvised_page = &map[1 * pagesize]; + + ASSERT_TRUE(try_read_buf(unadvised_page)); + for (int i = 0; i < pagesize; i++) + ASSERT_EQ(unadvised_page[i], 'A'); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize * 10), 0); +} + +static long get_smaps_anon_huge_pages(pid_t pid, void *addr) +{ + char smaps_path[64]; + char *line = NULL; + unsigned long start, end; + long anon_huge_kb; + size_t len; + FILE *f; + bool in_vma; + + in_vma = false; + snprintf(smaps_path, sizeof(smaps_path), "/proc/%d/smaps", pid); + f = fopen(smaps_path, "r"); + if (!f) + return -1; + + while (getline(&line, &len, f) != -1) { + /* Check if the line describes a VMA range */ + if (sscanf(line, "%lx-%lx", &start, &end) == 2) { + if ((unsigned long)addr >= start && + (unsigned long)addr < end) + in_vma = true; + else + in_vma = false; + continue; + } + + /* If we are in the correct VMA, look for the AnonHugePages field */ + if (in_vma && + sscanf(line, "AnonHugePages: %ld kB", &anon_huge_kb) == 1) + break; + } + + free(line); + fclose(f); + + return (anon_huge_kb > 0) ? (anon_huge_kb * 1024) : 0; +} + +/** + * TEST_F(process_madvise, remote_collapse) + * + * This test deterministically validates process_madvise() with MADV_COLLAPSE + * on a remote process, other advices are difficult to verify reliably. + * + * The test verifies that a memory region in a child process, initially + * backed by small pages, can be collapsed into a Transparent Huge Page by a + * request from the parent. The result is verified by parsing the child's + * /proc/<pid>/smaps file. + */ +TEST_F(process_madvise, remote_collapse) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + pid_t child_pid; + int pidfd; + long huge_page_size; + int pipe_info[2]; + ssize_t ret; + struct iovec vec; + + struct child_info { + pid_t pid; + void *map_addr; + } info; + + huge_page_size = default_huge_page_size(); + if (huge_page_size <= 0) + ksft_exit_skip("Could not determine a valid huge page size.\n"); + + ASSERT_EQ(pipe(pipe_info), 0); + + child_pid = fork(); + ASSERT_NE(child_pid, -1); + + if (child_pid == 0) { + char *map; + size_t map_size = 2 * huge_page_size; + + close(pipe_info[0]); + + map = mmap(NULL, map_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fault in as small pages */ + for (size_t i = 0; i < map_size; i += pagesize) + map[i] = 'A'; + + /* Send info and pause */ + info.pid = getpid(); + info.map_addr = map; + ret = write(pipe_info[1], &info, sizeof(info)); + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[1]); + + pause(); + exit(0); + } + + close(pipe_info[1]); + + /* Receive child info */ + ret = read(pipe_info[0], &info, sizeof(info)); + if (ret <= 0) { + waitpid(child_pid, NULL, 0); + ksft_exit_skip("Failed to read child info from pipe.\n"); + } + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[0]); + child_pid = info.pid; + + pidfd = pidfd_open(child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Baseline Check from Parent's perspective */ + ASSERT_EQ(get_smaps_anon_huge_pages(child_pid, info.map_addr), 0); + + vec.iov_base = info.map_addr; + vec.iov_len = huge_page_size; + ret = sys_process_madvise(pidfd, &vec, 1, MADV_COLLAPSE, 0); + if (ret == -1) { + if (errno == EINVAL) + ksft_exit_skip( + "PROCESS_MADV_ADVISE is not supported.\n"); + else if (errno == EPERM) + ksft_exit_skip( + "No process_madvise() permissions, try running as root.\n"); + goto cleanup; + } + ASSERT_EQ(ret, huge_page_size); + + ASSERT_EQ(get_smaps_anon_huge_pages(child_pid, info.map_addr), + huge_page_size); + + ksft_test_result_pass( + "MADV_COLLAPSE successfully verified via smaps.\n"); + +cleanup: + /* Cleanup */ + kill(child_pid, SIGKILL); + waitpid(child_pid, NULL, 0); + if (pidfd >= 0) + close(pidfd); +} + +/* + * Test process_madvise() with various invalid pidfds to ensure correct error + * handling. This includes negative fds, non-pidfd fds, and pidfds for + * processes that no longer exist. + */ +TEST_F(process_madvise, invalid_pidfd) +{ + struct iovec vec; + pid_t child_pid; + ssize_t ret; + int pidfd; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* Using an invalid fd number (-1) should fail with EBADF. */ + ret = sys_process_madvise(-1, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a valid fd that is not a pidfd (e.g. stdin) should fail + * with EBADF. + */ + ret = sys_process_madvise(STDIN_FILENO, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a pidfd for a process that has already exited should fail + * with ESRCH. + */ + child_pid = fork(); + ASSERT_NE(child_pid, -1); + + if (child_pid == 0) + exit(0); + + pidfd = pidfd_open(child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Wait for the child to ensure it has terminated. */ + waitpid(child_pid, NULL, 0); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ESRCH); + close(pidfd); +} + +/* + * Test process_madvise() with an invalid flag value. Now we only support flag=0 + * future we will use it support sync so reserve this test. + */ +TEST_F(process_madvise, flag) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + unsigned int invalid_flag; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + ksft_exit_skip("mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + invalid_flag = 0x80000000; + + ret = sys_process_madvise(PIDFD_SELF, &vec, 1, MADV_DONTNEED, + invalid_flag); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index dddd1dd8af14..84fb51902c3e 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -65,6 +65,8 @@ separated by spaces: test pagemap_scan IOCTL - pfnmap tests for VM_PFNMAP handling +- process_madv + test process_madvise - cow test copy-on-write semantics - thp @@ -422,6 +424,9 @@ CATEGORY="hmm" run_test bash ./test_hmm.sh smoke # MADV_GUARD_INSTALL and MADV_GUARD_REMOVE tests CATEGORY="madv_guard" run_test ./guard-regions +# PROCESS_MADVISE TEST +CATEGORY="process_madv" run_test ./process_madv + # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests CATEGORY="madv_populate" run_test ./madv_populate diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index 5492e3f784df..85b209260e5a 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -20,6 +20,9 @@ unsigned int __page_size; unsigned int __page_shift; +volatile sig_atomic_t signal_jump_set; +sigjmp_buf signal_jmp_buf; + uint64_t pagemap_get_entry(int fd, char *start) { const unsigned long pfn = (unsigned long)start / getpagesize(); @@ -524,3 +527,35 @@ int read_sysfs(const char *file_path, unsigned long *val) return 0; } + +static void handle_fatal(int c) +{ + if (!signal_jump_set) + return; + + siglongjmp(signal_jmp_buf, c); +} + +void setup_sighandler(void) +{ + struct sigaction act = { + .sa_handler = &handle_fatal, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + if (sigaction(SIGSEGV, &act, NULL)) + ksft_exit_fail_perror("sigaction in setup"); +} + +void teardown_sighandler(void) +{ + struct sigaction act = { + .sa_handler = SIG_DFL, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + if (sigaction(SIGSEGV, &act, NULL)) + ksft_exit_fail_perror("sigaction in teardown"); +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index b8136d12a0f8..6bc4177a2807 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -8,6 +8,8 @@ #include <unistd.h> /* _SC_PAGESIZE */ #include "../kselftest.h" #include <linux/fs.h> +#include <setjmp.h> +#include <signal.h> #define BIT_ULL(nr) (1ULL << (nr)) #define PM_SOFT_DIRTY BIT_ULL(55) @@ -61,6 +63,24 @@ static inline void skip_test_dodgy_fs(const char *op_name) ksft_test_result_skip("%s failed with ENOENT. Filesystem might be buggy (9pfs?)\n", op_name); } +/* + * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1: + * + * "If the signal occurs other than as the result of calling the abort or raise + * function, the behavior is undefined if the signal handler refers to any + * object with static storage duration other than by assigning a value to an + * object declared as volatile sig_atomic_t" + */ +extern volatile sig_atomic_t signal_jump_set; +extern sigjmp_buf signal_jmp_buf; + +/* + * Ignore the checkpatch warning, we must read from x but don't want to do + * anything with it in order to trigger a read page fault. We therefore must use + * volatile to stop the compiler from optimising this away. + */ +#define FORCE_READ(x) (*(volatile typeof(x) *)x) + uint64_t pagemap_get_entry(int fd, char *start); bool pagemap_is_softdirty(int fd, char *start); bool pagemap_is_swapped(int fd, char *start); @@ -90,6 +110,8 @@ bool find_vma_procmap(struct procmap_fd *procmap, void *address); int close_procmap(struct procmap_fd *procmap); int write_sysfs(const char *file_path, unsigned long val); int read_sysfs(const char *file_path, unsigned long *val); +void setup_sighandler(void); +void teardown_sighandler(void); static inline int open_self_procmap(struct procmap_fd *procmap_out) { -- 2.43.0

3 days, 6 hours

5
13
0 0

[PATCH bpf-next 0/3] bpf: Show precise rejected function when attaching to __noreturn and __btf_id functions

by KaFai Wan

Show precise rejected function name when attaching to __noreturn and __btf_id functions. Add selftest for attaching tracing to __btf_id functions. --- KaFai Wan (3): bpf: Show precise rejected function when attaching to __noreturn functions bpf: Show precise rejected function when attaching to __btf_id functions selftests/bpf: Add selftest for attaching tracing to __btf_id functions kernel/bpf/verifier.c | 5 ++++- .../selftests/bpf/prog_tests/tracing_btf_ids.c | 16 ++++++++++++++++ .../selftests/bpf/progs/fexit_noreturns.c | 2 +- .../selftests/bpf/progs/tracing_btf_ids.c | 15 +++++++++++++++ 4 files changed, 36 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_btf_ids.c create mode 100644 tools/testing/selftests/bpf/progs/tracing_btf_ids.c -- 2.43.0

3 days, 6 hours

3
7
0 0

[PATCH 00/10] mm/mremap: permit mremap() move of multiple VMAs

by Lorenzo Stoakes

Historically we've made it a uAPI requirement that mremap() may only operate on a single VMA at a time. For instances where VMAs need to be resized, this makes sense, as it becomes very difficult to determine what a user actually wants should they indicate a desire to expand or shrink the size of multiple VMAs (truncate? Adjust sizes individually? Some other strategy?). However, in instances where a user is moving VMAs, it is restrictive to disallow this. This is especially the case when anonymous mapping remap may or may not be mergeable depending on whether VMAs have or have not been faulted due to anon_vma assignment and folio index alignment with vma->vm_pgoff. Often this can result in surprising impact where a moved region is faulted, then moved back and a user fails to observe a merge from otherwise compatible, adjacent VMAs. This change allows such cases to work without the user having to be cognizant of whether a prior mremap() move or other VMA operations has resulted in VMA fragmentation. In order to do this, this series performs a large amount of refactoring, most pertinently - grouping sanity checks together, separately those that check input parameters and those relating to VMAs. we also simplify the post-mmap lock drop processing for uffd and mlock()'d VMAs. With this done, we can then fairly straightforwardly implement this functionality. This works exclusively for mremap() invocations which specify MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the notification of the userland fault handler would require us to drop the mmap lock. The input and output addresses ranges must not overlap. We carefully account for moves which would result in VMA merges or would otherwise result in VMA iterator invalidation. Lorenzo Stoakes (10): mm/mremap: perform some simple cleanups mm/mremap: refactor initial parameter sanity checks mm/mremap: put VMA check and prep logic into helper function mm/mremap: cleanup post-processing stage of mremap mm/mremap: use an explicit uffd failure path for mremap mm/mremap: check remap conditions earlier mm/mremap: move remap_is_valid() into check_prep_vma() mm/mremap: clean up mlock populate behaviour mm/mremap: permit mremap() move of multiple VMAs tools/testing/selftests: extend mremap_test to test multi-VMA mremap fs/userfaultfd.c | 15 +- include/linux/userfaultfd_k.h | 1 + mm/mremap.c | 502 ++++++++++++++--------- tools/testing/selftests/mm/mremap_test.c | 145 ++++++- 4 files changed, 462 insertions(+), 201 deletions(-) -- 2.50.0

3 days, 10 hours

6
30
0 0

[PATCH v21 0/9] PCI: EP: Add RC-to-EP doorbell with platform MSI controller

by Frank Li via B4 Relay

┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐ │ │ │ │ │ │ │ │ │ PCI Endpoint │ │ PCI Host │ │ │ │ │ │ │ │ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │ │ │ │ │ │ │ │ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │ │ Controller │ │ update doorbell register address│ │ │ │ │ │ for BAR │ │ │ │ │ │ │ │ 3. Write BAR<n>│ │ │◄──┼───────────────────────────────────┼───┤ │ │ │ │ │ │ │ │ ├──►│ 4.Irq Handle │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └────────────┘ └───────────────────────────────────┘ └────────────────┘ This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/ Original patch only target to vntb driver. But actually it is common method. This patches add new API to pci-epf-core, so any EP driver can use it. Previous v2 discussion here. https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/ Changes in v21: - Align to bar size, try to fix Niklas reported problem. - Rebase to v6.16-rc5 - Link to v20: https://lore.kernel.org/r/20250709-ep-msi-v20-0-43d56f9bd54a@nxp.com Changes in v20: - remove set epf of_node's patch and only support one epf now. - move imx6's patch to first - detail change see each patches' change log - Link to v19: https://lore.kernel.org/r/20250609-ep-msi-v19-0-77362eaa48fa@nxp.com Changes in v19: - irq part already in v6.16-rc1, only missed pcie/dts part - rebase to v6.16-rc1 - update commit message for patch IMMUTABLE check. - Link to v18: https://lore.kernel.org/r/20250414-ep-msi-v18-0-f69b49917464@nxp.com Changes in v18: - pci-ep.yaml: sort property order, fix maxvalue to 0x7ffff for msi-map-mask and iommu-map-mask - Link to v17: https://lore.kernel.org/r/20250407-ep-msi-v17-0-633ab45a31d0@nxp.com Changes in v17: - move document part to pci-ep.yaml - Link to v16: https://lore.kernel.org/r/20250404-ep-msi-v16-0-d4919d68c0d0@nxp.com Changes in v16: - remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file because there are better patches, which under review. - Add document for pcie-ep msi-map usage - other change to see each patch's change log About IMMUTABLE (No change for this part, tglx provide feedback) > - This IMMUTABLE thing serves no purpose, because you don't randomly > plug this end-point block on any MSI controller. They come as part > of an SoC. "Yes and no. The problem is that the EP implementation is meant to be a generic library and while GIC-ITS guarantees immutability of the address/data pair after setup, there are architectures (x86, loongson, riscv) where the base MSI controller does not and immutability is only achieved when interrupt remapping is enabled. The latter can be disabled at boot-time and then the EP implementation becomes a lottery across affinity changes. That was my concern about this library implementation and that's why I asked for a mechanism to ensure that the underlying irqdomain provides a immutable address/data pair. So it does not matter for GIC-ITS, but in the larger picture it matters. Thanks, tglx " So it does not matter for GIC-ITS, but in the larger picture it matters. - Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com Changes in v15: - rebase to v6.14-rc1 - fix build issue find by kernel test robot - Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com Changes in v14: Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As a result, the approach has been reverted to the v9 method. However, there are several improvements: MSI now supports msi-map in addition to msi-parent. - The struct device: id is used as the endpoint function (EPF) device identity to map to the stream ID (sideband information). - The EPC device tree source (DTS) utilizes msi-map to provide such information. - The EPF device's of_node is set to the EPC controller’s node. This approach is commonly used for multi-function device (MFD) platform child devices, allowing them to inherit properties from the MFD device’s DTS, such as reset-cells and gpio-cells. This method is well-suited for the current case, as the EPF is inherently created/binded to the EPC and should inherit the EPC’s DTS node properties. Additionally: Since the basic IMX95 LUT support has already been merged into the mainline, a DTS and driver increment patch is added to complete the solution. The patch is rebased onto the latest linux-next tree and aligned with the new pcitest framework. - Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com Changes in v13: - Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI - Change request id as func | vfunc << 3 - Remove IRQ_DOMAIN_MSI_IMMUTABLE Thomas Gleixner: I hope capture all your points in review comments. If missed, let me know. - Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com Changes in v12: - Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function irq_domain_msi_is_immuatble(). - split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches - Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com Changes in v11: - Change to use MSI_FLAG_MSG_IMMUTABLE - Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com Changes in v10: Thomas Gleixner: There are big change in pci-ep-msi.c. I am sure if go on the corrent path. The key improvement is remove only 1 function devices's limitation. I use new patch for imutable check, which relative additional feature compared to base enablement patch. - Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all() - Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info - Remove only support 1 endpoint function limiation. - Create one MSI domain for each endpoint function devices. - Use "msi-map" in pci ep controler node, instead of of msi-parent. first argument is (func_no << 8 | vfunc_no) - Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com Changes in v9 - Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all() - Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering - Remove API pci_epf_align_inbound_addr_lo_hi - Move doorbell_alloc in to doorbell_enable function. - Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com Changes in v8: - update helper function name to pci_epf_align_inbound_addr() - Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com Changes in v7: - Add helper function pci_epf_align_addr(); - Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com Changes in v6: - change doorbell_addr to doorbell_offset - use round_down() - add Niklas's test by tag - rebase to pci/endpoint - Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com Changes in v5: - Move request_irq to epf test function driver for more flexiable user case - Add fixed size bar handler - Some minor improvememtn to see each patches's changelog. - Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com Changes in v4: - Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs() - Use new method to avoid compatible problem. Add new command DOORBELL_ENABLE and DOORBELL_DISABLE. pcitest -B send DOORBELL_ENABLE first, EP test function driver try to remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old driver don't support new command, so failure return, not side effect. After test, DOORBELL_DISABLE command send out to recover original map, so pcitest bar test can pass as normal. - Other detail change see each patches's change log - Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com Change from v2 to v3 - Fixed manivannan's comments - Move common part to pci-ep-msi.c and pci-ep-msi.h - rebase to 6.12-rc1 - use RevID to distingiush old version mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid ^^^^^^ to enable platform msi support. ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep - use new device ID, which identify support doorbell to avoid broken compatility. Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices keep the same behavior as before. EP side RC with old driver RC with new driver PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled Other device ID doorbell disabled* doorbell disabled* * Behavior remains unchanged. Change from v1 to v2 - Add missed patch for endpont/pci-epf-test.c - Move alloc and free to epc driver from epf. - Provide general help function for EPC driver to alloc platform msi irq. - Fixed manivannan's comments. Signed-off-by: Frank Li <Frank.Li(a)nxp.com> --- Frank Li (9): PCI: imx6: Add helper function imx_pcie_add_lut_by_rid() PCI: imx6: Add LUT configuration for MSI/IOMMU in Endpoint mode PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment PCI: endpoint: pci-epf-test: Add doorbell test support misc: pci_endpoint_test: Add doorbell test case selftests: pci_endpoint: Add doorbell test case arm64: dts: imx95: Add msi-map for pci-ep device Documentation/PCI/endpoint/pci-test-howto.rst | 14 +++ arch/arm64/boot/dts/freescale/imx95.dtsi | 1 + drivers/misc/pci_endpoint_test.c | 85 ++++++++++++- drivers/pci/controller/dwc/pci-imx6.c | 25 ++-- drivers/pci/endpoint/Kconfig | 8 ++ drivers/pci/endpoint/Makefile | 1 + drivers/pci/endpoint/functions/pci-epf-test.c | 136 +++++++++++++++++++++ drivers/pci/endpoint/pci-ep-msi.c | 98 +++++++++++++++ drivers/pci/endpoint/pci-epf-core.c | 36 ++++++ include/linux/pci-ep-msi.h | 28 +++++ include/linux/pci-epf.h | 18 +++ include/uapi/linux/pcitest.h | 1 + .../selftests/pci_endpoint/pci_endpoint_test.c | 28 +++++ 13 files changed, 470 insertions(+), 9 deletions(-) --- base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a change-id: 20241010-ep-msi-8b4cab33b1be Best regards, -- Frank Li <Frank.Li(a)nxp.com>

3 days, 11 hours

2
10
0 0

[PATCH] kunit: Enable PCI on UML without triggering WARN()

by Thomas Weißschuh

Various KUnit tests require PCI infrastructure to work. All normal platforms enable PCI by default, but UML does not. Enabling PCI from .kunitconfig files is problematic as it would not be portable. So in commit 6fc3a8636a7b ("kunit: tool: Enable virtio/PCI by default on UML") PCI was enabled by way of CONFIG_UML_PCI_OVER_VIRTIO=y. However CONFIG_UML_PCI_OVER_VIRTIO requires additional configuration of CONFIG_UML_PCI_OVER_VIRTIO_DEVICE_ID or will otherwise trigger a WARN() in virtio_pcidev_init(). However there is no one correct value for UML_PCI_OVER_VIRTIO_DEVICE_ID which could be used by default. This warning is confusing when debugging test failures. On the other hand, the functionality of CONFIG_UML_PCI_OVER_VIRTIO is not used at all, given that it is completely non-functional as indicated by the WARN() in question. Instead it is only used as a way to enable CONFIG_UML_PCI which itself is not directly configurable. Instead of going through CONFIG_UML_PCI_OVER_VIRTIO, introduce a custom configuration option which enables CONFIG_UML_PCI without triggering warnings or building dead code. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- lib/kunit/Kconfig | 7 +++++++ tools/testing/kunit/configs/arch_uml.config | 5 ++--- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig index a97897edd9642f3e5df7fdd9dee26ee5cf00d6a4..c8ca155521b2455a221ddbec3f6fc55662c83475 100644 --- a/lib/kunit/Kconfig +++ b/lib/kunit/Kconfig @@ -93,4 +93,11 @@ config KUNIT_AUTORUN_ENABLED In most cases this should be left as Y. Only if additional opt-in behavior is needed should this be set to N. +config KUNIT_UML_PCI + bool "KUnit UML PCI Support" + depends on UML + select UML_PCI + help + Enables the PCI subsystem on UML for use by KUnit tests. + endif # KUNIT diff --git a/tools/testing/kunit/configs/arch_uml.config b/tools/testing/kunit/configs/arch_uml.config index 54ad8972681a2cc724e6122b19407188910b9025..28edf816aa70e6f408d9486efff8898df79ee090 100644 --- a/tools/testing/kunit/configs/arch_uml.config +++ b/tools/testing/kunit/configs/arch_uml.config @@ -1,8 +1,7 @@ # Config options which are added to UML builds by default -# Enable virtio/pci, as a lot of tests require it. -CONFIG_VIRTIO_UML=y -CONFIG_UML_PCI_OVER_VIRTIO=y +# Enable pci, as a lot of tests require it. +CONFIG_KUNIT_UML_PCI=y # Enable FORTIFY_SOURCE for wider checking. CONFIG_FORTIFY_SOURCE=y --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20250626-kunit-uml-pci-a2b687553746 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

3 days, 12 hours

2
1
0 0

[PATCH 0/3] signal handling support for nolibc

by Benjamin Berg

From: Benjamin Berg <benjamin.berg(a)intel.com> Hi, This patchset adds signal handling to nolibc. Initially, I would like to use this for tests. But in the long run, the goal is to use nolibc for the UML kernel itself. In both cases, signal handling will be needed. Benjamin Benjamin Berg (3): tools/nolibc: show failed run if test process crashes tools/nolibc: add more generic BITSET_* macros for FD_* tools/nolibc: add signal support tools/include/nolibc/arch-arm.h | 7 ++ tools/include/nolibc/arch-arm64.h | 3 + tools/include/nolibc/arch-loongarch.h | 3 + tools/include/nolibc/arch-m68k.h | 10 ++ tools/include/nolibc/arch-mips.h | 3 + tools/include/nolibc/arch-powerpc.h | 8 ++ tools/include/nolibc/arch-riscv.h | 3 + tools/include/nolibc/arch-s390.h | 8 +- tools/include/nolibc/arch-sh.h | 5 + tools/include/nolibc/arch-sparc.h | 47 ++++++++ tools/include/nolibc/arch-x86.h | 13 +++ tools/include/nolibc/signal.h | 103 ++++++++++++++++++ tools/include/nolibc/sys.h | 2 +- tools/include/nolibc/time.h | 3 +- tools/include/nolibc/types.h | 67 ++++++------ .../testing/selftests/nolibc/Makefile.nolibc | 3 +- tools/testing/selftests/nolibc/nolibc-test.c | 67 ++++++++++++ 17 files changed, 319 insertions(+), 36 deletions(-) -- 2.50.0

3 days, 13 hours

3
9
0 0

[PATCH -next] selftests/ftrace: Prevent potential failure in subsystem-enable test case

by Tengda Wu

The first 100 lines of trace output don't always contain 3 or more distinct events. In busy systems, they may be dominated by repetitive events like sched_stat_runtime, causing the `$count -lt 3` check to fail. Example trace: $ head -n 100 trace | grep -v ^# systemd-timesyn-266 [006] d.h2. 738.778482: sched_stat_runtime: comm=systemd-timesyn pid=266 runtime=976854 [ns] ftracetest-8751 [001] d.h2. 738.778512: sched_stat_runtime: comm=ftracetest pid=8751 runtime=938335 [ns] systemd-timesyn-266 [006] d.h1. 738.779531: sched_stat_runtime: comm=systemd-timesyn pid=266 runtime=1044284 [ns] ftracetest-8751 [001] d.h2. 738.779541: sched_stat_runtime: comm=ftracetest pid=8751 runtime=1028575 [ns] systemd-1 [007] d.h5. 738.779657: sched_stat_runtime: comm=systemd pid=1 runtime=642624 [ns] [...] With trace cleared, simply check `$count -eq 0` to confirm subsystem enablement, just like toplevel-enable.tc does. Fixes: 1a4ea83a6e67 ("selftests/ftrace: Limit length in subsystem-enable tests") Signed-off-by: Tengda Wu <wutengda(a)huaweicloud.com> --- .../selftests/ftrace/test.d/event/subsystem-enable.tc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc index b7c8f29c09a9..3a28adc7b727 100644 --- a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc +++ b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc @@ -19,8 +19,8 @@ echo 'sched:*' > set_event yield count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l` -if [ $count -lt 3 ]; then - fail "at least fork, exec and exit events should be recorded" +if [ $count -eq 0 ]; then + fail "none of scheduler events are recorded" fi do_reset @@ -30,8 +30,8 @@ echo 1 > events/sched/enable yield count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l` -if [ $count -lt 3 ]; then - fail "at least fork, exec and exit events should be recorded" +if [ $count -eq 0 ]; then + fail "none of scheduler events are recorded" fi do_reset -- 2.34.1

3 days, 15 hours

2
4
0 0

[PATCH net] selftests: net: lib: fix shift count out of range

by Hangbin Liu

I got the following warning when writing other tests: + handle_test_result_pass 'bond 802.3ad' '(lacp_active off)' + local 'test_name=bond 802.3ad' + shift + local 'opt_str=(lacp_active off)' + shift + log_test_result 'bond 802.3ad' '(lacp_active off)' ' OK ' + local 'test_name=bond 802.3ad' + shift + local 'opt_str=(lacp_active off)' + shift + local 'result= OK ' + shift + local retmsg= + shift /net/tools/testing/selftests/net/forwarding/../lib.sh: line 315: shift: shift count out of range This happens because an extra shift is executed even after all arguments have been consumed. Remove the last shift in log_test_result() to avoid this warning. Fixes: a923af1ceee7 ("selftests: forwarding: Convert log_test() to recognize RET values") Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com> --- tools/testing/selftests/net/lib.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh index 006fdadcc4b9..86a216e9aca8 100644 --- a/tools/testing/selftests/net/lib.sh +++ b/tools/testing/selftests/net/lib.sh @@ -312,7 +312,7 @@ log_test_result() local test_name=$1; shift local opt_str=$1; shift local result=$1; shift - local retmsg=$1; shift + local retmsg=$1 printf "TEST: %-60s [%s]\n" "$test_name $opt_str" "$result" if [[ $retmsg ]]; then -- 2.46.0

3 days, 17 hours

2
1
0 0

[PATCH net-next 0/5] ethtool: rss: report which fields are configured for hashing

by Jakub Kicinski

Add support for reading flow hash configuration via Netlink ethtool. $ ynl --family ethtool --dump rss-get [{ "header": { "dev-index": 1, "dev-name": "enp1s0" }, "hfunc": 1, "hkey": b"...", "indir": [0, 1, ...], "flow-hash": { "ether": {"l2da"}, "ah-esp4": {"ip-src", "ip-dst"}, "ah-esp6": {"ip-src", "ip-dst"}, "ah4": {"ip-src", "ip-dst"}, "ah6": {"ip-src", "ip-dst"}, "esp4": {"ip-src", "ip-dst"}, "esp6": {"ip-src", "ip-dst"}, "ip4": {"ip-src", "ip-dst"}, "ip6": {"ip-src", "ip-dst"}, "sctp4": {"ip-src", "ip-dst"}, "sctp6": {"ip-src", "ip-dst"}, "udp4": {"ip-src", "ip-dst"}, "udp6": {"ip-src", "ip-dst"} "tcp4": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"}, "tcp6": {"l4-b-0-1", "l4-b-2-3", "ip-src", "ip-dst"}, }, }] Jakub Kicinski (5): ethtool: rss: make sure dump takes the rss lock tools: ynl: decode enums in auto-ints ethtool: mark ETHER_FLOW as usable for Rx hash ethtool: rss: report which fields are configured for hashing selftests: drv-net: test RSS header field configuration Documentation/netlink/specs/ethtool.yaml | 151 ++++++++++++++++++ Documentation/networking/ethtool-netlink.rst | 9 +- include/uapi/linux/ethtool.h | 4 +- .../uapi/linux/ethtool_netlink_generated.h | 34 ++++ net/ethtool/ioctl.c | 7 +- net/ethtool/rss.c | 145 +++++++++++++---- tools/net/ynl/pyynl/lib/ynl.py | 2 + .../selftests/drivers/net/hw/rss_api.py | 47 ++++++ 8 files changed, 364 insertions(+), 35 deletions(-) -- 2.50.0

3 days, 17 hours

3
7
0 0

[PATCH 0/2] selftests/cgroup: better bound for cpu.max tests

by Shashank Balaji

cpu.max selftests (both the normal one and the nested one) test the working of throttling by setting up cpu.max, running a cpu hog process for a specified duration, and comparing usage_usec as reported by cpu.stat with the duration of the cpu hog: they should be far enough. Currently, this is done by using values_close, which has two problems: 1. Semantic: values_close is used with an error percentage of 95%, which one will not expect on seeing "values close". The intent it's actually going for is "values far". 2. Accuracy: the tests can pass even if usage_usec is upto around double the expected amount. That's too high of a margin for usage_usec. Overall, this patchset improves the readability and accuracy of the cpu.max tests. Signed-off-by: Shashank Balaji <shashank.mahadasyam(a)sony.com> --- Shashank Balaji (2): selftests/cgroup: rename `expected` to `duration` in cpu.max tests selftests/cgroup: better bound in cpu.max tests tools/testing/selftests/cgroup/test_cpu.c | 42 ++++++++++++++++++------------- 1 file changed, 24 insertions(+), 18 deletions(-) --- base-commit: 66701750d5565c574af42bef0b789ce0203e3071 change-id: 20250227-kselftest-cgroup-fix-cpu-max-56619928e99b Best regards, -- Shashank Balaji <shashank.mahadasyam(a)sony.com>

3 days, 18 hours

2
15
0 0

[PATCH] KVM: selftests: Add CONFIG_EVENTFD for irqfd selftest

by Mark Brown

In 7e9b231c402a ("KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements") we added a test for the newly added irqfd support but since this feature works with eventfds it won't work unless the kernel has been built wth eventfd support. Add CONFIG_EVENTFD to the list of required options for the KVM selftests. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- tools/testing/selftests/kvm/config | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/kvm/config b/tools/testing/selftests/kvm/config index 8835fed09e9f..96d874b239eb 100644 --- a/tools/testing/selftests/kvm/config +++ b/tools/testing/selftests/kvm/config @@ -1,5 +1,6 @@ CONFIG_KVM=y CONFIG_KVM_INTEL=y CONFIG_KVM_AMD=y +CONFIG_EVENTFD=y CONFIG_USERFAULTFD=y CONFIG_IDLE_PAGE_TRACKING=y --- base-commit: 7e9b231c402a297251b3e6e0f5cc16cef7dd3ce5 change-id: 20250709-kvm-selftests-eventfd-config-123bb022fa04 Best regards, -- Mark Brown <broonie(a)kernel.org>

3 days, 19 hours

2
1
0 0

[PATCH] selftests: breakpoints: use suspend_stats to reliably check suspend success

by Moon Hee Lee

The step_after_suspend_test verifies that the system successfully suspended and resumed by setting a timerfd and checking whether the timer fully expired. However, this method is unreliable due to timing races. In practice, the system may take time to enter suspend, during which the timer may expire just before or during the transition. As a result, the remaining time after resume may show non-zero nanoseconds, even if suspend/resume completed successfully. This leads to false test failures. Replace the timer-based check with a read from /sys/power/suspend_stats/success. This counter is incremented only after a full suspend/resume cycle, providing a reliable and race-free indicator. Also remove the unused file descriptor for /sys/power/state, which remained after switching to a system() call to trigger suspend [1]. [1] https://lore.kernel.org/all/20240930224025.2858767-1-yifei.l.liu@oracle.com/ Fixes: c66be905cda2 ("selftests: breakpoints: use remaining time to check if suspend succeed") Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com> --- .../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++----- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c index 8d275f03e977..8d233ac95696 100644 --- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c +++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c @@ -127,22 +127,42 @@ int run_test(int cpu) return KSFT_PASS; } +/* + * Reads the suspend success count from sysfs. + * Returns the count on success or exits on failure. + */ +static int get_suspend_success_count_or_fail(void) +{ + FILE *fp; + int val; + + fp = fopen("/sys/power/suspend_stats/success", "r"); + if (!fp) + ksft_exit_fail_msg( + "Failed to open suspend_stats/success: %s\n", + strerror(errno)); + + if (fscanf(fp, "%d", &val) != 1) { + fclose(fp); + ksft_exit_fail_msg( + "Failed to read suspend success count\n"); + } + + fclose(fp); + return val; +} + void suspend(void) { - int power_state_fd; int timerfd; int err; + int count_before; + int count_after; struct itimerspec spec = {}; if (getuid() != 0) ksft_exit_skip("Please run the test as root - Exiting.\n"); - power_state_fd = open("/sys/power/state", O_RDWR); - if (power_state_fd < 0) - ksft_exit_fail_msg( - "open(\"/sys/power/state\") failed %s)\n", - strerror(errno)); - timerfd = timerfd_create(CLOCK_BOOTTIME_ALARM, 0); if (timerfd < 0) ksft_exit_fail_msg("timerfd_create() failed\n"); @@ -152,14 +172,15 @@ void suspend(void) if (err < 0) ksft_exit_fail_msg("timerfd_settime() failed\n"); + count_before = get_suspend_success_count_or_fail(); + system("(echo mem > /sys/power/state) 2> /dev/null"); - timerfd_gettime(timerfd, &spec); - if (spec.it_value.tv_sec != 0 || spec.it_value.tv_nsec != 0) + count_after = get_suspend_success_count_or_fail(); + if (count_after <= count_before) ksft_exit_fail_msg("Failed to enter Suspend state\n"); close(timerfd); - close(power_state_fd); } int main(int argc, char **argv) -- 2.43.0

3 days, 22 hours

2
1
0 0

[PATCH v3 RESEND] kunit: fix longest symbol length test

by Sergio González Collado

The kunit test that checks the longests symbol length [1], has triggered warnings in some pilelines when symbol prefixes are used [2][3]. The test will to depend on !PREFIX_SYMBOLS and !CFI_CLANG as sujested in [4] and on !GCOV_KERNEL. [1] https://lore.kernel.org/rust-for-linux/CABVgOSm=5Q0fM6neBhxSbOUHBgNzmwf2V22… [2] https://lore.kernel.org/all/20250328112156.2614513-1-arnd@kernel.org/T/#u [3] https://lore.kernel.org/rust-for-linux/bbd03b37-c4d9-4a92-9be2-75aaf8c19815… [4] https://lore.kernel.org/linux-kselftest/20250427200916.GA1661412@ax162/T/#t Reviewed-by: Rae Moar <rmoar(a)google.com> Signed-off-by: Sergio González Collado <sergio.collado(a)gmail.com> Acked-by: Randy Dunlap <rdunlap(a)infradead.org> Tested-by: Randy Dunlap <rdunlap(a)infradead.org> --- lib/Kconfig.debug | 1 + lib/tests/longest_symbol_kunit.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ebe33181b6e6..4a75a52803b6 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2885,6 +2885,7 @@ config FORTIFY_KUNIT_TEST config LONGEST_SYM_KUNIT_TEST tristate "Test the longest symbol possible" if !KUNIT_ALL_TESTS depends on KUNIT && KPROBES + depends on !PREFIX_SYMBOLS && !CFI_CLANG && !GCOV_KERNEL default KUNIT_ALL_TESTS help Tests the longest symbol possible diff --git a/lib/tests/longest_symbol_kunit.c b/lib/tests/longest_symbol_kunit.c index e3c28ff1807f..9b4de3050ba7 100644 --- a/lib/tests/longest_symbol_kunit.c +++ b/lib/tests/longest_symbol_kunit.c @@ -3,8 +3,7 @@ * Test the longest symbol length. Execute with: * ./tools/testing/kunit/kunit.py run longest-symbol * --arch=x86_64 --kconfig_add CONFIG_KPROBES=y --kconfig_add CONFIG_MODULES=y - * --kconfig_add CONFIG_RETPOLINE=n --kconfig_add CONFIG_CFI_CLANG=n - * --kconfig_add CONFIG_MITIGATION_RETPOLINE=n + * --kconfig_add CONFIG_CPU_MITIGATIONS=n --kconfig_add CONFIG_GCOV_KERNEL=n */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt base-commit: 772b78c2abd85586bb90b23adff89f7303c704c7 -- 2.39.2

3 days, 23 hours

3
3
0 0

[PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY

by Suren Baghdasaryan

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Same is done for PROCMAP_QUERY ioctl which locks only the vma that fell into the requested range instead of the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. This test registered close to 10x improvement in update latencies: Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.455 0.011 0.008 0.472 0.011 0.008 0.535 0.011 0.009 0.545 ... 0.011 0.014 2.875 0.011 0.014 2.913 0.011 0.014 3.007 0.011 0.015 3.018 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.036 0.006 0.005 0.039 0.006 0.005 0.039 0.006 0.005 0.039 ... 0.006 0.006 0.403 0.006 0.006 0.474 0.006 0.006 0.479 0.006 0.006 0.498 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's even more important now to test for unexpected inconsistencies. In [3] Lorenzo identified the following possible vma merging/splitting scenarios: Merges with changes to existing vmas: 1 Merge both - mapping a vma over another one and between two vmas which can be merged after this replacement; 2. Merge left full - mapping a vma at the end of an existing one and completely over its right neighbor; 3. Merge left partial - mapping a vma at the end of an existing one and partially over its right neighbor; 4. Merge right full - mapping a vma before the start of an existing one and completely over its left neighbor; 5. Merge right partial - mapping a vma before the start of an existing one and partially over its left neighbor; Merges without changes to existing vmas: 6. Merge both - mapping a vma into a gap between two vmas which can be merged after the insertion; 7. Merge left - mapping a vma at the end of an existing one; 8. Merge right - mapping a vma before the start end of an existing one; Splits 9. Split with new vma at the lower address; 10. Split with new vma at the higher address; If such merges or splits happen concurrently with the /proc/maps reading we might report a vma twice, once before the modification and once after it is modified: Case 1 might report overwritten and previous vma along with the final merged vma; Case 2 might report previous and the final merged vma; Case 3 might cause us to retry once we detect the temporary gap caused by shrinking of the right neighbor; Case 4 might report overritten and the final merged vma; Case 5 might cause us to retry once we detect the temporary gap caused by shrinking of the left neighbor; Case 6 might report previous vma and the gap along with the final marged vma; Case 7 might report previous and the final merged vma; Case 8 might report the original gap and the final merged vma covering the gap; Case 9 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma start; Case 10 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma end; In all these cases the retry mechanism prevents us from reporting possible temporary gaps. Changes since v5 [4]: - Made /proc/pid/maps tearing test a separate selftest, per Alexey Dobriyan - Changed asserts with or'ed conditions into separate ones, per Alexey Dobriyan - Added a small cleanup patch [6/8] to avoid unnecessary seq_file position type casting - Removed unnecessary is_sentinel_pos() helper - Changed titles to use fs/proc/task_mmu instead of mm/maps prefix, per David Hildenbrand - Included Lorenzo's fix for mmap lock assertion in anon_vma_name() - Reworked the last patch to avoid allocation in the rcu read section, which replaces Jeongjun Park's fix !!! NOTES FOR APPLYING THE PATCHSET !!! Applies cleanly over mm-unstable after reverting old version with fixes. The following patches should be reverted before applyng this patchset: b33ce1be8a40 ("selftests/proc: add /proc/pid/maps tearing from vma split test") b538e0580fd6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma resizing") 4996b4409cc6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma remapping") c39471f78d5e ("selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified") 487570f548f3 ("selftests/proc: add verbose more for tests to facilitate debugging") e1ba4969cba1 ("mm/maps: read proc/pid/maps under per-vma lock") ecb110179e77 ("mm/madvise: fixup stray mmap lock assert in anon_vma_name()") 6772c457a865 ("fs/proc/task_mmu:: execute PROCMAP_QUERY ioctl under per-vma locks") d5c67bb2c5fb ("mm/maps: move kmalloc() call location in do_procmap_query() out of RCU critical section") [1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test [3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo… [4] https://lore.kernel.org/all/20250624193359.3865351-1-surenb@google.com/ Suren Baghdasaryan (8): selftests/proc: add /proc/pid/maps tearing from vma split test selftests/proc: extend /proc/pid/maps tearing test to include vma resizing selftests/proc: extend /proc/pid/maps tearing test to include vma remapping selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified selftests/proc: add verbose more for tests to facilitate debugging fs/proc/task_mmu: remove conversion of seq_file position to unsigned fs/proc/task_mmu: read proc/pid/maps under per-vma lock fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 5 + fs/proc/task_mmu.c | 188 +++- include/linux/mmap_lock.h | 11 + mm/madvise.c | 3 +- mm/mmap_lock.c | 88 ++ tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++ 8 files changed, 1098 insertions(+), 28 deletions(-) create mode 100644 tools/testing/selftests/proc/proc-maps-race.c -- 2.50.0.727.gbf7dc18ff4-goog

4 days, 1 hour

4
28
0 0

[PATCH bpf-next,v3 0/2] Clarify and Enhance XDP Rx Metadata Handling

by Song Yoong Siang

This patch set improves the documentation and selftests for XDP Rx metadata handling. The first patch clarifies the documentation around XDP metadata layout and METADATA_SIZE. The second patch enhances the BPF selftests to make XDP metadata handling more robust across different NICs. Prior to this patch set, the XDP program might accidentally overwrite the device-reserved metadata. V3: - update doc and commit msg accordingly. V2: https://lore.kernel.org/netdev/20250702030349.3275368-1-yoong.siang.song@in… - unconditionally do bpf_xdp_adjust_meta with -XDP_METADATA_SIZE (Stanislav) V1: https://lore.kernel.org/netdev/20250701042940.3272325-1-yoong.siang.song@in… Song Yoong Siang (2): doc: enhance explanation of XDP Rx metadata layout and METADATA_SIZE selftests/bpf: Enhance XDP Rx metadata handling Documentation/networking/xdp-rx-metadata.rst | 36 +++++++++++++++---- .../selftests/bpf/prog_tests/xdp_metadata.c | 2 +- .../selftests/bpf/progs/xdp_hw_metadata.c | 2 +- .../selftests/bpf/progs/xdp_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_metadata.h | 7 ++++ 6 files changed, 41 insertions(+), 10 deletions(-) -- 2.34.1

4 days, 1 hour

6
14
0 0

[PATCH v20 0/9] PCI: EP: Add RC-to-EP doorbell with platform MSI controller

by Frank Li via B4 Relay

┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐ │ │ │ │ │ │ │ │ │ PCI Endpoint │ │ PCI Host │ │ │ │ │ │ │ │ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │ │ │ │ │ │ │ │ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │ │ Controller │ │ update doorbell register address│ │ │ │ │ │ for BAR │ │ │ │ │ │ │ │ 3. Write BAR<n>│ │ │◄──┼───────────────────────────────────┼───┤ │ │ │ │ │ │ │ │ ├──►│ 4.Irq Handle │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └────────────┘ └───────────────────────────────────┘ └────────────────┘ This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/ Original patch only target to vntb driver. But actually it is common method. This patches add new API to pci-epf-core, so any EP driver can use it. Previous v2 discussion here. https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/ Changes in v20: - remove set epf of_node's patch and only support one epf now. - move imx6's patch to first - detail change see each patches' change log - Link to v19: https://lore.kernel.org/r/20250609-ep-msi-v19-0-77362eaa48fa@nxp.com Changes in v19: - irq part already in v6.16-rc1, only missed pcie/dts part - rebase to v6.16-rc1 - update commit message for patch IMMUTABLE check. - Link to v18: https://lore.kernel.org/r/20250414-ep-msi-v18-0-f69b49917464@nxp.com Changes in v18: - pci-ep.yaml: sort property order, fix maxvalue to 0x7ffff for msi-map-mask and iommu-map-mask - Link to v17: https://lore.kernel.org/r/20250407-ep-msi-v17-0-633ab45a31d0@nxp.com Changes in v17: - move document part to pci-ep.yaml - Link to v16: https://lore.kernel.org/r/20250404-ep-msi-v16-0-d4919d68c0d0@nxp.com Changes in v16: - remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file because there are better patches, which under review. - Add document for pcie-ep msi-map usage - other change to see each patch's change log About IMMUTABLE (No change for this part, tglx provide feedback) > - This IMMUTABLE thing serves no purpose, because you don't randomly > plug this end-point block on any MSI controller. They come as part > of an SoC. "Yes and no. The problem is that the EP implementation is meant to be a generic library and while GIC-ITS guarantees immutability of the address/data pair after setup, there are architectures (x86, loongson, riscv) where the base MSI controller does not and immutability is only achieved when interrupt remapping is enabled. The latter can be disabled at boot-time and then the EP implementation becomes a lottery across affinity changes. That was my concern about this library implementation and that's why I asked for a mechanism to ensure that the underlying irqdomain provides a immutable address/data pair. So it does not matter for GIC-ITS, but in the larger picture it matters. Thanks, tglx " So it does not matter for GIC-ITS, but in the larger picture it matters. - Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com Changes in v15: - rebase to v6.14-rc1 - fix build issue find by kernel test robot - Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com Changes in v14: Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As a result, the approach has been reverted to the v9 method. However, there are several improvements: MSI now supports msi-map in addition to msi-parent. - The struct device: id is used as the endpoint function (EPF) device identity to map to the stream ID (sideband information). - The EPC device tree source (DTS) utilizes msi-map to provide such information. - The EPF device's of_node is set to the EPC controller’s node. This approach is commonly used for multi-function device (MFD) platform child devices, allowing them to inherit properties from the MFD device’s DTS, such as reset-cells and gpio-cells. This method is well-suited for the current case, as the EPF is inherently created/binded to the EPC and should inherit the EPC’s DTS node properties. Additionally: Since the basic IMX95 LUT support has already been merged into the mainline, a DTS and driver increment patch is added to complete the solution. The patch is rebased onto the latest linux-next tree and aligned with the new pcitest framework. - Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com Changes in v13: - Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI - Change request id as func | vfunc << 3 - Remove IRQ_DOMAIN_MSI_IMMUTABLE Thomas Gleixner: I hope capture all your points in review comments. If missed, let me know. - Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com Changes in v12: - Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function irq_domain_msi_is_immuatble(). - split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches - Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com Changes in v11: - Change to use MSI_FLAG_MSG_IMMUTABLE - Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com Changes in v10: Thomas Gleixner: There are big change in pci-ep-msi.c. I am sure if go on the corrent path. The key improvement is remove only 1 function devices's limitation. I use new patch for imutable check, which relative additional feature compared to base enablement patch. - Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all() - Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info - Remove only support 1 endpoint function limiation. - Create one MSI domain for each endpoint function devices. - Use "msi-map" in pci ep controler node, instead of of msi-parent. first argument is (func_no << 8 | vfunc_no) - Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com Changes in v9 - Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all() - Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering - Remove API pci_epf_align_inbound_addr_lo_hi - Move doorbell_alloc in to doorbell_enable function. - Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com Changes in v8: - update helper function name to pci_epf_align_inbound_addr() - Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com Changes in v7: - Add helper function pci_epf_align_addr(); - Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com Changes in v6: - change doorbell_addr to doorbell_offset - use round_down() - add Niklas's test by tag - rebase to pci/endpoint - Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com Changes in v5: - Move request_irq to epf test function driver for more flexiable user case - Add fixed size bar handler - Some minor improvememtn to see each patches's changelog. - Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com Changes in v4: - Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs() - Use new method to avoid compatible problem. Add new command DOORBELL_ENABLE and DOORBELL_DISABLE. pcitest -B send DOORBELL_ENABLE first, EP test function driver try to remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old driver don't support new command, so failure return, not side effect. After test, DOORBELL_DISABLE command send out to recover original map, so pcitest bar test can pass as normal. - Other detail change see each patches's change log - Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com Change from v2 to v3 - Fixed manivannan's comments - Move common part to pci-ep-msi.c and pci-ep-msi.h - rebase to 6.12-rc1 - use RevID to distingiush old version mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid ^^^^^^ to enable platform msi support. ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep - use new device ID, which identify support doorbell to avoid broken compatility. Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices keep the same behavior as before. EP side RC with old driver RC with new driver PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled Other device ID doorbell disabled* doorbell disabled* * Behavior remains unchanged. Change from v1 to v2 - Add missed patch for endpont/pci-epf-test.c - Move alloc and free to epc driver from epf. - Provide general help function for EPC driver to alloc platform msi irq. - Fixed manivannan's comments. Signed-off-by: Frank Li <Frank.Li(a)nxp.com> --- Frank Li (9): PCI: imx6: Add helper function imx_pcie_add_lut_by_rid() PCI: imx6: Add LUT configuration for MSI/IOMMU in Endpoint mode PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment PCI: endpoint: pci-epf-test: Add doorbell test support misc: pci_endpoint_test: Add doorbell test case selftests: pci_endpoint: Add doorbell test case arm64: dts: imx95: Add msi-map for pci-ep device Documentation/PCI/endpoint/pci-test-howto.rst | 14 +++ arch/arm64/boot/dts/freescale/imx95.dtsi | 1 + drivers/misc/pci_endpoint_test.c | 85 ++++++++++++- drivers/pci/controller/dwc/pci-imx6.c | 25 ++-- drivers/pci/endpoint/Kconfig | 8 ++ drivers/pci/endpoint/Makefile | 1 + drivers/pci/endpoint/functions/pci-epf-test.c | 136 +++++++++++++++++++++ drivers/pci/endpoint/pci-ep-msi.c | 98 +++++++++++++++ drivers/pci/endpoint/pci-epf-core.c | 44 +++++++ include/linux/pci-ep-msi.h | 28 +++++ include/linux/pci-epf.h | 18 +++ include/uapi/linux/pcitest.h | 1 + .../selftests/pci_endpoint/pci_endpoint_test.c | 28 +++++ 13 files changed, 478 insertions(+), 9 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20241010-ep-msi-8b4cab33b1be Best regards, -- Frank Li <Frank.Li(a)nxp.com>

4 days, 2 hours

3
11
0 0

[PATCH v2 00/10] mm/mremap: permit mremap() move of multiple VMAs

by Lorenzo Stoakes

Historically we've made it a uAPI requirement that mremap() may only operate on a single VMA at a time. For instances where VMAs need to be resized, this makes sense, as it becomes very difficult to determine what a user actually wants should they indicate a desire to expand or shrink the size of multiple VMAs (truncate? Adjust sizes individually? Some other strategy?). However, in instances where a user is moving VMAs, it is restrictive to disallow this. This is especially the case when anonymous mapping remap may or may not be mergeable depending on whether VMAs have or have not been faulted due to anon_vma assignment and folio index alignment with vma->vm_pgoff. Often this can result in surprising impact where a moved region is faulted, then moved back and a user fails to observe a merge from otherwise compatible, adjacent VMAs. This change allows such cases to work without the user having to be cognizant of whether a prior mremap() move or other VMA operations has resulted in VMA fragmentation. In order to do this, this series performs a large amount of refactoring, most pertinently - grouping sanity checks together, separately those that check input parameters and those relating to VMAs. we also simplify the post-mmap lock drop processing for uffd and mlock()'d VMAs. With this done, we can then fairly straightforwardly implement this functionality. This works exclusively for mremap() invocations which specify MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the notification of the userland fault handler would require us to drop the mmap lock. The input and output addresses ranges must not overlap. We carefully account for moves which would result in VMA merges or would otherwise result in VMA iterator invalidation. v2: * Squashed uffd stub fix into series. * Propagated tags, thanks! * Fixed param naming in patch 4 as per Vlastimil. * Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per Liam. * Correctly return -EFAULT if no VMAs in input range. * Account for get_unmapped_area() disregarding MAP_FIXED and returning an altered address. * Added additional explanatatory comment to the remap_move() function. v1: https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (10): mm/mremap: perform some simple cleanups mm/mremap: refactor initial parameter sanity checks mm/mremap: put VMA check and prep logic into helper function mm/mremap: cleanup post-processing stage of mremap mm/mremap: use an explicit uffd failure path for mremap mm/mremap: check remap conditions earlier mm/mremap: move remap_is_valid() into check_prep_vma() mm/mremap: clean up mlock populate behaviour mm/mremap: permit mremap() move of multiple VMAs tools/testing/selftests: extend mremap_test to test multi-VMA mremap fs/userfaultfd.c | 15 +- include/linux/userfaultfd_k.h | 5 + mm/mremap.c | 528 ++++++++++++++--------- tools/testing/selftests/mm/mremap_test.c | 145 ++++++- 4 files changed, 492 insertions(+), 201 deletions(-) -- 2.50.0

4 days, 3 hours

1
10
0 0

[PATCH v3] selftests: cachestat: add tests for mmap Refactor and enhance mmap test for cachestat validation

by Suresh K C

From: Suresh K C <suresh.k.chandrappa(a)gmail.com> This patch merges the previous two patches into a single, cohesive test case that verifies cachestat behavior with memory-mapped files using mmap(). It also refactors the test logic to reduce redundancy, improve error reporting, and clarify failure messages for both shmem and mmap file types. Changes since v2: - Merged the two patches into one as suggested - Improved error messages for better clarity Tested on x86_64 with default kernel config. Signed-off-by: Suresh K C <suresh.k.chandrappa(a)gmail.com> --- .../selftests/cachestat/test_cachestat.c | 66 +++++++++++++++---- 1 file changed, 55 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c index 632ab44737ec..18188d7c639e 100644 --- a/tools/testing/selftests/cachestat/test_cachestat.c +++ b/tools/testing/selftests/cachestat/test_cachestat.c @@ -33,6 +33,11 @@ void print_cachestat(struct cachestat *cs) cs->nr_evicted, cs->nr_recently_evicted); } +enum file_type { + FILE_MMAP, + FILE_SHMEM +}; + bool write_exactly(int fd, size_t filesize) { int random_fd = open("/dev/urandom", O_RDONLY); @@ -201,8 +206,19 @@ static int test_cachestat(const char *filename, bool write_random, bool create, out: return ret; } +const char* file_type_str(enum file_type type) { + switch (type) { + case FILE_SHMEM: + return "shmem"; + case FILE_MMAP: + return "mmap"; + default: + return "unknown"; + } +} -bool test_cachestat_shmem(void) + +bool run_cachestat_test(enum file_type type) { size_t PS = sysconf(_SC_PAGESIZE); size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */ @@ -212,27 +228,49 @@ bool test_cachestat_shmem(void) char *filename = "tmpshmcstat"; struct cachestat cs; bool ret = true; + int fd; unsigned long num_pages = compute_len / PS; - int fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + if (type == FILE_SHMEM) + fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + else + fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); if (fd < 0) { - ksft_print_msg("Unable to create shmem file.\n"); + ksft_print_msg("Unable to create %s file.\n",file_type_str(type)); ret = false; goto out; } if (ftruncate(fd, filesize)) { - ksft_print_msg("Unable to truncate shmem file.\n"); + ksft_print_msg("Unable to truncate %s file.\n",file_type_str(type)); ret = false; goto close_fd; } - - if (!write_exactly(fd, filesize)) { - ksft_print_msg("Unable to write to shmem file.\n"); - ret = false; - goto close_fd; + switch (type){ + case FILE_SHMEM: + if (!write_exactly(fd, filesize)) { + ksft_print_msg("Unable to write to file.\n"); + ret = false; + goto close_fd; + } + break; + case FILE_MMAP: + char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (map == MAP_FAILED) { + ksft_print_msg("mmap failed.\n"); + ret = false; + goto close_fd; + } + for (int i = 0; i < filesize; i++) { + map[i] = 'A'; + } + break; + default: + ksft_print_msg("Unsupported file type.\n"); + ret = false; + goto close_fd; + break; } - syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); if (syscall_ret) { @@ -308,12 +346,18 @@ int main(void) break; } - if (test_cachestat_shmem()) + if (run_cachestat_test(FILE_SHMEM)) ksft_test_result_pass("cachestat works with a shmem file\n"); else { ksft_test_result_fail("cachestat fails with a shmem file\n"); ret = 1; } + if (run_cachestat_test(FILE_MMAP)) + ksft_test_result_pass("cachestat works with a mmap file\n"); + else { + ksft_test_result_fail("cachestat fails with a mmap file\n"); + ret = 1; + } return ret; } -- 2.43.0

4 days, 4 hours

3
2
0 0

[PATCH v4] selftests: riscv: add misaligned access testing

by Clément Léger

This selftest tests all the currently emulated instructions (except for the RV32 compressed ones which are left as a future exercise for a RV32 user). For the FPU instructions, all the FPU registers are tested. Signed-off-by: Clément Léger <cleger(a)rivosinc.com> --- Note: This test can be executed using an SBI that support FWFT or that delegates misaligned traps by default. If using QEMU, you will need the patches mentioned at [1] so that misaligned accesses will generate a trap. Note: This commit was part of a series [2] that was partially merged. Note: the remaining checkpatch errors are not applicable to this tests which is a userspace one and does not use the kernel headers. Macros with complex values can not be enclosed in do while loop since they are generating functions. Link: https://lore.kernel.org/all/20241211211933.198792-1-fkonrad@amd.com/ [1] Link: https://lore.kernel.org/linux-riscv/20250422162324.956065-1-cleger@rivosinc… [2] V4: - Fixed a typo in test name s/load/store V3: - Fixed a segfault and a sign extension error found when compiling with -O<x>, x != 0 (Alex) - Use inline assembly to generate the sigbus and avoid GCC optimizations V2: - Fix commit description - Fix a few errors reported by checkpatch.pl tools/testing/selftests/riscv/Makefile | 2 +- .../selftests/riscv/misaligned/.gitignore | 1 + .../selftests/riscv/misaligned/Makefile | 12 + .../selftests/riscv/misaligned/common.S | 33 +++ .../testing/selftests/riscv/misaligned/fpu.S | 180 +++++++++++++ tools/testing/selftests/riscv/misaligned/gp.S | 103 +++++++ .../selftests/riscv/misaligned/misaligned.c | 253 ++++++++++++++++++ 7 files changed, 583 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile create mode 100644 tools/testing/selftests/riscv/misaligned/common.S create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c diff --git a/tools/testing/selftests/riscv/Makefile b/tools/testing/selftests/riscv/Makefile index 099b8c1f46f8..95a98ceeb3b3 100644 --- a/tools/testing/selftests/riscv/Makefile +++ b/tools/testing/selftests/riscv/Makefile @@ -5,7 +5,7 @@ ARCH ?= $(shell uname -m 2>/dev/null || echo not) ifneq (,$(filter $(ARCH),riscv)) -RISCV_SUBTARGETS ?= abi hwprobe mm sigreturn vector +RISCV_SUBTARGETS ?= abi hwprobe mm sigreturn vector misaligned else RISCV_SUBTARGETS := endif diff --git a/tools/testing/selftests/riscv/misaligned/.gitignore b/tools/testing/selftests/riscv/misaligned/.gitignore new file mode 100644 index 000000000000..5eff15a1f981 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/.gitignore @@ -0,0 +1 @@ +misaligned diff --git a/tools/testing/selftests/riscv/misaligned/Makefile b/tools/testing/selftests/riscv/misaligned/Makefile new file mode 100644 index 000000000000..1aa40110c50d --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2021 ARM Limited +# Originally tools/testing/arm64/abi/Makefile + +CFLAGS += -I$(top_srcdir)/tools/include + +TEST_GEN_PROGS := misaligned + +include ../../lib.mk + +$(OUTPUT)/misaligned: misaligned.c fpu.S gp.S + $(CC) -g3 -static -o$@ -march=rv64imafdc $(CFLAGS) $(LDFLAGS) $^ diff --git a/tools/testing/selftests/riscv/misaligned/common.S b/tools/testing/selftests/riscv/misaligned/common.S new file mode 100644 index 000000000000..8fa00035bd5d --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/common.S @@ -0,0 +1,33 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +.macro lb_sb temp, offset, src, dst + lb \temp, \offset(\src) + sb \temp, \offset(\dst) +.endm + +.macro copy_long_to temp, src, dst + lb_sb \temp, 0, \src, \dst, + lb_sb \temp, 1, \src, \dst, + lb_sb \temp, 2, \src, \dst, + lb_sb \temp, 3, \src, \dst, + lb_sb \temp, 4, \src, \dst, + lb_sb \temp, 5, \src, \dst, + lb_sb \temp, 6, \src, \dst, + lb_sb \temp, 7, \src, \dst, +.endm + +.macro sp_stack_prologue offset + addi sp, sp, -8 + sub sp, sp, \offset +.endm + +.macro sp_stack_epilogue offset + add sp, sp, \offset + addi sp, sp, 8 +.endm diff --git a/tools/testing/selftests/riscv/misaligned/fpu.S b/tools/testing/selftests/riscv/misaligned/fpu.S new file mode 100644 index 000000000000..a7ad4430a424 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/fpu.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +#include "common.S" + +#define CASE_ALIGN 4 + +.macro fpu_load_inst fpreg, inst, precision, load_reg +.align CASE_ALIGN + \inst \fpreg, 0(\load_reg) + fmv.\precision fa0, \fpreg + j 2f +.endm + +#define flw(__fpreg) fpu_load_inst __fpreg, flw, s, a4 +#define fld(__fpreg) fpu_load_inst __fpreg, fld, d, a4 +#define c_flw(__fpreg) fpu_load_inst __fpreg, c.flw, s, a4 +#define c_fld(__fpreg) fpu_load_inst __fpreg, c.fld, d, a4 +#define c_fldsp(__fpreg) fpu_load_inst __fpreg, c.fldsp, d, sp + +.macro fpu_store_inst fpreg, inst, precision, store_reg +.align CASE_ALIGN + fmv.\precision \fpreg, fa0 + \inst \fpreg, 0(\store_reg) + j 2f +.endm + +#define fsw(__fpreg) fpu_store_inst __fpreg, fsw, s, a4 +#define fsd(__fpreg) fpu_store_inst __fpreg, fsd, d, a4 +#define c_fsw(__fpreg) fpu_store_inst __fpreg, c.fsw, s, a4 +#define c_fsd(__fpreg) fpu_store_inst __fpreg, c.fsd, d, a4 +#define c_fsdsp(__fpreg) fpu_store_inst __fpreg, c.fsdsp, d, sp + +.macro fp_test_prologue + move a4, a1 + /* + * Compute jump offset to store the correct FP register since we don't + * have indirect FP register access (or at least we don't use this + * extension so that works on all archs) + */ + sll t0, a0, CASE_ALIGN + la t2, 1f + add t0, t0, t2 + jr t0 +.align CASE_ALIGN +1: +.endm + +.macro fp_test_prologue_compressed + /* FP registers for compressed instructions starts from 8 to 16 */ + addi a0, a0, -8 + fp_test_prologue +.endm + +#define fp_test_body_compressed(__inst_func) \ + __inst_func(f8); \ + __inst_func(f9); \ + __inst_func(f10); \ + __inst_func(f11); \ + __inst_func(f12); \ + __inst_func(f13); \ + __inst_func(f14); \ + __inst_func(f15); \ +2: + +#define fp_test_body(__inst_func) \ + __inst_func(f0); \ + __inst_func(f1); \ + __inst_func(f2); \ + __inst_func(f3); \ + __inst_func(f4); \ + __inst_func(f5); \ + __inst_func(f6); \ + __inst_func(f7); \ + __inst_func(f8); \ + __inst_func(f9); \ + __inst_func(f10); \ + __inst_func(f11); \ + __inst_func(f12); \ + __inst_func(f13); \ + __inst_func(f14); \ + __inst_func(f15); \ + __inst_func(f16); \ + __inst_func(f17); \ + __inst_func(f18); \ + __inst_func(f19); \ + __inst_func(f20); \ + __inst_func(f21); \ + __inst_func(f22); \ + __inst_func(f23); \ + __inst_func(f24); \ + __inst_func(f25); \ + __inst_func(f26); \ + __inst_func(f27); \ + __inst_func(f28); \ + __inst_func(f29); \ + __inst_func(f30); \ + __inst_func(f31); \ +2: +.text + +#define __gen_test_inst(__inst, __suffix) \ +.global test_ ## __inst; \ +test_ ## __inst:; \ + fp_test_prologue ## __suffix; \ + fp_test_body ## __suffix(__inst); \ + ret + +#define gen_test_inst_compressed(__inst) \ + .option arch,+c; \ + __gen_test_inst(c_ ## __inst, _compressed) + +#define gen_test_inst(__inst) \ + .balign 16; \ + .option push; \ + .option arch,-c; \ + __gen_test_inst(__inst, ); \ + .option pop + +.macro fp_test_prologue_load_compressed_sp + copy_long_to t0, a1, sp +.endm + +.macro fp_test_epilogue_load_compressed_sp +.endm + +.macro fp_test_prologue_store_compressed_sp +.endm + +.macro fp_test_epilogue_store_compressed_sp + copy_long_to t0, sp, a1 +.endm + +#define gen_inst_compressed_sp(__inst, __type) \ + .global test_c_ ## __inst ## sp; \ + test_c_ ## __inst ## sp:; \ + sp_stack_prologue a2; \ + fp_test_prologue_## __type ## _compressed_sp; \ + fp_test_prologue_compressed; \ + fp_test_body_compressed(c_ ## __inst ## sp); \ + fp_test_epilogue_## __type ## _compressed_sp; \ + sp_stack_epilogue a2; \ + ret + +#define gen_test_load_compressed_sp(__inst) gen_inst_compressed_sp(__inst, load) +#define gen_test_store_compressed_sp(__inst) gen_inst_compressed_sp(__inst, store) + +/* + * float_fsw_reg - Set a FP register from a register containing the value + * a0 = FP register index to be set + * a1 = addr where to store register value + * a2 = address offset + * a3 = value to be store + */ +gen_test_inst(fsw) + +/* + * float_flw_reg - Get a FP register value and return it + * a0 = FP register index to be retrieved + * a1 = addr to load register from + * a2 = address offset + */ +gen_test_inst(flw) + +gen_test_inst(fsd) +#ifdef __riscv_compressed +gen_test_inst_compressed(fsd) +gen_test_store_compressed_sp(fsd) +#endif + +gen_test_inst(fld) +#ifdef __riscv_compressed +gen_test_inst_compressed(fld) +gen_test_load_compressed_sp(fld) +#endif diff --git a/tools/testing/selftests/riscv/misaligned/gp.S b/tools/testing/selftests/riscv/misaligned/gp.S new file mode 100644 index 000000000000..f53f4c6d81dd --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/gp.S @@ -0,0 +1,103 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +#include "common.S" + +.text + +.macro __gen_test_inst inst, src_reg + \inst a2, 0(\src_reg) + move a0, a2 +.endm + +.macro gen_func_header func_name, rvc + .option arch,\rvc + .global test_\func_name + test_\func_name: +.endm + +.macro gen_test_inst inst + .option push + gen_func_header \inst, -c + __gen_test_inst \inst, a0 + .option pop + ret +.endm + +.macro __gen_test_inst_c name, src_reg + .option push + gen_func_header c_\name, +c + __gen_test_inst c.\name, \src_reg + .option pop + ret +.endm + +.macro gen_test_inst_c name + __gen_test_inst_c \name, a0 +.endm + + +.macro gen_test_inst_load_c_sp name + .option push + gen_func_header c_\name\()sp, +c + sp_stack_prologue a1 + copy_long_to t0, a0, sp + c.ldsp a0, 0(sp) + sp_stack_epilogue a1 + .option pop + ret +.endm + +.macro lb_sp_sb_a0 reg, offset + lb_sb \reg, \offset, sp, a0 +.endm + +.macro gen_test_inst_store_c_sp inst_name + .option push + gen_func_header c_\inst_name\()sp, +c + /* Misalign stack pointer */ + sp_stack_prologue a1 + /* Misalign access */ + c.sdsp a2, 0(sp) + copy_long_to t0, sp, a0 + sp_stack_epilogue a1 + .option pop + ret +.endm + + + /* + * a0 = addr to load from + * a1 = address offset + * a2 = value to be loaded + */ +gen_test_inst lh +gen_test_inst lhu +gen_test_inst lw +gen_test_inst lwu +gen_test_inst ld +#ifdef __riscv_compressed +gen_test_inst_c lw +gen_test_inst_c ld +gen_test_inst_load_c_sp ld +#endif + +/* + * a0 = addr where to store value + * a1 = address offset + * a2 = value to be stored + */ +gen_test_inst sh +gen_test_inst sw +gen_test_inst sd +#ifdef __riscv_compressed +gen_test_inst_c sw +gen_test_inst_c sd +gen_test_inst_store_c_sp sd +#endif + diff --git a/tools/testing/selftests/riscv/misaligned/misaligned.c b/tools/testing/selftests/riscv/misaligned/misaligned.c new file mode 100644 index 000000000000..690687da7a93 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/misaligned.c @@ -0,0 +1,253 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <linux/ptrace.h> +#include "../../kselftest_harness.h" + +#include <stdlib.h> +#include <stdio.h> +#include <stdint.h> +#include <float.h> +#include <errno.h> +#include <math.h> +#include <string.h> +#include <signal.h> +#include <stdbool.h> +#include <unistd.h> +#include <inttypes.h> +#include <ucontext.h> + +#include <sys/prctl.h> + +#define stringify(s) __stringify(s) +#define __stringify(s) #s + +#define VAL16 0x1234U +#define VAL32 0x5EADBEEFUL +#define VAL64 0x45674321D00DF789ULL + +#define VAL_float 78951.234375 +#define VAL_double 567890.512396965789589290 + +static bool float_equal(float a, float b) +{ + float scaled_epsilon; + float difference = fabsf(a - b); + + // Scale to the largest value. + a = fabsf(a); + b = fabsf(b); + if (a > b) + scaled_epsilon = FLT_EPSILON * a; + else + scaled_epsilon = FLT_EPSILON * b; + + return difference <= scaled_epsilon; +} + +static bool double_equal(double a, double b) +{ + double scaled_epsilon; + double difference = fabsl(a - b); + + // Scale to the largest value. + a = fabs(a); + b = fabs(b); + if (a > b) + scaled_epsilon = DBL_EPSILON * a; + else + scaled_epsilon = DBL_EPSILON * b; + + return difference <= scaled_epsilon; +} + +#define fpu_load_proto(__inst, __type) \ +extern __type test_ ## __inst(unsigned long fp_reg, void *addr, unsigned long offset, __type value) + +fpu_load_proto(flw, float); +fpu_load_proto(fld, double); +fpu_load_proto(c_flw, float); +fpu_load_proto(c_fld, double); +fpu_load_proto(c_fldsp, double); + +#define fpu_store_proto(__inst, __type) \ +extern void test_ ## __inst(unsigned long fp_reg, void *addr, unsigned long offset, __type value) + +fpu_store_proto(fsw, float); +fpu_store_proto(fsd, double); +fpu_store_proto(c_fsw, float); +fpu_store_proto(c_fsd, double); +fpu_store_proto(c_fsdsp, double); + +#define gp_load_proto(__inst, __type) \ +extern __type test_ ## __inst(void *addr, unsigned long offset, __type value) + +gp_load_proto(lh, uint16_t); +gp_load_proto(lhu, uint16_t); +gp_load_proto(lw, uint32_t); +gp_load_proto(lwu, uint32_t); +gp_load_proto(ld, uint64_t); +gp_load_proto(c_lw, uint32_t); +gp_load_proto(c_ld, uint64_t); +gp_load_proto(c_ldsp, uint64_t); + +#define gp_store_proto(__inst, __type) \ +extern void test_ ## __inst(void *addr, unsigned long offset, __type value) + +gp_store_proto(sh, uint16_t); +gp_store_proto(sw, uint32_t); +gp_store_proto(sd, uint64_t); +gp_store_proto(c_sw, uint32_t); +gp_store_proto(c_sd, uint64_t); +gp_store_proto(c_sdsp, uint64_t); + +#define TEST_GP_LOAD(__inst, __type_size) \ +TEST(gp_load_ ## __inst) \ +{ \ + int offset, ret; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (offset = 1; offset < (__type_size) / 8; offset++) { \ + uint ## __type_size ## _t val = VAL ## __type_size; \ + uint ## __type_size ## _t *ptr = (uint ## __type_size ## _t *)(buf + offset); \ + memcpy(ptr, &val, sizeof(val)); \ + val = test_ ## __inst(ptr, offset, val); \ + EXPECT_EQ(VAL ## __type_size, val); \ + } \ +} + +TEST_GP_LOAD(lh, 16); +TEST_GP_LOAD(lhu, 16); +TEST_GP_LOAD(lw, 32); +TEST_GP_LOAD(lwu, 32); +TEST_GP_LOAD(ld, 64); +#ifdef __riscv_compressed +TEST_GP_LOAD(c_lw, 32); +TEST_GP_LOAD(c_ld, 64); +TEST_GP_LOAD(c_ldsp, 64); +#endif + +#define TEST_GP_STORE(__inst, __type_size) \ +TEST(gp_store_ ## __inst) \ +{ \ + int offset, ret; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (offset = 1; offset < (__type_size) / 8; offset++) { \ + uint ## __type_size ## _t val = VAL ## __type_size; \ + uint ## __type_size ## _t *ptr = (uint ## __type_size ## _t *)(buf + offset); \ + memset(ptr, 0, sizeof(val)); \ + test_ ## __inst(ptr, offset, val); \ + memcpy(&val, ptr, sizeof(val)); \ + EXPECT_EQ(VAL ## __type_size, val); \ + } \ +} +TEST_GP_STORE(sh, 16); +TEST_GP_STORE(sw, 32); +TEST_GP_STORE(sd, 64); +#ifdef __riscv_compressed +TEST_GP_STORE(c_sw, 32); +TEST_GP_STORE(c_sd, 64); +TEST_GP_STORE(c_sdsp, 64); +#endif + +#define __TEST_FPU_LOAD(__type, __inst, __reg_start, __reg_end) \ +TEST(fpu_load_ ## __inst) \ +{ \ + int ret, offset, fp_reg; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (fp_reg = __reg_start; fp_reg < __reg_end; fp_reg++) { \ + for (offset = 1; offset < 4; offset++) { \ + void *load_addr = (buf + offset); \ + __type val = VAL_ ## __type ; \ + \ + memcpy(load_addr, &val, sizeof(val)); \ + val = test_ ## __inst(fp_reg, load_addr, offset, val); \ + EXPECT_TRUE(__type ##_equal(val, VAL_## __type)); \ + } \ + } \ +} +#define TEST_FPU_LOAD(__type, __inst) \ + __TEST_FPU_LOAD(__type, __inst, 0, 32) +#define TEST_FPU_LOAD_COMPRESSED(__type, __inst) \ + __TEST_FPU_LOAD(__type, __inst, 8, 16) + +TEST_FPU_LOAD(float, flw) +TEST_FPU_LOAD(double, fld) +#ifdef __riscv_compressed +TEST_FPU_LOAD_COMPRESSED(double, c_fld) +TEST_FPU_LOAD_COMPRESSED(double, c_fldsp) +#endif + +#define __TEST_FPU_STORE(__type, __inst, __reg_start, __reg_end) \ +TEST(fpu_store_ ## __inst) \ +{ \ + int ret, offset, fp_reg; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (fp_reg = __reg_start; fp_reg < __reg_end; fp_reg++) { \ + for (offset = 1; offset < 4; offset++) { \ + \ + void *store_addr = (buf + offset); \ + __type val = VAL_ ## __type ; \ + \ + test_ ## __inst(fp_reg, store_addr, offset, val); \ + memcpy(&val, store_addr, sizeof(val)); \ + EXPECT_TRUE(__type ## _equal(val, VAL_## __type)); \ + } \ + } \ +} +#define TEST_FPU_STORE(__type, __inst) \ + __TEST_FPU_STORE(__type, __inst, 0, 32) +#define TEST_FPU_STORE_COMPRESSED(__type, __inst) \ + __TEST_FPU_STORE(__type, __inst, 8, 16) + +TEST_FPU_STORE(float, fsw) +TEST_FPU_STORE(double, fsd) +#ifdef __riscv_compressed +TEST_FPU_STORE_COMPRESSED(double, c_fsd) +TEST_FPU_STORE_COMPRESSED(double, c_fsdsp) +#endif + +TEST_SIGNAL(gen_sigbus, SIGBUS) +{ + uint32_t val = VAL32; + uint8_t buf[16] __attribute__((aligned(16))); + int ret; + + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_SIGBUS); + ASSERT_EQ(ret, 0); + + asm volatile("sw %0, 1(%1)" : : "r"(val), "r"(buf) : "memory"); +} + +int main(int argc, char **argv) +{ + int ret, val; + + ret = prctl(PR_GET_UNALIGN, &val); + if (ret == -1 && errno == EINVAL) + ksft_exit_skip("SKIP GET_UNALIGN_CTL not supported\n"); + + exit(test_harness_run(argc, argv)); +} -- 2.50.0

4 days, 4 hours

2
4
0 0

[PATCH 0/3] selftests/hid: upgrade the python scripts to match hid-tools 0.10

by Benjamin Tissoires

hid-tools 0.10 fixed a test regression introduced in 6.16-rc1: the kernel might communicate with the uhid node while the test suite opens the evdev node. This leads to a full test-suite time which used to run in 6 minutes into an hour. Merge the upstream hid-tools project in the selftest kernel dir to reduce that time to something manageable again. Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org> --- Benjamin Tissoires (3): selftests/hid: run ruff format on the python part selftests/hid: sync the python tests to hid-tools 0.8 selftests/hid: sync python tests to hid-tools 0.10 tools/testing/selftests/hid/tests/base.py | 46 ++- tools/testing/selftests/hid/tests/base_device.py | 49 ++- .../selftests/hid/tests/test_apple_keyboard.py | 3 +- tools/testing/selftests/hid/tests/test_gamepad.py | 3 +- .../selftests/hid/tests/test_ite_keyboard.py | 3 +- .../testing/selftests/hid/tests/test_multitouch.py | 2 +- tools/testing/selftests/hid/tests/test_sony.py | 7 +- tools/testing/selftests/hid/tests/test_tablet.py | 11 +- .../selftests/hid/tests/test_wacom_generic.py | 445 +++++++++++++++------ 9 files changed, 412 insertions(+), 157 deletions(-) --- base-commit: 2043ae9019e0f75c7785048230586c3f3ca0a2a4 change-id: 20250709-wip-fix-ci-d03bd06f778e Best regards, -- Benjamin Tissoires <bentiss(a)kernel.org>

4 days, 4 hours

2
5
0 0

[PATCH] selftests/thermal: Remove duplicate sprintf() call in workload_hint_test

by WangYuli

Remove redundant sprintf() call that was duplicating the same operation of formatting delay_str with argv[1]. Signed-off-by: WangYuli <wangyuli(a)uniontech.com> --- .../selftests/thermal/intel/workload_hint/workload_hint_test.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/tools/testing/selftests/thermal/intel/workload_hint/workload_hint_test.c b/tools/testing/selftests/thermal/intel/workload_hint/workload_hint_test.c index a40097232967..bda006af8b1b 100644 --- a/tools/testing/selftests/thermal/intel/workload_hint/workload_hint_test.c +++ b/tools/testing/selftests/thermal/intel/workload_hint/workload_hint_test.c @@ -67,8 +67,6 @@ int main(int argc, char **argv) if (delay < 0) exit(1); - sprintf(delay_str, "%s\n", argv[1]); - sprintf(delay_str, "%s\n", argv[1]); fd = open(WORKLOAD_NOTIFICATION_DELAY_ATTRIBUTE, O_RDWR); if (fd < 0) { -- 2.50.0

4 days, 5 hours

1
1
0 0

[PATCH v3] selftests: futex: define SYS_futex on 32-bit architectures with 64-bit time_t

by Ben Zong-You Xie

From: Cynthia Huang <cynthia(a)andestech.com> Linux kernel does not provide sys_futex() on some 32-bit architectures that do not support 32-bit time representations, such as riscv32. As a result, glibc cannot define SYS_futex, causing compilation failures in tests that rely on this syscall. Define SYS_futex as SYS_futex_time64 in such cases to ensure successful compilation and compatibility. Signed-off-by: Cynthia Huang <cynthia(a)andestech.com> Signed-off-by: Ben Zong-You Xie <ben717(a)andestech.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com> --- Changes since v2: - Refine the commit message suggested by tglx - Attribute the authorship to Cynthia - Add the Reviewed-by tag from Muhammad v2 : https://lore.kernel.org/all/20250627090812.937939-1-ben717@andestech.com/ Changes since v1: - Fix the SOB chain v1 : https://lore.kernel.org/all/20250527093536.3646143-1-ben717@andestech.com/ --- tools/testing/selftests/futex/include/futextest.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h index ddbcfc9b7bac..7a5fd1d5355e 100644 --- a/tools/testing/selftests/futex/include/futextest.h +++ b/tools/testing/selftests/futex/include/futextest.h @@ -47,6 +47,17 @@ typedef volatile u_int32_t futex_t; FUTEX_PRIVATE_FLAG) #endif +/* + * SYS_futex is expected from system C library, in glibc some 32-bit + * architectures (e.g. RV32) are using 64-bit time_t, therefore it doesn't have + * SYS_futex defined but just SYS_futex_time64. Define SYS_futex as + * SYS_futex_time64 in this situation to ensure the compilation and the + * compatibility. + */ +#if !defined(SYS_futex) && defined(SYS_futex_time64) +#define SYS_futex SYS_futex_time64 +#endif + /** * futex() - SYS_futex syscall wrapper * @uaddr: address of first futex -- 2.34.1

4 days, 8 hours

1
0
0 0

[PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)

by Nicolin Chen

The vIOMMU object is designed to represent a slice of an IOMMU HW for its virtualization features shared with or passed to user space (a VM mostly) in a way of HW acceleration. This extended the HWPT-based design for more advanced virtualization feature. HW QUEUE introduced by this series as a part of the vIOMMU infrastructure represents a HW accelerated queue/buffer for VM to use exclusively, e.g. - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer each of which allows its IOMMU HW to directly access a queue memory owned by a guest VM and allows a guest OS to control the HW queue direclty, to avoid VM Exit overheads to improve the performance. Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC allowing VMM to forward the IOMMU-specific queue info, such as queue base address, size, and etc. Meanwhile, a guest-owned queue needs the guest kernel to control the queue by reading/writing its consumer and producer indexes, via MMIO acceses to the hardware MMIO registers. Introduce an mmap infrastructure for iommufd to support passing through a piece of MMIO region from the host physical address space to the guest physical address space. The mmap info (offset/ length) used by an mmap syscall must be pre-allocated and returned to the user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC call. Thus, it requires a driver-specific user data support in the vIOMMU allocation flow. As a real-world use case, this series implements a HW QUEUE support in the tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it is also the Tegra CMDQV series Part-2 (user-space support), reworked from Previous RFCv1: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS. // Unmap latencies from "dma_map_benchmark -g @granule -t @threads", // by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq" @granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0 4KB 1 35.7 us 5.3 us 16KB 1 41.8 us 6.8 us 64KB 1 68.9 us 9.9 us 128KB 1 109.0 us 12.6 us 256KB 1 187.1 us 18.0 us 4KB 2 96.9 us 6.8 us 16KB 2 97.8 us 7.5 us 64KB 2 151.5 us 10.7 us 128KB 2 257.8 us 12.7 us 256KB 2 443.0 us 17.9 us This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v7 Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v7 Changelog v7 * Rebased on Jason's for-next tree (iommufd_hw_queue-prep series) * Add Reviewed-by from Baolu, Jason, Pranjal * Update kdocs and notes * [iommu] Replace "u32" with "enum iommu_hw_info_type" * [iommufd] Rename vdev->id to vdev->virt_id * [iommufd] Replace macros with inline helpers * [iommufd] Report unmapped_bytes in error path * [iommufd] Add iommufd_access_is_internal helper * [iommufd] Do not drop ops->unmap check for mdevs * [iommufd] Store physical addresses in immap structure * [iommufd] Reorder access and hw_queue object allocations * [iommufd] Scan for an internal access before any unmap call * [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue * [iommufd] Use kcalloc to avoid failure due to memory fragmentation * [tegra] Use "else" * [tegra] Lock destroy() using lvcmdq_mutex v6 https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/ * Rebase on iommufd_hw_queue-prep-v2 * Add Reviewed-by from Kevin and Jason * [iommufd] Update kdocs and notes * [iommufd] Drop redundant pages[i] check * [iommufd] Allow nesting_parent_iova to be 0 * [iommufd] Add iommufd_hw_queue_alloc_phys() * [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs * [iommufd] Move destroy ops to vdevice/hw_queue structures * [iommufd] Add union in hw_info struct to share out_data_type field * [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs * [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init * [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init * [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys * [smmu] Drop arm_smmu_domain_ipa_to_pa * [smmu] Update arm_smmu_impl_ops changes for vsmmu_init * [tegra] Add a vdev_to_vsid macro * [tegra] Add lvcmdq_mutex to protect multi queues * [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak) v5 https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc6 * Add Reviewed-by from Jason and Kevin * Correct typos in kdoc and update commit logs * [iommufd] Add a cosmetic fix * [iommufd] Drop unused num_pfns * [iommufd] Drop unnecessary check * [iommufd] Reorder patch sequence * [iommufd] Use io_remap_pfn_range() * [iommufd] Use success oriented flow * [iommufd] Fix max_npages calculation * [iommufd] Add more selftest coverage * [iommufd] Drop redundant static_assert * [iommufd] Fix mmap pfn range validation * [iommufd] Reject unmap on pinned iovas * [iommufd] Drop redundant vm_flags_set() * [iommufd] Drop iommufd_struct_destroy() * [iommufd] Drop redundant queue iova test * [iommufd] Use "mmio_addr" and "mmio_pfn" * [iommufd] Rename to "nesting_parent_iova" * [iommufd] Make iopt_pin_pages call option * [iommufd] Add ictx comparison in depend() * [iommufd] Add iommufd_object_alloc_ucmd() * [iommufd] Move kcalloc() after validations * [iommufd] Replace ictx setting with WARN_ON * [iommufd] Make hw_info's type bidirectional * [smmu] Add supported_vsmmu_type in impl_ops * [smmu] Drop impl report in smmu vendor struct * [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV * [tegra] Replace "number of VINTFs" with a note * [tegra] Drop the redundant lvcmdq pointer setting * [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA * [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op v4 https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc5 * Add Reviewed-by from Vasant * Rename "vQUEUE" to "HW QUEUE" * Use "offset" and "length" for all mmap-related variables * [iommufd] Use u64 for guest PA * [iommufd] Fix typo in uAPI doc * [iommufd] Rename immap_id to offset * [iommufd] Drop the partial-size mmap support * [iommufd] Do not replace WARN_ON with WARN_ON_ONCE * [iommufd] Use "u64 base_addr" for queue base address * [iommufd] Use u64 base_pfn/num_pfns for immap structure * [iommufd] Correct the size passed in to mtree_alloc_range() * [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops v3 https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/ * Add Reviewed-by from Baolu, Pranjal, and Alok * Revise kdocs, uAPI docs, and commit logs * Rename "vCMDQ" back to "vQUEUE" for AMD cases * [tegra] Add tegra241_vcmdq_hw_flush_timeout() * [tegra] Rename vsmmu_alloc to alloc_vintf_user * [tegra] Use writel for SID replacement registers * [tegra] Move mmap removal call to vsmmu_destroy op * [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user() * [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED() * [iommufd] Add an object-type "owner" to immap structure * [iommufd] Drop the ictx input in the new for-driver APIs * [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle * [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers * [iommufd] Rename iommufd_ctx_alloc/free_mmap to _iommufd_alloc/destroy_mmap v2 https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/ * Add Reviewed-by from Jason * [smmu] Fix vsmmu initial value * [smmu] Support impl for hw_info * [tegra] Rename "slot" to "vsid" * [tegra] Update kdocs and commit logs * [tegra] Map/unmap LVCMDQ dynamically * [tegra] Refcount the previous LVCMDQ * [tegra] Return -EEXIST if LVCMDQ exists * [tegra] Simplify VINTF cleanup routine * [tegra] Use vmid and s2_domain in vsmmu * [tegra] Rename "mmap_pgoff" to "immap_id" * [tegra] Add more addr and length validation * [iommufd] Add more narrative to mmap's kdoc * [iommufd] Add iommufd_struct_depend/undepend() * [iommufd] Rename vcmdq_free op to vcmdq_destroy * [iommufd] Fix bug in iommu_copy_struct_to_user() * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap() * [iommufd] Test the queue memory for its contiguity * [iommufd] Return -ENXIO if address or length fails * [iommufd] Do not change @min_last in mock_viommu_alloc() * [iommufd] Generalize TEGRA241_VCMDQ data in core structure * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping v1 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/ Thanks Nicolin Nicolin Chen (28): iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range iommufd/viommu: Explicitly define vdev->virt_id iommu: Use enum iommu_hw_info_type for type in hw_info op iommu: Add iommu_copy_struct_to_user helper iommu: Pass in a driver-level user data structure to viommu_init op iommufd/viommu: Allow driver-specific user data for a vIOMMU object iommufd/selftest: Support user_data in mock_viommu_alloc iommufd/selftest: Add coverage for viommu data iommufd/access: Add internal APIs for HW queue to use iommufd/access: Bypass access->ops->unmap for internal use iommufd/viommu: Add driver-defined vDEVICE support iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC iommufd: Add mmap interface iommufd/selftest: Add coverage for the new mmap interface Documentation: userspace-api: iommufd: Update HW QUEUE iommu: Allow an input type in hw_info op iommufd: Allow an input data_type via iommu_hw_info iommufd/selftest: Update hw_info coverage for an input data_type iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops iommu/tegra241-cmdqv: Use request_threaded_irq iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 22 +- drivers/iommu/iommufd/iommufd_private.h | 50 +- drivers/iommu/iommufd/iommufd_test.h | 20 + include/linux/iommu.h | 50 +- include/linux/iommufd.h | 160 ++++++ include/uapi/linux/iommufd.h | 145 +++++- tools/testing/selftests/iommu/iommufd_utils.h | 89 +++- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 28 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 484 +++++++++++++++++- drivers/iommu/intel/iommu.c | 7 +- drivers/iommu/iommufd/device.c | 90 +++- drivers/iommu/iommufd/driver.c | 81 ++- drivers/iommu/iommufd/io_pagetable.c | 17 +- drivers/iommu/iommufd/main.c | 69 +++ drivers/iommu/iommufd/selftest.c | 153 +++++- drivers/iommu/iommufd/viommu.c | 208 +++++++- tools/testing/selftests/iommu/iommufd.c | 143 +++++- .../selftests/iommu/iommufd_fail_nth.c | 15 +- Documentation/userspace-api/iommufd.rst | 12 + 19 files changed, 1736 insertions(+), 107 deletions(-) -- 2.43.0

4 days, 9 hours

4
66
0 0

[PATCH] vdso/gettimeofday: Fix code refactoring

by Marek Szyprowski

Commit fcc8e46f768f ("vdso/gettimeofday: Return bool from clock_gettime() helpers") changed the return value from clock_gettime() helpers, but it missed updating the one call to the do_hres() function, what breaks VDSO operation on some of my ARM 32bit based test boards. Fix this. Fixes: fcc8e46f768f ("vdso/gettimeofday: Return bool from clock_gettime() helpers") Signed-off-by: Marek Szyprowski <m.szyprowski(a)samsung.com> --- lib/vdso/gettimeofday.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index d6743ed756a1..97aa9059a5c9 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -395,7 +395,7 @@ __cvdso_gettimeofday_data(const struct vdso_time_data *vd, if (likely(tv != NULL)) { struct __kernel_timespec ts; - if (do_hres(vd, &vc[CS_HRES_COARSE], CLOCK_REALTIME, &ts)) + if (!do_hres(vd, &vc[CS_HRES_COARSE], CLOCK_REALTIME, &ts)) return gettimeofday_fallback(tv, tz); tv->tv_sec = ts.tv_sec; -- 2.34.1

4 days, 11 hours

2
2
0 0

[PATCH v8 00/29] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)

by Nicolin Chen

The vIOMMU object is designed to represent a slice of an IOMMU HW for its virtualization features shared with or passed to user space (a VM mostly) in a way of HW acceleration. This extended the HWPT-based design for more advanced virtualization feature. HW QUEUE introduced by this series as a part of the vIOMMU infrastructure represents a HW accelerated queue/buffer for VM to use exclusively, e.g. - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer each of which allows its IOMMU HW to directly access a queue memory owned by a guest VM and allows a guest OS to control the HW queue direclty, to avoid VM Exit overheads to improve the performance. Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC allowing VMM to forward the IOMMU-specific queue info, such as queue base address, size, and etc. Meanwhile, a guest-owned queue needs the guest kernel to control the queue by reading/writing its consumer and producer indexes, via MMIO acceses to the hardware MMIO registers. Introduce an mmap infrastructure for iommufd to support passing through a piece of MMIO region from the host physical address space to the guest physical address space. The mmap info (offset/ length) used by an mmap syscall must be pre-allocated and returned to the user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC call. Thus, it requires a driver-specific user data support in the vIOMMU allocation flow. As a real-world use case, this series implements a HW QUEUE support in the tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it is also the Tegra CMDQV series Part-2 (user-space support), reworked from Previous RFCv1: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS. // Unmap latencies from "dma_map_benchmark -g @granule -t @threads", // by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq" @granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0 4KB 1 35.7 us 5.3 us 16KB 1 41.8 us 6.8 us 64KB 1 68.9 us 9.9 us 128KB 1 109.0 us 12.6 us 256KB 1 187.1 us 18.0 us 4KB 2 96.9 us 6.8 us 16KB 2 97.8 us 7.5 us 64KB 2 151.5 us 10.7 us 128KB 2 257.8 us 12.7 us 256KB 2 443.0 us 17.9 us This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v8 Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v8 Changelog (attached git-diff v7..v8 at the end of this letter) v8 * Add Reviewed-by from Pranj, Kevin and Jason * Improve kdoc and comments * [iommufd] Skip selftest for no_viommu variants * [iommufd] Add unmap coverage for non internal area * [iommufd] Skip the first page when mtree_alloc_range() * [iommufd] Correct the passed in index to mtree_erase() * [iommufd] Correct variable types in iommufd_hw_queue_alloc_phys() * [iommufd] Reject iopt_unmap_iova_range() if area->num_locked is set * [tegra] Rename "SID replacement" with "SID mapping" * [tegra] Unwrap useless _tegra241_vcmdq_hw_init helper v7 https://lore.kernel.org/all/cover.1750966133.git.nicolinc@nvidia.com/ * Rebased on Jason's for-next tree (iommufd_hw_queue-prep series) * Add Reviewed-by from Baolu, Jason, Pranjal * Update kdocs and notes * [iommu] Replace "u32" with "enum iommu_hw_info_type" * [iommufd] Rename vdev->id to vdev->virt_id * [iommufd] Replace macros with inline helpers * [iommufd] Report unmapped_bytes in error path * [iommufd] Add iommufd_access_is_internal helper * [iommufd] Do not drop ops->unmap check for mdevs * [iommufd] Store physical addresses in immap structure * [iommufd] Reorder access and hw_queue object allocations * [iommufd] Scan for an internal access before any unmap call * [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue * [iommufd] Use kcalloc to avoid failure due to memory fragmentation * [tegra] Use "else" * [tegra] Lock destroy() using lvcmdq_mutex v6 https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/ * Rebase on iommufd_hw_queue-prep-v2 * Add Reviewed-by from Kevin and Jason * [iommufd] Update kdocs and notes * [iommufd] Drop redundant pages[i] check * [iommufd] Allow nesting_parent_iova to be 0 * [iommufd] Add iommufd_hw_queue_alloc_phys() * [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs * [iommufd] Move destroy ops to vdevice/hw_queue structures * [iommufd] Add union in hw_info struct to share out_data_type field * [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs * [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init * [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init * [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys * [smmu] Drop arm_smmu_domain_ipa_to_pa * [smmu] Update arm_smmu_impl_ops changes for vsmmu_init * [tegra] Add a vdev_to_vsid macro * [tegra] Add lvcmdq_mutex to protect multi queues * [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak) v5 https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc6 * Add Reviewed-by from Jason and Kevin * Correct typos in kdoc and update commit logs * [iommufd] Add a cosmetic fix * [iommufd] Drop unused num_pfns * [iommufd] Drop unnecessary check * [iommufd] Reorder patch sequence * [iommufd] Use io_remap_pfn_range() * [iommufd] Use success oriented flow * [iommufd] Fix max_npages calculation * [iommufd] Add more selftest coverage * [iommufd] Drop redundant static_assert * [iommufd] Fix mmap pfn range validation * [iommufd] Reject unmap on pinned iovas * [iommufd] Drop redundant vm_flags_set() * [iommufd] Drop iommufd_struct_destroy() * [iommufd] Drop redundant queue iova test * [iommufd] Use "mmio_addr" and "mmio_pfn" * [iommufd] Rename to "nesting_parent_iova" * [iommufd] Make iopt_pin_pages call option * [iommufd] Add ictx comparison in depend() * [iommufd] Add iommufd_object_alloc_ucmd() * [iommufd] Move kcalloc() after validations * [iommufd] Replace ictx setting with WARN_ON * [iommufd] Make hw_info's type bidirectional * [smmu] Add supported_vsmmu_type in impl_ops * [smmu] Drop impl report in smmu vendor struct * [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV * [tegra] Replace "number of VINTFs" with a note * [tegra] Drop the redundant lvcmdq pointer setting * [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA * [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op v4 https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc5 * Add Reviewed-by from Vasant * Rename "vQUEUE" to "HW QUEUE" * Use "offset" and "length" for all mmap-related variables * [iommufd] Use u64 for guest PA * [iommufd] Fix typo in uAPI doc * [iommufd] Rename immap_id to offset * [iommufd] Drop the partial-size mmap support * [iommufd] Do not replace WARN_ON with WARN_ON_ONCE * [iommufd] Use "u64 base_addr" for queue base address * [iommufd] Use u64 base_pfn/num_pfns for immap structure * [iommufd] Correct the size passed in to mtree_alloc_range() * [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops v3 https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/ * Add Reviewed-by from Baolu, Pranjal, and Alok * Revise kdocs, uAPI docs, and commit logs * Rename "vCMDQ" back to "vQUEUE" for AMD cases * [tegra] Add tegra241_vcmdq_hw_flush_timeout() * [tegra] Rename vsmmu_alloc to alloc_vintf_user * [tegra] Use writel for SID replacement registers * [tegra] Move mmap removal call to vsmmu_destroy op * [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user() * [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED() * [iommufd] Add an object-type "owner" to immap structure * [iommufd] Drop the ictx input in the new for-driver APIs * [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle * [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers * [iommufd] Rename iommufd_ctx_alloc/free_mmap to _iommufd_alloc/destroy_mmap v2 https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/ * Add Reviewed-by from Jason * [smmu] Fix vsmmu initial value * [smmu] Support impl for hw_info * [tegra] Rename "slot" to "vsid" * [tegra] Update kdocs and commit logs * [tegra] Map/unmap LVCMDQ dynamically * [tegra] Refcount the previous LVCMDQ * [tegra] Return -EEXIST if LVCMDQ exists * [tegra] Simplify VINTF cleanup routine * [tegra] Use vmid and s2_domain in vsmmu * [tegra] Rename "mmap_pgoff" to "immap_id" * [tegra] Add more addr and length validation * [iommufd] Add more narrative to mmap's kdoc * [iommufd] Add iommufd_struct_depend/undepend() * [iommufd] Rename vcmdq_free op to vcmdq_destroy * [iommufd] Fix bug in iommu_copy_struct_to_user() * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap() * [iommufd] Test the queue memory for its contiguity * [iommufd] Return -ENXIO if address or length fails * [iommufd] Do not change @min_last in mock_viommu_alloc() * [iommufd] Generalize TEGRA241_VCMDQ data in core structure * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping v1 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/ Thanks Nicolin Nicolin Chen (29): iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range iommufd: Correct virt_id kdoc at struct iommu_vdevice_alloc iommufd/viommu: Explicitly define vdev->virt_id iommu: Use enum iommu_hw_info_type for type in hw_info op iommu: Add iommu_copy_struct_to_user helper iommu: Pass in a driver-level user data structure to viommu_init op iommufd/viommu: Allow driver-specific user data for a vIOMMU object iommufd/selftest: Support user_data in mock_viommu_alloc iommufd/selftest: Add coverage for viommu data iommufd/access: Add internal APIs for HW queue to use iommufd/access: Bypass access->ops->unmap for internal use iommufd/viommu: Add driver-defined vDEVICE support iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC iommufd: Add mmap interface iommufd/selftest: Add coverage for the new mmap interface Documentation: userspace-api: iommufd: Update HW QUEUE iommu: Allow an input type in hw_info op iommufd: Allow an input data_type via iommu_hw_info iommufd/selftest: Update hw_info coverage for an input data_type iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops iommu/tegra241-cmdqv: Use request_threaded_irq iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 22 +- drivers/iommu/iommufd/io_pagetable.h | 5 +- drivers/iommu/iommufd/iommufd_private.h | 46 +- drivers/iommu/iommufd/iommufd_test.h | 20 + include/linux/iommu.h | 50 +- include/linux/iommufd.h | 160 ++++++ include/uapi/linux/iommufd.h | 147 +++++- tools/testing/selftests/iommu/iommufd_utils.h | 89 +++- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 28 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 477 +++++++++++++++++- drivers/iommu/intel/iommu.c | 7 +- drivers/iommu/iommufd/device.c | 87 +++- drivers/iommu/iommufd/driver.c | 82 ++- drivers/iommu/iommufd/io_pagetable.c | 13 +- drivers/iommu/iommufd/main.c | 69 +++ drivers/iommu/iommufd/pages.c | 12 +- drivers/iommu/iommufd/selftest.c | 153 +++++- drivers/iommu/iommufd/viommu.c | 215 +++++++- tools/testing/selftests/iommu/iommufd.c | 140 ++++- .../selftests/iommu/iommufd_fail_nth.c | 15 +- Documentation/userspace-api/iommufd.rst | 12 + 21 files changed, 1739 insertions(+), 110 deletions(-) -- 2.43.0 diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c index d57a3bea948c..d5d43a1c7708 100644 --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c @@ -161,7 +161,7 @@ struct tegra241_vcmdq { * @lvcmdq_mutex: Lock to serialize user-allocated lvcmdqs * @base: MMIO base address * @mmap_offset: Offset argument for mmap() syscall - * @sids: Stream ID replacement resources + * @sids: Stream ID mapping resources */ struct tegra241_vintf { struct arm_vsmmu vsmmu; @@ -183,11 +183,11 @@ struct tegra241_vintf { #define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core) /** - * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement + * struct tegra241_vintf_sid - Virtual Interface Stream ID Mapping * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID * @vintf: Parent VINTF pointer * @sid: Physical Stream ID - * @idx: Replacement index in the VINTF + * @idx: Mapping index in the VINTF */ struct tegra241_vintf_sid { struct iommufd_vdevice core; @@ -207,7 +207,7 @@ struct tegra241_vintf_sid { * @num_vintfs: Total number of VINTFs * @num_vcmdqs: Total number of VCMDQs * @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF - * @num_sids_per_vintf: Total number of SID replacements per VINTF + * @num_sids_per_vintf: Total number of SID mappings per VINTF * @vintf_ids: VINTF id allocator * @vintfs: List of VINTFs */ @@ -470,12 +470,6 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq) dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h); } -/* This function is for LVCMDQ, so @vcmdq must be mapped prior */ -static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) -{ - writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE)); -} - /* This function is for LVCMDQ, so @vcmdq must be mapped prior */ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) { @@ -486,7 +480,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq) tegra241_vcmdq_hw_deinit(vcmdq); /* Configure and enable VCMDQ */ - _tegra241_vcmdq_hw_init(vcmdq); + writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE)); ret = vcmdq_write_config(vcmdq, VCMDQ_EN); if (ret) { @@ -1077,7 +1071,7 @@ static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq) char header[64]; /* Configure the vcmdq only; User space does the enabling */ - _tegra241_vcmdq_hw_init(vcmdq); + writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE)); dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n", lvcmdq_error_header(vcmdq, header, 64), @@ -1259,6 +1253,7 @@ static int tegra241_vintf_init_vsid(struct iommufd_vdevice *vdev) static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = { .destroy = tegra241_cmdqv_destroy_vintf_user, .alloc_domain_nested = arm_vsmmu_alloc_domain_nested, + /* Non-accelerated commands will be still handled by the kernel */ .cache_invalidate = arm_vsmmu_cache_invalidate, .vdevice_size = VDEVICE_STRUCT_SIZE(struct tegra241_vintf_sid, core), .vdevice_init = tegra241_vintf_init_vsid, diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index cbd86aabdd1c..e2ba21c43ad2 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -1245,26 +1245,18 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_replace, "IOMMUFD"); * run in the future. Due to this a driver must not create locking that prevents * unmap to complete while iommufd_access_destroy() is running. */ -int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, - unsigned long length) +void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, + unsigned long length) { struct iommufd_ioas *ioas = container_of(iopt, struct iommufd_ioas, iopt); struct iommufd_access *access; unsigned long index; - int ret = 0; xa_lock(&ioas->iopt.access_list); - /* Bypass any unmap if there is an internal access */ xa_for_each(&ioas->iopt.access_list, index, access) { - if (iommufd_access_is_internal(access)) { - ret = -EBUSY; - goto unlock; - } - } - - xa_for_each(&ioas->iopt.access_list, index, access) { - if (!iommufd_lock_obj(&access->obj)) + if (!iommufd_lock_obj(&access->obj) || + iommufd_access_is_internal(access)) continue; xa_unlock(&ioas->iopt.access_list); @@ -1273,9 +1265,7 @@ int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, iommufd_put_object(access->ictx, &access->obj); xa_lock(&ioas->iopt.access_list); } -unlock: xa_unlock(&ioas->iopt.access_list); - return ret; } /** @@ -1290,6 +1280,7 @@ int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, void iommufd_access_unpin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length) { + bool internal = iommufd_access_is_internal(access); struct iopt_area_contig_iter iter; struct io_pagetable *iopt; unsigned long last_iova; @@ -1316,7 +1307,8 @@ void iommufd_access_unpin_pages(struct iommufd_access *access, area, iopt_area_iova_to_index(area, iter.cur_iova), iopt_area_iova_to_index( area, - min(last_iova, iopt_area_last_iova(area)))); + min(last_iova, iopt_area_last_iova(area))), + internal); WARN_ON(!iopt_area_contig_done(&iter)); up_read(&iopt->iova_rwsem); mutex_unlock(&access->ioas_lock); @@ -1365,6 +1357,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, unsigned long length, struct page **out_pages, unsigned int flags) { + bool internal = iommufd_access_is_internal(access); struct iopt_area_contig_iter iter; struct io_pagetable *iopt; unsigned long last_iova; @@ -1374,8 +1367,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, /* Driver's ops don't support pin_pages */ if (IS_ENABLED(CONFIG_IOMMUFD_TEST) && WARN_ON(access->iova_alignment != PAGE_SIZE || - (!iommufd_access_is_internal(access) && - !access->ops->unmap))) + (!internal && !access->ops->unmap))) return -EINVAL; if (!length) @@ -1409,7 +1401,7 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, } rc = iopt_area_add_access(area, index, last_index, out_pages, - flags); + flags, internal); if (rc) goto err_remove; out_pages += last_index - index + 1; @@ -1432,7 +1424,8 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova, iopt_area_iova_to_index(area, iter.cur_iova), iopt_area_iova_to_index( area, min(last_iova, - iopt_area_last_iova(area)))); + iopt_area_last_iova(area))), + internal); } up_read(&iopt->iova_rwsem); mutex_unlock(&access->ioas_lock); diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c index 2c9af93217f1..153e4720ee18 100644 --- a/drivers/iommu/iommufd/driver.c +++ b/drivers/iommu/iommufd/driver.c @@ -56,8 +56,9 @@ int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner, immap->length = length; immap->mmio_addr = mmio_addr; - rc = mtree_alloc_range(&ictx->mt_mmap, &startp, immap, immap->length, 0, - PHYS_ADDR_MAX, GFP_KERNEL); + /* Skip the first page to ease caller identifying the returned offset */ + rc = mtree_alloc_range(&ictx->mt_mmap, &startp, immap, immap->length, + PAGE_SIZE, PHYS_ADDR_MAX, GFP_KERNEL); if (rc < 0) { kfree(immap); return rc; @@ -76,7 +77,7 @@ void _iommufd_destroy_mmap(struct iommufd_ctx *ictx, { struct iommufd_mmap *immap; - immap = mtree_erase(&ictx->mt_mmap, offset >> PAGE_SHIFT); + immap = mtree_erase(&ictx->mt_mmap, offset); WARN_ON_ONCE(!immap || immap->owner != owner); kfree(immap); } diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 6b8477b1f94b..abf4aadca96c 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -719,6 +719,12 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start, goto out_unlock_iova; } + /* The area is locked by an object that has not been destroyed */ + if (area->num_locks) { + rc = -EBUSY; + goto out_unlock_iova; + } + if (area_first < start || area_last > last) { rc = -ENOENT; goto out_unlock_iova; @@ -740,15 +746,7 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start, up_write(&iopt->iova_rwsem); up_read(&iopt->domains_rwsem); - rc = iommufd_access_notify_unmap(iopt, area_first, - length); - if (rc) { - down_read(&iopt->domains_rwsem); - down_write(&iopt->iova_rwsem); - area->prevent_access = false; - goto out_unlock_iova; - } - + iommufd_access_notify_unmap(iopt, area_first, length); /* Something is not responding to unmap requests. */ tries++; if (WARN_ON(tries > 100)) { diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index c115a51d9384..b6064f4ce4af 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -48,6 +48,7 @@ struct iopt_area { int iommu_prot; bool prevent_access : 1; unsigned int num_accesses; + unsigned int num_locks; }; struct iopt_allowed { @@ -238,9 +239,9 @@ void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start, int iopt_area_add_access(struct iopt_area *area, unsigned long start, unsigned long last, struct page **out_pages, - unsigned int flags); + unsigned int flags, bool lock_area); void iopt_area_remove_access(struct iopt_area *area, unsigned long start, - unsigned long last); + unsigned long last, bool unlock_area); int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte, void *data, unsigned long length, unsigned int flags); diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index ebac6a4b3538..cd14163abdd1 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -125,8 +125,8 @@ int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt, int iopt_set_dirty_tracking(struct io_pagetable *iopt, struct iommu_domain *domain, bool enable); -int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, - unsigned long length); +void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova, + unsigned long length); int iopt_table_add_domain(struct io_pagetable *iopt, struct iommu_domain *domain); void iopt_table_remove_domain(struct io_pagetable *iopt, diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index cbdde642d2af..301c232462bd 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -2111,7 +2111,7 @@ iopt_pages_get_exact_access(struct iopt_pages *pages, unsigned long index, */ int iopt_area_add_access(struct iopt_area *area, unsigned long start_index, unsigned long last_index, struct page **out_pages, - unsigned int flags) + unsigned int flags, bool lock_area) { struct iopt_pages *pages = area->pages; struct iopt_pages_access *access; @@ -2124,6 +2124,8 @@ int iopt_area_add_access(struct iopt_area *area, unsigned long start_index, access = iopt_pages_get_exact_access(pages, start_index, last_index); if (access) { area->num_accesses++; + if (lock_area) + area->num_locks++; access->users++; iopt_pages_fill_from_xarray(pages, start_index, last_index, out_pages); @@ -2145,6 +2147,8 @@ int iopt_area_add_access(struct iopt_area *area, unsigned long start_index, access->node.last = last_index; access->users = 1; area->num_accesses++; + if (lock_area) + area->num_locks++; interval_tree_insert(&access->node, &pages->access_itree); mutex_unlock(&pages->mutex); return 0; @@ -2166,7 +2170,7 @@ int iopt_area_add_access(struct iopt_area *area, unsigned long start_index, * must stop using the PFNs before calling this. */ void iopt_area_remove_access(struct iopt_area *area, unsigned long start_index, - unsigned long last_index) + unsigned long last_index, bool unlock_area) { struct iopt_pages *pages = area->pages; struct iopt_pages_access *access; @@ -2177,6 +2181,10 @@ void iopt_area_remove_access(struct iopt_area *area, unsigned long start_index, goto out_unlock; WARN_ON(area->num_accesses == 0 || access->users == 0); + if (unlock_area) { + WARN_ON(area->num_locks == 0); + area->num_locks--; + } area->num_accesses--; access->users--; if (access->users) diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index ce509a827721..00641204efb2 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -241,13 +241,20 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd, { struct iommufd_access *access; struct page **pages; - int max_npages, i; + size_t max_npages; + size_t length; u64 offset; + size_t i; int rc; offset = cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova); - max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE); + /* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */ + if (check_add_overflow(offset, cmd->length, &length)) + return ERR_PTR(-ERANGE); + if (check_add_overflow(length, PAGE_SIZE - 1, &length)) + return ERR_PTR(-ERANGE); + max_npages = length / PAGE_SIZE; /* * Use kvcalloc() to avoid memory fragmentation for a large page array. diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 7ab9e3e928b3..e3a0cd47384d 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -112,7 +112,7 @@ struct iommufd_vdevice { /* * Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID of - * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table + * AMD IOMMU, and vRID of Intel VT-d */ u64 virt_id; diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index a2840beefa8c..111ea81f91a2 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -997,7 +997,7 @@ struct iommu_fault_alloc { * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM - * SMMUv3) Virtual Interface (VINTF) + * SMMUv3) enabled ARM SMMUv3 type */ enum iommu_viommu_type { IOMMU_VIOMMU_TYPE_DEFAULT = 0, @@ -1065,7 +1065,7 @@ struct iommu_viommu_alloc { * @dev_id: The physical device to allocate a virtual instance on the vIOMMU * @out_vdevice_id: Object handle for the vDevice. Pass to IOMMU_DESTORY * @virt_id: Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID - * of AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table + * of AMD IOMMU, and vRID of Intel VT-d * * Allocate a virtual device instance (for a physical device) against a vIOMMU. * This instance holds the device's information (related to its vIOMMU) in a VM. diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index fda93a195e26..9d5b852d5e19 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -2817,33 +2817,31 @@ TEST_F(iommufd_viommu, viommu_alloc_with_data) }; uint32_t *test; - if (self->device_id) { - test_cmd_viommu_alloc(self->device_id, self->hwpt_id, - IOMMU_VIOMMU_TYPE_SELFTEST, &data, - sizeof(data), &self->viommu_id); - ASSERT_EQ(data.out_data, data.in_data); - - /* Negative mmap tests -- offset and length cannot be changed */ - test_err_mmap(ENXIO, data.out_mmap_length, - data.out_mmap_offset + PAGE_SIZE); - test_err_mmap(ENXIO, data.out_mmap_length, - data.out_mmap_offset + PAGE_SIZE * 2); - test_err_mmap(ENXIO, data.out_mmap_length / 2, - data.out_mmap_offset); - test_err_mmap(ENXIO, data.out_mmap_length * 2, - data.out_mmap_offset); - - /* Now do a correct mmap for a loopback test */ - test = mmap(NULL, data.out_mmap_length, PROT_READ | PROT_WRITE, - MAP_SHARED, self->fd, data.out_mmap_offset); - ASSERT_NE(MAP_FAILED, test); - ASSERT_EQ(data.in_data, *test); - - /* The owner of the mmap region should be blocked */ - EXPECT_ERRNO(EBUSY, - _test_ioctl_destroy(self->fd, self->viommu_id)); - munmap(test, data.out_mmap_length); - } + if (!self->device_id) + SKIP(return, "Skipping test for variant no_viommu"); + + test_cmd_viommu_alloc(self->device_id, self->hwpt_id, + IOMMU_VIOMMU_TYPE_SELFTEST, &data, sizeof(data), + &self->viommu_id); + ASSERT_EQ(data.out_data, data.in_data); + + /* Negative mmap tests -- offset and length cannot be changed */ + test_err_mmap(ENXIO, data.out_mmap_length, + data.out_mmap_offset + PAGE_SIZE); + test_err_mmap(ENXIO, data.out_mmap_length, + data.out_mmap_offset + PAGE_SIZE * 2); + test_err_mmap(ENXIO, data.out_mmap_length / 2, data.out_mmap_offset); + test_err_mmap(ENXIO, data.out_mmap_length * 2, data.out_mmap_offset); + + /* Now do a correct mmap for a loopback test */ + test = mmap(NULL, data.out_mmap_length, PROT_READ | PROT_WRITE, + MAP_SHARED, self->fd, data.out_mmap_offset); + ASSERT_NE(MAP_FAILED, test); + ASSERT_EQ(data.in_data, *test); + + /* The owner of the mmap region should be blocked */ + EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, self->viommu_id)); + munmap(test, data.out_mmap_length); } TEST_F(iommufd_viommu, vdevice_alloc) @@ -3071,61 +3069,60 @@ TEST_F(iommufd_viommu, vdevice_cache) TEST_F(iommufd_viommu, hw_queue) { + __u64 iova = MOCK_APERTURE_START, iova2; uint32_t viommu_id = self->viommu_id; - __u64 iova = MOCK_APERTURE_START; uint32_t hw_queue_id[2]; - if (viommu_id) { - /* Fail IOMMU_HW_QUEUE_TYPE_DEFAULT */ - test_err_hw_queue_alloc(EOPNOTSUPP, viommu_id, - IOMMU_HW_QUEUE_TYPE_DEFAULT, 0, iova, - PAGE_SIZE, &hw_queue_id[0]); - /* Fail queue addr and length */ - test_err_hw_queue_alloc(EINVAL, viommu_id, - IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, - 0, &hw_queue_id[0]); - test_err_hw_queue_alloc(EOVERFLOW, viommu_id, - IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, - ~(uint64_t)0, PAGE_SIZE, - &hw_queue_id[0]); - /* Fail missing iova */ - test_err_hw_queue_alloc(ENOENT, viommu_id, - IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, - PAGE_SIZE, &hw_queue_id[0]); - - /* Map iova */ - test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova); - - /* Fail index=1 and =MAX; must start from index=0 */ - test_err_hw_queue_alloc(EIO, viommu_id, - IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, iova, - PAGE_SIZE, &hw_queue_id[0]); - test_err_hw_queue_alloc(EINVAL, viommu_id, - IOMMU_HW_QUEUE_TYPE_SELFTEST, - IOMMU_TEST_HW_QUEUE_MAX, iova, - PAGE_SIZE, &hw_queue_id[0]); - - /* Allocate index=0, declare ownership of the iova */ - test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, - 0, iova, PAGE_SIZE, &hw_queue_id[0]); - /* Fail duplicate */ - test_err_hw_queue_alloc(EEXIST, viommu_id, - IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, - PAGE_SIZE, &hw_queue_id[0]); - /* Fail unmap, due to iova ownership */ - test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE); - - /* Allocate index=1 */ - test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, - 1, iova, PAGE_SIZE, &hw_queue_id[1]); - /* Fail to destroy, due to dependency */ - EXPECT_ERRNO(EBUSY, - _test_ioctl_destroy(self->fd, hw_queue_id[0])); - - /* Destroy in descending order */ - test_ioctl_destroy(hw_queue_id[1]); - test_ioctl_destroy(hw_queue_id[0]); - } + if (!viommu_id) + SKIP(return, "Skipping test for variant no_viommu"); + + /* Fail IOMMU_HW_QUEUE_TYPE_DEFAULT */ + test_err_hw_queue_alloc(EOPNOTSUPP, viommu_id, + IOMMU_HW_QUEUE_TYPE_DEFAULT, 0, iova, PAGE_SIZE, + &hw_queue_id[0]); + /* Fail queue addr and length */ + test_err_hw_queue_alloc(EINVAL, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, + 0, iova, 0, &hw_queue_id[0]); + test_err_hw_queue_alloc(EOVERFLOW, viommu_id, + IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, ~(uint64_t)0, + PAGE_SIZE, &hw_queue_id[0]); + /* Fail missing iova */ + test_err_hw_queue_alloc(ENOENT, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, + 0, iova, PAGE_SIZE, &hw_queue_id[0]); + + /* Map iova */ + test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova); + test_ioctl_ioas_map(buffer + PAGE_SIZE, PAGE_SIZE, &iova2); + + /* Fail index=1 and =MAX; must start from index=0 */ + test_err_hw_queue_alloc(EIO, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, + iova, PAGE_SIZE, &hw_queue_id[0]); + test_err_hw_queue_alloc(EINVAL, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, + IOMMU_TEST_HW_QUEUE_MAX, iova, PAGE_SIZE, + &hw_queue_id[0]); + + /* Allocate index=0, declare ownership of the iova */ + test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, + iova, PAGE_SIZE, &hw_queue_id[0]); + /* Fail duplicate */ + test_err_hw_queue_alloc(EEXIST, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, + 0, iova, PAGE_SIZE, &hw_queue_id[0]); + /* Fail unmap, due to iova ownership */ + test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE); + /* The 2nd page is not pinned, so it can be unmmap */ + test_ioctl_ioas_unmap(iova + PAGE_SIZE, PAGE_SIZE); + + /* Allocate index=1 */ + test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, + iova, PAGE_SIZE, &hw_queue_id[1]); + /* Fail to destroy, due to dependency */ + EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, hw_queue_id[0])); + + /* Destroy in descending order */ + test_ioctl_destroy(hw_queue_id[1]); + test_ioctl_destroy(hw_queue_id[0]); + /* Now it can unmap the first page */ + test_ioctl_ioas_unmap(iova, PAGE_SIZE); } FIXTURE(iommufd_device_pasid)

4 days, 13 hours

3
40
0 0

[PATCH 0/4] selftests/mm/uffd: refactor global variables

by Ujwal Kundur

This patchset refactors non-composite global variables into a common struct that can be initialized and passed around per-test instead of relying on the presence of global variables. This allows: - Better encapsulation - Debugging becomes easier -- local variable state can be viewed per stack frame, and we can more easily reason about the variable mutations Patch 1 needs to be applied first and can be followed by any of the other patches. I've ensured that the tests are passing locally (or atleast have the same output as the code on master). Ujwal Kundur (4): selftests/mm/uffd: Refactor non-composite global vars into struct selftests/mm/uffd: Swap global vars with global test options selftests/mm/uffd: Swap global variables with global test opts selftests/mm/uffd: Swap global variables with global test opts tools/testing/selftests/mm/uffd-common.c | 269 +++++----- tools/testing/selftests/mm/uffd-common.h | 78 +-- tools/testing/selftests/mm/uffd-stress.c | 226 ++++---- tools/testing/selftests/mm/uffd-unit-tests.c | 523 ++++++++++--------- tools/testing/selftests/mm/uffd-wp-mremap.c | 23 +- 5 files changed, 591 insertions(+), 528 deletions(-) -- 2.20.1

4 days, 13 hours

4
34
0 0

[PATCH v2 00/14] stackleak: Support Clang stack depth tracking

by Kees Cook

v2: - rename stackleak to kstack_erase (mingo) - address __init vs inline with KCOV changes v1: https://lore.kernel.org/lkml/20250507180852.work.231-kees@kernel.org/ RFC: https://lore.kernel.org/lkml/20250502185834.work.560-kees@kernel.org/ Hi, As part of looking at what GCC plugins could be replaced with Clang implementations, this series uses the recently landed stack depth tracking callback in Clang[1] to implement the stackleak feature. Since the Clang feature is now landed, I'm moving this out of RFC to a v1. Since this touches a lot of arch-specific Makefiles, I tried to trim the CC list down to just mailing lists in those cases, otherwise the CC was giant. Thanks! -Kees [1] https://clang.llvm.org/docs/SanitizerCoverage.html#tracing-stack-depth Kees Cook (14): stackleak: Rename STACKLEAK to KSTACK_ERASE stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS x86: Handle KCOV __init vs inline mismatches arm: Handle KCOV __init vs inline mismatches arm64: Handle KCOV __init vs inline mismatches s390: Handle KCOV __init vs inline mismatches powerpc: Handle KCOV __init vs inline mismatches mips: Handle KCOV __init vs inline mismatches loongarch: Handle KCOV __init vs inline mismatches init.h: Disable sanitizer coverage for __init and __head kstack_erase: Support Clang stack depth tracking configs/hardening: Enable CONFIG_KSTACK_ERASE configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON arch/Kconfig | 4 +- arch/arm/Kconfig | 2 +- arch/arm64/Kconfig | 2 +- arch/riscv/Kconfig | 2 +- arch/s390/Kconfig | 2 +- arch/x86/Kconfig | 2 +- security/Kconfig.hardening | 45 +++++++++------- Makefile | 1 + arch/arm/boot/compressed/Makefile | 2 +- arch/arm/vdso/Makefile | 2 +- arch/arm64/kernel/pi/Makefile | 2 +- arch/arm64/kernel/vdso/Makefile | 3 +- arch/arm64/kvm/hyp/nvhe/Makefile | 2 +- arch/riscv/kernel/pi/Makefile | 2 +- arch/riscv/purgatory/Makefile | 2 +- arch/sparc/vdso/Makefile | 3 +- arch/x86/entry/vdso/Makefile | 3 +- arch/x86/purgatory/Makefile | 2 +- drivers/firmware/efi/libstub/Makefile | 8 +-- drivers/misc/lkdtm/Makefile | 2 +- kernel/Makefile | 10 ++-- lib/Makefile | 2 +- scripts/Makefile.gcc-plugins | 16 +----- scripts/Makefile.kstack_erase | 21 ++++++++ scripts/gcc-plugins/stackleak_plugin.c | 52 +++++++++---------- Documentation/admin-guide/sysctl/kernel.rst | 4 +- Documentation/arch/x86/x86_64/mm.rst | 2 +- Documentation/security/self-protection.rst | 2 +- .../zh_CN/security/self-protection.rst | 2 +- arch/arm64/include/asm/acpi.h | 2 +- arch/loongarch/include/asm/smp.h | 2 +- arch/mips/include/asm/time.h | 2 +- arch/s390/hypfs/hypfs.h | 2 +- arch/s390/hypfs/hypfs_diag.h | 2 +- arch/x86/entry/calling.h | 4 +- arch/x86/include/asm/acpi.h | 4 +- arch/x86/include/asm/init.h | 2 +- arch/x86/include/asm/realmode.h | 2 +- include/linux/acpi.h | 4 +- include/linux/bootconfig.h | 2 +- include/linux/efi.h | 2 +- include/linux/init.h | 4 +- include/linux/{stackleak.h => kstack_erase.h} | 20 +++---- include/linux/memblock.h | 2 +- include/linux/mfd/dbx500-prcmu.h | 2 +- include/linux/sched.h | 4 +- arch/arm/kernel/entry-common.S | 2 +- arch/arm64/kernel/entry.S | 2 +- arch/riscv/kernel/entry.S | 2 +- arch/s390/kernel/entry.S | 2 +- arch/arm/mm/cache-feroceon-l2.c | 2 +- arch/arm/mm/cache-tauros2.c | 2 +- arch/loongarch/kernel/time.c | 2 +- arch/loongarch/mm/ioremap.c | 4 +- arch/powerpc/mm/book3s64/hash_utils.c | 2 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +- arch/s390/mm/init.c | 2 +- arch/x86/kernel/kvm.c | 2 +- drivers/clocksource/timer-orion.c | 2 +- .../lkdtm/{stackleak.c => kstack_erase.c} | 26 +++++----- drivers/platform/x86/thinkpad_acpi.c | 4 +- drivers/soc/ti/pm33xx.c | 2 +- fs/proc/base.c | 6 +-- kernel/fork.c | 2 +- kernel/{stackleak.c => kstack_erase.c} | 22 ++++---- tools/objtool/check.c | 4 +- tools/testing/selftests/lkdtm/config | 2 +- MAINTAINERS | 6 ++- kernel/configs/hardening.config | 6 +++ 69 files changed, 203 insertions(+), 171 deletions(-) create mode 100644 scripts/Makefile.kstack_erase rename include/linux/{stackleak.h => kstack_erase.h} (81%) rename drivers/misc/lkdtm/{stackleak.c => kstack_erase.c} (89%) rename kernel/{stackleak.c => kstack_erase.c} (87%) -- 2.34.1

4 days, 16 hours

8
28
0 0

[PATCH] selftests/filesystems: Use return value of the function 'chdir()'

by Alessandro Zanni

Fix to use the return value of the function 'chdir("/")' and check if the return is either 0 (ok) or 1 (not ok, so the test stops). The patch fies the solves the following errors: mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-result] 468 | chdir("/"); mount-notify_test_ns.c:489:17: warning: ignoring return value of ‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused- result] 489 | chdir("/"); To reproduce the issue, use the command: make kselftest TARGET=filesystems/statmount Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com> --- .../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +- .../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c index 5a3b0ace1a88..a7f899599d52 100644 --- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c +++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c @@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir) ASSERT_GE(ret, 0); if (ret == 0) { - chdir("/"); + ASSERT_EQ(0, chdir("/")); unshare(CLONE_NEWNS); mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL); umount2("/a", MNT_DETACH); diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c index d91946e69591..dc9eb3087a1a 100644 --- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c +++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c @@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir) ASSERT_GE(ret, 0); if (ret == 0) { - chdir("/"); + ASSERT_EQ(0, chdir("/")); unshare(CLONE_NEWNS); mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL); umount2("/a", MNT_DETACH); -- 2.43.0

4 days, 18 hours

1
0
0 0

[PATCH v21 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v21. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v21 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>) - Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>) - Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>) - Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>) - Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>) - Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>) - Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>) - Fix some typos and format issues of dualpi2 attributes v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 151 ++- include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 68 + net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1171 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1665 insertions(+), 5 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

4 days, 21 hours

4
11
0 0

[PATCH net] selftests: drv-net: rss_ctx: Add short delay between per-context traffic checks

by Nimrod Oren

A few packets may still be sent and received during the termination of the iperf processes. These late packets cause failures when they arrive on queues expected to be empty. Add a one second delay between repeated _send_traffic_check() calls in rss_ctx tests to ensure such packets are processed before the next traffic checks are performed. Example failure observed: Check failed 2 != 0 traffic on inactive queues (context 1): [0, 0, 1, 1, 386385, 397196, 0, 0, 0, 0, ...] Check failed 4 != 0 traffic on inactive queues (context 2): [0, 0, 0, 0, 2, 2, 247152, 253013, 0, 0, ...] Check failed 2 != 0 traffic on inactive queues (context 3): [0, 0, 0, 0, 0, 0, 1, 1, 282434, 283070, ...] Note: While the `noise` parameter could be used to tolerate these late packets, it would be inappropriate here. `noise` tolerates far more traffic than acceptable in this case, risking false positives. Inactive queues are supposed to see zero traffic. Fixes: 847aa551fa78 ("selftests: drv-net: rss_ctx: factor out send traffic and check") Reviewed-by: Gal Pressman <gal(a)nvidia.com> Reviewed-by: Carolina Jubran <cjubran(a)nvidia.com> Signed-off-by: Nimrod Oren <noren(a)nvidia.com> --- tools/testing/selftests/drivers/net/hw/rss_ctx.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/testing/selftests/drivers/net/hw/rss_ctx.py b/tools/testing/selftests/drivers/net/hw/rss_ctx.py index 7bb552f8b182..19be69227693 100755 --- a/tools/testing/selftests/drivers/net/hw/rss_ctx.py +++ b/tools/testing/selftests/drivers/net/hw/rss_ctx.py @@ -4,6 +4,7 @@ import datetime import random import re +import time from lib.py import ksft_run, ksft_pr, ksft_exit from lib.py import ksft_eq, ksft_ne, ksft_ge, ksft_in, ksft_lt, ksft_true, ksft_raises from lib.py import NetDrvEpEnv @@ -492,6 +493,7 @@ def test_rss_context(cfg, ctx_cnt=1, create_with_cfg=None): { 'target': (2+i*2, 3+i*2), 'noise': (0, 1), 'empty': list(range(2, 2+i*2)) + list(range(4+i*2, 2+2*ctx_cnt)) }) + time.sleep(1) if requested_ctx_cnt != ctx_cnt: raise KsftSkipEx(f"Tested only {ctx_cnt} contexts, wanted {requested_ctx_cnt}") @@ -559,6 +561,7 @@ def test_rss_context_out_of_order(cfg, ctx_cnt=4): } _send_traffic_check(cfg, ports[i], f"context {i}", expected) + time.sleep(1) # Use queues 0 and 1 for normal traffic ethtool(f"-X {cfg.ifname} equal 2") -- 2.37.1

4 days, 22 hours

2
3
0 0

[PATCH v3 00/15] Consolidate iommu page table implementations (AMD)

by Jason Gunthorpe

[All the precursor patches are merged now and AMD/RISCV/VTD conversions are written] Currently each of the iommu page table formats duplicates all of the logic to maintain the page table and perform map/unmap/etc operations. There are several different versions of the algorithms between all the different formats. The io-pgtable system provides an interface to help isolate the page table code from the iommu driver, but doesn't provide tools to implement the common algorithms. This makes it very hard to improve the state of the pagetable code under the iommu domains as any proposed improvement needs to alter a large number of different driver code paths. Combined with a lack of software based testing this makes improvement in this area very hard. iommufd wants several new page table operations: - More efficient map/unmap operations, using iommufd's batching logic - unmap that returns the physical addresses into a batch as it progresses - cut that allows splitting areas so large pages can have holes poked in them dynamically (ie guestmemfd hitless shared/private transitions) - More agressive freeing of table memory to avoid waste - Fragmenting large pages so that dirty tracking can be more granular - Reassembling large pages so that VMs can run at full IO performance in migration/dirty tracking error flows - KHO integration for kernel live upgrade Together these are algorithmically complex enough to be a very significant task to go and implement in all the page table formats we support. Just the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86 PAE / AMDv1 / VT-D SS / RISCV) Instead of doing the duplicated work, this series takes the first step to consolidate the algorithms into one places. In spirit it is similar to the work Christoph did a few years back to pull the redundant get_user_pages() implementations out of the arch code into core MM. This unlocked a great deal of improvement in that space in the following years. I would like to see the same benefit in iommu as well. My first RFC showed a bigger picture with all most all formats and more algorithms. This series reorganizes that to be narrowly focused on just enough to convert the AMD driver to use the new mechanism. kunit tests are provided that allow good testing of the algorithms and all formats on x86, nothing is arch specific. AMD is one of the simpler options as the HW is quite uniform with few different options/bugs while still requiring the complicated contiguous pages support. The HW also has a very simple range based invalidation approach that is easy to implement. The AMD v1 and AMD v2 page table formats are implemented bit for bit identical to the current code, tested using a compare kunit test that checks against the io-pgtable version (on github, see below). Updating the AMD driver to replace the io-pgtable layer with the new stuff is fairly straightforward now. The layering is fixed up in the new version so that all the invalidation goes through function pointers. Several small fixing patches have come out of this as I've been fixing the problems that the test suite uncovers in the current code, and implementing the fixed version in iommupt. On performance, there is a quite wide variety of implementation designs across all the drivers. Looking at some key performance across the main formats: iommu_map(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 53,66 , 51,63 , 19.19 (AMDV1) 256*2^12, 386,1909 , 367,1795 , 79.79 256*2^21, 362,1633 , 355,1556 , 77.77 2^12, 56,62 , 52,59 , 11.11 (AMDv2) 256*2^12, 405,1355 , 357,1292 , 72.72 256*2^21, 393,1160 , 358,1114 , 67.67 2^12, 55,65 , 53,62 , 14.14 (VTD second stage) 256*2^12, 391,518 , 332,512 , 35.35 256*2^21, 383,635 , 336,624 , 46.46 2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit) 256*2^12, 380,389 , 361,369 , 2.02 256*2^21, 358,419 , 345,400 , 13.13 iommu_unmap(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 69,88 , 65,85 , 23.23 (AMDv1) 256*2^12, 353,6498 , 331,6029 , 94.94 256*2^21, 373,6014 , 360,5706 , 93.93 2^12, 71,72 , 66,69 , 4.04 (AMDv2) 256*2^12, 228,891 , 206,871 , 76.76 256*2^21, 254,721 , 245,711 , 65.65 2^12, 69,87 , 65,82 , 20.20 (VTD second stage) 256*2^12, 210,321 , 200,315 , 36.36 256*2^21, 255,349 , 238,342 , 30.30 2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit) 256*2^12, 521,357 , 447,346 , -29.29 256*2^21, 489,358 , 433,345 , -25.25 * Above numbers include additional patches to remove the iommu_pgsize() overheads. gcc 13.3.0, i7-12700 This version provides fairly consistent performance across formats. ARM unmap performance is quite different because this version supports contiguous pages and uses a very different algorithm for unmapping. Though why it is so worse compared to AMDv1 I haven't figured out yet. The per-format commits include a more detailed chart. There is a second branch: https://github.com/jgunthorpe/linux/commits/iommu_pt_all Containing supporting work and future steps: - ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats - RISCV format and RISCV conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv - Support for a DMA incoherent HW page table walker - VT-D second stage format and VT-D conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd - DART v1 & v2 format - Draft of a iommufd 'cut' operation to break down huge pages - A compare test that checks the iommupt formats against the iopgtable interface, including updating AMD to have a working iopgtable and patches to make VT-D have an iopgtable for testing. - A performance test to micro-benchmark map and unmap against iogptable My strategy is to go one by one for the drivers: - AMD driver conversion - RISCV page table and driver - Intel VT-D driver and VTDSS page table - Flushing improvements for RISCV - ARM SMMUv3 And concurrently work on the algorithm side: - debugfs content dump, like VT-D has - Cut support - Increase/Decrease page size support - map/unmap batching - KHO As we make more algorithm improvements the value to convert the drivers increases. This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt v2: - Rebase on v6.16-rc2 - s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better - Comment and documentation updates - Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top pointer - Add missed force_aperture = true - Make pt_iommu_deinit() take care of the not-yet-inited error case internally as AMD/RISCV/VTD all shared this logic - Change gather_range() into gather_range_pages() so it also deals with the page list. This makes the following cache flushing series simpler - Fix missed update of unmap->unmapped in some error cases - Change clear_contig() to order the gather more logically - Remove goto from the error handling in __map_range_leaf() - s/log2_/oalog2_/ in places where the argument is an oaddr_t - Pass the pts to pt_table_install64/32() - Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's information on how PASID 0 works. v1: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com - AMD driver only, many code changes RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/ Alejandro Jimenez (1): iommu/amd: Use the generic iommu page table Jason Gunthorpe (14): genpt: Generic Page Table base API genpt: Add Documentation/ files iommupt: Add the basic structure of the iommu implementation iommupt: Add the AMD IOMMU v1 page table format iommupt: Add iova_to_phys op iommupt: Add unmap_pages op iommupt: Add map_pages op iommupt: Add read_and_clear_dirty op iommupt: Add a kunit test for Generic Page Table iommupt: Add a mock pagetable format for iommufd selftest to use iommufd: Change the selftest to use iommupt instead of xarray iommupt: Add the x86 64 bit page table format iommu/amd: Remove AMD io_pgtable support iommupt: Add a kunit test for the IOMMU implementation .clang-format | 1 + Documentation/driver-api/generic_pt.rst | 140 ++ Documentation/driver-api/index.rst | 1 + drivers/iommu/Kconfig | 2 + drivers/iommu/Makefile | 1 + drivers/iommu/amd/Kconfig | 5 +- drivers/iommu/amd/Makefile | 2 +- drivers/iommu/amd/amd_iommu.h | 1 - drivers/iommu/amd/amd_iommu_types.h | 109 +- drivers/iommu/amd/io_pgtable.c | 560 -------- drivers/iommu/amd/io_pgtable_v2.c | 370 ------ drivers/iommu/amd/iommu.c | 516 ++++---- drivers/iommu/generic_pt/.kunitconfig | 13 + drivers/iommu/generic_pt/Kconfig | 72 ++ drivers/iommu/generic_pt/fmt/Makefile | 26 + drivers/iommu/generic_pt/fmt/amdv1.h | 409 ++++++ drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 + drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 + drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 + drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 + drivers/iommu/generic_pt/fmt/iommu_template.h | 48 + drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 + drivers/iommu/generic_pt/fmt/x86_64.h | 248 ++++ drivers/iommu/generic_pt/iommu_pt.h | 1150 +++++++++++++++++ drivers/iommu/generic_pt/kunit_generic_pt.h | 717 ++++++++++ drivers/iommu/generic_pt/kunit_iommu.h | 183 +++ drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++ drivers/iommu/generic_pt/pt_common.h | 354 +++++ drivers/iommu/generic_pt/pt_defs.h | 323 +++++ drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++ drivers/iommu/generic_pt/pt_iter.h | 640 +++++++++ drivers/iommu/generic_pt/pt_log2.h | 130 ++ drivers/iommu/io-pgtable.c | 4 - drivers/iommu/iommufd/Kconfig | 1 + drivers/iommu/iommufd/iommufd_test.h | 11 +- drivers/iommu/iommufd/selftest.c | 439 +++---- include/linux/generic_pt/common.h | 166 +++ include/linux/generic_pt/iommu.h | 270 ++++ include/linux/io-pgtable.h | 2 - tools/testing/selftests/iommu/iommufd.c | 60 +- tools/testing/selftests/iommu/iommufd_utils.h | 12 + 41 files changed, 6119 insertions(+), 1589 deletions(-) create mode 100644 Documentation/driver-api/generic_pt.rst delete mode 100644 drivers/iommu/amd/io_pgtable.c delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c create mode 100644 drivers/iommu/generic_pt/.kunitconfig create mode 100644 drivers/iommu/generic_pt/Kconfig create mode 100644 drivers/iommu/generic_pt/fmt/Makefile create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h create mode 100644 drivers/iommu/generic_pt/iommu_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h create mode 100644 drivers/iommu/generic_pt/pt_common.h create mode 100644 drivers/iommu/generic_pt/pt_defs.h create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h create mode 100644 drivers/iommu/generic_pt/pt_iter.h create mode 100644 drivers/iommu/generic_pt/pt_log2.h create mode 100644 include/linux/generic_pt/common.h create mode 100644 include/linux/generic_pt/iommu.h base-commit: cd76b0248a38645a3e3f8ca4a48bffc591e9da19 -- 2.43.0

4 days, 23 hours

4
22
0 0

Re: [PATCH v2 1/2] selftests: cachestat: add tests for mmap

by Joshua Hahn

On Tue, 8 Jul 2025 23:13:01 +0530 Suresh Chandrappa <suresh.k.chandrappa(a)gmail.com> wrote: > Hi Joshua, > > Thanks for the feedback! In the first patch, both shmem and mmap operations > are present, but I hadn’t introduced any logic to distinguish between them > yet. That distinction is added in the second patch through a new API. Hi Suresh, Yes, this makes sense to me. I think what I was getting at was that we could still make a conditional statement like if (type == FILE_SHMEM) ksft_print_msg("Unable to create shmem file.\n")' else if (type == FILE_MMAP) ksft_print_msg("Unable to create mmap file.\n"); (or use a switch statement) ... And just refactor it in patch 2, as opposed to changing the behavior. But this is mostly a nit. If you are planning to merge both patches in one patch in the next version, then all of these comments shouldn't matter : -) Looking forward to the next version, have a great day! Joshua Sent using hkml (https://github.com/sjp38/hackermail)

5 days, 1 hour

1
0
0 0

[kvm-unit-tests PATCH v3 1/2] lib: Add STR_IS_Y and STR_IS_N for checking env vars

by Jesse Taube

the line: (s && (*s == '1' || *s == 'y' || *s == 'Y')) is used in a few places add a macro for it and its 'n' counterpart. Add a copy of Linux's IS_ENABLED macro to be used in GET_ENV_OR_CONFIG. Add GET_ENV_OR_CONFIG for CONFIG values which can be overridden by the environment. Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> --- V1 -> V2: - New commit V2 -> V3: - Add IS_ENABLED so CONFIG_##name can be undefined - Change GET_ENV_OR_CONFIG to GET_CONFIG_OR_ENV - Fix it's to its --- lib/argv.h | 38 ++++++++++++++++++++++++++++++++++++++ lib/errata.h | 7 ++++--- riscv/sbi-tests.h | 3 ++- 3 files changed, 44 insertions(+), 4 deletions(-) diff --git a/lib/argv.h b/lib/argv.h index 0fa77725..111af078 100644 --- a/lib/argv.h +++ b/lib/argv.h @@ -14,4 +14,42 @@ extern void setup_args_progname(const char *args); extern void setup_env(char *env, int size); extern void add_setup_arg(const char *arg); +/* + * Helper macros to use CONFIG_ options in C/CPP expressions. Note that + * these only work with boolean and tristate options. + */ + +/* + * Getting something that works in C and CPP for an arg that may or may + * not be defined is tricky. Here, if we have "#define CONFIG_BOOGER 1" + * we match on the placeholder define, insert the "0," for arg1 and generate + * the triplet (0, 1, 0). Then the last step cherry picks the 2nd arg (a one). + * When CONFIG_BOOGER is not defined, we generate a (... 1, 0) pair, and when + * the last step cherry picks the 2nd arg, we get a zero. + */ +#define __ARG_PLACEHOLDER_1 0, +#define __take_second_arg(__ignored, val, ...) val +#define __is_defined(x) ___is_defined(x) +#define ___is_defined(val) ____is_defined(__ARG_PLACEHOLDER_##val) +#define ____is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0) + +/* + * IS_ENABLED(CONFIG_FOO) evaluates to 1 if CONFIG_FOO is set to '1', 0 + * otherwise. + */ +#define IS_ENABLED(option) __is_defined(option) + +#define STR_IS_Y(s) (s && (*s == '1' || *s == 'y' || *s == 'Y')) +#define STR_IS_N(s) (s && (*s == '0' || *s == 'n' || *s == 'N')) + +/* + * Get the boolean value of CONFIG_{name} + * which can be overridden by the {name} + * variable in the environment if present. + */ +#define GET_ENV_OR_CONFIG(name) ({ \ + const char *genv_s = getenv(#name); \ + genv_s ? STR_IS_Y(genv_s) : IS_ENABLED(CONFIG_##name); \ +}) + #endif diff --git a/lib/errata.h b/lib/errata.h index de8205d8..78007243 100644 --- a/lib/errata.h +++ b/lib/errata.h @@ -7,6 +7,7 @@ #ifndef _ERRATA_H_ #define _ERRATA_H_ #include <libcflat.h> +#include <argv.h> #include "config.h" @@ -28,7 +29,7 @@ static inline bool errata_force(void) return true; s = getenv("ERRATA_FORCE"); - return s && (*s == '1' || *s == 'y' || *s == 'Y'); + return STR_IS_Y(s); } static inline bool errata(const char *erratum) @@ -40,7 +41,7 @@ static inline bool errata(const char *erratum) s = getenv(erratum); - return s && (*s == '1' || *s == 'y' || *s == 'Y'); + return STR_IS_Y(s); } static inline bool errata_relaxed(const char *erratum) @@ -52,7 +53,7 @@ static inline bool errata_relaxed(const char *erratum) s = getenv(erratum); - return !(s && (*s == '0' || *s == 'n' || *s == 'N')); + return !STR_IS_N(s); } #endif diff --git a/riscv/sbi-tests.h b/riscv/sbi-tests.h index c1ebf016..4e051dca 100644 --- a/riscv/sbi-tests.h +++ b/riscv/sbi-tests.h @@ -37,6 +37,7 @@ #ifndef __ASSEMBLER__ #include <libcflat.h> +#include <argv.h> #include <asm/sbi.h> #define __sbiret_report(kfail, ret, expected_error, expected_value, \ @@ -94,7 +95,7 @@ static inline bool env_enabled(const char *env) { char *s = getenv(env); - return s && (*s == '1' || *s == 'y' || *s == 'Y'); + return STR_IS_Y(s); } void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo); -- 2.43.0

5 days, 1 hour

2
2
0 0

[PATCH] selftests/filesystems: Multiple declaration of __kernel_fsid_t

by Alessandro Zanni

Fix to remove a multiple declaration of the struct __kernel_fsid_t, which is already declared in posix_types.h, to solve the following errors: - mount-notify_test.c:22:3: error: conflicting types for ‘__kernel_fsid_t’; have ‘struct <anonymous>’ 22 | } __kernel_fsid_t; - usr/include/asm-generic/posix_types.h:81:3: note: previous declaration of ‘__kernel_fsid_t’ with type ‘__kernel_fsid_t’ 81 | } __kernel_fsid_t; To reproduce the issue, use the command: make kselftest TARGET=filesystems/statmount Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com> --- .../selftests/filesystems/mount-notify/mount-notify_test.c | 7 ------- .../filesystems/mount-notify/mount-notify_test_ns.c | 7 ------- 2 files changed, 14 deletions(-) diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c index 63ce708d93ed..5a3b0ace1a88 100644 --- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c +++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c @@ -15,13 +15,6 @@ #include "../statmount/statmount.h" #include "../utils.h" -// Needed for linux/fanotify.h -#ifndef __kernel_fsid_t -typedef struct { - int val[2]; -} __kernel_fsid_t; -#endif - #include <sys/fanotify.h> static const char root_mntpoint_templ[] = "/tmp/mount-notify_test_root.XXXXXX"; diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c index 090a5ca65004..d91946e69591 100644 --- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c +++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c @@ -16,13 +16,6 @@ #include "../statmount/statmount.h" #include "../utils.h" -// Needed for linux/fanotify.h -#ifndef __kernel_fsid_t -typedef struct { - int val[2]; -} __kernel_fsid_t; -#endif - #include <sys/fanotify.h> static const char root_mntpoint_templ[] = "/tmp/mount-notify_test_root.XXXXXX"; -- 2.43.0

5 days, 2 hours

1
0
0 0

[PATCH v4 00/38] Mediated vPMU 4.0 for x86

by Mingwei Zhang

With joint effort from the upstream KVM community, we come up with the 4th version of mediated vPMU for x86. We have made the following changes on top of the previous RFC v3. v3 -> v4 - Rebase whole patchset on 6.14-rc3 base. - Address Peter's comments on Perf part. - Address Sean's comments on KVM part. * Change key word "passthrough" to "mediated" in all patches * Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY. * Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR save/retore list support for GLOBAL_CTRL, thus the support of mediated vPMU is constrained to SapphireRapids and later CPUs on Intel side. * Merge some small changes into a single patch. - Address Sandipan's comment on invalid pmu pointer. - Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly manipulate pmc->eventsel and pmu->fixed_ctr_ctrl. Testing (Intel side): - Perf-based legacy vPMU (force emulation on/off) * Kselftests pmu_counters_test, pmu_event_filter_test and vmx_pmu_caps_test pass. * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass. * Basic perf counting/sampling tests in 3 scenarios, guest-only, host-only and host-guest coexistence all pass. - Mediated vPMU (force emulation on/off) * Kselftests pmu_counters_test, pmu_event_filter_test and vmx_pmu_caps_test pass. * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass. * Basic perf counting/sampling tests in 3 scenarios, guest-only, host-only and host-guest coexistence all pass. - Failures. All above tests passed on Intel Granite Rapids as well except a failure on KUT/pmu_pebs. * GP counter 0 (0xfffffffffffe): PEBS record (written seq 0) is verified (including size, counters and cfg). * The pebs_data_cfg (0xb500000000) doesn't match with the effective MSR_PEBS_DATA_CFG (0x0). * This failure has nothing to do with this mediated vPMU patch set. The failure is caused by Granite Rapids supported timed PEBS which needs extra support on Qemu and KUT/pmu_pebs. These extra support would be sent in separate patches later. Testing (AMD side): - Kselftests pmu_counters_test, pmu_event_filter_test and vmx_pmu_caps_test all pass - legacy guest with KUT/pmu: * qmeu option: -cpu host, -perfctr-core * when set force_emulation_prefix=1, passes * when set force_emulation_prefix=0, passes - perfmon-v1 guest with KUT/pmu: * qmeu option: -cpu host, -perfmon-v2 * when set force_emulation_prefix=1, passes * when set force_emulation_prefix=0, passes - perfmon-v2 guest with KUT/pmu: * qmeu option: -cpu host * when set force_emulation_prefix=1, passes * when set force_emulation_prefix=0, passes - perf_fuzzer (perfmon-v2): * fails with soft lockup in guest in current version. * culprit could be between 6.13 ~ 6.14-rc3 within KVM * Series tested on 6.12 and 6.13 without issue. Note: a QEMU series is needed to run mediated vPMU v4: - https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.c… History: - RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/ - RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/ - RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.int… Dapeng Mi (18): KVM: x86/pmu: Introduce enable_mediated_pmu global parameter KVM: x86/pmu: Check PMU cpuid configuration from user space KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{} KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header KVM: VMX: Add macros to wrap around {secondary,tertiary}_exec_controls_changebit() KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers KVM: x86/pmu: Setup PMU MSRs' interception mode KVM: x86/pmu: Handle PMU MSRs interception and event filtering KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry KVM: x86/pmu: Handle emulated instruction for mediated vPMU KVM: nVMX: Add macros to simplify nested MSR interception setting KVM: selftests: Add mediated vPMU supported for pmu tests KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test KVM: Selftests: Fix pmu_counters_test error for mediated vPMU KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space Kan Liang (8): perf: Support get/put mediated PMU interfaces perf: Skip pmu_ctx based on event_type perf: Clean up perf ctx time perf: Add a EVENT_GUEST flag perf: Add generic exclude_guest support perf: Add switch_guest_ctx() interface perf/x86: Support switch_guest_ctx interface perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU Mingwei Zhang (5): perf/x86: Forbid PMI handler when guest own PMU perf/x86/core: Plumb mediated PMU capability from x86_pmu to x86_pmu_cap KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering KVM: nVMX: Add nested virtualization support for mediated PMU Sandipan Das (4): perf/x86/core: Do not set bit width for unavailable counters KVM: x86/pmu: Add AMD PMU registers to direct access list KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host Xiong Zhang (3): x86/irq: Factor out common code for installing kvm irq handler perf: core/x86: Register a new vector for KVM GUEST PMI KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler arch/x86/events/amd/core.c | 2 + arch/x86/events/core.c | 40 +- arch/x86/events/intel/core.c | 5 + arch/x86/include/asm/hardirq.h | 1 + arch/x86/include/asm/idtentry.h | 1 + arch/x86/include/asm/irq.h | 2 +- arch/x86/include/asm/irq_vectors.h | 5 +- arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 + arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/msr-index.h | 18 +- arch/x86/include/asm/perf_event.h | 1 + arch/x86/include/asm/vmx.h | 1 + arch/x86/kernel/idt.c | 1 + arch/x86/kernel/irq.c | 39 +- arch/x86/kvm/cpuid.c | 15 + arch/x86/kvm/pmu.c | 254 ++++++++- arch/x86/kvm/pmu.h | 45 ++ arch/x86/kvm/svm/pmu.c | 148 ++++- arch/x86/kvm/svm/svm.c | 26 + arch/x86/kvm/svm/svm.h | 2 +- arch/x86/kvm/vmx/capabilities.h | 11 +- arch/x86/kvm/vmx/nested.c | 68 ++- arch/x86/kvm/vmx/pmu_intel.c | 224 ++++++-- arch/x86/kvm/vmx/vmx.c | 89 +-- arch/x86/kvm/vmx/vmx.h | 11 +- arch/x86/kvm/x86.c | 63 ++- arch/x86/kvm/x86.h | 2 + include/linux/perf_event.h | 47 +- kernel/events/core.c | 519 ++++++++++++++---- .../beauty/arch/x86/include/asm/irq_vectors.h | 5 +- .../selftests/kvm/include/kvm_test_harness.h | 13 + .../testing/selftests/kvm/include/kvm_util.h | 3 + .../selftests/kvm/include/x86/processor.h | 8 + tools/testing/selftests/kvm/lib/kvm_util.c | 23 + .../selftests/kvm/x86/pmu_counters_test.c | 24 +- .../selftests/kvm/x86/pmu_event_filter_test.c | 8 +- .../selftests/kvm/x86/vmx_pmu_caps_test.c | 2 +- 37 files changed, 1480 insertions(+), 258 deletions(-) base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319 -- 2.49.0.395.g12beb8f557-goog

5 days, 2 hours

8
120
0 0

[PATCH net-next v2 0/3] netdevsim: add support for PHY devices

by Maxime Chevallier

Hi everyone, Here's a V2 for the netdevsim PHY support, including a bugfix for NETDEVSIM=m as well as a round of shellcheck cleanups for ethtool-phy.sh. The idea of this series is to allow attaching virtual PHY devices to netdevsim, so that we can test PHY-related ethtool commands. This can be extended in the future for phylib testing as well. V1: https://lore.kernel.org/netdev/20250702082806.706973-1-maxime.chevallier@bo… Maxime Chevallier (3): net: netdevsim: Add PHY support in netdevsim selftests: ethtool: Drop the unused old_netdevs variable selftests: ethtool: Introduce ethernet PHY selftests on netdevsim drivers/net/netdevsim/Makefile | 4 + drivers/net/netdevsim/dev.c | 2 + drivers/net/netdevsim/netdev.c | 8 + drivers/net/netdevsim/netdevsim.h | 25 ++ drivers/net/netdevsim/phy.c | 398 ++++++++++++++++++ .../selftests/drivers/net/netdevsim/config | 1 + .../drivers/net/netdevsim/ethtool-common.sh | 19 +- .../drivers/net/netdevsim/ethtool-phy.sh | 64 +++ 8 files changed, 518 insertions(+), 3 deletions(-) create mode 100644 drivers/net/netdevsim/phy.c create mode 100755 tools/testing/selftests/drivers/net/netdevsim/ethtool-phy.sh -- 2.49.0

5 days, 7 hours

2
6
0 0

[PATCH v19 00/10] PCI: EP: Add RC-to-EP doorbell with platform MSI controller

by Frank Li

┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐ │ │ │ │ │ │ │ │ │ PCI Endpoint │ │ PCI Host │ │ │ │ │ │ │ │ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │ │ │ │ │ │ │ │ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │ │ Controller │ │ update doorbell register address│ │ │ │ │ │ for BAR │ │ │ │ │ │ │ │ 3. Write BAR<n>│ │ │◄──┼───────────────────────────────────┼───┤ │ │ │ │ │ │ │ │ ├──►│ 4.Irq Handle │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └────────────┘ └───────────────────────────────────┘ └────────────────┘ This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/ Original patch only target to vntb driver. But actually it is common method. This patches add new API to pci-epf-core, so any EP driver can use it. Previous v2 discussion here. https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/ Changes in v19: - irq part already in v6.16-rc1, only missed pcie/dts part - rebase to v6.16-rc1 - update commit message for patch IMMUTABLE check. - Link to v18: https://lore.kernel.org/r/20250414-ep-msi-v18-0-f69b49917464@nxp.com Changes in v18: - pci-ep.yaml: sort property order, fix maxvalue to 0x7ffff for msi-map-mask and iommu-map-mask - Link to v17: https://lore.kernel.org/r/20250407-ep-msi-v17-0-633ab45a31d0@nxp.com Changes in v17: - move document part to pci-ep.yaml - Link to v16: https://lore.kernel.org/r/20250404-ep-msi-v16-0-d4919d68c0d0@nxp.com Changes in v16: - remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file because there are better patches, which under review. - Add document for pcie-ep msi-map usage - other change to see each patch's change log About IMMUTABLE (No change for this part, tglx provide feedback) > - This IMMUTABLE thing serves no purpose, because you don't randomly > plug this end-point block on any MSI controller. They come as part > of an SoC. "Yes and no. The problem is that the EP implementation is meant to be a generic library and while GIC-ITS guarantees immutability of the address/data pair after setup, there are architectures (x86, loongson, riscv) where the base MSI controller does not and immutability is only achieved when interrupt remapping is enabled. The latter can be disabled at boot-time and then the EP implementation becomes a lottery across affinity changes. That was my concern about this library implementation and that's why I asked for a mechanism to ensure that the underlying irqdomain provides a immutable address/data pair. So it does not matter for GIC-ITS, but in the larger picture it matters. Thanks, tglx " So it does not matter for GIC-ITS, but in the larger picture it matters. - Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com Changes in v15: - rebase to v6.14-rc1 - fix build issue find by kernel test robot - Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com Changes in v14: Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As a result, the approach has been reverted to the v9 method. However, there are several improvements: MSI now supports msi-map in addition to msi-parent. - The struct device: id is used as the endpoint function (EPF) device identity to map to the stream ID (sideband information). - The EPC device tree source (DTS) utilizes msi-map to provide such information. - The EPF device's of_node is set to the EPC controller’s node. This approach is commonly used for multi-function device (MFD) platform child devices, allowing them to inherit properties from the MFD device’s DTS, such as reset-cells and gpio-cells. This method is well-suited for the current case, as the EPF is inherently created/binded to the EPC and should inherit the EPC’s DTS node properties. Additionally: Since the basic IMX95 LUT support has already been merged into the mainline, a DTS and driver increment patch is added to complete the solution. The patch is rebased onto the latest linux-next tree and aligned with the new pcitest framework. - Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com Changes in v13: - Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI - Change request id as func | vfunc << 3 - Remove IRQ_DOMAIN_MSI_IMMUTABLE Thomas Gleixner: I hope capture all your points in review comments. If missed, let me know. - Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com Changes in v12: - Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function irq_domain_msi_is_immuatble(). - split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches - Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com Changes in v11: - Change to use MSI_FLAG_MSG_IMMUTABLE - Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com Changes in v10: Thomas Gleixner: There are big change in pci-ep-msi.c. I am sure if go on the corrent path. The key improvement is remove only 1 function devices's limitation. I use new patch for imutable check, which relative additional feature compared to base enablement patch. - Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all() - Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info - Remove only support 1 endpoint function limiation. - Create one MSI domain for each endpoint function devices. - Use "msi-map" in pci ep controler node, instead of of msi-parent. first argument is (func_no << 8 | vfunc_no) - Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com Changes in v9 - Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all() - Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering - Remove API pci_epf_align_inbound_addr_lo_hi - Move doorbell_alloc in to doorbell_enable function. - Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com Changes in v8: - update helper function name to pci_epf_align_inbound_addr() - Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com Changes in v7: - Add helper function pci_epf_align_addr(); - Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com Changes in v6: - change doorbell_addr to doorbell_offset - use round_down() - add Niklas's test by tag - rebase to pci/endpoint - Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com Changes in v5: - Move request_irq to epf test function driver for more flexiable user case - Add fixed size bar handler - Some minor improvememtn to see each patches's changelog. - Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com Changes in v4: - Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs() - Use new method to avoid compatible problem. Add new command DOORBELL_ENABLE and DOORBELL_DISABLE. pcitest -B send DOORBELL_ENABLE first, EP test function driver try to remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old driver don't support new command, so failure return, not side effect. After test, DOORBELL_DISABLE command send out to recover original map, so pcitest bar test can pass as normal. - Other detail change see each patches's change log - Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com Change from v2 to v3 - Fixed manivannan's comments - Move common part to pci-ep-msi.c and pci-ep-msi.h - rebase to 6.12-rc1 - use RevID to distingiush old version mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid ^^^^^^ to enable platform msi support. ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep - use new device ID, which identify support doorbell to avoid broken compatility. Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices keep the same behavior as before. EP side RC with old driver RC with new driver PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled Other device ID doorbell disabled* doorbell disabled* * Behavior remains unchanged. Change from v1 to v2 - Add missed patch for endpont/pci-epf-test.c - Move alloc and free to epc driver from epf. - Provide general help function for EPC driver to alloc platform msi irq. - Fixed manivannan's comments. Signed-off-by: Frank Li <Frank.Li(a)nxp.com> --- Frank Li (10): PCI: endpoint: Set ID and of_node for function driver PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment PCI: endpoint: pci-epf-test: Add doorbell test support misc: pci_endpoint_test: Add doorbell test case selftests: pci_endpoint: Add doorbell test case pci: imx6: Add helper function imx_pcie_add_lut_by_rid() pci: imx6: Add LUT setting for MSI/IOMMU in Endpoint mode arm64: dts: imx95: Add msi-map for pci-ep device arch/arm64/boot/dts/freescale/imx95.dtsi | 1 + drivers/misc/pci_endpoint_test.c | 82 ++++++++++++ drivers/pci/controller/dwc/pci-imx6.c | 25 ++-- drivers/pci/endpoint/Makefile | 1 + drivers/pci/endpoint/functions/pci-epf-test.c | 142 +++++++++++++++++++++ drivers/pci/endpoint/pci-ep-msi.c | 90 +++++++++++++ drivers/pci/endpoint/pci-epf-core.c | 48 +++++++ include/linux/pci-ep-msi.h | 28 ++++ include/linux/pci-epf.h | 21 +++ include/uapi/linux/pcitest.h | 1 + .../selftests/pci_endpoint/pci_endpoint_test.c | 28 ++++ 11 files changed, 459 insertions(+), 8 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20241010-ep-msi-8b4cab33b1be Best regards, --- Frank Li <Frank.Li(a)nxp.com>

5 days, 8 hours

5
36
0 0

[kvm-unit-tests PATCH v2 1/2] lib: Add STR_IS_Y and STR_IS_N for checking env vars

by Jesse Taube

the line: (s && (*s == '1' || *s == 'y' || *s == 'Y')) is used in a few places add a macro for it and it's 'n' counterpart. Add GET_CONFIG_OR_ENV for CONFIG values which can be overridden by the environment. Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> --- lib/argv.h | 13 +++++++++++++ lib/errata.h | 7 ++++--- riscv/sbi-tests.h | 3 ++- 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/lib/argv.h b/lib/argv.h index 0fa77725..ecbb40d7 100644 --- a/lib/argv.h +++ b/lib/argv.h @@ -14,4 +14,17 @@ extern void setup_args_progname(const char *args); extern void setup_env(char *env, int size); extern void add_setup_arg(const char *arg); +#define STR_IS_Y(s) (s && (*s == '1' || *s == 'y' || *s == 'Y')) +#define STR_IS_N(s) (s && (*s == '0' || *s == 'n' || *s == 'N')) + +/* + * Get the boolean value of CONFIG_{name} + * which can be overridden by the {name} + * variable in the environment if present. + */ +#define GET_CONFIG_OR_ENV(name) ({ \ + const char *s = getenv(#name); \ + s ? STR_IS_Y(s) : CONFIG_##name; \ +}) + #endif diff --git a/lib/errata.h b/lib/errata.h index de8205d8..78007243 100644 --- a/lib/errata.h +++ b/lib/errata.h @@ -7,6 +7,7 @@ #ifndef _ERRATA_H_ #define _ERRATA_H_ #include <libcflat.h> +#include <argv.h> #include "config.h" @@ -28,7 +29,7 @@ static inline bool errata_force(void) return true; s = getenv("ERRATA_FORCE"); - return s && (*s == '1' || *s == 'y' || *s == 'Y'); + return STR_IS_Y(s); } static inline bool errata(const char *erratum) @@ -40,7 +41,7 @@ static inline bool errata(const char *erratum) s = getenv(erratum); - return s && (*s == '1' || *s == 'y' || *s == 'Y'); + return STR_IS_Y(s); } static inline bool errata_relaxed(const char *erratum) @@ -52,7 +53,7 @@ static inline bool errata_relaxed(const char *erratum) s = getenv(erratum); - return !(s && (*s == '0' || *s == 'n' || *s == 'N')); + return !STR_IS_N(s); } #endif diff --git a/riscv/sbi-tests.h b/riscv/sbi-tests.h index c1ebf016..4e051dca 100644 --- a/riscv/sbi-tests.h +++ b/riscv/sbi-tests.h @@ -37,6 +37,7 @@ #ifndef __ASSEMBLER__ #include <libcflat.h> +#include <argv.h> #include <asm/sbi.h> #define __sbiret_report(kfail, ret, expected_error, expected_value, \ @@ -94,7 +95,7 @@ static inline bool env_enabled(const char *env) { char *s = getenv(env); - return s && (*s == '1' || *s == 'y' || *s == 'Y'); + return STR_IS_Y(s); } void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo); -- 2.43.0

5 days, 8 hours

2
4
0 0

[PATCH net 0/2] bonding: fix LACP negotiation issues in passive mode

by Hangbin Liu

This patchset fixes an issue where bonding fails to establish a stable LACP negotiation when operating in passive mode (lacp_active=off). In passive mode, the current implementation only replies when the partner's state changes, which results in LACP timeout and unstable aggregator formation. With this change, the bond responds to each received LACPDU in passive mode by setting ntt = true, ensuring timely replies and stable LACP negotiation. Hangbin Liu (2): bonding: update ntt to true in passive mode selftests: bonding: add test for passive LACP mode drivers/net/bonding/bond_3ad.c | 6 ++ .../drivers/net/bonding/bond_passive_lacp.sh | 21 +++++ .../drivers/net/bonding/bond_topo_lacp.sh | 77 +++++++++++++++++++ 3 files changed, 104 insertions(+) create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh create mode 100644 tools/testing/selftests/drivers/net/bonding/bond_topo_lacp.sh -- 2.46.0

5 days, 9 hours

1
2
0 0

[PATCH 0/2] bpf, arm64: relax constraint in BPF JIT compiler

by Alexis Lothoré (eBPF Foundation)

Hello, this series follows up on the one introducing 9+ args for tracing programs [1]. It has been observed with this series that there are cases for which we can not identify accurately the location of the target function arguments to prepare correctly the corresponding BPF trampoline. This is the case for example if: - the function consumes a struct variable _by value_ - it is passed on the stack (no more register available for it) - it has some __packed__ or __aligned(X)__ attribute As a consequence, a small restrictive check has been added to the ARM64 side, highlighting that other arch supporting 9+ args in BPF trampolines are already suffering from the same issue. After a bit of discussions and attempts, the chosen solution is, rather than applying the same constraint to all JIT compilers, to prevent such function from being encoded at all in BTF info([2]). As the pahole side is closed to be integrated, we can now remove the restrictive check from kernel side. [1] https://lore.kernel.org/bpf/20250527-many_args_arm64-v3-0-3faf7bb8e4a2@boot… [2] https://lore.kernel.org/bpf/20250707-btf_skip_structs_on_stack-v3-0-29569e0… Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com> --- Alexis Lothoré (eBPF Foundation) (2): bpf, arm64: remove structs on stack constraint selftests/bpf: enable tracing_struct tests for arm64 arch/arm64/net/bpf_jit_comp.c | 5 ----- tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 - 2 files changed, 6 deletions(-) --- base-commit: 8da1e37fc84868b50ba6a7cdf082aa3b0d11e006 change-id: 20250708-arm64_relax_jit_comp-e8889647d8d2 Best regards, -- Alexis Lothoré, Bootlin Embedded Linux and Kernel engineering https://bootlin.com

5 days, 10 hours

1
2
0 0

[PATCH 00/14] vdso: Add support for auxiliary clocks

by Thomas Weißschuh

Extend the vDSO for fast-path access to auxiliary clocks (CLOCK_AUX). The implementation is based on the generic vDSO infrastructure and works for all its supported architectures. Namely x86, arm, arm64, riscv, powerpc, loongarch and s390. No changes to userspace are necessary. Based on timers/ptp of tip.git. This also depends on v6.16-rc2 *exactly*. The specific dependency is commit 11fcf368506d ("uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again"), which is available in v6.16-rc2. Unfortunately that got broken again in v6.16-rc3 by commit fc92099902fb ("tools headers: Synchronize linux/bits.h with the kernel sources"). Another fix for this is pending [0] and should make it into v6.16. [0] https://lore.kernel.org/lkml/20250630-uapi-genmask-v1-1-eb0ad956a83e@linutr… Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Thomas Weißschuh (14): selftests/timers: Add testcase for auxiliary clocks vdso/vsyscall: Introduce a helper to fill clock configurations vdso/vsyscall: Split up __arch_update_vsyscall() into __arch_update_vdso_clock() vdso/helpers: Add helpers for seqlocks of single vdso_clock vdso/gettimeofday: Return bool from clock_getres() helpers vdso/gettimeofday: Return bool from clock_gettime() helpers vdso/gettimeofday: Introduce vdso_clockid_valid() vdso/gettimeofday: Introduce vdso_set_timespec() vdso/gettimeofday: Introduce vdso_get_timestamp() vdso: Introduce aux_clock_resolution_ns() vdso/vsyscall: Update auxiliary clock data in the datapage vdso/gettimeofday: Add support for auxiliary clocks Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers" selftests/timers/auxclock: Test vDSO functionality arch/arm64/include/asm/vdso/vsyscall.h | 7 +- include/asm-generic/vdso/vsyscall.h | 6 +- include/linux/timekeeper_internal.h | 13 + include/vdso/auxclock.h | 13 + include/vdso/datapage.h | 5 + include/vdso/helpers.h | 40 ++- kernel/time/namespace.c | 5 + kernel/time/timekeeping.c | 18 +- kernel/time/vsyscall.c | 70 ++++-- lib/vdso/gettimeofday.c | 212 ++++++++++------ tools/testing/selftests/timers/.gitignore | 1 + tools/testing/selftests/timers/Makefile | 2 +- tools/testing/selftests/timers/auxclock.c | 406 ++++++++++++++++++++++++++++++ tools/testing/selftests/vDSO/Makefile | 2 - tools/testing/selftests/vDSO/parse_vdso.c | 3 +- 15 files changed, 683 insertions(+), 120 deletions(-) --- base-commit: 4e83b31e48cf2e62aeaed5cd9875c851e36a90d9 change-id: 20250630-vdso-auxclock-97abdf8e042a Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

5 days, 10 hours

5
29
0 0

[PATCH v4 0/6] replace `allow(...)` lints with `expect(...)`

by Onur Özkan

Changes in v4: - The first patch from previous versions is now broken down into multiple smaller patches. - `#![expect(dead_code)]` is reverted from `rust/kernel/opp.rs` as it failed on the kernel test bot (see [1] for more). Link: https://lore.kernel.org/all/202506291507.M9eg5kic-lkp@intel.com [1] Onur Özkan (6): rust: switch to `#[expect(...)]` in core modules rust: switch to `#[expect(...)]` in init and kunit drivers: gpu: switch to `#[expect(...)]` in nova-core/regs.rs rust: switch to `#[expect(...)]` in devres, driver and ioctl rust: remove `#[allow(clippy::unnecessary_cast)]` rust: remove `#[allow(clippy::non_send_fields_in_send_ty)]` drivers/gpu/nova-core/regs.rs | 2 +- rust/kernel/alloc/allocator_test.rs | 2 +- rust/kernel/cpufreq.rs | 1 - rust/kernel/devres.rs | 2 +- rust/kernel/driver.rs | 2 +- rust/kernel/drm/ioctl.rs | 8 ++++---- rust/kernel/error.rs | 3 +-- rust/kernel/init.rs | 6 +++--- rust/kernel/kunit.rs | 2 +- rust/kernel/types.rs | 2 +- rust/macros/helpers.rs | 2 +- 11 files changed, 15 insertions(+), 17 deletions(-) -- 2.50.0

5 days, 16 hours

5
16
0 0

[PATCH v3 00/22] ARM64 PMU Partitioning

by Colton Lewis

This series creates a new PMU scheme on ARM, a partitioned PMU that allows reserving a subset of counters for more direct guest access, significantly reducing overhead. More details, including performance benchmarks, can be read in the v1 cover letter linked below. v3: * Return to using one kernel command line parameter for reserved host counters, but set the default value meaning no partitioning to -1 so 0 isn't ambiguous. Adjust checks for partitioning accordingly. * Move the function kvm_pmu_partition_supported() back out of the driver to avoid subbing has_vhe in 32-bit ARM code. * Fix the kernel test robot build failures, one with KVM and no PMU, and one with PMU and no KVM. * Scan entire series to remove vestigial changes and update comments. v2: https://lore.kernel.org/kvm/20250620221326.1261128-1-coltonlewis@google.com/ v1: https://lore.kernel.org/kvm/20250602192702.2125115-1-coltonlewis@google.com/ Colton Lewis (21): arm64: cpufeature: Add cpucap for HPMN0 arm64: Generate sign macro for sysreg Enums KVM: arm64: Define PMI{CNTR,FILTR}_EL0 as undef_access KVM: arm64: Reorganize PMU functions perf: arm_pmuv3: Introduce method to partition the PMU perf: arm_pmuv3: Generalize counter bitmasks perf: arm_pmuv3: Keep out of guest counter partition KVM: arm64: Correct kvm_arm_pmu_get_max_counters() KVM: arm64: Set up FGT for Partitioned PMU KVM: arm64: Writethrough trapped PMEVTYPER register KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned KVM: arm64: Writethrough trapped PMOVS register KVM: arm64: Write fast path PMU register handlers KVM: arm64: Setup MDCR_EL2 to handle a partitioned PMU KVM: arm64: Account for partitioning in PMCR_EL0 access KVM: arm64: Context swap Partitioned PMU guest registers KVM: arm64: Enforce PMU event filter at vcpu_load() perf: arm_pmuv3: Handle IRQs for Partitioned PMU guest counters KVM: arm64: Inject recorded guest interrupts KVM: arm64: Add ioctl to partition the PMU when supported KVM: arm64: selftests: Add test case for partitioned PMU Marc Zyngier (1): KVM: arm64: Cleanup PMU includes Documentation/virt/kvm/api.rst | 21 + arch/arm/include/asm/arm_pmuv3.h | 38 + arch/arm64/include/asm/arm_pmuv3.h | 61 +- arch/arm64/include/asm/kvm_host.h | 18 +- arch/arm64/include/asm/kvm_pmu.h | 100 +++ arch/arm64/kernel/cpufeature.c | 8 + arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 22 + arch/arm64/kvm/debug.c | 24 +- arch/arm64/kvm/hyp/include/hyp/switch.h | 233 ++++++ arch/arm64/kvm/pmu-emul.c | 674 +---------------- arch/arm64/kvm/pmu-part.c | 378 ++++++++++ arch/arm64/kvm/pmu.c | 702 ++++++++++++++++++ arch/arm64/kvm/sys_regs.c | 79 +- arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/gen-sysreg.awk | 1 + arch/arm64/tools/sysreg | 6 +- drivers/perf/arm_pmuv3.c | 123 ++- include/linux/perf/arm_pmu.h | 6 + include/linux/perf/arm_pmuv3.h | 14 +- include/uapi/linux/kvm.h | 4 + tools/include/uapi/linux/kvm.h | 2 + .../selftests/kvm/arm64/vpmu_counter_access.c | 62 +- virt/kvm/kvm_main.c | 1 + 24 files changed, 1828 insertions(+), 752 deletions(-) create mode 100644 arch/arm64/kvm/pmu-part.c base-commit: 79150772457f4d45e38b842d786240c36bb1f97f -- 2.50.0.727.gbf7dc18ff4-goog

5 days, 20 hours

5
45
0 0

[PATCH v3] selftests/mm: Fix UFFDIO_API usage with proper two-step feature negotiation

by Li Wang

The current implementation of test_unmerge_uffd_wp() explicitly sets `uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP` before calling UFFDIO_API. This can cause the ioctl() call to fail with EINVAL on kernels that do not support UFFD-WP, leading the test to fail unnecessarily: # ------------------------------ # running ./ksm_functional_tests # ------------------------------ # TAP version 13 # 1..9 # # [RUN] test_unmerge # ok 1 Pages were unmerged # # [RUN] test_unmerge_zero_pages # ok 2 KSM zero pages were unmerged # # [RUN] test_unmerge_discarded # ok 3 Pages were unmerged # # [RUN] test_unmerge_uffd_wp # not ok 4 UFFDIO_API failed <----- # # [RUN] test_prot_none # ok 5 Pages were unmerged # # [RUN] test_prctl # ok 6 Setting/clearing PR_SET_MEMORY_MERGE works # # [RUN] test_prctl_fork # # No pages got merged # # [RUN] test_prctl_fork_exec # ok 7 PR_SET_MEMORY_MERGE value is inherited # # [RUN] test_prctl_unmerge # ok 8 Pages were unmerged # Bail out! 1 out of 8 tests failed # # Planned tests != run tests (9 != 8) # # Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0 # [FAIL] This patch improves compatibility and robustness of the UFFD-WP test (test_unmerge_uffd_wp) by correctly implementing the UFFDIO_API two-step handshake as recommended by the userfaultfd(2) man page. Key changes: 1. Use features=0 in the initial UFFDIO_API call to query supported feature bits, rather than immediately requesting WP support. 2. Skip the test gracefully if: - UFFDIO_API fails with EINVAL (e.g. unsupported API version), or - UFFD_FEATURE_PAGEFAULT_FLAG_WP is not advertised by the kernel. 3. Close the initial userfaultfd and create a new one before enabling the required feature, since UFFDIO_API can only be called once per fd. 4. Improve diagnostics by distinguishing between expected and unexpected failures, using strerror() to report errors. This ensures the test behaves correctly across a wider range of kernel versions and configurations, while preserving the intended behavior on kernels that support UFFD-WP. Suggestted-by: David Hildenbrand <david(a)redhat.com> Signed-off-by: Li Wang <liwang(a)redhat.com> Cc: Peter Xu <peterx(a)redhat.com> Cc: Nadav Amit <nadav.amit(a)gmail.com> Cc: Aruna Ramakrishna <aruna.ramakrishna(a)oracle.com> Cc: Bagas Sanjaya <bagasdotme(a)gmail.com> Cc: Catalin Marinas <catalin.marinas(a)arm.com> Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: Joey Gouly <joey.gouly(a)arm.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Keith Lucas <keith.lucas(a)oracle.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Shuah Khan <shuah(a)kernel.org> Acked-by: David Hildenbrand <david(a)redhat.com> --- .../selftests/mm/ksm_functional_tests.c | 28 +++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c index b61803e36d1c..d8bd1911dfc0 100644 --- a/tools/testing/selftests/mm/ksm_functional_tests.c +++ b/tools/testing/selftests/mm/ksm_functional_tests.c @@ -393,9 +393,13 @@ static void test_unmerge_uffd_wp(void) /* See if UFFD-WP is around. */ uffdio_api.api = UFFD_API; - uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; + uffdio_api.features = 0; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { - ksft_test_result_fail("UFFDIO_API failed\n"); + if (errno == EINVAL) + ksft_test_result_skip("The API version requested is not supported\n"); + else + ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno)); + goto close_uffd; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { @@ -403,6 +407,26 @@ static void test_unmerge_uffd_wp(void) goto close_uffd; } + /* + * UFFDIO_API must only be called once to enable features. + * So we close the old userfaultfd and create a new one to + * actually enable UFFD_FEATURE_PAGEFAULT_FLAG_WP. + */ + close(uffd); + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) { + ksft_test_result_fail("__NR_userfaultfd failed\n"); + goto unmap; + } + + /* Now, enable it ("two-step handshake") */ + uffdio_api.api = UFFD_API; + uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; + if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { + ksft_test_result_fail("UFFDIO_API failed: %s\n", strerror(errno)); + goto close_uffd; + } + /* Register UFFD-WP, no need for an actual handler. */ if (uffd_register(uffd, map, size, false, true, false)) { ksft_test_result_fail("UFFDIO_REGISTER_MODE_WP failed\n"); -- 2.49.0

6 days, 3 hours

2
1
0 0

[PATCH net-next v4 0/3] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v4: - Make the test XFail if it doesn't hit the function we are looking for - Toggle the interface while the traffic is flowing. - Bumped the number of messages from 10 to 40 per iterations. * This is hitting ~15 times per run on my vng test. - Decreased the time from 15 seconds to 10 seconds, given that if it didn't hit the function in 10 seconds, 5 seconds extra will not help. - Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org Changes in v3: - Make pylint happy (Simon) - Remove the unnecessary patch in bpftrace to raise an exception when it fails. (Jakub) - Improved the bpftrace code (Willem) - Stop sending messages if bpftrace is not alive anymore. - Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (2): selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/py/__init__.py | 3 +- .../testing/selftests/drivers/net/netpoll_basic.py | 365 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 35 ++ 4 files changed, 403 insertions(+), 1 deletion(-) --- base-commit: 22e1cfda0f612556f116560d2b7e9c3315636bfb change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

6 days, 3 hours

3
7
0 0

[PATCH net-next 0/3] netdevsim: add support for PHY devices

by Maxime Chevallier

There exist numerous ethtool commands that gets their information from an interface's PHY. This series allows creating virtual PHY devices attached to netdevsim, so that we can start testing these commands. This first series adds a minimal support. The PHY that we add are only capable of saying if the link is up or down, based on a debugfs file. When accepted, this interface can be extended to allow testing further commands, and in greater details. This series also add some selftests for the "ethtool --show-phys" command. This is a first step towards having better testability for the ethtool netlink PHY commands, but this could also potentially be a stepping stone for some basic phylib tests ? Thanks, Maxime Maxime Chevallier (3): net: netdevsim: Add PHY support in netdevsim selftests: ethtool: Drop the unused old_netdevs variable selftests: ethtool: Introduce ethernet PHY selftests on netdevsim drivers/net/netdevsim/Makefile | 4 + drivers/net/netdevsim/dev.c | 2 + drivers/net/netdevsim/netdev.c | 3 + drivers/net/netdevsim/netdevsim.h | 14 + drivers/net/netdevsim/phy.c | 387 ++++++++++++++++++ .../selftests/drivers/net/netdevsim/config | 1 + .../drivers/net/netdevsim/ethtool-common.sh | 19 +- .../drivers/net/netdevsim/ethtool-phy.sh | 64 +++ 8 files changed, 491 insertions(+), 3 deletions(-) create mode 100644 drivers/net/netdevsim/phy.c create mode 100755 tools/testing/selftests/drivers/net/netdevsim/ethtool-phy.sh -- 2.49.0

6 days, 4 hours

5
12
0 0

[PATCH nf-next v3 0/2] Add IPIP flowtable SW acceleratio

by Lorenzo Bianconi

Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. --- Changes in v3: - Add outer IP header sanity checks - target nf-next tree instead of net-next - Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern… Changes in v2: - Introduce IPIP flowtable selftest - Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern… --- Lorenzo Bianconi (2): net: netfilter: Add IPIP flowtable SW acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest net/ipv4/ipip.c | 21 ++++++++++++ net/netfilter/nf_flow_table_ip.c | 34 ++++++++++++++++-- .../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++++++++ 3 files changed, 93 insertions(+), 2 deletions(-) --- base-commit: 8b98f34ce1d8c520403362cb785231f9898eb3ff change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067 Best regards, -- Lorenzo Bianconi <lorenzo(a)kernel.org>

6 days, 10 hours

3
6
0 0

[PATCH v2] selftests/kexec: fix test_kexec_jump build

by Moon Hee Lee

The test_kexec_jump program builds correctly when invoked from the top-level selftests/Makefile, which explicitly sets the OUTPUT variable. However, building directly in tools/testing/selftests/kexec fails with: make: *** No rule to make target '/test_kexec_jump', needed by 'test_kexec_jump.sh'. Stop. This failure occurs because the Makefile rule relies on $(OUTPUT), which is undefined in direct builds. Fix this by listing test_kexec_jump in TEST_GEN_PROGS, the standard way to declare generated test binaries in the kselftest framework. This ensures the binary is built regardless of invocation context and properly removed by make clean. Acked-by: Shuah Khan <skhan(a)linuxfoundation.org> Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com> --- Changes in v2: - Dropped the .gitignore addition, as it is already handled in [1] [1] https://lore.kernel.org/r/20250623232549.3263273-1-dyudaken@gmail.com tools/testing/selftests/kexec/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kexec/Makefile b/tools/testing/selftests/kexec/Makefile index e3000ccb9a5d..874cfdd3b75b 100644 --- a/tools/testing/selftests/kexec/Makefile +++ b/tools/testing/selftests/kexec/Makefile @@ -12,7 +12,7 @@ include ../../../scripts/Makefile.arch ifeq ($(IS_64_BIT)$(ARCH_PROCESSED),1x86) TEST_PROGS += test_kexec_jump.sh -test_kexec_jump.sh: $(OUTPUT)/test_kexec_jump +TEST_GEN_PROGS := test_kexec_jump endif include ../lib.mk -- 2.43.0

6 days, 11 hours

4
3
0 0

[PATCH bpf-next 0/2] Clarify and Enhance XDP Rx Metadata Handling

by Song Yoong Siang

This patch set improves the documentation and selftests for XDP Rx metadata handling. The first patch clarifies the documentation around XDP metadata layout and the use of bpf_xdp_adjust_meta. The second patch enhances the BPF selftests to make XDP metadata handling more robust and portable across different NICs. Prior to this patch set, the user application retrieved the xdp_meta by calculating backward from the data pointer, while the XDP program fill in the xdp_meta by calculating backward from data_meta. This approach will cause mismatch if there is device-reserved metadata. |<---sizeof(xdp_meta)--| | | struct xdp_meta rx_desc->address ^ ^ | | +----------+----------------------+------------+------+ | headroom | custom metadata | reserved | data | +----------+----------------------+------------+------+ ^ ^ ^ | | | struct xdp_meta xdp_buff->data_meta xdp_buff->data | | |<---sizeof(xdp_meta)--| Song Yoong Siang (2): doc: clarify XDP Rx metadata layout and bpf_xdp_adjust_meta usage selftests/bpf: Enhance XDP Rx Metadata Handling Documentation/networking/xdp-rx-metadata.rst | 38 +++++++++++++++++++ .../selftests/bpf/prog_tests/xdp_metadata.c | 2 +- .../selftests/bpf/progs/xdp_hw_metadata.c | 10 ++++- .../selftests/bpf/progs/xdp_metadata.c | 8 +++- tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_metadata.h | 7 ++++ 6 files changed, 63 insertions(+), 4 deletions(-) -- 2.34.1

6 days, 16 hours

4
15
0 0

[PATCH net-next v2 0/7] netpoll: Factor out functions from netpoll_send_udp() and add ipv6 selftest

by Breno Leitao

Refactors the netpoll UDP transmit path to improve code clarity, maintainability, and protocol-layer encapsulation. Function netpoll_send_udp() has more than 100 LoC, which is hard to understand and review. After this patchset, it has only 32 LoC, which is more manageable. The series systematically moves the construction of protocol headers (UDP, IPv4, IPv6, Ethernet) out of the core `netpoll_send_udp()` function into dedicated static helpers: - `push_udp()` for UDP header setup - `push_ipv4()` and `push_ipv6()` for IP header setup - `push_eth()` for Ethernet header setup This results in a clean, layered abstraction that mirrors the protocol stack, reduces code duplication, and improves readability. Also, to make sure this is not breaking anything, add IPv6 selftest to netconsole tests, which will exercise this code. This test would also pick problems similiar to the one fixed by f599020702698 ("net: netpoll: Initialize UDP checksum field before checksumming"), which was embarrassin we didn't have a selftest catch it. Anyway, there are **no functional changes** intended in this patchset. Signed-off-by: Breno Leitao <leitao(a)debian.org> --- Changes in v2: - Move ethernet data from push_ipv{4,6} into push_eth() function, making the slice up even more clearer (Jakub). - Fixed some long lines to get checkpatch happier. - Link to v1: https://lore.kernel.org/r/20250627-netpoll_untagle_ip-v1-0-61a21692f84a@deb… --- Breno Leitao (7): netpoll: Improve code clarity with explicit struct size calculations netpoll: factor out UDP checksum calculation into helper netpoll: factor out IPv6 header setup into push_ipv6() helper netpoll: factor out IPv4 header setup into push_ipv4() helper netpoll: factor out UDP header setup into push_udp() helper netpoll: move Ethernet setup to push_eth() helper selftests: net: Add IPv6 support to netconsole basic tests net/core/netpoll.c | 192 +++++++++++++-------- .../selftests/drivers/net/lib/sh/lib_netcons.sh | 76 +++++++- .../testing/selftests/drivers/net/netcons_basic.sh | 53 +++--- 3 files changed, 215 insertions(+), 106 deletions(-) --- base-commit: 8efa26fcbf8a7f783fd1ce7dd2a409e9b7758df0 change-id: 20250620-netpoll_untagle_ip-e37c799a6925 Best regards, -- Breno Leitao <leitao(a)debian.org>

6 days, 16 hours

3
16
0 0

[PATCH 0/2] tracing: Fixes for filter

by Masami Hiramatsu (Google)

Hi, Here is a patch series to fix some issues on the trace event and function filters. The first patch fixes an issue that the event filter can not handle the string pointer with BTF attribute tag. This happens with CONFIG_DEBUG_INFO_BTF=y and PAHOLE_HAS_BTF_TAG=y. The second patch fixes a selftest issue on the function glob filter. Since mutex_trylock() can be an inline function, it is not a good example for ftrace. This replaces it with mutex_unlock(). Thank you, --- Masami Hiramatsu (Google) (2): tracing: Handle "(const) char __attribute() *" as string ptr type selftests: tracing: Use mutex_unlock for testing glob filter kernel/trace/trace_events_filter.c | 5 +++++ tools/testing/selftests/ftrace/test.d/ftrace/func-filter-glob.tc | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-)

6 days, 17 hours

4
7
0 0

[PATCH v2 1/2] selftests: cachestat: add tests for mmap

by Suresh K C

From: Suresh K C <suresh.k.chandrappa(a)gmail.com> Add a test case to verify cachestat behavior with memory-mapped files using mmap(). This ensures that pages accessed via mmap are correctly accounted for in the page cache. Tested on x86_64 with default kernel config Signed-off-by: Suresh K C <suresh.k.chandrappa(a)gmail.com> --- .../selftests/cachestat/test_cachestat.c | 39 ++++++++++++++++--- 1 file changed, 33 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c index 632ab44737ec..b6452978dae0 100644 --- a/tools/testing/selftests/cachestat/test_cachestat.c +++ b/tools/testing/selftests/cachestat/test_cachestat.c @@ -33,6 +33,11 @@ void print_cachestat(struct cachestat *cs) cs->nr_evicted, cs->nr_recently_evicted); } +enum file_type { + FILE_MMAP, + FILE_SHMEM +}; + bool write_exactly(int fd, size_t filesize) { int random_fd = open("/dev/urandom", O_RDONLY); @@ -202,7 +207,7 @@ static int test_cachestat(const char *filename, bool write_random, bool create, return ret; } -bool test_cachestat_shmem(void) +bool run_cachestat_test(enum file_type type) { size_t PS = sysconf(_SC_PAGESIZE); size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */ @@ -212,27 +217,43 @@ bool test_cachestat_shmem(void) char *filename = "tmpshmcstat"; struct cachestat cs; bool ret = true; + int fd; unsigned long num_pages = compute_len / PS; - int fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + if (type == FILE_SHMEM) + fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + else + fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); if (fd < 0) { - ksft_print_msg("Unable to create shmem file.\n"); + ksft_print_msg("Unable to create file.\n"); ret = false; goto out; } if (ftruncate(fd, filesize)) { - ksft_print_msg("Unable to truncate shmem file.\n"); + ksft_print_msg("Unable to truncate file.\n"); ret = false; goto close_fd; } if (!write_exactly(fd, filesize)) { - ksft_print_msg("Unable to write to shmem file.\n"); + ksft_print_msg("Unable to write to file.\n"); ret = false; goto close_fd; } + if (type == FILE_MMAP){ + char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (map == MAP_FAILED) { + ksft_print_msg("mmap failed.\n"); + ret = false; + goto close_fd; + } + for (int i = 0; i < filesize; i++) { + map[i] = 'A'; + } + map[filesize - 1] = 'X'; + } syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); if (syscall_ret) { @@ -308,12 +329,18 @@ int main(void) break; } - if (test_cachestat_shmem()) + if (run_cachestat_test(FILE_SHMEM)) ksft_test_result_pass("cachestat works with a shmem file\n"); else { ksft_test_result_fail("cachestat fails with a shmem file\n"); ret = 1; } + if (run_cachestat_test(FILE_MMAP)) + ksft_test_result_pass("cachestat works with a mmap file\n"); + else { + ksft_test_result_fail("cachestat fails with a mmap file\n"); + ret = 1; + } return ret; } -- 2.43.0

6 days, 21 hours

2
2
0 0

[PATCH] selftests: harness: Rework is_signed_type() to avoid collision with overflow.h

by Sean Christopherson

Rename is_signed_type() to is_signed_var() to avoid colliding with a macro of the same name defined by linux/overflow.h. Note, overflow.h's version takes a type as the input, whereas the harness's version takes a variable! This fixes warnings (and presumably potential test failures) in tests that utilize the selftests harness and happen to (indirectly) include overflow.h. In file included from tools/include/linux/bits.h:34, from tools/include/linux/bitops.h:14, from tools/include/linux/hashtable.h:13, from include/kvm_util.h:11, from x86/userspace_msr_exit_test.c:11: tools/include/linux/overflow.h:31:9: error: "is_signed_type" redefined [-Werror] 31 | #define is_signed_type(type) (((type)(-1)) < (type)1) | ^~~~~~~~~~~~~~ In file included from include/kvm_test_harness.h:11, from x86/userspace_msr_exit_test.c:9: ../kselftest_harness.h:754:9: note: this is the location of the previous definition 754 | #define is_signed_type(var) (!!(((__typeof__(var))(-1)) < (__typeof__(var))1)) | ^~~~~~~~~~~~~~ Opportunistically use is_signed_type() to implement is_signed_var() so that the relationship and differences are obvious. Fixes: fc92099902fb ("tools headers: Synchronize linux/bits.h with the kernel sources") Cc: Vincent Mailhol <mailhol.vincent(a)wanadoo.fr> Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com> Signed-off-by: Sean Christopherson <seanjc(a)google.com> --- This is probably compile-tested only, I don't think any of the KVM selftests utilize the harness's EXPECT macros. tools/testing/selftests/kselftest_harness.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h index 2925e47db995..f3e7a46345db 100644 --- a/tools/testing/selftests/kselftest_harness.h +++ b/tools/testing/selftests/kselftest_harness.h @@ -56,6 +56,7 @@ #include <asm/types.h> #include <ctype.h> #include <errno.h> +#include <linux/overflow.h> #include <linux/unistd.h> #include <poll.h> #include <stdbool.h> @@ -751,7 +752,7 @@ for (; _metadata->trigger; _metadata->trigger = \ __bail(_assert, _metadata)) -#define is_signed_type(var) (!!(((__typeof__(var))(-1)) < (__typeof__(var))1)) +#define is_signed_var(var) is_signed_type(__typeof__(var)) #define __EXPECT(_expected, _expected_str, _seen, _seen_str, _t, _assert) do { \ /* Avoid multiple evaluation of the cases */ \ @@ -759,7 +760,7 @@ __typeof__(_seen) __seen = (_seen); \ if (!(__exp _t __seen)) { \ /* Report with actual signedness to avoid weird output. */ \ - switch (is_signed_type(__exp) * 2 + is_signed_type(__seen)) { \ + switch (is_signed_var(__exp) * 2 + is_signed_var(__seen)) { \ case 0: { \ uintmax_t __exp_print = (uintmax_t)__exp; \ uintmax_t __seen_print = (uintmax_t)__seen; \ base-commit: 78f4e737a53e1163ded2687a922fce138aee73f5 -- 2.50.0.714.g196bf9f422-goog

6 days, 21 hours

3
3
0 0

[PATCH v2 net-next] selftests: devmem: configure HDS threshold

by Taehee Yoo

The devmem TCP requires the hds-thresh value to be 0, but it doesn't change it automatically. Therefore, make configure_headersplit() sets hds-thresh value to 0. Signed-off-by: Taehee Yoo <ap420073(a)gmail.com> --- v2: - Do not implement configure_hds_thresh(). - Make configure_headersplit() sets hds-thresh to 0. tools/testing/selftests/drivers/net/hw/ncdevmem.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c index cc9b40d9c5d5..52b72de11e3b 100644 --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c @@ -331,6 +331,12 @@ static int configure_headersplit(bool on) ret = ethtool_rings_set(ys, req); if (ret < 0) fprintf(stderr, "YNL failed: %s\n", ys->err.msg); + if (on) { + ethtool_rings_set_req_set_hds_thresh(req, 0); + ret = ethtool_rings_set(ys, req); + if (ret < 0) + fprintf(stderr, "YNL failed: %s\n", ys->err.msg); + } ethtool_rings_set_req_free(req); if (ret == 0) { @@ -338,9 +344,12 @@ static int configure_headersplit(bool on) ethtool_rings_get_req_set_header_dev_index(get_req, ifindex); get_rsp = ethtool_rings_get(ys, get_req); ethtool_rings_get_req_free(get_req); - if (get_rsp) + if (get_rsp) { fprintf(stderr, "TCP header split: %s\n", tcp_data_split_str(get_rsp->tcp_data_split)); + fprintf(stderr, "HDS threshold: %u\n", + get_rsp->hds_thresh); + } ethtool_rings_get_rsp_free(get_rsp); } -- 2.34.1

6 days, 22 hours

4
7
0 0

[kvm-unit-tests PATCH] riscv: lib: Add sbi-exit-code to configure and environment

by Jesse Taube

Add --[enable|disable]-sbi-exit-code to configure script. With the default value disabled. Add a check for SBI_PASS_EXIT_CODE in the environment, so that passing of the test status is configurable from both the environment and the configure script Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> --- configure | 11 +++++++++++ lib/riscv/io.c | 12 +++++++++++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/configure b/configure index 20bf5042..7c949bdc 100755 --- a/configure +++ b/configure @@ -67,6 +67,7 @@ earlycon= console= efi= efi_direct= +sbi_exit_code=0 target_cpu= # Enable -Werror by default for git repositories only (i.e. developer builds) @@ -141,6 +142,9 @@ usage() { system and run from the UEFI shell. Ignored when efi isn't enabled and defaults to enabled when efi is enabled for riscv64. (arm64 and riscv64 only) + --[enable|disable]-sbi-exit-code + Enable or disable sending pass/fail exit code to SBI SRST. + (disabled by default, riscv only) EOF exit 1 } @@ -236,6 +240,12 @@ while [[ $optno -le $argc ]]; do --disable-efi-direct) efi_direct=n ;; + --enable-sbi-exit-code) + sbi_exit_code=1 + ;; + --disable-sbi-exit-code) + sbi_exit_code=0 + ;; --enable-werror) werror=-Werror ;; @@ -551,6 +561,7 @@ EOF elif [ "$arch" = "riscv32" ] || [ "$arch" = "riscv64" ]; then echo "#define CONFIG_UART_EARLY_BASE ${uart_early_addr}" >> lib/config.h [ "$console" = "sbi" ] && echo "#define CONFIG_SBI_CONSOLE" >> lib/config.h + echo "#define CONFIG_SBI_EXIT_CODE ${sbi_exit_code}" >> lib/config.h echo >> lib/config.h fi echo "#endif" >> lib/config.h diff --git a/lib/riscv/io.c b/lib/riscv/io.c index b1163404..0e666009 100644 --- a/lib/riscv/io.c +++ b/lib/riscv/io.c @@ -162,8 +162,18 @@ void halt(int code); void exit(int code) { + char *s = getenv("SBI_PASS_EXIT_CODE"); + bool pass_exit = CONFIG_SBI_EXIT_CODE; + printf("\nEXIT: STATUS=%d\n", ((code) << 1) | 1); - sbi_shutdown(code == 0); + + if (s) + pass_exit = (*s == '1' || *s == 'y' || *s == 'Y'); + + if (pass_exit) + sbi_shutdown(code == 0); + else + sbi_shutdown(true); halt(code); __builtin_unreachable(); } -- 2.43.0

1 week

2
3
0 0

[RFC PATCH 00/33] vfio: Introduce selftests for VFIO

by David Matlack

This series introduces VFIO selftests, located in tools/testing/selftests/vfio/. VFIO selftests aim to enable kernel developers to write and run tests that take the form of userspace programs that interact with VFIO and IOMMUFD uAPIs. VFIO selftests can be used to write functional tests for new features, regression tests for bugs, and performance tests for optimizations. These tests are designed to interact with real PCI devices, i.e. they do not rely on mocking out or faking any behavior in the kernel. This allows the tests to exercise not only VFIO but also IOMMUFD, the IOMMU driver, interrupt remapping, IRQ handling, etc. We chose selftests to host these tests primarily to enable integration with the existing KVM selftests. As explained in the next section, enabling KVM developers to test the interaction between VFIO and KVM is one of the motivators of this series. Motivation ----------------------------------------------------------------------- The main motivation for this series is upcoming development in the kernel to support Hypervisor Live Updates [1][2]. Live Update is a specialized reboot process where selected devices are kept operational and their kernel state is preserved and recreated across a kexec. For devices, DMA and interrupts may continue during the reboot. VFIO-bound devices are the main target, since the first usecase of Live Updates is to enable host kernel upgrades in a Cloud Computing environment without disrupting running customer VMs. To prepare for upcoming support for Live Updates in VFIO, IOMMUFD, IOMMU drivers, the PCI layer, etc., we'd like to first lay the ground work for exercising and testing VFIO from kernel selftests. This way when we eventually upstream support for Live Updates, we can also upstream tests for those changes, rather than purely relying on Live Update integration tests which would be hard to share and reproduce upstream. But even without Live Updates, VFIO and IOMMUFD are becoming an increasingly critical component of running KVM-based VMs in cloud environments. Virtualized networking and storage are increasingly being offloaded to smart NICs/cards, and demand for high performance networking, storage, and AI are also leading to NICs, SSDs, and GPUs being directly attached to VMs via VFIO. VFIO selftests increases our ability to test in several ways. - It enables developers sending VFIO, IOMMUFD, etc. commits upstream to test their changes against all existing VFIO selftests, reducing the probability of regressions. - It enables developers sending VFIO, IOMMUFD, etc. commits upstream to include tests alongside their changes, increasing the quality of the code that is merged. - It enables testing the interaction between VFIO and KVM. There are some paths in KVM that are only exercised through VFIO, such as IRQ bypass. VFIO selftests provides a helper library to enable KVM developers to write KVM selftests to test those interactions [3]. Design ----------------------------------------------------------------------- VFIO selftests are designed around interacting with with VFIO-managed PCI devices. As such, the core data struture is struct vfio_pci_device, which represents a single PCI device. struct vfio_pci_device *device; device = vfio_pci_device_init("0000:6a:01.0", iommu_mode); ... vfio_pci_device_cleanup(device); vfio_pci_device_init() sets up a container or iommufd, depending on the iommu_mode argument, to manage DMA mappings, fetches information about the device and what interrupts it supports from VFIO and caches it, and mmap()s all mappable BARs for the test to use. There are helper methods that operate on struct vfio_pci_device to do things like read and write to PCI config space, enable/disable IRQs, and map memory for DMA, struct vfio_pci_device and its methods do not care about what device they are actually interacting with. It can be a GPU, a NIC, an SSD, etc. To keep things simple initially, VFIO selftests only support a single device per group and per container/iommufd. But it should be possible to relax those restrictions in the future, e.g. to enable testing with multiple devices in the same container/iommufd. Driver Framework ----------------------------------------------------------------------- In order to support VFIO selftests where a device is generating DMA and interrupts on command, the VFIO selftests supports a driver framework. This framework abstracts away device-specific details allowing VFIO selftests to be written in a generic way, and then run against different devices depending on what hardware developers have access to. The framework also aims to support carrying drivers out-of-tree, e.g. so that companies can run VFIO selftests with custom/test hardware. Drivers must implement the following methods: - probe(): Check if the driver supports a given device. - init(): Initialize the driver. - remove(): Deinitialize the driver and reset the device. - memcpy_start(): Kick off a series of repeated memcpys (DMA reads and DMA writes). - memcpy_wait(): Wait for a memcpy operation to complete. - send_msi(): Make the device send an MSI interrupt. memcpy_start/wait() are for generating DMA. We separate the operation into 2 steps so that tests can trigger a long-running DMA operation. We expect to use this to stress test Live Updates by kicking off a long-running mempcy operation and then performing a Live Update. These methods are required to not generate any interrupts. send_msi() is used for testing MSI and MSI-x interrupts. The driver tells the test which MSI it will be using via device->driver.msi. It's the responsibility of the test to set up a region of memory and map it into the device for use by the driver, e.g. for in-memory descriptors, before calling init(). A demo of the driver framework can be found in tools/testing/selftests/vfio/vfio_pci_driver_test.c. In addition, this series introduces a new KVM selftest to demonstrate delivering a device MSI directly into a guest, which can be found in tools/testing/selftests/kvm/vfio_pci_device_irq_test.c. Tests ----------------------------------------------------------------------- There are 5 tests in this series, mostly to demonstrate as a proof-of-concept: - tools/testing/selftests/vfio/vfio_pci_device_test.c - tools/testing/selftests/vfio/vfio_pci_driver_test.c - tools/testing/selftests/vfio/vfio_iommufd_setup_test.c - tools/testing/selftests/vfio/vfio_dma_mapping_test.c - tools/testing/selftests/kvm/vfio_pci_device_irq_test.c Integrating with KVM selftests ----------------------------------------------------------------------- To support testing the interactions between VFIO and KVM, the VFIO selftests support sharing its library with the KVM selftest. The patches at the end of this series demonstrate how that works. Essentially, we allow the KVM selftests to build their own copy of tools/testing/selftests/vfio/lib/ and link it into KVM selftests binaries. This requires minimal changes to the KVM selftests Makefile. Future Areas of Development ----------------------------------------------------------------------- Library: - Driver support for devices that can be used on AMD, ARM, and other platforms. - Driver support for a device available in QEMU VMs. - Support for tests that use multiple devices. - Support for IOMMU groups with multiple devices. - Support for multiple devices sharing the same container/iommufd. - Sharing TEST_ASSERT() macros and other common code between KVM and VFIO selftests. Tests: - DMA mapping performance tests for BARs/HugeTLB/etc. - Live Update selftests. - Porting Sean's KVM selftest for posted interrupts to use the VFIO selftests library [3] This series can also be found on GitHub: https://github.com/dmatlack/linux/tree/vfio/selftests/rfc Cc: Alex Williamson <alex.williamson(a)redhat.com> Cc: Jason Gunthorpe <jgg(a)nvidia.com> Cc: Kevin Tian <kevin.tian(a)intel.com> Cc: Paolo Bonzini <pbonzini(a)redhat.com> Cc: Sean Christopherson <seanjc(a)google.com> Cc: Vipin Sharma <vipinsh(a)google.com> Cc: Josh Hilke <jrhilke(a)google.com> Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com> Cc: Saeed Mahameed <saeedm(a)nvidia.com> Cc: Saeed Mahameed <saeedm(a)nvidia.com> Cc: Adithya Jayachandran <ajayachandra(a)nvidia.com> Cc: Parav Pandit <parav(a)nvidia.com> Cc: Leon Romanovsky <leonro(a)nvidia.com> Cc: Vinicius Costa Gomes <vinicius.gomes(a)intel.com> Cc: Dave Jiang <dave.jiang(a)intel.com> Cc: Dan Williams <dan.j.williams(a)intel.com> [1] https://lore.kernel.org/all/f35359d5-63e1-8390-619f-67961443bfe1@google.com/ [2] https://lore.kernel.org/all/20250515182322.117840-1-pasha.tatashin@soleen.c… [3] https://lore.kernel.org/kvm/20250404193923.1413163-68-seanjc@google.com/ David Matlack (28): selftests: Create tools/testing/selftests/vfio vfio: selftests: Add a helper library for VFIO selftests vfio: selftests: Introduce vfio_pci_device_test tools headers: Add stub definition for __iomem tools headers: Import asm-generic MMIO helpers tools headers: Import x86 MMIO helper overrides tools headers: Import iosubmit_cmds512() tools headers: Import drivers/dma/ioat/{hw.h,registers.h} tools headers: Import drivers/dma/idxd/registers.h tools headers: Import linux/pci_ids.h vfio: selftests: Keep track of DMA regions mapped into the device vfio: selftests: Enable asserting MSI eventfds not firing vfio: selftests: Add a helper for matching vendor+device IDs vfio: selftests: Add driver framework vfio: sefltests: Add vfio_pci_driver_test vfio: selftests: Add driver for Intel CBDMA vfio: selftests: Add driver for Intel DSA vfio: selftests: Move helper to get cdev path to libvfio vfio: selftests: Encapsulate IOMMU mode vfio: selftests: Add [-i iommu_mode] option to all tests vfio: selftests: Add vfio_type1v2_mode vfio: selftests: Add iommufd_compat_type1{,v2} modes vfio: selftests: Add iommufd mode vfio: selftests: Make iommufd the default iommu_mode vfio: selftests: Add a script to help with running VFIO selftests KVM: selftests: Build and link sefltests/vfio/lib into KVM selftests KVM: selftests: Test sending a vfio-pci device IRQ to a VM KVM: selftests: Use real device MSIs in vfio_pci_device_irq_test Josh Hilke (5): vfio: selftests: Test basic VFIO and IOMMUFD integration vfio: selftests: Move vfio dma mapping test to their own file vfio: selftests: Add test to reset vfio device. vfio: selftests: Use command line to set hugepage size for DMA mapping test vfio: selftests: Validate 2M/1G HugeTLB are mapped as 2M/1G in IOMMU MAINTAINERS | 7 + tools/arch/x86/include/asm/io.h | 101 + tools/arch/x86/include/asm/special_insns.h | 27 + tools/include/asm-generic/io.h | 482 +++ tools/include/asm/io.h | 11 + tools/include/drivers/dma/idxd/registers.h | 601 +++ tools/include/drivers/dma/ioat/hw.h | 270 ++ tools/include/drivers/dma/ioat/registers.h | 251 ++ tools/include/linux/compiler.h | 4 + tools/include/linux/io.h | 4 +- tools/include/linux/pci_ids.h | 3212 +++++++++++++++++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/kvm/Makefile.kvm | 6 +- .../testing/selftests/kvm/include/kvm_util.h | 4 + tools/testing/selftests/kvm/lib/kvm_util.c | 21 + .../selftests/kvm/vfio_pci_device_irq_test.c | 173 + tools/testing/selftests/vfio/.gitignore | 7 + tools/testing/selftests/vfio/Makefile | 20 + .../testing/selftests/vfio/lib/drivers/dsa.c | 416 +++ .../testing/selftests/vfio/lib/drivers/ioat.c | 235 ++ .../selftests/vfio/lib/include/vfio_util.h | 271 ++ tools/testing/selftests/vfio/lib/libvfio.mk | 26 + .../selftests/vfio/lib/vfio_pci_device.c | 573 +++ .../selftests/vfio/lib/vfio_pci_driver.c | 126 + tools/testing/selftests/vfio/run.sh | 110 + .../selftests/vfio/vfio_dma_mapping_test.c | 239 ++ .../selftests/vfio/vfio_iommufd_setup_test.c | 133 + .../selftests/vfio/vfio_pci_device_test.c | 195 + .../selftests/vfio/vfio_pci_driver_test.c | 256 ++ 29 files changed, 7780 insertions(+), 2 deletions(-) create mode 100644 tools/arch/x86/include/asm/io.h create mode 100644 tools/arch/x86/include/asm/special_insns.h create mode 100644 tools/include/asm-generic/io.h create mode 100644 tools/include/asm/io.h create mode 100644 tools/include/drivers/dma/idxd/registers.h create mode 100644 tools/include/drivers/dma/ioat/hw.h create mode 100644 tools/include/drivers/dma/ioat/registers.h create mode 100644 tools/include/linux/pci_ids.h create mode 100644 tools/testing/selftests/kvm/vfio_pci_device_irq_test.c create mode 100644 tools/testing/selftests/vfio/.gitignore create mode 100644 tools/testing/selftests/vfio/Makefile create mode 100644 tools/testing/selftests/vfio/lib/drivers/dsa.c create mode 100644 tools/testing/selftests/vfio/lib/drivers/ioat.c create mode 100644 tools/testing/selftests/vfio/lib/include/vfio_util.h create mode 100644 tools/testing/selftests/vfio/lib/libvfio.mk create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_device.c create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_driver.c create mode 100755 tools/testing/selftests/vfio/run.sh create mode 100644 tools/testing/selftests/vfio/vfio_dma_mapping_test.c create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_setup_test.c create mode 100644 tools/testing/selftests/vfio/vfio_pci_device_test.c create mode 100644 tools/testing/selftests/vfio/vfio_pci_driver_test.c base-commit: a11a72229881d8ac1d52ea727101bc9c744189c1 prerequisite-patch-id: 3bae97c9e1093148763235f47a84fa040b512d04 -- 2.49.0.1151.ga128411c76-goog

1 week

10
63
0 0

[PATCH bpf-next v3 0/2] bpf: Fix and test aux usage after do_check_insn()

by Luis Gerhorst

Fix cur_aux()->nospec_result test after do_check_insn() referring to the to-be-analyzed (potentially unsafe) instruction, not the already-analyzed (safe) instruction. This might allow a unsafe insn to slip through on a speculative path. Create some tests from the reproducer [1]. Commit d6f1c85f2253 ("bpf: Fall back to nospec for Spectre v1") should not be in any stable kernel yet, therefore bpf-next should suffice. [1] https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/ Changes since v2: - Use insn_aux variable instead of introducing prev_aux() as suggested by Eduard (and therefore also drop patch 1) - v2: https://lore.kernel.org/bpf/20250628145016.784256-1-luis.gerhorst@fau.de/ Changes since v1: - Fix compiler error due to missed rename of prev_insn_idx in first patch - v1: https://lore.kernel.org/bpf/20250628125927.763088-1-luis.gerhorst@fau.de/ Changes since RFC: - Introduce prev_aux() as suggested by Alexei. For this, we must move the env->prev_insn_idx assignment to happen directly after do_check_insn(), for which I have created a separate commit. This patch could be simplified by using a local prev_aux variable as sugested by Eduard, but I figured one might find the new assignment-strategy easier to understand (before, prev_insn_idx and env->prev_insn_idx were out-of-sync for the latter part of the loop). Also, like this we do not have an additional prev_* variable that must be kept in-sync and the local variable's usage (old prev_insn_idx, new tmp) is much more local. If you think it would be better to not take the risk and keep the fix simple by just introducing the prev_aux variable, let me know. - Change WARN_ON_ONCE() to verifier_bug_if() as suggested by Alexei - Change assertion to check instruction is BPF_JMP[32] as suggested by Eduard - RFC: https://lore.kernel.org/bpf/8734bmoemx.fsf@fau.de/ Luis Gerhorst (2): bpf: Fix aux usage after do_check_insn() selftests/bpf: Add Spectre v4 tests kernel/bpf/verifier.c | 19 ++- tools/testing/selftests/bpf/progs/bpf_misc.h | 4 + .../selftests/bpf/progs/verifier_unpriv.c | 149 ++++++++++++++++++ 3 files changed, 167 insertions(+), 5 deletions(-) base-commit: 03fe01ddd1d8be7799419ea5e5f228a0186ae8c2 -- 2.49.0

1 week

3
4
0 0

[PATCH] selftests/bpf: Set CONFIG_PACKET=y for selftests

by Saket Kumar Bhaskar

BPF selftest fails to build with below error: CLNG-BPF [test_progs] lsm_cgroup.bpf.o progs/lsm_cgroup.c:105:21: error: variable has incomplete type 'struct sockaddr_ll' 105 | struct sockaddr_ll sa = {}; | ^ progs/lsm_cgroup.c:105:9: note: forward declaration of 'struct sockaddr_ll' 105 | struct sockaddr_ll sa = {}; | ^ 1 error generated. lsm_cgroup selftest requires sockaddr_ll structure which is not there in vmlinux.h when the kernel is built with CONFIG_PACKET=m. Enabling CONFIG_PACKET=y ensures that sockaddr_ll is available in vmlinux, allowing it to be captured in the generated vmlinux.h for bpf selftests. Reported-by: Sachin P Bappalige <sachinpb(a)linux.ibm.com> Signed-off-by: Saket Kumar Bhaskar <skb99(a)linux.ibm.com> --- tools/testing/selftests/bpf/config | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config index f74e1ea0ad3b..7247833fe623 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -105,6 +105,7 @@ CONFIG_IP_NF_IPTABLES=y CONFIG_IP6_NF_IPTABLES=y CONFIG_IP6_NF_FILTER=y CONFIG_NF_NAT=y +CONFIG_PACKET=y CONFIG_RC_CORE=y CONFIG_SECURITY=y CONFIG_SECURITYFS=y -- 2.43.5

1 week

3
2
0 0

[RFC v2] tools/nolibc: add sigaction()

by Benjamin Berg

From: Benjamin Berg <benjamin.berg(a)intel.com> In preparation to add tests that use it. Note that some architectures do not have a usable linux/signal.h include file. However, in those cases we can use asm-generic/signal.h instead. Signed-off-by: Benjamin Berg <benjamin.berg(a)intel.com> --- Another attempt at signal handling for nolibc which should actually be working. Some trickery is needed to get the right definition, but I feel it is sufficiently clean this way. Submitting this as RFC mostly because I do not yet have a proper patch to add a test that uses the feature. Benjamin --- tools/include/nolibc/arch-aarch64.h | 3 + tools/include/nolibc/arch-arm.h | 7 ++ tools/include/nolibc/arch-i386.h | 13 +++ tools/include/nolibc/arch-loongarch.h | 3 + tools/include/nolibc/arch-m68k.h | 10 ++ tools/include/nolibc/arch-mips.h | 3 + tools/include/nolibc/arch-powerpc.h | 8 ++ tools/include/nolibc/arch-riscv.h | 3 + tools/include/nolibc/arch-s390.h | 8 +- tools/include/nolibc/arch-sparc.h | 43 +++++++++ tools/include/nolibc/arch-x86_64.h | 10 ++ tools/include/nolibc/signal.h | 97 ++++++++++++++++++++ tools/include/nolibc/sys.h | 2 +- tools/include/nolibc/time.h | 3 +- tools/testing/selftests/nolibc/nolibc-test.c | 52 +++++++++++ 15 files changed, 261 insertions(+), 4 deletions(-) diff --git a/tools/include/nolibc/arch-aarch64.h b/tools/include/nolibc/arch-aarch64.h index 937a348da42e..736aae6dbd47 100644 --- a/tools/include/nolibc/arch-aarch64.h +++ b/tools/include/nolibc/arch-aarch64.h @@ -10,6 +10,9 @@ #include "compiler.h" #include "crt.h" +/* Architecture has a usable linux/signal.h */ +#include <linux/signal.h> + /* Syscalls for AARCH64 : * - registers are 64-bit * - stack is 16-byte aligned diff --git a/tools/include/nolibc/arch-arm.h b/tools/include/nolibc/arch-arm.h index 1f66e7e5a444..1faf6c2dbeb8 100644 --- a/tools/include/nolibc/arch-arm.h +++ b/tools/include/nolibc/arch-arm.h @@ -10,6 +10,13 @@ #include "compiler.h" #include "crt.h" +/* Needed to get the correct struct sigaction definition */ +#define SA_RESTORER 0x04000000 + +/* Avoid linux/signal.h, it has an incorrect _NSIG and sigset_t */ +#include <asm-generic/signal.h> +#include <asm-generic/siginfo.h> + /* Syscalls for ARM in ARM or Thumb modes : * - registers are 32-bit * - stack is 8-byte aligned diff --git a/tools/include/nolibc/arch-i386.h b/tools/include/nolibc/arch-i386.h index 7c9b38e96418..fbec7490a92c 100644 --- a/tools/include/nolibc/arch-i386.h +++ b/tools/include/nolibc/arch-i386.h @@ -10,6 +10,19 @@ #include "compiler.h" #include "crt.h" +/* Needed to get the correct struct sigaction definition */ +#define SA_RESTORER 0x04000000 + +/* Restorer must be set on i386 */ +#define _NOLIBC_ARCH_NEEDS_SA_RESTORER + +/* Otherwise we would need to use sigreturn instead of rt_sigreturn */ +#define _NOLIBC_ARCH_FORCE_SIG_FLAGS SA_SIGINFO + +/* Avoid linux/signal.h, it has an incorrect _NSIG and sigset_t */ +#include <asm-generic/signal.h> +#include <asm-generic/siginfo.h> + /* Syscalls for i386 : * - mostly similar to x86_64 * - registers are 32-bit diff --git a/tools/include/nolibc/arch-loongarch.h b/tools/include/nolibc/arch-loongarch.h index 5511705303ea..68d60d04ef59 100644 --- a/tools/include/nolibc/arch-loongarch.h +++ b/tools/include/nolibc/arch-loongarch.h @@ -10,6 +10,9 @@ #include "compiler.h" #include "crt.h" +/* Architecture has a usable linux/signal.h */ +#include <linux/signal.h> + /* Syscalls for LoongArch : * - stack is 16-byte aligned * - syscall number is passed in a7 diff --git a/tools/include/nolibc/arch-m68k.h b/tools/include/nolibc/arch-m68k.h index 6dac1845f298..981b4cc55a69 100644 --- a/tools/include/nolibc/arch-m68k.h +++ b/tools/include/nolibc/arch-m68k.h @@ -13,6 +13,16 @@ #include "compiler.h" #include "crt.h" +/* + * Needed to get the correct struct sigaction definition. m68k does not use + * sa_restorer, but it is included in the structure. + */ +#define SA_RESTORER 0x04000000 + +/* Avoid linux/signal.h, it has an incorrect _NSIG and sigset_t */ +#include <asm-generic/signal.h> +#include <asm-generic/siginfo.h> + #define _NOLIBC_SYSCALL_CLOBBERLIST "memory" #define my_syscall0(num) \ diff --git a/tools/include/nolibc/arch-mips.h b/tools/include/nolibc/arch-mips.h index 753a8ed2cf69..a8837452e744 100644 --- a/tools/include/nolibc/arch-mips.h +++ b/tools/include/nolibc/arch-mips.h @@ -14,6 +14,9 @@ #error Unsupported MIPS ABI #endif +/* Architecture has a usable linux/signal.h */ +#include <linux/signal.h> + /* Syscalls for MIPS ABI O32 : * - WARNING! there's always a delayed slot! * - WARNING again, the syntax is different, registers take a '$' and numbers diff --git a/tools/include/nolibc/arch-powerpc.h b/tools/include/nolibc/arch-powerpc.h index 204564bbcd32..c846a7ddcf3c 100644 --- a/tools/include/nolibc/arch-powerpc.h +++ b/tools/include/nolibc/arch-powerpc.h @@ -10,6 +10,14 @@ #include "compiler.h" #include "crt.h" +/* Needed to get the correct struct sigaction definition */ +#define SA_RESTORER 0x04000000 +#define _NOLIBC_ARCH_NEEDS_SA_RESTORER + +/* Avoid linux/signal.h, it has an incorrect _NSIG and sigset_t */ +#include <asm-generic/signal.h> +#include <asm-generic/siginfo.h> + /* Syscalls for PowerPC : * - stack is 16-byte aligned * - syscall number is passed in r0 diff --git a/tools/include/nolibc/arch-riscv.h b/tools/include/nolibc/arch-riscv.h index 885383a86c38..709e6a262d9a 100644 --- a/tools/include/nolibc/arch-riscv.h +++ b/tools/include/nolibc/arch-riscv.h @@ -10,6 +10,9 @@ #include "compiler.h" #include "crt.h" +/* Architecture has a usable linux/signal.h */ +#include <linux/signal.h> + /* Syscalls for RISCV : * - stack is 16-byte aligned * - syscall number is passed in a7 diff --git a/tools/include/nolibc/arch-s390.h b/tools/include/nolibc/arch-s390.h index df4c3cc713ac..0dccb6d1ad64 100644 --- a/tools/include/nolibc/arch-s390.h +++ b/tools/include/nolibc/arch-s390.h @@ -5,13 +5,19 @@ #ifndef _NOLIBC_ARCH_S390_H #define _NOLIBC_ARCH_S390_H -#include <linux/signal.h> #include <linux/unistd.h> #include "compiler.h" #include "crt.h" #include "std.h" +/* Needed to get the correct struct sigaction definition */ +#define SA_RESTORER 0x04000000 + +/* Avoid linux/signal.h, it has an incorrect _NSIG and sigset_t */ +#include <asm-generic/signal.h> +#include <asm-generic/siginfo.h> + /* Syscalls for s390: * - registers are 64-bit * - syscall number is passed in r1 diff --git a/tools/include/nolibc/arch-sparc.h b/tools/include/nolibc/arch-sparc.h index 1435172f3dfe..303291d4b8fb 100644 --- a/tools/include/nolibc/arch-sparc.h +++ b/tools/include/nolibc/arch-sparc.h @@ -12,6 +12,19 @@ #include "compiler.h" #include "crt.h" +/* Otherwise we would need to use sigreturn instead of rt_sigreturn */ +#define _NOLIBC_ARCH_FORCE_SIG_FLAGS SA_SIGINFO + +/* The includes are sane, if one sets __WANT_POSIX1B_SIGNALS__ */ +#define __WANT_POSIX1B_SIGNALS__ +#include <linux/signal.h> + +/* + * sparc has ODD_RT_SIGACTION, we always pass our restorer as an argument + * to rt_sigaction. The restorer is implemented in this file. + */ +#define _NOLIBC_RT_SIGACTION_PASSES_RESTORER + /* * Syscalls for SPARC: * - registers are native word size @@ -188,4 +201,34 @@ pid_t sys_fork(void) } #define sys_fork sys_fork +#define __nolibc_stringify_1(x...) #x +#define __nolibc_stringify(x...) __stringify_1(x) + +/* The compiler insists on adding a SAVE call to the start of every function */ +#define __nolibc_sa_restorer __nolibc_sa_restorer +void __nolibc_sa_restorer (void); +#ifdef __arch64__ +__asm__( \ + ".section .text\n" \ + ".align 4 \n" \ + "__nolibc_sa_restorer:\n" \ + "nop\n" \ + "nop\n" \ + "mov " __stringify(__NR_rt_sigreturn) ", %g1 \n" \ + "t 0x6d \n"); +#else +__asm__( \ + ".section .text\n" \ + ".align 4 \n" \ + "__nolibc_sa_restorer:\n" \ + "nop\n" \ + "nop\n" \ + "mov " __stringify(__NR_rt_sigreturn) ", %g1 \n" \ + "t 0x10 \n" \ + ); +#endif + +#undef __nolibc_stringify_1(x...) +#undef __nolibc_stringify + #endif /* _NOLIBC_ARCH_SPARC_H */ diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h index 67305e24dbef..9f13a2205876 100644 --- a/tools/include/nolibc/arch-x86_64.h +++ b/tools/include/nolibc/arch-x86_64.h @@ -10,6 +10,16 @@ #include "compiler.h" #include "crt.h" +/* Needed to get the correct struct sigaction definition */ +#define SA_RESTORER 0x04000000 + +/* Restorer must be set on i386 */ +#define _NOLIBC_ARCH_NEEDS_SA_RESTORER + +/* Avoid linux/signal.h, it has an incorrect _NSIG and sigset_t */ +#include <asm-generic/signal.h> +#include <asm-generic/siginfo.h> + /* Syscalls for x86_64 : * - registers are 64-bit * - syscall number is passed in rax diff --git a/tools/include/nolibc/signal.h b/tools/include/nolibc/signal.h index ac13e53ac31d..829672250ede 100644 --- a/tools/include/nolibc/signal.h +++ b/tools/include/nolibc/signal.h @@ -14,6 +14,8 @@ #include "arch.h" #include "types.h" #include "sys.h" +#include "string.h" +/* signal definitions are included by arch.h */ /* This one is not marked static as it's needed by libgcc for divide by zero */ int raise(int signal); @@ -23,4 +25,99 @@ int raise(int signal) return sys_kill(sys_getpid(), signal); } +/* + * sigaction(int signum, const struct sigaction *act, struct sigaction *oldact) + */ +#if defined(_NOLIBC_ARCH_NEEDS_SA_RESTORER) && !defined(__nolibc_sa_restorer) +static __attribute__((noreturn)) __nolibc_entrypoint __no_stack_protector +void __nolibc_sa_restorer(void) +{ + my_syscall0(__NR_rt_sigreturn); + __nolibc_entrypoint_epilogue(); +} +#endif + +static __attribute__((unused)) +int sys_rt_sigaction(int signum, const struct sigaction *act, struct sigaction *oldact) +{ + struct sigaction real_act = *act; +#if defined(SA_RESTORER) && defined(_NOLIBC_ARCH_NEEDS_SA_RESTORER) + if (!(real_act.sa_flags & SA_RESTORER)) { + real_act.sa_flags |= SA_RESTORER; + real_act.sa_restorer = __nolibc_sa_restorer; + } +#endif +#ifdef _NOLIBC_ARCH_FORCE_SIG_FLAGS + real_act.sa_flags |= _NOLIBC_ARCH_FORCE_SIG_FLAGS; +#endif + +#ifndef _NOLIBC_RT_SIGACTION_PASSES_RESTORER + return my_syscall4(__NR_rt_sigaction, signum, &real_act, oldact, + sizeof(act->sa_mask)); +#else + return my_syscall5(__NR_rt_sigaction, signum, &real_act, oldact, + __nolibc_sa_restorer, sizeof(act->sa_mask)); +#endif +} + +static __attribute__((unused)) +int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact) +{ + return __sysret(sys_rt_sigaction(signum, act, oldact)); +} + +/* + * int sigemptyset(sigset_t *set) + */ +static __attribute__((unused)) +int sigemptyset(sigset_t *set) +{ + memset(set, 0, sizeof(*set)); + return 0; +} + +/* + * int sigfillset(sigset_t *set) + */ +static __attribute__((unused)) +int sigfillset(sigset_t *set) +{ + memset(set, 0xff, sizeof(*set)); + return 0; +} + +/* + * int sigaddset(sigset_t *set, int signum) + */ +static __attribute__((unused)) +int sigaddset(sigset_t *set, int signum) +{ + set->sig[(signum - 1) / (8 * sizeof(set->sig[0]))] |= + 1UL << ((signum - 1) % (8 * sizeof(set->sig[0]))); + return 0; +} + +/* + * int sigdelset(sigset_t *set, int signum) + */ +static __attribute__((unused)) +int sigdelset(sigset_t *set, int signum) +{ + set->sig[(signum - 1) / (8 * sizeof(set->sig[0]))] &= + ~(1UL << ((signum - 1) % (8 * sizeof(set->sig[0])))); + return 0; +} + +/* + * int sigismember(sigset_t *set, int signum) + */ +static __attribute__((unused)) +int sigismember(sigset_t *set, int signum) +{ + unsigned long res = + set->sig[(signum - 1) / (8 * sizeof(set->sig[0]))] & + (1UL << ((signum - 1) % (8 * sizeof(set->sig[0])))); + return !!res; +} + #endif /* _NOLIBC_SIGNAL_H */ diff --git a/tools/include/nolibc/sys.h b/tools/include/nolibc/sys.h index 9556c69a6ae1..fb9a1ccf2440 100644 --- a/tools/include/nolibc/sys.h +++ b/tools/include/nolibc/sys.h @@ -14,7 +14,6 @@ /* system includes */ #include <linux/unistd.h> -#include <linux/signal.h> /* for SIGCHLD */ #include <linux/termios.h> #include <linux/mman.h> #include <linux/fs.h> @@ -23,6 +22,7 @@ #include <linux/auxvec.h> #include <linux/fcntl.h> /* for O_* and AT_* */ #include <linux/stat.h> /* for statx() */ +/* signal definitions are included by arch.h */ #include "errno.h" #include "stdarg.h" diff --git a/tools/include/nolibc/time.h b/tools/include/nolibc/time.h index fc387940d51f..70ef31d4117f 100644 --- a/tools/include/nolibc/time.h +++ b/tools/include/nolibc/time.h @@ -14,9 +14,8 @@ #include "arch.h" #include "types.h" #include "sys.h" - -#include <linux/signal.h> #include <linux/time.h> +/* signal definitions are included by arch.h */ static __inline__ void __nolibc_timespec_user_to_kernel(const struct timespec *ts, struct __kernel_timespec *kts) diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c index dbe13000fb1a..af66b739ea18 100644 --- a/tools/testing/selftests/nolibc/nolibc-test.c +++ b/tools/testing/selftests/nolibc/nolibc-test.c @@ -1750,6 +1750,57 @@ static int run_protection(int min __attribute__((unused)), } } +volatile int signal_check; + +void test_sighandler(int signum) +{ + if (signum == SIGUSR1) { + kill(getpid(), SIGUSR2); + signal_check = 1; + } else { + signal_check++; + } +} + +int run_signal(int min, int max) +{ + struct sigaction sa = { + .sa_flags = 0, + .sa_handler = test_sighandler, + }; + int llen; /* line length */ + int ret = 0; + int res; + + (void)min; + (void)max; + + signal_check = 0; + + sigemptyset(&sa.sa_mask); + sigaddset(&sa.sa_mask, SIGUSR2); + + res = sigaction(SIGUSR1, &sa, NULL); + llen = printf("register SIGUSR1: %d", res); + EXPECT_EQ(1, 0, res); + res = sigaction(SIGUSR2, &sa, NULL); + llen = printf("register SIGUSR2: %d", res); + EXPECT_EQ(1, 0, res); + + /* Trigger the first signal. */ + kill(getpid(), SIGUSR1); + + /* If signal_check is 1 or higher, then signal emission worked */ + llen = printf("signal emission: 1 <= signal_check"); + EXPECT_GE(1, signal_check, 1); + + /* If it is 2, then signal masking worked */ + llen = printf("signal masking: 2 == signal_check"); + EXPECT_EQ(1, signal_check, 2); + + return ret; +} + /* prepare what needs to be prepared for pid 1 (stdio, /dev, /proc, etc) */ int prepare(void) { @@ -1815,6 +1866,7 @@ static const struct test test_names[] = { { .name = "stdlib", .func = run_stdlib }, { .name = "printf", .func = run_printf }, { .name = "protection", .func = run_protection }, + { .name = "signal", .func = run_signal }, { 0 } }; -- 2.50.0

1 week

3
5
0 0

[PATCH v14 3/3] selftests/rseq: Add test for mm_cid compaction

by Gabriele Monaco

A task in the kernel (task_mm_cid_work) runs somewhat periodically to compact the mm_cid for each process. Add a test to validate that it runs correctly and timely. The test spawns 1 thread pinned to each CPU, then each thread, including the main one, runs in short bursts for some time. During this period, the mm_cids should be spanning all numbers between 0 and nproc. At the end of this phase, a thread with high enough mm_cid (>= nproc/2) is selected to be the new leader, all other threads terminate. After some time, the only running thread should see 0 as mm_cid, if that doesn't happen, the compaction mechanism didn't work and the test fails. The test never fails if only 1 core is available, in which case, we cannot test anything as the only available mm_cid is 0. Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com> Acked-by: Shuah Khan <skhan(a)linuxfoundation.org> Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com> --- tools/testing/selftests/rseq/.gitignore | 1 + tools/testing/selftests/rseq/Makefile | 2 +- .../selftests/rseq/mm_cid_compaction_test.c | 200 ++++++++++++++++++ 3 files changed, 202 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore index 0fda241fa62b0..b3920c59bf401 100644 --- a/tools/testing/selftests/rseq/.gitignore +++ b/tools/testing/selftests/rseq/.gitignore @@ -3,6 +3,7 @@ basic_percpu_ops_test basic_percpu_ops_mm_cid_test basic_test basic_rseq_op_test +mm_cid_compaction_test param_test param_test_benchmark param_test_compare_twice diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile index 0d0a5fae59547..bc4d940f66d40 100644 --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \ param_test_benchmark param_test_compare_twice param_test_mm_cid \ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ - syscall_errors_test + syscall_errors_test mm_cid_compaction_test TEST_GEN_PROGS_EXTENDED = librseq.so diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c new file mode 100644 index 0000000000000..7ddde3b657dd6 --- /dev/null +++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c @@ -0,0 +1,200 @@ +// SPDX-License-Identifier: LGPL-2.1 +#define _GNU_SOURCE +#include <assert.h> +#include <pthread.h> +#include <sched.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <stddef.h> + +#include "../kselftest.h" +#include "rseq.h" + +#define VERBOSE 0 +#define printf_verbose(fmt, ...) \ + do { \ + if (VERBOSE) \ + printf(fmt, ##__VA_ARGS__); \ + } while (0) + +/* 0.5 s */ +#define RUNNER_PERIOD 500000 +/* Number of runs before we terminate or get the token */ +#define THREAD_RUNS 5 + +/* + * Number of times we check that the mm_cid were compacted. + * Checks are repeated every RUNNER_PERIOD. + */ +#define MM_CID_COMPACT_TIMEOUT 10 + +struct thread_args { + int cpu; + int num_cpus; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + pthread_t *tinfo; + struct thread_args *args_head; +}; + +static void __noreturn *thread_runner(void *arg) +{ + struct thread_args *args = arg; + int i, ret, curr_mm_cid; + cpu_set_t cpumask; + + CPU_ZERO(&cpumask); + CPU_SET(args->cpu, &cpumask); + ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask); + if (ret) { + errno = ret; + perror("Error: failed to set affinity"); + abort(); + } + pthread_barrier_wait(args->barrier); + + for (i = 0; i < THREAD_RUNS; i++) + usleep(RUNNER_PERIOD); + curr_mm_cid = rseq_current_mm_cid(); + /* + * We select one thread with high enough mm_cid to be the new leader. + * All other threads (including the main thread) will terminate. + * After some time, the mm_cid of the only remaining thread should + * converge to 0, if not, the test fails. + */ + if (curr_mm_cid >= args->num_cpus / 2 && + !pthread_mutex_trylock(args->token)) { + printf_verbose( + "cpu%d has mm_cid=%d and will be the new leader.\n", + sched_getcpu(), curr_mm_cid); + for (i = 0; i < args->num_cpus; i++) { + if (args->tinfo[i] == pthread_self()) + continue; + ret = pthread_join(args->tinfo[i], NULL); + if (ret) { + errno = ret; + perror("Error: failed to join thread"); + abort(); + } + } + pthread_barrier_destroy(args->barrier); + free(args->tinfo); + free(args->token); + free(args->barrier); + free(args->args_head); + + for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) { + curr_mm_cid = rseq_current_mm_cid(); + printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i, + curr_mm_cid, sched_getcpu()); + if (curr_mm_cid == 0) + exit(EXIT_SUCCESS); + usleep(RUNNER_PERIOD); + } + exit(EXIT_FAILURE); + } + printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n", + sched_getcpu(), curr_mm_cid); + pthread_exit(NULL); +} + +int test_mm_cid_compaction(void) +{ + cpu_set_t affinity; + int i, j, ret = 0, num_threads; + pthread_t *tinfo; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + struct thread_args *args; + + sched_getaffinity(0, sizeof(affinity), &affinity); + num_threads = CPU_COUNT(&affinity); + tinfo = calloc(num_threads, sizeof(*tinfo)); + if (!tinfo) { + perror("Error: failed to allocate tinfo"); + return -1; + } + args = calloc(num_threads, sizeof(*args)); + if (!args) { + perror("Error: failed to allocate args"); + ret = -1; + goto out_free_tinfo; + } + token = malloc(sizeof(*token)); + if (!token) { + perror("Error: failed to allocate token"); + ret = -1; + goto out_free_args; + } + barrier = malloc(sizeof(*barrier)); + if (!barrier) { + perror("Error: failed to allocate barrier"); + ret = -1; + goto out_free_token; + } + if (num_threads == 1) { + fprintf(stderr, "Cannot test on a single cpu. " + "Skipping mm_cid_compaction test.\n"); + /* only skipping the test, this is not a failure */ + goto out_free_barrier; + } + pthread_mutex_init(token, NULL); + ret = pthread_barrier_init(barrier, NULL, num_threads); + if (ret) { + errno = ret; + perror("Error: failed to initialise barrier"); + goto out_free_barrier; + } + for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) { + if (!CPU_ISSET(i, &affinity)) + continue; + args[j].num_cpus = num_threads; + args[j].tinfo = tinfo; + args[j].token = token; + args[j].barrier = barrier; + args[j].cpu = i; + args[j].args_head = args; + if (!j) { + /* The first thread is the main one */ + tinfo[0] = pthread_self(); + ++j; + continue; + } + ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]); + if (ret) { + errno = ret; + perror("Error: failed to create thread"); + abort(); + } + ++j; + } + printf_verbose("Started %d threads.\n", num_threads); + + /* Also main thread will terminate if it is not selected as leader */ + thread_runner(&args[0]); + + /* only reached in case of errors */ +out_free_barrier: + free(barrier); +out_free_token: + free(token); +out_free_args: + free(args); +out_free_tinfo: + free(tinfo); + + return ret; +} + +int main(int argc, char **argv) +{ + if (!rseq_mm_cid_available()) { + fprintf(stderr, "Error: rseq_mm_cid unavailable\n"); + return -1; + } + if (test_mm_cid_compaction()) + return -1; + return 0; +} -- 2.50.0

1 week

1
0
0 0

[PATCH 0/5] selftests: vDSO: Clean up vdso_test_abi and drop vdso_test_clock_getres

by Thomas Weißschuh

Some cleanups for the vDSO selftests. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Thomas Weißschuh (5): selftests: vDSO: vdso_test_abi: Use ksft_finished() selftests: vDSO: vdso_test_abi: Drop clock availability tests selftests: vDSO: vdso_test_abi: Use explicit indices for name array selftests: vDSO: vdso_test_abi: Test CPUTIME clocks selftests: vDSO: Drop vdso_test_clock_getres tools/testing/selftests/vDSO/Makefile | 2 - tools/testing/selftests/vDSO/vdso_test_abi.c | 57 +++------- .../selftests/vDSO/vdso_test_clock_getres.c | 123 --------------------- 3 files changed, 17 insertions(+), 165 deletions(-) --- base-commit: 437079605c26dc7c98586580a8c01b5f7f746a79 change-id: 20250707-vdso-tests-fixes-7e4ddffd7f27 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

1 week

1
5
0 0

[PATCH] selftests/arm64: Prevent build warnings from -Wmaybe-uninitialized

by Anshuman Khandual

Arguments passed into WEXITSTATUS() should have been initialized earlier. Otherwise following warning show up while building platform selftests on arm64. Hence just zero out all the relevant local variables to avoid the build warning. Warning: ‘status’ may be used uninitialized in this function [-Wmaybe-uninitialized] Cc: Catalin Marinas <catalin.marinas(a)arm.com> Cc: Will Deacon <will(a)kernel.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Mark Brown <broonie(a)kernel.org> Cc: linux-arm-kernel(a)lists.infradead.org Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Signed-off-by: Anshuman Khandual <anshuman.khandual(a)arm.com> --- This applies on v6.16-rc3 tools/testing/selftests/arm64/abi/tpidr2.c | 2 +- tools/testing/selftests/arm64/fp/za-fork.c | 2 +- tools/testing/selftests/arm64/gcs/basic-gcs.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/arm64/abi/tpidr2.c b/tools/testing/selftests/arm64/abi/tpidr2.c index eb19dcc37a755..389a60e5feabf 100644 --- a/tools/testing/selftests/arm64/abi/tpidr2.c +++ b/tools/testing/selftests/arm64/abi/tpidr2.c @@ -96,7 +96,7 @@ static int write_sleep_read(void) static int write_fork_read(void) { pid_t newpid, waiting, oldpid; - int status; + int status = 0; set_tpidr2(getpid()); diff --git a/tools/testing/selftests/arm64/fp/za-fork.c b/tools/testing/selftests/arm64/fp/za-fork.c index 587b946482226..6098beb3515a0 100644 --- a/tools/testing/selftests/arm64/fp/za-fork.c +++ b/tools/testing/selftests/arm64/fp/za-fork.c @@ -24,7 +24,7 @@ int verify_fork(void); int fork_test_c(void) { pid_t newpid, waiting; - int child_status, parent_result; + int child_status = 0, parent_result; newpid = fork(); if (newpid == 0) { diff --git a/tools/testing/selftests/arm64/gcs/basic-gcs.c b/tools/testing/selftests/arm64/gcs/basic-gcs.c index 3fb9742342a34..2b350b6b7e12c 100644 --- a/tools/testing/selftests/arm64/gcs/basic-gcs.c +++ b/tools/testing/selftests/arm64/gcs/basic-gcs.c @@ -240,7 +240,7 @@ static bool map_guarded_stack(void) static bool test_fork(void) { unsigned long child_mode; - int ret, status; + int ret, status = 0; pid_t pid; bool pass = true; -- 2.30.2

1 week

3
4
0 0

LTP syscalls mseal02 and shmctl03 fails on compat mode 64-bit kernel on 32-bit rootfs

by Naresh Kamboju

The LTP syscalls mseal02 and shmctl03 failed only with compat mode testing with 64-bit kernel with 32-bit rootfs combination. Would it be possible to detect compat mode test environment and handle the test expectation in LTP test development ? Test case: - ltp-syscalls/mseal02 - ltp-syscalls/shmctl03 Test environments: - qemu-arm64-compat - qemu-x86_64-compat - x86-compat Regression Analysis: - New regression? Yes - Reproducibility? Yes Test regression: LTP mseal02.c:45: TFAIL: mseal(0xf7a8e000, 4294963201, 0) expected EINVAL: ENOMEM (12) Test regression: LTP shmctl03.c:33: TFAIL: /proc/sys/kernel/shmmax != 2147483647 got 4294967295 Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org> ## Test log tst_test.c:1953: TINFO: LTP version: 20250530 tst_test.c:1956: TINFO: Tested kernel: 6.16.0-rc4-next-20250701 #1 SMP PREEMPT @1751364932 aarch64 tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz' tst_kconfig.c:676: TINFO: CONFIG_TRACE_IRQFLAGS kernel option detected which might slow the execution tst_test.c:1774: TINFO: Overall timeout per run is 0h 21m 36s mseal02.c:45: TPASS: mseal(0xf7a8e000, 4096, 4294967295) : EINVAL (22) mseal02.c:45: TPASS: mseal(0xf7a8e001, 4096, 0) : EINVAL (22) mseal02.c:45: TFAIL: mseal(0xf7a8e000, 4294963201, 0) expected EINVAL: ENOMEM (12) mseal02.c:45: TPASS: mseal(0xf7a90000, 8192, 0) : ENOMEM (12) mseal02.c:45: TPASS: mseal(0xf7a8f000, 8192, 0) : ENOMEM (12) mseal02.c:45: TPASS: mseal(0xf7a8e000, 16384, 0) : ENOMEM (12) tst_test.c:1953: TINFO: LTP version: 20250530 tst_test.c:1956: TINFO: Tested kernel: 6.16.0-rc3-next-20250627 #1 SMP PREEMPT @1751015729 aarch64 tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz' tst_kconfig.c:676: TINFO: CONFIG_TRACE_IRQFLAGS kernel option detected which might slow the execution tst_test.c:1774: TINFO: Overall timeout per run is 0h 21m 36s shmctl03.c:31: TPASS: shmmin = 1 shmctl03.c:33: TFAIL: /proc/sys/kernel/shmmax != 2147483647 got 4294967295 shmctl03.c:34: TPASS: /proc/sys/kernel/shmmni = 4096 shmctl03.c:35: TFAIL: /proc/sys/kernel/shmall != 4278190079 got 4294967295 ## Source * Kernel version: 6.16.0-rc4-next-20250701 * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git * Git sha: 3f804361f3b9af33e00b90ec9cb5afcc96831e60 * Git describe: 6.16.0-rc4-next-20250701 * Project details: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250701/ * Architectures: arm64 * Toolchains: gcc-13 * Build name: gcc-13-lkftconfig-compat ## Build arm64 * Test log: https://qa-reports.linaro.org/api/testruns/28971382/log_file/ * Test details 1: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250701/te… * Test details 2: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250627/te… * Test results compare 1: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250701/te… * Test results compare 2: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250627/te… * Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2zGk12rggXwQHzqasQsW… * Kernel config: https://storage.tuxsuite.com/public/linaro/lkft/builds/2zGk12rggXwQHzqasQsW… -- Linaro LKFT https://lkft.linaro.org

1 week

2
2
0 0

[PATCH v3] selftests/mm: pagemap_scan ioctl: add PFN ZERO test cases

by Muhammad Usama Anjum

Add test cases to test the correctness of PFN ZERO flag of pagemap_scan ioctl. Test with normal pages backed memory and huge pages backed memory. Cc: David Hildenbrand <david(a)redhat.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com> --- The bug has been fixed [1]. [1] https://lore.kernel.org/all/20250617143532.2375383-1-david@redhat.com Changes since v1: - Skip if madvise() fails - Skip test if use_zero_page isn't set to 1 - Keep on using memalign()+free() to allocate huge pages Changes sice v2: - Move zero page detection code out to vm_util and use that - Use mmap instead of memalign to allocate hugepage --- tools/testing/selftests/mm/cow.c | 27 +------- tools/testing/selftests/mm/pagemap_ioctl.c | 72 +++++++++++++++++++++- tools/testing/selftests/mm/vm_util.c | 23 +++++++ tools/testing/selftests/mm/vm_util.h | 2 + 4 files changed, 95 insertions(+), 29 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index b6cfe0a4b7dfd..b26bbf6ec4617 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -72,31 +72,6 @@ static int detect_thp_sizes(size_t sizes[], int max) return count; } -static void detect_huge_zeropage(void) -{ - int fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page", - O_RDONLY); - size_t enabled = 0; - char buf[15]; - int ret; - - if (fd < 0) - return; - - ret = pread(fd, buf, sizeof(buf), 0); - if (ret > 0 && ret < sizeof(buf)) { - buf[ret] = 0; - - enabled = strtoul(buf, NULL, 10); - if (enabled == 1) { - has_huge_zeropage = true; - ksft_print_msg("[INFO] huge zeropage is enabled\n"); - } - } - - close(fd); -} - static bool range_is_swapped(void *addr, size_t size) { for (; size; addr += pagesize, size -= pagesize) @@ -1791,7 +1766,7 @@ int main(int argc, char **argv) } nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes, ARRAY_SIZE(hugetlbsizes)); - detect_huge_zeropage(); + has_huge_zeropage = detect_huge_zeropage(); ksft_set_plan(ARRAY_SIZE(anon_test_cases) * tests_per_anon_test_case() + ARRAY_SIZE(anon_thp_test_cases) * tests_per_anon_thp_test_case() + diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index 57b4bba2b45f3..059c6d5f971e7 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 + #define _GNU_SOURCE #include <stdio.h> #include <fcntl.h> @@ -34,8 +35,8 @@ #define PAGEMAP "/proc/self/pagemap" int pagemap_fd; int uffd; -unsigned int page_size; -unsigned int hpage_size; +size_t page_size; +size_t hpage_size; const char *progname; #define LEN(region) ((region.end - region.start)/page_size) @@ -1480,6 +1481,68 @@ static void transact_test(int page_size) extra_thread_faults); } +void zeropfn_tests(void) +{ + unsigned long long mem_size; + struct page_region vec; + int i, ret; + char *mmap_mem, *mem; + + /* Test with normal memory */ + mem_size = 10 * page_size; + mem = mmap(NULL, mem_size, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + /* Touch each page to ensure it's mapped */ + for (i = 0; i < mem_size; i += page_size) + (void)((volatile char *)mem)[i]; + + ret = pagemap_ioctl(mem, mem_size, &vec, 1, 0, + (mem_size / page_size), PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (mem_size / page_size), + "%s all pages must have PFNZERO set\n", __func__); + + munmap(mem, mem_size); + + /* Test with huge page if user_zero_page is set to 1 */ + if (!detect_huge_zeropage()) { + ksft_test_result_skip("%s use_zero_page not supported or set to 1\n", __func__); + return; + } + + mem_size = 2 * hpage_size; + mmap_mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (mmap_mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + /* We need a THP-aligned memory area. */ + mem = (char *)(((uintptr_t)mmap_mem + hpage_size) & ~(hpage_size - 1)); + + ret = madvise(mem, hpage_size, MADV_HUGEPAGE); + if (!ret) { + char tmp = *mem; + + asm volatile("" : "+r" (tmp)); + + ret = pagemap_ioctl(mem, hpage_size, &vec, 1, 0, + 0, PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (hpage_size / page_size), + "%s all huge pages must have PFNZERO set\n", __func__); + } else { + ksft_test_result_skip("%s huge page not supported\n", __func__); + } + + munmap(mmap_mem, mem_size); +} + int main(int __attribute__((unused)) argc, char *argv[]) { int shmid, buf_size, fd, i, ret; @@ -1494,7 +1557,7 @@ int main(int __attribute__((unused)) argc, char *argv[]) if (init_uffd()) ksft_exit_pass(); - ksft_set_plan(115); + ksft_set_plan(117); page_size = getpagesize(); hpage_size = read_pmd_pagesize(); @@ -1669,6 +1732,9 @@ int main(int __attribute__((unused)) argc, char *argv[]) /* 16. Userfaultfd tests */ userfaultfd_tests(); + /* 17. ZEROPFN tests */ + zeropfn_tests(); + close(pagemap_fd); ksft_exit_pass(); } diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index a36734fb62f38..dde9e8ab4dc46 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -424,3 +424,26 @@ bool check_vmflag_io(void *addr) flags += flaglen; } } + +bool detect_huge_zeropage(void) +{ + int fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page", + O_RDONLY); + bool enabled = 0; + char buf[15]; + int ret; + + if (fd < 0) + return 0; + + ret = pread(fd, buf, sizeof(buf), 0); + if (ret > 0 && ret < sizeof(buf)) { + buf[ret] = 0; + + if (strtoul(buf, NULL, 10) == 1) + enabled = 1; + } + + close(fd); + return enabled; +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 6effafdc4d8a2..ca4c1f78ce18c 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -74,6 +74,8 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); +bool detect_huge_zeropage(void); + /* * On ppc64 this will only work with radix 2M hugepage size */ -- 2.43.0

1 week

1
0
0 0

[PATCH] tools/nolibc: add support for clock_nanosleep() and nanosleep()

by Thomas Weißschuh

Also add some tests. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- tools/include/nolibc/time.h | 34 ++++++++++++++++++++++++++++ tools/testing/selftests/nolibc/nolibc-test.c | 1 + 2 files changed, 35 insertions(+) diff --git a/tools/include/nolibc/time.h b/tools/include/nolibc/time.h index fc387940d51f389d4233bd5712588dced31ae6e5..d02bc44d2643a5e39afa808841f7175bfab5ff7e 100644 --- a/tools/include/nolibc/time.h +++ b/tools/include/nolibc/time.h @@ -36,6 +36,8 @@ void __nolibc_timespec_kernel_to_user(const struct __kernel_timespec *kts, struc * int clock_getres(clockid_t clockid, struct timespec *res); * int clock_gettime(clockid_t clockid, struct timespec *tp); * int clock_settime(clockid_t clockid, const struct timespec *tp); + * int clock_nanosleep(clockid_t clockid, int flags, const struct timespec *rqtp, + * struct timespec *rmtp) */ static __attribute__((unused)) @@ -107,6 +109,32 @@ int clock_settime(clockid_t clockid, struct timespec *tp) return __sysret(sys_clock_settime(clockid, tp)); } +static __attribute__((unused)) +int sys_clock_nanosleep(clockid_t clockid, int flags, const struct timespec *rqtp, + struct timespec *rmtp) +{ +#if defined(__NR_clock_nanosleep) + return my_syscall4(__NR_clock_nanosleep, clockid, flags, rqtp, rmtp); +#elif defined(__NR_clock_nanosleep_time64) + struct __kernel_timespec krqtp, krmtp; + int ret; + + __nolibc_timespec_user_to_kernel(rqtp, &krqtp); + ret = my_syscall4(__NR_clock_nanosleep_time64, clockid, flags, &krqtp, &krmtp); + if (rmtp) + __nolibc_timespec_kernel_to_user(&krmtp, rmtp); + return ret; +#else + return __nolibc_enosys(__func__, clockid, flags, rqtp, rmtp); +#endif +} + +static __attribute__((unused)) +int clock_nanosleep(clockid_t clockid, int flags, const struct timespec *rqtp, + struct timespec *rmtp) +{ + return __sysret(sys_clock_nanosleep(clockid, flags, rqtp, rmtp)); +} static __inline__ double difftime(time_t time1, time_t time2) @@ -114,6 +142,12 @@ double difftime(time_t time1, time_t time2) return time1 - time2; } +static __inline__ +int nanosleep(const struct timespec *rqtp, struct timespec *rmtp) +{ + return clock_nanosleep(CLOCK_REALTIME, 0, rqtp, rmtp); +} + static __attribute__((unused)) time_t time(time_t *tptr) diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c index b5bca1dcf36e95a576ca9ffba4f7c213978a3f35..315229233930265501296dfeb9bc2838bb6fef84 100644 --- a/tools/testing/selftests/nolibc/nolibc-test.c +++ b/tools/testing/selftests/nolibc/nolibc-test.c @@ -1363,6 +1363,7 @@ int run_syscall(int min, int max) CASE_TEST(mmap_bad); EXPECT_PTRER(1, mmap(NULL, 0, PROT_READ, MAP_PRIVATE, 0, 0), MAP_FAILED, EINVAL); break; CASE_TEST(munmap_bad); EXPECT_SYSER(1, munmap(NULL, 0), -1, EINVAL); break; CASE_TEST(mmap_munmap_good); EXPECT_SYSZR(1, test_mmap_munmap()); break; + CASE_TEST(nanosleep); ts.tv_nsec = -1; EXPECT_SYSER(1, nanosleep(&ts, NULL), -1, EINVAL); break; CASE_TEST(open_tty); EXPECT_SYSNE(1, tmp = open("/dev/null", O_RDONLY), -1); if (tmp != -1) close(tmp); break; CASE_TEST(open_blah); EXPECT_SYSER(1, tmp = open("/proc/self/blah", O_RDONLY), -1, ENOENT); if (tmp != -1) close(tmp); break; CASE_TEST(openat_dir); EXPECT_SYSZR(1, test_openat()); break; --- base-commit: 1536aa0fb1e09cb50f401ec4852c60f38173d751 change-id: 20250704-nolibc-nanosleep-2476b806b0d5 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

1 week, 1 day

3
2
0 0

[PATCH v2] selftests: futex: define SYS_futex on 32-bit architectures with 64-bit time_t

by Ben Zong-You Xie

glibc does not define SYS_futex for 32-bit architectures using 64-bit time_t e.g. riscv32, therefore this test fails to compile since it does not find SYS_futex in C library headers. Define SYS_futex as SYS_futex_time64 in this situation to ensure successful compilation and compatibility. Signed-off-by: Cynthia Huang <cynthia(a)andestech.com> Signed-off-by: Ben Zong-You Xie <ben717(a)andestech.com> --- Changes since v1: - Fix the SOB chain v1 : https://lore.kernel.org/all/20250527093536.3646143-1-ben717@andestech.com/ --- tools/testing/selftests/futex/include/futextest.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h index ddbcfc9b7bac..7a5fd1d5355e 100644 --- a/tools/testing/selftests/futex/include/futextest.h +++ b/tools/testing/selftests/futex/include/futextest.h @@ -47,6 +47,17 @@ typedef volatile u_int32_t futex_t; FUTEX_PRIVATE_FLAG) #endif +/* + * SYS_futex is expected from system C library, in glibc some 32-bit + * architectures (e.g. RV32) are using 64-bit time_t, therefore it doesn't have + * SYS_futex defined but just SYS_futex_time64. Define SYS_futex as + * SYS_futex_time64 in this situation to ensure the compilation and the + * compatibility. + */ +#if !defined(SYS_futex) && defined(SYS_futex_time64) +#define SYS_futex SYS_futex_time64 +#endif + /** * futex() - SYS_futex syscall wrapper * @uaddr: address of first futex -- 2.34.1

1 week, 1 day

3
2
0 0

[PATCH 0/2] selftests/nolibc: correctly report errors from printf() and friends

by Thomas Weißschuh

When an error is encountered by printf() it needs to be reported. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Thomas Weißschuh (2): selftests/nolibc: create /dev/full when running as PID 1 selftests/nolibc: correctly report errors from printf() and friends tools/include/nolibc/stdio.h | 4 ++-- tools/testing/selftests/nolibc/nolibc-test.c | 27 ++++++++++++++++++++++++++- 2 files changed, 28 insertions(+), 3 deletions(-) --- base-commit: 1536aa0fb1e09cb50f401ec4852c60f38173d751 change-id: 20250704-nolibc-printf-error-aa54f951c0a4 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

1 week, 1 day

2
3
0 0

[PATCH 00/15] selftests/futex: Refactor tests to use kselftest_harness.h

by André Almeida

This patch series refactors all futex selftests to use kselftest_harness.h instead of futex's logging.h, as discussed here [1]. This allows to remove a lot of boilerplate code and to simplify some parts of the test logic, mainly when the test needs to exit early. The result of this is more than 500 lines removed from tools/testing/selftests/futex/. Also, this enables new tests to use kselftest.h features like ASSERT_s and such. There are some caveats around this refactor: - logging.h had verbosity levels, while kselftest_harness.h doesn't. I created a new print function called ksft_print_dbg_msg() that prints the message if the user uses the -d flag, so now there's an equivalent of this feature. - futex_requeue_pi test accepted command line arguments to be used as test parameters (e.g. ./futex_requeue_pi -b -l -t 500000). This doesn't work with kselftest_harness.h because there's no straightforward way to send command line arguments to the test. I used FIXTURE_VARIANT() to achieve the same result, but now the parameters live inside of the test file, instead of on functional/run.sh. This increased a little bit the number of test cases for futex_requeue_pi, from 22 to 24. - test_harness_run() calls mmap(MAP_SHARED) before running the test and this has caused a side effect on test futex_numa_mpol.c. This test also calls mmap() and then try to access an address out of boundaries of this mapped memory for a "Memory out of range" subtest, where the kernel should return -EACCESS. After the refactor, the test address might be fall inside the first memory mapped region, thus being a valid address and succeeding the syscall, making the test fail. To fix that, I created a small "buffer zone" with mmap(PROT_NONE) between both mmaps. I have compared the results of run.sh before and after this patchset and didn't find any regression from the test results. Thanks, André [1] https://lore.kernel.org/lkml/87ecv6p364.ffs@tglx/ --- André Almeida (15): selftests: kselftest: Create ksft_print_dbg_msg() selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h selftests/futex: Refactor futex_wait with kselftest_harness.h selftests/futex: Refactor futex_requeue with kselftest_harness.h selftests/futex: Refactor futex_waitv with kselftest_harness.h selftests/futex: Refactor futex_priv_hash with kselftest_harness.h selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h selftests/futex: Drop logging.h include from futex_numa selftests/futex: Remove logging.h file tools/testing/selftests/futex/functional/Makefile | 3 +- .../selftests/futex/functional/futex_numa.c | 3 +- .../selftests/futex/functional/futex_numa_mpol.c | 57 ++--- .../selftests/futex/functional/futex_priv_hash.c | 79 +++---- .../selftests/futex/functional/futex_requeue.c | 76 ++---- .../selftests/futex/functional/futex_requeue_pi.c | 261 ++++++++++----------- .../functional/futex_requeue_pi_mismatched_ops.c | 80 ++----- .../functional/futex_requeue_pi_signal_restart.c | 129 +++------- .../selftests/futex/functional/futex_wait.c | 103 +++----- .../functional/futex_wait_private_mapped_file.c | 83 ++----- .../futex/functional/futex_wait_timeout.c | 139 +++++------ .../functional/futex_wait_uninitialized_heap.c | 76 ++---- .../futex/functional/futex_wait_wouldblock.c | 75 ++---- .../selftests/futex/functional/futex_waitv.c | 98 ++++---- tools/testing/selftests/futex/functional/run.sh | 62 +---- tools/testing/selftests/futex/include/logging.h | 148 ------------ tools/testing/selftests/kselftest.h | 13 + tools/testing/selftests/kselftest_harness.h | 13 +- 18 files changed, 491 insertions(+), 1007 deletions(-) --- base-commit: a24cc6ce1933eade12aa2b9859de0fcd2dac2c06 change-id: 20250703-tonyk-robust_test_cleanup-d1f3406365d9 Best regards, -- André Almeida <andrealmeid(a)igalia.com>

1 week, 1 day

2
16
0 0

[PATCH v13 0/5] rust: replace kernel::str::CStr w/ core::ffi::CStr

by Tamir Duberstein

This picks up from Michal Rostecki's work[0]. Per Michal's guidance I have omitted Co-authored tags, as the end result is quite different. Link: https://lore.kernel.org/rust-for-linux/20240819153656.28807-2-vadorovsky@pr… [0] Closes: https://github.com/Rust-for-Linux/linux/issues/1075 Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Changes in v13: - Rebase on v6.16-rc4. - Link to v12: https://lore.kernel.org/r/20250619-cstr-core-v12-0-80c9c7b45900@gmail.com Changes in v12: - Introduce `kernel::fmt::Display` to allow implementations on foreign types. - Tidy up doc comment on `str_to_cstr`. (Alice Ryhl). - Link to v11: https://lore.kernel.org/r/20250530-cstr-core-v11-0-cd9c0cbcb902@gmail.com Changes in v11: - Use `quote_spanned!` to avoid `use<'a, T>` and generally reduce manual token construction. - Add a commit to simplify `quote_spanned!`. - Drop first commit in favor of https://lore.kernel.org/rust-for-linux/20240906164448.2268368-1-paddymills@…. (Miguel Ojeda) - Correctly handle expressions such as `pr_info!("{a}", a = a = a)`. (Benno Lossin) - Avoid dealing with `}}` escapes, which is not needed. (Benno Lossin) - Revert some unnecessary changes. (Benno Lossin) - Rename `c_str_avoid_literals!` to `str_to_cstr!`. (Benno Lossin & Alice Ryhl). - Link to v10: https://lore.kernel.org/r/20250524-cstr-core-v10-0-6412a94d9d75@gmail.com Changes in v10: - Rebase on cbeaa41dfe26b72639141e87183cb23e00d4b0dd. - Implement Alice's suggestion to use a proc macro to work around orphan rules otherwise preventing `core::ffi::CStr` to be directly printed with `{}`. - Link to v9: https://lore.kernel.org/r/20250317-cstr-core-v9-0-51d6cc522f62@gmail.com Changes in v9: - Rebase on rust-next. - Restore `impl Display for BStr` which exists upstream[1]. - Link: https://doc.rust-lang.org/nightly/std/bstr/struct.ByteStr.html#impl-Display… [1] - Link to v8: https://lore.kernel.org/r/20250203-cstr-core-v8-0-cb3f26e78686@gmail.com Changes in v8: - Move `{from,as}_char_ptr` back to `CStrExt`. This reduces the diff some. - Restore `from_bytes_with_nul_unchecked_mut`, `to_cstring`. - Link to v7: https://lore.kernel.org/r/20250202-cstr-core-v7-0-da1802520438@gmail.com Changes in v7: - Rebased on mainline. - Restore functionality added in commit a321f3ad0a5d ("rust: str: add {make,to}_{upper,lower}case() to CString"). - Used `diff.algorithm patience` to improve diff readability. - Link to v6: https://lore.kernel.org/r/20250202-cstr-core-v6-0-8469cd6d29fd@gmail.com Changes in v6: - Split the work into several commits for ease of review. - Restore `{from,as}_char_ptr` to allow building on ARM (see commit message). - Add `CStrExt` to `kernel::prelude`. (Alice Ryhl) - Remove `CStrExt::from_bytes_with_nul_unchecked_mut` and restore `DerefMut for CString`. (Alice Ryhl) - Rename and hide `kernel::c_str!` to encourage use of C-String literals. - Drop implementation and invocation changes in kunit.rs. (Trevor Gross) - Drop docs on `Display` impl. (Trevor Gross) - Rewrite docs in the style of the standard library. - Restore the `test_cstr_debug` unit tests to demonstrate that the implementation has changed. Changes in v5: - Keep the `test_cstr_display*` unit tests. Changes in v4: - Provide the `CStrExt` trait with `display()` method, which returns a `CStrDisplay` wrapper with `Display` implementation. This addresses the lack of `Display` implementation for `core::ffi::CStr`. - Provide `from_bytes_with_nul_unchecked_mut()` method in `CStrExt`, which might be useful and is going to prevent manual, unsafe casts. - Fix a typo (s/preffered/prefered/). Changes in v3: - Fix the commit message. - Remove redundant braces in `use`, when only one item is imported. Changes in v2: - Do not remove `c_str` macro. While it's preferred to use C-string literals, there are two cases where `c_str` is helpful: - When working with macros, which already return a Rust string literal (e.g. `stringify!`). - When building macros, where we want to take a Rust string literal as an argument (for caller's convenience), but still use it as a C-string internally. - Use Rust literals as arguments in macros (`new_mutex`, `new_condvar`, `new_mutex`). Use the `c_str` macro to convert these literals to C-string literals. - Use `c_str` in kunit.rs for converting the output of `stringify!` to a `CStr`. - Remove `DerefMut` implementation for `CString`. --- Tamir Duberstein (5): rust: macros: reduce collections in `quote!` macro rust: support formatting of foreign types rust: replace `CStr` with `core::ffi::CStr` rust: replace `kernel::c_str!` with C-Strings rust: remove core::ffi::CStr reexport drivers/block/rnull.rs | 4 +- drivers/cpufreq/rcpufreq_dt.rs | 5 +- drivers/gpu/drm/drm_panic_qr.rs | 5 +- drivers/gpu/drm/nova/driver.rs | 10 +- drivers/gpu/nova-core/driver.rs | 6 +- drivers/gpu/nova-core/firmware.rs | 2 +- drivers/gpu/nova-core/gpu.rs | 4 +- drivers/gpu/nova-core/nova_core.rs | 2 +- drivers/net/phy/ax88796b_rust.rs | 8 +- drivers/net/phy/qt2025.rs | 6 +- rust/kernel/auxiliary.rs | 6 +- rust/kernel/block/mq.rs | 2 +- rust/kernel/clk.rs | 9 +- rust/kernel/configfs.rs | 14 +- rust/kernel/cpufreq.rs | 6 +- rust/kernel/device.rs | 9 +- rust/kernel/devres.rs | 2 +- rust/kernel/driver.rs | 4 +- rust/kernel/drm/device.rs | 4 +- rust/kernel/drm/driver.rs | 3 +- rust/kernel/drm/ioctl.rs | 2 +- rust/kernel/error.rs | 10 +- rust/kernel/faux.rs | 5 +- rust/kernel/firmware.rs | 16 +- rust/kernel/fmt.rs | 89 +++++++ rust/kernel/kunit.rs | 28 +-- rust/kernel/lib.rs | 3 +- rust/kernel/miscdevice.rs | 5 +- rust/kernel/net/phy.rs | 12 +- rust/kernel/of.rs | 5 +- rust/kernel/pci.rs | 2 +- rust/kernel/platform.rs | 6 +- rust/kernel/prelude.rs | 5 +- rust/kernel/print.rs | 4 +- rust/kernel/seq_file.rs | 6 +- rust/kernel/str.rs | 444 ++++++++++------------------------ rust/kernel/sync.rs | 7 +- rust/kernel/sync/completion.rs | 2 +- rust/kernel/sync/condvar.rs | 4 +- rust/kernel/sync/lock.rs | 4 +- rust/kernel/sync/lock/global.rs | 6 +- rust/kernel/sync/poll.rs | 1 + rust/kernel/workqueue.rs | 9 +- rust/macros/fmt.rs | 99 ++++++++ rust/macros/kunit.rs | 10 +- rust/macros/lib.rs | 19 ++ rust/macros/module.rs | 2 +- rust/macros/quote.rs | 111 ++++----- samples/rust/rust_configfs.rs | 9 +- samples/rust/rust_driver_auxiliary.rs | 7 +- samples/rust/rust_driver_faux.rs | 4 +- samples/rust/rust_driver_pci.rs | 4 +- samples/rust/rust_driver_platform.rs | 4 +- samples/rust/rust_misc_device.rs | 3 +- scripts/rustdoc_test_gen.rs | 6 +- 55 files changed, 543 insertions(+), 521 deletions(-) --- base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494 change-id: 20250201-cstr-core-d4b9b69120cf Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

1 week, 2 days

4
32
0 0

[PATCH v5] selftests/futex: Convert 32bit timespec struct to 64bit version for 32bit compatibility mode

by Terry Tritton

sys_futex_wait() can not accept old_timespec32 struct, so userspace should convert it from 32bit to 64bit before syscall to support 32bit compatible mode. This fix is based off [1] Link: https://lore.kernel.org/all/20231203235117.29677-1-wegao@suse.com/ [1] Originally-by: Wei Gao <wegao(a)suse.com> Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> --- Changes in v5: - Fixed checkpatch errors Changes in v4: - Change to use __kernel_timespec as suggested by tglx Changes in v3: - Fix signed-off-by chain but for real this time Changes in v2: - Fix signed-off-by chain tools/testing/selftests/futex/include/futex2test.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h index ea79662405bc..ff9eebcc270c 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h @@ -5,6 +5,7 @@ * Copyright 2021 Collabora Ltd. */ #include <stdint.h> +#include <linux/time_types.h> #define u64_to_ptr(x) ((void *)(uintptr_t)(x)) @@ -65,7 +66,12 @@ struct futex32_numa { static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters, unsigned long flags, struct timespec *timo, clockid_t clockid) { - return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid); + struct __kernel_timespec ts = { + .tv_sec = timo->tv_sec, + .tv_nsec = timo->tv_nsec, + }; + + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, &ts, clockid); } /* -- 2.39.5

1 week, 2 days

1
0
0 0

[PATCH v4] selftests/futex: Convert 32bit timespec struct to 64bit version for 32bit compatibility mode

by Terry Tritton

sys_futex_wait() can not accept old_timespec32 struct, so userspace should convert it from 32bit to 64bit before syscall to support 32bit compatible mode. This fix is based off [1] Link: https://lore.kernel.org/all/20231203235117.29677-1-wegao@suse.com/ [1] Originally-by: Wei Gao <wegao(a)suse.com> Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> --- Changes in v4: - Change to use __kernel_timespec as suggested by tglx Changes in v3: - Fix signed-off-by chain but for real this time Changes in v2: - Fix signed-off-by chain tools/testing/selftests/futex/include/futex2test.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h index ea79662405bc..af5b92ba04ad 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h @@ -5,6 +5,7 @@ * Copyright 2021 Collabora Ltd. */ #include <stdint.h> +#include <linux/time_types.h> #define u64_to_ptr(x) ((void *)(uintptr_t)(x)) @@ -65,7 +66,12 @@ struct futex32_numa { static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters, unsigned long flags, struct timespec *timo, clockid_t clockid) { - return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid); + struct __kernel_timespec ts = { + .tv_sec = timo->tv_sec, + .tv_nsec = timo->tv_nsec, + }; + + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, &ts, clockid); } /* -- 2.39.5

1 week, 2 days

2
2
0 0

[PATCH net-next v4] ipv6: add `force_forwarding` sysctl to enable per-interface forwarding

by Gabriel Goller

It is currently impossible to enable ipv6 forwarding on a per-interface basis like in ipv4. To enable forwarding on an ipv6 interface we need to enable it on all interfaces and disable it on the other interfaces using a netfilter rule. This is especially cumbersome if you have lots of interface and only want to enable forwarding on a few. According to the sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding for all interfaces, while the interface-specific `net.ipv6.conf.<interface>.forwarding` configures the interface Host/Router configuration. Introduce a new sysctl flag `force_forwarding`, which can be set on every interface. The ip6_forwarding function will then check if the global forwarding flag OR the force_forwarding flag is active and forward the packet. To preserver backwards-compatibility reset the flag (on all interfaces) to 0 if the net.ipv6.conf.all.forwarding flag is set to 0. Add a short selftest that checks if a packet gets forwarded with and without `force_forwarding`. [0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com> --- First time writing a selftest, so LMK if I did something wrong or if there is something I can improve. Thanks! v4: * actually write the sysctl value to the table * use ASSERT_RTNL() when forwarding the sysctl change * remove useless comments in function body * simplify forwarding and force_forwarding check in ip6_output.c * fix code backticks in Documentation (double instead of single) * add selftests v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ * remove forwarding=0 setting force_forwarding=0 globally. * add min and max (0 and 1) value to sysctl. v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/ * rename from `do_forwarding` to `force_forwarding`. * add global `force_forwarding` flag which will enable `force_forwarding` on every interface like the `ipv4.all.forwarding` flag. * `forwarding`=0 will disable global and per-interface `force_forwarding`. * export option as NETCONFA_FORCE_FORWARDING. v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ Documentation/networking/ip-sysctl.rst | 5 + include/linux/ipv6.h | 1 + include/uapi/linux/ipv6.h | 1 + include/uapi/linux/netconf.h | 1 + include/uapi/linux/sysctl.h | 1 + net/ipv6/addrconf.c | 91 +++++++++++++++ net/ipv6/ip6_output.c | 3 +- tools/testing/selftests/net/Makefile | 1 + .../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++ 9 files changed, 208 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/net/ipv6_force_forwarding.sh diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 0f1251cce314..ec7fa1e890f1 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2292,6 +2292,11 @@ conf/all/forwarding - BOOLEAN proxy_ndp - BOOLEAN Do proxy ndp. +force_forwarding - BOOLEAN + Enable forwarding on this interface only -- regardless of the setting on + ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0, + the ``force_forwarding`` flag will be reset on all interfaces. + fwmark_reflect - BOOLEAN Controls the fwmark of kernel-generated IPv6 reply packets that are not associated with a socket for example, TCP RSTs or ICMPv6 echo replies). diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 5aeeed22f35b..5380107e466c 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -19,6 +19,7 @@ struct ipv6_devconf { __s32 forwarding; __s32 disable_policy; __s32 proxy_ndp; + __s32 force_forwarding; __cacheline_group_end(ipv6_devconf_read_txrx); __s32 accept_ra; diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index cf592d7b630f..d4d3ae774b26 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -199,6 +199,7 @@ enum { DEVCONF_NDISC_EVICT_NOCARRIER, DEVCONF_ACCEPT_UNTRACKED_NA, DEVCONF_ACCEPT_RA_MIN_LFT, + DEVCONF_FORCE_FORWARDING, DEVCONF_MAX }; diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h index fac4edd55379..1c8c84d65ae3 100644 --- a/include/uapi/linux/netconf.h +++ b/include/uapi/linux/netconf.h @@ -19,6 +19,7 @@ enum { NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN, NETCONFA_INPUT, NETCONFA_BC_FORWARDING, + NETCONFA_FORCE_FORWARDING, __NETCONFA_MAX }; #define NETCONFA_MAX (__NETCONFA_MAX - 1) diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 8981f00204db..63d1464cb71c 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -573,6 +573,7 @@ enum { NET_IPV6_ACCEPT_RA_FROM_LOCAL=26, NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27, NET_IPV6_RA_DEFRTR_METRIC=28, + NET_IPV6_FORCE_FORWARDING=29, __NET_IPV6_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index ba2ec7c870cc..dcf4e8bf8cf8 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { @@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; /* Check if link is ready: is it up and is a valid qdisc available */ @@ -857,6 +859,15 @@ static void addrconf_forward_change(struct net *net, __s32 newf) idev = __in6_dev_get_rtnl_net(dev); if (idev) { int changed = (!idev->cnf.forwarding) ^ (!newf); + /* + * With the introduction of force_forwarding, we need to be backwards + * compatible, so that means we need to set the force_forwarding flag + * on every interface to 0 if net.ipv6.conf.all.forwarding is set to 0. + * This allows the global forwarding flag to disable forwarding for + * all interfaces. + */ + if (newf == 0) + WRITE_ONCE(idev->cnf.force_forwarding, newf); WRITE_ONCE(idev->cnf.forwarding, newf); if (changed) @@ -5719,6 +5730,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf, array[DEVCONF_ACCEPT_UNTRACKED_NA] = READ_ONCE(cnf->accept_untracked_na); array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft); + array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding); } static inline size_t inet6_ifla6_size(void) @@ -6747,6 +6759,78 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write return ret; } +static void addrconf_force_forward_change(struct net *net, __s32 newf) +{ + ASSERT_RTNL(); + struct net_device *dev; + struct inet6_dev *idev; + + for_each_netdev(net, dev) { + idev = __in6_dev_get_rtnl_net(dev); + if (idev) { + int changed = (!idev->cnf.force_forwarding) ^ (!newf); + + WRITE_ONCE(idev->cnf.force_forwarding, newf); + if (changed) { + inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + dev->ifindex, &idev->cnf); + } + } + } +} + +static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct inet6_dev *idev = ctl->extra1; + struct net *net = ctl->extra2; + int *valp = ctl->data; + loff_t pos = *ppos; + int new_val = *valp; + int old_val = *valp; + int ret; + + struct ctl_table tmp_ctl = *ctl; + + tmp_ctl.extra1 = SYSCTL_ZERO; + tmp_ctl.extra2 = SYSCTL_ONE; + tmp_ctl.data = &new_val; + + ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos); + + if (write && old_val != new_val) { + if (!rtnl_net_trylock(net)) + return restart_syscall(); + + if (valp == &net->ipv6.devconf_dflt->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_DEFAULT, + net->ipv6.devconf_dflt); + } else if (valp == &net->ipv6.devconf_all->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_ALL, + net->ipv6.devconf_all); + + addrconf_force_forward_change(net, new_val); + } else { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + idev->dev->ifindex, + &idev->cnf); + } + rtnl_net_unlock(net); + } + + if (write) + WRITE_ONCE(*valp, new_val); + if (ret) + *ppos = pos; + return ret; +} + static int minus_one = -1; static const int two_five_five = 255; static u32 ioam6_if_id_max = U16_MAX; @@ -7217,6 +7301,13 @@ static const struct ctl_table addrconf_sysctl[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, + { + .procname = "force_forwarding", + .data = &ipv6_devconf.force_forwarding, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = addrconf_sysctl_force_forwarding, + }, }; static int __addrconf_sysctl_register(struct net *net, char *dev_name, diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7bd29a9ff0db..440b9efced72 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -509,7 +509,8 @@ int ip6_forward(struct sk_buff *skb) u32 mtu; idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif)); - if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0) + if (idev && !READ_ONCE(idev->cnf.force_forwarding) && + !READ_ONCE(net->ipv6.devconf_all->forwarding)) goto error; if (skb->pkt_type != PACKET_HOST) diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 332f387615d7..f64ec8a15a77 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -112,6 +112,7 @@ TEST_PROGS += skf_net_off.sh TEST_GEN_FILES += skf_net_off TEST_GEN_FILES += tfo TEST_PROGS += tfo_passive.sh +TEST_PROGS += ipv6_force_forwarding.sh # YNL files, must be before "include ..lib.mk" YNL_GEN_FILES := busy_poller netlink-dumps diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh new file mode 100644 index 000000000000..62adc9d4afc9 --- /dev/null +++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh @@ -0,0 +1,105 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Test IPv6 force_forwarding interface property +# +# This test verifies that the force_forwarding property works correctly: +# - When global forwarding is disabled, packets are not forwarded normally +# - When force_forwarding is enabled on an interface, packets are forwarded +# regardless of the global forwarding setting + +source lib.sh + +cleanup() { + cleanup_ns $ns1 $ns2 $ns3 +} + +trap cleanup EXIT + +setup_test() { + # Create three namespaces: sender, router, receiver + setup_ns ns1 ns2 ns3 + + # Create veth pairs: ns1 <-> ns2 <-> ns3 + ip link add name veth12 type veth peer name veth21 + ip link add name veth23 type veth peer name veth32 + + # Move interfaces to namespaces + ip link set veth12 netns $ns1 + ip link set veth21 netns $ns2 + ip link set veth23 netns $ns2 + ip link set veth32 netns $ns3 + + # Configure interfaces + ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12 + ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21 + ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23 + ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32 + + # Bring up interfaces + ip -n $ns1 link set veth12 up + ip -n $ns2 link set veth21 up + ip -n $ns2 link set veth23 up + ip -n $ns3 link set veth32 up + + # Add routes + ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2 + ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1 + + # Disable global forwarding + ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0 +} + +test_force_forwarding() { + local ret=0 + + echo "TEST: force_forwarding functionality" + + # Check if force_forwarding sysctl exists + if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then + echo "SKIP: force_forwarding not available" + return $ksft_skip + fi + + # Test 1: Without force_forwarding, ping should fail + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "FAIL: ping succeeded when forwarding disabled" + ret=1 + else + echo "PASS: forwarding disabled correctly" + fi + + # Test 2: With force_forwarding enabled, ping should succeed + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "PASS: force_forwarding enabled forwarding" + else + echo "FAIL: ping failed with force_forwarding enabled" + ret=1 + fi + + return $ret +} + +echo "IPv6 force_forwarding test" +echo "==========================" + +setup_test +test_force_forwarding +ret=$? + +if [ $ret -eq 0 ]; then + echo "OK" + exit 0 +elif [ $ret -eq $ksft_skip ]; then + echo "SKIP" + exit $ksft_skip +else + echo "FAIL" + exit 1 +fi -- 2.39.5

1 week, 3 days

3
5
0 0

[PATCH v2] selftests/futex: Add futex_numa to .gitignore

by Terry Tritton

futex_numa was never added to the .gitignore file. Add it. Fixes: 9140f57c1c13 ("futex,selftests: Add another FUTEX2_NUMA selftest") Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> --- Changes in v2: - Add Fixes tag tools/testing/selftests/futex/functional/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore index 7b24ae89594a..776ad658f75e 100644 --- a/tools/testing/selftests/futex/functional/.gitignore +++ b/tools/testing/selftests/futex/functional/.gitignore @@ -11,3 +11,4 @@ futex_wait_timeout futex_wait_uninitialized_heap futex_wait_wouldblock futex_waitv +futex_numa -- 2.39.5

1 week, 3 days

2
1
0 0

[PATCH v2] selftests/mm: pagemap_scan ioctl: add PFN ZERO test cases

by Muhammad Usama Anjum

Add test cases to test the correctness of PFN ZERO flag of pagemap_scan ioctl. Test with normal pages backed memory and huge pages backed memory. Cc: David Hildenbrand <david(a)redhat.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com> --- The bug has been fixed [1]. [1] https://lore.kernel.org/all/20250617143532.2375383-1-david@redhat.com Changes since v1: - Skip if madvise() fails - Skip test if use_zero_page isn't set to 1 - Keep on using memalign()+free() to allocate huge pages --- tools/testing/selftests/mm/pagemap_ioctl.c | 86 +++++++++++++++++++++- 1 file changed, 85 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index 57b4bba2b45f3..976ab357f4651 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 + #define _GNU_SOURCE #include <stdio.h> #include <fcntl.h> @@ -1480,6 +1481,86 @@ static void transact_test(int page_size) extra_thread_faults); } +bool is_use_zero_page_set(void) +{ + ssize_t bytes_read; + char buffer[2]; + int fd; + + fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page", O_RDONLY); + if (fd < 0) + return 0; + + bytes_read = read(fd, buffer, sizeof(buffer) - 1); + if (bytes_read <= 0) { + close(fd); + return 0; + } + + close(fd); + if (atoi(buffer) != 1) + return 0; + + return 1; +} + +void zeropfn_tests(void) +{ + unsigned long long mem_size; + struct page_region vec; + int i, ret; + char *mem; + + /* Test with normal memory */ + mem_size = 10 * page_size; + mem = mmap(NULL, mem_size, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + /* Touch each page to ensure it's mapped */ + for (i = 0; i < mem_size; i += page_size) + (void)((volatile char *)mem)[i]; + + ret = pagemap_ioctl(mem, mem_size, &vec, 1, 0, + (mem_size / page_size), PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (mem_size / page_size), + "%s all pages must have PFNZERO set\n", __func__); + + munmap(mem, mem_size); + + /* Test with huge page if user_zero_page is set to 1 */ + if (!is_use_zero_page_set()) { + ksft_test_result_skip("%s use_zero_page not supported or set to 1\n", __func__); + return; + } + + mem_size = 10 * hpage_size; + mem = memalign(hpage_size, mem_size); + if (!mem) + ksft_exit_fail_msg("error nomem\n"); + + ret = madvise(mem, mem_size, MADV_HUGEPAGE); + if (!ret) { + for (i = 0; i < mem_size; i += hpage_size) + (void)((volatile char *)mem)[i]; + + ret = pagemap_ioctl(mem, mem_size, &vec, 1, 0, + (mem_size / page_size), PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (mem_size / page_size), + "%s all huge pages must have PFNZERO set\n", __func__); + + free(mem); + } else { + ksft_test_result_skip("%s huge page not supported\n", __func__); + } +} + int main(int __attribute__((unused)) argc, char *argv[]) { int shmid, buf_size, fd, i, ret; @@ -1494,7 +1575,7 @@ int main(int __attribute__((unused)) argc, char *argv[]) if (init_uffd()) ksft_exit_pass(); - ksft_set_plan(115); + ksft_set_plan(117); page_size = getpagesize(); hpage_size = read_pmd_pagesize(); @@ -1669,6 +1750,9 @@ int main(int __attribute__((unused)) argc, char *argv[]) /* 16. Userfaultfd tests */ userfaultfd_tests(); + /* 17. ZEROPFN tests */ + zeropfn_tests(); + close(pagemap_fd); ksft_exit_pass(); } -- 2.43.0

1 week, 3 days

2
1
0 0

[PATCH v3 0/4] kselftest/arm64: Add coverage for the interaction of vfork() and GCS

by Mark Brown

I had cause to look at the vfork() support for GCS and realised that we don't have any direct test coverage, this series does so by adding vfork() to nolibc and then using that in basic-gcs to provide some simple vfork() coverage. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v3: - Stylistic nits in the GCS vfork() test. - SPARC has a non-standard vfork() ABI which needs handling. - Link to v2: https://lore.kernel.org/r/20250610-arm64-gcs-vfork-exit-v2-0-929443dfcf82@k… Changes in v2: - Add replacement of ifdef with if defined() in nolibc since the code doesn't reflect the coding style. - Remove check for arch specific vfork(). - Link to v1: https://lore.kernel.org/r/20250609-arm64-gcs-vfork-exit-v1-0-baad0f085747@k… --- Mark Brown (4): tools/nolibc: Replace ifdef with if defined() in sys.h tools/nolibc: Provide vfork() kselftest/arm64: Add a test for vfork() with GCS selftests/nolibc: Add coverage of vfork() tools/include/nolibc/arch-sparc.h | 16 +++++++ tools/include/nolibc/sys.h | 59 ++++++++++++++++++------- tools/testing/selftests/arm64/gcs/basic-gcs.c | 63 +++++++++++++++++++++++++++ tools/testing/selftests/nolibc/nolibc-test.c | 23 ++++++++-- 4 files changed, 142 insertions(+), 19 deletions(-) --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20250528-arm64-gcs-vfork-exit-4a7daf7652ee Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 3 days

2
6
0 0

[kvm-unit-tests PATCH] riscv: Use norvc over arch, -c

by Jesse Taube

The Linux kernel main tree uses "norvc" over "arch, -c" change to match this. GCC 15 started to add _zca_zcd to the assembler flags causing a bug which made "arch, -c" generate a compressed instruction. Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33128 Cc: Clément Léger <cleger(a)rivosinc.com> Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> --- riscv/isa-dbltrp.c | 2 +- riscv/sbi-dbtr.c | 2 +- riscv/sbi-fwft.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/riscv/isa-dbltrp.c b/riscv/isa-dbltrp.c index b7e21589..af12860c 100644 --- a/riscv/isa-dbltrp.c +++ b/riscv/isa-dbltrp.c @@ -26,7 +26,7 @@ do { \ unsigned long value = 0; \ asm volatile( \ " .option push\n" \ - " .option arch,-c\n" \ + " .option norvc\n" \ " sw %0, 0(%1)\n" \ " .option pop\n" \ : : "r" (value), "r" (ptr) : "memory"); \ diff --git a/riscv/sbi-dbtr.c b/riscv/sbi-dbtr.c index c4ccd81d..129f79b8 100644 --- a/riscv/sbi-dbtr.c +++ b/riscv/sbi-dbtr.c @@ -134,7 +134,7 @@ static __attribute__((naked)) void exec_call(void) { /* skip over nop when triggered instead of ret. */ asm volatile (".option push\n" - ".option arch, -c\n" + ".option norvc\n" "nop\n" "ret\n" ".option pop\n"); diff --git a/riscv/sbi-fwft.c b/riscv/sbi-fwft.c index 8920bcb5..fda7eb52 100644 --- a/riscv/sbi-fwft.c +++ b/riscv/sbi-fwft.c @@ -174,7 +174,7 @@ static void fwft_check_misaligned_exc_deleg(void) * Disable compression so the lw takes exactly 4 bytes and thus * can be skipped reliably from the exception handler. */ - ".option arch,-c\n" + ".option norvc\n" "lw %[val], 1(%[val_addr])\n" ".option pop\n" : [val] "+r" (ret.value) -- 2.43.0

1 week, 3 days

2
1
0 0

[PATCH v3] kunit: fix longest symbol length test

by Sergio González Collado

The kunit test that checks the longests symbol length [1], has triggered warnings in some pilelines when symbol prefixes are used [2][3]. The test will to depend on !PREFIX_SYMBOLS and !CFI_CLANG as sujested in [4] and on !GCOV_KERNEL. [1] https://lore.kernel.org/rust-for-linux/CABVgOSm=5Q0fM6neBhxSbOUHBgNzmwf2V22… [2] https://lore.kernel.org/all/20250328112156.2614513-1-arnd@kernel.org/T/#u [3] https://lore.kernel.org/rust-for-linux/bbd03b37-c4d9-4a92-9be2-75aaf8c19815… [4] https://lore.kernel.org/linux-kselftest/20250427200916.GA1661412@ax162/T/#t Reviewed-by: Rae Moar <rmoar(a)google.com> Signed-off-by: Sergio González Collado <sergio.collado(a)gmail.com> --- v2 -> v3: added dependency on !GCOV_KERNEL (to avoid __gcov_ prefix) --- v1 -> v2: added dependency on !CFI_CLANG as suggested in [3], removed CONFIG_ prefix --- lib/Kconfig.debug | 1 + lib/tests/longest_symbol_kunit.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index f9051ab610d5..e55c761eae20 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2886,6 +2886,7 @@ config FORTIFY_KUNIT_TEST config LONGEST_SYM_KUNIT_TEST tristate "Test the longest symbol possible" if !KUNIT_ALL_TESTS depends on KUNIT && KPROBES + depends on !PREFIX_SYMBOLS && !CFI_CLANG && !GCOV_KERNEL default KUNIT_ALL_TESTS help Tests the longest symbol possible diff --git a/lib/tests/longest_symbol_kunit.c b/lib/tests/longest_symbol_kunit.c index e3c28ff1807f..9b4de3050ba7 100644 --- a/lib/tests/longest_symbol_kunit.c +++ b/lib/tests/longest_symbol_kunit.c @@ -3,8 +3,7 @@ * Test the longest symbol length. Execute with: * ./tools/testing/kunit/kunit.py run longest-symbol * --arch=x86_64 --kconfig_add CONFIG_KPROBES=y --kconfig_add CONFIG_MODULES=y - * --kconfig_add CONFIG_RETPOLINE=n --kconfig_add CONFIG_CFI_CLANG=n - * --kconfig_add CONFIG_MITIGATION_RETPOLINE=n + * --kconfig_add CONFIG_CPU_MITIGATIONS=n --kconfig_add CONFIG_GCOV_KERNEL=n */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt base-commit: 1a80a098c606b285fb0a13aa992af4f86da1ff06 -- 2.39.2

1 week, 3 days

2
2
0 0

[PATCH v3] selftests/futex: Convert 32bit timespec struct to 64bit version for 32bit compatibility mode

by Terry Tritton

Futex_waitv can not accept old_timespec32 struct, so userspace should convert it from 32bit to 64bit before syscall in 32bit compatible mode. This fix is based off [1] Link: https://lore.kernel.org/all/20231203235117.29677-1-wegao@suse.com/ [1] Originally-by: Wei Gao <wegao(a)suse.com> Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> --- Changes in v3: - Fix signed-off-by chain but for real this time Changes in v2: - Fix signed-off-by chain .../testing/selftests/futex/include/futex2test.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h index ea79662405bc..6780e51eb2d6 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h @@ -55,6 +55,13 @@ struct futex32_numa { futex_t numa; }; +#if !defined(__LP64__) +struct timespec64 { + int64_t tv_sec; + int64_t tv_nsec; +}; +#endif + /** * futex_waitv - Wait at multiple futexes, wake on any * @waiters: Array of waiters @@ -65,7 +72,15 @@ struct futex32_numa { static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters, unsigned long flags, struct timespec *timo, clockid_t clockid) { +#if !defined(__LP64__) + struct timespec64 timo64 = {0}; + + timo64.tv_sec = timo->tv_sec; + timo64.tv_nsec = timo->tv_nsec; + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, &timo64, clockid); +#else return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid); +#endif } /* -- 2.39.5

1 week, 3 days

3
2
0 0

[PATCH] selftests: print installation complete message

by Shuah Khan

Add installation complete message to Makefile install logic. Signed-off-by: Shuah Khan <skhan(a)linuxfoundation.org> --- tools/testing/selftests/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 9dae84a74e7f..b95de208265a 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -300,6 +300,7 @@ ifdef INSTALL_PATH else \ printf "Unable to get version from git describe\n"; \ fi + @echo "**Kselftest Installation is complete: $(INSTALL_PATH)**" else $(error Error: set INSTALL_PATH to use install) endif -- 2.47.2

1 week, 3 days

1
0
0 0

[PATCH net-next v3 7/7] selftests: net: extend SCM_PIDFD test to cover stale pidfds

by Alexander Mikhalitsyn

Extend SCM_PIDFD test scenarios to also cover dead task's pidfd retrieval and reading its exit info. Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: netdev(a)vger.kernel.org Cc: Shuah Khan <shuah(a)kernel.org> Cc: "David S. Miller" <davem(a)davemloft.net> Cc: Eric Dumazet <edumazet(a)google.com> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: Paolo Abeni <pabeni(a)redhat.com> Cc: Simon Horman <horms(a)kernel.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Kuniyuki Iwashima <kuniyu(a)google.com> Cc: Lennart Poettering <mzxreary(a)0pointer.de> Cc: Luca Boccassi <bluca(a)debian.org> Cc: David Rheinsberg <david(a)readahead.eu> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn(a)canonical.com> Reviewed-by: Christian Brauner <brauner(a)kernel.org> --- .../testing/selftests/net/af_unix/scm_pidfd.c | 217 ++++++++++++++---- 1 file changed, 173 insertions(+), 44 deletions(-) diff --git a/tools/testing/selftests/net/af_unix/scm_pidfd.c b/tools/testing/selftests/net/af_unix/scm_pidfd.c index 7e534594167e..37e034874034 100644 --- a/tools/testing/selftests/net/af_unix/scm_pidfd.c +++ b/tools/testing/selftests/net/af_unix/scm_pidfd.c @@ -15,6 +15,7 @@ #include <sys/types.h> #include <sys/wait.h> +#include "../../pidfd/pidfd.h" #include "../../kselftest_harness.h" #define clean_errno() (errno == 0 ? "None" : strerror(errno)) @@ -26,6 +27,8 @@ #define SCM_PIDFD 0x04 #endif +#define CHILD_EXIT_CODE_OK 123 + static void child_die() { exit(1); @@ -126,16 +129,65 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen) return result; } +struct cmsg_data { + struct ucred *ucred; + int *pidfd; +}; + +static int parse_cmsg(struct msghdr *msg, struct cmsg_data *res) +{ + struct cmsghdr *cmsg; + int data = 0; + + if (msg->msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + for (cmsg = CMSG_FIRSTHDR(msg); cmsg != NULL; + cmsg = CMSG_NXTHDR(msg, cmsg)) { + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_PIDFD) { + if (cmsg->cmsg_len < sizeof(*res->pidfd)) { + log_err("CMSG parse: SCM_PIDFD wrong len"); + return 1; + } + + res->pidfd = (void *)CMSG_DATA(cmsg); + } + + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_CREDENTIALS) { + if (cmsg->cmsg_len < sizeof(*res->ucred)) { + log_err("CMSG parse: SCM_CREDENTIALS wrong len"); + return 1; + } + + res->ucred = (void *)CMSG_DATA(cmsg); + } + } + + if (!res->pidfd) { + log_err("CMSG parse: SCM_PIDFD not found"); + return 1; + } + + if (!res->ucred) { + log_err("CMSG parse: SCM_CREDENTIALS not found"); + return 1; + } + + return 0; +} + static int cmsg_check(int fd) { struct msghdr msg = { 0 }; - struct cmsghdr *cmsg; + struct cmsg_data res; struct iovec iov; - struct ucred *ucred = NULL; int data = 0; char control[CMSG_SPACE(sizeof(struct ucred)) + CMSG_SPACE(sizeof(int))] = { 0 }; - int *pidfd = NULL; pid_t parent_pid; int err; @@ -158,53 +210,99 @@ static int cmsg_check(int fd) return 1; } - for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL; - cmsg = CMSG_NXTHDR(&msg, cmsg)) { - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_PIDFD) { - if (cmsg->cmsg_len < sizeof(*pidfd)) { - log_err("CMSG parse: SCM_PIDFD wrong len"); - return 1; - } + /* send(pfd, "x", sizeof(char), 0) */ + if (data != 'x') { + log_err("recvmsg: data corruption"); + return 1; + } - pidfd = (void *)CMSG_DATA(cmsg); - } + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); + return 1; + } - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_CREDENTIALS) { - if (cmsg->cmsg_len < sizeof(*ucred)) { - log_err("CMSG parse: SCM_CREDENTIALS wrong len"); - return 1; - } + /* pidfd from SCM_PIDFD should point to the parent process PID */ + parent_pid = + get_pid_from_fdinfo_file(*res.pidfd, "Pid:", sizeof("Pid:") - 1); + if (parent_pid != getppid()) { + log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + close(*res.pidfd); + return 1; + } - ucred = (void *)CMSG_DATA(cmsg); - } + close(*res.pidfd); + return 0; +} + +static int cmsg_check_dead(int fd, int expected_pid) +{ + int err; + struct msghdr msg = { 0 }; + struct cmsg_data res; + struct iovec iov; + int data = 0; + char control[CMSG_SPACE(sizeof(struct ucred)) + + CMSG_SPACE(sizeof(int))] = { 0 }; + pid_t client_pid; + struct pidfd_info info = { + .mask = PIDFD_INFO_EXIT, + }; + + iov.iov_base = &data; + iov.iov_len = sizeof(data); + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + err = recvmsg(fd, &msg, 0); + if (err < 0) { + log_err("recvmsg"); + return 1; } - /* send(pfd, "x", sizeof(char), 0) */ - if (data != 'x') { + if (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + /* send(cfd, "y", sizeof(char), 0) */ + if (data != 'y') { log_err("recvmsg: data corruption"); return 1; } - if (!pidfd) { - log_err("CMSG parse: SCM_PIDFD not found"); + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); return 1; } - if (!ucred) { - log_err("CMSG parse: SCM_CREDENTIALS not found"); + /* + * pidfd from SCM_PIDFD should point to the client_pid. + * Let's read exit information and check if it's what + * we expect to see. + */ + if (ioctl(*res.pidfd, PIDFD_GET_INFO, &info)) { + log_err("%s: ioctl(PIDFD_GET_INFO) failed", __func__); + close(*res.pidfd); return 1; } - /* pidfd from SCM_PIDFD should point to the parent process PID */ - parent_pid = - get_pid_from_fdinfo_file(*pidfd, "Pid:", sizeof("Pid:") - 1); - if (parent_pid != getppid()) { - log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + if (!(info.mask & PIDFD_INFO_EXIT)) { + log_err("%s: No exit information from ioctl(PIDFD_GET_INFO)", __func__); + close(*res.pidfd); return 1; } + err = WIFEXITED(info.exit_code) ? WEXITSTATUS(info.exit_code) : 1; + if (err != CHILD_EXIT_CODE_OK) { + log_err("%s: wrong exit_code %d != %d", __func__, err, CHILD_EXIT_CODE_OK); + close(*res.pidfd); + return 1; + } + + close(*res.pidfd); return 0; } @@ -291,6 +389,24 @@ static void fill_sockaddr(struct sock_addr *addr, bool abstract) memcpy(sun_path_buf, addr->sock_name, strlen(addr->sock_name)); } +static int sk_enable_cred_pass(int sk) +{ + int on = 0; + + on = 1; + if (setsockopt(sk, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { + log_err("Failed to set SO_PASSCRED"); + return 1; + } + + if (setsockopt(sk, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { + log_err("Failed to set SO_PASSPIDFD"); + return 1; + } + + return 0; +} + static void client(FIXTURE_DATA(scm_pidfd) *self, const FIXTURE_VARIANT(scm_pidfd) *variant) { @@ -299,7 +415,6 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, struct ucred peer_cred; int peer_pidfd; pid_t peer_pid; - int on = 0; cfd = socket(AF_UNIX, variant->type, 0); if (cfd < 0) { @@ -322,14 +437,8 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } - on = 1; - if (setsockopt(cfd, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { - log_err("Failed to set SO_PASSCRED"); - child_die(); - } - - if (setsockopt(cfd, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { - log_err("Failed to set SO_PASSPIDFD"); + if (sk_enable_cred_pass(cfd)) { + log_err("sk_enable_cred_pass() failed"); child_die(); } @@ -340,6 +449,12 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } + /* send something to the parent so it can receive SCM_PIDFD too and validate it */ + if (send(cfd, "y", sizeof(char), 0) == -1) { + log_err("Failed to send(cfd, \"y\", sizeof(char), 0)"); + child_die(); + } + /* skip further for SOCK_DGRAM as it's not applicable */ if (variant->type == SOCK_DGRAM) return; @@ -398,7 +513,13 @@ TEST_F(scm_pidfd, test) close(self->server); close(self->startup_pipe[0]); client(self, variant); - exit(0); + + /* + * It's a bit unusual, but in case of success we return non-zero + * exit code (CHILD_EXIT_CODE_OK) and then we expect to read it + * from ioctl(PIDFD_GET_INFO) in cmsg_check_dead(). + */ + exit(CHILD_EXIT_CODE_OK); } close(self->startup_pipe[1]); @@ -421,9 +542,17 @@ TEST_F(scm_pidfd, test) ASSERT_NE(-1, err); } - close(pfd); waitpid(self->client_pid, &child_status, 0); - ASSERT_EQ(0, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + /* see comment before exit(CHILD_EXIT_CODE_OK) */ + ASSERT_EQ(CHILD_EXIT_CODE_OK, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + + err = sk_enable_cred_pass(pfd); + ASSERT_EQ(0, err); + + err = cmsg_check_dead(pfd, self->client_pid); + ASSERT_EQ(0, err); + + close(pfd); } TEST_HARNESS_MAIN -- 2.43.0

1 week, 3 days

1
0
0 0

[PATCH v2] selftests/powerpc: fix "for a while" typo

by Ahelenia Ziemiańska

Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli(a)nabijaczleweli.xyz> --- v1: https://lore.kernel.org/lkml/h2ieddqja5jfrnuh3mvlxt6njrvp352t5rfzp2cvnrufop… tools/testing/selftests/powerpc/tm/tm-tmspr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/powerpc/tm/tm-tmspr.c b/tools/testing/selftests/powerpc/tm/tm-tmspr.c index dd5ddffa28b7..0d64988ffb40 100644 --- a/tools/testing/selftests/powerpc/tm/tm-tmspr.c +++ b/tools/testing/selftests/powerpc/tm/tm-tmspr.c @@ -14,7 +14,7 @@ * (1) create more threads than cpus * (2) in each thread: * (a) set TFIAR and TFHAR a unique value - * (b) loop for awhile, continually checking to see if + * (b) loop for a while, continually checking to see if * either register has been corrupted. * * (3) Loop: -- 2.39.5

1 week, 4 days

1
0
0 0

[PATCH v6 00/28] KVM: arm64: Implement support for SME

by Mark Brown

I've removed the RFC tag from this version of the series, but the items that I'm looking for feedback on remains the same: - The userspace ABI, in particular: - The vector length used for the SVE registers, access to the SVE registers and access to ZA and (if available) ZT0 depending on the current state of PSTATE.{SM,ZA}. - The use of a single finalisation for both SVE and SME. - The addition of control for enabling fine grained traps in a similar manner to FGU but without the UNDEF, I'm not clear if this is desired at all and at present this requires symmetric read and write traps like FGU. That seemed like it might be desired from an implementation point of view but we already have one case where we enable an asymmetric trap (for ARM64_WORKAROUND_AMPERE_AC03_CPU_38) and it seems generally useful to enable asymmetrically. This series implements support for SME use in non-protected KVM guests. Much of this is very similar to SVE, the main additional challenge that SME presents is that it introduces a new vector length similar to the SVE vector length and two new controls which change the registers seen by guests: - PSTATE.ZA enables the ZA matrix register and, if SME2 is supported, the ZT0 LUT register. - PSTATE.SM enables streaming mode, a new floating point mode which uses the SVE register set with the separately configured SME vector length. In streaming mode implementation of the FFR register is optional. It is also permitted to build systems which support SME without SVE, in this case when not in streaming mode no SVE registers or instructions are available. Further, there is no requirement that there be any overlap in the set of vector lengths supported by SVE and SME in a system, this is expected to be a common situation in practical systems. Since there is a new vector length to configure we introduce a new feature parallel to the existing SVE one with a new pseudo register for the streaming mode vector length. Due to the overlap with SVE caused by streaming mode rather than finalising SME as a separate feature we use the existing SVE finalisation to also finalise SME, a new define KVM_ARM_VCPU_VEC is provided to help make user code clearer. Finalising SVE and SME separately would introduce complication with register access since finalising SVE makes the SVE registers writeable by userspace and doing multiple finalisations results in an error being reported. Dealing with a state where the SVE registers are writeable due to one of SVE or SME being finalised but may have their VL changed by the other being finalised seems like needless complexity with minimal practical utility, it seems clearer to just express directly that only one finalisation can be done in the ABI. Access to the floating point registers follows the architecture: - When both SVE and SME are present: - If PSTATE.SM == 0 the vector length used for the Z and P registers is the SVE vector length. - If PSTATE.SM == 1 the vector length used for the Z and P registers is the SME vector length. - If only SME is present: - If PSTATE.SM == 0 the Z and P registers are inaccessible and the floating point state accessed via the encodings for the V registers. - If PSTATE.SM == 1 the vector length used for the Z and P registers - The SME specific ZA and ZT0 registers are only accessible if SVCR.ZA is 1. The VMM must understand this, in particular when loading state SVCR should be configured before other state. It should be noted that while the architecture refers to PSTATE.SM and PSTATE.ZA these PSTATE bits are not preserved in SPSR_ELx, they are only accessible via SVCR. There are a large number of subfeatures for SME, most of which only offer additional instructions but some of which (SME2 and FA64) add architectural state. These are configured via the ID registers as per usual. Protected KVM supported, with the implementation maintaining the existing restriction that the hypervisor will refuse to run if streaming mode or ZA is enabled. This both simplfies the code and avoids the need to allocate storage for host ZA and ZT0 state, there seems to be little practical use case for supporting this and the memory usage would be non-trivial. The new KVM_ARM_VCPU_VEC feature and ZA and ZT0 registers have not been added to the get-reg-list selftest, the idea of supporting additional features there without restructuring the program to generate all possible feature combinations has been rejected. I will post a separate series which does that restructuring. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v6: - Rebase onto v6.16-rc3. - Link to v5: https://lore.kernel.org/r/20250417-kvm-arm64-sme-v5-0-f469a2d5f574@kernel.o… Changes in v5: - Rebase onto v6.15-rc2. - Add pKVM guest support. - Always restore SVCR. - Link to v4: https://lore.kernel.org/r/20250214-kvm-arm64-sme-v4-0-d64a681adcc2@kernel.o… Changes in v4: - Rebase onto v6.14-rc2 and Mark Rutland's fixes. - Expose SME to nested guests. - Additional cleanups and test fixes following on from the rebase. - Flush register state on VMM PSTATE.{SM,ZA}. - Link to v3: https://lore.kernel.org/r/20241220-kvm-arm64-sme-v3-0-05b018c1ffeb@kernel.o… Changes in v3: - Rebase onto v6.12-rc2. - Link to v2: https://lore.kernel.org/r/20231222-kvm-arm64-sme-v2-0-da226cb180bb@kernel.o… Changes in v2: - Rebase onto v6.7-rc3. - Configure subfeatures based on host system only. - Complete nVHE support. - There was some snafu with sending v1 out, it didn't make it to the lists but in case it hit people's inboxes I'm sending as v2. --- Mark Brown (28): arm64/fpsimd: Update FA64 and ZT0 enables when loading SME state arm64/fpsimd: Decide to save ZT0 and streaming mode FFR at bind time arm64/fpsimd: Check enable bit for FA64 when saving EFI state arm64/fpsimd: Determine maximum virtualisable SME vector length KVM: arm64: Introduce non-UNDEF FGT control KVM: arm64: Pay attention to FFR parameter in SVE save and load KVM: arm64: Pull ctxt_has_ helpers to start of sysreg-sr.h KVM: arm64: Move SVE state access macros after feature test macros KVM: arm64: Rename SVE finalization constants to be more general KVM: arm64: Document the KVM ABI for SME KVM: arm64: Define internal features for SME KVM: arm64: Rename sve_state_reg_region KVM: arm64: Store vector lengths in an array KVM: arm64: Implement SME vector length configuration KVM: arm64: Support SME control registers KVM: arm64: Support TPIDR2_EL0 KVM: arm64: Support SME identification registers for guests KVM: arm64: Support SME priority registers KVM: arm64: Provide assembly for SME register access KVM: arm64: Support userspace access to streaming mode Z and P registers KVM: arm64: Flush register state on writes to SVCR.SM and SVCR.ZA KVM: arm64: Expose SME specific state to userspace KVM: arm64: Context switch SME state for guests KVM: arm64: Handle SME exceptions KVM: arm64: Expose SME to nested guests KVM: arm64: Provide interface for configuring and enabling SME for guests KVM: arm64: selftests: Add SME system registers to get-reg-list KVM: arm64: selftests: Add SME to set_id_regs test Documentation/virt/kvm/api.rst | 117 +++++++---- arch/arm64/include/asm/fpsimd.h | 26 +++ arch/arm64/include/asm/kvm_emulate.h | 6 + arch/arm64/include/asm/kvm_host.h | 168 ++++++++++++--- arch/arm64/include/asm/kvm_hyp.h | 5 +- arch/arm64/include/asm/kvm_pkvm.h | 2 +- arch/arm64/include/asm/vncr_mapping.h | 2 + arch/arm64/include/uapi/asm/kvm.h | 33 +++ arch/arm64/kernel/cpufeature.c | 2 - arch/arm64/kernel/fpsimd.c | 89 ++++---- arch/arm64/kvm/arm.c | 10 + arch/arm64/kvm/fpsimd.c | 28 ++- arch/arm64/kvm/guest.c | 252 ++++++++++++++++++++--- arch/arm64/kvm/handle_exit.c | 14 ++ arch/arm64/kvm/hyp/fpsimd.S | 28 ++- arch/arm64/kvm/hyp/include/hyp/switch.h | 175 ++++++++++++++-- arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 97 +++++---- arch/arm64/kvm/hyp/nvhe/hyp-main.c | 86 ++++++-- arch/arm64/kvm/hyp/nvhe/pkvm.c | 81 ++++++-- arch/arm64/kvm/hyp/nvhe/switch.c | 4 +- arch/arm64/kvm/hyp/nvhe/sys_regs.c | 6 + arch/arm64/kvm/hyp/vhe/switch.c | 17 +- arch/arm64/kvm/nested.c | 3 +- arch/arm64/kvm/reset.c | 156 ++++++++++---- arch/arm64/kvm/sys_regs.c | 140 ++++++++++++- include/uapi/linux/kvm.h | 1 + tools/testing/selftests/kvm/arm64/get-reg-list.c | 32 ++- tools/testing/selftests/kvm/arm64/set_id_regs.c | 29 ++- 28 files changed, 1315 insertions(+), 294 deletions(-) --- base-commit: 7204503c922cfdb4fcfce4a4ab61f4558a01a73b change-id: 20230301-kvm-arm64-sme-06a1246d3636 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 4 days

2
31
0 0

[PATCH v2 0/3] kselftest/arm64: Update sve-ptrace for ABI changes

by Mark Brown

Mark Rutland's recent SME fixes updated the SME ABI to reject any attempt to write FPSIMD register data via the streaming mode SVE register set but did not update the sve-ptrace test to take account of this, resulting in spurious failures. Update the test for this, and also fix another preexisting issue I noticed while looking at this. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Rebase onto v6.16-rc1. - Update fixes tag for patch 1. - Link to v1: https://lore.kernel.org/r/20250523-kselftest-arm64-ssve-fixups-v1-0-65069a2… --- Mark Brown (3): kselftest/arm64: Fix check for setting new VLs in sve-ptrace kselftest/arm64: Fix test for streaming FPSIMD write in sve-ptrace kselftest/arm64: Specify SVE data when testing VL set in sve-ptrace tools/testing/selftests/arm64/fp/sve-ptrace.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20250523-kselftest-arm64-ssve-fixups-b68ae61c1ebf Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 4 days

2
4
0 0

[PATCH] kselftest/arm64: Convert tpidr2 test to use kselftest.h

by Mark Brown

Recent work by Thomas Weißschuh means that it is now possible to use kselftest.h with nolibc. Convert the tpidr2 test which is nolibc specific to use kselftest.h, making it look more standard and ensuring it gets the benefit of any work done on kselftest.h. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- tools/testing/selftests/arm64/abi/Makefile | 2 +- tools/testing/selftests/arm64/abi/tpidr2.c | 140 ++++++++--------------------- 2 files changed, 38 insertions(+), 104 deletions(-) diff --git a/tools/testing/selftests/arm64/abi/Makefile b/tools/testing/selftests/arm64/abi/Makefile index a6d30c620908..483488f8c2ad 100644 --- a/tools/testing/selftests/arm64/abi/Makefile +++ b/tools/testing/selftests/arm64/abi/Makefile @@ -12,4 +12,4 @@ $(OUTPUT)/syscall-abi: syscall-abi.c syscall-abi-asm.S $(OUTPUT)/tpidr2: tpidr2.c $(CC) -fno-asynchronous-unwind-tables -fno-ident -s -Os -nostdlib \ -static -include ../../../../include/nolibc/nolibc.h \ - -ffreestanding -Wall $^ -o $@ -lgcc + -I../.. -ffreestanding -Wall $^ -o $@ -lgcc diff --git a/tools/testing/selftests/arm64/abi/tpidr2.c b/tools/testing/selftests/arm64/abi/tpidr2.c index eb19dcc37a75..f58a9f89b952 100644 --- a/tools/testing/selftests/arm64/abi/tpidr2.c +++ b/tools/testing/selftests/arm64/abi/tpidr2.c @@ -3,31 +3,12 @@ #include <linux/sched.h> #include <linux/wait.h> +#include "kselftest.h" + #define SYS_TPIDR2 "S3_3_C13_C0_5" #define EXPECTED_TESTS 5 -static void putstr(const char *str) -{ - write(1, str, strlen(str)); -} - -static void putnum(unsigned int num) -{ - char c; - - if (num / 10) - putnum(num / 10); - - c = '0' + (num % 10); - write(1, &c, 1); -} - -static int tests_run; -static int tests_passed; -static int tests_failed; -static int tests_skipped; - static void set_tpidr2(uint64_t val) { asm volatile ( @@ -50,20 +31,6 @@ static uint64_t get_tpidr2(void) return val; } -static void print_summary(void) -{ - if (tests_passed + tests_failed + tests_skipped != EXPECTED_TESTS) - putstr("# UNEXPECTED TEST COUNT: "); - - putstr("# Totals: pass:"); - putnum(tests_passed); - putstr(" fail:"); - putnum(tests_failed); - putstr(" xfail:0 xpass:0 skip:"); - putnum(tests_skipped); - putstr(" error:0\n"); -} - /* Processes should start with TPIDR2 == 0 */ static int default_value(void) { @@ -105,9 +72,8 @@ static int write_fork_read(void) if (newpid == 0) { /* In child */ if (get_tpidr2() != oldpid) { - putstr("# TPIDR2 changed in child: "); - putnum(get_tpidr2()); - putstr("\n"); + ksft_print_msg("TPIDR2 changed in child: %llx\n", + get_tpidr2()); exit(0); } @@ -115,14 +81,12 @@ static int write_fork_read(void) if (get_tpidr2() == getpid()) { exit(1); } else { - putstr("# Failed to set TPIDR2 in child\n"); + ksft_print_msg("Failed to set TPIDR2 in child\n"); exit(0); } } if (newpid < 0) { - putstr("# fork() failed: -"); - putnum(-newpid); - putstr("\n"); + ksft_print_msg("fork() failed: %d\n", newpid); return 0; } @@ -132,23 +96,22 @@ static int write_fork_read(void) if (waiting < 0) { if (errno == EINTR) continue; - putstr("# waitpid() failed: "); - putnum(errno); - putstr("\n"); + ksft_print_msg("waitpid() failed: %d\n", errno); return 0; } if (waiting != newpid) { - putstr("# waitpid() returned wrong PID\n"); + ksft_print_msg("waitpid() returned wrong PID: %d != %d\n", + waiting, newpid); return 0; } if (!WIFEXITED(status)) { - putstr("# child did not exit\n"); + ksft_print_msg("child did not exit\n"); return 0; } if (getpid() != get_tpidr2()) { - putstr("# TPIDR2 corrupted in parent\n"); + ksft_print_msg("TPIDR2 corrupted in parent\n"); return 0; } @@ -188,35 +151,32 @@ static int write_clone_read(void) stack = malloc(__STACK_SIZE); if (!stack) { - putstr("# malloc() failed\n"); + ksft_print_msg("malloc() failed\n"); return 0; } ret = sys_clone(CLONE_VM, (unsigned long)stack + __STACK_SIZE, &parent_tid, 0, &child_tid); if (ret == -1) { - putstr("# clone() failed\n"); - putnum(errno); - putstr("\n"); + ksft_print_msg("clone() failed: %d\n", errno); return 0; } if (ret == 0) { /* In child */ if (get_tpidr2() != 0) { - putstr("# TPIDR2 non-zero in child: "); - putnum(get_tpidr2()); - putstr("\n"); + ksft_print_msg("TPIDR2 non-zero in child: %llx\n", + get_tpidr2()); exit(0); } if (gettid() == 0) - putstr("# Child TID==0\n"); + ksft_print_msg("Child TID==0\n"); set_tpidr2(gettid()); if (get_tpidr2() == gettid()) { exit(1); } else { - putstr("# Failed to set TPIDR2 in child\n"); + ksft_print_msg("Failed to set TPIDR2 in child\n"); exit(0); } } @@ -227,25 +187,22 @@ static int write_clone_read(void) if (waiting < 0) { if (errno == EINTR) continue; - putstr("# wait4() failed: "); - putnum(errno); - putstr("\n"); + ksft_print_msg("wait4() failed: %d\n", errno); return 0; } if (waiting != ret) { - putstr("# wait4() returned wrong PID "); - putnum(waiting); - putstr("\n"); + ksft_print_msg("wait4() returned wrong PID %d\n", + waiting); return 0; } if (!WIFEXITED(status)) { - putstr("# child did not exit\n"); + ksft_print_msg("child did not exit\n"); return 0; } if (parent != get_tpidr2()) { - putstr("# TPIDR2 corrupted in parent\n"); + ksft_print_msg("TPIDR2 corrupted in parent\n"); return 0; } @@ -253,35 +210,14 @@ static int write_clone_read(void) } } -#define run_test(name) \ - if (name()) { \ - tests_passed++; \ - } else { \ - tests_failed++; \ - putstr("not "); \ - } \ - putstr("ok "); \ - putnum(++tests_run); \ - putstr(" " #name "\n"); - -#define skip_test(name) \ - tests_skipped++; \ - putstr("ok "); \ - putnum(++tests_run); \ - putstr(" # SKIP " #name "\n"); - int main(int argc, char **argv) { int ret; - putstr("TAP version 13\n"); - putstr("1.."); - putnum(EXPECTED_TESTS); - putstr("\n"); + ksft_print_header(); + ksft_set_plan(5); - putstr("# PID: "); - putnum(getpid()); - putstr("\n"); + ksft_print_msg("PID: %d\n", getpid()); /* * This test is run with nolibc which doesn't support hwcap and @@ -290,23 +226,21 @@ int main(int argc, char **argv) */ ret = open("/proc/sys/abi/sme_default_vector_length", O_RDONLY, 0); if (ret >= 0) { - run_test(default_value); - run_test(write_read); - run_test(write_sleep_read); - run_test(write_fork_read); - run_test(write_clone_read); + ksft_test_result(default_value(), "default_value\n"); + ksft_test_result(write_read, "write_read\n"); + ksft_test_result(write_sleep_read, "write_sleep_read\n"); + ksft_test_result(write_fork_read, "write_fork_read\n"); + ksft_test_result(write_clone_read, "write_clone_read\n"); } else { - putstr("# SME support not present\n"); + ksft_print_msg("SME support not present\n"); - skip_test(default_value); - skip_test(write_read); - skip_test(write_sleep_read); - skip_test(write_fork_read); - skip_test(write_clone_read); + ksft_test_result_skip("default_value\n"); + ksft_test_result_skip("write_read\n"); + ksft_test_result_skip("write_sleep_read\n"); + ksft_test_result_skip("write_fork_read\n"); + ksft_test_result_skip("write_clone_read\n"); } - print_summary(); - - return 0; + ksft_finished(); } --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20250527-kselftest-arm64-nolibc-header-b650a457f7f4 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 4 days

2
1
0 0

[PATCH v2 0/3] arm64: Support FEAT_LSFE (Large System Float Extension)

by Mark Brown

FEAT_LSFE is optional from v9.5, it adds new instructions for atomic memory operations with floating point values. We have no immediate use for it in kernel, provide a hwcap so userspace can discover it and allow the ID register field to be exposed to KVM guests. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Fix result of vi dropping in hwcap test. - Link to v1: https://lore.kernel.org/r/20250627-arm64-lsfe-v1-0-68351c4bf741@kernel.org --- Mark Brown (3): arm64/hwcap: Add hwcap for FEAT_LSFE KVM: arm64: Expose FEAT_LSFE to guests kselftest/arm64: Add lsfe to the hwcaps test Documentation/arch/arm64/elf_hwcaps.rst | 4 ++++ arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/kernel/cpufeature.c | 2 ++ arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kvm/sys_regs.c | 4 +++- tools/testing/selftests/arm64/abi/hwcap.c | 21 +++++++++++++++++++++ 7 files changed, 33 insertions(+), 1 deletion(-) --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20250625-arm64-lsfe-0810cf98adc2 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 4 days

1
3
0 0

[RFC PATCH v1 0/2] kselftest/resctrl: CAT functional tests

by Ilpo Järvinen

Hi all, In the last Fall Reinette mentioned functional tests of resctrl would be preferred over selftests that are based on performance measurement. This series tries to address that shortcoming by adding some functional tests for resctrl FS interface and another that checks MSRs match to what is written through resctrl FS. The MSR test is only available for Intel CPUs at the moment. Why RFC? The new functional selftest itself works, AFAIK. However, calling ksft_test_result_skip() in cat.c if MSR reading is found to be unavailable is problematic because of how kselftest harness is architected. The kselftest.h header itself defines some variables, so including it into different .c files results in duplicating the test framework related variables (duplication of ksft_count matters in this case). The duplication problem could be worked around by creating a resctrl selftest specific wrapper for ksft_test_result_skip() into resctrl_tests.c so the accounting would occur in the "correct" .c file, but perhaps that is considered hacky and the selftest framework/build systems should be reworked to avoid duplicating variables? Ilpo Järvinen (2): kselftest/resctrl: CAT L3 resctrl FS function tests kselftest/resctrl: Add CAT L3 CBM functional test for Intel RDT tools/testing/selftests/resctrl/cat_test.c | 210 ++++++++++++++++++ tools/testing/selftests/resctrl/msr.c | 55 +++++ tools/testing/selftests/resctrl/resctrl.h | 6 + .../testing/selftests/resctrl/resctrl_tests.c | 2 + tools/testing/selftests/resctrl/resctrlfs.c | 48 ++++ 5 files changed, 321 insertions(+) create mode 100644 tools/testing/selftests/resctrl/msr.c base-commit: c1d7e19c70cbb8a19f63c190cf53e71b5f970514 -- 2.39.5

1 week, 4 days

2
5
0 0

[PATCH v2 0/7] selftests/mm: Fix false positives and skip unsupported tests

by Aboorva Devarajan

Hi all, This patch series addresses false positives in the generic mm selftests and skips tests that cannot run correctly due to missing features or system limitations. --- v1: https://lore.kernel.org/all/20250616160632.35250-1-aboorvad@linux.ibm.com/ Changes in v2: - Rebased onto the mm-new branch, top commit of the base is 3b4a8ad89f7e ("mm: add zblock allocator"). - Split some patches for clarity. - Updated virtual_address_range test to support testing 4PB VA on PPC64. - Added proper Fixes: tags. - Included a patch to skip a failing userfaultfd test when unsupported, instead of reporting a failure. --- Please let us know if you have any further comments. Thanks, Aboorva Aboorva Devarajan (3): selftests/mm: Fix child process exit codes in ksm_functional_tests selftests/mm: Skip thuge-gen if shmmax is too small or no 1G huge pages selftests/mm: Skip hugepage-mremap test if userfaultfd unavailable Donet Tom (4): mm/selftests: Fix incorrect pointer being passed to mark_range() selftests/mm: Add support to test 4PB VA on PPC64 selftest/mm: Fix ksm_funtional_test failures mm/selftests: Fix split_huge_page_test failure on systems with 64KB page size tools/testing/selftests/mm/hugepage-mremap.c | 16 ++++++++++--- .../selftests/mm/ksm_functional_tests.c | 24 +++++++++++++------ .../selftests/mm/split_huge_page_test.c | 23 +++++++++++++----- tools/testing/selftests/mm/thuge-gen.c | 11 +++++---- .../selftests/mm/virtual_address_range.c | 8 ++++++- 5 files changed, 61 insertions(+), 21 deletions(-) -- 2.43.5

1 week, 4 days

5
33
0 0

[PATCH net-next v2 0/2] Add IPIP flowtable SW acceleratio

by Lorenzo Bianconi

Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. --- Changes in v2: - Introduce IPIP flowtable selftest - Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern… --- Lorenzo Bianconi (2): net: netfilter: Add IPIP flowtable SW acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest net/ipv4/ipip.c | 21 ++++++++++++ net/netfilter/nf_flow_table_ip.c | 28 +++++++++++++-- .../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++++++++ 3 files changed, 87 insertions(+), 2 deletions(-) --- base-commit: 8efa26fcbf8a7f783fd1ce7dd2a409e9b7758df0 change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067 Best regards, -- Lorenzo Bianconi <lorenzo(a)kernel.org>

1 week, 4 days

4
6
0 0

[PATCH bpf-next v2 18/18] selftests/bpf: add bench tests for tracing_multi

by Menglong Dong

Add bench testcase for fentry_multi, fexit_multi and fmodret_multi in bench_trigger.c. Signed-off-by: Menglong Dong <dongml2(a)chinatelecom.cn> --- v2: - use the existing bpf bench framework instead of introducing new one --- tools/testing/selftests/bpf/bench.c | 8 +++ .../selftests/bpf/benchs/bench_trigger.c | 72 +++++++++++++++++++ .../selftests/bpf/benchs/run_bench_trigger.sh | 1 + .../selftests/bpf/progs/trigger_bench.c | 22 ++++++ 4 files changed, 103 insertions(+) diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index ddd73d06a1eb..32f1e2e936c0 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -510,8 +510,12 @@ extern const struct bench bench_trig_kretprobe; extern const struct bench bench_trig_kprobe_multi; extern const struct bench bench_trig_kretprobe_multi; extern const struct bench bench_trig_fentry; +extern const struct bench bench_trig_fentry_multi; +extern const struct bench bench_trig_fentry_multi_all; extern const struct bench bench_trig_fexit; +extern const struct bench bench_trig_fexit_multi; extern const struct bench bench_trig_fmodret; +extern const struct bench bench_trig_fmodret_multi; extern const struct bench bench_trig_tp; extern const struct bench bench_trig_rawtp; @@ -578,8 +582,12 @@ static const struct bench *benchs[] = { &bench_trig_kprobe_multi, &bench_trig_kretprobe_multi, &bench_trig_fentry, + &bench_trig_fentry_multi, + &bench_trig_fentry_multi_all, &bench_trig_fexit, + &bench_trig_fexit_multi, &bench_trig_fmodret, + &bench_trig_fmodret_multi, &bench_trig_tp, &bench_trig_rawtp, /* uprobes */ diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c index 82327657846e..a1844ee358f1 100644 --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c @@ -226,6 +226,54 @@ static void trigger_fentry_setup(void) attach_bpf(ctx.skel->progs.bench_trigger_fentry); } +static void trigger_fentry_multi_setup(void) +{ + setup_ctx(); + bpf_program__set_autoload(ctx.skel->progs.bench_trigger_fentry_multi, true); + load_ctx(); + attach_bpf(ctx.skel->progs.bench_trigger_fentry_multi); +} + +static void trigger_fentry_multi_all_setup(void) +{ + LIBBPF_OPTS(bpf_trace_multi_opts, opts); + struct bpf_program *prog; + struct bpf_link *link; + char **syms = NULL; + size_t cnt = 0; + int i; + + setup_ctx(); + prog = ctx.skel->progs.bench_trigger_fentry_multi; + bpf_program__set_autoload(prog, true); + load_ctx(); + + if (bpf_get_ksyms(&syms, &cnt, true)) { + printf("failed to get ksyms\n"); + exit(1); + } + + for (i = 0; i < cnt; i++) { + if (strcmp(syms[i], "bpf_get_numa_node_id") == 0) + break; + } + if (i == cnt) { + printf("bpf_get_numa_node_id not found in ksyms\n"); + exit(1); + } + + printf("found %zu ksyms\n", cnt); + opts.syms = (const char **) syms; + opts.cnt = cnt; + opts.skip_invalid = true; + link = bpf_program__attach_trace_multi_opts(prog, &opts); + if (!link) { + printf("failed to attach bench_trigger_fentry_multi to all\n"); + exit(1); + } + ctx.skel->links.bench_trigger_fentry_multi = link; +} + static void trigger_fexit_setup(void) { setup_ctx(); @@ -234,6 +282,14 @@ static void trigger_fexit_setup(void) attach_bpf(ctx.skel->progs.bench_trigger_fexit); } +static void trigger_fexit_multi_setup(void) +{ + setup_ctx(); + bpf_program__set_autoload(ctx.skel->progs.bench_trigger_fexit_multi, true); + load_ctx(); + attach_bpf(ctx.skel->progs.bench_trigger_fexit_multi); +} + static void trigger_fmodret_setup(void) { setup_ctx(); @@ -246,6 +302,18 @@ static void trigger_fmodret_setup(void) attach_bpf(ctx.skel->progs.bench_trigger_fmodret); } +static void trigger_fmodret_multi_setup(void) +{ + setup_ctx(); + bpf_program__set_autoload(ctx.skel->progs.trigger_driver, false); + bpf_program__set_autoload(ctx.skel->progs.trigger_driver_kfunc, true); + bpf_program__set_autoload(ctx.skel->progs.bench_trigger_fmodret_multi, true); + load_ctx(); + /* override driver program */ + ctx.driver_prog_fd = bpf_program__fd(ctx.skel->progs.trigger_driver_kfunc); + attach_bpf(ctx.skel->progs.bench_trigger_fmodret_multi); +} + static void trigger_tp_setup(void) { setup_ctx(); @@ -512,8 +580,12 @@ BENCH_TRIG_KERNEL(kretprobe, "kretprobe"); BENCH_TRIG_KERNEL(kprobe_multi, "kprobe-multi"); BENCH_TRIG_KERNEL(kretprobe_multi, "kretprobe-multi"); BENCH_TRIG_KERNEL(fentry, "fentry"); +BENCH_TRIG_KERNEL(fentry_multi, "fentry-multi"); +BENCH_TRIG_KERNEL(fentry_multi_all, "fentry-multi-all"); BENCH_TRIG_KERNEL(fexit, "fexit"); +BENCH_TRIG_KERNEL(fexit_multi, "fexit-multi"); BENCH_TRIG_KERNEL(fmodret, "fmodret"); +BENCH_TRIG_KERNEL(fmodret_multi, "fmodret-multi"); BENCH_TRIG_KERNEL(tp, "tp"); BENCH_TRIG_KERNEL(rawtp, "rawtp"); diff --git a/tools/testing/selftests/bpf/benchs/run_bench_trigger.sh b/tools/testing/selftests/bpf/benchs/run_bench_trigger.sh index a690f5a68b6b..48a7f809d053 100755 --- a/tools/testing/selftests/bpf/benchs/run_bench_trigger.sh +++ b/tools/testing/selftests/bpf/benchs/run_bench_trigger.sh @@ -5,6 +5,7 @@ set -eufo pipefail def_tests=( \ usermode-count kernel-count syscall-count \ fentry fexit fmodret \ + fentry-multi fentry-multi-all fexit-multi fmodret-multi \ rawtp tp \ kprobe kprobe-multi \ kretprobe kretprobe-multi \ diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c index 044a6d78923e..2ff1a7568080 100644 --- a/tools/testing/selftests/bpf/progs/trigger_bench.c +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c @@ -111,6 +111,13 @@ int bench_trigger_fentry(void *ctx) return 0; } +SEC("?fentry.multi/bpf_get_numa_node_id") +int bench_trigger_fentry_multi(void *ctx) +{ + inc_counter(); + return 0; +} + SEC("?fexit/bpf_get_numa_node_id") int bench_trigger_fexit(void *ctx) { @@ -118,6 +125,14 @@ int bench_trigger_fexit(void *ctx) return 0; } +SEC("?fexit.multi/bpf_get_numa_node_id") +int bench_trigger_fexit_multi(void *ctx) +{ + inc_counter(); + + return 0; +} + SEC("?fmod_ret/bpf_modify_return_test_tp") int bench_trigger_fmodret(void *ctx) { @@ -125,6 +140,13 @@ int bench_trigger_fmodret(void *ctx) return -22; } +SEC("?fmod_ret.multi/bpf_modify_return_test_tp") +int bench_trigger_fmodret_multi(void *ctx) +{ + inc_counter(); + return -22; +} + SEC("?tp/bpf_test_run/bpf_trigger_tp") int bench_trigger_tp(void *ctx) { -- 2.39.5

1 week, 4 days

1
0
0 0

[PATCH bpf-next v2 17/18] selftests/bpf: add basic testcases for tracing_multi

by Menglong Dong

In this commit, we add some testcases for the following attach types: BPF_TRACE_FENTRY_MULTI BPF_TRACE_FEXIT_MULTI BPF_MODIFY_RETURN_MULTI We reuse the testings in fentry_test.c, fexit_test.c and modify_return.c by attach the tracing bpf prog as tracing_multi. We add some functions to skip for tracing progs to bpf_get_ksyms(). The functions that in the "btf_id_deny" should be skipped. What's more, the kernel can't find the right function address according to the btf type id when duplicated function name exist. So we skip such functions that we meet. The list is not whole, so we still can fail during attaching the FENTRY_MULTI to all the kernel functions. This is something that we need to fix in the feature. Signed-off-by: Menglong Dong <dongml2(a)chinatelecom.cn> --- tools/testing/selftests/bpf/Makefile | 2 +- .../selftests/bpf/prog_tests/fentry_fexit.c | 22 +- .../selftests/bpf/prog_tests/fentry_test.c | 79 +++++-- .../selftests/bpf/prog_tests/fexit_test.c | 79 +++++-- .../selftests/bpf/prog_tests/modify_return.c | 60 +++++ .../bpf/prog_tests/tracing_multi_link.c | 210 ++++++++++++++++++ .../selftests/bpf/progs/fentry_multi_empty.c | 13 ++ .../selftests/bpf/progs/tracing_multi_test.c | 181 +++++++++++++++ .../selftests/bpf/test_kmods/bpf_testmod.c | 24 ++ tools/testing/selftests/bpf/test_progs.c | 50 +++++ tools/testing/selftests/bpf/test_progs.h | 3 + tools/testing/selftests/bpf/trace_helpers.c | 69 ++++++ 12 files changed, 744 insertions(+), 48 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_multi_link.c create mode 100644 tools/testing/selftests/bpf/progs/fentry_multi_empty.c create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_test.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 4863106034df..1fa0da096262 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -496,7 +496,7 @@ LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h \ test_subskeleton.skel.h test_subskeleton_lib.skel.h \ test_usdt.skel.h -LSKELS := fentry_test.c fexit_test.c fexit_sleep.c atomics.c \ +LSKELS := fexit_sleep.c atomics.c \ trace_printk.c trace_vprintk.c map_ptr_kern.c \ core_kern.c core_kern_overflow.c test_ringbuf.c \ test_ringbuf_n.c test_ringbuf_map_key.c test_ringbuf_write.c diff --git a/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c b/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c index 130f5b82d2e6..84cc8b669684 100644 --- a/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c +++ b/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c @@ -1,32 +1,32 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 Facebook */ #include <test_progs.h> -#include "fentry_test.lskel.h" -#include "fexit_test.lskel.h" +#include "fentry_test.skel.h" +#include "fexit_test.skel.h" void test_fentry_fexit(void) { - struct fentry_test_lskel *fentry_skel = NULL; - struct fexit_test_lskel *fexit_skel = NULL; + struct fentry_test *fentry_skel = NULL; + struct fexit_test *fexit_skel = NULL; __u64 *fentry_res, *fexit_res; int err, prog_fd, i; LIBBPF_OPTS(bpf_test_run_opts, topts); - fentry_skel = fentry_test_lskel__open_and_load(); + fentry_skel = fentry_test__open_and_load(); if (!ASSERT_OK_PTR(fentry_skel, "fentry_skel_load")) goto close_prog; - fexit_skel = fexit_test_lskel__open_and_load(); + fexit_skel = fexit_test__open_and_load(); if (!ASSERT_OK_PTR(fexit_skel, "fexit_skel_load")) goto close_prog; - err = fentry_test_lskel__attach(fentry_skel); + err = fentry_test__attach(fentry_skel); if (!ASSERT_OK(err, "fentry_attach")) goto close_prog; - err = fexit_test_lskel__attach(fexit_skel); + err = fexit_test__attach(fexit_skel); if (!ASSERT_OK(err, "fexit_attach")) goto close_prog; - prog_fd = fexit_skel->progs.test1.prog_fd; + prog_fd = bpf_program__fd(fexit_skel->progs.test1); err = bpf_prog_test_run_opts(prog_fd, &topts); ASSERT_OK(err, "ipv6 test_run"); ASSERT_OK(topts.retval, "ipv6 test retval"); @@ -40,6 +40,6 @@ void test_fentry_fexit(void) } close_prog: - fentry_test_lskel__destroy(fentry_skel); - fexit_test_lskel__destroy(fexit_skel); + fentry_test__destroy(fentry_skel); + fexit_test__destroy(fexit_skel); } diff --git a/tools/testing/selftests/bpf/prog_tests/fentry_test.c b/tools/testing/selftests/bpf/prog_tests/fentry_test.c index aee1bc77a17f..9edd383feabd 100644 --- a/tools/testing/selftests/bpf/prog_tests/fentry_test.c +++ b/tools/testing/selftests/bpf/prog_tests/fentry_test.c @@ -1,26 +1,16 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 Facebook */ #include <test_progs.h> -#include "fentry_test.lskel.h" +#include "fentry_test.skel.h" #include "fentry_many_args.skel.h" -static int fentry_test_common(struct fentry_test_lskel *fentry_skel) +static int fentry_test_check(struct fentry_test *fentry_skel) { + LIBBPF_OPTS(bpf_test_run_opts, topts); int err, prog_fd, i; - int link_fd; __u64 *result; - LIBBPF_OPTS(bpf_test_run_opts, topts); - - err = fentry_test_lskel__attach(fentry_skel); - if (!ASSERT_OK(err, "fentry_attach")) - return err; - /* Check that already linked program can't be attached again. */ - link_fd = fentry_test_lskel__test1__attach(fentry_skel); - if (!ASSERT_LT(link_fd, 0, "fentry_attach_link")) - return -1; - - prog_fd = fentry_skel->progs.test1.prog_fd; + prog_fd = bpf_program__fd(fentry_skel->progs.test1); err = bpf_prog_test_run_opts(prog_fd, &topts); ASSERT_OK(err, "test_run"); ASSERT_EQ(topts.retval, 0, "test_run"); @@ -31,7 +21,28 @@ static int fentry_test_common(struct fentry_test_lskel *fentry_skel) return -1; } - fentry_test_lskel__detach(fentry_skel); + return 0; +} + +static int fentry_test_common(struct fentry_test *fentry_skel) +{ + struct bpf_link *link; + int err; + + err = fentry_test__attach(fentry_skel); + if (!ASSERT_OK(err, "fentry_attach")) + return err; + + /* Check that already linked program can't be attached again. */ + link = bpf_program__attach(fentry_skel->progs.test1); + if (!ASSERT_ERR_PTR(link, "fentry_attach_link")) + return -1; + + err = fentry_test_check(fentry_skel); + if (!ASSERT_OK(err, "fentry_test_check")) + return err; + + fentry_test__detach(fentry_skel); /* zero results for re-attach test */ memset(fentry_skel->bss, 0, sizeof(*fentry_skel->bss)); @@ -40,10 +51,10 @@ static int fentry_test_common(struct fentry_test_lskel *fentry_skel) static void fentry_test(void) { - struct fentry_test_lskel *fentry_skel = NULL; + struct fentry_test *fentry_skel = NULL; int err; - fentry_skel = fentry_test_lskel__open_and_load(); + fentry_skel = fentry_test__open_and_load(); if (!ASSERT_OK_PTR(fentry_skel, "fentry_skel_load")) goto cleanup; @@ -55,7 +66,7 @@ static void fentry_test(void) ASSERT_OK(err, "fentry_second_attach"); cleanup: - fentry_test_lskel__destroy(fentry_skel); + fentry_test__destroy(fentry_skel); } static void fentry_many_args(void) @@ -84,10 +95,42 @@ static void fentry_many_args(void) fentry_many_args__destroy(fentry_skel); } +static void fentry_multi_test(void) +{ + struct fentry_test *fentry_skel = NULL; + int err, prog_cnt; + + fentry_skel = fentry_test__open(); + if (!ASSERT_OK_PTR(fentry_skel, "fentry_skel_open")) + goto cleanup; + + prog_cnt = sizeof(fentry_skel->progs) / sizeof(long); + err = bpf_to_tracing_multi((void *)&fentry_skel->progs, prog_cnt); + if (!ASSERT_OK(err, "fentry_to_multi")) + goto cleanup; + + err = fentry_test__load(fentry_skel); + if (!ASSERT_OK(err, "fentry_skel_load")) + goto cleanup; + + err = bpf_attach_as_tracing_multi((void *)&fentry_skel->progs, + prog_cnt, + (void *)&fentry_skel->links); + if (!ASSERT_OK(err, "fentry_attach_multi")) + goto cleanup; + + err = fentry_test_check(fentry_skel); + ASSERT_OK(err, "fentry_first_attach"); +cleanup: + fentry_test__destroy(fentry_skel); +} + void test_fentry_test(void) { if (test__start_subtest("fentry")) fentry_test(); + if (test__start_subtest("fentry_multi")) + fentry_multi_test(); if (test__start_subtest("fentry_many_args")) fentry_many_args(); } diff --git a/tools/testing/selftests/bpf/prog_tests/fexit_test.c b/tools/testing/selftests/bpf/prog_tests/fexit_test.c index 1c13007e37dd..5652d02b3ad9 100644 --- a/tools/testing/selftests/bpf/prog_tests/fexit_test.c +++ b/tools/testing/selftests/bpf/prog_tests/fexit_test.c @@ -1,26 +1,16 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 Facebook */ #include <test_progs.h> -#include "fexit_test.lskel.h" +#include "fexit_test.skel.h" #include "fexit_many_args.skel.h" -static int fexit_test_common(struct fexit_test_lskel *fexit_skel) +static int fexit_test_check(struct fexit_test *fexit_skel) { + LIBBPF_OPTS(bpf_test_run_opts, topts); int err, prog_fd, i; - int link_fd; __u64 *result; - LIBBPF_OPTS(bpf_test_run_opts, topts); - - err = fexit_test_lskel__attach(fexit_skel); - if (!ASSERT_OK(err, "fexit_attach")) - return err; - /* Check that already linked program can't be attached again. */ - link_fd = fexit_test_lskel__test1__attach(fexit_skel); - if (!ASSERT_LT(link_fd, 0, "fexit_attach_link")) - return -1; - - prog_fd = fexit_skel->progs.test1.prog_fd; + prog_fd = bpf_program__fd(fexit_skel->progs.test1); err = bpf_prog_test_run_opts(prog_fd, &topts); ASSERT_OK(err, "test_run"); ASSERT_EQ(topts.retval, 0, "test_run"); @@ -31,7 +21,28 @@ static int fexit_test_common(struct fexit_test_lskel *fexit_skel) return -1; } - fexit_test_lskel__detach(fexit_skel); + return 0; +} + +static int fexit_test_common(struct fexit_test *fexit_skel) +{ + struct bpf_link *link; + int err; + + err = fexit_test__attach(fexit_skel); + if (!ASSERT_OK(err, "fexit_attach")) + return err; + + /* Check that already linked program can't be attached again. */ + link = bpf_program__attach(fexit_skel->progs.test1); + if (!ASSERT_ERR_PTR(link, "fexit_attach_link")) + return -1; + + err = fexit_test_check(fexit_skel); + if (!ASSERT_OK(err, "fexit_test_check")) + return err; + + fexit_test__detach(fexit_skel); /* zero results for re-attach test */ memset(fexit_skel->bss, 0, sizeof(*fexit_skel->bss)); @@ -40,10 +51,10 @@ static int fexit_test_common(struct fexit_test_lskel *fexit_skel) static void fexit_test(void) { - struct fexit_test_lskel *fexit_skel = NULL; + struct fexit_test *fexit_skel = NULL; int err; - fexit_skel = fexit_test_lskel__open_and_load(); + fexit_skel = fexit_test__open_and_load(); if (!ASSERT_OK_PTR(fexit_skel, "fexit_skel_load")) goto cleanup; @@ -55,7 +66,7 @@ static void fexit_test(void) ASSERT_OK(err, "fexit_second_attach"); cleanup: - fexit_test_lskel__destroy(fexit_skel); + fexit_test__destroy(fexit_skel); } static void fexit_many_args(void) @@ -84,10 +95,42 @@ static void fexit_many_args(void) fexit_many_args__destroy(fexit_skel); } +static void fexit_test_multi(void) +{ + struct fexit_test *fexit_skel = NULL; + int err, prog_cnt; + + fexit_skel = fexit_test__open(); + if (!ASSERT_OK_PTR(fexit_skel, "fexit_skel_open")) + goto cleanup; + + prog_cnt = sizeof(fexit_skel->progs) / sizeof(long); + err = bpf_to_tracing_multi((void *)&fexit_skel->progs, prog_cnt); + if (!ASSERT_OK(err, "fexit_to_multi")) + goto cleanup; + + err = fexit_test__load(fexit_skel); + if (!ASSERT_OK(err, "fexit_skel_load")) + goto cleanup; + + err = bpf_attach_as_tracing_multi((void *)&fexit_skel->progs, + prog_cnt, + (void *)&fexit_skel->links); + if (!ASSERT_OK(err, "fexit_attach_multi")) + goto cleanup; + + err = fexit_test_check(fexit_skel); + ASSERT_OK(err, "fexit_first_attach"); +cleanup: + fexit_test__destroy(fexit_skel); +} + void test_fexit_test(void) { if (test__start_subtest("fexit")) fexit_test(); + if (test__start_subtest("fexit_multi")) + fexit_test_multi(); if (test__start_subtest("fexit_many_args")) fexit_many_args(); } diff --git a/tools/testing/selftests/bpf/prog_tests/modify_return.c b/tools/testing/selftests/bpf/prog_tests/modify_return.c index a70c99c2f8c8..3ca454379e90 100644 --- a/tools/testing/selftests/bpf/prog_tests/modify_return.c +++ b/tools/testing/selftests/bpf/prog_tests/modify_return.c @@ -49,6 +49,56 @@ static void run_test(__u32 input_retval, __u16 want_side_effect, __s16 want_ret) modify_return__destroy(skel); } +static void run_multi_test(__u32 input_retval, __u16 want_side_effect, __s16 want_ret) +{ + struct modify_return *skel = NULL; + int err, prog_fd, prog_cnt; + __u16 side_effect; + __s16 ret; + LIBBPF_OPTS(bpf_test_run_opts, topts); + + skel = modify_return__open(); + if (!ASSERT_OK_PTR(skel, "skel_open")) + goto cleanup; + + /* stack function args is not supported by tracing multi-link yet, + * so we only enable the bpf progs without stack function args. + */ + bpf_program__set_expected_attach_type(skel->progs.fentry_test, + BPF_TRACE_FENTRY_MULTI); + bpf_program__set_expected_attach_type(skel->progs.fexit_test, + BPF_TRACE_FEXIT_MULTI); + bpf_program__set_expected_attach_type(skel->progs.fmod_ret_test, + BPF_MODIFY_RETURN_MULTI); + + err = modify_return__load(skel); + if (!ASSERT_OK(err, "skel_load")) + goto cleanup; + + prog_cnt = sizeof(skel->progs) / sizeof(long); + err = bpf_attach_as_tracing_multi((void *)&skel->progs, + prog_cnt, + (void *)&skel->links); + if (!ASSERT_OK(err, "modify_return__attach failed")) + goto cleanup; + + skel->bss->input_retval = input_retval; + prog_fd = bpf_program__fd(skel->progs.fmod_ret_test); + err = bpf_prog_test_run_opts(prog_fd, &topts); + ASSERT_OK(err, "test_run"); + + side_effect = UPPER(topts.retval); + ret = LOWER(topts.retval); + + ASSERT_EQ(ret, want_ret, "test_run ret"); + ASSERT_EQ(side_effect, want_side_effect, "modify_return side_effect"); + ASSERT_EQ(skel->bss->fentry_result, 1, "modify_return fentry_result"); + ASSERT_EQ(skel->bss->fexit_result, 1, "modify_return fexit_result"); + ASSERT_EQ(skel->bss->fmod_ret_result, 1, "modify_return fmod_ret_result"); +cleanup: + modify_return__destroy(skel); +} + /* TODO: conflict with get_func_ip_test */ void serial_test_modify_return(void) { @@ -59,3 +109,13 @@ void serial_test_modify_return(void) 0 /* want_side_effect */, -EINVAL * 2 /* want_ret */); } + +void serial_test_modify_return_multi(void) +{ + run_multi_test(0 /* input_retval */, + 2 /* want_side_effect */, + 33 /* want_ret */); + run_multi_test(-EINVAL /* input_retval */, + 1 /* want_side_effect */, + -EINVAL + 29 /* want_ret */); +} diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi_link.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi_link.c new file mode 100644 index 000000000000..1cbe6089472f --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi_link.c @@ -0,0 +1,210 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 ChinaTelecom */ + +#include <test_progs.h> + +#include "tracing_multi_test.skel.h" +#include "fentry_multi_empty.skel.h" + +static void test_run(struct tracing_multi_test *skel) +{ + LIBBPF_OPTS(bpf_test_run_opts, topts); + int err, prog_fd; + + skel->bss->pid = getpid(); + prog_fd = bpf_program__fd(skel->progs.fentry_cookie_test1); + err = bpf_prog_test_run_opts(prog_fd, &topts); + ASSERT_OK(err, "test_run"); + ASSERT_EQ(topts.retval, 0, "test_run"); + + ASSERT_EQ(skel->bss->fentry_test1_result, 1, "fentry_test1_result"); + ASSERT_EQ(skel->bss->fentry_test2_result, 1, "fentry_test2_result"); + ASSERT_EQ(skel->bss->fentry_test3_result, 1, "fentry_test3_result"); + ASSERT_EQ(skel->bss->fentry_test4_result, 1, "fentry_test4_result"); + ASSERT_EQ(skel->bss->fentry_test5_result, 1, "fentry_test5_result"); + ASSERT_EQ(skel->bss->fentry_test6_result, 1, "fentry_test6_result"); + ASSERT_EQ(skel->bss->fentry_test7_result, 1, "fentry_test7_result"); + ASSERT_EQ(skel->bss->fentry_test8_result, 1, "fentry_test8_result"); +} + +static void test_skel_auto_api(void) +{ + struct tracing_multi_test *skel; + int err; + + skel = tracing_multi_test__open_and_load(); + if (!ASSERT_OK_PTR(skel, "tracing_multi_test__open_and_load")) + return; + + /* disable all programs that should fail */ + bpf_program__set_autoattach(skel->progs.fentry_fail_test1, false); + bpf_program__set_autoattach(skel->progs.fentry_fail_test2, false); + bpf_program__set_autoattach(skel->progs.fentry_fail_test3, false); + bpf_program__set_autoattach(skel->progs.fentry_fail_test4, false); + bpf_program__set_autoattach(skel->progs.fentry_fail_test5, false); + bpf_program__set_autoattach(skel->progs.fentry_fail_test6, false); + + bpf_program__set_autoattach(skel->progs.fexit_fail_test1, false); + bpf_program__set_autoattach(skel->progs.fexit_fail_test2, false); + bpf_program__set_autoattach(skel->progs.fexit_fail_test3, false); + + err = tracing_multi_test__attach(skel); + if (!ASSERT_OK(err, "tracing_multi_test__attach")) + goto cleanup; + + test_run(skel); + +cleanup: + tracing_multi_test__destroy(skel); +} + +static int attach_bpf(struct bpf_program *prog, struct bpf_link **link_ptr, + bool success) +{ + struct bpf_link *link; + int err; + + link = bpf_program__attach(prog); + err = libbpf_get_error(link); + if (!ASSERT_OK(success ? err : !err, "attach_bpf")) + return err; + *link_ptr = link; + + return 0; +} + +#define attach_skel_bpf(name, success) \ + attach_bpf(skel->progs.name, &skel->links.name, success) + +static void test_skel_manual_api(void) +{ + struct tracing_multi_test *skel; + + skel = tracing_multi_test__open_and_load(); + if (!ASSERT_OK_PTR(skel, "tracing_multi_test__open_and_load")) + return; + + if (attach_skel_bpf(fentry_success_test1, true) || + attach_skel_bpf(fentry_success_test2, true) || + attach_skel_bpf(fentry_success_test3, true) || + attach_skel_bpf(fentry_success_test4, true) || + attach_skel_bpf(fexit_success_test1, true) || + attach_skel_bpf(fexit_success_test2, true) || + attach_skel_bpf(fentry_fail_test1, false) || + attach_skel_bpf(fentry_fail_test2, false) || + attach_skel_bpf(fentry_fail_test3, false) || + attach_skel_bpf(fentry_fail_test4, false) || + attach_skel_bpf(fentry_fail_test5, false) || + attach_skel_bpf(fentry_fail_test6, false) || + attach_skel_bpf(fexit_fail_test1, false) || + attach_skel_bpf(fexit_fail_test2, false) || + attach_skel_bpf(fexit_fail_test3, false) || + attach_skel_bpf(fentry_cookie_test1, true)) + goto cleanup; + + test_run(skel); + +cleanup: + tracing_multi_test__destroy(skel); +} + +static void test_attach_api(void) +{ + LIBBPF_OPTS(bpf_trace_multi_opts, opts); + struct tracing_multi_test *skel; + struct bpf_link *link; + const char *syms[8] = { + "bpf_fentry_test1", + "bpf_fentry_test2", + "bpf_fentry_test3", + "bpf_fentry_test4", + "bpf_fentry_test5", + "bpf_fentry_test6", + "bpf_fentry_test7", + "bpf_fentry_test8", + }; + __u64 cookies[] = {1, 7, 2, 3, 4, 5, 6, 8}; + + skel = tracing_multi_test__open_and_load(); + if (!ASSERT_OK_PTR(skel, "tracing_multi_test__open_and_load")) + return; + + opts.syms = syms; + opts.cookies = cookies; + opts.cnt = ARRAY_SIZE(syms); + link = bpf_program__attach_trace_multi_opts(skel->progs.fentry_cookie_test1, + &opts); + if (!ASSERT_OK_PTR(link, "bpf_program__attach_trace_multi_opts")) + goto cleanup; + skel->links.fentry_cookie_test1 = link; + + skel->bss->test_cookie = true; + test_run(skel); +cleanup: + tracing_multi_test__destroy(skel); +} + +static void test_attach_bench(bool kernel) +{ + LIBBPF_OPTS(bpf_trace_multi_opts, opts); + struct fentry_multi_empty *skel; + long attach_start_ns, attach_end_ns; + long detach_start_ns, detach_end_ns; + double attach_delta, detach_delta; + struct bpf_link *link = NULL; + char **syms = NULL; + size_t cnt = 0; + + if (!ASSERT_OK(bpf_get_ksyms(&syms, &cnt, kernel), "get_syms")) + return; + + skel = fentry_multi_empty__open_and_load(); + if (!ASSERT_OK_PTR(skel, "fentry_multi_empty__open_and_load")) + goto cleanup; + + opts.syms = (const char **) syms; + opts.cnt = cnt; + opts.skip_invalid = true; + + attach_start_ns = get_time_ns(); + link = bpf_program__attach_trace_multi_opts(skel->progs.fentry_multi_empty, + &opts); + attach_end_ns = get_time_ns(); + + if (!ASSERT_OK_PTR(link, "bpf_program__attach_trace_multi_opts")) + return; + + detach_start_ns = get_time_ns(); + bpf_link__destroy(link); + detach_end_ns = get_time_ns(); + + attach_delta = (attach_end_ns - attach_start_ns) / 1000000000.0; + detach_delta = (detach_end_ns - detach_start_ns) / 1000000000.0; + + printf("%s: found %lu functions\n", __func__, opts.cnt); + printf("%s: attached in %7.3lfs\n", __func__, attach_delta); + printf("%s: detached in %7.3lfs\n", __func__, detach_delta); + +cleanup: + fentry_multi_empty__destroy(skel); + if (syms) + free(syms); +} + +void serial_test_tracing_multi_attach_bench(void) +{ + if (test__start_subtest("kernel")) + test_attach_bench(true); + if (test__start_subtest("modules")) + test_attach_bench(false); +} + +void test_tracing_multi_attach_test(void) +{ + if (test__start_subtest("skel_auto_api")) + test_skel_auto_api(); + if (test__start_subtest("skel_manual_api")) + test_skel_manual_api(); + if (test__start_subtest("attach_api")) + test_attach_api(); +} diff --git a/tools/testing/selftests/bpf/progs/fentry_multi_empty.c b/tools/testing/selftests/bpf/progs/fentry_multi_empty.c new file mode 100644 index 000000000000..a09ba216dff8 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/fentry_multi_empty.c @@ -0,0 +1,13 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 ChinaTelecom */ +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +SEC("fentry.multi/bpf_fentry_test1") +int BPF_PROG(fentry_multi_empty) +{ + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/tracing_multi_test.c b/tools/testing/selftests/bpf/progs/tracing_multi_test.c new file mode 100644 index 000000000000..fa27851896b9 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/tracing_multi_test.c @@ -0,0 +1,181 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 ChinaTelecom */ +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +struct bpf_testmod_struct_arg_1 { + int a; +}; +struct bpf_testmod_struct_arg_2 { + long a; + long b; +}; + +__u64 test_result = 0; + +int pid = 0; +int test_cookie = 0; + +__u64 fentry_test1_result = 0; +__u64 fentry_test2_result = 0; +__u64 fentry_test3_result = 0; +__u64 fentry_test4_result = 0; +__u64 fentry_test5_result = 0; +__u64 fentry_test6_result = 0; +__u64 fentry_test7_result = 0; +__u64 fentry_test8_result = 0; + +extern const void bpf_fentry_test1 __ksym; +extern const void bpf_fentry_test2 __ksym; +extern const void bpf_fentry_test3 __ksym; +extern const void bpf_fentry_test4 __ksym; +extern const void bpf_fentry_test5 __ksym; +extern const void bpf_fentry_test6 __ksym; +extern const void bpf_fentry_test7 __ksym; +extern const void bpf_fentry_test8 __ksym; + +SEC("fentry.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_13") +int BPF_PROG2(fentry_success_test1, struct bpf_testmod_struct_arg_2, a) +{ + test_result = a.a + a.b; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_2,bpf_testmod_test_struct_arg_10") +int BPF_PROG2(fentry_success_test2, int, a, struct bpf_testmod_struct_arg_2, b) +{ + test_result = a + b.a + b.b; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_4") +int BPF_PROG2(fentry_success_test3, struct bpf_testmod_struct_arg_2, a, int, b, + int, c) +{ + test_result = c; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_2") +int BPF_PROG2(fentry_success_test4, struct bpf_testmod_struct_arg_2, a, int, b, + int, c) +{ + test_result = c; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_1") +int BPF_PROG2(fentry_fail_test1, struct bpf_testmod_struct_arg_2, a) +{ + test_result = a.a + a.b; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_2") +int BPF_PROG2(fentry_fail_test2, struct bpf_testmod_struct_arg_2, a) +{ + test_result = a.a + a.b; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_4") +int BPF_PROG2(fentry_fail_test3, struct bpf_testmod_struct_arg_2, a) +{ + test_result = a.a + a.b; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_2,bpf_testmod_test_struct_arg_2") +int BPF_PROG2(fentry_fail_test4, int, a, struct bpf_testmod_struct_arg_2, b) +{ + test_result = a + b.a + b.b; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_2,bpf_testmod_test_struct_arg_13") +int BPF_PROG2(fentry_fail_test5, int, a, struct bpf_testmod_struct_arg_2, b) +{ + test_result = a + b.a + b.b; + return 0; +} + +SEC("fentry.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_12") +int BPF_PROG2(fentry_fail_test6, struct bpf_testmod_struct_arg_2, a, int, b, + int, c) +{ + test_result = c; + return 0; +} + +SEC("fexit.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_2,bpf_testmod_test_struct_arg_3") +int BPF_PROG2(fexit_success_test1, struct bpf_testmod_struct_arg_2, a, int, b, + int, c, int, retval) +{ + test_result = retval; + return 0; +} + +SEC("fexit.multi/bpf_testmod_test_struct_arg_2,bpf_testmod_test_struct_arg_12") +int BPF_PROG2(fexit_success_test2, int, a, struct bpf_testmod_struct_arg_2, b, + int, c, int, retval) +{ + test_result = a + b.a + b.b + retval; + return 0; +} + +SEC("fexit.multi/bpf_testmod_test_struct_arg_1,bpf_testmod_test_struct_arg_4") +int BPF_PROG2(fexit_fail_test1, struct bpf_testmod_struct_arg_2, a, int, b, + int, c, int, retval) +{ + test_result = retval; + return 0; +} + +SEC("fexit.multi/bpf_testmod_test_struct_arg_2,bpf_testmod_test_struct_arg_10") +int BPF_PROG2(fexit_fail_test2, int, a, struct bpf_testmod_struct_arg_2, b, + int, c, int, retval) +{ + test_result = a + b.a + b.b + retval; + return 0; +} + +SEC("fexit.multi/bpf_testmod_test_struct_arg_2,bpf_testmod_test_struct_arg_11") +int BPF_PROG2(fexit_fail_test3, int, a, struct bpf_testmod_struct_arg_2, b, + int, c, int, retval) +{ + test_result = a + b.a + b.b + retval; + return 0; +} + +static void tracing_multi_check_cookie(unsigned long long *ctx) +{ + if (bpf_get_current_pid_tgid() >> 32 != pid) + return; + + __u64 cookie = test_cookie ? bpf_get_attach_cookie(ctx) : 0; + __u64 addr = bpf_get_func_ip(ctx); + +#define SET(__var, __addr, __cookie) ({ \ + if (((const void *) addr == __addr) && \ + (!test_cookie || (cookie == __cookie))) \ + __var = 1; \ +}) + SET(fentry_test1_result, &bpf_fentry_test1, 1); + SET(fentry_test2_result, &bpf_fentry_test2, 7); + SET(fentry_test3_result, &bpf_fentry_test3, 2); + SET(fentry_test4_result, &bpf_fentry_test4, 3); + SET(fentry_test5_result, &bpf_fentry_test5, 4); + SET(fentry_test6_result, &bpf_fentry_test6, 5); + SET(fentry_test7_result, &bpf_fentry_test7, 6); + SET(fentry_test8_result, &bpf_fentry_test8, 8); +} + +SEC("fentry.multi/bpf_fentry_test1,bpf_fentry_test2,bpf_fentry_test3,bpf_fentry_test4,bpf_fentry_test5,bpf_fentry_test6,bpf_fentry_test7,bpf_fentry_test8") +int BPF_PROG(fentry_cookie_test1) +{ + tracing_multi_check_cookie(ctx); + return 0; +} diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c index e9e918cdf31f..07ea1d5d3795 100644 --- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c +++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c @@ -128,6 +128,30 @@ bpf_testmod_test_struct_arg_9(u64 a, void *b, short c, int d, void *e, char f, return bpf_testmod_test_struct_arg_result; } +noinline int +bpf_testmod_test_struct_arg_10(int a, struct bpf_testmod_struct_arg_2 b) { + bpf_testmod_test_struct_arg_result = a + b.a + b.b; + return bpf_testmod_test_struct_arg_result; +} + +noinline struct bpf_testmod_struct_arg_2 * +bpf_testmod_test_struct_arg_11(int a, struct bpf_testmod_struct_arg_2 b, int c) { + bpf_testmod_test_struct_arg_result = a + b.a + b.b + c; + return (void *)bpf_testmod_test_struct_arg_result; +} + +noinline int +bpf_testmod_test_struct_arg_12(int a, struct bpf_testmod_struct_arg_2 b, int *c) { + bpf_testmod_test_struct_arg_result = a + b.a + b.b + *c; + return bpf_testmod_test_struct_arg_result; +} + +noinline int +bpf_testmod_test_struct_arg_13(struct bpf_testmod_struct_arg_2 b) { + bpf_testmod_test_struct_arg_result = b.a + b.b; + return bpf_testmod_test_struct_arg_result; +} + noinline int bpf_testmod_test_arg_ptr_to_struct(struct bpf_testmod_struct_arg_1 *a) { bpf_testmod_test_struct_arg_result = a->a; diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c index 309d9d4a8ace..533b714f1ca6 100644 --- a/tools/testing/selftests/bpf/test_progs.c +++ b/tools/testing/selftests/bpf/test_progs.c @@ -667,6 +667,56 @@ int bpf_find_map(const char *test, struct bpf_object *obj, const char *name) return bpf_map__fd(map); } +int bpf_to_tracing_multi(struct bpf_program **progs, int prog_cnt) +{ + enum bpf_attach_type type; + int i, err; + + for (i = 0; i < prog_cnt; i++) { + type = bpf_program__get_expected_attach_type(progs[i]); + if (type == BPF_TRACE_FENTRY) + type = BPF_TRACE_FENTRY_MULTI; + else if (type == BPF_TRACE_FEXIT) + type = BPF_TRACE_FEXIT_MULTI; + else if (type == BPF_MODIFY_RETURN) + type = BPF_MODIFY_RETURN_MULTI; + else + continue; + err = bpf_program__set_expected_attach_type(progs[i], type); + if (err) + return err; + } + + return 0; +} + +int bpf_attach_as_tracing_multi(struct bpf_program **progs, int prog_cnt, + struct bpf_link **link) +{ + struct bpf_link *__link; + int err, type; + + for (int i = 0; i < prog_cnt; i++) { + LIBBPF_OPTS(bpf_trace_multi_opts, opts); + + type = bpf_program__get_expected_attach_type(progs[i]); + if (type != BPF_TRACE_FENTRY_MULTI && + type != BPF_TRACE_FEXIT_MULTI && + type != BPF_MODIFY_RETURN_MULTI) + continue; + + opts.attach_tracing = true; + __link = bpf_program__attach_trace_multi_opts(progs[i], &opts); + err = libbpf_get_error(link); + if (err) + return err; + + link[i] = __link; + } + + return 0; +} + int compare_map_keys(int map1_fd, int map2_fd) { __u32 key, next_key; diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h index df2222a1806f..7e30c6dbf35c 100644 --- a/tools/testing/selftests/bpf/test_progs.h +++ b/tools/testing/selftests/bpf/test_progs.h @@ -496,6 +496,9 @@ int trigger_module_test_write(int write_sz); int write_sysctl(const char *sysctl, const char *value); int get_bpf_max_tramp_links_from(struct btf *btf); int get_bpf_max_tramp_links(void); +int bpf_to_tracing_multi(struct bpf_program **progs, int prog_cnt); +int bpf_attach_as_tracing_multi(struct bpf_program **progs, int prog_cnt, + struct bpf_link **link); struct netns_obj; struct netns_obj *netns_new(const char *name, bool open); diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c index d24baf244d1f..a9e9dd3be226 100644 --- a/tools/testing/selftests/bpf/trace_helpers.c +++ b/tools/testing/selftests/bpf/trace_helpers.c @@ -559,6 +559,75 @@ static bool skip_entry(char *name) if (!strncmp(name, "__ftrace_invalid_address__", sizeof("__ftrace_invalid_address__") - 1)) return true; + + /* skip functions in "btf_id_deny" */ + if (!strcmp(name, "migrate_disable")) + return true; + if (!strcmp(name, "migrate_enable")) + return true; + if (!strcmp(name, "rcu_read_unlock_strict")) + return true; + if (!strcmp(name, "preempt_count_add")) + return true; + if (!strcmp(name, "preempt_count_sub")) + return true; + if (!strcmp(name, "__rcu_read_lock")) + return true; + if (!strcmp(name, "__rcu_read_unlock")) + return true; + + /* Following symbols have multi definition in kallsyms, take + * "t_next" for example: + * + * ffffffff813c10d0 t t_next + * ffffffff813d31b0 t t_next + * ffffffff813e06b0 t t_next + * ffffffff813eb360 t t_next + * ffffffff81613360 t t_next + * + * but only one of them have corresponding mrecord: + * ffffffff81613364 t_next + * + * The kernel search the target function address by the symbol + * name "t_next" with kallsyms_lookup_name() during attaching + * and the function "0xffffffff813c10d0" can be matched, which + * doesn't have a corresponding mrecord. And this will make + * the attach failing. Skip the functions like this. + * + * The list maybe not whole, so we still can fail......We need a + * way to make the whole things right. Yes, we need fix it :/ + */ + if (!strcmp(name, "kill_pid_usb_asyncio")) + return true; + if (!strcmp(name, "t_next")) + return true; + if (!strcmp(name, "t_stop")) + return true; + if (!strcmp(name, "t_start")) + return true; + if (!strcmp(name, "p_next")) + return true; + if (!strcmp(name, "p_stop")) + return true; + if (!strcmp(name, "p_start")) + return true; + if (!strcmp(name, "mem32_serial_out")) + return true; + if (!strcmp(name, "mem32_serial_in")) + return true; + if (!strcmp(name, "io_serial_in")) + return true; + if (!strcmp(name, "io_serial_out")) + return true; + if (!strcmp(name, "event_callback")) + return true; + if (!strcmp(name, "amd_pmu_init")) + return true; + if (!strcmp(name, "sync_regs")) + return true; + if (!strcmp(name, "empty")) + return true; + return false; } -- 2.39.5

1 week, 4 days

1
0
0 0

[PATCH bpf-next v2 16/18] selftests/bpf: move get_ksyms and get_addrs to trace_helpers.c

by Menglong Dong

We need to get all the kernel function that can be traced sometimes, so we move the get_syms() and get_addrs() in kprobe_multi_test.c to trace_helpers.c and rename it to bpf_get_ksyms() and bpf_get_addrs(). Signed-off-by: Menglong Dong <dongml2(a)chinatelecom.cn> --- .../bpf/prog_tests/kprobe_multi_test.c | 220 +----------------- tools/testing/selftests/bpf/trace_helpers.c | 214 +++++++++++++++++ tools/testing/selftests/bpf/trace_helpers.h | 3 + 3 files changed, 220 insertions(+), 217 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c index e19ef509ebf8..171706e78da8 100644 --- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c +++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c @@ -422,220 +422,6 @@ static void test_unique_match(void) kprobe_multi__destroy(skel); } -static size_t symbol_hash(long key, void *ctx __maybe_unused) -{ - return str_hash((const char *) key); -} - -static bool symbol_equal(long key1, long key2, void *ctx __maybe_unused) -{ - return strcmp((const char *) key1, (const char *) key2) == 0; -} - -static bool is_invalid_entry(char *buf, bool kernel) -{ - if (kernel && strchr(buf, '[')) - return true; - if (!kernel && !strchr(buf, '[')) - return true; - return false; -} - -static bool skip_entry(char *name) -{ - /* - * We attach to almost all kernel functions and some of them - * will cause 'suspicious RCU usage' when fprobe is attached - * to them. Filter out the current culprits - arch_cpu_idle - * default_idle and rcu_* functions. - */ - if (!strcmp(name, "arch_cpu_idle")) - return true; - if (!strcmp(name, "default_idle")) - return true; - if (!strncmp(name, "rcu_", 4)) - return true; - if (!strcmp(name, "bpf_dispatcher_xdp_func")) - return true; - if (!strncmp(name, "__ftrace_invalid_address__", - sizeof("__ftrace_invalid_address__") - 1)) - return true; - return false; -} - -/* Do comparision by ignoring '.llvm.<hash>' suffixes. */ -static int compare_name(const char *name1, const char *name2) -{ - const char *res1, *res2; - int len1, len2; - - res1 = strstr(name1, ".llvm."); - res2 = strstr(name2, ".llvm."); - len1 = res1 ? res1 - name1 : strlen(name1); - len2 = res2 ? res2 - name2 : strlen(name2); - - if (len1 == len2) - return strncmp(name1, name2, len1); - if (len1 < len2) - return strncmp(name1, name2, len1) <= 0 ? -1 : 1; - return strncmp(name1, name2, len2) >= 0 ? 1 : -1; -} - -static int load_kallsyms_compare(const void *p1, const void *p2) -{ - return compare_name(((const struct ksym *)p1)->name, ((const struct ksym *)p2)->name); -} - -static int search_kallsyms_compare(const void *p1, const struct ksym *p2) -{ - return compare_name(p1, p2->name); -} - -static int get_syms(char ***symsp, size_t *cntp, bool kernel) -{ - size_t cap = 0, cnt = 0; - char *name = NULL, *ksym_name, **syms = NULL; - struct hashmap *map; - struct ksyms *ksyms; - struct ksym *ks; - char buf[256]; - FILE *f; - int err = 0; - - ksyms = load_kallsyms_custom_local(load_kallsyms_compare); - if (!ASSERT_OK_PTR(ksyms, "load_kallsyms_custom_local")) - return -EINVAL; - - /* - * The available_filter_functions contains many duplicates, - * but other than that all symbols are usable in kprobe multi - * interface. - * Filtering out duplicates by using hashmap__add, which won't - * add existing entry. - */ - - if (access("/sys/kernel/tracing/trace", F_OK) == 0) - f = fopen("/sys/kernel/tracing/available_filter_functions", "r"); - else - f = fopen("/sys/kernel/debug/tracing/available_filter_functions", "r"); - - if (!f) - return -EINVAL; - - map = hashmap__new(symbol_hash, symbol_equal, NULL); - if (IS_ERR(map)) { - err = libbpf_get_error(map); - goto error; - } - - while (fgets(buf, sizeof(buf), f)) { - if (is_invalid_entry(buf, kernel)) - continue; - - free(name); - if (sscanf(buf, "%ms$*[^\n]\n", &name) != 1) - continue; - if (skip_entry(name)) - continue; - - ks = search_kallsyms_custom_local(ksyms, name, search_kallsyms_compare); - if (!ks) { - err = -EINVAL; - goto error; - } - - ksym_name = ks->name; - err = hashmap__add(map, ksym_name, 0); - if (err == -EEXIST) { - err = 0; - continue; - } - if (err) - goto error; - - err = libbpf_ensure_mem((void **) &syms, &cap, - sizeof(*syms), cnt + 1); - if (err) - goto error; - - syms[cnt++] = ksym_name; - } - - *symsp = syms; - *cntp = cnt; - -error: - free(name); - fclose(f); - hashmap__free(map); - if (err) - free(syms); - return err; -} - -static int get_addrs(unsigned long **addrsp, size_t *cntp, bool kernel) -{ - unsigned long *addr, *addrs, *tmp_addrs; - int err = 0, max_cnt, inc_cnt; - char *name = NULL; - size_t cnt = 0; - char buf[256]; - FILE *f; - - if (access("/sys/kernel/tracing/trace", F_OK) == 0) - f = fopen("/sys/kernel/tracing/available_filter_functions_addrs", "r"); - else - f = fopen("/sys/kernel/debug/tracing/available_filter_functions_addrs", "r"); - - if (!f) - return -ENOENT; - - /* In my local setup, the number of entries is 50k+ so Let us initially - * allocate space to hold 64k entries. If 64k is not enough, incrementally - * increase 1k each time. - */ - max_cnt = 65536; - inc_cnt = 1024; - addrs = malloc(max_cnt * sizeof(long)); - if (addrs == NULL) { - err = -ENOMEM; - goto error; - } - - while (fgets(buf, sizeof(buf), f)) { - if (is_invalid_entry(buf, kernel)) - continue; - - free(name); - if (sscanf(buf, "%p %ms$*[^\n]\n", &addr, &name) != 2) - continue; - if (skip_entry(name)) - continue; - - if (cnt == max_cnt) { - max_cnt += inc_cnt; - tmp_addrs = realloc(addrs, max_cnt); - if (!tmp_addrs) { - err = -ENOMEM; - goto error; - } - addrs = tmp_addrs; - } - - addrs[cnt++] = (unsigned long)addr; - } - - *addrsp = addrs; - *cntp = cnt; - -error: - free(name); - fclose(f); - if (err) - free(addrs); - return err; -} - static void do_bench_test(struct kprobe_multi_empty *skel, struct bpf_kprobe_multi_opts *opts) { long attach_start_ns, attach_end_ns; @@ -670,7 +456,7 @@ static void test_kprobe_multi_bench_attach(bool kernel) char **syms = NULL; size_t cnt = 0; - if (!ASSERT_OK(get_syms(&syms, &cnt, kernel), "get_syms")) + if (!ASSERT_OK(bpf_get_ksyms(&syms, &cnt, kernel), "bpf_get_ksyms")) return; skel = kprobe_multi_empty__open_and_load(); @@ -696,13 +482,13 @@ static void test_kprobe_multi_bench_attach_addr(bool kernel) size_t cnt = 0; int err; - err = get_addrs(&addrs, &cnt, kernel); + err = bpf_get_addrs(&addrs, &cnt, kernel); if (err == -ENOENT) { test__skip(); return; } - if (!ASSERT_OK(err, "get_addrs")) + if (!ASSERT_OK(err, "bpf_get_addrs")) return; skel = kprobe_multi_empty__open_and_load(); diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c index 81943c6254e6..d24baf244d1f 100644 --- a/tools/testing/selftests/bpf/trace_helpers.c +++ b/tools/testing/selftests/bpf/trace_helpers.c @@ -17,6 +17,7 @@ #include <linux/limits.h> #include <libelf.h> #include <gelf.h> +#include "bpf/hashmap.h" #include "bpf/libbpf_internal.h" #define TRACEFS_PIPE "/sys/kernel/tracing/trace_pipe" @@ -519,3 +520,216 @@ void read_trace_pipe(void) { read_trace_pipe_iter(trace_pipe_cb, NULL, 0); } + +static size_t symbol_hash(long key, void *ctx __maybe_unused) +{ + return str_hash((const char *) key); +} + +static bool symbol_equal(long key1, long key2, void *ctx __maybe_unused) +{ + return strcmp((const char *) key1, (const char *) key2) == 0; +} + +static bool is_invalid_entry(char *buf, bool kernel) +{ + if (kernel && strchr(buf, '[')) + return true; + if (!kernel && !strchr(buf, '[')) + return true; + return false; +} + +static bool skip_entry(char *name) +{ + /* + * We attach to almost all kernel functions and some of them + * will cause 'suspicious RCU usage' when fprobe is attached + * to them. Filter out the current culprits - arch_cpu_idle + * default_idle and rcu_* functions. + */ + if (!strcmp(name, "arch_cpu_idle")) + return true; + if (!strcmp(name, "default_idle")) + return true; + if (!strncmp(name, "rcu_", 4)) + return true; + if (!strcmp(name, "bpf_dispatcher_xdp_func")) + return true; + if (!strncmp(name, "__ftrace_invalid_address__", + sizeof("__ftrace_invalid_address__") - 1)) + return true; + return false; +} + +/* Do comparison by ignoring '.llvm.<hash>' suffixes. */ +static int compare_name(const char *name1, const char *name2) +{ + const char *res1, *res2; + int len1, len2; + + res1 = strstr(name1, ".llvm."); + res2 = strstr(name2, ".llvm."); + len1 = res1 ? res1 - name1 : strlen(name1); + len2 = res2 ? res2 - name2 : strlen(name2); + + if (len1 == len2) + return strncmp(name1, name2, len1); + if (len1 < len2) + return strncmp(name1, name2, len1) <= 0 ? -1 : 1; + return strncmp(name1, name2, len2) >= 0 ? 1 : -1; +} + +static int load_kallsyms_compare(const void *p1, const void *p2) +{ + return compare_name(((const struct ksym *)p1)->name, ((const struct ksym *)p2)->name); +} + +static int search_kallsyms_compare(const void *p1, const struct ksym *p2) +{ + return compare_name(p1, p2->name); +} + +int bpf_get_ksyms(char ***symsp, size_t *cntp, bool kernel) +{ + size_t cap = 0, cnt = 0; + char *name = NULL, *ksym_name, **syms = NULL; + struct hashmap *map; + struct ksyms *ksyms; + struct ksym *ks; + char buf[256]; + FILE *f; + int err = 0; + + ksyms = load_kallsyms_custom_local(load_kallsyms_compare); + if (!ksyms) + return -EINVAL; + + /* + * The available_filter_functions contains many duplicates, + * but other than that all symbols are usable to trace. + * Filtering out duplicates by using hashmap__add, which won't + * add existing entry. + */ + + if (access("/sys/kernel/tracing/trace", F_OK) == 0) + f = fopen("/sys/kernel/tracing/available_filter_functions", "r"); + else + f = fopen("/sys/kernel/debug/tracing/available_filter_functions", "r"); + + if (!f) + return -EINVAL; + + map = hashmap__new(symbol_hash, symbol_equal, NULL); + if (IS_ERR(map)) { + err = libbpf_get_error(map); + goto error; + } + + while (fgets(buf, sizeof(buf), f)) { + if (is_invalid_entry(buf, kernel)) + continue; + + free(name); + if (sscanf(buf, "%ms$*[^\n]\n", &name) != 1) + continue; + if (skip_entry(name)) + continue; + + ks = search_kallsyms_custom_local(ksyms, name, search_kallsyms_compare); + if (!ks) { + err = -EINVAL; + goto error; + } + + ksym_name = ks->name; + err = hashmap__add(map, ksym_name, 0); + if (err == -EEXIST) { + err = 0; + continue; + } + if (err) + goto error; + + err = libbpf_ensure_mem((void **) &syms, &cap, + sizeof(*syms), cnt + 1); + if (err) + goto error; + + syms[cnt++] = ksym_name; + } + + *symsp = syms; + *cntp = cnt; + +error: + free(name); + fclose(f); + hashmap__free(map); + if (err) + free(syms); + return err; +} + +int bpf_get_addrs(unsigned long **addrsp, size_t *cntp, bool kernel) +{ + unsigned long *addr, *addrs, *tmp_addrs; + int err = 0, max_cnt, inc_cnt; + char *name = NULL; + size_t cnt = 0; + char buf[256]; + FILE *f; + + if (access("/sys/kernel/tracing/trace", F_OK) == 0) + f = fopen("/sys/kernel/tracing/available_filter_functions_addrs", "r"); + else + f = fopen("/sys/kernel/debug/tracing/available_filter_functions_addrs", "r"); + + if (!f) + return -ENOENT; + + /* In my local setup, the number of entries is 50k+ so Let us initially + * allocate space to hold 64k entries. If 64k is not enough, incrementally + * increase 1k each time. + */ + max_cnt = 65536; + inc_cnt = 1024; + addrs = malloc(max_cnt * sizeof(long)); + if (addrs == NULL) { + err = -ENOMEM; + goto error; + } + + while (fgets(buf, sizeof(buf), f)) { + if (is_invalid_entry(buf, kernel)) + continue; + + free(name); + if (sscanf(buf, "%p %ms$*[^\n]\n", &addr, &name) != 2) + continue; + if (skip_entry(name)) + continue; + + if (cnt == max_cnt) { + max_cnt += inc_cnt; + tmp_addrs = realloc(addrs, max_cnt); + if (!tmp_addrs) { + err = -ENOMEM; + goto error; + } + addrs = tmp_addrs; + } + + addrs[cnt++] = (unsigned long)addr; + } + + *addrsp = addrs; + *cntp = cnt; + +error: + free(name); + fclose(f); + if (err) + free(addrs); + return err; +} diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h index 2ce873c9f9aa..9437bdd4afa5 100644 --- a/tools/testing/selftests/bpf/trace_helpers.h +++ b/tools/testing/selftests/bpf/trace_helpers.h @@ -41,4 +41,7 @@ ssize_t get_rel_offset(uintptr_t addr); int read_build_id(const char *path, char *build_id, size_t size); +int bpf_get_ksyms(char ***symsp, size_t *cntp, bool kernel); +int bpf_get_addrs(unsigned long **addrsp, size_t *cntp, bool kernel); + #endif -- 2.39.5

1 week, 4 days

1
0
0 0

[PATCH v2 0/4] kselftest/arm64: Add coverage for the interaction of vfork() and GCS

by Mark Brown

I had cause to look at the vfork() support for GCS and realised that we don't have any direct test coverage, this series does so by adding vfork() to nolibc and then using that in basic-gcs to provide some simple vfork() coverage. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Add replacement of ifdef with if defined() in nolibc since the code doesn't reflect the coding style. - Remove check for arch specific vfork(). - Link to v1: https://lore.kernel.org/r/20250609-arm64-gcs-vfork-exit-v1-0-baad0f085747@k… --- Mark Brown (4): tools/nolibc: Replace ifdef with if defined() in sys.h tools/nolibc: Provide vfork() kselftest/arm64: Add a test for vfork() with GCS selftests/nolibc: Add coverage of vfork() tools/include/nolibc/sys.h | 57 +++++++++++++++++------- tools/testing/selftests/arm64/gcs/basic-gcs.c | 63 +++++++++++++++++++++++++++ tools/testing/selftests/nolibc/nolibc-test.c | 23 ++++++++-- 3 files changed, 124 insertions(+), 19 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20250528-arm64-gcs-vfork-exit-4a7daf7652ee Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 4 days

3
7
0 0

[PATCH] selftests: net: fix resource leak in napi_id_helper.c

by Malaya Kumar Rout

Resolve minor resource leaks reported by cppcheck in napi_id_helper.c cppcheck output before this patch: tools/testing/selftests/drivers/net/napi_id_helper.c:37:3: error: Resource leak: server [resourceLeak] tools/testing/selftests/drivers/net/napi_id_helper.c:46:3: error: Resource leak: server [resourceLeak] tools/testing/selftests/drivers/net/napi_id_helper.c:51:3: error: Resource leak: server [resourceLeak] tools/testing/selftests/drivers/net/napi_id_helper.c:59:3: error: Resource leak: server [resourceLeak] tools/testing/selftests/drivers/net/napi_id_helper.c:67:3: error: Resource leak: server [resourceLeak] tools/testing/selftests/drivers/net/napi_id_helper.c:76:3: error: Resource leak: server [resourceLeak] cppcheck output after this patch: No resource leaks found Signed-off-by: Malaya Kumar Rout <malayarout91(a)gmail.com> --- tools/testing/selftests/drivers/net/napi_id_helper.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/tools/testing/selftests/drivers/net/napi_id_helper.c b/tools/testing/selftests/drivers/net/napi_id_helper.c index eecd610c2109..1441b8d49b91 100644 --- a/tools/testing/selftests/drivers/net/napi_id_helper.c +++ b/tools/testing/selftests/drivers/net/napi_id_helper.c @@ -34,6 +34,7 @@ int main(int argc, char *argv[]) if (setsockopt(server, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt))) { perror("setsockopt"); + close(server); return 1; } @@ -43,11 +44,13 @@ int main(int argc, char *argv[]) if (bind(server, (struct sockaddr *)&address, sizeof(address)) < 0) { perror("bind failed"); + close(server); return 1; } if (listen(server, 1) < 0) { perror("listen"); + close(server); return 1; } @@ -56,6 +59,7 @@ int main(int argc, char *argv[]) client = accept(server, NULL, 0); if (client < 0) { perror("accept"); + close(server); return 1; } @@ -64,6 +68,7 @@ int main(int argc, char *argv[]) &optlen); if (ret != 0) { perror("getsockopt"); + close(server); return 1; } @@ -73,6 +78,7 @@ int main(int argc, char *argv[]) if (napi_id == 0) { fprintf(stderr, "napi ID is 0\n"); + close(server); return 1; } -- 2.43.0

1 week, 4 days

5
6
0 0

[PATCH] global: fix misapplications of "awhile"

by Ahelenia Ziemiańska

Of these: 7 "for a while" typos 5 "take a while" typos 1 misreading of "once in a while"? 3 awhiles used correctly remain in the tree Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli(a)nabijaczleweli.xyz> --- Documentation/trace/histogram.rst | 2 +- arch/sh/drivers/pci/common.c | 2 +- arch/sh/drivers/pci/pci-sh7780.c | 2 +- drivers/atm/lanai.c | 2 +- drivers/md/bcache/bcache.h | 2 +- drivers/md/bcache/request.c | 2 +- drivers/net/ethernet/google/gve/gve_rx_dqo.c | 2 +- drivers/scsi/hpsa.c | 2 +- drivers/tty/serial/jsm/jsm_neo.c | 2 +- fs/ocfs2/dlm/dlmrecovery.c | 2 +- sound/pci/emu10k1/emu10k1_main.c | 2 +- sound/pci/emu10k1/emupcm.c | 2 +- tools/testing/selftests/powerpc/tm/tm-tmspr.c | 2 +- 13 files changed, 13 insertions(+), 13 deletions(-) diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst index 0aada18c38c6..2b98c1720a54 100644 --- a/Documentation/trace/histogram.rst +++ b/Documentation/trace/histogram.rst @@ -249,7 +249,7 @@ Extended error information table, it should keep a running total of the number of bytes requested by that call_site. - We'll let it run for awhile and then dump the contents of the 'hist' + We'll let it run for a while and then dump the contents of the 'hist' file in the kmalloc event's subdirectory (for readability, a number of entries have been omitted):: diff --git a/arch/sh/drivers/pci/common.c b/arch/sh/drivers/pci/common.c index 9633b6147a05..f95004c67e6c 100644 --- a/arch/sh/drivers/pci/common.c +++ b/arch/sh/drivers/pci/common.c @@ -148,7 +148,7 @@ unsigned int pcibios_handle_status_errors(unsigned long addr, cmd |= PCI_STATUS_PARITY | PCI_STATUS_DETECTED_PARITY; - /* Now back off of the IRQ for awhile */ + /* Now back off of the IRQ for a while */ if (hose->err_irq) { disable_irq_nosync(hose->err_irq); hose->err_timer.expires = jiffies + HZ; diff --git a/arch/sh/drivers/pci/pci-sh7780.c b/arch/sh/drivers/pci/pci-sh7780.c index 9a624a6ee354..f41d6939a3d9 100644 --- a/arch/sh/drivers/pci/pci-sh7780.c +++ b/arch/sh/drivers/pci/pci-sh7780.c @@ -153,7 +153,7 @@ static irqreturn_t sh7780_pci_serr_irq(int irq, void *dev_id) /* Deassert SERR */ __raw_writel(SH4_PCIINTM_SDIM, hose->reg_base + SH4_PCIINTM); - /* Back off the IRQ for awhile */ + /* Back off the IRQ for a while */ disable_irq_nosync(irq); hose->serr_timer.expires = jiffies + HZ; add_timer(&hose->serr_timer); diff --git a/drivers/atm/lanai.c b/drivers/atm/lanai.c index 2a1fe3080712..0dfa2cdc897c 100644 --- a/drivers/atm/lanai.c +++ b/drivers/atm/lanai.c @@ -755,7 +755,7 @@ static void lanai_shutdown_rx_vci(const struct lanai_vcc *lvcc) /* Shutdown transmitting on card. * Unfortunately the lanai needs us to wait until all the data * drains out of the buffer before we can dealloc it, so this - * can take awhile -- up to 370ms for a full 128KB buffer + * can take a while -- up to 370ms for a full 128KB buffer * assuming everone else is quiet. In theory the time is * boundless if there's a CBR VCC holding things up. */ diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 1d33e40d26ea..7318d9800370 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -499,7 +499,7 @@ struct gc_stat { * won't automatically reattach). * * CACHE_SET_STOPPING always gets set first when we're closing down a cache set; - * we'll continue to run normally for awhile with CACHE_SET_STOPPING set (i.e. + * we'll continue to run normally for a while with CACHE_SET_STOPPING set (i.e. * flushing dirty data). * * CACHE_SET_RUNNING means all cache devices have been registered and journal diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index af345dc6fde1..87b4341cb42c 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -257,7 +257,7 @@ static CLOSURE_CALLBACK(bch_data_insert_start) /* * But if it's not a writeback write we'd rather just bail out if - * there aren't any buckets ready to write to - it might take awhile and + * there aren't any buckets ready to write to - it might take a while and * we might be starving btree writes for gc or something. */ diff --git a/drivers/net/ethernet/google/gve/gve_rx_dqo.c b/drivers/net/ethernet/google/gve/gve_rx_dqo.c index dcb0545baa50..6a0be54f1c81 100644 --- a/drivers/net/ethernet/google/gve/gve_rx_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_rx_dqo.c @@ -608,7 +608,7 @@ static int gve_rx_dqo(struct napi_struct *napi, struct gve_rx_ring *rx, buf_len = compl_desc->packet_len; hdr_len = compl_desc->header_len; - /* Page might have not been used for awhile and was likely last written + /* Page might have not been used for a while and was likely last written * by a different thread. */ if (rx->dqo.page_pool) { diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index c73a71ac3c29..0066f15153a7 100644 --- a/drivers/scsi/hpsa.c +++ b/drivers/scsi/hpsa.c @@ -7795,7 +7795,7 @@ static int hpsa_wait_for_mode_change_ack(struct ctlr_info *h) u32 doorbell_value; unsigned long flags; - /* under certain very rare conditions, this can take awhile. + /* under certain very rare conditions, this can take a while. * (e.g.: hot replace a failed 144GB drive in a RAID 5 set right * as we enter this code.) */ diff --git a/drivers/tty/serial/jsm/jsm_neo.c b/drivers/tty/serial/jsm/jsm_neo.c index e8e13bf056e2..2eb9ff26d6e8 100644 --- a/drivers/tty/serial/jsm/jsm_neo.c +++ b/drivers/tty/serial/jsm/jsm_neo.c @@ -1189,7 +1189,7 @@ static irqreturn_t neo_intr(int irq, void *voidbrd) /* * The UART triggered us with a bogus interrupt type. * It appears the Exar chip, when REALLY bogged down, will throw - * these once and awhile. + * these periodically. * Its harmless, just ignore it and move on. */ jsm_dbg(INTR, &brd->pci_dev, diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 67fc62a49a76..00f52812dbb0 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -2632,7 +2632,7 @@ static int dlm_pick_recovery_master(struct dlm_ctxt *dlm) dlm_reco_master_ready(dlm), msecs_to_jiffies(1000)); if (!dlm_reco_master_ready(dlm)) { - mlog(0, "%s: reco master taking awhile\n", + mlog(0, "%s: reco master taking a while\n", dlm->name); goto again; } diff --git a/sound/pci/emu10k1/emu10k1_main.c b/sound/pci/emu10k1/emu10k1_main.c index bbe252b8916c..6050201851b1 100644 --- a/sound/pci/emu10k1/emu10k1_main.c +++ b/sound/pci/emu10k1/emu10k1_main.c @@ -606,7 +606,7 @@ static int snd_emu10k1_ecard_init(struct snd_emu10k1 *emu) /* Step 2: Calibrate the ADC and DAC */ snd_emu10k1_ecard_write(emu, EC_DACCAL | EC_LEDN | EC_TRIM_CSN); - /* Step 3: Wait for awhile; XXX We can't get away with this + /* Step 3: Wait for a while; XXX We can't get away with this * under a real operating system; we'll need to block and wait that * way. */ snd_emu10k1_wait(emu, 48000); diff --git a/sound/pci/emu10k1/emupcm.c b/sound/pci/emu10k1/emupcm.c index 1bf6e3d652f8..ca4b03317539 100644 --- a/sound/pci/emu10k1/emupcm.c +++ b/sound/pci/emu10k1/emupcm.c @@ -991,7 +991,7 @@ static snd_pcm_uframes_t snd_emu10k1_capture_pointer(struct snd_pcm_substream *s if (!epcm->running) return 0; if (epcm->first_ptr) { - udelay(50); /* hack, it takes awhile until capture is started */ + udelay(50); /* hack, it takes a while until capture is started */ epcm->first_ptr = 0; } ptr = snd_emu10k1_ptr_read(emu, epcm->capture_idx_reg, 0) & 0x0000ffff; diff --git a/tools/testing/selftests/powerpc/tm/tm-tmspr.c b/tools/testing/selftests/powerpc/tm/tm-tmspr.c index dd5ddffa28b7..0d64988ffb40 100644 --- a/tools/testing/selftests/powerpc/tm/tm-tmspr.c +++ b/tools/testing/selftests/powerpc/tm/tm-tmspr.c @@ -14,7 +14,7 @@ * (1) create more threads than cpus * (2) in each thread: * (a) set TFIAR and TFHAR a unique value - * (b) loop for awhile, continually checking to see if + * (b) loop for a while, continually checking to see if * either register has been corrupted. * * (3) Loop: -- 2.39.5

1 week, 4 days

2
1
0 0

[PATCH v11 net-next 00/15] AccECN protocol patch series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the v10 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best Regards, Chia-Yu --- v11 (04-Jul-2025) - Fix compilation issue of some intermediate patches in v10 v10 (03-Jul-2025) - Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>) - Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>) - Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>) - Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>) - Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>) - Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch - Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>) v9 (21-Jun-2025) - Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>) - Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>) - Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>) - Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>) - Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>) v8 (10-Jun-2025) - Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>) - Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>) - Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>) v7 (14-May-2025) - Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>) - Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message for #9 to explain the increase in tcp_sock_write_rx group size - Modify group size of tcp_sock_write_tx in #10 based on pahole results v6 (09-May-2025) - Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>) - Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>) - Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>) - Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>) - Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>) - Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>) - Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15 v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Mar-2025) - Add one missing patch from the previous AccECN protocol preparation patch series to this patch series. --- Chia-Yu Chang (6): tcp: reorganize tcp_sock_write_txrx group for variables later tcp: ecn functions in separated include file tcp: Add wait_third_ack flag for ECN negotiation in simultaneous connect tcp: accecn: AccECN option send control tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (9): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics Documentation/networking/ip-sysctl.rst | 55 +- .../networking/net_cachelines/tcp_sock.rst | 13 + include/linux/tcp.h | 33 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 87 ++- include/net/tcp_ecn.h | 648 ++++++++++++++++++ include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 4 + net/ipv4/sysctl_net_ipv4.c | 19 + net/ipv4/tcp.c | 29 +- net/ipv4/tcp_input.c | 371 ++++++++-- net/ipv4/tcp_ipv4.c | 8 +- net/ipv4/tcp_minisocks.c | 40 +- net/ipv4/tcp_output.c | 297 ++++++-- net/ipv6/syncookies.c | 2 + net/ipv6/tcp_ipv6.c | 1 + 16 files changed, 1429 insertions(+), 187 deletions(-) create mode 100644 include/net/tcp_ecn.h -- 2.34.1

1 week, 4 days

1
15
0 0

[PATCH v6 14/14] selftests/sched_ext: Add test for DL server total_bw consistency

by Joel Fernandes

Add a new kselftest to verify that the total_bw value in /sys/kernel/debug/sched/debug remains consistent across all CPUs under different sched_ext BPF program states: 1. Before a BPF scheduler is loaded 2. While a BPF scheduler is loaded and active 3. After a BPF scheduler is unloaded The test runs CPU stress threads to ensure DL server bandwidth values stabilize before checking consistency. This helps catch potential issues with DL server bandwidth accounting during sched_ext transitions. Signed-off-by: Joel Fernandes <joelagnelf(a)nvidia.com> --- tools/testing/selftests/sched_ext/Makefile | 1 + tools/testing/selftests/sched_ext/total_bw.c | 286 +++++++++++++++++++ 2 files changed, 287 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/total_bw.c diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index f0a8cba3a99f..d48be158b0a1 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -184,6 +184,7 @@ auto-test-targets := \ select_cpu_vtime \ rt_stall \ test_example \ + total_bw \ testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c new file mode 100644 index 000000000000..6b81d6c51054 --- /dev/null +++ b/tools/testing/selftests/sched_ext/total_bw.c @@ -0,0 +1,286 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test to verify that total_bw value remains consistent across all CPUs + * in different BPF program states. + * + * Copyright (C) 2025 Nvidia Corporation. + */ +#include <bpf/bpf.h> +#include <errno.h> +#include <pthread.h> +#include <scx/common.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/wait.h> +#include <unistd.h> +#include "minimal.bpf.skel.h" +#include "scx_test.h" + +#define MAX_CPUS 512 +#define STABILIZATION_TIME_SEC 5 +#define STRESS_DURATION_SEC 5 + +struct total_bw_ctx { + struct minimal *skel; + long baseline_bw[MAX_CPUS]; + int nr_cpus; +}; + +static void *cpu_stress_thread(void *arg) +{ + volatile int i; + time_t end_time = time(NULL) + STRESS_DURATION_SEC; + + while (time(NULL) < end_time) { + for (i = 0; i < 1000000; i++); + } + + return NULL; +} + +/* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ +static int run_cpu_stress(int nr_cpus) +{ + pthread_t *threads; + int i, ret = 0; + + threads = calloc(nr_cpus, sizeof(pthread_t)); + if (!threads) + return -ENOMEM; + + /* Create threads to run on each CPU */ + for (i = 0; i < nr_cpus; i++) { + if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) { + ret = -errno; + fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret)); + break; + } + } + + /* Wait for all threads to complete */ + for (i = 0; i < nr_cpus; i++) { + if (threads[i]) + pthread_join(threads[i], NULL); + } + + free(threads); + return ret; +} + +static int read_total_bw_values(long *bw_values, int max_cpus) +{ + FILE *fp; + char line[256]; + int cpu_count = 0; + + fp = fopen("/sys/kernel/debug/sched/debug", "r"); + if (!fp) { + SCX_ERR("Failed to open debug file"); + return -1; + } + + while (fgets(line, sizeof(line), fp)) { + char *bw_str = strstr(line, "total_bw"); + if (bw_str) { + bw_str = strchr(bw_str, ':'); + if (bw_str) { + /* Only store up to max_cpus values */ + if (cpu_count < max_cpus) { + bw_values[cpu_count] = atol(bw_str + 1); + } + cpu_count++; + } + } + } + + fclose(fp); + return cpu_count; +} + +static bool verify_total_bw_consistency(long *bw_values, int count) +{ + int i; + long first_value; + + if (count <= 0) + return false; + + first_value = bw_values[0]; + + for (i = 1; i < count; i++) { + if (bw_values[i] != first_value) { + SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld", + first_value, i, bw_values[i]); + return false; + } + } + + return true; +} + +static int fetch_verify_total_bw(long *bw_values, int nr_cpus) +{ + int attempts = 0; + int max_attempts = 10; + int count; + + /* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ + if (run_cpu_stress(nr_cpus) < 0) { + SCX_ERR("Failed to run CPU stress"); + return -1; + } + + /* Let things settle down */ + sleep(STABILIZATION_TIME_SEC); + + /* Try multiple times to get stable values */ + while (attempts < max_attempts) { + count = read_total_bw_values(bw_values, nr_cpus); + fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus); + /* If system has more CPUs than we're testing, that's OK */ + if (count < nr_cpus) { + SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count); + attempts++; + sleep(1); + continue; + } + + /* Only verify the CPUs we're testing */ + if (verify_total_bw_consistency(bw_values, nr_cpus)) { + fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]); + return 0; + } + + attempts++; + sleep(1); + } + + return -1; +} + +static enum scx_test_status setup(void **ctx) +{ + struct total_bw_ctx *test_ctx; + + if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) { + fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n"); + return SCX_TEST_SKIP; + } + + test_ctx = calloc(1, sizeof(*test_ctx)); + if (!test_ctx) + return SCX_TEST_FAIL; + + test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); + if (test_ctx->nr_cpus <= 0) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */ + if (test_ctx->nr_cpus > MAX_CPUS) { + test_ctx->nr_cpus = MAX_CPUS; + } + + /* Test scenario 1: BPF program not loaded */ + /* Read and verify baseline total_bw before loading BPF program */ + fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable baseline values"); + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* Load the BPF skeleton */ + test_ctx->skel = minimal__open(); + if (!test_ctx->skel) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + SCX_ENUM_INIT(test_ctx->skel); + if (minimal__load(test_ctx->skel)) { + minimal__destroy(test_ctx->skel); + free(test_ctx); + return SCX_TEST_FAIL; + } + + *ctx = test_ctx; + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + struct bpf_link *link; + long loaded_bw[MAX_CPUS]; + long unloaded_bw[MAX_CPUS]; + int i; + + /* Test scenario 2: BPF program loaded */ + link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + fprintf(stderr, "BPF program loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values with BPF loaded"); + bpf_link__destroy(link); + return SCX_TEST_FAIL; + } + bpf_link__destroy(link); + + /* Test scenario 3: BPF program unloaded */ + fprintf(stderr, "BPF program unloaded, reading total_bw values\n"); + if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values after BPF unload"); + return SCX_TEST_FAIL; + } + + /* Verify all three scenarios have the same total_bw values */ + for (i = 0; i < test_ctx->nr_cpus; i++) { + if (test_ctx->baseline_bw[i] != loaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld", + i, test_ctx->baseline_bw[i], loaded_bw[i]); + return SCX_TEST_FAIL; + } + + if (test_ctx->baseline_bw[i] != unloaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld", + i, test_ctx->baseline_bw[i], unloaded_bw[i]); + return SCX_TEST_FAIL; + } + } + + fprintf(stderr, "All total_bw values are consistent across all scenarios\n"); + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + + if (test_ctx) { + if (test_ctx->skel) + minimal__destroy(test_ctx->skel); + free(test_ctx); + } +} + +struct scx_test total_bw = { + .name = "total_bw", + .description = "Verify total_bw consistency across BPF program states", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&total_bw) -- 2.34.1

1 week, 4 days

1
0
0 0

[PATCH v6 12/14] selftests/sched_ext: Add test for sched_ext dl_server

by Joel Fernandes

From: Andrea Righi <arighi(a)nvidia.com> Add a selftest to validate the correct behavior of the deadline server for the ext_sched_class. [ Joel: Replaced occurences of CFS in the test with EXT. ] Signed-off-by: Joel Fernandes <joelagnelf(a)nvidia.com> Signed-off-by: Andrea Righi <arighi(a)nvidia.com> --- tools/testing/selftests/sched_ext/Makefile | 1 + .../selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 213 ++++++++++++++++++ 3 files changed, 237 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 9d9d6b4c38b0..f0a8cba3a99f 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -182,6 +182,7 @@ auto-test-targets := \ select_cpu_dispatch_bad_dsq \ select_cpu_dispatch_dbl_dsp \ select_cpu_vtime \ + rt_stall \ test_example \ testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c new file mode 100644 index 000000000000..80086779dd1e --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A scheduler that verified if RT tasks can stall SCHED_EXT tasks. + * + * Copyright (c) 2025 NVIDIA Corporation. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops rt_stall_ops = { + .exit = (void *)rt_stall_exit, + .name = "rt_stall", +}; diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c new file mode 100644 index 000000000000..d4cb545ebfd8 --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.c @@ -0,0 +1,213 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 NVIDIA Corporation. + */ +#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sched.h> +#include <sys/prctl.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <time.h> +#include <linux/sched.h> +#include <signal.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "rt_stall.bpf.skel.h" +#include "scx_test.h" +#include "../kselftest.h" + +#define CORE_ID 0 /* CPU to pin tasks to */ +#define RUN_TIME 5 /* How long to run the test in seconds */ + +/* Simple busy-wait function for test tasks */ +static void process_func(void) +{ + while (1) { + /* Busy wait */ + for (volatile unsigned long i = 0; i < 10000000UL; i++); + } +} + +/* Set CPU affinity to a specific core */ +static void set_affinity(int cpu) +{ + cpu_set_t mask; + + CPU_ZERO(&mask); + CPU_SET(cpu, &mask); + if (sched_setaffinity(0, sizeof(mask), &mask) != 0) { + perror("sched_setaffinity"); + exit(EXIT_FAILURE); + } +} + +/* Set task scheduling policy and priority */ +static void set_sched(int policy, int priority) +{ + struct sched_param param; + + param.sched_priority = priority; + if (sched_setscheduler(0, policy, &param) != 0) { + perror("sched_setscheduler"); + exit(EXIT_FAILURE); + } +} + +/* Get process runtime from /proc/<pid>/stat */ +static float get_process_runtime(int pid) +{ + char path[256]; + FILE *file; + long utime, stime; + int fields; + + snprintf(path, sizeof(path), "/proc/%d/stat", pid); + file = fopen(path, "r"); + if (file == NULL) { + perror("Failed to open stat file"); + return -1; + } + + /* Skip the first 13 fields and read the 14th and 15th */ + fields = fscanf(file, + "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu", + &utime, &stime); + fclose(file); + + if (fields != 2) { + fprintf(stderr, "Failed to read stat file\n"); + return -1; + } + + /* Calculate the total time spent in the process */ + long total_time = utime + stime; + long ticks_per_second = sysconf(_SC_CLK_TCK); + float runtime_seconds = total_time * 1.0 / ticks_per_second; + + return runtime_seconds; +} + +static enum scx_test_status setup(void **ctx) +{ + struct rt_stall *skel; + + skel = rt_stall__open(); + SCX_FAIL_IF(!skel, "Failed to open"); + SCX_ENUM_INIT(skel); + SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel"); + + *ctx = skel; + + return SCX_TEST_PASS; +} + +static bool sched_stress_test(void) +{ + float cfs_runtime, rt_runtime; + int cfs_pid, rt_pid; + float expected_min_ratio = 0.04; /* 4% */ + + ksft_print_header(); + ksft_set_plan(1); + + /* Create and set up a EXT task */ + cfs_pid = fork(); + if (cfs_pid == 0) { + set_affinity(CORE_ID); + process_func(); + exit(0); + } else if (cfs_pid < 0) { + perror("fork for EXT task"); + ksft_exit_fail(); + } + + /* Create an RT task */ + rt_pid = fork(); + if (rt_pid == 0) { + set_affinity(CORE_ID); + set_sched(SCHED_FIFO, 50); + process_func(); + exit(0); + } else if (rt_pid < 0) { + perror("fork for RT task"); + ksft_exit_fail(); + } + + /* Let the processes run for the specified time */ + sleep(RUN_TIME); + + /* Get runtime for the EXT task */ + cfs_runtime = get_process_runtime(cfs_pid); + if (cfs_runtime != -1) + ksft_print_msg("Runtime of EXT task (PID %d) is %f seconds\n", cfs_pid, cfs_runtime); + else + ksft_exit_fail_msg("Error getting runtime for EXT task (PID %d)\n", cfs_pid); + + /* Get runtime for the RT task */ + rt_runtime = get_process_runtime(rt_pid); + if (rt_runtime != -1) + ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime); + else + ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid); + + /* Kill the processes */ + kill(cfs_pid, SIGKILL); + kill(rt_pid, SIGKILL); + waitpid(cfs_pid, NULL, 0); + waitpid(rt_pid, NULL, 0); + + /* Verify that the scx task got enough runtime */ + float actual_ratio = cfs_runtime / (cfs_runtime + rt_runtime); + ksft_print_msg("EXT task got %.2f%% of total runtime\n", actual_ratio * 100); + + if (actual_ratio >= expected_min_ratio) { + ksft_test_result_pass("PASS: EXT task got more than %.2f%% of runtime\n", + expected_min_ratio * 100); + return true; + } else { + ksft_test_result_fail("FAIL: EXT task got less than %.2f%% of runtime\n", + expected_min_ratio * 100); + return false; + } +} + +static enum scx_test_status run(void *ctx) +{ + struct rt_stall *skel = ctx; + struct bpf_link *link; + bool res; + + link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + res = sched_stress_test(); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); + bpf_link__destroy(link); + + if (!res) + ksft_exit_fail(); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct rt_stall *skel = ctx; + + rt_stall__destroy(skel); +} + +struct scx_test rt_stall = { + .name = "rt_stall", + .description = "Verify that RT tasks cannot stall SCHED_EXT tasks", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&rt_stall) -- 2.34.1

1 week, 4 days

1
0
0 0

[PATCH net-next v12 0/8] Support rate management on traffic classes in devlink and mlx5

by Mark Bloch

V12: - Fixed YAML indentation in devlink.yaml. - Removed unused total variable from devlink_nl_rate_tc_bw_set(). - Quoted shell variables in devlink.sh and split declarations to fix shellcheck warnings. - Added missing DevlinkFamily imports in selftests to fix pylint warnings. - Pulled changes from net-next to enable these adjustments: Inclusion of DevlinkFamily in YNL test libs. Introduction of nlmsg_for_each_attr_type() macro and its use in nfsd. V11: - Refactored the devlink code to accept relative TC bandwidth share values instead of percentages. - Updated documentation to clarify that values are interpreted as relative shares. - Refactored the logic in mlx5 to support proportional scaling for tc-bw values. - Switched to `nlmsg_for_each_attr_type()` for cleaner attribute parsing. - Added a hardware selftest to validate TC bandwidth behavior. - Refactored esw_qos_is_node_empty for readability. V10: - Added netdevsim selftest for tc-bw ops. - Dropped header: field as it’s unnecessary for local constants in devlink.yaml. V9: - Defined DEVLINK_RATE_TCS_MAX as 8 in uapi/linux/devlink.h. - Replaced IEEE_8021QAZ_MAX_TCS with DEVLINK_RATE_TCS_MAX throughout the code. - Updated devlink-rate-tc-index-max spec to reference the correct UAPI header. V8: - Limit line width to 80 characters in mlx5 changes instead of 100. - Increase the scheduling node levels to support TC arbitration. - Ensure parent nodes are set correctly in all code paths that extend the hierarchy depth for TC arbitration. - Extended the cover letter with the ongoing discussion on devlink-rate and net-shapers. - Extended the cover letter with the Netdev talk link on this series. V7: - Fixed disabling tc-bw on leaf nodes that did not have tc-bw configured. - Fixed an issue where tc-bw was disabled on a node with assigned vports, ensuring that vport->qos.sched_node->parent is correctly updated with the cloned node. - Declared a constant for the maximum allowed Traffic Class index in devlink rate. - Added a range check to validate rate-tc-index. - Added documentation for the tc-bw argument. - Add a validation check to ensure that the total bandwidth assigned to all traffic classes sums to 100. V6: - Addressed comments on devlink patch #3. - Removed first 4 IFC patches, to be pulled from mlx5-next. V5: - Fix warning in devlink_nl_rate_tc_bw_set(). - Fix target branch of patch #4. V4: - Renamed the nested attribute for traffic class bandwidth to DEVLINK_ATTR_RATE_TC_BWS. - Changed the order of the attributes in `devlink.h`. - Refactored the initialization tc-bw array in devlink_nl_rate_tc_bw_set(). - Added extack messages to provide clear feedback on issues with tc-bw arguments. - Updated `rate-tc-bws` to support a multi-attr set, where each attribute includes an index and the corresponding bandwidth for that traffic class. - Handled the issue where the user could provide DEVLINK_ATTR_RATE_TC_BWS with duplicate indices. - Provided ynl exmaples in patch [1/5] commit message. - Take IFC patches to beginning of the series, targeted for mlx5-next. V3: - Dropped rate-tc-index, using tc-bw array index instead. - Renamed rate-bw to rate-tc-bw. - Documneted what the rate-tc-bw represents and added a range check for validation. - Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC bandwidth values. - Updated the user API in the commit message of patch 1/6 to ensure bandwidths sum equals 100. - Fixed missing filling of rate-parent in devlink_nl_rate_fill(). V2: - Included <linux/dcbnl.h> in devlink.h to resolve missing IEEE_8021QAZ_MAX_TCS definition. - Refactored the rate-tc-bw attribute structure to use a separate rate-tc-index. - Updated patch 2/6 title. This patch series extends the devlink-rate API to support traffic class (TC) bandwidth management, enabling more granular control over traffic shaping and rate limiting across multiple TCs. The API now allows users to specify bandwidth proportions for different traffic classes in a single command. This is particularly useful for managing Enhanced Transmission Selection (ETS) for groups of Virtual Functions (VFs), allowing precise bandwidth allocation across traffic classes. Additionally the series refines the QoS handling in net/mlx5 to support TC arbitration and bandwidth management on vports and rate nodes. Discussions on traffic class shaping in net-shapers began in V5 [1], where we discussed with maintainers whether net-shapers should support traffic classes and how this could be implemented. Later, after further conversations with Paolo Abeni and Simon Horman, Cosmin provided an update [2], confirming that net-shapers' tree-based hierarchy aligns well with traffic classes when treated as distinct subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX queues and traffic classes, this approach seems feasible, though some open questions remain regarding queue reconfiguration and certain mlx5 scheduling behaviors. Building on that discussion, Cosmin has now shared a concrete implementation plan on the netdev mailing list [3]. The plan, developed in collaboration with Paolo and Simon, outlines how net-shapers can be extended to support the same use cases currently covered by devlink-rate, with the eventual goal of aligning both and simplifying the shaping infrastructure in the kernel. This work was presented at Netdev 0x19 in Zagreb [4]. There we presented how TC scheduling is enforced in mlx5 hardware, which led to discussions on the mailing list. A summary of how things work: Classification means labeling a packet with a traffic class based on the packet's DSCP or VLAN PCP field, then treating packets with different traffic classes differently during transmit processing. In a virtualized setup, VFs are untrusted and do not control classification or shaping. Classification is done by the hardware using a prio-to-TC mapping set by the hypervisor. VFs only select which send queue to use and are expected to respect the classification logic by sending each traffic class on its dedicated queue. As stated in the net-shapers plan [3], each transmit queue should carry only a single traffic class. Mixing classes in a single queue can lead to HOL blocking. In the mlx5 implementation, if the queue used does not match the classified traffic class, the hardware moves the queue to the correct TC scheduler. This movement is not a reclassification; it’s a necessary enforcement step to ensure traffic class isolation is maintained. Extend devlink-rate API to support rate management on TCs: - devlink: Extend the devlink rate API to support traffic class bandwidth management Introduce a no-op implementation: - net/mlx5: Add no-op implementation for setting tc-bw on rate objects Add support for enabling and disabling TC QoS on vports and nodes: - net/mlx5: Add support for setting tc-bw on nodes - net/mlx5: Add traffic class scheduling support for vport QoS Support for setting tc-bw on rate objects: - net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw [1] https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/ [2] https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.cam… [3] https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.cam… [4] https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-… Carolina Jubran (8): netlink: introduce type-checking attribute iteration for nlmsg devlink: Extend devlink rate API with traffic classes bandwidth management selftest: netdevsim: Add devlink rate tc-bw test net/mlx5: Add no-op implementation for setting tc-bw on rate objects net/mlx5: Add support for setting tc-bw on nodes net/mlx5: Add traffic class scheduling support for vport QoS net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution Documentation/netlink/specs/devlink.yaml | 32 +- .../networking/devlink/devlink-port.rst | 8 + .../net/ethernet/mellanox/mlx5/core/devlink.c | 2 + .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 1037 ++++++++++++++++- .../net/ethernet/mellanox/mlx5/core/esw/qos.h | 8 + .../net/ethernet/mellanox/mlx5/core/eswitch.h | 14 +- drivers/net/netdevsim/dev.c | 43 + drivers/net/netdevsim/netdevsim.h | 1 + drivers/net/vxlan/vxlan_vnifilter.c | 13 +- fs/nfsd/nfsctl.c | 36 +- include/net/devlink.h | 8 + include/net/netlink.h | 14 + include/uapi/linux/devlink.h | 9 + net/devlink/netlink_gen.c | 15 +- net/devlink/netlink_gen.h | 1 + net/devlink/rate.c | 127 ++ .../drivers/net/hw/devlink_rate_tc_bw.py | 466 ++++++++ .../drivers/net/hw/lib/py/__init__.py | 2 +- .../selftests/drivers/net/lib/py/__init__.py | 2 +- .../drivers/net/netdevsim/devlink.sh | 53 + .../testing/selftests/net/lib/py/__init__.py | 2 +- tools/testing/selftests/net/lib/py/ynl.py | 5 + 22 files changed, 1825 insertions(+), 73 deletions(-) create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py base-commit: 20a0c20f82acf46d5731a11743e7c7ac4de25db8 -- 2.34.1

1 week, 4 days

2
9
0 0

[PATCH] selftests/futex: Add futex_numa to .gitignore

by Terry Tritton

futex_numa was never added to the .gitignore file. Add it. Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> --- tools/testing/selftests/futex/functional/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore index 7b24ae89594a..776ad658f75e 100644 --- a/tools/testing/selftests/futex/functional/.gitignore +++ b/tools/testing/selftests/futex/functional/.gitignore @@ -11,3 +11,4 @@ futex_wait_timeout futex_wait_uninitialized_heap futex_wait_wouldblock futex_waitv +futex_numa -- 2.39.5

1 week, 4 days

2
1
0 0

[PATCH] selftests/kexec: fix test_kexec_jump build and ignore generated binary

by Moon Hee Lee

The test_kexec_jump program builds correctly when invoked from the top-level selftests/Makefile, which explicitly sets the OUTPUT variable. However, building directly in tools/testing/selftests/kexec fails with: make: *** No rule to make target '/test_kexec_jump', needed by 'test_kexec_jump.sh'. Stop. This failure occurs because the Makefile rule relies on $(OUTPUT), which is undefined in direct builds. Fix this by listing test_kexec_jump in TEST_GEN_PROGS, the standard way to declare generated test binaries in the kselftest framework. This ensures the binary is built regardless of invocation context and properly removed by make clean. Also add the binary to .gitignore to avoid tracking it in version control. Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com> --- tools/testing/selftests/kexec/.gitignore | 2 ++ tools/testing/selftests/kexec/Makefile | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kexec/.gitignore diff --git a/tools/testing/selftests/kexec/.gitignore b/tools/testing/selftests/kexec/.gitignore new file mode 100644 index 000000000000..5f3d9e089ae8 --- /dev/null +++ b/tools/testing/selftests/kexec/.gitignore @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +test_kexec_jump diff --git a/tools/testing/selftests/kexec/Makefile b/tools/testing/selftests/kexec/Makefile index e3000ccb9a5d..874cfdd3b75b 100644 --- a/tools/testing/selftests/kexec/Makefile +++ b/tools/testing/selftests/kexec/Makefile @@ -12,7 +12,7 @@ include ../../../scripts/Makefile.arch ifeq ($(IS_64_BIT)$(ARCH_PROCESSED),1x86) TEST_PROGS += test_kexec_jump.sh -test_kexec_jump.sh: $(OUTPUT)/test_kexec_jump +TEST_GEN_PROGS := test_kexec_jump endif include ../lib.mk -- 2.43.0

1 week, 5 days

3
2
0 0

[PATCH v18 0/8] fork: Support shadow stacks in clone3()

by Mark Brown

The kernel has recently added support for shadow stacks, currently x86 only using their CET feature but both arm64 and RISC-V have equivalent features (GCS and Zicfiss respectively), I am actively working on GCS[1]. With shadow stacks the hardware maintains an additional stack containing only the return addresses for branch instructions which is not generally writeable by userspace and ensures that any returns are to the recorded addresses. This provides some protection against ROP attacks and making it easier to collect call stacks. These shadow stacks are allocated in the address space of the userspace process. Our API for shadow stacks does not currently offer userspace any flexiblity for managing the allocation of shadow stacks for newly created threads, instead the kernel allocates a new shadow stack with the same size as the normal stack whenever a thread is created with the feature enabled. The stacks allocated in this way are freed by the kernel when the thread exits or shadow stacks are disabled for the thread. This lack of flexibility and control isn't ideal, in the vast majority of cases the shadow stack will be over allocated and the implicit allocation and deallocation is not consistent with other interfaces. As far as I can tell the interface is done in this manner mainly because the shadow stack patches were in development since before clone3() was implemented. Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). The user must provide a shadow stack pointer, this must point to memory mapped for use as a shadow stackby map_shadow_stack() with an architecture specified shadow stack token at the top of the stack. Yuri Khrustalev has raised questions from the libc side regarding discoverability of extended clone3() structure sizes[2], this seems like a general issue with clone3(). There was a suggestion to add a hwcap on arm64 which isn't ideal but is doable there, though architecture specific mechanisms would also be needed for x86 (and RISC-V if it's support gets merged before this does). The idea has, however, had strong pushback from the architecture maintainers and it is possible to detect support for this in clone3() by attempting a call with a misaligned shadow stack pointer specified so no hwcap has been added. [1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87… [2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v18: - Rebase onto v6.16-rc3. - Thanks to pointers from Yuri Khrustalev this version has been tested on x86 so I have removed the RFT tag. - Clarify clone3_shadow_stack_valid() comment about the Kconfig check. - Remove redundant GCSB DSYNCs in arm64 code. - Fix token validation on x86. - Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@k… Changes in v17: - Rebase onto v6.16-rc1. - Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k… Changes in v16: - Rebase onto v6.15-rc2. - Roll in fixes from x86 testing from Rick Edgecombe. - Rework so that the argument is shadow_stack_token. - Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k… Changes in v15: - Rebase onto v6.15-rc1. - Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k… Changes in v14: - Rebase onto v6.14-rc1. - Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k… Changes in v13: - Rebase onto v6.13-rc1. - Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k… Changes in v12: - Add the regular prctl() to the userspace API document since arm64 support is queued in -next. - Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k… Changes in v11: - Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and integrate arm64 support. - Rework the interface to specify a shadow stack pointer rather than a base and size like we do for the regular stack. - Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k… Changes in v10: - Integrate fixes & improvements for the x86 implementation from Rick Edgecombe. - Require that the shadow stack be VM_WRITE. - Require that the shadow stack base and size be sizeof(void *) aligned. - Clean up trailing newline. - Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke… Changes in v9: - Pull token validation earlier and report problems with an error return to parent rather than signal delivery to the child. - Verify that the top of the supplied shadow stack is VM_SHADOW_STACK. - Rework token validation to only do the page mapping once. - Drop no longer needed support for testing for signals in selftest. - Fix typo in comments. - Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke… Changes in v8: - Fix token verification with user specified shadow stack. - Don't track user managed shadow stacks for child processes. - Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke… Changes in v7: - Rebase onto v6.11-rc1. - Typo fixes. - Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke… Changes in v6: - Rebase onto v6.10-rc3. - Ensure we don't try to free the parent shadow stack in error paths of x86 arch code. - Spelling fixes in userspace API document. - Additional cleanups and improvements to the clone3() tests to support the shadow stack tests. - Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke… Changes in v5: - Rebase onto v6.8-rc2. - Rework ABI to have the user allocate the shadow stack memory with map_shadow_stack() and a token. - Force inlining of the x86 shadow stack enablement. - Move shadow stack enablement out into a shared header for reuse by other tests. - Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke… Changes in v4: - Formatting changes. - Use a define for minimum shadow stack size and move some basic validation to fork.c. - Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke… Changes in v3: - Rebase onto v6.7-rc2. - Remove stale shadow_stack in internal kargs. - If a shadow stack is specified unconditionally use it regardless of CLONE_ parameters. - Force enable shadow stacks in the selftest. - Update changelogs for RISC-V feature rename. - Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke… Changes in v2: - Rebase onto v6.7-rc1. - Remove ability to provide preallocated shadow stack, just specify the desired size. - Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke… --- Mark Brown (8): arm64/gcs: Return a success value from gcs_alloc_thread_stack() Documentation: userspace-api: Add shadow stack API documentation selftests: Provide helper header for shadow stack testing fork: Add shadow stack support to clone3() selftests/clone3: Remove redundant flushes of output streams selftests/clone3: Factor more of main loop into test_clone3() selftests/clone3: Allow tests to flag if -E2BIG is a valid error code selftests/clone3: Test shadow stack support Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/shadow_stack.rst | 44 +++++ arch/arm64/include/asm/gcs.h | 8 +- arch/arm64/kernel/process.c | 8 +- arch/arm64/mm/gcs.c | 55 +++++- arch/x86/include/asm/shstk.h | 11 +- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/shstk.c | 53 ++++- include/asm-generic/cacheflush.h | 11 ++ include/linux/sched/task.h | 17 ++ include/uapi/linux/sched.h | 9 +- kernel/fork.c | 93 +++++++-- tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++---- tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++- tools/testing/selftests/ksft_shstk.h | 98 ++++++++++ 15 files changed, 620 insertions(+), 81 deletions(-) --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20231019-clone3-shadow-stack-15d40d2bf536 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 week, 5 days

2
9
0 0

[kvm-unit-tests PATCH v3 1/2] riscv: Add RV_INSN_LEN to processor.h

by Jesse Taube

When handeling traps and faults it is offten necessary to know the size of the instruction at epc. Add RV_INSN_LEN to calculate the instruction size. Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> --- lib/riscv/asm/processor.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/riscv/asm/processor.h b/lib/riscv/asm/processor.h index 40104272..631ce226 100644 --- a/lib/riscv/asm/processor.h +++ b/lib/riscv/asm/processor.h @@ -7,6 +7,8 @@ #define EXCEPTION_CAUSE_MAX 24 #define INTERRUPT_CAUSE_MAX 16 +#define RV_INSN_LEN(insn) ((((insn) & 0x3) < 0x3) ? 2 : 4) + typedef void (*exception_fn)(struct pt_regs *); struct thread_info { -- 2.43.0

1 week, 5 days

2
4
0 0

[kvm-unit-tests PATCH v2] riscv: Allow SBI_CONSOLE with no uart in device tree

by Jesse Taube

When CONFIG_SBI_CONSOLE is enabled and there is no uart defined in the device tree kvm-unit-tests fails to start. Only abort when uart is not found in device tree if SBI_CONSOLE is false Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> --- lib/riscv/io.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/lib/riscv/io.c b/lib/riscv/io.c index fb40adb7..e64e9253 100644 --- a/lib/riscv/io.c +++ b/lib/riscv/io.c @@ -30,7 +30,6 @@ static u32 uart0_reg_width = 1; static u32 uart0_reg_shift; static struct spinlock uart_lock; -#ifndef CONFIG_SBI_CONSOLE static u32 uart0_read(u32 num) { u32 offset = num << uart0_reg_shift; @@ -54,7 +53,6 @@ static void uart0_write(u32 num, u32 val) else writel(val, uart0_base + offset); } -#endif static void uart0_init_fdt(void) { @@ -73,11 +71,16 @@ static void uart0_init_fdt(void) break; } +#ifdef CONFIG_SBI_CONSOLE + uart0_base = NULL; + return; +#else if (ret) { printf("%s: Compatible uart not found in the device tree, aborting...\n", __func__); abort(); } +#endif } else { const fdt32_t *val; int len; @@ -116,8 +119,8 @@ void io_init(void) } } -#ifdef CONFIG_SBI_CONSOLE -void puts(const char *s) +void sbi_puts(const char *s); +void sbi_puts(const char *s) { phys_addr_t addr = virt_to_phys((void *)s); unsigned long hi = upper_32_bits(addr); @@ -127,9 +130,11 @@ void puts(const char *s) sbi_ecall(SBI_EXT_DBCN, SBI_EXT_DBCN_CONSOLE_WRITE, strlen(s), lo, hi, 0, 0, 0); spin_unlock(&uart_lock); } -#else -void puts(const char *s) + +void uart0_puts(const char *s); +void uart0_puts(const char *s) { + assert(uart0_base); spin_lock(&uart_lock); while (*s) { while (!(uart0_read(UART_LSR_OFFSET) & UART_LSR_THRE)) @@ -138,7 +143,15 @@ void puts(const char *s) } spin_unlock(&uart_lock); } + +void puts(const char *s) +{ +#ifdef CONFIG_SBI_CONSOLE + sbi_puts(s); +#else + uart0_puts(s); #endif +} /* * Defining halt to take 'code' as an argument guarantees that it will -- 2.43.0

1 week, 5 days

2
2
0 0

[kvm-unit-tests PATCH v2] riscv: lib: sbi_shutdown add pass/fail exit code.

by Jesse Taube

When exiting it may be useful for the sbi implementation to know if kvm-unit-tests passed or failed. Add exit code to sbi_shutdown, and use it in exit() to pass success/failure (0/1) to sbi. Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> --- lib/riscv/asm/sbi.h | 2 +- lib/riscv/io.c | 2 +- lib/riscv/sbi.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/riscv/asm/sbi.h b/lib/riscv/asm/sbi.h index a5738a5c..de11c109 100644 --- a/lib/riscv/asm/sbi.h +++ b/lib/riscv/asm/sbi.h @@ -250,7 +250,7 @@ struct sbiret sbi_ecall(int ext, int fid, unsigned long arg0, unsigned long arg3, unsigned long arg4, unsigned long arg5); -void sbi_shutdown(void); +void sbi_shutdown(unsigned int code); struct sbiret sbi_hart_start(unsigned long hartid, unsigned long entry, unsigned long sp); struct sbiret sbi_hart_stop(void); struct sbiret sbi_hart_get_status(unsigned long hartid); diff --git a/lib/riscv/io.c b/lib/riscv/io.c index fb40adb7..0bde25d4 100644 --- a/lib/riscv/io.c +++ b/lib/riscv/io.c @@ -150,7 +150,7 @@ void halt(int code); void exit(int code) { printf("\nEXIT: STATUS=%d\n", ((code) << 1) | 1); - sbi_shutdown(); + sbi_shutdown(!!code); halt(code); __builtin_unreachable(); } diff --git a/lib/riscv/sbi.c b/lib/riscv/sbi.c index 2959378f..9dd11e9d 100644 --- a/lib/riscv/sbi.c +++ b/lib/riscv/sbi.c @@ -107,9 +107,9 @@ struct sbiret sbi_sse_inject(unsigned long event_id, unsigned long hart_id) return sbi_ecall(SBI_EXT_SSE, SBI_EXT_SSE_INJECT, event_id, hart_id, 0, 0, 0, 0); } -void sbi_shutdown(void) +void sbi_shutdown(unsigned int code) { - sbi_ecall(SBI_EXT_SRST, 0, 0, 0, 0, 0, 0, 0); + sbi_ecall(SBI_EXT_SRST, 0, 0, code, 0, 0, 0, 0); puts("SBI shutdown failed!\n"); } -- 2.43.0

1 week, 5 days

2
4
0 0

[kvm-unit-tests PATCH v8] riscv: sbi: Add SBI Debug Triggers Extension tests

by Jesse Taube

Add tests for the DBTR SBI extension. Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> Reviewed-by: Charlie Jenkins <charlie(a)rivosinc.com> Tested-by: Charlie Jenkins <charlie(a)rivosinc.com> --- V1 -> V2: - Call report_prefix_pop before returning - Disable compressed instructions in exec_call, update related comment - Remove extra "| 1" in dbtr_test_load - Remove extra newlines - Remove extra tabs in check_exec - Remove typedefs from enums - Return when dbtr_install_trigger fails - s/avalible/available/g - s/unistall/uninstall/g V2 -> V3: - Change SBI_DBTR_SHMEM_INVALID_ADDR to -1UL - Move all dbtr functions to sbi-dbtr.c - Move INSN_LEN to processor.h - Update include list - Use C-style comments V3 -> V4: - Include libcflat.h - Remove #define SBI_DBTR_SHMEM_INVALID_ADDR V4 -> V5: - Sort includes - Add kfail for update triggers V5 -> V6: - Add assert in gen_tdata1 - Add prefix to dbtr_test_type - Add TRIG_STATE_DMODE - Add TRIG_STATE_RESERVED - Align function paramaters with opening parenthesis - Change OpenSBI < v1.7 to < v1.5 - Constantly use spaces in prefix rather than _ - Export split_phys_addr - Fix MCONTROL_U and MCONTROL_M mix up - Fix swapped VU and VS - Move /* to own line - Print type in dbtr_test_type - Remove _BIT suffix from macros - Remove duplicate MODE_S - Remove spaces before include - Rename tdata1,2 to trigger and control in dbtr_install_trigger - Report skip in dbtr_test_multiple - Report variables in info not pass or fail - s/save/store/g - sbi_debug_set_shmem use split_phys_addr - Use if (!report(... in dbtr_test_disable_enable V6 -> V7: - Alphabetize Makefile - Only print read info on failure - Remove return after assert - Remove unnecessary OpenSBI version check - Rename error to exit_test in check_dbtr - Use prefix in dbtr_test_num_triggers and dbtr_test_type V7 -> V8: - Add mcontrol_size mcontrol6_size - Cast McontrolType to int for assert - Set trigger size to 32BIT using mcontrol_size Debug triggers allow traps on different access sizes, which was set to any. QEMU was triggering on a 32BIT access next to the trigger address causing an erroneous trap. Set the size to 32BITs to avoid this. --- lib/riscv/asm/sbi.h | 1 + riscv/Makefile | 1 + riscv/sbi-dbtr.c | 867 ++++++++++++++++++++++++++++++++++++++++++++ riscv/sbi-tests.h | 2 + riscv/sbi.c | 3 +- 5 files changed, 873 insertions(+), 1 deletion(-) create mode 100644 riscv/sbi-dbtr.c diff --git a/lib/riscv/asm/sbi.h b/lib/riscv/asm/sbi.h index a5738a5c..78fd6e2a 100644 --- a/lib/riscv/asm/sbi.h +++ b/lib/riscv/asm/sbi.h @@ -51,6 +51,7 @@ enum sbi_ext_id { SBI_EXT_SUSP = 0x53555350, SBI_EXT_FWFT = 0x46574654, SBI_EXT_SSE = 0x535345, + SBI_EXT_DBTR = 0x44425452, }; enum sbi_ext_base_fid { diff --git a/riscv/Makefile b/riscv/Makefile index 11e68eae..9309ac12 100644 --- a/riscv/Makefile +++ b/riscv/Makefile @@ -18,6 +18,7 @@ tests += $(TEST_DIR)/sieve.$(exe) all: $(tests) $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-asm.o +$(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-dbtr.o $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-fwft.o $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-sse.o diff --git a/riscv/sbi-dbtr.c b/riscv/sbi-dbtr.c new file mode 100644 index 00000000..13f64015 --- /dev/null +++ b/riscv/sbi-dbtr.c @@ -0,0 +1,867 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * SBI DBTR testsuite + * + * Copyright (C) 2025, Rivos Inc., Jesse Taube <jesse(a)rivosinc.com> + */ + +#include <libcflat.h> +#include <bitops.h> + +#include <asm/io.h> +#include <asm/processor.h> + +#include "sbi-tests.h" + +#define RV_MAX_TRIGGERS 32 + +#define SBI_DBTR_TRIG_STATE_MAPPED BIT(0) +#define SBI_DBTR_TRIG_STATE_U BIT(1) +#define SBI_DBTR_TRIG_STATE_S BIT(2) +#define SBI_DBTR_TRIG_STATE_VU BIT(3) +#define SBI_DBTR_TRIG_STATE_VS BIT(4) +#define SBI_DBTR_TRIG_STATE_HAVE_HW_TRIG BIT(5) +#define SBI_DBTR_TRIG_STATE_RESERVED GENMASK(7, 6) + +#define SBI_DBTR_TRIG_STATE_HW_TRIG_IDX_SHIFT 8 +#define SBI_DBTR_TRIG_STATE_HW_TRIG_IDX(trig_state) (trig_state >> SBI_DBTR_TRIG_STATE_HW_TRIG_IDX_SHIFT) + +#define SBI_DBTR_TDATA1_TYPE_SHIFT (__riscv_xlen - 4) +#define SBI_DBTR_TDATA1_DMODE BIT_UL(__riscv_xlen - 5) + +#define SBI_DBTR_TDATA1_MCONTROL6_LOAD BIT(0) +#define SBI_DBTR_TDATA1_MCONTROL6_STORE BIT(1) +#define SBI_DBTR_TDATA1_MCONTROL6_EXECUTE BIT(2) +#define SBI_DBTR_TDATA1_MCONTROL6_U BIT(3) +#define SBI_DBTR_TDATA1_MCONTROL6_S BIT(4) +#define SBI_DBTR_TDATA1_MCONTROL6_M BIT(6) +#define SBI_DBTR_TDATA1_MCONTROL6_SIZE_SHIFT 16 +#define SBI_DBTR_TDATA1_MCONTROL6_SIZE_MASK 0x7 +#define SBI_DBTR_TDATA1_MCONTROL6_SELECT BIT(21) +#define SBI_DBTR_TDATA1_MCONTROL6_VU BIT(23) +#define SBI_DBTR_TDATA1_MCONTROL6_VS BIT(24) + +#define SBI_DBTR_TDATA1_MCONTROL_LOAD BIT(0) +#define SBI_DBTR_TDATA1_MCONTROL_STORE BIT(1) +#define SBI_DBTR_TDATA1_MCONTROL_EXECUTE BIT(2) +#define SBI_DBTR_TDATA1_MCONTROL_U BIT(3) +#define SBI_DBTR_TDATA1_MCONTROL_S BIT(4) +#define SBI_DBTR_TDATA1_MCONTROL_M BIT(6) +#define SBI_DBTR_TDATA1_MCONTROL_SIZELO_SHIFT 16 +#define SBI_DBTR_TDATA1_MCONTROL_SIZELO_MASK 0x3 +#define SBI_DBTR_TDATA1_MCONTROL_SELECT BIT(19) +#define SBI_DBTR_TDATA1_MCONTROL_SIZEHI_SHIFT 21 +#define SBI_DBTR_TDATA1_MCONTROL_SIZEHI_MASK 0x3 + +enum McontrolType { + SBI_DBTR_TDATA1_TYPE_NONE = (0UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_LEGACY = (1UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_MCONTROL = (2UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ICOUNT = (3UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ITRIGGER = (4UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ETRIGGER = (5UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_MCONTROL6 = (6UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_TMEXTTRIGGER = (7UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED0 = (8UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED1 = (9UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED2 = (10UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED3 = (11UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM0 = (12UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM1 = (13UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM2 = (14UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_DISABLED = (15UL << SBI_DBTR_TDATA1_TYPE_SHIFT), +}; + +enum Tdata1Size { + SIZE_ANY = 0, + SIZE_8BIT, + SIZE_16BIT, + SIZE_32BIT, + SIZE_48BIT, + SIZE_64BIT, +}; + +enum Tdata1Value { + VALUE_NONE = 0, + VALUE_LOAD = BIT(0), + VALUE_STORE = BIT(1), + VALUE_EXECUTE = BIT(2), +}; + +enum Tdata1Mode { + MODE_NONE = 0, + MODE_M = BIT(0), + MODE_U = BIT(1), + MODE_S = BIT(2), + MODE_VU = BIT(3), + MODE_VS = BIT(4), +}; + +enum sbi_ext_dbtr_fid { + SBI_EXT_DBTR_NUM_TRIGGERS = 0, + SBI_EXT_DBTR_SETUP_SHMEM, + SBI_EXT_DBTR_TRIGGER_READ, + SBI_EXT_DBTR_TRIGGER_INSTALL, + SBI_EXT_DBTR_TRIGGER_UPDATE, + SBI_EXT_DBTR_TRIGGER_UNINSTALL, + SBI_EXT_DBTR_TRIGGER_ENABLE, + SBI_EXT_DBTR_TRIGGER_DISABLE, +}; + +struct sbi_dbtr_data_msg { + unsigned long tstate; + unsigned long tdata1; + unsigned long tdata2; + unsigned long tdata3; +}; + +struct sbi_dbtr_id_msg { + unsigned long idx; +}; + +/* SBI shared mem messages layout */ +struct sbi_dbtr_shmem_entry { + union { + struct sbi_dbtr_data_msg data; + struct sbi_dbtr_id_msg id; + }; +}; + +static bool dbtr_handled; + +/* Expected to be leaf function as not to disrupt frame-pointer */ +static __attribute__((naked)) void exec_call(void) +{ + /* skip over nop when triggered instead of ret. */ + asm volatile (".option push\n" + ".option arch, -c\n" + "nop\n" + "ret\n" + ".option pop\n"); +} + +static void dbtr_exception_handler(struct pt_regs *regs) +{ + dbtr_handled = true; + + /* Reading *epc may cause a fault, skip over nop */ + if ((void *)regs->epc == exec_call) { + regs->epc += 4; + return; + } + + /* WARNING: Skips over the trapped intruction */ + regs->epc += RV_INSN_LEN(readw((void *)regs->epc)); +} + +static bool do_store(void *tdata2) +{ + bool ret; + + writel(0, tdata2); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static bool do_load(void *tdata2) +{ + bool ret; + + readl(tdata2); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static bool do_exec(void) +{ + bool ret; + + exec_call(); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static unsigned long mcontrol_size(enum Tdata1Size mode) +{ + unsigned long ret = 0; + + ret |= ((mode >> 2) & SBI_DBTR_TDATA1_MCONTROL_SIZEHI_MASK) + << SBI_DBTR_TDATA1_MCONTROL_SIZEHI_SHIFT; + ret |= (mode & SBI_DBTR_TDATA1_MCONTROL_SIZELO_MASK) + << SBI_DBTR_TDATA1_MCONTROL_SIZELO_SHIFT; + + return ret; +} + +static unsigned long mcontrol6_size(enum Tdata1Size mode) +{ + return (mode & SBI_DBTR_TDATA1_MCONTROL6_SIZE_MASK) + << SBI_DBTR_TDATA1_MCONTROL6_SIZE_SHIFT; +} + +static unsigned long gen_tdata1_mcontrol(enum Tdata1Mode mode, enum Tdata1Value value) +{ + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL; + + if (value & VALUE_LOAD) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_LOAD; + + if (value & VALUE_STORE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_STORE; + + if (value & VALUE_EXECUTE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_EXECUTE; + + if (mode & MODE_M) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_M; + + if (mode & MODE_U) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_U; + + if (mode & MODE_S) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_S; + + return tdata1; +} + +static unsigned long gen_tdata1_mcontrol6(enum Tdata1Mode mode, enum Tdata1Value value) +{ + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL6; + + if (value & VALUE_LOAD) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_LOAD; + + if (value & VALUE_STORE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_STORE; + + if (value & VALUE_EXECUTE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_EXECUTE; + + if (mode & MODE_M) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_M; + + if (mode & MODE_U) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_U; + + if (mode & MODE_S) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_S; + + if (mode & MODE_VU) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_VU; + + if (mode & MODE_VS) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_VS; + + return tdata1; +} + +static unsigned long gen_tdata1(enum McontrolType type, enum Tdata1Value value, enum Tdata1Mode mode) +{ + switch (type) { + case SBI_DBTR_TDATA1_TYPE_MCONTROL: + return gen_tdata1_mcontrol(mode, value) | mcontrol_size(SIZE_32BIT); + case SBI_DBTR_TDATA1_TYPE_MCONTROL6: + return gen_tdata1_mcontrol6(mode, value) | mcontrol6_size(SIZE_32BIT); + default: + assert_msg(false, "Invalid mcontrol type: %u", (int)type); + } +} + +static struct sbiret sbi_debug_num_triggers(unsigned long trig_tdata1) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_NUM_TRIGGERS, trig_tdata1, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_set_shmem_raw(unsigned long shmem_phys_lo, + unsigned long shmem_phys_hi, + unsigned long flags) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_SETUP_SHMEM, shmem_phys_lo, + shmem_phys_hi, flags, 0, 0, 0); +} + +static struct sbiret sbi_debug_set_shmem(void *shmem) +{ + unsigned long base_addr_lo, base_addr_hi; + + split_phys_addr(virt_to_phys(shmem), &base_addr_hi, &base_addr_lo); + return sbi_debug_set_shmem_raw(base_addr_lo, base_addr_hi, 0); +} + +static struct sbiret sbi_debug_read_triggers(unsigned long trig_idx_base, + unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_READ, trig_idx_base, + trig_count, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_install_triggers(unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_INSTALL, trig_count, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_update_triggers(unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_UPDATE, trig_count, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_uninstall_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_UNINSTALL, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_enable_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_ENABLE, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_disable_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_DISABLE, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static bool dbtr_install_trigger(struct sbi_dbtr_shmem_entry *shmem, void *trigger, + unsigned long control) +{ + struct sbiret sbi_ret; + bool ret; + + shmem->data.tdata1 = control; + shmem->data.tdata2 = (unsigned long)trigger; + + sbi_ret = sbi_debug_install_triggers(1); + ret = sbiret_report_error(&sbi_ret, SBI_SUCCESS, "sbi_debug_install_triggers"); + if (ret) + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + return ret; +} + +static bool dbtr_uninstall_trigger(void) +{ + struct sbiret ret; + + install_exception_handler(EXC_BREAKPOINT, NULL); + + ret = sbi_debug_uninstall_triggers(0, 1); + return sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); +} + +static unsigned long dbtr_test_num_triggers(void) +{ + struct sbiret ret; + unsigned long tdata1 = 0; + /* sbi_debug_num_triggers will return trig_max in sbiret.value when trig_tdata1 == 0 */ + + report_prefix_push("available triggers"); + + /* should be at least one trigger. */ + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_num_triggers"); + + if (ret.value == 0) { + report_fail("Returned 0 triggers available"); + } else { + report_pass("Returned triggers available"); + report_info("Returned %lu triggers available", ret.value); + } + + report_prefix_pop(); + return ret.value; +} + +static enum McontrolType dbtr_test_type(unsigned long *num_trig) +{ + struct sbiret ret; + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL6; + + report_prefix_push("test type"); + report_prefix_push("sbi_debug_num_triggers"); + + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "mcontrol6"); + *num_trig = ret.value; + if (ret.value > 0) { + report_pass("Returned mcontrol6 triggers available"); + report_info("Returned %lu mcontrol6 triggers available", + ret.value); + report_prefix_popn(2); + return tdata1; + } + + tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL; + + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "mcontrol"); + *num_trig = ret.value; + if (ret.value > 0) { + report_pass("Returned mcontrol triggers available"); + report_info("Returned %lu mcontrol triggers available", + ret.value); + report_prefix_popn(2); + return tdata1; + } + + report_fail("Returned 0 mcontrol(6) triggers available"); + report_prefix_popn(2); + + return SBI_DBTR_TDATA1_TYPE_NONE; +} + +static struct sbiret dbtr_test_store_install_uninstall(struct sbi_dbtr_shmem_entry *shmem, + enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("store trigger"); + + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + ret = sbi_debug_install_triggers(1); + if (!sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_install_triggers")) { + report_prefix_pop(); + return ret; + } + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(do_store(&test), "triggered"); + + if (do_load(&test)) + report_fail("triggered by load"); + + ret = sbi_debug_uninstall_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + if (do_store(&test)) + report_fail("triggered after uninstall"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); + + return ret; +} + +static void dbtr_test_update(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + bool kfail; + + report_prefix_push("update trigger"); + + if (!dbtr_install_trigger(shmem, NULL, gen_tdata1(type, VALUE_NONE, MODE_NONE))) { + report_prefix_pop(); + return; + } + + shmem->id.idx = 0; + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + ret = sbi_debug_update_triggers(1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_update_triggers"); + + /* + * Known broken update_triggers. + * https://lore.kernel.org/opensbi/aDdp1UeUh7GugeHp@ghost/T/#t + */ + kfail = __sbi_get_imp_id() == SBI_IMPL_OPENSBI && + __sbi_get_imp_version() < sbi_impl_opensbi_mk_version(1, 7); + report_kfail(kfail, do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_load(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + + report_prefix_push("load trigger"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_LOAD, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_load(&test), "triggered"); + + if (do_store(&test)) + report_fail("triggered by store"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_disable_enable(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable trigger"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + if (!report(!do_store(&test), "should not trigger")) { + dbtr_uninstall_trigger(); + report_prefix_pop(); + report_skip("enable trigger: no disable"); + + return; + } + + report_prefix_pop(); + report_prefix_push("enable trigger"); + + ret = sbi_debug_enable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_enable_triggers"); + + report(do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_exec(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + + report_prefix_push("exec trigger"); + /* check if loads and stores trigger exec */ + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + report_prefix_pop(); + return; + } + + if (do_load(&test)) + report_fail("triggered by load"); + + if (do_store(&test)) + report_fail("triggered by store"); + + dbtr_uninstall_trigger(); + + /* Check if exec works */ + if (!dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + report_prefix_pop(); + return; + } + report(do_exec(), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_read(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + const unsigned long tstatus_expected = SBI_DBTR_TRIG_STATE_S | SBI_DBTR_TRIG_STATE_MAPPED; + const unsigned long tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + static unsigned long test; + struct sbiret ret; + + report_prefix_push("read trigger"); + if (!dbtr_install_trigger(shmem, &test, tdata1)) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_read_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_read_triggers"); + + if (!report(shmem->data.tdata1 == tdata1, "tdata1 expected: 0x%016lx", tdata1)) + report_info("tdata1 found: 0x%016lx", shmem->data.tdata1); + if (!report(shmem->data.tdata2 == ((unsigned long)&test), "tdata2 expected: 0x%016lx", + (unsigned long)&test)) + report_info("tdata2 found: 0x%016lx", shmem->data.tdata2); + if (!report(shmem->data.tstate == tstatus_expected, "tstate expected: 0x%016lx", tstatus_expected)) + report_info("tstate found: 0x%016lx", shmem->data.tstate); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void check_exec(unsigned long base) +{ + struct sbiret ret; + + report(do_exec(), "exec triggered"); + + ret = sbi_debug_uninstall_triggers(base, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); +} + +static void dbtr_test_multiple(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type, + unsigned long num_trigs) +{ + static unsigned long test[2]; + struct sbiret ret; + bool have_three = num_trigs > 2; + + if (num_trigs < 2) { + report_skip("test multiple"); + return; + } + + report_prefix_push("test multiple"); + + if (!dbtr_install_trigger(shmem, &test[0], gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + if (!dbtr_install_trigger(shmem, &test[1], gen_tdata1(type, VALUE_LOAD, MODE_S))) + goto error; + if (have_three && + !dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + ret = sbi_debug_uninstall_triggers(1, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + goto error; + } + + report(do_store(&test[0]), "store triggered"); + + if (do_load(&test[0])) + report_fail("store triggered by load"); + + report(do_load(&test[1]), "load triggered"); + + if (do_store(&test[1])) + report_fail("load triggered by store"); + + if (have_three) + check_exec(2); + + ret = sbi_debug_uninstall_triggers(1, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + if (do_load(&test[1])) + report_fail("load triggered after uninstall"); + + report(do_store(&test[0]), "store triggered"); + + if (!have_three && + dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) + check_exec(1); + +error: + ret = sbi_debug_uninstall_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_multiple_types(struct sbi_dbtr_shmem_entry *shmem, unsigned long type) +{ + static unsigned long test; + + report_prefix_push("test multiple types"); + + /* check if loads and stores trigger exec */ + if (!dbtr_install_trigger(shmem, &test, + gen_tdata1(type, VALUE_EXECUTE | VALUE_LOAD | VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_load(&test), "load triggered"); + + report(do_store(&test), "store triggered"); + + dbtr_uninstall_trigger(); + + /* Check if exec works */ + if (!dbtr_install_trigger(shmem, exec_call, + gen_tdata1(type, VALUE_EXECUTE | VALUE_LOAD | VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_exec(), "exec triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_disable_uninstall(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable uninstall"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + dbtr_uninstall_trigger(); + + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_uninstall_enable(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("uninstall enable"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + dbtr_uninstall_trigger(); + + ret = sbi_debug_enable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_enable_triggers"); + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(!do_store(&test), "should not trigger"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_uninstall_update(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + bool kfail; + + report_prefix_push("uninstall update"); + if (!dbtr_install_trigger(shmem, NULL, gen_tdata1(type, VALUE_NONE, MODE_NONE))) { + report_prefix_pop(); + return; + } + + dbtr_uninstall_trigger(); + + shmem->id.idx = 0; + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + /* + * Known broken update_triggers. + * https://lore.kernel.org/opensbi/aDdp1UeUh7GugeHp@ghost/T/#t + */ + kfail = __sbi_get_imp_id() == SBI_IMPL_OPENSBI && + __sbi_get_imp_version() < sbi_impl_opensbi_mk_version(1, 7); + ret = sbi_debug_update_triggers(1); + sbiret_kfail_error(kfail, &ret, SBI_ERR_FAILURE, "sbi_debug_update_triggers"); + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(!do_store(&test), "should not trigger"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_disable_read(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + const unsigned long tstatus_expected = SBI_DBTR_TRIG_STATE_S | SBI_DBTR_TRIG_STATE_MAPPED; + const unsigned long tdata1 = gen_tdata1(type, VALUE_STORE, MODE_NONE); + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable read"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + ret = sbi_debug_read_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_read_triggers"); + + if (!report(shmem->data.tdata1 == tdata1, "tdata1 expected: 0x%016lx", tdata1)) + report_info("tdata1 found: 0x%016lx", shmem->data.tdata1); + if (!report(shmem->data.tdata2 == ((unsigned long)&test), "tdata2 expected: 0x%016lx", + (unsigned long)&test)) + report_info("tdata2 found: 0x%016lx", shmem->data.tdata2); + if (!report(shmem->data.tstate == tstatus_expected, "tstate expected: 0x%016lx", tstatus_expected)) + report_info("tstate found: 0x%016lx", shmem->data.tstate); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +void check_dbtr(void) +{ + static struct sbi_dbtr_shmem_entry shmem[RV_MAX_TRIGGERS] = {}; + unsigned long num_trigs; + enum McontrolType trig_type; + struct sbiret ret; + + report_prefix_push("dbtr"); + + if (!sbi_probe(SBI_EXT_DBTR)) { + report_skip("extension not available"); + goto exit_test; + } + + num_trigs = dbtr_test_num_triggers(); + if (!num_trigs) + goto exit_test; + + trig_type = dbtr_test_type(&num_trigs); + if (trig_type == SBI_DBTR_TDATA1_TYPE_NONE) + goto exit_test; + + ret = sbi_debug_set_shmem(shmem); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_set_shmem"); + + ret = dbtr_test_store_install_uninstall(&shmem[0], trig_type); + /* install or uninstall failed */ + if (ret.error != SBI_SUCCESS) + goto exit_test; + + dbtr_test_load(&shmem[0], trig_type); + dbtr_test_exec(&shmem[0], trig_type); + dbtr_test_read(&shmem[0], trig_type); + dbtr_test_disable_enable(&shmem[0], trig_type); + dbtr_test_update(&shmem[0], trig_type); + dbtr_test_multiple_types(&shmem[0], trig_type); + dbtr_test_multiple(shmem, trig_type, num_trigs); + dbtr_test_disable_uninstall(&shmem[0], trig_type); + dbtr_test_uninstall_enable(&shmem[0], trig_type); + dbtr_test_uninstall_update(&shmem[0], trig_type); + dbtr_test_disable_read(&shmem[0], trig_type); + +exit_test: + report_prefix_pop(); +} diff --git a/riscv/sbi-tests.h b/riscv/sbi-tests.h index d5c4ae70..c1ebf016 100644 --- a/riscv/sbi-tests.h +++ b/riscv/sbi-tests.h @@ -97,8 +97,10 @@ static inline bool env_enabled(const char *env) return s && (*s == '1' || *s == 'y' || *s == 'Y'); } +void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo); void sbi_bad_fid(int ext); void check_sse(void); +void check_dbtr(void); #endif /* __ASSEMBLER__ */ #endif /* _RISCV_SBI_TESTS_H_ */ diff --git a/riscv/sbi.c b/riscv/sbi.c index edb1a6be..3b8aadce 100644 --- a/riscv/sbi.c +++ b/riscv/sbi.c @@ -105,7 +105,7 @@ static int rand_online_cpu(prng_state *ps) return cpu; } -static void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo) +void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo) { *lo = (unsigned long)paddr; *hi = 0; @@ -1561,6 +1561,7 @@ int main(int argc, char **argv) check_susp(); check_sse(); check_fwft(); + check_dbtr(); return report_summary(); } -- 2.43.0

1 week, 5 days

2
2
0 0

[PATCH] selftests/mm: pagemap_scan ioctl: add PFN ZERO test cases

by Muhammad Usama Anjum

Add test cases to test the correctness of PFN ZERO flag of pagemap_scan ioctl. Test with normal pages backed memory and huge pages backed memory. Cc: David Hildenbrand <david(a)redhat.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com> --- The bug has been fixed [1]. [1] https://lore.kernel.org/all/20250617143532.2375383-1-david@redhat.com --- tools/testing/selftests/mm/pagemap_ioctl.c | 57 +++++++++++++++++++++- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index 57b4bba2b45f3..6138de0087edf 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 + #define _GNU_SOURCE #include <stdio.h> #include <fcntl.h> @@ -1480,6 +1481,57 @@ static void transact_test(int page_size) extra_thread_faults); } +void zeropfn_tests(void) +{ + unsigned long long mem_size; + struct page_region vec; + int i, ret; + char *mem; + + /* Test with page backed memory */ + mem_size = 10 * page_size; + mem = mmap(NULL, mem_size, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); + if (mem == MAP_FAILED) + ksft_exit_fail_msg("error nomem\n"); + + /* Touch each page to ensure it's mapped */ + for (i = 0; i < mem_size; i += page_size) + (void)((volatile char *)mem)[i]; + + ret = pagemap_ioctl(mem, mem_size, &vec, 1, 0, + (mem_size / page_size), PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (mem_size / page_size), + "%s all pages must have PFNZERO set\n", __func__); + + munmap(mem, mem_size); + + /* Test with huge page */ + mem_size = 10 * hpage_size; + mem = memalign(hpage_size, mem_size); + if (!mem) + ksft_exit_fail_msg("error nomem\n"); + + ret = madvise(mem, mem_size, MADV_HUGEPAGE); + if (ret) + ksft_exit_fail_msg("madvise failed %d %s\n", errno, strerror(errno)); + + for (i = 0; i < mem_size; i += hpage_size) + (void)((volatile char *)mem)[i]; + + ret = pagemap_ioctl(mem, mem_size, &vec, 1, 0, + (mem_size / page_size), PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); + if (ret < 0) + ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno)); + + ksft_test_result(ret == 1 && LEN(vec) == (mem_size / page_size), + "%s all huge pages must have PFNZERO set\n", __func__); + + free(mem); +} + int main(int __attribute__((unused)) argc, char *argv[]) { int shmid, buf_size, fd, i, ret; @@ -1494,7 +1546,7 @@ int main(int __attribute__((unused)) argc, char *argv[]) if (init_uffd()) ksft_exit_pass(); - ksft_set_plan(115); + ksft_set_plan(117); page_size = getpagesize(); hpage_size = read_pmd_pagesize(); @@ -1669,6 +1721,9 @@ int main(int __attribute__((unused)) argc, char *argv[]) /* 16. Userfaultfd tests */ userfaultfd_tests(); + /* 17. ZEROPFN tests */ + zeropfn_tests(); + close(pagemap_fd); ksft_exit_pass(); } -- 2.43.0

1 week, 5 days

2
4
0 0

[PATCH v4 0/2] kunit: qemu_configs: Add MIPS configurations

by Thomas Weißschuh

Add basic support to run various MIPS variants via kunit_tool using the virtualized malta platform. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Changes in v4: - Rebase on v6.16-rc1 - Pick up reviews from David - Clarify that GIC page is linked to vDSO - Link to v3: https://lore.kernel.org/r/20250415-kunit-mips-v3-0-4ec2461b5a7e@linutronix.… Changes in v3: - Also skip VDSO_RANDOMIZE_SIZE adjustment for kthreads - Link to v2: https://lore.kernel.org/r/20250414-kunit-mips-v2-0-4cf01e1a29e6@linutronix.… Changes in v2: - Fix usercopy kunit test by handling ABI-less tasks in stack_top() - Drop change to mm initialization. The broken test is not built by default anymore. - Link to v1: https://lore.kernel.org/r/20250212-kunit-mips-v1-0-eb49c9d76615@linutronix.… --- Thomas Weißschuh (2): MIPS: Don't crash in stack_top() for tasks without ABI or vDSO kunit: qemu_configs: Add MIPS configurations arch/mips/kernel/process.c | 16 +++++++++------- tools/testing/kunit/qemu_configs/mips.py | 18 ++++++++++++++++++ tools/testing/kunit/qemu_configs/mips64.py | 19 +++++++++++++++++++ tools/testing/kunit/qemu_configs/mips64el.py | 19 +++++++++++++++++++ tools/testing/kunit/qemu_configs/mipsel.py | 18 ++++++++++++++++++ 5 files changed, 83 insertions(+), 7 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20241014-kunit-mips-e4fe1c265ed7 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

1 week, 5 days

3
4
0 0

[PATCH v2] selftests/futex: Convert 32bit timespec struct to 64bit version for 32bit compatibility mode

by Terry Tritton

Futex_waitv can not accept old_timespec32 struct, so userspace should convert it from 32bit to 64bit before syscall in 32bit compatible mode. This fix is based off [1] Link: https://lore.kernel.org/all/20231203235117.29677-1-wegao@suse.com/ [1] Signed-off-by: Wei Gao <wegao(a)suse.com> Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> --- .../testing/selftests/futex/include/futex2test.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h index ea79662405bc..6780e51eb2d6 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h @@ -55,6 +55,13 @@ struct futex32_numa { futex_t numa; }; +#if !defined(__LP64__) +struct timespec64 { + int64_t tv_sec; + int64_t tv_nsec; +}; +#endif + /** * futex_waitv - Wait at multiple futexes, wake on any * @waiters: Array of waiters @@ -65,7 +72,15 @@ struct futex32_numa { static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters, unsigned long flags, struct timespec *timo, clockid_t clockid) { +#if !defined(__LP64__) + struct timespec64 timo64 = {0}; + + timo64.tv_sec = timo->tv_sec; + timo64.tv_nsec = timo->tv_nsec; + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, &timo64, clockid); +#else return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid); +#endif } /* -- 2.39.5

1 week, 5 days

3
5
0 0

[PATCH net-next 0/7] netpoll: Factor out functions from netpoll_send_udp() and add ipv6 selftest

by Breno Leitao

Refactors the netpoll UDP transmit path to improve code clarity, maintainability, and protocol-layer encapsulation. Function netpoll_send_udp() has more than 100 LoC, which is hard to understand and review. After this patchset, it has only 32 LoC, which is more manageable. The series systematically moves the construction of protocol headers (UDP, IPv4, IPv6, Ethernet) out of the core `netpoll_send_udp()` function into dedicated static helpers: - `push_udp()` for UDP header setup - `push_ipv4()` and `push_ipv6()` for IP header setup - `push_eth()` for Ethernet header setup This results in a clean, layered abstraction that mirrors the protocol stack, reduces code duplication, and improves readability. Also, to make sure this is not breaking anything, add IPv6 selftest to netconsole tests, which will exercise this code. This test would also pick problems similiar to the one fixed by f599020702698 ("net: netpoll: Initialize UDP checksum field before checksumming"), which was embarrassin we didn't have a selftest catch it. Anyway, there are **no functional changes** intended in this patchset. Signed-off-by: Breno Leitao <leitao(a)debian.org> --- Breno Leitao (7): netpoll: Improve code clarity with explicit struct size calculations netpoll: factor out UDP checksum calculation into helper netpoll: factor out IPv6 header setup into push_ipv6() helper netpoll: factor out IPv4 header setup into push_ipv4() helper netpoll: factor out UDP header setup into push_udp() helper netpoll: move Ethernet setup to push_eth() helper selftests: net: Add IPv6 support to netconsole basic tests net/core/netpoll.c | 196 +++++++++++++-------- .../selftests/drivers/net/lib/sh/lib_netcons.sh | 74 +++++++- .../testing/selftests/drivers/net/netcons_basic.sh | 52 +++--- 3 files changed, 216 insertions(+), 106 deletions(-) --- base-commit: 8efa26fcbf8a7f783fd1ce7dd2a409e9b7758df0 change-id: 20250620-netpoll_untagle_ip-e37c799a6925 Best regards, -- Breno Leitao <leitao(a)debian.org>

1 week, 5 days

2
9
0 0

[PATCH bpf-next v2 0/3] bpf: Fix and test aux usage after do_check_insn()

by Luis Gerhorst

Fix cur_aux()->nospec_result test after do_check_insn() referring to the to-be-analyzed (potentially unsafe) instruction, not the already-analyzed (safe) instruction. This might allow a unsafe insn to slip through on a speculative path. Create some tests from the reproducer [1]. Commit d6f1c85f2253 ("bpf: Fall back to nospec for Spectre v1") should not be in any stable kernel yet, therefore bpf-next should suffice. [1] https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/ Changes since v1: - Fix compiler error due to missed rename of prev_insn_idx in first patch - v1: https://lore.kernel.org/bpf/20250628125927.763088-1-luis.gerhorst@fau.de/ Changes since RFC: - Introduce prev_aux() as suggested by Alexei. For this, we must move the env->prev_insn_idx assignment to happen directly after do_check_insn(), for which I have created a separate commit. This patch could be simplified by using a local prev_aux variable as sugested by Eduard, but I figured one might find the new assignment-strategy easier to understand (before, prev_insn_idx and env->prev_insn_idx were out-of-sync for the latter part of the loop). Also, like this we do not have an additional prev_* variable that must be kept in-sync and the local variable's usage (old prev_insn_idx, new tmp) is much more local. If you think it would be better to not take the risk and keep the fix simple by just introducing the prev_aux variable, let me know. - Change WARN_ON_ONCE() to verifier_bug_if() as suggested by Alexei - Change assertion to check instruction is BPF_JMP[32] as suggested by Eduard - RFC: https://lore.kernel.org/bpf/8734bmoemx.fsf@fau.de/ Luis Gerhorst (3): bpf: Update env->prev_insn_idx after do_check_insn() bpf: Fix aux usage after do_check_insn() selftests/bpf: Add Spectre v4 tests kernel/bpf/verifier.c | 30 ++-- tools/testing/selftests/bpf/progs/bpf_misc.h | 4 + .../selftests/bpf/progs/verifier_unpriv.c | 149 ++++++++++++++++++ 3 files changed, 174 insertions(+), 9 deletions(-) base-commit: d69bafe6ee2b5eff6099fa26626ecc2963f0f363 -- 2.49.0

1 week, 5 days

2
5
0 0

[PATCH] tools/testing/selftests: add mremap() unfaulted/faulted test cases

by Lorenzo Stoakes

Assert that mremap() behaviour is as expected when moving around unfaulted VMAs immediately adjacent to faulted ones, as well as moving around faulted VMAs and placing them back immediately adjacent to the VMA from which they were moved. This also introduces a shared helper for the syscall version of mremap() so we don't encounter any issues with libc filtering parameters. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> --- tools/testing/selftests/mm/merge.c | 599 ++++++++++++++++++++++++++- tools/testing/selftests/mm/vm_util.c | 8 + tools/testing/selftests/mm/vm_util.h | 3 + 3 files changed, 608 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/mm/merge.c b/tools/testing/selftests/mm/merge.c index 150dd5baed2b..cc4253f47f10 100644 --- a/tools/testing/selftests/mm/merge.c +++ b/tools/testing/selftests/mm/merge.c @@ -13,6 +13,7 @@ #include <sys/wait.h> #include <linux/perf_event.h> #include "vm_util.h" +#include <linux/mman.h> FIXTURE(merge) { @@ -25,7 +26,7 @@ FIXTURE_SETUP(merge) { self->page_size = psize(); /* Carve out PROT_NONE region to map over. */ - self->carveout = mmap(NULL, 12 * self->page_size, PROT_NONE, + self->carveout = mmap(NULL, 30 * self->page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE, -1, 0); ASSERT_NE(self->carveout, MAP_FAILED); /* Setup PROCMAP_QUERY interface. */ @@ -34,7 +35,7 @@ FIXTURE_SETUP(merge) FIXTURE_TEARDOWN(merge) { - ASSERT_EQ(munmap(self->carveout, 12 * self->page_size), 0); + ASSERT_EQ(munmap(self->carveout, 30 * self->page_size), 0); ASSERT_EQ(close_procmap(&self->procmap), 0); /* * Clear unconditionally, as some tests set this. It is no issue if this @@ -576,4 +577,598 @@ TEST_F(merge, ksm_merge) ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 2 * page_size); } +TEST_F(merge, mremap_unfaulted_to_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2; + + /* + * Map two distinct areas: + * + * |-----------| |-----------| + * | unfaulted | | unfaulted | + * |-----------| |-----------| + * ptr ptr2 + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr: + * \ + * |-----------| / |-----------| + * | faulted | \ | unfaulted | + * |-----------| / |-----------| + * ptr \ ptr2 + */ + ptr[0] = 'x'; + + /* + * Now move ptr2 adjacent to ptr: + * + * |-----------|-----------| + * | faulted | unfaulted | + * |-----------|-----------| + * ptr ptr2 + * + * It should merge: + * + * |----------------------| + * | faulted | + * |----------------------| + * ptr + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); +} + +TEST_F(merge, mremap_unfaulted_behind_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2; + + /* + * Map two distinct areas: + * + * |-----------| |-----------| + * | unfaulted | | unfaulted | + * |-----------| |-----------| + * ptr ptr2 + */ + ptr = mmap(&carveout[6 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr: + * \ + * |-----------| / |-----------| + * | faulted | \ | unfaulted | + * |-----------| / |-----------| + * ptr \ ptr2 + */ + ptr[0] = 'x'; + + /* + * Now move ptr2 adjacent, but behind, ptr: + * + * |-----------|-----------| + * | unfaulted | faulted | + * |-----------|-----------| + * ptr2 ptr + * + * It should merge: + * + * |----------------------| + * | faulted | + * |----------------------| + * ptr2 + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr2)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size); +} + +TEST_F(merge, mremap_unfaulted_between_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2, *ptr3; + + /* + * Map three distinct areas: + * + * |-----------| |-----------| |-----------| + * | unfaulted | | unfaulted | | unfaulted | + * |-----------| |-----------| |-----------| + * ptr ptr2 ptr3 + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Offset ptr3 further away. */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr, ptr3: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | unfaulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr \ ptr2 \ ptr3 + */ + ptr[0] = 'x'; + ptr3[0] = 'x'; + + /* + * Move ptr3 back into place, leaving a place for ptr2: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | faulted | \ | unfaulted | + * |-----------| |-----------| / |-----------| + * ptr ptr3 \ ptr2 + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + /* + * Finally, move ptr2 into place: + * + * |-----------|-----------|-----------| + * | faulted | unfaulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge, but only ptr, ptr2: + * + * |-----------------------|-----------| + * | faulted | unfaulted | + * |-----------------------|-----------| + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr3)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr3); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr3 + 5 * page_size); +} + +TEST_F(merge, mremap_unfaulted_between_faulted_unfaulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2, *ptr3; + + /* + * Map three distinct areas: + * + * |-----------| |-----------| |-----------| + * | unfaulted | | unfaulted | | unfaulted | + * |-----------| |-----------| |-----------| + * ptr ptr2 ptr3 + */ + ptr = mmap(&carveout[page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + ptr2 = mmap(&carveout[7 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + ptr3 = mmap(&carveout[14 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr3, MAP_FAILED); + + /* Offset ptr3 further away. */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000); + ASSERT_NE(ptr3, MAP_FAILED); + + + /* Offset ptr2 further away. */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Fault in ptr: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | unfaulted | \ | unfaulted | + * |-----------| / |-----------| / |-----------| + * ptr \ ptr2 \ ptr3 + */ + ptr[0] = 'x'; + + /* + * Move ptr3 back into place, leaving a place for ptr2: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | unfaulted | \ | unfaulted | + * |-----------| |-----------| / |-----------| + * ptr ptr3 \ ptr2 + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + /* + * Finally, move ptr2 into place: + * + * |-----------|-----------|-----------| + * | faulted | unfaulted | unfaulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); +} + +TEST_F(merge, mremap_unfaulted_between_correctly_placed_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2; + + /* + * Map one larger area: + * + * |-----------------------------------| + * | unfaulted | + * |-----------------------------------| + */ + ptr = mmap(&carveout[page_size], 15 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Fault in ptr: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr[0] = 'x'; + + /* + * Unmap middle: + * + * |-----------| |-----------| + * | faulted | | faulted | + * |-----------| |-----------| + * + * Now the faulted areas are compatible with each other (anon_vma the + * same, vma->vm_pgoff equal to virtual page offset). + */ + ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0); + + /* + * Map a new area, ptr2: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | faulted | \ | unfaulted | + * |-----------| |-----------| / |-----------| + * ptr \ ptr2 + */ + ptr2 = mmap(&carveout[20 * page_size], 5 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Finally, move ptr2 into place: + * + * |-----------|-----------|-----------| + * | faulted | unfaulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); +} + +TEST_F(merge, mremap_correct_placed_faulted) +{ + unsigned int page_size = self->page_size; + char *carveout = self->carveout; + struct procmap_fd *procmap = &self->procmap; + char *ptr, *ptr2, *ptr3; + + /* + * Map one larger area: + * + * |-----------------------------------| + * | unfaulted | + * |-----------------------------------| + */ + ptr = mmap(&carveout[page_size], 15 * page_size, PROT_READ | PROT_WRITE, + MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Fault in ptr: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + */ + ptr[0] = 'x'; + + /* + * Offset the final and middle 5 pages further away: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | faulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr \ ptr2 \ ptr3 + */ + ptr3 = &ptr[10 * page_size]; + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 2000); + ASSERT_NE(ptr3, MAP_FAILED); + ptr2 = &ptr[5 * page_size]; + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Move ptr2 into its correct place: + * \ + * |-----------|-----------| / |-----------| + * | faulted | faulted | \ | faulted | + * |-----------|-----------| / |-----------| + * ptr ptr2 \ ptr3 + * + * It should merge: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr \ ptr3 + */ + + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); + + /* + * Now move ptr out of place: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | faulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr2 \ ptr \ ptr3 + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr + page_size * 1000); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Now move ptr back into place: + * \ + * |-----------|-----------| / |-----------| + * | faulted | faulted | \ | faulted | + * |-----------|-----------| / |-----------| + * ptr ptr2 \ ptr3 + * + * It should merge: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr \ ptr3 + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]); + ASSERT_NE(ptr, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 10 * page_size); + + /* + * Now move ptr out of place again: + * \ \ + * |-----------| / |-----------| / |-----------| + * | faulted | \ | faulted | \ | faulted | + * |-----------| / |-----------| / |-----------| + * ptr2 \ ptr \ ptr3 + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr + page_size * 1000); + ASSERT_NE(ptr, MAP_FAILED); + + /* + * Now move ptr3 back into place: + * \ + * |-----------|-----------| / |-----------| + * | faulted | faulted | \ | faulted | + * |-----------|-----------| / |-----------| + * ptr2 ptr3 \ ptr + * + * It should merge: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr2 \ ptr + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr2[5 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr2)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr2); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr2 + 10 * page_size); + + /* + * Now move ptr back into place: + * + * |-----------|-----------------------| + * | faulted | faulted | + * |-----------|-----------------------| + * ptr ptr2 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + * ptr + */ + ptr = sys_mremap(ptr, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &carveout[page_size]); + ASSERT_NE(ptr, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); + + /* + * Now move ptr2 out of the way: + * \ + * |-----------| |-----------| / |-----------| + * | faulted | | faulted | \ | faulted | + * |-----------| |-----------| / |-----------| + * ptr ptr3 \ ptr2 + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr2 + page_size * 1000); + ASSERT_NE(ptr2, MAP_FAILED); + + /* + * Now move it back: + * + * |-----------|-----------|-----------| + * | faulted | faulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + * ptr + */ + ptr2 = sys_mremap(ptr2, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[5 * page_size]); + ASSERT_NE(ptr2, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); + + /* + * Move ptr3 out of place: + * \ + * |-----------------------| / |-----------| + * | faulted | \ | faulted | + * |-----------------------| / |-----------| + * ptr \ ptr3 + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, ptr3 + page_size * 1000); + ASSERT_NE(ptr3, MAP_FAILED); + + /* + * Now move it back: + * + * |-----------|-----------|-----------| + * | faulted | faulted | faulted | + * |-----------|-----------|-----------| + * ptr ptr2 ptr3 + * + * It should merge: + * + * |-----------------------------------| + * | faulted | + * |-----------------------------------| + * ptr + */ + ptr3 = sys_mremap(ptr3, 5 * page_size, 5 * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, &ptr[10 * page_size]); + ASSERT_NE(ptr3, MAP_FAILED); + + ASSERT_TRUE(find_vma_procmap(procmap, ptr)); + ASSERT_EQ(procmap->query.vma_start, (unsigned long)ptr); + ASSERT_EQ(procmap->query.vma_end, (unsigned long)ptr + 15 * page_size); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index 5492e3f784df..1d434772fa54 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -524,3 +524,11 @@ int read_sysfs(const char *file_path, unsigned long *val) return 0; } + +void *sys_mremap(void *old_address, unsigned long old_size, + unsigned long new_size, int flags, void *new_address) +{ + return (void *)syscall(__NR_mremap, (unsigned long)old_address, + old_size, new_size, flags, + (unsigned long)new_address); +} diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index b8136d12a0f8..797c24215b17 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -117,6 +117,9 @@ static inline void log_test_result(int result) ksft_test_result_report(result, "%s\n", test_name); } +void *sys_mremap(void *old_address, unsigned long old_size, + unsigned long new_size, int flags, void *new_address); + /* * On ppc64 this will only work with radix 2M hugepage size */ -- 2.50.0

1 week, 5 days

1
0
0 0

[PATCH v2 0/5] binder: Set up KUnit tests for alloc

by Tiffany Yang

Hello, binder_alloc_selftest provides a robust set of checks for the binder allocator, but it rarely runs because it must hook into a running binder process and block all other binder threads until it completes. The test itself is a good candidate for conversion to KUnit, and it can be further isolated from user processes by using a test-specific lru freelist instead of the global one. This series converts the selftest to KUnit to make it less burdensome to run and to set up a foundation for unit testing future binder_alloc changes. Thanks, Tiffany Tiffany Yang (5): binder: Fix selftest page indexing binder: Store lru freelist in binder_alloc binder: Scaffolding for binder_alloc KUnit tests binder: Convert binder_alloc selftests to KUnit binder: encapsulate individual alloc test cases drivers/android/Kconfig | 15 +- drivers/android/Makefile | 2 +- drivers/android/binder.c | 10 +- drivers/android/binder_alloc.c | 39 +- drivers/android/binder_alloc.h | 14 +- drivers/android/binder_alloc_selftest.c | 306 ----------- drivers/android/binder_internal.h | 4 + drivers/android/tests/.kunitconfig | 3 + drivers/android/tests/Makefile | 3 + drivers/android/tests/binder_alloc_kunit.c | 573 +++++++++++++++++++++ include/kunit/test.h | 12 + lib/kunit/user_alloc.c | 4 +- 12 files changed, 645 insertions(+), 340 deletions(-) delete mode 100644 drivers/android/binder_alloc_selftest.c create mode 100644 drivers/android/tests/.kunitconfig create mode 100644 drivers/android/tests/Makefile create mode 100644 drivers/android/tests/binder_alloc_kunit.c -- 2.50.0.727.gbf7dc18ff4-goog

1 week, 5 days

2
6
0 0

[PATCH v10 net-next 00/15] AccECN protocol patch series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the v10 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best Regards, Chia-Yu --- v10 (03-Jul-2025) - Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>) - Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>) - Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>) - Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>) - Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>) - Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch - Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>) v9 (21-Jun-2025) - Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>) - Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>) - Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>) - Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>) - Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>) v8 (10-Jun-2025) - Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>) - Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>) - Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>) v7 (14-May-2025) - Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>) - Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message for #9 to explain the increase in tcp_sock_write_rx group size - Modify group size of tcp_sock_write_tx in #10 based on pahole results v6 (09-May-2025) - Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>) - Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>) - Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>) - Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>) - Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>) - Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>) - Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15 v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Mar-2025) - Add one missing patch from the previous AccECN protocol preparation patch series to this patch series. --- Chia-Yu Chang (6): tcp: reorganize tcp_sock_write_txrx group for variables later tcp: ecn functions in separated include file tcp: Add wait_third_ack flag for ECN negotiation in simultaneous connect tcp: accecn: AccECN option send control tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (9): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics Documentation/networking/ip-sysctl.rst | 55 +- .../networking/net_cachelines/tcp_sock.rst | 13 + include/linux/tcp.h | 33 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 87 ++- include/net/tcp_ecn.h | 647 ++++++++++++++++++ include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 4 + net/ipv4/sysctl_net_ipv4.c | 19 + net/ipv4/tcp.c | 29 +- net/ipv4/tcp_input.c | 371 ++++++++-- net/ipv4/tcp_ipv4.c | 8 +- net/ipv4/tcp_minisocks.c | 40 +- net/ipv4/tcp_output.c | 297 ++++++-- net/ipv6/syncookies.c | 2 + net/ipv6/tcp_ipv6.c | 1 + 16 files changed, 1428 insertions(+), 187 deletions(-) create mode 100644 include/net/tcp_ecn.h -- 2.34.1

1 week, 5 days

1
15
0 0

[PATCH v10 iproute2-next 0/1] DUALPI2 iproute2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find DUALPI2 iproute2 patch v10. For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best Regards, Chia-Yu --- v10 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS w/ STEP_THRESH_PKTS and STEP_THRESH_US of net-next patch (Jakub Kicinski <kuba(a)kernel.org>) v9 (13-Jun-2025) - Fix space issue and typos (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Change 'rtt_typical' to 'typical_rtt' in tc/q_dualpi2.c (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Add the num of enum used by DualPI2 in pkt_sched.h v8 (09-May-2025) - Update pkt_sched.h with the one in nex-next - Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) v7 (05-May-2025) - Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml - Reorganize dualpi2_print_opt() to match the order in tc.yaml - Remove credit-queue in PRINT_JSON v6 (26-Apr-2025) - Update JSON file output due to spec modification in tc.yaml of net-next v5 (25-Mar-2025) - Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>) - Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>) - Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>) v4 (16-Mar-2025) - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking. v3 (21-Feb-2025) - Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update the manual to align with the latest implementation and clarify the queue naming and default unit - Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2 v2 (23-Oct-2024) - Rename get_float in dualpi2 to get_float_min_max in utils.c - Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>) --- Chia-Yu Chang (1): tc: add dualpi2 scheduler module bash-completion/tc | 11 +- include/uapi/linux/pkt_sched.h | 68 +++++ include/utils.h | 2 + ip/iplink_can.c | 14 - lib/utils.c | 30 ++ man/man8/tc-dualpi2.8 | 249 ++++++++++++++++ tc/Makefile | 1 + tc/q_dualpi2.c | 528 +++++++++++++++++++++++++++++++++ 8 files changed, 888 insertions(+), 15 deletions(-) create mode 100644 man/man8/tc-dualpi2.8 create mode 100644 tc/q_dualpi2.c -- 2.34.1

1 week, 5 days

1
1
0 0

[PATCH bpf-next,v2 0/2] Clarify and Enhance XDP Rx Metadata Handling

by Song Yoong Siang

This patch set improves the documentation and selftests for XDP Rx metadata handling. The first patch clarifies the documentation around XDP metadata layout and the use of bpf_xdp_adjust_meta. The second patch enhances the BPF selftests to make XDP metadata handling more robust and portable across different NICs. Prior to this patch set, the user application retrieved the xdp_meta by calculating backward from the data pointer, while the XDP program fill in the xdp_meta by calculating backward from data_meta. This approach will cause mismatch if there is device-reserved metadata. |<---sizeof(xdp_meta)--| | | struct xdp_meta rx_desc->address ^ ^ | | +----------+----------------------+------------+------+ | headroom | custom metadata | reserved | data | +----------+----------------------+------------+------+ ^ ^ ^ | | | struct xdp_meta xdp_buff->data_meta xdp_buff->data | | |<---sizeof(xdp_meta)--| V2: - unconditionally do bpf_xdp_adjust_meta with -XDP_METADATA_SIZE (Stanislav) V1: https://lore.kernel.org/netdev/20250701042940.3272325-1-yoong.siang.song@in… Song Yoong Siang (2): doc: clarify XDP Rx metadata layout and bpf_xdp_adjust_meta usage selftests/bpf: Enhance XDP Rx Metadata Handling Documentation/networking/xdp-rx-metadata.rst | 38 +++++++++++++++++++ .../selftests/bpf/prog_tests/xdp_metadata.c | 2 +- .../selftests/bpf/progs/xdp_hw_metadata.c | 2 +- .../selftests/bpf/progs/xdp_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_hw_metadata.c | 2 +- tools/testing/selftests/bpf/xdp_metadata.h | 7 ++++ 6 files changed, 49 insertions(+), 4 deletions(-) -- 2.34.1

1 week, 5 days

1
2
0 0

[PATCH net-next] selftests/tc-testing: Enable CONFIG_IP_SET

by Sebastian Andrzej Siewior

The config snippet specifies CONFIG_NET_EMATCH_IPSET. This option depends on CONFIG_IP_SET. Set CONFIG_IP_SET to be enabled at part for tc-testing. Signed-off-by: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de> --- tools/testing/selftests/tc-testing/config | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/tc-testing/config b/tools/testing/selftests/tc-testing/config index db176fe7d0c3f..8e902f7f1a181 100644 --- a/tools/testing/selftests/tc-testing/config +++ b/tools/testing/selftests/tc-testing/config @@ -21,6 +21,7 @@ CONFIG_NF_NAT=m CONFIG_NETFILTER_XT_TARGET_LOG=m CONFIG_NET_SCHED=y +CONFIG_IP_SET=m # # Queueing/Scheduling -- 2.50.0

1 week, 5 days

4
3
0 0

[PATCH net-next v1 1/2] selftests: pp-bench: remove unneeded linux/version.h

by Mina Almasry

linux/version.h was used by the out-of-tree version, but not needed in the upstream one anymore. While I'm at it, sort the includes. Reported-by: kernel test robot <lkp(a)intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506271434.Gk0epC9H-lkp@intel.com/ Signed-off-by: Mina Almasry <almasrymina(a)google.com> --- .../selftests/net/bench/page_pool/bench_page_pool_simple.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c index f183d5e30dc6..1cd3157fb6a9 100644 --- a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c +++ b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c @@ -5,15 +5,12 @@ */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include <linux/interrupt.h> +#include <linux/limits.h> #include <linux/module.h> #include <linux/mutex.h> - -#include <linux/version.h> #include <net/page_pool/helpers.h> -#include <linux/interrupt.h> -#include <linux/limits.h> - #include "time_bench.h" static int verbose = 1; base-commit: 8efa26fcbf8a7f783fd1ce7dd2a409e9b7758df0 -- 2.50.0.727.gbf7dc18ff4-goog

1 week, 5 days

6
10
0 0

[PATCH v20 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v20. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 166 +++ include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 70 +- net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1146 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1658 insertions(+), 4 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

1 week, 5 days

7
25
0 0

[PATCH][next] selftests/bpf: Fix spelling mistake "subtration" -> "subtraction"

by Colin Ian King

There are spelling mistakes in description text. Fix these. Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com> --- tools/testing/selftests/bpf/progs/verifier_bounds.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/verifier_bounds.c b/tools/testing/selftests/bpf/progs/verifier_bounds.c index e52a24e15b34..6f986ae5085e 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bounds.c +++ b/tools/testing/selftests/bpf/progs/verifier_bounds.c @@ -1474,7 +1474,7 @@ __naked void sub64_full_overflow(void) } SEC("socket") -__description("64-bit subtration, partial overflow, result in unbounded reg") +__description("64-bit subtraction, partial overflow, result in unbounded reg") __success __log_level(2) __msg("3: (1f) r3 -= r2 {{.*}} R3_w=scalar()") __retval(0) @@ -1514,7 +1514,7 @@ __naked void sub32_full_overflow(void) } SEC("socket") -__description("32-bit subtration, partial overflow, result in unbounded u32 bounds") +__description("32-bit subtraction, partial overflow, result in unbounded u32 bounds") __success __log_level(2) __msg("3: (1c) w3 -= w2 {{.*}} R3_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))") __retval(0) -- 2.50.0

1 week, 5 days

2
1
0 0

[PATCH 0/5] binder: Set up KUnit tests for alloc

by Tiffany Yang

Hello, binder_alloc_selftest provides a robust set of checks for the binder allocator, but it rarely runs because it must hook into a running binder process and block all other binder threads until it completes. The test itself is a good candidate for conversion to KUnit, and it can be further isolated from user processes by using a test-specific lru freelist instead of the global one. This series converts the selftest to KUnit to make it less burdensome to run and to set up a foundation for testing future binder_alloc changes. Thanks, Tiffany Tiffany Yang (5): binder: Fix selftest page indexing binder: Store lru freelist in binder_alloc binder: Scaffolding for binder_alloc KUnit tests binder: Convert binder_alloc selftests to KUnit binder: encapsulate individual alloc test cases drivers/android/Kconfig | 15 +- drivers/android/Makefile | 2 +- drivers/android/binder.c | 10 +- drivers/android/binder_alloc.c | 39 +- drivers/android/binder_alloc.h | 14 +- drivers/android/binder_alloc_selftest.c | 306 ----------- drivers/android/binder_internal.h | 4 + drivers/android/tests/.kunitconfig | 3 + drivers/android/tests/Makefile | 3 + drivers/android/tests/binder_alloc_kunit.c | 572 +++++++++++++++++++++ include/kunit/test.h | 12 + lib/kunit/user_alloc.c | 4 +- 12 files changed, 644 insertions(+), 340 deletions(-) delete mode 100644 drivers/android/binder_alloc_selftest.c create mode 100644 drivers/android/tests/.kunitconfig create mode 100644 drivers/android/tests/Makefile create mode 100644 drivers/android/tests/binder_alloc_kunit.c -- 2.50.0.727.gbf7dc18ff4-goog

1 week, 5 days

3
8
0 0

[PATCH] selftests: ptrace: add set_syscall_info to .gitignore

by Moon Hee Lee

Add the set_syscall_info test binary to .gitignore to avoid tracking build artifacts in the ptrace selftests directory. Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com> --- tools/testing/selftests/ptrace/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/ptrace/.gitignore b/tools/testing/selftests/ptrace/.gitignore index b7dde152e75a..f6be8efd57ea 100644 --- a/tools/testing/selftests/ptrace/.gitignore +++ b/tools/testing/selftests/ptrace/.gitignore @@ -3,3 +3,4 @@ get_syscall_info get_set_sud peeksiginfo vmaccess +set_syscall_info -- 2.43.0

1 week, 5 days

2
1
0 0

[RESEND PATCH v2] kallsyms: fix build without execinfo

by Achill Gilgenast

Some libc's like musl libc don't provide execinfo.h since it's not part of POSIX. In order to fix compilation on musl, only include execinfo.h if available (HAVE_BACKTRACE_SUPPORT) This was discovered with c104c16073b7 ("Kunit to check the longest symbol length") which starts to include linux/kallsyms.h with Alpine Linux' configs. Signed-off-by: Achill Gilgenast <fossdd(a)pwned.life> Cc: stable(a)vger.kernel.org --- tools/include/linux/kallsyms.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/include/linux/kallsyms.h b/tools/include/linux/kallsyms.h index 5a37ccbec54f..f61a01dd7eb7 100644 --- a/tools/include/linux/kallsyms.h +++ b/tools/include/linux/kallsyms.h @@ -18,6 +18,7 @@ static inline const char *kallsyms_lookup(unsigned long addr, return NULL; } +#ifdef HAVE_BACKTRACE_SUPPORT #include <execinfo.h> #include <stdlib.h> static inline void print_ip_sym(const char *loglvl, unsigned long ip) @@ -30,5 +31,8 @@ static inline void print_ip_sym(const char *loglvl, unsigned long ip) free(name); } +#else +static inline void print_ip_sym(const char *loglvl, unsigned long ip) {} +#endif #endif -- 2.50.0

1 week, 5 days

3
5
0 0

[PATCH v2 1/2] selftests: cachestat: add tests for mmap

by Suresh K C

From: Suresh K C <suresh.k.chandrappa(a)gmail.com> Add a test case to verify cachestat behavior with memory-mapped files using mmap(). This ensures that pages accessed via mmap are correctly accounted for in the page cache. Tested on x86_64 with default kernel config Signed-off-by: Suresh K C <suresh.k.chandrappa(a)gmail.com> --- .../selftests/cachestat/test_cachestat.c | 39 ++++++++++++++++--- 1 file changed, 33 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c index 632ab44737ec..b6452978dae0 100644 --- a/tools/testing/selftests/cachestat/test_cachestat.c +++ b/tools/testing/selftests/cachestat/test_cachestat.c @@ -33,6 +33,11 @@ void print_cachestat(struct cachestat *cs) cs->nr_evicted, cs->nr_recently_evicted); } +enum file_type { + FILE_MMAP, + FILE_SHMEM +}; + bool write_exactly(int fd, size_t filesize) { int random_fd = open("/dev/urandom", O_RDONLY); @@ -202,7 +207,7 @@ static int test_cachestat(const char *filename, bool write_random, bool create, return ret; } -bool test_cachestat_shmem(void) +bool run_cachestat_test(enum file_type type) { size_t PS = sysconf(_SC_PAGESIZE); size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */ @@ -212,27 +217,43 @@ bool test_cachestat_shmem(void) char *filename = "tmpshmcstat"; struct cachestat cs; bool ret = true; + int fd; unsigned long num_pages = compute_len / PS; - int fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + if (type == FILE_SHMEM) + fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + else + fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); if (fd < 0) { - ksft_print_msg("Unable to create shmem file.\n"); + ksft_print_msg("Unable to create file.\n"); ret = false; goto out; } if (ftruncate(fd, filesize)) { - ksft_print_msg("Unable to truncate shmem file.\n"); + ksft_print_msg("Unable to truncate file.\n"); ret = false; goto close_fd; } if (!write_exactly(fd, filesize)) { - ksft_print_msg("Unable to write to shmem file.\n"); + ksft_print_msg("Unable to write to file.\n"); ret = false; goto close_fd; } + if (type == FILE_MMAP){ + char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (map == MAP_FAILED) { + ksft_print_msg("mmap failed.\n"); + ret = false; + goto close_fd; + } + for (int i = 0; i < filesize; i++) { + map[i] = 'A'; + } + map[filesize - 1] = 'X'; + } syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); if (syscall_ret) { @@ -308,12 +329,18 @@ int main(void) break; } - if (test_cachestat_shmem()) + if (run_cachestat_test(FILE_SHMEM)) ksft_test_result_pass("cachestat works with a shmem file\n"); else { ksft_test_result_fail("cachestat fails with a shmem file\n"); ret = 1; } + if (run_cachestat_test(FILE_MMAP)) + ksft_test_result_pass("cachestat works with a mmap file\n"); + else { + ksft_test_result_fail("cachestat fails with a mmap file\n"); + ret = 1; + } return ret; } -- 2.43.0

1 week, 6 days

1
1
0 0

[PATCH v12 0/5] rust: replace kernel::str::CStr w/ core::ffi::CStr

by Tamir Duberstein

This picks up from Michal Rostecki's work[0]. Per Michal's guidance I have omitted Co-authored tags, as the end result is quite different. Link: https://lore.kernel.org/rust-for-linux/20240819153656.28807-2-vadorovsky@pr… [0] Closes: https://github.com/Rust-for-Linux/linux/issues/1075 Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Changes in v12: - Introduce `kernel::fmt::Display` to allow implementations on foreign types. - Tidy up doc comment on `str_to_cstr`. (Alice Ryhl). - Link to v11: https://lore.kernel.org/r/20250530-cstr-core-v11-0-cd9c0cbcb902@gmail.com Changes in v11: - Use `quote_spanned!` to avoid `use<'a, T>` and generally reduce manual token construction. - Add a commit to simplify `quote_spanned!`. - Drop first commit in favor of https://lore.kernel.org/rust-for-linux/20240906164448.2268368-1-paddymills@…. (Miguel Ojeda) - Correctly handle expressions such as `pr_info!("{a}", a = a = a)`. (Benno Lossin) - Avoid dealing with `}}` escapes, which is not needed. (Benno Lossin) - Revert some unnecessary changes. (Benno Lossin) - Rename `c_str_avoid_literals!` to `str_to_cstr!`. (Benno Lossin & Alice Ryhl). - Link to v10: https://lore.kernel.org/r/20250524-cstr-core-v10-0-6412a94d9d75@gmail.com Changes in v10: - Rebase on cbeaa41dfe26b72639141e87183cb23e00d4b0dd. - Implement Alice's suggestion to use a proc macro to work around orphan rules otherwise preventing `core::ffi::CStr` to be directly printed with `{}`. - Link to v9: https://lore.kernel.org/r/20250317-cstr-core-v9-0-51d6cc522f62@gmail.com Changes in v9: - Rebase on rust-next. - Restore `impl Display for BStr` which exists upstream[1]. - Link: https://doc.rust-lang.org/nightly/std/bstr/struct.ByteStr.html#impl-Display… [1] - Link to v8: https://lore.kernel.org/r/20250203-cstr-core-v8-0-cb3f26e78686@gmail.com Changes in v8: - Move `{from,as}_char_ptr` back to `CStrExt`. This reduces the diff some. - Restore `from_bytes_with_nul_unchecked_mut`, `to_cstring`. - Link to v7: https://lore.kernel.org/r/20250202-cstr-core-v7-0-da1802520438@gmail.com Changes in v7: - Rebased on mainline. - Restore functionality added in commit a321f3ad0a5d ("rust: str: add {make,to}_{upper,lower}case() to CString"). - Used `diff.algorithm patience` to improve diff readability. - Link to v6: https://lore.kernel.org/r/20250202-cstr-core-v6-0-8469cd6d29fd@gmail.com Changes in v6: - Split the work into several commits for ease of review. - Restore `{from,as}_char_ptr` to allow building on ARM (see commit message). - Add `CStrExt` to `kernel::prelude`. (Alice Ryhl) - Remove `CStrExt::from_bytes_with_nul_unchecked_mut` and restore `DerefMut for CString`. (Alice Ryhl) - Rename and hide `kernel::c_str!` to encourage use of C-String literals. - Drop implementation and invocation changes in kunit.rs. (Trevor Gross) - Drop docs on `Display` impl. (Trevor Gross) - Rewrite docs in the style of the standard library. - Restore the `test_cstr_debug` unit tests to demonstrate that the implementation has changed. Changes in v5: - Keep the `test_cstr_display*` unit tests. Changes in v4: - Provide the `CStrExt` trait with `display()` method, which returns a `CStrDisplay` wrapper with `Display` implementation. This addresses the lack of `Display` implementation for `core::ffi::CStr`. - Provide `from_bytes_with_nul_unchecked_mut()` method in `CStrExt`, which might be useful and is going to prevent manual, unsafe casts. - Fix a typo (s/preffered/prefered/). Changes in v3: - Fix the commit message. - Remove redundant braces in `use`, when only one item is imported. Changes in v2: - Do not remove `c_str` macro. While it's preferred to use C-string literals, there are two cases where `c_str` is helpful: - When working with macros, which already return a Rust string literal (e.g. `stringify!`). - When building macros, where we want to take a Rust string literal as an argument (for caller's convenience), but still use it as a C-string internally. - Use Rust literals as arguments in macros (`new_mutex`, `new_condvar`, `new_mutex`). Use the `c_str` macro to convert these literals to C-string literals. - Use `c_str` in kunit.rs for converting the output of `stringify!` to a `CStr`. - Remove `DerefMut` implementation for `CString`. --- Tamir Duberstein (5): rust: macros: reduce collections in `quote!` macro rust: support formatting of foreign types rust: replace `CStr` with `core::ffi::CStr` rust: replace `kernel::c_str!` with C-Strings rust: remove core::ffi::CStr reexport drivers/block/rnull.rs | 4 +- drivers/cpufreq/rcpufreq_dt.rs | 5 +- drivers/gpu/drm/drm_panic_qr.rs | 5 +- drivers/gpu/drm/nova/driver.rs | 10 +- drivers/gpu/nova-core/driver.rs | 6 +- drivers/gpu/nova-core/firmware.rs | 2 +- drivers/gpu/nova-core/gpu.rs | 4 +- drivers/gpu/nova-core/nova_core.rs | 2 +- drivers/net/phy/ax88796b_rust.rs | 8 +- drivers/net/phy/qt2025.rs | 6 +- rust/kernel/auxiliary.rs | 6 +- rust/kernel/block/mq.rs | 2 +- rust/kernel/clk.rs | 9 +- rust/kernel/configfs.rs | 14 +- rust/kernel/cpufreq.rs | 6 +- rust/kernel/device.rs | 9 +- rust/kernel/devres.rs | 2 +- rust/kernel/driver.rs | 4 +- rust/kernel/drm/device.rs | 4 +- rust/kernel/drm/driver.rs | 3 +- rust/kernel/drm/ioctl.rs | 2 +- rust/kernel/error.rs | 10 +- rust/kernel/faux.rs | 5 +- rust/kernel/firmware.rs | 16 +- rust/kernel/fmt.rs | 89 +++++++ rust/kernel/kunit.rs | 21 +- rust/kernel/lib.rs | 3 +- rust/kernel/miscdevice.rs | 5 +- rust/kernel/net/phy.rs | 12 +- rust/kernel/of.rs | 5 +- rust/kernel/pci.rs | 2 +- rust/kernel/platform.rs | 6 +- rust/kernel/prelude.rs | 5 +- rust/kernel/print.rs | 4 +- rust/kernel/seq_file.rs | 6 +- rust/kernel/str.rs | 444 ++++++++++------------------------ rust/kernel/sync.rs | 7 +- rust/kernel/sync/condvar.rs | 4 +- rust/kernel/sync/lock.rs | 4 +- rust/kernel/sync/lock/global.rs | 6 +- rust/kernel/sync/poll.rs | 1 + rust/kernel/workqueue.rs | 9 +- rust/macros/fmt.rs | 99 ++++++++ rust/macros/kunit.rs | 10 +- rust/macros/lib.rs | 19 ++ rust/macros/module.rs | 2 +- rust/macros/quote.rs | 111 ++++----- samples/rust/rust_configfs.rs | 9 +- samples/rust/rust_driver_auxiliary.rs | 7 +- samples/rust/rust_driver_faux.rs | 4 +- samples/rust/rust_driver_pci.rs | 4 +- samples/rust/rust_driver_platform.rs | 4 +- samples/rust/rust_misc_device.rs | 3 +- scripts/rustdoc_test_gen.rs | 6 +- 54 files changed, 540 insertions(+), 515 deletions(-) --- base-commit: e04c78d86a9699d136910cfc0bdcf01087e3267e change-id: 20250201-cstr-core-d4b9b69120cf Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

1 week, 6 days

5
11
0 0

[PATCH v2] selftests/mm: Add process_madvise() tests

by wang lian

This patch adds tests for the process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. Signed-off-by: wang lian<lianux.mm(a)gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: David Hildenbrand <david(a)redhat.com> --- Changelog v2: - Drop MADV_DONTNEED tests based on feedback - Focus solely on process_madvise() syscall - Improve error handling and structure - Add future-proof flag test - Style and comment cleanups tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/process_madv.c | 414 ++++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + 4 files changed, 421 insertions(+) create mode 100644 tools/testing/selftests/mm/process_madv.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 911f39d634be..a8c3be02188c 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -42,6 +42,7 @@ memfd_secret hugetlb_dio pkey_sighandler_tests_32 pkey_sighandler_tests_64 +process_madv soft-dirty split_huge_page_test ksm_tests diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 2352252f3914..725612e09582 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -86,6 +86,7 @@ TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += pfnmap +TEST_GEN_FILES += process_madv TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/process_madv.c b/tools/testing/selftests/mm/process_madv.c new file mode 100644 index 000000000000..73999c8e3570 --- /dev/null +++ b/tools/testing/selftests/mm/process_madv.c @@ -0,0 +1,414 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <errno.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/syscall.h> +#include <unistd.h> +#include <sched.h> +#include <sys/pidfd.h> +#include "vm_util.h" + +#include "../pidfd/pidfd.h" + +/* + * Ignore the checkpatch warning, as per the C99 standard, section 7.14.1.1: + * + * "If the signal occurs other than as the result of calling the abort or raise + * function, the behavior is undefined if the signal handler refers to any + * object with static storage duration other than by assigning a value to an + * object declared as volatile sig_atomic_t" + */ +static volatile sig_atomic_t signal_jump_set; +static sigjmp_buf signal_jmp_buf; + +/* + * Ignore the checkpatch warning, we must read from x but don't want to do + * anything with it in order to trigger a read page fault. We therefore must use + * volatile to stop the compiler from optimising this away. + */ +#define FORCE_READ(x) (*(volatile typeof(x) *)x) + +static void handle_fatal(int c) +{ + if (!signal_jump_set) + return; + + siglongjmp(signal_jmp_buf, c); +} + +FIXTURE(process_madvise) +{ + int pidfd; + int flag; +}; + +static void setup_sighandler(void) +{ + struct sigaction act = { + .sa_handler = &handle_fatal, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + if (sigaction(SIGSEGV, &act, NULL)) + ksft_exit_fail_perror("sigaction"); +} + +static void teardown_sighandler(void) +{ + struct sigaction act = { + .sa_handler = SIG_DFL, + .sa_flags = SA_NODEFER, + }; + + sigemptyset(&act.sa_mask); + sigaction(SIGSEGV, &act, NULL); +} + +FIXTURE_SETUP(process_madvise) +{ + self->pidfd = PIDFD_SELF; + self->flag = 0; + setup_sighandler(); +}; + +FIXTURE_TEARDOWN_PARENT(process_madvise) +{ + teardown_sighandler(); +} + +static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, + size_t vlen, int advice, unsigned int flags) +{ + return syscall(__NR_process_madvise, pidfd, iovec, vlen, advice, flags); +} + +/* + * Enable our signal catcher and try to read/write the specified buffer. The + * return value indicates whether the read/write succeeds without a fatal + * signal. + */ +static bool try_access_buf(char *ptr, bool write) +{ + bool failed; + + /* Tell signal handler to jump back here on fatal signal. */ + signal_jump_set = true; + /* If a fatal signal arose, we will jump back here and failed is set. */ + failed = sigsetjmp(signal_jmp_buf, 0) != 0; + + if (!failed) { + if (write) + *ptr = 'x'; + else + FORCE_READ(ptr); + } + + signal_jump_set = false; + return !failed; +} + +/* Try and read from a buffer, return true if no fatal signal. */ +static bool try_read_buf(char *ptr) +{ + return try_access_buf(ptr, false); +} + +TEST_F(process_madvise, basic) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + const int madvise_pages = 4; + char *map; + ssize_t ret; + struct iovec vec[madvise_pages]; + + /* + * Create a single large mapping. We will pick pages from this + * mapping to advise on. This ensures we test non-contiguous iovecs. + */ + map = mmap(NULL, pagesize * 10, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fill the entire region with a known pattern. */ + memset(map, 'A', pagesize * 10); + + /* + * Setup the iovec to point to 4 non-contiguous pages + * within the mapping. + */ + vec[0].iov_base = &map[0 * pagesize]; + vec[0].iov_len = pagesize; + vec[1].iov_base = &map[3 * pagesize]; + vec[1].iov_len = pagesize; + vec[2].iov_base = &map[5 * pagesize]; + vec[2].iov_len = pagesize; + vec[3].iov_base = &map[8 * pagesize]; + vec[3].iov_len = pagesize; + + ret = sys_process_madvise(PIDFD_SELF, vec, madvise_pages, MADV_DONTNEED, + 0); + if (ret == -1 && errno == EPERM) + ksft_exit_skip( + "process_madvise() unsupported or permission denied, try running as root.\n"); + else if (errno == EINVAL) + ksft_exit_skip( + "process_madvise() unsupported or parameter invalid, please check arguments.\n"); + + /* The call should succeed and report the total bytes processed. */ + ASSERT_EQ(ret, madvise_pages * pagesize); + + /* Check that advised pages are now zero. */ + for (int i = 0; i < madvise_pages; i++) { + char *advised_page = (char *)vec[i].iov_base; + + /* Access should be successful (kernel provides a new page). */ + ASSERT_TRUE(try_read_buf(advised_page)); + /* Content must be 0, not 'A'. */ + ASSERT_EQ(*advised_page, 0); + } + + /* Check that an un-advised page in between is still 'A'. */ + char *unadvised_page = &map[1 * pagesize]; + + ASSERT_TRUE(try_read_buf(unadvised_page)); + ASSERT_EQ(*unadvised_page, 'A'); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize * 10), 0); +} + +static long get_smaps_anon_huge_pages(pid_t pid, void *addr) +{ + char smaps_path[64]; + char *line = NULL; + unsigned long start, end; + long anon_huge_kb; + size_t len; + FILE *f; + bool in_vma; + + in_vma = false; + sprintf(smaps_path, "/proc/%d/smaps", pid); + f = fopen(smaps_path, "r"); + if (!f) + return -1; + + while (getline(&line, &len, f) != -1) { + /* Check if the line describes a VMA range */ + if (sscanf(line, "%lx-%lx", &start, &end) == 2) { + if ((unsigned long)addr >= start && + (unsigned long)addr < end) + in_vma = true; + else + in_vma = false; + continue; + } + + /* If we are in the correct VMA, look for the AnonHugePages field */ + if (in_vma && + sscanf(line, "AnonHugePages: %ld kB", &anon_huge_kb) == 1) + break; + } + + free(line); + fclose(f); + + return (anon_huge_kb > 0) ? (anon_huge_kb * 1024) : 0; +} + +/** + * TEST_F(process_madvise, remote_collapse) + * + * This test deterministically validates process_madvise() with MADV_COLLAPSE + * on a remote process, other advices are difficult to verify reliably. + * + * The test verifies that a memory region in a child process, initially + * backed by small pages, can be collapsed into a Transparent Huge Page by a + * request from the parent. The result is verified by parsing the child's + * /proc/<pid>/smaps file. + */ +TEST_F(process_madvise, remote_collapse) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + pid_t child_pid; + int pidfd; + long huge_page_size; + int pipe_info[2]; + ssize_t ret; + struct iovec vec; + + struct child_info { + pid_t pid; + void *map_addr; + } info; + + huge_page_size = default_huge_page_size(); + if (huge_page_size <= 0) + ksft_exit_skip("Could not determine a valid huge page size.\n"); + + ASSERT_EQ(pipe(pipe_info), 0); + + child_pid = fork(); + ASSERT_NE(child_pid, -1); + + if (child_pid == 0) { + char *map; + size_t map_size = 2 * huge_page_size; + + close(pipe_info[0]); + + map = mmap(NULL, map_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fault in as small pages */ + for (size_t i = 0; i < map_size; i += pagesize) + map[i] = 'A'; + + /* Send info and pause */ + info.pid = getpid(); + info.map_addr = map; + ret = write(pipe_info[1], &info, sizeof(info)); + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[1]); + + pause(); + exit(0); + } + + close(pipe_info[1]); + + /* Receive child info */ + ret = read(pipe_info[0], &info, sizeof(info)); + if (ret <= 0) { + waitpid(child_pid, NULL, 0); + ksft_exit_skip("Failed to read child info from pipe.\n"); + } + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[0]); + child_pid = info.pid; + + pidfd = pidfd_open(child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Baseline Check from Parent's perspective */ + ASSERT_EQ(get_smaps_anon_huge_pages(child_pid, info.map_addr), 0); + + vec.iov_base = info.map_addr; + vec.iov_len = huge_page_size; + ret = sys_process_madvise(pidfd, &vec, 1, MADV_COLLAPSE, 0); + if (ret == -1) { + if (errno == EINVAL) + ksft_exit_skip( + "PROCESS_MADV_ADVISE is not supported.\n"); + else if (errno == EPERM) + ksft_exit_skip( + "No process_madvise() permissions, try running as root.\n"); + goto cleanup; + } + ASSERT_EQ(ret, huge_page_size); + + ASSERT_EQ(get_smaps_anon_huge_pages(child_pid, info.map_addr), + huge_page_size); + + ksft_test_result_pass( + "MADV_COLLAPSE successfully verified via smaps.\n"); + +cleanup: + /* Cleanup */ + kill(child_pid, SIGKILL); + waitpid(child_pid, NULL, 0); + if (pidfd >= 0) + close(pidfd); +} + +/* + * Test process_madvise() with various invalid pidfds to ensure correct error + * handling. This includes negative fds, non-pidfd fds, and pidfds for + * processes that no longer exist. + */ +TEST_F(process_madvise, invalid_pidfd) +{ + struct iovec vec; + pid_t child_pid; + ssize_t ret; + int pidfd; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* Using an invalid fd number (-1) should fail with EBADF. */ + ret = sys_process_madvise(-1, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a valid fd that is not a pidfd (e.g. stdin) should fail + * with EBADF. + */ + ret = sys_process_madvise(STDIN_FILENO, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a pidfd for a process that has already exited should fail + * with ESRCH. + */ + child_pid = fork(); + ASSERT_NE(child_pid, -1); + + if (child_pid == 0) + exit(0); + + pidfd = pidfd_open(child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Wait for the child to ensure it has terminated. */ + waitpid(child_pid, NULL, 0); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ESRCH); + close(pidfd); +} + +/* + * Test process_madvise() with an invalid flag value. Now we only support flag=0 + * future we will use it support sync so reserve this test. + */ +TEST_F(process_madvise, flag) +{ + const unsigned long pagesize = (unsigned long)sysconf(_SC_PAGESIZE); + unsigned int invalid_flag; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + ASSERT_NE(map, MAP_FAILED); + + vec.iov_base = map; + vec.iov_len = pagesize; + + invalid_flag = 0x80000000; + + ret = sys_process_madvise(PIDFD_SELF, &vec, 1, MADV_DONTNEED, + invalid_flag); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +TEST_HARNESS_MAIN \ No newline at end of file diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index f96d43153fc0..5c28ebcf1ea9 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -61,6 +61,8 @@ separated by spaces: ksm tests that require >=2 NUMA nodes - pkey memory protection key tests +- process_madvise + test process_madvise - soft_dirty test soft dirty page bit semantics - pagemap @@ -424,6 +426,9 @@ CATEGORY="hmm" run_test bash ./test_hmm.sh smoke # MADV_GUARD_INSTALL and MADV_GUARD_REMOVE tests CATEGORY="madv_guard" run_test ./guard-regions +# PROCESS_MADVISE TEST +CATEGORY="process_madv" run_test ./process_madv + # MADV_DONTNEED and PROCESS_DONTNEED tests CATEGORY="madv_dontneed" run_test ./madv_dontneed -- 2.43.0

1 week, 6 days

3
3
0 0

[PATCH v3 0/7] Add support for FEAT_{LS64, LS64_V} and related tests

by Yicong Yang

From: Yicong Yang <yangyicong(a)hisilicon.com> Armv8.7 introduces single-copy atomic 64-byte loads and stores instructions and its variants named under FEAT_{LS64, LS64_V}. Add support for Armv8.7 FEAT_{LS64, LS64_V}: - Add identifying and enabling in the cpufeature list - Expose the support of these features to userspace through HWCAP3 and cpuinfo - Add related hwcap test - Handle the trap of unsupported memory (normal/uncacheable) access in a VM A real scenario for this feature is that the userspace driver can make use of this to implement direct WQE (workqueue entry) - a mechanism to fill WQE directly into the hardware. Picked Marc's 2 patches form [1] for handling the LS64 trap in a VM on emulated MMIO and the introduce of KVM_EXIT_ARM_LDST64B. [1] https://lore.kernel.org/linux-arm-kernel/20240815125959.2097734-1-maz@kerne… Tested with hwcap test [*]: [*] https://lore.kernel.org/linux-arm-kernel/20250331094320.35226-5-yangyicong@… On host: root@localhost:/tmp# dmesg | grep "All CPU(s) started" [ 0.504846] CPU: All CPU(s) started at EL2 root@localhost:/tmp# ./hwcap [...] # LS64 present ok 217 cpuinfo_match_LS64 ok 218 sigill_LS64 ok 219 # SKIP sigbus_LS64 # LS64_V present ok 220 cpuinfo_match_LS64_V ok 221 sigill_LS64_V ok 222 # SKIP sigbus_LS64_V # 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0 On guest: root@localhost:/# dmesg | grep "All CPU(s) started" [ 0.451482] CPU: All CPU(s) started at EL1 root@localhost:/mnt# ./hwcap [...] # LS64 present ok 217 cpuinfo_match_LS64 ok 218 sigill_LS64 ok 219 # SKIP sigbus_LS64 # LS64_V present ok 220 cpuinfo_match_LS64_V ok 221 sigill_LS64_V ok 222 # SKIP sigbus_LS64_V # 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage. # Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0 Change since v2: - Handle the LS64 fault to userspace and allow userspace to inject LS64 fault - Reorder the patches to make KVM handling prior to feature support Link: https://lore.kernel.org/linux-arm-kernel/20250331094320.35226-1-yangyicong@… Change since v1: - Drop the support for LS64_ACCDATA - handle the DABT of unsupported memory type after checking the memory attributes Link: https://lore.kernel.org/linux-arm-kernel/20241202135504.14252-1-yangyicong@… Marc Zyngier (2): KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B Yicong Yang (5): KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory KVM: arm/arm64: Allow user injection of unsupported exclusive/atomic DABT arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1 arm64: Add support for FEAT_{LS64, LS64_V} KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest Documentation/arch/arm64/booting.rst | 12 ++++++ Documentation/arch/arm64/elf_hwcaps.rst | 6 +++ Documentation/virt/kvm/api.rst | 43 +++++++++++++++++---- arch/arm64/include/asm/el2_setup.h | 12 +++++- arch/arm64/include/asm/esr.h | 8 ++++ arch/arm64/include/asm/hwcap.h | 2 + arch/arm64/include/asm/kvm_emulate.h | 7 ++++ arch/arm64/include/uapi/asm/hwcap.h | 2 + arch/arm64/include/uapi/asm/kvm.h | 3 +- arch/arm64/kernel/cpufeature.c | 51 +++++++++++++++++++++++++ arch/arm64/kernel/cpuinfo.c | 2 + arch/arm64/kvm/guest.c | 4 ++ arch/arm64/kvm/inject_fault.c | 29 ++++++++++++++ arch/arm64/kvm/mmio.c | 27 ++++++++++++- arch/arm64/kvm/mmu.c | 21 +++++++++- arch/arm64/tools/cpucaps | 2 + include/uapi/linux/kvm.h | 3 +- 17 files changed, 222 insertions(+), 12 deletions(-) -- 2.24.0

1 week, 6 days

2
12
0 0

[RFC PATCH 0/7] platform/chrome: Add Kunit tests for protocol device drivers

by Tzung-Bi Shih

The protocol device drivers under drivers/platform/chrome/ are responsible to communicate to the ChromeOS EC (Embedded Controller). They need to pack the data in a pre-defined format and check if the EC responds accordingly. The series adds some fundamental unit tests for the protocol. It calls the .cmd_xfer() and .pkt_xfer() callbacks (which are the most crucial parts for the protocol), mocks the rest of the system, and checks if the interactions are all correct. The series isn't ready for landing. It's more like a PoC for the binary-level function redirection and its use cases. The 1st patch adds ftrace stub which is originally from [1][2]. There is no follow-up discussion about the ftrace stub. As a result, the patch is still on the mailing list. The 2nd patch adds Kunit tests for cros_ec_i2c. It relies on the ftrace stub for redirecting cros_ec_{un,}register(). The 3rd patch uses static stub instead (if ftrace stub isn't really an option). However, I'm not a big fan to change the production code (i.e. adding the prologue in cros_ec_{un,}register()) for testing. The 4th patch adds Kunit tests for cros_ec_spi. It relies on the ftrace stub for redirecting cros_ec_{un,}register() again. The 5th patch calls .probe() directly instead of forcing the driver probe needs to be synchronous. In comparison with the 4th patch, I don't think this is simpler. I'd prefer to the way in the 4th patch. After talked to Masami about the work, he suggested to use Kprobes for function redirection. The 6th patch adds kprobes stub. The 7th patch uses kprobes stub instead for cros_ec_spi. Questions: - Are we going to support ftrace stub so that tests can use it? - If ftrace stub isn't on the plate (e.g. due to too many dependencies), how about the kprobes stub? Is it something we could pursue? - (minor) I'm unsure if people would prefer 'kprobes stub' vs. 'kprobe stub'. [1]: https://kunit.dev/mocking.html#binary-level-ftrace-et-al [2]: https://lore.kernel.org/linux-kselftest/20220318021314.3225240-3-davidgow@g… Daniel Latypov (1): kunit: expose ftrace-based API for stubbing out functions during tests Tzung-Bi Shih (6): platform/chrome: kunit: cros_ec_i2c: Add tests with ftrace stub platform/chrome: kunit: cros_ec_i2c: Use static stub instead platform/chrome: kunit: cros_ec_spi: Add tests with ftrace stub platform/chrome: kunit: cros_ec_spi: Call .probe() directly kunit: Expose 'kprobes stub' API to redirect functions platform/chrome: kunit: cros_ec_spi: Use kprobes stub instead drivers/platform/chrome/Kconfig | 17 + drivers/platform/chrome/Makefile | 2 + drivers/platform/chrome/cros_ec.c | 6 + drivers/platform/chrome/cros_ec_i2c_test.c | 479 +++++++++++++++++++++ drivers/platform/chrome/cros_ec_spi_test.c | 355 +++++++++++++++ include/kunit/ftrace_stub.h | 84 ++++ include/kunit/kprobes_stub.h | 19 + kernel/trace/ftrace.c | 3 + lib/kunit/Kconfig | 18 + lib/kunit/Makefile | 8 + lib/kunit/ftrace_stub.c | 139 ++++++ lib/kunit/kprobes_stub.c | 113 +++++ lib/kunit/kunit-example-test.c | 27 +- lib/kunit/stubs_example.kunitconfig | 11 + 14 files changed, 1280 insertions(+), 1 deletion(-) create mode 100644 drivers/platform/chrome/cros_ec_i2c_test.c create mode 100644 drivers/platform/chrome/cros_ec_spi_test.c create mode 100644 include/kunit/ftrace_stub.h create mode 100644 include/kunit/kprobes_stub.h create mode 100644 lib/kunit/ftrace_stub.c create mode 100644 lib/kunit/kprobes_stub.c create mode 100644 lib/kunit/stubs_example.kunitconfig -- 2.49.0.1101.gccaa498523-goog

1 week, 6 days

2
11
0 0

[PATCH v5 0/7] futex: Create set_robust_list2

by André Almeida

This patch adds a new robust_list() syscall. The current syscall can't be expanded to cover the following use case, so a new one is needed. This new syscall allows users to set multiple robust lists per process and to have either 32bit or 64bit pointers in the list. * Use case FEX-Emu[1] is an application that runs x86 and x86-64 binaries on an AArch64 Linux host. One of the tasks of FEX-Emu is to translate syscalls from one platform to another. Existing set_robust_list() can't be easily translated because of two limitations: 1) x86 apps can have 32bit pointers robust lists. For a x86-64 kernel this is not a problem, because of the compat entry point. But there's no such compat entry point for AArch64, so the kernel would do the pointer arithmetic wrongly. Is also unviable to userspace to keep track every addition/removal to the robust list and keep a 64bit version of it somewhere else to feed the kernel. Thus, the new interface has an option of telling the kernel if the list is filled with 32bit or 64bit pointers. 2) Apps can set just one robust list (in theory, x86-64 can set two if they also use the compat entry point). That means that when a x86 app asks FEX-Emu to call set_robust_list(), FEX have two options: to overwrite their own robust list pointer and make the app robust, or to ignore the app robust list and keep the emulator robust. The new interface allows for multiple robust lists per application, solving this. * Interface This is the proposed interface: long set_robust_list2(void *head, int index, unsigned int flags) `head` is the head of the userspace struct robust_list_head, just as old set_robust_list(). It needs to be a void pointer since it can point to a normal robust_list_head or a compat_robust_list_head. `flags` can be used for defining the list type: enum robust_list_type { ROBUST_LIST_32BIT, ROBUST_LIST_64BIT, }; `index` is the index in the internal robust_list's linked list (the naming starts to get confusing, I reckon). If `index == -1`, that means that user wants to set a new robust_list, and the kernel will append it in the end of the list, assign a new index and return this index to the user. If `index >= 0`, that means that user wants to re-set `*head` of an already existing list (similarly to what happens when you call set_robust_list() twice with different `*head`). If `index` is out of range, or it points to a non-existing robust_list, or if the internal list is full, an error is returned. * Implementation The implementation re-uses most of the existing robust list interface as possible. The new task_struct member `struct list_head robust_list2` is just a linked list where new lists are appended as the user requests more lists, and by futex_cleanup(), the kernel walks through the internal list feeding exit_robust_list() with the robust_list's. This implementation supports up to 10 lists (defined at ROBUST_LISTS_PER_TASK), but it was an arbitrary number for this RFC. For the described use case above, 4 should be enough, I'm not sure which should be the limit. It doesn't support list removal (should it support?). It doesn't have a proper get_robust_list2() yet as well, but I can add it in a next revision. We could also have a generic robust_list() syscall that can be used to set/get and be controlled by flags. The new interface has a `unsigned int flags` argument, making it extensible for future use cases as well. It refuses unaligned `head` addresses. It doesn't have a limit for elements in a single list (like ROBUST_LIST_LIMIT), it destroys the list as it is parsed to be safe against circular lists. * Testing This patcheset has a selftest patch that expands this one: https://lore.kernel.org/lkml/20250212131123.37431-1-andrealmeid@igalia.com/ Also, FEX-Emu added support for this interface to validate it: https://github.com/FEX-Emu/FEX/pull/3966 Feedback is very welcomed! Thanks, André [1] https://github.com/FEX-Emu/FEX Changelog: - Fixed compilation issues when CONFIG_COMPAT or CONFIG_FUTEX are not set - Rebased on top of new futex work (private hash) v4: https://lore.kernel.org/lkml/20250225183531.682556-1-andrealmeid@igalia.com/ - Refuse unaligned head pointers - Ignore ROBUST_LIST_LIMIT for lists created with this interface and make it robust against circular lists - Fix a get_robust_list() syscall bug for getting the list from another thread - Adapt selftest to use the new interface v3: https://lore.kernel.org/lkml/20241217174958.477692-1-andrealmeid@igalia.com/ - Old syscall set_robust_list() adds new head to the internal linked list of robust lists pointers, instead of having a field just for them. Remove tsk->robust_list and use only tsk->robust_list2 v2: https://lore.kernel.org/lkml/20241101162147.284993-1-andrealmeid@igalia.com/ - Added a patch to properly deal with exit_robust_list() in 64bit vs 32bit - Wired-up syscall for all archs - Added more of the cover letter to the commit message v1: https://lore.kernel.org/lkml/20241024145735.162090-1-andrealmeid@igalia.com/ --- André Almeida (7): selftests/futex: Add ASSERT_ macros selftests/futex: Create test for robust list futex: Use explicit sizes for compat_exit_robust_list futex: Create set_robust_list2 futex: Remove the limit of elements for sys_set_robust_list2 lists futex: Wire up set_robust_list2 syscall selftests: futex: Expand robust list test for the new interface arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/compat.h | 12 +- include/linux/futex.h | 30 +- include/linux/sched.h | 5 +- include/uapi/asm-generic/unistd.h | 2 + include/uapi/linux/futex.h | 10 + kernel/futex/core.c | 156 ++++- kernel/futex/futex.h | 5 + kernel/futex/syscalls.c | 85 ++- kernel/sys_ni.c | 1 + scripts/syscall.tbl | 1 + .../testing/selftests/futex/functional/.gitignore | 1 + tools/testing/selftests/futex/functional/Makefile | 3 +- .../selftests/futex/functional/robust_list.c | 706 +++++++++++++++++++++ tools/testing/selftests/futex/include/logging.h | 38 ++ 29 files changed, 1021 insertions(+), 49 deletions(-) --- base-commit: a24cc6ce1933eade12aa2b9859de0fcd2dac2c06 change-id: 20250225-tonyk-robust_futex-60adeedac695 Best regards, -- André Almeida <andrealmeid(a)igalia.com>

1 week, 6 days

3
17
0 0

[RESEND PATCH net-next 6/6] selftests: net: extend SCM_PIDFD test to cover stale pidfds

by Alexander Mikhalitsyn

Extend SCM_PIDFD test scenarios to also cover dead task's pidfd retrieval and reading its exit info. Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: netdev(a)vger.kernel.org Cc: Shuah Khan <shuah(a)kernel.org> Cc: "David S. Miller" <davem(a)davemloft.net> Cc: Eric Dumazet <edumazet(a)google.com> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: Paolo Abeni <pabeni(a)redhat.com> Cc: Simon Horman <horms(a)kernel.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Kuniyuki Iwashima <kuniyu(a)google.com> Cc: Lennart Poettering <mzxreary(a)0pointer.de> Cc: Luca Boccassi <bluca(a)debian.org> Cc: David Rheinsberg <david(a)readahead.eu> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn(a)canonical.com> --- .../testing/selftests/net/af_unix/scm_pidfd.c | 217 ++++++++++++++---- 1 file changed, 173 insertions(+), 44 deletions(-) diff --git a/tools/testing/selftests/net/af_unix/scm_pidfd.c b/tools/testing/selftests/net/af_unix/scm_pidfd.c index 7e534594167e..37e034874034 100644 --- a/tools/testing/selftests/net/af_unix/scm_pidfd.c +++ b/tools/testing/selftests/net/af_unix/scm_pidfd.c @@ -15,6 +15,7 @@ #include <sys/types.h> #include <sys/wait.h> +#include "../../pidfd/pidfd.h" #include "../../kselftest_harness.h" #define clean_errno() (errno == 0 ? "None" : strerror(errno)) @@ -26,6 +27,8 @@ #define SCM_PIDFD 0x04 #endif +#define CHILD_EXIT_CODE_OK 123 + static void child_die() { exit(1); @@ -126,16 +129,65 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen) return result; } +struct cmsg_data { + struct ucred *ucred; + int *pidfd; +}; + +static int parse_cmsg(struct msghdr *msg, struct cmsg_data *res) +{ + struct cmsghdr *cmsg; + int data = 0; + + if (msg->msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + for (cmsg = CMSG_FIRSTHDR(msg); cmsg != NULL; + cmsg = CMSG_NXTHDR(msg, cmsg)) { + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_PIDFD) { + if (cmsg->cmsg_len < sizeof(*res->pidfd)) { + log_err("CMSG parse: SCM_PIDFD wrong len"); + return 1; + } + + res->pidfd = (void *)CMSG_DATA(cmsg); + } + + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_CREDENTIALS) { + if (cmsg->cmsg_len < sizeof(*res->ucred)) { + log_err("CMSG parse: SCM_CREDENTIALS wrong len"); + return 1; + } + + res->ucred = (void *)CMSG_DATA(cmsg); + } + } + + if (!res->pidfd) { + log_err("CMSG parse: SCM_PIDFD not found"); + return 1; + } + + if (!res->ucred) { + log_err("CMSG parse: SCM_CREDENTIALS not found"); + return 1; + } + + return 0; +} + static int cmsg_check(int fd) { struct msghdr msg = { 0 }; - struct cmsghdr *cmsg; + struct cmsg_data res; struct iovec iov; - struct ucred *ucred = NULL; int data = 0; char control[CMSG_SPACE(sizeof(struct ucred)) + CMSG_SPACE(sizeof(int))] = { 0 }; - int *pidfd = NULL; pid_t parent_pid; int err; @@ -158,53 +210,99 @@ static int cmsg_check(int fd) return 1; } - for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL; - cmsg = CMSG_NXTHDR(&msg, cmsg)) { - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_PIDFD) { - if (cmsg->cmsg_len < sizeof(*pidfd)) { - log_err("CMSG parse: SCM_PIDFD wrong len"); - return 1; - } + /* send(pfd, "x", sizeof(char), 0) */ + if (data != 'x') { + log_err("recvmsg: data corruption"); + return 1; + } - pidfd = (void *)CMSG_DATA(cmsg); - } + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); + return 1; + } - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_CREDENTIALS) { - if (cmsg->cmsg_len < sizeof(*ucred)) { - log_err("CMSG parse: SCM_CREDENTIALS wrong len"); - return 1; - } + /* pidfd from SCM_PIDFD should point to the parent process PID */ + parent_pid = + get_pid_from_fdinfo_file(*res.pidfd, "Pid:", sizeof("Pid:") - 1); + if (parent_pid != getppid()) { + log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + close(*res.pidfd); + return 1; + } - ucred = (void *)CMSG_DATA(cmsg); - } + close(*res.pidfd); + return 0; +} + +static int cmsg_check_dead(int fd, int expected_pid) +{ + int err; + struct msghdr msg = { 0 }; + struct cmsg_data res; + struct iovec iov; + int data = 0; + char control[CMSG_SPACE(sizeof(struct ucred)) + + CMSG_SPACE(sizeof(int))] = { 0 }; + pid_t client_pid; + struct pidfd_info info = { + .mask = PIDFD_INFO_EXIT, + }; + + iov.iov_base = &data; + iov.iov_len = sizeof(data); + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + err = recvmsg(fd, &msg, 0); + if (err < 0) { + log_err("recvmsg"); + return 1; } - /* send(pfd, "x", sizeof(char), 0) */ - if (data != 'x') { + if (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + /* send(cfd, "y", sizeof(char), 0) */ + if (data != 'y') { log_err("recvmsg: data corruption"); return 1; } - if (!pidfd) { - log_err("CMSG parse: SCM_PIDFD not found"); + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); return 1; } - if (!ucred) { - log_err("CMSG parse: SCM_CREDENTIALS not found"); + /* + * pidfd from SCM_PIDFD should point to the client_pid. + * Let's read exit information and check if it's what + * we expect to see. + */ + if (ioctl(*res.pidfd, PIDFD_GET_INFO, &info)) { + log_err("%s: ioctl(PIDFD_GET_INFO) failed", __func__); + close(*res.pidfd); return 1; } - /* pidfd from SCM_PIDFD should point to the parent process PID */ - parent_pid = - get_pid_from_fdinfo_file(*pidfd, "Pid:", sizeof("Pid:") - 1); - if (parent_pid != getppid()) { - log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + if (!(info.mask & PIDFD_INFO_EXIT)) { + log_err("%s: No exit information from ioctl(PIDFD_GET_INFO)", __func__); + close(*res.pidfd); return 1; } + err = WIFEXITED(info.exit_code) ? WEXITSTATUS(info.exit_code) : 1; + if (err != CHILD_EXIT_CODE_OK) { + log_err("%s: wrong exit_code %d != %d", __func__, err, CHILD_EXIT_CODE_OK); + close(*res.pidfd); + return 1; + } + + close(*res.pidfd); return 0; } @@ -291,6 +389,24 @@ static void fill_sockaddr(struct sock_addr *addr, bool abstract) memcpy(sun_path_buf, addr->sock_name, strlen(addr->sock_name)); } +static int sk_enable_cred_pass(int sk) +{ + int on = 0; + + on = 1; + if (setsockopt(sk, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { + log_err("Failed to set SO_PASSCRED"); + return 1; + } + + if (setsockopt(sk, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { + log_err("Failed to set SO_PASSPIDFD"); + return 1; + } + + return 0; +} + static void client(FIXTURE_DATA(scm_pidfd) *self, const FIXTURE_VARIANT(scm_pidfd) *variant) { @@ -299,7 +415,6 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, struct ucred peer_cred; int peer_pidfd; pid_t peer_pid; - int on = 0; cfd = socket(AF_UNIX, variant->type, 0); if (cfd < 0) { @@ -322,14 +437,8 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } - on = 1; - if (setsockopt(cfd, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { - log_err("Failed to set SO_PASSCRED"); - child_die(); - } - - if (setsockopt(cfd, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { - log_err("Failed to set SO_PASSPIDFD"); + if (sk_enable_cred_pass(cfd)) { + log_err("sk_enable_cred_pass() failed"); child_die(); } @@ -340,6 +449,12 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } + /* send something to the parent so it can receive SCM_PIDFD too and validate it */ + if (send(cfd, "y", sizeof(char), 0) == -1) { + log_err("Failed to send(cfd, \"y\", sizeof(char), 0)"); + child_die(); + } + /* skip further for SOCK_DGRAM as it's not applicable */ if (variant->type == SOCK_DGRAM) return; @@ -398,7 +513,13 @@ TEST_F(scm_pidfd, test) close(self->server); close(self->startup_pipe[0]); client(self, variant); - exit(0); + + /* + * It's a bit unusual, but in case of success we return non-zero + * exit code (CHILD_EXIT_CODE_OK) and then we expect to read it + * from ioctl(PIDFD_GET_INFO) in cmsg_check_dead(). + */ + exit(CHILD_EXIT_CODE_OK); } close(self->startup_pipe[1]); @@ -421,9 +542,17 @@ TEST_F(scm_pidfd, test) ASSERT_NE(-1, err); } - close(pfd); waitpid(self->client_pid, &child_status, 0); - ASSERT_EQ(0, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + /* see comment before exit(CHILD_EXIT_CODE_OK) */ + ASSERT_EQ(CHILD_EXIT_CODE_OK, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + + err = sk_enable_cred_pass(pfd); + ASSERT_EQ(0, err); + + err = cmsg_check_dead(pfd, self->client_pid); + ASSERT_EQ(0, err); + + close(pfd); } TEST_HARNESS_MAIN -- 2.43.0

1 week, 6 days

3
2
0 0

[kvm-unit-tests PATCH v7] riscv: sbi: Add SBI Debug Triggers Extension tests

by Jesse Taube

Add tests for the DBTR SBI extension. Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> Reviewed-by: Charlie Jenkins <charlie(a)rivosinc.com> Tested-by: Charlie Jenkins <charlie(a)rivosinc.com> --- V1 -> V2: - Call report_prefix_pop before returning - Disable compressed instructions in exec_call, update related comment - Remove extra "| 1" in dbtr_test_load - Remove extra newlines - Remove extra tabs in check_exec - Remove typedefs from enums - Return when dbtr_install_trigger fails - s/avalible/available/g - s/unistall/uninstall/g V2 -> V3: - Change SBI_DBTR_SHMEM_INVALID_ADDR to -1UL - Move all dbtr functions to sbi-dbtr.c - Move INSN_LEN to processor.h - Update include list - Use C-style comments V3 -> V4: - Include libcflat.h - Remove #define SBI_DBTR_SHMEM_INVALID_ADDR V4 -> V5: - Sort includes - Add kfail for update triggers V5 -> V6: - Add assert in gen_tdata1 - Add prefix to dbtr_test_type - Add TRIG_STATE_DMODE - Add TRIG_STATE_RESERVED - Align function paramaters with opening parenthesis - Change OpenSBI < v1.7 to < v1.5 - Constantly use spaces in prefix rather than _ - Export split_phys_addr - Fix MCONTROL_U and MCONTROL_M mix up - Fix swapped VU and VS - Move /* to own line - Print type in dbtr_test_type - Remove _BIT suffix from macros - Remove duplicate MODE_S - Remove spaces before include - Rename tdata1,2 to trigger and control in dbtr_install_trigger - Report skip in dbtr_test_multiple - Report variables in info not pass or fail - s/save/store/g - sbi_debug_set_shmem use split_phys_addr - Use if (!report(... in dbtr_test_disable_enable V6 -> V7: - Alphabetize Makefile - Only print read info on failure - Remove return after assert - Remove unnecessary OpenSBI version check - Rename error to exit_test in check_dbtr - Use prefix in dbtr_test_num_triggers and dbtr_test_type --- lib/riscv/asm/sbi.h | 1 + riscv/Makefile | 1 + riscv/sbi-dbtr.c | 834 ++++++++++++++++++++++++++++++++++++++++++++ riscv/sbi-tests.h | 2 + riscv/sbi.c | 3 +- 5 files changed, 840 insertions(+), 1 deletion(-) create mode 100644 riscv/sbi-dbtr.c diff --git a/lib/riscv/asm/sbi.h b/lib/riscv/asm/sbi.h index a5738a5c..78fd6e2a 100644 --- a/lib/riscv/asm/sbi.h +++ b/lib/riscv/asm/sbi.h @@ -51,6 +51,7 @@ enum sbi_ext_id { SBI_EXT_SUSP = 0x53555350, SBI_EXT_FWFT = 0x46574654, SBI_EXT_SSE = 0x535345, + SBI_EXT_DBTR = 0x44425452, }; enum sbi_ext_base_fid { diff --git a/riscv/Makefile b/riscv/Makefile index 11e68eae..9309ac12 100644 --- a/riscv/Makefile +++ b/riscv/Makefile @@ -18,6 +18,7 @@ tests += $(TEST_DIR)/sieve.$(exe) all: $(tests) $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-asm.o +$(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-dbtr.o $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-fwft.o $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-sse.o diff --git a/riscv/sbi-dbtr.c b/riscv/sbi-dbtr.c new file mode 100644 index 00000000..71ffcd32 --- /dev/null +++ b/riscv/sbi-dbtr.c @@ -0,0 +1,834 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * SBI DBTR testsuite + * + * Copyright (C) 2025, Rivos Inc., Jesse Taube <jesse(a)rivosinc.com> + */ + +#include <libcflat.h> +#include <bitops.h> + +#include <asm/io.h> +#include <asm/processor.h> + +#include "sbi-tests.h" + +#define RV_MAX_TRIGGERS 32 + +#define SBI_DBTR_TRIG_STATE_MAPPED BIT(0) +#define SBI_DBTR_TRIG_STATE_U BIT(1) +#define SBI_DBTR_TRIG_STATE_S BIT(2) +#define SBI_DBTR_TRIG_STATE_VU BIT(3) +#define SBI_DBTR_TRIG_STATE_VS BIT(4) +#define SBI_DBTR_TRIG_STATE_HAVE_HW_TRIG BIT(5) +#define SBI_DBTR_TRIG_STATE_RESERVED GENMASK(7, 6) + +#define SBI_DBTR_TRIG_STATE_HW_TRIG_IDX_SHIFT 8 +#define SBI_DBTR_TRIG_STATE_HW_TRIG_IDX(trig_state) (trig_state >> SBI_DBTR_TRIG_STATE_HW_TRIG_IDX_SHIFT) + +#define SBI_DBTR_TDATA1_TYPE_SHIFT (__riscv_xlen - 4) +#define SBI_DBTR_TDATA1_DMODE BIT_UL(__riscv_xlen - 5) + +#define SBI_DBTR_TDATA1_MCONTROL6_LOAD BIT(0) +#define SBI_DBTR_TDATA1_MCONTROL6_STORE BIT(1) +#define SBI_DBTR_TDATA1_MCONTROL6_EXECUTE BIT(2) +#define SBI_DBTR_TDATA1_MCONTROL6_U BIT(3) +#define SBI_DBTR_TDATA1_MCONTROL6_S BIT(4) +#define SBI_DBTR_TDATA1_MCONTROL6_M BIT(6) +#define SBI_DBTR_TDATA1_MCONTROL6_SELECT BIT(21) +#define SBI_DBTR_TDATA1_MCONTROL6_VU BIT(23) +#define SBI_DBTR_TDATA1_MCONTROL6_VS BIT(24) + +#define SBI_DBTR_TDATA1_MCONTROL_LOAD BIT(0) +#define SBI_DBTR_TDATA1_MCONTROL_STORE BIT(1) +#define SBI_DBTR_TDATA1_MCONTROL_EXECUTE BIT(2) +#define SBI_DBTR_TDATA1_MCONTROL_U BIT(3) +#define SBI_DBTR_TDATA1_MCONTROL_S BIT(4) +#define SBI_DBTR_TDATA1_MCONTROL_M BIT(6) +#define SBI_DBTR_TDATA1_MCONTROL_SELECT BIT(19) + +enum McontrolType { + SBI_DBTR_TDATA1_TYPE_NONE = (0UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_LEGACY = (1UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_MCONTROL = (2UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ICOUNT = (3UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ITRIGGER = (4UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ETRIGGER = (5UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_MCONTROL6 = (6UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_TMEXTTRIGGER = (7UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED0 = (8UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED1 = (9UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED2 = (10UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED3 = (11UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM0 = (12UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM1 = (13UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM2 = (14UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_DISABLED = (15UL << SBI_DBTR_TDATA1_TYPE_SHIFT), +}; + +enum Tdata1Value { + VALUE_NONE = 0, + VALUE_LOAD = BIT(0), + VALUE_STORE = BIT(1), + VALUE_EXECUTE = BIT(2), +}; + +enum Tdata1Mode { + MODE_NONE = 0, + MODE_M = BIT(0), + MODE_U = BIT(1), + MODE_S = BIT(2), + MODE_VU = BIT(3), + MODE_VS = BIT(4), +}; + +enum sbi_ext_dbtr_fid { + SBI_EXT_DBTR_NUM_TRIGGERS = 0, + SBI_EXT_DBTR_SETUP_SHMEM, + SBI_EXT_DBTR_TRIGGER_READ, + SBI_EXT_DBTR_TRIGGER_INSTALL, + SBI_EXT_DBTR_TRIGGER_UPDATE, + SBI_EXT_DBTR_TRIGGER_UNINSTALL, + SBI_EXT_DBTR_TRIGGER_ENABLE, + SBI_EXT_DBTR_TRIGGER_DISABLE, +}; + +struct sbi_dbtr_data_msg { + unsigned long tstate; + unsigned long tdata1; + unsigned long tdata2; + unsigned long tdata3; +}; + +struct sbi_dbtr_id_msg { + unsigned long idx; +}; + +/* SBI shared mem messages layout */ +struct sbi_dbtr_shmem_entry { + union { + struct sbi_dbtr_data_msg data; + struct sbi_dbtr_id_msg id; + }; +}; + +static bool dbtr_handled; + +/* Expected to be leaf function as not to disrupt frame-pointer */ +static __attribute__((naked)) void exec_call(void) +{ + /* skip over nop when triggered instead of ret. */ + asm volatile (".option push\n" + ".option arch, -c\n" + "nop\n" + "ret\n" + ".option pop\n"); +} + +static void dbtr_exception_handler(struct pt_regs *regs) +{ + dbtr_handled = true; + + /* Reading *epc may cause a fault, skip over nop */ + if ((void *)regs->epc == exec_call) { + regs->epc += 4; + return; + } + + /* WARNING: Skips over the trapped intruction */ + regs->epc += RV_INSN_LEN(readw((void *)regs->epc)); +} + +static bool do_store(void *tdata2) +{ + bool ret; + + writel(0, tdata2); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static bool do_load(void *tdata2) +{ + bool ret; + + readl(tdata2); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static bool do_exec(void) +{ + bool ret; + + exec_call(); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static unsigned long gen_tdata1_mcontrol(enum Tdata1Mode mode, enum Tdata1Value value) +{ + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL; + + if (value & VALUE_LOAD) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_LOAD; + + if (value & VALUE_STORE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_STORE; + + if (value & VALUE_EXECUTE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_EXECUTE; + + if (mode & MODE_M) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_M; + + if (mode & MODE_U) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_U; + + if (mode & MODE_S) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_S; + + return tdata1; +} + +static unsigned long gen_tdata1_mcontrol6(enum Tdata1Mode mode, enum Tdata1Value value) +{ + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL6; + + if (value & VALUE_LOAD) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_LOAD; + + if (value & VALUE_STORE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_STORE; + + if (value & VALUE_EXECUTE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_EXECUTE; + + if (mode & MODE_M) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_M; + + if (mode & MODE_U) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_U; + + if (mode & MODE_S) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_S; + + if (mode & MODE_VU) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_VU; + + if (mode & MODE_VS) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_VS; + + return tdata1; +} + +static unsigned long gen_tdata1(enum McontrolType type, enum Tdata1Value value, enum Tdata1Mode mode) +{ + switch (type) { + case SBI_DBTR_TDATA1_TYPE_MCONTROL: + return gen_tdata1_mcontrol(mode, value); + case SBI_DBTR_TDATA1_TYPE_MCONTROL6: + return gen_tdata1_mcontrol6(mode, value); + default: + assert_msg(false, "Invalid mcontrol type: %lu", type); + } +} + +static struct sbiret sbi_debug_num_triggers(unsigned long trig_tdata1) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_NUM_TRIGGERS, trig_tdata1, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_set_shmem_raw(unsigned long shmem_phys_lo, + unsigned long shmem_phys_hi, + unsigned long flags) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_SETUP_SHMEM, shmem_phys_lo, + shmem_phys_hi, flags, 0, 0, 0); +} + +static struct sbiret sbi_debug_set_shmem(void *shmem) +{ + unsigned long base_addr_lo, base_addr_hi; + + split_phys_addr(virt_to_phys(shmem), &base_addr_hi, &base_addr_lo); + return sbi_debug_set_shmem_raw(base_addr_lo, base_addr_hi, 0); +} + +static struct sbiret sbi_debug_read_triggers(unsigned long trig_idx_base, + unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_READ, trig_idx_base, + trig_count, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_install_triggers(unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_INSTALL, trig_count, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_update_triggers(unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_UPDATE, trig_count, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_uninstall_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_UNINSTALL, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_enable_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_ENABLE, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_disable_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_DISABLE, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static bool dbtr_install_trigger(struct sbi_dbtr_shmem_entry *shmem, void *trigger, + unsigned long control) +{ + struct sbiret sbi_ret; + bool ret; + + shmem->data.tdata1 = control; + shmem->data.tdata2 = (unsigned long)trigger; + + sbi_ret = sbi_debug_install_triggers(1); + ret = sbiret_report_error(&sbi_ret, SBI_SUCCESS, "sbi_debug_install_triggers"); + if (ret) + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + return ret; +} + +static bool dbtr_uninstall_trigger(void) +{ + struct sbiret ret; + + install_exception_handler(EXC_BREAKPOINT, NULL); + + ret = sbi_debug_uninstall_triggers(0, 1); + return sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); +} + +static unsigned long dbtr_test_num_triggers(void) +{ + struct sbiret ret; + unsigned long tdata1 = 0; + /* sbi_debug_num_triggers will return trig_max in sbiret.value when trig_tdata1 == 0 */ + + report_prefix_push("available triggers"); + + /* should be at least one trigger. */ + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_num_triggers"); + + if (ret.value == 0) { + report_fail("Returned 0 triggers available"); + } else { + report_pass("Returned triggers available"); + report_info("Returned %lu triggers available", ret.value); + } + + report_prefix_pop(); + return ret.value; +} + +static enum McontrolType dbtr_test_type(unsigned long *num_trig) +{ + struct sbiret ret; + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL6; + + report_prefix_push("test type"); + report_prefix_push("sbi_debug_num_triggers"); + + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "mcontrol6"); + *num_trig = ret.value; + if (ret.value > 0) { + report_pass("Returned mcontrol6 triggers available"); + report_info("Returned %lu mcontrol6 triggers available", + ret.value); + report_prefix_popn(2); + return tdata1; + } + + tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL; + + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "mcontrol"); + *num_trig = ret.value; + if (ret.value > 0) { + report_pass("Returned mcontrol triggers available"); + report_info("Returned %lu mcontrol triggers available", + ret.value); + report_prefix_popn(2); + return tdata1; + } + + report_fail("Returned 0 mcontrol(6) triggers available"); + report_prefix_popn(2); + + return SBI_DBTR_TDATA1_TYPE_NONE; +} + +static struct sbiret dbtr_test_store_install_uninstall(struct sbi_dbtr_shmem_entry *shmem, + enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("store trigger"); + + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + ret = sbi_debug_install_triggers(1); + if (!sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_install_triggers")) { + report_prefix_pop(); + return ret; + } + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(do_store(&test), "triggered"); + + if (do_load(&test)) + report_fail("triggered by load"); + + ret = sbi_debug_uninstall_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + if (do_store(&test)) + report_fail("triggered after uninstall"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); + + return ret; +} + +static void dbtr_test_update(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + bool kfail; + + report_prefix_push("update trigger"); + + if (!dbtr_install_trigger(shmem, NULL, gen_tdata1(type, VALUE_NONE, MODE_NONE))) { + report_prefix_pop(); + return; + } + + shmem->id.idx = 0; + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + ret = sbi_debug_update_triggers(1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_update_triggers"); + + /* + * Known broken update_triggers. + * https://lore.kernel.org/opensbi/aDdp1UeUh7GugeHp@ghost/T/#t + */ + kfail = __sbi_get_imp_id() == SBI_IMPL_OPENSBI && + __sbi_get_imp_version() < sbi_impl_opensbi_mk_version(1, 7); + report_kfail(kfail, do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_load(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + + report_prefix_push("load trigger"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_LOAD, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_load(&test), "triggered"); + + if (do_store(&test)) + report_fail("triggered by store"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_disable_enable(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable trigger"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + if (!report(!do_store(&test), "should not trigger")) { + dbtr_uninstall_trigger(); + report_prefix_pop(); + report_skip("enable trigger: no disable"); + + return; + } + + report_prefix_pop(); + report_prefix_push("enable trigger"); + + ret = sbi_debug_enable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_enable_triggers"); + + report(do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_exec(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + + report_prefix_push("exec trigger"); + /* check if loads and stores trigger exec */ + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + report_prefix_pop(); + return; + } + + if (do_load(&test)) + report_fail("triggered by load"); + + if (do_store(&test)) + report_fail("triggered by store"); + + dbtr_uninstall_trigger(); + + /* Check if exec works */ + if (!dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + report_prefix_pop(); + return; + } + report(do_exec(), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_read(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + const unsigned long tstatus_expected = SBI_DBTR_TRIG_STATE_S | SBI_DBTR_TRIG_STATE_MAPPED; + const unsigned long tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + static unsigned long test; + struct sbiret ret; + + report_prefix_push("read trigger"); + if (!dbtr_install_trigger(shmem, &test, tdata1)) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_read_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_read_triggers"); + + if (!report(shmem->data.tdata1 == tdata1, "tdata1 expected: 0x%016lx", tdata1)) + report_info("tdata1 found: 0x%016lx", shmem->data.tdata1); + if (!report(shmem->data.tdata2 == ((unsigned long)&test), "tdata2 expected: 0x%016lx", + (unsigned long)&test)) + report_info("tdata2 found: 0x%016lx", shmem->data.tdata2); + if (!report(shmem->data.tstate == tstatus_expected, "tstate expected: 0x%016lx", tstatus_expected)) + report_info("tstate found: 0x%016lx", shmem->data.tstate); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void check_exec(unsigned long base) +{ + struct sbiret ret; + + report(do_exec(), "exec triggered"); + + ret = sbi_debug_uninstall_triggers(base, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); +} + +static void dbtr_test_multiple(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type, + unsigned long num_trigs) +{ + static unsigned long test[2]; + struct sbiret ret; + bool have_three = num_trigs > 2; + + if (num_trigs < 2) { + report_skip("test multiple"); + return; + } + + report_prefix_push("test multiple"); + + if (!dbtr_install_trigger(shmem, &test[0], gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + if (!dbtr_install_trigger(shmem, &test[1], gen_tdata1(type, VALUE_LOAD, MODE_S))) + goto error; + if (have_three && + !dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + ret = sbi_debug_uninstall_triggers(1, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + goto error; + } + + report(do_store(&test[0]), "store triggered"); + + if (do_load(&test[0])) + report_fail("store triggered by load"); + + report(do_load(&test[1]), "load triggered"); + + if (do_store(&test[1])) + report_fail("load triggered by store"); + + if (have_three) + check_exec(2); + + ret = sbi_debug_uninstall_triggers(1, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + if (do_load(&test[1])) + report_fail("load triggered after uninstall"); + + report(do_store(&test[0]), "store triggered"); + + if (!have_three && + dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) + check_exec(1); + +error: + ret = sbi_debug_uninstall_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_multiple_types(struct sbi_dbtr_shmem_entry *shmem, unsigned long type) +{ + static unsigned long test; + + report_prefix_push("test multiple types"); + + /* check if loads and stores trigger exec */ + if (!dbtr_install_trigger(shmem, &test, + gen_tdata1(type, VALUE_EXECUTE | VALUE_LOAD | VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_load(&test), "load triggered"); + + report(do_store(&test), "store triggered"); + + dbtr_uninstall_trigger(); + + /* Check if exec works */ + if (!dbtr_install_trigger(shmem, exec_call, + gen_tdata1(type, VALUE_EXECUTE | VALUE_LOAD | VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_exec(), "exec triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_disable_uninstall(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable uninstall"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + dbtr_uninstall_trigger(); + + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_uninstall_enable(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("uninstall enable"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + dbtr_uninstall_trigger(); + + ret = sbi_debug_enable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_enable_triggers"); + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(!do_store(&test), "should not trigger"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_uninstall_update(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + bool kfail; + + report_prefix_push("uninstall update"); + if (!dbtr_install_trigger(shmem, NULL, gen_tdata1(type, VALUE_NONE, MODE_NONE))) { + report_prefix_pop(); + return; + } + + dbtr_uninstall_trigger(); + + shmem->id.idx = 0; + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + /* + * Known broken update_triggers. + * https://lore.kernel.org/opensbi/aDdp1UeUh7GugeHp@ghost/T/#t + */ + kfail = __sbi_get_imp_id() == SBI_IMPL_OPENSBI && + __sbi_get_imp_version() < sbi_impl_opensbi_mk_version(1, 7); + ret = sbi_debug_update_triggers(1); + sbiret_kfail_error(kfail, &ret, SBI_ERR_FAILURE, "sbi_debug_update_triggers"); + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(!do_store(&test), "should not trigger"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_disable_read(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + const unsigned long tstatus_expected = SBI_DBTR_TRIG_STATE_S | SBI_DBTR_TRIG_STATE_MAPPED; + const unsigned long tdata1 = gen_tdata1(type, VALUE_STORE, MODE_NONE); + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable read"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + ret = sbi_debug_read_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_read_triggers"); + + if (!report(shmem->data.tdata1 == tdata1, "tdata1 expected: 0x%016lx", tdata1)) + report_info("tdata1 found: 0x%016lx", shmem->data.tdata1); + if (!report(shmem->data.tdata2 == ((unsigned long)&test), "tdata2 expected: 0x%016lx", + (unsigned long)&test)) + report_info("tdata2 found: 0x%016lx", shmem->data.tdata2); + if (!report(shmem->data.tstate == tstatus_expected, "tstate expected: 0x%016lx", tstatus_expected)) + report_info("tstate found: 0x%016lx", shmem->data.tstate); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +void check_dbtr(void) +{ + static struct sbi_dbtr_shmem_entry shmem[RV_MAX_TRIGGERS] = {}; + unsigned long num_trigs; + enum McontrolType trig_type; + struct sbiret ret; + + report_prefix_push("dbtr"); + + if (!sbi_probe(SBI_EXT_DBTR)) { + report_skip("extension not available"); + goto exit_test; + } + + num_trigs = dbtr_test_num_triggers(); + if (!num_trigs) + goto exit_test; + + trig_type = dbtr_test_type(&num_trigs); + if (trig_type == SBI_DBTR_TDATA1_TYPE_NONE) + goto exit_test; + + ret = sbi_debug_set_shmem(shmem); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_set_shmem"); + + ret = dbtr_test_store_install_uninstall(&shmem[0], trig_type); + /* install or uninstall failed */ + if (ret.error != SBI_SUCCESS) + goto exit_test; + + dbtr_test_load(&shmem[0], trig_type); + dbtr_test_exec(&shmem[0], trig_type); + dbtr_test_read(&shmem[0], trig_type); + dbtr_test_disable_enable(&shmem[0], trig_type); + dbtr_test_update(&shmem[0], trig_type); + dbtr_test_multiple_types(&shmem[0], trig_type); + dbtr_test_multiple(shmem, trig_type, num_trigs); + dbtr_test_disable_uninstall(&shmem[0], trig_type); + dbtr_test_uninstall_enable(&shmem[0], trig_type); + dbtr_test_uninstall_update(&shmem[0], trig_type); + dbtr_test_disable_read(&shmem[0], trig_type); + +exit_test: + report_prefix_pop(); +} diff --git a/riscv/sbi-tests.h b/riscv/sbi-tests.h index d5c4ae70..c1ebf016 100644 --- a/riscv/sbi-tests.h +++ b/riscv/sbi-tests.h @@ -97,8 +97,10 @@ static inline bool env_enabled(const char *env) return s && (*s == '1' || *s == 'y' || *s == 'Y'); } +void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo); void sbi_bad_fid(int ext); void check_sse(void); +void check_dbtr(void); #endif /* __ASSEMBLER__ */ #endif /* _RISCV_SBI_TESTS_H_ */ diff --git a/riscv/sbi.c b/riscv/sbi.c index edb1a6be..3b8aadce 100644 --- a/riscv/sbi.c +++ b/riscv/sbi.c @@ -105,7 +105,7 @@ static int rand_online_cpu(prng_state *ps) return cpu; } -static void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo) +void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo) { *lo = (unsigned long)paddr; *hi = 0; @@ -1561,6 +1561,7 @@ int main(int argc, char **argv) check_susp(); check_sse(); check_fwft(); + check_dbtr(); return report_summary(); } -- 2.43.0

1 week, 6 days

2
1
0 0

[PATCH net-next v2 6/6] selftests: net: extend SCM_PIDFD test to cover stale pidfds

by Alexander Mikhalitsyn

Extend SCM_PIDFD test scenarios to also cover dead task's pidfd retrieval and reading its exit info. Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: netdev(a)vger.kernel.org Cc: Shuah Khan <shuah(a)kernel.org> Cc: "David S. Miller" <davem(a)davemloft.net> Cc: Eric Dumazet <edumazet(a)google.com> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: Paolo Abeni <pabeni(a)redhat.com> Cc: Simon Horman <horms(a)kernel.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Kuniyuki Iwashima <kuniyu(a)google.com> Cc: Lennart Poettering <mzxreary(a)0pointer.de> Cc: Luca Boccassi <bluca(a)debian.org> Cc: David Rheinsberg <david(a)readahead.eu> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn(a)canonical.com> Reviewed-by: Christian Brauner <brauner(a)kernel.org> --- .../testing/selftests/net/af_unix/scm_pidfd.c | 217 ++++++++++++++---- 1 file changed, 173 insertions(+), 44 deletions(-) diff --git a/tools/testing/selftests/net/af_unix/scm_pidfd.c b/tools/testing/selftests/net/af_unix/scm_pidfd.c index 7e534594167e..37e034874034 100644 --- a/tools/testing/selftests/net/af_unix/scm_pidfd.c +++ b/tools/testing/selftests/net/af_unix/scm_pidfd.c @@ -15,6 +15,7 @@ #include <sys/types.h> #include <sys/wait.h> +#include "../../pidfd/pidfd.h" #include "../../kselftest_harness.h" #define clean_errno() (errno == 0 ? "None" : strerror(errno)) @@ -26,6 +27,8 @@ #define SCM_PIDFD 0x04 #endif +#define CHILD_EXIT_CODE_OK 123 + static void child_die() { exit(1); @@ -126,16 +129,65 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen) return result; } +struct cmsg_data { + struct ucred *ucred; + int *pidfd; +}; + +static int parse_cmsg(struct msghdr *msg, struct cmsg_data *res) +{ + struct cmsghdr *cmsg; + int data = 0; + + if (msg->msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + for (cmsg = CMSG_FIRSTHDR(msg); cmsg != NULL; + cmsg = CMSG_NXTHDR(msg, cmsg)) { + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_PIDFD) { + if (cmsg->cmsg_len < sizeof(*res->pidfd)) { + log_err("CMSG parse: SCM_PIDFD wrong len"); + return 1; + } + + res->pidfd = (void *)CMSG_DATA(cmsg); + } + + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_CREDENTIALS) { + if (cmsg->cmsg_len < sizeof(*res->ucred)) { + log_err("CMSG parse: SCM_CREDENTIALS wrong len"); + return 1; + } + + res->ucred = (void *)CMSG_DATA(cmsg); + } + } + + if (!res->pidfd) { + log_err("CMSG parse: SCM_PIDFD not found"); + return 1; + } + + if (!res->ucred) { + log_err("CMSG parse: SCM_CREDENTIALS not found"); + return 1; + } + + return 0; +} + static int cmsg_check(int fd) { struct msghdr msg = { 0 }; - struct cmsghdr *cmsg; + struct cmsg_data res; struct iovec iov; - struct ucred *ucred = NULL; int data = 0; char control[CMSG_SPACE(sizeof(struct ucred)) + CMSG_SPACE(sizeof(int))] = { 0 }; - int *pidfd = NULL; pid_t parent_pid; int err; @@ -158,53 +210,99 @@ static int cmsg_check(int fd) return 1; } - for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL; - cmsg = CMSG_NXTHDR(&msg, cmsg)) { - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_PIDFD) { - if (cmsg->cmsg_len < sizeof(*pidfd)) { - log_err("CMSG parse: SCM_PIDFD wrong len"); - return 1; - } + /* send(pfd, "x", sizeof(char), 0) */ + if (data != 'x') { + log_err("recvmsg: data corruption"); + return 1; + } - pidfd = (void *)CMSG_DATA(cmsg); - } + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); + return 1; + } - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_CREDENTIALS) { - if (cmsg->cmsg_len < sizeof(*ucred)) { - log_err("CMSG parse: SCM_CREDENTIALS wrong len"); - return 1; - } + /* pidfd from SCM_PIDFD should point to the parent process PID */ + parent_pid = + get_pid_from_fdinfo_file(*res.pidfd, "Pid:", sizeof("Pid:") - 1); + if (parent_pid != getppid()) { + log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + close(*res.pidfd); + return 1; + } - ucred = (void *)CMSG_DATA(cmsg); - } + close(*res.pidfd); + return 0; +} + +static int cmsg_check_dead(int fd, int expected_pid) +{ + int err; + struct msghdr msg = { 0 }; + struct cmsg_data res; + struct iovec iov; + int data = 0; + char control[CMSG_SPACE(sizeof(struct ucred)) + + CMSG_SPACE(sizeof(int))] = { 0 }; + pid_t client_pid; + struct pidfd_info info = { + .mask = PIDFD_INFO_EXIT, + }; + + iov.iov_base = &data; + iov.iov_len = sizeof(data); + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + err = recvmsg(fd, &msg, 0); + if (err < 0) { + log_err("recvmsg"); + return 1; } - /* send(pfd, "x", sizeof(char), 0) */ - if (data != 'x') { + if (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + /* send(cfd, "y", sizeof(char), 0) */ + if (data != 'y') { log_err("recvmsg: data corruption"); return 1; } - if (!pidfd) { - log_err("CMSG parse: SCM_PIDFD not found"); + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); return 1; } - if (!ucred) { - log_err("CMSG parse: SCM_CREDENTIALS not found"); + /* + * pidfd from SCM_PIDFD should point to the client_pid. + * Let's read exit information and check if it's what + * we expect to see. + */ + if (ioctl(*res.pidfd, PIDFD_GET_INFO, &info)) { + log_err("%s: ioctl(PIDFD_GET_INFO) failed", __func__); + close(*res.pidfd); return 1; } - /* pidfd from SCM_PIDFD should point to the parent process PID */ - parent_pid = - get_pid_from_fdinfo_file(*pidfd, "Pid:", sizeof("Pid:") - 1); - if (parent_pid != getppid()) { - log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + if (!(info.mask & PIDFD_INFO_EXIT)) { + log_err("%s: No exit information from ioctl(PIDFD_GET_INFO)", __func__); + close(*res.pidfd); return 1; } + err = WIFEXITED(info.exit_code) ? WEXITSTATUS(info.exit_code) : 1; + if (err != CHILD_EXIT_CODE_OK) { + log_err("%s: wrong exit_code %d != %d", __func__, err, CHILD_EXIT_CODE_OK); + close(*res.pidfd); + return 1; + } + + close(*res.pidfd); return 0; } @@ -291,6 +389,24 @@ static void fill_sockaddr(struct sock_addr *addr, bool abstract) memcpy(sun_path_buf, addr->sock_name, strlen(addr->sock_name)); } +static int sk_enable_cred_pass(int sk) +{ + int on = 0; + + on = 1; + if (setsockopt(sk, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { + log_err("Failed to set SO_PASSCRED"); + return 1; + } + + if (setsockopt(sk, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { + log_err("Failed to set SO_PASSPIDFD"); + return 1; + } + + return 0; +} + static void client(FIXTURE_DATA(scm_pidfd) *self, const FIXTURE_VARIANT(scm_pidfd) *variant) { @@ -299,7 +415,6 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, struct ucred peer_cred; int peer_pidfd; pid_t peer_pid; - int on = 0; cfd = socket(AF_UNIX, variant->type, 0); if (cfd < 0) { @@ -322,14 +437,8 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } - on = 1; - if (setsockopt(cfd, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { - log_err("Failed to set SO_PASSCRED"); - child_die(); - } - - if (setsockopt(cfd, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { - log_err("Failed to set SO_PASSPIDFD"); + if (sk_enable_cred_pass(cfd)) { + log_err("sk_enable_cred_pass() failed"); child_die(); } @@ -340,6 +449,12 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } + /* send something to the parent so it can receive SCM_PIDFD too and validate it */ + if (send(cfd, "y", sizeof(char), 0) == -1) { + log_err("Failed to send(cfd, \"y\", sizeof(char), 0)"); + child_die(); + } + /* skip further for SOCK_DGRAM as it's not applicable */ if (variant->type == SOCK_DGRAM) return; @@ -398,7 +513,13 @@ TEST_F(scm_pidfd, test) close(self->server); close(self->startup_pipe[0]); client(self, variant); - exit(0); + + /* + * It's a bit unusual, but in case of success we return non-zero + * exit code (CHILD_EXIT_CODE_OK) and then we expect to read it + * from ioctl(PIDFD_GET_INFO) in cmsg_check_dead(). + */ + exit(CHILD_EXIT_CODE_OK); } close(self->startup_pipe[1]); @@ -421,9 +542,17 @@ TEST_F(scm_pidfd, test) ASSERT_NE(-1, err); } - close(pfd); waitpid(self->client_pid, &child_status, 0); - ASSERT_EQ(0, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + /* see comment before exit(CHILD_EXIT_CODE_OK) */ + ASSERT_EQ(CHILD_EXIT_CODE_OK, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + + err = sk_enable_cred_pass(pfd); + ASSERT_EQ(0, err); + + err = cmsg_check_dead(pfd, self->client_pid); + ASSERT_EQ(0, err); + + close(pfd); } TEST_HARNESS_MAIN -- 2.43.0

1 week, 6 days

1
0
0 0

[PATCH resend] rtc: Rename lib_test to rtc_lib_test

by Geert Uytterhoeven

When compiling the RTC library functions test as a module, the module has the non-descriptive name "lib_test.ko". Fix this by adding the subsystem's name as a prefix. Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org> --- drivers/rtc/Makefile | 2 +- drivers/rtc/{lib_test.c => rtc_lib_test.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename drivers/rtc/{lib_test.c => rtc_lib_test.c} (100%) diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile index 4619aa2ac4697591..aa08d1c8a32ec23d 100644 --- a/drivers/rtc/Makefile +++ b/drivers/rtc/Makefile @@ -15,7 +15,7 @@ rtc-core-$(CONFIG_RTC_INTF_DEV) += dev.o rtc-core-$(CONFIG_RTC_INTF_PROC) += proc.o rtc-core-$(CONFIG_RTC_INTF_SYSFS) += sysfs.o -obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += lib_test.o +obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += rtc_lib_test.o # Keep the list ordered. diff --git a/drivers/rtc/lib_test.c b/drivers/rtc/rtc_lib_test.c similarity index 100% rename from drivers/rtc/lib_test.c rename to drivers/rtc/rtc_lib_test.c -- 2.43.0

1 week, 6 days

2
2
0 0

[PATCH v2] rtc: Rename lib_test to test_rtc_lib

by Geert Uytterhoeven

When compiling the RTC library functions test as a module, the module has the non-descriptive name "lib_test.ko". Fix this by renaming it to "test_rtc_lib.ko". Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org> --- v2: - s/rtc_lib_test/test_rtc_lib/. --- drivers/rtc/Makefile | 2 +- drivers/rtc/{lib_test.c => test_rtc_lib.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename drivers/rtc/{lib_test.c => test_rtc_lib.c} (100%) diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile index 4619aa2ac4697591..789bddfea99d8fcd 100644 --- a/drivers/rtc/Makefile +++ b/drivers/rtc/Makefile @@ -15,7 +15,7 @@ rtc-core-$(CONFIG_RTC_INTF_DEV) += dev.o rtc-core-$(CONFIG_RTC_INTF_PROC) += proc.o rtc-core-$(CONFIG_RTC_INTF_SYSFS) += sysfs.o -obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += lib_test.o +obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += test_rtc_lib.o # Keep the list ordered. diff --git a/drivers/rtc/lib_test.c b/drivers/rtc/test_rtc_lib.c similarity index 100% rename from drivers/rtc/lib_test.c rename to drivers/rtc/test_rtc_lib.c -- 2.43.0

1 week, 6 days

1
0
0 0

[PATCH net-next] selftests: devmem: configure HDS threshold

by Taehee Yoo

The devmem TCP requires the hds-thresh value to be 0, but it doesn't change it automatically. Therefore, configure_hds_thresh() is added to handle this. The run_devmem_tests() now tests hds_thresh, but it skips test if the hds_thresh_max value is 0. Signed-off-by: Taehee Yoo <ap420073(a)gmail.com> --- .../selftests/drivers/net/hw/ncdevmem.c | 86 +++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c index cc9b40d9c5d5..d78b5e5697d7 100644 --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c @@ -349,6 +349,72 @@ static int configure_headersplit(bool on) return ret; } +static int configure_hds_thresh(int len) +{ + struct ethtool_rings_get_req *get_req; + struct ethtool_rings_get_rsp *get_rsp; + struct ethtool_rings_set_req *req; + struct ynl_error yerr; + struct ynl_sock *ys; + int ret; + + ys = ynl_sock_create(&ynl_ethtool_family, &yerr); + if (!ys) { + fprintf(stderr, "YNL: %s\n", yerr.msg); + return -1; + } + + req = ethtool_rings_set_req_alloc(); + ethtool_rings_set_req_set_header_dev_index(req, ifindex); + ethtool_rings_set_req_set_hds_thresh(req, len); + ret = ethtool_rings_set(ys, req); + if (ret < 0) + fprintf(stderr, "YNL failed: %s\n", ys->err.msg); + ethtool_rings_set_req_free(req); + + if (ret == 0) { + get_req = ethtool_rings_get_req_alloc(); + ethtool_rings_get_req_set_header_dev_index(get_req, ifindex); + get_rsp = ethtool_rings_get(ys, get_req); + ethtool_rings_get_req_free(get_req); + if (get_rsp) + fprintf(stderr, "HDS threshold: %d\n", + get_rsp->hds_thresh); + ethtool_rings_get_rsp_free(get_rsp); + } + + ynl_sock_destroy(ys); + + return ret; +} + +static int get_hds_thresh_max(void) +{ + struct ethtool_rings_get_req *get_req; + struct ethtool_rings_get_rsp *get_rsp; + struct ynl_error yerr; + unsigned int ret = 0; + struct ynl_sock *ys; + + ys = ynl_sock_create(&ynl_ethtool_family, &yerr); + if (!ys) { + fprintf(stderr, "YNL: %s\n", yerr.msg); + return -1; + } + + get_req = ethtool_rings_get_req_alloc(); + ethtool_rings_get_req_set_header_dev_index(get_req, ifindex); + get_rsp = ethtool_rings_get(ys, get_req); + ethtool_rings_get_req_free(get_req); + if (get_rsp) + ret = get_rsp->hds_thresh_max; + ethtool_rings_get_rsp_free(get_rsp); + + ynl_sock_destroy(ys); + + return ret; +} + static int configure_rss(void) { return run_command("sudo ethtool -X %s equal %d >&2", ifname, start_queue); @@ -565,6 +631,9 @@ static int do_server(struct memory_buffer *mem) if (configure_headersplit(1)) error(1, 0, "Failed to enable TCP header split\n"); + if (configure_hds_thresh(0)) + error(1, 0, "Failed to set HDS threshold\n"); + /* Configure RSS to divert all traffic from our devmem queues */ if (configure_rss()) error(1, 0, "Failed to configure rss\n"); @@ -725,10 +794,14 @@ static int do_server(struct memory_buffer *mem) void run_devmem_tests(void) { struct memory_buffer *mem; + int hds_thresh_max = 0; struct ynl_sock *ys; mem = provider->alloc(getpagesize() * NUM_PAGES); + hds_thresh_max = get_hds_thresh_max(); + if (hds_thresh_max < 0) + error(1, 0, "Failed to get hds_thresh_max"); /* Configure RSS to divert all traffic from our devmem queues */ if (configure_rss()) error(1, 0, "rss error\n"); @@ -750,6 +823,19 @@ void run_devmem_tests(void) if (configure_headersplit(1)) error(1, 0, "Failed to configure header split\n"); + /* Do not test hds_thresh if hds_thresh_max value is 0 */ + if (hds_thresh_max) { + if (configure_hds_thresh(1)) + error(1, 0, "Failed to configure HDS threshold\n"); + + if (!bind_rx_queue(ifindex, mem->fd, create_queues(), + num_queues, &ys)) + error(1, 0, "Configure dmabuf with non-zero HDS threshold should have failed\n"); + + if (configure_hds_thresh(0)) + error(1, 0, "Failed to configure HDS threshold\n"); + } + if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys)) error(1, 0, "Failed to bind\n"); -- 2.34.1

1 week, 6 days

3
4
0 0

[PATCH] selftests: cachestat: add tests for mmap

by Suresh K C

From: Suresh K C <suresh.k.chandrappa(a)gmail.com> Add a test case to verify cachestat behavior with memory-mapped files using mmap(). This ensures that pages accessed via mmap are correctly accounted for in the page cache. Tested on x86_64 with default kernel config Signed-off-by: Suresh K C <suresh.k.chandrappa(a)gmail.com> --- .../selftests/cachestat/test_cachestat.c | 39 ++++++++++++++++--- 1 file changed, 33 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c index 632ab44737ec..b6452978dae0 100644 --- a/tools/testing/selftests/cachestat/test_cachestat.c +++ b/tools/testing/selftests/cachestat/test_cachestat.c @@ -33,6 +33,11 @@ void print_cachestat(struct cachestat *cs) cs->nr_evicted, cs->nr_recently_evicted); } +enum file_type { + FILE_MMAP, + FILE_SHMEM +}; + bool write_exactly(int fd, size_t filesize) { int random_fd = open("/dev/urandom", O_RDONLY); @@ -202,7 +207,7 @@ static int test_cachestat(const char *filename, bool write_random, bool create, return ret; } -bool test_cachestat_shmem(void) +bool run_cachestat_test(enum file_type type) { size_t PS = sysconf(_SC_PAGESIZE); size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */ @@ -212,27 +217,43 @@ bool test_cachestat_shmem(void) char *filename = "tmpshmcstat"; struct cachestat cs; bool ret = true; + int fd; unsigned long num_pages = compute_len / PS; - int fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + if (type == FILE_SHMEM) + fd = shm_open(filename, O_CREAT | O_RDWR, 0600); + else + fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); if (fd < 0) { - ksft_print_msg("Unable to create shmem file.\n"); + ksft_print_msg("Unable to create file.\n"); ret = false; goto out; } if (ftruncate(fd, filesize)) { - ksft_print_msg("Unable to truncate shmem file.\n"); + ksft_print_msg("Unable to truncate file.\n"); ret = false; goto close_fd; } if (!write_exactly(fd, filesize)) { - ksft_print_msg("Unable to write to shmem file.\n"); + ksft_print_msg("Unable to write to file.\n"); ret = false; goto close_fd; } + if (type == FILE_MMAP){ + char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (map == MAP_FAILED) { + ksft_print_msg("mmap failed.\n"); + ret = false; + goto close_fd; + } + for (int i = 0; i < filesize; i++) { + map[i] = 'A'; + } + map[filesize - 1] = 'X'; + } syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); if (syscall_ret) { @@ -308,12 +329,18 @@ int main(void) break; } - if (test_cachestat_shmem()) + if (run_cachestat_test(FILE_SHMEM)) ksft_test_result_pass("cachestat works with a shmem file\n"); else { ksft_test_result_fail("cachestat fails with a shmem file\n"); ret = 1; } + if (run_cachestat_test(FILE_MMAP)) + ksft_test_result_pass("cachestat works with a mmap file\n"); + else { + ksft_test_result_fail("cachestat fails with a mmap file\n"); + ret = 1; + } return ret; } -- 2.43.0

1 week, 6 days

3
2
0 0

[PATCH v9 net-next 00/15] AccECN protocol patch series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the v9 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best Regards, Chia-Yu --- v9 (21-Jun-2025) - Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>) - Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>) - Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>) - Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>) - Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>) v8 (10-Jun-2025) - Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>) - Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>) - Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>) v7 (14-May-2025) - Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>) - Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message for #9 to explain the increase in tcp_sock_write_rx group size - Modify group size of tcp_sock_write_tx in #10 based on pahole results v6 (09-May-2025) - Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>) - Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>) - Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>) - Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>) - Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>) - Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>) - Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15 v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Mar-2025) - Add one missing patch from the previous AccECN protocol preparation patch series to this patch series. --- Chia-Yu Chang (3): tcp: reorganize tcp_sock_write_txrx group for variables later tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (12): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option send control tcp: accecn: AccECN option ceb/cep heuristic tcp: accecn: AccECN ACE field multi-wrap heuristic tcp: try to avoid safer when ACKs are thinned .../networking/net_cachelines/tcp_sock.rst | 14 + include/linux/tcp.h | 34 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 225 ++++++- include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 3 + net/ipv4/sysctl_net_ipv4.c | 19 + net/ipv4/tcp.c | 30 +- net/ipv4/tcp_input.c | 622 +++++++++++++++++- net/ipv4/tcp_ipv4.c | 7 +- net/ipv4/tcp_minisocks.c | 91 ++- net/ipv4/tcp_output.c | 325 ++++++++- net/ipv6/syncookies.c | 1 + net/ipv6/tcp_ipv6.c | 1 + 14 files changed, 1285 insertions(+), 96 deletions(-) -- 2.34.1

2 weeks

3
28
0 0

[PATCH net-next v3 0/3] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v3: - Make pylint happy (Simon) - Remove the unnecessary patch in bpftrace to raise an exception when it fails. (Jakub) - Improved the bpftrace code (Willem) - Stop sending messages if bpftrace is not alive anymore. - Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (2): selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/py/__init__.py | 3 +- .../testing/selftests/drivers/net/netpoll_basic.py | 345 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 35 +++ 4 files changed, 383 insertions(+), 1 deletion(-) --- base-commit: 8efa26fcbf8a7f783fd1ce7dd2a409e9b7758df0 change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

2 weeks

3
8
0 0

[PATCH net-next 2/2] selftests: seg6: fix instaces typo in comments

by Andrea Mayer

Fix a typo: instaces -> instances The typo has been identified using codespell, and the tool does not report any additional issues in the selftests considered. Signed-off-by: Andrea Mayer <andrea.mayer(a)uniroma2.it> --- tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh | 2 +- tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh | 2 +- tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh | 2 +- tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh b/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh index ba730655a7bf..4bc135e5c22c 100755 --- a/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh +++ b/tools/testing/selftests/net/srv6_end_next_csid_l3vpn_test.sh @@ -594,7 +594,7 @@ setup_rt_local_sids() dev "${DUMMY_DEVNAME}" # all SIDs for VPNs start with a common locator. Routes and SRv6 - # Endpoint behavior instaces are grouped together in the 'localsid' + # Endpoint behavior instances are grouped together in the 'localsid' # table. ip -netns "${nsname}" -6 rule \ add to "${VPN_LOCATOR_SERVICE}::/16" \ diff --git a/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh b/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh index bedf0ce885c2..34b781a2ae74 100755 --- a/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh +++ b/tools/testing/selftests/net/srv6_end_x_next_csid_l3vpn_test.sh @@ -681,7 +681,7 @@ setup_rt_local_sids() set_underlay_sids_reachability "${rt}" "${rt_neighs}" # all SIDs for VPNs start with a common locator. Routes and SRv6 - # Endpoint behavior instaces are grouped together in the 'localsid' + # Endpoint behavior instances are grouped together in the 'localsid' # table. ip -netns "${nsname}" -6 rule \ add to "${VPN_LOCATOR_SERVICE}::/16" \ diff --git a/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh b/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh index 3efce1718c5f..6a68c7eff1dc 100755 --- a/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh +++ b/tools/testing/selftests/net/srv6_hencap_red_l3vpn_test.sh @@ -395,7 +395,7 @@ setup_rt_local_sids() dev "${VRF_DEVNAME}" # all SIDs for VPNs start with a common locator. Routes and SRv6 - # Endpoint behavior instaces are grouped together in the 'localsid' + # Endpoint behavior instances are grouped together in the 'localsid' # table. ip -netns "${nsname}" -6 rule \ add to "${VPN_LOCATOR_SERVICE}::/16" \ diff --git a/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh b/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh index cabc70538ffe..0979b5316fdf 100755 --- a/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh +++ b/tools/testing/selftests/net/srv6_hl2encap_red_l2vpn_test.sh @@ -343,7 +343,7 @@ setup_rt_local_sids() encap seg6local action End dev "${DUMMY_DEVNAME}" # all SIDs for VPNs start with a common locator. Routes and SRv6 - # Endpoint behaviors instaces are grouped together in the 'localsid' + # Endpoint behaviors instances are grouped together in the 'localsid' # table. ip -netns "${nsname}" -6 rule add \ to "${VPN_LOCATOR_SERVICE}::/16" \ -- 2.20.1

2 weeks

2
1
0 0

Re: [PATCH] selftests/mm: add test for (BATCH_PROCESS)MADV_DONTNEED

by wang lian

I deeply appreciate for your criticism. This is my first patch and I will improve it based on what we have discussed so far.

2 weeks

3
5
0 0

[PATCH 0/3] arm64: Support FEAT_LSFE (Large System Float Extension)

by Mark Brown

FEAT_LSFE is optional from v9.5, it adds new instructions for atomic memory operations with floating point values. We have no immediate use for it in kernel, provide a hwcap so userspace can discover it and allow the ID register field to be exposed to KVM guests. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Mark Brown (3): arm64/hwcap: Add hwcap for FEAT_LSFE KVM: arm64: Expose FEAT_LSFE to guests kselftest/arm64: Add lsfe to the hwcaps test Documentation/arch/arm64/elf_hwcaps.rst | 4 ++++ arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/kernel/cpufeature.c | 2 ++ arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kvm/sys_regs.c | 4 +++- tools/testing/selftests/arm64/abi/hwcap.c | 21 +++++++++++++++++++++ 7 files changed, 33 insertions(+), 1 deletion(-) --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20250625-arm64-lsfe-0810cf98adc2 Best regards, -- Mark Brown <broonie(a)kernel.org>

2 weeks

2
6
0 0

[PATCH 00/33] vfio: Introduce selftests for VFIO

by David Matlack

This series introduces VFIO selftests, located in tools/testing/selftests/vfio/. VFIO selftests aim to enable kernel developers to write and run tests that take the form of userspace programs that interact with VFIO and IOMMUFD uAPIs. VFIO selftests can be used to write functional tests for new features, regression tests for bugs, and performance tests for optimizations. These tests are designed to interact with real PCI devices, i.e. they do not rely on mocking out or faking any behavior in the kernel. This allows the tests to exercise not only VFIO but also IOMMUFD, the IOMMU driver, interrupt remapping, IRQ handling, etc. For more background on the motivation and design of this series, please see the RFC: https://lore.kernel.org/kvm/20250523233018.1702151-1-dmatlack@google.com/ This series can also be found on GitHub: https://github.com/dmatlack/linux/tree/vfio/selftests/v1 Changelog ----------------------------------------------------------------------- RFC: https://lore.kernel.org/kvm/20250523233018.1702151-1-dmatlack@google.com/ - Add symlink to linux/pci_ids.h instead of copying (Jason) - Add symlinks to drivers/dma/*/*.h instead of copying (Jason) - Automatically replicate vfio_dma_mapping_test across backing sources using fixture variants (Jason) - Automatically replicate vfio_dma_mapping_test and vfio_pci_driver_test across all iommu_modes using fixture variants (Jason) - Invert access() check in vfio_dma_mapping_test (me) - Use driver_override instead of add/remove_id (Alex) - Allow tests to get BDF from env var (Alex) - Use KSFT_FAIL instead of 1 to exit with failure (Alex) - Unconditionally create $(LIBVFIO_O_DIRS) to avoid target conflict with ../cgroup/lib/libcgroup.mk when building KVM selftests (me) - Allow VFIO selftests to run automatically by switching from TEST_GEN_PROGS_EXTENDED to TEST_GEN_PROGS. Automatically run selftests will use $VFIO_SELFTESTS_BDF environment variable to know which device to use (Alex) - Replace hardcoded SZ_4K with getpagesize() in vfio_dma_mapping_test to support platforms with other page sizes (me) - Make all global variables static where possible (me) - Pass argc and argv to test_harness_main() so that users can pass flags to the kselftest harness (me) Instructions ----------------------------------------------------------------------- Running VFIO selftests requires at a PCI device bound to vfio-pci for the tests to use. The address of this device is passed to the test as a segment:bus:device.function string, which must match the path to the device in /sys/bus/pci/devices/ (e.g. 0000:00:04.0). Once you have chosen a device, there is a helper script provided to unbind the device from its current driver, bind it to vfio-pci, export the environment variable $VFIO_SELFTESTS_BDF, and launch a shell: $ tools/testing/selftests/vfio/run.sh -d 0000:00:04.0 -s The -d option tells the script which device to use and the -s option tells the script to launch a shell. Additionally, the VFIO selftest vfio_dma_mapping_test has test cases that rely on HugeTLB pages being available, otherwise they are skipped. To enable those tests make sure at least 1 2MB and 1 1GB HugeTLB pages are available. $ echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages $ echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages To run all VFIO selftests using make: $ make -C tools/testing/selftests/vfio run_tests To run individual tests: $ tools/testing/selftests/vfio/vfio_dma_mapping_test $ tools/testing/selftests/vfio/vfio_dma_mapping_test -v iommufd_anonymous_hugetlb_2mb $ tools/testing/selftests/vfio/vfio_dma_mapping_test -r vfio_dma_mapping_test.iommufd_anonymous_hugetlb_2mb.dma_map_unmap The environment variable $VFIO_SELFTESTS_BDF can be overridden for a specific test by passing in the BDF on the command line as the last positional argument. $ tools/testing/selftests/vfio/vfio_dma_mapping_test 0000:00:04.0 $ tools/testing/selftests/vfio/vfio_dma_mapping_test -v iommufd_anonymous_hugetlb_2mb 0000:00:04.0 $ tools/testing/selftests/vfio/vfio_dma_mapping_test -r vfio_dma_mapping_test.iommufd_anonymous_hugetlb_2mb.dma_map_unmap 0000:00:04.0 When you are done, free the HugeTLB pages and exit the shell started by run.sh. Exiting the shell will cause the device to be unbound from vfio-pci and bound back to its original driver. $ echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages $ echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages $ exit It's also possible to use run.sh to run just a single test hermetically, rather than dropping into a shell: $ tools/testing/selftests/vfio/run.sh -d 0000:00:04.0 -- tools/testing/selftests/vfio/vfio_dma_mapping_test -v iommufd_anonymous Tests ----------------------------------------------------------------------- There are 5 tests in this series, mostly to demonstrate as a proof-of-concept: - tools/testing/selftests/vfio/vfio_pci_device_test.c - tools/testing/selftests/vfio/vfio_pci_driver_test.c - tools/testing/selftests/vfio/vfio_iommufd_setup_test.c - tools/testing/selftests/vfio/vfio_dma_mapping_test.c - tools/testing/selftests/kvm/vfio_pci_device_irq_test.c Future Areas of Development ----------------------------------------------------------------------- Library: - Driver support for devices that can be used on AMD, ARM, and other platforms (e.g. mlx5). - Driver support for a device available in QEMU VMs (e.g. pcie-ats-testdev [1]) - Support for tests that use multiple devices. - Support for IOMMU groups with multiple devices. - Support for multiple devices sharing the same container/iommufd. - Sharing TEST_ASSERT() macros and other common code between KVM and VFIO selftests. Tests: - DMA mapping performance tests for BARs/HugeTLB/etc. - Porting tests from https://github.com/awilliam/tests/commits/for-clg/ to selftests. - Live Update selftests. - Porting Sean's KVM selftest for posted interrupts to use the VFIO selftests library [2] Cc: Alex Williamson <alex.williamson(a)redhat.com> Cc: Jason Gunthorpe <jgg(a)nvidia.com> Cc: Kevin Tian <kevin.tian(a)intel.com> Cc: Paolo Bonzini <pbonzini(a)redhat.com> Cc: Sean Christopherson <seanjc(a)google.com> Cc: Vipin Sharma <vipinsh(a)google.com> Cc: Josh Hilke <jrhilke(a)google.com> Cc: Aaron Lewis <aaronlewis(a)google.com> Cc: Pasha Tatashin <pasha.tatashin(a)soleen.com> Cc: Saeed Mahameed <saeedm(a)nvidia.com> Cc: Adithya Jayachandran <ajayachandra(a)nvidia.com> Cc: Joel Granados <joel.granados(a)kernel.org> [1] https://github.com/Joelgranados/qemu/blob/pcie-testdev/hw/misc/pcie-ats-tes… [2] https://lore.kernel.org/kvm/20250404193923.1413163-68-seanjc@google.com/ David Matlack (28): selftests: Create tools/testing/selftests/vfio vfio: selftests: Add a helper library for VFIO selftests vfio: selftests: Introduce vfio_pci_device_test tools headers: Add stub definition for __iomem tools headers: Import asm-generic MMIO helpers tools headers: Import x86 MMIO helper overrides tools headers: Import iosubmit_cmds512() tools headers: Add symlink to linux/pci_ids.h vfio: selftests: Keep track of DMA regions mapped into the device vfio: selftests: Enable asserting MSI eventfds not firing vfio: selftests: Add a helper for matching vendor+device IDs vfio: selftests: Add driver framework vfio: sefltests: Add vfio_pci_driver_test dmaengine: ioat: Move system_has_dca_enabled() to dma.h vfio: selftests: Add driver for Intel CBDMA dmaengine: idxd: Allow registers.h to be included from tools/ vfio: selftests: Add driver for Intel DSA vfio: selftests: Move helper to get cdev path to libvfio vfio: selftests: Encapsulate IOMMU mode vfio: selftests: Replicate tests across all iommu_modes vfio: selftests: Add vfio_type1v2_mode vfio: selftests: Add iommufd_compat_type1{,v2} modes vfio: selftests: Add iommufd mode vfio: selftests: Make iommufd the default iommu_mode vfio: selftests: Add a script to help with running VFIO selftests KVM: selftests: Build and link sefltests/vfio/lib into KVM selftests KVM: selftests: Test sending a vfio-pci device IRQ to a VM KVM: selftests: Add -d option to vfio_pci_device_irq_test for device-sent MSIs Josh Hilke (5): vfio: selftests: Test basic VFIO and IOMMUFD integration vfio: selftests: Move vfio dma mapping test to their own file vfio: selftests: Add test to reset vfio device. vfio: selftests: Add DMA mapping tests for 2M and 1G HugeTLB vfio: selftests: Validate 2M/1G HugeTLB are mapped as 2M/1G in IOMMU MAINTAINERS | 7 + drivers/dma/idxd/registers.h | 4 + drivers/dma/ioat/dma.h | 2 + drivers/dma/ioat/hw.h | 3 - tools/arch/x86/include/asm/io.h | 101 +++ tools/arch/x86/include/asm/special_insns.h | 27 + tools/include/asm-generic/io.h | 482 ++++++++++++++ tools/include/asm/io.h | 11 + tools/include/linux/compiler.h | 4 + tools/include/linux/io.h | 4 +- tools/include/linux/pci_ids.h | 1 + tools/testing/selftests/Makefile | 1 + tools/testing/selftests/kvm/Makefile.kvm | 4 + .../testing/selftests/kvm/include/kvm_util.h | 4 + tools/testing/selftests/kvm/lib/kvm_util.c | 21 + .../selftests/kvm/vfio_pci_device_irq_test.c | 172 +++++ tools/testing/selftests/vfio/.gitignore | 7 + tools/testing/selftests/vfio/Makefile | 21 + .../selftests/vfio/lib/drivers/dsa/dsa.c | 416 ++++++++++++ .../vfio/lib/drivers/dsa/registers.h | 1 + .../selftests/vfio/lib/drivers/ioat/hw.h | 1 + .../selftests/vfio/lib/drivers/ioat/ioat.c | 235 +++++++ .../vfio/lib/drivers/ioat/registers.h | 1 + .../selftests/vfio/lib/include/vfio_util.h | 295 +++++++++ tools/testing/selftests/vfio/lib/libvfio.mk | 24 + .../selftests/vfio/lib/vfio_pci_device.c | 594 ++++++++++++++++++ .../selftests/vfio/lib/vfio_pci_driver.c | 126 ++++ tools/testing/selftests/vfio/run.sh | 109 ++++ .../selftests/vfio/vfio_dma_mapping_test.c | 199 ++++++ .../selftests/vfio/vfio_iommufd_setup_test.c | 127 ++++ .../selftests/vfio/vfio_pci_device_test.c | 176 ++++++ .../selftests/vfio/vfio_pci_driver_test.c | 247 ++++++++ 32 files changed, 3423 insertions(+), 4 deletions(-) create mode 100644 tools/arch/x86/include/asm/io.h create mode 100644 tools/arch/x86/include/asm/special_insns.h create mode 100644 tools/include/asm-generic/io.h create mode 100644 tools/include/asm/io.h create mode 120000 tools/include/linux/pci_ids.h create mode 100644 tools/testing/selftests/kvm/vfio_pci_device_irq_test.c create mode 100644 tools/testing/selftests/vfio/.gitignore create mode 100644 tools/testing/selftests/vfio/Makefile create mode 100644 tools/testing/selftests/vfio/lib/drivers/dsa/dsa.c create mode 120000 tools/testing/selftests/vfio/lib/drivers/dsa/registers.h create mode 120000 tools/testing/selftests/vfio/lib/drivers/ioat/hw.h create mode 100644 tools/testing/selftests/vfio/lib/drivers/ioat/ioat.c create mode 120000 tools/testing/selftests/vfio/lib/drivers/ioat/registers.h create mode 100644 tools/testing/selftests/vfio/lib/include/vfio_util.h create mode 100644 tools/testing/selftests/vfio/lib/libvfio.mk create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_device.c create mode 100644 tools/testing/selftests/vfio/lib/vfio_pci_driver.c create mode 100755 tools/testing/selftests/vfio/run.sh create mode 100644 tools/testing/selftests/vfio/vfio_dma_mapping_test.c create mode 100644 tools/testing/selftests/vfio/vfio_iommufd_setup_test.c create mode 100644 tools/testing/selftests/vfio/vfio_pci_device_test.c create mode 100644 tools/testing/selftests/vfio/vfio_pci_driver_test.c base-commit: e271ed52b344ac02d4581286961d0c40acc54c03 prerequisite-patch-id: c1decca4653262d3d2451e6fd4422ebff9c0b589 -- 2.50.0.rc2.701.gf1e915cc24-goog

2 weeks

2
39
0 0

[PATCH net-next 6/6] selftests: net: extend SCM_PIDFD test to cover stale pidfds

by Alexander Mikhalitsyn

Extend SCM_PIDFD test scenarios to also cover dead task's pidfd retrieval and reading its exit info. Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: netdev(a)vger.kernel.org Cc: Shuah Khan <shuah(a)kernel.org> Cc: "David S. Miller" <davem(a)davemloft.net> Cc: Eric Dumazet <edumazet(a)google.com> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: Paolo Abeni <pabeni(a)redhat.com> Cc: Simon Horman <horms(a)kernel.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Kuniyuki Iwashima <kuniyu(a)amazon.com> Cc: Lennart Poettering <mzxreary(a)0pointer.de> Cc: Luca Boccassi <bluca(a)debian.org> Cc: David Rheinsberg <david(a)readahead.eu> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn(a)canonical.com> --- .../testing/selftests/net/af_unix/scm_pidfd.c | 217 ++++++++++++++---- 1 file changed, 173 insertions(+), 44 deletions(-) diff --git a/tools/testing/selftests/net/af_unix/scm_pidfd.c b/tools/testing/selftests/net/af_unix/scm_pidfd.c index 7e534594167e..37e034874034 100644 --- a/tools/testing/selftests/net/af_unix/scm_pidfd.c +++ b/tools/testing/selftests/net/af_unix/scm_pidfd.c @@ -15,6 +15,7 @@ #include <sys/types.h> #include <sys/wait.h> +#include "../../pidfd/pidfd.h" #include "../../kselftest_harness.h" #define clean_errno() (errno == 0 ? "None" : strerror(errno)) @@ -26,6 +27,8 @@ #define SCM_PIDFD 0x04 #endif +#define CHILD_EXIT_CODE_OK 123 + static void child_die() { exit(1); @@ -126,16 +129,65 @@ static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen) return result; } +struct cmsg_data { + struct ucred *ucred; + int *pidfd; +}; + +static int parse_cmsg(struct msghdr *msg, struct cmsg_data *res) +{ + struct cmsghdr *cmsg; + int data = 0; + + if (msg->msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + for (cmsg = CMSG_FIRSTHDR(msg); cmsg != NULL; + cmsg = CMSG_NXTHDR(msg, cmsg)) { + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_PIDFD) { + if (cmsg->cmsg_len < sizeof(*res->pidfd)) { + log_err("CMSG parse: SCM_PIDFD wrong len"); + return 1; + } + + res->pidfd = (void *)CMSG_DATA(cmsg); + } + + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_CREDENTIALS) { + if (cmsg->cmsg_len < sizeof(*res->ucred)) { + log_err("CMSG parse: SCM_CREDENTIALS wrong len"); + return 1; + } + + res->ucred = (void *)CMSG_DATA(cmsg); + } + } + + if (!res->pidfd) { + log_err("CMSG parse: SCM_PIDFD not found"); + return 1; + } + + if (!res->ucred) { + log_err("CMSG parse: SCM_CREDENTIALS not found"); + return 1; + } + + return 0; +} + static int cmsg_check(int fd) { struct msghdr msg = { 0 }; - struct cmsghdr *cmsg; + struct cmsg_data res; struct iovec iov; - struct ucred *ucred = NULL; int data = 0; char control[CMSG_SPACE(sizeof(struct ucred)) + CMSG_SPACE(sizeof(int))] = { 0 }; - int *pidfd = NULL; pid_t parent_pid; int err; @@ -158,53 +210,99 @@ static int cmsg_check(int fd) return 1; } - for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL; - cmsg = CMSG_NXTHDR(&msg, cmsg)) { - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_PIDFD) { - if (cmsg->cmsg_len < sizeof(*pidfd)) { - log_err("CMSG parse: SCM_PIDFD wrong len"); - return 1; - } + /* send(pfd, "x", sizeof(char), 0) */ + if (data != 'x') { + log_err("recvmsg: data corruption"); + return 1; + } - pidfd = (void *)CMSG_DATA(cmsg); - } + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); + return 1; + } - if (cmsg->cmsg_level == SOL_SOCKET && - cmsg->cmsg_type == SCM_CREDENTIALS) { - if (cmsg->cmsg_len < sizeof(*ucred)) { - log_err("CMSG parse: SCM_CREDENTIALS wrong len"); - return 1; - } + /* pidfd from SCM_PIDFD should point to the parent process PID */ + parent_pid = + get_pid_from_fdinfo_file(*res.pidfd, "Pid:", sizeof("Pid:") - 1); + if (parent_pid != getppid()) { + log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + close(*res.pidfd); + return 1; + } - ucred = (void *)CMSG_DATA(cmsg); - } + close(*res.pidfd); + return 0; +} + +static int cmsg_check_dead(int fd, int expected_pid) +{ + int err; + struct msghdr msg = { 0 }; + struct cmsg_data res; + struct iovec iov; + int data = 0; + char control[CMSG_SPACE(sizeof(struct ucred)) + + CMSG_SPACE(sizeof(int))] = { 0 }; + pid_t client_pid; + struct pidfd_info info = { + .mask = PIDFD_INFO_EXIT, + }; + + iov.iov_base = &data; + iov.iov_len = sizeof(data); + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + err = recvmsg(fd, &msg, 0); + if (err < 0) { + log_err("recvmsg"); + return 1; } - /* send(pfd, "x", sizeof(char), 0) */ - if (data != 'x') { + if (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC)) { + log_err("recvmsg: truncated"); + return 1; + } + + /* send(cfd, "y", sizeof(char), 0) */ + if (data != 'y') { log_err("recvmsg: data corruption"); return 1; } - if (!pidfd) { - log_err("CMSG parse: SCM_PIDFD not found"); + if (parse_cmsg(&msg, &res)) { + log_err("CMSG parse: parse_cmsg() failed"); return 1; } - if (!ucred) { - log_err("CMSG parse: SCM_CREDENTIALS not found"); + /* + * pidfd from SCM_PIDFD should point to the client_pid. + * Let's read exit information and check if it's what + * we expect to see. + */ + if (ioctl(*res.pidfd, PIDFD_GET_INFO, &info)) { + log_err("%s: ioctl(PIDFD_GET_INFO) failed", __func__); + close(*res.pidfd); return 1; } - /* pidfd from SCM_PIDFD should point to the parent process PID */ - parent_pid = - get_pid_from_fdinfo_file(*pidfd, "Pid:", sizeof("Pid:") - 1); - if (parent_pid != getppid()) { - log_err("wrong SCM_PIDFD %d != %d", parent_pid, getppid()); + if (!(info.mask & PIDFD_INFO_EXIT)) { + log_err("%s: No exit information from ioctl(PIDFD_GET_INFO)", __func__); + close(*res.pidfd); return 1; } + err = WIFEXITED(info.exit_code) ? WEXITSTATUS(info.exit_code) : 1; + if (err != CHILD_EXIT_CODE_OK) { + log_err("%s: wrong exit_code %d != %d", __func__, err, CHILD_EXIT_CODE_OK); + close(*res.pidfd); + return 1; + } + + close(*res.pidfd); return 0; } @@ -291,6 +389,24 @@ static void fill_sockaddr(struct sock_addr *addr, bool abstract) memcpy(sun_path_buf, addr->sock_name, strlen(addr->sock_name)); } +static int sk_enable_cred_pass(int sk) +{ + int on = 0; + + on = 1; + if (setsockopt(sk, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { + log_err("Failed to set SO_PASSCRED"); + return 1; + } + + if (setsockopt(sk, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { + log_err("Failed to set SO_PASSPIDFD"); + return 1; + } + + return 0; +} + static void client(FIXTURE_DATA(scm_pidfd) *self, const FIXTURE_VARIANT(scm_pidfd) *variant) { @@ -299,7 +415,6 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, struct ucred peer_cred; int peer_pidfd; pid_t peer_pid; - int on = 0; cfd = socket(AF_UNIX, variant->type, 0); if (cfd < 0) { @@ -322,14 +437,8 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } - on = 1; - if (setsockopt(cfd, SOL_SOCKET, SO_PASSCRED, &on, sizeof(on))) { - log_err("Failed to set SO_PASSCRED"); - child_die(); - } - - if (setsockopt(cfd, SOL_SOCKET, SO_PASSPIDFD, &on, sizeof(on))) { - log_err("Failed to set SO_PASSPIDFD"); + if (sk_enable_cred_pass(cfd)) { + log_err("sk_enable_cred_pass() failed"); child_die(); } @@ -340,6 +449,12 @@ static void client(FIXTURE_DATA(scm_pidfd) *self, child_die(); } + /* send something to the parent so it can receive SCM_PIDFD too and validate it */ + if (send(cfd, "y", sizeof(char), 0) == -1) { + log_err("Failed to send(cfd, \"y\", sizeof(char), 0)"); + child_die(); + } + /* skip further for SOCK_DGRAM as it's not applicable */ if (variant->type == SOCK_DGRAM) return; @@ -398,7 +513,13 @@ TEST_F(scm_pidfd, test) close(self->server); close(self->startup_pipe[0]); client(self, variant); - exit(0); + + /* + * It's a bit unusual, but in case of success we return non-zero + * exit code (CHILD_EXIT_CODE_OK) and then we expect to read it + * from ioctl(PIDFD_GET_INFO) in cmsg_check_dead(). + */ + exit(CHILD_EXIT_CODE_OK); } close(self->startup_pipe[1]); @@ -421,9 +542,17 @@ TEST_F(scm_pidfd, test) ASSERT_NE(-1, err); } - close(pfd); waitpid(self->client_pid, &child_status, 0); - ASSERT_EQ(0, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + /* see comment before exit(CHILD_EXIT_CODE_OK) */ + ASSERT_EQ(CHILD_EXIT_CODE_OK, WIFEXITED(child_status) ? WEXITSTATUS(child_status) : 1); + + err = sk_enable_cred_pass(pfd); + ASSERT_EQ(0, err); + + err = cmsg_check_dead(pfd, self->client_pid); + ASSERT_EQ(0, err); + + close(pfd); } TEST_HARNESS_MAIN -- 2.43.0

2 weeks

1
0
0 0

[RFC PATCH v8 0/7] Add NUMA mempolicy support for KVM guest-memfd

by Shivank Garg

This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work that enabled host-mapping for guest_memfd memory [1] and can be applied directly on KVM tree (branch:queue, base commit:7915077245) [2]. == Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2). == Implementation == This series implements proper NUMA policy support for guest-memfd by: 1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy. With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design). == Upstream Plan == Phased approach as per David's guest_memfd extension overview [3] and community calls [4]: Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work. Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [5]. This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities. Please review and provide feedback! Thanks, Shivank == Changelog == - v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [6] and guest_memfd bi-weekly upstream call discussion [7]). - v7: Use inodes to store NUMA policy instead of file [8]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. [1] https://lore.kernel.org/all/20250611133330.1514028-1-tabba@google.com [2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=queue [3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [4] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo… [5] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [6] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [7] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [8] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes Shivank Garg (5): security: Export anon_inode_make_secure_inode for KVM guest_memfd mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support Shivansh Dhiman (1): mm/filemap: Add mempolicy support to the filemap layer fs/anon_inodes.c | 20 +- include/linux/fs.h | 2 + include/linux/pagemap.h | 41 +++ include/uapi/linux/magic.h | 1 + mm/filemap.c | 27 +- mm/mempolicy.c | 6 + tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 123 ++++++++- virt/kvm/guest_memfd.c | 254 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 10 +- 11 files changed, 456 insertions(+), 36 deletions(-) -- 2.43.0 --- == Earlier Postings == v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com

2 weeks, 1 day

8
33
0 0

[PATCH v3 0/3] replace `allow(...)` lints with `expect(...)`

by Onur Özkan

Changes in v3: - `#![expect(internal_features)]` is reverted from `rust/compiler_builtins.rs` which was done in the first 2 versions. Onur Özkan (3): replace `#[allow(...)]` with `#[expect(...)]` rust: remove `#[allow(clippy::unnecessary_cast)]` rust: remove `#[allow(clippy::non_send_fields_in_send_ty)]` drivers/gpu/nova-core/regs.rs | 2 +- rust/kernel/alloc/allocator_test.rs | 2 +- rust/kernel/cpufreq.rs | 1 - rust/kernel/devres.rs | 2 +- rust/kernel/driver.rs | 2 +- rust/kernel/drm/ioctl.rs | 8 ++++---- rust/kernel/error.rs | 3 +-- rust/kernel/init.rs | 6 +++--- rust/kernel/kunit.rs | 2 +- rust/kernel/opp.rs | 4 ++-- rust/kernel/types.rs | 2 +- rust/macros/helpers.rs | 2 +- 12 files changed, 17 insertions(+), 19 deletions(-) -- 2.50.0

2 weeks, 1 day

4
15
0 0

[PATCH 0/6] selftests/damon: add python and drgn based DAMON sysfs functionality tests

by SeongJae Park

DAMON sysfs interface is the bridge between the user space and the kernel space for DAMON parameters. There is no good and simple test to see if the parameters are set as expected. Existing DAMON selftests therefore test end-to-end features. For example, damos_quota_goal.py runs a DAMOS scheme with quota goal set against a test program running an artificial access pattern, and see if the result is as expected. Such tests cover only a few part of DAMON. Adding more tests is also complicated. Finally, the reliability of the test itself on different systems is bad. 'drgn' is a tool that can extract kernel internal data structures like DAMON parameters. Add a test that passes specific DAMON parameters via DAMON sysfs reusing _damon_sysfs.py, extract resulting DAMON parameters via 'drgn', and compare those. Note that this test is not adding exhaustive tests of all DAMON parameters and input combinations but very basic things. Advancing the test infrastructure and adding more tests are future works. Changes from RFC (https://lore.kernel.org/20250622210330.40490-1-sj@kernel.org) - Rebase on latest mm-new SeongJae Park (6): selftests/damon: add drgn script for extracting damon status selftests/damon/_damon_sysfs: set Kdamond.pid in start() selftests/damon: add python and drgn-based DAMON sysfs test selftests/damon/sysfs.py: test monitoring attribute parameters selftests/damon/sysfs.py: test adaptive targets parameter selftests/damon/sysfs.py: test DAMOS schemes parameters setup tools/testing/selftests/damon/Makefile | 1 + tools/testing/selftests/damon/_damon_sysfs.py | 3 + .../selftests/damon/drgn_dump_damon_status.py | 161 ++++++++++++++++++ tools/testing/selftests/damon/sysfs.py | 115 +++++++++++++ 4 files changed, 280 insertions(+) create mode 100755 tools/testing/selftests/damon/drgn_dump_damon_status.py create mode 100755 tools/testing/selftests/damon/sysfs.py base-commit: 5ab6feac2d83ebbf0d0d2eedf0505878ba677dcb -- 2.39.5

2 weeks, 2 days

1
6
0 0

[PATCH bpf-next 0/3] bpf: Fix and test aux usage after do_check_insn()

by Luis Gerhorst

Fix cur_aux()->nospec_result test after do_check_insn() referring to the to-be-analyzed (potentially unsafe) instruction, not the already-analyzed (safe) instruction. This might allow a unsafe insn to slip through on a speculative path. Create some tests from the reproducer [1]. Commit d6f1c85f2253 ("bpf: Fall back to nospec for Spectre v1") should not be in any stable kernel yet, therefore bpf-next should suffice. [1] https://lore.kernel.org/bpf/685b3c1b.050a0220.2303ee.0010.GAE@google.com/ Changes since RFC: - Introduce prev_aux() as suggested by Alexei. For this, we must move the env->prev_insn_idx assignment to happen directly after do_check_insn(), for which I have created a separate commit. This patch could be simplified by using a local prev_aux variable as sugested by Eduard, but I figured one might find the new assignment-strategy easier to understand (before, prev_insn_idx and env->prev_insn_idx were out-of-sync for the latter part of the loop). Also, like this we do not have an additional prev_* variable that must be kept in-sync and the local variable's usage (old prev_insn_idx, new tmp) is much more local. If you think it would be better to not take the risk and keep the fix simple by just introducing the prev_aux variable, let me know. - Change WARN_ON_ONCE() to verifier_bug_if() as suggested by Alexei - Change assertion to check instruction is BPF_JMP[32] as suggested by Eduard - RFC: https://lore.kernel.org/bpf/8734bmoemx.fsf@fau.de/ Luis Gerhorst (3): bpf: Update env->prev_insn_idx after do_check_insn() bpf: Fix aux usage after do_check_insn() selftests/bpf: Add Spectre v4 tests kernel/bpf/verifier.c | 30 ++-- tools/testing/selftests/bpf/progs/bpf_misc.h | 4 + .../selftests/bpf/progs/verifier_unpriv.c | 149 ++++++++++++++++++ 3 files changed, 174 insertions(+), 9 deletions(-) base-commit: d69bafe6ee2b5eff6099fa26626ecc2963f0f363 -- 2.49.0

2 weeks, 2 days

1
3
0 0

[kvm-unit-tests PATCH v6] riscv: sbi: Add SBI Debug Triggers Extension tests

by Jesse Taube

Add tests for the DBTR SBI extension. Signed-off-by: Jesse Taube <jesse(a)rivosinc.com> Reviewed-by: Charlie Jenkins <charlie(a)rivosinc.com> Tested-by: Charlie Jenkins <charlie(a)rivosinc.com> --- V1 -> V2: - Call report_prefix_pop before returning - Disable compressed instructions in exec_call, update related comment - Remove extra "| 1" in dbtr_test_load - Remove extra newlines - Remove extra tabs in check_exec - Remove typedefs from enums - Return when dbtr_install_trigger fails - s/avalible/available/g - s/unistall/uninstall/g V2 -> V3: - Change SBI_DBTR_SHMEM_INVALID_ADDR to -1UL - Move all dbtr functions to sbi-dbtr.c - Move INSN_LEN to processor.h - Update include list - Use C-style comments V3 -> V4: - Include libcflat.h - Remove #define SBI_DBTR_SHMEM_INVALID_ADDR V4 -> V5: - Sort includes - Add kfail for update triggers V5 -> V6: - Add assert in gen_tdata1 - Add prefix to dbtr_test_type - Add TRIG_STATE_DMODE - Add TRIG_STATE_RESERVED - Align function paramaters with opening parenthesis - Change OpenSBI < v1.7 to < v1.5 - Constantly use spaces in prefix rather than _ - Export split_phys_addr - Fix MCONTROL_U and MCONTROL_M mix up - Fix swapped VU and VS - Move /* to own line - Print type in dbtr_test_type - Remove _BIT suffix from macros - Remove duplicate MODE_S - Remove spaces before include - Rename tdata1,2 to trigger and control in dbtr_install_trigger - Report skip in dbtr_test_multiple - Report variables in info not pass or fail - s/save/store/g - sbi_debug_set_shmem use split_phys_addr - Use if (!report(... in dbtr_test_disable_enable --- lib/riscv/asm/sbi.h | 1 + riscv/Makefile | 1 + riscv/sbi-dbtr.c | 839 ++++++++++++++++++++++++++++++++++++++++++++ riscv/sbi-tests.h | 2 + riscv/sbi.c | 3 +- 5 files changed, 845 insertions(+), 1 deletion(-) create mode 100644 riscv/sbi-dbtr.c diff --git a/lib/riscv/asm/sbi.h b/lib/riscv/asm/sbi.h index a5738a5c..78fd6e2a 100644 --- a/lib/riscv/asm/sbi.h +++ b/lib/riscv/asm/sbi.h @@ -51,6 +51,7 @@ enum sbi_ext_id { SBI_EXT_SUSP = 0x53555350, SBI_EXT_FWFT = 0x46574654, SBI_EXT_SSE = 0x535345, + SBI_EXT_DBTR = 0x44425452, }; enum sbi_ext_base_fid { diff --git a/riscv/Makefile b/riscv/Makefile index 11e68eae..55c7ac93 100644 --- a/riscv/Makefile +++ b/riscv/Makefile @@ -20,6 +20,7 @@ all: $(tests) $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-asm.o $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-fwft.o $(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-sse.o +$(TEST_DIR)/sbi-deps += $(TEST_DIR)/sbi-dbtr.o all_deps += $($(TEST_DIR)/sbi-deps) diff --git a/riscv/sbi-dbtr.c b/riscv/sbi-dbtr.c new file mode 100644 index 00000000..7837553f --- /dev/null +++ b/riscv/sbi-dbtr.c @@ -0,0 +1,839 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * SBI DBTR testsuite + * + * Copyright (C) 2025, Rivos Inc., Jesse Taube <jesse(a)rivosinc.com> + */ + +#include <libcflat.h> +#include <bitops.h> + +#include <asm/io.h> +#include <asm/processor.h> + +#include "sbi-tests.h" + +#define RV_MAX_TRIGGERS 32 + +#define SBI_DBTR_TRIG_STATE_MAPPED BIT(0) +#define SBI_DBTR_TRIG_STATE_U BIT(1) +#define SBI_DBTR_TRIG_STATE_S BIT(2) +#define SBI_DBTR_TRIG_STATE_VU BIT(3) +#define SBI_DBTR_TRIG_STATE_VS BIT(4) +#define SBI_DBTR_TRIG_STATE_HAVE_HW_TRIG BIT(5) +#define SBI_DBTR_TRIG_STATE_RESERVED GENMASK(7, 6) + +#define SBI_DBTR_TRIG_STATE_HW_TRIG_IDX_SHIFT 8 +#define SBI_DBTR_TRIG_STATE_HW_TRIG_IDX(trig_state) (trig_state >> SBI_DBTR_TRIG_STATE_HW_TRIG_IDX_SHIFT) + +#define SBI_DBTR_TDATA1_TYPE_SHIFT (__riscv_xlen - 4) +#define SBI_DBTR_TDATA1_DMODE BIT_UL(__riscv_xlen - 5) + +#define SBI_DBTR_TDATA1_MCONTROL6_LOAD BIT(0) +#define SBI_DBTR_TDATA1_MCONTROL6_STORE BIT(1) +#define SBI_DBTR_TDATA1_MCONTROL6_EXECUTE BIT(2) +#define SBI_DBTR_TDATA1_MCONTROL6_U BIT(3) +#define SBI_DBTR_TDATA1_MCONTROL6_S BIT(4) +#define SBI_DBTR_TDATA1_MCONTROL6_M BIT(6) +#define SBI_DBTR_TDATA1_MCONTROL6_SELECT BIT(21) +#define SBI_DBTR_TDATA1_MCONTROL6_VU BIT(23) +#define SBI_DBTR_TDATA1_MCONTROL6_VS BIT(24) + +#define SBI_DBTR_TDATA1_MCONTROL_LOAD BIT(0) +#define SBI_DBTR_TDATA1_MCONTROL_STORE BIT(1) +#define SBI_DBTR_TDATA1_MCONTROL_EXECUTE BIT(2) +#define SBI_DBTR_TDATA1_MCONTROL_U BIT(3) +#define SBI_DBTR_TDATA1_MCONTROL_S BIT(4) +#define SBI_DBTR_TDATA1_MCONTROL_M BIT(6) +#define SBI_DBTR_TDATA1_MCONTROL_SELECT BIT(19) + +enum McontrolType { + SBI_DBTR_TDATA1_TYPE_NONE = (0UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_LEGACY = (1UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_MCONTROL = (2UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ICOUNT = (3UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ITRIGGER = (4UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_ETRIGGER = (5UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_MCONTROL6 = (6UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_TMEXTTRIGGER = (7UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED0 = (8UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED1 = (9UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED2 = (10UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_RESERVED3 = (11UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM0 = (12UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM1 = (13UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_CUSTOM2 = (14UL << SBI_DBTR_TDATA1_TYPE_SHIFT), + SBI_DBTR_TDATA1_TYPE_DISABLED = (15UL << SBI_DBTR_TDATA1_TYPE_SHIFT), +}; + +enum Tdata1Value { + VALUE_NONE = 0, + VALUE_LOAD = BIT(0), + VALUE_STORE = BIT(1), + VALUE_EXECUTE = BIT(2), +}; + +enum Tdata1Mode { + MODE_NONE = 0, + MODE_M = BIT(0), + MODE_U = BIT(1), + MODE_S = BIT(2), + MODE_VU = BIT(3), + MODE_VS = BIT(4), +}; + +enum sbi_ext_dbtr_fid { + SBI_EXT_DBTR_NUM_TRIGGERS = 0, + SBI_EXT_DBTR_SETUP_SHMEM, + SBI_EXT_DBTR_TRIGGER_READ, + SBI_EXT_DBTR_TRIGGER_INSTALL, + SBI_EXT_DBTR_TRIGGER_UPDATE, + SBI_EXT_DBTR_TRIGGER_UNINSTALL, + SBI_EXT_DBTR_TRIGGER_ENABLE, + SBI_EXT_DBTR_TRIGGER_DISABLE, +}; + +struct sbi_dbtr_data_msg { + unsigned long tstate; + unsigned long tdata1; + unsigned long tdata2; + unsigned long tdata3; +}; + +struct sbi_dbtr_id_msg { + unsigned long idx; +}; + +/* SBI shared mem messages layout */ +struct sbi_dbtr_shmem_entry { + union { + struct sbi_dbtr_data_msg data; + struct sbi_dbtr_id_msg id; + }; +}; + +static bool dbtr_handled; + +/* Expected to be leaf function as not to disrupt frame-pointer */ +static __attribute__((naked)) void exec_call(void) +{ + /* skip over nop when triggered instead of ret. */ + asm volatile (".option push\n" + ".option arch, -c\n" + "nop\n" + "ret\n" + ".option pop\n"); +} + +static void dbtr_exception_handler(struct pt_regs *regs) +{ + dbtr_handled = true; + + /* Reading *epc may cause a fault, skip over nop */ + if ((void *)regs->epc == exec_call) { + regs->epc += 4; + return; + } + + /* WARNING: Skips over the trapped intruction */ + regs->epc += RV_INSN_LEN(readw((void *)regs->epc)); +} + +static bool do_store(void *tdata2) +{ + bool ret; + + writel(0, tdata2); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static bool do_load(void *tdata2) +{ + bool ret; + + readl(tdata2); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static bool do_exec(void) +{ + bool ret; + + exec_call(); + + ret = dbtr_handled; + dbtr_handled = false; + + return ret; +} + +static unsigned long gen_tdata1_mcontrol(enum Tdata1Mode mode, enum Tdata1Value value) +{ + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL; + + if (value & VALUE_LOAD) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_LOAD; + + if (value & VALUE_STORE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_STORE; + + if (value & VALUE_EXECUTE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_EXECUTE; + + if (mode & MODE_M) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_M; + + if (mode & MODE_U) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_U; + + if (mode & MODE_S) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL_S; + + return tdata1; +} + +static unsigned long gen_tdata1_mcontrol6(enum Tdata1Mode mode, enum Tdata1Value value) +{ + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL6; + + if (value & VALUE_LOAD) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_LOAD; + + if (value & VALUE_STORE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_STORE; + + if (value & VALUE_EXECUTE) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_EXECUTE; + + if (mode & MODE_M) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_M; + + if (mode & MODE_U) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_U; + + if (mode & MODE_S) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_S; + + if (mode & MODE_VU) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_VU; + + if (mode & MODE_VS) + tdata1 |= SBI_DBTR_TDATA1_MCONTROL6_VS; + + return tdata1; +} + +static unsigned long gen_tdata1(enum McontrolType type, enum Tdata1Value value, enum Tdata1Mode mode) +{ + switch (type) { + case SBI_DBTR_TDATA1_TYPE_MCONTROL: + return gen_tdata1_mcontrol(mode, value); + case SBI_DBTR_TDATA1_TYPE_MCONTROL6: + return gen_tdata1_mcontrol6(mode, value); + default: + assert_msg(false, "Invalid mcontrol type: %lu", type); + return 0; + } +} + +static struct sbiret sbi_debug_num_triggers(unsigned long trig_tdata1) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_NUM_TRIGGERS, trig_tdata1, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_set_shmem_raw(unsigned long shmem_phys_lo, + unsigned long shmem_phys_hi, + unsigned long flags) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_SETUP_SHMEM, shmem_phys_lo, + shmem_phys_hi, flags, 0, 0, 0); +} + +static struct sbiret sbi_debug_set_shmem(void *shmem) +{ + unsigned long base_addr_lo, base_addr_hi; + + split_phys_addr(virt_to_phys(shmem), &base_addr_hi, &base_addr_lo); + return sbi_debug_set_shmem_raw(base_addr_lo, base_addr_hi, 0); +} + +static struct sbiret sbi_debug_read_triggers(unsigned long trig_idx_base, + unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_READ, trig_idx_base, + trig_count, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_install_triggers(unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_INSTALL, trig_count, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_update_triggers(unsigned long trig_count) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_UPDATE, trig_count, 0, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_uninstall_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_UNINSTALL, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_enable_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_ENABLE, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static struct sbiret sbi_debug_disable_triggers(unsigned long trig_idx_base, + unsigned long trig_idx_mask) +{ + return sbi_ecall(SBI_EXT_DBTR, SBI_EXT_DBTR_TRIGGER_DISABLE, trig_idx_base, + trig_idx_mask, 0, 0, 0, 0); +} + +static bool dbtr_install_trigger(struct sbi_dbtr_shmem_entry *shmem, void *trigger, + unsigned long control) +{ + struct sbiret sbi_ret; + bool ret; + + shmem->data.tdata1 = control; + shmem->data.tdata2 = (unsigned long)trigger; + + sbi_ret = sbi_debug_install_triggers(1); + ret = sbiret_report_error(&sbi_ret, SBI_SUCCESS, "sbi_debug_install_triggers"); + if (ret) + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + return ret; +} + +static bool dbtr_uninstall_trigger(void) +{ + struct sbiret ret; + + install_exception_handler(EXC_BREAKPOINT, NULL); + + ret = sbi_debug_uninstall_triggers(0, 1); + return sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); +} + +static unsigned long dbtr_test_num_triggers(void) +{ + struct sbiret ret; + unsigned long tdata1 = 0; + /* sbi_debug_num_triggers will return trig_max in sbiret.value when trig_tdata1 == 0 */ + + /* should be at least one trigger. */ + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_num_triggers"); + + if (ret.value == 0) { + report_fail("sbi_debug_num_triggers: Returned 0 triggers available"); + } else { + report_pass("sbi_debug_num_triggers: Returned triggers available"); + report_info("sbi_debug_num_triggers: Returned %lu triggers available", ret.value); + } + + return ret.value; +} + +static enum McontrolType dbtr_test_type(unsigned long *num_trig) +{ + struct sbiret ret; + unsigned long tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL6; + + report_prefix_push("test type"); + + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_num_triggers: mcontrol6"); + *num_trig = ret.value; + if (ret.value > 0) { + report_pass("sbi_debug_num_triggers: Returned mcontrol6 triggers available"); + report_info("sbi_debug_num_triggers: Returned %lu mcontrol6 triggers available", + ret.value); + report_prefix_pop(); + return tdata1; + } + + tdata1 = SBI_DBTR_TDATA1_TYPE_MCONTROL; + + ret = sbi_debug_num_triggers(tdata1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_num_triggers: mcontrol"); + *num_trig = ret.value; + if (ret.value > 0) { + report_pass("sbi_debug_num_triggers: Returned mcontrol triggers available"); + report_info("sbi_debug_num_triggers: Returned %lu mcontrol triggers available", + ret.value); + report_prefix_pop(); + return tdata1; + } + + report_fail("sbi_debug_num_triggers: Returned 0 mcontrol(6) triggers available"); + report_prefix_pop(); + + return SBI_DBTR_TDATA1_TYPE_NONE; +} + +static struct sbiret dbtr_test_store_install_uninstall(struct sbi_dbtr_shmem_entry *shmem, + enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("store trigger"); + + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + ret = sbi_debug_install_triggers(1); + if (!sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_install_triggers")) { + report_prefix_pop(); + return ret; + } + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(do_store(&test), "triggered"); + + if (do_load(&test)) + report_fail("triggered by load"); + + ret = sbi_debug_uninstall_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + if (do_store(&test)) + report_fail("triggered after uninstall"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); + + return ret; +} + +static void dbtr_test_update(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + bool kfail; + + report_prefix_push("update trigger"); + + if (!dbtr_install_trigger(shmem, NULL, gen_tdata1(type, VALUE_NONE, MODE_NONE))) { + report_prefix_pop(); + return; + } + + shmem->id.idx = 0; + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + ret = sbi_debug_update_triggers(1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_update_triggers"); + + /* + * Known broken update_triggers. + * https://lore.kernel.org/opensbi/aDdp1UeUh7GugeHp@ghost/T/#t + */ + kfail = __sbi_get_imp_id() == SBI_IMPL_OPENSBI && + __sbi_get_imp_version() < sbi_impl_opensbi_mk_version(1, 7); + report_kfail(kfail, do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_load(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + + report_prefix_push("load trigger"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_LOAD, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_load(&test), "triggered"); + + if (do_store(&test)) + report_fail("triggered by store"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_disable_enable(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable trigger"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + if (!report(!do_store(&test), "should not trigger")) { + dbtr_uninstall_trigger(); + report_prefix_pop(); + report_skip("enable trigger: no disable"); + + return; + } + + report_prefix_pop(); + report_prefix_push("enable trigger"); + + ret = sbi_debug_enable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_enable_triggers"); + + report(do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_exec(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + + report_prefix_push("exec trigger"); + /* check if loads and stores trigger exec */ + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + report_prefix_pop(); + return; + } + + if (do_load(&test)) + report_fail("triggered by load"); + + if (do_store(&test)) + report_fail("triggered by store"); + + dbtr_uninstall_trigger(); + + /* Check if exec works */ + if (!dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + report_prefix_pop(); + return; + } + report(do_exec(), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_read(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + const unsigned long tstatus_expected = SBI_DBTR_TRIG_STATE_S | SBI_DBTR_TRIG_STATE_MAPPED; + const unsigned long tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + static unsigned long test; + struct sbiret ret; + + report_prefix_push("read trigger"); + if (!dbtr_install_trigger(shmem, &test, tdata1)) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_read_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_read_triggers"); + + report(shmem->data.tdata1 == tdata1, "tdata1 expected: 0x%016lx", tdata1); + report_info("tdata1 found: 0x%016lx", shmem->data.tdata1); + report(shmem->data.tdata2 == ((unsigned long)&test), "tdata2 expected: 0x%016lx", + (unsigned long)&test); + report_info("tdata2 found: 0x%016lx", shmem->data.tdata2); + report(shmem->data.tstate == tstatus_expected, "tstate expected: 0x%016lx", tstatus_expected); + report_info("tstate found: 0x%016lx", shmem->data.tstate); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void check_exec(unsigned long base) +{ + struct sbiret ret; + + report(do_exec(), "exec triggered"); + + ret = sbi_debug_uninstall_triggers(base, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); +} + +static void dbtr_test_multiple(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type, + unsigned long num_trigs) +{ + static unsigned long test[2]; + struct sbiret ret; + bool have_three = num_trigs > 2; + + if (num_trigs < 2) { + report_skip("test multiple"); + return; + } + + report_prefix_push("test multiple"); + + if (!dbtr_install_trigger(shmem, &test[0], gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + if (!dbtr_install_trigger(shmem, &test[1], gen_tdata1(type, VALUE_LOAD, MODE_S))) + goto error; + if (have_three && + !dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) { + ret = sbi_debug_uninstall_triggers(1, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + goto error; + } + + report(do_store(&test[0]), "store triggered"); + + if (do_load(&test[0])) + report_fail("store triggered by load"); + + report(do_load(&test[1]), "load triggered"); + + if (do_store(&test[1])) + report_fail("load triggered by store"); + + if (have_three) + check_exec(2); + + ret = sbi_debug_uninstall_triggers(1, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + if (do_load(&test[1])) + report_fail("load triggered after uninstall"); + + report(do_store(&test[0]), "store triggered"); + + if (!have_three && + dbtr_install_trigger(shmem, exec_call, gen_tdata1(type, VALUE_EXECUTE, MODE_S))) + check_exec(1); + +error: + ret = sbi_debug_uninstall_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_uninstall_triggers"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_multiple_types(struct sbi_dbtr_shmem_entry *shmem, unsigned long type) +{ + static unsigned long test; + + report_prefix_push("test multiple types"); + + /* check if loads and stores trigger exec */ + if (!dbtr_install_trigger(shmem, &test, + gen_tdata1(type, VALUE_EXECUTE | VALUE_LOAD | VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_load(&test), "load triggered"); + + report(do_store(&test), "store triggered"); + + dbtr_uninstall_trigger(); + + /* Check if exec works */ + if (!dbtr_install_trigger(shmem, exec_call, + gen_tdata1(type, VALUE_EXECUTE | VALUE_LOAD | VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_exec(), "exec triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_disable_uninstall(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable uninstall"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + dbtr_uninstall_trigger(); + + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + report(do_store(&test), "triggered"); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +static void dbtr_test_uninstall_enable(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + + report_prefix_push("uninstall enable"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + dbtr_uninstall_trigger(); + + ret = sbi_debug_enable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_enable_triggers"); + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(!do_store(&test), "should not trigger"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_uninstall_update(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + static unsigned long test; + struct sbiret ret; + bool kfail; + + report_prefix_push("uninstall update"); + if (!dbtr_install_trigger(shmem, NULL, gen_tdata1(type, VALUE_NONE, MODE_NONE))) { + report_prefix_pop(); + return; + } + + dbtr_uninstall_trigger(); + + shmem->id.idx = 0; + shmem->data.tdata1 = gen_tdata1(type, VALUE_STORE, MODE_S); + shmem->data.tdata2 = (unsigned long)&test; + + /* + * Known broken update_triggers. + * https://lore.kernel.org/opensbi/aDdp1UeUh7GugeHp@ghost/T/#t + */ + kfail = __sbi_get_imp_id() == SBI_IMPL_OPENSBI && + __sbi_get_imp_version() < sbi_impl_opensbi_mk_version(1, 7); + ret = sbi_debug_update_triggers(1); + sbiret_kfail_error(kfail, &ret, SBI_ERR_FAILURE, "sbi_debug_update_triggers"); + + install_exception_handler(EXC_BREAKPOINT, dbtr_exception_handler); + + report(!do_store(&test), "should not trigger"); + + install_exception_handler(EXC_BREAKPOINT, NULL); + report_prefix_pop(); +} + +static void dbtr_test_disable_read(struct sbi_dbtr_shmem_entry *shmem, enum McontrolType type) +{ + const unsigned long tstatus_expected = SBI_DBTR_TRIG_STATE_S | SBI_DBTR_TRIG_STATE_MAPPED; + const unsigned long tdata1 = gen_tdata1(type, VALUE_STORE, MODE_NONE); + static unsigned long test; + struct sbiret ret; + + report_prefix_push("disable read"); + if (!dbtr_install_trigger(shmem, &test, gen_tdata1(type, VALUE_STORE, MODE_S))) { + report_prefix_pop(); + return; + } + + ret = sbi_debug_disable_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_disable_triggers"); + + ret = sbi_debug_read_triggers(0, 1); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_read_triggers"); + + report(shmem->data.tdata1 == tdata1, "tdata1 expected: 0x%016lx", tdata1); + report_info("tdata1 found: 0x%016lx", shmem->data.tdata1); + report(shmem->data.tdata2 == ((unsigned long)&test), "tdata2 expected: 0x%016lx", + (unsigned long)&test); + report_info("tdata2 found: 0x%016lx", shmem->data.tdata2); + report(shmem->data.tstate == tstatus_expected, "tstate expected: 0x%016lx", tstatus_expected); + report_info("tstate found: 0x%016lx", shmem->data.tstate); + + dbtr_uninstall_trigger(); + report_prefix_pop(); +} + +void check_dbtr(void) +{ + static struct sbi_dbtr_shmem_entry shmem[RV_MAX_TRIGGERS] = {}; + unsigned long num_trigs; + enum McontrolType trig_type; + struct sbiret ret; + + report_prefix_push("dbtr"); + + if (!sbi_probe(SBI_EXT_DBTR)) { + report_skip("extension not available"); + report_prefix_pop(); + return; + } + + if (__sbi_get_imp_id() == SBI_IMPL_OPENSBI && + __sbi_get_imp_version() < sbi_impl_opensbi_mk_version(1, 5)) { + report_skip("OpenSBI < v1.5 detected, skipping tests"); + report_prefix_pop(); + return; + } + + num_trigs = dbtr_test_num_triggers(); + if (!num_trigs) + goto error; + + trig_type = dbtr_test_type(&num_trigs); + if (trig_type == SBI_DBTR_TDATA1_TYPE_NONE) + goto error; + + ret = sbi_debug_set_shmem(shmem); + sbiret_report_error(&ret, SBI_SUCCESS, "sbi_debug_set_shmem"); + + ret = dbtr_test_store_install_uninstall(&shmem[0], trig_type); + /* install or uninstall failed */ + if (ret.error != SBI_SUCCESS) + goto error; + + dbtr_test_load(&shmem[0], trig_type); + dbtr_test_exec(&shmem[0], trig_type); + dbtr_test_read(&shmem[0], trig_type); + dbtr_test_disable_enable(&shmem[0], trig_type); + dbtr_test_update(&shmem[0], trig_type); + dbtr_test_multiple_types(&shmem[0], trig_type); + dbtr_test_multiple(shmem, trig_type, num_trigs); + dbtr_test_disable_uninstall(&shmem[0], trig_type); + dbtr_test_uninstall_enable(&shmem[0], trig_type); + dbtr_test_uninstall_update(&shmem[0], trig_type); + dbtr_test_disable_read(&shmem[0], trig_type); + +error: + report_prefix_pop(); +} diff --git a/riscv/sbi-tests.h b/riscv/sbi-tests.h index d5c4ae70..c1ebf016 100644 --- a/riscv/sbi-tests.h +++ b/riscv/sbi-tests.h @@ -97,8 +97,10 @@ static inline bool env_enabled(const char *env) return s && (*s == '1' || *s == 'y' || *s == 'Y'); } +void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo); void sbi_bad_fid(int ext); void check_sse(void); +void check_dbtr(void); #endif /* __ASSEMBLER__ */ #endif /* _RISCV_SBI_TESTS_H_ */ diff --git a/riscv/sbi.c b/riscv/sbi.c index edb1a6be..3b8aadce 100644 --- a/riscv/sbi.c +++ b/riscv/sbi.c @@ -105,7 +105,7 @@ static int rand_online_cpu(prng_state *ps) return cpu; } -static void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo) +void split_phys_addr(phys_addr_t paddr, unsigned long *hi, unsigned long *lo) { *lo = (unsigned long)paddr; *hi = 0; @@ -1561,6 +1561,7 @@ int main(int argc, char **argv) check_susp(); check_sse(); check_fwft(); + check_dbtr(); return report_summary(); } -- 2.43.0

2 weeks, 2 days

3
3
0 0

[PATCH v2 0/3] replace `allow(...)` lints with `expect(...)`

by Onur Özkan

Changes in v2: - Removed lints are not replaced with `expect` in the first diff. - Removals are done in separate diffs for each. The `#[allow(clippy::non_send_fields_in_send_ty)]` removal was tested on 1.81 and clippy was still happy with it. I couldn't test it on 1.78 because when I go below 1.81 `menuconfig` no longer shows the Rust option. And any manual changes I make to `.config` are immediately reverted on `make` invocations. Onur Özkan (3): replace `#[allow(...)]` with `#[expect(...)]` rust: remove `#[allow(clippy::unnecessary_cast)]` rust: remove `#[allow(clippy::non_send_fields_in_send_ty)]` drivers/gpu/nova-core/regs.rs | 2 +- rust/compiler_builtins.rs | 2 +- rust/kernel/alloc/allocator_test.rs | 2 +- rust/kernel/cpufreq.rs | 1 - rust/kernel/devres.rs | 2 +- rust/kernel/driver.rs | 2 +- rust/kernel/drm/ioctl.rs | 8 ++++---- rust/kernel/error.rs | 3 +-- rust/kernel/init.rs | 6 +++--- rust/kernel/kunit.rs | 2 +- rust/kernel/opp.rs | 4 ++-- rust/kernel/types.rs | 2 +- rust/macros/helpers.rs | 2 +- 13 files changed, 18 insertions(+), 20 deletions(-) -- 2.50.0

2 weeks, 2 days

3
6
0 0

[PATCH 0/2] Possible TTY privilege escalation in TIOCSTI ioctl

by Abhinav Saxena via B4 Relay

This patch series was initially sent to security(a)k.o; resending it in public. I might follow-up with a tests series which addresses similar issues with TIOCLINUX. =============== The TIOCSTI ioctl uses capable(CAP_SYS_ADMIN) for access control, which checks the current process's credentials. However, it doesn't validate against the file opener's credentials stored in file->f_cred. This creates a potential security issue where an unprivileged process can open a TTY fd and pass it to a privileged process via SCM_RIGHTS. The privileged process may then inadvertently grant access based on its elevated privileges rather than the original opener's credentials. Background ========== As noted in previous discussion, while CONFIG_LEGACY_TIOCSTI can restrict TIOCSTI usage, it is enabled by default in most distributions. Even when CONFIG_LEGACY_TIOCSTI=n, processes with CAP_SYS_ADMIN can still use TIOCSTI according to the Kconfig documentation. Additionally, CONFIG_LEGACY_TIOCSTI controls the default value for the dev.tty.legacy_tiocsti sysctl, which remains runtime-configurable. This means the described attack vector could work on systems even with CONFIG_LEGACY_TIOCSTI=n, particularly on Ubuntu 24.04 where it's "restricted" but still functional. Solution Approach ================= This series addresses the issue through SELinux LSM integration rather than modifying core TTY credential checking to avoid potential compatibility issues with existing userspace. The enhancement adds proper current task and file credential capability validation in SELinux's selinux_file_ioctl() hook specifically for TIOCSTI operations. Testing ======= All patches have been validated using: - scripts/checkpatch.pl --strict (0 errors, 0 warnings) - Functional testing on kernel v6.16-rc2 - File descriptor passing security test scenarios - SELinux policy enforcement testing The fd_passing_security test demonstrates the security concern. To verify, disable legacy TIOCSTI and run the test: $ echo "0" | sudo tee /proc/sys/dev/tty/legacy_tiocsti $ sudo ./tools/testing/selftests/tty/tty_tiocsti_test -t fd_passing_security Patch Overview ============== PATCH 1/2: selftests/tty: add TIOCSTI test suite Comprehensive test suite demonstrating the issue and fix validation PATCH 2/2: selinux: add capability checks for TIOCSTI ioctl Core security enhancement via SELinux LSM hook References ========== - tty_ioctl(4) - documents TIOCSTI ioctl and capability requirements - commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") - Documentation/security/credentials.rst - https://github.com/KSPP/linux/issues/156 - https://lore.kernel.org/linux-hardening/Y0m9l52AKmw6Yxi1@hostpad/ - drivers/tty/Kconfig Configuration References: [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [2] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [3] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… To: Shuah Khan <shuah(a)kernel.org> To: Nathan Chancellor <nathan(a)kernel.org> To: Nick Desaulniers <nick.desaulniers+lkml(a)gmail.com> To: Bill Wendling <morbo(a)google.com> To: Justin Stitt <justinstitt(a)google.com> To: Paul Moore <paul(a)paul-moore.com> To: Stephen Smalley <stephen.smalley.work(a)gmail.com> To: Ondrej Mosnacek <omosnace(a)redhat.com> Cc: linux-kernel(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: llvm(a)lists.linux.dev Cc: selinux(a)vger.kernel.org Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- Abhinav Saxena (2): selftests/tty: add TIOCSTI test suite selinux: add capability checks for TIOCSTI ioctl security/selinux/hooks.c | 6 + tools/testing/selftests/tty/Makefile | 6 +- tools/testing/selftests/tty/config | 1 + tools/testing/selftests/tty/tty_tiocsti_test.c | 421 +++++++++++++++++++++++++ 4 files changed, 433 insertions(+), 1 deletion(-) --- base-commit: 5adb635077d1b4bd65b183022775a59a378a9c00 change-id: 20250618-toicsti-bug-7822b8e94a32 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

2 weeks, 2 days

7
10
0 0

[PATCH RFT v17 0/8] fork: Support shadow stacks in clone3()

by Mark Brown

The kernel has recently added support for shadow stacks, currently x86 only using their CET feature but both arm64 and RISC-V have equivalent features (GCS and Zicfiss respectively), I am actively working on GCS[1]. With shadow stacks the hardware maintains an additional stack containing only the return addresses for branch instructions which is not generally writeable by userspace and ensures that any returns are to the recorded addresses. This provides some protection against ROP attacks and making it easier to collect call stacks. These shadow stacks are allocated in the address space of the userspace process. Our API for shadow stacks does not currently offer userspace any flexiblity for managing the allocation of shadow stacks for newly created threads, instead the kernel allocates a new shadow stack with the same size as the normal stack whenever a thread is created with the feature enabled. The stacks allocated in this way are freed by the kernel when the thread exits or shadow stacks are disabled for the thread. This lack of flexibility and control isn't ideal, in the vast majority of cases the shadow stack will be over allocated and the implicit allocation and deallocation is not consistent with other interfaces. As far as I can tell the interface is done in this manner mainly because the shadow stack patches were in development since before clone3() was implemented. Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). The user must provide a shadow stack pointer, this must point to memory mapped for use as a shadow stackby map_shadow_stack() with an architecture specified shadow stack token at the top of the stack. Yuri Khrustalev has raised questions from the libc side regarding discoverability of extended clone3() structure sizes[2], this seems like a general issue with clone3(). There was a suggestion to add a hwcap on arm64 which isn't ideal but is doable there, though architecture specific mechanisms would also be needed for x86 (and RISC-V if it's support gets merged before this does). Please note that the x86 portions of this code are build tested only, I don't appear to have a system that can run CET available to me. [1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87… [2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v17: - Rebase onto v6.16-rc1. - Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k… Changes in v16: - Rebase onto v6.15-rc2. - Roll in fixes from x86 testing from Rick Edgecombe. - Rework so that the argument is shadow_stack_token. - Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k… Changes in v15: - Rebase onto v6.15-rc1. - Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k… Changes in v14: - Rebase onto v6.14-rc1. - Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k… Changes in v13: - Rebase onto v6.13-rc1. - Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k… Changes in v12: - Add the regular prctl() to the userspace API document since arm64 support is queued in -next. - Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k… Changes in v11: - Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and integrate arm64 support. - Rework the interface to specify a shadow stack pointer rather than a base and size like we do for the regular stack. - Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k… Changes in v10: - Integrate fixes & improvements for the x86 implementation from Rick Edgecombe. - Require that the shadow stack be VM_WRITE. - Require that the shadow stack base and size be sizeof(void *) aligned. - Clean up trailing newline. - Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke… Changes in v9: - Pull token validation earlier and report problems with an error return to parent rather than signal delivery to the child. - Verify that the top of the supplied shadow stack is VM_SHADOW_STACK. - Rework token validation to only do the page mapping once. - Drop no longer needed support for testing for signals in selftest. - Fix typo in comments. - Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke… Changes in v8: - Fix token verification with user specified shadow stack. - Don't track user managed shadow stacks for child processes. - Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke… Changes in v7: - Rebase onto v6.11-rc1. - Typo fixes. - Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke… Changes in v6: - Rebase onto v6.10-rc3. - Ensure we don't try to free the parent shadow stack in error paths of x86 arch code. - Spelling fixes in userspace API document. - Additional cleanups and improvements to the clone3() tests to support the shadow stack tests. - Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke… Changes in v5: - Rebase onto v6.8-rc2. - Rework ABI to have the user allocate the shadow stack memory with map_shadow_stack() and a token. - Force inlining of the x86 shadow stack enablement. - Move shadow stack enablement out into a shared header for reuse by other tests. - Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke… Changes in v4: - Formatting changes. - Use a define for minimum shadow stack size and move some basic validation to fork.c. - Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke… Changes in v3: - Rebase onto v6.7-rc2. - Remove stale shadow_stack in internal kargs. - If a shadow stack is specified unconditionally use it regardless of CLONE_ parameters. - Force enable shadow stacks in the selftest. - Update changelogs for RISC-V feature rename. - Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke… Changes in v2: - Rebase onto v6.7-rc1. - Remove ability to provide preallocated shadow stack, just specify the desired size. - Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke… --- Mark Brown (8): arm64/gcs: Return a success value from gcs_alloc_thread_stack() Documentation: userspace-api: Add shadow stack API documentation selftests: Provide helper header for shadow stack testing fork: Add shadow stack support to clone3() selftests/clone3: Remove redundant flushes of output streams selftests/clone3: Factor more of main loop into test_clone3() selftests/clone3: Allow tests to flag if -E2BIG is a valid error code selftests/clone3: Test shadow stack support Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/shadow_stack.rst | 44 +++++ arch/arm64/include/asm/gcs.h | 8 +- arch/arm64/kernel/process.c | 8 +- arch/arm64/mm/gcs.c | 61 +++++- arch/x86/include/asm/shstk.h | 11 +- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/shstk.c | 57 +++++- include/asm-generic/cacheflush.h | 11 ++ include/linux/sched/task.h | 17 ++ include/uapi/linux/sched.h | 9 +- kernel/fork.c | 96 +++++++-- tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++---- tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++- tools/testing/selftests/ksft_shstk.h | 98 ++++++++++ 15 files changed, 633 insertions(+), 81 deletions(-) --- base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494 change-id: 20231019-clone3-shadow-stack-15d40d2bf536 Best regards, -- Mark Brown <broonie(a)kernel.org>

2 weeks, 2 days

3
12
0 0

Re: [PATCH v2] selftests: cachestat: Refactor test to remove duplication

by Nhat Pham

On Tue, Jun 24, 2025 at 9:58 PM Suresh Chandrappa <suresh.k.chandrappa(a)gmail.com> wrote: > > Hi @nphamcs > Can you please check the modified change Is this supposed to be on top of the earlier patch you sent out? In that case, you should send both together as a patch series. > > Thanks > Suresh K C > > > On Wed, 11 Jun 2025, 23:39 Suresh K C, <suresh.k.chandrappa(a)gmail.com> wrote: >> >> From: Suresh K C <suresh.k.chandrappa(a)gmail.com> >> >> Refactored the mmap and shmem test logic into a common function >> to reduce code duplication and improve maintainability >> >> Changes in v2: >> Refactored mmap and shmem tests into a common function >> Renamed test function to run_cachestat_test() >> Removed test for /proc/cpuinfo as a general /proc test case already exists >> >> Signed-off-by: Suresh K C <suresh.k.chandrappa(a)gmail.com> >> --- >> .../selftests/cachestat/test_cachestat.c | 97 ++++++------------- >> 1 file changed, 30 insertions(+), 67 deletions(-) >> >> diff --git a/tools/testing/selftests/cachestat/test_cachestat.c b/tools/testing/selftests/cachestat/test_cachestat.c >> index 81e7f6dd2279..7c2f64175943 100644 >> --- a/tools/testing/selftests/cachestat/test_cachestat.c >> +++ b/tools/testing/selftests/cachestat/test_cachestat.c >> @@ -22,7 +22,7 @@ >> >> static const char * const dev_files[] = { >> "/dev/zero", "/dev/null", "/dev/urandom", >> - "/proc/version","/proc/cpuinfo","/proc" >> + "/proc/version","/proc" So you removed one file that you added in an earlier patch, right? Then why bother adding it in the first place...? Can you either: 1. Send the two patch together as a series. In the first patch, do not add /proc/cpuinfo. or 2. Squash them into a single patch. I'll let you decide if this is worth it. >> }; >> >> void print_cachestat(struct cachestat *cs) >> @@ -33,6 +33,11 @@ void print_cachestat(struct cachestat *cs) >> cs->nr_evicted, cs->nr_recently_evicted); >> } >> >> +enum file_type { >> + FILE_MMAP, >> + FILE_SHMEM >> +}; >> + >> bool write_exactly(int fd, size_t filesize) >> { >> int random_fd = open("/dev/urandom", O_RDONLY); >> @@ -202,66 +207,8 @@ static int test_cachestat(const char *filename, bool write_random, bool create, >> return ret; >> } >> >> -bool test_cachestat_mmap(void){ >> - >> - size_t PS = sysconf(_SC_PAGESIZE); >> - size_t filesize = PS * 512 * 2;; >> - int syscall_ret; >> - size_t compute_len = PS * 512; >> - struct cachestat_range cs_range = { PS, compute_len }; >> - char *filename = "tmpshmcstat"; >> - unsigned long num_pages = compute_len / PS; >> - struct cachestat cs; >> - bool ret = true; >> - int fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); >> - if (fd < 0) { >> - ksft_print_msg("Unable to create mmap file.\n"); >> - ret = false; >> - goto out; >> - } >> - if (ftruncate(fd, filesize)) { >> - ksft_print_msg("Unable to truncate mmap file.\n"); >> - ret = false; >> - goto close_fd; >> - } >> - if (!write_exactly(fd, filesize)) { >> - ksft_print_msg("Unable to write to mmap file.\n"); >> - ret = false; >> - goto close_fd; >> - } >> - char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >> - if (map == MAP_FAILED) { >> - ksft_print_msg("mmap failed.\n"); >> - ret = false; >> - goto close_fd; >> - } >> - >> - for (int i = 0; i < filesize; i++) { >> - map[i] = 'A'; >> - } >> - map[filesize - 1] = 'X'; >> - >> - syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); >> - >> - if (syscall_ret) { >> - ksft_print_msg("Cachestat returned non-zero.\n"); >> - ret = false; >> - } else { >> - print_cachestat(&cs); >> - if (cs.nr_cache + cs.nr_evicted != num_pages) { >> - ksft_print_msg("Total number of cached and evicted pages is off.\n"); >> - ret = false; >> - } >> - } >> - >> -close_fd: >> - close(fd); >> - unlink(filename); >> -out: >> - return ret; >> -} >> >> -bool test_cachestat_shmem(void) >> +bool run_cachestat_test(enum file_type type) Can you just call this function test_cachestat(enum file_type type) ? >> { >> size_t PS = sysconf(_SC_PAGESIZE); >> size_t filesize = PS * 512 * 2; /* 2 2MB huge pages */ >> @@ -271,27 +218,43 @@ bool test_cachestat_shmem(void) >> char *filename = "tmpshmcstat"; >> struct cachestat cs; >> bool ret = true; >> + int fd; >> unsigned long num_pages = compute_len / PS; >> - int fd = shm_open(filename, O_CREAT | O_RDWR, 0600); >> + if (type == FILE_SHMEM) >> + fd = shm_open(filename, O_CREAT | O_RDWR, 0600); >> + else >> + fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666); >> >> if (fd < 0) { >> - ksft_print_msg("Unable to create shmem file.\n"); >> + ksft_print_msg("Unable to create file.\n"); >> ret = false; >> goto out; >> } >> >> if (ftruncate(fd, filesize)) { >> - ksft_print_msg("Unable to truncate shmem file.\n"); >> + ksft_print_msg("Unable to truncate file.\n"); >> ret = false; >> goto close_fd; >> } >> >> if (!write_exactly(fd, filesize)) { >> - ksft_print_msg("Unable to write to shmem file.\n"); >> + ksft_print_msg("Unable to write to file.\n"); >> ret = false; >> goto close_fd; >> } >> >> + if (type == FILE_MMAP){ >> + char *map = mmap(NULL, filesize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >> + if (map == MAP_FAILED) { >> + ksft_print_msg("mmap failed.\n"); >> + ret = false; >> + goto close_fd; >> + } >> + for (int i = 0; i < filesize; i++) { >> + map[i] = 'A'; >> + } >> + map[filesize - 1] = 'X'; >> + } >> syscall_ret = syscall(__NR_cachestat, fd, &cs_range, &cs, 0); >> >> if (syscall_ret) { >> @@ -333,7 +296,7 @@ int main(void) >> ret = 1; >> } >> >> - for (int i = 0; i < 6; i++) { >> + for (int i = 0; i < 5; i++) { >> const char *dev_filename = dev_files[i]; >> >> if (test_cachestat(dev_filename, false, false, false, >> @@ -367,14 +330,14 @@ int main(void) >> break; >> } >> >> - if (test_cachestat_shmem()) >> + if (run_cachestat_test(FILE_SHMEM)) >> ksft_test_result_pass("cachestat works with a shmem file\n"); >> else { >> ksft_test_result_fail("cachestat fails with a shmem file\n"); >> ret = 1; >> } >> >> - if (test_cachestat_mmap()) >> + if (run_cachestat_test(FILE_MMAP)) >> ksft_test_result_pass("cachestat works with a mmap file\n"); >> else { >> ksft_test_result_fail("cachestat fails with a mmap file\n"); >> -- >> 2.43.0 >>

2 weeks, 3 days

1
0
0 0

[PATCH net-next v2 0/4] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v1 (from RFC): - Toggle the netconsole interfaces up and down after 5 iterations. - Moved the traffic check under DEBUG (Willem de Bruijn). - Bumped the iterations to 20 given it runs faster now. - Link to the RFC: https://lore.kernel.org/r/20250612-netpoll_test-v1-1-4774fd95933f@debian.org --- Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (3): selftests: drv-net: Improve bpftrace utility error handling selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../testing/selftests/drivers/net/netpoll_basic.py | 344 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 38 +++ 3 files changed, 383 insertions(+) --- base-commit: eb4c27edb4d8dbfbdcc7bc03e0394a0fab8af7d5 change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

2 weeks, 3 days

4
18
0 0

[PATCH 0/5] sysctl: Remove last two ctl_tables from the kern_table array

by Joel Granados

This is the last series to relocate sysctl tables from kernel/sysctl.c into their respective subsystems. After the move of two ctl_tables (uevent_helper & overflow{uid,gid}), five remain. They either handle variables defined within sysctl.c or serve as a common place for variables that are defined in different architectures. These five will not be moved. Note that this series includes two auxiliary changes: Removal of an unused variable and Nix-based rework of sysctl.sh test script By decentralizing sysctl registrations, subsystem maintainers regain control over their sysctl interfaces, improving maintainability and reducing the likelihood of merge conflicts. All this is made possible by the work done to reduce the ctl_table memory footprint in commit d7a76ec87195 ("sysctl: Remove check for sentinel element in ctl_table arrays"). A few comments on the process: 1. If you prefer to merge this through a non-sysctl tree, please let me know so I can avoid conflicts in linux-next. 2. Apologies if you were copied by mistake—let me know if you'd like to be removed. 3. This series builds on [1], so please rebase accordingly for clean application. 4. Testing done by running sysctl selftests on x86_64 and 0-day. Comments/Suggestions greatly appreciated [1] https://lore.kernel.org/20250509-jag-mv_ctltables_iter2-v1-0-d0ad83f5f4c3@k… Signed-off-by: Joel Granados <joel.granados(a)kernel.org> --- Joel Granados (5): sysctl: Nixify sysctl.sh sysctl: Removed unused variable uevent: mv uevent_helper into kobject_uevent.c kernel/sys.c: Move overflow{uid,gid} sysctl into kernel/sys.c sysctl: rename kern_table -> sysctl_subsys_table include/linux/sysctl.h | 1 - kernel/sys.c | 29 +++++++++++++++++++ kernel/sysctl.c | 49 +++++++------------------------- lib/kobject_uevent.c | 20 +++++++++++++ tools/testing/selftests/sysctl/sysctl.sh | 2 +- 5 files changed, 61 insertions(+), 40 deletions(-) --- base-commit: 501dd0fbc76bcae57902ea000d9c6ccd9d5f226e change-id: 20250627-jag-sysctl-823adf5732be Best regards, -- Joel Granados <joel.granados(a)kernel.org>

2 weeks, 3 days

1
5
0 0

[PATCH net-next v11 0/8] Support rate management on traffic classes in devlink and mlx5

by Mark Bloch

V11: - Refactored the devlink code to accept relative TC bandwidth share values instead of percentages. - Updated documentation to clarify that values are interpreted as relative shares. - Refactored the logic in mlx5 to support proportional scaling for tc-bw values. - Switched to `nlmsg_for_each_attr_type()` for cleaner attribute parsing. - Added a hardware selftest to validate TC bandwidth behavior. - Refactored esw_qos_is_node_empty for readability. V10: - Added netdevsim selftest for tc-bw ops. - Dropped header: field as it’s unnecessary for local constants in devlink.yaml. V9: - Defined DEVLINK_RATE_TCS_MAX as 8 in uapi/linux/devlink.h. - Replaced IEEE_8021QAZ_MAX_TCS with DEVLINK_RATE_TCS_MAX throughout the code. - Updated devlink-rate-tc-index-max spec to reference the correct UAPI header. V8: - Limit line width to 80 characters in mlx5 changes instead of 100. - Increase the scheduling node levels to support TC arbitration. - Ensure parent nodes are set correctly in all code paths that extend the hierarchy depth for TC arbitration. - Extended the cover letter with the ongoing discussion on devlink-rate and net-shapers. - Extended the cover letter with the Netdev talk link on this series. V7: - Fixed disabling tc-bw on leaf nodes that did not have tc-bw configured. - Fixed an issue where tc-bw was disabled on a node with assigned vports, ensuring that vport->qos.sched_node->parent is correctly updated with the cloned node. - Declared a constant for the maximum allowed Traffic Class index in devlink rate. - Added a range check to validate rate-tc-index. - Added documentation for the tc-bw argument. - Add a validation check to ensure that the total bandwidth assigned to all traffic classes sums to 100. V6: - Addressed comments on devlink patch #3. - Removed first 4 IFC patches, to be pulled from mlx5-next. V5: - Fix warning in devlink_nl_rate_tc_bw_set(). - Fix target branch of patch #4. V4: - Renamed the nested attribute for traffic class bandwidth to DEVLINK_ATTR_RATE_TC_BWS. - Changed the order of the attributes in `devlink.h`. - Refactored the initialization tc-bw array in devlink_nl_rate_tc_bw_set(). - Added extack messages to provide clear feedback on issues with tc-bw arguments. - Updated `rate-tc-bws` to support a multi-attr set, where each attribute includes an index and the corresponding bandwidth for that traffic class. - Handled the issue where the user could provide DEVLINK_ATTR_RATE_TC_BWS with duplicate indices. - Provided ynl exmaples in patch [1/5] commit message. - Take IFC patches to beginning of the series, targeted for mlx5-next. V3: - Dropped rate-tc-index, using tc-bw array index instead. - Renamed rate-bw to rate-tc-bw. - Documneted what the rate-tc-bw represents and added a range check for validation. - Intorduced devlink_nl_rate_tc_bw_set() to parse and set the TC bandwidth values. - Updated the user API in the commit message of patch 1/6 to ensure bandwidths sum equals 100. - Fixed missing filling of rate-parent in devlink_nl_rate_fill(). V2: - Included <linux/dcbnl.h> in devlink.h to resolve missing IEEE_8021QAZ_MAX_TCS definition. - Refactored the rate-tc-bw attribute structure to use a separate rate-tc-index. - Updated patch 2/6 title. This patch series extends the devlink-rate API to support traffic class (TC) bandwidth management, enabling more granular control over traffic shaping and rate limiting across multiple TCs. The API now allows users to specify bandwidth proportions for different traffic classes in a single command. This is particularly useful for managing Enhanced Transmission Selection (ETS) for groups of Virtual Functions (VFs), allowing precise bandwidth allocation across traffic classes. Additionally the series refines the QoS handling in net/mlx5 to support TC arbitration and bandwidth management on vports and rate nodes. Discussions on traffic class shaping in net-shapers began in V5 [1], where we discussed with maintainers whether net-shapers should support traffic classes and how this could be implemented. Later, after further conversations with Paolo Abeni and Simon Horman, Cosmin provided an update [2], confirming that net-shapers' tree-based hierarchy aligns well with traffic classes when treated as distinct subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX queues and traffic classes, this approach seems feasible, though some open questions remain regarding queue reconfiguration and certain mlx5 scheduling behaviors. Building on that discussion, Cosmin has now shared a concrete implementation plan on the netdev mailing list [3]. The plan, developed in collaboration with Paolo and Simon, outlines how net-shapers can be extended to support the same use cases currently covered by devlink-rate, with the eventual goal of aligning both and simplifying the shaping infrastructure in the kernel. This work was presented at Netdev 0x19 in Zagreb [4]. There we presented how TC scheduling is enforced in mlx5 hardware, which led to discussions on the mailing list. A summary of how things work: Classification means labeling a packet with a traffic class based on the packet's DSCP or VLAN PCP field, then treating packets with different traffic classes differently during transmit processing. In a virtualized setup, VFs are untrusted and do not control classification or shaping. Classification is done by the hardware using a prio-to-TC mapping set by the hypervisor. VFs only select which send queue to use and are expected to respect the classification logic by sending each traffic class on its dedicated queue. As stated in the net-shapers plan [3], each transmit queue should carry only a single traffic class. Mixing classes in a single queue can lead to HOL blocking. In the mlx5 implementation, if the queue used does not match the classified traffic class, the hardware moves the queue to the correct TC scheduler. This movement is not a reclassification; it’s a necessary enforcement step to ensure traffic class isolation is maintained. Extend devlink-rate API to support rate management on TCs: - devlink: Extend the devlink rate API to support traffic class bandwidth management Introduce a no-op implementation: - net/mlx5: Add no-op implementation for setting tc-bw on rate objects Add support for enabling and disabling TC QoS on vports and nodes: - net/mlx5: Add support for setting tc-bw on nodes - net/mlx5: Add traffic class scheduling support for vport QoS Support for setting tc-bw on rate objects: - net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw [1] https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/ [2] https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.cam… [3] https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.cam… [4] https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-… Carolina Jubran (8): netlink: introduce type-checking attribute iteration for nlmsg devlink: Extend devlink rate API with traffic classes bandwidth management selftest: netdevsim: Add devlink rate tc-bw test net/mlx5: Add no-op implementation for setting tc-bw on rate objects net/mlx5: Add support for setting tc-bw on nodes net/mlx5: Add traffic class scheduling support for vport QoS net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution Documentation/netlink/specs/devlink.yaml | 32 +- .../networking/devlink/devlink-port.rst | 8 + .../net/ethernet/mellanox/mlx5/core/devlink.c | 2 + .../net/ethernet/mellanox/mlx5/core/esw/qos.c | 1037 ++++++++++++++++- .../net/ethernet/mellanox/mlx5/core/esw/qos.h | 8 + .../net/ethernet/mellanox/mlx5/core/eswitch.h | 14 +- drivers/net/netdevsim/dev.c | 43 + drivers/net/netdevsim/netdevsim.h | 1 + drivers/net/vxlan/vxlan_vnifilter.c | 13 +- fs/nfsd/nfsctl.c | 36 +- include/net/devlink.h | 8 + include/net/netlink.h | 14 + include/uapi/linux/devlink.h | 9 + net/devlink/netlink_gen.c | 15 +- net/devlink/netlink_gen.h | 1 + net/devlink/rate.c | 129 ++ .../drivers/net/hw/devlink_rate_tc_bw.py | 466 ++++++++ .../drivers/net/netdevsim/devlink.sh | 51 + .../testing/selftests/net/lib/py/__init__.py | 2 +- tools/testing/selftests/net/lib/py/ynl.py | 5 + 20 files changed, 1823 insertions(+), 71 deletions(-) create mode 100755 tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py base-commit: 8dacfd92dbefee829ca555a860e86108fdd1d55b -- 2.34.1

2 weeks, 3 days

2
9
0 0

[PATCH net-next] selftests: forwarding: lib: Split setup_wait()

by Petr Machata

setup_wait() takes an optional argument and then is called from the top level of the test script. That confuses shellcheck, which thinks that maybe the intention is to pass $1 of the script to the function, which is never the case. To avoid having to annotate every single new test with a SC disable, split the function in two: one that takes a mandatory argument, and one that takes no argument at all. Convert the two existing users of that optional argument, both in Spectrum resource selftest, to use the new form. Clean up vxlan_bridge_1q_mc_ul.sh to not pass a now-unused argument. Signed-off-by: Petr Machata <petrm(a)nvidia.com> --- Notes: CC: Shuah Khan <shuah(a)kernel.org> CC: Matthieu Baerts <matttbe(a)kernel.org> CC: linux-kselftest(a)vger.kernel.org .../drivers/net/mlxsw/spectrum-2/resource_scale.sh | 2 +- .../drivers/net/mlxsw/spectrum/resource_scale.sh | 2 +- tools/testing/selftests/net/forwarding/lib.sh | 9 +++++++-- .../selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh | 2 +- 4 files changed, 10 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh b/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh index 899b6892603f..d7505b933aef 100755 --- a/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/spectrum-2/resource_scale.sh @@ -51,7 +51,7 @@ for current_test in ${TESTS:-$ALL_TESTS}; do fi ${current_test}_setup_prepare - setup_wait $num_netifs + setup_wait_n $num_netifs # Update target in case occupancy of a certain resource changed # following the test setup. target=$(${current_test}_get_target "$should_fail") diff --git a/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh b/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh index 482ebb744eba..7b98cdd0580d 100755 --- a/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/spectrum/resource_scale.sh @@ -55,7 +55,7 @@ for current_test in ${TESTS:-$ALL_TESTS}; do continue fi ${current_test}_setup_prepare - setup_wait $num_netifs + setup_wait_n $num_netifs # Update target in case occupancy of a certain resource # changed following the test setup. target=$(${current_test}_get_target "$should_fail") diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index 83ee6a07e072..9308b2f77fed 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -526,9 +526,9 @@ setup_wait_dev_with_timeout() return 1 } -setup_wait() +setup_wait_n() { - local num_netifs=${1:-$NUM_NETIFS} + local num_netifs=$1; shift local i for ((i = 1; i <= num_netifs; ++i)); do @@ -539,6 +539,11 @@ setup_wait() sleep $WAIT_TIME } +setup_wait() +{ + setup_wait_n "$NUM_NETIFS" +} + wait_for_dev() { local dev=$1; shift diff --git a/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh b/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh index 7ec58b6b1128..462db0b603e7 100755 --- a/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh +++ b/tools/testing/selftests/net/forwarding/vxlan_bridge_1q_mc_ul.sh @@ -765,7 +765,7 @@ ipv6_mcroute_fdb_sep_rx() trap cleanup EXIT setup_prepare -setup_wait "$NUM_NETIFS" +setup_wait tests_run exit "$EXIT_STATUS" -- 2.49.0

2 weeks, 3 days

2
1
0 0

[PATCH] selftests: futex: define SYS_futex on 32-bit architectures with 64-bit time_t

by Ben Zong-You Xie

glibc does not define SYS_futex for 32-bit architectures using 64-bit time_t e.g. riscv32, therefore this test fails to compile since it does not find SYS_futex in C library headers. Define SYS_futex as SYS_futex_time64 in this situation to ensure successful compilation and compatibility. Signed-off-by: Ben Zong-You Xie <ben717(a)andestech.com> Signed-off-by: Cynthia Huang <cynthia(a)andestech.com> --- tools/testing/selftests/futex/include/futextest.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h index ddbcfc9b7bac..7a5fd1d5355e 100644 --- a/tools/testing/selftests/futex/include/futextest.h +++ b/tools/testing/selftests/futex/include/futextest.h @@ -47,6 +47,17 @@ typedef volatile u_int32_t futex_t; FUTEX_PRIVATE_FLAG) #endif +/* + * SYS_futex is expected from system C library, in glibc some 32-bit + * architectures (e.g. RV32) are using 64-bit time_t, therefore it doesn't have + * SYS_futex defined but just SYS_futex_time64. Define SYS_futex as + * SYS_futex_time64 in this situation to ensure the compilation and the + * compatibility. + */ +#if !defined(SYS_futex) && defined(SYS_futex_time64) +#define SYS_futex SYS_futex_time64 +#endif + /** * futex() - SYS_futex syscall wrapper * @uaddr: address of first futex -- 2.34.1

2 weeks, 3 days

2
1
0 0

[PATCH] selftests/futex: Convert 32bit timespec struct to 64bit version for 32bit compatibility mode

by Terry Tritton

Futex_waitv can not accept old_timespec32 struct, so userspace should convert it from 32bit to 64bit before syscall in 32bit compatible mode. This fix is based off [1] Link: https://lore.kernel.org/all/20231203235117.29677-1-wegao@suse.com/ [1] Signed-off-by: Terry Tritton <terry.tritton(a)linaro.org> Signed-off-by: Wei Gao <wegao(a)suse.com> --- The original patch is for an identically named file and function in ltp and we need the same fix in kselftest. The patch is near identical with only a slight change to `syscall` instead of `tst_syscall`. Is the way I have tagged this appropriate? .../testing/selftests/futex/include/futex2test.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/tools/testing/selftests/futex/include/futex2test.h b/tools/testing/selftests/futex/include/futex2test.h index ea79662405bc..6780e51eb2d6 100644 --- a/tools/testing/selftests/futex/include/futex2test.h +++ b/tools/testing/selftests/futex/include/futex2test.h @@ -55,6 +55,13 @@ struct futex32_numa { futex_t numa; }; +#if !defined(__LP64__) +struct timespec64 { + int64_t tv_sec; + int64_t tv_nsec; +}; +#endif + /** * futex_waitv - Wait at multiple futexes, wake on any * @waiters: Array of waiters @@ -65,7 +72,15 @@ struct futex32_numa { static inline int futex_waitv(volatile struct futex_waitv *waiters, unsigned long nr_waiters, unsigned long flags, struct timespec *timo, clockid_t clockid) { +#if !defined(__LP64__) + struct timespec64 timo64 = {0}; + + timo64.tv_sec = timo->tv_sec; + timo64.tv_nsec = timo->tv_nsec; + return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, &timo64, clockid); +#else return syscall(__NR_futex_waitv, waiters, nr_waiters, flags, timo, clockid); +#endif } /* -- 2.39.5

2 weeks, 3 days

2
1
0 0

[PATCH v6 00/25] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)

by Nicolin Chen

The vIOMMU object is designed to represent a slice of an IOMMU HW for its virtualization features shared with or passed to user space (a VM mostly) in a way of HW acceleration. This extended the HWPT-based design for more advanced virtualization feature. HW QUEUE introduced by this series as a part of the vIOMMU infrastructure represents a HW accelerated queue/buffer for VM to use exclusively, e.g. - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer each of which allows its IOMMU HW to directly access a queue memory owned by a guest VM and allows a guest OS to control the HW queue direclty, to avoid VM Exit overheads to improve the performance. Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC allowing VMM to forward the IOMMU-specific queue info, such as queue base address, size, and etc. Meanwhile, a guest-owned queue needs the guest kernel to control the queue by reading/writing its consumer and producer indexes, via MMIO acceses to the hardware MMIO registers. Introduce an mmap infrastructure for iommufd to support passing through a piece of MMIO region from the host physical address space to the guest physical address space. The mmap info (offset/ length) used by an mmap syscall must be pre-allocated and returned to the user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC call. Thus, it requires a driver-specific user data support in the vIOMMU allocation flow. As a real-world use case, this series implements a HW QUEUE support in the tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it is also the Tegra CMDQV series Part-2 (user-space support), reworked from Previous RFCv1: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS. // Unmap latencies from "dma_map_benchmark -g @granule -t @threads", // by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq" @granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0 4KB 1 35.7 us 5.3 us 16KB 1 41.8 us 6.8 us 64KB 1 68.9 us 9.9 us 128KB 1 109.0 us 12.6 us 256KB 1 187.1 us 18.0 us 4KB 2 96.9 us 6.8 us 16KB 2 97.8 us 7.5 us 64KB 2 151.5 us 10.7 us 128KB 2 257.8 us 12.7 us 256KB 2 443.0 us 17.9 us This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v6 Paring QEMU branch for testing: https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v6 Changelog v6 * Rebase on iommufd_hw_queue-prep-v2 * Add Reviewed-by from Kevin and Jason * [iommufd] Update kdocs and notes * [iommufd] Drop redundant pages[i] check * [iommufd] Allow nesting_parent_iova to be 0 * [iommufd] Add iommufd_hw_queue_alloc_phys() * [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs * [iommufd] Move destroy ops to vdevice/hw_queue structures * [iommufd] Add union in hw_info struct to share out_data_type field * [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs * [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init * [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init * [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys * [smmu] Drop arm_smmu_domain_ipa_to_pa * [smmu] Update arm_smmu_impl_ops changes for vsmmu_init * [tegra] Add a vdev_to_vsid macro * [tegra] Add lvcmdq_mutex to protect multi queues * [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak) v5 https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc6 * Add Reviewed-by from Jason and Kevin * Correct typos in kdoc and update commit logs * [iommufd] Add a cosmetic fix * [iommufd] Drop unused num_pfns * [iommufd] Drop unnecessary check * [iommufd] Reorder patch sequence * [iommufd] Use io_remap_pfn_range() * [iommufd] Use success oriented flow * [iommufd] Fix max_npages calculation * [iommufd] Add more selftest coverage * [iommufd] Drop redundant static_assert * [iommufd] Fix mmap pfn range validation * [iommufd] Reject unmap on pinned iovas * [iommufd] Drop redundant vm_flags_set() * [iommufd] Drop iommufd_struct_destroy() * [iommufd] Drop redundant queue iova test * [iommufd] Use "mmio_addr" and "mmio_pfn" * [iommufd] Rename to "nesting_parent_iova" * [iommufd] Make iopt_pin_pages call option * [iommufd] Add ictx comparison in depend() * [iommufd] Add iommufd_object_alloc_ucmd() * [iommufd] Move kcalloc() after validations * [iommufd] Replace ictx setting with WARN_ON * [iommufd] Make hw_info's type bidirectional * [smmu] Add supported_vsmmu_type in impl_ops * [smmu] Drop impl report in smmu vendor struct * [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV * [tegra] Replace "number of VINTFs" with a note * [tegra] Drop the redundant lvcmdq pointer setting * [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA * [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op v4 https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc5 * Add Reviewed-by from Vasant * Rename "vQUEUE" to "HW QUEUE" * Use "offset" and "length" for all mmap-related variables * [iommufd] Use u64 for guest PA * [iommufd] Fix typo in uAPI doc * [iommufd] Rename immap_id to offset * [iommufd] Drop the partial-size mmap support * [iommufd] Do not replace WARN_ON with WARN_ON_ONCE * [iommufd] Use "u64 base_addr" for queue base address * [iommufd] Use u64 base_pfn/num_pfns for immap structure * [iommufd] Correct the size passed in to mtree_alloc_range() * [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops v3 https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/ * Add Reviewed-by from Baolu, Pranjal, and Alok * Revise kdocs, uAPI docs, and commit logs * Rename "vCMDQ" back to "vQUEUE" for AMD cases * [tegra] Add tegra241_vcmdq_hw_flush_timeout() * [tegra] Rename vsmmu_alloc to alloc_vintf_user * [tegra] Use writel for SID replacement registers * [tegra] Move mmap removal call to vsmmu_destroy op * [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user() * [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED() * [iommufd] Add an object-type "owner" to immap structure * [iommufd] Drop the ictx input in the new for-driver APIs * [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle * [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers * [iommufd] Rename iommufd_ctx_alloc/free_mmap to _iommufd_alloc/destroy_mmap v2 https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/ * Add Reviewed-by from Jason * [smmu] Fix vsmmu initial value * [smmu] Support impl for hw_info * [tegra] Rename "slot" to "vsid" * [tegra] Update kdocs and commit logs * [tegra] Map/unmap LVCMDQ dynamically * [tegra] Refcount the previous LVCMDQ * [tegra] Return -EEXIST if LVCMDQ exists * [tegra] Simplify VINTF cleanup routine * [tegra] Use vmid and s2_domain in vsmmu * [tegra] Rename "mmap_pgoff" to "immap_id" * [tegra] Add more addr and length validation * [iommufd] Add more narrative to mmap's kdoc * [iommufd] Add iommufd_struct_depend/undepend() * [iommufd] Rename vcmdq_free op to vcmdq_destroy * [iommufd] Fix bug in iommu_copy_struct_to_user() * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap() * [iommufd] Test the queue memory for its contiguity * [iommufd] Return -ENXIO if address or length fails * [iommufd] Do not change @min_last in mock_viommu_alloc() * [iommufd] Generalize TEGRA241_VCMDQ data in core structure * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping v1 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/ Thanks Nicolin Nicolin Chen (25): iommu: Add iommu_copy_struct_to_user helper iommu: Pass in a driver-level user data structure to viommu_init op iommufd/viommu: Allow driver-specific user data for a vIOMMU object iommufd/selftest: Support user_data in mock_viommu_alloc iommufd/selftest: Add coverage for viommu data iommufd/access: Allow access->ops to be NULL for internal use iommufd/access: Add internal APIs for HW queue to use iommufd/viommu: Add driver-defined vDEVICE support iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC iommufd: Add mmap interface iommufd/selftest: Add coverage for the new mmap interface Documentation: userspace-api: iommufd: Update HW QUEUE iommu: Allow an input type in hw_info op iommufd: Allow an input data_type via iommu_hw_info iommufd/selftest: Update hw_info coverage for an input data_type iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops iommu/tegra241-cmdqv: Use request_threaded_irq iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 16 +- drivers/iommu/iommufd/iommufd_private.h | 35 +- drivers/iommu/iommufd/iommufd_test.h | 20 + include/linux/iommu.h | 49 +- include/linux/iommufd.h | 156 ++++++ include/uapi/linux/iommufd.h | 145 +++++- tools/testing/selftests/iommu/iommufd_utils.h | 89 +++- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 25 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 481 +++++++++++++++++- drivers/iommu/intel/iommu.c | 4 + drivers/iommu/iommufd/device.c | 84 ++- drivers/iommu/iommufd/driver.c | 79 +++ drivers/iommu/iommufd/io_pagetable.c | 17 +- drivers/iommu/iommufd/main.c | 70 +++ drivers/iommu/iommufd/selftest.c | 150 +++++- drivers/iommu/iommufd/viommu.c | 218 +++++++- tools/testing/selftests/iommu/iommufd.c | 143 +++++- .../selftests/iommu/iommufd_fail_nth.c | 15 +- Documentation/userspace-api/iommufd.rst | 12 + 19 files changed, 1708 insertions(+), 100 deletions(-) -- 2.43.0

2 weeks, 3 days

5
76
0 0

[PATCH 0/2] membarrier: allow cpu_id to be set on more commands

by Dylan Yudaken

Hi, I noticed that only the RSEQ membarrier command allows specifying a specific cpu. I have a (extremely toy) lib I was playing around with [1] and noticed this and that being able to specify the cpu would be useful to me. I'm by no means an expert in this code though - and so could be totally missing something. Additionally this seems really difficult to actually test. I added a self test that just proces "some" interrupts are being sent, but nothing more than that. I don't know what else can be done there. Patch 1 is the main change Patch 2 is the self-test. Which maybe you don't want at all. [1]: https://github.com/DylanZA/rseqlock/commit/be7bc7214fd5aacec47e26126118f8bb… Thanks, Dylan Dylan Yudaken (2): membarrier: allow cpu_id to be set on more commands membarrier: self test for cpu specific calls kernel/sched/membarrier.c | 6 +- tools/testing/selftests/membarrier/.gitignore | 1 + tools/testing/selftests/membarrier/Makefile | 3 +- .../membarrier/membarrier_test_expedited.c | 135 ++++++++++++++++++ .../membarrier/membarrier_test_impl.h | 5 + 5 files changed, 148 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/membarrier/membarrier_test_expedited.c base-commit: ee88bddf7f2f5d1f1da87dd7bedc734048b70e88 -- 2.49.0

2 weeks, 4 days

3
4
0 0

[PATCH] kunit: Make default kunit_test timeout configurable via both a module parameter and a Kconfig option

by Marie Zhussupova

To accommodate varying hardware performance and use cases, the default kunit test case timeout (currently 300 seconds) is now configurable. Users can adjust the timeout by either setting the 'timeout' module parameter or the KUNIT_DEFAULT_TIMEOUT Kconfig option to their desired timeout in seconds. Signed-off-by: Marie Zhussupova <marievic(a)google.com> --- lib/kunit/Kconfig | 13 +++++++++++++ lib/kunit/test.c | 15 ++++++++------- 2 files changed, 21 insertions(+), 7 deletions(-) diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig index a97897edd964..c10ede4b1d22 100644 --- a/lib/kunit/Kconfig +++ b/lib/kunit/Kconfig @@ -93,4 +93,17 @@ config KUNIT_AUTORUN_ENABLED In most cases this should be left as Y. Only if additional opt-in behavior is needed should this be set to N. +config KUNIT_DEFAULT_TIMEOUT + int "Default value of the timeout module parameter" + default 300 + help + Sets the default timeout, in seconds, for Kunit test cases. This value + is further multiplied by a factor determined by the assigned speed + setting: 1x for `DEFAULT`, 3x for `KUNIT_SPEED_SLOW`, and 12x for + `KUNIT_SPEED_VERY_SLOW`. This allows slower tests on slower machines + sufficient time to complete. + + If unsure, the default timeout of 300 seconds is suitable for most + cases. + endif # KUNIT diff --git a/lib/kunit/test.c b/lib/kunit/test.c index 002121675605..f3c6b11f12b8 100644 --- a/lib/kunit/test.c +++ b/lib/kunit/test.c @@ -69,6 +69,13 @@ static bool enable_param; module_param_named(enable, enable_param, bool, 0); MODULE_PARM_DESC(enable, "Enable KUnit tests"); +/* + * Configure the base timeout. + */ +static unsigned long kunit_base_timeout = CONFIG_KUNIT_DEFAULT_TIMEOUT; +module_param_named(timeout, kunit_base_timeout, ulong, 0644); +MODULE_PARM_DESC(timeout, "Set the base timeout for Kunit test cases"); + /* * KUnit statistic mode: * 0 - disabled @@ -393,12 +400,6 @@ static int kunit_timeout_mult(enum kunit_speed speed) static unsigned long kunit_test_timeout(struct kunit_suite *suite, struct kunit_case *test_case) { int mult = 1; - /* - * TODO: Make the default (base) timeout configurable, so that users with - * particularly slow or fast machines can successfully run tests, while - * still taking advantage of the relative speed. - */ - unsigned long default_timeout = 300; /* * The default test timeout is 300 seconds and will be adjusted by mult @@ -409,7 +410,7 @@ static unsigned long kunit_test_timeout(struct kunit_suite *suite, struct kunit_ mult = kunit_timeout_mult(suite->attr.speed); if (test_case->attr.speed != KUNIT_SPEED_UNSET) mult = kunit_timeout_mult(test_case->attr.speed); - return mult * default_timeout * msecs_to_jiffies(MSEC_PER_SEC); + return mult * kunit_base_timeout * msecs_to_jiffies(MSEC_PER_SEC); } -- 2.50.0.rc2.761.g2dc52ea45b-goog

2 weeks, 4 days

2
1
0 0

[PATCH net-next v5] page_pool: import Jesper's page_pool benchmark

by Mina Almasry

From: Jesper Dangaard Brouer <hawk(a)kernel.org> We frequently consult with Jesper's out-of-tree page_pool benchmark to evaluate page_pool changes. Import the benchmark into the upstream linux kernel tree so that (a) we're all running the same version, (b) pave the way for shared improvements, and (c) maybe one day integrate it with nipa, if possible. Import bench_page_pool_simple from commit 35b1716d0c30 ("Add page_bench06_walk_all"), from this repository: https://github.com/netoptimizer/prototype-kernel.git Changes done during upstreaming: - Fix checkpatch issues. - Remove the tasklet logic not needed. - Move under tools/testing - Create ksft for the benchmark. - Changed slightly how the benchmark gets build. Out of tree, time_bench is built as an independent .ko. Here it is included in bench_page_pool.ko Steps to run: ``` mkdir -p /tmp/run-pp-bench make -C ./tools/testing/selftests/net/bench make -C ./tools/testing/selftests/net/bench install INSTALL_PATH=/tmp/run-pp-bench rsync --delete -avz --progress /tmp/run-pp-bench mina@$SERVER:~/ ssh mina@$SERVER << EOF cd ~/run-pp-bench && sudo ./test_bench_page_pool.sh EOF ``` Note that by default, the Makefile will build the benchmark for the currently installed kernel in /lib/modules/$(shell uname -r)/build. To build against the current tree, do: make KDIR=$(pwd) -C ./tools/testing/selftests/net/bench Output (from Jesper): ``` sudo ./test_bench_page_pool.sh (benchmark dmesg logs snipped) Fast path results: no-softirq-page_pool01 Per elem: 23 cycles(tsc) 6.571 ns ptr_ring results: no-softirq-page_pool02 Per elem: 60 cycles(tsc) 16.862 ns slow path results: no-softirq-page_pool03 Per elem: 265 cycles(tsc) 73.739 ns ``` Output (from me): ``` sudo ./test_bench_page_pool.sh (benchmark dmesg logs snipped) Fast path results: no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.177 ns ptr_ring results: no-softirq-page_pool02 Per elem: 51 cycles(tsc) 19.117 ns slow path results: no-softirq-page_pool03 Per elem: 168 cycles(tsc) 62.469 ns ``` Results of course will vary based on hardware/kernel/configs, and some variance may be there from run to run due to some noise. Cc: Jesper Dangaard Brouer <hawk(a)kernel.org> Cc: Ilias Apalodimas <ilias.apalodimas(a)linaro.org> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: Toke Høiland-Jørgensen <toke(a)toke.dk> Signed-off-by: Mina Almasry <almasrymina(a)google.com> Acked-by: Ilias Apalodimas <ilias.apalodimas(a)linaro.org> Signed-off-by: Jesper Dangaard Brouer <hawk(a)kernel.org> --- v5: https://lore.kernel.org/netdev/20250615205914.835368-1-almasrymina@google.c… - Update results in the commit message - Update to build against current tree v4: https://lore.kernel.org/netdev/20250614100853.3f2372f2@kernel.org/ - Fix more checkpatch and coccicheck issues (Jakub) v3: - Non RFC - Collect Signed-off-by from Jesper and Acked-by Ilias. - Move test_bench_page_pool.sh to address nipa complaint. - Remove `static inline` in .c files to address nipa complaint. v2: - Move under tools/selftests (Jakub) - Create ksft for it. - Remove the tasklet logic no longer needed (Jesper + Toke) RFC discussion points: - Desirable to import it? - Can the benchmark be imported as-is for an initial version? Or needs lots of modifications? - Code location. I retained the location in Jesper's tree, but a path like net/core/bench/ may make more sense. --- tools/testing/selftests/net/bench/Makefile | 7 + .../selftests/net/bench/page_pool/Makefile | 17 + .../bench/page_pool/bench_page_pool_simple.c | 276 ++++++++++++ .../net/bench/page_pool/time_bench.c | 394 ++++++++++++++++++ .../net/bench/page_pool/time_bench.h | 238 +++++++++++ .../net/bench/test_bench_page_pool.sh | 32 ++ 6 files changed, 964 insertions(+) create mode 100644 tools/testing/selftests/net/bench/Makefile create mode 100644 tools/testing/selftests/net/bench/page_pool/Makefile create mode 100644 tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c create mode 100644 tools/testing/selftests/net/bench/page_pool/time_bench.c create mode 100644 tools/testing/selftests/net/bench/page_pool/time_bench.h create mode 100755 tools/testing/selftests/net/bench/test_bench_page_pool.sh diff --git a/tools/testing/selftests/net/bench/Makefile b/tools/testing/selftests/net/bench/Makefile new file mode 100644 index 000000000000..2546c45e42f7 --- /dev/null +++ b/tools/testing/selftests/net/bench/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 + +TEST_GEN_MODS_DIR := page_pool + +TEST_PROGS += test_bench_page_pool.sh + +include ../../lib.mk diff --git a/tools/testing/selftests/net/bench/page_pool/Makefile b/tools/testing/selftests/net/bench/page_pool/Makefile new file mode 100644 index 000000000000..0549a16ba275 --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/Makefile @@ -0,0 +1,17 @@ +BENCH_PAGE_POOL_SIMPLE_TEST_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST))))) +KDIR ?= /lib/modules/$(shell uname -r)/build + +ifeq ($(V),1) +Q = +else +Q = @ +endif + +obj-m += bench_page_pool.o +bench_page_pool-y += bench_page_pool_simple.o time_bench.o + +all: + +$(Q)make -C $(KDIR) M=$(BENCH_PAGE_POOL_SIMPLE_TEST_DIR) modules + +clean: + +$(Q)make -C $(KDIR) M=$(BENCH_PAGE_POOL_SIMPLE_TEST_DIR) clean diff --git a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c new file mode 100644 index 000000000000..f183d5e30dc6 --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c @@ -0,0 +1,276 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Benchmark module for page_pool. + * + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/module.h> +#include <linux/mutex.h> + +#include <linux/version.h> +#include <net/page_pool/helpers.h> + +#include <linux/interrupt.h> +#include <linux/limits.h> + +#include "time_bench.h" + +static int verbose = 1; +#define MY_POOL_SIZE 1024 + +static void _page_pool_put_page(struct page_pool *pool, struct page *page, + bool allow_direct) +{ + page_pool_put_page(pool, page, -1, allow_direct); +} + +/* Makes tests selectable. Useful for perf-record to analyze a single test. + * Hint: Bash shells support writing binary number like: $((2#101010) + * + * # modprobe bench_page_pool_simple run_flags=$((2#100)) + */ +static unsigned long run_flags = 0xFFFFFFFF; +module_param(run_flags, ulong, 0); +MODULE_PARM_DESC(run_flags, "Limit which bench test that runs"); + +/* Count the bit number from the enum */ +enum benchmark_bit { + bit_run_bench_baseline, + bit_run_bench_no_softirq01, + bit_run_bench_no_softirq02, + bit_run_bench_no_softirq03, +}; + +#define bit(b) (1 << (b)) +#define enabled(b) ((run_flags & (bit(b)))) + +/* notice time_bench is limited to U32_MAX nr loops */ +static unsigned long loops = 10000000; +module_param(loops, ulong, 0); +MODULE_PARM_DESC(loops, "Specify loops bench will run"); + +/* Timing at the nanosec level, we need to know the overhead + * introduced by the for loop itself + */ +static int time_bench_for_loop(struct time_bench_record *rec, void *data) +{ + uint64_t loops_cnt = 0; + int i; + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + loops_cnt++; + barrier(); /* avoid compiler to optimize this loop */ + } + time_bench_stop(rec, loops_cnt); + return loops_cnt; +} + +static int time_bench_atomic_inc(struct time_bench_record *rec, void *data) +{ + uint64_t loops_cnt = 0; + atomic_t cnt; + int i; + + atomic_set(&cnt, 0); + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + atomic_inc(&cnt); + barrier(); /* avoid compiler to optimize this loop */ + } + loops_cnt = atomic_read(&cnt); + time_bench_stop(rec, loops_cnt); + return loops_cnt; +} + +/* The ptr_ping in page_pool uses a spinlock. We need to know the minimum + * overhead of taking+releasing a spinlock, to know the cycles that can be saved + * by e.g. amortizing this via bulking. + */ +static int time_bench_lock(struct time_bench_record *rec, void *data) +{ + uint64_t loops_cnt = 0; + spinlock_t lock; + int i; + + spin_lock_init(&lock); + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + spin_lock(&lock); + loops_cnt++; + barrier(); /* avoid compiler to optimize this loop */ + spin_unlock(&lock); + } + time_bench_stop(rec, loops_cnt); + return loops_cnt; +} + +/* Helper for filling some page's into ptr_ring */ +static void pp_fill_ptr_ring(struct page_pool *pp, int elems) +{ + /* GFP_ATOMIC needed when under run softirq */ + gfp_t gfp_mask = GFP_ATOMIC; + struct page **array; + int i; + + array = kcalloc(elems, sizeof(struct page *), gfp_mask); + + for (i = 0; i < elems; i++) + array[i] = page_pool_alloc_pages(pp, gfp_mask); + for (i = 0; i < elems; i++) + _page_pool_put_page(pp, array[i], false); + + kfree(array); +} + +enum test_type { type_fast_path, type_ptr_ring, type_page_allocator }; + +/* Depends on compile optimizing this function */ +static int time_bench_page_pool(struct time_bench_record *rec, void *data, + enum test_type type, const char *func) +{ + uint64_t loops_cnt = 0; + gfp_t gfp_mask = GFP_ATOMIC; /* GFP_ATOMIC is not really needed */ + int i, err; + + struct page_pool *pp; + struct page *page; + + struct page_pool_params pp_params = { + .order = 0, + .flags = 0, + .pool_size = MY_POOL_SIZE, + .nid = NUMA_NO_NODE, + .dev = NULL, /* Only use for DMA mapping */ + .dma_dir = DMA_BIDIRECTIONAL, + }; + + pp = page_pool_create(&pp_params); + if (IS_ERR(pp)) { + err = PTR_ERR(pp); + pr_warn("%s: Error(%d) creating page_pool\n", func, err); + goto out; + } + pp_fill_ptr_ring(pp, 64); + + if (in_serving_softirq()) + pr_warn("%s(): in_serving_softirq fast-path\n", func); + else + pr_warn("%s(): Cannot use page_pool fast-path\n", func); + + time_bench_start(rec); + /** Loop to measure **/ + for (i = 0; i < rec->loops; i++) { + /* Common fast-path alloc that depend on in_serving_softirq() */ + page = page_pool_alloc_pages(pp, gfp_mask); + if (!page) + break; + loops_cnt++; + barrier(); /* avoid compiler to optimize this loop */ + + /* The benchmarks purpose it to test different return paths. + * Compiler should inline optimize other function calls out + */ + if (type == type_fast_path) { + /* Fast-path recycling e.g. XDP_DROP use-case */ + page_pool_recycle_direct(pp, page); + + } else if (type == type_ptr_ring) { + /* Normal return path */ + _page_pool_put_page(pp, page, false); + + } else if (type == type_page_allocator) { + /* Test if not pages are recycled, but instead + * returned back into systems page allocator + */ + get_page(page); /* cause no-recycling */ + _page_pool_put_page(pp, page, false); + put_page(page); + } else { + BUILD_BUG(); + } + } + time_bench_stop(rec, loops_cnt); +out: + page_pool_destroy(pp); + return loops_cnt; +} + +static int time_bench_page_pool01_fast_path(struct time_bench_record *rec, + void *data) +{ + return time_bench_page_pool(rec, data, type_fast_path, __func__); +} + +static int time_bench_page_pool02_ptr_ring(struct time_bench_record *rec, + void *data) +{ + return time_bench_page_pool(rec, data, type_ptr_ring, __func__); +} + +static int time_bench_page_pool03_slow(struct time_bench_record *rec, + void *data) +{ + return time_bench_page_pool(rec, data, type_page_allocator, __func__); +} + +static int run_benchmark_tests(void) +{ + uint32_t nr_loops = loops; + + /* Baseline tests */ + if (enabled(bit_run_bench_baseline)) { + time_bench_loop(nr_loops * 10, 0, "for_loop", NULL, + time_bench_for_loop); + time_bench_loop(nr_loops * 10, 0, "atomic_inc", NULL, + time_bench_atomic_inc); + time_bench_loop(nr_loops, 0, "lock", NULL, time_bench_lock); + } + + /* This test cannot activate correct code path, due to no-softirq ctx */ + if (enabled(bit_run_bench_no_softirq01)) + time_bench_loop(nr_loops, 0, "no-softirq-page_pool01", NULL, + time_bench_page_pool01_fast_path); + if (enabled(bit_run_bench_no_softirq02)) + time_bench_loop(nr_loops, 0, "no-softirq-page_pool02", NULL, + time_bench_page_pool02_ptr_ring); + if (enabled(bit_run_bench_no_softirq03)) + time_bench_loop(nr_loops, 0, "no-softirq-page_pool03", NULL, + time_bench_page_pool03_slow); + + return 0; +} + +static int __init bench_page_pool_simple_module_init(void) +{ + if (verbose) + pr_info("Loaded\n"); + + if (loops > U32_MAX) { + pr_err("Module param loops(%lu) exceeded U32_MAX(%u)\n", loops, + U32_MAX); + return -ECHRNG; + } + + run_benchmark_tests(); + + return 0; +} +module_init(bench_page_pool_simple_module_init); + +static void __exit bench_page_pool_simple_module_exit(void) +{ + if (verbose) + pr_info("Unloaded\n"); +} +module_exit(bench_page_pool_simple_module_exit); + +MODULE_DESCRIPTION("Benchmark of page_pool simple cases"); +MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer(a)brouer.com>"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/net/bench/page_pool/time_bench.c b/tools/testing/selftests/net/bench/page_pool/time_bench.c new file mode 100644 index 000000000000..073bb36ec5f2 --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/time_bench.c @@ -0,0 +1,394 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Benchmarking code execution time inside the kernel + * + * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/module.h> +#include <linux/time.h> + +#include <linux/perf_event.h> /* perf_event_create_kernel_counter() */ + +/* For concurrency testing */ +#include <linux/completion.h> +#include <linux/sched.h> +#include <linux/workqueue.h> +#include <linux/kthread.h> + +#include "time_bench.h" + +static int verbose = 1; + +/** TSC (Time-Stamp Counter) based ** + * See: linux/time_bench.h + * tsc_start_clock() and tsc_stop_clock() + */ + +/** Wall-clock based ** + */ + +/** PMU (Performance Monitor Unit) based ** + */ +#define PERF_FORMAT \ + (PERF_FORMAT_GROUP | PERF_FORMAT_ID | PERF_FORMAT_TOTAL_TIME_ENABLED | \ + PERF_FORMAT_TOTAL_TIME_RUNNING) + +struct raw_perf_event { + uint64_t config; /* event */ + uint64_t config1; /* umask */ + struct perf_event *save; + char *desc; +}; + +/* if HT is enable a maximum of 4 events (5 if one is instructions + * retired can be specified, if HT is disabled a maximum of 8 (9 if + * one is instructions retired) can be specified. + * + * From Table 19-1. Architectural Performance Events + * Architectures Software Developer’s Manual Volume 3: System Programming + * Guide + */ +struct raw_perf_event perf_events[] = { + { 0x3c, 0x00, NULL, "Unhalted CPU Cycles" }, + { 0xc0, 0x00, NULL, "Instruction Retired" } +}; + +#define NUM_EVTS (ARRAY_SIZE(perf_events)) + +/* WARNING: PMU config is currently broken! + */ +bool time_bench_PMU_config(bool enable) +{ + int i; + struct perf_event_attr perf_conf; + struct perf_event *perf_event; + int cpu; + + preempt_disable(); + cpu = smp_processor_id(); + pr_info("DEBUG: cpu:%d\n", cpu); + preempt_enable(); + + memset(&perf_conf, 0, sizeof(struct perf_event_attr)); + perf_conf.type = PERF_TYPE_RAW; + perf_conf.size = sizeof(struct perf_event_attr); + perf_conf.read_format = PERF_FORMAT; + perf_conf.pinned = 1; + perf_conf.exclude_user = 1; /* No userspace events */ + perf_conf.exclude_kernel = 0; /* Only kernel events */ + + for (i = 0; i < NUM_EVTS; i++) { + perf_conf.disabled = enable; + //perf_conf.disabled = (i == 0) ? 1 : 0; + perf_conf.config = perf_events[i].config; + perf_conf.config1 = perf_events[i].config1; + if (verbose) + pr_info("%s() enable PMU counter: %s\n", + __func__, perf_events[i].desc); + perf_event = perf_event_create_kernel_counter(&perf_conf, cpu, + NULL /* task */, + NULL /* overflow_handler*/, + NULL /* context */); + if (perf_event) { + perf_events[i].save = perf_event; + pr_info("%s():DEBUG perf_event success\n", __func__); + + perf_event_enable(perf_event); + } else { + pr_info("%s():DEBUG perf_event is NULL\n", __func__); + } + } + + return true; +} + +/** Generic functions ** + */ + +/* Calculate stats, store results in record */ +bool time_bench_calc_stats(struct time_bench_record *rec) +{ +#define NANOSEC_PER_SEC 1000000000 /* 10^9 */ + uint64_t ns_per_call_tmp_rem = 0; + uint32_t ns_per_call_remainder = 0; + uint64_t pmc_ipc_tmp_rem = 0; + uint32_t pmc_ipc_remainder = 0; + uint32_t pmc_ipc_div = 0; + uint32_t invoked_cnt_precision = 0; + uint32_t invoked_cnt = 0; /* 32-bit due to div_u64_rem() */ + + if (rec->flags & TIME_BENCH_LOOP) { + if (rec->invoked_cnt < 1000) { + pr_err("ERR: need more(>1000) loops(%llu) for timing\n", + rec->invoked_cnt); + return false; + } + if (rec->invoked_cnt > ((1ULL << 32) - 1)) { + /* div_u64_rem() can only support div with 32bit*/ + pr_err("ERR: Invoke cnt(%llu) too big overflow 32bit\n", + rec->invoked_cnt); + return false; + } + invoked_cnt = (uint32_t)rec->invoked_cnt; + } + + /* TSC (Time-Stamp Counter) records */ + if (rec->flags & TIME_BENCH_TSC) { + rec->tsc_interval = rec->tsc_stop - rec->tsc_start; + if (rec->tsc_interval == 0) { + pr_err("ABORT: timing took ZERO TSC time\n"); + return false; + } + /* Calculate stats */ + if (rec->flags & TIME_BENCH_LOOP) + rec->tsc_cycles = rec->tsc_interval / invoked_cnt; + else + rec->tsc_cycles = rec->tsc_interval; + } + + /* Wall-clock time calc */ + if (rec->flags & TIME_BENCH_WALLCLOCK) { + rec->time_start = rec->ts_start.tv_nsec + + (NANOSEC_PER_SEC * rec->ts_start.tv_sec); + rec->time_stop = rec->ts_stop.tv_nsec + + (NANOSEC_PER_SEC * rec->ts_stop.tv_sec); + rec->time_interval = rec->time_stop - rec->time_start; + if (rec->time_interval == 0) { + pr_err("ABORT: timing took ZERO wallclock time\n"); + return false; + } + /* Calculate stats */ + /*** Division in kernel it tricky ***/ + /* Orig: time_sec = (time_interval / NANOSEC_PER_SEC); */ + /* remainder only correct because NANOSEC_PER_SEC is 10^9 */ + rec->time_sec = div_u64_rem(rec->time_interval, NANOSEC_PER_SEC, + &rec->time_sec_remainder); + //TODO: use existing struct timespec records instead of div? + + if (rec->flags & TIME_BENCH_LOOP) { + /*** Division in kernel it tricky ***/ + /* Orig: ns = ((double)time_interval / invoked_cnt); */ + /* First get quotient */ + rec->ns_per_call_quotient = + div_u64_rem(rec->time_interval, invoked_cnt, + &ns_per_call_remainder); + /* Now get decimals .xxx precision (incorrect roundup)*/ + ns_per_call_tmp_rem = ns_per_call_remainder; + invoked_cnt_precision = invoked_cnt / 1000; + if (invoked_cnt_precision > 0) { + rec->ns_per_call_decimal = + div_u64_rem(ns_per_call_tmp_rem, + invoked_cnt_precision, + &ns_per_call_remainder); + } + } + } + + /* Performance Monitor Unit (PMU) counters */ + if (rec->flags & TIME_BENCH_PMU) { + //FIXME: Overflow handling??? + rec->pmc_inst = rec->pmc_inst_stop - rec->pmc_inst_start; + rec->pmc_clk = rec->pmc_clk_stop - rec->pmc_clk_start; + + /* Calc Instruction Per Cycle (IPC) */ + /* First get quotient */ + rec->pmc_ipc_quotient = div_u64_rem(rec->pmc_inst, rec->pmc_clk, + &pmc_ipc_remainder); + /* Now get decimals .xxx precision (incorrect roundup)*/ + pmc_ipc_tmp_rem = pmc_ipc_remainder; + pmc_ipc_div = rec->pmc_clk / 1000; + if (pmc_ipc_div > 0) { + rec->pmc_ipc_decimal = div_u64_rem(pmc_ipc_tmp_rem, + pmc_ipc_div, + &pmc_ipc_remainder); + } + } + + return true; +} + +/* Generic function for invoking a loop function and calculating + * execution time stats. The function being called/timed is assumed + * to perform a tight loop, and update the timing record struct. + */ +bool time_bench_loop(uint32_t loops, int step, char *txt, void *data, + int (*func)(struct time_bench_record *record, void *data)) +{ + struct time_bench_record rec; + + /* Setup record */ + memset(&rec, 0, sizeof(rec)); /* zero func might not update all */ + rec.version_abi = 1; + rec.loops = loops; + rec.step = step; + rec.flags = (TIME_BENCH_LOOP | TIME_BENCH_TSC | TIME_BENCH_WALLCLOCK); + + /*** Loop function being timed ***/ + if (!func(&rec, data)) { + pr_err("ABORT: function being timed failed\n"); + return false; + } + + if (rec.invoked_cnt < loops) + pr_warn("WARNING: Invoke count(%llu) smaller than loops(%d)\n", + rec.invoked_cnt, loops); + + /* Calculate stats */ + time_bench_calc_stats(&rec); + + pr_info("Type:%s Per elem: %llu cycles(tsc) %llu.%03llu ns (step:%d) - (measurement period time:%llu.%09u sec time_interval:%llu) - (invoke count:%llu tsc_interval:%llu)\n", + txt, rec.tsc_cycles, rec.ns_per_call_quotient, + rec.ns_per_call_decimal, rec.step, rec.time_sec, + rec.time_sec_remainder, rec.time_interval, rec.invoked_cnt, + rec.tsc_interval); + if (rec.flags & TIME_BENCH_PMU) + pr_info("Type:%s PMU inst/clock%llu/%llu = %llu.%03llu IPC (inst per cycle)\n", + txt, rec.pmc_inst, rec.pmc_clk, rec.pmc_ipc_quotient, + rec.pmc_ipc_decimal); + return true; +} + +/* Function getting invoked by kthread */ +static int invoke_test_on_cpu_func(void *private) +{ + struct time_bench_cpu *cpu = private; + struct time_bench_sync *sync = cpu->sync; + cpumask_t newmask = CPU_MASK_NONE; + void *data = cpu->data; + + /* Restrict CPU */ + cpumask_set_cpu(cpu->rec.cpu, &newmask); + set_cpus_allowed_ptr(current, &newmask); + + /* Synchronize start of concurrency test */ + atomic_inc(&sync->nr_tests_running); + wait_for_completion(&sync->start_event); + + /* Start benchmark function */ + if (!cpu->bench_func(&cpu->rec, data)) { + pr_err("ERROR: function being timed failed on CPU:%d(%d)\n", + cpu->rec.cpu, smp_processor_id()); + } else { + if (verbose) + pr_info("SUCCESS: ran on CPU:%d(%d)\n", cpu->rec.cpu, + smp_processor_id()); + } + cpu->did_bench_run = true; + + /* End test */ + atomic_dec(&sync->nr_tests_running); + /* Wait for kthread_stop() telling us to stop */ + while (!kthread_should_stop()) { + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +void time_bench_print_stats_cpumask(const char *desc, + struct time_bench_cpu *cpu_tasks, + const struct cpumask *mask) +{ + uint64_t average = 0; + int cpu; + int step = 0; + struct sum { + uint64_t tsc_cycles; + int records; + } sum = { 0 }; + + /* Get stats */ + for_each_cpu(cpu, mask) { + struct time_bench_cpu *c = &cpu_tasks[cpu]; + struct time_bench_record *rec = &c->rec; + + /* Calculate stats */ + time_bench_calc_stats(rec); + + pr_info("Type:%s CPU(%d) %llu cycles(tsc) %llu.%03llu ns (step:%d) - (measurement period time:%llu.%09u sec time_interval:%llu) - (invoke count:%llu tsc_interval:%llu)\n", + desc, cpu, rec->tsc_cycles, rec->ns_per_call_quotient, + rec->ns_per_call_decimal, rec->step, rec->time_sec, + rec->time_sec_remainder, rec->time_interval, + rec->invoked_cnt, rec->tsc_interval); + + /* Collect average */ + sum.records++; + sum.tsc_cycles += rec->tsc_cycles; + step = rec->step; + } + + if (sum.records) /* avoid div-by-zero */ + average = sum.tsc_cycles / sum.records; + pr_info("Sum Type:%s Average: %llu cycles(tsc) CPUs:%d step:%d\n", desc, + average, sum.records, step); +} + +void time_bench_run_concurrent(uint32_t loops, int step, void *data, + const struct cpumask *mask, /* Support masking outsome CPUs*/ + struct time_bench_sync *sync, + struct time_bench_cpu *cpu_tasks, + int (*func)(struct time_bench_record *record, void *data)) +{ + int cpu, running = 0; + + if (verbose) // DEBUG + pr_warn("%s() Started on CPU:%d\n", __func__, + smp_processor_id()); + + /* Reset sync conditions */ + atomic_set(&sync->nr_tests_running, 0); + init_completion(&sync->start_event); + + /* Spawn off jobs on all CPUs */ + for_each_cpu(cpu, mask) { + struct time_bench_cpu *c = &cpu_tasks[cpu]; + + running++; + c->sync = sync; /* Send sync variable along */ + c->data = data; /* Send opaque along */ + + /* Init benchmark record */ + memset(&c->rec, 0, sizeof(struct time_bench_record)); + c->rec.version_abi = 1; + c->rec.loops = loops; + c->rec.step = step; + c->rec.flags = (TIME_BENCH_LOOP | TIME_BENCH_TSC | + TIME_BENCH_WALLCLOCK); + c->rec.cpu = cpu; + c->bench_func = func; + c->task = kthread_run(invoke_test_on_cpu_func, c, + "time_bench%d", cpu); + if (IS_ERR(c->task)) { + pr_err("%s(): Failed to start test func\n", __func__); + return; /* Argh, what about cleanup?! */ + } + } + + /* Wait until all processes are running */ + while (atomic_read(&sync->nr_tests_running) < running) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(10); + } + /* Kick off all CPU concurrently on completion event */ + complete_all(&sync->start_event); + + /* Wait for CPUs to finish */ + while (atomic_read(&sync->nr_tests_running)) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(10); + } + + /* Stop the kthreads */ + for_each_cpu(cpu, mask) { + struct time_bench_cpu *c = &cpu_tasks[cpu]; + + kthread_stop(c->task); + } + + if (verbose) // DEBUG - happens often, finish on another CPU + pr_warn("%s() Finished on CPU:%d\n", __func__, + smp_processor_id()); +} diff --git a/tools/testing/selftests/net/bench/page_pool/time_bench.h b/tools/testing/selftests/net/bench/page_pool/time_bench.h new file mode 100644 index 000000000000..e113fcf341dc --- /dev/null +++ b/tools/testing/selftests/net/bench/page_pool/time_bench.h @@ -0,0 +1,238 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Benchmarking code execution time inside the kernel + * + * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer + * for licensing details see kernel-base/COPYING + */ +#ifndef _LINUX_TIME_BENCH_H +#define _LINUX_TIME_BENCH_H + +/* Main structure used for recording a benchmark run */ +struct time_bench_record { + uint32_t version_abi; + uint32_t loops; /* Requested loop invocations */ + uint32_t step; /* option for e.g. bulk invocations */ + + uint32_t flags; /* Measurements types enabled */ +#define TIME_BENCH_LOOP BIT(0) +#define TIME_BENCH_TSC BIT(1) +#define TIME_BENCH_WALLCLOCK BIT(2) +#define TIME_BENCH_PMU BIT(3) + + uint32_t cpu; /* Used when embedded in time_bench_cpu */ + + /* Records */ + uint64_t invoked_cnt; /* Returned actual invocations */ + uint64_t tsc_start; + uint64_t tsc_stop; + struct timespec64 ts_start; + struct timespec64 ts_stop; + /* PMU counters for instruction and cycles + * instructions counter including pipelined instructions + */ + uint64_t pmc_inst_start; + uint64_t pmc_inst_stop; + /* CPU unhalted clock counter */ + uint64_t pmc_clk_start; + uint64_t pmc_clk_stop; + + /* Result records */ + uint64_t tsc_interval; + uint64_t time_start, time_stop, time_interval; /* in nanosec */ + uint64_t pmc_inst, pmc_clk; + + /* Derived result records */ + uint64_t tsc_cycles; // +decimal? + uint64_t ns_per_call_quotient, ns_per_call_decimal; + uint64_t time_sec; + uint32_t time_sec_remainder; + uint64_t pmc_ipc_quotient, pmc_ipc_decimal; /* inst per cycle */ +}; + +/* For synchronizing parallel CPUs to run concurrently */ +struct time_bench_sync { + atomic_t nr_tests_running; + struct completion start_event; +}; + +/* Keep track of CPUs executing our bench function. + * + * Embed a time_bench_record for storing info per cpu + */ +struct time_bench_cpu { + struct time_bench_record rec; + struct time_bench_sync *sync; /* back ptr */ + struct task_struct *task; + /* "data" opaque could have been placed in time_bench_sync, + * but to avoid any false sharing, place it per CPU + */ + void *data; + /* Support masking outsome CPUs, mark if it ran */ + bool did_bench_run; + /* int cpu; // note CPU stored in time_bench_record */ + int (*bench_func)(struct time_bench_record *record, void *data); +}; + +/* + * Below TSC assembler code is not compatible with other archs, and + * can also fail on guests if cpu-flags are not correct. + * + * The way TSC reading is used, many iterations, does not require as + * high accuracy as described below (in Intel Doc #324264). + * + * Considering changing to use get_cycles() (#include <asm/timex.h>). + */ + +/** TSC (Time-Stamp Counter) based ** + * Recommend reading, to understand details of reading TSC accurately: + * Intel Doc #324264, "How to Benchmark Code Execution Times on Intel" + * + * Consider getting exclusive ownership of CPU by using: + * unsigned long flags; + * preempt_disable(); + * raw_local_irq_save(flags); + * _your_code_ + * raw_local_irq_restore(flags); + * preempt_enable(); + * + * Clobbered registers: "%rax", "%rbx", "%rcx", "%rdx" + * RDTSC only change "%rax" and "%rdx" but + * CPUID clears the high 32-bits of all (rax/rbx/rcx/rdx) + */ +static __always_inline uint64_t tsc_start_clock(void) +{ + /* See: Intel Doc #324264 */ + unsigned int hi, lo; + + asm volatile("CPUID\n\t" + "RDTSC\n\t" + "mov %%edx, %0\n\t" + "mov %%eax, %1\n\t" + : "=r"(hi), "=r"(lo)::"%rax", "%rbx", "%rcx", "%rdx"); + //FIXME: on 32bit use clobbered %eax + %edx + return ((uint64_t)lo) | (((uint64_t)hi) << 32); +} + +static __always_inline uint64_t tsc_stop_clock(void) +{ + /* See: Intel Doc #324264 */ + unsigned int hi, lo; + + asm volatile("RDTSCP\n\t" + "mov %%edx, %0\n\t" + "mov %%eax, %1\n\t" + "CPUID\n\t" + : "=r"(hi), "=r"(lo)::"%rax", "%rbx", "%rcx", "%rdx"); + return ((uint64_t)lo) | (((uint64_t)hi) << 32); +} + +/** Wall-clock based ** + * + * use: getnstimeofday() + * getnstimeofday(&rec->ts_start); + * getnstimeofday(&rec->ts_stop); + * + * API changed see: Documentation/core-api/timekeeping.rst + * https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#c.getnstim… + * + * We should instead use: ktime_get_real_ts64() is a direct + * replacement, but consider using monotonic time (ktime_get_ts64()) + * and/or a ktime_t based interface (ktime_get()/ktime_get_real()). + */ + +/** PMU (Performance Monitor Unit) based ** + * + * Needed for calculating: Instructions Per Cycle (IPC) + * - The IPC number tell how efficient the CPU pipelining were + */ +//lookup: perf_event_create_kernel_counter() + +bool time_bench_PMU_config(bool enable); + +/* Raw reading via rdpmc() using fixed counters + * + * From: https://github.com/andikleen/simple-pmu + */ +enum { + FIXED_SELECT = (1U << 30), /* == 0x40000000 */ + FIXED_INST_RETIRED_ANY = 0, + FIXED_CPU_CLK_UNHALTED_CORE = 1, + FIXED_CPU_CLK_UNHALTED_REF = 2, +}; + +static __always_inline unsigned int long long p_rdpmc(unsigned int in) +{ + unsigned int d, a; + + asm volatile("rdpmc" : "=d"(d), "=a"(a) : "c"(in) : "memory"); + return ((unsigned long long)d << 32) | a; +} + +/* These PMU counter needs to be enabled, but I don't have the + * configure code implemented. My current hack is running: + * sudo perf stat -e cycles:k -e instructions:k insmod lib/ring_queue_test.ko + */ +/* Reading all pipelined instruction */ +static __always_inline unsigned long long pmc_inst(void) +{ + return p_rdpmc(FIXED_SELECT | FIXED_INST_RETIRED_ANY); +} + +/* Reading CPU clock cycles */ +static __always_inline unsigned long long pmc_clk(void) +{ + return p_rdpmc(FIXED_SELECT | FIXED_CPU_CLK_UNHALTED_CORE); +} + +/* Raw reading via MSR rdmsr() is likely wrong + * FIXME: How can I know which raw MSR registers are conf for what? + */ +#define MSR_IA32_PCM0 0x400000C1 /* PERFCTR0 */ +#define MSR_IA32_PCM1 0x400000C2 /* PERFCTR1 */ +#define MSR_IA32_PCM2 0x400000C3 +static inline uint64_t msr_inst(unsigned long long *msr_result) +{ + return rdmsrq_safe(MSR_IA32_PCM0, msr_result); +} + +/** Generic functions ** + */ +bool time_bench_loop(uint32_t loops, int step, char *txt, void *data, + int (*func)(struct time_bench_record *rec, void *data)); +bool time_bench_calc_stats(struct time_bench_record *rec); + +void time_bench_run_concurrent(uint32_t loops, int step, void *data, + const struct cpumask *mask, /* Support masking outsome CPUs*/ + struct time_bench_sync *sync, struct time_bench_cpu *cpu_tasks, + int (*func)(struct time_bench_record *record, void *data)); +void time_bench_print_stats_cpumask(const char *desc, + struct time_bench_cpu *cpu_tasks, + const struct cpumask *mask); + +//FIXME: use rec->flags to select measurement, should be MACRO +static __always_inline void time_bench_start(struct time_bench_record *rec) +{ + //getnstimeofday(&rec->ts_start); + ktime_get_real_ts64(&rec->ts_start); + if (rec->flags & TIME_BENCH_PMU) { + rec->pmc_inst_start = pmc_inst(); + rec->pmc_clk_start = pmc_clk(); + } + rec->tsc_start = tsc_start_clock(); +} + +static __always_inline void time_bench_stop(struct time_bench_record *rec, + uint64_t invoked_cnt) +{ + rec->tsc_stop = tsc_stop_clock(); + if (rec->flags & TIME_BENCH_PMU) { + rec->pmc_inst_stop = pmc_inst(); + rec->pmc_clk_stop = pmc_clk(); + } + //getnstimeofday(&rec->ts_stop); + ktime_get_real_ts64(&rec->ts_stop); + rec->invoked_cnt = invoked_cnt; +} + +#endif /* _LINUX_TIME_BENCH_H */ diff --git a/tools/testing/selftests/net/bench/test_bench_page_pool.sh b/tools/testing/selftests/net/bench/test_bench_page_pool.sh new file mode 100755 index 000000000000..7b8b18cfedce --- /dev/null +++ b/tools/testing/selftests/net/bench/test_bench_page_pool.sh @@ -0,0 +1,32 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# + +set -e + +DRIVER="./page_pool/bench_page_pool.ko" +result="" + +function run_test() +{ + rmmod "bench_page_pool.ko" || true + insmod $DRIVER > /dev/null 2>&1 + result=$(dmesg | tail -10) + echo "$result" + + echo + echo "Fast path results:" + echo "${result}" | grep -o -E "no-softirq-page_pool01 Per elem: ([0-9]+) cycles$tsc$ ([0-9]+\.[0-9]+) ns" + + echo + echo "ptr_ring results:" + echo "${result}" | grep -o -E "no-softirq-page_pool02 Per elem: ([0-9]+) cycles$tsc$ ([0-9]+\.[0-9]+) ns" + + echo + echo "slow path results:" + echo "${result}" | grep -o -E "no-softirq-page_pool03 Per elem: ([0-9]+) cycles$tsc$ ([0-9]+\.[0-9]+) ns" +} + +run_test + +exit 0 base-commit: afc783fa0aab9cc093fbb04871bfda406480cf8d -- 2.50.0.rc2.701.gf1e915cc24-goog

2 weeks, 4 days

4
7
0 0

[PATCH rc v2 0/4] Fix iommufd selftest FAIL and warnings with v6.16

by Nicolin Chen

A few selftest harness changes being merged to v6.16, which exposed some bugs and vulnerabilities in the iommufd selftest code. Fix them properly. Note that the patch fixing the build warnings at mfd is not ideal, as it has possibly hit some corner case in the gcc: https://lore.kernel.org/all/aEi8DV+ReF3v3Rlf@nvidia.com/ This is on github: https://github.com/nicolinc/iommufd/commits/iommufd_selftest_fixes-v6.16 Changelog: v2 * Add "Reviewed-by" from Jason * Only use kfree() in the teardown() * Add an mmap_buffer_size for readability v1 https://lore.kernel.org/all/cover.1750049883.git.nicolinc@nvidia.com/ Thanks Nicolin Nicolin Chen (4): iommufd/selftest: Fix iommufd_dirty_tracking with large hugepage sizes iommufd/selftest: Add missing close(mfd) in memfd_mmap() iommufd/selftest: Add asserts testing global mfd iommufd/selftest: Fix build warnings due to uninitialized mfd tools/testing/selftests/iommu/iommufd_utils.h | 9 ++++- tools/testing/selftests/iommu/iommufd.c | 40 ++++++++++++++----- 2 files changed, 36 insertions(+), 13 deletions(-) -- 2.43.0

2 weeks, 4 days

2
5
0 0