- Linux-kselftest-mirror - lists.linaro.org

[PATCH] kunit: Release resource upon __kunit_add_resource() failure in the Resource API

by Marie Zhussupova

__kunit_add_resource() currently does the following things in order: initializes the resource refcount to 1, initializes the resource, and adds the resource to the test's resource list. Currently, __kunit_add_resource() only fails if the resource initialization fails. The kunit_alloc_and_get_resource() and kunit_alloc_resource() functions allocate memory for `struct kunit_resource`. However, if the subsequent call to __kunit_add_resource() fails, the functions return NULL without releasing the memory. This patch adds calls to kunit_put_resource() in these functions before returning NULL to decrease the refcount of the resource that failed to initialize to 0. This will trigger kunit_release_resource(), which will both call kunit_resource->free and kfree() on the resource. Since kunit_resource->free is user defined, comments were added to note that kunit_resource->free() should be able to handle any inconsistent state that may result from the resource init failure. Signed-off-by: Marie Zhussupova <marievic(a)google.com> --- include/kunit/resource.h | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/include/kunit/resource.h b/include/kunit/resource.h index 4ad69a2642a5..2585e9a5242d 100644 --- a/include/kunit/resource.h +++ b/include/kunit/resource.h @@ -216,7 +216,9 @@ static inline int kunit_add_named_resource(struct kunit *test, * kunit_alloc_and_get_resource() - Allocates and returns a *test managed resource*. * @test: The test context object. * @init: a user supplied function to initialize the resource. - * @free: a user supplied function to free the resource (if needed). + * @free: a user supplied function to free the resource (if needed). Note that, + * if supplied, @free will run even if @init fails: Make sure it can handle any + * inconsistent state which may result. * @internal_gfp: gfp to use for internal allocations, if unsure, use GFP_KERNEL * @context: for the user to pass in arbitrary data to the init function. * @@ -258,6 +260,7 @@ kunit_alloc_and_get_resource(struct kunit *test, kunit_get_resource(res); return res; } + kunit_put_resource(res); return NULL; } @@ -265,7 +268,9 @@ kunit_alloc_and_get_resource(struct kunit *test, * kunit_alloc_resource() - Allocates a *test managed resource*. * @test: The test context object. * @init: a user supplied function to initialize the resource. - * @free: a user supplied function to free the resource (if needed). + * @free: a user supplied function to free the resource (if needed). Note that, + * if supplied, @free will run even if @init fails: Make sure it can handle any + * inconsistent state which may result. * @internal_gfp: gfp to use for internal allocations, if unsure, use GFP_KERNEL * @context: for the user to pass in arbitrary data to the init function. * @@ -293,6 +298,7 @@ static inline void *kunit_alloc_resource(struct kunit *test, if (!__kunit_add_resource(test, init, free, res, context)) return res->data; + kunit_put_resource(res); return NULL; } -- 2.51.0.rc0.205.g4a044479a3-goog

4 months, 4 weeks

1
0
0 0

[PATCH net] selftests: drv-net: wait for carrier

by Jakub Kicinski

On fast machines the tests run in quick succession so even when tests clean up after themselves the carrier may need some time to come back. Specifically in NIPA when ping.py runs right after netpoll_basic.py the first ping command fails. Since the context manager callbacks are now common NetDrvEpEnv gets an ip link up call as well. Fixes: b4db9f840283 ("selftests: drivers: add scaffolding for Netlink tests in Python") Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- CC: shuah(a)kernel.org CC: willemb(a)google.com CC: petrm(a)nvidia.com CC: linux-kselftest(a)vger.kernel.org --- .../selftests/drivers/net/lib/py/__init__.py | 2 +- .../selftests/drivers/net/lib/py/env.py | 38 +++++++++---------- tools/testing/selftests/net/lib/py/utils.py | 18 +++++++++ 3 files changed, 36 insertions(+), 22 deletions(-) diff --git a/tools/testing/selftests/drivers/net/lib/py/__init__.py b/tools/testing/selftests/drivers/net/lib/py/__init__.py index 8711c67ad658..a07b56a75c8a 100644 --- a/tools/testing/selftests/drivers/net/lib/py/__init__.py +++ b/tools/testing/selftests/drivers/net/lib/py/__init__.py @@ -15,7 +15,7 @@ KSFT_DIR = (Path(__file__).parent / "../../../..").resolve() NlError, RtnlFamily, DevlinkFamily from net.lib.py import CmdExitFailure from net.lib.py import bkg, cmd, bpftool, bpftrace, defer, ethtool, \ - fd_read_timeout, ip, rand_port, tool, wait_port_listen + fd_read_timeout, ip, rand_port, tool, wait_port_listen, wait_file from net.lib.py import fd_read_timeout from net.lib.py import KsftSkipEx, KsftFailEx, KsftXfailEx from net.lib.py import ksft_disruptive, ksft_exit, ksft_pr, ksft_run, \ diff --git a/tools/testing/selftests/drivers/net/lib/py/env.py b/tools/testing/selftests/drivers/net/lib/py/env.py index 1b8bd648048f..1de63734ddec 100644 --- a/tools/testing/selftests/drivers/net/lib/py/env.py +++ b/tools/testing/selftests/drivers/net/lib/py/env.py @@ -4,7 +4,7 @@ import os import time from pathlib import Path from lib.py import KsftSkipEx, KsftXfailEx -from lib.py import ksft_setup +from lib.py import ksft_setup, wait_file from lib.py import cmd, ethtool, ip, CmdExitFailure from lib.py import NetNS, NetdevSimDev from .remote import Remote @@ -25,6 +25,9 @@ from .remote import Remote self.env = self._load_env_file() + # Following attrs must be set be inheriting classes + self.dev = None + def _load_env_file(self): env = os.environ.copy() @@ -48,6 +51,19 @@ from .remote import Remote env[pair[0]] = pair[1] return ksft_setup(env) + def __enter__(self): + ip(f"link set dev {self.dev['ifname']} up") + wait_file(f"/sys/class/net/{self.dev['ifname']}/carrier", + lambda x: x.strip() == "1") + + return self + + def __exit__(self, ex_type, ex_value, ex_tb): + """ + __exit__ gets called at the end of a "with" block. + """ + self.__del__() + class NetDrvEnv(NetDrvEnvBase): """ @@ -72,17 +88,6 @@ from .remote import Remote self.ifname = self.dev['ifname'] self.ifindex = self.dev['ifindex'] - def __enter__(self): - ip(f"link set dev {self.dev['ifname']} up") - - return self - - def __exit__(self, ex_type, ex_value, ex_tb): - """ - __exit__ gets called at the end of a "with" block. - """ - self.__del__() - def __del__(self): if self._ns: self._ns.remove() @@ -219,15 +224,6 @@ from .remote import Remote raise Exception("Can't resolve remote interface name, multiple interfaces match") return v6[0]["ifname"] if v6 else v4[0]["ifname"] - def __enter__(self): - return self - - def __exit__(self, ex_type, ex_value, ex_tb): - """ - __exit__ gets called at the end of a "with" block. - """ - self.__del__() - def __del__(self): if self._ns: self._ns.remove() diff --git a/tools/testing/selftests/net/lib/py/utils.py b/tools/testing/selftests/net/lib/py/utils.py index f395c90fb0f1..c42bffea0d87 100644 --- a/tools/testing/selftests/net/lib/py/utils.py +++ b/tools/testing/selftests/net/lib/py/utils.py @@ -249,3 +249,21 @@ global_defer_queue = [] if time.monotonic() > end: raise Exception("Waiting for port listen timed out") time.sleep(sleep) + + +def wait_file(fname, test_fn, sleep=0.005, deadline=5, encoding='utf-8'): + """ + Wait for file contents on the local system to satisfy a condition. + test_fn() should take one argument (file contents) and return whether + condition is met. + """ + end = time.monotonic() + deadline + + with open(fname, "r", encoding=encoding) as fp: + while True: + if test_fn(fp.read()): + break + fp.seek(0) + if time.monotonic() > end: + raise TimeoutError("Wait for file contents failed", fname) + time.sleep(sleep) -- 2.50.1

4 months, 4 weeks

2
2
0 0

[PATCH net-next] selftests: netconsole: Validate interface selection by MAC address

by Andre Carvalho

Extend the existing netconsole cmdline selftest to also validate that interface selection can be performed via MAC address. The test now validates that netconsole works with both interface name and MAC address, improving test coverage. Suggested-by: Breno Leitao <leitao(a)debian.org> Signed-off-by: Andre Carvalho <asantostc(a)gmail.com> --- .../selftests/drivers/net/lib/sh/lib_netcons.sh | 10 +++- .../selftests/drivers/net/netcons_cmdline.sh | 55 +++++++++++++--------- 2 files changed, 42 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh index b6071e80ebbb6a33283ab6cd6bcb7b925aefdb43..8e1085e896472d5c87ec8b236240878a5b2d00d2 100644 --- a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh +++ b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh @@ -148,12 +148,20 @@ function create_dynamic_target() { # Generate the command line argument for netconsole following: # netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr] function create_cmdline_str() { + local BINDMODE=${1:-"ifname"} + if [ "${BINDMODE}" == "ifname" ] + then + SRCDEV=${SRCIF} + else + SRCDEV=$(mac_get "${SRCIF}") + fi + DSTMAC=$(ip netns exec "${NAMESPACE}" \ ip link show "${DSTIF}" | awk '/ether/ {print $2}') SRCPORT="1514" TGTPORT="6666" - echo "netconsole=\"+${SRCPORT}@${SRCIP}/${SRCIF},${TGTPORT}@${DSTIP}/${DSTMAC}\"" + echo "netconsole=\"+${SRCPORT}@${SRCIP}/${SRCDEV},${TGTPORT}@${DSTIP}/${DSTMAC}\"" } # Do not append the release to the header of the message diff --git a/tools/testing/selftests/drivers/net/netcons_cmdline.sh b/tools/testing/selftests/drivers/net/netcons_cmdline.sh index ad2fb8b1c46326c69af20f2c9d68e80fa8eb894f..a15149f3a905d7287258cd17f0e806fb50604cf4 100755 --- a/tools/testing/selftests/drivers/net/netcons_cmdline.sh +++ b/tools/testing/selftests/drivers/net/netcons_cmdline.sh @@ -17,10 +17,6 @@ source "${SCRIPTDIR}"/lib/sh/lib_netcons.sh check_netconsole_module modprobe netdevsim 2> /dev/null || true -rmmod netconsole 2> /dev/null || true - -# The content of kmsg will be save to the following file -OUTPUT_FILE="/tmp/${TARGET}" # Check for basic system dependency and exit if not found # check_for_dependencies @@ -30,23 +26,38 @@ echo "6 5" > /proc/sys/kernel/printk trap do_cleanup EXIT # Create one namespace and two interfaces set_network -# Create the command line for netconsole, with the configuration from the -# function above -CMDLINE="$(create_cmdline_str)" - -# Load the module, with the cmdline set -modprobe netconsole "${CMDLINE}" - -# Listed for netconsole port inside the namespace and destination interface -listen_port_and_save_to "${OUTPUT_FILE}" & -# Wait for socat to start and listen to the port. -wait_local_port_listen "${NAMESPACE}" "${PORT}" udp -# Send the message -echo "${MSG}: ${TARGET}" > /dev/kmsg -# Wait until socat saves the file to disk -busywait "${BUSYWAIT_TIMEOUT}" test -s "${OUTPUT_FILE}" -# Make sure the message was received in the dst part -# and exit -validate_msg "${OUTPUT_FILE}" + +# Run the test twice, with different cmdline parameters +for BINDMODE in "ifname" "mac" +do + echo "Running with bind mode: ${BINDMODE}" + # Create the command line for netconsole, with the configuration from the + # function above + CMDLINE="$(create_cmdline_str "${BINDMODE}")" + + # The content of kmsg will be save to the following file + OUTPUT_FILE="/tmp/${TARGET}-${BINDMODE}" + + # Unload the module, if present + rmmod netconsole 2> /dev/null || true + # Load the module, with the cmdline set + modprobe netconsole "${CMDLINE}" + + # Listed for netconsole port inside the namespace and destination interface + listen_port_and_save_to "${OUTPUT_FILE}" & + # Wait for socat to start and listen to the port. + wait_local_port_listen "${NAMESPACE}" "${PORT}" udp + # Send the message + echo "${MSG}: ${TARGET}" > /dev/kmsg + # Wait until socat saves the file to disk + busywait "${BUSYWAIT_TIMEOUT}" test -s "${OUTPUT_FILE}" + # Make sure the message was received in the dst part + # and exit + validate_msg "${OUTPUT_FILE}" + + # kill socat in case it is still running + pkill_socat + echo "${BINDMODE} : Test passed" >&2 +done exit "${ksft_pass}" --- base-commit: 37816488247ddddbc3de113c78c83572274b1e2e change-id: 20250807-netcons-cmdline-selftest-b32e27a4bd16 Best regards, -- Andre Carvalho <asantostc(a)gmail.com>

4 months, 4 weeks

2
1
0 0

[PATCH net v2 0/3] net: prevent deadlocks and mis-configuration with per-NAPI threaded config

by Jakub Kicinski

Running the test added with a recent fix on a driver with persistent NAPI config leads to a deadlock. The deadlock is fixed by patch 3, patch 2 is I think a more fundamental problem with the way we implemented the config. I hope the fix makes sense, my own thinking is definitely colored by my preference (IOW how the per-queue config RFC was implemented). v2: add missing kdoc v1: https://lore.kernel.org/20250808014952.724762-1-kuba@kernel.org Jakub Kicinski (3): selftests: drv-net: don't assume device has only 2 queues net: update NAPI threaded config even for disabled NAPIs net: prevent deadlocks when enabling NAPIs with mixed kthread config include/linux/netdevice.h | 5 ++++- net/core/dev.h | 8 ++++++++ net/core/dev.c | 12 +++++++++--- tools/testing/selftests/drivers/net/napi_threaded.py | 10 ++++++---- 4 files changed, 27 insertions(+), 8 deletions(-) -- 2.50.1

4 months, 4 weeks

3
9
0 0

[PATCH v2 0/2] selftests/fchmodat2: Error handling and general cleanups

by Mark Brown

I looked at the fchmodat2() tests since I've been experiencing some random intermittent segfaults with them in my test systems, while doing so I noticed these two issues. Unfortunately I didn't figure out the original yet, unless I managed to fix it unwittingly. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Rebase onto v6.17-rc1. - Link to v1: https://lore.kernel.org/r/20250714-selftests-fchmodat2-v1-0-b74f3ee0d09c@ke… --- Mark Brown (2): selftests/fchmodat2: Clean up temporary files and directories selftests/fchmodat2: Use ksft_finished() tools/testing/selftests/fchmodat2/fchmodat2_test.c | 166 ++++++++++++++------- 1 file changed, 112 insertions(+), 54 deletions(-) --- base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585 change-id: 20250711-selftests-fchmodat2-c30374c376f8 Best regards, -- Mark Brown <broonie(a)kernel.org>

4 months, 4 weeks

1
2
0 0

[PATCH v5 00/15] kunit: Introduce UAPI testing framework

by Thomas Weißschuh

Currently testing of userspace and in-kernel API use two different frameworks. kselftests for the userspace ones and Kunit for the in-kernel ones. Besides their different scopes, both have different strengths and limitations: Kunit: * Tests are normal kernel code. * They use the regular kernel toolchain. * They can be packaged and distributed as modules conveniently. Kselftests: * Tests are normal userspace code * They need a userspace toolchain. A kernel cross toolchain is likely not enough. * A fair amout of userland is required to run the tests, which means a full distro or handcrafted rootfs. * There is no way to conveniently package and run kselftests with a given kernel image. * The kselftests makefiles are not as powerful as regular kbuild. For example they are missing proper header dependency tracking or more complex compiler option modifications. Therefore kunit is much easier to run against different kernel configurations and architectures. This series aims to combine kselftests and kunit, avoiding both their limitations. It works by compiling the userspace kselftests as part of the regular kernel build, embedding them into the kunit kernel or module and executing them from there. If the kernel toolchain is not fit to produce userspace because of a missing libc, the kernel's own nolibc can be used instead. The structured TAP output from the kselftest is integrated into the kunit KTAP output transparently, the kunit parser can parse the combined logs together. Further room for improvements: * Call each test in its completely dedicated namespace * Handle additional test files besides the test executable through archives. CPIO, cramfs, etc. * Compatibility with kselftest_harness.h (in progress) * Expose the blobs in debugfs * Provide some convience wrappers around compat userprogs * Figure out a migration path/coexistence solution for kunit UAPI and tools/testing/selftests/ Output from the kunit example testcase, note the output of "example_uapi_tests". $ ./tools/testing/kunit/kunit.py run --kunitconfig lib/kunit example ... Running tests with: $ .kunit/linux kunit.filter_glob=example kunit.enable=1 mem=1G console=tty kunit_shutdown=halt [11:53:53] ================== example (10 subtests) =================== [11:53:53] [PASSED] example_simple_test [11:53:53] [SKIPPED] example_skip_test [11:53:53] [SKIPPED] example_mark_skipped_test [11:53:53] [PASSED] example_all_expect_macros_test [11:53:53] [PASSED] example_static_stub_test [11:53:53] [PASSED] example_static_stub_using_fn_ptr_test [11:53:53] [PASSED] example_priv_test [11:53:53] =================== example_params_test =================== [11:53:53] [SKIPPED] example value 3 [11:53:53] [PASSED] example value 2 [11:53:53] [PASSED] example value 1 [11:53:53] [SKIPPED] example value 0 [11:53:53] =============== [PASSED] example_params_test =============== [11:53:53] [PASSED] example_slow_test [11:53:53] ======================= (4 subtests) ======================= [11:53:53] [PASSED] procfs [11:53:53] [PASSED] userspace test 2 [11:53:53] [SKIPPED] userspace test 3: some reason [11:53:53] [PASSED] userspace test 4 [11:53:53] ================ [PASSED] example_uapi_test ================ [11:53:53] ===================== [PASSED] example ===================== [11:53:53] ============================================================ [11:53:53] Testing complete. Ran 16 tests: passed: 11, skipped: 5 [11:53:53] Elapsed time: 67.543s total, 1.823s configuring, 65.655s building, 0.058s running Based on v6.16-rc1. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Changes in v5: - Initialize output variable of kernel_wait() - Fix .incbin with in-tree builds - Keep requirement of KTAP tests to have a number which was removed accidentally - Only synthesize KTAP subtest failure if the outer one is TestStatus.FAILURE - Use -I instead of -isystem in NOLIBC_USERCFLAGS to populate dependency files - +To filesystem developers to all patches - +To Luis Chamberlain for discussions about usage of usermodehelper (see patches 6 and 12) - Link to v4: https://lore.kernel.org/r/20250626-kunit-kselftests-v4-0-48760534fef5@linut… Changes in v4: - Move Kconfig.nolibc from tools/ to init/ - Drop generic userprogs nolibc integration - Drop generic blob framework - Pick up review tags from David - Extend new kunit TAP parser tests - Add MAINTAINERS entry - Allow CONFIG_KUNIT_UAPI=m - Split /proc validation into dedicated UAPI test - Trim recipient list a bit - Use KUNIT_FAIL_AND_ABORT() over KUNIT_FAIL() - Link to v3: https://lore.kernel.org/r/20250611-kunit-kselftests-v3-0-55e3d148cbc6@linut… Changes in v3: - Reintroduce CONFIG_CC_CAN_LINK_STATIC - Enable CONFIG_ARCH_HAS_NOLIBC for m68k and SPARC - Properly handle 'clean' target for userprogs - Use ramfs over tmpfs to reduce dependencies - Inherit userprogs byte order and ABI from kernel - Drop now unnecessary "#ifndef NOLIBC" - Pick up review tags - Drop usage of __private in blob.h, sparse complains and it is not really necessary - Fix execution on loongarch when using clang - Drop userprogs libgcc handling, it was ugly and is not yet necessary - Link to v2: https://lore.kernel.org/r/20250407-kunit-kselftests-v2-0-454114e287fd@linut… Changes in v2: - Rebase onto v6.15-rc1 - Add documentation and kernel docs - Resolve invalid kconfig breakages - Drop already applied patch "kbuild: implement CONFIG_HEADERS_INSTALL for Usermode Linux" - Drop userprogs CONFIG_WERROR integration, it doesn't need to be part of this series - Replace patch prefix "kconfig" with "kbuild" - Rename kunit_uapi_run_executable() to kunit_uapi_run_kselftest() - Generate private, conflict-free symbols in the blob framework - Handle kselftest exit codes - Handle SIGABRT - Forward output also to kunit debugfs log - Install a fd=0 stdin filedescriptor - Link to v1: https://lore.kernel.org/r/20250217-kunit-kselftests-v1-0-42b4524c3b0a@linut… --- Thomas Weißschuh (15): kbuild: userprogs: avoid duplication of flags inherited from kernel kbuild: userprogs: also inherit byte order and ABI from kernel kbuild: doc: add label for userprogs section init: re-add CONFIG_CC_CAN_LINK_STATIC init: add nolibc build support fs,fork,exit: export symbols necessary for KUnit UAPI support kunit: tool: Add test for nested test result reporting kunit: tool: Don't overwrite test status based on subtest counts kunit: tool: Parse skipped tests from kselftest.h kunit: Always descend into kunit directory during build kunit: qemu_configs: loongarch: Enable LSX/LSAX kunit: Introduce UAPI testing framework kunit: uapi: Add example for UAPI tests kunit: uapi: Introduce preinit executable kunit: uapi: Validate usability of /proc Documentation/dev-tools/kunit/api/index.rst | 5 + Documentation/dev-tools/kunit/api/uapi.rst | 14 + Documentation/kbuild/makefiles.rst | 2 + MAINTAINERS | 11 + Makefile | 7 +- fs/exec.c | 2 + fs/file.c | 1 + fs/filesystems.c | 2 + fs/fs_struct.c | 1 + fs/pipe.c | 2 + include/kunit/uapi.h | 77 ++++++ init/Kconfig | 7 + init/Kconfig.nolibc | 15 + init/Makefile.nolibc | 13 + kernel/exit.c | 3 + kernel/fork.c | 2 + lib/Makefile | 4 - lib/kunit/Kconfig | 14 + lib/kunit/Makefile | 30 +- lib/kunit/kunit-example-test.c | 15 + lib/kunit/kunit-example-uapi.c | 22 ++ lib/kunit/kunit-test-uapi.c | 51 ++++ lib/kunit/kunit-test.c | 23 +- lib/kunit/kunit-uapi.c | 305 +++++++++++++++++++++ lib/kunit/uapi-preinit.c | 63 +++++ tools/testing/kunit/kunit_parser.py | 11 +- tools/testing/kunit/kunit_tool_test.py | 11 + tools/testing/kunit/qemu_configs/loongarch.py | 2 + .../test_is_test_passed-failure-nested.log | 10 + .../test_data/test_is_test_passed-kselftest.log | 3 +- 30 files changed, 715 insertions(+), 13 deletions(-) --- base-commit: 9d5898b413d17510b2a41664a42390a2c79f8bf4 change-id: 20241015-kunit-kselftests-56273bc40442 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

4 months, 4 weeks

4
24
0 0

[PATCH] seccomp:seccomp_bpf: remove unused macros

by bajing

After reviewing the code, it was found that these macros are never referenced in the code. Just remove them. Signed-off-by: bajing <bajing(a)cmss.chinamobile.com> --- tools/testing/selftests/seccomp/seccomp_bpf.c | 12 ------------ 1 file changed, 12 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 61acbd45ffaa..a80bcc5149bf 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -78,18 +78,6 @@ #define PR_GET_NO_NEW_PRIVS 39 #endif -#ifndef PR_SECCOMP_EXT -#define PR_SECCOMP_EXT 43 -#endif - -#ifndef SECCOMP_EXT_ACT -#define SECCOMP_EXT_ACT 1 -#endif - -#ifndef SECCOMP_EXT_ACT_TSYNC -#define SECCOMP_EXT_ACT_TSYNC 1 -#endif - #ifndef SECCOMP_MODE_STRICT #define SECCOMP_MODE_STRICT 1 #endif -- 2.33.0

4 months, 4 weeks

1
0
0 0

[PATCH v2 0/8] selftests: vDSO: Clean up vdso_test_abi and drop vdso_test_clock_getres

by Thomas Weißschuh

Some cleanups for the vDSO selftests. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Changes in v2: - Also drop vdso_test_clock_getres from .gitignore - Move patch to fix -Wunitialized in powerpc VDSO_CALL() into this series - Rebase on v6.17-rc1 - Add test for clock_gettime64() - Link to v1: https://lore.kernel.org/r/20250707-vdso-tests-fixes-v1-0-545be9781b0c@linut… --- Thomas Weißschuh (8): selftests: vDSO: fix -Wunitialized in powerpc VDSO_CALL() wrapper selftests: vDSO: vdso_test_abi: Correctly skip whole test with missing vDSO selftests: vDSO: vdso_test_abi: Use ksft_finished() selftests: vDSO: vdso_test_abi: Drop clock availability tests selftests: vDSO: vdso_test_abi: Use explicit indices for name array selftests: vDSO: vdso_test_abi: Test CPUTIME clocks selftests: vDSO: vdso_test_abi: Add tests for clock_gettime64() selftests: vDSO: Drop vdso_test_clock_getres tools/testing/selftests/vDSO/.gitignore | 1 - tools/testing/selftests/vDSO/Makefile | 2 - tools/testing/selftests/vDSO/vdso_call.h | 7 +- tools/testing/selftests/vDSO/vdso_test_abi.c | 101 +++++++++-------- .../selftests/vDSO/vdso_test_clock_getres.c | 123 --------------------- 5 files changed, 59 insertions(+), 175 deletions(-) --- base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585 change-id: 20250707-vdso-tests-fixes-7e4ddffd7f27 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

4 months, 4 weeks

2
9
0 0

[PATCH net-next 0/3] bonding: support aggregator selection based on port priority

by Hangbin Liu

This patchset introduces a new per-port bonding option: `ad_actor_port_prio`. It allows users to configure the actor's port priority, which can then be used by the bonding driver for aggregator selection based on port priority. This provides finer control over LACP aggregator choice, especially in setups with multiple eligible aggregators over 2 switches. Hangbin Liu (3): bonding: add support for per-port LACP actor priority bonding: support aggregator selection based on port priority selftests: bonding: add test for LACP actor port priority Documentation/networking/bonding.rst | 18 ++++- drivers/net/bonding/bond_3ad.c | 31 ++++++++ drivers/net/bonding/bond_netlink.c | 16 ++++ drivers/net/bonding/bond_options.c | 36 +++++++++ include/net/bond_3ad.h | 2 + include/net/bond_options.h | 1 + include/uapi/linux/if_link.h | 1 + .../selftests/drivers/net/bonding/Makefile | 3 +- .../drivers/net/bonding/bond_lacp_prio.sh | 73 +++++++++++++++++++ tools/testing/selftests/net/forwarding/lib.sh | 24 ------ tools/testing/selftests/net/lib.sh | 24 ++++++ 11 files changed, 203 insertions(+), 26 deletions(-) create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_lacp_prio.sh -- 2.46.0

4 months, 4 weeks

3
11
0 0

[PATCH v2] selftests/net: Ensure assert() triggers in psock_tpacket.c

by Wake Liu

The get_next_frame() function in psock_tpacket.c was missing a return statement in its default switch case, leading to a compiler warning. This was caused by a `bug_on(1)` call, which is defined as an `assert()`, being compiled out because NDEBUG is defined during the build. Instead of adding a `return NULL;` which would silently hide the error and could lead to crashes later, this change restores the original author's intent. By adding `#undef NDEBUG` before including <assert.h>, we ensure the assertion is active and will cause the test to abort if this unreachable code is ever executed. Signed-off-by: Wake Liu <wakel(a)google.com> --- tools/testing/selftests/net/psock_tpacket.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c index 0dd909e325d9..2938045c5cf9 100644 --- a/tools/testing/selftests/net/psock_tpacket.c +++ b/tools/testing/selftests/net/psock_tpacket.c @@ -22,6 +22,7 @@ * - TPACKET_V3: RX_RING */ +#undef NDEBUG #include <stdio.h> #include <stdlib.h> #include <sys/types.h> -- 2.50.1.703.g449372360f-goog

4 months, 4 weeks

2
1
0 0

[PATCH] selftests/net: Replace non-standard __WORDSIZE with sizeof(long) * 8

by Wake Liu

The `__WORDSIZE` macro, defined in the non-standard `<bits/wordsize.h>` header, is a GNU extension and not universally available with all toolchains, such as Clang when used with musl libc. This can lead to build failures in environments where this header is missing. The intention of the code is to determine the bit width of a C `long`. Replace the non-portable `__WORDSIZE` with the standard and portable `sizeof(long) * 8` expression to achieve the same result. This change also removes the inclusion of the now-unused `<bits/wordsize.h>` header. Signed-off-by: Wake Liu <wakel(a)google.com> --- tools/testing/selftests/net/psock_tpacket.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c index 221270cee3ea..0dd909e325d9 100644 --- a/tools/testing/selftests/net/psock_tpacket.c +++ b/tools/testing/selftests/net/psock_tpacket.c @@ -33,7 +33,6 @@ #include <ctype.h> #include <fcntl.h> #include <unistd.h> -#include <bits/wordsize.h> #include <net/ethernet.h> #include <netinet/ip.h> #include <arpa/inet.h> @@ -785,7 +784,7 @@ static int test_kernel_bit_width(void) static int test_user_bit_width(void) { - return __WORDSIZE; + return sizeof(long) * 8; } static const char *tpacket_str[] = { -- 2.50.1.703.g449372360f-goog

4 months, 4 weeks

2
1
0 0

[PATCH net] bonding: don't set oif to bond dev when getting NS target destination

by Hangbin Liu

Unlike IPv4, IPv6 routing strictly requires the source address to be valid on the outgoing interface. If the NS target is set to a remote VLAN interface, and the source address is also configured on a VLAN over a bond interface, setting the oif to the bond device will fail to retrieve the correct destination route. Fix this by not setting the oif to the bond device when retrieving the NS target destination. This allows the correct destination device (the VLAN interface) to be determined, so that bond_verify_device_path can return the proper VLAN tags for sending NS messages. Reported-by: David Wilder <wilder(a)us.ibm.com> Closes: https://lore.kernel.org/netdev/aGOKggdfjv0cApTO@fedora/ Suggested-by: Jay Vosburgh <jv(a)jvosburgh.net> Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets") Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com> --- drivers/net/bonding/bond_main.c | 1 - .../drivers/net/bonding/bond_options.sh | 59 +++++++++++++++++++ 2 files changed, 59 insertions(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 257333c88710..30cf97f4e814 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3355,7 +3355,6 @@ static void bond_ns_send_all(struct bonding *bond, struct slave *slave) /* Find out through which dev should the packet go */ memset(&fl6, 0, sizeof(struct flowi6)); fl6.daddr = targets[i]; - fl6.flowi6_oif = bond->dev->ifindex; dst = ip6_route_output(dev_net(bond->dev), NULL, &fl6); if (dst->error) { diff --git a/tools/testing/selftests/drivers/net/bonding/bond_options.sh b/tools/testing/selftests/drivers/net/bonding/bond_options.sh index 7bc148889ca7..b3eb8a919c71 100755 --- a/tools/testing/selftests/drivers/net/bonding/bond_options.sh +++ b/tools/testing/selftests/drivers/net/bonding/bond_options.sh @@ -7,6 +7,7 @@ ALL_TESTS=" prio arp_validate num_grat_arp + vlan_over_bond " lib_dir=$(dirname "$0") @@ -376,6 +377,64 @@ num_grat_arp() done } +vlan_over_bond_arp() +{ + local mode="$1" + RET=0 + + bond_reset "mode $mode arp_interval 100 arp_ip_target 192.0.3.10" + ip -n "${s_ns}" link add bond0.3 link bond0 type vlan id 3 + ip -n "${s_ns}" link set bond0.3 up + ip -n "${s_ns}" addr add 192.0.3.1/24 dev bond0.3 + ip -n "${s_ns}" addr add 2001:db8::3:1/64 dev bond0.3 + + slowwait_for_counter 5 5 tc_rule_handle_stats_get \ + "dev eth0.3 ingress" 101 ".packets" "-n ${c_ns}" || RET=1 + log_test "vlan over bond arp" "$mode" +} + +vlan_over_bond_ns() +{ + local mode="$1" + RET=0 + + if skip_ns; then + log_test_skip "vlan_over_bond ns" "$mode" + return 0 + fi + + bond_reset "mode $mode arp_interval 100 ns_ip6_target 2001:db8::3:10" + ip -n "${s_ns}" link add bond0.3 link bond0 type vlan id 3 + ip -n "${s_ns}" link set bond0.3 up + ip -n "${s_ns}" addr add 192.0.3.1/24 dev bond0.3 + ip -n "${s_ns}" addr add 2001:db8::3:1/64 dev bond0.3 + + slowwait_for_counter 5 5 tc_rule_handle_stats_get \ + "dev eth0.3 ingress" 102 ".packets" "-n ${c_ns}" || RET=1 + log_test "vlan over bond ns" "$mode" +} + +vlan_over_bond() +{ + # add vlan 3 for client + ip -n "${c_ns}" link add eth0.3 link eth0 type vlan id 3 + ip -n "${c_ns}" link set eth0.3 up + ip -n "${c_ns}" addr add 192.0.3.10/24 dev eth0.3 + ip -n "${c_ns}" addr add 2001:db8::3:10/64 dev eth0.3 + + # Add tc rule to check the vlan pkts + tc -n "${c_ns}" qdisc add dev eth0.3 clsact + tc -n "${c_ns}" filter add dev eth0.3 ingress protocol arp \ + handle 101 flower skip_hw arp_op request \ + arp_sip 192.0.3.1 arp_tip 192.0.3.10 action pass + tc -n "${c_ns}" filter add dev eth0.3 ingress protocol ipv6 \ + handle 102 flower skip_hw ip_proto icmpv6 \ + type 135 src_ip 2001:db8::3:1 action pass + + vlan_over_bond_arp "active-backup" + vlan_over_bond_ns "active-backup" +} + trap cleanup EXIT setup_prepare -- 2.50.1

4 months, 4 weeks

3
5
0 0

[Patch v2] selftests/mm: do check_huge_anon() with a number been passed in

by Wei Yang

Currently it hard codes the number of hugepage to check for check_huge_anon(), but it would be more reasonable to do the check based on a number passed in. Pass in the hugepage number and do the check based on it. Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com> Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com> Cc: Donet Tom <donettom(a)linux.ibm.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Dev Jain <dev.jain(a)arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Zi Yan <ziy(a)nvidia.com> --- v2: * use mm-new * add back nr_hpages which is removed by an early commit * adjust the change log a little * drop RB and resend --- tools/testing/selftests/mm/split_huge_page_test.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index 5ab488fab1cd..63ac82f0b9e0 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -105,12 +105,12 @@ static char *allocate_zero_filled_hugepage(size_t len) return result; } -static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, size_t len) +static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len) { unsigned long rss_anon_before, rss_anon_after; size_t i; - if (!check_huge_anon(one_page, 4, pmd_pagesize)) + if (!check_huge_anon(one_page, nr_hpages, pmd_pagesize)) ksft_exit_fail_msg("No THP is allocated\n"); rss_anon_before = rss_anon(); @@ -141,7 +141,7 @@ void split_pmd_zero_pages(void) size_t len = nr_hpages * pmd_pagesize; one_page = allocate_zero_filled_hugepage(len); - verify_rss_anon_split_huge_page_all_zeroes(one_page, len); + verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len); ksft_test_result_pass("Split zero filled huge pages successful\n"); free(one_page); } -- 2.34.1

4 months, 4 weeks

7
8
0 0

We are interested in purchasing your products.

by PERDIS SUPER U

-- Hi, PERDIS SUPER U is a leading retail group in France with numerous outlets across the country. After reviewing your company profile and products, we’re very interested in establishing a long-term partnership. Kindly share your product catalog or website so we can review your offerings and pricing. We are ready to place orders and begin cooperation.Please note: Our payment terms are SWIFT, 14 days after delivery. Looking forward to your response. Best regards, Dominique Schelcher Director, PERDIS SUPER U RUE DE SAVOIE, 45600 SAINT-PÈRE-SUR-LOIRE VAT: FR65380071464 www.magasins-u.com

4 months, 4 weeks

1
0
0 0

[PATCH v2] selftests/futex: fix format-security warnings in futex_priv_hash

by Nai-Chen Cheng

Fix format-security warnings by using proper format strings when passing message variables to ksft_exit_fail_msg(), ksft_test_result_pass(), and ksft_test_result_skip() function. This prevents potential security issues and eliminates compiler warnings when building with -Wformat-security. Signed-off-by: Nai-Chen Cheng <bleach1827(a)gmail.com> --- Changes in v2: - Fix typo in subject: "selftest" -> "selftests" - Retested compilation and functionality - Link to v1: https://lore.kernel.org/all/20250717120606.45115-1-bleach1827@gmail.com/ --- diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c index 24a92dc94eb8..19651087c4de 100644 --- a/tools/testing/selftests/futex/functional/futex_priv_hash.c +++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c @@ -184,10 +184,10 @@ int main(int argc, char *argv[]) futex_slots1 = futex_hash_slots_get(); if (futex_slots1 <= 0) { ksft_print_msg("Current hash buckets: %d\n", futex_slots1); - ksft_exit_fail_msg(test_msg_auto_create); + ksft_exit_fail_msg("%s", test_msg_auto_create); } - ksft_test_result_pass(test_msg_auto_create); + ksft_test_result_pass("%s", test_msg_auto_create); online_cpus = sysconf(_SC_NPROCESSORS_ONLN); ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1); @@ -212,11 +212,11 @@ int main(int argc, char *argv[]) if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) { ksft_print_msg("Expected increase of hash buckets but got: %d -> %d\n", futex_slots1, futex_slotsn); - ksft_exit_fail_msg(test_msg_auto_inc); + ksft_exit_fail_msg("%s", test_msg_auto_inc); } - ksft_test_result_pass(test_msg_auto_inc); + ksft_test_result_pass("%s", test_msg_auto_inc); } else { - ksft_test_result_skip(test_msg_auto_inc); + ksft_test_result_skip("%s", test_msg_auto_inc); } ret = pthread_mutex_unlock(&global_lock); -- 2.43.0

4 months, 4 weeks

1
0
0 0

[PATCH net-next v4 4/4] selftests: drv-net: add test for RSS on flow label

by Jakub Kicinski

Add a simple test for checking that RSS on flow label works, and that its rejected for IPv4 flows. # ./tools/testing/selftests/drivers/net/hw/rss_flow_label.py TAP version 13 1..2 ok 1 rss_flow_label.test_rss_flow_label ok 2 rss_flow_label.test_rss_flow_label_6only # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0 Reviewed-by: Willem de Bruijn <willemb(a)google.com> Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- v2: - check for RPS / RFS v1: https://lore.kernel.org/20250722014915.3365370-5-kuba@kernel.org CC: shuah(a)kernel.org CC: sdf(a)fomichev.me CC: linux-kselftest(a)vger.kernel.org --- .../testing/selftests/drivers/net/hw/Makefile | 1 + .../drivers/net/hw/rss_flow_label.py | 167 ++++++++++++++++++ 2 files changed, 168 insertions(+) create mode 100755 tools/testing/selftests/drivers/net/hw/rss_flow_label.py diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile index fdc97355588c..5159fd34cb33 100644 --- a/tools/testing/selftests/drivers/net/hw/Makefile +++ b/tools/testing/selftests/drivers/net/hw/Makefile @@ -18,6 +18,7 @@ TEST_PROGS = \ pp_alloc_fail.py \ rss_api.py \ rss_ctx.py \ + rss_flow_label.py \ rss_input_xfrm.py \ tso.py \ xsk_reconfig.py \ diff --git a/tools/testing/selftests/drivers/net/hw/rss_flow_label.py b/tools/testing/selftests/drivers/net/hw/rss_flow_label.py new file mode 100755 index 000000000000..6fa95fe27c47 --- /dev/null +++ b/tools/testing/selftests/drivers/net/hw/rss_flow_label.py @@ -0,0 +1,167 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +""" +Tests for RSS hashing on IPv6 Flow Label. +""" + +import glob +import os +import socket +from lib.py import CmdExitFailure +from lib.py import ksft_run, ksft_exit, ksft_eq, ksft_ge, ksft_in, \ + ksft_not_in, ksft_raises, KsftSkipEx +from lib.py import bkg, cmd, defer, fd_read_timeout, rand_port +from lib.py import NetDrvEpEnv + + +def _check_system(cfg): + if not hasattr(socket, "SO_INCOMING_CPU"): + raise KsftSkipEx("socket.SO_INCOMING_CPU was added in Python 3.11") + + qcnt = len(glob.glob(f"/sys/class/net/{cfg.ifname}/queues/rx-*")) + if qcnt < 2: + raise KsftSkipEx(f"Local has only {qcnt} queues") + + for f in [f"/sys/class/net/{cfg.ifname}/queues/rx-0/rps_flow_cnt", + f"/sys/class/net/{cfg.ifname}/queues/rx-0/rps_cpus"]: + try: + with open(f, 'r') as fp: + setting = fp.read().strip() + # CPU mask will be zeros and commas + if setting.replace("0", "").replace(",", ""): + raise KsftSkipEx(f"RPS/RFS is configured: {f}: {setting}") + except FileNotFoundError: + pass + + # 1 is the default, if someone changed it we probably shouldn"t mess with it + af = cmd("cat /proc/sys/net/ipv6/auto_flowlabels", host=cfg.remote).stdout + if af.strip() != "1": + raise KsftSkipEx("Remote does not have auto_flowlabels enabled") + + +def _ethtool_get_cfg(cfg, fl_type): + descr = cmd(f"ethtool -n {cfg.ifname} rx-flow-hash {fl_type}").stdout + + converter = { + "IP SA": "s", + "IP DA": "d", + "L3 proto": "t", + "L4 bytes 0 & 1 [TCP/UDP src port]": "f", + "L4 bytes 2 & 3 [TCP/UDP dst port]": "n", + "IPv6 Flow Label": "l", + } + + ret = "" + for line in descr.split("\n")[1:-2]: + # if this raises we probably need to add more keys to converter above + ret += converter[line] + return ret + + +def _traffic(cfg, one_sock, one_cpu): + local_port = rand_port(socket.SOCK_DGRAM) + remote_port = rand_port(socket.SOCK_DGRAM) + + sock = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) + sock.bind(("", local_port)) + sock.connect((cfg.remote_addr_v["6"], 0)) + if one_sock: + send = f"exec 5<>/dev/udp/{cfg.addr_v['6']}/{local_port}; " \ + "for i in `seq 20`; do echo a >&5; sleep 0.02; done; exec 5>&-" + else: + send = "for i in `seq 20`; do echo a | socat -t0.02 - UDP6:" \ + f"[{cfg.addr_v['6']}]:{local_port},sourceport={remote_port}; done" + + cpus = set() + with bkg(send, shell=True, host=cfg.remote, exit_wait=True): + for _ in range(20): + fd_read_timeout(sock.fileno(), 1) + cpu = sock.getsockopt(socket.SOL_SOCKET, socket.SO_INCOMING_CPU) + cpus.add(cpu) + + if one_cpu: + ksft_eq(len(cpus), 1, + f"{one_sock=} - expected one CPU, got traffic on: {cpus=}") + else: + ksft_ge(len(cpus), 2, + f"{one_sock=} - expected many CPUs, got traffic on: {cpus=}") + + +def test_rss_flow_label(cfg): + """ + Test hashing on IPv6 flow label. Send traffic over a single socket + and over multiple sockets. Depend on the remote having auto-label + enabled so that it randomizes the label per socket. + """ + + cfg.require_ipver("6") + cfg.require_cmd("socat", remote=True) + _check_system(cfg) + + # Enable flow label hashing for UDP6 + initial = _ethtool_get_cfg(cfg, "udp6") + no_lbl = initial.replace("l", "") + if "l" not in initial: + try: + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 l{no_lbl}") + except CmdExitFailure as exc: + raise KsftSkipEx("Device doesn't support Flow Label for UDP6") from exc + + defer(cmd, f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {initial}") + + _traffic(cfg, one_sock=True, one_cpu=True) + _traffic(cfg, one_sock=False, one_cpu=False) + + # Disable it, we should see no hashing (reset was already defer()ed) + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {no_lbl}") + + _traffic(cfg, one_sock=False, one_cpu=True) + + +def _check_v4_flow_types(cfg): + for fl_type in ["tcp4", "udp4", "ah4", "esp4", "sctp4"]: + try: + cur = cmd(f"ethtool -n {cfg.ifname} rx-flow-hash {fl_type}").stdout + ksft_not_in("Flow Label", cur, + comment=f"{fl_type=} has Flow Label:" + cur) + except CmdExitFailure: + # Probably does not support this flow type + pass + + +def test_rss_flow_label_6only(cfg): + """ + Test interactions with IPv4 flow types. It should not be possible to set + IPv6 Flow Label hashing for an IPv4 flow type. The Flow Label should also + not appear in the IPv4 "current config". + """ + + with ksft_raises(CmdExitFailure) as cm: + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash tcp4 sdfnl") + ksft_in("Invalid argument", cm.exception.cmd.stderr) + + _check_v4_flow_types(cfg) + + # Try to enable Flow Labels and check again, in case it leaks thru + initial = _ethtool_get_cfg(cfg, "udp6") + changed = initial.replace("l", "") if "l" in initial else initial + "l" + + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {changed}") + restore = defer(cmd, f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {initial}") + + _check_v4_flow_types(cfg) + restore.exec() + _check_v4_flow_types(cfg) + + +def main() -> None: + with NetDrvEpEnv(__file__, nsim_test=False) as cfg: + ksft_run([test_rss_flow_label, + test_rss_flow_label_6only], + args=(cfg, )) + ksft_exit() + + +if __name__ == "__main__": + main() -- 2.50.1

4 months, 4 weeks

2
1
0 0

[PATCH v2 0/3] Better split_huge_page_test result check

by Zi Yan

This patchset uses kpageflags to get after-split folio orders for a better split_huge_page_test result check[1]. The added gather_folio_orders() scans through a VPN range and collects the numbers of folios at different orders. check_folio_orders() compares the result of gather_folio_orders() to a given list of numbers of different orders. This patchset also added new order and in folio offset to the split huge page debugfs's pr_debug()s; Changelog === From V1[2]: 1. Dropped split_huge_pages_pid() for loop step change to avoid messing up with PTE-mapped THP handling. split_huge_page_test.c is changed to perform split at [addr, addr + pagesize) range to limit one folio_split() per folio. 2. Moved pr_debug changes in Patch 2 to Patch 1. 3. Moved KPF_* to vm_util.h and used PAGEMAP_PFN instead of local PFN_MASK. 4. Used pagemap_get_pfn() helper. 5. Used char *vaddr and size_t len as inputs to gather_folio_orders() and check_folio_orders() instead of vpn and nr_pages. 6. Removed variable length variables and used malloc instead. [1] https://lore.kernel.org/linux-mm/e2f32bdb-e4a4-447c-867c-31405cbba151@redha… [2] https://lore.kernel.org/linux-mm/20250806022045.342824-1-ziy@nvidia.com/ Zi Yan (3): mm/huge_memory: add new_order and offset to split_huge_pages*() pr_debug. selftests/mm: add check_folio_orders() helper. selftests/mm: check after-split folio orders in split_huge_page_test. mm/huge_memory.c | 8 +- .../selftests/mm/split_huge_page_test.c | 102 ++++++++++---- tools/testing/selftests/mm/vm_util.c | 133 ++++++++++++++++++ tools/testing/selftests/mm/vm_util.h | 7 + 4 files changed, 217 insertions(+), 33 deletions(-) -- 2.47.2

4 months, 4 weeks

7
18
0 0

[PATCH -next] selftests/sched_ext: Remove duplicate sched.h header

by Jiapeng Chong

./tools/testing/selftests/sched_ext/hotplug.c: sched.h is included more than once. Reported-by: Abaci Robot <abaci(a)linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22941 Signed-off-by: Jiapeng Chong <jiapeng.chong(a)linux.alibaba.com> --- tools/testing/selftests/sched_ext/hotplug.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/sched_ext/hotplug.c b/tools/testing/selftests/sched_ext/hotplug.c index 1c9ceb661c43..0cfbb111a2d0 100644 --- a/tools/testing/selftests/sched_ext/hotplug.c +++ b/tools/testing/selftests/sched_ext/hotplug.c @@ -6,7 +6,6 @@ #include <bpf/bpf.h> #include <sched.h> #include <scx/common.h> -#include <sched.h> #include <sys/wait.h> #include <unistd.h> -- 2.43.5

4 months, 4 weeks

3
2
0 0

[PATCH V9 0/7] Add NUMA mempolicy support for KVM guest-memfd

by Shivank Garg

This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work that enabled host-mapping for guest_memfd memory [1]. == Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2). == Implementation == This series implements proper NUMA policy support for guest-memfd by: 1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy. With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design). == Upstream Plan == Phased approach as per David's guest_memfd extension overview [2] and community calls [3]: Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work. Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [4]. This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities. Please review and provide feedback! Thanks, Shivank == Changelog == - v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [5] and guest_memfd bi-weekly upstream call discussion [6]). - v7: Use inodes to store NUMA policy instead of file [7]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. - v9: Rebase on top of Fuad's V13 and incorporate review comments [1] https://lore.kernel.org/all/20250709105946.4009897-1-tabba@google.com [2] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo… [4] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [5] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [6] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [7] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes Matthew Wilcox (Oracle) (2): mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies Shivank Garg (4): mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 122 ++++++++- virt/kvm/guest_memfd.c | 255 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 10 +- 15 files changed, 408 insertions(+), 49 deletions(-) -- 2.43.0 --- == Earlier Postings == v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com

5 months

6
23
0 0

[PATCH] selftests/coredump: Remove the read() that fails the test

by Nam Cao

Resolve a conflict between commit 6a68d28066b6 ("selftests/coredump: Fix "socket_detect_userspace_client" test failure") and commit 994dc26302ed ("selftests/coredump: fix build") The first commit adds a read() to wait for write() from another thread to finish. But the second commit removes the write(). Now that the two commits are in the same tree, the read() now gets EOF and the test fails. Remove this read() so that the test passes. Signed-off-by: Nam Cao <namcao(a)linutronix.de> --- tools/testing/selftests/coredump/stackdump_test.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c index 5a5a7a5f7e1d..a4ac80bb1003 100644 --- a/tools/testing/selftests/coredump/stackdump_test.c +++ b/tools/testing/selftests/coredump/stackdump_test.c @@ -446,9 +446,6 @@ TEST_F(coredump, socket_detect_userspace_client) if (info.coredump_mask & PIDFD_COREDUMPED) goto out; - if (read(fd_coredump, &c, 1) < 1) - goto out; - exit_code = EXIT_SUCCESS; out: if (fd_peer_pidfd >= 0) -- 2.39.5

5 months

1
0
0 0

[PATCH v5] kunit: qemu_configs: Add MIPS configurations

by Thomas Weißschuh

Add basic support to run various MIPS variants via kunit_tool using the virtualized malta platform. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> Reviewed-by: David Gow <davidgow(a)google.com> --- Changes in v5: - Rebase on v6.17-rc1 - Drop alreayd applied patch to MIPS core code and related CCs - Link to v4: https://lore.kernel.org/r/20250611-kunit-mips-v4-0-1d8997fb2ae4@linutronix.… Changes in v4: - Rebase on v6.16-rc1 - Pick up reviews from David - Clarify that GIC page is linked to vDSO - Link to v3: https://lore.kernel.org/r/20250415-kunit-mips-v3-0-4ec2461b5a7e@linutronix.… Changes in v3: - Also skip VDSO_RANDOMIZE_SIZE adjustment for kthreads - Link to v2: https://lore.kernel.org/r/20250414-kunit-mips-v2-0-4cf01e1a29e6@linutronix.… Changes in v2: - Fix usercopy kunit test by handling ABI-less tasks in stack_top() - Drop change to mm initialization. The broken test is not built by default anymore. - Link to v1: https://lore.kernel.org/r/20250212-kunit-mips-v1-0-eb49c9d76615@linutronix.… --- tools/testing/kunit/qemu_configs/mips.py | 18 ++++++++++++++++++ tools/testing/kunit/qemu_configs/mips64.py | 19 +++++++++++++++++++ tools/testing/kunit/qemu_configs/mips64el.py | 19 +++++++++++++++++++ tools/testing/kunit/qemu_configs/mipsel.py | 18 ++++++++++++++++++ 4 files changed, 74 insertions(+) diff --git a/tools/testing/kunit/qemu_configs/mips.py b/tools/testing/kunit/qemu_configs/mips.py new file mode 100644 index 0000000000000000000000000000000000000000..8899ac157b30bd2ee847eacd5b90fe6ad4e5fb04 --- /dev/null +++ b/tools/testing/kunit/qemu_configs/mips.py @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: GPL-2.0 + +from ..qemu_config import QemuArchParams + +QEMU_ARCH = QemuArchParams(linux_arch='mips', + kconfig=''' +CONFIG_32BIT=y +CONFIG_CPU_BIG_ENDIAN=y +CONFIG_MIPS_MALTA=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_POWER_RESET=y +CONFIG_POWER_RESET_SYSCON=y +''', + qemu_arch='mips', + kernel_path='vmlinuz', + kernel_command_line='console=ttyS0', + extra_qemu_params=['-M', 'malta']) diff --git a/tools/testing/kunit/qemu_configs/mips64.py b/tools/testing/kunit/qemu_configs/mips64.py new file mode 100644 index 0000000000000000000000000000000000000000..1478aed05b94da4914f34c6a8affdcfe34eb88ea --- /dev/null +++ b/tools/testing/kunit/qemu_configs/mips64.py @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: GPL-2.0 + +from ..qemu_config import QemuArchParams + +QEMU_ARCH = QemuArchParams(linux_arch='mips', + kconfig=''' +CONFIG_CPU_MIPS64_R2=y +CONFIG_64BIT=y +CONFIG_CPU_BIG_ENDIAN=y +CONFIG_MIPS_MALTA=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_POWER_RESET=y +CONFIG_POWER_RESET_SYSCON=y +''', + qemu_arch='mips64', + kernel_path='vmlinuz', + kernel_command_line='console=ttyS0', + extra_qemu_params=['-M', 'malta', '-cpu', '5KEc']) diff --git a/tools/testing/kunit/qemu_configs/mips64el.py b/tools/testing/kunit/qemu_configs/mips64el.py new file mode 100644 index 0000000000000000000000000000000000000000..300c711d7a82500b2ebcb4cf1467b6f72b5c17aa --- /dev/null +++ b/tools/testing/kunit/qemu_configs/mips64el.py @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: GPL-2.0 + +from ..qemu_config import QemuArchParams + +QEMU_ARCH = QemuArchParams(linux_arch='mips', + kconfig=''' +CONFIG_CPU_MIPS64_R2=y +CONFIG_64BIT=y +CONFIG_CPU_LITTLE_ENDIAN=y +CONFIG_MIPS_MALTA=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_POWER_RESET=y +CONFIG_POWER_RESET_SYSCON=y +''', + qemu_arch='mips64el', + kernel_path='vmlinuz', + kernel_command_line='console=ttyS0', + extra_qemu_params=['-M', 'malta', '-cpu', '5KEc']) diff --git a/tools/testing/kunit/qemu_configs/mipsel.py b/tools/testing/kunit/qemu_configs/mipsel.py new file mode 100644 index 0000000000000000000000000000000000000000..3d3543315b45776d0e77fb5c00c8c0a89eafdffd --- /dev/null +++ b/tools/testing/kunit/qemu_configs/mipsel.py @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: GPL-2.0 + +from ..qemu_config import QemuArchParams + +QEMU_ARCH = QemuArchParams(linux_arch='mips', + kconfig=''' +CONFIG_32BIT=y +CONFIG_CPU_LITTLE_ENDIAN=y +CONFIG_MIPS_MALTA=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_POWER_RESET=y +CONFIG_POWER_RESET_SYSCON=y +''', + qemu_arch='mipsel', + kernel_path='vmlinuz', + kernel_command_line='console=ttyS0', + extra_qemu_params=['-M', 'malta']) --- base-commit: 5606dd26f0b0d614e64a51e68c86e5066f9a5b71 change-id: 20241014-kunit-mips-e4fe1c265ed7 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

5 months

1
0
0 0

[PATCH v2] kunit: Enable PCI on UML without triggering WARN()

by Thomas Weißschuh

Various KUnit tests require PCI infrastructure to work. All normal platforms enable PCI by default, but UML does not. Enabling PCI from .kunitconfig files is problematic as it would not be portable. So in commit 6fc3a8636a7b ("kunit: tool: Enable virtio/PCI by default on UML") PCI was enabled by way of CONFIG_UML_PCI_OVER_VIRTIO=y. However CONFIG_UML_PCI_OVER_VIRTIO requires additional configuration of CONFIG_UML_PCI_OVER_VIRTIO_DEVICE_ID or will otherwise trigger a WARN() in virtio_pcidev_init(). However there is no one correct value for UML_PCI_OVER_VIRTIO_DEVICE_ID which could be used by default. This warning is confusing when debugging test failures. On the other hand, the functionality of CONFIG_UML_PCI_OVER_VIRTIO is not used at all, given that it is completely non-functional as indicated by the WARN() in question. Instead it is only used as a way to enable CONFIG_UML_PCI which itself is not directly configurable. Instead of going through CONFIG_UML_PCI_OVER_VIRTIO, introduce a custom configuration option which enables CONFIG_UML_PCI without triggering warnings or building dead code. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> Reviewed-by: Johannes Berg <johannes(a)sipsolutions.net> --- Changes in v2: - Rebase onto v6.17-rc1 - Pick up review from Johannes - Link to v1: https://lore.kernel.org/r/20250627-kunit-uml-pci-v1-1-a622fa445e58@linutron… --- lib/kunit/Kconfig | 7 +++++++ tools/testing/kunit/configs/arch_uml.config | 5 ++--- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig index c10ede4b1d2201d5f8cddeb71cc5096e21be9b6a..1823539e96da30e165fa8d395ccbd3f6754c836e 100644 --- a/lib/kunit/Kconfig +++ b/lib/kunit/Kconfig @@ -106,4 +106,11 @@ config KUNIT_DEFAULT_TIMEOUT If unsure, the default timeout of 300 seconds is suitable for most cases. +config KUNIT_UML_PCI + bool "KUnit UML PCI Support" + depends on UML + select UML_PCI + help + Enables the PCI subsystem on UML for use by KUnit tests. + endif # KUNIT diff --git a/tools/testing/kunit/configs/arch_uml.config b/tools/testing/kunit/configs/arch_uml.config index 54ad8972681a2cc724e6122b19407188910b9025..28edf816aa70e6f408d9486efff8898df79ee090 100644 --- a/tools/testing/kunit/configs/arch_uml.config +++ b/tools/testing/kunit/configs/arch_uml.config @@ -1,8 +1,7 @@ # Config options which are added to UML builds by default -# Enable virtio/pci, as a lot of tests require it. -CONFIG_VIRTIO_UML=y -CONFIG_UML_PCI_OVER_VIRTIO=y +# Enable pci, as a lot of tests require it. +CONFIG_KUNIT_UML_PCI=y # Enable FORTIFY_SOURCE for wider checking. CONFIG_FORTIFY_SOURCE=y --- base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585 change-id: 20250626-kunit-uml-pci-a2b687553746 Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

5 months

1
0
0 0

[linus:master] [selftests/fs/mount] c6d9775c20: kernel-selftests.filesystems/mount-notify.make.fail

by kernel test robot

Hello, kernel test robot noticed "kernel-selftests.filesystems/mount-notify.make.fail" on: commit: c6d9775c2066a37385e784ee2e0ce83bd6644610 ("selftests/fs/mount-notify: build with tools include dir") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master [test failed on linus/master 6e64f4580381e32c06ee146ca807c555b8f73e24] [test failed on linux-next/master 442d93313caebc8ccd6d53f4572c50732a95bc48] in testcase: kernel-selftests version: kernel-selftests-x86_64-186f3edfdd41-1_20250803 with following parameters: group: filesystems config: x86_64-rhel-9.4-kselftests compiler: gcc-12 test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz (Skylake) with 32G memory (please refer to attached dmesg/kmsg for entire log/backtrace) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang(a)intel.com> | Closes: https://lore.kernel.org/oe-lkp/202508110628.65069d92-lkp@intel.com 2025-08-06 18:21:58 make -j36 TARGETS=filesystems/mount-notify make[1]: Entering directory '/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/tools/testing/selftests/filesystems/mount-notify' CC mount-notify_test mount-notify_test.c:21:3: error: conflicting types for ‘__kernel_fsid_t’; have ‘struct <anonymous>’ 21 | } __kernel_fsid_t; | ^~~~~~~~~~~~~~~ In file included from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/asm/posix_types_64.h:18, from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/asm/posix_types.h:7, from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/linux/posix_types.h:36, from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/linux/types.h:9, from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/linux/stat.h:5, from /usr/include/x86_64-linux-gnu/bits/statx.h:31, from /usr/include/x86_64-linux-gnu/sys/stat.h:465, from mount-notify_test.c:9: /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/asm-generic/posix_types.h:81:3: note: previous declaration of ‘__kernel_fsid_t’ with type ‘__kernel_fsid_t’ 81 | } __kernel_fsid_t; | ^~~~~~~~~~~~~~~ make[1]: *** [../../lib.mk:222: /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/tools/testing/selftests/filesystems/mount-notify/mount-notify_test] Error 1 make[1]: Leaving directory '/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/tools/testing/selftests/filesystems/mount-notify' make: *** [Makefile:203: all] Error 2 The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20250811/202508110628.65069d92-lkp@… -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

5 months

1
0
0 0

[PATCH] selftests/mm: do check_huge_anon() with a number been passed in

by Wei Yang

Currently it hard coded the number of hugepage to check for check_huge_anon(), but we already have the number passed in. Do the check based on the number of hugepage passed in is more reasonable. Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com> --- tools/testing/selftests/mm/split_huge_page_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index 44a3f8a58806..bf40e6b121ab 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -111,7 +111,7 @@ static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hp unsigned long rss_anon_before, rss_anon_after; size_t i; - if (!check_huge_anon(one_page, 4, pmd_pagesize)) + if (!check_huge_anon(one_page, nr_hpages, pmd_pagesize)) ksft_exit_fail_msg("No THP is allocated\n"); rss_anon_before = rss_anon(); -- 2.34.1

5 months

8
11
0 0

[PATCH v3 00/13] stackleak: Support Clang stack depth tracking

by Kees Cook

v3: - split up and drop __init vs inline patches that went via arch trees - apply feedback about preferring __init to __always_inline - incorporate Ritesh Harjani's patch for __init cleanups in powerpc - wider build testing on older compilers v2: https://lore.kernel.org/lkml/20250523043251.it.550-kees@kernel.org/ v1: https://lore.kernel.org/lkml/20250507180852.work.231-kees@kernel.org/ Hi, As part of looking at what GCC plugins could be replaced with Clang implementations, this series uses the recently landed stack depth tracking callback in Clang[1] to implement the stackleak feature. Since the Clang feature is now landed, I'm moving this out of RFC to a v1. Since this touches a lot of arch-specific Makefiles, I tried to trim the CC list down to just mailing lists in those cases, otherwise the CC was giant. Thanks! -Kees [1] https://clang.llvm.org/docs/SanitizerCoverage.html#tracing-stack-depth Kees Cook (12): stackleak: Rename STACKLEAK to KSTACK_ERASE stackleak: Rename stackleak_track_stack to __sanitizer_cov_stack_depth stackleak: Split KSTACK_ERASE_CFLAGS from GCC_PLUGINS_CFLAGS x86: Handle KCOV __init vs inline mismatches arm: Handle KCOV __init vs inline mismatches arm64: Handle KCOV __init vs inline mismatches s390: Handle KCOV __init vs inline mismatches mips: Handle KCOV __init vs inline mismatch init.h: Disable sanitizer coverage for __init and __head kstack_erase: Support Clang stack depth tracking configs/hardening: Enable CONFIG_KSTACK_ERASE configs/hardening: Enable CONFIG_INIT_ON_FREE_DEFAULT_ON Ritesh Harjani (IBM) (1): powerpc/mm/book3s64: Move kfence and debug_pagealloc related calls to __init section arch/Kconfig | 4 +- arch/arm/Kconfig | 2 +- arch/arm64/Kconfig | 2 +- arch/riscv/Kconfig | 2 +- arch/s390/Kconfig | 2 +- arch/x86/Kconfig | 2 +- security/Kconfig.hardening | 45 +++++++++------- Makefile | 1 + arch/arm/boot/compressed/Makefile | 2 +- arch/arm/vdso/Makefile | 2 +- arch/arm64/kernel/pi/Makefile | 2 +- arch/arm64/kernel/vdso/Makefile | 3 +- arch/arm64/kvm/hyp/nvhe/Makefile | 2 +- arch/riscv/kernel/pi/Makefile | 2 +- arch/riscv/purgatory/Makefile | 2 +- arch/sparc/vdso/Makefile | 3 +- arch/x86/entry/vdso/Makefile | 3 +- arch/x86/purgatory/Makefile | 2 +- drivers/firmware/efi/libstub/Makefile | 8 +-- drivers/misc/lkdtm/Makefile | 2 +- kernel/Makefile | 10 ++-- lib/Makefile | 2 +- scripts/Makefile.gcc-plugins | 16 +----- scripts/Makefile.kstack_erase | 21 ++++++++ scripts/gcc-plugins/stackleak_plugin.c | 52 +++++++++---------- Documentation/admin-guide/sysctl/kernel.rst | 4 +- Documentation/arch/x86/x86_64/mm.rst | 2 +- Documentation/security/self-protection.rst | 2 +- .../zh_CN/security/self-protection.rst | 2 +- arch/arm64/include/asm/acpi.h | 2 +- arch/mips/include/asm/time.h | 2 +- arch/s390/hypfs/hypfs.h | 2 +- arch/s390/hypfs/hypfs_diag.h | 2 +- arch/x86/entry/calling.h | 4 +- arch/x86/include/asm/acpi.h | 4 +- arch/x86/include/asm/init.h | 2 +- arch/x86/include/asm/realmode.h | 2 +- include/linux/acpi.h | 4 +- include/linux/bootconfig.h | 2 +- include/linux/efi.h | 2 +- include/linux/init.h | 4 +- include/linux/{stackleak.h => kstack_erase.h} | 20 +++---- include/linux/memblock.h | 2 +- include/linux/mfd/dbx500-prcmu.h | 2 +- include/linux/sched.h | 4 +- include/linux/smp.h | 2 +- arch/arm/kernel/entry-common.S | 2 +- arch/arm64/kernel/entry.S | 2 +- arch/riscv/kernel/entry.S | 2 +- arch/s390/kernel/entry.S | 2 +- arch/arm/mm/cache-feroceon-l2.c | 2 +- arch/arm/mm/cache-tauros2.c | 2 +- arch/powerpc/mm/book3s64/hash_utils.c | 6 +-- arch/powerpc/mm/book3s64/radix_pgtable.c | 4 +- arch/s390/mm/init.c | 2 +- arch/x86/kernel/kvm.c | 2 +- arch/x86/mm/init_64.c | 2 +- drivers/clocksource/timer-orion.c | 2 +- .../lkdtm/{stackleak.c => kstack_erase.c} | 26 +++++----- drivers/soc/ti/pm33xx.c | 2 +- fs/proc/base.c | 6 +-- kernel/fork.c | 2 +- kernel/kexec_handover.c | 4 +- kernel/{stackleak.c => kstack_erase.c} | 22 ++++---- tools/objtool/check.c | 4 +- tools/testing/selftests/lkdtm/config | 2 +- MAINTAINERS | 6 ++- kernel/configs/hardening.config | 6 +++ 68 files changed, 204 insertions(+), 172 deletions(-) create mode 100644 scripts/Makefile.kstack_erase rename include/linux/{stackleak.h => kstack_erase.h} (81%) rename drivers/misc/lkdtm/{stackleak.c => kstack_erase.c} (89%) rename kernel/{stackleak.c => kstack_erase.c} (87%) -- 2.34.1

5 months

9
29
0 0

[PATCH -rebased 15/15] selftests/sched_ext: Add test for DL server total_bw consistency

by Joel Fernandes

Add a new kselftest to verify that the total_bw value in /sys/kernel/debug/sched/debug remains consistent across all CPUs under different sched_ext BPF program states: 1. Before a BPF scheduler is loaded 2. While a BPF scheduler is loaded and active 3. After a BPF scheduler is unloaded The test runs CPU stress threads to ensure DL server bandwidth values stabilize before checking consistency. This helps catch potential issues with DL server bandwidth accounting during sched_ext transitions. Signed-off-by: Joel Fernandes <joelagnelf(a)nvidia.com> --- tools/testing/selftests/sched_ext/Makefile | 1 + tools/testing/selftests/sched_ext/total_bw.c | 282 +++++++++++++++++++ 2 files changed, 283 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/total_bw.c diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index f0a8cba3a99f..d48be158b0a1 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -184,6 +184,7 @@ auto-test-targets := \ select_cpu_vtime \ rt_stall \ test_example \ + total_bw \ testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/total_bw.c b/tools/testing/selftests/sched_ext/total_bw.c new file mode 100644 index 000000000000..d70852cee358 --- /dev/null +++ b/tools/testing/selftests/sched_ext/total_bw.c @@ -0,0 +1,282 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test to verify that total_bw value remains consistent across all CPUs + * in different BPF program states. + * + * Copyright (C) 2025 Nvidia Corporation. + */ +#include <bpf/bpf.h> +#include <errno.h> +#include <pthread.h> +#include <scx/common.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/wait.h> +#include <unistd.h> +#include "minimal.bpf.skel.h" +#include "scx_test.h" + +#define MAX_CPUS 512 +#define STRESS_DURATION_SEC 5 + +struct total_bw_ctx { + struct minimal *skel; + long baseline_bw[MAX_CPUS]; + int nr_cpus; +}; + +static void *cpu_stress_thread(void *arg) +{ + volatile int i; + time_t end_time = time(NULL) + STRESS_DURATION_SEC; + + while (time(NULL) < end_time) { + for (i = 0; i < 1000000; i++); + } + + return NULL; +} + +/* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ +static int run_cpu_stress(int nr_cpus) +{ + pthread_t *threads; + int i, ret = 0; + + threads = calloc(nr_cpus, sizeof(pthread_t)); + if (!threads) + return -ENOMEM; + + /* Create threads to run on each CPU */ + for (i = 0; i < nr_cpus; i++) { + if (pthread_create(&threads[i], NULL, cpu_stress_thread, NULL)) { + ret = -errno; + fprintf(stderr, "Failed to create thread %d: %s\n", i, strerror(-ret)); + break; + } + } + + /* Wait for all threads to complete */ + for (i = 0; i < nr_cpus; i++) { + if (threads[i]) + pthread_join(threads[i], NULL); + } + + free(threads); + return ret; +} + +static int read_total_bw_values(long *bw_values, int max_cpus) +{ + FILE *fp; + char line[256]; + int cpu_count = 0; + + fp = fopen("/sys/kernel/debug/sched/debug", "r"); + if (!fp) { + SCX_ERR("Failed to open debug file"); + return -1; + } + + while (fgets(line, sizeof(line), fp)) { + char *bw_str = strstr(line, "total_bw"); + if (bw_str) { + bw_str = strchr(bw_str, ':'); + if (bw_str) { + /* Only store up to max_cpus values */ + if (cpu_count < max_cpus) { + bw_values[cpu_count] = atol(bw_str + 1); + } + cpu_count++; + } + } + } + + fclose(fp); + return cpu_count; +} + +static bool verify_total_bw_consistency(long *bw_values, int count) +{ + int i; + long first_value; + + if (count <= 0) + return false; + + first_value = bw_values[0]; + + for (i = 1; i < count; i++) { + if (bw_values[i] != first_value) { + SCX_ERR("Inconsistent total_bw: CPU0=%ld, CPU%d=%ld", + first_value, i, bw_values[i]); + return false; + } + } + + return true; +} + +static int fetch_verify_total_bw(long *bw_values, int nr_cpus) +{ + int attempts = 0; + int max_attempts = 10; + int count; + + /* + * The first enqueue on a CPU causes the DL server to start, for that + * reason run stressor threads in the hopes it schedules on all CPUs. + */ + if (run_cpu_stress(nr_cpus) < 0) { + SCX_ERR("Failed to run CPU stress"); + return -1; + } + + /* Try multiple times to get stable values */ + while (attempts < max_attempts) { + count = read_total_bw_values(bw_values, nr_cpus); + fprintf(stderr, "Read %d total_bw values (testing %d CPUs)\n", count, nr_cpus); + /* If system has more CPUs than we're testing, that's OK */ + if (count < nr_cpus) { + SCX_ERR("Expected at least %d CPUs, got %d", nr_cpus, count); + attempts++; + sleep(1); + continue; + } + + /* Only verify the CPUs we're testing */ + if (verify_total_bw_consistency(bw_values, nr_cpus)) { + fprintf(stderr, "Values are consistent: %ld\n", bw_values[0]); + return 0; + } + + attempts++; + sleep(1); + } + + return -1; +} + +static enum scx_test_status setup(void **ctx) +{ + struct total_bw_ctx *test_ctx; + + if (access("/sys/kernel/debug/sched/debug", R_OK) != 0) { + fprintf(stderr, "Skipping test: debugfs sched/debug not accessible\n"); + return SCX_TEST_SKIP; + } + + test_ctx = calloc(1, sizeof(*test_ctx)); + if (!test_ctx) + return SCX_TEST_FAIL; + + test_ctx->nr_cpus = sysconf(_SC_NPROCESSORS_ONLN); + if (test_ctx->nr_cpus <= 0) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* If system has more CPUs than MAX_CPUS, just test the first MAX_CPUS */ + if (test_ctx->nr_cpus > MAX_CPUS) { + test_ctx->nr_cpus = MAX_CPUS; + } + + /* Test scenario 1: BPF program not loaded */ + /* Read and verify baseline total_bw before loading BPF program */ + fprintf(stderr, "BPF prog initially not loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(test_ctx->baseline_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable baseline values"); + free(test_ctx); + return SCX_TEST_FAIL; + } + + /* Load the BPF skeleton */ + test_ctx->skel = minimal__open(); + if (!test_ctx->skel) { + free(test_ctx); + return SCX_TEST_FAIL; + } + + SCX_ENUM_INIT(test_ctx->skel); + if (minimal__load(test_ctx->skel)) { + minimal__destroy(test_ctx->skel); + free(test_ctx); + return SCX_TEST_FAIL; + } + + *ctx = test_ctx; + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + struct bpf_link *link; + long loaded_bw[MAX_CPUS]; + long unloaded_bw[MAX_CPUS]; + int i; + + /* Test scenario 2: BPF program loaded */ + link = bpf_map__attach_struct_ops(test_ctx->skel->maps.minimal_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + fprintf(stderr, "BPF program loaded, reading total_bw values\n"); + if (fetch_verify_total_bw(loaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values with BPF loaded"); + bpf_link__destroy(link); + return SCX_TEST_FAIL; + } + bpf_link__destroy(link); + + /* Test scenario 3: BPF program unloaded */ + fprintf(stderr, "BPF program unloaded, reading total_bw values\n"); + if (fetch_verify_total_bw(unloaded_bw, test_ctx->nr_cpus) < 0) { + SCX_ERR("Failed to get stable values after BPF unload"); + return SCX_TEST_FAIL; + } + + /* Verify all three scenarios have the same total_bw values */ + for (i = 0; i < test_ctx->nr_cpus; i++) { + if (test_ctx->baseline_bw[i] != loaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != loaded_bw=%ld", + i, test_ctx->baseline_bw[i], loaded_bw[i]); + return SCX_TEST_FAIL; + } + + if (test_ctx->baseline_bw[i] != unloaded_bw[i]) { + SCX_ERR("CPU%d: baseline_bw=%ld != unloaded_bw=%ld", + i, test_ctx->baseline_bw[i], unloaded_bw[i]); + return SCX_TEST_FAIL; + } + } + + fprintf(stderr, "All total_bw values are consistent across all scenarios\n"); + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct total_bw_ctx *test_ctx = ctx; + + if (test_ctx) { + if (test_ctx->skel) + minimal__destroy(test_ctx->skel); + free(test_ctx); + } +} + +struct scx_test total_bw = { + .name = "total_bw", + .description = "Verify total_bw consistency across BPF program states", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&total_bw) -- 2.34.1

5 months

1
0
0 0

[PATCH -rebased 14/15] selftests/sched_ext: Add test for sched_ext dl_server

by Joel Fernandes

From: Andrea Righi <arighi(a)nvidia.com> Add a selftest to validate the correct behavior of the deadline server for the ext_sched_class. [ Joel: Replaced occurences of CFS in the test with EXT. ] Signed-off-by: Joel Fernandes <joelagnelf(a)nvidia.com> Signed-off-by: Andrea Righi <arighi(a)nvidia.com> --- tools/testing/selftests/sched_ext/Makefile | 1 + .../selftests/sched_ext/rt_stall.bpf.c | 23 ++ tools/testing/selftests/sched_ext/rt_stall.c | 213 ++++++++++++++++++ 3 files changed, 237 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 9d9d6b4c38b0..f0a8cba3a99f 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -182,6 +182,7 @@ auto-test-targets := \ select_cpu_dispatch_bad_dsq \ select_cpu_dispatch_dbl_dsp \ select_cpu_vtime \ + rt_stall \ test_example \ testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/rt_stall.bpf.c b/tools/testing/selftests/sched_ext/rt_stall.bpf.c new file mode 100644 index 000000000000..80086779dd1e --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.bpf.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A scheduler that verified if RT tasks can stall SCHED_EXT tasks. + * + * Copyright (c) 2025 NVIDIA Corporation. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +void BPF_STRUCT_OPS(rt_stall_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops rt_stall_ops = { + .exit = (void *)rt_stall_exit, + .name = "rt_stall", +}; diff --git a/tools/testing/selftests/sched_ext/rt_stall.c b/tools/testing/selftests/sched_ext/rt_stall.c new file mode 100644 index 000000000000..d4cb545ebfd8 --- /dev/null +++ b/tools/testing/selftests/sched_ext/rt_stall.c @@ -0,0 +1,213 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 NVIDIA Corporation. + */ +#define _GNU_SOURCE +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <sched.h> +#include <sys/prctl.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <time.h> +#include <linux/sched.h> +#include <signal.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <unistd.h> +#include "rt_stall.bpf.skel.h" +#include "scx_test.h" +#include "../kselftest.h" + +#define CORE_ID 0 /* CPU to pin tasks to */ +#define RUN_TIME 5 /* How long to run the test in seconds */ + +/* Simple busy-wait function for test tasks */ +static void process_func(void) +{ + while (1) { + /* Busy wait */ + for (volatile unsigned long i = 0; i < 10000000UL; i++); + } +} + +/* Set CPU affinity to a specific core */ +static void set_affinity(int cpu) +{ + cpu_set_t mask; + + CPU_ZERO(&mask); + CPU_SET(cpu, &mask); + if (sched_setaffinity(0, sizeof(mask), &mask) != 0) { + perror("sched_setaffinity"); + exit(EXIT_FAILURE); + } +} + +/* Set task scheduling policy and priority */ +static void set_sched(int policy, int priority) +{ + struct sched_param param; + + param.sched_priority = priority; + if (sched_setscheduler(0, policy, &param) != 0) { + perror("sched_setscheduler"); + exit(EXIT_FAILURE); + } +} + +/* Get process runtime from /proc/<pid>/stat */ +static float get_process_runtime(int pid) +{ + char path[256]; + FILE *file; + long utime, stime; + int fields; + + snprintf(path, sizeof(path), "/proc/%d/stat", pid); + file = fopen(path, "r"); + if (file == NULL) { + perror("Failed to open stat file"); + return -1; + } + + /* Skip the first 13 fields and read the 14th and 15th */ + fields = fscanf(file, + "%*d %*s %*c %*d %*d %*d %*d %*d %*u %*u %*u %*u %*u %lu %lu", + &utime, &stime); + fclose(file); + + if (fields != 2) { + fprintf(stderr, "Failed to read stat file\n"); + return -1; + } + + /* Calculate the total time spent in the process */ + long total_time = utime + stime; + long ticks_per_second = sysconf(_SC_CLK_TCK); + float runtime_seconds = total_time * 1.0 / ticks_per_second; + + return runtime_seconds; +} + +static enum scx_test_status setup(void **ctx) +{ + struct rt_stall *skel; + + skel = rt_stall__open(); + SCX_FAIL_IF(!skel, "Failed to open"); + SCX_ENUM_INIT(skel); + SCX_FAIL_IF(rt_stall__load(skel), "Failed to load skel"); + + *ctx = skel; + + return SCX_TEST_PASS; +} + +static bool sched_stress_test(void) +{ + float cfs_runtime, rt_runtime; + int cfs_pid, rt_pid; + float expected_min_ratio = 0.04; /* 4% */ + + ksft_print_header(); + ksft_set_plan(1); + + /* Create and set up a EXT task */ + cfs_pid = fork(); + if (cfs_pid == 0) { + set_affinity(CORE_ID); + process_func(); + exit(0); + } else if (cfs_pid < 0) { + perror("fork for EXT task"); + ksft_exit_fail(); + } + + /* Create an RT task */ + rt_pid = fork(); + if (rt_pid == 0) { + set_affinity(CORE_ID); + set_sched(SCHED_FIFO, 50); + process_func(); + exit(0); + } else if (rt_pid < 0) { + perror("fork for RT task"); + ksft_exit_fail(); + } + + /* Let the processes run for the specified time */ + sleep(RUN_TIME); + + /* Get runtime for the EXT task */ + cfs_runtime = get_process_runtime(cfs_pid); + if (cfs_runtime != -1) + ksft_print_msg("Runtime of EXT task (PID %d) is %f seconds\n", cfs_pid, cfs_runtime); + else + ksft_exit_fail_msg("Error getting runtime for EXT task (PID %d)\n", cfs_pid); + + /* Get runtime for the RT task */ + rt_runtime = get_process_runtime(rt_pid); + if (rt_runtime != -1) + ksft_print_msg("Runtime of RT task (PID %d) is %f seconds\n", rt_pid, rt_runtime); + else + ksft_exit_fail_msg("Error getting runtime for RT task (PID %d)\n", rt_pid); + + /* Kill the processes */ + kill(cfs_pid, SIGKILL); + kill(rt_pid, SIGKILL); + waitpid(cfs_pid, NULL, 0); + waitpid(rt_pid, NULL, 0); + + /* Verify that the scx task got enough runtime */ + float actual_ratio = cfs_runtime / (cfs_runtime + rt_runtime); + ksft_print_msg("EXT task got %.2f%% of total runtime\n", actual_ratio * 100); + + if (actual_ratio >= expected_min_ratio) { + ksft_test_result_pass("PASS: EXT task got more than %.2f%% of runtime\n", + expected_min_ratio * 100); + return true; + } else { + ksft_test_result_fail("FAIL: EXT task got less than %.2f%% of runtime\n", + expected_min_ratio * 100); + return false; + } +} + +static enum scx_test_status run(void *ctx) +{ + struct rt_stall *skel = ctx; + struct bpf_link *link; + bool res; + + link = bpf_map__attach_struct_ops(skel->maps.rt_stall_ops); + SCX_FAIL_IF(!link, "Failed to attach scheduler"); + + res = sched_stress_test(); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE)); + bpf_link__destroy(link); + + if (!res) + ksft_exit_fail(); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct rt_stall *skel = ctx; + + rt_stall__destroy(skel); +} + +struct scx_test rt_stall = { + .name = "rt_stall", + .description = "Verify that RT tasks cannot stall SCHED_EXT tasks", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&rt_stall) -- 2.34.1

5 months

1
0
0 0

[PATCH v3 0/7] selftests/mm: Fix false positives and skip unsupported tests

by Aboorva Devarajan

Hi all, This patch series addresses false positives in the generic mm selftests and skips tests that cannot run correctly due to missing features or system limitations. v2: https://lore.kernel.org/all/20250703060656.54345-1-aboorvad@linux.ibm.com/ Changes in v3: - Rebased onto the latest mm-new branch, top commit of the base is commit 0709ddf8951f ("mm: add zblock allocator"). - Minor refactor based on the review comments. - Included the tags from the previous version. --- v1: https://lore.kernel.org/all/20250616160632.35250-1-aboorvad@linux.ibm.com/ Changes in v2: - Rebased onto the mm-new branch, top commit of the base is commit 3b4a8ad89f7e ("mm: add zblock allocator"). - Split some patches for clarity. - Updated virtual_address_range test to support testing 4PB VA on PPC64. - Added proper Fixes: tags. - Included a patch to skip a failing userfaultfd test when unsupported, instead of reporting a failure. --- Please let us know if you have any further comments. Thanks, Aboorva Aboorva Devarajan (3): selftests/mm: Fix child process exit codes in ksm_functional_tests selftests/mm: Skip thuge-gen test if system is not setup properly selftests/mm: Skip hugepage-mremap test if userfaultfd unavailable Donet Tom (4): mm/selftests: Fix incorrect pointer being passed to mark_range() selftests/mm: Add support to test 4PB VA on PPC64 selftest/mm: Fix ksm_funtional_test failures mm/selftests: Fix split_huge_page_test failure on systems with 64KB page size tools/testing/selftests/mm/hugepage-mremap.c | 16 +++++++++-- .../selftests/mm/ksm_functional_tests.c | 28 +++++++++++++------ .../selftests/mm/split_huge_page_test.c | 23 +++++++++------ tools/testing/selftests/mm/thuge-gen.c | 11 +++++--- .../selftests/mm/virtual_address_range.c | 13 ++++++++- 5 files changed, 67 insertions(+), 24 deletions(-) -- 2.47.1

5 months

4
20
0 0

[PATCH v2 0/1] selftests/futex: Check for shmget support at runtime

by Wake Liu

Changes in v2: - Restore RET_FAIL assignments in error paths to ensure the test's exit code accurately reflects the failure status. Wake Liu (1): selftests/futex: Check for shmget support at runtime .../selftests/futex/functional/futex_wait.c | 49 +++++++------ .../selftests/futex/functional/futex_waitv.c | 73 ++++++++++++------- 2 files changed, 73 insertions(+), 49 deletions(-) -- 2.50.1.703.g449372360f-goog

5 months

2
3
0 0

[PATCH] selftests/filesystems/binderfs: Skip tests if user namespaces are unavailable

by Wake Liu

The binderfs selftests, specifically `binderfs_stress` and `binderfs_test_unprivileged`, depend on user namespaces to run. On kernels built without user namespace support (CONFIG_USER_NS=n), these tests will fail. To prevent these failures, add a check for the availability of user namespaces by testing for the existence of "/proc/self/ns/user". If the check fails, skip the tests and print an informative message. Signed-off-by: Wake Liu <wakel(a)google.com> --- .../selftests/filesystems/binderfs/binderfs_test.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c index 81db85a5cc16..e77ed34ebd06 100644 --- a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c +++ b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c @@ -291,6 +291,11 @@ static int write_id_mapping(enum idmap_type type, pid_t pid, const char *buf, return 0; } +static bool has_userns(void) +{ + return (access("/proc/self/ns/user", F_OK) == 0); +} + static void change_userns(struct __test_metadata *_metadata, int syncfds[2]) { int ret; @@ -378,6 +383,9 @@ static void *binder_version_thread(void *data) */ TEST(binderfs_stress) { + if (!has_userns()) + SKIP(return, "%s: user namespace not supported\n", __func__); + int fds[1000]; int syncfds[2]; pid_t pid; @@ -502,6 +510,8 @@ TEST(binderfs_test_privileged) TEST(binderfs_test_unprivileged) { + if (!has_userns()) + SKIP(return, "%s: user namespace not supported\n", __func__); int ret; int syncfds[2]; pid_t pid; -- 2.50.1.703.g449372360f-goog

5 months

2
2
0 0

[PATCH] selftests/futex: Skip futex_waitv tests if ENOSYS

by Wake Liu

The futex_waitv() syscall was introduced in Linux 5.16. The existing test in futex_wait_timeout.c will fail on kernels older than 5.16 due to the syscall not being implemented. Modify the test_timeout() function to check if the error returned is ENOSYS. If it is, skip the test and report it as such, rather than failing. This ensures the selftests can be run on a wider range of kernel versions without false negatives. Signed-off-by: Wake Liu <wakel(a)google.com> --- .../selftests/futex/functional/futex_wait_timeout.c | 11 ++++++++--- .../testing/selftests/futex/functional/futex_waitv.c | 8 ++++++++ 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/futex/functional/futex_wait_timeout.c b/tools/testing/selftests/futex/functional/futex_wait_timeout.c index d183f878360b..323cab339814 100644 --- a/tools/testing/selftests/futex/functional/futex_wait_timeout.c +++ b/tools/testing/selftests/futex/functional/futex_wait_timeout.c @@ -64,9 +64,14 @@ void *get_pi_lock(void *arg) static void test_timeout(int res, int *ret, char *test_name, int err) { if (!res || errno != err) { - ksft_test_result_fail("%s returned %d\n", test_name, - res < 0 ? errno : res); - *ret = RET_FAIL; + if (errno == ENOSYS) { + ksft_test_result_skip("%s: %s\n", + test_name, strerror(errno)); + } else { + ksft_test_result_fail("%s returned %d\n", test_name, + res < 0 ? errno : res); + *ret = RET_FAIL; + } } else { ksft_test_result_pass("%s succeeds\n", test_name); } diff --git a/tools/testing/selftests/futex/functional/futex_waitv.c b/tools/testing/selftests/futex/functional/futex_waitv.c index 034dbfef40cb..2a86fd3ea657 100644 --- a/tools/testing/selftests/futex/functional/futex_waitv.c +++ b/tools/testing/selftests/futex/functional/futex_waitv.c @@ -59,6 +59,14 @@ void *waiterfn(void *arg) int main(int argc, char *argv[]) { + if (!ksft_min_kernel_version(5, 16)) { + ksft_print_header(); + ksft_set_plan(0); + ksft_print_msg("%s: FUTEX_WAITV not implemented until 5.16\n", + basename(argv[0])); + ksft_print_cnts(); + return KSFT_SKIP; + } pthread_t waiter; int res, ret = RET_PASS; struct timespec to; -- 2.50.1.703.g449372360f-goog

5 months

2
3
0 0

[PATCH v3] selftests/futex: Check for shmget support at runtime

by Wake Liu

The futex tests `futex_wait.c` and `futex_waitv.c` rely on the `shmget()` syscall, which may not be available if the kernel is built without System V IPC support (CONFIG_SYSVIPC=n). This can lead to test failures on such systems. This patch modifies the tests to check for `shmget()` support at runtime by calling it and checking for an `ENOSYS` error. If `shmget()` is not supported, the tests are skipped with a clear message, improving the user experience and preventing false negatives. This approach is more robust than relying on compile-time checks and ensures that the tests run only when the required kernel features are present. Signed-off-by: Wake Liu <wakel(a)google.com> --- .../selftests/futex/functional/futex_wait.c | 57 ++++++++------- .../selftests/futex/functional/futex_waitv.c | 73 ++++++++++++------- 2 files changed, 78 insertions(+), 52 deletions(-) diff --git a/tools/testing/selftests/futex/functional/futex_wait.c b/tools/testing/selftests/futex/functional/futex_wait.c index 685140d9b93d..2a834f074959 100644 --- a/tools/testing/selftests/futex/functional/futex_wait.c +++ b/tools/testing/selftests/futex/functional/futex_wait.c @@ -48,7 +48,7 @@ static void *waiterfn(void *arg) int main(int argc, char *argv[]) { int res, ret = RET_PASS, fd, c, shm_id; - u_int32_t f_private = 0, *shared_data; + u_int32_t f_private = 0, *shared_data = NULL; unsigned int flags = FUTEX_PRIVATE_FLAG; pthread_t waiter; void *shm; @@ -96,32 +96,37 @@ int main(int argc, char *argv[]) /* Testing an anon page shared memory */ shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); if (shm_id < 0) { - perror("shmget"); - exit(1); - } - - shared_data = shmat(shm_id, NULL, 0); - - *shared_data = 0; - futex = shared_data; - - info("Calling shared (page anon) futex_wait on futex: %p\n", futex); - if (pthread_create(&waiter, NULL, waiterfn, NULL)) - error("pthread_create failed\n", errno); - - usleep(WAKE_WAIT_US); - - info("Calling shared (page anon) futex_wake on futex: %p\n", futex); - res = futex_wake(futex, 1, 0); - if (res != 1) { - ksft_test_result_fail("futex_wake shared (page anon) returned: %d %s\n", - errno, strerror(errno)); - ret = RET_FAIL; + if (errno == ENOSYS) { + ksft_test_result_skip("Kernel does not support System V shared memory\n"); + } else { + ksft_test_result_fail("shmget() failed with error: %s\n", strerror(errno)); + ret = RET_FAIL; + } } else { - ksft_test_result_pass("futex_wake shared (page anon) succeeds\n"); + shared_data = shmat(shm_id, NULL, 0); + + *shared_data = 0; + futex = shared_data; + + info("Calling shared (page anon) futex_wait on futex: %p\n", futex); + if (pthread_create(&waiter, NULL, waiterfn, NULL)) + error("pthread_create failed\n", errno); + + usleep(WAKE_WAIT_US); + + info("Calling shared (page anon) futex_wake on futex: %p\n", futex); + res = futex_wake(futex, 1, 0); + if (res != 1) { + if (res < 0) + ksft_test_result_fail("futex_wake shared (page anon) failed with res=%d: %m\n", res); + else + ksft_test_result_fail("futex_wake shared (page anon) returned %d, expected 1\n", res); + ret = RET_FAIL; + } else { + ksft_test_result_pass("futex_wake shared (page anon) succeeds\n"); + } } - /* Testing a file backed shared memory */ fd = open(SHM_PATH, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR); if (fd < 0) { @@ -161,7 +166,8 @@ int main(int argc, char *argv[]) } /* Freeing resources */ - shmdt(shared_data); + if (shared_data) + shmdt(shared_data); munmap(shm, sizeof(f_private)); remove(SHM_PATH); close(fd); @@ -169,3 +175,4 @@ int main(int argc, char *argv[]) ksft_print_cnts(); return ret; } + diff --git a/tools/testing/selftests/futex/functional/futex_waitv.c b/tools/testing/selftests/futex/functional/futex_waitv.c index a94337f677e1..034dbfef40cb 100644 --- a/tools/testing/selftests/futex/functional/futex_waitv.c +++ b/tools/testing/selftests/futex/functional/futex_waitv.c @@ -110,40 +110,58 @@ int main(int argc, char *argv[]) } /* Shared waitv */ - for (i = 0; i < NR_FUTEXES; i++) { - int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); - - if (shm_id < 0) { - perror("shmget"); - exit(1); + bool shm_supported = true; + int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); + + if (shm_id < 0) { + if (errno == ENOSYS) { + shm_supported = false; + ksft_test_result_skip("Kernel does not support System V shared memory\n"); + } else { + ksft_test_result_fail("shmget() failed with error: %s\n", strerror(errno)); + ret = RET_FAIL; + shm_supported = false; } + } else { + shmctl(shm_id, IPC_RMID, NULL); + } - unsigned int *shared_data = shmat(shm_id, NULL, 0); + if (shm_supported) { + for (i = 0; i < NR_FUTEXES; i++) { + int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); - *shared_data = 0; - waitv[i].uaddr = (uintptr_t)shared_data; - waitv[i].flags = FUTEX_32; - waitv[i].val = 0; - waitv[i].__reserved = 0; - } + if (shm_id < 0) { + perror("shmget"); + exit(1); + } - if (pthread_create(&waiter, NULL, waiterfn, NULL)) - error("pthread_create failed\n", errno); + unsigned int *shared_data = shmat(shm_id, NULL, 0); - usleep(WAKE_WAIT_US); + *shared_data = 0; + waitv[i].uaddr = (uintptr_t)shared_data; + waitv[i].flags = FUTEX_32; + waitv[i].val = 0; + waitv[i].__reserved = 0; + } - res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, 0); - if (res != 1) { - ksft_test_result_fail("futex_wake shared returned: %d %s\n", - res ? errno : res, - res ? strerror(errno) : ""); - ret = RET_FAIL; - } else { - ksft_test_result_pass("futex_waitv shared\n"); - } + if (pthread_create(&waiter, NULL, waiterfn, NULL)) + error("pthread_create failed\n", errno); - for (i = 0; i < NR_FUTEXES; i++) - shmdt(u64_to_ptr(waitv[i].uaddr)); + usleep(WAKE_WAIT_US); + + res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, 0); + if (res != 1) { + ksft_test_result_fail("futex_wake shared returned: %d %s\n", + res ? errno : res, + res ? strerror(errno) : ""); + ret = RET_FAIL; + } else { + ksft_test_result_pass("futex_waitv shared\n"); + } + + for (i = 0; i < NR_FUTEXES; i++) + shmdt(u64_to_ptr(waitv[i].uaddr)); + } /* Testing a waiter without FUTEX_32 flag */ waitv[0].flags = FUTEX_PRIVATE_FLAG; @@ -235,3 +253,4 @@ int main(int argc, char *argv[]) ksft_print_cnts(); return ret; } + -- 2.50.1.703.g449372360f-goog

5 months

1
0
0 0

[PATCH RFC v2 0/2] KVM: arm64: PMU: Use multiple host PMUs

by Akihiko Odaki

On heterogeneous arm64 systems, KVM's PMU emulation is based on the features of a single host PMU instance. When a vCPU is migrated to a pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop incrementing. Although this behavior is permitted by the architecture, Windows does not handle it gracefully and may crash with a division-by-zero error. The current workaround requires VMMs to pin vCPUs to a set of pCPUs that share a compatible PMU. This is difficult to implement correctly in QEMU/libvirt, where pinning occurs after vCPU initialization, and it also restricts the guest to a subset of available pCPUs. This patch introduces the KVM_ARM_VCPU_PMU_V3_COMPOSITION attribute to create a "composite" PMU. When set, KVM exposes a PMU that is compatible with all pCPUs by advertising only a single cycle counter, a feature common to all PMUs. This allows Windows guests to run reliably on heterogeneous systems without crashing, even without vCPU pinning, and enables VMMs to schedule vCPUs across all available pCPUs, making full use of the host hardware. A QEMU patch that demonstrates the usage of the new attribute is available at: https://lore.kernel.org/qemu-devel/20250806-kvm-v1-1-d1d50b7058cd@rsg.ci.i.… ("[PATCH RFC] target/arm/kvm: Choose PMU backend") Signed-off-by: Akihiko Odaki <odaki(a)rsg.ci.i.u-tokyo.ac.jp> --- Changes in v2: - Added the KVM_ARM_VCPU_PMU_V3_COMPOSITION attribute to opt in the feature. - Added code to handle overflow. - Link to v1: https://lore.kernel.org/r/20250319-hybrid-v1-1-4d1ada10e705@daynix.com --- Akihiko Odaki (2): KVM: arm64: PMU: Introduce KVM_ARM_VCPU_PMU_V3_COMPOSITION KVM: arm64: selftests: Test guest PMUv3 composition Documentation/virt/kvm/devices/vcpu.rst | 30 ++ arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/include/uapi/asm/kvm.h | 1 + arch/arm64/kvm/arm.c | 5 +- arch/arm64/kvm/pmu-emul.c | 495 +++++++++++++-------- arch/arm64/kvm/sys_regs.c | 2 +- include/kvm/arm_pmu.h | 12 +- .../selftests/kvm/arm64/vpmu_counter_access.c | 148 ++++-- 8 files changed, 461 insertions(+), 234 deletions(-) --- base-commit: 8ec6d99a41e3d1dbdff2bdb3aa42951681e1e76c change-id: 20250224-hybrid-01d5ff47edd2 Best regards, -- Akihiko Odaki <odaki(a)rsg.ci.i.u-tokyo.ac.jp>

5 months

2
8
0 0

[PATCH v12 iproute2-next 0/3] DUALPI2 iproute2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find DUALPI2 iproute2 patch v12. For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best Regards, Chia-Yu --- v12 (04-Aug-2025) - Split into 3 patches: one move get_float(), one add get_float_min_max(), one for dualpi2 (David Ahern <dsahern(a)kernel.org>) - Repalce matches() with strcmp() within get_packets() (David Ahern <dsahern(a)kernel.org>) - Apply reverse xmas tree listing of variables (David Ahern <dsahern(a)kernel.org>) v11 (18-Jul-2025) - Replace TCA_DUALPI2 prefix with TC_DUALPI2 prefix for enums (Jakub Kicinski <kuba(a)kernel.org>) v10 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS w/ STEP_THRESH_PKTS and STEP_THRESH_US of net-next patch (Jakub Kicinski <kuba(a)kernel.org>) v9 (13-Jun-2025) - Fix space issue and typos (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Change 'rtt_typical' to 'typical_rtt' in tc/q_dualpi2.c (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Add the num of enum used by DualPI2 in pkt_sched.h v8 (09-May-2025) - Update pkt_sched.h with the one in nex-next - Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) v7 (05-May-2025) - Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml - Reorganize dualpi2_print_opt() to match the order in tc.yaml - Remove credit-queue in PRINT_JSON v6 (26-Apr-2025) - Update JSON file output due to spec modification in tc.yaml of net-next v5 (25-Mar-2025) - Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>) - Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>) - Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>) v4 (16-Mar-2025) - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking. v3 (21-Feb-2025) - Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update the manual to align with the latest implementation and clarify the queue naming and default unit - Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2 v2 (23-Oct-2024) - Rename get_float in dualpi2 to get_float_min_max in utils.c - Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>) --- Chia-Yu Chang (3): Move get_float() from ip/iplink_can.c to lib/utils.c Add get_float_min_max() in lib/utils.c tc: add dualpi2 scheduler module bash-completion/tc | 11 +- include/utils.h | 2 + ip/iplink_can.c | 14 -- lib/utils.c | 30 +++ man/man8/tc-dualpi2.8 | 249 ++++++++++++++++++++ tc/Makefile | 1 + tc/q_dualpi2.c | 535 ++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 827 insertions(+), 15 deletions(-) create mode 100644 man/man8/tc-dualpi2.8 create mode 100644 tc/q_dualpi2.c -- 2.34.1

5 months

2
4
0 0

[RFC PATCH 00/18] cgroup/cpuset: Enable runtime modification of

by Waiman Long

The "nohz_full" and "rcu_nocbs" boot command parameters can be used to remove a lot of kernel overhead on a specific set of isolated CPUs which can be used to run some latency/bandwidth sensitive workloads with as little kernel disturbance/noise as possible. The problem with this mode of operation is the fact that it is a static configuration which cannot be changed after boot to adjust for changes in application loading. There is always a desire to enable runtime modification of the number of isolated CPUs that can be dedicated to this type of demanding workloads. This patchset is an attempt to do just that with an amount of CPU isolation close to what can be done with the nohz_full and rcu_nocbs boot kernel parameters. This patch series provides the ability to change the set of housekeeping CPUs at run time via the cpuset isolated partition functionality. Currently, the cpuset isolated partition is able to disable scheduler load balancing and the CPU affinity of the unbound workqueue to avoid the isolated CPUs. This patch series will extend that with other kernel noises associated with the nohz_full boot command line parameter which has the following sub-categories: - tick - timer - RCU - MISC - WQ - kthread The rcu_nocbs is actually a subset of nohz_full focusing just on the RCU part of the kernel noises. The WQ part has already been handled by the current cpuset code. This series focuses on the tick and RCU part of the kernel noises by actively changing their internal data structures to track changes in the list of isolated CPUs used by cpuset isolated partitions. The dynamic update of the lists of housekeeping CPUs at run time will also have impact on the other part of the kernel noises that reference the lists of housekeeping CPUs at run time. The pending patch series on timer migration[1], when properly integrated will support the timer part too. The CPU hotplug functionality of the Linux kernel is used to facilitate the runtime change of the nohz_full isolated CPUs with minimal code changes. The CPUs that need to be switched from non-isolated to isolated or vice versa will be brought offline first, making the necessary changes and then brought back online afterward. The use of CPU hotplug, however, does have a slight drawback of freezing all the other CPUs in part of the offlining process using the stop machine feature of the kernel. That will cause a noticeable latency spikes in other running applications which may be significant to sensitive applications running on isolated CPUs in other isolated partitions at the time. Hopefully we can find a way to solve this problem in the future. One possible workaround for this is to reserve a set of nohz_full isolated CPUs at boot time using the nohz_full boot command parameter. The bringing of those nohz_full reserved CPUs into and out of isolated partitions will not invoke CPU hotplug and hence will not cause unexpected latency spikes. These reserved CPUs will only be needed if there are other existing isolated partitions running critical applications at the time when an isolated partition needs to be created. Patches 1-4 updates the CPU isolation code at kernel/sched/isolation.c to enable dynamic update of the lists of housekeeping CPUs. Patch 5 introduces a new cpuhp_offline_cb() API for shutting down the given set of CPUs, running the given callback method and then bringing those CPUs back online again. This new API will block any incoming hotplug events from interfering this operation. Patches 6-9 updates the cpuset partition code to use the new cpuhp API to shut down the affect CPUs, making changes to the housekeeping cpumasks and then bring those CPUs online afterward. Patch 10 works around an issue in the DL server code that block the hotplug operation under certain configurations. Patch 11-14 updates the timer tick and related code to enable proper updates to the set of CPUs requiring nohz_full dynticks support. Patch 15 enables runtime modification to the set of isolated CPUs requiring RCU NO-CB CPU support with minor changes to the RCU code. Patches 16-18 includes other miscellaneous updates to cpuset code and documentation. This patch series is applied on top of some other cpuset patches[1] posted upstream recently. [1] https://lore.kernel.org/lkml/20250806093855.86469-1-gmonaco@redhat.com/ [2] https://lore.kernel.org/lkml/20250806172430.1155133-1-longman@redhat.com/ Waiman Long (18): sched/isolation: Enable runtime update of housekeeping cpumasks sched/isolation: Call sched_tick_offload_init() when HK_FLAG_KERNEL_NOISE is first set sched/isolation: Use RCU to delay successive housekeeping cpumask updates sched/isolation: Add a debugfs file to dump housekeeping cpumasks cpu/hotplug: Add a new cpuhp_offline_cb() API cgroup/cpuset: Introduce a new top level isolcpus_update_mutex cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full modification cgroup/cpuset: Revert "Include isolated cpuset CPUs in cpu_is_isolated() check" sched/core: Ignore DL BW deactivation error if in cpuhp_offline_cb_mode tick/nohz: Make nohz_full parameter optional tick/nohz: Introduce tick_nohz_full_update_cpus() to update tick_nohz_full_mask tick/nohz: Allow runtime changes in full dynticks CPUs tick: Pass timer tick job to an online HK CPU in tick_cpu_dying() cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs cgroup/cpuset: Don't set have_boot_nohz_full without any boot time nohz_full CPU cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated partition cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call Documentation/admin-guide/cgroup-v2.rst | 33 +- .../admin-guide/kernel-parameters.txt | 19 +- include/linux/context_tracking.h | 8 +- include/linux/cpuhplock.h | 9 + include/linux/cpuset.h | 6 - include/linux/rcupdate.h | 2 + include/linux/sched/isolation.h | 9 +- include/linux/tick.h | 2 + kernel/cgroup/cpuset.c | 344 ++++++++++++------ kernel/context_tracking.c | 21 +- kernel/cpu.c | 47 +++ kernel/rcu/tree_nocb.h | 7 +- kernel/sched/core.c | 8 +- kernel/sched/debug.c | 32 ++ kernel/sched/isolation.c | 151 +++++++- kernel/sched/sched.h | 2 +- kernel/time/tick-common.c | 15 +- kernel/time/tick-sched.c | 24 +- .../selftests/cgroup/test_cpuset_prs.sh | 15 +- 19 files changed, 583 insertions(+), 171 deletions(-) -- 2.50.0

5 months

3
20
0 0

[PATCH 0/4] Better split_huge_page_test result check

by Zi Yan

David asked me if there is a way of checking split_huge_page_test results instead of the existing smap check[1]. This patchset uses kpageflags to get after-split folio orders for a better split_huge_page_test result check. The added gather_folio_orders() scans through a VPN range and collects the numbers of folios at different orders. check_folio_orders() compares the result of gather_folio_orders() to a given list of numbers of different orders. split_huge_page_test needs the FORCE_READ fix in [2] to work correctly. This patchset also: 1. added new order and in folio offset to the split huge page debugfs's pr_debug()s; 2. changed split_huge_pages_pid() to skip the rest of a folio if it is split by folio_split() (not changing split_folio_to_order() part since split_pte_mapped_thp test relies on its behavior). [1] https://lore.kernel.org/linux-mm/e2f32bdb-e4a4-447c-867c-31405cbba151@redha… [2] https://lore.kernel.org/linux-mm/20250805175140.241656-1-ziy@nvidia.com/ Zi Yan (4): mm/huge_memory: add new_order and offset to split_huge_pages*() pr_debug. mm/huge_memory: move to next folio after folio_split() succeeds. selftests/mm: add check_folio_orders() helper. selftests/mm: check after-split folio orders in split_huge_page_test. mm/huge_memory.c | 22 +-- .../selftests/mm/split_huge_page_test.c | 67 ++++++--- tools/testing/selftests/mm/vm_util.c | 139 ++++++++++++++++++ tools/testing/selftests/mm/vm_util.h | 2 + 4 files changed, 200 insertions(+), 30 deletions(-) -- 2.47.2

5 months

6
20
0 0

[PATCH net 1/2] tls: handle data disappearing from under the TLS ULP

by Jakub Kicinski

TLS expects that it owns the receive queue of the TCP socket. This cannot be guaranteed in case the reader of the TCP socket entered before the TLS ULP was installed, or uses some non-standard read API (eg. zerocopy ones). Make sure that the TCP sequence numbers match between ->data_ready and ->recvmsg, otherwise don't trust the work that ->data_ready has done. Signed-off-by: William Liu <will(a)willsroot.io> Signed-off-by: Savino Dicanosa <savy(a)syst3mfailure.io> Link: https://lore.kernel.org/tFjq_kf7sWIG3A7CrCg_egb8CVsT_gsmHAK0_wxDPJXfIzxFAMx… Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- include/net/tls.h | 1 + net/tls/tls.h | 2 +- net/tls/tls_strp.c | 17 ++++++++++++++--- net/tls/tls_sw.c | 3 ++- 4 files changed, 18 insertions(+), 5 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index 857340338b69..37344a39e4c9 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -117,6 +117,7 @@ struct tls_strparser { bool msg_ready; struct strp_msg stm; + u32 copied_seq; struct sk_buff *anchor; struct work_struct work; diff --git a/net/tls/tls.h b/net/tls/tls.h index 774859b63f0d..4e077068e6d9 100644 --- a/net/tls/tls.h +++ b/net/tls/tls.h @@ -196,7 +196,7 @@ void tls_strp_msg_done(struct tls_strparser *strp); int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb); void tls_rx_msg_ready(struct tls_strparser *strp); -void tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh); +bool tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh); int tls_strp_msg_cow(struct tls_sw_context_rx *ctx); struct sk_buff *tls_strp_msg_detach(struct tls_sw_context_rx *ctx); int tls_strp_msg_hold(struct tls_strparser *strp, struct sk_buff_head *dst); diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c index 095cf31bae0b..4bac58174cc3 100644 --- a/net/tls/tls_strp.c +++ b/net/tls/tls_strp.c @@ -473,9 +473,11 @@ static void tls_strp_load_anchor_with_queue(struct tls_strparser *strp, int len) strp->anchor->destructor = NULL; strp->stm.offset = offset; + + strp->copied_seq = tp->copied_seq; } -void tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh) +bool tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh) { struct strp_msg *rxm; struct tls_msg *tlm; @@ -484,8 +486,15 @@ void tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh) DEBUG_NET_WARN_ON_ONCE(!strp->stm.full_len); if (!strp->copy_mode && force_refresh) { - if (WARN_ON(tcp_inq(strp->sk) < strp->stm.full_len)) - return; + struct tcp_sock *tp = tcp_sk(strp->sk); + + if (unlikely(strp->copied_seq != tp->copied_seq || + WARN_ON(tcp_inq(strp->sk) < strp->stm.full_len))) { + + WRITE_ONCE(strp->msg_ready, 0); + memset(&strp->stm, 0, sizeof(strp->stm)); + return false; + } tls_strp_load_anchor_with_queue(strp, strp->stm.full_len); } @@ -495,6 +504,8 @@ void tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh) rxm->offset = strp->stm.offset; tlm = tls_msg(strp->anchor); tlm->control = strp->mark; + + return true; } /* Called with lock held on lower socket */ diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 549d1ea01a72..51c98a007dda 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1384,7 +1384,8 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, return sock_intr_errno(timeo); } - tls_strp_msg_load(&ctx->strp, released); + if (unlikely(!tls_strp_msg_load(&ctx->strp, released))) + return tls_rx_rec_wait(sk, psock, nonblock, false); return 1; } -- 2.50.1

5 months

2
6
0 0

[PATCH v3 0/2] fscontext: do not consume log entries when returning -EMSGSIZE

by Aleksa Sarai

Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. While we're at it, refactor some fscontext_read() into a separate helper to make the ring buffer logic a bit easier to read. Fixes: 007ec26cdc9f ("vfs: Implement logging through fs_context") Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Changes in v3: - selftests: use EXPECT_STREQ() - v2: <https://lore.kernel.org/r/20250806-fscontext-log-cleanups-v2-0-88e9d34d142f…> Changes in v2: - Refactor message fetching to fetch_message_locked() which returns ERR_PTR() in error cases. [Al Viro] - v1: <https://lore.kernel.org/r/20250806-fscontext-log-cleanups-v1-0-880597d42a5a…> --- Aleksa Sarai (2): fscontext: do not consume log entries when returning -EMSGSIZE selftests/filesystems: add basic fscontext log tests fs/fsopen.c | 54 +++++----- tools/testing/selftests/filesystems/.gitignore | 1 + tools/testing/selftests/filesystems/Makefile | 2 +- tools/testing/selftests/filesystems/fclog.c | 130 +++++++++++++++++++++++++ 4 files changed, 162 insertions(+), 25 deletions(-) --- base-commit: 66639db858112bf6b0f76677f7517643d586e575 change-id: 20250806-fscontext-log-cleanups-50f0143674ae Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

5 months

3
7
0 0

[PATCH 0/2] open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE

by Aleksa Sarai

As described in commit 7a54947e727b ('Merge patch series "fs: allow changing idmappings"'), open_tree_attr(2) was necessary in order to allow for a detached mount to be created and have its idmappings changed without the risk of any racing threads operating on it. For this reason, mount_setattr(2) still does not allow for id-mappings to be changed. However, there was a bug in commit 2462651ffa76 ("fs: allow changing idmappings") which allowed users to bypass this restriction by calling open_tree_attr(2) *without* OPEN_TREE_CLONE. can_idmap_mount() prevented this bug from allowing an attached mountpoint's id-mapping from being modified (thanks to an is_anon_ns() check), but this still allows for detached (but visible) mounts to have their be id-mapping changed. This risks the same UAF and locking issues as described in the merge commit, and was likely unintentional. For what it's worth, I found this while working on the open_tree_attr(2) man page, and was trying to figure out what open_tree_attr(2)'s behaviour was in the (slightly fruity) ~OPEN_TREE_CLONE case. Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Aleksa Sarai (2): open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug fs/namespace.c | 3 +- .../selftests/mount_setattr/mount_setattr_test.c | 77 ++++++++++++++++++---- 2 files changed, 66 insertions(+), 14 deletions(-) --- base-commit: 66639db858112bf6b0f76677f7517643d586e575 change-id: 20250808-open_tree_attr-bugfix-idmap-bb741166dc04 Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

5 months

2
3
0 0

[PATCH 0/9] kunit: Refactor and extend KUnit's

by Marie Zhussupova

Hello! KUnit offers a parameterized testing framework, where tests can be run multiple times with different inputs. Currently, the same `struct kunit` is used for each parameter execution. After each run, the test instance gets cleaned up. This creates the following limitations: a. There is no way to store resources that are accessible across the individual parameter test executions. b. It's not possible to pass additional context besides the previous parameter to `generate_params()` to get the next parameter. c. Test users are restricted to using pre-defined static arrays of parameter objects or `generate_params()` to define their parameters. There is no flexibility to pass a custom dynamic array without using `generate_params()`, which can be complex if generating the next parameter depends on more than just the single previous parameter (e.g., two or more previous parameters). This patch series resolves these limitations by: 1. [P 1] Giving each parameterized test execution its own `struct kunit`. This aligns more with the definition of a `struct kunit` as a running instance of a test. It will also remove the need to manage state, such as resetting the `test->priv` field or the `test->status_comment` after every parameter run. 2. [P 1] Introducing a parent pointer of type `struct kunit`. Behind the scenes, a parent instance for the parameterized tests will be created. It won't be used to execute any test logic, but will instead be used as a context for shared resources. Each individual running instance of a test will now have a reference to that parent instance and thus, have access to those resources. 3. [P 2] Introducing `param_init()` and `param_exit()` functions that can set up and clean up the parent instance of the parameterized tests. They will run once before and after the parameterized series and provide a way for the user to access the parent instance to add the parameter array or any other resources to it, including custom ones to the `test->parent->priv` field or to `test->parent->resources` via the Resource API (link below). https://elixir.bootlin.com/linux/v6.16-rc7/source/include/kunit/resource.h 4. [P 3, 4 & 5] Passing the parent `struct kunit` as an additional parameter to `generate_params()`. This provides `generate_params()` with more available context, making parameter generation much more flexible. The `generate_params()` implementations in the KCSAN and drm/xe tests have been adapted to match the new function pointer signature. 5. [P 6] Introducing a `params_data` field in `struct kunit`. This will allow the parent instance of a test to have direct storage of the parameter array, enabling features like using dynamic parameter arrays or using context beyond just the previous parameter. Thank you! -Marie Marie Zhussupova (9): kunit: Add parent kunit for parameterized test context kunit: Introduce param_init/exit for parameterized test shared context management kunit: Pass additional context to generate_params for parameterized testing kcsan: test: Update parameter generator to new signature drm/xe: Update parameter generator to new signature kunit: Enable direct registration of parameter arrays to a KUnit test kunit: Add example parameterized test with shared resources and direct static parameter array setup kunit: Add example parameterized test with direct dynamic parameter array setup Documentation: kunit: Document new parameterized test features Documentation/dev-tools/kunit/usage.rst | 455 +++++++++++++++++++++++- drivers/gpu/drm/xe/tests/xe_pci.c | 2 +- include/kunit/test.h | 98 ++++- kernel/kcsan/kcsan_test.c | 2 +- lib/kunit/kunit-example-test.c | 207 +++++++++++ lib/kunit/test.c | 82 ++++- 6 files changed, 818 insertions(+), 28 deletions(-) -- 2.50.1.552.g942d659e1b-goog

5 months

7
40
0 0

selftests/futex: issue with -g option help text

by Colin King (gmail)

Hi, I found some text that contains a spelling mistake, however I can't parse the message either, so I'm reporting this as a minor issue that needs some attention. The issue is found in commit: commit cda95faef7bcf26ba3f54c3cddce66d50116d146 Author: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de> Date: Wed Apr 16 18:29:20 2025 +0200 selftests/futex: Add futex_priv_hash Namely: static void usage(char *prog) { printf("Usage: %s\n", prog); printf(" -c Use color\n"); printf(" -g Test global hash instead intead local immutable \n"); printf(" -h Display this help message\n"); printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n", VQUIET, VCRITICAL, VINFO); } there word "intead" for the -g option should be removed, but I'm also finding the resulting text hard to parse, perhaps it needs to be rephrased? Colin

5 months

2
3
0 0

[PATCH v2 00/10] rust: use `core::ffi::CStr` method names

by Tamir Duberstein

This is series 2b/5 of the migration to `core::ffi::CStr`[0]. 20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com. This series depends on the prior series[0] and is intended to go through the rust tree to reduce the number of release cycles required to complete the work. Subsystem maintainers: I would appreciate your `Acked-by`s so that this can be taken through Miguel's tree (where the other series must go). [0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm… Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Changes in v2: - Update patch title (was nova-core, now drm/panic). - Link to v1: https://lore.kernel.org/r/20250709-core-cstr-fanout-1-v1-0-fd793b3e58a2@gma… --- Tamir Duberstein (10): drm/panic: use `core::ffi::CStr` method names rust: auxiliary: use `core::ffi::CStr` method names rust: configfs: use `core::ffi::CStr` method names rust: cpufreq: use `core::ffi::CStr` method names rust: drm: use `core::ffi::CStr` method names rust: firmware: use `core::ffi::CStr` method names rust: kunit: use `core::ffi::CStr` method names rust: miscdevice: use `core::ffi::CStr` method names rust: net: use `core::ffi::CStr` method names rust: of: use `core::ffi::CStr` method names drivers/gpu/drm/drm_panic_qr.rs | 2 +- rust/kernel/auxiliary.rs | 4 ++-- rust/kernel/configfs.rs | 4 ++-- rust/kernel/cpufreq.rs | 2 +- rust/kernel/drm/device.rs | 4 ++-- rust/kernel/firmware.rs | 2 +- rust/kernel/kunit.rs | 6 +++--- rust/kernel/miscdevice.rs | 2 +- rust/kernel/net/phy.rs | 2 +- rust/kernel/of.rs | 2 +- samples/rust/rust_configfs.rs | 2 +- 11 files changed, 16 insertions(+), 16 deletions(-) --- base-commit: cc84ef3b88f407e8bd5a5f7b6906d1e69851c856 change-id: 20250709-core-cstr-fanout-1-f20611832272 prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

5 months

4
15
0 0

[PATCH v2 0/8] rust: use `kernel::{fmt,prelude::fmt!}`

by Tamir Duberstein

This is series 2a/5 of the migration to `core::ffi::CStr`[0]. 20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com. This series depends on the prior series[0] and is intended to go through the rust tree to reduce the number of release cycles required to complete the work. Subsystem maintainers: I would appreciate your `Acked-by`s so that this can be taken through Miguel's tree (where the other series must go). [0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm… Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Changes in v2: - Rebase on rust-next. - Drop pin-init patch, which is no longer needed. - Link to v1: https://lore.kernel.org/r/20250709-core-cstr-fanout-1-v1-0-64308e7203fc@gma… --- Tamir Duberstein (8): gpu: nova-core: use `kernel::{fmt,prelude::fmt!}` rust: alloc: use `kernel::{fmt,prelude::fmt!}` rust: block: use `kernel::{fmt,prelude::fmt!}` rust: device: use `kernel::{fmt,prelude::fmt!}` rust: file: use `kernel::{fmt,prelude::fmt!}` rust: kunit: use `kernel::{fmt,prelude::fmt!}` rust: seq_file: use `kernel::{fmt,prelude::fmt!}` rust: sync: use `kernel::{fmt,prelude::fmt!}` drivers/block/rnull.rs | 2 +- drivers/gpu/nova-core/gpu.rs | 3 +-- drivers/gpu/nova-core/regs/macros.rs | 6 +++--- rust/kernel/alloc/kbox.rs | 2 +- rust/kernel/alloc/kvec.rs | 2 +- rust/kernel/alloc/kvec/errors.rs | 2 +- rust/kernel/block/mq.rs | 2 +- rust/kernel/block/mq/gen_disk.rs | 2 +- rust/kernel/block/mq/raw_writer.rs | 3 +-- rust/kernel/device.rs | 6 +++--- rust/kernel/fs/file.rs | 5 +++-- rust/kernel/kunit.rs | 8 ++++---- rust/kernel/seq_file.rs | 6 +++--- rust/kernel/sync/arc.rs | 2 +- scripts/rustdoc_test_gen.rs | 2 +- 15 files changed, 26 insertions(+), 27 deletions(-) --- base-commit: cc84ef3b88f407e8bd5a5f7b6906d1e69851c856 change-id: 20250709-core-cstr-fanout-1-f20611832272 prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

5 months

3
13
0 0

BPF selftest: mptcp subtest failing

by Harshvardhan Jha

Hi there, I have explicitly disabled mptpcp by default on my custom kernel and this seems to be causing the test case to fail. Even after enabling mtpcp via sysctl command or adding an entry to /etc/sysctl.conf this fails. I don't think this test should be failing and should account for cases where mptcp has not been enabled by default? These are the test logs: $ sudo tools/testing/selftests/bpf/test_progs -t mptcp Can't find bpf_testmod.ko kernel module: -2 WARNING! Selftests relying on bpf_testmod.ko will be skipped. run_test:PASS:bpf_prog_attach 0 nsec run_test:PASS:connect to fd 0 nsec verify_tsk:PASS:bpf_map_lookup_elem 0 nsec verify_tsk:PASS:unexpected invoked count 0 nsec verify_tsk:PASS:unexpected is_mptcp 0 nsec test_base:PASS:run_test tcp 0 nsec (network_helpers.c:107: errno: Protocol not available) Failed to create server socket test_base:FAIL:start_mptcp_server unexpected start_mptcp_server: actual -1 < expected 0 #178/1 mptcp/base:FAIL test_mptcpify:PASS:test__join_cgroup 0 nsec create_netns:PASS:ip netns add mptcp_ns 0 nsec create_netns:PASS:ip -net mptcp_ns link set dev lo up 0 nsec test_mptcpify:PASS:create_netns 0 nsec run_mptcpify:PASS:skel_open_load 0 nsec run_mptcpify:PASS:skel_attach 0 nsec (network_helpers.c:107: errno: Protocol not available) Failed to create server socket run_mptcpify:FAIL:start_server unexpected start_server: actual -1 < expected 0 test_mptcpify:FAIL:run_mptcpify unexpected error: -5 (errno 92) #178/2 mptcp/mptcpify:FAIL #178 mptcp:FAIL All error logs: test_base:PASS:test__join_cgroup 0 nsec create_netns:PASS:ip netns add mptcp_ns 0 nsec create_netns:PASS:ip -net mptcp_ns link set dev lo up 0 nsec test_base:PASS:create_netns 0 nsec test_base:PASS:start_server 0 nsec run_test:PASS:skel_open_load 0 nsec run_test:PASS:skel_attach 0 nsec run_test:PASS:bpf_prog_attach 0 nsec run_test:PASS:connect to fd 0 nsec verify_tsk:PASS:bpf_map_lookup_elem 0 nsec verify_tsk:PASS:unexpected invoked count 0 nsec verify_tsk:PASS:unexpected is_mptcp 0 nsec test_base:PASS:run_test tcp 0 nsec (network_helpers.c:107: errno: Protocol not available) Failed to create server socket test_base:FAIL:start_mptcp_server unexpected start_mptcp_server: actual -1 < expected 0 #178/1 mptcp/base:FAIL test_mptcpify:PASS:test__join_cgroup 0 nsec create_netns:PASS:ip netns add mptcp_ns 0 nsec create_netns:PASS:ip -net mptcp_ns link set dev lo up 0 nsec test_mptcpify:PASS:create_netns 0 nsec run_mptcpify:PASS:skel_open_load 0 nsec run_mptcpify:PASS:skel_attach 0 nsec (network_helpers.c:107: errno: Protocol not available) Failed to create server socket run_mptcpify:FAIL:start_server unexpected start_server: actual -1 < expected 0 test_mptcpify:FAIL:run_mptcpify unexpected error: -5 (errno 92) #178/2 mptcp/mptcpify:FAIL #178 mptcp:FAIL Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED This is the custom patch I had applied on the LTS v6.12.36 kernel and tested it: diff --git a/net/mptcp/ctrl.c b/net/mptcp/ctrl.c index dd595d9b5e50c..bdcc4136e92ef 100644 --- a/net/mptcp/ctrl.c +++ b/net/mptcp/ctrl.c @@ -89,7 +89,7 @@ const char *mptcp_get_scheduler(const struct net *net) static void mptcp_pernet_set_defaults(struct mptcp_pernet *pernet) { - pernet->mptcp_enabled = 1; + pernet->mptcp_enabled = 0; pernet->add_addr_timeout = TCP_RTO_MAX; pernet->blackhole_timeout = 3600; atomic_set(&pernet->active_disable_times, 0); -- Thanks & Regards, Harshvardhan

5 months

2
3
0 0

[PATCH net 0/3] net: prevent deadlocks and mis-configuration with per-NAPI threaded config

by Jakub Kicinski

Running the test added with a recent fix on a driver with persistent NAPI config leads to a deadlock. The deadlock is fixed by patch 3, patch 2 is I think a more fundamental problem with the way we implemented the config. I hope the fix makes sense, my own thinking is definitely colored by my preference (IOW how the per-queue config RFC was implemented). Jakub Kicinski (3): selftests: drv-net: don't assume device has only 2 queues net: update NAPI threaded config even for disabled NAPIs net: prevent deadlocks when enabling NAPIs with mixed kthread config include/linux/netdevice.h | 3 ++- net/core/dev.h | 8 ++++++++ net/core/dev.c | 12 +++++++++--- tools/testing/selftests/drivers/net/napi_threaded.py | 10 ++++++---- 4 files changed, 25 insertions(+), 8 deletions(-) -- 2.50.1

5 months

1
3
0 0

[PATCH RFC net-next v4 00/12] vsock: add namespace support to vhost-vsock

by Bobby Eshleman

This series adds namespace support to vhost-vsock. It does not add namespaces to any of the guest transports (virtio-vsock, hyperv, or vmci). The current revision only supports two modes: local or global. Local mode is complete isolation of namespaces, while global mode is complete sharing between namespaces of CIDs (the original behavior). Future may include supporting a mixed mode, which I expect to be more complicated because socket lookups will have to include new logic and API changes to behave differently based on if the lookup is part of a mixed mode CID allocation, a global CID allocation, a mixed-to-global connection (allowed), or a global-to-mixed connection (not allowed). Modes are per-netns and write-once. This allows a system to configure namespaces independently (some may share CIDs, others are completely isolated). This also supports future mixed use cases, where there may be namespaces in global mode spinning up VMs while there are mixed mode namespaces that provide services to the VMs, but are not allowed to allocate from the global CID pool. Thanks again for everyone's help and reviews! Signed-off-by: Bobby Eshleman <bobbyeshleman(a)gmail.com> To: Stefano Garzarella <sgarzare(a)redhat.com> To: Shuah Khan <shuah(a)kernel.org> To: David S. Miller <davem(a)davemloft.net> To: Eric Dumazet <edumazet(a)google.com> To: Jakub Kicinski <kuba(a)kernel.org> To: Paolo Abeni <pabeni(a)redhat.com> To: Simon Horman <horms(a)kernel.org> To: Stefan Hajnoczi <stefanha(a)redhat.com> To: Michael S. Tsirkin <mst(a)redhat.com> To: Jason Wang <jasowang(a)redhat.com> To: Xuan Zhuo <xuanzhuo(a)linux.alibaba.com> To: Eugenio Pérez <eperezma(a)redhat.com> To: K. Y. Srinivasan <kys(a)microsoft.com> To: Haiyang Zhang <haiyangz(a)microsoft.com> To: Wei Liu <wei.liu(a)kernel.org> To: Dexuan Cui <decui(a)microsoft.com> To: Bryan Tan <bryan-bt.tan(a)broadcom.com> To: Vishnu Dasa <vishnu.dasa(a)broadcom.com> To: Broadcom internal kernel review list <bcm-kernel-feedback-list(a)broadcom.com> Cc: virtualization(a)lists.linux.dev Cc: netdev(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: kvm(a)vger.kernel.org Cc: linux-hyperv(a)vger.kernel.org Cc: berrange(a)redhat.com Changes in v4: - removed RFC tag - implemented loopback support - renamed new tests to better reflect behavior - completed suite of tests with permutations of ns modes and vsock_test as guest/host - simplified socat bridging with unix socket instead of tcp + veth - only use vsock_test for success case, socat for failure case (context in commit message) - lots of cleanup Changes in v3: - add notion of "modes" - add procfs /proc/net/vsock_ns_mode - local and global modes only - no /dev/vhost-vsock-netns - vmtest.sh already merged, so new patch just adds new tests for NS - Link to v2: https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com Changes in v2: - only support vhost-vsock namespaces - all g2h namespaces retain old behavior, only common API changes impacted by vhost-vsock changes - add /dev/vhost-vsock-netns for "opt-in" - leave /dev/vhost-vsock to old behavior - removed netns module param - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com Changes in v1: - added 'netns' module param to vsock.ko to enable the network namespace support (disabled by default) - added 'vsock_net_eq()' to check the "net" assigned to a socket only when 'netns' support is enabled - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/ --- Bobby Eshleman (12): vsock: a per-net vsock NS mode state vsock: add net to vsock skb cb vsock: add netns to af_vsock core vsock/virtio: add netns to virtio transport common vhost/vsock: add netns support vsock/virtio: use the global netns hv_sock: add netns hooks vsock/vmci: add netns hooks vsock/loopback: add netns support selftests/vsock: improve logging in vmtest.sh selftests/vsock: invoke vsock_test through helpers selftests/vsock: add namespace tests MAINTAINERS | 1 + drivers/vhost/vsock.c | 48 +- include/linux/virtio_vsock.h | 12 + include/net/af_vsock.h | 59 +- include/net/net_namespace.h | 4 + include/net/netns/vsock.h | 21 + net/vmw_vsock/af_vsock.c | 204 +++++- net/vmw_vsock/hyperv_transport.c | 2 +- net/vmw_vsock/virtio_transport.c | 5 +- net/vmw_vsock/virtio_transport_common.c | 14 +- net/vmw_vsock/vmci_transport.c | 4 +- net/vmw_vsock/vsock_loopback.c | 59 +- tools/testing/selftests/vsock/vmtest.sh | 1088 ++++++++++++++++++++++++++----- 13 files changed, 1330 insertions(+), 191 deletions(-) --- base-commit: dd500e4aecf25e48e874ca7628697969df679493 change-id: 20250325-vsock-vmtest-b3a21d2102c2 Best regards, -- Bobby Eshleman <bobbyeshleman(a)meta.com>

5 months

3
19
0 0

[PATCH v7 00/30] TDX KVM selftests

by Sagi Shahar

This is v7 of the TDX selftests now that the base TDX patches have been accepted. This series is based on v6.16-rc1 No major changes from v6 asside from rebasing. Thanks, Changes from v6: - Rebased on top of v6.16-rc1 Ackerley Tng (12): KVM: selftests: Add function to allow one-to-one GVA to GPA mappings KVM: selftests: Expose function that sets up sregs based on VM's mode KVM: selftests: Store initial stack address in struct kvm_vcpu KVM: selftests: Add vCPU descriptor table initialization utility KVM: selftests: TDX: Use KVM_TDX_CAPABILITIES to validate TDs' attribute configuration KVM: selftests: TDX: Update load_td_memory_region() for VM memory backed by guest memfd KVM: selftests: Add functions to allow mapping as shared KVM: selftests: KVM: selftests: Expose new vm_vaddr_alloc_private() KVM: selftests: TDX: Add support for TDG.MEM.PAGE.ACCEPT KVM: selftests: TDX: Add support for TDG.VP.VEINFO.GET KVM: selftests: TDX: Add TDX UPM selftest KVM: selftests: TDX: Add TDX UPM selftests for implicit conversion Erdem Aktas (3): KVM: selftests: Add helper functions to create TDX VMs KVM: selftests: TDX: Add TDX lifecycle test KVM: selftests: TDX: Add TDX HLT exit test Isaku Yamahata (1): KVM: selftests: Update kvm_init_vm_address_properties() for TDX Roger Wang (1): KVM: selftests: TDX: Add TDG.VP.INFO test Ryan Afranji (2): KVM: selftests: TDX: Verify the behavior when host consumes a TD private memory KVM: selftests: TDX: Add shared memory test Sagi Shahar (10): KVM: selftests: TDX: Add report_fatal_error test KVM: selftests: TDX: Adding test case for TDX port IO KVM: selftests: TDX: Add basic TDX CPUID test KVM: selftests: TDX: Add basic TDG.VP.VMCALL<GetTdVmCallInfo> test KVM: selftests: TDX: Add TDX IO writes test KVM: selftests: TDX: Add TDX IO reads test KVM: selftests: TDX: Add TDX MSR read/write tests KVM: selftests: TDX: Add TDX MMIO reads test KVM: selftests: TDX: Add TDX MMIO writes test KVM: selftests: TDX: Add TDX CPUID TDVMCALL test Yan Zhao (1): KVM: selftests: TDX: Test LOG_DIRTY_PAGES flag to a non-GUEST_MEMFD memslot tools/testing/selftests/kvm/Makefile.kvm | 8 + .../testing/selftests/kvm/include/kvm_util.h | 36 + .../selftests/kvm/include/x86/kvm_util_arch.h | 1 + .../selftests/kvm/include/x86/processor.h | 2 + .../selftests/kvm/include/x86/tdx/td_boot.h | 83 ++ .../kvm/include/x86/tdx/td_boot_asm.h | 16 + .../selftests/kvm/include/x86/tdx/tdcall.h | 54 + .../selftests/kvm/include/x86/tdx/tdx.h | 67 + .../selftests/kvm/include/x86/tdx/tdx_util.h | 23 + .../selftests/kvm/include/x86/tdx/test_util.h | 133 ++ tools/testing/selftests/kvm/lib/kvm_util.c | 74 +- .../testing/selftests/kvm/lib/x86/processor.c | 97 +- .../selftests/kvm/lib/x86/tdx/td_boot.S | 100 ++ .../selftests/kvm/lib/x86/tdx/tdcall.S | 163 +++ tools/testing/selftests/kvm/lib/x86/tdx/tdx.c | 243 ++++ .../selftests/kvm/lib/x86/tdx/tdx_util.c | 643 +++++++++ .../selftests/kvm/lib/x86/tdx/test_util.c | 187 +++ .../selftests/kvm/x86/tdx_shared_mem_test.c | 129 ++ .../testing/selftests/kvm/x86/tdx_upm_test.c | 461 ++++++ tools/testing/selftests/kvm/x86/tdx_vm_test.c | 1254 +++++++++++++++++ 20 files changed, 3734 insertions(+), 40 deletions(-) create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/td_boot.h create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdcall.h create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdx.h create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/test_util.h create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/td_boot.S create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdcall.S create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdx.c create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/test_util.c create mode 100644 tools/testing/selftests/kvm/x86/tdx_shared_mem_test.c create mode 100644 tools/testing/selftests/kvm/x86/tdx_upm_test.c create mode 100644 tools/testing/selftests/kvm/x86/tdx_vm_test.c -- 2.50.0.rc2.692.g299adb8693-goog

5 months

2
34
0 0

[PATCH] futex: selftests: Add description for futex_wake_op()

by Devaansh Kumar

Signed-off-by: Devaansh Kumar <devaanshk840(a)gmail.com> --- tools/testing/selftests/futex/include/futextest.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h index ddbcfc9b7..c77352b97 100644 --- a/tools/testing/selftests/futex/include/futextest.h +++ b/tools/testing/selftests/futex/include/futextest.h @@ -134,7 +134,9 @@ futex_unlock_pi(futex_t *uaddr, int opflags) } /** - * futex_wake_op() - FIXME: COME UP WITH A GOOD ONE LINE DESCRIPTION + * futex_wake_op() - atomically modify uaddr2 + * @nr_wake: wake up to this many tasks on uaddr + * @nr_wake2: wake up to this many tasks on uaddr2 */ static inline int futex_wake_op(futex_t *uaddr, futex_t *uaddr2, int nr_wake, int nr_wake2, -- 2.49.0

5 months

2
1
0 0

[PATCH] selftests: arm64: Fix -Waddress warning in tpidr2 test

by Bala-Vignesh-Reddy

Resolve compiler warning about always true condition in ksft_test_result in tpidr2, passing actual function. This silences -Waddress warning while maintaining test functionality. Signed-off-by: Bala-Vignesh-Reddy <reddybalavignesh9979(a)gmail.com> --- tools/testing/selftests/arm64/abi/tpidr2.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/arm64/abi/tpidr2.c b/tools/testing/selftests/arm64/abi/tpidr2.c index f58a9f89b952..4c89ab0f1010 100644 --- a/tools/testing/selftests/arm64/abi/tpidr2.c +++ b/tools/testing/selftests/arm64/abi/tpidr2.c @@ -227,10 +227,10 @@ int main(int argc, char **argv) ret = open("/proc/sys/abi/sme_default_vector_length", O_RDONLY, 0); if (ret >= 0) { ksft_test_result(default_value(), "default_value\n"); - ksft_test_result(write_read, "write_read\n"); - ksft_test_result(write_sleep_read, "write_sleep_read\n"); - ksft_test_result(write_fork_read, "write_fork_read\n"); - ksft_test_result(write_clone_read, "write_clone_read\n"); + ksft_test_result(write_read(), "write_read\n"); + ksft_test_result(write_sleep_read(), "write_sleep_read\n"); + ksft_test_result(write_fork_read(), "write_fork_read\n"); + ksft_test_result(write_clone_read(), "write_clone_read\n"); } else { ksft_print_msg("SME support not present\n"); -- 2.43.0

5 months

3
2
0 0

next-20250804 Unable to handle kernel execute from non-executable memory at virtual address idem_hash

by Naresh Kamboju

While booting and testing selftest cgroups and filesystem testing on arm64 dragonboard-410c the following kernel warnings / errors noticed and system halted and did not recover with selftests Kconfig enabled running the kernel Linux next tag next-20250804. Regression Analysis: - New regression? Yes - Reproducibility? Re-validation is in progress First seen on the next-20250804 Good: next-20250801 Bad: next-20250804 Test regression: next-20250804 Unable to handle kernel execute from non-executable memory at virtual address idem_hash Test regression: next-20250804 refcount_t: addition on 0; use-after-free refcount_warn_saturate Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org> ## Test crash log [ 9.811341] Unable to handle kernel NULL pointer dereference at virtual address 000000000000002e [ 9.811444] Mem abort info: [ 9.821150] ESR = 0x0000000096000004 [ 9.833499] SET = 0, FnV = 0 [ 9.833566] EA = 0, S1PTW = 0 [ 9.835511] FSC = 0x04: level 0 translation fault [ 9.838901] Data abort info: [ 9.843788] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 9.846565] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 9.851938] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 9.853510] rtc-pm8xxx 200f000.spmi:pmic@0:rtc@6000: registered as rtc0 [ 9.856992] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000856f8000 [ 9.862446] rtc-pm8xxx 200f000.spmi:pmic@0:rtc@6000: setting system clock to 1970-01-01T00:00:31 UTC (31) [ 9.868789] [000000000000002e] pgd=0000000000000000, p4d=0000000000000000 [ 9.875459] Internal error: Oops: 0000000096000004 [#1] SMP [ 9.889547] input: pm8941_pwrkey as /devices/platform/soc@0/200f000.spmi/spmi-0/0-00/200f000.spmi:pmic@0:pon@800/200f000.spmi:pmic@0:pon@800:pwrkey/input/input1 [ 9.891545] Modules linked in: qcom_spmi_temp_alarm rtc_pm8xxx qcom_pon(+) qcom_pil_info videobuf2_dma_sg ubwc_config qcom_q6v5 venus_core(+) qcom_sysmon qcom_spmi_vadc v4l2_fwnode llcc_qcom v4l2_async qcom_vadc_common qcom_common ocmem v4l2_mem2mem drm_gpuvm videobuf2_memops qcom_glink_smem videobuf2_v4l2 drm_exec mdt_loader qmi_helpers gpu_sched drm_dp_aux_bus qnoc_msm8916 videodev drm_display_helper qcom_stats videobuf2_common cec qcom_rng drm_client_lib mc phy_qcom_usb_hs socinfo rpmsg_ctrl display_connector rpmsg_char ramoops rmtfs_mem reed_solomon drm_kms_helper fuse drm backlight [ 9.912286] input: pm8941_resin as /devices/platform/soc@0/200f000.spmi/spmi-0/0-00/200f000.spmi:pmic@0:pon@800/200f000.spmi:pmic@0:pon@800:resin/input/input2 [ 9.941186] CPU: 2 UID: 0 PID: 221 Comm: (udev-worker) Not tainted 6.16.0-next-20250804 #1 PREEMPT [ 9.941200] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 9.941206] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 9.941215] pc : dev_pm_opp_put (/builds/linux/drivers/opp/core.c:1685) [ 9.941233] lr : core_clks_enable+0x54/0x148 venus_core [ 10.004266] sp : ffff8000842b35f0 [ 10.004273] x29: ffff8000842b35f0 x28: ffff8000842b3ba0 x27: ffff0000047be938 [ 10.004289] x26: 0000000000000000 x25: 0000000000000000 x24: ffff80007b350ba0 [ 10.004303] x23: ffff00000ba380c8 x22: ffff00000ba38080 x21: 0000000000000000 [ 10.004316] x20: 0000000000000000 x19: ffffffffffffffee x18: 00000000ffffffff [ 10.004330] x17: 0000000000000000 x16: 1fffe000017541a1 x15: ffff8000842b3560 [ 10.004344] x14: 0000000000000000 x13: 007473696c5f7974 x12: 696e696666615f65 [ 10.004358] x11: 00000000000000c0 x10: 0000000000000020 x9 : ffff80007b33f2bc [ 10.004371] x8 : ffffffffffffffde x7 : ffff0000044a4800 x6 : 0000000000000000 [ 10.004384] x5 : 0000000000000002 x4 : 00000000c0000000 x3 : 0000000000000001 [ 10.004397] x2 : 0000000000000002 x1 : ffffffffffffffde x0 : ffffffffffffffee [ 10.004412] Call trace: [ 10.004417] dev_pm_opp_put (/builds/linux/drivers/opp/core.c:1685) (P) [ 10.004435] core_clks_enable+0x54/0x148 venus_core [ 10.004504] core_power_v1+0x78/0x90 venus_core [ 10.004560] venus_runtime_resume+0x6c/0x98 venus_core [ 10.004616] pm_generic_runtime_resume (/builds/linux/drivers/base/power/generic_ops.c:47) [ 10.004630] __genpd_runtime_resume (/builds/linux/drivers/pmdomain/core.c:1203) [ 10.004645] genpd_runtime_resume (/builds/linux/drivers/pmdomain/core.c:1329) [ 10.004656] __rpm_callback (/builds/linux/drivers/base/power/runtime.c:406) [ 10.004668] rpm_callback (/builds/linux/drivers/base/power/runtime.c:460) [ 10.004680] rpm_resume (/builds/linux/drivers/base/power/runtime.c:934) [ 10.004692] __pm_runtime_resume (/builds/linux/drivers/base/power/runtime.c:1192) [ 10.004704] venus_probe+0x2d8/0x588 venus_core [ 10.004761] platform_probe (/builds/linux/drivers/base/platform.c:1408 (discriminator 1)) [ 10.004776] really_probe (/builds/linux/drivers/base/dd.c:581 /builds/linux/drivers/base/dd.c:659) [ 10.004788] __driver_probe_device (/builds/linux/drivers/base/dd.c:801) [ 10.004800] driver_probe_device (/builds/linux/drivers/base/dd.c:831) [ 10.004812] __driver_attach (/builds/linux/drivers/base/dd.c:1218 /builds/linux/drivers/base/dd.c:1157) [ 10.004824] bus_for_each_dev (/builds/linux/drivers/base/bus.c:370) [ 10.004835] driver_attach (/builds/linux/drivers/base/dd.c:1236) [ 10.004847] bus_add_driver (/builds/linux/drivers/base/bus.c:678) [ 10.004859] driver_register (/builds/linux/drivers/base/driver.c:249) [ 10.004871] __platform_driver_register (/builds/linux/drivers/base/platform.c:868) [ 10.004885] qcom_venus_driver_init+0x28/0xfb8 venus_core [ 10.004942] do_one_initcall (/builds/linux/init/main.c:1269) [ 10.004954] do_init_module (/builds/linux/kernel/module/main.c:3039) [ 10.004967] load_module (/builds/linux/kernel/module/main.c:3509) [ 10.004979] init_module_from_file (/builds/linux/kernel/module/main.c:3702) [ 10.004991] __arm64_sys_finit_module (/builds/linux/kernel/module/main.c:3713 /builds/linux/kernel/module/main.c:3739 /builds/linux/kernel/module/main.c:3723 /builds/linux/kernel/module/main.c:3723) [ 10.005004] invoke_syscall (/builds/linux/arch/arm64/include/asm/current.h:19 /builds/linux/arch/arm64/kernel/syscall.c:54) [ 10.005014] el0_svc_common.constprop.0 (/builds/linux/arch/arm64/kernel/syscall.c:139) [ 10.005023] do_el0_svc (/builds/linux/arch/arm64/kernel/syscall.c:152) [ 10.005032] el0_svc (/builds/linux/arch/arm64/include/asm/irqflags.h:82 (discriminator 1) /builds/linux/arch/arm64/include/asm/irqflags.h:123 (discriminator 1) /builds/linux/arch/arm64/include/asm/irqflags.h:136 (discriminator 1) /builds/linux/arch/arm64/kernel/entry-common.c:169 (discriminator 1) /builds/linux/arch/arm64/kernel/entry-common.c:182 (discriminator 1) /builds/linux/arch/arm64/kernel/entry-common.c:880 (discriminator 1)) [ 10.005045] el0t_64_sync_handler (/builds/linux/arch/arm64/kernel/entry-common.c:899) [ 10.005058] el0t_64_sync (/builds/linux/arch/arm64/kernel/entry.S:596) [ 10.005073] Code: 910003fd f9000bf3 91004013 aa1303e0 (f9402821) All code ======== 0: 910003fd mov x29, sp 4: f9000bf3 str x19, [sp, #16] 8: 91004013 add x19, x0, #0x10 c: aa1303e0 mov x0, x19 10:* f9402821 ldr x1, [x1, #80] <-- trapping instruction Code starting with the faulting instruction =========================================== 0: f9402821 ldr x1, [x1, #80] [ 10.005082] ---[ end trace 0000000000000000 ]--- [ 10.089433] systemd-journald[147]: Time jumped backwards, rotating. ## Test cgroup crash log selftests: cgroup: test_cpu ok 1 test_cpucg_subtree_control ok 2 test_cpucg_stats ok 3 test_cpucg_nice not ok 4 test_cpucg_weight_overprovisioned not ok 5 test_cpucg_weight_underprovisioned ok 6 test_cpucg_nested_weight_overprovisioned [ 60.273474] Unable to handle kernel execute from non-executable memory at virtual address ffff800082f89d50 [ 60.273547] Mem abort info: [ 60.282111] ESR = 0x000000008600000e [ 60.284730] EC = 0x21: IABT (current EL), IL = 32 bits [ 60.288616] SET = 0, FnV = 0 [ 60.294041] EA = 0, S1PTW = 0 [ 60.296880] FSC = 0x0e: level 2 permission fault [ 60.299953] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000082485000 [ 60.304828] [ffff800082f89d50] pgd=0000000000000000, p4d=100000008300c003, pud=100000008300d003, pmd=0068000082e00701 [ 60.311682] Internal error: Oops: 000000008600000e [#2] SMP [ 60.322146] Modules linked in: pm8916_wdt qcom_wcnss_pil snd_soc_lpass_apq8016 snd_soc_msm8916_analog snd_soc_lpass_cpu snd_soc_apq8016_sbc snd_soc_msm8916_digital snd_soc_lpass_platform snd_soc_qcom_common coresight_cpu_debug snd_soc_core coresight_tmc coresight_replicator snd_compress coresight_funnel snd_pcm_dmaengine coresight_stm stm_core coresight_cti coresight_tpiu snd_pcm coresight snd_timer qrtr msm snd adv7511 qcom_camss qcom_q6v5_mss soundcore qcom_spmi_temp_alarm rtc_pm8xxx qcom_pon qcom_pil_info videobuf2_dma_sg ubwc_config qcom_q6v5 venus_core(+) qcom_sysmon qcom_spmi_vadc v4l2_fwnode llcc_qcom v4l2_async qcom_vadc_common qcom_common ocmem v4l2_mem2mem drm_gpuvm videobuf2_memops qcom_glink_smem videobuf2_v4l2 drm_exec mdt_loader qmi_helpers gpu_sched drm_dp_aux_bus qnoc_msm8916 videodev drm_display_helper qcom_stats videobuf2_common cec qcom_rng drm_client_lib mc phy_qcom_usb_hs socinfo rpmsg_ctrl display_connector rpmsg_char ramoops rmtfs_mem reed_solomon drm_kms_helper fuse drm backlight [ 60.394361] CPU: 3 UID: 0 PID: 252 Comm: kworker/u16:7 Tainted: G D 6.16.0-next-20250804 #1 PREEMPT [ 60.416518] Tainted: [D]=DIE [ 60.427172] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 60.430139] Workqueue: 0x0 (async) [ 60.436813] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 60.440027] pc : idem_hash+0x58/0x800 [ 60.446967] lr : idem_hash+0x58/0x800 [ 60.450785] sp : ffff8000842b3dd0 [ 60.454429] x29: ffff8000842b3dd0 x28: 0000000000000000 x27: 0000000000000000 [ 60.457737] x26: 0000000000000000 x25: ffff000010127880 x24: ffff000010127840 [ 60.464856] x23: ffff8000828cf000 x22: 61c8864680b583eb x21: ffff000003418c28 [ 60.471973] x20: ffff0000044a4800 x19: ffff0000044a4800 x18: 0000000000000000 [ 60.479090] x17: ffff7fffbd4dd000 x16: ffff800080018000 x15: 0000000000000000 [ 60.486208] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 60.493327] x11: 00000000000000c0 x10: 0000000000000b50 x9 : ffff80008163543c [ 60.500444] x8 : ffff8000842b3bd8 x7 : 0000000000000001 x6 : ffff8000828ab000 [ 60.507563] x5 : ffff0000044a4800 x4 : ffff8000828ab3e0 x3 : 0000000000000000 [ 60.514680] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000044a4800 [ 60.521800] Call trace: [ 60.528909] idem_hash+0x58/0x800 (P) [ 60.531168] worker_thread (/builds/linux/kernel/workqueue.c:3353) [ 60.534987] kthread (/builds/linux/kernel/kthread.c:463) [ 60.538631] ret_from_fork (/builds/linux/arch/arm64/kernel/entry.S:861) [ 60.542023] Code: 00000000 00000000 00000000 00000000 (842b3d70) All code ======== ... 10: 842b3d70 .inst 0x842b3d70 ; undefined Code starting with the faulting instruction =========================================== 0: 842b3d70 .inst 0x842b3d70 ; undefined [ 60.545587] ---[ end trace 0000000000000000 ]--- [ 60.561661] note: kworker/u16:7[252] exited with preempt_count 1 ok 7 test_cpucg_nested_weight_underprovisioned # not ok 2 selftests: cgroup: test_cpu TIMEOUT 45 seconds ## Test filesystems crash log selftests: filesystems: file_stressor TAP version 13 1..1 Starting 1 tests from 1 test cases. RUN file_stressor.slab_typesafe_by_rcu ... [ 316.785677] ------------[ cut here ]------------ [ 316.785733] refcount_t: addition on 0; use-after-free. [ 316.789429] WARNING: lib/refcount.c:25 at refcount_warn_saturate+0x120/0x148, CPU#0: 5/88 [ 316.794336] Modules linked in: pm8916_wdt qcom_wcnss_pil snd_soc_lpass_apq8016 snd_soc_msm8916_analog snd_soc_lpass_cpu snd_soc_apq8016_sbc snd_soc_msm8916_digital snd_soc_lpass_platform snd_soc_qcom_common coresight_cpu_debug snd_soc_core coresight_tmc coresight_replicator snd_compress coresight_funnel snd_pcm_dmaengine coresight_stm stm_core coresight_cti coresight_tpiu snd_pcm coresight snd_timer qrtr msm snd adv7511 qcom_camss qcom_q6v5_mss soundcore qcom_spmi_temp_alarm rtc_pm8xxx qcom_pon qcom_pil_info videobuf2_dma_sg ubwc_config qcom_q6v5 venus_core(+) qcom_sysmon qcom_spmi_vadc v4l2_fwnode llcc_qcom v4l2_async qcom_vadc_common qcom_common ocmem v4l2_mem2mem drm_gpuvm videobuf2_memops qcom_glink_smem videobuf2_v4l2 drm_exec mdt_loader qmi_helpers gpu_sched drm_dp_aux_bus qnoc_msm8916 videodev drm_display_helper qcom_stats videobuf2_common cec qcom_rng drm_client_lib mc phy_qcom_usb_hs socinfo rpmsg_ctrl display_connector rpmsg_char ramoops rmtfs_mem reed_solomon drm_kms_helper fuse drm backlight [ 316.870196] CPU: 0 UID: 0 PID: 88 Comm: kworker/u16:5 Tainted: G D 6.16.0-next-20250804 #1 PREEMPT [ 316.892345] Tainted: [D]=DIE [ 316.903000] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 316.905873] Workqueue: events_unbound idle_cull_fn [ 316.912553] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 316.917156] pc : refcount_warn_saturate (/builds/linux/lib/refcount.c:25 (discriminator 1)) [ 316.924010] lr : refcount_warn_saturate (/builds/linux/lib/refcount.c:25 (discriminator 1)) [ 316.928870] sp : ffff8000839dbd10 [ 316.933727] x29: ffff8000839dbd10 x28: 0000000000000000 x27: 0000000000000000 [ 316.937208] x26: ffff000003418c28 x25: 0000000000000000 x24: ffff000003418c78 [ 316.944327] x23: 00000000000124f8 x22: ffff8000828a8000 x21: ffff8000839dbd38 [ 316.951444] x20: ffff8000828cf108 x19: ffff000003418c00 x18: 0000000000000006 [ 316.958563] x17: 0000000000000000 x16: 0000000000000000 x15: 0765076507720766 [ 316.965680] x14: 072d077207650774 x13: 0765076507720766 x12: 072d077207650774 [ 316.972799] x11: 0720072007200720 x10: ffff800082931cc0 x9 : ffff8000801ce594 [ 316.979918] x8 : 00000000ffffefff x7 : ffff800082931cc0 x6 : 80000000fffff000 [ 316.987035] x5 : 0000000000000566 x4 : 0000000000000000 x3 : 0000000000000027 [ 316.994153] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000059e0000 [ 317.001272] Call trace: [ 317.008380] refcount_warn_saturate (/builds/linux/lib/refcount.c:25 (discriminator 1)) (P) [ 317.010642] set_worker_dying (/builds/linux/include/linux/refcount.h:289 /builds/linux/include/linux/refcount.h:366 /builds/linux/include/linux/refcount.h:383 /builds/linux/include/linux/sched/task.h:116 /builds/linux/kernel/workqueue.c:2895) [ 317.015500] idle_cull_fn (/builds/linux/kernel/workqueue.c:962 /builds/linux/kernel/workqueue.c:2961) [ 317.019666] process_one_work (/builds/linux/kernel/workqueue.c:3241) [ 317.023225] worker_thread (/builds/linux/kernel/workqueue.c:3313 (discriminator 2) /builds/linux/kernel/workqueue.c:3400 (discriminator 2)) [ 317.027217] kthread (/builds/linux/kernel/kthread.c:463) [ 317.030862] ret_from_fork (/builds/linux/arch/arm64/kernel/entry.S:861) [ 317.034249] ---[ end trace 0000000000000000 ]--- [ 317.047081] ------------[ cut here ]------------ [ 317.047142] refcount_t: saturated; leaking memory. [ 317.051602] WARNING: lib/refcount.c:22 at refcount_warn_saturate+0x74/0x148, CPU#0: 5/88 [ 317.055397] Modules linked in: pm8916_wdt qcom_wcnss_pil snd_soc_lpass_apq8016 snd_soc_msm8916_analog snd_soc_lpass_cpu snd_soc_apq8016_sbc snd_soc_msm8916_digital snd_soc_lpass_platform snd_soc_qcom_common coresight_cpu_debug snd_soc_core coresight_tmc coresight_replicator snd_compress coresight_funnel snd_pcm_dmaengine coresight_stm stm_core coresight_cti coresight_tpiu snd_pcm coresight snd_timer qrtr msm snd adv7511 qcom_camss qcom_q6v5_mss soundcore qcom_spmi_temp_alarm rtc_pm8xxx qcom_pon qcom_pil_info videobuf2_dma_sg ubwc_config qcom_q6v5 venus_core(+) qcom_sysmon qcom_spmi_vadc v4l2_fwnode llcc_qcom v4l2_async qcom_vadc_common qcom_common ocmem v4l2_mem2mem drm_gpuvm videobuf2_memops qcom_glink_smem videobuf2_v4l2 drm_exec mdt_loader qmi_helpers gpu_sched drm_dp_aux_bus qnoc_msm8916 videodev drm_display_helper qcom_stats videobuf2_common cec qcom_rng drm_client_lib mc phy_qcom_usb_hs socinfo rpmsg_ctrl display_connector rpmsg_char ramoops rmtfs_mem reed_solomon drm_kms_helper fuse drm backlight [ 317.131166] CPU: 0 UID: 0 PID: 88 Comm: kworker/u16:5 Tainted: G D W 6.16.0-next-20250804 #1 PREEMPT [ 317.153317] Tainted: [D]=DIE, [W]=WARN [ 317.163972] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 317.167536] Workqueue: events_unbound idle_cull_fn [ 317.174392] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 317.178993] pc : refcount_warn_saturate (/builds/linux/lib/refcount.c:22 (discriminator 1)) [ 317.185848] lr : refcount_warn_saturate (/builds/linux/lib/refcount.c:22 (discriminator 1)) [ 317.190708] sp : ffff8000839dbcd0 [ 317.195478] x29: ffff8000839dbcd0 x28: 0000000000000000 x27: 0000000000000000 [ 317.198871] x26: ffff000003418c28 x25: 0000000000000000 x24: ffff000003418c78 [ 317.205990] x23: dead000000000122 x22: dead000000000100 x21: ffff8000839dbd38 [ 317.213109] x20: ffff0000044a4828 x19: ffff0000044a4800 x18: 0000000000000000 [ 317.220226] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 317.227345] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 317.234463] x11: 00000000000000c0 x10: 0000000000000b50 x9 : ffff80008163543c [ 317.241581] x8 : ffff8000839db9e8 x7 : 0000000000000001 x6 : 0000000000000001 [ 317.248699] x5 : ffff8000828ab000 x4 : ffff8000828ab3e0 x3 : 0000000000000000 [ 317.255817] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000059e0000 [ 317.262936] Call trace: [ 317.270045] refcount_warn_saturate (/builds/linux/lib/refcount.c:22 (discriminator 1)) (P) [ 317.272306] kthread_stop (/builds/linux/include/linux/refcount.h:291 /builds/linux/include/linux/refcount.h:366 /builds/linux/include/linux/refcount.h:383 /builds/linux/include/linux/sched/task.h:116 /builds/linux/kernel/kthread.c:784) [ 317.277163] kthread_stop_put (/builds/linux/include/linux/sched/task.h:130 /builds/linux/kernel/kthread.c:812) [ 317.280897] idle_cull_fn (/builds/linux/kernel/workqueue.c:2859 /builds/linux/kernel/workqueue.c:2980) [ 317.284541] process_one_work (/builds/linux/kernel/workqueue.c:3241) [ 317.288361] worker_thread (/builds/linux/kernel/workqueue.c:3313 (discriminator 2) /builds/linux/kernel/workqueue.c:3400 (discriminator 2)) [ 317.292353] kthread (/builds/linux/kernel/kthread.c:463) [ 317.295999] ret_from_fork (/builds/linux/arch/arm64/kernel/entry.S:861) [ 317.299386] ---[ end trace 0000000000000000 ]--- [ 317.303294] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 317.307630] Mem abort info: [ 317.316403] ESR = 0x0000000096000004 [ 317.318879] EC = 0x25: DABT (current EL), IL = 32 bits [ 317.322723] SET = 0, FnV = 0 [ 317.328194] EA = 0, S1PTW = 0 [ 317.331025] FSC = 0x04: level 0 translation fault [ 317.334121] Data abort info: [ 317.338933] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 317.342061] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 317.347364] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 317.352515] user pgtable: 4k pages, 48-bit VAs, pgdp=000000008a592000 [ 317.357892] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 317.364247] Internal error: Oops: 0000000096000004 [#3] SMP [ 317.370923] Modules linked in: pm8916_wdt qcom_wcnss_pil snd_soc_lpass_apq8016 snd_soc_msm8916_analog snd_soc_lpass_cpu snd_soc_apq8016_sbc snd_soc_msm8916_digital snd_soc_lpass_platform snd_soc_qcom_common coresight_cpu_debug snd_soc_core coresight_tmc coresight_replicator snd_compress coresight_funnel snd_pcm_dmaengine coresight_stm stm_core coresight_cti coresight_tpiu snd_pcm coresight snd_timer qrtr msm snd adv7511 qcom_camss qcom_q6v5_mss soundcore qcom_spmi_temp_alarm rtc_pm8xxx qcom_pon qcom_pil_info videobuf2_dma_sg ubwc_config qcom_q6v5 venus_core(+) qcom_sysmon qcom_spmi_vadc v4l2_fwnode llcc_qcom v4l2_async qcom_vadc_common qcom_common ocmem v4l2_mem2mem drm_gpuvm videobuf2_memops qcom_glink_smem videobuf2_v4l2 drm_exec mdt_loader qmi_helpers gpu_sched drm_dp_aux_bus qnoc_msm8916 videodev drm_display_helper qcom_stats videobuf2_common cec qcom_rng drm_client_lib mc phy_qcom_usb_hs socinfo rpmsg_ctrl display_connector rpmsg_char ramoops rmtfs_mem reed_solomon drm_kms_helper fuse drm backlight [ 317.443145] CPU: 0 UID: 0 PID: 88 Comm: kworker/u16:5 Tainted: G D W 6.16.0-next-20250804 #1 PREEMPT [ 317.465294] Tainted: [D]=DIE, [W]=WARN [ 317.475949] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 317.479516] Workqueue: events_unbound idle_cull_fn [ 317.486370] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 317.490972] pc : kthread_stop (/builds/linux/arch/arm64/include/asm/atomic_ll_sc.h:203 (discriminator 2) /builds/linux/arch/arm64/include/asm/atomic.h:65 (discriminator 2) /builds/linux/include/linux/atomic/atomic-arch-fallback.h:3798 (discriminator 2) /builds/linux/include/linux/atomic/atomic-long.h:1069 (discriminator 2) /builds/linux/include/asm-generic/bitops/atomic.h:18 (discriminator 2) /builds/linux/include/asm-generic/bitops/instrumented-atomic.h:29 (discriminator 2) /builds/linux/kernel/kthread.c:786 (discriminator 2)) [ 317.497825] lr : kthread_stop (/builds/linux/include/linux/refcount.h:291 /builds/linux/include/linux/refcount.h:366 /builds/linux/include/linux/refcount.h:383 /builds/linux/include/linux/sched/task.h:116 /builds/linux/kernel/kthread.c:784) [ 317.501989] sp : ffff8000839dbce0 [ 317.505981] x29: ffff8000839dbce0 x28: 0000000000000000 x27: 0000000000000000 [ 317.509288] x26: ffff000003418c28 x25: 0000000000000000 x24: ffff000003418c78 [ 317.516405] x23: dead000000000122 x22: dead000000000100 x21: 0000000000000000 [ 317.523526] x20: ffff0000044a4828 x19: ffff0000044a4800 x18: 0000000000000000 [ 317.530642] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 317.537762] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 317.544879] x11: 00000000000000c0 x10: 0000000000000b50 x9 : ffff80008163543c [ 317.551996] x8 : ffff8000839db9e8 x7 : 0000000000000001 x6 : 0000000000000001 [ 317.559116] x5 : ffff8000828ab000 x4 : ffff8000828ab3e0 x3 : 0000000000000000 [ 317.566232] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 000000000020804c [ 317.573353] Call trace: [ 317.580461] kthread_stop (/builds/linux/arch/arm64/include/asm/atomic_ll_sc.h:203 (discriminator 2) /builds/linux/arch/arm64/include/asm/atomic.h:65 (discriminator 2) /builds/linux/include/linux/atomic/atomic-arch-fallback.h:3798 (discriminator 2) /builds/linux/include/linux/atomic/atomic-long.h:1069 (discriminator 2) /builds/linux/include/asm-generic/bitops/atomic.h:18 (discriminator 2) /builds/linux/include/asm-generic/bitops/instrumented-atomic.h:29 (discriminator 2) /builds/linux/kernel/kthread.c:786 (discriminator 2)) (P) [ 317.582720] kthread_stop_put (/builds/linux/include/linux/sched/task.h:130 /builds/linux/kernel/kthread.c:812) [ 317.586884] idle_cull_fn (/builds/linux/kernel/workqueue.c:2859 /builds/linux/kernel/workqueue.c:2980) [ 317.590531] process_one_work (/builds/linux/kernel/workqueue.c:3241) [ 317.594351] worker_thread (/builds/linux/kernel/workqueue.c:3313 (discriminator 2) /builds/linux/kernel/workqueue.c:3400 (discriminator 2)) [ 317.598345] kthread (/builds/linux/kernel/kthread.c:463) [ 317.601988] ret_from_fork (/builds/linux/arch/arm64/kernel/entry.S:861) [ 317.605380] Code: c8017e60 35ffffa1 17ffffaf f98002b1 (c85f7ea0) All code ======== 0: c8017e60 stxr w1, x0, [x19] 4: 35ffffa1 cbnz w1, 0xfffffffffffffff8 8: 17ffffaf b 0xfffffffffffffec4 c: f98002b1 prfm pstl1strm, [x21] 10:* c85f7ea0 ldxr x0, [x21] <-- trapping instruction Code starting with the faulting instruction =========================================== 0: c85f7ea0 ldxr x0, [x21] [ 317.608944] ---[ end trace 0000000000000000 ]--- ## Source * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git * Git sha: 5c5a10f0be967a8950a2309ea965bae54251b50e * Git describe: next-20250804 * Project details: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250804 * Architectures: arm64 Dragonboard-410c * Toolchains: gcc-13 * Kconfigs: selftests/*/configs ## Build * Test log: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250804/te… * Test details: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250804/te… * Test history: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250804/te… * Test plan: https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/30oCAeKlwl… * Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/30oC7ut8e7yXWPAtJXay… * Kernel config: https://storage.tuxsuite.com/public/linaro/lkft/builds/30oC7ut8e7yXWPAtJXay… -- Linaro LKFT https://lkft.linaro.org

5 months

2
3
0 0

[PATCH] selftests/futex: Check for shmget support at runtime

by Wake Liu

The futex tests `futex_wait.c` and `futex_waitv.c` rely on the `shmget()` syscall, which may not be available if the kernel is built without System V IPC support (CONFIG_SYSVIPC=n). This can lead to test failures on such systems. This patch modifies the tests to check for `shmget()` support at runtime by calling it and checking for an `ENOSYS` error. If `shmget()` is not supported, the tests are skipped with a clear message, improving the user experience and preventing false negatives. This approach is more robust than relying on compile-time checks and ensures that the tests run only when the required kernel features are present. Signed-off-by: Wake Liu <wakel(a)google.com> --- .../selftests/futex/functional/futex_wait.c | 49 ++++++------ .../selftests/futex/functional/futex_waitv.c | 78 +++++++++++-------- 2 files changed, 72 insertions(+), 55 deletions(-) diff --git a/tools/testing/selftests/futex/functional/futex_wait.c b/tools/testing/selftests/futex/functional/futex_wait.c index 685140d9b93d..17a465313a59 100644 --- a/tools/testing/selftests/futex/functional/futex_wait.c +++ b/tools/testing/selftests/futex/functional/futex_wait.c @@ -48,7 +48,7 @@ static void *waiterfn(void *arg) int main(int argc, char *argv[]) { int res, ret = RET_PASS, fd, c, shm_id; - u_int32_t f_private = 0, *shared_data; + u_int32_t f_private = 0, *shared_data = NULL; unsigned int flags = FUTEX_PRIVATE_FLAG; pthread_t waiter; void *shm; @@ -96,32 +96,35 @@ int main(int argc, char *argv[]) /* Testing an anon page shared memory */ shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); if (shm_id < 0) { - perror("shmget"); - exit(1); - } - - shared_data = shmat(shm_id, NULL, 0); + if (errno == ENOSYS) { + ksft_test_result_skip("Kernel does not support System V shared memory\n"); + } else { + ksft_test_result_fail("shmget() failed with error: %s\n", strerror(errno)); + ret = RET_FAIL; + } + } else { + shared_data = shmat(shm_id, NULL, 0); - *shared_data = 0; - futex = shared_data; + *shared_data = 0; + futex = shared_data; - info("Calling shared (page anon) futex_wait on futex: %p\n", futex); - if (pthread_create(&waiter, NULL, waiterfn, NULL)) - error("pthread_create failed\n", errno); + info("Calling shared (page anon) futex_wait on futex: %p\n", futex); + if (pthread_create(&waiter, NULL, waiterfn, NULL)) + error("pthread_create failed\n", errno); - usleep(WAKE_WAIT_US); + usleep(WAKE_WAIT_US); - info("Calling shared (page anon) futex_wake on futex: %p\n", futex); - res = futex_wake(futex, 1, 0); - if (res != 1) { - ksft_test_result_fail("futex_wake shared (page anon) returned: %d %s\n", - errno, strerror(errno)); - ret = RET_FAIL; - } else { - ksft_test_result_pass("futex_wake shared (page anon) succeeds\n"); + info("Calling shared (page anon) futex_wake on futex: %p\n", futex); + res = futex_wake(futex, 1, 0); + if (res != 1) { + ksft_test_result_fail("futex_wake shared (page anon) returned: %d %s\n", + errno, strerror(errno)); + ret = RET_FAIL; + } else { + ksft_test_result_pass("futex_wake shared (page anon) succeeds\n"); + } } - /* Testing a file backed shared memory */ fd = open(SHM_PATH, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR); if (fd < 0) { @@ -161,7 +164,8 @@ int main(int argc, char *argv[]) } /* Freeing resources */ - shmdt(shared_data); + if (shared_data) + shmdt(shared_data); munmap(shm, sizeof(f_private)); remove(SHM_PATH); close(fd); @@ -169,3 +173,4 @@ int main(int argc, char *argv[]) ksft_print_cnts(); return ret; } + diff --git a/tools/testing/selftests/futex/functional/futex_waitv.c b/tools/testing/selftests/futex/functional/futex_waitv.c index a94337f677e1..3baf5142b434 100644 --- a/tools/testing/selftests/futex/functional/futex_waitv.c +++ b/tools/testing/selftests/futex/functional/futex_waitv.c @@ -104,46 +104,62 @@ int main(int argc, char *argv[]) ksft_test_result_fail("futex_wake private returned: %d %s\n", res ? errno : res, res ? strerror(errno) : ""); - ret = RET_FAIL; } else { ksft_test_result_pass("futex_waitv private\n"); } /* Shared waitv */ - for (i = 0; i < NR_FUTEXES; i++) { - int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); - - if (shm_id < 0) { - perror("shmget"); - exit(1); + bool shm_supported = true; + int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); + + if (shm_id < 0) { + if (errno == ENOSYS) { + shm_supported = false; + ksft_test_result_skip("Kernel does not support System V shared memory\n"); + } else { + ksft_test_result_fail("shmget() failed with error: %s\n", strerror(errno)); + ret = RET_FAIL; + shm_supported = false; } + } else { + shmctl(shm_id, IPC_RMID, NULL); + } - unsigned int *shared_data = shmat(shm_id, NULL, 0); + if (shm_supported) { + for (i = 0; i < NR_FUTEXES; i++) { + int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666); - *shared_data = 0; - waitv[i].uaddr = (uintptr_t)shared_data; - waitv[i].flags = FUTEX_32; - waitv[i].val = 0; - waitv[i].__reserved = 0; - } + if (shm_id < 0) { + perror("shmget"); + exit(1); + } - if (pthread_create(&waiter, NULL, waiterfn, NULL)) - error("pthread_create failed\n", errno); + unsigned int *shared_data = shmat(shm_id, NULL, 0); - usleep(WAKE_WAIT_US); + *shared_data = 0; + waitv[i].uaddr = (uintptr_t)shared_data; + waitv[i].flags = FUTEX_32; + waitv[i].val = 0; + waitv[i].__reserved = 0; + } - res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, 0); - if (res != 1) { - ksft_test_result_fail("futex_wake shared returned: %d %s\n", - res ? errno : res, - res ? strerror(errno) : ""); - ret = RET_FAIL; - } else { - ksft_test_result_pass("futex_waitv shared\n"); - } + if (pthread_create(&waiter, NULL, waiterfn, NULL)) + error("pthread_create failed\n", errno); - for (i = 0; i < NR_FUTEXES; i++) - shmdt(u64_to_ptr(waitv[i].uaddr)); + usleep(WAKE_WAIT_US); + + res = futex_wake(u64_to_ptr(waitv[NR_FUTEXES - 1].uaddr), 1, 0); + if (res != 1) { + ksft_test_result_fail("futex_wake shared returned: %d %s\n", + res ? errno : res, + res ? strerror(errno) : ""); + } else { + ksft_test_result_pass("futex_waitv shared\n"); + } + + for (i = 0; i < NR_FUTEXES; i++) + shmdt(u64_to_ptr(waitv[i].uaddr)); + } /* Testing a waiter without FUTEX_32 flag */ waitv[0].flags = FUTEX_PRIVATE_FLAG; @@ -158,7 +174,6 @@ int main(int argc, char *argv[]) ksft_test_result_fail("futex_waitv private returned: %d %s\n", res ? errno : res, res ? strerror(errno) : ""); - ret = RET_FAIL; } else { ksft_test_result_pass("futex_waitv without FUTEX_32\n"); } @@ -177,7 +192,6 @@ int main(int argc, char *argv[]) ksft_test_result_fail("futex_wake private returned: %d %s\n", res ? errno : res, res ? strerror(errno) : ""); - ret = RET_FAIL; } else { ksft_test_result_pass("futex_waitv with an unaligned address\n"); } @@ -195,7 +209,6 @@ int main(int argc, char *argv[]) ksft_test_result_fail("futex_waitv private returned: %d %s\n", res ? errno : res, res ? strerror(errno) : ""); - ret = RET_FAIL; } else { ksft_test_result_pass("futex_waitv NULL address in waitv.uaddr\n"); } @@ -211,7 +224,6 @@ int main(int argc, char *argv[]) ksft_test_result_fail("futex_waitv private returned: %d %s\n", res ? errno : res, res ? strerror(errno) : ""); - ret = RET_FAIL; } else { ksft_test_result_pass("futex_waitv NULL address in *waiters\n"); } @@ -227,7 +239,6 @@ int main(int argc, char *argv[]) ksft_test_result_fail("futex_waitv private returned: %d %s\n", res ? errno : res, res ? strerror(errno) : ""); - ret = RET_FAIL; } else { ksft_test_result_pass("futex_waitv invalid clockid\n"); } @@ -235,3 +246,4 @@ int main(int argc, char *argv[]) ksft_print_cnts(); return ret; } + -- 2.50.1.703.g449372360f-goog

5 months

1
0
0 0

[PATCH] selftests/net: Ensure assert() triggers in psock_tpacket.c

by Wake Liu

The get_next_frame() function in psock_tpacket.c was missing a return statement in its default switch case, leading to a compiler warning. This was caused by a `bug_on(1)` call, which is defined as an `assert()`, being compiled out because NDEBUG is defined during the build. Instead of adding a `return NULL;` which would silently hide the error and could lead to crashes later, this change restores the original author's intent. By adding `#undef NDEBUG` before including <assert.h>, we ensure the assertion is active and will cause the test to abort if this unreachable code is ever executed. Signed-off-by: Wake Liu <wakel(a)google.com> --- tools/testing/selftests/net/psock_tpacket.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c index 0dd909e325d9..a54f2eb754ce 100644 --- a/tools/testing/selftests/net/psock_tpacket.c +++ b/tools/testing/selftests/net/psock_tpacket.c @@ -38,6 +38,7 @@ #include <arpa/inet.h> #include <stdint.h> #include <string.h> +#undef NDEBUG #include <assert.h> #include <net/if.h> #include <inttypes.h> -- 2.50.1.703.g449372360f-goog

5 months

1
0
0 0

[PATCH v3 0/3] execute PROCMAP_QUERY ioctl under per-vma lock

by Suren Baghdasaryan

With /proc/pid/maps now being read under per-vma lock protection we can reuse parts of that code to execute PROCMAP_QUERY ioctl also without taking mmap_lock. The change is designed to reduce mmap_lock contention and prevent PROCMAP_QUERY ioctl calls from blocking address space updates. This patchset was split out of the original patchset [1] that introduced per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY tests, code refactoring patch to simplify the main change and the actual transition to per-vma lock. Changes since v2 [2] - Added Reviewed-by, per Vlastimil Babka - Fixed query_vma_find_by_addr() to handle lock_ctx->mmap_locked case, per Vlastimil Babka [1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/ [2] https://lore.kernel.org/all/20250804231552.1217132-1-surenb@google.com/ Suren Baghdasaryan (3): selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified fs/proc/task_mmu: factor out proc_maps_private fields used by PROCMAP_QUERY fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 15 +- fs/proc/task_mmu.c | 152 ++++++++++++------ fs/proc/task_nommu.c | 14 +- tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++ 4 files changed, 184 insertions(+), 62 deletions(-) base-commit: 8e7e0c6d09502e44aa7a8fce0821e042a6ec03d1 -- 2.50.1.565.gc32cd1483b-goog

5 months

3
10
0 0

[PATCH -next] selftests/bpf: Fix warning comparing pointer to 0

by Jiapeng Chong

Avoid pointer type value compared with 0 to make code clear. ./tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c:221:10-11: WARNING comparing pointer to 0. Reported-by: Abaci Robot <abaci(a)linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=23403 Signed-off-by: Jiapeng Chong <jiapeng.chong(a)linux.alibaba.com> --- tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c b/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c index 4f94c971ae86..6b725725b2bf 100644 --- a/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c +++ b/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c @@ -218,7 +218,7 @@ int null_check(void *ctx) int *p; p = bpf_rdonly_cast(0, 0); - if (p == 0) + if (!p) /* make this a function call to avoid compiler * moving r0 assignment before check. */ -- 2.43.5

5 months

2
1
0 0

[PATCH v5] selftests: riscv: add misaligned access testing

by Clément Léger

This selftest tests all the currently emulated instructions (except for the RV32 compressed ones which are left as a future exercise for a RV32 user). For the FPU instructions, all the FPU registers are tested. Signed-off-by: Clément Léger <cleger(a)rivosinc.com> --- This commits needs [1] for the test to be passed or it will fail due to missing sign extension. Note: This test can be executed using an SBI that support FWFT or that delegates misaligned traps by default. If using QEMU, you will need the patches mentioned at [2] so that misaligned accesses will generate a trap. Note: This commit was part of a series [3] that was partially merged. Note: the remaining checkpatch errors are not applicable to this tests which is a userspace one and does not use the kernel headers. Macros with complex values can not be enclosed in do while loop since they are generating functions. Link: https://lore.kernel.org/linux-riscv/mvmikk0goil.fsf@suse.de/ [1] Link: https://lore.kernel.org/all/20241211211933.198792-1-fkonrad@amd.com/ [2] Link: https://lore.kernel.org/linux-riscv/20250422162324.956065-1-cleger@rivosinc… [3] V5: - Sign extend LWU return value - Added sign extensions tests - Tests multiples values for GP load/store V4: - Fixed a typo in test name s/load/store V3: - Fixed a segfault and a sign extension error found when compiling with -O<x>, x != 0 (Alex) - Use inline assembly to generate the sigbus and avoid GCC optimizations V2: - Fix commit description - Fix a few errors reported by checkpatch.pl --- tools/testing/selftests/riscv/Makefile | 2 +- .../selftests/riscv/misaligned/.gitignore | 1 + .../selftests/riscv/misaligned/Makefile | 12 + .../selftests/riscv/misaligned/common.S | 33 ++ .../testing/selftests/riscv/misaligned/fpu.S | 180 +++++++++++ tools/testing/selftests/riscv/misaligned/gp.S | 113 +++++++ .../selftests/riscv/misaligned/misaligned.c | 288 ++++++++++++++++++ 7 files changed, 628 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile create mode 100644 tools/testing/selftests/riscv/misaligned/common.S create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c diff --git a/tools/testing/selftests/riscv/Makefile b/tools/testing/selftests/riscv/Makefile index 099b8c1f46f8..95a98ceeb3b3 100644 --- a/tools/testing/selftests/riscv/Makefile +++ b/tools/testing/selftests/riscv/Makefile @@ -5,7 +5,7 @@ ARCH ?= $(shell uname -m 2>/dev/null || echo not) ifneq (,$(filter $(ARCH),riscv)) -RISCV_SUBTARGETS ?= abi hwprobe mm sigreturn vector +RISCV_SUBTARGETS ?= abi hwprobe mm sigreturn vector misaligned else RISCV_SUBTARGETS := endif diff --git a/tools/testing/selftests/riscv/misaligned/.gitignore b/tools/testing/selftests/riscv/misaligned/.gitignore new file mode 100644 index 000000000000..5eff15a1f981 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/.gitignore @@ -0,0 +1 @@ +misaligned diff --git a/tools/testing/selftests/riscv/misaligned/Makefile b/tools/testing/selftests/riscv/misaligned/Makefile new file mode 100644 index 000000000000..1aa40110c50d --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/Makefile @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2021 ARM Limited +# Originally tools/testing/arm64/abi/Makefile + +CFLAGS += -I$(top_srcdir)/tools/include + +TEST_GEN_PROGS := misaligned + +include ../../lib.mk + +$(OUTPUT)/misaligned: misaligned.c fpu.S gp.S + $(CC) -g3 -static -o$@ -march=rv64imafdc $(CFLAGS) $(LDFLAGS) $^ diff --git a/tools/testing/selftests/riscv/misaligned/common.S b/tools/testing/selftests/riscv/misaligned/common.S new file mode 100644 index 000000000000..8fa00035bd5d --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/common.S @@ -0,0 +1,33 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +.macro lb_sb temp, offset, src, dst + lb \temp, \offset(\src) + sb \temp, \offset(\dst) +.endm + +.macro copy_long_to temp, src, dst + lb_sb \temp, 0, \src, \dst, + lb_sb \temp, 1, \src, \dst, + lb_sb \temp, 2, \src, \dst, + lb_sb \temp, 3, \src, \dst, + lb_sb \temp, 4, \src, \dst, + lb_sb \temp, 5, \src, \dst, + lb_sb \temp, 6, \src, \dst, + lb_sb \temp, 7, \src, \dst, +.endm + +.macro sp_stack_prologue offset + addi sp, sp, -8 + sub sp, sp, \offset +.endm + +.macro sp_stack_epilogue offset + add sp, sp, \offset + addi sp, sp, 8 +.endm diff --git a/tools/testing/selftests/riscv/misaligned/fpu.S b/tools/testing/selftests/riscv/misaligned/fpu.S new file mode 100644 index 000000000000..a7ad4430a424 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/fpu.S @@ -0,0 +1,180 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +#include "common.S" + +#define CASE_ALIGN 4 + +.macro fpu_load_inst fpreg, inst, precision, load_reg +.align CASE_ALIGN + \inst \fpreg, 0(\load_reg) + fmv.\precision fa0, \fpreg + j 2f +.endm + +#define flw(__fpreg) fpu_load_inst __fpreg, flw, s, a4 +#define fld(__fpreg) fpu_load_inst __fpreg, fld, d, a4 +#define c_flw(__fpreg) fpu_load_inst __fpreg, c.flw, s, a4 +#define c_fld(__fpreg) fpu_load_inst __fpreg, c.fld, d, a4 +#define c_fldsp(__fpreg) fpu_load_inst __fpreg, c.fldsp, d, sp + +.macro fpu_store_inst fpreg, inst, precision, store_reg +.align CASE_ALIGN + fmv.\precision \fpreg, fa0 + \inst \fpreg, 0(\store_reg) + j 2f +.endm + +#define fsw(__fpreg) fpu_store_inst __fpreg, fsw, s, a4 +#define fsd(__fpreg) fpu_store_inst __fpreg, fsd, d, a4 +#define c_fsw(__fpreg) fpu_store_inst __fpreg, c.fsw, s, a4 +#define c_fsd(__fpreg) fpu_store_inst __fpreg, c.fsd, d, a4 +#define c_fsdsp(__fpreg) fpu_store_inst __fpreg, c.fsdsp, d, sp + +.macro fp_test_prologue + move a4, a1 + /* + * Compute jump offset to store the correct FP register since we don't + * have indirect FP register access (or at least we don't use this + * extension so that works on all archs) + */ + sll t0, a0, CASE_ALIGN + la t2, 1f + add t0, t0, t2 + jr t0 +.align CASE_ALIGN +1: +.endm + +.macro fp_test_prologue_compressed + /* FP registers for compressed instructions starts from 8 to 16 */ + addi a0, a0, -8 + fp_test_prologue +.endm + +#define fp_test_body_compressed(__inst_func) \ + __inst_func(f8); \ + __inst_func(f9); \ + __inst_func(f10); \ + __inst_func(f11); \ + __inst_func(f12); \ + __inst_func(f13); \ + __inst_func(f14); \ + __inst_func(f15); \ +2: + +#define fp_test_body(__inst_func) \ + __inst_func(f0); \ + __inst_func(f1); \ + __inst_func(f2); \ + __inst_func(f3); \ + __inst_func(f4); \ + __inst_func(f5); \ + __inst_func(f6); \ + __inst_func(f7); \ + __inst_func(f8); \ + __inst_func(f9); \ + __inst_func(f10); \ + __inst_func(f11); \ + __inst_func(f12); \ + __inst_func(f13); \ + __inst_func(f14); \ + __inst_func(f15); \ + __inst_func(f16); \ + __inst_func(f17); \ + __inst_func(f18); \ + __inst_func(f19); \ + __inst_func(f20); \ + __inst_func(f21); \ + __inst_func(f22); \ + __inst_func(f23); \ + __inst_func(f24); \ + __inst_func(f25); \ + __inst_func(f26); \ + __inst_func(f27); \ + __inst_func(f28); \ + __inst_func(f29); \ + __inst_func(f30); \ + __inst_func(f31); \ +2: +.text + +#define __gen_test_inst(__inst, __suffix) \ +.global test_ ## __inst; \ +test_ ## __inst:; \ + fp_test_prologue ## __suffix; \ + fp_test_body ## __suffix(__inst); \ + ret + +#define gen_test_inst_compressed(__inst) \ + .option arch,+c; \ + __gen_test_inst(c_ ## __inst, _compressed) + +#define gen_test_inst(__inst) \ + .balign 16; \ + .option push; \ + .option arch,-c; \ + __gen_test_inst(__inst, ); \ + .option pop + +.macro fp_test_prologue_load_compressed_sp + copy_long_to t0, a1, sp +.endm + +.macro fp_test_epilogue_load_compressed_sp +.endm + +.macro fp_test_prologue_store_compressed_sp +.endm + +.macro fp_test_epilogue_store_compressed_sp + copy_long_to t0, sp, a1 +.endm + +#define gen_inst_compressed_sp(__inst, __type) \ + .global test_c_ ## __inst ## sp; \ + test_c_ ## __inst ## sp:; \ + sp_stack_prologue a2; \ + fp_test_prologue_## __type ## _compressed_sp; \ + fp_test_prologue_compressed; \ + fp_test_body_compressed(c_ ## __inst ## sp); \ + fp_test_epilogue_## __type ## _compressed_sp; \ + sp_stack_epilogue a2; \ + ret + +#define gen_test_load_compressed_sp(__inst) gen_inst_compressed_sp(__inst, load) +#define gen_test_store_compressed_sp(__inst) gen_inst_compressed_sp(__inst, store) + +/* + * float_fsw_reg - Set a FP register from a register containing the value + * a0 = FP register index to be set + * a1 = addr where to store register value + * a2 = address offset + * a3 = value to be store + */ +gen_test_inst(fsw) + +/* + * float_flw_reg - Get a FP register value and return it + * a0 = FP register index to be retrieved + * a1 = addr to load register from + * a2 = address offset + */ +gen_test_inst(flw) + +gen_test_inst(fsd) +#ifdef __riscv_compressed +gen_test_inst_compressed(fsd) +gen_test_store_compressed_sp(fsd) +#endif + +gen_test_inst(fld) +#ifdef __riscv_compressed +gen_test_inst_compressed(fld) +gen_test_load_compressed_sp(fld) +#endif diff --git a/tools/testing/selftests/riscv/misaligned/gp.S b/tools/testing/selftests/riscv/misaligned/gp.S new file mode 100644 index 000000000000..5abec5ccc828 --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/gp.S @@ -0,0 +1,113 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ + +#include "common.S" + +.text + +.macro __gen_test_inst inst, src_reg + \inst a2, 0(\src_reg) + move a0, a2 +.endm + +.macro gen_func_header func_name, rvc + .option arch,\rvc + .global test_\func_name + test_\func_name: +.endm + +.macro gen_test_inst inst + .option push + gen_func_header \inst, -c + __gen_test_inst \inst, a0 + .option pop + ret +.endm + +.macro gen_test_lwu + .option push + gen_func_header lwu, -c + __gen_test_inst lwu, a0 + .option pop + # RISC-V ABI states that C expect a sign extended 32 bits integer + sext.w a0, a0 + ret +.endm + +.macro __gen_test_inst_c name, src_reg + .option push + gen_func_header c_\name, +c + __gen_test_inst c.\name, \src_reg + .option pop + ret +.endm + +.macro gen_test_inst_c name + __gen_test_inst_c \name, a0 +.endm + + +.macro gen_test_inst_c_ldsp + .option push + gen_func_header c_ldsp, +c + sp_stack_prologue a1 + copy_long_to t0, a0, sp + c.ldsp a0, 0(sp) + sp_stack_epilogue a1 + .option pop + ret +.endm + +.macro lb_sp_sb_a0 reg, offset + lb_sb \reg, \offset, sp, a0 +.endm + +.macro gen_test_inst_c_sdsp + .option push + gen_func_header c_sdsp, +c + /* Misalign stack pointer */ + sp_stack_prologue a1 + /* Misalign access */ + c.sdsp a2, 0(sp) + copy_long_to t0, sp, a0 + sp_stack_epilogue a1 + .option pop + ret +.endm + + + /* + * a0 = addr to load from + * a1 = address offset + * a2 = value to be loaded + */ +gen_test_inst lh +gen_test_inst lhu +gen_test_inst lw +gen_test_lwu +gen_test_inst ld +#ifdef __riscv_compressed +gen_test_inst_c lw +gen_test_inst_c ld +gen_test_inst_c_ldsp +#endif + +/* + * a0 = addr where to store value + * a1 = address offset + * a2 = value to be stored + */ +gen_test_inst sh +gen_test_inst sw +gen_test_inst sd +#ifdef __riscv_compressed +gen_test_inst_c sw +gen_test_inst_c sd +gen_test_inst_c_sdsp +#endif + diff --git a/tools/testing/selftests/riscv/misaligned/misaligned.c b/tools/testing/selftests/riscv/misaligned/misaligned.c new file mode 100644 index 000000000000..57ddcbdc947c --- /dev/null +++ b/tools/testing/selftests/riscv/misaligned/misaligned.c @@ -0,0 +1,288 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 Rivos Inc. + * + * Authors: + * Clément Léger <cleger(a)rivosinc.com> + */ +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <linux/ptrace.h> +#include "../../kselftest_harness.h" + +#include <stdlib.h> +#include <stdio.h> +#include <stdint.h> +#include <float.h> +#include <errno.h> +#include <math.h> +#include <string.h> +#include <signal.h> +#include <stdbool.h> +#include <unistd.h> +#include <inttypes.h> +#include <ucontext.h> + +#include <sys/prctl.h> + +#define stringify(s) __stringify(s) +#define __stringify(s) #s + +#define U8_MAX ((u8)~0U) +#define S8_MAX ((s8)(U8_MAX >> 1)) +#define S8_MIN ((s8)(-S8_MAX - 1)) +#define U16_MAX ((u16)~0U) +#define S16_MAX ((s16)(U16_MAX >> 1)) +#define S16_MIN ((s16)(-S16_MAX - 1)) +#define U32_MAX ((u32)~0U) +#define U32_MIN ((u32)0) +#define S32_MAX ((s32)(U32_MAX >> 1)) +#define S32_MIN ((s32)(-S32_MAX - 1)) +#define U64_MAX ((u64)~0ULL) +#define S64_MAX ((s64)(U64_MAX >> 1)) +#define S64_MIN ((s64)(-S64_MAX - 1)) + +#define int16_TEST_VALUES {S16_MIN, S16_MIN/2, -1, 1, S16_MAX/2, S16_MAX} +#define int32_TEST_VALUES {S32_MIN, S32_MIN/2, -1, 1, S32_MAX/2, S32_MAX} +#define int64_TEST_VALUES {S64_MIN, S64_MIN/2, -1, 1, S64_MAX/2, S64_MAX} +#define uint16_TEST_VALUES {0, U16_MAX/2, U16_MAX} +#define uint32_TEST_VALUES {0, U32_MAX/2, U32_MAX} + +#define float_TEST_VALUES {FLT_MIN, FLT_MIN/2, FLT_MAX/2, FLT_MAX} +#define double_TEST_VALUES {DBL_MIN, DBL_MIN/2, DBL_MAX/2, DBL_MAX} + +static bool float_equal(float a, float b) +{ + float scaled_epsilon; + float difference = fabsf(a - b); + + // Scale to the largest value. + a = fabsf(a); + b = fabsf(b); + if (a > b) + scaled_epsilon = FLT_EPSILON * a; + else + scaled_epsilon = FLT_EPSILON * b; + + return difference <= scaled_epsilon; +} + +static bool double_equal(double a, double b) +{ + double scaled_epsilon; + double difference = fabsl(a - b); + + // Scale to the largest value. + a = fabs(a); + b = fabs(b); + if (a > b) + scaled_epsilon = DBL_EPSILON * a; + else + scaled_epsilon = DBL_EPSILON * b; + + return difference <= scaled_epsilon; +} + +#define fpu_load_proto(__inst, __type) \ +extern __type test_ ## __inst(unsigned long fp_reg, void *addr, unsigned long offset, __type value) + +fpu_load_proto(flw, float); +fpu_load_proto(fld, double); +fpu_load_proto(c_flw, float); +fpu_load_proto(c_fld, double); +fpu_load_proto(c_fldsp, double); + +#define fpu_store_proto(__inst, __type) \ +extern void test_ ## __inst(unsigned long fp_reg, void *addr, unsigned long offset, __type value) + +fpu_store_proto(fsw, float); +fpu_store_proto(fsd, double); +fpu_store_proto(c_fsw, float); +fpu_store_proto(c_fsd, double); +fpu_store_proto(c_fsdsp, double); + +#define gp_load_proto(__inst, __type) \ +extern __type test_ ## __inst(void *addr, unsigned long offset, __type value) + +gp_load_proto(lh, int16_t); +gp_load_proto(lhu, uint16_t); +gp_load_proto(lw, int32_t); +gp_load_proto(lwu, uint32_t); +gp_load_proto(ld, int64_t); +gp_load_proto(c_lw, int32_t); +gp_load_proto(c_ld, int64_t); +gp_load_proto(c_ldsp, int64_t); + +#define gp_store_proto(__inst, __type) \ +extern void test_ ## __inst(void *addr, unsigned long offset, __type value) + +gp_store_proto(sh, int16_t); +gp_store_proto(sw, int32_t); +gp_store_proto(sd, int64_t); +gp_store_proto(c_sw, int32_t); +gp_store_proto(c_sd, int64_t); +gp_store_proto(c_sdsp, int64_t); + +#define TEST_GP_LOAD(__inst, __type_size, __type) \ +TEST(gp_load_ ## __inst) \ +{ \ + int offset, ret, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type ## __type_size ## _t test_val[] = __type ## __type_size ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (offset = 1; offset < (__type_size) / 8; offset++) { \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type ## __type_size ## _t ref_val = test_val[val_i]; \ + __type ## __type_size ## _t *ptr = \ + (__type ## __type_size ## _t *)(buf + offset); \ + memcpy(ptr, &ref_val, sizeof(ref_val)); \ + __type ## __type_size ## _t val = \ + test_ ## __inst(ptr, offset, ref_val); \ + EXPECT_EQ(ref_val, val); \ + } \ + } \ +} + +TEST_GP_LOAD(lh, 16, int) +TEST_GP_LOAD(lhu, 16, uint) +TEST_GP_LOAD(lw, 32, int) +TEST_GP_LOAD(lwu, 32, uint) +TEST_GP_LOAD(ld, 64, int) +#ifdef __riscv_compressed +TEST_GP_LOAD(c_lw, 32, int) +TEST_GP_LOAD(c_ld, 64, int) +TEST_GP_LOAD(c_ldsp, 64, int) +#endif + +#define TEST_GP_STORE(__inst, __type_size, __type) \ +TEST(gp_store_ ## __inst) \ +{ \ + int offset, ret, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type ## __type_size ## _t test_val[] = \ + __type ## __type_size ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type ## __type_size ## _t ref_val = test_val[val_i]; \ + for (offset = 1; offset < (__type_size) / 8; offset++) { \ + __type ## __type_size ## _t val = ref_val; \ + __type ## __type_size ## _t *ptr = \ + (__type ## __type_size ## _t *)(buf + offset); \ + memset(ptr, 0, sizeof(val)); \ + test_ ## __inst(ptr, offset, val); \ + memcpy(&val, ptr, sizeof(val)); \ + EXPECT_EQ(ref_val, val); \ + } \ + } \ +} + +TEST_GP_STORE(sh, 16, int) +TEST_GP_STORE(sw, 32, int) +TEST_GP_STORE(sd, 64, int) +#ifdef __riscv_compressed +TEST_GP_STORE(c_sw, 32, int) +TEST_GP_STORE(c_sd, 64, int) +TEST_GP_STORE(c_sdsp, 64, int) +#endif + +#define __TEST_FPU_LOAD(__type, __inst, __reg_start, __reg_end) \ +TEST(fpu_load_ ## __inst) \ +{ \ + int ret, offset, fp_reg, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type test_val[] = __type ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (fp_reg = __reg_start; fp_reg < __reg_end; fp_reg++) { \ + for (offset = 1; offset < 4; offset++) { \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type val, ref_val = test_val[val_i]; \ + void *load_addr = (buf + offset); \ + \ + memcpy(load_addr, &ref_val, sizeof(ref_val)); \ + val = test_ ## __inst(fp_reg, load_addr, offset, ref_val); \ + EXPECT_TRUE(__type ##_equal(val, ref_val)); \ + } \ + } \ + } \ +} + +#define TEST_FPU_LOAD(__type, __inst) \ + __TEST_FPU_LOAD(__type, __inst, 0, 32) +#define TEST_FPU_LOAD_COMPRESSED(__type, __inst) \ + __TEST_FPU_LOAD(__type, __inst, 8, 16) + +TEST_FPU_LOAD(float, flw) +TEST_FPU_LOAD(double, fld) +#ifdef __riscv_compressed +TEST_FPU_LOAD_COMPRESSED(double, c_fld) +TEST_FPU_LOAD_COMPRESSED(double, c_fldsp) +#endif + +#define __TEST_FPU_STORE(__type, __inst, __reg_start, __reg_end) \ +TEST(fpu_store_ ## __inst) \ +{ \ + int ret, offset, fp_reg, val_i; \ + uint8_t buf[16] __attribute__((aligned(16))); \ + __type test_val[] = __type ## _TEST_VALUES; \ + \ + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_NOPRINT); \ + ASSERT_EQ(ret, 0); \ + \ + for (fp_reg = __reg_start; fp_reg < __reg_end; fp_reg++) { \ + for (offset = 1; offset < 4; offset++) { \ + for (val_i = 0; val_i < ARRAY_SIZE(test_val); val_i++) { \ + __type val, ref_val = test_val[val_i]; \ + \ + void *store_addr = (buf + offset); \ + \ + test_ ## __inst(fp_reg, store_addr, offset, ref_val); \ + memcpy(&val, store_addr, sizeof(val)); \ + EXPECT_TRUE(__type ## _equal(val, ref_val)); \ + } \ + } \ + } \ +} +#define TEST_FPU_STORE(__type, __inst) \ + __TEST_FPU_STORE(__type, __inst, 0, 32) +#define TEST_FPU_STORE_COMPRESSED(__type, __inst) \ + __TEST_FPU_STORE(__type, __inst, 8, 16) + +TEST_FPU_STORE(float, fsw) +TEST_FPU_STORE(double, fsd) +#ifdef __riscv_compressed +TEST_FPU_STORE_COMPRESSED(double, c_fsd) +TEST_FPU_STORE_COMPRESSED(double, c_fsdsp) +#endif + +TEST_SIGNAL(gen_sigbus, SIGBUS) +{ + uint32_t val = 0xDEADBEEF; + uint8_t buf[16] __attribute__((aligned(16))); + int ret; + + ret = prctl(PR_SET_UNALIGN, PR_UNALIGN_SIGBUS); + ASSERT_EQ(ret, 0); + + asm volatile("sw %0, 1(%1)" : : "r"(val), "r"(buf) : "memory"); +} + +int main(int argc, char **argv) +{ + int ret, val; + + ret = prctl(PR_GET_UNALIGN, &val); + if (ret == -1 && errno == EINVAL) + ksft_exit_skip("SKIP GET_UNALIGN_CTL not supported\n"); + + exit(test_harness_run(argc, argv)); +} -- 2.43.0

5 months

3
2
0 0

[PATCH] selftests/drivers/net: replace typeof() with __auto_type

by Pranav Tyagi

Replace typeof() with __auto_type in iou-zcrx.c. __auto_type was introduced in GCC 4.9 and reduces the compile time for all compilers. No functional changes intended. Signed-off-by: Pranav Tyagi <pranav.tyagi03(a)gmail.com> --- tools/testing/selftests/drivers/net/hw/iou-zcrx.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c index 62456df947bc..85551594bf0f 100644 --- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c +++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c @@ -42,8 +42,8 @@ static long page_size; #define SEND_SIZE (512 * 4096) #define min(a, b) \ ({ \ - typeof(a) _a = (a); \ - typeof(b) _b = (b); \ + __auto_type _a = (a); \ + __auto_type _b = (b); \ _a < _b ? _a : _b; \ }) #define min_t(t, a, b) \ -- 2.49.0

5 months

2
2
0 0

[PATCH] selftests/bpf/progs: use __auto_type in swap() macro

by Pranav Tyagi

Replace typeof() with __auto_type in xdp_synproxy_kern.c. __auto_type was introduced in GCC 4.9 and reduces the compile time for all compilers. No functional changes intended. Signed-off-by: Pranav Tyagi <pranav.tyagi03(a)gmail.com> --- tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c b/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c index 62b8e29ced9f..b08738f9a0e6 100644 --- a/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c +++ b/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c @@ -58,7 +58,7 @@ #define MAX_PACKET_OFF 0xffff #define swap(a, b) \ - do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0) + do { __auto_type __tmp = (a); (a) = (b); (b) = __tmp; } while (0) #define __get_unaligned_t(type, ptr) ({ \ const struct { type x; } __attribute__((__packed__)) *__pptr = (typeof(__pptr))(ptr); \ -- 2.49.0

5 months

2
2
0 0

[PATCH] selftests/mm: fix FORCE_READ to read input value correctly.

by Zi Yan

FORCE_READ() converts input value x to its pointer type then reads from address x. This is wrong. If x is a non-pointer, it would be caught it easily. But all FORCE_READ() callers are trying to read from a pointer and FORCE_READ() basically reads a pointer to a pointer instead of the original typed pointer. Almost no access violation was found, except the one from split_huge_page_test. Fix it by implementing a simplified READ_ONCE() instead. Fixes: 3f6bfd4789a0 ("selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));"") Signed-off-by: Zi Yan <ziy(a)nvidia.com> --- FORCE_READ() comes from commit 876320d71f51 ("selftests/mm: add self tests for guard page feature"). I will a separate patch to stable tree. tools/testing/selftests/mm/cow.c | 4 ++-- tools/testing/selftests/mm/guard-regions.c | 2 +- tools/testing/selftests/mm/hugetlb-madvise.c | 4 +++- tools/testing/selftests/mm/migration.c | 2 +- tools/testing/selftests/mm/pagemap_ioctl.c | 2 +- tools/testing/selftests/mm/split_huge_page_test.c | 7 +++++-- tools/testing/selftests/mm/vm_util.h | 2 +- 7 files changed, 14 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index d30625c18259..c744c603d688 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -1554,8 +1554,8 @@ static void run_with_zeropage(non_anon_test_fn fn, const char *desc) } /* Read from the page to populate the shared zeropage. */ - FORCE_READ(mem); - FORCE_READ(smem); + FORCE_READ(*mem); + FORCE_READ(*smem); fn(mem, smem, pagesize); munmap: diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c index b0d42eb04e3a..8dd81c0a4a5a 100644 --- a/tools/testing/selftests/mm/guard-regions.c +++ b/tools/testing/selftests/mm/guard-regions.c @@ -145,7 +145,7 @@ static bool try_access_buf(char *ptr, bool write) if (write) *ptr = 'x'; else - FORCE_READ(ptr); + FORCE_READ(*ptr); } signal_jump_set = false; diff --git a/tools/testing/selftests/mm/hugetlb-madvise.c b/tools/testing/selftests/mm/hugetlb-madvise.c index 1afe14b9dc0c..c5940c0595be 100644 --- a/tools/testing/selftests/mm/hugetlb-madvise.c +++ b/tools/testing/selftests/mm/hugetlb-madvise.c @@ -50,8 +50,10 @@ void read_fault_pages(void *addr, unsigned long nr_pages) unsigned long i; for (i = 0; i < nr_pages; i++) { + unsigned long *addr2 = + ((unsigned long *)(addr + (i * huge_page_size))); /* Prevent the compiler from optimizing out the entire loop: */ - FORCE_READ(((unsigned long *)(addr + (i * huge_page_size)))); + FORCE_READ(*addr2); } } diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c index c5a73617796a..ea945eebec2f 100644 --- a/tools/testing/selftests/mm/migration.c +++ b/tools/testing/selftests/mm/migration.c @@ -110,7 +110,7 @@ void *access_mem(void *ptr) * the memory access actually happens and prevents the compiler * from optimizing away this entire loop. */ - FORCE_READ((uint64_t *)ptr); + FORCE_READ(*(uint64_t *)ptr); } return NULL; diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c index 0d4209eef0c3..e6face7c0166 100644 --- a/tools/testing/selftests/mm/pagemap_ioctl.c +++ b/tools/testing/selftests/mm/pagemap_ioctl.c @@ -1525,7 +1525,7 @@ void zeropfn_tests(void) ret = madvise(mem, hpage_size, MADV_HUGEPAGE); if (!ret) { - FORCE_READ(mem); + FORCE_READ(*mem); ret = pagemap_ioctl(mem, hpage_size, &vec, 1, 0, 0, PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO); diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index 718daceb5282..3c761228e451 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -440,8 +440,11 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, } madvise(*addr, fd_size, MADV_HUGEPAGE); - for (size_t i = 0; i < fd_size; i++) - FORCE_READ((*addr + i)); + for (size_t i = 0; i < fd_size; i++) { + char *addr2 = *addr + i; + + FORCE_READ(*addr2); + } if (!check_huge_file(*addr, fd_size / pmd_pagesize, pmd_pagesize)) { ksft_print_msg("No large pagecache folio generated, please provide a filesystem supporting large folio\n"); diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index c20298ae98ea..b55d1809debc 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -23,7 +23,7 @@ * anything with it in order to trigger a read page fault. We therefore must use * volatile to stop the compiler from optimising this away. */ -#define FORCE_READ(x) (*(volatile typeof(x) *)x) +#define FORCE_READ(x) (*(const volatile typeof(x) *)&(x)) extern unsigned int __page_size; extern unsigned int __page_shift; -- 2.47.2

5 months

7
11
0 0

[PATCH] selftests/filesystems: replace typeof() with __auto_type

by Pranav Tyagi

Replace typeof() with __auto_type in utils.c. __auto_type was introduced in GCC 4.9 and reduces the compile time for all compilers. No functional changes intended. Signed-off-by: Pranav Tyagi <pranav.tyagi03(a)gmail.com> --- tools/testing/selftests/filesystems/utils.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/filesystems/utils.c b/tools/testing/selftests/filesystems/utils.c index c43a69dffd83..95f202e2bfb7 100644 --- a/tools/testing/selftests/filesystems/utils.c +++ b/tools/testing/selftests/filesystems/utils.c @@ -34,7 +34,7 @@ #define syserror_set(__ret__, format, ...) \ ({ \ - typeof(__ret__) __internal_ret__ = (__ret__); \ + __auto_type __internal_ret__ = (__ret__); \ errno = labs(__ret__); \ fprintf(stderr, "%m - " format "\n", ##__VA_ARGS__); \ __internal_ret__; \ -- 2.49.0

5 months

1
0
0 0

[PATCH v6 0/2] libbpf: fix USDT SIB argument handling causing unrecognized register error

by Jiawei Zhao

When using GCC on x86-64 to compile an usdt prog with -O1 or higher optimization, the compiler will generate SIB addressing mode for global array and PC-relative addressing mode for global variable, e.g. "1@-96(%rbp,%rax,8)" and "-1@4+t1(%rip)". The current USDT implementation in libbpf cannot parse these two formats, causing `bpf_program__attach_usdt()` to fail with -ENOENT (unrecognized register). This patch series adds support for SIB addressing mode in USDT probes. The main changes include: - add correct handling logic for SIB-addressed arguments in `parse_usdt_arg`. - force -O2 optimization for usdt.test.o to generate SIB addressing usdt argument spec. - change the global variable t1 to a local variable, to avoid compiler generating PC-relative addressing mode for it. Testing shows that the SIB probe correctly generates 8@(%rcx,%rax,8) argument spec and passes all validation checks. The modification history of this patch series: Change since v1: - refactor the code to make it more readable - modify the commit message to explain why and how Change since v2: - fix the `scale` uninitialized error Change since v3: - force -O2 optimization for usdt.test.o to generate SIB addressing usdt and pass all test cases. Change since v4: - split the patch into two parts, one for the fix and the other for the test Change since v5: - Only enable optimization for x86 architecture to generate SIB addressing usdt argument spec. Do we need to add support for PC-relative USDT argument spec handling in libbpf? I have some interest in this question, but currently have no ideas. Getting offsets based on symbols requires dependency on the symbol table. However, once the binary file is stripped, the symtab will also be removed, which will cause this approach to fail. Does anyone have any thoughts on this? Jiawei Zhao (2): libbpf: fix USDT SIB argument handling causing unrecognized register error selftests/bpf: Force -O2 for USDT selftests to cover SIB handling logic tools/lib/bpf/usdt.bpf.h | 33 +++++++++++++- tools/lib/bpf/usdt.c | 43 ++++++++++++++++--- tools/testing/selftests/bpf/Makefile | 8 ++++ tools/testing/selftests/bpf/prog_tests/usdt.c | 18 +++++--- 4 files changed, 89 insertions(+), 13 deletions(-) -- 2.43.0

5 months

4
6
0 0

[PATCH] selftests/bpf/progs: replace typeof() with __auto_type

by Pranav Tyagi

Replace typeof() with __auto_type in bpf_dctcp.c. __auto_type was introduced in GCC 4.9 and reduces the compile time for all compilers. No functional changes intended. Signed-off-by: Pranav Tyagi <pranav.tyagi03(a)gmail.com> --- tools/testing/selftests/bpf/progs/bpf_dctcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c b/tools/testing/selftests/bpf/progs/bpf_dctcp.c index 7cd73e75f52a..0bab6cec6bbc 100644 --- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c +++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c @@ -16,8 +16,8 @@ #define min(a, b) ((a) < (b) ? (a) : (b)) #define max(a, b) ((a) > (b) ? (a) : (b)) #define min_not_zero(x, y) ({ \ - typeof(x) __x = (x); \ - typeof(y) __y = (y); \ + __auto_type __x = (x); \ + __auto_type __y = (y); \ __x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); }) static bool before(__u32 seq1, __u32 seq2) { -- 2.49.0

5 months

1
0
0 0

[PATCH v2 0/2] fscontext: do not consume log entries when returning -EMSGSIZE

by Aleksa Sarai

Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. While we're at it, refactor some fscontext_read() into a separate helper to make the ring buffer logic a bit easier to read. Fixes: 007ec26cdc9f ("vfs: Implement logging through fs_context") Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Changes in v2: - Refactor message fetching to fetch_message_locked() which returns ERR_PTR() in error cases. [Al Viro] - v1: <https://lore.kernel.org/r/20250806-fscontext-log-cleanups-v1-0-880597d42a5a…> --- Aleksa Sarai (2): fscontext: do not consume log entries when returning -EMSGSIZE selftests/filesystems: add basic fscontext log tests fs/fsopen.c | 54 +++++----- tools/testing/selftests/filesystems/.gitignore | 1 + tools/testing/selftests/filesystems/Makefile | 2 +- tools/testing/selftests/filesystems/fclog.c | 135 +++++++++++++++++++++++++ 4 files changed, 167 insertions(+), 25 deletions(-) --- base-commit: 66639db858112bf6b0f76677f7517643d586e575 change-id: 20250806-fscontext-log-cleanups-50f0143674ae Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

5 months

1
2
0 0

[PATCH 0/2] fscontext: do not consume log entries for -EMSGSIZE case

by Aleksa Sarai

Userspace generally expects APIs that return EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. Fixes: 007ec26cdc9f ("vfs: Implement logging through fs_context") Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Aleksa Sarai (2): fscontext: do not consume log entries for -EMSGSIZE case selftests/filesystems: add basic fscontext log tests fs/fsopen.c | 22 ++-- tools/testing/selftests/filesystems/.gitignore | 1 + tools/testing/selftests/filesystems/Makefile | 2 +- tools/testing/selftests/filesystems/fclog.c | 137 +++++++++++++++++++++++++ 4 files changed, 153 insertions(+), 9 deletions(-) --- base-commit: 66639db858112bf6b0f76677f7517643d586e575 change-id: 20250806-fscontext-log-cleanups-50f0143674ae Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

5 months

2
3
0 0

[PATCH v4] selftests: filesystems: Add functional test for the abort file in fusectl

by Chen Linxuan

This patch add a simple functional test for the "abort" file in fusectlfs (/sys/fs/fuse/connections/ID/about). A simple fuse daemon is added for testing. Related discussion can be found in the link below. Link: https://lore.kernel.org/all/CAOQ4uxjKFXOKQxPpxtS6G_nR0tpw95w0GiO68UcWg_OBhm… Signed-off-by: Chen Linxuan <chenlinxuan(a)uniontech.com> Acked-by: Shuah Khan <skhan(a)linuxfoundation.org> Reviewed-by: Amir Goldstein <amir73il(a)gmail.com> Co-developed-by: Miklos Szeredi <miklos(a)szeredi.hu> Reviewed-by: Miklos Szeredi <miklos(a)szeredi.hu> --- Changes in v4: - Apply patch suggested by Miklos Szeredi - Setting up a userns environment for testing - Fix a EBUSY on umount/rmdir - Link to v3: https://lore.kernel.org/all/20250610021007.2800329-2-chenlinxuan@uniontech.… Changes in v3: - Apply changes suggested by Amir Goldstein - Rename the test subdir to filesystems/fuse - Verify errno when connection is aborted - Apply changes suggested by Shuah Khan - Update commit message - Link to v2: https://lore.kernel.org/all/20250517012350.10317-2-chenlinxuan@uniontech.co… Changes in v2: - Apply changes suggested by Amir Goldstein - Check errno - Link to v1: https://lore.kernel.org/all/20250515073449.346774-2-chenlinxuan@uniontech.c… --- MAINTAINERS | 1 + tools/testing/selftests/Makefile | 1 + .../selftests/filesystems/fuse/.gitignore | 3 + .../selftests/filesystems/fuse/Makefile | 21 +++ .../selftests/filesystems/fuse/fuse_mnt.c | 146 ++++++++++++++++++ .../selftests/filesystems/fuse/fusectl_test.c | 140 +++++++++++++++++ 6 files changed, 312 insertions(+) create mode 100644 tools/testing/selftests/filesystems/fuse/.gitignore create mode 100644 tools/testing/selftests/filesystems/fuse/Makefile create mode 100644 tools/testing/selftests/filesystems/fuse/fuse_mnt.c create mode 100644 tools/testing/selftests/filesystems/fuse/fusectl_test.c diff --git a/MAINTAINERS b/MAINTAINERS index a92290fffa163..04d90432c1841 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9901,6 +9901,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git F: Documentation/filesystems/fuse* F: fs/fuse/ F: include/uapi/linux/fuse.h +F: tools/testing/selftests/filesystems/fuse/ FUTEX SUBSYSTEM M: Thomas Gleixner <tglx(a)linutronix.de> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 339b31e6a6b59..c37a76a8ca214 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -36,6 +36,7 @@ TARGETS += filesystems/fat TARGETS += filesystems/overlayfs TARGETS += filesystems/statmount TARGETS += filesystems/mount-notify +TARGETS += filesystems/fuse TARGETS += firmware TARGETS += fpu TARGETS += ftrace diff --git a/tools/testing/selftests/filesystems/fuse/.gitignore b/tools/testing/selftests/filesystems/fuse/.gitignore new file mode 100644 index 0000000000000..3e72e742d08e8 --- /dev/null +++ b/tools/testing/selftests/filesystems/fuse/.gitignore @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only +fuse_mnt +fusectl_test diff --git a/tools/testing/selftests/filesystems/fuse/Makefile b/tools/testing/selftests/filesystems/fuse/Makefile new file mode 100644 index 0000000000000..612aad69a93aa --- /dev/null +++ b/tools/testing/selftests/filesystems/fuse/Makefile @@ -0,0 +1,21 @@ +# SPDX-License-Identifier: GPL-2.0-or-later + +CFLAGS += -Wall -O2 -g $(KHDR_INCLUDES) + +TEST_GEN_PROGS := fusectl_test +TEST_GEN_FILES := fuse_mnt + +include ../../lib.mk + +VAR_CFLAGS := $(shell pkg-config fuse --cflags 2>/dev/null) +ifeq ($(VAR_CFLAGS),) +VAR_CFLAGS := -D_FILE_OFFSET_BITS=64 -I/usr/include/fuse +endif + +VAR_LDLIBS := $(shell pkg-config fuse --libs 2>/dev/null) +ifeq ($(VAR_LDLIBS),) +VAR_LDLIBS := -lfuse -pthread +endif + +$(OUTPUT)/fuse_mnt: CFLAGS += $(VAR_CFLAGS) +$(OUTPUT)/fuse_mnt: LDLIBS += $(VAR_LDLIBS) diff --git a/tools/testing/selftests/filesystems/fuse/fuse_mnt.c b/tools/testing/selftests/filesystems/fuse/fuse_mnt.c new file mode 100644 index 0000000000000..d12b17f30fadc --- /dev/null +++ b/tools/testing/selftests/filesystems/fuse/fuse_mnt.c @@ -0,0 +1,146 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * fusectl test file-system + * Creates a simple FUSE filesystem with a single read-write file (/test) + */ + +#define FUSE_USE_VERSION 26 + +#include <fuse.h> +#include <stdio.h> +#include <string.h> +#include <errno.h> +#include <fcntl.h> +#include <stdlib.h> +#include <unistd.h> + +#define MAX(a, b) ((a) > (b) ? (a) : (b)) + +static char *content; +static size_t content_size = 0; +static const char test_path[] = "/test"; + +static int test_getattr(const char *path, struct stat *st) +{ + memset(st, 0, sizeof(*st)); + + if (!strcmp(path, "/")) { + st->st_mode = S_IFDIR | 0755; + st->st_nlink = 2; + return 0; + } + + if (!strcmp(path, test_path)) { + st->st_mode = S_IFREG | 0664; + st->st_nlink = 1; + st->st_size = content_size; + return 0; + } + + return -ENOENT; +} + +static int test_readdir(const char *path, void *buf, fuse_fill_dir_t filler, + off_t offset, struct fuse_file_info *fi) +{ + if (strcmp(path, "/")) + return -ENOENT; + + filler(buf, ".", NULL, 0); + filler(buf, "..", NULL, 0); + filler(buf, test_path + 1, NULL, 0); + + return 0; +} + +static int test_open(const char *path, struct fuse_file_info *fi) +{ + if (strcmp(path, test_path)) + return -ENOENT; + + return 0; +} + +static int test_read(const char *path, char *buf, size_t size, off_t offset, + struct fuse_file_info *fi) +{ + if (strcmp(path, test_path) != 0) + return -ENOENT; + + if (!content || content_size == 0) + return 0; + + if (offset >= content_size) + return 0; + + if (offset + size > content_size) + size = content_size - offset; + + memcpy(buf, content + offset, size); + + return size; +} + +static int test_write(const char *path, const char *buf, size_t size, + off_t offset, struct fuse_file_info *fi) +{ + size_t new_size; + + if (strcmp(path, test_path) != 0) + return -ENOENT; + + if(offset > content_size) + return -EINVAL; + + new_size = MAX(offset + size, content_size); + + if (new_size > content_size) + content = realloc(content, new_size); + + content_size = new_size; + + if (!content) + return -ENOMEM; + + memcpy(content + offset, buf, size); + + return size; +} + +static int test_truncate(const char *path, off_t size) +{ + if (strcmp(path, test_path) != 0) + return -ENOENT; + + if (size == 0) { + free(content); + content = NULL; + content_size = 0; + return 0; + } + + content = realloc(content, size); + + if (!content) + return -ENOMEM; + + if (size > content_size) + memset(content + content_size, 0, size - content_size); + + content_size = size; + return 0; +} + +static struct fuse_operations memfd_ops = { + .getattr = test_getattr, + .readdir = test_readdir, + .open = test_open, + .read = test_read, + .write = test_write, + .truncate = test_truncate, +}; + +int main(int argc, char *argv[]) +{ + return fuse_main(argc, argv, &memfd_ops, NULL); +} diff --git a/tools/testing/selftests/filesystems/fuse/fusectl_test.c b/tools/testing/selftests/filesystems/fuse/fusectl_test.c new file mode 100644 index 0000000000000..8d124d1cacb26 --- /dev/null +++ b/tools/testing/selftests/filesystems/fuse/fusectl_test.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +// Copyright (c) 2025 Chen Linxuan <chenlinxuan(a)uniontech.com> + +#define _GNU_SOURCE + +#include <errno.h> +#include <fcntl.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mount.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <unistd.h> +#include <dirent.h> +#include <sched.h> +#include <linux/limits.h> + +#include "../../kselftest_harness.h" + +#define FUSECTL_MOUNTPOINT "/sys/fs/fuse/connections" +#define FUSE_MOUNTPOINT "/tmp/fuse_mnt_XXXXXX" +#define FUSE_DEVICE "/dev/fuse" +#define FUSECTL_TEST_VALUE "1" + +static void write_file(struct __test_metadata *const _metadata, + const char *path, const char *val) +{ + int fd = open(path, O_WRONLY); + size_t len = strlen(val); + + ASSERT_GE(fd, 0); + ASSERT_EQ(write(fd, val, len), len); + ASSERT_EQ(close(fd), 0); +} + +FIXTURE(fusectl){ + char fuse_mountpoint[sizeof(FUSE_MOUNTPOINT)]; + int connection; +}; + +FIXTURE_SETUP(fusectl) +{ + const char *fuse_mnt_prog = "./fuse_mnt"; + int status, pid; + struct stat statbuf; + uid_t uid = getuid(); + gid_t gid = getgid(); + char buf[32]; + + /* Setup userns */ + ASSERT_EQ(unshare(CLONE_NEWNS|CLONE_NEWUSER), 0); + sprintf(buf, "0 %d 1", uid); + write_file(_metadata, "/proc/self/uid_map", buf); + write_file(_metadata, "/proc/self/setgroups", "deny"); + sprintf(buf, "0 %d 1", gid); + write_file(_metadata, "/proc/self/gid_map", buf); + ASSERT_EQ(mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL), 0); + + strcpy(self->fuse_mountpoint, FUSE_MOUNTPOINT); + + if (!mkdtemp(self->fuse_mountpoint)) + SKIP(return, + "Failed to create FUSE mountpoint %s", + strerror(errno)); + + if (access(FUSECTL_MOUNTPOINT, F_OK)) + SKIP(return, + "FUSE control filesystem not mounted"); + + pid = fork(); + if (pid < 0) + SKIP(return, + "Failed to fork FUSE daemon process: %s", + strerror(errno)); + + if (pid == 0) { + execlp(fuse_mnt_prog, fuse_mnt_prog, self->fuse_mountpoint, NULL); + exit(errno); + } + + waitpid(pid, &status, 0); + if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) { + SKIP(return, + "Failed to start FUSE daemon %s", + strerror(WEXITSTATUS(status))); + } + + if (stat(self->fuse_mountpoint, &statbuf)) + SKIP(return, + "Failed to stat FUSE mountpoint %s", + strerror(errno)); + + self->connection = statbuf.st_dev; +} + +FIXTURE_TEARDOWN(fusectl) +{ + umount2(self->fuse_mountpoint, MNT_DETACH); + rmdir(self->fuse_mountpoint); +} + +TEST_F(fusectl, abort) +{ + char path_buf[PATH_MAX]; + int abort_fd, test_fd, ret; + + sprintf(path_buf, "/sys/fs/fuse/connections/%d/abort", self->connection); + + ASSERT_EQ(0, access(path_buf, F_OK)); + + abort_fd = open(path_buf, O_WRONLY); + ASSERT_GE(abort_fd, 0); + + sprintf(path_buf, "%s/test", self->fuse_mountpoint); + + test_fd = open(path_buf, O_RDWR); + ASSERT_GE(test_fd, 0); + + ret = read(test_fd, path_buf, sizeof(path_buf)); + ASSERT_EQ(ret, 0); + + ret = write(test_fd, "test", sizeof("test")); + ASSERT_EQ(ret, sizeof("test")); + + ret = lseek(test_fd, 0, SEEK_SET); + ASSERT_GE(ret, 0); + + ret = write(abort_fd, FUSECTL_TEST_VALUE, sizeof(FUSECTL_TEST_VALUE)); + ASSERT_GT(ret, 0); + + close(abort_fd); + + ret = read(test_fd, path_buf, sizeof(path_buf)); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ENOTCONN); +} + +TEST_HARNESS_MAIN -- 2.43.0

5 months

5
5
0 0

[PATCH v3] selftests/mm: pass filename as input param to VM_PFNMAP tests

by Sudarsan Mahendran

Enable these tests to be run on other pfnmap'ed memory like NVIDIA's EGM. Add '--' as a separator to pass in file path. This allows passing of cmd line arguments to kselftest_harness. Use '/dev/mem' as default filename. Existing test passes: pfnmap TAP version 13 1..6 # Starting 6 tests from 1 test cases. # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 Pass params to kselftest_harness: pfnmap -r pfnmap:mremap_fixed TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN pfnmap.mremap_fixed ... # OK pfnmap.mremap_fixed ok 1 pfnmap.mremap_fixed # PASSED: 1 / 1 tests passed. # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 Pass non-existent file name as input: pfnmap -- /dev/blah TAP version 13 1..6 # Starting 6 tests from 1 test cases. # RUN pfnmap.madvise_disallowed ... # SKIP Cannot open '/dev/blah' Pass non pfnmap'ed file as input: pfnmap -r pfnmap.madvise_disallowed -- randfile.txt TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN pfnmap.madvise_disallowed ... # SKIP Invalid file: 'randfile.txt'. Not pfnmap'ed Signed-off-by: Sudarsan Mahendran <sudarsanm(a)google.com> --- v2 -> v3: * Add check_vmflag_pfnmap func * Re-use existing check_vmflag_io func * Verify pfnmap using mmap addr * Rename phys_addr to offset v1 -> v2: * Add verify_pfnmap func to sanity check the input param * mmap with zero offset if filename != '/dev/mem' --- tools/testing/selftests/mm/pfnmap.c | 48 ++++++++++++++++++++-------- tools/testing/selftests/mm/vm_util.c | 14 ++++++-- tools/testing/selftests/mm/vm_util.h | 1 + 3 files changed, 47 insertions(+), 16 deletions(-) diff --git a/tools/testing/selftests/mm/pfnmap.c b/tools/testing/selftests/mm/pfnmap.c index 866ac023baf5..88659f0a90ea 100644 --- a/tools/testing/selftests/mm/pfnmap.c +++ b/tools/testing/selftests/mm/pfnmap.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * Basic VM_PFNMAP tests relying on mmap() of '/dev/mem' + * Basic VM_PFNMAP tests relying on mmap() of input file provided. + * Use '/dev/mem' as default. * * Copyright 2025, Red Hat, Inc. * @@ -25,6 +26,7 @@ #include "vm_util.h" static sigjmp_buf sigjmp_buf_env; +static char *file = "/dev/mem"; static void signal_handler(int sig) { @@ -51,7 +53,7 @@ static int test_read_access(char *addr, size_t size, size_t pagesize) return ret; } -static int find_ram_target(off_t *phys_addr, +static int find_ram_target(off_t *offset, unsigned long long pagesize) { unsigned long long start, end; @@ -91,7 +93,7 @@ static int find_ram_target(off_t *phys_addr, /* We need two pages. */ if (end > start + 2 * pagesize) { fclose(file); - *phys_addr = start; + *offset = start; return 0; } } @@ -100,7 +102,7 @@ static int find_ram_target(off_t *phys_addr, FIXTURE(pfnmap) { - off_t phys_addr; + off_t offset; size_t pagesize; int dev_mem_fd; char *addr1; @@ -113,23 +115,31 @@ FIXTURE_SETUP(pfnmap) { self->pagesize = getpagesize(); - /* We'll require two physical pages throughout our tests ... */ - if (find_ram_target(&self->phys_addr, self->pagesize)) - SKIP(return, "Cannot find ram target in '/proc/iomem'\n"); + if (strncmp(file, "/dev/mem", strlen("/dev/mem")) == 0) { + /* We'll require two physical pages throughout our tests ... */ + if (find_ram_target(&self->offset, self->pagesize)) + SKIP(return, + "Cannot find ram target in '/proc/iomem'\n"); + } else { + self->offset = 0; + } - self->dev_mem_fd = open("/dev/mem", O_RDONLY); + self->dev_mem_fd = open(file, O_RDONLY); if (self->dev_mem_fd < 0) - SKIP(return, "Cannot open '/dev/mem'\n"); + SKIP(return, "Cannot open '%s'\n", file); self->size1 = self->pagesize * 2; self->addr1 = mmap(NULL, self->size1, PROT_READ, MAP_SHARED, - self->dev_mem_fd, self->phys_addr); + self->dev_mem_fd, self->offset); if (self->addr1 == MAP_FAILED) - SKIP(return, "Cannot mmap '/dev/mem'\n"); + SKIP(return, "Cannot mmap '%s'\n", file); + + if (!check_vmflag_pfnmap(self->addr1)) + SKIP(return, "Invalid file: '%s'. Not pfnmap'ed\n", file); /* ... and want to be able to read from them. */ if (test_read_access(self->addr1, self->size1, self->pagesize)) - SKIP(return, "Cannot read-access mmap'ed '/dev/mem'\n"); + SKIP(return, "Cannot read-access mmap'ed '%s'\n", file); self->size2 = 0; self->addr2 = MAP_FAILED; @@ -182,7 +192,7 @@ TEST_F(pfnmap, munmap_split) */ self->size2 = self->pagesize; self->addr2 = mmap(NULL, self->pagesize, PROT_READ, MAP_SHARED, - self->dev_mem_fd, self->phys_addr); + self->dev_mem_fd, self->offset); ASSERT_NE(self->addr2, MAP_FAILED); } @@ -246,4 +256,14 @@ TEST_F(pfnmap, fork) ASSERT_EQ(ret, 0); } -TEST_HARNESS_MAIN +int main(int argc, char **argv) +{ + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--") == 0) { + if (i + 1 < argc && strlen(argv[i + 1]) > 0) + file = argv[i + 1]; + return test_harness_run(i, argv); + } + } + return test_harness_run(argc, argv); +} diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c index 5492e3f784df..2cebe4212db8 100644 --- a/tools/testing/selftests/mm/vm_util.c +++ b/tools/testing/selftests/mm/vm_util.c @@ -402,7 +402,7 @@ unsigned long get_free_hugepages(void) return fhp; } -bool check_vmflag_io(void *addr) +static bool check_vmflag(void *addr, const char *flag) { char buffer[MAX_LINE_LENGTH]; const char *flags; @@ -419,13 +419,23 @@ bool check_vmflag_io(void *addr) if (!flaglen) return false; - if (flaglen == strlen("io") && !memcmp(flags, "io", flaglen)) + if (flaglen == strlen(flag) && !memcmp(flags, flag, flaglen)) return true; flags += flaglen; } } +bool check_vmflag_io(void *addr) +{ + return check_vmflag(addr, "io"); +} + +bool check_vmflag_pfnmap(void *addr) +{ + return check_vmflag(addr, "pf"); +} + /* * Open an fd at /proc/$pid/maps and configure procmap_out ready for * PROCMAP_QUERY query. Returns 0 on success, or an error code otherwise. diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index b8136d12a0f8..ec1f61f30fe7 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -84,6 +84,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len, bool miss, bool wp, bool minor, uint64_t *ioctls); unsigned long get_free_hugepages(void); bool check_vmflag_io(void *addr); +bool check_vmflag_pfnmap(void *addr); int open_procmap(pid_t pid, struct procmap_fd *procmap_out); int query_procmap(struct procmap_fd *procmap); bool find_vma_procmap(struct procmap_fd *procmap, void *address); -- 2.50.1.565.gc32cd1483b-goog

5 months

2
2
0 0

[PATCH v2 0/3] execute PROCMAP_QUERY ioctl under per-vma lock

by Suren Baghdasaryan

With /proc/pid/maps now being read under per-vma lock protection we can reuse parts of that code to execute PROCMAP_QUERY ioctl also without taking mmap_lock. The change is designed to reduce mmap_lock contention and prevent PROCMAP_QUERY ioctl calls from blocking address space updates. This patchset was split out of the original patchset [1] that introduced per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY tests, code refactoring patch to simplify the main change and the actual transition to per-vma lock. Changes since v1 [2] - Added Tested-by and Acked-by, per SeongJae Park - Fixed NOMMU case, per Vlastimil Babka - Renamed proc_maps_query_data to proc_maps_locking_ctx, per Vlastimil Babka [1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/ [2] https://lore.kernel.org/all/20250731220024.702621-1-surenb@google.com/ Suren Baghdasaryan (3): selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified fs/proc/task_mmu: factor out proc_maps_private fields used by PROCMAP_QUERY fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 15 +- fs/proc/task_mmu.c | 149 ++++++++++++------ fs/proc/task_nommu.c | 14 +- tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++ 4 files changed, 181 insertions(+), 62 deletions(-) base-commit: 01da54f10fddf3b01c5a3b80f6b16bbad390c302 -- 2.50.1.565.gc32cd1483b-goog

5 months

2
6
0 0

[PATCH 00/17] rust: replace `kernel::c_str!` with C-Strings

by Tamir Duberstein

This series depends on step 3[0] which depends on steps 2a[1] and 2b[2] which both depend on step 1[3]. This series also has a minor merge conflict with a small change[4] that was taken through driver-core-testing. This series is marked as depending on that change; as such it contains the post-conflict patch. Subsystem maintainers: I would appreciate your `Acked-by`s so that this can be taken through Miguel's tree (where the previous series must go). Link https://lore.kernel.org/all/20250710-cstr-core-v14-0-ca7e0ca82c82@gmail.com/ [0] Link: https://lore.kernel.org/all/20250709-core-cstr-fanout-1-v1-0-64308e7203fc@g… [1] Link: https://lore.kernel.org/all/20250709-core-cstr-fanout-1-v1-0-fd793b3e58a2@g… [2] Link: https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm… [3] Link: https://lore.kernel.org/all/20250704-cstr-include-aux-v1-1-e1a404ae92ac@gma… [4] Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Tamir Duberstein (17): drivers: net: replace `kernel::c_str!` with C-Strings gpu: nova-core: replace `kernel::c_str!` with C-Strings rust: auxiliary: replace `kernel::c_str!` with C-Strings rust: clk: replace `kernel::c_str!` with C-Strings rust: configfs: replace `kernel::c_str!` with C-Strings rust: cpufreq: replace `kernel::c_str!` with C-Strings rust: device: replace `kernel::c_str!` with C-Strings rust: firmware: replace `kernel::c_str!` with C-Strings rust: kunit: replace `kernel::c_str!` with C-Strings rust: macros: replace `kernel::c_str!` with C-Strings rust: miscdevice: replace `kernel::c_str!` with C-Strings rust: net: replace `kernel::c_str!` with C-Strings rust: pci: replace `kernel::c_str!` with C-Strings rust: platform: replace `kernel::c_str!` with C-Strings rust: seq_file: replace `kernel::c_str!` with C-Strings rust: str: replace `kernel::c_str!` with C-Strings rust: sync: replace `kernel::c_str!` with C-Strings drivers/block/rnull.rs | 2 +- drivers/cpufreq/rcpufreq_dt.rs | 5 ++--- drivers/gpu/drm/nova/driver.rs | 10 +++++----- drivers/gpu/nova-core/driver.rs | 6 +++--- drivers/net/phy/ax88796b_rust.rs | 7 +++---- drivers/net/phy/qt2025.rs | 5 ++--- rust/kernel/clk.rs | 6 ++---- rust/kernel/configfs.rs | 5 ++--- rust/kernel/cpufreq.rs | 3 +-- rust/kernel/device.rs | 4 +--- rust/kernel/firmware.rs | 6 +++--- rust/kernel/kunit.rs | 11 ++++------- rust/kernel/net/phy.rs | 6 ++---- rust/kernel/platform.rs | 4 ++-- rust/kernel/seq_file.rs | 4 ++-- rust/kernel/str.rs | 5 ++--- rust/kernel/sync.rs | 5 ++--- rust/kernel/sync/completion.rs | 2 +- rust/kernel/workqueue.rs | 8 ++++---- rust/macros/kunit.rs | 10 +++++----- rust/macros/module.rs | 2 +- samples/rust/rust_configfs.rs | 5 ++--- samples/rust/rust_driver_auxiliary.rs | 4 ++-- samples/rust/rust_driver_faux.rs | 4 ++-- samples/rust/rust_driver_pci.rs | 4 ++-- samples/rust/rust_driver_platform.rs | 4 ++-- samples/rust/rust_misc_device.rs | 3 +-- scripts/rustdoc_test_gen.rs | 4 ++-- 28 files changed, 63 insertions(+), 81 deletions(-) --- base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494 change-id: 20250710-core-cstr-cstrings-1faaa632f0fd prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 prerequisite-message-id: 20250709-core-cstr-fanout-1-v1-0-64308e7203fc(a)gmail.com prerequisite-patch-id: fa79c5d8fd2762b5e488ba017e13a5774d933f81 prerequisite-patch-id: c338aa49e1319e9e802de2ad8bb0fa688bce9d9c prerequisite-patch-id: 589a352ba7f7c9aefefd84dfd3b6b20e290b0d14 prerequisite-patch-id: 29fc25261295349f6747d1bb409cf18130e9aa69 prerequisite-patch-id: 3d89601bba1fb01d190b0ba415b28ad9cbf1e209 prerequisite-patch-id: 10923aebf24011b727f60496c0f9e0ad57e0a967 prerequisite-patch-id: 56583fd829951fb4fac843c6b1874c643b726de0 prerequisite-patch-id: 9a7e8ba460358985147efd347658be31fbc78ba2 prerequisite-patch-id: 5821a23334e317cd0351b8e4404b9e3b36b72d67 prerequisite-message-id: 20250709-core-cstr-fanout-1-v1-0-fd793b3e58a2(a)gmail.com prerequisite-patch-id: 0ccc3545ff9bf22a67b79a944705cef2fb9c2bbf prerequisite-patch-id: b1866166714606d5c11a4d7506abe4c2f86dac8d prerequisite-patch-id: 163b8ff1edaf8e48976fd5de3f64e68fc38c7277 prerequisite-patch-id: 8fee5e2daf0749362331dad4fc63d907a01b14e9 prerequisite-patch-id: 366ef1f93fb40b1d039768f2041ff79995e7e228 prerequisite-patch-id: 1d350291f9292f910081856d8f7d5e4d9545cfd1 prerequisite-patch-id: 9a6a60bd2b209126de64c16a77a3a1d229dd898c prerequisite-patch-id: 08ae5855768ec3b4c68272b86d2a0e0667c9aa47 prerequisite-patch-id: f15b54927660a03b52ffb34fb7943ac3228b7803 prerequisite-patch-id: f0dbf0a55a27fe8e199e242d1f79ea800d1ddb66 prerequisite-change-id: 20250201-cstr-core-d4b9b69120cf:v14 prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573 prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27 prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8 prerequisite-patch-id: 0ccc3545ff9bf22a67b79a944705cef2fb9c2bbf prerequisite-patch-id: b1866166714606d5c11a4d7506abe4c2f86dac8d prerequisite-patch-id: 163b8ff1edaf8e48976fd5de3f64e68fc38c7277 prerequisite-patch-id: 8fee5e2daf0749362331dad4fc63d907a01b14e9 prerequisite-patch-id: 366ef1f93fb40b1d039768f2041ff79995e7e228 prerequisite-patch-id: 1d350291f9292f910081856d8f7d5e4d9545cfd1 prerequisite-patch-id: 9a6a60bd2b209126de64c16a77a3a1d229dd898c prerequisite-patch-id: 08ae5855768ec3b4c68272b86d2a0e0667c9aa47 prerequisite-patch-id: f15b54927660a03b52ffb34fb7943ac3228b7803 prerequisite-patch-id: f0dbf0a55a27fe8e199e242d1f79ea800d1ddb66 prerequisite-patch-id: fa79c5d8fd2762b5e488ba017e13a5774d933f81 prerequisite-patch-id: c338aa49e1319e9e802de2ad8bb0fa688bce9d9c prerequisite-patch-id: 589a352ba7f7c9aefefd84dfd3b6b20e290b0d14 prerequisite-patch-id: 29fc25261295349f6747d1bb409cf18130e9aa69 prerequisite-patch-id: 3d89601bba1fb01d190b0ba415b28ad9cbf1e209 prerequisite-patch-id: 10923aebf24011b727f60496c0f9e0ad57e0a967 prerequisite-patch-id: 56583fd829951fb4fac843c6b1874c643b726de0 prerequisite-patch-id: 9a7e8ba460358985147efd347658be31fbc78ba2 prerequisite-patch-id: 5821a23334e317cd0351b8e4404b9e3b36b72d67 prerequisite-patch-id: 9c0a6624ed7b7e1d0373985c5c084a844e7c49ce prerequisite-patch-id: 6d8dbdf864f79fc0c2820e702a7cb87753649ca0 prerequisite-patch-id: 2bc4afce0104c13c0dd4d50923b0db2f5cd11129 prerequisite-change-id: 20250704-cstr-include-aux-7847969762a8:v1 prerequisite-patch-id: 1f79f64dd9b8a092ff039e6c7fad1430afb8ea25 Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

5 months

5
21
0 0

[PATCH v2] selftests/mm: pass filename as input param to VM_PFNMAP tests

by Sudarsan Mahendran

Enable these tests to be run on other pfnmap'ed memory like NVIDIA's EGM. Add '--' as a separator to pass in file path. This allows passing of cmd line arguments to kselftest_harness. Use '/dev/mem' as default filename. Existing test passes: pfnmap TAP version 13 1..6 # Starting 6 tests from 1 test cases. # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 Pass params to kselftest_harness: pfnmap -r pfnmap:mremap_fixed TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN pfnmap.mremap_fixed ... # OK pfnmap.mremap_fixed ok 1 pfnmap.mremap_fixed # PASSED: 1 / 1 tests passed. # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 Pass non-existent file name as input: pfnmap -- /dev/blah TAP version 13 1..6 # Starting 6 tests from 1 test cases. # RUN pfnmap.madvise_disallowed ... # SKIP Cannot open '/dev/blah' Pass non pfnmap'ed file as input: pfnmap -r pfnmap.madvise_disallowed -- randfile TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN pfnmap.madvise_disallowed ... # SKIP Invalid file: 'randfile'. Not pfnmap'ed Signed-off-by: Sudarsan Mahendran <sudarsanm(a)google.com> --- v1 -> v2: * Add verify_pfnmap func to sanity check the input param * mmap with zero offset if filename != '/dev/mem' --- tools/testing/selftests/mm/pfnmap.c | 62 ++++++++++++++++++++++++----- 1 file changed, 53 insertions(+), 9 deletions(-) diff --git a/tools/testing/selftests/mm/pfnmap.c b/tools/testing/selftests/mm/pfnmap.c index 866ac023baf5..e078b961c333 100644 --- a/tools/testing/selftests/mm/pfnmap.c +++ b/tools/testing/selftests/mm/pfnmap.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * Basic VM_PFNMAP tests relying on mmap() of '/dev/mem' + * Basic VM_PFNMAP tests relying on mmap() of input file provided. + * Use '/dev/mem' as default. * * Copyright 2025, Red Hat, Inc. * @@ -25,6 +26,7 @@ #include "vm_util.h" static sigjmp_buf sigjmp_buf_env; +static char *file = "/dev/mem"; static void signal_handler(int sig) { @@ -98,6 +100,30 @@ static int find_ram_target(off_t *phys_addr, return -ENOENT; } +static int verify_pfnmap(void) +{ + FILE *smaps_fp; + char line[512]; + int found_mmap_entry = 0; + + smaps_fp = fopen("/proc/self/smaps", "r"); + if (!smaps_fp) + return -errno; + while (fgets(line, sizeof(line), smaps_fp) != NULL) { + if (strstr(line, file) && strstr(line, " r--s ")) + found_mmap_entry = 1; + + if (found_mmap_entry && + strncmp(line, "VmFlags:", strlen("VmFlags:")) == 0) { + if (strstr(line, " pf ")) + return 0; + found_mmap_entry = 0; + } + } + fclose(smaps_fp); + return -ENOENT; +} + FIXTURE(pfnmap) { off_t phys_addr; @@ -113,23 +139,31 @@ FIXTURE_SETUP(pfnmap) { self->pagesize = getpagesize(); - /* We'll require two physical pages throughout our tests ... */ - if (find_ram_target(&self->phys_addr, self->pagesize)) - SKIP(return, "Cannot find ram target in '/proc/iomem'\n"); + if (strncmp(file, "/dev/mem", strlen("/dev/mem")) == 0) { + /* We'll require two physical pages throughout our tests ... */ + if (find_ram_target(&self->phys_addr, self->pagesize)) + SKIP(return, + "Cannot find ram target in '/proc/iomem'\n"); + } else { + self->phys_addr = 0; + } - self->dev_mem_fd = open("/dev/mem", O_RDONLY); + self->dev_mem_fd = open(file, O_RDONLY); if (self->dev_mem_fd < 0) - SKIP(return, "Cannot open '/dev/mem'\n"); + SKIP(return, "Cannot open '%s'\n", file); self->size1 = self->pagesize * 2; self->addr1 = mmap(NULL, self->size1, PROT_READ, MAP_SHARED, self->dev_mem_fd, self->phys_addr); if (self->addr1 == MAP_FAILED) - SKIP(return, "Cannot mmap '/dev/mem'\n"); + SKIP(return, "Cannot mmap '%s'\n", file); + + if (verify_pfnmap()) + SKIP(return, "Invalid file: '%s'. Not pfnmap'ed\n", file); /* ... and want to be able to read from them. */ if (test_read_access(self->addr1, self->size1, self->pagesize)) - SKIP(return, "Cannot read-access mmap'ed '/dev/mem'\n"); + SKIP(return, "Cannot read-access mmap'ed '%s'\n", file); self->size2 = 0; self->addr2 = MAP_FAILED; @@ -246,4 +280,14 @@ TEST_F(pfnmap, fork) ASSERT_EQ(ret, 0); } -TEST_HARNESS_MAIN +int main(int argc, char **argv) +{ + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--") == 0) { + if (i + 1 < argc && strlen(argv[i + 1]) > 0) + file = argv[i + 1]; + return test_harness_run(i, argv); + } + } + return test_harness_run(argc, argv); +} -- 2.50.1.565.gc32cd1483b-goog

5 months

2
2
0 0

[PATCH v4 00/38] Mediated vPMU 4.0 for x86

by Mingwei Zhang

With joint effort from the upstream KVM community, we come up with the 4th version of mediated vPMU for x86. We have made the following changes on top of the previous RFC v3. v3 -> v4 - Rebase whole patchset on 6.14-rc3 base. - Address Peter's comments on Perf part. - Address Sean's comments on KVM part. * Change key word "passthrough" to "mediated" in all patches * Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY. * Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR save/retore list support for GLOBAL_CTRL, thus the support of mediated vPMU is constrained to SapphireRapids and later CPUs on Intel side. * Merge some small changes into a single patch. - Address Sandipan's comment on invalid pmu pointer. - Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly manipulate pmc->eventsel and pmu->fixed_ctr_ctrl. Testing (Intel side): - Perf-based legacy vPMU (force emulation on/off) * Kselftests pmu_counters_test, pmu_event_filter_test and vmx_pmu_caps_test pass. * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass. * Basic perf counting/sampling tests in 3 scenarios, guest-only, host-only and host-guest coexistence all pass. - Mediated vPMU (force emulation on/off) * Kselftests pmu_counters_test, pmu_event_filter_test and vmx_pmu_caps_test pass. * KUT PMU tests pmu, pmu_lbr, pmu_pebs pass. * Basic perf counting/sampling tests in 3 scenarios, guest-only, host-only and host-guest coexistence all pass. - Failures. All above tests passed on Intel Granite Rapids as well except a failure on KUT/pmu_pebs. * GP counter 0 (0xfffffffffffe): PEBS record (written seq 0) is verified (including size, counters and cfg). * The pebs_data_cfg (0xb500000000) doesn't match with the effective MSR_PEBS_DATA_CFG (0x0). * This failure has nothing to do with this mediated vPMU patch set. The failure is caused by Granite Rapids supported timed PEBS which needs extra support on Qemu and KUT/pmu_pebs. These extra support would be sent in separate patches later. Testing (AMD side): - Kselftests pmu_counters_test, pmu_event_filter_test and vmx_pmu_caps_test all pass - legacy guest with KUT/pmu: * qmeu option: -cpu host, -perfctr-core * when set force_emulation_prefix=1, passes * when set force_emulation_prefix=0, passes - perfmon-v1 guest with KUT/pmu: * qmeu option: -cpu host, -perfmon-v2 * when set force_emulation_prefix=1, passes * when set force_emulation_prefix=0, passes - perfmon-v2 guest with KUT/pmu: * qmeu option: -cpu host * when set force_emulation_prefix=1, passes * when set force_emulation_prefix=0, passes - perf_fuzzer (perfmon-v2): * fails with soft lockup in guest in current version. * culprit could be between 6.13 ~ 6.14-rc3 within KVM * Series tested on 6.12 and 6.13 without issue. Note: a QEMU series is needed to run mediated vPMU v4: - https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.c… History: - RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/ - RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/ - RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.int… Dapeng Mi (18): KVM: x86/pmu: Introduce enable_mediated_pmu global parameter KVM: x86/pmu: Check PMU cpuid configuration from user space KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{} KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header KVM: VMX: Add macros to wrap around {secondary,tertiary}_exec_controls_changebit() KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with vm_exit/entry_ctrl KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers KVM: x86/pmu: Setup PMU MSRs' interception mode KVM: x86/pmu: Handle PMU MSRs interception and event filtering KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry KVM: x86/pmu: Handle emulated instruction for mediated vPMU KVM: nVMX: Add macros to simplify nested MSR interception setting KVM: selftests: Add mediated vPMU supported for pmu tests KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test KVM: Selftests: Fix pmu_counters_test error for mediated vPMU KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space Kan Liang (8): perf: Support get/put mediated PMU interfaces perf: Skip pmu_ctx based on event_type perf: Clean up perf ctx time perf: Add a EVENT_GUEST flag perf: Add generic exclude_guest support perf: Add switch_guest_ctx() interface perf/x86: Support switch_guest_ctx interface perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU Mingwei Zhang (5): perf/x86: Forbid PMI handler when guest own PMU perf/x86/core: Plumb mediated PMU capability from x86_pmu to x86_pmu_cap KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering KVM: nVMX: Add nested virtualization support for mediated PMU Sandipan Das (4): perf/x86/core: Do not set bit width for unavailable counters KVM: x86/pmu: Add AMD PMU registers to direct access list KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host Xiong Zhang (3): x86/irq: Factor out common code for installing kvm irq handler perf: core/x86: Register a new vector for KVM GUEST PMI KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler arch/x86/events/amd/core.c | 2 + arch/x86/events/core.c | 40 +- arch/x86/events/intel/core.c | 5 + arch/x86/include/asm/hardirq.h | 1 + arch/x86/include/asm/idtentry.h | 1 + arch/x86/include/asm/irq.h | 2 +- arch/x86/include/asm/irq_vectors.h | 5 +- arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 + arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/msr-index.h | 18 +- arch/x86/include/asm/perf_event.h | 1 + arch/x86/include/asm/vmx.h | 1 + arch/x86/kernel/idt.c | 1 + arch/x86/kernel/irq.c | 39 +- arch/x86/kvm/cpuid.c | 15 + arch/x86/kvm/pmu.c | 254 ++++++++- arch/x86/kvm/pmu.h | 45 ++ arch/x86/kvm/svm/pmu.c | 148 ++++- arch/x86/kvm/svm/svm.c | 26 + arch/x86/kvm/svm/svm.h | 2 +- arch/x86/kvm/vmx/capabilities.h | 11 +- arch/x86/kvm/vmx/nested.c | 68 ++- arch/x86/kvm/vmx/pmu_intel.c | 224 ++++++-- arch/x86/kvm/vmx/vmx.c | 89 +-- arch/x86/kvm/vmx/vmx.h | 11 +- arch/x86/kvm/x86.c | 63 ++- arch/x86/kvm/x86.h | 2 + include/linux/perf_event.h | 47 +- kernel/events/core.c | 519 ++++++++++++++---- .../beauty/arch/x86/include/asm/irq_vectors.h | 5 +- .../selftests/kvm/include/kvm_test_harness.h | 13 + .../testing/selftests/kvm/include/kvm_util.h | 3 + .../selftests/kvm/include/x86/processor.h | 8 + tools/testing/selftests/kvm/lib/kvm_util.c | 23 + .../selftests/kvm/x86/pmu_counters_test.c | 24 +- .../selftests/kvm/x86/pmu_event_filter_test.c | 8 +- .../selftests/kvm/x86/vmx_pmu_caps_test.c | 2 +- 37 files changed, 1480 insertions(+), 258 deletions(-) base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319 -- 2.49.0.395.g12beb8f557-goog

5 months

8
126
0 0

[PATCH net] selftests: net: packetdrill: xfail all problems on slow machines

by Jakub Kicinski

We keep seeing flakes on packetdrill on debug kernels, while non-debug kernels are stable, not a single flake in 200 runs. Time to give up, debug kernels appear to suffer from 10msec latency spikes and any timing-sensitive test is bound to flake. Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- CC: shuah(a)kernel.org CC: willemb(a)google.com CC: matttbe(a)kernel.org CC: linux-kselftest(a)vger.kernel.org --- .../selftests/net/packetdrill/ksft_runner.sh | 19 +------------------ 1 file changed, 1 insertion(+), 18 deletions(-) diff --git a/tools/testing/selftests/net/packetdrill/ksft_runner.sh b/tools/testing/selftests/net/packetdrill/ksft_runner.sh index c5b01e1bd4c7..a7e790af38ff 100755 --- a/tools/testing/selftests/net/packetdrill/ksft_runner.sh +++ b/tools/testing/selftests/net/packetdrill/ksft_runner.sh @@ -35,24 +35,7 @@ failfunc=ktap_test_fail if [[ -n "${KSFT_MACHINE_SLOW}" ]]; then optargs+=('--tolerance_usecs=14000') - - # xfail tests that are known flaky with dbg config, not fixable. - # still run them for coverage (and expect 100% pass without dbg). - declare -ar xfail_list=( - "tcp_blocking_blocking-connect.pkt" - "tcp_blocking_blocking-read.pkt" - "tcp_eor_no-coalesce-retrans.pkt" - "tcp_fast_recovery_prr-ss.*.pkt" - "tcp_sack_sack-route-refresh-ip-tos.pkt" - "tcp_slow_start_slow-start-after-win-update.pkt" - "tcp_timestamping.*.pkt" - "tcp_user_timeout_user-timeout-probe.pkt" - "tcp_zerocopy_cl.*.pkt" - "tcp_zerocopy_epoll_.*.pkt" - "tcp_tcp_info_tcp-info-.*-limited.pkt" - ) - readonly xfail_regex="^($(printf '%s|' "${xfail_list[@]}"))$" - [[ "$script" =~ ${xfail_regex} ]] && failfunc=ktap_test_xfail + failfunc=ktap_test_xfail fi ktap_print_header -- 2.50.1

5 months

4
5
0 0

[PATCH v2] selftests/proc: Fix string literal warning in proc-maps-race.c

by Sukrut Heroorkar

This change resolves non literal string format warning invoked for proc-maps-race.c while compiling. proc-maps-race.c:205:17: warning: format not a string literal and no format arguments [-Wformat-security] 205 | printf(text); | ^~~~~~ proc-maps-race.c:209:17: warning: format not a string literal and no format arguments [-Wformat-security] 209 | printf(text); | ^~~~~~ proc-maps-race.c: In function ‘print_last_lines’: proc-maps-race.c:224:9: warning: format not a string literal and no format arguments [-Wformat-security] 224 | printf(start); | ^~~~~~ Added string format specifier %s for the printf calls in both print_first_lines() and print_last_lines() thus resolving the warnings invoked. The test executes fine after this change thus causing no affect to the functional behavior of the test. Fixes: aadc099c480f ("selftests/proc: add verbose mode for /proc/pid/maps tearing tests") Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com> Acked-by: Suren Baghdasaryan <surenb(a)google.com> --- Changes since v1: - Added Fixes tag - Included Acked-by Suren Baghdasaryan https://lore.kernel.org/all/CAHCkknoxpKV80-S3jByY1xnRXd1Pr=v=D2a0ZcgnY0-Hny… tools/testing/selftests/proc/proc-maps-race.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c index 66773685a047..94bba4553130 100644 --- a/tools/testing/selftests/proc/proc-maps-race.c +++ b/tools/testing/selftests/proc/proc-maps-race.c @@ -202,11 +202,11 @@ static void print_first_lines(char *text, int nr) int offs = end - text; text[offs] = '\0'; - printf(text); + printf("%s", text); text[offs] = '\n'; printf("\n"); } else { - printf(text); + printf("%s", text); } } @@ -221,7 +221,7 @@ static void print_last_lines(char *text, int nr) nr--; start--; } - printf(start); + printf("%s", start); } static void print_boundaries(const char *title, FIXTURE_DATA(proc_maps_race) *self) -- 2.43.0

5 months

1
0
0 0

[PATCH] selftests/proc: Fix string literal warning in proc-maps-race.c

by Sukrut Heroorkar

This change resolves non literal string format warning invoked for proc-maps-race.c while compiling. proc-maps-race.c:205:17: warning: format not a string literal and no format arguments [-Wformat-security] 205 | printf(text); | ^~~~~~ proc-maps-race.c:209:17: warning: format not a string literal and no format arguments [-Wformat-security] 209 | printf(text); | ^~~~~~ proc-maps-race.c: In function ‘print_last_lines’: proc-maps-race.c:224:9: warning: format not a string literal and no format arguments [-Wformat-security] 224 | printf(start); | ^~~~~~ Added string format specifier %s for the printf calls in both print_first_lines() and print_last_lines() thus resolving the warnings invoked. The test executes fine after this change thus causing no affect to the functional behavior of the test. Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com> --- tools/testing/selftests/proc/proc-maps-race.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c index 66773685a047..94bba4553130 100644 --- a/tools/testing/selftests/proc/proc-maps-race.c +++ b/tools/testing/selftests/proc/proc-maps-race.c @@ -202,11 +202,11 @@ static void print_first_lines(char *text, int nr) int offs = end - text; text[offs] = '\0'; - printf(text); + printf("%s", text); text[offs] = '\n'; printf("\n"); } else { - printf(text); + printf("%s", text); } } @@ -221,7 +221,7 @@ static void print_last_lines(char *text, int nr) nr--; start--; } - printf(start); + printf("%s", start); } static void print_boundaries(const char *title, FIXTURE_DATA(proc_maps_race) *self) -- 2.43.0

5 months

3
2
0 0

[PATCH] selftests: bpf: crypto: Improved clarity in test output message

by Noorain Eqbal

In 'crypto_setup()' the error message for invalid buffer size was updated for grammar and clarity This change does not affect the test behaviour but improve the quality of test output Signed-off-by: Noorain Eqbal <nooraineqbal(a)gmail.com> --- tools/testing/selftests/bpf/benchs/bench_bpf_crypto.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_crypto.c b/tools/testing/selftests/bpf/benchs/bench_bpf_crypto.c index 2845edaba8db..ac91cb224373 100644 --- a/tools/testing/selftests/bpf/benchs/bench_bpf_crypto.c +++ b/tools/testing/selftests/bpf/benchs/bench_bpf_crypto.c @@ -83,7 +83,7 @@ static void crypto_setup(void) sz = args.crypto_len; if (!sz || sz > sizeof(ctx.skel->bss->dst)) { - fprintf(stderr, "invalid encrypt buffer size (source %zu, target %zu)\n", + fprintf(stderr, "invalid encryption buffer size: source %zu, target %zu\n", sz, sizeof(ctx.skel->bss->dst)); exit(1); } -- 2.50.1

5 months

1
0
0 0

[PATCH v11 iproute2-next 0/1] DUALPI2 iproute2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find DUALPI2 iproute2 patch v11. For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best Regards, Chia-Yu --- v11 (18-Jul-2025) - Replace TCA_DUALPI2 prefix with TC_DUALPI2 prefix for enums (Jakub Kicinski <kuba(a)kernel.org>) v10 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS w/ STEP_THRESH_PKTS and STEP_THRESH_US of net-next patch (Jakub Kicinski <kuba(a)kernel.org>) v9 (13-Jun-2025) - Fix space issue and typos (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Change 'rtt_typical' to 'typical_rtt' in tc/q_dualpi2.c (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Add the num of enum used by DualPI2 in pkt_sched.h v8 (09-May-2025) - Update pkt_sched.h with the one in nex-next - Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) v7 (05-May-2025) - Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml - Reorganize dualpi2_print_opt() to match the order in tc.yaml - Remove credit-queue in PRINT_JSON v6 (26-Apr-2025) - Update JSON file output due to spec modification in tc.yaml of net-next v5 (25-Mar-2025) - Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>) - Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>) - Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>) v4 (16-Mar-2025) - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking. v3 (21-Feb-2025) - Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update the manual to align with the latest implementation and clarify the queue naming and default unit - Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2 v2 (23-Oct-2024) - Rename get_float in dualpi2 to get_float_min_max in utils.c - Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>) --- Chia-Yu Chang (1): tc: add dualpi2 scheduler module bash-completion/tc | 11 +- include/uapi/linux/pkt_sched.h | 68 +++++ include/utils.h | 2 + ip/iplink_can.c | 14 - lib/utils.c | 30 ++ man/man8/tc-dualpi2.8 | 249 ++++++++++++++++ tc/Makefile | 1 + tc/q_dualpi2.c | 528 +++++++++++++++++++++++++++++++++ 8 files changed, 888 insertions(+), 15 deletions(-) create mode 100644 man/man8/tc-dualpi2.8 create mode 100644 tc/q_dualpi2.c -- 2.34.1

5 months, 1 week

3
3
0 0

[PATCH v3 0/4] signal handling support for nolibc

by Benjamin Berg

From: Benjamin Berg <benjamin.berg(a)intel.com> Hi, This patchset adds signal handling to nolibc. Initially, I would like to use this for tests. But in the long run, the goal is to use nolibc for the UML kernel itself. In both cases, signal handling will be needed. With v3 everything is now included in nolibc instead of trying to use the messy kernel headers. Benjamin Benjamin Berg (4): selftests/nolibc: fix EXPECT_NZ macro selftests/nolibc: remove outdated comment about construct order tools/nolibc: add more generic bitmask macros for FD_* tools/nolibc: add signal support tools/include/nolibc/Makefile | 1 + tools/include/nolibc/arch-s390.h | 4 +- tools/include/nolibc/asm-signal.h | 237 +++++++++++++++++++ tools/include/nolibc/signal.h | 179 ++++++++++++++ tools/include/nolibc/sys.h | 2 +- tools/include/nolibc/sys/wait.h | 1 + tools/include/nolibc/time.h | 2 +- tools/include/nolibc/types.h | 81 ++++--- tools/testing/selftests/nolibc/nolibc-test.c | 139 ++++++++++- 9 files changed, 608 insertions(+), 38 deletions(-) create mode 100644 tools/include/nolibc/asm-signal.h -- 2.50.1

5 months, 1 week

3
8
0 0

[PATCH] selftests/mm: link with thp_settings when necessary

by Wei Yang

Currently all test cases are linked with thp_settings, while only 6 out of 50+ targets rely on it. Instead of making thp_settings as a common dependency, link it only when necessary. Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> --- tools/testing/selftests/mm/Makefile | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index d4f19f87053b..eea4881c918a 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -158,14 +158,19 @@ TEST_FILES += write_hugetlb_memory.sh include ../lib.mk -$(TEST_GEN_PROGS): vm_util.c thp_settings.c -$(TEST_GEN_FILES): vm_util.c thp_settings.c +$(TEST_GEN_PROGS): vm_util.c +$(TEST_GEN_FILES): vm_util.c $(OUTPUT)/uffd-stress: uffd-common.c $(OUTPUT)/uffd-unit-tests: uffd-common.c -$(OUTPUT)/uffd-wp-mremap: uffd-common.c +$(OUTPUT)/uffd-wp-mremap: uffd-common.c thp_settings.c $(OUTPUT)/protection_keys: pkey_util.c $(OUTPUT)/pkey_sighandler_tests: pkey_util.c +$(OUTPUT)/cow: thp_settings.c +$(OUTPUT)/migration: thp_settings.c +$(OUTPUT)/khugepaged: thp_settings.c +$(OUTPUT)/ksm_tests: thp_settings.c +$(OUTPUT)/soft-dirty: thp_settings.c ifeq ($(ARCH),x86_64) BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32)) -- 2.34.1

5 months, 1 week

3
4
0 0

[PATCH v5 0/2] libbpf: fix USDT SIB argument handling causing unrecognized register error

by Jiawei Zhao

When using GCC on x86-64 to compile an usdt prog with -O1 or higher optimization, the compiler will generate SIB addressing mode for global array and PC-relative addressing mode for global variable, e.g. "1@-96(%rbp,%rax,8)" and "-1@4+t1(%rip)". The current USDT implementation in libbpf cannot parse these two formats, causing `bpf_program__attach_usdt()` to fail with -ENOENT (unrecognized register). This patch series adds support for SIB addressing mode in USDT probes. The main changes include: - add correct handling logic for SIB-addressed arguments in `parse_usdt_arg`. - force -O2 optimization for usdt.test.o to generate SIB addressing usdt argument spec. - change the global variable t1 to a local variable, to avoid compiler generating PC-relative addressing mode for it. Testing shows that the SIB probe correctly generates 8@(%rcx,%rax,8) argument spec and passes all validation checks. The modification history of this patch series: Change since v1: - refactor the code to make it more readable - modify the commit message to explain why and how Change since v2: - fix the `scale` uninitialized error Change since v3: - force -O2 optimization for usdt.test.o to generate SIB addressing usdt and pass all test cases. Change since v4: - split the patch into two parts, one for the fix and the other for the test Do we need to add support for PC-relative USDT argument spec handling in libbpf? I have some interest in this question, but currently have no ideas. Getting offsets based on symbols requires dependency on the symbol table. However, once the binary file is stripped, the symtab will also be removed, which will cause this approach to fail. Does anyone have any thoughts on this? Jiawei Zhao (2): libbpf: fix USDT SIB argument handling causing unrecognized register error selftests/bpf: Force -O2 for USDT selftests to cover SIB handling logic tools/lib/bpf/usdt.bpf.h | 33 +++++++++++++- tools/lib/bpf/usdt.c | 43 ++++++++++++++++--- tools/testing/selftests/bpf/Makefile | 5 +++ tools/testing/selftests/bpf/prog_tests/usdt.c | 18 +++++--- 4 files changed, 86 insertions(+), 13 deletions(-) -- 2.43.0

5 months, 1 week

1
2
0 0

[PATCH v4 0/1] libbpf: fix USDT SIB argument handling causing unrecognized register error

by Jiawei Zhao

When using GCC on x86-64 to compile an usdt prog with -O1 or higher optimization, the compiler will generate SIB addressing mode for global array and PC-relative addressing mode for global variable, e.g. "1@-96(%rbp,%rax,8)" and "-1@4+t1(%rip)". The current USDT implementation in libbpf cannot parse these two formats, causing `bpf_program__attach_usdt()` to fail with -ENOENT (unrecognized register). This patch series adds support for SIB addressing mode in USDT probes. The main changes include: - add correct handling logic for SIB-addressed arguments in `parse_usdt_arg`. - force -O2 optimization for usdt.test.o to generate SIB addressing usdt argument spec. - change the global variable t1 to a local variable, to avoid compiler generating PC-relative addressing mode for it. Testing shows that the SIB probe correctly generates 8@(%rcx,%rax,8) argument spec and passes all validation checks. The modification history of this patch series: Change since v1: - refactor the code to make it more readable - modify the commit message to explain why and how Change since v2: - fix the `scale` uninitialized error Change since v3: - force -O2 optimization for usdt.test.o to generate SIB addressing usdt and pass all test cases. Do we need to add support for PC-relative USDT argument spec handling in libbpf? I have some interest in this question, but currently have no ideas. Getting offsets based on symbols requires dependency on the symbol table. However, once the binary file is stripped, the symtab will also be removed, which will cause this approach to fail. Does anyone have any thoughts on this? Jiawei Zhao (1): libbpf: fix USDT SIB argument handling causing unrecognized register error tools/lib/bpf/usdt.bpf.h | 33 +++++++++++++- tools/lib/bpf/usdt.c | 43 ++++++++++++++++--- tools/testing/selftests/bpf/Makefile | 5 +++ tools/testing/selftests/bpf/prog_tests/usdt.c | 18 +++++--- 4 files changed, 86 insertions(+), 13 deletions(-) -- 2.43.0

5 months, 1 week

2
2
0 0

[PATCH] selftests/mm: pass filename as input param to VM_PFNMAP tests

by Sudarsan Mahendran

Enable these tests to be run on other pfnmap'ed memory like NVIDIA's EGM. Add '--' as a separator to pass in file path. This allows passing of cmd line arguments to kselftest_harness. Use '/dev/mem' as default filename. Existing test passes: pfnmap TAP version 13 1..6 # Starting 6 tests from 1 test cases. # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 Pass params to kselftest_harness: pfnmap -r pfnmap:mremap_fixed TAP version 13 1..1 # Starting 1 tests from 1 test cases. # RUN pfnmap.mremap_fixed ... # OK pfnmap.mremap_fixed ok 1 pfnmap.mremap_fixed # PASSED: 1 / 1 tests passed. # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 Pass random file name as input: pfnmap -- /dev/blah TAP version 13 1..6 # Starting 6 tests from 1 test cases. # RUN pfnmap.madvise_disallowed ... # SKIP Cannot open '/dev/blah' Signed-off-by: Sudarsan Mahendran <sudarsanm(a)google.com> --- tools/testing/selftests/mm/pfnmap.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/mm/pfnmap.c b/tools/testing/selftests/mm/pfnmap.c index 866ac023baf5..2d4e8b165f91 100644 --- a/tools/testing/selftests/mm/pfnmap.c +++ b/tools/testing/selftests/mm/pfnmap.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * Basic VM_PFNMAP tests relying on mmap() of '/dev/mem' + * Basic VM_PFNMAP tests relying on mmap() of input file provided. + * Use '/dev/mem' as default. * * Copyright 2025, Red Hat, Inc. * @@ -25,6 +26,7 @@ #include "vm_util.h" static sigjmp_buf sigjmp_buf_env; +static char *file = "/dev/mem"; static void signal_handler(int sig) { @@ -117,19 +119,19 @@ FIXTURE_SETUP(pfnmap) if (find_ram_target(&self->phys_addr, self->pagesize)) SKIP(return, "Cannot find ram target in '/proc/iomem'\n"); - self->dev_mem_fd = open("/dev/mem", O_RDONLY); + self->dev_mem_fd = open(file, O_RDONLY); if (self->dev_mem_fd < 0) - SKIP(return, "Cannot open '/dev/mem'\n"); + SKIP(return, "Cannot open '%s'\n", file); self->size1 = self->pagesize * 2; self->addr1 = mmap(NULL, self->size1, PROT_READ, MAP_SHARED, self->dev_mem_fd, self->phys_addr); if (self->addr1 == MAP_FAILED) - SKIP(return, "Cannot mmap '/dev/mem'\n"); + SKIP(return, "Cannot mmap '%s'\n", file); /* ... and want to be able to read from them. */ if (test_read_access(self->addr1, self->size1, self->pagesize)) - SKIP(return, "Cannot read-access mmap'ed '/dev/mem'\n"); + SKIP(return, "Cannot read-access mmap'ed '%s'\n", file); self->size2 = 0; self->addr2 = MAP_FAILED; @@ -246,4 +248,14 @@ TEST_F(pfnmap, fork) ASSERT_EQ(ret, 0); } -TEST_HARNESS_MAIN +int main(int argc, char **argv) +{ + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--") == 0) { + if (i + 1 < argc && strlen(argv[i + 1]) > 0) + file = argv[i + 1]; + return test_harness_run(i, argv); + } + } + return test_harness_run(argc, argv); +} -- 2.50.1.565.gc32cd1483b-goog

5 months, 1 week

2
2
0 0

[PATCH] selftests/mm: Fix typos and improve output messages

by Swaraj Gaikwad

From: Swaraj-1925 <swarajgaikwad1925(a)gmail.com> Fixed spelling and grammar issues in test output messages to improve readability. Signed-off-by: swarajgaikwad1925(a)gmail.com --- tools/testing/selftests/mm/Makefile | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ae6f994d3add..96985c545d16 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -48,10 +48,10 @@ ifneq (,$(wildcard $(KDIR)/Module.symvers)) ifneq (,$(wildcard $(KDIR)/include/linux/page_frag_cache.h)) TEST_GEN_MODS_DIR := page_frag else -PAGE_FRAG_WARNING = "missing page_frag_cache.h, please use a newer kernel" +PAGE_FRAG_WARNING = "Missing page_frag_cache.h, Please use a newer kernel" endif else -PAGE_FRAG_WARNING = "missing Module.symvers, please have the kernel built first" +PAGE_FRAG_WARNING = "Missing Module.symvers, Please build the kernel first" endif TEST_GEN_FILES = cow @@ -202,8 +202,8 @@ ifeq ($(CAN_BUILD_I386)$(CAN_BUILD_X86_64),01) all: warn_32bit_failure warn_32bit_failure: - @echo "Warning: you seem to have a broken 32-bit build" 2>&1; \ - echo "environment. This will reduce test coverage of 64-bit" 2>&1; \ + @echo "Warning: you seem to have a broken 32-bit build environment." 2>&1; \ + echo "This will reduce test coverage of 64-bit" 2>&1; \ echo "kernels. If you are using a Debian-like distribution," 2>&1; \ echo "try:"; 2>&1; \ echo ""; \ -- 2.50.1

5 months, 1 week

2
1
0 0

[PATCH 0/3] execute PROCMAP_QUERY ioctl under per-vma lock

by Suren Baghdasaryan

With /proc/pid/maps now being read under per-vma lock protection we can reuse parts of that code to execute PROCMAP_QUERY ioctl also without taking mmap_lock. The change is designed to reduce mmap_lock contention and prevent PROCMAP_QUERY ioctl calls from blocking address space updates. This patchset was split out of the original patchset [1] that introduced per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY tests, code refactoring patch to simplify the main change and the actual transition to per-vma lock. [1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/ Suren Baghdasaryan (3): selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified fs/proc/task_mmu: factor out proc_maps_private fields used by PROCMAP_QUERY fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 15 +- fs/proc/task_mmu.c | 149 ++++++++++++------ tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++ 3 files changed, 174 insertions(+), 55 deletions(-) base-commit: 01da54f10fddf3b01c5a3b80f6b16bbad390c302 -- 2.50.1.565.gc32cd1483b-goog

5 months, 1 week

4
8
0 0

[PATCH] selftests: breakpoints: use suspend_stats to reliably check suspend success

by Moon Hee Lee

The step_after_suspend_test verifies that the system successfully suspended and resumed by setting a timerfd and checking whether the timer fully expired. However, this method is unreliable due to timing races. In practice, the system may take time to enter suspend, during which the timer may expire just before or during the transition. As a result, the remaining time after resume may show non-zero nanoseconds, even if suspend/resume completed successfully. This leads to false test failures. Replace the timer-based check with a read from /sys/power/suspend_stats/success. This counter is incremented only after a full suspend/resume cycle, providing a reliable and race-free indicator. Also remove the unused file descriptor for /sys/power/state, which remained after switching to a system() call to trigger suspend [1]. [1] https://lore.kernel.org/all/20240930224025.2858767-1-yifei.l.liu@oracle.com/ Fixes: c66be905cda2 ("selftests: breakpoints: use remaining time to check if suspend succeed") Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com> --- .../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++----- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c index 8d275f03e977..8d233ac95696 100644 --- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c +++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c @@ -127,22 +127,42 @@ int run_test(int cpu) return KSFT_PASS; } +/* + * Reads the suspend success count from sysfs. + * Returns the count on success or exits on failure. + */ +static int get_suspend_success_count_or_fail(void) +{ + FILE *fp; + int val; + + fp = fopen("/sys/power/suspend_stats/success", "r"); + if (!fp) + ksft_exit_fail_msg( + "Failed to open suspend_stats/success: %s\n", + strerror(errno)); + + if (fscanf(fp, "%d", &val) != 1) { + fclose(fp); + ksft_exit_fail_msg( + "Failed to read suspend success count\n"); + } + + fclose(fp); + return val; +} + void suspend(void) { - int power_state_fd; int timerfd; int err; + int count_before; + int count_after; struct itimerspec spec = {}; if (getuid() != 0) ksft_exit_skip("Please run the test as root - Exiting.\n"); - power_state_fd = open("/sys/power/state", O_RDWR); - if (power_state_fd < 0) - ksft_exit_fail_msg( - "open(\"/sys/power/state\") failed %s)\n", - strerror(errno)); - timerfd = timerfd_create(CLOCK_BOOTTIME_ALARM, 0); if (timerfd < 0) ksft_exit_fail_msg("timerfd_create() failed\n"); @@ -152,14 +172,15 @@ void suspend(void) if (err < 0) ksft_exit_fail_msg("timerfd_settime() failed\n"); + count_before = get_suspend_success_count_or_fail(); + system("(echo mem > /sys/power/state) 2> /dev/null"); - timerfd_gettime(timerfd, &spec); - if (spec.it_value.tv_sec != 0 || spec.it_value.tv_nsec != 0) + count_after = get_suspend_success_count_or_fail(); + if (count_after <= count_before) ksft_exit_fail_msg("Failed to enter Suspend state\n"); close(timerfd); - close(power_state_fd); } int main(int argc, char **argv) -- 2.43.0

5 months, 1 week

3
3
0 0

[PATCH v2] kho: add test for kexec handover

by Mike Rapoport

From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org> Testing kexec handover requires a kernel driver that will generate some data and preserve it with KHO on the first boot and then restore that data and verify it was preserved properly after kexec. To facilitate such test, along with the kernel driver responsible for data generation, preservation and restoration add a script that runs a kernel in a VM with a minimal /init. The /init enables KHO, loads a kernel image for kexec and runs kexec reboot. After the boot of the kexeced kernel, the driver verifies that the data was properly preserved. Signed-off-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> --- v2 changes: * fix section mismatch warning in lib/test_kho.c * address Thomas' comments about nolibc and initrd generation v1: https://lore.kernel.org/all/20250727083733.2590139-1-rppt@kernel.org MAINTAINERS | 1 + lib/Kconfig.debug | 21 ++ lib/Makefile | 1 + lib/test_kho.c | 305 +++++++++++++++++++++++++ tools/testing/selftests/kho/arm64.conf | 9 + tools/testing/selftests/kho/init.c | 98 ++++++++ tools/testing/selftests/kho/vmtest.sh | 185 +++++++++++++++ tools/testing/selftests/kho/x86.conf | 7 + 8 files changed, 627 insertions(+) create mode 100644 lib/test_kho.c create mode 100644 tools/testing/selftests/kho/arm64.conf create mode 100644 tools/testing/selftests/kho/init.c create mode 100755 tools/testing/selftests/kho/vmtest.sh create mode 100644 tools/testing/selftests/kho/x86.conf diff --git a/MAINTAINERS b/MAINTAINERS index 10850512c118..7eada657c5e6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13356,6 +13356,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: kernel/kexec_handover.c +F: tools/testing/selftests/kho/ KEYS-ENCRYPTED M: Mimi Zohar <zohar(a)linux.ibm.com> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ebe33181b6e6..4f82d38e3c45 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -3225,6 +3225,27 @@ config TEST_OBJPOOL If unsure, say N. +config TEST_KEXEC_HANDOVER + bool "Test for Kexec HandOver" + default n + depends on KEXEC_HANDOVER + help + This option enables test for Kexec HandOver (KHO). + The test consists of two parts: saving kernel data before kexec and + restoring the data after kexec and verifying that it was properly + handed over. This test module creates and saves data on the boot of + the first kernel and restores and verifies the data on the boot of + kexec'ed kernel. + + For detailed documentation about KHO, see Documentation/core-api/kho. + + To run the test run: + + tools/testing/selftests/kho/vmtest.sh -h + + If unsure, say N. + + config INT_POW_KUNIT_TEST tristate "Integer exponentiation (int_pow) test" if !KUNIT_ALL_TESTS depends on KUNIT diff --git a/lib/Makefile b/lib/Makefile index c38582f187dd..6a8d00aac3a8 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -102,6 +102,7 @@ obj-$(CONFIG_TEST_HMM) += test_hmm.o obj-$(CONFIG_TEST_FREE_PAGES) += test_free_pages.o obj-$(CONFIG_TEST_REF_TRACKER) += test_ref_tracker.o obj-$(CONFIG_TEST_OBJPOOL) += test_objpool.o +obj-$(CONFIG_TEST_KEXEC_HANDOVER) += test_kho.o obj-$(CONFIG_TEST_FPU) += test_fpu.o test_fpu-y := test_fpu_glue.o test_fpu_impl.o diff --git a/lib/test_kho.c b/lib/test_kho.c new file mode 100644 index 000000000000..c2eb899c3b45 --- /dev/null +++ b/lib/test_kho.c @@ -0,0 +1,305 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Test module for KHO + * Copyright (c) 2025 Microsoft Corporation. + * + * Authors: + * Saurabh Sengar <ssengar(a)microsoft.com> + * Mike Rapoport <rppt(a)kernel.org> + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/mm.h> +#include <linux/gfp.h> +#include <linux/slab.h> +#include <linux/kexec.h> +#include <linux/libfdt.h> +#include <linux/module.h> +#include <linux/printk.h> +#include <linux/vmalloc.h> +#include <linux/kexec_handover.h> + +#include <net/checksum.h> + +#define KHO_TEST_MAGIC 0x4b484f21 /* KHO! */ +#define KHO_TEST_FDT "kho_test" +#define KHO_TEST_COMPAT "kho-test-v1" + +static long max_mem = (PAGE_SIZE << MAX_PAGE_ORDER) * 2; +module_param(max_mem, long, 0644); + +struct kho_test_state { + unsigned int nr_folios; + struct folio **folios; + struct folio *fdt; + __wsum csum; +}; + +static struct kho_test_state kho_test_state; + +static int kho_test_notifier(struct notifier_block *self, unsigned long cmd, + void *v) +{ + struct kho_test_state *state = &kho_test_state; + struct kho_serialization *ser = v; + int err = 0; + + switch (cmd) { + case KEXEC_KHO_ABORT: + return NOTIFY_DONE; + case KEXEC_KHO_FINALIZE: + /* Handled below */ + break; + default: + return NOTIFY_BAD; + } + + err |= kho_preserve_folio(state->fdt); + err |= kho_add_subtree(ser, KHO_TEST_FDT, folio_address(state->fdt)); + + return err ? NOTIFY_BAD : NOTIFY_DONE; +} + +static struct notifier_block kho_test_nb = { + .notifier_call = kho_test_notifier, +}; + +static int kho_test_save_data(struct kho_test_state *state, void *fdt) +{ + phys_addr_t *folios_info __free(kvfree) = NULL; + int err = 0; + + folios_info = kvmalloc_array(state->nr_folios, sizeof(*folios_info), + GFP_KERNEL); + if (!folios_info) + return -ENOMEM; + + for (int i = 0; i < state->nr_folios; i++) { + struct folio *folio = state->folios[i]; + unsigned int order = folio_order(folio); + + folios_info[i] = virt_to_phys(folio_address(folio)) | order; + + err = kho_preserve_folio(folio); + if (err) + return err; + } + + err |= fdt_begin_node(fdt, "data"); + err |= fdt_property(fdt, "nr_folios", &state->nr_folios, + sizeof(state->nr_folios)); + err |= fdt_property(fdt, "folios_info", folios_info, + state->nr_folios * sizeof(*folios_info)); + err |= fdt_property(fdt, "csum", &state->csum, sizeof(state->csum)); + err |= fdt_end_node(fdt); + + return err; +} + +static int kho_test_prepare_fdt(struct kho_test_state *state) +{ + const char compatible[] = KHO_TEST_COMPAT; + unsigned int magic = KHO_TEST_MAGIC; + ssize_t fdt_size; + int err = 0; + void *fdt; + + fdt_size = state->nr_folios * sizeof(phys_addr_t) + PAGE_SIZE; + state->fdt = folio_alloc(GFP_KERNEL, get_order(fdt_size)); + if (!state->fdt) + return -ENOMEM; + + fdt = folio_address(state->fdt); + + err |= fdt_create(fdt, fdt_size); + err |= fdt_finish_reservemap(fdt); + + err |= fdt_begin_node(fdt, ""); + err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible)); + err |= fdt_property(fdt, "magic", &magic, sizeof(magic)); + err |= kho_test_save_data(state, fdt); + err |= fdt_end_node(fdt); + + err |= fdt_finish(fdt); + + if (err) + folio_put(state->fdt); + + return err; +} + +static int kho_test_generate_data(struct kho_test_state *state) +{ + size_t alloc_size = 0; + __wsum csum = 0; + + while (alloc_size < max_mem) { + int order = get_random_u32() % NR_PAGE_ORDERS; + struct folio *folio; + unsigned int size; + void *addr; + + /* cap allocation so that we won't exceed max_mem */ + if (alloc_size + (PAGE_SIZE << order) > max_mem) { + order = get_order(max_mem - alloc_size); + if (order) + order--; + } + size = PAGE_SIZE << order; + + folio = folio_alloc(GFP_KERNEL | __GFP_NORETRY, order); + if (!folio) + goto err_free_folios; + + state->folios[state->nr_folios++] = folio; + addr = folio_address(folio); + get_random_bytes(addr, size); + csum = csum_partial(addr, size, csum); + alloc_size += size; + } + + state->csum = csum; + return 0; + +err_free_folios: + for (int i = 0; i < state->nr_folios; i++) + folio_put(state->folios[i]); + return -ENOMEM; +} + +static int kho_test_save(void) +{ + struct kho_test_state *state = &kho_test_state; + struct folio **folios __free(kvfree) = NULL; + unsigned long max_nr; + int err; + + max_mem = PAGE_ALIGN(max_mem); + max_nr = max_mem >> PAGE_SHIFT; + + folios = kvmalloc_array(max_nr, sizeof(*state->folios), GFP_KERNEL); + if (!folios) + return -ENOMEM; + state->folios = folios; + + err = kho_test_generate_data(state); + if (err) + return err; + + err = kho_test_prepare_fdt(state); + if (err) + return err; + + return register_kho_notifier(&kho_test_nb); +} + +static int kho_test_restore_data(const void *fdt, int node) +{ + const unsigned int *nr_folios; + const phys_addr_t *folios_info; + const __wsum *old_csum; + __wsum csum = 0; + int len; + + node = fdt_path_offset(fdt, "/data"); + + nr_folios = fdt_getprop(fdt, node, "nr_folios", &len); + if (!nr_folios || len != sizeof(*nr_folios)) + return -EINVAL; + + old_csum = fdt_getprop(fdt, node, "csum", &len); + if (!old_csum || len != sizeof(*old_csum)) + return -EINVAL; + + folios_info = fdt_getprop(fdt, node, "folios_info", &len); + if (!folios_info || len != sizeof(*folios_info) * *nr_folios) + return -EINVAL; + + for (int i = 0; i < *nr_folios; i++) { + unsigned int order = folios_info[i] & ~PAGE_MASK; + phys_addr_t phys = folios_info[i] & PAGE_MASK; + unsigned int size = PAGE_SIZE << order; + struct folio *folio; + + folio = kho_restore_folio(phys); + if (!folio) + break; + + if (folio_order(folio) != order) + break; + + csum = csum_partial(folio_address(folio), size, csum); + folio_put(folio); + } + + if (csum != *old_csum) + return -EINVAL; + + return 0; +} + +static int kho_test_restore(phys_addr_t fdt_phys) +{ + void *fdt = phys_to_virt(fdt_phys); + const unsigned int *magic; + int node, len, err; + + node = fdt_path_offset(fdt, "/"); + if (node < 0) + return -EINVAL; + + if (fdt_node_check_compatible(fdt, node, KHO_TEST_COMPAT)) + return -EINVAL; + + magic = fdt_getprop(fdt, node, "magic", &len); + if (!magic || len != sizeof(*magic)) + return -EINVAL; + + if (*magic != KHO_TEST_MAGIC) + return -EINVAL; + + err = kho_test_restore_data(fdt, node); + if (err) + return err; + + pr_info("KHO restore succeeded\n"); + return 0; +} + +static int __init kho_test_init(void) +{ + phys_addr_t fdt_phys; + int err; + + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); + if (!err) + return kho_test_restore(fdt_phys); + + if (err != -ENOENT) { + pr_warn("failed to retrieve %s FDT: %d\n", KHO_TEST_FDT, err); + return err; + } + + return kho_test_save(); +} +module_init(kho_test_init); + +static void kho_test_cleanup(void) +{ + for (int i = 0; i < kho_test_state.nr_folios; i++) + folio_put(kho_test_state.folios[i]); + + kvfree(kho_test_state.folios); +} + +static void __exit kho_test_exit(void) +{ + unregister_kho_notifier(&kho_test_nb); + kho_test_cleanup(); +} +module_exit(kho_test_exit); + +MODULE_AUTHOR("Mike Rapoport <rppt(a)kernel.org>"); +MODULE_DESCRIPTION("KHO test module"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/kho/arm64.conf b/tools/testing/selftests/kho/arm64.conf new file mode 100644 index 000000000000..ee696807cd35 --- /dev/null +++ b/tools/testing/selftests/kho/arm64.conf @@ -0,0 +1,9 @@ +QEMU_CMD="qemu-system-aarch64 -M virt -cpu max" +QEMU_KCONFIG=" +CONFIG_SERIAL_AMBA_PL010=y +CONFIG_SERIAL_AMBA_PL010_CONSOLE=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +" +KERNEL_IMAGE="Image" +KERNEL_CMDLINE="console=ttyAMA0" diff --git a/tools/testing/selftests/kho/init.c b/tools/testing/selftests/kho/init.c new file mode 100644 index 000000000000..8044ca56fff5 --- /dev/null +++ b/tools/testing/selftests/kho/init.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <errno.h> +#include <stdio.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/syscall.h> +#include <sys/mount.h> +#include <sys/reboot.h> + +/* from arch/x86/include/asm/setup.h */ +#define COMMAND_LINE_SIZE 2048 + +/* from include/linux/kexex.h */ +#define KEXEC_FILE_NO_INITRAMFS 0x00000004 + +#define KHO_FINILIZE "/debugfs/kho/out/finalize" +#define KERNEL_IMAGE "/kernel" + +static int mount_filesystems(void) +{ + if (mount("debugfs", "/debugfs", "debugfs", 0, NULL) < 0) + return -1; + + return mount("proc", "/proc", "proc", 0, NULL); +} + +static int kho_enable(void) +{ + const char enable[] = "1"; + int fd; + + fd = open(KHO_FINILIZE, O_RDWR); + if (fd < 0) + return -1; + + if (write(fd, enable, sizeof(enable)) != sizeof(enable)) + return 1; + + close(fd); + return 0; +} + +static long kexec_file_load(int kernel_fd, int initrd_fd, + unsigned long cmdline_len, const char *cmdline, + unsigned long flags) +{ + return syscall(__NR_kexec_file_load, kernel_fd, initrd_fd, cmdline_len, + cmdline, flags); +} + +static int kexec_load(void) +{ + char cmdline[COMMAND_LINE_SIZE]; + ssize_t len; + int fd, err; + + fd = open("/proc/cmdline", O_RDONLY); + if (fd < 0) + return -1; + + len = read(fd, cmdline, sizeof(cmdline)); + close(fd); + if (len < 0) + return -1; + + /* replace \n with \0 */ + cmdline[len - 1] = 0; + fd = open(KERNEL_IMAGE, O_RDONLY); + if (fd < 0) + return -1; + + err = kexec_file_load(fd, -1, len, cmdline, KEXEC_FILE_NO_INITRAMFS); + close(fd); + + return err ? : 0; +} + +int main(int argc, char *argv[]) +{ + if (mount_filesystems()) + goto err_reboot; + + if (kho_enable()) + goto err_reboot; + + if (kexec_load()) + goto err_reboot; + + if (reboot(RB_KEXEC)) + goto err_reboot; + + return 0; + +err_reboot: + reboot(RB_AUTOBOOT); + return -1; +} diff --git a/tools/testing/selftests/kho/vmtest.sh b/tools/testing/selftests/kho/vmtest.sh new file mode 100755 index 000000000000..3f6c17166846 --- /dev/null +++ b/tools/testing/selftests/kho/vmtest.sh @@ -0,0 +1,185 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +set -ue + +CROSS_COMPILE="${CROSS_COMPILE:-""}" + +test_dir=$(realpath "$(dirname "$0")") +kernel_dir=$(realpath "$test_dir/../../../..") + +tmp_dir=$(mktemp -d /tmp/kho-test.XXXXXXXX) +headers_dir="$tmp_dir/usr" +initrd="$tmp_dir/initrd.cpio" + +source "$test_dir/../kselftest/ktap_helpers.sh" + +function usage() { + cat <<EOF +$0 [-d build_dir] [-j jobs] [-t target_arch] [-h] +Options: + -d) path to the kernel build directory + -j) number of jobs for compilation, similar to -j in make + -t) run test for target_arch, requires CROSS_COMPILE set + supported targets: aarch64, x86_64 + -h) display this help +EOF +} + +function cleanup() { + rm -fr "$tmp_dir" + ktap_finished +} +trap cleanup EXIT + +function skip() { + local msg=${1:-""} + + ktap_test_skip "$msg" + exit "$KSFT_SKIP" +} + +function fail() { + local msg=${1:-""} + + ktap_test_fail "$msg" + exit "$KSFT_FAIL" +} + +function build_kernel() { + local build_dir=$1 + local make_cmd=$2 + local arch_kconfig=$3 + local kimage=$4 + + local kho_config="$tmp_dir/kho.config" + local kconfig="$build_dir/.config" + + # enable initrd, KHO and KHO test in kernel configuration + tee "$kconfig" > "$kho_config" <<EOF +CONFIG_BLK_DEV_INITRD=y +CONFIG_KEXEC_HANDOVER=y +CONFIG_TEST_KEXEC_HANDOVER=y +CONFIG_DEBUG_KERNEL=y +CONFIG_DEBUG_VM=y +$arch_kconfig +EOF + + make_cmd="$make_cmd -C $kernel_dir O=$build_dir" + $make_cmd olddefconfig + + # verify that kernel confiration has all necessary options + while read -r opt ; do + grep "$opt" "$kconfig" &>/dev/null || skip "$opt is missing" + done < "$kho_config" + + $make_cmd "$kimage" + $make_cmd headers_install INSTALL_HDR_PATH="$headers_dir" +} + +function mkinitrd() { + local kernel=$1 + + "$CROSS_COMPILE"gcc -s -static -Os -nostdinc -nostdlib \ + -fno-asynchronous-unwind-tables -fno-ident \ + -I "$headers_dir/include" \ + -I "$kernel_dir/tools/include/nolibc" \ + -o "$tmp_dir/init" "$test_dir/init.c" + + cat > "$tmp_dir/cpio_list" <<EOF +dir /dev 0755 0 0 +dir /proc 0755 0 0 +dir /debugfs 0755 0 0 +nod /dev/console 0600 0 0 c 5 1 +file /init $tmp_dir/init 0755 0 0 +file /kernel $kernel 0644 0 0 +EOF + + "$build_dir/usr/gen_init_cpio" "$tmp_dir/cpio_list" > "$initrd" +} + +function run_qemu() { + local qemu_cmd=$1 + local cmdline=$2 + local kernel=$3 + local serial="$tmp_dir/qemu.serial" + + cmdline="$cmdline kho=on panic=-1" + + $qemu_cmd -m 1G -smp 2 -no-reboot -nographic -nodefaults \ + -accel kvm -accel hvf -accel tcg \ + -serial file:"$serial" \ + -append "$cmdline" \ + -kernel "$kernel" \ + -initrd "$initrd" + + grep "KHO restore succeeded" "$serial" &> /dev/null || fail "KHO failed" +} + +function target_to_arch() { + local target=$1 + + case $target in + aarch64) echo "arm64" ;; + x86_64) echo "x86" ;; + *) skip "architecture $target is not supported" + esac +} + +function main() { + local build_dir="$kernel_dir/.kho" + local jobs=$(($(nproc) * 2)) + local target="$(uname -m)" + + # skip the test if any of the preparation steps fails + set -o errtrace + trap skip ERR + + while getopts 'hd:j:t:' opt; do + case $opt in + d) + build_dir="$OPTARG" + ;; + j) + jobs="$OPTARG" + ;; + t) + target="$OPTARG" + ;; + h) + usage + exit 0 + ;; + *) + echo Unknown argument "$opt" + usage + exit 1 + ;; + esac + done + + ktap_print_header + ktap_set_plan 1 + + if [[ "$target" != "$(uname -m)" ]] && [[ -z "$CROSS_COMPILE" ]]; then + skip "Cross-platform testing needs to specify CROSS_COMPILE" + fi + + mkdir -p "$build_dir" + local arch=$(target_to_arch "$target") + source "$test_dir/$arch.conf" + + # build the kernel and create initrd + # initrd includes the kernel image that will be kexec'ed + local make_cmd="make ARCH=$arch CROSS_COMPILE=$CROSS_COMPILE -j$jobs" + build_kernel "$build_dir" "$make_cmd" "$QEMU_KCONFIG" "$KERNEL_IMAGE" + + local kernel="$build_dir/arch/$arch/boot/$KERNEL_IMAGE" + mkinitrd "$kernel" + + run_qemu "$QEMU_CMD" "$KERNEL_CMDLINE" "$kernel" + + ktap_test_pass "KHO succeeded" +} + +main "$@" diff --git a/tools/testing/selftests/kho/x86.conf b/tools/testing/selftests/kho/x86.conf new file mode 100644 index 000000000000..b419e610ca22 --- /dev/null +++ b/tools/testing/selftests/kho/x86.conf @@ -0,0 +1,7 @@ +QEMU_CMD=qemu-system-x86_64 +QEMU_KCONFIG=" +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +" +KERNEL_IMAGE="bzImage" +KERNEL_CMDLINE="console=ttyS0" base-commit: 89be9a83ccf1f88522317ce02f854f30d6115c41 -- 2.47.2

5 months, 1 week

3
2
0 0

[PATCH net 0/2] bonding: fix negotiation flapping in 802.3ad passive mode

by Hangbin Liu

This patch fixes unstable LACP negotiation when bonding is configured in passive mode (`lacp_active=off`). Previously, the actor would stop sending LACPDUs after initial negotiation succeeded, leading to the partner timing out and restarting the negotiation cycle. This resulted in continuous LACP state flapping. The fix ensures the passive actor starts sending periodic LACPDUs after receiving the first LACPDU from the partner, in accordance with IEEE 802.1AX-2020 section 6.4.1. Out of topic: Although this patch addresses a functional bug and could be considered for `net`, I'm slightly concerned about potential regressions, as it changes the current bonding LACP protocol behavior. It might be safer to merge this through `net-next` first to allow broader testing. Thoughts? Hangbin Liu (2): bonding: send LACPDUs periodically in passive mode after receiving partner's LACPDU selftests: bonding: add test for passive LACP mode drivers/net/bonding/bond_3ad.c | 72 ++++++++++---- drivers/net/bonding/bond_options.c | 1 + include/net/bond_3ad.h | 1 + .../selftests/drivers/net/bonding/Makefile | 3 +- .../drivers/net/bonding/bond_passive_lacp.sh | 93 +++++++++++++++++++ 5 files changed, 151 insertions(+), 19 deletions(-) create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh -- 2.46.0

5 months, 1 week

3
6
0 0

[PATCH] selftests: bpf: Add missing symbol declarations to common header

by chenyuan_fl＠163.com

From: Yuan Chen <chenyuan(a)kylinos.cn> Fix implicit function declaration errors in bpf_qdisc_xxx.c by adding the required kernel symbol declarations to the shared header file bpf_qdisc_common.h. This ensures all qdisc BPF programs can properly resolve these kernel functions. The added declarations include: - bpf_qdisc_skb_drop - bpf_qdisc_bstats_update - bpf_kfree_skb - bpf_skb_get_hash - bpf_qdisc_watchdog_schedule Using a common header prevents duplication and ensures consistency across different qdisc implementations. Signed-off-by: Yuan Chen <chenyuan(a)kylinos.cn> --- tools/testing/selftests/bpf/progs/bpf_qdisc_common.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h index 3754f581b328..4c896b3e0f65 100644 --- a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h @@ -14,6 +14,15 @@ struct bpf_sk_buff_ptr; +extern void bpf_qdisc_skb_drop(struct sk_buff *skb, + struct bpf_sk_buff_ptr *to_free_list) __ksym; +extern void bpf_qdisc_bstats_update(struct Qdisc *sch, + const struct sk_buff *skb) __ksym; +extern void bpf_kfree_skb(struct sk_buff *skb) __ksym; +extern u32 bpf_skb_get_hash(struct sk_buff *skb) __ksym; +extern void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, + u64 delta_ns) __ksym; + static struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb) { return (struct qdisc_skb_cb *)skb->cb; -- 2.47.3

5 months, 1 week

2
1
0 0

[PATCH] kho: add test for kexec handover

by Mike Rapoport

From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org> Testing kexec handover requires a kernel driver that will generate some data and preserve it with KHO on the first boot and then restore that data and verify it was preserved properly after kexec. To facilitate such test, along with the kernel driver responsible for data generation, preservation and restoration add a script that runs a kernel in a VM with a minimal /init. The /init enables KHO, loads a kernel image for kexec and runs kexec reboot. After the boot of the kexeced kernel, the driver verifies that the data was properly preserved. Signed-off-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org> --- MAINTAINERS | 1 + lib/Kconfig.debug | 21 ++ lib/Makefile | 1 + lib/test_kho.c | 305 +++++++++++++++++++++++++ tools/testing/selftests/kho/arm64.conf | 9 + tools/testing/selftests/kho/init.c | 100 ++++++++ tools/testing/selftests/kho/vmtest.sh | 183 +++++++++++++++ tools/testing/selftests/kho/x86.conf | 7 + 8 files changed, 627 insertions(+) create mode 100644 lib/test_kho.c create mode 100644 tools/testing/selftests/kho/arm64.conf create mode 100644 tools/testing/selftests/kho/init.c create mode 100755 tools/testing/selftests/kho/vmtest.sh create mode 100644 tools/testing/selftests/kho/x86.conf diff --git a/MAINTAINERS b/MAINTAINERS index 10850512c118..7eada657c5e6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13356,6 +13356,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: kernel/kexec_handover.c +F: tools/testing/selftests/kho/ KEYS-ENCRYPTED M: Mimi Zohar <zohar(a)linux.ibm.com> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ebe33181b6e6..4f82d38e3c45 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -3225,6 +3225,27 @@ config TEST_OBJPOOL If unsure, say N. +config TEST_KEXEC_HANDOVER + bool "Test for Kexec HandOver" + default n + depends on KEXEC_HANDOVER + help + This option enables test for Kexec HandOver (KHO). + The test consists of two parts: saving kernel data before kexec and + restoring the data after kexec and verifying that it was properly + handed over. This test module creates and saves data on the boot of + the first kernel and restores and verifies the data on the boot of + kexec'ed kernel. + + For detailed documentation about KHO, see Documentation/core-api/kho. + + To run the test run: + + tools/testing/selftests/kho/vmtest.sh -h + + If unsure, say N. + + config INT_POW_KUNIT_TEST tristate "Integer exponentiation (int_pow) test" if !KUNIT_ALL_TESTS depends on KUNIT diff --git a/lib/Makefile b/lib/Makefile index c38582f187dd..6a8d00aac3a8 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -102,6 +102,7 @@ obj-$(CONFIG_TEST_HMM) += test_hmm.o obj-$(CONFIG_TEST_FREE_PAGES) += test_free_pages.o obj-$(CONFIG_TEST_REF_TRACKER) += test_ref_tracker.o obj-$(CONFIG_TEST_OBJPOOL) += test_objpool.o +obj-$(CONFIG_TEST_KEXEC_HANDOVER) += test_kho.o obj-$(CONFIG_TEST_FPU) += test_fpu.o test_fpu-y := test_fpu_glue.o test_fpu_impl.o diff --git a/lib/test_kho.c b/lib/test_kho.c new file mode 100644 index 000000000000..f5fe39c7c2b1 --- /dev/null +++ b/lib/test_kho.c @@ -0,0 +1,305 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Test module for KHO + * Copyright (c) 2025 Microsoft Corporation. + * + * Authors: + * Saurabh Sengar <ssengar(a)microsoft.com> + * Mike Rapoport <rppt(a)kernel.org> + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/mm.h> +#include <linux/gfp.h> +#include <linux/slab.h> +#include <linux/kexec.h> +#include <linux/libfdt.h> +#include <linux/module.h> +#include <linux/printk.h> +#include <linux/vmalloc.h> +#include <linux/kexec_handover.h> + +#include <net/checksum.h> + +#define KHO_TEST_MAGIC 0x4b484f21 /* KHO! */ +#define KHO_TEST_FDT "kho_test" +#define KHO_TEST_COMPAT "kho-test-v1" + +static long max_mem = (PAGE_SIZE << MAX_PAGE_ORDER) * 2; +module_param(max_mem, long, 0644); + +struct kho_test_state { + unsigned int nr_folios; + struct folio **folios; + struct folio *fdt; + __wsum csum; +}; + +static struct kho_test_state kho_test_state; + +static int kho_test_notifier(struct notifier_block *self, unsigned long cmd, + void *v) +{ + struct kho_test_state *state = &kho_test_state; + struct kho_serialization *ser = v; + int err = 0; + + switch (cmd) { + case KEXEC_KHO_ABORT: + return NOTIFY_DONE; + case KEXEC_KHO_FINALIZE: + /* Handled below */ + break; + default: + return NOTIFY_BAD; + } + + err |= kho_preserve_folio(state->fdt); + err |= kho_add_subtree(ser, KHO_TEST_FDT, folio_address(state->fdt)); + + return err ? NOTIFY_BAD : NOTIFY_DONE; +} + +static struct notifier_block kho_test_nb = { + .notifier_call = kho_test_notifier, +}; + +static int kho_test_save_data(struct kho_test_state *state, void *fdt) +{ + phys_addr_t *folios_info __free(kvfree) = NULL; + int err = 0; + + folios_info = kvmalloc_array(state->nr_folios, sizeof(*folios_info), + GFP_KERNEL); + if (!folios_info) + return -ENOMEM; + + for (int i = 0; i < state->nr_folios; i++) { + struct folio *folio = state->folios[i]; + unsigned int order = folio_order(folio); + + folios_info[i] = virt_to_phys(folio_address(folio)) | order; + + err = kho_preserve_folio(folio); + if (err) + return err; + } + + err |= fdt_begin_node(fdt, "data"); + err |= fdt_property(fdt, "nr_folios", &state->nr_folios, + sizeof(state->nr_folios)); + err |= fdt_property(fdt, "folios_info", folios_info, + state->nr_folios * sizeof(*folios_info)); + err |= fdt_property(fdt, "csum", &state->csum, sizeof(state->csum)); + err |= fdt_end_node(fdt); + + return err; +} + +static int kho_test_prepare_fdt(struct kho_test_state *state) +{ + const char compatible[] = KHO_TEST_COMPAT; + unsigned int magic = KHO_TEST_MAGIC; + ssize_t fdt_size; + int err = 0; + void *fdt; + + fdt_size = state->nr_folios * sizeof(phys_addr_t) + PAGE_SIZE; + state->fdt = folio_alloc(GFP_KERNEL, get_order(fdt_size)); + if (!state->fdt) + return -ENOMEM; + + fdt = folio_address(state->fdt); + + err |= fdt_create(fdt, fdt_size); + err |= fdt_finish_reservemap(fdt); + + err |= fdt_begin_node(fdt, ""); + err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible)); + err |= fdt_property(fdt, "magic", &magic, sizeof(magic)); + err |= kho_test_save_data(state, fdt); + err |= fdt_end_node(fdt); + + err |= fdt_finish(fdt); + + if (err) + folio_put(state->fdt); + + return err; +} + +static int kho_test_generate_data(struct kho_test_state *state) +{ + size_t alloc_size = 0; + __wsum csum = 0; + + while (alloc_size < max_mem) { + int order = get_random_u32() % NR_PAGE_ORDERS; + struct folio *folio; + unsigned int size; + void *addr; + + /* cap allocation so that we won't exceed max_mem */ + if (alloc_size + (PAGE_SIZE << order) > max_mem) { + order = get_order(max_mem - alloc_size); + if (order) + order--; + } + size = PAGE_SIZE << order; + + folio = folio_alloc(GFP_KERNEL | __GFP_NORETRY, order); + if (!folio) + goto err_free_folios; + + state->folios[state->nr_folios++] = folio; + addr = folio_address(folio); + get_random_bytes(addr, size); + csum = csum_partial(addr, size, csum); + alloc_size += size; + } + + state->csum = csum; + return 0; + +err_free_folios: + for (int i = 0; i < state->nr_folios; i++) + folio_put(state->folios[i]); + return -ENOMEM; +} + +static int kho_test_save(void) +{ + struct kho_test_state *state = &kho_test_state; + struct folio **folios __free(kvfree) = NULL; + unsigned long max_nr; + int err; + + max_mem = PAGE_ALIGN(max_mem); + max_nr = max_mem >> PAGE_SHIFT; + + folios = kvmalloc_array(max_nr, sizeof(*state->folios), GFP_KERNEL); + if (!folios) + return -ENOMEM; + state->folios = folios; + + err = kho_test_generate_data(state); + if (err) + return err; + + err = kho_test_prepare_fdt(state); + if (err) + return err; + + return register_kho_notifier(&kho_test_nb); +} + +static int __init kho_test_restore_data(const void *fdt, int node) +{ + const unsigned int *nr_folios; + const phys_addr_t *folios_info; + const __wsum *old_csum; + __wsum csum = 0; + int len; + + node = fdt_path_offset(fdt, "/data"); + + nr_folios = fdt_getprop(fdt, node, "nr_folios", &len); + if (!nr_folios || len != sizeof(*nr_folios)) + return -EINVAL; + + old_csum = fdt_getprop(fdt, node, "csum", &len); + if (!old_csum || len != sizeof(*old_csum)) + return -EINVAL; + + folios_info = fdt_getprop(fdt, node, "folios_info", &len); + if (!folios_info || len != sizeof(*folios_info) * *nr_folios) + return -EINVAL; + + for (int i = 0; i < *nr_folios; i++) { + unsigned int order = folios_info[i] & ~PAGE_MASK; + phys_addr_t phys = folios_info[i] & PAGE_MASK; + unsigned int size = PAGE_SIZE << order; + struct folio *folio; + + folio = kho_restore_folio(phys); + if (!folio) + break; + + if (folio_order(folio) != order) + break; + + csum = csum_partial(folio_address(folio), size, csum); + folio_put(folio); + } + + if (csum != *old_csum) + return -EINVAL; + + return 0; +} + +static int kho_test_restore(phys_addr_t fdt_phys) +{ + void *fdt = phys_to_virt(fdt_phys); + const unsigned int *magic; + int node, len, err; + + node = fdt_path_offset(fdt, "/"); + if (node < 0) + return -EINVAL; + + if (fdt_node_check_compatible(fdt, node, KHO_TEST_COMPAT)) + return -EINVAL; + + magic = fdt_getprop(fdt, node, "magic", &len); + if (!magic || len != sizeof(*magic)) + return -EINVAL; + + if (*magic != KHO_TEST_MAGIC) + return -EINVAL; + + err = kho_test_restore_data(fdt, node); + if (err) + return err; + + pr_info("KHO restore succeeded\n"); + return 0; +} + +static int __init kho_test_init(void) +{ + phys_addr_t fdt_phys; + int err; + + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); + if (!err) + return kho_test_restore(fdt_phys); + + if (err != -ENOENT) { + pr_warn("failed to retrieve %s FDT: %d\n", KHO_TEST_FDT, err); + return err; + } + + return kho_test_save(); +} +module_init(kho_test_init); + +static void kho_test_cleanup(void) +{ + for (int i = 0; i < kho_test_state.nr_folios; i++) + folio_put(kho_test_state.folios[i]); + + kvfree(kho_test_state.folios); +} + +static void __exit kho_test_exit(void) +{ + unregister_kho_notifier(&kho_test_nb); + kho_test_cleanup(); +} +module_exit(kho_test_exit); + +MODULE_AUTHOR("Mike Rapoport <rppt(a)kernel.org>"); +MODULE_DESCRIPTION("KHO test module"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/kho/arm64.conf b/tools/testing/selftests/kho/arm64.conf new file mode 100644 index 000000000000..ee696807cd35 --- /dev/null +++ b/tools/testing/selftests/kho/arm64.conf @@ -0,0 +1,9 @@ +QEMU_CMD="qemu-system-aarch64 -M virt -cpu max" +QEMU_KCONFIG=" +CONFIG_SERIAL_AMBA_PL010=y +CONFIG_SERIAL_AMBA_PL010_CONSOLE=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +" +KERNEL_IMAGE="Image" +KERNEL_CMDLINE="console=ttyAMA0" diff --git a/tools/testing/selftests/kho/init.c b/tools/testing/selftests/kho/init.c new file mode 100644 index 000000000000..8034e24c6bf6 --- /dev/null +++ b/tools/testing/selftests/kho/init.c @@ -0,0 +1,100 @@ +// SPDX-License-Identifier: GPL-2.0 + +#ifndef NOLIBC +#include <errno.h> +#include <stdio.h> +#include <unistd.h> +#include <fcntl.h> +#include <syscall.h> +#include <sys/mount.h> +#include <sys/reboot.h> +#endif + +/* from arch/x86/include/asm/setup.h */ +#define COMMAND_LINE_SIZE 2048 + +/* from include/linux/kexex.h */ +#define KEXEC_FILE_NO_INITRAMFS 0x00000004 + +#define KHO_FINILIZE "/debugfs/kho/out/finalize" +#define KERNEL_IMAGE "/kernel" + +static int mount_filesystems(void) +{ + if (mount("debugfs", "/debugfs", "debugfs", 0, NULL) < 0) + return -1; + + return mount("proc", "/proc", "proc", 0, NULL); +} + +static int kho_enable(void) +{ + const char enable[] = "1"; + int fd; + + fd = open(KHO_FINILIZE, O_RDWR); + if (fd < 0) + return -1; + + if (write(fd, enable, sizeof(enable)) != sizeof(enable)) + return 1; + + close(fd); + return 0; +} + +static long kexec_file_load(int kernel_fd, int initrd_fd, + unsigned long cmdline_len, const char *cmdline, + unsigned long flags) +{ + return syscall(__NR_kexec_file_load, kernel_fd, initrd_fd, cmdline_len, + cmdline, flags); +} + +static int kexec_load(void) +{ + char cmdline[COMMAND_LINE_SIZE]; + ssize_t len; + int fd, err; + + fd = open("/proc/cmdline", O_RDONLY); + if (fd < 0) + return -1; + + len = read(fd, cmdline, sizeof(cmdline)); + close(fd); + if (len < 0) + return -1; + + /* replace \n with \0 */ + cmdline[len - 1] = 0; + fd = open(KERNEL_IMAGE, O_RDONLY); + if (fd < 0) + return -1; + + err = kexec_file_load(fd, -1, len, cmdline, KEXEC_FILE_NO_INITRAMFS); + close(fd); + + return err ? : 0; +} + +int main(int argc, char *argv[]) +{ + if (mount_filesystems()) + goto err_reboot; + + if (kho_enable()) + goto err_reboot; + + if (kexec_load()) + goto err_reboot; + + if (reboot(RB_KEXEC)) + goto err_reboot; + + return 0; + +err_reboot: + reboot(RB_AUTOBOOT); + return -1; +} diff --git a/tools/testing/selftests/kho/vmtest.sh b/tools/testing/selftests/kho/vmtest.sh new file mode 100755 index 000000000000..ec70a17bd476 --- /dev/null +++ b/tools/testing/selftests/kho/vmtest.sh @@ -0,0 +1,183 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +set -ue + +CROSS_COMPILE="${CROSS_COMPILE:-""}" + +test_dir=$(realpath "$(dirname "$0")") +kernel_dir=$(realpath "$test_dir/../../../..") + +tmp_dir=$(mktemp -d /tmp/kho-test.XXXXXXXX) +headers_dir="$tmp_dir/usr" +initrd_dir="$tmp_dir/initrd" +initrd="$tmp_dir/initrd.cpio" + +source "$test_dir/../kselftest/ktap_helpers.sh" + +function usage() { + cat <<EOF +$0 [-d build_dir] [-j jobs] [-t target_arch] [-h] +Options: + -d) path to the kernel build directory + -j) number of jobs for compilation, similar to -j in make + -t) run test for target_arch, requires CROSS_COMPILE set + supported targets: aarch64, x86_64 + -h) display this help +EOF +} + +function cleanup() { + rm -fr "$tmp_dir" + ktap_finished +} +trap cleanup EXIT + +function skip() { + local msg=${1:-""} + + ktap_test_skip "$msg" + exit "$KSFT_SKIP" +} + +function fail() { + local msg=${1:-""} + + ktap_test_fail "$msg" + exit "$KSFT_FAIL" +} + +function build_kernel() { + local build_dir=$1 + local make_cmd=$2 + local arch_kconfig=$3 + local kimage=$4 + + local kho_config="$tmp_dir/kho.config" + local kconfig="$build_dir/.config" + + # enable initrd, KHO and KHO test in kernel configuration + tee "$kconfig" > "$kho_config" <<EOF +CONFIG_BLK_DEV_INITRD=y +CONFIG_KEXEC_HANDOVER=y +CONFIG_TEST_KEXEC_HANDOVER=y +CONFIG_DEBUG_KERNEL=y +CONFIG_DEBUG_VM=y +$arch_kconfig +EOF + + make_cmd="$make_cmd -C $kernel_dir O=$build_dir" + $make_cmd olddefconfig + + # verify that kernel confiration has all necessary options + while read -r opt ; do + grep "$opt" "$kconfig" &>/dev/null || skip "$opt is missing" + done < "$kho_config" + + $make_cmd "$kimage" + $make_cmd headers_install INSTALL_HDR_PATH="$headers_dir" +} + +function mkinitrd() { + local kernel=$1 + + mkdir -p "$initrd_dir"/{dev,debugfs,proc} + sudo mknod "$initrd_dir/dev/console" c 5 1 + + "$CROSS_COMPILE"gcc -s -static -Os -nostdinc -I"$headers_dir/include" \ + -fno-asynchronous-unwind-tables -fno-ident -nostdlib \ + -include "$test_dir/../../../include/nolibc/nolibc.h" \ + -o "$initrd_dir/init" "$test_dir/init.c" \ + + cp "$kernel" "$initrd_dir/kernel" + + pushd "$initrd_dir" &>/dev/null + find . | cpio -H newc --create > "$initrd" 2>/dev/null + popd &>/dev/null +} + +function run_qemu() { + local qemu_cmd=$1 + local cmdline=$2 + local kernel=$3 + local serial="$tmp_dir/qemu.serial" + + cmdline="$cmdline kho=on panic=-1" + + $qemu_cmd -m 1G -smp 2 -no-reboot -nographic -nodefaults \ + -accel kvm -accel hvf -accel tcg \ + -serial file:"$serial" \ + -append "$cmdline" \ + -kernel "$kernel" \ + -initrd "$initrd" + + grep "KHO restore succeeded" "$serial" &> /dev/null || fail "KHO failed" +} + +function target_to_arch() { + local target=$1 + + case $target in + aarch64) echo "arm64" ;; + x86_64) echo "x86" ;; + *) skip "architecture $target is not supported" + esac +} + +function main() { + local build_dir="$kernel_dir/.kho" + local jobs=$(($(nproc) * 2)) + local target="$(uname -m)" + + # skip the test if any of the preparation steps fails + set -o errtrace + trap skip ERR + + while getopts 'hd:j:t:' opt; do + case $opt in + d) + build_dir="$OPTARG" + ;; + j) + jobs="$OPTARG" + ;; + t) + target="$OPTARG" + ;; + h) + usage + exit 0 + ;; + *) + echo Unknown argument "$opt" + usage + exit 1 + ;; + esac + done + + ktap_print_header + ktap_set_plan 1 + + if [[ "$target" != "$(uname -m)" ]] && [[ -z "$CROSS_COMPILE" ]]; then + skip "Cross-platform testing needs to specify CROSS_COMPILE" + fi + + mkdir -p "$build_dir" + local arch=$(target_to_arch "$target") + source "$test_dir/$arch.conf" + + # build the kernel and create initrd + # initrd includes the kernel image that will be kexec'ed + local make_cmd="make ARCH=$arch CROSS_COMPILE=$CROSS_COMPILE -j$jobs" + build_kernel "$build_dir" "$make_cmd" "$QEMU_KCONFIG" "$KERNEL_IMAGE" + + local kernel="$build_dir/arch/$arch/boot/$KERNEL_IMAGE" + mkinitrd "$kernel" + + run_qemu "$QEMU_CMD" "$KERNEL_CMDLINE" "$kernel" + + ktap_test_pass "KHO succeeded" +} + +main "$@" diff --git a/tools/testing/selftests/kho/x86.conf b/tools/testing/selftests/kho/x86.conf new file mode 100644 index 000000000000..b419e610ca22 --- /dev/null +++ b/tools/testing/selftests/kho/x86.conf @@ -0,0 +1,7 @@ +QEMU_CMD=qemu-system-x86_64 +QEMU_KCONFIG=" +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +" +KERNEL_IMAGE="bzImage" +KERNEL_CMDLINE="console=ttyS0" base-commit: 89be9a83ccf1f88522317ce02f854f30d6115c41 -- 2.47.2

5 months, 1 week

2
1
0 0

[PATCH v1 0/4] A couple of improvements for VMM to inject external abort to guest

by Jiaqi Yan

There are several situations where VMM is involved when handling synchronous external instruction or data aborts, and often VMM needs to inject external aborts to guest. In addition to manipulating individual registers with KVM_SET_ONE_REG API, an easier way is to use the KVM_SET_VCPU_EVENTS API. This patchset adds two new features to the KVM_SET_VCPU_EVENTS API. 1. Extend KVM_SET_VCPU_EVENTS to support external instruction abort. 2. Allow userspace to emulate ESR_ELx.ISS by supplying ESR_ELx. In this way, we can also allow userspace to emulate ESR_ELx.ISS2 in future. The UAPI change for #1 is straightforward. However, I would appreciate some feedback on the ABI change for #2: struct kvm_vcpu_events { struct { __u8 serror_pending; __u8 serror_has_esr; __u8 ext_dabt_pending; __u8 ext_iabt_pending; __u8 ext_abt_has_esr; __u8 pad[3]; __u64 serror_esr; __u64 ext_abt_esr; // <= +8 bytes } exception; __u32 reserved[10]; // <= -8 bytes }; The offset to kvm_vcpu_events.reserved changes, and the size of exception changes. I think we can't say userspace will never access reserved, or they will never use sizeof(exception). Theoretically this is an ABI break and I want to call it out and ask if a new ABI is needed for feature #2. For example, is it worthy to introduce exception_v2 or kvm_vcpu_events_v2. Based on commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI mappings when vPE allocation is disabled") Jiaqi Yan (3): KVM: arm64: Allow userspace to supply ESR when injecting SEA KVM: selftests: Test injecting external abort with ISS Documentation: kvm: update UAPI for injecting SEA Raghavendra Rao Ananta (1): KVM: arm64: Allow userspace to inject external instruction abort Documentation/virt/kvm/api.rst | 48 +++-- arch/arm64/include/asm/kvm_emulate.h | 9 +- arch/arm64/include/uapi/asm/kvm.h | 7 +- arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/emulate-nested.c | 6 +- arch/arm64/kvm/guest.c | 42 ++-- arch/arm64/kvm/inject_fault.c | 16 +- include/uapi/linux/kvm.h | 1 + tools/arch/arm64/include/uapi/asm/kvm.h | 7 +- .../selftests/kvm/arm64/external_aborts.c | 191 +++++++++++++++--- .../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++++++ 11 files changed, 352 insertions(+), 74 deletions(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c -- 2.50.1.565.gc32cd1483b-goog

5 months, 1 week

1
4
0 0

[PATCH v2 0/6] VMM can handle guest SEA via KVM_EXIT_ARM_SEA

by Jiaqi Yan

Problem ======= When host APEI is unable to claim synchronous external abort (SEA) during stage-2 guest abort, today KVM directly injects an async SError into the VCPU then resumes it. The injected SError usually results in unpleasant guest kernel panic. One of the major situation of guest SEA is when VCPU consumes recoverable uncorrected memory error (UER), which is not uncommon at all in modern datacenter servers with large amounts of physical memory. Although SError and guest panic is sufficient to stop the propagation of corrupted memory there is room to recover from an UER in a more graceful manner. Proposed Solution ================= Alternatively KVM can replay the SEA to the faulting VCPU, via existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or the fault that cause SEA is not from guest kernel, the blast radius can be limited to the consuming or faulting guest userspace process, so the VM can keep running. In addition, instead of doing under the hood without involving userspace, there are benefits to redirect the SEA to VMM: - VM customers care about the disruptions caused by memory errors, and VMM usually has the responsibility to start the process of notifying the customers of memory error events in their VMs. For example some cloud provider emits a critical log in their observability UI [1], and provides playbook for customers on how to mitigate disruptions to their workloads. - VMM can protect future memory error consumption by unmapping the poisoned pages from stage-2 page table with KVM userfault, or by splitting the memslot that contains the poisoned guest pages [2]. - VMM can keep track of SEA events in the VM. When VMM thinks the status on the host or the VM is bad enough, e.g. number of distinct SEAs exceeds a threshold, it can restart the VM on another healthy host. - Behavior parity with x86 architecture. When machine check exception (MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to let VMM either recover from the MCE, or terminate itself with VM. The prior RFC proposes to implement SIGBUS on arm64 as well, but Marc preferred VCPU exit over signal [3]. However, implementation aside, returning SEA to VMM is on par with returning MCE to VMM. Once SEA is redirected to VMM, among other actions, VMM is encouraged to inject external aborts into the faulting VCPU, which is already supported by KVM on arm64. We notice injecting instruction abort is not fully supported by KVM_SET_VCPU_EVENTS. Complement it in the patchset. New UAPIs ========= This patchset introduces following userspace-visiable changes to empower VMM to control what happens next for SEA on guest memory: - KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled this new capability at VM creation, and the SEA is not caused by memory allocated for stage-2 translation table, instead of injecting SError, return KVM_EXIT_ARM_SEA to userspace. - KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details about the SEA is provided in arm_sea as much as possible, including sanitized ESR value at EL2, if guest virtual and physical addresses (GPA and GVA) are available and the values if available. - KVM_CAP_ARM_INJECT_EXT_IABT. VMM today can inject external data abort to VCPU via KVM_SET_VCPU_EVENTS API. However, in case of instruction abort, VMM cannot inject it via KVM_SET_VCPU_EVENTS. KVM_CAP_ARM_INJECT_EXT_IABT is just a natural extend to KVM_CAP_ARM_INJECT_EXT_DABT that tells VMM KVM_SET_VCPU_EVENTS now supports external instruction abort. * From v1 [4]: - Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer"). - Sanitize ESR_EL2 before reporting it to userspace. - Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to stage-2 translation table. [1] https://cloud.google.com/solutions/sap/docs/manage-host-errors [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com [3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org [4] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com Jiaqi Yan (5): KVM: arm64: VM exit to userspace to handle SEA KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT Documentation: kvm: new uAPI for handling SEA Raghavendra Rao Ananta (1): KVM: arm64: Allow userspace to inject external instruction aborts Documentation/virt/kvm/api.rst | 128 ++++++- arch/arm64/include/asm/kvm_emulate.h | 67 ++++ arch/arm64/include/asm/kvm_host.h | 8 + arch/arm64/include/asm/kvm_ras.h | 2 +- arch/arm64/include/uapi/asm/kvm.h | 3 +- arch/arm64/kvm/arm.c | 6 + arch/arm64/kvm/guest.c | 13 +- arch/arm64/kvm/inject_fault.c | 3 + arch/arm64/kvm/mmu.c | 59 ++- include/uapi/linux/kvm.h | 12 + tools/arch/arm64/include/asm/esr.h | 2 + tools/arch/arm64/include/uapi/asm/kvm.h | 3 +- tools/testing/selftests/kvm/Makefile.kvm | 2 + .../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++ .../testing/selftests/kvm/arm64/sea_to_user.c | 340 ++++++++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 1 + 16 files changed, 718 insertions(+), 29 deletions(-) create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c -- 2.49.0.1266.g31b7d2e469-goog

5 months, 1 week

2
20
0 0

[PATCH] kunit: tool: Accept --raw_output=full as an alias of 'all'

by David Gow

I can never remember whether --raw_output takes 'all' or 'full'. No reason we can't support both. For the record, 'all' is the recommended, documented option. Signed-off-by: David Gow <davidgow(a)google.com> --- tools/testing/kunit/kunit.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/kunit/kunit.py b/tools/testing/kunit/kunit.py index 7f9ae55fd6d5..cd99c1956331 100755 --- a/tools/testing/kunit/kunit.py +++ b/tools/testing/kunit/kunit.py @@ -228,7 +228,7 @@ def parse_tests(request: KunitParseRequest, metadata: kunit_json.Metadata, input fake_test.counts.passed = 1 output: Iterable[str] = input_data - if request.raw_output == 'all': + if request.raw_output == 'all' or request.raw_output == 'full': pass elif request.raw_output == 'kunit': output = kunit_parser.extract_tap_lines(output) @@ -425,7 +425,7 @@ def add_parse_opts(parser: argparse.ArgumentParser) -> None: parser.add_argument('--raw_output', help='If set don\'t parse output from kernel. ' 'By default, filters to just KUnit output. Use ' '--raw_output=all to show everything', - type=str, nargs='?', const='all', default=None, choices=['all', 'kunit']) + type=str, nargs='?', const='all', default=None, choices=['all', 'full', 'kunit']) parser.add_argument('--json', nargs='?', help='Prints parsed test results as JSON to stdout or a file if ' -- 2.50.1.552.g942d659e1b-goog

5 months, 1 week

2
1
0 0

Re: [PATCH] selftests: timers: improve adjtick output readability

by Thomas Gleixner

Vishal! On Wed, Jul 30 2025 at 23:35, Vishal Parmar wrote: Please do not top-post and trim your replies. > The intent behind this change is to make output useful as is. > for example, to provide a performance report in case of regression. The point John was making: >> So it might be worth looking into getting the output to be happy with >> TAP while you're tweaking things here. The kernel selftests are converting over to standardized TAP output format, which is intended to aid automated testing. So if we change the outpot format of this test, then we switch it over to TAP format and do not invent yet another randomized output scheme. > CSV format is also a good alternative if the maintainer prefers that. The most important information is whether the test succeeded or not and CSV format is not helping either to conform with the test output standards. For the success case, the actual numbers are uninteresting. In the failure case it's sufficient to emit: ksft_test_result_fail("Req: NNNN, Exp: $MMMM, Res: $LLLL\n", ...); In case of regressions (fail), a report providing this output is good enough for the relevant maintainer/developer to start investigating. No? Thanks, tglx

5 months, 1 week

2
1
0 0

[PATCH] selftests: ALSA: fix memory leak in utimer test

by WangYuli

Free the malloc'd buffer in TEST_F(timer_f, utimer) to prevent memory leak. Reported-by: Jun Zhan <zhanjun(a)uniontech.com> Signed-off-by: WangYuli <wangyuli(a)uniontech.com> --- tools/testing/selftests/alsa/utimer-test.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/alsa/utimer-test.c b/tools/testing/selftests/alsa/utimer-test.c index 32ee3ce57721..37964f311a33 100644 --- a/tools/testing/selftests/alsa/utimer-test.c +++ b/tools/testing/selftests/alsa/utimer-test.c @@ -135,6 +135,7 @@ TEST_F(timer_f, utimer) { pthread_join(ticking_thread, NULL); ASSERT_EQ(total_ticks, TICKS_COUNT); pclose(rfp); + free(buf); } TEST(wrong_timers_test) { -- 2.50.1

5 months, 1 week

2
1
0 0

Crediting test authors

by Jakub Kicinski

Hi! Does anyone have ideas about crediting test authors or tests for bugs discovered? We increasingly see situations where someone adds a test then our subsystem CI uncovers a (1 in a 100 runs) bug using that test. Using reported-by doesn't feel right. But credit should go to the person who wrote the test. Is anyone else having this dilemma?

5 months, 1 week

5
9
0 0

[PATCH RFC v2 0/4] procfs: make reference pidns more user-visible

by Aleksa Sarai

Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed). /* pidns mount option for procfs */ This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs). But this requires forking off a helper process because you cannot just one-shot this using mount(2). * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process). While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate. Things would be much easier if there was a way for userspace to just specify the pidns they want. Patch 1 implements a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); The initial security model I have in this RFC is to be as conservative as possible and just mirror the security model for setns(2) -- which means that you can only set pidns=... to pid namespaces that your current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN privileges over the pid namespace. This fulfils the requirements of container runtimes, but I suspect that this may be too strict for some usecases. The pidns argument is not displayed in mountinfo -- it's not clear to me what value it would make sense to show (maybe we could just use ns_dname to provide an identifier for the namespace, but this number would be fairly useless to userspace). I'm open to suggestions. Note that PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get information about this outside of mountinfo. /* ioctl(PROCFS_GET_PID_NAMESPACE) */ In addition, being able to figure out what pid namespace is being used by a procfs mount is quite useful when you have an administrative process (such as a container runtime) which wants to figure out the correct way of mapping PIDs between its own namespace and the namespace for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are alternative ways to do this, but they all rely on ancillary information that third-party libraries and tools do not necessarily have access to. To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which can be used to get a reference to the pidns that a procfs is using. It's not quite clear what is the correct security model for this API, but the current approach I've taken is to: * Make the ioctl only valid on the root (meaning that a process without access to the procfs root -- such as only having an fd to a procfs file or some open_tree(2)-like subset -- cannot use this API). * Require that the process requesting either has access to /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative access to it and can join it if they had a handle), or is in a pidns that is a direct ancestor of the target pidns (i.e. all of the pids are already visible in the procfs for the current process's pidns). The security model for this is a little loose, as it seems to me that all of the cases mentioned are valid cases to allow access, but I'm open to suggestions for whether we need to make this stricter or looser. Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Changes in v2: - #ifdef CONFIG_PID_NS - Improve cover letter wording to make it clear we're talking about two separate features with different permission models. [Andy Lutomirski] - Fix build warnings in pidns_is_ancestor() patch. [kernel test robot] - v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cypha…> --- Aleksa Sarai (4): pidns: move is-ancestor logic to helper procfs: add "pidns" mount option procfs: add PROCFS_GET_PID_NAMESPACE ioctl selftests/proc: add tests for new pidns APIs Documentation/filesystems/proc.rst | 10 ++ fs/proc/root.c | 144 ++++++++++++++- include/linux/pid_namespace.h | 9 + include/uapi/linux/fs.h | 3 + kernel/pid_namespace.c | 23 ++- tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-pidns.c | 286 ++++++++++++++++++++++++++++++ 8 files changed, 461 insertions(+), 16 deletions(-) --- base-commit: 4c838c7672c39ec6ec48456c6ce22d14a68f4cda change-id: 20250717-procfs-pidns-api-8ed1583431f0 Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

5 months, 1 week

2
12
0 0

[PATCH v2] selftests/bpf: Add missing kfunc declarations to fix build errors

by Jiawei Zhao

A number of BPF selftests that utilize kernel functions (kfuncs) fail to build due to missing function prototypes. This results in compilation errors, as implicit function declarations are treated as errors: error: call to undeclared function 'bpf_copy_from_user_task_str'; ISO C99 and later do not support implicit function declarations Unlike BPF helpers, kfuncs are not automatically available to BPF programs and must be explicitly declared before use. To resolve this, centralize all the necessary kfunc declarations into the `bpf_kfuncs.h` header file. This header is then included in all the test programs that were previously missing these declarations. This approach also allows for the removal of redundant local `extern` declarations from individual source files (e.g., in `irq.c`), leading to cleaner and more maintainable code. Change since v1: - Add a kfunc declaration for __bpf_trap in bpf_kfuncs.h Signed-off-by: Jiawei Zhao <phoenix500526(a)163.com> --- tools/testing/selftests/bpf/bpf_kfuncs.h | 65 +++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_tasks.c | 1 + .../bpf/progs/bpf_qdisc_fail__incompl_ops.c | 1 + .../selftests/bpf/progs/bpf_qdisc_fifo.c | 1 + .../selftests/bpf/progs/bpf_qdisc_fq.c | 1 + .../selftests/bpf/progs/cgroup_read_xattr.c | 1 + .../testing/selftests/bpf/progs/dmabuf_iter.c | 1 + .../selftests/bpf/progs/dynptr_success.c | 1 + tools/testing/selftests/bpf/progs/irq.c | 6 +- .../selftests/bpf/progs/linked_list_peek.c | 1 + .../selftests/bpf/progs/rbtree_search.c | 1 + .../selftests/bpf/progs/rcu_read_lock.c | 1 + .../selftests/bpf/progs/read_cgroupfs_xattr.c | 1 + .../selftests/bpf/progs/res_spin_lock.c | 1 + .../selftests/bpf/progs/res_spin_lock_fail.c | 1 + .../struct_ops_refcounted_fail__tail_call.c | 1 + .../selftests/bpf/progs/test_spin_lock_fail.c | 1 + .../selftests/bpf/progs/verifier_bpf_trap.c | 1 + 18 files changed, 82 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h index 8215c9b3115e..a08c865b737a 100644 --- a/tools/testing/selftests/bpf/bpf_kfuncs.h +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h @@ -2,6 +2,7 @@ #define __BPF_KFUNCS__ struct bpf_sock_addr_kern; +struct bpf_res_spin_lock; /* Description * Initializes an skb-type dynptr @@ -42,6 +43,28 @@ extern bool bpf_dynptr_is_null(const struct bpf_dynptr *ptr) __ksym __weak; extern bool bpf_dynptr_is_rdonly(const struct bpf_dynptr *ptr) __ksym __weak; extern __u32 bpf_dynptr_size(const struct bpf_dynptr *ptr) __ksym __weak; extern int bpf_dynptr_clone(const struct bpf_dynptr *ptr, struct bpf_dynptr *clone__init) __ksym __weak; +extern int bpf_dynptr_copy(struct bpf_dynptr *dst_ptr, __u32 dst_off, struct bpf_dynptr *src_ptr, + __u32 src_off, __u32 size) __ksym __weak; +extern int bpf_probe_read_user_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign) __ksym __weak; +extern int bpf_probe_read_kernel_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign) __ksym __weak; +extern int bpf_probe_read_user_str_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign) __ksym __weak; +extern int bpf_probe_read_kernel_str_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign) __ksym __weak; +extern int bpf_copy_from_user_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign) __ksym __weak; +extern int bpf_copy_from_user_str_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign) __ksym __weak; +extern int bpf_copy_from_user_task_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign, + struct task_struct *tsk) __ksym __weak; +extern int bpf_copy_from_user_task_str_dynptr(struct bpf_dynptr *dptr, __u32 off, + __u32 size, const void *unsafe_ptr__ign, + struct task_struct *tsk) __ksym __weak; +extern int bpf_copy_from_user_task_str(void *dst, __u32, const void *, + struct task_struct *, __u64) __ksym __weak; /* Description * Modify the address of a AF_UNIX sockaddr. @@ -92,4 +115,46 @@ extern int bpf_set_dentry_xattr(struct dentry *dentry, const char *name__str, const struct bpf_dynptr *value_p, int flags) __ksym __weak; extern int bpf_remove_dentry_xattr(struct dentry *dentry, const char *name__str) __ksym __weak; +extern void bpf_local_irq_save(unsigned long *) __ksym __weak; +extern void bpf_local_irq_restore(unsigned long *) __ksym __weak; +extern int bpf_copy_from_user_str(void *dst, __u32 dst__sz, + const void *unsafe_ptr__ign, __u64 flags) __ksym __weak; +extern int bpf_res_spin_lock_irqsave(struct bpf_res_spin_lock *lock, + unsigned long *flags__irq_flag) __ksym __weak; +extern void bpf_res_spin_unlock_irqrestore(struct bpf_res_spin_lock *lock, + unsigned long *flags__irq_flag) __ksym __weak; +extern int bpf_res_spin_lock(struct bpf_res_spin_lock *lock) __ksym __weak; +extern void bpf_res_spin_unlock(struct bpf_res_spin_lock *lock) __ksym __weak; + +extern struct bpf_list_node *bpf_list_front(struct bpf_list_head *head) __ksym __weak; +extern struct bpf_list_node *bpf_list_back(struct bpf_list_head *head) __ksym __weak; + +struct bpf_sk_buff_ptr; +struct sk_buff; +struct Qdisc; + +extern void bpf_qdisc_skb_drop(struct sk_buff *skb, + struct bpf_sk_buff_ptr *to_free_list) __ksym __weak; +extern void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym __weak; +extern void bpf_kfree_skb(struct sk_buff *skb) __ksym __weak; +extern __u32 bpf_skb_get_hash(struct sk_buff *) __ksym __weak; +extern void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, __u64 expire, + __u64 delta_ns) __ksym __weak; + +extern struct cgroup *bpf_cgroup_from_id(__u64 cgid) __ksym __weak; +extern void bpf_cgroup_release(struct cgroup *cgrp) __ksym __weak; +extern void bpf_rcu_read_lock(void) __ksym __weak; +extern void bpf_rcu_read_unlock(void) __ksym __weak; +extern struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym __weak; + + +extern struct bpf_rb_node *bpf_rbtree_root(struct bpf_rb_root *root) __ksym __weak; +extern struct bpf_rb_node *bpf_rbtree_left(struct bpf_rb_root *root, + struct bpf_rb_node *node) __ksym __weak; +extern struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root, + struct bpf_rb_node *node) __ksym __weak; + +extern void bpf_task_release(struct task_struct *p) __ksym __weak; +extern void __bpf_trap(void) __ksym __weak; + #endif diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c b/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c index 966ee5a7b066..63daf05366df 100644 --- a/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c +++ b/tools/testing/selftests/bpf/progs/bpf_iter_tasks.c @@ -3,6 +3,7 @@ #include <vmlinux.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fail__incompl_ops.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fail__incompl_ops.c index f188062ed730..7f1a5a1b5dac 100644 --- a/tools/testing/selftests/bpf/progs/bpf_qdisc_fail__incompl_ops.c +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fail__incompl_ops.c @@ -3,6 +3,7 @@ #include <vmlinux.h> #include "bpf_experimental.h" #include "bpf_qdisc_common.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c index 1de2be3e370b..9ae41518d578 100644 --- a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c @@ -3,6 +3,7 @@ #include <vmlinux.h> #include "bpf_experimental.h" #include "bpf_qdisc_common.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c index 1a3233a275c7..f86981bc2a09 100644 --- a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c @@ -37,6 +37,7 @@ #include <bpf/bpf_helpers.h> #include "bpf_experimental.h" #include "bpf_qdisc_common.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/cgroup_read_xattr.c b/tools/testing/selftests/bpf/progs/cgroup_read_xattr.c index 092db1d0435e..50162ca905cc 100644 --- a/tools/testing/selftests/bpf/progs/cgroup_read_xattr.c +++ b/tools/testing/selftests/bpf/progs/cgroup_read_xattr.c @@ -7,6 +7,7 @@ #include <bpf/bpf_core_read.h> #include "bpf_experimental.h" #include "bpf_misc.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/dmabuf_iter.c b/tools/testing/selftests/bpf/progs/dmabuf_iter.c index 13cdb11fdeb2..df0021dc54da 100644 --- a/tools/testing/selftests/bpf/progs/dmabuf_iter.c +++ b/tools/testing/selftests/bpf/progs/dmabuf_iter.c @@ -3,6 +3,7 @@ #include <vmlinux.h> #include <bpf/bpf_core_read.h> #include <bpf/bpf_helpers.h> +#include "bpf_experimental.h" /* From uapi/linux/dma-buf.h */ #define DMA_BUF_NAME_LEN 32 diff --git a/tools/testing/selftests/bpf/progs/dynptr_success.c b/tools/testing/selftests/bpf/progs/dynptr_success.c index a0391f9da2d4..95bcdf465c4b 100644 --- a/tools/testing/selftests/bpf/progs/dynptr_success.c +++ b/tools/testing/selftests/bpf/progs/dynptr_success.c @@ -8,6 +8,7 @@ #include <bpf/bpf_tracing.h> #include "bpf_misc.h" #include "errno.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/irq.c b/tools/testing/selftests/bpf/progs/irq.c index 74d912b22de9..ce3b2509e6f1 100644 --- a/tools/testing/selftests/bpf/progs/irq.c +++ b/tools/testing/selftests/bpf/progs/irq.c @@ -4,13 +4,9 @@ #include <bpf/bpf_helpers.h> #include "bpf_misc.h" #include "bpf_experimental.h" +#include "bpf_kfuncs.h" unsigned long global_flags; - -extern void bpf_local_irq_save(unsigned long *) __weak __ksym; -extern void bpf_local_irq_restore(unsigned long *) __weak __ksym; -extern int bpf_copy_from_user_str(void *dst, u32 dst__sz, const void *unsafe_ptr__ign, u64 flags) __weak __ksym; - struct bpf_res_spin_lock lockA __hidden SEC(".data.A"); struct bpf_res_spin_lock lockB __hidden SEC(".data.B"); diff --git a/tools/testing/selftests/bpf/progs/linked_list_peek.c b/tools/testing/selftests/bpf/progs/linked_list_peek.c index 264e81bfb287..00d5299eeb0a 100644 --- a/tools/testing/selftests/bpf/progs/linked_list_peek.c +++ b/tools/testing/selftests/bpf/progs/linked_list_peek.c @@ -5,6 +5,7 @@ #include <bpf/bpf_helpers.h> #include "bpf_misc.h" #include "bpf_experimental.h" +#include "bpf_kfuncs.h" struct node_data { struct bpf_list_node l; diff --git a/tools/testing/selftests/bpf/progs/rbtree_search.c b/tools/testing/selftests/bpf/progs/rbtree_search.c index 098ef970fac1..681ea24d6877 100644 --- a/tools/testing/selftests/bpf/progs/rbtree_search.c +++ b/tools/testing/selftests/bpf/progs/rbtree_search.c @@ -5,6 +5,7 @@ #include <bpf/bpf_helpers.h> #include "bpf_misc.h" #include "bpf_experimental.h" +#include "bpf_kfuncs.h" struct node_data { struct bpf_refcount ref; diff --git a/tools/testing/selftests/bpf/progs/rcu_read_lock.c b/tools/testing/selftests/bpf/progs/rcu_read_lock.c index 43637ee2cdcd..386559f026dd 100644 --- a/tools/testing/selftests/bpf/progs/rcu_read_lock.c +++ b/tools/testing/selftests/bpf/progs/rcu_read_lock.c @@ -6,6 +6,7 @@ #include <bpf/bpf_tracing.h> #include "bpf_tracing_net.h" #include "bpf_misc.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/read_cgroupfs_xattr.c b/tools/testing/selftests/bpf/progs/read_cgroupfs_xattr.c index 855f85fc5522..0575e08ae108 100644 --- a/tools/testing/selftests/bpf/progs/read_cgroupfs_xattr.c +++ b/tools/testing/selftests/bpf/progs/read_cgroupfs_xattr.c @@ -6,6 +6,7 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_core_read.h> #include "bpf_experimental.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/res_spin_lock.c b/tools/testing/selftests/bpf/progs/res_spin_lock.c index 22c4fb8b9266..8d21b7ae0a18 100644 --- a/tools/testing/selftests/bpf/progs/res_spin_lock.c +++ b/tools/testing/selftests/bpf/progs/res_spin_lock.c @@ -4,6 +4,7 @@ #include <bpf/bpf_tracing.h> #include <bpf/bpf_helpers.h> #include "bpf_misc.h" +#include "bpf_kfuncs.h" #define EDEADLK 35 #define ETIMEDOUT 110 diff --git a/tools/testing/selftests/bpf/progs/res_spin_lock_fail.c b/tools/testing/selftests/bpf/progs/res_spin_lock_fail.c index 330682a88c16..d643ff783798 100644 --- a/tools/testing/selftests/bpf/progs/res_spin_lock_fail.c +++ b/tools/testing/selftests/bpf/progs/res_spin_lock_fail.c @@ -6,6 +6,7 @@ #include <bpf/bpf_core_read.h> #include "bpf_misc.h" #include "bpf_experimental.h" +#include "bpf_kfuncs.h" struct arr_elem { struct bpf_res_spin_lock lock; diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__tail_call.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__tail_call.c index 3b125025a1f2..7661658848f4 100644 --- a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__tail_call.c +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__tail_call.c @@ -4,6 +4,7 @@ #include <bpf/bpf_tracing.h> #include "../test_kmods/bpf_testmod.h" #include "bpf_misc.h" +#include "bpf_kfuncs.h" char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c b/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c index f678ee6bd7ea..aee2791ad863 100644 --- a/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c +++ b/tools/testing/selftests/bpf/progs/test_spin_lock_fail.c @@ -3,6 +3,7 @@ #include <bpf/bpf_tracing.h> #include <bpf/bpf_helpers.h> #include "bpf_experimental.h" +#include "bpf_kfuncs.h" struct foo { struct bpf_spin_lock lock; diff --git a/tools/testing/selftests/bpf/progs/verifier_bpf_trap.c b/tools/testing/selftests/bpf/progs/verifier_bpf_trap.c index 35e2cdc00a01..9d89ab6f5c58 100644 --- a/tools/testing/selftests/bpf/progs/verifier_bpf_trap.c +++ b/tools/testing/selftests/bpf/progs/verifier_bpf_trap.c @@ -3,6 +3,7 @@ #include <vmlinux.h> #include <bpf/bpf_helpers.h> #include "bpf_misc.h" +#include "bpf_kfuncs.h" #if __clang_major__ >= 21 && 0 SEC("socket") -- 2.43.0

5 months, 1 week

1
0
0 0

[PATCH] tools/nolibc: fix error return value of clock_nanosleep()

by Thomas Weißschuh

clock_nanosleep() returns a positive error value. Unlike other libc functions it *does not* return -1 nor set errno. Fix the return value and also adapt nanosleep(). Fixes: 7c02bc4088af ("tools/nolibc: add support for clock_nanosleep() and nanosleep()") Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- tools/include/nolibc/time.h | 5 +++-- tools/testing/selftests/nolibc/nolibc-test.c | 1 + 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/tools/include/nolibc/time.h b/tools/include/nolibc/time.h index d02bc44d2643a5e39afa808841f7175bfab5ff7e..e9c1b976791a65c0d73268bebbcfd4f2a57a47ee 100644 --- a/tools/include/nolibc/time.h +++ b/tools/include/nolibc/time.h @@ -133,7 +133,8 @@ static __attribute__((unused)) int clock_nanosleep(clockid_t clockid, int flags, const struct timespec *rqtp, struct timespec *rmtp) { - return __sysret(sys_clock_nanosleep(clockid, flags, rqtp, rmtp)); + /* Directly return a positive error number */ + return -sys_clock_nanosleep(clockid, flags, rqtp, rmtp); } static __inline__ @@ -145,7 +146,7 @@ double difftime(time_t time1, time_t time2) static __inline__ int nanosleep(const struct timespec *rqtp, struct timespec *rmtp) { - return clock_nanosleep(CLOCK_REALTIME, 0, rqtp, rmtp); + return __sysret(sys_clock_nanosleep(CLOCK_REALTIME, 0, rqtp, rmtp)); } diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c index a297ee0d6d0754dfcd9f9e5609d42c7442dabc4e..cc4d730ac4656fb5944d50be9477a3dfefb00aa0 100644 --- a/tools/testing/selftests/nolibc/nolibc-test.c +++ b/tools/testing/selftests/nolibc/nolibc-test.c @@ -1334,6 +1334,7 @@ int run_syscall(int min, int max) CASE_TEST(chroot_root); EXPECT_SYSZR(euid0, chroot("/")); break; CASE_TEST(chroot_blah); EXPECT_SYSER(1, chroot("/proc/self/blah"), -1, ENOENT); break; CASE_TEST(chroot_exe); EXPECT_SYSER(1, chroot(argv0), -1, ENOTDIR); break; + CASE_TEST(clock_nanosleep); ts.tv_nsec = -1; EXPECT_EQ(1, EINVAL, clock_nanosleep(CLOCK_REALTIME, 0, &ts, NULL)); break; CASE_TEST(close_m1); EXPECT_SYSER(1, close(-1), -1, EBADF); break; CASE_TEST(close_dup); EXPECT_SYSZR(1, close(dup(0))); break; CASE_TEST(dup_0); tmp = dup(0); EXPECT_SYSNE(1, tmp, -1); close(tmp); break; --- base-commit: 260f6f4fda93c8485c8037865c941b42b9cba5d2 change-id: 20250731-nolibc-clock_nanosleep-ret-b03a299c083f Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

5 months, 1 week

2
1
0 0

[PATCH v2] selftests/tty: add TIOCSTI test suite

by Abhinav Saxena via B4 Relay

From: Abhinav Saxena <xandfury(a)gmail.com> TIOCSTI is a TTY ioctl command that allows inserting characters into the terminal input queue, making it appear as if the user typed those characters. Add a test suite with four tests to verify TIOCSTI behaviour in different scenarios when dev.tty.legacy_tiocsti is both enabled and disabled: - Test TIOCSTI functionality when legacy support is enabled - Test TIOCSTI rejection when legacy support is disabled - Test capability requirements for TIOCSTI usage - Test TIOCSTI security with file descriptor passing The tests validate proper enforcement of the legacy_tiocsti sysctl introduced in commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled"). See tty_ioctl(4) for details on TIOCSTI behavior and security requirements. Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- This patch adds comprehensive selftests for the TIOCSTI ioctl to validate proper behaviour under different system configurations. =============== The TIOCSTI ioctl allows inserting characters into the terminal input queue, making it appear as if the user typed those characters. This functionality has security implications and behaviour that varies based on system configuration. Background ========== CONFIG_LEGACY_TIOCSTI controls the default value for the dev.tty.legacy_tiocsti sysctl, which remains runtime-configurable. The dev.tty.legacy_tiocsti sysctl was introduced in commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") to provide administrators control over TIOCSTI usage. When legacy_tiocsti is disabled, TIOCSTI requires CAP_SYS_ADMIN capability. However, the current implementation only checks the current process's credentials via capable(CAP_SYS_ADMIN), which doesn't validate against the file opener's credentials stored in file->f_cred. This creates a potential security scenario where an unprivileged process can open a TTY fd and pass it to a privileged process via SCM_RIGHTS. Testing ======= The test suite includes four comprehensive tests: - Test TIOCSTI functionality when legacy support is enabled - Test TIOCSTI rejection when legacy support is disabled - Test capability requirements for TIOCSTI usage - Test TIOCSTI security with file descriptor passing All patches have been validated using: - scripts/checkpatch.pl --strict (0 errors, 0 warnings) - Functional testing on kernel v6.16-rc2 - File descriptor passing security test scenarios The fd_passing_security test demonstrates the security concern. To verify, disable legacy TIOCSTI and run the test: $ echo "0" | sudo tee /proc/sys/dev/tty/legacy_tiocsti $ sudo ./tools/testing/selftests/tty/tty_tiocsti_test -t fd_passing_security Patch Overview ============== PATCH 1/1: selftests/tty: add TIOCSTI test suite Comprehensive test suite demonstrating the issue and fix validation References ========== - tty_ioctl(4) - documents TIOCSTI ioctl and capability requirements - commit 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") - Documentation/security/credentials.rst - https://github.com/KSPP/linux/issues/156 - https://lore.kernel.org/linux-hardening/Y0m9l52AKmw6Yxi1@hostpad/ - drivers/tty/Kconfig Configuration References: [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [2] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… [3] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/dri… Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> Changes in v2: - Focused series on selftests only - Removed SELinux capability checking patch for separate submission - Link to v1: https://lore.kernel.org/r/20250622-toicsti-bug-v1-0-f374373b04b2@gmail.com --- tools/testing/selftests/tty/Makefile | 6 +- tools/testing/selftests/tty/config | 1 + tools/testing/selftests/tty/tty_tiocsti_test.c | 421 +++++++++++++++++++++++++ 3 files changed, 427 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/tty/Makefile b/tools/testing/selftests/tty/Makefile index 50d7027b2ae3..7f6fbe5a0cd5 100644 --- a/tools/testing/selftests/tty/Makefile +++ b/tools/testing/selftests/tty/Makefile @@ -1,5 +1,9 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS = -O2 -Wall -TEST_GEN_PROGS := tty_tstamp_update +TEST_GEN_PROGS := tty_tstamp_update tty_tiocsti_test +LDLIBS += -lcap include ../lib.mk + +# Add libcap for TIOCSTI test +$(OUTPUT)/tty_tiocsti_test: LDLIBS += -lcap diff --git a/tools/testing/selftests/tty/config b/tools/testing/selftests/tty/config new file mode 100644 index 000000000000..c6373aba6636 --- /dev/null +++ b/tools/testing/selftests/tty/config @@ -0,0 +1 @@ +CONFIG_LEGACY_TIOCSTI=y diff --git a/tools/testing/selftests/tty/tty_tiocsti_test.c b/tools/testing/selftests/tty/tty_tiocsti_test.c new file mode 100644 index 000000000000..6a4b497078b0 --- /dev/null +++ b/tools/testing/selftests/tty/tty_tiocsti_test.c @@ -0,0 +1,421 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * TTY Tests - TIOCSTI + * + * Copyright © 2025 Abhinav Saxena <xandfury(a)gmail.com> + */ + +#include <stdio.h> +#include <stdlib.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/ioctl.h> +#include <errno.h> +#include <stdbool.h> +#include <string.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <pwd.h> +#include <termios.h> +#include <grp.h> +#include <sys/capability.h> +#include <sys/prctl.h> + +#include "../kselftest_harness.h" + +/* Helper function to send FD via SCM_RIGHTS */ +static int send_fd_via_socket(int socket_fd, int fd_to_send) +{ + struct msghdr msg = { 0 }; + struct cmsghdr *cmsg; + char cmsg_buf[CMSG_SPACE(sizeof(int))]; + char dummy_data = 'F'; + struct iovec iov = { .iov_base = &dummy_data, .iov_len = 1 }; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = cmsg_buf; + msg.msg_controllen = sizeof(cmsg_buf); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + cmsg->cmsg_len = CMSG_LEN(sizeof(int)); + + memcpy(CMSG_DATA(cmsg), &fd_to_send, sizeof(int)); + + return sendmsg(socket_fd, &msg, 0) < 0 ? -1 : 0; +} + +/* Helper function to receive FD via SCM_RIGHTS */ +static int recv_fd_via_socket(int socket_fd) +{ + struct msghdr msg = { 0 }; + struct cmsghdr *cmsg; + char cmsg_buf[CMSG_SPACE(sizeof(int))]; + char dummy_data; + struct iovec iov = { .iov_base = &dummy_data, .iov_len = 1 }; + int received_fd = -1; + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = cmsg_buf; + msg.msg_controllen = sizeof(cmsg_buf); + + if (recvmsg(socket_fd, &msg, 0) < 0) + return -1; + + for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { + if (cmsg->cmsg_level == SOL_SOCKET && + cmsg->cmsg_type == SCM_RIGHTS) { + memcpy(&received_fd, CMSG_DATA(cmsg), sizeof(int)); + break; + } + } + + return received_fd; +} + +static inline bool has_cap_sys_admin(void) +{ + cap_t caps = cap_get_proc(); + + if (!caps) + return false; + + cap_flag_value_t cap_val; + bool has_cap = (cap_get_flag(caps, CAP_SYS_ADMIN, CAP_EFFECTIVE, + &cap_val) == 0) && + (cap_val == CAP_SET); + + cap_free(caps); + return has_cap; +} + +/* + * Simple privilege drop that just changes uid/gid in current process + * and also capabilities like CAP_SYS_ADMIN + */ +static inline bool drop_to_nobody(void) +{ + /* Drop supplementary groups */ + if (setgroups(0, NULL) != 0) { + printf("setgroups failed: %s", strerror(errno)); + return false; + } + + /* Change group to nobody */ + if (setgid(65534) != 0) { + printf("setgid failed: %s", strerror(errno)); + return false; + } + + /* Change user to nobody (this drops capabilities) */ + if (setuid(65534) != 0) { + printf("setuid failed: %s", strerror(errno)); + return false; + } + + /* Verify we no longer have CAP_SYS_ADMIN */ + if (has_cap_sys_admin()) { + printf("ERROR: Still have CAP_SYS_ADMIN after changing to nobody"); + return false; + } + + printf("Successfully changed to nobody (uid:%d gid:%d)\n", getuid(), + getgid()); + return true; +} + +static inline int get_legacy_tiocsti_setting(void) +{ + FILE *fp; + int value = -1; + + fp = fopen("/proc/sys/dev/tty/legacy_tiocsti", "r"); + if (!fp) { + if (errno == ENOENT) { + printf("legacy_tiocsti sysctl not available (kernel < 6.2)\n"); + } else { + printf("Cannot read legacy_tiocsti: %s\n", + strerror(errno)); + } + return -1; + } + + if (fscanf(fp, "%d", &value) == 1) { + printf("legacy_tiocsti setting=%d\n", value); + + if (value < 0 || value > 1) { + printf("legacy_tiocsti unexpected value %d\n", value); + value = -1; + } else { + printf("legacy_tiocsti=%d (%s mode)\n", value, + value == 0 ? "restricted" : "permissive"); + } + } else { + printf("Failed to parse legacy_tiocsti value"); + value = -1; + } + + fclose(fp); + return value; +} + +static inline int test_tiocsti_injection(int fd) +{ + int ret; + char test_char = 'X'; + + ret = ioctl(fd, TIOCSTI, &test_char); + if (ret == 0) { + /* Clear the injected character */ + printf("TIOCSTI injection succeeded\n"); + } else { + printf("TIOCSTI injection failed: %s (errno=%d)\n", + strerror(errno), errno); + } + return ret == 0 ? 0 : -1; +} + +FIXTURE(tty_tiocsti) +{ + int tty_fd; + char *tty_name; + bool has_tty; + bool initial_cap_sys_admin; + int legacy_tiocsti_setting; +}; + +FIXTURE_SETUP(tty_tiocsti) +{ + TH_LOG("Running as UID: %d with effective UID: %d", getuid(), + geteuid()); + + self->tty_fd = open("/dev/tty", O_RDWR); + self->has_tty = (self->tty_fd >= 0); + + if (self->tty_fd < 0) + TH_LOG("Cannot open /dev/tty: %s", strerror(errno)); + + self->tty_name = ttyname(STDIN_FILENO); + TH_LOG("Current TTY: %s", self->tty_name ? self->tty_name : "none"); + + self->initial_cap_sys_admin = has_cap_sys_admin(); + TH_LOG("Initial CAP_SYS_ADMIN: %s", + self->initial_cap_sys_admin ? "yes" : "no"); + + self->legacy_tiocsti_setting = get_legacy_tiocsti_setting(); +} + +FIXTURE_TEARDOWN(tty_tiocsti) +{ + if (self->has_tty && self->tty_fd >= 0) + close(self->tty_fd); +} + +/* Test case 1: legacy_tiocsti != 0 (permissive mode) */ +TEST_F(tty_tiocsti, permissive_mode) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting == 0) + SKIP(return, + "Test requires permissive mode (legacy_tiocsti=1)"); + // clang-format on + + ASSERT_TRUE(self->has_tty); + + if (self->initial_cap_sys_admin) { + ASSERT_TRUE(drop_to_nobody()); + ASSERT_FALSE(has_cap_sys_admin()); + } + + /* In permissive mode, TIOCSTI should work without CAP_SYS_ADMIN */ + EXPECT_EQ(test_tiocsti_injection(self->tty_fd), 0) + { + TH_LOG("TIOCSTI should succeed in permissive mode without CAP_SYS_ADMIN"); + } +} + +/* Test case 2: legacy_tiocsti == 0, without CAP_SYS_ADMIN (should fail) */ +TEST_F(tty_tiocsti, restricted_mode_nopriv) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting != 0) + SKIP(return, + "Test requires restricted mode (legacy_tiocsti=0)"); + // clang-format on + + ASSERT_TRUE(self->has_tty); + + if (self->initial_cap_sys_admin) { + ASSERT_TRUE(drop_to_nobody()); + ASSERT_FALSE(has_cap_sys_admin()); + } + /* In restricted mode, TIOCSTI should fail without CAP_SYS_ADMIN */ + EXPECT_EQ(test_tiocsti_injection(self->tty_fd), -1); + + /* + * it might fail with either EPERM or EIO + * EXPECT_TRUE(errno == EPERM || errno == EIO) + * { + * TH_LOG("Expected EPERM, got: %s", strerror(errno)); + * } + */ +} + +/* Test case 3: legacy_tiocsti == 0, with CAP_SYS_ADMIN (should succeed) */ +TEST_F(tty_tiocsti, restricted_mode_priv) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting != 0) + SKIP(return, + "Test requires restricted mode (legacy_tiocsti=0)"); + // clang-format on + + /* Must have CAP_SYS_ADMIN for this test */ + if (!self->initial_cap_sys_admin) + SKIP(return, "Test requires CAP_SYS_ADMIN"); + + ASSERT_TRUE(self->has_tty); + ASSERT_TRUE(has_cap_sys_admin()); + + /* In restricted mode, TIOCSTI should succeed with CAP_SYS_ADMIN */ + EXPECT_EQ(test_tiocsti_injection(self->tty_fd), 0) + { + TH_LOG("TIOCSTI should succeed in restricted mode with CAP_SYS_ADMIN"); + } +} + +/* Test TIOCSTI security with file descriptor passing */ +TEST_F(tty_tiocsti, fd_passing_security) +{ + // clang-format off + if (self->legacy_tiocsti_setting < 0) + SKIP(return, + "legacy_tiocsti sysctl not available (kernel < 6.2)"); + + if (self->legacy_tiocsti_setting != 0) + SKIP(return, + "Test requires restricted mode (legacy_tiocsti=0)"); + // clang-format on + + /* Must start with CAP_SYS_ADMIN */ + if (!self->initial_cap_sys_admin) + SKIP(return, "Test requires initial CAP_SYS_ADMIN"); + + int sockpair[2]; + pid_t child_pid; + + ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, sockpair), 0); + + child_pid = fork(); + ASSERT_GE(child_pid, 0) + TH_LOG("Fork failed: %s", strerror(errno)); + + if (child_pid == 0) { + /* Child process - become unprivileged, open TTY, send FD to parent */ + close(sockpair[0]); + + TH_LOG("Child: Dropping privileges..."); + + /* Drop to nobody user (loses all capabilities) */ + drop_to_nobody(); + + /* Verify we no longer have CAP_SYS_ADMIN */ + if (has_cap_sys_admin()) { + TH_LOG("Child: Failed to drop CAP_SYS_ADMIN"); + _exit(1); + } + + TH_LOG("Child: Opening TTY as unprivileged user..."); + + int unprivileged_tty_fd = open("/dev/tty", O_RDWR); + + if (unprivileged_tty_fd < 0) { + TH_LOG("Child: Cannot open TTY: %s", strerror(errno)); + _exit(1); + } + + /* Test that we can't use TIOCSTI directly (should fail) */ + + char test_char = 'X'; + + if (ioctl(unprivileged_tty_fd, TIOCSTI, &test_char) == 0) { + TH_LOG("Child: ERROR - Direct TIOCSTI succeeded unexpectedly!"); + close(unprivileged_tty_fd); + _exit(1); + } + TH_LOG("Child: Good - Direct TIOCSTI failed as expected: %s", + strerror(errno)); + + /* Send the TTY FD to privileged parent via SCM_RIGHTS */ + TH_LOG("Child: Sending TTY FD to privileged parent..."); + if (send_fd_via_socket(sockpair[1], unprivileged_tty_fd) != 0) { + TH_LOG("Child: Failed to send FD"); + close(unprivileged_tty_fd); + _exit(1); + } + + close(unprivileged_tty_fd); + close(sockpair[1]); + _exit(0); /* Child success */ + + } else { + /* Parent process - keep CAP_SYS_ADMIN, receive FD, test TIOCSTI */ + close(sockpair[1]); + + TH_LOG("Parent: Waiting for TTY FD from unprivileged child..."); + + /* Verify we still have CAP_SYS_ADMIN */ + ASSERT_TRUE(has_cap_sys_admin()); + + /* Receive the TTY FD from unprivileged child */ + int received_fd = recv_fd_via_socket(sockpair[0]); + + ASSERT_GE(received_fd, 0) + TH_LOG("Parent: Received FD %d (opened by unprivileged process)", + received_fd); + + /* + * VULNERABILITY TEST: Try TIOCSTI with FD opened by unprivileged process + * This should FAIL even though parent has CAP_SYS_ADMIN + * because the FD was opened by unprivileged process + */ + char attack_char = 'V'; /* V for Vulnerability */ + int ret = ioctl(received_fd, TIOCSTI, &attack_char); + + TH_LOG("Parent: Testing TIOCSTI on FD from unprivileged process..."); + if (ret == 0) { + TH_LOG("*** VULNERABILITY DETECTED ***"); + TH_LOG("Privileged process can use TIOCSTI on unprivileged FD"); + } else { + TH_LOG("TIOCSTI failed on unprivileged FD: %s", + strerror(errno)); + EXPECT_EQ(errno, EPERM); + } + close(received_fd); + close(sockpair[0]); + + /* Wait for child */ + int status; + + ASSERT_EQ(waitpid(child_pid, &status, 0), child_pid); + EXPECT_EQ(WEXITSTATUS(status), 0); + ASSERT_NE(ret, 0); + } +} + +TEST_HARNESS_MAIN --- base-commit: 40f92e79b0aabbf3575e371f9054657a421a3e79 change-id: 20250618-toicsti-bug-7822b8e94a32 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

5 months, 1 week

3
2
0 0

[RFC PATCH 1/2] kunit: tool: Move qemu architecture dependency checks into a function

by David Gow

Currently the RISC-V qemu architecture config has some code to check for the presence of the BIOS files before loading, printing a message and quitting if it's not present. However, this prevents us from loading the architecture file even just to inspect it if the dependencies are not present. Instead, have kunit.py look for a check_dependencies function, and call it if present only when the architecture config is being used. This is necessary for future changes which enumerate or automatically select an architecture. Signed-off-by: David Gow <davidgow(a)google.com> --- tools/testing/kunit/kunit_kernel.py | 5 +++++ tools/testing/kunit/qemu_configs/riscv.py | 10 ++++++---- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py index 260d8d9aa1db..c3201a76da24 100644 --- a/tools/testing/kunit/kunit_kernel.py +++ b/tools/testing/kunit/kunit_kernel.py @@ -230,6 +230,11 @@ def _get_qemu_ops(config_path: str, assert isinstance(spec.loader, importlib.abc.Loader) spec.loader.exec_module(config) + # Check for any per-architecture dependencies + if hasattr(config, 'check_dependencies'): + if not config.check_dependencies(): + raise ValueError('Missing dependencies for ' + config_path) + if not hasattr(config, 'QEMU_ARCH'): raise ValueError('qemu_config module missing "QEMU_ARCH": ' + config_path) params: qemu_config.QemuArchParams = config.QEMU_ARCH diff --git a/tools/testing/kunit/qemu_configs/riscv.py b/tools/testing/kunit/qemu_configs/riscv.py index c87758030ff7..3c271d1005d9 100644 --- a/tools/testing/kunit/qemu_configs/riscv.py +++ b/tools/testing/kunit/qemu_configs/riscv.py @@ -6,10 +6,12 @@ import sys OPENSBI_FILE = 'opensbi-riscv64-generic-fw_dynamic.bin' OPENSBI_PATH = '/usr/share/qemu/' + OPENSBI_FILE -if not os.path.isfile(OPENSBI_PATH): - print('\n\nOpenSBI bios was not found in "' + OPENSBI_PATH + '".\n' - 'Please ensure that qemu-system-riscv is installed, or edit the path in "qemu_configs/riscv.py"\n') - sys.exit() +def check_dependencies() -> bool: + if not os.path.isfile(OPENSBI_PATH): + print('\n\nOpenSBI bios was not found in "' + OPENSBI_PATH + '".\n' + 'Please ensure that qemu-system-riscv is installed, or edit the path in "qemu_configs/riscv.py"\n') + return False + return True QEMU_ARCH = QemuArchParams(linux_arch='riscv', kconfig=''' -- 2.50.1.552.g942d659e1b-goog

5 months, 1 week

1
1
0 0

[PATCH v2] Subject: [PATCH] selftests: panic: Add test module to trigger kernel panic

by Vishal Parmar

This patch adds a new test module under tools/testing/selftests/panic that intentionally triggers a kernel panic for test and diagnostic purposes. The goal is to provide a reproducible and isolated kernel panic event for testing crash dump mechanisms or validating kernel panic handling behavior. The test includes: - A kernel module that calls panic() in init. - A Makefile to build the kernel module. - A run.sh script to load the module and capture panic logs. Changes in v2: - Added run.sh - Added reference output log of run.sh Signed-off-by: Vishal Parmar <vishistriker(a)gmail.com> --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/panic/Makefile | 13 +++++++++ .../selftests/panic/panic_trigger_test.c | 26 +++++++++++++++++ .../selftests/panic/reference_output_log.txt | 29 +++++++++++++++++++ tools/testing/selftests/panic/run.sh | 17 +++++++++++ 5 files changed, 86 insertions(+) create mode 100644 tools/testing/selftests/panic/Makefile create mode 100644 tools/testing/selftests/panic/panic_trigger_test.c create mode 100644 tools/testing/selftests/panic/reference_output_log.txt create mode 100755 tools/testing/selftests/panic/run.sh diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 339b31e6a6b5..7b824470a9b3 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -78,6 +78,7 @@ TARGETS += net/packetdrill TARGETS += net/rds TARGETS += net/tcp_ao TARGETS += nsfs +TARGETS += panic TARGETS += pci_endpoint TARGETS += pcie_bwctrl TARGETS += perf_events diff --git a/tools/testing/selftests/panic/Makefile b/tools/testing/selftests/panic/Makefile new file mode 100644 index 000000000000..e4a1b88a63b2 --- /dev/null +++ b/tools/testing/selftests/panic/Makefile @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-m := panic_trigger_test.o + +KDIR := $(abspath ../../../../) +PWD := $(shell pwd) + +all: + $(MAKE) -C $(KDIR) M=$(PWD) modules + +clean: + $(MAKE) -C $(KDIR) M=$(PWD) clean + rm -f *.mod.c *.o *.ko *.order *.symvers diff --git a/tools/testing/selftests/panic/panic_trigger_test.c b/tools/testing/selftests/panic/panic_trigger_test.c new file mode 100644 index 000000000000..4e2e043fe3ad --- /dev/null +++ b/tools/testing/selftests/panic/panic_trigger_test.c @@ -0,0 +1,26 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * panic_test.c - Module to test kernel panic + */ + +#include <linux/module.h> +#include <linux/init.h> + +static int __init panic_test_init(void) +{ + pr_info("Triggering a deliberate kernel panic now.\n"); + panic("Triggered by panic_test module."); + return 0; +} + +static void __exit panic_test_exit(void) +{ + pr_info("This should not be printed, as system panics on init.\n"); +} + +module_init(panic_test_init); +module_exit(panic_test_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Vishal Parmar"); +MODULE_DESCRIPTION("Module to trigger kernel panic for testing"); diff --git a/tools/testing/selftests/panic/reference_output_log.txt b/tools/testing/selftests/panic/reference_output_log.txt new file mode 100644 index 000000000000..2c8143bf6c4a --- /dev/null +++ b/tools/testing/selftests/panic/reference_output_log.txt @@ -0,0 +1,29 @@ +[*] Inserting module: panic_trigger_test.ko +[ 30.377307] panic_trigger_test: loading out-of-tree module taints kernel. +[ 30.380328] Triggering a deliberate kernel panic now. +[ 30.382369] Kernel panic - not syncing: Triggered by panic_test module. +[ 30.383349] CPU: 1 UID: 0 PID: 99 Comm: insmod Tainted: G O 6.16.0-rc7-00140-gec2df4364666 #1 PREEMPT(voluntary) +[ 30.383349] Tainted: [O]=OOT_MODULE +[ 30.383349] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 +[ 30.383349] Call Trace: +[ 30.383349] <TASK> +[ 30.383349] panic+0x325/0x380 +[ 30.383349] ? __pfx_panic_test_init+0x10/0x10 [panic_trigger_test] +[ 30.383349] panic_test_init+0x1c/0xff0 [panic_trigger_test] +[ 30.383349] do_one_initcall+0x55/0x220 +[ 30.383349] do_init_module+0x5b/0x230 +[ 30.383349] __do_sys_init_module+0x150/0x180 +[ 30.383349] do_syscall_64+0xa4/0x260 +[ 30.383349] entry_SYSCALL_64_after_hwframe+0x77/0x7f +[ 30.383349] RIP: 0033:0x7f88f09177d9 +[ 30.383349] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f7 05 8 +[ 30.383349] RSP: 002b:00007ffc16317848 EFLAGS: 00000206 ORIG_RAX: 00000000000000af +[ 30.383349] RAX: ffffffffffffffda RBX: 000055a4a84b0eae RCX: 00007f88f09177d9 +[ 30.383349] RDX: 000055a4a84b0eae RSI: 00000000000019e8 RDI: 000055a4c9b4b370 +[ 30.383349] RBP: 00007ffc16317bd0 R08: 000055a4c9b4b310 R09: 00000000000019e8 +[ 30.383349] R10: 0000000000000007 R11: 0000000000000206 R12: 00007ffc16317bd8 +[ 30.383349] R13: 00007ffc16317be0 R14: 000055a4a84b0eae R15: 00007f88f0b25020 +[ 30.383349] </TASK> +[ 30.383349] Kernel Offset: 0x10000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) +[ 30.383349] ---[ end Kernel panic - not syncing: Triggered by panic_test module. ]--- + diff --git a/tools/testing/selftests/panic/run.sh b/tools/testing/selftests/panic/run.sh new file mode 100755 index 000000000000..ffa20dc22708 --- /dev/null +++ b/tools/testing/selftests/panic/run.sh @@ -0,0 +1,17 @@ +# tools/testing/selftests/panic/run.sh + +#!/bin/sh +set -e + +MOD_NAME="panic_trigger_test.ko" +LOG_FILE="panic_log.txt" + +echo "[*] Clearing dmesg..." +dmesg -c + +echo "[*] Inserting module: $MOD_NAME" +insmod ./$MOD_NAME + +echo "[*] Capturing dmesg..." +dmesg > "$LOG_FILE" + -- 2.39.5

5 months, 1 week

1
0
0 0

Re: [PATCH v5] char: misc: add test cases

by Geert Uytterhoeven

Hi Thadeu,, On Sun, 15 Jun 2025 at 23:31, Thadeu Lima de Souza Cascardo <cascardo(a)igalia.com> wrote: > > Add test cases for static and dynamic minor number allocation and > deallocation. > > While at it, improve description and test suite name. > > Some of the cases include: > > - that static and dynamic allocation reserved the expected minors. > > - that registering duplicate minors or duplicate names will fail. > > - that failing to create a sysfs file (due to duplicate names) will > deallocate the dynamic minor correctly. > > - that dynamic allocation does not allocate a minor number in the static > range. > > - that there are no collisions when mixing dynamic and static allocations. > > - that opening devices with various minor device numbers work. > > - that registering a static number in the dynamic range won't conflict with > a dynamic allocation. > > This last test verifies the bug fixed by commit 6d04d2b554b1 ("misc: > misc_minor_alloc to use ida for all dynamic/misc dynamic minors") has not > regressed. > > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)igalia.com> Thanks for your patch, which is now commit 74d8361be3441dff ("char: misc: add test cases") in linus/master stable/master > Changes in v5: > - Make miscdevice unit test built-in only > - Make unit test require CONFIG_KUNIT=y Why were these changes made? This means the test is no longer available if KUNIT=m, and I can no longer just load the module when I want to run the test. > - Link to v4: https://lore.kernel.org/r/20250423-misc-dynrange-v4-0-133b5ae4ca18@igalia.c… > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -2506,8 +2506,8 @@ config TEST_IDA > tristate "Perform selftest on IDA functions" > > config TEST_MISC_MINOR > - tristate "miscdevice KUnit test" if !KUNIT_ALL_TESTS > - depends on KUNIT > + bool "miscdevice KUnit test" if !KUNIT_ALL_TESTS > + depends on KUNIT=y > default KUNIT_ALL_TESTS > help > Kunit test for miscdevice API, specially its behavior in respect to Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds

5 months, 1 week

2
1
0 0

[PATCH 5.10.y 4/4] selftests/memfd: add test for mapping write-sealed memfd read-only

by Isaac J. Manjarres

From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> [ Upstream commit ea0916e01d0b0f2cce1369ac1494239a79827270 ] Now we have reinstated the ability to map F_SEAL_WRITE mappings read-only, assert that we are able to do this in a test to ensure that we do not regress this again. Link: https://lkml.kernel.org/r/a6377ec470b14c0539b4600cf8fa24bf2e4858ae.17328047… Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Jann Horn <jannh(a)google.com> Cc: Julian Orth <ju.orth(a)gmail.com> Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Vlastimil Babka <vbabka(a)suse.cz> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: stable(a)vger.kernel.org Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com> --- tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index fba322d1c67a..5d1ad547416a 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -186,6 +186,24 @@ static void *mfd_assert_mmap_shared(int fd) return p; } +static void *mfd_assert_mmap_read_shared(int fd) +{ + void *p; + + p = mmap(NULL, + mfd_def_size, + PROT_READ, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + return p; +} + static void *mfd_assert_mmap_private(int fd) { void *p; @@ -802,6 +820,30 @@ static void test_seal_future_write(void) close(fd); } +static void test_seal_write_map_read_shared(void) +{ + int fd; + void *p; + + printf("%s SEAL-WRITE-MAP-READ\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_write_map_read", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_add_seals(fd, F_SEAL_WRITE); + mfd_assert_has_seals(fd, F_SEAL_WRITE); + + p = mfd_assert_mmap_read_shared(fd); + + mfd_assert_read(fd); + mfd_assert_read_shared(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -1056,6 +1098,7 @@ int main(int argc, char **argv) test_seal_write(); test_seal_future_write(); + test_seal_write_map_read_shared(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.50.1.552.g942d659e1b-goog

5 months, 1 week

1
0
0 0

[PATCH 5.15.y 4/4] selftests/memfd: add test for mapping write-sealed memfd read-only

by Isaac J. Manjarres

From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> [ Upstream commit ea0916e01d0b0f2cce1369ac1494239a79827270 ] Now we have reinstated the ability to map F_SEAL_WRITE mappings read-only, assert that we are able to do this in a test to ensure that we do not regress this again. Link: https://lkml.kernel.org/r/a6377ec470b14c0539b4600cf8fa24bf2e4858ae.17328047… Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Jann Horn <jannh(a)google.com> Cc: Julian Orth <ju.orth(a)gmail.com> Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Vlastimil Babka <vbabka(a)suse.cz> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: stable(a)vger.kernel.org Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com> --- tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 94df2692e6e4..15a90db80836 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -186,6 +186,24 @@ static void *mfd_assert_mmap_shared(int fd) return p; } +static void *mfd_assert_mmap_read_shared(int fd) +{ + void *p; + + p = mmap(NULL, + mfd_def_size, + PROT_READ, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + return p; +} + static void *mfd_assert_mmap_private(int fd) { void *p; @@ -802,6 +820,30 @@ static void test_seal_future_write(void) close(fd); } +static void test_seal_write_map_read_shared(void) +{ + int fd; + void *p; + + printf("%s SEAL-WRITE-MAP-READ\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_write_map_read", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_add_seals(fd, F_SEAL_WRITE); + mfd_assert_has_seals(fd, F_SEAL_WRITE); + + p = mfd_assert_mmap_read_shared(fd); + + mfd_assert_read(fd); + mfd_assert_read_shared(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -1056,6 +1098,7 @@ int main(int argc, char **argv) test_seal_write(); test_seal_future_write(); + test_seal_write_map_read_shared(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.50.1.552.g942d659e1b-goog

5 months, 1 week

1
0
0 0

[PATCH 6.1.y 4/4] selftests/memfd: add test for mapping write-sealed memfd read-only

by Isaac J. Manjarres

From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> [ Upstream commit ea0916e01d0b0f2cce1369ac1494239a79827270 ] Now we have reinstated the ability to map F_SEAL_WRITE mappings read-only, assert that we are able to do this in a test to ensure that we do not regress this again. Link: https://lkml.kernel.org/r/a6377ec470b14c0539b4600cf8fa24bf2e4858ae.17328047… Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Jann Horn <jannh(a)google.com> Cc: Julian Orth <ju.orth(a)gmail.com> Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Vlastimil Babka <vbabka(a)suse.cz> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: stable(a)vger.kernel.org Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com> --- tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 94df2692e6e4..15a90db80836 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -186,6 +186,24 @@ static void *mfd_assert_mmap_shared(int fd) return p; } +static void *mfd_assert_mmap_read_shared(int fd) +{ + void *p; + + p = mmap(NULL, + mfd_def_size, + PROT_READ, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + return p; +} + static void *mfd_assert_mmap_private(int fd) { void *p; @@ -802,6 +820,30 @@ static void test_seal_future_write(void) close(fd); } +static void test_seal_write_map_read_shared(void) +{ + int fd; + void *p; + + printf("%s SEAL-WRITE-MAP-READ\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_write_map_read", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_add_seals(fd, F_SEAL_WRITE); + mfd_assert_has_seals(fd, F_SEAL_WRITE); + + p = mfd_assert_mmap_read_shared(fd); + + mfd_assert_read(fd); + mfd_assert_read_shared(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -1056,6 +1098,7 @@ int main(int argc, char **argv) test_seal_write(); test_seal_future_write(); + test_seal_write_map_read_shared(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.50.1.552.g942d659e1b-goog

5 months, 1 week

1
0
0 0

[PATCH 6.6.y 4/4] selftests/memfd: add test for mapping write-sealed memfd read-only

by Isaac J. Manjarres

From: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> [ Upstream commit ea0916e01d0b0f2cce1369ac1494239a79827270 ] Now we have reinstated the ability to map F_SEAL_WRITE mappings read-only, assert that we are able to do this in a test to ensure that we do not regress this again. Link: https://lkml.kernel.org/r/a6377ec470b14c0539b4600cf8fa24bf2e4858ae.17328047… Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Jann Horn <jannh(a)google.com> Cc: Julian Orth <ju.orth(a)gmail.com> Cc: Liam R. Howlett <Liam.Howlett(a)Oracle.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Vlastimil Babka <vbabka(a)suse.cz> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: stable(a)vger.kernel.org Signed-off-by: Isaac J. Manjarres <isaacmanjarres(a)google.com> --- tools/testing/selftests/memfd/memfd_test.c | 43 ++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index e92b60eecb7d..9c9c82fd18a7 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -285,6 +285,24 @@ static void *mfd_assert_mmap_shared(int fd) return p; } +static void *mfd_assert_mmap_read_shared(int fd) +{ + void *p; + + p = mmap(NULL, + mfd_def_size, + PROT_READ, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + return p; +} + static void *mfd_assert_mmap_private(int fd) { void *p; @@ -986,6 +1004,30 @@ static void test_seal_future_write(void) close(fd); } +static void test_seal_write_map_read_shared(void) +{ + int fd; + void *p; + + printf("%s SEAL-WRITE-MAP-READ\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_write_map_read", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + mfd_assert_add_seals(fd, F_SEAL_WRITE); + mfd_assert_has_seals(fd, F_SEAL_WRITE); + + p = mfd_assert_mmap_read_shared(fd); + + mfd_assert_read(fd); + mfd_assert_read_shared(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -1603,6 +1645,7 @@ int main(int argc, char **argv) test_seal_write(); test_seal_future_write(); + test_seal_write_map_read_shared(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.50.1.552.g942d659e1b-goog

5 months, 1 week

1
0
0 0

[PATCH v2 0/2] seccomp: Fix a race with WAIT_KILLABLE_RECV if the tracer replies too fast

by Johannes Nixdorf

If WAIT_KILLABLE_RECV was specified, and an event is received, the tracee's syscall is not supposed to be interruptible. This was not properly ensured if the reply was sent too fast, and an interrupting signal was received before the reply was processed on the tracee side. This series fixes the bug and adds a test case for it to the selftests. Signed-off-by: Johannes Nixdorf <johannes(a)nixdorf.dev> --- Changes in v2: - Added a selftest for the bug. - Link to v1: https://lore.kernel.org/r/20250723-seccomp-races-v1-1-bef5667ce30a@nixdorf.… --- Johannes Nixdorf (2): seccomp: Fix a race with WAIT_KILLABLE_RECV if the tracer replies too fast selftests/seccomp: Add a test for the WAIT_KILLABLE_RECV fast reply race kernel/seccomp.c | 13 ++- tools/testing/selftests/seccomp/seccomp_bpf.c | 130 ++++++++++++++++++++++++++ 2 files changed, 136 insertions(+), 7 deletions(-) --- base-commit: 89be9a83ccf1f88522317ce02f854f30d6115c41 change-id: 20250721-seccomp-races-e97897d6d94b Best regards, -- Johannes Nixdorf <johannes(a)nixdorf.dev>

5 months, 1 week

3
5
0 0

[GIT PULL] kselftest next update for Linux 6.17-rc1

by Shuah Khan

Hi Linus, Please pull this kselftest next update for Linux 6.17-rc1. Fixes - false failure of subsystem event test - glob filter test to use mutex_unlock() instead of mutex_trylock() - several spelling errors in tests - test_kexec_jump build errors - pidfd test duplicate-symbol warnings for SCHED_ CPP symbols Adds a reliable check for suspend to breakpoints suspend test Improvements to ipc test diff is attached. thanks, -- Shuah ---------------------------------------------------------------- The following changes since commit 19272b37aa4f83ca52bdf9c16d5d81bdd1354494: Linux 6.16-rc1 (2025-06-08 13:44:43 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-next-6.17-rc1 for you to fetch changes up to 30fb5e134f05800dc424f8aa1d69841a6bdd9a54: selftests/pidfd: Fix duplicate-symbol warnings for SCHED_ CPP symbols (2025-07-24 16:14:45 -0600) ---------------------------------------------------------------- linux_kselftest-next-6.17-rc1 Fixes - false failure of subsystem event test - glob filter test to use mutex_unlock() instead of mutex_trylock() - several spelling errors in tests - test_kexec_jump build errors - pidfd test duplicate-symbol warnings for SCHED_ CPP symbols Adds a reliable check for suspend to breakpoints suspend test Improvements to ipc test ---------------------------------------------------------------- Ankit Chauhan (1): selftests/ptrace: Fix spelling mistake "multible" -> "multiple" Jihed Chaibi (1): selftests/cpu-hotplug: fix typo in hotplaggable_offline_cpus function name Masami Hiramatsu (Google) (1): selftests: tracing: Use mutex_unlock for testing glob filter Moon Hee Lee (2): selftests: breakpoints: use suspend_stats to reliably check suspend success selftests/kexec: fix test_kexec_jump build Nick Huang (1): selftests: ipc: Replace fail print statements with ksft_test_result_fail Paul E. McKenney (1): selftests/pidfd: Fix duplicate-symbol warnings for SCHED_ CPP symbols Shuah Khan (1): selftests: print installation complete message Steven Rostedt (1): selftests/tracing: Fix false failure of subsystem event test Tianyi Cui (1): selftests: Add version file to kselftest installation dir tools/testing/selftests/Makefile | 8 ++++ .../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++----- .../selftests/cpu-hotplug/cpu-on-off-test.sh | 4 +- .../ftrace/test.d/event/subsystem-enable.tc | 28 ++++++++++++- .../ftrace/test.d/ftrace/func-filter-glob.tc | 2 +- tools/testing/selftests/ipc/msgque.c | 47 +++++++++++----------- tools/testing/selftests/kexec/Makefile | 2 +- tools/testing/selftests/pidfd/pidfd.h | 9 +++++ tools/testing/selftests/ptrace/peeksiginfo.c | 2 +- 9 files changed, 102 insertions(+), 41 deletions(-) ----------------------------------------------------------------

5 months, 1 week

2
1
0 0

[GIT PULL] kunit next update for Linux 6.17-rc1

by Shuah Khan

Hi Linus, Please pull the following kunit next update for Linux 6.17-rc1. Corrects MODULE_IMPORT_NS() syntax documentation, makes kunit_test timeout configurable via a module parameter and a Kconfig option, fixes longest symbol length test, adds a test for static stub, and adjusts kunit_test timeout based on test_{suite,case} speed. diff is attached. thanks, -- Shuah ---------------------------------------------------------------- The following changes since commit 19272b37aa4f83ca52bdf9c16d5d81bdd1354494: Linux 6.16-rc1 (2025-06-08 13:44:43 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.17-rc1 for you to fetch changes up to 34db4fba81916a2001d7a503dfcf718c08ed5c42: kunit: fix longest symbol length test (2025-07-10 14:02:07 -0600) ---------------------------------------------------------------- linux_kselftest-kunit-6.17-rc1 Corrects MODULE_IMPORT_NS() syntax documentation, makes kunit_test timeout configurable via a module parameter and a Kconfig option, fixes longest symbol length test, adds a test for static stub, and adjusts kunit_test timeout based on test_{suite,case} speed. ---------------------------------------------------------------- Brian Norris (1): Documentation: kunit: Correct MODULE_IMPORT_NS() syntax Marie Zhussupova (1): kunit: Make default kunit_test timeout configurable via both a module parameter and a Kconfig option Sergio González Collado (1): kunit: fix longest symbol length test Tzung-Bi Shih (1): kunit: Add test for static stub Ujwal Jain (1): kunit: Adjust kunit_test timeout based on test_{suite,case} speed Documentation/dev-tools/kunit/usage.rst | 2 +- include/kunit/try-catch.h | 1 + lib/Kconfig.debug | 1 + lib/kunit/Kconfig | 13 ++++++++ lib/kunit/kunit-test.c | 55 ++++++++++++++++++++++++++++++--- lib/kunit/test.c | 47 ++++++++++++++++++++++++++-- lib/kunit/try-catch-impl.h | 4 ++- lib/kunit/try-catch.c | 29 ++--------------- lib/tests/longest_symbol_kunit.c | 3 +- 9 files changed, 118 insertions(+), 37 deletions(-) ----------------------------------------------------------------

5 months, 1 week

2
1
0 0

[PATCH nf-next v5 0/2] Add IPIP flowtable SW acceleratio

by Lorenzo Bianconi

Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. --- Changes in v5: - Rely on __ipv4_addr_hash() to compute the hash used as encap ID - Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap() - Add nf_flow_ip4_ecanp_pop utility routine - Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kern… Changes in v4: - Use the hash value of the saddr, daddr and protocol of outer IP header as encapsulation id. - Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kern… Changes in v3: - Add outer IP header sanity checks - target nf-next tree instead of net-next - Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern… Changes in v2: - Introduce IPIP flowtable selftest - Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern… --- Lorenzo Bianconi (2): net: netfilter: Add IPIP flowtable SW acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest include/linux/netdevice.h | 1 + net/ipv4/ipip.c | 28 +++++++++++ net/netfilter/nf_flow_table_ip.c | 56 +++++++++++++++++++++- net/netfilter/nft_flow_offload.c | 1 + .../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++ 5 files changed, 124 insertions(+), 2 deletions(-) --- base-commit: dd500e4aecf25e48e874ca7628697969df679493 change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067 Best regards, -- Lorenzo Bianconi <lorenzo(a)kernel.org>

5 months, 1 week

1
3
0 0

[PATCH v7] selftests/mm: add process_madvise() tests

by wang lian

Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. Signed-off-by: wang lian <lianux.mm(a)gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: David Hildenbrand <david(a)redhat.com> Suggested-by: Mark Brown <broonie(a)kernel.org> Acked-by: SeongJae Park <sj(a)kernel.org> Reviewed-by: Zi Yan <ziy(a)nvidia.com> Tested-by: Zi Yan <ziy(a)nvidia.com> --- Changelog v7: - In the remote_collapse test, replace default_huge_page_size() with read_pmd_pagesize() - Add a new test, invalid_vlen, to verify that process_madvise() correctly fails with EINVAL when the vlen argument exceeds UIO_MAXIOV. Changelog v6: https://lore.kernel.org/lkml/20250721114614.40996-1-lianux.mm@gmail.com/ - Refactor child process and pidfd management to use the kselftest fixture's setup and teardown mechanism. This ensures that child processes are reliably terminated and file descriptors are closed, even when a test is aborted by an ASSERT or SKIP macro. This resolves the issue where a failed assertion could lead to a leaked child process. Changelog v5: https://lore.kernel.org/lkml/20250714122533.3135-1-lianux.mm@gmail.com/ - Refactor the remote_collapse test to concentrate on its primary goal confirming the successful remote invocation of process_madvise() on a child process. - Split the validation logic for invalid pidfds out of the remote test and into two new (`exited_process_pidfd` and `bad_pidfd`). - Based mm-new branch, can ensure clean application Changelog v4: https://lore.kernel.org/lkml/20250710112249.58722-1-lianux.mm@gmail.com/ - Refine resource cleanup logic in test teardown to be more robust. - Improve remote_collapse test to correctly handle different THP (Transparent Huge Page) policies ('always', 'madvise', 'never'), including handling race conditions with khugepaged. - Resolve build errors Changelog v3: https://lore.kernel.org/lkml/20250703044326.65061-1-lianux.mm@gmail.com/ - Rebased onto the latest mm-stable branch to ensure clean application. - Refactor common signal handling logic into vm_util to reduce code duplication. - Improve test robustness and diagnostics based on community feedback. - Address minor code style and script corrections. Changelog v2: https://lore.kernel.org/lkml/20250630140957.4000-1-lianux.mm@gmail.com/ - Drop MADV_DONTNEED tests based on feedback. - Focus solely on process_madvise() syscall. - Improve error handling and structure. - Add future-proof flag test. - Style and comment cleanups. -V1: https://lore.kernel.org/lkml/20250621133003.4733-1-lianux.mm@gmail.com/ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/process_madv.c | 344 ++++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + 4 files changed, 351 insertions(+) create mode 100644 tools/testing/selftests/mm/process_madv.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index f2dafa0b700b..e7b23a8a05fe 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -21,6 +21,7 @@ on-fault-limit transhuge-stress pagemap_ioctl pfnmap +process_madv *.tmp* protection_keys protection_keys_32 diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ae6f994d3add..d13b3cef2a2b 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -85,6 +85,7 @@ TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += pfnmap +TEST_GEN_FILES += process_madv TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/process_madv.c b/tools/testing/selftests/mm/process_madv.c new file mode 100644 index 000000000000..471cae8427f1 --- /dev/null +++ b/tools/testing/selftests/mm/process_madv.c @@ -0,0 +1,344 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <errno.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <linux/mman.h> +#include <sys/syscall.h> +#include <unistd.h> +#include <sched.h> +#include "vm_util.h" + +#include "../pidfd/pidfd.h" + +FIXTURE(process_madvise) +{ + unsigned long page_size; + pid_t child_pid; + int remote_pidfd; + int pidfd; +}; + +FIXTURE_SETUP(process_madvise) +{ + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); + self->pidfd = PIDFD_SELF; + self->remote_pidfd = -1; + self->child_pid = -1; +}; + +FIXTURE_TEARDOWN_PARENT(process_madvise) +{ + /* This teardown is guaranteed to run, even if tests SKIP or ASSERT */ + if (self->child_pid > 0) { + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + } + + if (self->remote_pidfd >= 0) + close(self->remote_pidfd); +} + +static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, + size_t vlen, int advice, unsigned int flags) +{ + return syscall(__NR_process_madvise, pidfd, iovec, vlen, advice, flags); +} + +/* + * This test uses PIDFD_SELF to target the current process. The main + * goal is to verify the basic behavior of process_madvise() with + * a vector of non-contiguous memory ranges, not its cross-process + * capabilities. + */ +TEST_F(process_madvise, basic) +{ + const unsigned long pagesize = self->page_size; + const int madvise_pages = 4; + struct iovec vec[madvise_pages]; + int pidfd = self->pidfd; + ssize_t ret; + char *map; + + /* + * Create a single large mapping. We will pick pages from this + * mapping to advise on. This ensures we test non-contiguous iovecs. + */ + map = mmap(NULL, pagesize * 10, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + /* Fill the entire region with a known pattern. */ + memset(map, 'A', pagesize * 10); + + /* + * Setup the iovec to point to 4 non-contiguous pages + * within the mapping. + */ + vec[0].iov_base = &map[0 * pagesize]; + vec[0].iov_len = pagesize; + vec[1].iov_base = &map[3 * pagesize]; + vec[1].iov_len = pagesize; + vec[2].iov_base = &map[5 * pagesize]; + vec[2].iov_len = pagesize; + vec[3].iov_base = &map[8 * pagesize]; + vec[3].iov_len = pagesize; + + ret = sys_process_madvise(pidfd, vec, madvise_pages, MADV_DONTNEED, 0); + if (ret == -1 && errno == EPERM) + SKIP(return, + "process_madvise() unsupported or permission denied, try running as root.\n"); + else if (errno == EINVAL) + SKIP(return, + "process_madvise() unsupported or parameter invalid, please check arguments.\n"); + + /* The call should succeed and report the total bytes processed. */ + ASSERT_EQ(ret, madvise_pages * pagesize); + + /* Check that advised pages are now zero. */ + for (int i = 0; i < madvise_pages; i++) { + char *advised_page = (char *)vec[i].iov_base; + + /* Content must be 0, not 'A'. */ + ASSERT_EQ(*advised_page, '\0'); + } + + /* Check that an un-advised page in between is still 'A'. */ + char *unadvised_page = &map[1 * pagesize]; + + for (int i = 0; i < pagesize; i++) + ASSERT_EQ(unadvised_page[i], 'A'); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize * 10), 0); +} + +/* + * This test deterministically validates process_madvise() with MADV_COLLAPSE + * on a remote process, other advices are difficult to verify reliably. + * + * The test verifies that a memory region in a child process, + * focus on process_madv remote result, only check addresses and lengths. + * The correctness of the MADV_COLLAPSE can be found in the relevant test examples in khugepaged. + */ +TEST_F(process_madvise, remote_collapse) +{ + const unsigned long pagesize = self->page_size; + long huge_page_size; + int pipe_info[2]; + ssize_t ret; + struct iovec vec; + + struct child_info { + pid_t pid; + void *map_addr; + } info; + + huge_page_size = read_pmd_pagesize(); + if (huge_page_size <= 0) + SKIP(return, "Could not determine a valid huge page size.\n"); + + ASSERT_EQ(pipe(pipe_info), 0); + + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) { + char *map; + size_t map_size = 2 * huge_page_size; + + close(pipe_info[0]); + + map = mmap(NULL, map_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fault in as small pages */ + for (size_t i = 0; i < map_size; i += pagesize) + map[i] = 'A'; + + /* Send info and pause */ + info.pid = getpid(); + info.map_addr = map; + ret = write(pipe_info[1], &info, sizeof(info)); + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[1]); + + pause(); + exit(0); + } + + close(pipe_info[1]); + + /* Receive child info */ + ret = read(pipe_info[0], &info, sizeof(info)); + if (ret <= 0) { + waitpid(self->child_pid, NULL, 0); + SKIP(return, "Failed to read child info from pipe.\n"); + } + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[0]); + self->child_pid = info.pid; + + self->remote_pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(self->remote_pidfd, 0); + + vec.iov_base = info.map_addr; + vec.iov_len = huge_page_size; + + ret = sys_process_madvise(self->remote_pidfd, &vec, 1, MADV_COLLAPSE, + 0); + if (ret == -1) { + if (errno == EINVAL) + SKIP(return, "PROCESS_MADV_ADVISE is not supported.\n"); + else if (errno == EPERM) + SKIP(return, + "No process_madvise() permissions, try running as root.\n"); + return; + } + + ASSERT_EQ(ret, huge_page_size); +} + +/* + * Test process_madvise() with a pidfd for a process that has already + * exited to ensure correct error handling. + */ +TEST_F(process_madvise, exited_process_pidfd) +{ + const unsigned long pagesize = self->page_size; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + /* + * Using a pidfd for a process that has already exited should fail + * with ESRCH. + */ + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) + exit(0); + + self->remote_pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(self->remote_pidfd, 0); + + /* Wait for the child to ensure it has terminated. */ + waitpid(self->child_pid, NULL, 0); + + ret = sys_process_madvise(self->remote_pidfd, &vec, 1, MADV_DONTNEED, + 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ESRCH); +} + +/* + * Test process_madvise() with bad pidfds to ensure correct error + * handling. + */ +TEST_F(process_madvise, bad_pidfd) +{ + const unsigned long pagesize = self->page_size; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + /* Using an invalid fd number (-1) should fail with EBADF. */ + ret = sys_process_madvise(-1, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a valid fd that is not a pidfd (e.g. stdin) should fail + * with EBADF. + */ + ret = sys_process_madvise(STDIN_FILENO, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); +} + +/* + * Test that process_madvise() rejects vlen > UIO_MAXIOV. + * The kernel should return -EINVAL when the number of iovecs exceeds 1024. + */ +TEST_F(process_madvise, invalid_vlen) +{ + const unsigned long pagesize = self->page_size; + int pidfd = self->pidfd; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + ret = sys_process_madvise(pidfd, &vec, 1025, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +/* + * Test process_madvise() with an invalid flag value. Currently, only a flag + * value of 0 is supported. This test is reserved for the future, e.g., if + * synchronous flags are added. + */ +TEST_F(process_madvise, flag) +{ + const unsigned long pagesize = self->page_size; + unsigned int invalid_flag; + int pidfd = self->pidfd; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + invalid_flag = 0x80000000; + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, invalid_flag); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index a38c984103ce..471e539d82b8 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -65,6 +65,8 @@ separated by spaces: test pagemap_scan IOCTL - pfnmap tests for VM_PFNMAP handling +- process_madv + test for process_madv - cow test copy-on-write semantics - thp @@ -425,6 +427,9 @@ CATEGORY="madv_guard" run_test ./guard-regions # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests CATEGORY="madv_populate" run_test ./madv_populate +# PROCESS_MADV test +CATEGORY="process_madv" run_test ./process_madv + CATEGORY="vma_merge" run_test ./merge if [ -x ./memfd_secret ] -- 2.43.0

5 months, 1 week

1
0
0 0

[PATCH bpf-next v4 0/4] bpf: Show precise rejected function when attaching to __noreturn and deny list functions

by KaFai Wan

Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions. Add log for attaching tracing programs to functions in deny list. Add selftest for attaching tracing programs to functions in deny list. Migrate fexit_noreturns case into tracing_failure test suite. changes: v4: - change tracing_deny case attaching function (Yonghong Song) - add Acked-by: Yafang Shao and Yonghong Song v3: - add tracing_deny case into existing files (Alexei) - migrate fexit_noreturns into tracing_failure - change SOB https://lore.kernel.org/bpf/20250722153434.20571-1-kafai.wan@linux.dev/ v2: - change verifier log message (Alexei) - add missing Suggested-by https://lore.kernel.org/bpf/20250714120408.1627128-1-mannkafai@gmail.com/ v1: https://lore.kernel.org/all/20250710162717.3808020-1-mannkafai@gmail.com/ --- KaFai Wan (4): bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions bpf: Add log for attaching tracing programs to functions in deny list selftests/bpf: Add selftest for attaching tracing programs to functions in deny list selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite kernel/bpf/verifier.c | 5 +- .../bpf/prog_tests/fexit_noreturns.c | 9 ---- .../bpf/prog_tests/tracing_failure.c | 52 +++++++++++++++++++ .../selftests/bpf/progs/fexit_noreturns.c | 15 ------ .../selftests/bpf/progs/tracing_failure.c | 12 +++++ 5 files changed, 68 insertions(+), 25 deletions(-) delete mode 100644 tools/testing/selftests/bpf/prog_tests/fexit_noreturns.c delete mode 100644 tools/testing/selftests/bpf/progs/fexit_noreturns.c -- 2.43.0

5 months, 1 week

2
5
0 0

[PATCH] selftests: timers: improve adjtick output readability

by Vishal Parmar

Reformat the output of the `adjtick` test in tools/testing/selftests/timers/ to display results in a clean tabular format. Previously, the output was printed in a free-form manner like this: Each iteration takes about 15 seconds Estimating tick (act: 9000 usec, -100000 ppm): 9000 usec, -100000 ppm [OK] This format made it hard to visually compare values across iterations or parse results in scripts. The new output is aligned in a table with clearly labeled columns: Each iteration takes about 15 seconds --------------------------------------------------------------- | Requested (usec) | Expected (ppm) | Measured (ppm) | Result | |------------------|----------------|----------------|---------| | 9000 | -100000 | -100001 | [ OK ] | | 9250 | -75000 | -75000 | [ OK ] | ... --------------------------------------------------------------- This improves readability, consistency, and log usability for automated tooling. Signed-off-by: Vishal Parmar <vishistriker(a)gmail.com> --- tools/testing/selftests/timers/adjtick.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/timers/adjtick.c b/tools/testing/selftests/timers/adjtick.c index 777d9494b683..b6b3de04d6ae 100644 --- a/tools/testing/selftests/timers/adjtick.c +++ b/tools/testing/selftests/timers/adjtick.c @@ -128,18 +128,18 @@ int check_tick_adj(long tickval) sleep(1); ppm = ((long long)tickval * MILLION)/systick - MILLION; - printf("Estimating tick (act: %ld usec, %lld ppm): ", tickval, ppm); + printf(" | %-16ld | %-14lld |", tickval, ppm); eppm = get_ppm_drift(); - printf("%lld usec, %lld ppm", systick + (systick * eppm / MILLION), eppm); + printf(" %-14lld |", eppm); fflush(stdout); tx1.modes = 0; adjtimex(&tx1); if (tx1.offset || tx1.freq || tx1.tick != tickval) { - printf(" [ERROR]\n"); - printf("\tUnexpected adjtimex return values, make sure ntpd is not running.\n"); + printf(" [ERROR] |\n"); + printf(" Unexpected adjtimex return values, make sure ntpd is not running.\n"); return -1; } @@ -153,10 +153,10 @@ int check_tick_adj(long tickval) * room for interruptions during the measurement. */ if (llabs(eppm - ppm) > 100) { - printf(" [FAILED]\n"); + printf(" [FAILED]\n"); return -1; } - printf(" [OK]\n"); + printf(" [ OK ] |\n"); return 0; } @@ -175,7 +175,10 @@ int main(int argc, char **argv) return -1; } - printf("Each iteration takes about 15 seconds\n"); + printf("\n Each iteration takes about 15 seconds\n"); + printf(" ---------------------------------------------------------------\n"); + printf(" | Requested (usec) | Expected (ppm) | Measured (ppm) | Result |\n"); + printf(" |------------------|----------------|----------------|---------|\n"); systick = sysconf(_SC_CLK_TCK); systick = USEC_PER_SEC/sysconf(_SC_CLK_TCK); @@ -188,6 +191,7 @@ int main(int argc, char **argv) break; } } + printf(" ---------------------------------------------------------------\n"); /* Reset things to zero */ tx1.modes = ADJ_TICK; -- 2.39.5

5 months, 1 week

2
1
0 0

[PATCH v3 00/15] Consolidate iommu page table implementations (AMD)

by Jason Gunthorpe

[All the precursor patches are merged now and AMD/RISCV/VTD conversions are written] Currently each of the iommu page table formats duplicates all of the logic to maintain the page table and perform map/unmap/etc operations. There are several different versions of the algorithms between all the different formats. The io-pgtable system provides an interface to help isolate the page table code from the iommu driver, but doesn't provide tools to implement the common algorithms. This makes it very hard to improve the state of the pagetable code under the iommu domains as any proposed improvement needs to alter a large number of different driver code paths. Combined with a lack of software based testing this makes improvement in this area very hard. iommufd wants several new page table operations: - More efficient map/unmap operations, using iommufd's batching logic - unmap that returns the physical addresses into a batch as it progresses - cut that allows splitting areas so large pages can have holes poked in them dynamically (ie guestmemfd hitless shared/private transitions) - More agressive freeing of table memory to avoid waste - Fragmenting large pages so that dirty tracking can be more granular - Reassembling large pages so that VMs can run at full IO performance in migration/dirty tracking error flows - KHO integration for kernel live upgrade Together these are algorithmically complex enough to be a very significant task to go and implement in all the page table formats we support. Just the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86 PAE / AMDv1 / VT-D SS / RISCV) Instead of doing the duplicated work, this series takes the first step to consolidate the algorithms into one places. In spirit it is similar to the work Christoph did a few years back to pull the redundant get_user_pages() implementations out of the arch code into core MM. This unlocked a great deal of improvement in that space in the following years. I would like to see the same benefit in iommu as well. My first RFC showed a bigger picture with all most all formats and more algorithms. This series reorganizes that to be narrowly focused on just enough to convert the AMD driver to use the new mechanism. kunit tests are provided that allow good testing of the algorithms and all formats on x86, nothing is arch specific. AMD is one of the simpler options as the HW is quite uniform with few different options/bugs while still requiring the complicated contiguous pages support. The HW also has a very simple range based invalidation approach that is easy to implement. The AMD v1 and AMD v2 page table formats are implemented bit for bit identical to the current code, tested using a compare kunit test that checks against the io-pgtable version (on github, see below). Updating the AMD driver to replace the io-pgtable layer with the new stuff is fairly straightforward now. The layering is fixed up in the new version so that all the invalidation goes through function pointers. Several small fixing patches have come out of this as I've been fixing the problems that the test suite uncovers in the current code, and implementing the fixed version in iommupt. On performance, there is a quite wide variety of implementation designs across all the drivers. Looking at some key performance across the main formats: iommu_map(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 53,66 , 51,63 , 19.19 (AMDV1) 256*2^12, 386,1909 , 367,1795 , 79.79 256*2^21, 362,1633 , 355,1556 , 77.77 2^12, 56,62 , 52,59 , 11.11 (AMDv2) 256*2^12, 405,1355 , 357,1292 , 72.72 256*2^21, 393,1160 , 358,1114 , 67.67 2^12, 55,65 , 53,62 , 14.14 (VTD second stage) 256*2^12, 391,518 , 332,512 , 35.35 256*2^21, 383,635 , 336,624 , 46.46 2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit) 256*2^12, 380,389 , 361,369 , 2.02 256*2^21, 358,419 , 345,400 , 13.13 iommu_unmap(): pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) 2^12, 69,88 , 65,85 , 23.23 (AMDv1) 256*2^12, 353,6498 , 331,6029 , 94.94 256*2^21, 373,6014 , 360,5706 , 93.93 2^12, 71,72 , 66,69 , 4.04 (AMDv2) 256*2^12, 228,891 , 206,871 , 76.76 256*2^21, 254,721 , 245,711 , 65.65 2^12, 69,87 , 65,82 , 20.20 (VTD second stage) 256*2^12, 210,321 , 200,315 , 36.36 256*2^21, 255,349 , 238,342 , 30.30 2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit) 256*2^12, 521,357 , 447,346 , -29.29 256*2^21, 489,358 , 433,345 , -25.25 * Above numbers include additional patches to remove the iommu_pgsize() overheads. gcc 13.3.0, i7-12700 This version provides fairly consistent performance across formats. ARM unmap performance is quite different because this version supports contiguous pages and uses a very different algorithm for unmapping. Though why it is so worse compared to AMDv1 I haven't figured out yet. The per-format commits include a more detailed chart. There is a second branch: https://github.com/jgunthorpe/linux/commits/iommu_pt_all Containing supporting work and future steps: - ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats - RISCV format and RISCV conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv - Support for a DMA incoherent HW page table walker - VT-D second stage format and VT-D conversion https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd - DART v1 & v2 format - Draft of a iommufd 'cut' operation to break down huge pages - A compare test that checks the iommupt formats against the iopgtable interface, including updating AMD to have a working iopgtable and patches to make VT-D have an iopgtable for testing. - A performance test to micro-benchmark map and unmap against iogptable My strategy is to go one by one for the drivers: - AMD driver conversion - RISCV page table and driver - Intel VT-D driver and VTDSS page table - Flushing improvements for RISCV - ARM SMMUv3 And concurrently work on the algorithm side: - debugfs content dump, like VT-D has - Cut support - Increase/Decrease page size support - map/unmap batching - KHO As we make more algorithm improvements the value to convert the drivers increases. This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt v2: - Rebase on v6.16-rc2 - s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better - Comment and documentation updates - Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top pointer - Add missed force_aperture = true - Make pt_iommu_deinit() take care of the not-yet-inited error case internally as AMD/RISCV/VTD all shared this logic - Change gather_range() into gather_range_pages() so it also deals with the page list. This makes the following cache flushing series simpler - Fix missed update of unmap->unmapped in some error cases - Change clear_contig() to order the gather more logically - Remove goto from the error handling in __map_range_leaf() - s/log2_/oalog2_/ in places where the argument is an oaddr_t - Pass the pts to pt_table_install64/32() - Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's information on how PASID 0 works. v1: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com - AMD driver only, many code changes RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/ Alejandro Jimenez (1): iommu/amd: Use the generic iommu page table Jason Gunthorpe (14): genpt: Generic Page Table base API genpt: Add Documentation/ files iommupt: Add the basic structure of the iommu implementation iommupt: Add the AMD IOMMU v1 page table format iommupt: Add iova_to_phys op iommupt: Add unmap_pages op iommupt: Add map_pages op iommupt: Add read_and_clear_dirty op iommupt: Add a kunit test for Generic Page Table iommupt: Add a mock pagetable format for iommufd selftest to use iommufd: Change the selftest to use iommupt instead of xarray iommupt: Add the x86 64 bit page table format iommu/amd: Remove AMD io_pgtable support iommupt: Add a kunit test for the IOMMU implementation .clang-format | 1 + Documentation/driver-api/generic_pt.rst | 140 ++ Documentation/driver-api/index.rst | 1 + drivers/iommu/Kconfig | 2 + drivers/iommu/Makefile | 1 + drivers/iommu/amd/Kconfig | 5 +- drivers/iommu/amd/Makefile | 2 +- drivers/iommu/amd/amd_iommu.h | 1 - drivers/iommu/amd/amd_iommu_types.h | 109 +- drivers/iommu/amd/io_pgtable.c | 560 -------- drivers/iommu/amd/io_pgtable_v2.c | 370 ------ drivers/iommu/amd/iommu.c | 516 ++++---- drivers/iommu/generic_pt/.kunitconfig | 13 + drivers/iommu/generic_pt/Kconfig | 72 ++ drivers/iommu/generic_pt/fmt/Makefile | 26 + drivers/iommu/generic_pt/fmt/amdv1.h | 409 ++++++ drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 + drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 + drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 + drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 + drivers/iommu/generic_pt/fmt/iommu_template.h | 48 + drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 + drivers/iommu/generic_pt/fmt/x86_64.h | 248 ++++ drivers/iommu/generic_pt/iommu_pt.h | 1150 +++++++++++++++++ drivers/iommu/generic_pt/kunit_generic_pt.h | 717 ++++++++++ drivers/iommu/generic_pt/kunit_iommu.h | 183 +++ drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++ drivers/iommu/generic_pt/pt_common.h | 354 +++++ drivers/iommu/generic_pt/pt_defs.h | 323 +++++ drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++ drivers/iommu/generic_pt/pt_iter.h | 640 +++++++++ drivers/iommu/generic_pt/pt_log2.h | 130 ++ drivers/iommu/io-pgtable.c | 4 - drivers/iommu/iommufd/Kconfig | 1 + drivers/iommu/iommufd/iommufd_test.h | 11 +- drivers/iommu/iommufd/selftest.c | 439 +++---- include/linux/generic_pt/common.h | 166 +++ include/linux/generic_pt/iommu.h | 270 ++++ include/linux/io-pgtable.h | 2 - tools/testing/selftests/iommu/iommufd.c | 60 +- tools/testing/selftests/iommu/iommufd_utils.h | 12 + 41 files changed, 6119 insertions(+), 1589 deletions(-) create mode 100644 Documentation/driver-api/generic_pt.rst delete mode 100644 drivers/iommu/amd/io_pgtable.c delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c create mode 100644 drivers/iommu/generic_pt/.kunitconfig create mode 100644 drivers/iommu/generic_pt/Kconfig create mode 100644 drivers/iommu/generic_pt/fmt/Makefile create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h create mode 100644 drivers/iommu/generic_pt/iommu_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h create mode 100644 drivers/iommu/generic_pt/pt_common.h create mode 100644 drivers/iommu/generic_pt/pt_defs.h create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h create mode 100644 drivers/iommu/generic_pt/pt_iter.h create mode 100644 drivers/iommu/generic_pt/pt_log2.h create mode 100644 include/linux/generic_pt/common.h create mode 100644 include/linux/generic_pt/iommu.h base-commit: cd76b0248a38645a3e3f8ca4a48bffc591e9da19 -- 2.43.0

5 months, 2 weeks

4
24
0 0

[PATCH v14 net-next 00/14] AccECN protocol patch series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the v14 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best Regards, Chia-Yu --- v14 (22-Jul-2025) - Add missing const for struct tcp_sock of tcp_accecn_option_beacon_check() of #11 (Simon Horman <horms(a)kernel.org>) v13 (18-Jul-2025) - Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>) - Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>) - Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>) - Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>) - Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>) - Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>) - Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>) - Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>) v12 (04-Jul-2025) - Fix compilation issues with some intermediate patches in v11 - Add more comments for AccECN helpers of tcp_ecn.h v11 (03-Jul-2025) - Fix compilation issues with some intermediate patches in v10 v10 (02-Jul-2025) - Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>) - Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>) - Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>) - Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>) - Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>) - Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch - Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>) v9 (21-Jun-2025) - Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>) - Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>) - Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>) - Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>) - Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>) v8 (10-Jun-2025) - Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>) - Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>) - Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>) v7 (14-May-2025) - Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>) - Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message for #9 to explain the increase in tcp_sock_write_rx group size - Modify group size of tcp_sock_write_tx in #10 based on pahole results v6 (09-May-2025) - Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>) - Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>) - Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>) - Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>) - Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>) - Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>) - Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15 v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Mar-2025) - Add one missing patch from the previous AccECN protocol preparation patch series to this patch series. --- Chia-Yu Chang (5): tcp: reorganize tcp_sock_write_txrx group for variables later tcp: ecn functions in separated include file tcp: accecn: AccECN option send control tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (9): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics Documentation/networking/ip-sysctl.rst | 55 +- .../networking/net_cachelines/tcp_sock.rst | 12 + include/linux/tcp.h | 32 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 87 ++- include/net/tcp_ecn.h | 649 ++++++++++++++++++ include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 4 + net/ipv4/sysctl_net_ipv4.c | 19 + net/ipv4/tcp.c | 28 +- net/ipv4/tcp_input.c | 353 ++++++++-- net/ipv4/tcp_ipv4.c | 8 +- net/ipv4/tcp_minisocks.c | 40 +- net/ipv4/tcp_output.c | 294 ++++++-- net/ipv6/syncookies.c | 2 + net/ipv6/tcp_ipv6.c | 1 + 16 files changed, 1409 insertions(+), 184 deletions(-) create mode 100644 include/net/tcp_ecn.h -- 2.34.1

5 months, 2 weeks

3
16
0 0

[PATCH v20 4/5] binder: add transaction_report feature entry

by Carlos Llamas

From: Li Li <dualli(a)google.com> Add "transaction_report" to the binderfs feature list, to help userspace determine if the "BINDER_CMD_REPORT" generic netlink api is supported by the binder driver. Signed-off-by: Li Li <dualli(a)google.com> Signed-off-by: Carlos Llamas <cmllamas(a)google.com> --- drivers/android/binderfs.c | 8 ++++++++ .../selftests/filesystems/binderfs/binderfs_test.c | 1 + 2 files changed, 9 insertions(+) diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c index 4f827152d18e..f74a7e380261 100644 --- a/drivers/android/binderfs.c +++ b/drivers/android/binderfs.c @@ -59,6 +59,7 @@ struct binder_features { bool oneway_spam_detection; bool extended_error; bool freeze_notification; + bool transaction_report; }; static const struct constant_table binderfs_param_stats[] = { @@ -76,6 +77,7 @@ static struct binder_features binder_features = { .oneway_spam_detection = true, .extended_error = true, .freeze_notification = true, + .transaction_report = true, }; static inline struct binderfs_info *BINDERFS_SB(const struct super_block *sb) @@ -616,6 +618,12 @@ static int init_binder_features(struct super_block *sb) if (IS_ERR(dentry)) return PTR_ERR(dentry); + dentry = binderfs_create_file(dir, "transaction_report", + &binder_features_fops, + &binder_features.transaction_report); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + return 0; } diff --git a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c index 81db85a5cc16..39a68078a79b 100644 --- a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c +++ b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c @@ -65,6 +65,7 @@ static int __do_binderfs_test(struct __test_metadata *_metadata) "oneway_spam_detection", "extended_error", "freeze_notification", + "transaction_report", }; change_mountns(_metadata); -- 2.50.1.470.g6ba607880d-goog

5 months, 2 weeks

1
0
0 0

[PATCH net-next] selftests: bpf: fix legacy netfilter options

by Jakub Kicinski

Recent commit to add NETFILTER_XTABLES_LEGACY missed setting a couple of configs to y. They are still enabled but as modules which appears to have upset BPF CI, e.g.: test_bpf_nf_ct:FAIL:iptables-legacy -t raw -A PREROUTING -j CONNMARK --set-mark 42/0 unexpected error: 768 (errno 0) Fixes: 3c3ab65f00eb ("selftests: net: Enable legacy netfilter legacy options.") Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- Targeting net-next 'cause that's where the bad commit is. CC: ast(a)kernel.org CC: daniel(a)iogearbox.net CC: andrii(a)kernel.org CC: martin.lau(a)linux.dev CC: eddyz87(a)gmail.com CC: song(a)kernel.org CC: yonghong.song(a)linux.dev CC: john.fastabend(a)gmail.com CC: kpsingh(a)kernel.org CC: sdf(a)fomichev.me CC: haoluo(a)google.com CC: jolsa(a)kernel.org CC: mykolal(a)fb.com CC: shuah(a)kernel.org CC: pablo(a)netfilter.org CC: bigeasy(a)linutronix.de CC: fw(a)strlen.de CC: bpf(a)vger.kernel.org CC: linux-kselftest(a)vger.kernel.org --- tools/testing/selftests/bpf/config | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config index 521836776733..e8c6c77b96cb 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -97,6 +97,8 @@ CONFIG_NF_TABLES_NETDEV=y CONFIG_NF_TABLES_IPV4=y CONFIG_NF_TABLES_IPV6=y CONFIG_NETFILTER_INGRESS=y +CONFIG_IP_NF_IPTABLES_LEGACY=y +CONFIG_IP6_NF_IPTABLES_LEGACY=y CONFIG_NETFILTER_XTABLES_LEGACY=y CONFIG_NF_FLOW_TABLE=y CONFIG_NF_FLOW_TABLE_INET=y -- 2.50.1

5 months, 2 weeks

3
3
0 0

[PATCH net v2] selftests: rtnetlink.sh: remove esp4_offload after test

by Xiumei Mu

The esp4_offload module, loaded during IPsec offload tests, should be reset to its default settings after testing. Otherwise, leaving it enabled could unintentionally affect subsequence test cases by keeping offload active. Without this fix: $ lsmod | grep offload; ./rtnetlink.sh -t kci_test_ipsec_offload ; lsmod | grep offload; PASS: ipsec_offload esp4_offload 12288 0 esp4 32768 1 esp4_offload With this fix: $ lsmod | grep offload; ./rtnetlink.sh -t kci_test_ipsec_offload ; lsmod | grep offload; PASS: ipsec_offload Fixes: 2766a11161cc ("selftests: rtnetlink: add ipsec offload API test") Signed-off-by: Xiumei Mu <xmu(a)redhat.com> Reviewed-by: Shannon Nelson <sln(a)onemain.com> --- Changes in v2: - add test results in description - Enhanced logic for rmmod esp4_offload - fix shellcheck warning: SC2086 (The quoting issue) --- --- tools/testing/selftests/net/rtnetlink.sh | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 2e8243a65b50..d2298da320a6 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -673,6 +673,11 @@ kci_test_ipsec_offload() sysfsf=$sysfsd/ipsec sysfsnet=/sys/bus/netdevsim/devices/netdevsim0/net/ probed=false + esp4_offload_probed_default=false + + if lsmod | grep -q esp4_offload; then + esp4_offload_probed_default=true + fi if ! mount | grep -q debugfs; then mount -t debugfs none /sys/kernel/debug/ &> /dev/null @@ -766,6 +771,7 @@ EOF fi # clean up any leftovers + ! "$esp4_offload_probed_default" && lsmod | grep -q esp4_offload && rmmod esp4_offload echo 0 > /sys/bus/netdevsim/del_device $probed && rmmod netdevsim -- 2.50.1

5 months, 2 weeks

3
2
0 0

[PATCH -next] selftests/ftrace: Prevent potential failure in subsystem-enable test case

by Tengda Wu

The first 100 lines of trace output don't always contain 3 or more distinct events. In busy systems, they may be dominated by repetitive events like sched_stat_runtime, causing the `$count -lt 3` check to fail. Example trace: $ head -n 100 trace | grep -v ^# systemd-timesyn-266 [006] d.h2. 738.778482: sched_stat_runtime: comm=systemd-timesyn pid=266 runtime=976854 [ns] ftracetest-8751 [001] d.h2. 738.778512: sched_stat_runtime: comm=ftracetest pid=8751 runtime=938335 [ns] systemd-timesyn-266 [006] d.h1. 738.779531: sched_stat_runtime: comm=systemd-timesyn pid=266 runtime=1044284 [ns] ftracetest-8751 [001] d.h2. 738.779541: sched_stat_runtime: comm=ftracetest pid=8751 runtime=1028575 [ns] systemd-1 [007] d.h5. 738.779657: sched_stat_runtime: comm=systemd pid=1 runtime=642624 [ns] [...] With trace cleared, simply check `$count -eq 0` to confirm subsystem enablement, just like toplevel-enable.tc does. Fixes: 1a4ea83a6e67 ("selftests/ftrace: Limit length in subsystem-enable tests") Signed-off-by: Tengda Wu <wutengda(a)huaweicloud.com> --- .../selftests/ftrace/test.d/event/subsystem-enable.tc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc index b7c8f29c09a9..3a28adc7b727 100644 --- a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc +++ b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc @@ -19,8 +19,8 @@ echo 'sched:*' > set_event yield count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l` -if [ $count -lt 3 ]; then - fail "at least fork, exec and exit events should be recorded" +if [ $count -eq 0 ]; then + fail "none of scheduler events are recorded" fi do_reset @@ -30,8 +30,8 @@ echo 1 > events/sched/enable yield count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l` -if [ $count -lt 3 ]; then - fail "at least fork, exec and exit events should be recorded" +if [ $count -eq 0 ]; then + fail "none of scheduler events are recorded" fi do_reset -- 2.34.1

5 months, 2 weeks

2
7
0 0

[PATCH net v2] selftests: netfilter: ipvs.sh: Explicity disable rp_filter on interface tunl0

by Yi Chen

Although setup_ns() set net.ipv4.conf.default.rp_filter=0, loading certain module such as ipip will automatically create a tunl0 interface in all netns including new created ones. In the script, this is before than default.rp_filter=0 applied, as a result tunl0.rp_filter remains set to 1 which causes the test report FAIL when ipip module is preloaded. Before fix: Testing DR mode... Testing NAT mode... Testing Tunnel mode... ipvs.sh: FAIL After fix: Testing DR mode... Testing NAT mode... Testing Tunnel mode... ipvs.sh: PASS Fixes: 7c8b89ec506e ("selftests: netfilter: remove rp_filter configuration") v2: Fixed the format of Fixes tag. Signed-off-by: Yi Chen <yiche(a)redhat.com> --- tools/testing/selftests/net/netfilter/ipvs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/netfilter/ipvs.sh b/tools/testing/selftests/net/netfilter/ipvs.sh index 6af2ea3ad6b8..9c9d5b38ab71 100755 --- a/tools/testing/selftests/net/netfilter/ipvs.sh +++ b/tools/testing/selftests/net/netfilter/ipvs.sh @@ -151,7 +151,7 @@ test_nat() { test_tun() { ip netns exec "${ns0}" ip route add "${vip_v4}" via "${gip_v4}" dev br0 - ip netns exec "${ns1}" modprobe -q ipip + modprobe -q ipip ip netns exec "${ns1}" ip link set tunl0 up ip netns exec "${ns1}" sysctl -qw net.ipv4.ip_forward=0 ip netns exec "${ns1}" sysctl -qw net.ipv4.conf.all.send_redirects=0 @@ -160,10 +160,10 @@ test_tun() { ip netns exec "${ns1}" ipvsadm -a -i -t "${vip_v4}:${port}" -r ${rip_v4}:${port} ip netns exec "${ns1}" ip addr add ${vip_v4}/32 dev lo:1 - ip netns exec "${ns2}" modprobe -q ipip ip netns exec "${ns2}" ip link set tunl0 up ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_ignore=1 ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_announce=2 + ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.tunl0.rp_filter=0 ip netns exec "${ns2}" ip addr add "${vip_v4}/32" dev lo:1 test_service -- 2.50.1

5 months, 2 weeks

2
1
0 0

[PATCH net-next] selftests: drv-net: Wait for bkg socat to start

by Mohsin Bashir

Currently, UDP exchange is prone to failure when cmd attempt to send data while socat in bkg is not ready. Since, the behavior is probabilistic, this can result in flakiness for XDP tests. While testing test_xdp_native_tx_mb() on netdevsim, a failure rate of around 1% in 500 500 iterations was observed. Use wait_port_listen() to ensure that the bkg socat is started and ready to receive before cmd start sending. With proposed changes, a re-run of the same test passed 100% of time. Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> Signed-off-by: Mohsin Bashir <mohsin.bashr(a)gmail.com> --- tools/testing/selftests/drivers/net/xdp.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/drivers/net/xdp.py b/tools/testing/selftests/drivers/net/xdp.py index 887d662ad128..1dd8bf3bf6c9 100755 --- a/tools/testing/selftests/drivers/net/xdp.py +++ b/tools/testing/selftests/drivers/net/xdp.py @@ -13,7 +13,7 @@ from enum import Enum from lib.py import ksft_run, ksft_exit, ksft_eq, ksft_ne, ksft_pr from lib.py import KsftFailEx, NetDrvEpEnv, EthtoolFamily, NlError -from lib.py import bkg, cmd, rand_port +from lib.py import bkg, cmd, rand_port, wait_port_listen from lib.py import ip, bpftool, defer @@ -70,6 +70,7 @@ def _exchg_udp(cfg, port, test_string): tx_udp_cmd = f"echo -n {test_string} | socat -t 2 -u STDIN UDP:{cfg.baddr}:{port}" with bkg(rx_udp_cmd, exit_wait=True) as nc: + wait_port_listen(port, proto="udp") cmd(tx_udp_cmd, host=cfg.remote, shell=True) return nc.stdout.strip() @@ -310,6 +311,7 @@ def test_xdp_native_tx_mb(cfg): tx_udp = f"echo {test_string} | socat -t 2 -u STDIN UDP:{cfg.baddr}:{port}" with bkg(rx_udp, host=cfg.remote, exit_wait=True) as rnc: + wait_port_listen(port, proto="udp", host=cfg.remote) cmd(tx_udp, host=cfg.remote, shell=True) stats = _get_stats(prog_info['maps']['map_xdp_stats']) -- 2.47.3

5 months, 2 weeks

2
1
0 0

[PATCH net-next v7] ipv6: add `force_forwarding` sysctl to enable per-interface forwarding

by Gabriel Goller

It is currently impossible to enable ipv6 forwarding on a per-interface basis like in ipv4. To enable forwarding on an ipv6 interface we need to enable it on all interfaces and disable it on the other interfaces using a netfilter rule. This is especially cumbersome if you have lots of interfaces and only want to enable forwarding on a few. According to the sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding for all interfaces, while the interface-specific `net.ipv6.conf.<interface>.forwarding` configures the interface Host/Router configuration. Introduce a new sysctl flag `force_forwarding`, which can be set on every interface. The ip6_forwarding function will then check if the global forwarding flag OR the force_forwarding flag is active and forward the packet. To preserve backwards-compatibility reset the flag (on all interfaces) to 0 if the net.ipv6.conf.all.forwarding flag is set to 0. Add a short selftest that checks if a packet gets forwarded with and without `force_forwarding`. [0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Acked-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com> Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com> --- v7: * rebase * fix typos in commit message v6: https://lore.kernel.org/netdev/20250711124243.526735-1-g.goller@proxmox.com/ * rebase * remove brackets around single line * add 'nodad' to addresses in selftest to avoid sporadic failures v5: https://lore.kernel.org/netdev/20250707094307.223975-1-g.goller@proxmox.com/ * update conf/all/forwarding docs * simplified backwards-compat comment * remove ASSERT_RTNL as it's guaranteed by __in6_dev_get_rtnl_net() already * cange ip6_forward logic so that it doesn't depend on the idev existing * move WRITE_ONCE inside device lock v4: https://lore.kernel.org/netdev/20250703160154.560239-1-g.goller@proxmox.com/ * actually write the sysctl value to the table * use ASSERT_RTNL() when forwarding the sysctl change * remove useless comments in function body * simplify forwarding and force_forwarding check in ip6_output.c * fix code backticks in Documentation (double instead of single) * add selftests v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ * remove forwarding=0 setting force_forwarding=0 globally. * add min and max (0 and 1) value to sysctl. v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/ * rename from `do_forwarding` to `force_forwarding`. * add global `force_forwarding` flag which will enable `force_forwarding` on every interface like the `ipv4.all.forwarding` flag. * `forwarding`=0 will disable global and per-interface `force_forwarding`. * export option as NETCONFA_FORCE_FORWARDING. v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ Documentation/networking/ip-sysctl.rst | 8 +- include/linux/ipv6.h | 1 + include/uapi/linux/ipv6.h | 1 + include/uapi/linux/netconf.h | 1 + include/uapi/linux/sysctl.h | 1 + net/ipv6/addrconf.c | 82 ++++++++++++++ net/ipv6/ip6_output.c | 3 +- tools/testing/selftests/net/Makefile | 1 + .../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++ 9 files changed, 200 insertions(+), 3 deletions(-) create mode 100755 tools/testing/selftests/net/ipv6_force_forwarding.sh diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 14700ea77e75..bb620f554598 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2543,8 +2543,8 @@ conf/all/disable_ipv6 - BOOLEAN conf/all/forwarding - BOOLEAN Enable global IPv6 forwarding between all interfaces. - IPv4 and IPv6 work differently here; e.g. netfilter must be used - to control which interfaces may forward packets and which not. + IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must + be used to control which interfaces may forward packets. This also sets all interfaces' Host/Router setting 'forwarding' to the specified value. See below for details. @@ -2561,6 +2561,10 @@ proxy_ndp - BOOLEAN Default: 0 (disabled) +force_forwarding - BOOLEAN + Enable forwarding on this interface only -- regardless of the setting on + ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0, + the ``force_forwarding`` flag will be reset on all interfaces. fwmark_reflect - BOOLEAN Controls the fwmark of kernel-generated IPv6 reply packets that are not diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index db0eb0d86b64..bc6ec2959173 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -17,6 +17,7 @@ struct ipv6_devconf { __s32 hop_limit; __s32 mtu6; __s32 forwarding; + __s32 force_forwarding; __s32 disable_policy; __s32 proxy_ndp; __cacheline_group_end(ipv6_devconf_read_txrx); diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index cf592d7b630f..d4d3ae774b26 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -199,6 +199,7 @@ enum { DEVCONF_NDISC_EVICT_NOCARRIER, DEVCONF_ACCEPT_UNTRACKED_NA, DEVCONF_ACCEPT_RA_MIN_LFT, + DEVCONF_FORCE_FORWARDING, DEVCONF_MAX }; diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h index fac4edd55379..1c8c84d65ae3 100644 --- a/include/uapi/linux/netconf.h +++ b/include/uapi/linux/netconf.h @@ -19,6 +19,7 @@ enum { NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN, NETCONFA_INPUT, NETCONFA_BC_FORWARDING, + NETCONFA_FORCE_FORWARDING, __NETCONFA_MAX }; #define NETCONFA_MAX (__NETCONFA_MAX - 1) diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 8981f00204db..63d1464cb71c 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -573,6 +573,7 @@ enum { NET_IPV6_ACCEPT_RA_FROM_LOCAL=26, NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27, NET_IPV6_RA_DEFRTR_METRIC=28, + NET_IPV6_FORCE_FORWARDING=29, __NET_IPV6_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 4f1d7d110302..81a067a2e526 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { @@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; /* Check if link is ready: is it up and is a valid qdisc available */ @@ -857,6 +859,9 @@ static void addrconf_forward_change(struct net *net, __s32 newf) idev = __in6_dev_get_rtnl_net(dev); if (idev) { int changed = (!idev->cnf.forwarding) ^ (!newf); + /* Disabling all.forwarding sets 0 to force_forwarding for all interfaces */ + if (newf == 0) + WRITE_ONCE(idev->cnf.force_forwarding, 0); WRITE_ONCE(idev->cnf.forwarding, newf); if (changed) @@ -5710,6 +5715,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf, array[DEVCONF_ACCEPT_UNTRACKED_NA] = READ_ONCE(cnf->accept_untracked_na); array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft); + array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding); } static inline size_t inet6_ifla6_size(void) @@ -6738,6 +6744,75 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write return ret; } +static void addrconf_force_forward_change(struct net *net, __s32 newf) +{ + struct net_device *dev; + struct inet6_dev *idev; + + for_each_netdev(net, dev) { + idev = __in6_dev_get_rtnl_net(dev); + if (idev) { + int changed = (!idev->cnf.force_forwarding) ^ (!newf); + + WRITE_ONCE(idev->cnf.force_forwarding, newf); + if (changed) + inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + dev->ifindex, &idev->cnf); + } + } +} + +static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct inet6_dev *idev = ctl->extra1; + struct ctl_table tmp_ctl = *ctl; + struct net *net = ctl->extra2; + int *valp = ctl->data; + int new_val = *valp; + int old_val = *valp; + loff_t pos = *ppos; + int ret; + + tmp_ctl.extra1 = SYSCTL_ZERO; + tmp_ctl.extra2 = SYSCTL_ONE; + tmp_ctl.data = &new_val; + + ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos); + + if (write && old_val != new_val) { + if (!rtnl_net_trylock(net)) + return restart_syscall(); + + WRITE_ONCE(*valp, new_val); + + if (valp == &net->ipv6.devconf_dflt->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_DEFAULT, + net->ipv6.devconf_dflt); + } else if (valp == &net->ipv6.devconf_all->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_ALL, + net->ipv6.devconf_all); + + addrconf_force_forward_change(net, new_val); + } else { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + idev->dev->ifindex, + &idev->cnf); + } + rtnl_net_unlock(net); + } + + if (ret) + *ppos = pos; + return ret; +} + static int minus_one = -1; static const int two_five_five = 255; static u32 ioam6_if_id_max = U16_MAX; @@ -7208,6 +7283,13 @@ static const struct ctl_table addrconf_sysctl[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, + { + .procname = "force_forwarding", + .data = &ipv6_devconf.force_forwarding, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = addrconf_sysctl_force_forwarding, + }, }; static int __addrconf_sysctl_register(struct net *net, char *dev_name, diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 0412f8544695..1e1410237b6e 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -511,7 +511,8 @@ int ip6_forward(struct sk_buff *skb) u32 mtu; idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif)); - if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0) + if (!READ_ONCE(net->ipv6.devconf_all->forwarding) && + (!idev || !READ_ONCE(idev->cnf.force_forwarding))) goto error; if (skb->pkt_type != PACKET_HOST) diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 13e2678d418b..b31a71f2b372 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -116,6 +116,7 @@ TEST_GEN_FILES += skf_net_off TEST_GEN_FILES += tfo TEST_PROGS += tfo_passive.sh TEST_PROGS += broadcast_pmtu.sh +TEST_PROGS += ipv6_force_forwarding.sh # YNL files, must be before "include ..lib.mk" YNL_GEN_FILES := busy_poller netlink-dumps diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh new file mode 100755 index 000000000000..bf0243366caa --- /dev/null +++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh @@ -0,0 +1,105 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Test IPv6 force_forwarding interface property +# +# This test verifies that the force_forwarding property works correctly: +# - When global forwarding is disabled, packets are not forwarded normally +# - When force_forwarding is enabled on an interface, packets are forwarded +# regardless of the global forwarding setting + +source lib.sh + +cleanup() { + cleanup_ns $ns1 $ns2 $ns3 +} + +trap cleanup EXIT + +setup_test() { + # Create three namespaces: sender, router, receiver + setup_ns ns1 ns2 ns3 + + # Create veth pairs: ns1 <-> ns2 <-> ns3 + ip link add name veth12 type veth peer name veth21 + ip link add name veth23 type veth peer name veth32 + + # Move interfaces to namespaces + ip link set veth12 netns $ns1 + ip link set veth21 netns $ns2 + ip link set veth23 netns $ns2 + ip link set veth32 netns $ns3 + + # Configure interfaces + ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12 nodad + ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21 nodad + ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23 nodad + ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32 nodad + + # Bring up interfaces + ip -n $ns1 link set veth12 up + ip -n $ns2 link set veth21 up + ip -n $ns2 link set veth23 up + ip -n $ns3 link set veth32 up + + # Add routes + ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2 + ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1 + + # Disable global forwarding + ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0 +} + +test_force_forwarding() { + local ret=0 + + echo "TEST: force_forwarding functionality" + + # Check if force_forwarding sysctl exists + if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then + echo "SKIP: force_forwarding not available" + return $ksft_skip + fi + + # Test 1: Without force_forwarding, ping should fail + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "FAIL: ping succeeded when forwarding disabled" + ret=1 + else + echo "PASS: forwarding disabled correctly" + fi + + # Test 2: With force_forwarding enabled, ping should succeed + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "PASS: force_forwarding enabled forwarding" + else + echo "FAIL: ping failed with force_forwarding enabled" + ret=1 + fi + + return $ret +} + +echo "IPv6 force_forwarding test" +echo "==========================" + +setup_test +test_force_forwarding +ret=$? + +if [ $ret -eq 0 ]; then + echo "OK" + exit 0 +elif [ $ret -eq $ksft_skip ]; then + echo "SKIP" + exit $ksft_skip +else + echo "FAIL" + exit 1 +fi -- 2.39.5

5 months, 2 weeks

2
1
0 0

[PATCH net-next] selftests: net: Skip test if IPv6 is not configured

by Breno Leitao

Extend the `check_for_dependencies()` function in `lib_netcons.sh` to check whether IPv6 is enabled by verifying the existence of `/proc/net/if_inet6`. Having IPv6 is a now a dependency of netconsole tests. If the file does not exist, the script will skip the test with an appropriate message suggesting to verify if `CONFIG_IPV6` is enabled. This prevents the test to misbehave if IPv6 is not configured. Signed-off-by: Breno Leitao <leitao(a)debian.org> --- tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh index 258af805497b4..b6071e80ebbb6 100644 --- a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh +++ b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh @@ -281,6 +281,11 @@ function check_for_dependencies() { exit "${ksft_skip}" fi + if [ ! -f /proc/net/if_inet6 ]; then + echo "SKIP: IPv6 not configured. Check if CONFIG_IPV6 is enabled" >&2 + exit "${ksft_skip}" + fi + if [ ! -f "${NSIM_DEV_SYS_NEW}" ]; then echo "SKIP: file ${NSIM_DEV_SYS_NEW} does not exist. Check if CONFIG_NETDEVSIM is enabled" >&2 exit "${ksft_skip}" --- base-commit: dd500e4aecf25e48e874ca7628697969df679493 change-id: 20250723-netcons_test_ipv6-15b1b76bb231 Best regards, -- Breno Leitao <leitao(a)debian.org>

5 months, 2 weeks

5
7
0 0

[PATCH v3 00/10] mm/mremap: permit mremap() move of multiple VMAs

by Lorenzo Stoakes

Historically we've made it a uAPI requirement that mremap() may only operate on a single VMA at a time. For instances where VMAs need to be resized, this makes sense, as it becomes very difficult to determine what a user actually wants should they indicate a desire to expand or shrink the size of multiple VMAs (truncate? Adjust sizes individually? Some other strategy?). However, in instances where a user is moving VMAs, it is restrictive to disallow this. This is especially the case when anonymous mapping remap may or may not be mergeable depending on whether VMAs have or have not been faulted due to anon_vma assignment and folio index alignment with vma->vm_pgoff. Often this can result in surprising impact where a moved region is faulted, then moved back and a user fails to observe a merge from otherwise compatible, adjacent VMAs. This change allows such cases to work without the user having to be cognizant of whether a prior mremap() move or other VMA operations has resulted in VMA fragmentation. In order to do this, this series performs a large amount of refactoring, most pertinently - grouping sanity checks together, separately those that check input parameters and those relating to VMAs. we also simplify the post-mmap lock drop processing for uffd and mlock()'d VMAs. With this done, we can then fairly straightforwardly implement this functionality. This works exclusively for mremap() invocations which specify MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the notification of the userland fault handler would require us to drop the mmap lock. It is also not compatible with file-backed mappings with customised get_unmapped_area() handlers as these may not honour MREMAP_FIXED. The input and output addresses ranges must not overlap. We carefully account for moves which would result in VMA iterator invalidation. While there can be gaps between VMAs in the input range, there can be no gap before the first VMA in the range. v3: * Disallowed move operation except for MREMAP_FIXED. * Disallow gap at start of aggregate range to avoid confusion. * Disallow any file-baked VMAs with custom get_unmapped_area. * Renamed multi_vma to seen_vma to be clearer. Stop reusing new_addr, use separate target_addr var to track next target address. * Check if first VMA fails multi VMA check, if so we'll allow one VMA but not multiple. * Updated the commit message for patch 9 to be clearer about gap behaviour. * Removed accidentally included debug goto statement in test (doh!). Test was and is passing regardless. * Unmap target range in test, previously we ended up moving additional VMAs unintentionally. This still all passed :) but was not what was intended. * Removed self-merge check - there is absolutely no way this can happen across multiple VMAs, as there is no means of moving VMAs such that a VMA merges with itself. v2: * Squashed uffd stub fix into series. * Propagated tags, thanks! * Fixed param naming in patch 4 as per Vlastimil. * Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per Liam. * Correctly return -EFAULT if no VMAs in input range. * Account for get_unmapped_area() disregarding MAP_FIXED and returning an altered address. * Added additional explanatatory comment to the remap_move() function. https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/ v1: https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (10): mm/mremap: perform some simple cleanups mm/mremap: refactor initial parameter sanity checks mm/mremap: put VMA check and prep logic into helper function mm/mremap: cleanup post-processing stage of mremap mm/mremap: use an explicit uffd failure path for mremap mm/mremap: check remap conditions earlier mm/mremap: move remap_is_valid() into check_prep_vma() mm/mremap: clean up mlock populate behaviour mm/mremap: permit mremap() move of multiple VMAs tools/testing/selftests: extend mremap_test to test multi-VMA mremap fs/userfaultfd.c | 15 +- include/linux/userfaultfd_k.h | 5 + mm/mremap.c | 553 +++++++++++++++-------- tools/testing/selftests/mm/mremap_test.c | 146 +++++- 4 files changed, 518 insertions(+), 201 deletions(-) -- 2.50.0

5 months, 2 weeks

3
19
0 0

[PATCH v19 4/5] binder: add transaction_report feature entry

by Carlos Llamas

From: Li Li <dualli(a)google.com> Add "transaction_report" to the binderfs feature list, to help userspace determine if the "BINDER_CMD_REPORT" generic netlink api is supported by the binder driver. Signed-off-by: Li Li <dualli(a)google.com> Signed-off-by: Carlos Llamas <cmllamas(a)google.com> --- drivers/android/binderfs.c | 8 ++++++++ .../selftests/filesystems/binderfs/binderfs_test.c | 1 + 2 files changed, 9 insertions(+) diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c index 4f827152d18e..f74a7e380261 100644 --- a/drivers/android/binderfs.c +++ b/drivers/android/binderfs.c @@ -59,6 +59,7 @@ struct binder_features { bool oneway_spam_detection; bool extended_error; bool freeze_notification; + bool transaction_report; }; static const struct constant_table binderfs_param_stats[] = { @@ -76,6 +77,7 @@ static struct binder_features binder_features = { .oneway_spam_detection = true, .extended_error = true, .freeze_notification = true, + .transaction_report = true, }; static inline struct binderfs_info *BINDERFS_SB(const struct super_block *sb) @@ -616,6 +618,12 @@ static int init_binder_features(struct super_block *sb) if (IS_ERR(dentry)) return PTR_ERR(dentry); + dentry = binderfs_create_file(dir, "transaction_report", + &binder_features_fops, + &binder_features.transaction_report); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + return 0; } diff --git a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c index 81db85a5cc16..39a68078a79b 100644 --- a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c +++ b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c @@ -65,6 +65,7 @@ static int __do_binderfs_test(struct __test_metadata *_metadata) "oneway_spam_detection", "extended_error", "freeze_notification", + "transaction_report", }; change_mountns(_metadata); -- 2.50.1.470.g6ba607880d-goog

5 months, 2 weeks

1
0
0 0

[PATCH net v2 1/2] macsec: set IFF_UNICAST_FLT priv flag

by Stanislav Fomichev

Cosmin reports the following locking issue: # BUG: sleeping function called from invalid context at kernel/locking/mutex.c:275 # dump_stack_lvl+0x4f/0x60 # __might_resched+0xeb/0x140 # mutex_lock+0x1a/0x40 # dev_set_promiscuity+0x26/0x90 # __dev_set_promiscuity+0x85/0x170 # __dev_set_rx_mode+0x69/0xa0 # dev_uc_add+0x6d/0x80 # vlan_dev_open+0x5f/0x120 [8021q] # __dev_open+0x10c/0x2a0 # __dev_change_flags+0x1a4/0x210 # netif_change_flags+0x22/0x60 # do_setlink.isra.0+0xdb0/0x10f0 # rtnl_newlink+0x797/0xb00 # rtnetlink_rcv_msg+0x1cb/0x3f0 # netlink_rcv_skb+0x53/0x100 # netlink_unicast+0x273/0x3b0 # netlink_sendmsg+0x1f2/0x430 Which is similar to recent syzkaller reports in [0] and [1] and triggers because macsec does not advertise IFF_UNICAST_FLT although it has proper ndo_set_rx_mode callback that takes care of pushing uc/mc addresses down to the real device. In general, dev_uc_add call path is problematic for stacking non-IFF_UNICAST_FLT because we might grab netdev instance lock under addr_list_lock spinlock, so this is not a systemic fix. 0: https://lore.kernel.org/netdev/686d55b4.050a0220.1ffab7.0014.GAE@google.com 1: https://lore.kernel.org/netdev/68712acf.a00a0220.26a83e.0051.GAE@google.com/ Reviewed-by: Simon Horman <horms(a)kernel.org> Tested-by: Simon Horman <horms(a)kernel.org> Link: https://lore.kernel.org/netdev/2aff4342b0f5b1539c02ffd8df4c7e58dd9746e7.cam… Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations") Reported-by: Cosmin Ratiu <cratiu(a)nvidia.com> Tested-by: Cosmin Ratiu <cratiu(a)nvidia.com> Signed-off-by: Stanislav Fomichev <sdf(a)fomichev.me> --- drivers/net/macsec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index 7edbe76b5455..4c75d1fea552 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -3868,7 +3868,7 @@ static void macsec_setup(struct net_device *dev) ether_setup(dev); dev->min_mtu = 0; dev->max_mtu = ETH_MAX_MTU; - dev->priv_flags |= IFF_NO_QUEUE; + dev->priv_flags |= IFF_NO_QUEUE | IFF_UNICAST_FLT; dev->netdev_ops = &macsec_netdev_ops; dev->needs_free_netdev = true; dev->priv_destructor = macsec_free_netdev; -- 2.50.1

5 months, 2 weeks

2
2
0 0

[PATCH net 0/2] bonding: fix LACP negotiation issues in passive mode

by Hangbin Liu

This patchset fixes an issue where bonding fails to establish a stable LACP negotiation when operating in passive mode (lacp_active=off). In passive mode, the current implementation only replies when the partner's state changes, which results in LACP timeout and unstable aggregator formation. With this change, the bond responds to each received LACPDU in passive mode by setting ntt = true, ensuring timely replies and stable LACP negotiation. Hangbin Liu (2): bonding: update ntt to true in passive mode selftests: bonding: add test for passive LACP mode drivers/net/bonding/bond_3ad.c | 6 ++ .../drivers/net/bonding/bond_passive_lacp.sh | 21 +++++ .../drivers/net/bonding/bond_topo_lacp.sh | 77 +++++++++++++++++++ 3 files changed, 104 insertions(+) create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh create mode 100644 tools/testing/selftests/drivers/net/bonding/bond_topo_lacp.sh -- 2.46.0

5 months, 2 weeks

4
14
0 0

[RFC PATCH v2 0/9] KVM: Enable Nested Virt selftests

by Ganapatrao Kulkarni

This patch series makes the selftest work with NV enabled. The guest code is run in vEL2 instead of EL1. We add a command line option to enable testing of NV. The NV tests are disabled by default. Modified around 12 selftests in this series. Changes since v1: - Updated NV helper functions as per comments [1]. - Modified existing testscases to run guest code in vEL2. [1] https://lkml.iu.edu/hypermail/linux/kernel/2502.0/07001.html Ganapatrao Kulkarni (9): KVM: arm64: nv: selftests: Add support to run guest code in vEL2. KVM: arm64: nv: selftests: Add simple test to run guest code in vEL2 KVM: arm64: nv: selftests: Enable hypervisor timer tests to run in vEL2 KVM: arm64: nv: selftests: enable aarch32_id_regs test to run in vEL2 KVM: arm64: nv: selftests: Enable vgic tests to run in vEL2 KVM: arm64: nv: selftests: Enable set_id_regs test to run in vEL2 KVM: arm64: nv: selftests: Enable test to run in vEL2 KVM: selftests: arm64: Extend kvm_page_table_test to run guest code in vEL2 KVM: arm64: nv: selftests: Enable page_fault_test test to run in vEL2 tools/testing/selftests/kvm/Makefile.kvm | 2 + tools/testing/selftests/kvm/arch_timer.c | 8 +- .../selftests/kvm/arm64/aarch32_id_regs.c | 34 ++++- .../testing/selftests/kvm/arm64/arch_timer.c | 118 +++++++++++++++--- .../selftests/kvm/arm64/nv_guest_hypervisor.c | 68 ++++++++++ .../selftests/kvm/arm64/page_fault_test.c | 35 +++++- .../testing/selftests/kvm/arm64/set_id_regs.c | 57 ++++++++- tools/testing/selftests/kvm/arm64/vgic_init.c | 54 +++++++- tools/testing/selftests/kvm/arm64/vgic_irq.c | 27 ++-- .../selftests/kvm/arm64/vgic_lpi_stress.c | 19 ++- .../testing/selftests/kvm/guest_print_test.c | 32 +++++ .../selftests/kvm/include/arm64/arch_timer.h | 16 +++ .../kvm/include/arm64/kvm_util_arch.h | 3 + .../selftests/kvm/include/arm64/nv_util.h | 45 +++++++ .../selftests/kvm/include/arm64/vgic.h | 1 + .../testing/selftests/kvm/include/kvm_util.h | 3 + .../selftests/kvm/include/timer_test.h | 1 + .../selftests/kvm/kvm_page_table_test.c | 30 ++++- tools/testing/selftests/kvm/lib/arm64/nv.c | 46 +++++++ .../selftests/kvm/lib/arm64/processor.c | 61 ++++++--- tools/testing/selftests/kvm/lib/arm64/vgic.c | 8 ++ 21 files changed, 604 insertions(+), 64 deletions(-) create mode 100644 tools/testing/selftests/kvm/arm64/nv_guest_hypervisor.c create mode 100644 tools/testing/selftests/kvm/include/arm64/nv_util.h create mode 100644 tools/testing/selftests/kvm/lib/arm64/nv.c -- 2.48.1

5 months, 2 weeks

6
28
0 0

[PATCH RFC 00/14] sparc64: vdso: Switch to generic vDSO library

by Thomas Weißschuh

The generic vDSO provides a lot common functionality shared between different architectures. SPARC is the last architecture not using it, preventing some necessary code cleanup. Make use of the generic infrastructure. Follow-up to and replacement for Arnd's SPARC vDSO removal patches: https://lore.kernel.org/lkml/20250707144726.4008707-1-arnd@kernel.org/ Only tested on QEMU. Based von v6.16-rc1. Marked as RFC for testing and review only. Will be properly resubmitted after v6.17-rc1. Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de> --- Arnd Bergmann (1): clocksource: remove ARCH_CLOCKSOURCE_DATA Thomas Weißschuh (13): vdso: add struct __kernel_old_timeval forward declaration to gettime.h sparc64: time: Remove architecture-specific clocksource data sparc64: vdso: Link with -z noexecstack sparc64: vdso: Remove obsolete "fake section table" reservation sparc64: vdso: Replace code patching with runtime conditional sparc64: vdso: Move hardware counter read into header sparc64: vdso: Move syscall fallbacks into header sparc64: vdso: Introduce vdso/processor.h sparc64: vdso: Switch to the generic vDSO library sparc64: vdso2c: Drop sym_vvar_start handling sparc64: vdso2c: Remove symbol handling sparc64: vdso: Implement clock_gettime64() sparc64: vdso: Implement clock_getres() arch/sparc/Kconfig | 5 +- arch/sparc/include/asm/clocksource.h | 9 - arch/sparc/include/asm/processor.h | 3 + arch/sparc/include/asm/processor_32.h | 2 - arch/sparc/include/asm/processor_64.h | 25 -- arch/sparc/include/asm/vdso.h | 2 - arch/sparc/include/asm/vdso/clocksource.h | 10 + arch/sparc/include/asm/vdso/gettimeofday.h | 208 ++++++++++++++++ arch/sparc/include/asm/vdso/processor.h | 41 ++++ arch/sparc/include/asm/vdso/vsyscall.h | 10 + arch/sparc/include/asm/vvar.h | 75 ------ arch/sparc/kernel/Makefile | 1 - arch/sparc/kernel/time_64.c | 6 +- arch/sparc/kernel/vdso.c | 69 ------ arch/sparc/vdso/Makefile | 8 +- arch/sparc/vdso/vclock_gettime.c | 382 +++-------------------------- arch/sparc/vdso/vdso-layout.lds.S | 26 +- arch/sparc/vdso/vdso.lds.S | 4 +- arch/sparc/vdso/vdso2c.c | 24 -- arch/sparc/vdso/vdso2c.h | 45 +--- arch/sparc/vdso/vdso32/vdso32.lds.S | 6 +- arch/sparc/vdso/vma.c | 274 ++------------------- include/linux/clocksource.h | 6 +- include/vdso/gettime.h | 1 + kernel/time/Kconfig | 4 - 25 files changed, 344 insertions(+), 902 deletions(-) --- base-commit: eaa6313d2ceb2a3f1c870866621058ad6081f028 change-id: 20250722-vdso-sparc64-generic-2-25f2e058e92c Best regards, -- Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>

5 months, 2 weeks

2
15
0 0

[PATCH net-next 0/2] selftests: drv-net: Fix and improve command requirement checking

by Gal Pressman

This series fixes remote command checking and cleans up command requirement calls across tests. The first patch fixes require_cmd() incorrectly checking commands locally even when remote=True was specified due to a missing host parameter. The second patch makes require_cmd() usage explicit about local/remote requirements, avoiding unnecessary test failures and consolidating duplicate calls. Gal Pressman (2): selftests: drv-net: Fix remote command checking in require_cmd() selftests: drv-net: Make command requirements explicit tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py | 3 +-- tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py | 2 +- tools/testing/selftests/drivers/net/hw/tso.py | 2 +- tools/testing/selftests/drivers/net/lib/py/env.py | 2 +- tools/testing/selftests/drivers/net/lib/py/load.py | 2 +- tools/testing/selftests/drivers/net/ping.py | 2 +- 6 files changed, 6 insertions(+), 7 deletions(-) -- 2.40.1

5 months, 2 weeks

2
3
0 0

[PATCH net 0/3] selftests: drv-net: tso: fix issues with tso selftest

by Daniel Zahka

There are a couple issues with the tso selftest. - Features required for test cases are detected by searching the set of active features at test start, so if a feature is supported by hw, but disabled, the test will report that the feature under test is not available and fail. - The vxlan test cases do not use the correct ip link flags based on the gso feature under test - The non-tunneled tso6 test case is showing up with the wrong name. With all patches applied test output is: # Detected qstat for LSO wire-packets TAP version 13 1..14 ok 1 tso.ipv4 # Testing with mangleid enabled ok 2 tso.vxlan4_ipv4 ok 3 tso.vxlan4_ipv6 # Testing with mangleid enabled ok 4 tso.vxlan_csum4_ipv4 ok 5 tso.vxlan_csum4_ipv6 # Testing with mangleid enabled ok 6 tso.gre4_ipv4 ok 7 tso.gre4_ipv6 ok 8 tso.ipv6 # Testing with mangleid enabled ok 9 tso.vxlan6_ipv4 ok 10 tso.vxlan6_ipv6 # Testing with mangleid enabled ok 11 tso.vxlan_csum6_ipv4 ok 12 tso.vxlan_csum6_ipv6 # Testing with mangleid enabled ok 13 tso.gre6_ipv4 ok 14 tso.gre6_ipv6 # Totals: pass:14 fail:0 xfail:0 xpass:0 skip:0 error:0 Daniel Zahka (3): selftests: drv-net: tso: enable test cases based on hw_features selftests: drv-net: tso: fix vxlan tunnel flags to get correct gso_type selftests: drv-net: tso: fix non-tunneled tso6 test case name tools/testing/selftests/drivers/net/hw/tso.py | 99 +++++++++++-------- 1 file changed, 59 insertions(+), 40 deletions(-) -- 2.47.1

5 months, 2 weeks

2
4
0 0

[PATCH RFC] selftests/pidfd: Fix duplicate-symbol warnings for SCHED_ CPP symbols

by Paul E. McKenney

The pidfd selftests run in userspace and include both userspace and kernel header files. On some distros (for example, CentOS), this results in duplicate-symbol warnings in allmodconfig builds, while on other distros (for example, Ubuntu) it does not. (This happens in recent -next trees, including next-20250714.) Therefore, use #undef to get rid of the userspace definitions in favor of the kernel definitions. Other ways of handling this include splitting up the selftest code so that the userspace definitions go into one translation unit and the kernel definitions into another (which might or might not be feasible) or to adjust compiler command-line options to suppress the warnings (which might or might not be desirable). Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <linux-kselftest(a)vger.kernel.org> --- pidfd.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h index efd74063126eb..6ff495398e872 100644 --- a/tools/testing/selftests/pidfd/pidfd.h +++ b/tools/testing/selftests/pidfd/pidfd.h @@ -16,6 +16,10 @@ #include <sys/types.h> #include <sys/wait.h> +#undef SCHED_NORMAL +#undef SCHED_FLAG_KEEP_ALL +#undef SCHED_FLAG_UTIL_CLAMP + #include "../kselftest.h" #include "../clone3/clone3_selftests.h"

5 months, 2 weeks

2
6
0 0

[PATCH] selftests/bpf: Install test modules into $INSTALL_PATH

by Ricardo B. Marlière

The tests expect the modules to be in the same working directory, but when using a different $INSTALL_PATH they are not copied over. Signed-off-by: Ricardo B. Marlière <rbm(a)suse.com> --- # cd tools/testing/selftests/kselftest_install && ./run_kselftest.sh -t bpf:test_verifier TAP version 13 1..1 # timeout set to 0 # selftests: bpf: test_verifier # Can't find bpf_testmod.ko kernel module: -2 not ok 1 selftests: bpf: test_verifier # exit=1 --- tools/testing/selftests/bpf/Makefile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 4863106034dfbcd35f830432322f054d897bb406..56b0565af8a76a9e784836a836935dd22e814fc0 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -877,5 +877,7 @@ override define INSTALL_RULE @for DIR in $(TEST_INST_SUBDIRS); do \ mkdir -p $(INSTALL_PATH)/$$DIR; \ rsync -a $(OUTPUT)/$$DIR/*.bpf.o $(INSTALL_PATH)/$$DIR;\ + rsync -a $(OUTPUT)/$$DIR/*.ko $(INSTALL_PATH)/$$DIR;\ + rsync -a $(OUTPUT)/*.ko $(INSTALL_PATH);\ done endef --- base-commit: f227e9ed4fe4f2fed40e4725d6c10860d30c2ea2 change-id: 20250724-bpf-next_for-next-f1de3e4becc8 Best regards, -- Ricardo B. Marlière <rbm(a)suse.com>

5 months, 2 weeks

1
0
0 0

[PATCH] selftests/tracing: Fix false failure of subsystem event test

by Steven Rostedt

From: Steven Rostedt <rostedt(a)goodmis.org> The subsystem event test enables all "sched" events and makes sure there's at least 3 different events in the output. It used to cat the entire trace file to | wc -l, but on slow machines, that could last a very long time. To solve that, it was changed to just read the first 100 lines of the trace file. This can cause false failures as some events repeat so often, that the 100 lines that are examined could possibly be of only one event. Instead, create an awk script that looks for 3 different events and will exit out after it finds them. This will find the 3 events the test looks for (eventually if it works), and still exit out after the test is satisfied and not cause slower machines to run forever. Reported-by: Tengda Wu <wutengda(a)huaweicloud.com> Closes: https://lore.kernel.org/all/20250710130134.591066-1-wutengda@huaweicloud.co… Fixes: 1a4ea83a6e67 ("selftests/ftrace: Limit length in subsystem-enable tests") Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org> --- .../ftrace/test.d/event/subsystem-enable.tc | 28 +++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc index b7c8f29c09a9..65916bb55dfb 100644 --- a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc +++ b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc @@ -14,11 +14,35 @@ fail() { #msg exit_fail } +# As reading trace can last forever, simply look for 3 different +# events then exit out of reading the file. If there's not 3 different +# events, then the test has failed. +check_unique() { + cat trace | grep -v '^#' | awk ' + BEGIN { cnt = 0; } + { + for (i = 0; i < cnt; i++) { + if (event[i] == $5) { + break; + } + } + if (i == cnt) { + event[cnt++] = $5; + if (cnt > 2) { + exit; + } + } + } + END { + printf "%d", cnt; + }' +} + echo 'sched:*' > set_event yield -count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l` +count=`check_unique` if [ $count -lt 3 ]; then fail "at least fork, exec and exit events should be recorded" fi @@ -29,7 +53,7 @@ echo 1 > events/sched/enable yield -count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l` +count=`check_unique` if [ $count -lt 3 ]; then fail "at least fork, exec and exit events should be recorded" fi -- 2.47.2

5 months, 2 weeks

2
2
0 0

[PATCH v2] selftests: firmware: Add details in error logging

by Harshal

Specify details in logs of failed cases Signed-off-by: Harshal <embedkari167(a)gmail.com> --- v2: - revert back to exit() instead of die() to avoid modifying system behaviour v1: https://lore.kernel.org/all/c7c071ed-6a4e-4a9c-ba9d-c745fd42c22f@linuxfound… tools/testing/selftests/firmware/fw_namespace.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/firmware/fw_namespace.c b/tools/testing/selftests/firmware/fw_namespace.c index 04757dc7e546..5b0032498ede 100644 --- a/tools/testing/selftests/firmware/fw_namespace.c +++ b/tools/testing/selftests/firmware/fw_namespace.c @@ -38,7 +38,7 @@ static void trigger_fw(const char *fw_name, const char *sys_path) fd = open(sys_path, O_WRONLY); if (fd < 0) - die("open failed: %s\n", + die("open of sys_path failed: %s\n", strerror(errno)); if (write(fd, fw_name, strlen(fw_name)) != strlen(fw_name)) exit(EXIT_FAILURE); @@ -52,10 +52,10 @@ static void setup_fw(const char *fw_path) fd = open(fw_path, O_WRONLY | O_CREAT, 0600); if (fd < 0) - die("open failed: %s\n", + die("open of firmware file failed: %s\n", strerror(errno)); if (write(fd, fw, sizeof(fw) -1) != sizeof(fw) -1) - die("write failed: %s\n", + die("write to firmware file failed: %s\n", strerror(errno)); close(fd); } @@ -66,7 +66,7 @@ static bool test_fw_in_ns(const char *fw_name, const char *sys_path, bool block_ if (block_fw_in_parent_ns) if (mount("test", "/lib/firmware", "tmpfs", MS_RDONLY, NULL) == -1) - die("blocking firmware in parent ns failed\n"); + die("blocking firmware in parent namespace failed\n"); child = fork(); if (child == -1) { @@ -99,11 +99,11 @@ static bool test_fw_in_ns(const char *fw_name, const char *sys_path, bool block_ strerror(errno)); } if (mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL) == -1) - die("remount root in child ns failed\n"); + die("remount root in child namespace failed\n"); if (!block_fw_in_parent_ns) { if (mount("test", "/lib/firmware", "tmpfs", MS_RDONLY, NULL) == -1) - die("blocking firmware in child ns failed\n"); + die("blocking firmware in child namespace failed\n"); } else umount("/lib/firmware"); @@ -129,8 +129,8 @@ int main(int argc, char **argv) die("error: failed to build full fw_path\n"); setup_fw(fw_path); - setvbuf(stdout, NULL, _IONBF, 0); + /* Positive case: firmware in PID1 mount namespace */ printf("Testing with firmware in parent namespace (assumed to be same file system as PID1)\n"); if (!test_fw_in_ns(fw_name, sys_path, false)) -- 2.43.0

5 months, 2 weeks

3
3
0 0

[PATCH v18 4/5] binder: add transaction_report feature entry

by Carlos Llamas

From: Li Li <dualli(a)google.com> Add "transaction_report" to the binderfs feature list, to help userspace determine if the "BINDER_CMD_REPORT" generic netlink api is supported by the binder driver. Signed-off-by: Li Li <dualli(a)google.com> Signed-off-by: Carlos Llamas <cmllamas(a)google.com> --- drivers/android/binderfs.c | 8 ++++++++ .../selftests/filesystems/binderfs/binderfs_test.c | 1 + 2 files changed, 9 insertions(+) diff --git a/drivers/android/binderfs.c b/drivers/android/binderfs.c index 4f827152d18e..f74a7e380261 100644 --- a/drivers/android/binderfs.c +++ b/drivers/android/binderfs.c @@ -59,6 +59,7 @@ struct binder_features { bool oneway_spam_detection; bool extended_error; bool freeze_notification; + bool transaction_report; }; static const struct constant_table binderfs_param_stats[] = { @@ -76,6 +77,7 @@ static struct binder_features binder_features = { .oneway_spam_detection = true, .extended_error = true, .freeze_notification = true, + .transaction_report = true, }; static inline struct binderfs_info *BINDERFS_SB(const struct super_block *sb) @@ -616,6 +618,12 @@ static int init_binder_features(struct super_block *sb) if (IS_ERR(dentry)) return PTR_ERR(dentry); + dentry = binderfs_create_file(dir, "transaction_report", + &binder_features_fops, + &binder_features.transaction_report); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + return 0; } diff --git a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c index 81db85a5cc16..39a68078a79b 100644 --- a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c +++ b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c @@ -65,6 +65,7 @@ static int __do_binderfs_test(struct __test_metadata *_metadata) "oneway_spam_detection", "extended_error", "freeze_notification", + "transaction_report", }; change_mountns(_metadata); -- 2.50.1.470.g6ba607880d-goog

5 months, 2 weeks

1
0
0 0

Re: [PATCH net] selftests: rtnetlink.sh: remove esp4_offload after test

by Shannon Nelson

(yeah, I made the same non-ASCII mistake...) On 7/24/25 10:49 AM, Shannon Nelson wrote: > On 7/24/25 1:20 AM, Xiumei Mu wrote: >> resent the reply again with "plain text mode" >> >> On Thu, Jul 24, 2025 at 2:25 PM Hangbin Liu<liuhangbin(a)gmail.com> wrote: >>> Hi Xiumei, >>> On Thu, Jul 24, 2025 at 12:55:02PM +0800, Xiumei Mu wrote: >>>> The esp4_offload module, loaded during IPsec offload tests, should >>>> be reset to its default settings after testing. >>>> Otherwise, leaving it enabled could unintentionally affect subsequence >>>> test cases by keeping offload active. >>> Would you please show which subsequence test will be affected? >> Any general ipsec case, which expects to be tested by default >> behavior(without offload). >> esp4_offload will affect the performance. >> >>>> Fixes: 2766a11161cc ("selftests: rtnetlink: add ipsec offload API test") >>> It would be good to Cc the fix commit author. You can use >>> `./scripts/get_maintainer.pl your_patch_file` to get the contacts you >>> need to Cc. >> I used the script to generate the cc list. >> and I double checked the old email of the author is invalid >> added his personal email in the cc list: >> >> Shannon Nelson<shannon.nelson(a)oracle.com>. -----> Shannon Nelson >> <sln(a)onemain.com> >> >> get the information from here: >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a… > > Yep, that was me a couple of corporate email addresses ago. Thanks for > digging up the new email address. Luckily I have a couple of mail > filters watching for old email addresses. > >>>> Signed-off-by: Xiumei Mu<xmu(a)redhat.com> >>>> --- >>>> tools/testing/selftests/net/rtnetlink.sh | 6 ++++++ >>>> 1 file changed, 6 insertions(+) >>>> >>>> diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh >>>> index 2e8243a65b50..5cc1b5340a1a 100755 >>>> --- a/tools/testing/selftests/net/rtnetlink.sh >>>> +++ b/tools/testing/selftests/net/rtnetlink.sh >>>> @@ -673,6 +673,11 @@ kci_test_ipsec_offload() >>>> sysfsf=$sysfsd/ipsec >>>> sysfsnet=/sys/bus/netdevsim/devices/netdevsim0/net/ >>>> probed=false >>>> + esp4_offload_probed_default=false >>>> + >>>> + if lsmod | grep -q esp4_offload; then >>>> + esp4_offload_probed_default=true >>>> + fi >>> If the mode is loaded by default, how to avoid the subsequence test to be >>> failed? >> The module is not loaded by default, but some users or testers may >> need to load esp4_offload in their own environments. >> Therefore, resetting it to the default configuration is the best >> practice to prevent this self-test case from impacting subsequent >> tests > > Seems reasonable to me. > >>>> if ! mount | grep -q debugfs; then >>>> mount -t debugfs none /sys/kernel/debug/ &> /dev/null >>>> @@ -766,6 +771,7 @@ EOF >>>> fi >>>> >>>> # clean up any leftovers >>>> + [ $esp4_offload_probed_default == false ] && rmmod esp4_offload >>> The new patch need to pass shellcheck. We need to double quote the variable. >> Thanks your comment, I will add double quote in patchv2 > > Or you keep with the existing style as done a line or two later: > $esp4_offload_probed_default && rmmod esp4_offload Either way, > Reviewed-by: Shannon Nelson <sln(a)onemain.com> Cheers, sln >>> Thanks >>> Hangbin >>>> echo 0 > /sys/bus/netdevsim/del_device >>>> $probed && rmmod netdevsim >>>> >>>> -- >>>> 2.50.1 >>>> >

5 months, 2 weeks

1
0
0 0

[PATCH bpf-next v3 0/4] bpf: Show precise rejected function when attaching to __noreturn and deny list functions

by KaFai Wan

Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions. Add log for attaching tracing programs to functions in deny list. Add selftest for attaching tracing programs to functions in deny list. Migrate fexit_noreturns case into tracing_failure test suite. changes: v3: - add tracing_deny case into existing files (Alexei) - migrate fexit_noreturns into tracing_failure - change SOB v2: - change verifier log message (Alexei) - add missing Suggested-by https://lore.kernel.org/bpf/20250714120408.1627128-1-mannkafai@gmail.com/ v1: https://lore.kernel.org/all/20250710162717.3808020-1-mannkafai@gmail.com/ --- KaFai Wan (4): bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions bpf: Add log for attaching tracing programs to functions in deny list selftests/bpf: Add selftest for attaching tracing programs to functions in deny list selftests/bpf: Migrate fexit_noreturns case into tracing_failure test suite kernel/bpf/verifier.c | 5 +- .../bpf/prog_tests/fexit_noreturns.c | 9 ---- .../bpf/prog_tests/tracing_failure.c | 52 +++++++++++++++++++ .../selftests/bpf/progs/fexit_noreturns.c | 15 ------ .../selftests/bpf/progs/tracing_failure.c | 12 +++++ 5 files changed, 68 insertions(+), 25 deletions(-) delete mode 100644 tools/testing/selftests/bpf/prog_tests/fexit_noreturns.c delete mode 100644 tools/testing/selftests/bpf/progs/fexit_noreturns.c -- 2.43.0

5 months, 2 weeks

3
11
0 0

[PATCH net-next V6 0/5] selftests: drv-net: Test XDP native support

by Mohsin Bashir

This patch series add tests to validate XDP native support for PASS, DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment. For adjustment tests, validate support for both the extension and shrinking cases across various packet sizes and offset values. The pass criteria for head/tail adjustment tests require that at-least one adjustment value works for at-least one packet size. This ensure that the variability in maximum supported head/tail adjustment offset across different drivers is being incorporated. The results reported in this series are based on netdevsim. However, the series is tested against multiple other drivers including fbnic. Note: The XDP support for fbnic will be added later. --- Change-log: - Force checksum computation in netdevsim xdp hook - Update checksum when updating packet for head/tail adjustment cases - Use 1 as the minimum value for the data growth while adjusting tail V5: https://lore.kernel.org/netdev/20250715210553.1568963-6-mohsin.bashr@gmail.… V4: https://lore.kernel.org/netdev/20250714210352.1115230-1-mohsin.bashr@gmail.… V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.… V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.… Jakub Kicinski (1): net: netdevsim: hook in XDP handling Mohsin Bashir (4): selftests: drv-net: Test XDP_PASS/DROP support selftests: drv-net: Test XDP_TX support selftests: drv-net: Test tail-adjustment support selftests: drv-net: Test head-adjustment support drivers/net/netdevsim/netdev.c | 21 +- tools/testing/selftests/drivers/net/Makefile | 1 + tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++ .../selftests/net/lib/xdp_native.bpf.c | 621 +++++++++++++++++ 4 files changed, 1298 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/drivers/net/xdp.py create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c -- 2.47.1

5 months, 2 weeks

8
20
0 0

[PATCH v3 0/4] procfs: make reference pidns more user-visible

by Aleksa Sarai

Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed). /* pidns mount option for procfs */ This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs). But this requires forking off a helper process because you cannot just one-shot this using mount(2). * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process). While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate. Things would be much easier if there was a way for userspace to just specify the pidns they want. Patch 1 implements a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); The initial security model I have in this RFC is to be as conservative as possible and just mirror the security model for setns(2) -- which means that you can only set pidns=... to pid namespaces that your current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN privileges over the pid namespace. This fulfils the requirements of container runtimes, but I suspect that this may be too strict for some usecases. The pidns argument is not displayed in mountinfo -- it's not clear to me what value it would make sense to show (maybe we could just use ns_dname to provide an identifier for the namespace, but this number would be fairly useless to userspace). I'm open to suggestions. Note that PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get information about this outside of mountinfo. Note that you cannot change the pidns of an already-created procfs instance. The primary reason is that allowing this to be changed would require RCU-protecting proc_pid_ns(sb) and thus auditing all of fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF the pid namespace. Since creating procfs instances is very cheap, it seems unnecessary to overcomplicate this upfront. Trying to reconfigure procfs this way errors out with -EBUSY. /* ioctl(PROCFS_GET_PID_NAMESPACE) */ In addition, being able to figure out what pid namespace is being used by a procfs mount is quite useful when you have an administrative process (such as a container runtime) which wants to figure out the correct way of mapping PIDs between its own namespace and the namespace for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are alternative ways to do this, but they all rely on ancillary information that third-party libraries and tools do not necessarily have access to. To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which can be used to get a reference to the pidns that a procfs is using. It's not quite clear what is the correct security model for this API, but the current approach I've taken is to: * Make the ioctl only valid on the root (meaning that a process without access to the procfs root -- such as only having an fd to a procfs file or some open_tree(2)-like subset -- cannot use this API). * Require that the process requesting either has access to /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative access to it and can join it if they had a handle), or is in a pidns that is a direct ancestor of the target pidns (i.e. all of the pids are already visible in the procfs for the current process's pidns). The security model for this is a little loose, as it seems to me that all of the cases mentioned are valid cases to allow access, but I'm open to suggestions for whether we need to make this stricter or looser. Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Changes in v3: - Disallow changing pidns for existing procfs instances, as we'd probably have to RCU-protect everything that touches the pinned pidns reference. - Improve tests with slightly nicer ASSERT_ERRNO* macros. - v2: <https://lore.kernel.org/r/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cypha…> Changes in v2: - #ifdef CONFIG_PID_NS - Improve cover letter wording to make it clear we're talking about two separate features with different permission models. [Andy Lutomirski] - Fix build warnings in pidns_is_ancestor() patch. [kernel test robot] - v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cypha…> --- Aleksa Sarai (4): pidns: move is-ancestor logic to helper procfs: add "pidns" mount option procfs: add PROCFS_GET_PID_NAMESPACE ioctl selftests/proc: add tests for new pidns APIs Documentation/filesystems/proc.rst | 12 ++ fs/proc/root.c | 156 +++++++++++++++++- include/linux/pid_namespace.h | 9 ++ include/uapi/linux/fs.h | 3 + kernel/pid_namespace.c | 23 ++- tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-pidns.c | 252 ++++++++++++++++++++++++++++++ 8 files changed, 441 insertions(+), 16 deletions(-) --- base-commit: 66639db858112bf6b0f76677f7517643d586e575 change-id: 20250717-procfs-pidns-api-8ed1583431f0 Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

5 months, 2 weeks

1
4
0 0

[PATCH net] selftests: rtnetlink.sh: remove esp4_offload after test

by Xiumei Mu

The esp4_offload module, loaded during IPsec offload tests, should be reset to its default settings after testing. Otherwise, leaving it enabled could unintentionally affect subsequence test cases by keeping offload active. Fixes: 2766a11161cc ("selftests: rtnetlink: add ipsec offload API test") Signed-off-by: Xiumei Mu <xmu(a)redhat.com> --- tools/testing/selftests/net/rtnetlink.sh | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh index 2e8243a65b50..5cc1b5340a1a 100755 --- a/tools/testing/selftests/net/rtnetlink.sh +++ b/tools/testing/selftests/net/rtnetlink.sh @@ -673,6 +673,11 @@ kci_test_ipsec_offload() sysfsf=$sysfsd/ipsec sysfsnet=/sys/bus/netdevsim/devices/netdevsim0/net/ probed=false + esp4_offload_probed_default=false + + if lsmod | grep -q esp4_offload; then + esp4_offload_probed_default=true + fi if ! mount | grep -q debugfs; then mount -t debugfs none /sys/kernel/debug/ &> /dev/null @@ -766,6 +771,7 @@ EOF fi # clean up any leftovers + [ $esp4_offload_probed_default == false ] && rmmod esp4_offload echo 0 > /sys/bus/netdevsim/del_device $probed && rmmod netdevsim -- 2.50.1

5 months, 2 weeks

2
2
0 0

[PATCH] selftests: netfilter: ipvs.sh: Explicity disable rp_filter on interface tunl0

by Yi Chen

Although setup_ns() set net.ipv4.conf.default.rp_filter=0, loading certain module such as ipip will automatically create a tunl0 interface in all netns including new created ones, this in script is before than default.rp_filter=0 applied, as a result tunl0.rp_filter remains set to 1 which causes the test report FAIL when ipip module is preloaded. Before fix: Testing DR mode... Testing NAT mode... Testing Tunnel mode... ipvs.sh: FAIL After fix: Testing DR mode... Testing NAT mode... Testing Tunnel mode... ipvs.sh: PASS Fixes: ("7c8b89ec5 selftests: netfilter: remove rp_filter configuration") Signed-off-by: Yi Chen <yiche(a)redhat.com> --- tools/testing/selftests/net/netfilter/ipvs.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/netfilter/ipvs.sh b/tools/testing/selftests/net/netfilter/ipvs.sh index 6af2ea3ad6b8..9c9d5b38ab71 100755 --- a/tools/testing/selftests/net/netfilter/ipvs.sh +++ b/tools/testing/selftests/net/netfilter/ipvs.sh @@ -151,7 +151,7 @@ test_nat() { test_tun() { ip netns exec "${ns0}" ip route add "${vip_v4}" via "${gip_v4}" dev br0 - ip netns exec "${ns1}" modprobe -q ipip + modprobe -q ipip ip netns exec "${ns1}" ip link set tunl0 up ip netns exec "${ns1}" sysctl -qw net.ipv4.ip_forward=0 ip netns exec "${ns1}" sysctl -qw net.ipv4.conf.all.send_redirects=0 @@ -160,10 +160,10 @@ test_tun() { ip netns exec "${ns1}" ipvsadm -a -i -t "${vip_v4}:${port}" -r ${rip_v4}:${port} ip netns exec "${ns1}" ip addr add ${vip_v4}/32 dev lo:1 - ip netns exec "${ns2}" modprobe -q ipip ip netns exec "${ns2}" ip link set tunl0 up ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_ignore=1 ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_announce=2 + ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.tunl0.rp_filter=0 ip netns exec "${ns2}" ip addr add "${vip_v4}/32" dev lo:1 test_service -- 2.50.1

5 months, 2 weeks

2
1
0 0

[PATCH v26 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v26. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v25 (19-Jul-2025) and v26 (22-Jul-2025) - Restruct to avoid using lock and unlock when both step_thresh are provided (Jakub Kicinski <kuba(a)kernel.org>) v24 (18-Jul-2025) - Replace TCA_DUALPI2 prefix with TC_DUALPI2 for enums in pkt_sched.h (Jakub Kicinski <kuba(a)kernel.org>) - Report error if both packet and time step thresholds are provided (Jakub Kicinski <kuba(a)kernel.org>) v22 (11-Jul-2025) and v23 (13-Jul-2025) - Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>) v21 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>) - Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>) - Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>) - Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>) - Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>) - Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>) - Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>) - Fix some typos and format issues of dualpi2 attributes v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 151 ++- include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 68 + net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1175 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1669 insertions(+), 5 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

5 months, 2 weeks

2
7
0 0

[PATCH net-next] devlink: Fix excessive stack usage in rate TC bandwidth parsing

by Tariq Toukan

From: Carolina Jubran <cjubran(a)nvidia.com> The devlink_nl_rate_tc_bw_parse function uses a large stack array for devlink attributes, which triggers a warning about excessive stack usage: net/devlink/rate.c: In function 'devlink_nl_rate_tc_bw_parse': net/devlink/rate.c:382:1: error: the frame size of 1648 bytes is larger than 1536 bytes [-Werror=frame-larger-than=] Introduce a separate attribute set specifically for rate TC bandwidth parsing that only contains the two attributes actually used: index and bandwidth. This reduces the stack array from DEVLINK_ATTR_MAX entries to just 2 entries, solving the stack usage issue. Update devlink selftest to use the new 'index' and 'bw' attribute names consistent with the YAML spec. Example usage with ynl with the new spec: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"index": 0, "bw": 50}, {"index": 1, "bw": 50}, {"index": 2, "bw": 0}, {"index": 3, "bw": 0}, {"index": 4, "bw": 0}, {"index": 5, "bw": 0}, {"index": 6, "bw": 0}, {"index": 7, "bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'bw': 50, 'index': 0}, {'bw': 50, 'index': 1}, {'bw': 0, 'index': 2}, {'bw': 0, 'index': 3}, {'bw': 0, 'index': 4}, {'bw': 0, 'index': 5}, {'bw': 0, 'index': 6}, {'bw': 0, 'index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Fixes: 566e8f108fc7 ("devlink: Extend devlink rate API with traffic classes bandwidth management") Reported-by: Arnd Bergmann <arnd(a)arndb.de> Closes: https://lore.kernel.org/netdev/20250708160652.1810573-1-arnd@kernel.org/ Reported-by: kernel test robot <lkp(a)intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507171943.W7DJcs6Y-lkp@intel.com/ Suggested-by: Jakub Kicinski <kuba(a)kernel.org> Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> Signed-off-by: Carolina Jubran <cjubran(a)nvidia.com> Tested-by: Carolina Jubran <cjubran(a)nvidia.com> Signed-off-by: Tariq Toukan <tariqt(a)nvidia.com> --- Documentation/netlink/specs/devlink.yaml | 26 ++++++++----------- include/uapi/linux/devlink.h | 11 ++++++-- net/devlink/netlink_gen.c | 6 ++--- net/devlink/netlink_gen.h | 2 +- net/devlink/rate.c | 20 +++++++------- .../drivers/net/hw/devlink_rate_tc_bw.py | 16 ++++++------ 6 files changed, 42 insertions(+), 39 deletions(-) diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml index 1c4bb0cbe5f0..bb87111d5e16 100644 --- a/Documentation/netlink/specs/devlink.yaml +++ b/Documentation/netlink/specs/devlink.yaml @@ -853,18 +853,6 @@ attribute-sets: type: nest multi-attr: true nested-attributes: dl-rate-tc-bws - - - name: rate-tc-index - type: u8 - checks: - max: rate-tc-index-max - - - name: rate-tc-bw - type: u32 - doc: | - Specifies the bandwidth share assigned to the Traffic Class. - The bandwidth for the traffic class is determined - in proportion to the sum of the shares of all configured classes. - name: dl-dev-stats subset-of: devlink @@ -1271,12 +1259,20 @@ attribute-sets: type: flag - name: dl-rate-tc-bws - subset-of: devlink + name-prefix: devlink-rate-tc-attr- attributes: - - name: rate-tc-index + name: index + type: u8 + checks: + max: rate-tc-index-max - - name: rate-tc-bw + name: bw + type: u32 + doc: | + Specifies the bandwidth share assigned to the Traffic Class. + The bandwidth for the traffic class is determined + in proportion to the sum of the shares of all configured classes. operations: enum-model: directional diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h index e72bcc239afd..9fcb25a0f447 100644 --- a/include/uapi/linux/devlink.h +++ b/include/uapi/linux/devlink.h @@ -635,8 +635,6 @@ enum devlink_attr { DEVLINK_ATTR_REGION_DIRECT, /* flag */ DEVLINK_ATTR_RATE_TC_BWS, /* nested */ - DEVLINK_ATTR_RATE_TC_INDEX, /* u8 */ - DEVLINK_ATTR_RATE_TC_BW, /* u32 */ /* Add new attributes above here, update the spec in * Documentation/netlink/specs/devlink.yaml and re-generate @@ -647,6 +645,15 @@ enum devlink_attr { DEVLINK_ATTR_MAX = __DEVLINK_ATTR_MAX - 1 }; +enum devlink_rate_tc_attr { + DEVLINK_RATE_TC_ATTR_UNSPEC, + DEVLINK_RATE_TC_ATTR_INDEX, /* u8 */ + DEVLINK_RATE_TC_ATTR_BW, /* u32 */ + + __DEVLINK_RATE_TC_ATTR_MAX, + DEVLINK_RATE_TC_ATTR_MAX = __DEVLINK_RATE_TC_ATTR_MAX - 1 +}; + /* Mapping between internal resource described by the field and system * structure */ diff --git a/net/devlink/netlink_gen.c b/net/devlink/netlink_gen.c index c50436433c18..d97c326a9045 100644 --- a/net/devlink/netlink_gen.c +++ b/net/devlink/netlink_gen.c @@ -45,9 +45,9 @@ const struct nla_policy devlink_dl_port_function_nl_policy[DEVLINK_PORT_FN_ATTR_ [DEVLINK_PORT_FN_ATTR_CAPS] = NLA_POLICY_BITFIELD32(15), }; -const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_ATTR_RATE_TC_BW + 1] = { - [DEVLINK_ATTR_RATE_TC_INDEX] = NLA_POLICY_MAX(NLA_U8, DEVLINK_RATE_TC_INDEX_MAX), - [DEVLINK_ATTR_RATE_TC_BW] = { .type = NLA_U32, }, +const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_RATE_TC_ATTR_BW + 1] = { + [DEVLINK_RATE_TC_ATTR_INDEX] = NLA_POLICY_MAX(NLA_U8, DEVLINK_RATE_TC_INDEX_MAX), + [DEVLINK_RATE_TC_ATTR_BW] = { .type = NLA_U32, }, }; const struct nla_policy devlink_dl_selftest_id_nl_policy[DEVLINK_ATTR_SELFTEST_ID_FLASH + 1] = { diff --git a/net/devlink/netlink_gen.h b/net/devlink/netlink_gen.h index fb733b5d4ff1..09cc6f264ccf 100644 --- a/net/devlink/netlink_gen.h +++ b/net/devlink/netlink_gen.h @@ -13,7 +13,7 @@ /* Common nested types */ extern const struct nla_policy devlink_dl_port_function_nl_policy[DEVLINK_PORT_FN_ATTR_CAPS + 1]; -extern const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_ATTR_RATE_TC_BW + 1]; +extern const struct nla_policy devlink_dl_rate_tc_bws_nl_policy[DEVLINK_RATE_TC_ATTR_BW + 1]; extern const struct nla_policy devlink_dl_selftest_id_nl_policy[DEVLINK_ATTR_SELFTEST_ID_FLASH + 1]; /* Ops table for devlink */ diff --git a/net/devlink/rate.c b/net/devlink/rate.c index d39300a9b3d4..110b3fa8a0b1 100644 --- a/net/devlink/rate.c +++ b/net/devlink/rate.c @@ -90,8 +90,8 @@ static int devlink_rate_put_tc_bws(struct sk_buff *msg, u32 *tc_bw) if (!nla_tc_bw) return -EMSGSIZE; - if (nla_put_u8(msg, DEVLINK_ATTR_RATE_TC_INDEX, i) || - nla_put_u32(msg, DEVLINK_ATTR_RATE_TC_BW, tc_bw[i])) + if (nla_put_u8(msg, DEVLINK_RATE_TC_ATTR_INDEX, i) || + nla_put_u32(msg, DEVLINK_RATE_TC_ATTR_BW, tc_bw[i])) goto nla_put_failure; nla_nest_end(msg, nla_tc_bw); @@ -346,26 +346,26 @@ static int devlink_nl_rate_tc_bw_parse(struct nlattr *parent_nest, u32 *tc_bw, unsigned long *bitmap, struct netlink_ext_ack *extack) { - struct nlattr *tb[DEVLINK_ATTR_MAX + 1]; + struct nlattr *tb[DEVLINK_RATE_TC_ATTR_MAX + 1]; u8 tc_index; int err; - err = nla_parse_nested(tb, DEVLINK_ATTR_MAX, parent_nest, + err = nla_parse_nested(tb, DEVLINK_RATE_TC_ATTR_MAX, parent_nest, devlink_dl_rate_tc_bws_nl_policy, extack); if (err) return err; - if (!tb[DEVLINK_ATTR_RATE_TC_INDEX]) { + if (!tb[DEVLINK_RATE_TC_ATTR_INDEX]) { NL_SET_ERR_ATTR_MISS(extack, parent_nest, - DEVLINK_ATTR_RATE_TC_INDEX); + DEVLINK_RATE_TC_ATTR_INDEX); return -EINVAL; } - tc_index = nla_get_u8(tb[DEVLINK_ATTR_RATE_TC_INDEX]); + tc_index = nla_get_u8(tb[DEVLINK_RATE_TC_ATTR_INDEX]); - if (!tb[DEVLINK_ATTR_RATE_TC_BW]) { + if (!tb[DEVLINK_RATE_TC_ATTR_BW]) { NL_SET_ERR_ATTR_MISS(extack, parent_nest, - DEVLINK_ATTR_RATE_TC_BW); + DEVLINK_RATE_TC_ATTR_BW); return -EINVAL; } @@ -376,7 +376,7 @@ static int devlink_nl_rate_tc_bw_parse(struct nlattr *parent_nest, u32 *tc_bw, return -EINVAL; } - tc_bw[tc_index] = nla_get_u32(tb[DEVLINK_ATTR_RATE_TC_BW]); + tc_bw[tc_index] = nla_get_u32(tb[DEVLINK_RATE_TC_ATTR_BW]); return 0; } diff --git a/tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py b/tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py index 820d8a03becc..835c357919a8 100755 --- a/tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py +++ b/tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py @@ -208,14 +208,14 @@ def setup_devlink_rate(cfg): "port-index": port_index, "rate-tx-max": 125000000, "rate-tc-bws": [ - {"rate-tc-index": 0, "rate-tc-bw": 0}, - {"rate-tc-index": 1, "rate-tc-bw": 0}, - {"rate-tc-index": 2, "rate-tc-bw": 0}, - {"rate-tc-index": 3, "rate-tc-bw": 20}, - {"rate-tc-index": 4, "rate-tc-bw": 80}, - {"rate-tc-index": 5, "rate-tc-bw": 0}, - {"rate-tc-index": 6, "rate-tc-bw": 0}, - {"rate-tc-index": 7, "rate-tc-bw": 0}, + {"index": 0, "bw": 0}, + {"index": 1, "bw": 0}, + {"index": 2, "bw": 0}, + {"index": 3, "bw": 20}, + {"index": 4, "bw": 80}, + {"index": 5, "bw": 0}, + {"index": 6, "bw": 0}, + {"index": 7, "bw": 0}, ] }) except NlError as exc: base-commit: 3fc894728fb3a0d9282e81247b68c07468fe2985 -- 2.31.1

5 months, 2 weeks

4
3
0 0

[PATCH net v2] selftests: drv-net: wait for iperf client to stop sending

by Nimrod Oren

A few packets may still be sent out during the termination of iperf processes. These late packets cause failures in rss_ctx.py when they arrive on queues expected to be empty. Example failure observed: Check failed 2 != 0 traffic on inactive queues (context 1): [0, 0, 1, 1, 386385, 397196, 0, 0, 0, 0, ...] Check failed 4 != 0 traffic on inactive queues (context 2): [0, 0, 0, 0, 2, 2, 247152, 253013, 0, 0, ...] Check failed 2 != 0 traffic on inactive queues (context 3): [0, 0, 0, 0, 0, 0, 1, 1, 282434, 283070, ...] To avoid such failures, wait until all client sockets for the requested port are either closed or in the TIME_WAIT state. Fixes: 847aa551fa78 ("selftests: drv-net: rss_ctx: factor out send traffic and check") Signed-off-by: Nimrod Oren <noren(a)nvidia.com> Reviewed-by: Gal Pressman <gal(a)nvidia.com> Reviewed-by: Carolina Jubran <cjubran(a)nvidia.com> --- Changelog: v2: - Replace fixed sleep with logic that waits for client sockets to close. - Update commit title and message to reflect new approach. v1: https://lore.kernel.org/all/20250629111812.644282-1-noren@nvidia.com/ --- .../selftests/drivers/net/lib/py/load.py | 23 +++++++++++++++---- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/drivers/net/lib/py/load.py b/tools/testing/selftests/drivers/net/lib/py/load.py index d9c10613ae67..44151b7b1a24 100644 --- a/tools/testing/selftests/drivers/net/lib/py/load.py +++ b/tools/testing/selftests/drivers/net/lib/py/load.py @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 +import re import time from lib.py import ksft_pr, cmd, ip, rand_port, wait_port_listen @@ -10,12 +11,11 @@ class GenerateTraffic: self.env = env - if port is None: - port = rand_port() - self._iperf_server = cmd(f"iperf3 -s -1 -p {port}", background=True) - wait_port_listen(port) + self.port = rand_port() if port is None else port + self._iperf_server = cmd(f"iperf3 -s -1 -p {self.port}", background=True) + wait_port_listen(self.port) time.sleep(0.1) - self._iperf_client = cmd(f"iperf3 -c {env.addr} -P 16 -p {port} -t 86400", + self._iperf_client = cmd(f"iperf3 -c {env.addr} -P 16 -p {self.port} -t 86400", background=True, host=env.remote) # Wait for traffic to ramp up @@ -56,3 +56,16 @@ class GenerateTraffic: ksft_pr(">> Server:") ksft_pr(self._iperf_server.stdout) ksft_pr(self._iperf_server.stderr) + self._wait_client_stopped() + + def _wait_client_stopped(self, sleep=0.005, timeout=5): + end = time.monotonic() + timeout + + live_port_pattern = re.compile(fr":{self.port:04X} 0[^6] ") + + while time.monotonic() < end: + data = cmd("cat /proc/net/tcp*", host=self.env.remote).stdout + if not live_port_pattern.search(data): + return + time.sleep(sleep) + raise Exception(f"Waiting for client to stop timed out after {timeout}s") -- 2.40.1

5 months, 2 weeks

3
2
0 0

[PATCH net-next v3 4/4] selftests: drv-net: add test for RSS on flow label

by Jakub Kicinski

Add a simple test for checking that RSS on flow label works, and that its rejected for IPv4 flows. # ./tools/testing/selftests/drivers/net/hw/rss_flow_label.py TAP version 13 1..2 ok 1 rss_flow_label.test_rss_flow_label ok 2 rss_flow_label.test_rss_flow_label_6only # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0 Reviewed-by: Willem de Bruijn <willemb(a)google.com> Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- v2: - check for RPS / RFS v1: https://lore.kernel.org/20250722014915.3365370-5-kuba@kernel.org CC: shuah(a)kernel.org CC: sdf(a)fomichev.me CC: linux-kselftest(a)vger.kernel.org --- .../testing/selftests/drivers/net/hw/Makefile | 1 + .../drivers/net/hw/rss_flow_label.py | 167 ++++++++++++++++++ 2 files changed, 168 insertions(+) create mode 100755 tools/testing/selftests/drivers/net/hw/rss_flow_label.py diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile index fdc97355588c..5159fd34cb33 100644 --- a/tools/testing/selftests/drivers/net/hw/Makefile +++ b/tools/testing/selftests/drivers/net/hw/Makefile @@ -18,6 +18,7 @@ TEST_PROGS = \ pp_alloc_fail.py \ rss_api.py \ rss_ctx.py \ + rss_flow_label.py \ rss_input_xfrm.py \ tso.py \ xsk_reconfig.py \ diff --git a/tools/testing/selftests/drivers/net/hw/rss_flow_label.py b/tools/testing/selftests/drivers/net/hw/rss_flow_label.py new file mode 100755 index 000000000000..6fa95fe27c47 --- /dev/null +++ b/tools/testing/selftests/drivers/net/hw/rss_flow_label.py @@ -0,0 +1,167 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +""" +Tests for RSS hashing on IPv6 Flow Label. +""" + +import glob +import os +import socket +from lib.py import CmdExitFailure +from lib.py import ksft_run, ksft_exit, ksft_eq, ksft_ge, ksft_in, \ + ksft_not_in, ksft_raises, KsftSkipEx +from lib.py import bkg, cmd, defer, fd_read_timeout, rand_port +from lib.py import NetDrvEpEnv + + +def _check_system(cfg): + if not hasattr(socket, "SO_INCOMING_CPU"): + raise KsftSkipEx("socket.SO_INCOMING_CPU was added in Python 3.11") + + qcnt = len(glob.glob(f"/sys/class/net/{cfg.ifname}/queues/rx-*")) + if qcnt < 2: + raise KsftSkipEx(f"Local has only {qcnt} queues") + + for f in [f"/sys/class/net/{cfg.ifname}/queues/rx-0/rps_flow_cnt", + f"/sys/class/net/{cfg.ifname}/queues/rx-0/rps_cpus"]: + try: + with open(f, 'r') as fp: + setting = fp.read().strip() + # CPU mask will be zeros and commas + if setting.replace("0", "").replace(",", ""): + raise KsftSkipEx(f"RPS/RFS is configured: {f}: {setting}") + except FileNotFoundError: + pass + + # 1 is the default, if someone changed it we probably shouldn"t mess with it + af = cmd("cat /proc/sys/net/ipv6/auto_flowlabels", host=cfg.remote).stdout + if af.strip() != "1": + raise KsftSkipEx("Remote does not have auto_flowlabels enabled") + + +def _ethtool_get_cfg(cfg, fl_type): + descr = cmd(f"ethtool -n {cfg.ifname} rx-flow-hash {fl_type}").stdout + + converter = { + "IP SA": "s", + "IP DA": "d", + "L3 proto": "t", + "L4 bytes 0 & 1 [TCP/UDP src port]": "f", + "L4 bytes 2 & 3 [TCP/UDP dst port]": "n", + "IPv6 Flow Label": "l", + } + + ret = "" + for line in descr.split("\n")[1:-2]: + # if this raises we probably need to add more keys to converter above + ret += converter[line] + return ret + + +def _traffic(cfg, one_sock, one_cpu): + local_port = rand_port(socket.SOCK_DGRAM) + remote_port = rand_port(socket.SOCK_DGRAM) + + sock = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) + sock.bind(("", local_port)) + sock.connect((cfg.remote_addr_v["6"], 0)) + if one_sock: + send = f"exec 5<>/dev/udp/{cfg.addr_v['6']}/{local_port}; " \ + "for i in `seq 20`; do echo a >&5; sleep 0.02; done; exec 5>&-" + else: + send = "for i in `seq 20`; do echo a | socat -t0.02 - UDP6:" \ + f"[{cfg.addr_v['6']}]:{local_port},sourceport={remote_port}; done" + + cpus = set() + with bkg(send, shell=True, host=cfg.remote, exit_wait=True): + for _ in range(20): + fd_read_timeout(sock.fileno(), 1) + cpu = sock.getsockopt(socket.SOL_SOCKET, socket.SO_INCOMING_CPU) + cpus.add(cpu) + + if one_cpu: + ksft_eq(len(cpus), 1, + f"{one_sock=} - expected one CPU, got traffic on: {cpus=}") + else: + ksft_ge(len(cpus), 2, + f"{one_sock=} - expected many CPUs, got traffic on: {cpus=}") + + +def test_rss_flow_label(cfg): + """ + Test hashing on IPv6 flow label. Send traffic over a single socket + and over multiple sockets. Depend on the remote having auto-label + enabled so that it randomizes the label per socket. + """ + + cfg.require_ipver("6") + cfg.require_cmd("socat", remote=True) + _check_system(cfg) + + # Enable flow label hashing for UDP6 + initial = _ethtool_get_cfg(cfg, "udp6") + no_lbl = initial.replace("l", "") + if "l" not in initial: + try: + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 l{no_lbl}") + except CmdExitFailure as exc: + raise KsftSkipEx("Device doesn't support Flow Label for UDP6") from exc + + defer(cmd, f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {initial}") + + _traffic(cfg, one_sock=True, one_cpu=True) + _traffic(cfg, one_sock=False, one_cpu=False) + + # Disable it, we should see no hashing (reset was already defer()ed) + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {no_lbl}") + + _traffic(cfg, one_sock=False, one_cpu=True) + + +def _check_v4_flow_types(cfg): + for fl_type in ["tcp4", "udp4", "ah4", "esp4", "sctp4"]: + try: + cur = cmd(f"ethtool -n {cfg.ifname} rx-flow-hash {fl_type}").stdout + ksft_not_in("Flow Label", cur, + comment=f"{fl_type=} has Flow Label:" + cur) + except CmdExitFailure: + # Probably does not support this flow type + pass + + +def test_rss_flow_label_6only(cfg): + """ + Test interactions with IPv4 flow types. It should not be possible to set + IPv6 Flow Label hashing for an IPv4 flow type. The Flow Label should also + not appear in the IPv4 "current config". + """ + + with ksft_raises(CmdExitFailure) as cm: + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash tcp4 sdfnl") + ksft_in("Invalid argument", cm.exception.cmd.stderr) + + _check_v4_flow_types(cfg) + + # Try to enable Flow Labels and check again, in case it leaks thru + initial = _ethtool_get_cfg(cfg, "udp6") + changed = initial.replace("l", "") if "l" in initial else initial + "l" + + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {changed}") + restore = defer(cmd, f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {initial}") + + _check_v4_flow_types(cfg) + restore.exec() + _check_v4_flow_types(cfg) + + +def main() -> None: + with NetDrvEpEnv(__file__, nsim_test=False) as cfg: + ksft_run([test_rss_flow_label, + test_rss_flow_label_6only], + args=(cfg, )) + ksft_exit() + + +if __name__ == "__main__": + main() -- 2.50.1

5 months, 2 weeks

1
0
0 0

[PATCH RFC 0/4] procfs: make reference pidns more user-visible

by Aleksa Sarai

Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed). This has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs). But this requires forking off a helper process because you cannot just one-shot this using mount(2). * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process). While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate. Things would be much easier if there was a way for userspace to just specify the pidns they want. Patch 1 implements a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); The initial security model I have in this RFC is to be as conservative as possible and just mirror the security model for setns(2) -- which means that you can only set pidns=... to pid namespaces that your current pid namespace is a direct ancestor of. This fulfils the requirements of container runtimes, but I suspect that this may be too strict for some usecases. The pidns argument is not displayed in mountinfo -- it's not clear to me what value it would make sense to show (maybe we could just use ns_dname to provide an identifier for the namespace, but this number would be fairly useless to userspace). I'm open to suggestions. In addition, being able to figure out what pid namespace is being used by a procfs mount is quite useful when you have an administrative process (such as a container runtime) which wants to figure out the correct way of mapping PIDs between its own namespace and the namespace for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are alternative ways to do this, but they all rely on ancillary information that third-party libraries and tools do not necessarily have access to. To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which can be used to get a reference to the pidns that a procfs is using. It's not quite clear what is the correct security model for this API, but the current approach I've taken is to: * Make the ioctl only valid on the root (meaning that a process without access to the procfs root -- such as only having an fd to a procfs file or some open_tree(2)-like subset -- cannot use this API). * Require that the process requesting either has access to /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative access to it and can join it if they had a handle), or is in a pidns that is a direct ancestor of the target pidns (i.e. all of the pids are already visible in the procfs for the current process's pidns). The security model for this is a little loose, as it seems to me that all of the cases mentioned are valid cases to allow access, but I'm open to suggestions for whether we need to make this stricter or looser. Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- Aleksa Sarai (4): pidns: move is-ancestor logic to helper procfs: add pidns= mount option procfs: add PROCFS_GET_PID_NAMESPACE ioctl selftests/proc: add tests for new pidns APIs Documentation/filesystems/proc.rst | 10 ++ fs/proc/root.c | 132 +++++++++++++- include/linux/pid_namespace.h | 9 + include/uapi/linux/fs.h | 3 + kernel/pid_namespace.c | 21 ++- tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-pidns.c | 286 ++++++++++++++++++++++++++++++ 8 files changed, 448 insertions(+), 15 deletions(-) --- base-commit: 4c838c7672c39ec6ec48456c6ce22d14a68f4cda change-id: 20250717-procfs-pidns-api-8ed1583431f0 Best regards, -- Aleksa Sarai <cyphar(a)cyphar.com>

5 months, 2 weeks

2
7
0 0

[PATCH v2 0/5] KVM: Improve VMware guest support

by Zack Rusin

This is the second version of a series that lets us run VMware Workstation on Linux on top of KVM. The most significant change in this series is the introduction of CONFIG_KVM_VMWARE which is, in general, a nice cleanup for various bits of VMware compatibility code that have been scattered around KVM. (first patch) The rest of the series builds upon the VMware platform to implement features that are needed to run VMware guests without any modifications on top of KVM: - ability to turn on the VMware backdoor at runtime on a per-vm basis (used to be a kernel boot argument only) - support for VMware hypercalls - VMware products have a huge collection of hypercalls, all of which are handled in userspace, - support for handling legacy VMware backdoor in L0 in nested configs - in cases where we have WS running a Windows VBS guest, the L0 would be KVM, L1 Hyper-V so by default VMware Tools backdoor calls endup in Hyper-V which can not handle them, so introduce a cap to let L0 handle those. The final change in the series is a kselftest of the VMware hypercall functionality. Cc: Paolo Bonzini <pbonzini(a)redhat.com> Cc: Jonathan Corbet <corbet(a)lwn.net> Cc: Sean Christopherson <seanjc(a)google.com> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Borislav Petkov <bp(a)alien8.de> Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: x86(a)kernel.org Cc: "H. Peter Anvin" <hpa(a)zytor.com> Cc: Zack Rusin <zack.rusin(a)broadcom.com> Cc: Doug Covelli <doug.covelli(a)broadcom.com> Cc: Shuah Khan <shuah(a)kernel.org> Cc: Namhyung Kim <namhyung(a)kernel.org> Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com> Cc: Michael Ellerman <mpe(a)ellerman.id.au> Cc: Joel Stanley <joel(a)jms.id.au> Cc: Isaku Yamahata <isaku.yamahata(a)intel.com> Cc: kvm(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Zack Rusin (5): KVM: x86: Centralize KVM's VMware code KVM: x86: Allow enabling of the vmware backdoor via a cap KVM: x86: Add support for VMware guest specific hypercalls KVM: x86: Add support for legacy VMware backdoors in nested setups KVM: selftests: x86: Add a test for KVM_CAP_X86_VMWARE_HYPERCALL Documentation/virt/kvm/api.rst | 86 +++++++- MAINTAINERS | 9 + arch/x86/include/asm/kvm_host.h | 13 ++ arch/x86/kvm/Kconfig | 16 ++ arch/x86/kvm/Makefile | 1 + arch/x86/kvm/emulate.c | 11 +- arch/x86/kvm/kvm_vmware.c | 85 ++++++++ arch/x86/kvm/kvm_vmware.h | 189 ++++++++++++++++++ arch/x86/kvm/pmu.c | 39 +--- arch/x86/kvm/pmu.h | 4 - arch/x86/kvm/svm/nested.c | 6 + arch/x86/kvm/svm/svm.c | 10 +- arch/x86/kvm/vmx/nested.c | 6 + arch/x86/kvm/vmx/vmx.c | 5 +- arch/x86/kvm/x86.c | 74 +++---- arch/x86/kvm/x86.h | 2 - include/uapi/linux/kvm.h | 27 +++ tools/include/uapi/linux/kvm.h | 3 + tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/x86/vmware_hypercall_test.c | 121 +++++++++++ 20 files changed, 614 insertions(+), 94 deletions(-) create mode 100644 arch/x86/kvm/kvm_vmware.c create mode 100644 arch/x86/kvm/kvm_vmware.h create mode 100644 tools/testing/selftests/kvm/x86/vmware_hypercall_test.c -- 2.48.1

5 months, 2 weeks

2
2
0 0

[PATCH v2] rtc: Rename lib_test to test_rtc_lib

by Geert Uytterhoeven

When compiling the RTC library functions test as a module, the module has the non-descriptive name "lib_test.ko". Fix this by renaming it to "test_rtc_lib.ko". Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org> --- v2: - s/rtc_lib_test/test_rtc_lib/. --- drivers/rtc/Makefile | 2 +- drivers/rtc/{lib_test.c => test_rtc_lib.c} | 0 2 files changed, 1 insertion(+), 1 deletion(-) rename drivers/rtc/{lib_test.c => test_rtc_lib.c} (100%) diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile index 4619aa2ac4697591..789bddfea99d8fcd 100644 --- a/drivers/rtc/Makefile +++ b/drivers/rtc/Makefile @@ -15,7 +15,7 @@ rtc-core-$(CONFIG_RTC_INTF_DEV) += dev.o rtc-core-$(CONFIG_RTC_INTF_PROC) += proc.o rtc-core-$(CONFIG_RTC_INTF_SYSFS) += sysfs.o -obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += lib_test.o +obj-$(CONFIG_RTC_LIB_KUNIT_TEST) += test_rtc_lib.o # Keep the list ordered. diff --git a/drivers/rtc/lib_test.c b/drivers/rtc/test_rtc_lib.c similarity index 100% rename from drivers/rtc/lib_test.c rename to drivers/rtc/test_rtc_lib.c -- 2.43.0

5 months, 2 weeks

2
1
0 0

[PATCH][next] tools/testing/selftests: Fix spelling mistake "unnmap" -> "unmap"

by Colin Ian King

There is a spelling mistake in ksft_test_result_fail messages. Fix them. Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com> --- tools/testing/selftests/mm/mremap_test.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c index fccf9e797a0c..774cdba102fc 100644 --- a/tools/testing/selftests/mm/mremap_test.c +++ b/tools/testing/selftests/mm/mremap_test.c @@ -525,10 +525,10 @@ static void mremap_move_multiple_vmas(unsigned int pattern_seed, out: if (success) ksft_test_result_pass("%s%s\n", test_name, - dont_unmap ? " [dontunnmap]" : ""); + dont_unmap ? " [dontunmap]" : ""); else ksft_test_result_fail("%s%s\n", test_name, - dont_unmap ? " [dontunnmap]" : ""); + dont_unmap ? " [dontunmap]" : ""); } static void mremap_shrink_multiple_vmas(unsigned long page_size, @@ -727,10 +727,10 @@ static void mremap_move_multiple_vmas_split(unsigned int pattern_seed, out: if (success) ksft_test_result_pass("%s%s\n", test_name, - dont_unmap ? " [dontunnmap]" : ""); + dont_unmap ? " [dontunmap]" : ""); else ksft_test_result_fail("%s%s\n", test_name, - dont_unmap ? " [dontunnmap]" : ""); + dont_unmap ? " [dontunmap]" : ""); } /* Returns the time taken for the remap on success else returns -1. */ -- 2.50.0

5 months, 2 weeks

5
4
0 0

[PATCH/RFC] kunit/rtc: Add real support for very slow tests

by Geert Uytterhoeven

When running rtc_lib_test ("lib_test" before my "[PATCH] rtc: Rename lib_test to rtc_lib_test") on m68k/ARAnyM: KTAP version 1 1..1 KTAP version 1 # Subtest: rtc_lib_test_cases # module: rtc_lib_test 1..2 # rtc_time64_to_tm_test_date_range_1000: Test should be marked slow (runtime: 3.222371420s) ok 1 rtc_time64_to_tm_test_date_range_1000 # rtc_time64_to_tm_test_date_range_160000: try timed out # rtc_time64_to_tm_test_date_range_160000: test case timed out # rtc_time64_to_tm_test_date_range_160000.speed: slow not ok 2 rtc_time64_to_tm_test_date_range_160000 # rtc_lib_test_cases: pass:1 fail:1 skip:0 total:2 # Totals: pass:1 fail:1 skip:0 total:2 not ok 1 rtc_lib_test_cases Commit 02c2d0c2a84172c3 ("kunit: Add speed attribute") added the notion of "very slow" tests, but this is further unused and unhandled. Hence: 1. Introduce KUNIT_CASE_VERY_SLOW(), 2. Increase timeout by ten; ideally this should only be done for very slow tests, but I couldn't find how to access kunit_case.attr.case from kunit_try_catch_run(), 3. Mark rtc_time64_to_tm_test_date_range_1000 slow, 4. Mark rtc_time64_to_tm_test_date_range_160000 very slow. Afterwards: KTAP version 1 1..1 KTAP version 1 # Subtest: rtc_lib_test_cases # module: rtc_lib_test 1..2 # rtc_time64_to_tm_test_date_range_1000.speed: slow ok 1 rtc_time64_to_tm_test_date_range_1000 # rtc_time64_to_tm_test_date_range_160000.speed: very_slow ok 2 rtc_time64_to_tm_test_date_range_160000 # rtc_lib_test_cases: pass:2 fail:0 skip:0 total:2 # Totals: pass:2 fail:0 skip:0 total:2 ok 1 rtc_lib_test_cases Signed-off-by: Geert Uytterhoeven <geert(a)linux-m68k.org> --- drivers/rtc/rtc_lib_test.c | 4 ++-- include/kunit/test.h | 11 +++++++++++ lib/kunit/try-catch.c | 3 ++- 3 files changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/rtc/rtc_lib_test.c b/drivers/rtc/rtc_lib_test.c index c30c759662e39b48..fd3210e39d37dbc6 100644 --- a/drivers/rtc/rtc_lib_test.c +++ b/drivers/rtc/rtc_lib_test.c @@ -85,8 +85,8 @@ static void rtc_time64_to_tm_test_date_range_1000(struct kunit *test) } static struct kunit_case rtc_lib_test_cases[] = { - KUNIT_CASE(rtc_time64_to_tm_test_date_range_1000), - KUNIT_CASE_SLOW(rtc_time64_to_tm_test_date_range_160000), + KUNIT_CASE_SLOW(rtc_time64_to_tm_test_date_range_1000), + KUNIT_CASE_VERY_SLOW(rtc_time64_to_tm_test_date_range_160000), {} }; diff --git a/include/kunit/test.h b/include/kunit/test.h index 9b773406e01f3c43..4e3c1cae5b41466e 100644 --- a/include/kunit/test.h +++ b/include/kunit/test.h @@ -183,6 +183,17 @@ static inline char *kunit_status_to_ok_not_ok(enum kunit_status status) { .run_case = test_name, .name = #test_name, \ .attr.speed = KUNIT_SPEED_SLOW, .module_name = KBUILD_MODNAME} +/** + * KUNIT_CASE_VERY_SLOW - A helper for creating a &struct kunit_case + * with the very slow attribute + * + * @test_name: a reference to a test case function. + */ + +#define KUNIT_CASE_VERY_SLOW(test_name) \ + { .run_case = test_name, .name = #test_name, \ + .attr.speed = KUNIT_SPEED_VERY_SLOW, .module_name = KBUILD_MODNAME} + /** * KUNIT_CASE_PARAM - A helper for creation a parameterized &struct kunit_case * diff --git a/lib/kunit/try-catch.c b/lib/kunit/try-catch.c index 6bbe0025b0790bd2..92099c67bb21d0a4 100644 --- a/lib/kunit/try-catch.c +++ b/lib/kunit/try-catch.c @@ -56,7 +56,8 @@ static unsigned long kunit_test_timeout(void) * If tests timeout due to exceeding sysctl_hung_task_timeout_secs, * the task will be killed and an oops generated. */ - return 300 * msecs_to_jiffies(MSEC_PER_SEC); /* 5 min */ + // FIXME times ten for KUNIT_SPEED_VERY_SLOW? + return 10 * 300 * msecs_to_jiffies(MSEC_PER_SEC); /* 5 min */ } void kunit_try_catch_run(struct kunit_try_catch *try_catch, void *context) -- 2.43.0

5 months, 2 weeks

3
3
0 0

[PATCH net-next v2 4/4] selftests: drv-net: add test for RSS on flow label

by Jakub Kicinski

Add a simple test for checking that RSS on flow label works, and that its rejected for IPv4 flows. # ./tools/testing/selftests/drivers/net/hw/rss_flow_label.py TAP version 13 1..2 ok 1 rss_flow_label.test_rss_flow_label ok 2 rss_flow_label.test_rss_flow_label_6only # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- CC: shuah(a)kernel.org CC: sdf(a)fomichev.me CC: linux-kselftest(a)vger.kernel.org --- .../testing/selftests/drivers/net/hw/Makefile | 1 + .../drivers/net/hw/rss_flow_label.py | 151 ++++++++++++++++++ 2 files changed, 152 insertions(+) create mode 100755 tools/testing/selftests/drivers/net/hw/rss_flow_label.py diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile index fdc97355588c..5159fd34cb33 100644 --- a/tools/testing/selftests/drivers/net/hw/Makefile +++ b/tools/testing/selftests/drivers/net/hw/Makefile @@ -18,6 +18,7 @@ TEST_PROGS = \ pp_alloc_fail.py \ rss_api.py \ rss_ctx.py \ + rss_flow_label.py \ rss_input_xfrm.py \ tso.py \ xsk_reconfig.py \ diff --git a/tools/testing/selftests/drivers/net/hw/rss_flow_label.py b/tools/testing/selftests/drivers/net/hw/rss_flow_label.py new file mode 100755 index 000000000000..e471e13160ae --- /dev/null +++ b/tools/testing/selftests/drivers/net/hw/rss_flow_label.py @@ -0,0 +1,151 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +""" +Tests for RSS hashing on IPv6 Flow Label. +""" + +import glob +import socket +from lib.py import CmdExitFailure +from lib.py import ksft_run, ksft_exit, ksft_eq, ksft_ge, ksft_in, \ + ksft_not_in, ksft_raises, KsftSkipEx +from lib.py import bkg, cmd, defer, fd_read_timeout, rand_port +from lib.py import NetDrvEpEnv + + +def _ethtool_get_cfg(cfg, fl_type): + descr = cmd(f"ethtool -n {cfg.ifname} rx-flow-hash {fl_type}").stdout + + converter = { + "IP SA": "s", + "IP DA": "d", + "L3 proto": "t", + "L4 bytes 0 & 1 [TCP/UDP src port]": "f", + "L4 bytes 2 & 3 [TCP/UDP dst port]": "n", + "IPv6 Flow Label": "l", + } + + ret = "" + for line in descr.split("\n")[1:-2]: + # if this raises we probably need to add more keys to converter above + ret += converter[line] + return ret + + +def _traffic(cfg, one_sock, one_cpu): + local_port = rand_port(socket.SOCK_DGRAM) + remote_port = rand_port(socket.SOCK_DGRAM) + + sock = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) + sock.bind(("", local_port)) + sock.connect((cfg.remote_addr_v["6"], 0)) + if one_sock: + send = f"exec 5<>/dev/udp/{cfg.addr_v['6']}/{local_port}; " \ + "for i in `seq 20`; do echo a >&5; sleep 0.02; done; exec 5>&-" + else: + send = "for i in `seq 20`; do echo a | socat -t0.02 - UDP6:" \ + f"[{cfg.addr_v['6']}]:{local_port},sourceport={remote_port}; done" + + cpus = set() + with bkg(send, shell=True, host=cfg.remote, exit_wait=True): + for _ in range(20): + fd_read_timeout(sock.fileno(), 1) + cpu = sock.getsockopt(socket.SOL_SOCKET, socket.SO_INCOMING_CPU) + cpus.add(cpu) + + if one_cpu: + ksft_eq(len(cpus), 1, + f"{one_sock=} - expected one CPU, got traffic on: {cpus=}") + else: + ksft_ge(len(cpus), 2, + f"{one_sock=} - expected many CPUs, got traffic on: {cpus=}") + + +def test_rss_flow_label(cfg): + """ + Test hashing on IPv6 flow label. Send traffic over a single socket + and over multiple sockets. Depend on the remote having auto-label + enabled so that it randomizes the label per socket. + """ + + cfg.require_ipver("6") + cfg.require_cmd("socat", remote=True) + if not hasattr(socket, "SO_INCOMING_CPU"): + raise KsftSkipEx("socket.SO_INCOMING_CPU was added in Python 3.11") + + # 1 is the default, if someone changed it we probably shouldn"t mess with it + af = cmd("cat /proc/sys/net/ipv6/auto_flowlabels", host=cfg.remote).stdout + if af.strip() != "1": + raise KsftSkipEx("Remote does not have auto_flowlabels enabled") + + qcnt = len(glob.glob(f"/sys/class/net/{cfg.ifname}/queues/rx-*")) + if qcnt < 2: + raise KsftSkipEx(f"Local has only {qcnt} queues") + + # Enable flow label hashing for UDP6 + initial = _ethtool_get_cfg(cfg, "udp6") + no_lbl = initial.replace("l", "") + if "l" not in initial: + try: + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 l{no_lbl}") + except CmdExitFailure as exc: + raise KsftSkipEx("Device doesn't support Flow Label for UDP6") from exc + + defer(cmd, f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {initial}") + + _traffic(cfg, one_sock=True, one_cpu=True) + _traffic(cfg, one_sock=False, one_cpu=False) + + # Disable it, we should see no hashing (reset was already defer()ed) + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {no_lbl}") + + _traffic(cfg, one_sock=False, one_cpu=True) + + +def _check_v4_flow_types(cfg): + for fl_type in ["tcp4", "udp4", "ah4", "esp4", "sctp4"]: + try: + cur = cmd(f"ethtool -n {cfg.ifname} rx-flow-hash {fl_type}").stdout + ksft_not_in("Flow Label", cur, + comment=f"{fl_type=} has Flow Label:" + cur) + except CmdExitFailure: + # Probably does not support this flow type + pass + + +def test_rss_flow_label_6only(cfg): + """ + Test interactions with IPv4 flow types. It should not be possible to set + IPv6 Flow Label hashing for an IPv4 flow type. The Flow Label should also + not appear in the IPv4 "current config". + """ + + with ksft_raises(CmdExitFailure) as cm: + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash tcp4 sdfnl") + ksft_in("Invalid argument", cm.exception.cmd.stderr) + + _check_v4_flow_types(cfg) + + # Try to enable Flow Labels and check again, in case it leaks thru + initial = _ethtool_get_cfg(cfg, "udp6") + changed = initial.replace("l", "") if "l" in initial else initial + "l" + + cmd(f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {changed}") + restore = defer(cmd, f"ethtool -N {cfg.ifname} rx-flow-hash udp6 {initial}") + + _check_v4_flow_types(cfg) + restore.exec() + _check_v4_flow_types(cfg) + + +def main() -> None: + with NetDrvEpEnv(__file__, nsim_test=False) as cfg: + ksft_run([test_rss_flow_label, + test_rss_flow_label_6only], + args=(cfg, )) + ksft_exit() + + +if __name__ == "__main__": + main() -- 2.50.1

5 months, 2 weeks

2
1
0 0

[PATCH v6] selftests/mm: add process_madvise() tests

by wang lian

Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. Signed-off-by: wang lian <lianux.mm(a)gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: David Hildenbrand <david(a)redhat.com> Suggested-by: Zi Yan <ziy(a)nvidia.com> Suggested-by: Mark Brown <broonie(a)kernel.org> Acked-by: SeongJae Park <sj(a)kernel.org> --- Changelog v6: - Refactor child process and pidfd management to use the kselftest fixture's setup and teardown mechanism. This ensures that child processes are reliably terminated and file descriptors are closed, even when a test is aborted by an ASSERT or SKIP macro. This resolves the issue where a failed assertion could lead to a leaked child process. Changelog v5: https://lore.kernel.org/lkml/20250714122533.3135-1-lianux.mm@gmail.com/ - Refactor the remote_collapse test to concentrate on its primary goal confirming the successful remote invocation of process_madvise() on a child process. - Split the validation logic for invalid pidfds out of the remote test and into two new (`exited_process_pidfd` and `bad_pidfd`). - Based mm-new branch, can ensure clean application Changelog v4: https://lore.kernel.org/lkml/20250710112249.58722-1-lianux.mm@gmail.com/ - Refine resource cleanup logic in test teardown to be more robust. - Improve remote_collapse test to correctly handle different THP (Transparent Huge Page) policies ('always', 'madvise', 'never'), including handling race conditions with khugepaged. - Resolve build errors Changelog v3: https://lore.kernel.org/lkml/20250703044326.65061-1-lianux.mm@gmail.com/ - Rebased onto the latest mm-stable branch to ensure clean application. - Refactor common signal handling logic into vm_util to reduce code duplication. - Improve test robustness and diagnostics based on community feedback. - Address minor code style and script corrections. Changelog v2: https://lore.kernel.org/lkml/20250630140957.4000-1-lianux.mm@gmail.com/ - Drop MADV_DONTNEED tests based on feedback. - Focus solely on process_madvise() syscall. - Improve error handling and structure. - Add future-proof flag test. - Style and comment cleanups. -V1: https://lore.kernel.org/lkml/20250621133003.4733-1-lianux.mm@gmail.com/ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/process_madv.c | 302 ++++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + 4 files changed, 309 insertions(+) create mode 100644 tools/testing/selftests/mm/process_madv.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index f2dafa0b700b..e7b23a8a05fe --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -21,6 +21,7 @@ on-fault-limit transhuge-stress pagemap_ioctl pfnmap +process_madv *.tmp* protection_keys protection_keys_32 diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ae6f994d3add..d13b3cef2a2b 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -85,6 +85,7 @@ TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += pfnmap +TEST_GEN_FILES += process_madv TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/process_madv.c b/tools/testing/selftests/mm/process_madv.c new file mode 100644 index 000000000000..8a83eac3bfab --- /dev/null +++ b/tools/testing/selftests/mm/process_madv.c @@ -0,0 +1,302 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <errno.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <linux/mman.h> +#include <sys/syscall.h> +#include <unistd.h> +#include <sched.h> +#include "vm_util.h" + +#include "../pidfd/pidfd.h" + +FIXTURE(process_madvise) +{ + unsigned long page_size; + pid_t child_pid; + int remote_pidfd; + int pidfd; +}; + +FIXTURE_SETUP(process_madvise) +{ + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); + self->pidfd = PIDFD_SELF; + self->remote_pidfd = -1; + self->child_pid = -1; +}; + +FIXTURE_TEARDOWN_PARENT(process_madvise) +{ + /* This teardown is guaranteed to run, even if tests SKIP or ASSERT */ + if (self->child_pid > 0) { + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + } + + if (self->remote_pidfd >= 0) + close(self->remote_pidfd); +} + +static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, + size_t vlen, int advice, unsigned int flags) +{ + return syscall(__NR_process_madvise, pidfd, iovec, vlen, advice, flags); +} + +/* + * This test uses PIDFD_SELF to target the current process. The main + * goal is to verify the basic behavior of process_madvise() with + * a vector of non-contiguous memory ranges, not its cross-process + * capabilities. + */ +TEST_F(process_madvise, basic) +{ + const unsigned long pagesize = self->page_size; + const int madvise_pages = 4; + struct iovec vec[madvise_pages]; + int pidfd = self->pidfd; + ssize_t ret; + char *map; + + /* + * Create a single large mapping. We will pick pages from this + * mapping to advise on. This ensures we test non-contiguous iovecs. + */ + map = mmap(NULL, pagesize * 10, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + /* Fill the entire region with a known pattern. */ + memset(map, 'A', pagesize * 10); + + /* + * Setup the iovec to point to 4 non-contiguous pages + * within the mapping. + */ + vec[0].iov_base = &map[0 * pagesize]; + vec[0].iov_len = pagesize; + vec[1].iov_base = &map[3 * pagesize]; + vec[1].iov_len = pagesize; + vec[2].iov_base = &map[5 * pagesize]; + vec[2].iov_len = pagesize; + vec[3].iov_base = &map[8 * pagesize]; + vec[3].iov_len = pagesize; + + ret = sys_process_madvise(pidfd, vec, madvise_pages, MADV_DONTNEED, 0); + if (ret == -1 && errno == EPERM) + SKIP(return, + "process_madvise() unsupported or permission denied, try running as root.\n"); + else if (errno == EINVAL) + SKIP(return, + "process_madvise() unsupported or parameter invalid, please check arguments.\n"); + + /* The call should succeed and report the total bytes processed. */ + ASSERT_EQ(ret, madvise_pages * pagesize); + + /* Check that advised pages are now zero. */ + for (int i = 0; i < madvise_pages; i++) { + char *advised_page = (char *)vec[i].iov_base; + + /* Content must be 0, not 'A'. */ + ASSERT_EQ(*advised_page, '\0'); + } + + /* Check that an un-advised page in between is still 'A'. */ + char *unadvised_page = &map[1 * pagesize]; + + for (int i = 0; i < pagesize; i++) + ASSERT_EQ(unadvised_page[i], 'A'); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize * 10), 0); +} + +/* + * This test deterministically validates process_madvise() with MADV_COLLAPSE + * on a remote process, other advices are difficult to verify reliably. + * + * The test verifies that a memory region in a child process, + * focus on process_madv remote result, only check addresses and lengths. + * The correctness of the MADV_COLLAPSE can be found in the relevant test examples in khugepaged. + */ +TEST_F(process_madvise, remote_collapse) +{ + const unsigned long pagesize = self->page_size; + long huge_page_size; + int pipe_info[2]; + ssize_t ret; + struct iovec vec; + + struct child_info { + pid_t pid; + void *map_addr; + } info; + + huge_page_size = default_huge_page_size(); + if (huge_page_size <= 0) + SKIP(return, "Could not determine a valid huge page size.\n"); + + ASSERT_EQ(pipe(pipe_info), 0); + + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) { + char *map; + size_t map_size = 2 * huge_page_size; + + close(pipe_info[0]); + + map = mmap(NULL, map_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fault in as small pages */ + for (size_t i = 0; i < map_size; i += pagesize) + map[i] = 'A'; + + /* Send info and pause */ + info.pid = getpid(); + info.map_addr = map; + ret = write(pipe_info[1], &info, sizeof(info)); + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[1]); + + pause(); + exit(0); + } + + close(pipe_info[1]); + + /* Receive child info */ + ret = read(pipe_info[0], &info, sizeof(info)); + if (ret <= 0) { + waitpid(self->child_pid, NULL, 0); + SKIP(return, "Failed to read child info from pipe.\n"); + } + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[0]); + self->child_pid = info.pid; + + self->remote_pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(self->remote_pidfd, 0); + + vec.iov_base = info.map_addr; + vec.iov_len = huge_page_size; + + ret = sys_process_madvise(self->remote_pidfd, &vec, 1, MADV_COLLAPSE, 0); + if (ret == -1) { + if (errno == EINVAL) + SKIP(return, "PROCESS_MADV_ADVISE is not supported.\n"); + else if (errno == EPERM) + SKIP(return, + "No process_madvise() permissions, try running as root.\n"); + return; + } + + ASSERT_EQ(ret, huge_page_size); +} + +/* + * Test process_madvise() with a pidfd for a process that has already + * exited to ensure correct error handling. + */ +TEST_F(process_madvise, exited_process_pidfd) +{ + struct iovec vec; + ssize_t ret; + int pidfd; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* + * Using a pidfd for a process that has already exited should fail + * with ESRCH. + */ + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) + exit(0); + + pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Wait for the child to ensure it has terminated. */ + waitpid(self->child_pid, NULL, 0); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ESRCH); + close(pidfd); +} + +/* + * Test process_madvise() with bad pidfds to ensure correct error + * handling. + */ +TEST_F(process_madvise, bad_pidfd) +{ + struct iovec vec; + ssize_t ret; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* Using an invalid fd number (-1) should fail with EBADF. */ + ret = sys_process_madvise(-1, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a valid fd that is not a pidfd (e.g. stdin) should fail + * with EBADF. + */ + ret = sys_process_madvise(STDIN_FILENO, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); +} + +/* + * Test process_madvise() with an invalid flag value. Currently, only a flag + * value of 0 is supported. This test is reserved for the future, e.g., if + * synchronous flags are added. + */ +TEST_F(process_madvise, flag) +{ + const unsigned long pagesize = self->page_size; + unsigned int invalid_flag; + int pidfd = self->pidfd; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + invalid_flag = 0x80000000; + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, invalid_flag); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index a38c984103ce..471e539d82b8 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -65,6 +65,8 @@ separated by spaces: test pagemap_scan IOCTL - pfnmap tests for VM_PFNMAP handling +- process_madv + test for process_madv - cow test copy-on-write semantics - thp @@ -425,6 +427,9 @@ CATEGORY="madv_guard" run_test ./guard-regions # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests CATEGORY="madv_populate" run_test ./madv_populate +# PROCESS_MADV test +CATEGORY="process_madv" run_test ./process_madv + CATEGORY="vma_merge" run_test ./merge if [ -x ./memfd_secret ] -- 2.43.0

5 months, 2 weeks

2
2
0 0

[PATCH net 1/2] macsec: set IFF_UNICAST_FLT priv flag

by Stanislav Fomichev

Cosmin reports the following locking issue: # BUG: sleeping function called from invalid context at kernel/locking/mutex.c:275 # dump_stack_lvl+0x4f/0x60 # __might_resched+0xeb/0x140 # mutex_lock+0x1a/0x40 # dev_set_promiscuity+0x26/0x90 # __dev_set_promiscuity+0x85/0x170 # __dev_set_rx_mode+0x69/0xa0 # dev_uc_add+0x6d/0x80 # vlan_dev_open+0x5f/0x120 [8021q] # __dev_open+0x10c/0x2a0 # __dev_change_flags+0x1a4/0x210 # netif_change_flags+0x22/0x60 # do_setlink.isra.0+0xdb0/0x10f0 # rtnl_newlink+0x797/0xb00 # rtnetlink_rcv_msg+0x1cb/0x3f0 # netlink_rcv_skb+0x53/0x100 # netlink_unicast+0x273/0x3b0 # netlink_sendmsg+0x1f2/0x430 Which is similar to recent syzkaller reports in [0] and [1] and triggers because macsec does not advertise IFF_UNICAST_FLT although it has proper ndo_set_rx_mode callback that takes care of pushing uc/mc addresses down to the real device. In general, dev_uc_add call path is problematic for stacking non-IFF_UNICAST_FLT because we might grab netdev instance lock under addr_list_lock spinlock, so this is not a systemic fix. 0: https://lore.kernel.org/netdev/686d55b4.050a0220.1ffab7.0014.GAE@google.com 1: https://lore.kernel.org/netdev/68712acf.a00a0220.26a83e.0051.GAE@google.com/ Link: 2aff4342b0f5b1539c02ffd8df4c7e58dd9746e7.camel(a)nvidia.com Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations") Reported-by: Cosmin Ratiu <cratiu(a)nvidia.com> Tested-by: Cosmin Ratiu <cratiu(a)nvidia.com> Signed-off-by: Stanislav Fomichev <sdf(a)fomichev.me> --- drivers/net/macsec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index 7edbe76b5455..4c75d1fea552 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -3868,7 +3868,7 @@ static void macsec_setup(struct net_device *dev) ether_setup(dev); dev->min_mtu = 0; dev->max_mtu = ETH_MAX_MTU; - dev->priv_flags |= IFF_NO_QUEUE; + dev->priv_flags |= IFF_NO_QUEUE | IFF_UNICAST_FLT; dev->netdev_ops = &macsec_netdev_ops; dev->needs_free_netdev = true; dev->priv_destructor = macsec_free_netdev; -- 2.50.1

5 months, 2 weeks

3
4
0 0

[PATCH v2 0/7] Replace "__auto_type" with "auto"

by H. Peter Anvin

"auto" was defined as a keyword back in the K&R days, but as a storage type specifier. No one ever used it, since it was and is the default storage type for local variables. C++11 recycled the keyword to allow a type to be declared based on the type of an initializer. This was finally adopted into standard C in C23. gcc and clang provide the "__auto_type" alias keyword as an extension for pre-C23, however, there is no reason to pollute the bulk of the source base with this temporary keyword; instead define "auto" as a macro unless the compiler is running in C23+ mode. This macro is added in <linux/compiler_types.h> because that header is included in some of the tools headers, wheres <linux/compiler.h> is not as it has a bunch of very kernel-specific things in it. Changes in v2: - Restore indentation of macro backslashes (David Laight) - arch/nios2: Replace an adjacent typeof() with a similar "auto" construct (Linus Torvalds) - fs/proc/inode.c: change "__auto_type" to "const auto" (Alexey Dobriyan) --- arch/nios2/include/asm/uaccess.h | 8 ++++---- arch/x86/include/asm/bug.h | 2 +- arch/x86/include/asm/string_64.h | 6 +++--- arch/x86/include/asm/uaccess_64.h | 2 +- fs/proc/inode.c | 16 ++++++++-------- include/linux/cleanup.h | 6 +++--- include/linux/compiler.h | 2 +- include/linux/compiler_types.h | 13 +++++++++++++ include/linux/minmax.h | 6 +++--- tools/testing/selftests/bpf/prog_tests/socket_helpers.h | 9 +++++++-- tools/virtio/linux/compiler.h | 2 +- 11 files changed, 45 insertions(+), 27 deletions(-)

5 months, 2 weeks

5
12
0 0

[PATCH] kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace

by Mark Brown

The NT_ARM_SVE register set supports two data formats, the native SVE one and an alternative format where we embed a copy of user_fpsimd_data as used for NT_PRFPREG in the SVE register set. The register data is set as for a write to NT_PRFPREG and changes in vector length and streaming mode are handled as for any NT_ARM_SVE write. This has not previously been tested by fp-ptrace, add coverage of it. We do not support writes in FPSIMD format for NT_ARM_SSVE so we skip the test for anything that would leave us in streaming mode. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- tools/testing/selftests/arm64/fp/fp-ptrace.c | 66 +++++++++++++++++++++++++++- 1 file changed, 64 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/arm64/fp/fp-ptrace.c b/tools/testing/selftests/arm64/fp/fp-ptrace.c index 191c47ca0ed8..c479c97dea1a 100644 --- a/tools/testing/selftests/arm64/fp/fp-ptrace.c +++ b/tools/testing/selftests/arm64/fp/fp-ptrace.c @@ -1066,6 +1066,23 @@ static bool sve_write_supported(struct test_config *config) return true; } +static bool sve_write_fpsimd_supported(struct test_config *config) +{ + if (!sve_supported()) + return false; + + if ((config->svcr_in & SVCR_ZA) != (config->svcr_expected & SVCR_ZA)) + return false; + + if (config->svcr_expected & SVCR_SM) + return false; + + if (config->sme_vl_in != config->sme_vl_expected) + return false; + + return true; +} + static void fpsimd_write_expected(struct test_config *config) { int vl; @@ -1152,7 +1169,7 @@ static void sve_write_expected(struct test_config *config) } } -static void sve_write(pid_t child, struct test_config *config) +static void sve_write_sve(pid_t child, struct test_config *config) { struct user_sve_header *sve; struct iovec iov; @@ -1195,6 +1212,45 @@ static void sve_write(pid_t child, struct test_config *config) free(iov.iov_base); } +static void sve_write_fpsimd(pid_t child, struct test_config *config) +{ + struct user_sve_header *sve; + struct user_fpsimd_state *fpsimd; + struct iovec iov; + int ret, vl, vq; + + vl = vl_expected(config); + vq = __sve_vq_from_vl(vl); + + if (!vl) + return; + + iov.iov_len = SVE_PT_SVE_OFFSET + SVE_PT_SVE_SIZE(vq, + SVE_PT_REGS_FPSIMD); + iov.iov_base = malloc(iov.iov_len); + if (!iov.iov_base) { + ksft_print_msg("Failed allocating %lu byte SVE write buffer\n", + iov.iov_len); + return; + } + memset(iov.iov_base, 0, iov.iov_len); + + sve = iov.iov_base; + sve->size = iov.iov_len; + sve->flags = SVE_PT_REGS_FPSIMD; + sve->vl = vl; + + fpsimd = iov.iov_base + SVE_PT_REGS_OFFSET; + memcpy(&fpsimd->vregs, v_expected, sizeof(v_expected)); + + ret = ptrace(PTRACE_SETREGSET, child, NT_ARM_SVE, &iov); + if (ret != 0) + ksft_print_msg("Failed to write SVE: %s (%d)\n", + strerror(errno), errno); + + free(iov.iov_base); +} + static bool za_write_supported(struct test_config *config) { if ((config->svcr_in & SVCR_SM) != (config->svcr_expected & SVCR_SM)) @@ -1386,7 +1442,13 @@ static struct test_definition sve_test_defs[] = { .name = "SVE write", .supported = sve_write_supported, .set_expected_values = sve_write_expected, - .modify_values = sve_write, + .modify_values = sve_write_sve, + }, + { + .name = "SVE write FPSIMD format", + .supported = sve_write_fpsimd_supported, + .set_expected_values = fpsimd_write_expected, + .modify_values = sve_write_fpsimd, }, }; --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20250718-arm64-fp-ptrace-sve-fpsimd-ea20bdd9138b Best regards, -- Mark Brown <broonie(a)kernel.org>

5 months, 2 weeks

2
1
0 0

[PATCH] kselftest/arm64: Allow sve-ptrace to run on SME only systems

by Mark Brown

Currently the sve-ptrace test program only runs if the system supports SVE but since SME includes streaming SVE the tests it offers are valid even on a system that only supports SME. Since the tests already have individual hwcap checks just remove the top level test and rely on those. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- tools/testing/selftests/arm64/fp/sve-ptrace.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/tools/testing/selftests/arm64/fp/sve-ptrace.c b/tools/testing/selftests/arm64/fp/sve-ptrace.c index 7f9b6a61d369..b22303778fb0 100644 --- a/tools/testing/selftests/arm64/fp/sve-ptrace.c +++ b/tools/testing/selftests/arm64/fp/sve-ptrace.c @@ -753,9 +753,6 @@ int main(void) ksft_print_header(); ksft_set_plan(EXPECTED_TESTS); - if (!(getauxval(AT_HWCAP) & HWCAP_SVE)) - ksft_exit_skip("SVE not available\n"); - child = fork(); if (!child) return do_child(); --- base-commit: 9e8ebfe677f9101bbfe1f75d548a5aec581e8213 change-id: 20250718-arm64-sve-ptrace-sme-only-4ab49d037295 Best regards, -- Mark Brown <broonie(a)kernel.org>

5 months, 2 weeks

2
1
0 0

[PATCH 0/3] kselftest/arm64: Fixes for fp-ptrace on SME only systems

by Mark Brown

When testing SME only systems I noticed that fp-ptrace does not cope at all well with them, this series fixes the major issues so that the test program completes successfully. The reason I was looking at this is that following the recent round of fixes to ptrace we do not currently offer any mechanism for disabling streaming mode via ptrace, this series brings the program to a point where it tests the currently implemented ABI. A further series allowing the disabling of streaming mode via ptrace will follow. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Mark Brown (3): kselftest/arm64: Test SME on SME only systems in fp-ptrace kselftest/arm64: Fix SVE write data generation for SME only systems kselftest/arm64: Handle attempts to disable SM on SME only systems tools/testing/selftests/arm64/fp/fp-ptrace.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20250718-arm64-fp-ptrace-sme-only-ab327d7f0d32 Best regards, -- Mark Brown <broonie(a)kernel.org>

5 months, 2 weeks

2
4
0 0

[PATCH v25 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v25. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v25 (19-Jul-2025) - Fix the missing sch_tree_unlock() in v24 (Jakub Kicinski <kuba(a)kernel.org>) v24 (18-Jul-2025) - Replace TCA_DUALPI2 prefix with TC_DUALPI2 for enums in pkt_sched.h (Jakub Kicinski <kuba(a)kernel.org>) - Report error if both packet and time step thresholds are provided (Jakub Kicinski <kuba(a)kernel.org>) v23 (13-Jul-2025) and v22 (11-Jul-2025) - Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>) v21 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>) - Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>) - Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>) - Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>) - Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>) - Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>) - Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>) - Fix some typos and format issues of dualpi2 attributes v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 151 ++- include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 68 + net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1177 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1671 insertions(+), 5 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

5 months, 2 weeks

3
8
0 0

[PATCH net-next v6] ipv6: add `force_forwarding` sysctl to enable per-interface forwarding

by Gabriel Goller

It is currently impossible to enable ipv6 forwarding on a per-interface basis like in ipv4. To enable forwarding on an ipv6 interface we need to enable it on all interfaces and disable it on the other interfaces using a netfilter rule. This is especially cumbersome if you have lots of interface and only want to enable forwarding on a few. According to the sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding for all interfaces, while the interface-specific `net.ipv6.conf.<interface>.forwarding` configures the interface Host/Router configuration. Introduce a new sysctl flag `force_forwarding`, which can be set on every interface. The ip6_forwarding function will then check if the global forwarding flag OR the force_forwarding flag is active and forward the packet. To preserver backwards-compatibility reset the flag (on all interfaces) to 0 if the net.ipv6.conf.all.forwarding flag is set to 0. Add a short selftest that checks if a packet gets forwarded with and without `force_forwarding`. [0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com> Acked-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com> --- v6: * rebase * remove brackts around single line * add 'nodad' to addresses in selftest to avoid sporadic failures v5: https://lore.kernel.org/netdev/20250707094307.223975-1-g.goller@proxmox.com/ * update conf/all/forwarding docs * simplified backwards-compat comment * remove ASSERT_RTNL as it's guaranteed by __in6_dev_get_rtnl_net() already * cange ip6_forward logic so that it doesn't depend on the idev existing * move WRITE_ONCE inside device lock v4: https://lore.kernel.org/netdev/20250703160154.560239-1-g.goller@proxmox.com/ * actually write the sysctl value to the table * use ASSERT_RTNL() when forwarding the sysctl change * remove useless comments in function body * simplify forwarding and force_forwarding check in ip6_output.c * fix code backticks in Documentation (double instead of single) * add selftests v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ * remove forwarding=0 setting force_forwarding=0 globally. * add min and max (0 and 1) value to sysctl. v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/ * rename from `do_forwarding` to `force_forwarding`. * add global `force_forwarding` flag which will enable `force_forwarding` on every interface like the `ipv4.all.forwarding` flag. * `forwarding`=0 will disable global and per-interface `force_forwarding`. * export option as NETCONFA_FORCE_FORWARDING. v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/ Documentation/networking/ip-sysctl.rst | 9 +- include/linux/ipv6.h | 1 + include/uapi/linux/ipv6.h | 1 + include/uapi/linux/netconf.h | 1 + include/uapi/linux/sysctl.h | 1 + net/ipv6/addrconf.c | 82 ++++++++++++++ net/ipv6/ip6_output.c | 3 +- tools/testing/selftests/net/Makefile | 1 + .../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++ 9 files changed, 201 insertions(+), 3 deletions(-) create mode 100755 tools/testing/selftests/net/ipv6_force_forwarding.sh diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 0f1251cce314..6d92bae0257a 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2281,8 +2281,8 @@ conf/all/disable_ipv6 - BOOLEAN conf/all/forwarding - BOOLEAN Enable global IPv6 forwarding between all interfaces. - IPv4 and IPv6 work differently here; e.g. netfilter must be used - to control which interfaces may forward packets and which not. + IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must + be used to control which interfaces may forward packets. This also sets all interfaces' Host/Router setting 'forwarding' to the specified value. See below for details. @@ -2292,6 +2292,11 @@ conf/all/forwarding - BOOLEAN proxy_ndp - BOOLEAN Do proxy ndp. +force_forwarding - BOOLEAN + Enable forwarding on this interface only -- regardless of the setting on + ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0, + the ``force_forwarding`` flag will be reset on all interfaces. + fwmark_reflect - BOOLEAN Controls the fwmark of kernel-generated IPv6 reply packets that are not associated with a socket for example, TCP RSTs or ICMPv6 echo replies). diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 5aeeed22f35b..d975a86f29be 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -17,6 +17,7 @@ struct ipv6_devconf { __s32 hop_limit; __s32 mtu6; __s32 forwarding; + __s32 force_forwarding; __s32 disable_policy; __s32 proxy_ndp; __cacheline_group_end(ipv6_devconf_read_txrx); diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index cf592d7b630f..d4d3ae774b26 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -199,6 +199,7 @@ enum { DEVCONF_NDISC_EVICT_NOCARRIER, DEVCONF_ACCEPT_UNTRACKED_NA, DEVCONF_ACCEPT_RA_MIN_LFT, + DEVCONF_FORCE_FORWARDING, DEVCONF_MAX }; diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h index fac4edd55379..1c8c84d65ae3 100644 --- a/include/uapi/linux/netconf.h +++ b/include/uapi/linux/netconf.h @@ -19,6 +19,7 @@ enum { NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN, NETCONFA_INPUT, NETCONFA_BC_FORWARDING, + NETCONFA_FORCE_FORWARDING, __NETCONFA_MAX }; #define NETCONFA_MAX (__NETCONFA_MAX - 1) diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 8981f00204db..63d1464cb71c 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -573,6 +573,7 @@ enum { NET_IPV6_ACCEPT_RA_FROM_LOCAL=26, NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27, NET_IPV6_RA_DEFRTR_METRIC=28, + NET_IPV6_FORCE_FORWARDING=29, __NET_IPV6_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index ba2ec7c870cc..580aed034849 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { @@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .ndisc_evict_nocarrier = 1, .ra_honor_pio_life = 0, .ra_honor_pio_pflag = 0, + .force_forwarding = 0, }; /* Check if link is ready: is it up and is a valid qdisc available */ @@ -857,6 +859,9 @@ static void addrconf_forward_change(struct net *net, __s32 newf) idev = __in6_dev_get_rtnl_net(dev); if (idev) { int changed = (!idev->cnf.forwarding) ^ (!newf); + /* Disabling all.forwarding sets 0 to force_forwarding for all interfaces */ + if (newf == 0) + WRITE_ONCE(idev->cnf.force_forwarding, 0); WRITE_ONCE(idev->cnf.forwarding, newf); if (changed) @@ -5719,6 +5724,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf, array[DEVCONF_ACCEPT_UNTRACKED_NA] = READ_ONCE(cnf->accept_untracked_na); array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft); + array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding); } static inline size_t inet6_ifla6_size(void) @@ -6747,6 +6753,75 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write return ret; } +static void addrconf_force_forward_change(struct net *net, __s32 newf) +{ + struct net_device *dev; + struct inet6_dev *idev; + + for_each_netdev(net, dev) { + idev = __in6_dev_get_rtnl_net(dev); + if (idev) { + int changed = (!idev->cnf.force_forwarding) ^ (!newf); + + WRITE_ONCE(idev->cnf.force_forwarding, newf); + if (changed) + inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + dev->ifindex, &idev->cnf); + } + } +} + +static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct inet6_dev *idev = ctl->extra1; + struct ctl_table tmp_ctl = *ctl; + struct net *net = ctl->extra2; + int *valp = ctl->data; + int new_val = *valp; + int old_val = *valp; + loff_t pos = *ppos; + int ret; + + tmp_ctl.extra1 = SYSCTL_ZERO; + tmp_ctl.extra2 = SYSCTL_ONE; + tmp_ctl.data = &new_val; + + ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos); + + if (write && old_val != new_val) { + if (!rtnl_net_trylock(net)) + return restart_syscall(); + + WRITE_ONCE(*valp, new_val); + + if (valp == &net->ipv6.devconf_dflt->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_DEFAULT, + net->ipv6.devconf_dflt); + } else if (valp == &net->ipv6.devconf_all->force_forwarding) { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + NETCONFA_IFINDEX_ALL, + net->ipv6.devconf_all); + + addrconf_force_forward_change(net, new_val); + } else { + inet6_netconf_notify_devconf(net, RTM_NEWNETCONF, + NETCONFA_FORCE_FORWARDING, + idev->dev->ifindex, + &idev->cnf); + } + rtnl_net_unlock(net); + } + + if (ret) + *ppos = pos; + return ret; +} + static int minus_one = -1; static const int two_five_five = 255; static u32 ioam6_if_id_max = U16_MAX; @@ -7217,6 +7292,13 @@ static const struct ctl_table addrconf_sysctl[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_TWO, }, + { + .procname = "force_forwarding", + .data = &ipv6_devconf.force_forwarding, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = addrconf_sysctl_force_forwarding, + }, }; static int __addrconf_sysctl_register(struct net *net, char *dev_name, diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 7bd29a9ff0db..3853090d7282 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -509,7 +509,8 @@ int ip6_forward(struct sk_buff *skb) u32 mtu; idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif)); - if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0) + if (!READ_ONCE(net->ipv6.devconf_all->forwarding) && + (!idev || !READ_ONCE(idev->cnf.force_forwarding))) goto error; if (skb->pkt_type != PACKET_HOST) diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 332f387615d7..f64ec8a15a77 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -112,6 +112,7 @@ TEST_PROGS += skf_net_off.sh TEST_GEN_FILES += skf_net_off TEST_GEN_FILES += tfo TEST_PROGS += tfo_passive.sh +TEST_PROGS += ipv6_force_forwarding.sh # YNL files, must be before "include ..lib.mk" YNL_GEN_FILES := busy_poller netlink-dumps diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh new file mode 100755 index 000000000000..bf0243366caa --- /dev/null +++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh @@ -0,0 +1,105 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Test IPv6 force_forwarding interface property +# +# This test verifies that the force_forwarding property works correctly: +# - When global forwarding is disabled, packets are not forwarded normally +# - When force_forwarding is enabled on an interface, packets are forwarded +# regardless of the global forwarding setting + +source lib.sh + +cleanup() { + cleanup_ns $ns1 $ns2 $ns3 +} + +trap cleanup EXIT + +setup_test() { + # Create three namespaces: sender, router, receiver + setup_ns ns1 ns2 ns3 + + # Create veth pairs: ns1 <-> ns2 <-> ns3 + ip link add name veth12 type veth peer name veth21 + ip link add name veth23 type veth peer name veth32 + + # Move interfaces to namespaces + ip link set veth12 netns $ns1 + ip link set veth21 netns $ns2 + ip link set veth23 netns $ns2 + ip link set veth32 netns $ns3 + + # Configure interfaces + ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12 nodad + ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21 nodad + ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23 nodad + ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32 nodad + + # Bring up interfaces + ip -n $ns1 link set veth12 up + ip -n $ns2 link set veth21 up + ip -n $ns2 link set veth23 up + ip -n $ns3 link set veth32 up + + # Add routes + ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2 + ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1 + + # Disable global forwarding + ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0 +} + +test_force_forwarding() { + local ret=0 + + echo "TEST: force_forwarding functionality" + + # Check if force_forwarding sysctl exists + if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then + echo "SKIP: force_forwarding not available" + return $ksft_skip + fi + + # Test 1: Without force_forwarding, ping should fail + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "FAIL: ping succeeded when forwarding disabled" + ret=1 + else + echo "PASS: forwarding disabled correctly" + fi + + # Test 2: With force_forwarding enabled, ping should succeed + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1 + ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1 + + if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then + echo "PASS: force_forwarding enabled forwarding" + else + echo "FAIL: ping failed with force_forwarding enabled" + ret=1 + fi + + return $ret +} + +echo "IPv6 force_forwarding test" +echo "==========================" + +setup_test +test_force_forwarding +ret=$? + +if [ $ret -eq 0 ]; then + echo "OK" + exit 0 +elif [ $ret -eq $ksft_skip ]; then + echo "SKIP" + exit $ksft_skip +else + echo "FAIL" + exit 1 +fi -- 2.39.5

5 months, 2 weeks

3
4
0 0

[PATCH mm-stable] selftests/damon/sysfs.py: stop DAMON for dumping failures

by SeongJae Park

Commit 4ece01897627 ("selftests/damon: add python and drgn-based DAMON sysfs test") in mm-stable tree introduced sysfs.py that runs drgn for dumping DAMON status. When the DAMON status dumping fails for reasons including drgn uninstalled environment, the test fails without stopping DAMON. Following DAMON selftests that assumes DAMON is not running when they executed therefore fail. Catch dumping failures and stop DAMON for that case. Fixes: 4ece01897627 ("selftests/damon: add python and drgn-based DAMON sysfs test") # mm-stable Reported-by: kernel test robot <oliver.sang(a)intel.com> Closes: https://lore.kernel.org/oe-lkp/202507220707.9c5d6247-lkp@intel.com Signed-off-by: SeongJae Park <sj(a)kernel.org> --- tools/testing/selftests/damon/sysfs.py | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/testing/selftests/damon/sysfs.py b/tools/testing/selftests/damon/sysfs.py index 4ff99db0d247..dbf6613529bd 100755 --- a/tools/testing/selftests/damon/sysfs.py +++ b/tools/testing/selftests/damon/sysfs.py @@ -8,6 +8,10 @@ import subprocess import _damon_sysfs def dump_damon_status_dict(pid): + try: + subprocess.check_output(['which', 'drgn'], stderr=subprocess.DEVNULL) + except: + return None, 'drgn not found' file_dir = os.path.dirname(os.path.abspath(__file__)) dump_script = os.path.join(file_dir, 'drgn_dump_damon_status.py') rc = subprocess.call(['drgn', dump_script, pid, 'damon_dump_output'], @@ -31,6 +35,7 @@ def main(): status, err = dump_damon_status_dict(kdamonds.kdamonds[0].pid) if err is not None: print(err) + kdamonds.stop() exit(1) if len(status['contexts']) != 1: base-commit: 49c3f600a9088332b3c1a6db2dc6f3516f273609 -- 2.39.5

5 months, 2 weeks

1
0
0 0

[PATCH 00/22] selftests/damon/sysfs.py: test all parameters

by SeongJae Park

sysfs.py tests if DAMON sysfs interface is passing the user-requested parameters to DAMON as expected. But only the default (minimum) parameters are being tested. This is partially because _damon_sysfs.py, which is the library for making the parameter requests, is not supporting the entire parameters. The internal DAMON status dump script (drgn_dump_damon_status.py) is also not dumping entire parameters. Extend the test coverage by updating parameters input and status dumping scripts to support all parameters, and writing additional tests using those. This increased test coverage actually found one real bug (https://lore.kernel.org/20250719181932.72944-1-sj@kernel.org). First seven patches (1-7) extend _damon_sysfs.py for all parameters setup. The eight patch (8) fixes _damon_sysfs.py to use correct max nr_acceses and age values for their type. Following three patches (9-11) extend drgn_dump_damon_status.py to dump full DAMON parameters. Following nine patches (12-20) refactor sysfs.py for general testing code reuse, and extend it for full parameters check. Finally, two patches (21 and 22) add test cases in sysfs.py for full parameters testing. SeongJae Park (22): selftests/damon/_damon_sysfs: support DAMOS watermarks setup selftests/damon/_damon_sysfs: support DAMOS filters setup selftests/damon/_damon_sysfs: support monitoring intervals goal setup selftests/damon/_damon_sysfs: support DAMOS quota weights setup selftests/damon/_damon_sysfs: support DAMOS quota goal nid setup selftests/damon/_damon_sysfs: support DAMOS action dests setup selftests/damon/_damon_sysfs: support DAMOS target_nid setup selftests/damon/_damon_sysfs: use 2**32 - 1 as max nr_accesses and age selftests/damon/drgn_dump_damon_status: dump damos->migrate_dests selftests/damon/drgn_dump_damon_status: dump ctx->ops.id selftests/damon/drgn_dump_damon_status: dump DAMOS filters selftests/damon/sysfs.py: generalize DAMOS Watermarks commit assertion selftests/damon/sysfs.py: generalize DamosQuota commit assertion selftests/damon/sysfs.py: test quota goal commitment selftests/damon/sysfs.py: test DAMOS destinations commitment selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: test runtime reduction of DAMON parameters tools/testing/selftests/damon/_damon_sysfs.py | 301 +++++++++++++++++- .../selftests/damon/drgn_dump_damon_status.py | 63 +++- tools/testing/selftests/damon/sysfs.py | 284 +++++++++++++---- 3 files changed, 568 insertions(+), 80 deletions(-) base-commit: fc8066077f44a4fd43f8fdb12bc238f8fbeaa3c5 -- 2.39.5

5 months, 2 weeks

2
24
0 0

[RESEND PATCH] selftests/pidfd: align stack to fix SP alignment exception

by Shuai Xue

The pidfd_test fails on the ARM64 platform with the following error: Bail out! pidfd_poll check for premature notification on child thread exec test: Failed When exception-trace is enabled, the kernel logs the details: #echo 1 > /proc/sys/debug/exception-trace #dmesg | tail -n 20 [48628.713023] pidfd_test[1082142]: unhandled exception: SP Alignment, ESR 0x000000009a000000, SP/PC alignment exception in pidfd_test[400000+4000] [48628.713049] CPU: 21 PID: 1082142 Comm: pidfd_test Kdump: loaded Tainted: G W E 6.6.71-3_rc1.al8.aarch64 #1 [48628.713051] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-1UC1P-5B/AS1111MG1, BIOS 1.2.M1.AL.P.157.00 07/29/2023 [48628.713053] pstate: 60001800 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=-c) [48628.713055] pc : 0000000000402100 [48628.713056] lr : 0000ffff98288f9c [48628.713056] sp : 0000ffffde49daa8 [48628.713057] x29: 0000000000000000 x28: 0000000000000000 x27: 0000000000000000 [48628.713060] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [48628.713062] x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000400e80 [48628.713065] x20: 0000000000000000 x19: 0000000000402650 x18: 0000000000000000 [48628.713067] x17: 00000000004200d8 x16: 0000ffff98288f40 x15: 0000ffffde49b92c [48628.713070] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [48628.713072] x11: 0000000000001011 x10: 0000000000402100 x9 : 0000000000000010 [48628.713074] x8 : 00000000000000dc x7 : 3861616239346564 x6 : 000000000000000a [48628.713077] x5 : 0000ffffde49daa8 x4 : 000000000000000a x3 : 0000ffffde49daa8 [48628.713079] x2 : 0000ffffde49dadc x1 : 0000ffffde49daa8 x0 : 0000000000000000 According to ARM ARM D1.3.10.2 SP alignment checking: > When the SP is used as the base address of a calculation, regardless of > any offset applied by the instruction, if bits [3:0] of the SP are not > 0b0000, there is a misaligned SP. To fix it, align the stack with 16 bytes. Signed-off-by: Shuai Xue <xueshuai(a)linux.alibaba.com> --- tools/testing/selftests/pidfd/pidfd_test.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/pidfd/pidfd_test.c b/tools/testing/selftests/pidfd/pidfd_test.c index c081ae91313a..ec161a7c3ff9 100644 --- a/tools/testing/selftests/pidfd/pidfd_test.c +++ b/tools/testing/selftests/pidfd/pidfd_test.c @@ -33,7 +33,7 @@ static bool have_pidfd_send_signal; static pid_t pidfd_clone(int flags, int *pidfd, int (*fn)(void *)) { size_t stack_size = 1024; - char *stack[1024] = { 0 }; + char *stack[1024] __attribute__((aligned(16))) = {0}; #ifdef __ia64__ return __clone2(fn, stack, stack_size, flags | SIGCHLD, NULL, pidfd); -- 2.39.3

5 months, 2 weeks

3
6
0 0

[PATCH net-next] selftests: tc: Add generic erspan_opts matching support for tc-flower

by shuali＠redhat.com

From: Li Shuang <shuali(a)redhat.com> Add test cases to tc_flower.sh to validate generic matching on ERSPAN options. Both ERSPAN Type II and Type III are covered. Also add check_tc_erspan_support() to verify whether tc supports erspan_opts. Signed-off-by: Li Shuang <shuali(a)redhat.com> --- tools/testing/selftests/net/forwarding/lib.sh | 14 +++++ .../selftests/net/forwarding/tc_flower.sh | 52 ++++++++++++++++++- 2 files changed, 65 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index 9308b2f77fed..890b3374dacd 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -142,6 +142,20 @@ check_tc_version() fi } +check_tc_erspan_support() +{ + local dev=$1; shift + + tc filter add dev $dev ingress pref 1 handle 1 flower \ + erspan_opts 1:0:0:0 &> /dev/null + if [[ $? -ne 0 ]]; then + echo "SKIP: iproute2 too old; tc is missing erspan support" + return $ksft_skip + fi + tc filter del dev $dev ingress pref 1 handle 1 flower \ + erspan_opts 1:0:0:0 &> /dev/null +} + # Old versions of tc don't understand "mpls_uc" check_tc_mpls_support() { diff --git a/tools/testing/selftests/net/forwarding/tc_flower.sh b/tools/testing/selftests/net/forwarding/tc_flower.sh index b1daad19b01e..b58909a93112 100755 --- a/tools/testing/selftests/net/forwarding/tc_flower.sh +++ b/tools/testing/selftests/net/forwarding/tc_flower.sh @@ -6,7 +6,7 @@ ALL_TESTS="match_dst_mac_test match_src_mac_test match_dst_ip_test \ match_ip_tos_test match_indev_test match_ip_ttl_test match_mpls_label_test \ match_mpls_tc_test match_mpls_bos_test match_mpls_ttl_test \ - match_mpls_lse_test" + match_mpls_lse_test match_erspan_opts_test" NUM_NETIFS=2 source tc_common.sh source lib.sh @@ -676,6 +676,56 @@ match_mpls_lse_test() log_test "mpls lse match ($tcflags)" } +match_erspan_opts_test() +{ + RET=0 + + check_tc_erspan_support $h2 || return 0 + + # h1 erspan setup + tunnel_create erspan1 erspan 192.0.2.1 192.0.2.2 dev $h1 seq key 1001 \ + tos C ttl 64 erspan_ver 1 erspan 6789 # ERSPAN Type II + tunnel_create erspan2 erspan 192.0.2.1 192.0.2.2 dev $h1 seq key 1002 \ + tos C ttl 64 erspan_ver 2 erspan_dir egress erspan_hwid 63 \ + # ERSPAN Type III + ip link set dev erspan1 master v$h1 + ip link set dev erspan2 master v$h1 + # h2 erspan setup + ip link add ep-ex type erspan ttl 64 external # To collect tunnel info + ip link set ep-ex up + ip link set dev ep-ex master v$h2 + tc qdisc add dev ep-ex clsact + + # ERSPAN Type II [decap direction] + tc filter add dev ep-ex ingress protocol ip handle 101 flower \ + $tcflags enc_src_ip 192.0.2.1 enc_dst_ip 192.0.2.2 \ + enc_key_id 1001 erspan_opts 1:6789:0:0 \ + action drop + # ERSPAN Type III [decap direction] + tc filter add dev ep-ex ingress protocol ip handle 102 flower \ + $tcflags enc_src_ip 192.0.2.1 enc_dst_ip 192.0.2.2 \ + enc_key_id 1002 erspan_opts 2:0:1:63 action drop + + ep1mac=$(mac_get erspan1) + $MZ erspan1 -c 1 -p 64 -a $ep1mac -b $h2mac -t ip -q + tc_check_packets "dev ep-ex ingress" 101 1 + check_err $? "ERSPAN Type II" + + ep2mac=$(mac_get erspan2) + $MZ erspan2 -c 1 -p 64 -a $ep1mac -b $h2mac -t ip -q + tc_check_packets "dev ep-ex ingress" 102 1 + check_err $? "ERSPAN Type III" + + # h2 erspan cleanup + tc qdisc del dev ep-ex clsact + tunnel_destroy ep-ex + # h1 erspan cleanup + tunnel_destroy erspan2 # ERSPAN Type III + tunnel_destroy erspan1 # ERSPAN Type II + + log_test "erspan_opts match ($tcflags)" +} + setup_prepare() { h1=${NETIFS[p1]} -- 2.50.1

5 months, 2 weeks

3
2
0 0

[PATCH net v2 0/2] selftests: mptcp: connect: cover alt modes

by Matthieu Baerts (NGI0)

mptcp_connect.sh can be executed manually with "-m <MODE>" and "-C" to make sure everything works as expected when using "mmap" and "sendfile" modes instead of "poll", and with the MPTCP checksum support. These modes should be validated, but they are not when the selftests are executed via the kselftest helpers. It means that most CIs validating these selftests, like NIPA for the net development trees and LKFT for the stable ones, are not covering these modes. To fix that, new test programs have been added, simply calling mptcp_connect.sh with the right parameters. The first patch can be backported up to v5.6, and the second one up to v5.14. Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org> --- Changes in v2: - force using a different prefix in the subtests to avoid having the same test names in all mptcp_connect*.sh selftests. - Link to v1: https://lore.kernel.org/r/20250714-net-mptcp-sft-connect-alt-v1-0-bf1c5abbe… --- Matthieu Baerts (NGI0) (2): selftests: mptcp: connect: also cover alt modes selftests: mptcp: connect: also cover checksum tools/testing/selftests/net/mptcp/Makefile | 3 ++- tools/testing/selftests/net/mptcp/mptcp_connect_checksum.sh | 5 +++++ tools/testing/selftests/net/mptcp/mptcp_connect_mmap.sh | 5 +++++ tools/testing/selftests/net/mptcp/mptcp_connect_sendfile.sh | 5 +++++ 4 files changed, 17 insertions(+), 1 deletion(-) --- base-commit: b640daa2822a39ff76e70200cb2b7b892b896dce change-id: 20250714-net-mptcp-sft-connect-alt-c1aaf073ef4e Best regards, -- Matthieu Baerts (NGI0) <matttbe(a)kernel.org>

5 months, 2 weeks

4
12
0 0

Re: [PATCH v3 5/7] perf test: Introduce storing logs for shell tests

by Ian Rogers

On Mon, Jul 21, 2025 at 6:27 AM Jakub Brnak <jbrnak(a)redhat.com> wrote: > > From: Veronika Molnarova <vmolnaro(a)redhat.com> > > Create temporary directories for storing log files for shell tests > that could help while debugging. The log files are necessary for > perftool testsuite test cases also. If the variable KEEP_TEST_LOGS > is set keep the logs, else delete them. Is there perhaps a kunit equivalent of log files so we could keep the implementations as similar as possible? Thanks, Ian > Signed-off-by: Michael Petlan <mpetlan(a)redhat.com> > Signed-off-by: Veronika Molnarova <vmolnaro(a)redhat.com> > Signed-off-by: Jakub Brnak <jbrnak(a)redhat.com> > --- > tools/perf/tests/builtin-test.c | 90 ++++++++++++++++++++++++++++++++ > tools/perf/tests/tests-scripts.c | 3 ++ > tools/perf/tests/tests-scripts.h | 1 + > 3 files changed, 94 insertions(+) > > diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c > index 4e3d2f779b01..89b180798224 100644 > --- a/tools/perf/tests/builtin-test.c > +++ b/tools/perf/tests/builtin-test.c > @@ -6,6 +6,7 @@ > */ > #include <ctype.h> > #include <fcntl.h> > +#include <ftw.h> > #include <errno.h> > #ifdef HAVE_BACKTRACE_SUPPORT > #include <execinfo.h> > @@ -282,6 +283,86 @@ static bool test_exclusive(const struct test_suite *t, int test_case) > return t->test_cases[test_case].exclusive; > } > > +static int delete_file(const char *fpath, const struct stat *sb __maybe_unused, > + int typeflag, struct FTW *ftwbuf) > +{ > + int rv = -1; > + > + /* Stop traversal if going too deep */ > + if (ftwbuf->level > 5) { > + pr_err("Tree traversal reached level %d, stopping.", ftwbuf->level); > + return rv; > + } > + > + /* Remove only expected directories */ > + if (typeflag == FTW_D || typeflag == FTW_DP){ > + const char *dirname = fpath + ftwbuf->base; > + > + if (strcmp(dirname, "logs") && strcmp(dirname, "examples") && > + strcmp(dirname, "header_tar") && strncmp(dirname, "perf_", 5)) { > + pr_err("Unknown directory %s", dirname); > + return rv; > + } > + } > + > + /* Attempt to remove the file */ > + rv = remove(fpath); > + if (rv) > + pr_err("Failed to remove file: %s", fpath); > + > + return rv; > +} > + > +static bool create_logs(struct test_suite *t, int pass){ > + bool store_logs = t->priv && ((struct shell_info*)(t->priv))->store_logs; > + if (pass == 1 && (!test_exclusive(t, 0) || sequential || dont_fork)) { > + /* Sequential and non-exclusive tests run on the first pass. */ > + return store_logs; > + } > + else if (pass != 1 && test_exclusive(t, 0) && !sequential && !dont_fork) { > + /* Exclusive tests without sequential run on the second pass. */ > + return store_logs; > + } > + return false; > +} > + > +static char *setup_shell_logs(const char *name) > +{ > + char template[PATH_MAX]; > + char *temp_dir; > + > + if (snprintf(template, PATH_MAX, "/tmp/perf_test_%s.XXXXXX", name) < 0) { > + pr_err("Failed to create log dir template"); > + return NULL; /* Skip the testsuite */ > + } > + > + temp_dir = mkdtemp(template); > + if (temp_dir) { > + setenv("PERFSUITE_RUN_DIR", temp_dir, 1); > + return strdup(temp_dir); > + } > + else { > + pr_err("Failed to create the temporary directory"); > + } > + > + return NULL; /* Skip the testsuite */ > +} > + > +static void cleanup_shell_logs(char *dirname) > +{ > + char *keep_logs = getenv("PERFTEST_KEEP_LOGS"); > + > + /* Check if logs should be kept or do cleanup */ > + if (dirname) { > + if (!keep_logs || strcmp(keep_logs, "y") != 0) { > + nftw(dirname, delete_file, 8, FTW_DEPTH | FTW_PHYS); > + } > + free(dirname); > + } > + > + unsetenv("PERFSUITE_RUN_DIR"); > +} > + > static bool perf_test__matches(const char *desc, int suite_num, int argc, const char *argv[]) > { > int i; > @@ -626,6 +707,7 @@ static int __cmd_test(struct test_suite **suites, int argc, const char *argv[], > for (struct test_suite **t = suites; *t; t++, curr_suite++) { > int curr_test_case; > bool suite_matched = false; > + char *tmpdir = NULL; > > if (!perf_test__matches(test_description(*t, -1), curr_suite, argc, argv)) { > /* > @@ -655,6 +737,13 @@ static int __cmd_test(struct test_suite **suites, int argc, const char *argv[], > } > > for (unsigned int run = 0; run < runs_per_test; run++) { > + /* Setup temporary log directories for shell test suites */ > + if (create_logs(*t, pass)) { > + tmpdir = setup_shell_logs((*t)->desc); > + > + if (tmpdir == NULL) /* Couldn't create log dir, skip test suite */ > + ((struct shell_info*)((*t)->priv))->has_setup = FAILED_SETUP; > + } > test_suite__for_each_test_case(*t, curr_test_case) { > if (!suite_matched && > !perf_test__matches(test_description(*t, curr_test_case), > @@ -667,6 +756,7 @@ static int __cmd_test(struct test_suite **suites, int argc, const char *argv[], > goto err_out; > } > } > + cleanup_shell_logs(tmpdir); > } > if (!sequential) { > /* Parallel mode starts tests but doesn't finish them. Do that now. */ > diff --git a/tools/perf/tests/tests-scripts.c b/tools/perf/tests/tests-scripts.c > index d680a878800f..d4e382898a30 100644 > --- a/tools/perf/tests/tests-scripts.c > +++ b/tools/perf/tests/tests-scripts.c > @@ -251,6 +251,7 @@ static struct test_suite* prepare_test_suite(int dir_fd) > > test_info->base_path = strdup_check(dirpath); /* Absolute path to dir */ > test_info->has_setup = NO_SETUP; > + test_info->store_logs = false; > > test_suite->priv = test_info; > test_suite->desc = NULL; > @@ -427,6 +428,8 @@ static void append_suits_in_dir(int dir_fd, > continue; > } > > + /* Store logs for testsuite is sub-directories */ > + ((struct shell_info*)(test_suite->priv))->store_logs = true; > if (is_test_script(fd, SHELL_SETUP)) { /* Check for setup existance */ > char *desc = shell_test__description(fd, SHELL_SETUP); > test_suite->desc = desc; /* Set the suite name by the setup description */ > diff --git a/tools/perf/tests/tests-scripts.h b/tools/perf/tests/tests-scripts.h > index da4dcd26140c..41da0a175e4e 100644 > --- a/tools/perf/tests/tests-scripts.h > +++ b/tools/perf/tests/tests-scripts.h > @@ -16,6 +16,7 @@ enum shell_setup { > struct shell_info { > const char *base_path; > enum shell_setup has_setup; > + bool store_logs; > }; > > struct test_suite **create_script_test_suites(void); > -- > 2.50.1 >

5 months, 2 weeks

1
0
0 0

[PATCH v4 0/6] binder: Set up KUnit tests for alloc

by Tiffany Yang

Hello, binder_alloc_selftest provides a robust set of checks for the binder allocator, but it rarely runs because it must hook into a running binder process and block all other binder threads until it completes. The test itself is a good candidate for conversion to KUnit, and it can be further isolated from user processes by using a test-specific lru freelist instead of the global one. This series converts the selftest to KUnit to make it less burdensome to run and to set up a foundation for unit testing future binder_alloc changes. Thanks, Tiffany Tiffany Yang (6): binder: Fix selftest page indexing binder: Store lru freelist in binder_alloc kunit: test: Export kunit_attach_mm() binder: Scaffolding for binder_alloc KUnit tests binder: Convert binder_alloc selftests to KUnit binder: encapsulate individual alloc test cases drivers/android/Kconfig | 15 +- drivers/android/Makefile | 2 +- drivers/android/binder.c | 10 +- drivers/android/binder_alloc.c | 39 +- drivers/android/binder_alloc.h | 14 +- drivers/android/binder_alloc_selftest.c | 306 ----------- drivers/android/binder_internal.h | 4 + drivers/android/tests/.kunitconfig | 7 + drivers/android/tests/Makefile | 6 + drivers/android/tests/binder_alloc_kunit.c | 572 +++++++++++++++++++++ include/kunit/test.h | 12 + lib/kunit/user_alloc.c | 4 +- 12 files changed, 651 insertions(+), 340 deletions(-) delete mode 100644 drivers/android/binder_alloc_selftest.c create mode 100644 drivers/android/tests/.kunitconfig create mode 100644 drivers/android/tests/Makefile create mode 100644 drivers/android/tests/binder_alloc_kunit.c -- 2.50.0.727.gbf7dc18ff4-goog

5 months, 2 weeks

4
13
0 0

[PATCH] selftests/bpf: Add LPM trie microbenchmarks

by Matt Fleming

From: Matt Fleming <mfleming(a)cloudflare.com> Add benchmarks for the standard set of operations: lookup, update, delete. Also, include a benchmark for trie_free() which is known to have terrible performance for maps with many entries. Benchmarks operate on tries without gaps in the key range, i.e. each test begins with a trie with valid keys in the range [0, nr_entries). This is intended to cause maximum branching when traversing the trie. All measurements are recorded inside the kernel to remove syscall overhead. Most benchmarks run an XDP program to generate stats but free needs to collect latencies using fentry/fexit on map_free_deferred() because it's not possible to use fentry directly on lpm_trie.c since commit c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs") and there's no way to create/destroy a map from within an XDP program. Here is example output from an AMD EPYC 9684X 96-Core machine for each of the benchmarks using a trie with 10K entries and a 32-bit prefix length, e.g. $ ./bench lpm-trie-$op \ --prefix_len=32 \ --producers=1 \ --nr_entries=10000 lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op Signed-off-by: Matt Fleming <mfleming(a)cloudflare.com> --- tools/testing/selftests/bpf/Makefile | 2 + tools/testing/selftests/bpf/bench.c | 10 + tools/testing/selftests/bpf/bench.h | 1 + .../selftests/bpf/benchs/bench_lpm_trie_map.c | 345 ++++++++++++++++++ .../selftests/bpf/progs/lpm_trie_bench.c | 175 +++++++++ .../selftests/bpf/progs/lpm_trie_map.c | 19 + 6 files changed, 552 insertions(+) create mode 100644 tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_bench.c create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_map.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 910d8d6402ef..10a5f1d0fa41 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -815,6 +815,7 @@ $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h +$(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ) $(OUTPUT)/bench: LDLIBS += -lm $(OUTPUT)/bench: $(OUTPUT)/bench.o \ @@ -836,6 +837,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \ $(OUTPUT)/bench_htab_mem.o \ $(OUTPUT)/bench_bpf_crypto.o \ $(OUTPUT)/bench_sockmap.o \ + $(OUTPUT)/bench_lpm_trie_map.o \ # $(call msg,BINARY,,$@) $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@ diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index ddd73d06a1eb..fd15f60fd5a8 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -284,6 +284,7 @@ extern struct argp bench_htab_mem_argp; extern struct argp bench_trigger_batch_argp; extern struct argp bench_crypto_argp; extern struct argp bench_sockmap_argp; +extern struct argp bench_lpm_trie_map_argp; static const struct argp_child bench_parsers[] = { { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 }, @@ -299,6 +300,7 @@ static const struct argp_child bench_parsers[] = { { &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 }, { &bench_crypto_argp, 0, "bpf crypto benchmark", 0 }, { &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 }, + { &bench_lpm_trie_map_argp, 0, "LPM trie map benchmark", 0 }, {}, }; @@ -558,6 +560,10 @@ extern const struct bench bench_htab_mem; extern const struct bench bench_crypto_encrypt; extern const struct bench bench_crypto_decrypt; extern const struct bench bench_sockmap; +extern const struct bench bench_lpm_trie_lookup; +extern const struct bench bench_lpm_trie_update; +extern const struct bench bench_lpm_trie_delete; +extern const struct bench bench_lpm_trie_free; static const struct bench *benchs[] = { &bench_count_global, @@ -625,6 +631,10 @@ static const struct bench *benchs[] = { &bench_crypto_encrypt, &bench_crypto_decrypt, &bench_sockmap, + &bench_lpm_trie_lookup, + &bench_lpm_trie_update, + &bench_lpm_trie_delete, + &bench_lpm_trie_free, }; static void find_benchmark(void) diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h index 005c401b3e22..bea323820ffb 100644 --- a/tools/testing/selftests/bpf/bench.h +++ b/tools/testing/selftests/bpf/bench.h @@ -46,6 +46,7 @@ struct bench_res { unsigned long gp_ns; unsigned long gp_ct; unsigned int stime; + unsigned long duration_ns; }; struct bench { diff --git a/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c new file mode 100644 index 000000000000..ddd7d3669e70 --- /dev/null +++ b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c @@ -0,0 +1,345 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Cloudflare */ + +/* + * All of these benchmarks operate on tries with keys in the range + * [0, args.nr_entries), i.e. there are no gaps or partially filled + * branches of the trie for any key < args.nr_entries. + * + * This gives an idea of worst-case behaviour. + */ + +#include <argp.h> +#include <linux/time64.h> +#include <linux/if_ether.h> +#include "lpm_trie_bench.skel.h" +#include "lpm_trie_map.skel.h" +#include "bench.h" +#include "testing_helpers.h" + +static struct ctx { + struct lpm_trie_bench *bench; +} ctx; + +static struct { + __u32 nr_entries; + __u32 prefixlen; +} args = { + .nr_entries = 10000, + .prefixlen = 32, +}; + +enum { + ARG_NR_ENTRIES = 9000, + ARG_PREFIX_LEN, +}; + +static const struct argp_option opts[] = { + { "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0, + "Number of unique entries in the LPM trie" }, + { "prefix_len", ARG_PREFIX_LEN, "PREFIX_LEN", 0, + "Number of prefix bits to use in the LPM trie" }, + {}, +}; + +static error_t lpm_parse_arg(int key, char *arg, struct argp_state *state) +{ + long ret; + + switch (key) { + case ARG_NR_ENTRIES: + ret = strtol(arg, NULL, 10); + if (ret < 1 || ret > UINT_MAX) { + fprintf(stderr, "Invalid nr_entries count."); + argp_usage(state); + } + args.nr_entries = ret; + break; + case ARG_PREFIX_LEN: + ret = strtol(arg, NULL, 10); + if (ret < 1 || ret > UINT_MAX) { + fprintf(stderr, "Invalid prefix_len value."); + argp_usage(state); + } + args.prefixlen = ret; + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +const struct argp bench_lpm_trie_map_argp = { + .options = opts, + .parser = lpm_parse_arg, +}; + +static void __lpm_validate(void) +{ + if (env.consumer_cnt != 0) { + fprintf(stderr, "benchmark doesn't support consumer!\n"); + exit(1); + } + + if ((1UL << args.prefixlen) < args.nr_entries) { + fprintf(stderr, "prefix_len value too small for nr_entries!\n"); + exit(1); + }; +} + +enum { OP_LOOKUP = 1, OP_UPDATE, OP_DELETE, OP_FREE }; + +static void lpm_delete_validate(void) +{ + __lpm_validate(); + + if (env.producer_cnt != 1) { + fprintf(stderr, + "lpm-trie-delete requires a single producer!\n"); + exit(1); + } +} + +static void lpm_free_validate(void) +{ + __lpm_validate(); + + if (env.producer_cnt != 1) { + fprintf(stderr, "lpm-trie-free requires a single producer!\n"); + exit(1); + } +} + +static void fill_map(int map_fd) +{ + int i, err; + + for (i = 0; i < args.nr_entries; i++) { + struct trie_key { + __u32 prefixlen; + __u32 data; + } key = { args.prefixlen, i }; + __u32 val = 1; + + err = bpf_map_update_elem(map_fd, &key, &val, BPF_NOEXIST); + if (err) { + fprintf(stderr, "failed to add key %d to map: %d\n", + key.data, -err); + exit(1); + } + } +} + +static void __lpm_setup(void) +{ + ctx.bench = lpm_trie_bench__open_and_load(); + if (!ctx.bench) { + fprintf(stderr, "failed to open skeleton\n"); + exit(1); + } + + ctx.bench->bss->nr_entries = args.nr_entries; + ctx.bench->bss->prefixlen = args.prefixlen; + + if (lpm_trie_bench__attach(ctx.bench)) { + fprintf(stderr, "failed to attach skeleton\n"); + exit(1); + } +} + +static void lpm_setup(void) +{ + int fd; + + __lpm_setup(); + + fd = bpf_map__fd(ctx.bench->maps.trie_map); + fill_map(fd); +} + +static void lpm_lookup_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_LOOKUP; +} + +static void lpm_update_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_UPDATE; +} + +static void lpm_delete_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_DELETE; +} + +static void lpm_free_setup(void) +{ + __lpm_setup(); + ctx.bench->bss->op = OP_FREE; +} + +static void lpm_measure(struct bench_res *res) +{ + res->hits = atomic_swap(&ctx.bench->bss->hits, 0); + res->duration_ns = atomic_swap(&ctx.bench->bss->duration_ns, 0); +} + +/* For LOOKUP, UPDATE, and DELETE */ +static void *lpm_producer(void *unused __always_unused) +{ + int err; + char in[ETH_HLEN]; /* unused */ + + LIBBPF_OPTS(bpf_test_run_opts, opts, .data_in = in, + .data_size_in = sizeof(in), .repeat = 1, ); + + while (true) { + int fd = bpf_program__fd(ctx.bench->progs.run_bench); + err = bpf_prog_test_run_opts(fd, &opts); + if (err) { + fprintf(stderr, "failed to run BPF prog: %d\n", err); + exit(1); + } + + if (opts.retval < 0) { + fprintf(stderr, "BPF prog returned error: %d\n", + opts.retval); + exit(1); + } + + if (ctx.bench->bss->op == OP_DELETE && opts.retval == 1) { + /* trie_map needs to be refilled */ + fill_map(bpf_map__fd(ctx.bench->maps.trie_map)); + } + } + + return NULL; +} + +static void *lpm_free_producer(void *unused __always_unused) +{ + while (true) { + struct lpm_trie_map *skel; + + skel = lpm_trie_map__open_and_load(); + if (!skel) { + fprintf(stderr, "failed to open skeleton\n"); + exit(1); + } + + fill_map(bpf_map__fd(skel->maps.trie_free_map)); + lpm_trie_map__destroy(skel); + } + + return NULL; +} + +static __always_inline double duration_ms(struct bench_res *res) +{ + if (!res->hits) + return 0.0; + + return res->duration_ns / res->hits / NSEC_PER_MSEC; +} + +static void free_ops_report_progress(int iter, struct bench_res *res, + long delta_ns) +{ + double hits_per_sec, hits_per_prod; + double rate_divisor = 1000.0; + char rate = 'K'; + + hits_per_sec = res->hits / (res->duration_ns / (double)NSEC_PER_SEC) / + rate_divisor; + hits_per_prod = hits_per_sec / env.producer_cnt; + + printf("Iter %3d (%7.3lfus): ", iter, + (delta_ns - NSEC_PER_SEC) / 1000.0); + printf("hits %8.3lf%c/s (%7.3lf%c/prod)\n", hits_per_sec, rate, + hits_per_prod, rate); +} + +static void free_ops_report_final(struct bench_res res[], int res_cnt) +{ + double hits_mean = 0.0, hits_stddev = 0.0; + double lat_divisor = 1000000.0; + double rate_divisor = 1000.0; + const char *unit = "ms"; + double latency = 0.0; + char rate = 'K'; + int i; + + for (i = 0; i < res_cnt; i++) { + double val = res[i].hits / rate_divisor / + (res[i].duration_ns / (double)NSEC_PER_SEC); + hits_mean += val / (0.0 + res_cnt); + latency += res[i].duration_ns / res[i].hits / (0.0 + res_cnt); + } + + if (res_cnt > 1) { + for (i = 0; i < res_cnt; i++) { + double val = + res[i].hits / rate_divisor / + (res[i].duration_ns / (double)NSEC_PER_SEC); + hits_stddev += (hits_mean - val) * (hits_mean - val) / + (res_cnt - 1.0); + } + + hits_stddev = sqrt(hits_stddev); + } + printf("Summary: throughput %8.3lf \u00B1 %5.3lf %c ops/s (%7.3lf%c ops/prod), ", + hits_mean, hits_stddev, rate, hits_mean / env.producer_cnt, + rate); + printf("latency %8.3lf %s/op\n", + latency / lat_divisor / env.producer_cnt, unit); +} + +const struct bench bench_lpm_trie_lookup = { + .name = "lpm-trie-lookup", + .argp = &bench_lpm_trie_map_argp, + .validate = __lpm_validate, + .setup = lpm_lookup_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_update = { + .name = "lpm-trie-update", + .argp = &bench_lpm_trie_map_argp, + .validate = __lpm_validate, + .setup = lpm_update_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_delete = { + .name = "lpm-trie-delete", + .argp = &bench_lpm_trie_map_argp, + .validate = lpm_delete_validate, + .setup = lpm_delete_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_free = { + .name = "lpm-trie-free", + .argp = &bench_lpm_trie_map_argp, + .validate = lpm_free_validate, + .setup = lpm_free_setup, + .producer_thread = lpm_free_producer, + .measure = lpm_measure, + .report_progress = free_ops_report_progress, + .report_final = free_ops_report_final, +}; diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_bench.c b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c new file mode 100644 index 000000000000..c335718cc240 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c @@ -0,0 +1,175 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Cloudflare */ + +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> +#include "bpf_misc.h" + +#define BPF_OBJ_NAME_LEN 16U +#define MAX_ENTRIES 100000000 +#define NR_LOOPS 10000 + +struct trie_key { + __u32 prefixlen; + __u32 data; +}; + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 512); + __type(key, struct bpf_map *); + __type(value, __u64); +} latency_free_start SEC(".maps"); + +/* Filled by userspace. See fill_map() in bench_lpm_trie_map.c */ +struct { + __uint(type, BPF_MAP_TYPE_LPM_TRIE); + __type(key, struct trie_key); + __type(value, __u32); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, MAX_ENTRIES); +} trie_map SEC(".maps"); + +long hits; +long duration_ns; + +/* Configured from userspace */ +__u64 nr_entries; +__u32 prefixlen; +__u8 op; + +static __always_inline void atomic_inc(long *cnt) +{ + __atomic_add_fetch(cnt, 1, __ATOMIC_SEQ_CST); +} + +static __always_inline long atomic_swap(long *cnt, long val) +{ + return __atomic_exchange_n(cnt, val, __ATOMIC_SEQ_CST); +} + +SEC("fentry/bpf_map_free_deferred") +int BPF_PROG(trie_free_entry, struct work_struct *work) +{ + struct bpf_map *map = container_of(work, struct bpf_map, work); + const char *name; + u32 map_type; + __u64 val; + + map_type = BPF_CORE_READ(map, map_type); + if (map_type != BPF_MAP_TYPE_LPM_TRIE) + return 0; + + /* + * Ideally we'd have access to the map ID but that's already + * freed before we enter trie_free(). + */ + name = BPF_CORE_READ(map, name); + if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map")) + return 0; + + val = bpf_ktime_get_ns(); + bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY); + + return 0; +} + +SEC("fexit/bpf_map_free_deferred") +int BPF_PROG(trie_free_exit, struct work_struct *work) +{ + struct bpf_map *map = container_of(work, struct bpf_map, work); + __u64 *val; + + val = bpf_map_lookup_elem(&latency_free_start, &map); + if (val) { + __sync_add_and_fetch(&duration_ns, bpf_ktime_get_ns() - *val); + atomic_inc(&hits); + bpf_map_delete_elem(&latency_free_start, &map); + } + + return 0; +} + +static void gen_random_key(struct trie_key *key) +{ + key->prefixlen = prefixlen; + key->data = bpf_get_prandom_u32() % nr_entries; +} + +static int lookup(__u32 index, __u32 *unused) +{ + struct trie_key key; + + gen_random_key(&key); + bpf_map_lookup_elem(&trie_map, &key); + return 0; +} + +static int update(__u32 index, __u32 *unused) +{ + struct trie_key key; + u32 val = bpf_get_prandom_u32(); + + gen_random_key(&key); + bpf_map_update_elem(&trie_map, &key, &val, BPF_EXIST); + return 0; +} + +long deleted_entries; +long refill; + +static int delete (__u32 index, __u32 *unused) +{ + struct trie_key key = { + .data = deleted_entries, + .prefixlen = prefixlen, + }; + + bpf_map_delete_elem(&trie_map, &key); + atomic_inc(&deleted_entries); + + /* Do we need to refill the map? */ + if (deleted_entries >= nr_entries) { + atomic_swap(&refill, 1); + atomic_swap(&deleted_entries, 0); + return 1; + } + + return 0; +} + +SEC("xdp") +int BPF_PROG(run_bench) +{ + u64 start, delta; + bool need_refill = false; + + start = bpf_ktime_get_ns(); + + switch (op) { + case 1: + bpf_loop(NR_LOOPS, lookup, NULL, 0); + break; + case 2: + bpf_loop(NR_LOOPS, update, NULL, 0); + break; + case 3: + bpf_loop(NR_LOOPS, delete, NULL, 0); + need_refill = atomic_swap(&refill, 0); + break; + default: + bpf_printk("invalid benchmark operation\n"); + return -1; + } + + delta = bpf_ktime_get_ns() - start; + + __sync_add_and_fetch(&hits, NR_LOOPS); + __sync_add_and_fetch(&duration_ns, delta); + + return need_refill; +} diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_map.c b/tools/testing/selftests/bpf/progs/lpm_trie_map.c new file mode 100644 index 000000000000..2ab43e2cd6c6 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/lpm_trie_map.c @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#define MAX_ENTRIES 100000000 + +struct trie_key { + __u32 prefixlen; + __u32 data; +}; + +struct { + __uint(type, BPF_MAP_TYPE_LPM_TRIE); + __type(key, struct trie_key); + __type(value, __u32); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, MAX_ENTRIES); +} trie_free_map SEC(".maps"); -- 2.34.1

5 months, 2 weeks

3
4
0 0

[PATCH bpf-next v2] selftests/bpf: Add LPM trie microbenchmarks

by Matt Fleming

From: Matt Fleming <mfleming(a)cloudflare.com> Add benchmarks for the standard set of operations: lookup, update, delete. Also, include a benchmark for trie_free() which is known to have terrible performance for maps with many entries. Benchmarks operate on tries without gaps in the key range, i.e. each test begins with a trie with valid keys in the range [0, nr_entries). This is intended to cause maximum branching when traversing the trie. All measurements are recorded inside the kernel to remove syscall overhead. Most benchmarks run an XDP program to generate stats but free needs to collect latencies using fentry/fexit on map_free_deferred() because it's not possible to use fentry directly on lpm_trie.c since commit c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs") and there's no way to create/destroy a map from within an XDP program. Here is example output from an AMD EPYC 9684X 96-Core machine for each of the benchmarks using a trie with 10K entries and a 32-bit prefix length, e.g. $ ./bench lpm-trie-$op \ --prefix_len=32 \ --producers=1 \ --nr_entries=10000 lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op Tested-by: Jesper Dangaard Brouer <hawk(a)kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk(a)kernel.org> Signed-off-by: Matt Fleming <mfleming(a)cloudflare.com> --- Changes in v2: - Add Jesper's Tested-by and Revewied-by tags - Remove use of atomic_*() in favour of __sync_add_and_fetch() - Use a file-local 'deleted_entries' in the DELETE op benchmark and add a comment explaining why non-atomic accesses are safe. - Bump 'hits' with the number of bpf_loop() loops actually executed tools/testing/selftests/bpf/Makefile | 2 + tools/testing/selftests/bpf/bench.c | 10 + tools/testing/selftests/bpf/bench.h | 1 + .../selftests/bpf/benchs/bench_lpm_trie_map.c | 337 ++++++++++++++++++ .../selftests/bpf/progs/lpm_trie_bench.c | 171 +++++++++ .../selftests/bpf/progs/lpm_trie_map.c | 19 + 6 files changed, 540 insertions(+) create mode 100644 tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_bench.c create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_map.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 910d8d6402ef..10a5f1d0fa41 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -815,6 +815,7 @@ $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h +$(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ) $(OUTPUT)/bench: LDLIBS += -lm $(OUTPUT)/bench: $(OUTPUT)/bench.o \ @@ -836,6 +837,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \ $(OUTPUT)/bench_htab_mem.o \ $(OUTPUT)/bench_bpf_crypto.o \ $(OUTPUT)/bench_sockmap.o \ + $(OUTPUT)/bench_lpm_trie_map.o \ # $(call msg,BINARY,,$@) $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@ diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index ddd73d06a1eb..fd15f60fd5a8 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -284,6 +284,7 @@ extern struct argp bench_htab_mem_argp; extern struct argp bench_trigger_batch_argp; extern struct argp bench_crypto_argp; extern struct argp bench_sockmap_argp; +extern struct argp bench_lpm_trie_map_argp; static const struct argp_child bench_parsers[] = { { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 }, @@ -299,6 +300,7 @@ static const struct argp_child bench_parsers[] = { { &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 }, { &bench_crypto_argp, 0, "bpf crypto benchmark", 0 }, { &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 }, + { &bench_lpm_trie_map_argp, 0, "LPM trie map benchmark", 0 }, {}, }; @@ -558,6 +560,10 @@ extern const struct bench bench_htab_mem; extern const struct bench bench_crypto_encrypt; extern const struct bench bench_crypto_decrypt; extern const struct bench bench_sockmap; +extern const struct bench bench_lpm_trie_lookup; +extern const struct bench bench_lpm_trie_update; +extern const struct bench bench_lpm_trie_delete; +extern const struct bench bench_lpm_trie_free; static const struct bench *benchs[] = { &bench_count_global, @@ -625,6 +631,10 @@ static const struct bench *benchs[] = { &bench_crypto_encrypt, &bench_crypto_decrypt, &bench_sockmap, + &bench_lpm_trie_lookup, + &bench_lpm_trie_update, + &bench_lpm_trie_delete, + &bench_lpm_trie_free, }; static void find_benchmark(void) diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h index 005c401b3e22..bea323820ffb 100644 --- a/tools/testing/selftests/bpf/bench.h +++ b/tools/testing/selftests/bpf/bench.h @@ -46,6 +46,7 @@ struct bench_res { unsigned long gp_ns; unsigned long gp_ct; unsigned int stime; + unsigned long duration_ns; }; struct bench { diff --git a/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c new file mode 100644 index 000000000000..435b5c7ceee9 --- /dev/null +++ b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Cloudflare */ + +/* + * All of these benchmarks operate on tries with keys in the range + * [0, args.nr_entries), i.e. there are no gaps or partially filled + * branches of the trie for any key < args.nr_entries. + * + * This gives an idea of worst-case behaviour. + */ + +#include <argp.h> +#include <linux/time64.h> +#include <linux/if_ether.h> +#include "lpm_trie_bench.skel.h" +#include "lpm_trie_map.skel.h" +#include "bench.h" +#include "testing_helpers.h" + +static struct ctx { + struct lpm_trie_bench *bench; +} ctx; + +static struct { + __u32 nr_entries; + __u32 prefixlen; +} args = { + .nr_entries = 10000, + .prefixlen = 32, +}; + +enum { + ARG_NR_ENTRIES = 9000, + ARG_PREFIX_LEN, +}; + +static const struct argp_option opts[] = { + { "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0, + "Number of unique entries in the LPM trie" }, + { "prefix_len", ARG_PREFIX_LEN, "PREFIX_LEN", 0, + "Number of prefix bits to use in the LPM trie" }, + {}, +}; + +static error_t lpm_parse_arg(int key, char *arg, struct argp_state *state) +{ + long ret; + + switch (key) { + case ARG_NR_ENTRIES: + ret = strtol(arg, NULL, 10); + if (ret < 1 || ret > UINT_MAX) { + fprintf(stderr, "Invalid nr_entries count."); + argp_usage(state); + } + args.nr_entries = ret; + break; + case ARG_PREFIX_LEN: + ret = strtol(arg, NULL, 10); + if (ret < 1 || ret > UINT_MAX) { + fprintf(stderr, "Invalid prefix_len value."); + argp_usage(state); + } + args.prefixlen = ret; + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +const struct argp bench_lpm_trie_map_argp = { + .options = opts, + .parser = lpm_parse_arg, +}; + +static void __lpm_validate(void) +{ + if (env.consumer_cnt != 0) { + fprintf(stderr, "benchmark doesn't support consumer!\n"); + exit(1); + } + + if ((1UL << args.prefixlen) < args.nr_entries) { + fprintf(stderr, "prefix_len value too small for nr_entries!\n"); + exit(1); + }; +} + +enum { OP_LOOKUP = 1, OP_UPDATE, OP_DELETE, OP_FREE }; + +static void lpm_delete_validate(void) +{ + __lpm_validate(); + + if (env.producer_cnt != 1) { + fprintf(stderr, + "lpm-trie-delete requires a single producer!\n"); + exit(1); + } +} + +static void lpm_free_validate(void) +{ + __lpm_validate(); + + if (env.producer_cnt != 1) { + fprintf(stderr, "lpm-trie-free requires a single producer!\n"); + exit(1); + } +} + +static void fill_map(int map_fd) +{ + int i, err; + + for (i = 0; i < args.nr_entries; i++) { + struct trie_key { + __u32 prefixlen; + __u32 data; + } key = { args.prefixlen, i }; + __u32 val = 1; + + err = bpf_map_update_elem(map_fd, &key, &val, BPF_NOEXIST); + if (err) { + fprintf(stderr, "failed to add key %d to map: %d\n", + key.data, -err); + exit(1); + } + } +} + +static void __lpm_setup(void) +{ + ctx.bench = lpm_trie_bench__open_and_load(); + if (!ctx.bench) { + fprintf(stderr, "failed to open skeleton\n"); + exit(1); + } + + ctx.bench->bss->nr_entries = args.nr_entries; + ctx.bench->bss->prefixlen = args.prefixlen; + + if (lpm_trie_bench__attach(ctx.bench)) { + fprintf(stderr, "failed to attach skeleton\n"); + exit(1); + } +} + +static void lpm_setup(void) +{ + int fd; + + __lpm_setup(); + + fd = bpf_map__fd(ctx.bench->maps.trie_map); + fill_map(fd); +} + +static void lpm_lookup_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_LOOKUP; +} + +static void lpm_update_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_UPDATE; +} + +static void lpm_delete_setup(void) +{ + lpm_setup(); + + ctx.bench->bss->op = OP_DELETE; +} + +static void lpm_free_setup(void) +{ + __lpm_setup(); + ctx.bench->bss->op = OP_FREE; +} + +static void lpm_measure(struct bench_res *res) +{ + res->hits = atomic_swap(&ctx.bench->bss->hits, 0); + res->duration_ns = atomic_swap(&ctx.bench->bss->duration_ns, 0); +} + +/* For LOOKUP, UPDATE, and DELETE */ +static void *lpm_producer(void *unused __always_unused) +{ + int err; + char in[ETH_HLEN]; /* unused */ + + LIBBPF_OPTS(bpf_test_run_opts, opts, .data_in = in, + .data_size_in = sizeof(in), .repeat = 1, ); + + while (true) { + int fd = bpf_program__fd(ctx.bench->progs.run_bench); + err = bpf_prog_test_run_opts(fd, &opts); + if (err) { + fprintf(stderr, "failed to run BPF prog: %d\n", err); + exit(1); + } + + if (opts.retval < 0) { + fprintf(stderr, "BPF prog returned error: %d\n", + opts.retval); + exit(1); + } + + if (ctx.bench->bss->op == OP_DELETE && opts.retval == 1) { + /* trie_map needs to be refilled */ + fill_map(bpf_map__fd(ctx.bench->maps.trie_map)); + } + } + + return NULL; +} + +static void *lpm_free_producer(void *unused __always_unused) +{ + while (true) { + struct lpm_trie_map *skel; + + skel = lpm_trie_map__open_and_load(); + if (!skel) { + fprintf(stderr, "failed to open skeleton\n"); + exit(1); + } + + fill_map(bpf_map__fd(skel->maps.trie_free_map)); + lpm_trie_map__destroy(skel); + } + + return NULL; +} + +static void free_ops_report_progress(int iter, struct bench_res *res, + long delta_ns) +{ + double hits_per_sec, hits_per_prod; + double rate_divisor = 1000.0; + char rate = 'K'; + + hits_per_sec = res->hits / (res->duration_ns / (double)NSEC_PER_SEC) / + rate_divisor; + hits_per_prod = hits_per_sec / env.producer_cnt; + + printf("Iter %3d (%7.3lfus): ", iter, + (delta_ns - NSEC_PER_SEC) / 1000.0); + printf("hits %8.3lf%c/s (%7.3lf%c/prod)\n", hits_per_sec, rate, + hits_per_prod, rate); +} + +static void free_ops_report_final(struct bench_res res[], int res_cnt) +{ + double hits_mean = 0.0, hits_stddev = 0.0; + double lat_divisor = 1000000.0; + double rate_divisor = 1000.0; + const char *unit = "ms"; + double latency = 0.0; + char rate = 'K'; + int i; + + for (i = 0; i < res_cnt; i++) { + double val = res[i].hits / rate_divisor / + (res[i].duration_ns / (double)NSEC_PER_SEC); + hits_mean += val / (0.0 + res_cnt); + latency += res[i].duration_ns / res[i].hits / (0.0 + res_cnt); + } + + if (res_cnt > 1) { + for (i = 0; i < res_cnt; i++) { + double val = + res[i].hits / rate_divisor / + (res[i].duration_ns / (double)NSEC_PER_SEC); + hits_stddev += (hits_mean - val) * (hits_mean - val) / + (res_cnt - 1.0); + } + + hits_stddev = sqrt(hits_stddev); + } + printf("Summary: throughput %8.3lf \u00B1 %5.3lf %c ops/s (%7.3lf%c ops/prod), ", + hits_mean, hits_stddev, rate, hits_mean / env.producer_cnt, + rate); + printf("latency %8.3lf %s/op\n", + latency / lat_divisor / env.producer_cnt, unit); +} + +const struct bench bench_lpm_trie_lookup = { + .name = "lpm-trie-lookup", + .argp = &bench_lpm_trie_map_argp, + .validate = __lpm_validate, + .setup = lpm_lookup_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_update = { + .name = "lpm-trie-update", + .argp = &bench_lpm_trie_map_argp, + .validate = __lpm_validate, + .setup = lpm_update_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_delete = { + .name = "lpm-trie-delete", + .argp = &bench_lpm_trie_map_argp, + .validate = lpm_delete_validate, + .setup = lpm_delete_setup, + .producer_thread = lpm_producer, + .measure = lpm_measure, + .report_progress = ops_report_progress, + .report_final = ops_report_final, +}; + +const struct bench bench_lpm_trie_free = { + .name = "lpm-trie-free", + .argp = &bench_lpm_trie_map_argp, + .validate = lpm_free_validate, + .setup = lpm_free_setup, + .producer_thread = lpm_free_producer, + .measure = lpm_measure, + .report_progress = free_ops_report_progress, + .report_final = free_ops_report_final, +}; diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_bench.c b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c new file mode 100644 index 000000000000..1a138e21e156 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c @@ -0,0 +1,171 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Cloudflare */ + +#include <vmlinux.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> +#include "bpf_misc.h" + +#define BPF_OBJ_NAME_LEN 16U +#define MAX_ENTRIES 100000000 +#define NR_LOOPS 10000 + +struct trie_key { + __u32 prefixlen; + __u32 data; +}; + +char _license[] SEC("license") = "GPL"; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 512); + __type(key, struct bpf_map *); + __type(value, __u64); +} latency_free_start SEC(".maps"); + +/* Filled by userspace. See fill_map() in bench_lpm_trie_map.c */ +struct { + __uint(type, BPF_MAP_TYPE_LPM_TRIE); + __type(key, struct trie_key); + __type(value, __u32); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, MAX_ENTRIES); +} trie_map SEC(".maps"); + +long hits; +long duration_ns; + +/* Configured from userspace */ +__u32 nr_entries; +__u32 prefixlen; +__u8 op; + +SEC("fentry/bpf_map_free_deferred") +int BPF_PROG(trie_free_entry, struct work_struct *work) +{ + struct bpf_map *map = container_of(work, struct bpf_map, work); + const char *name; + u32 map_type; + __u64 val; + + map_type = BPF_CORE_READ(map, map_type); + if (map_type != BPF_MAP_TYPE_LPM_TRIE) + return 0; + + /* + * Ideally we'd have access to the map ID but that's already + * freed before we enter trie_free(). + */ + name = BPF_CORE_READ(map, name); + if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map")) + return 0; + + val = bpf_ktime_get_ns(); + bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY); + + return 0; +} + +SEC("fexit/bpf_map_free_deferred") +int BPF_PROG(trie_free_exit, struct work_struct *work) +{ + struct bpf_map *map = container_of(work, struct bpf_map, work); + __u64 *val; + + val = bpf_map_lookup_elem(&latency_free_start, &map); + if (val) { + __sync_add_and_fetch(&duration_ns, bpf_ktime_get_ns() - *val); + __sync_add_and_fetch(&hits, 1); + bpf_map_delete_elem(&latency_free_start, &map); + } + + return 0; +} + +static void gen_random_key(struct trie_key *key) +{ + key->prefixlen = prefixlen; + key->data = bpf_get_prandom_u32() % nr_entries; +} + +static int lookup(__u32 index, __u32 *unused) +{ + struct trie_key key; + + gen_random_key(&key); + bpf_map_lookup_elem(&trie_map, &key); + return 0; +} + +static int update(__u32 index, __u32 *unused) +{ + struct trie_key key; + u32 val = bpf_get_prandom_u32(); + + gen_random_key(&key); + bpf_map_update_elem(&trie_map, &key, &val, BPF_EXIST); + return 0; +} + +static __u32 deleted_entries; + +static int delete (__u32 index, bool *need_refill) +{ + struct trie_key key = { + .data = deleted_entries, + .prefixlen = prefixlen, + }; + + bpf_map_delete_elem(&trie_map, &key); + + /* Do we need to refill the map? */ + if (++deleted_entries == nr_entries) { + /* + * Atomicity isn't required because DELETE only supports + * one producer running concurrently. What we need is a + * way to track how many entries have been deleted from + * the trie between consecutive invocations of the BPF + * prog because a single bpf_loop() call might not + * delete all entries, e.g. when NR_LOOPS < nr_entries. + */ + deleted_entries = 0; + *need_refill = true; + return 1; + } + + return 0; +} + +SEC("xdp") +int BPF_PROG(run_bench) +{ + bool need_refill = false; + u64 start, delta; + int loops; + + start = bpf_ktime_get_ns(); + + switch (op) { + case 1: + loops = bpf_loop(NR_LOOPS, lookup, NULL, 0); + break; + case 2: + loops = bpf_loop(NR_LOOPS, update, NULL, 0); + break; + case 3: + loops = bpf_loop(NR_LOOPS, delete, &need_refill, 0); + break; + default: + bpf_printk("invalid benchmark operation\n"); + return -1; + } + + delta = bpf_ktime_get_ns() - start; + + __sync_add_and_fetch(&duration_ns, delta); + __sync_add_and_fetch(&hits, loops); + + return need_refill; +} diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_map.c b/tools/testing/selftests/bpf/progs/lpm_trie_map.c new file mode 100644 index 000000000000..2ab43e2cd6c6 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/lpm_trie_map.c @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#define MAX_ENTRIES 100000000 + +struct trie_key { + __u32 prefixlen; + __u32 data; +}; + +struct { + __uint(type, BPF_MAP_TYPE_LPM_TRIE); + __type(key, struct trie_key); + __type(value, __u32); + __uint(map_flags, BPF_F_NO_PREALLOC); + __uint(max_entries, MAX_ENTRIES); +} trie_free_map SEC(".maps"); -- 2.34.1

5 months, 2 weeks

1
0
0 0

[PATCH nf-next v4 0/2] Add IPIP flowtable SW acceleratio

by Lorenzo Bianconi

Introduce SW acceleration for IPIP tunnels in the netfilter flowtable infrastructure. --- Changes in v4: - Use the hash value of the saddr, daddr and protocol of outer IP header as encapsulation id. - Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kern… Changes in v3: - Add outer IP header sanity checks - target nf-next tree instead of net-next - Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern… Changes in v2: - Introduce IPIP flowtable selftest - Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern… --- Lorenzo Bianconi (2): net: netfilter: Add IPIP flowtable SW acceleration selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest include/linux/netdevice.h | 1 + net/ipv4/ipip.c | 25 +++++++++++ net/netfilter/nf_flow_table_ip.c | 48 +++++++++++++++++++++- net/netfilter/nft_flow_offload.c | 1 + .../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++++ 5 files changed, 113 insertions(+), 2 deletions(-) --- base-commit: d61f6cb6f6ef3c70d2ccc0d9c85c508cb8017da9 change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067 Best regards, -- Lorenzo Bianconi <lorenzo(a)kernel.org>

5 months, 2 weeks

2
7
0 0

[PATCH v2 0/2] rust: minor idiomatic fixes to doctest generator

by Tamir Duberstein

Please see individual commit messages. Signed-off-by: Tamir Duberstein <tamird(a)gmail.com> --- Changes in v2: - rustfmt. - Alice's RB. - Add second patch to emit information in panic rather than separately to stderr. - Link to v1: https://lore.kernel.org/r/20250527-idiomatic-match-slice-v1-1-34b0b1d1d58c@… --- Tamir Duberstein (2): rust: replace length checks with match rust: emit path candidates in panic message scripts/rustdoc_test_gen.rs | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) --- base-commit: 1ce98bb2bb30713ec4374ef11ead0d7d3e856766 change-id: 20250527-idiomatic-match-slice-26a79d100e4d Best regards, -- Tamir Duberstein <tamird(a)gmail.com>

5 months, 3 weeks

2
8
0 0

[PATCH v2 00/15] selftests/futex: Refactor tests to use kselftest_harness.h

by André Almeida

This patch series refactors all futex selftests to use kselftest_harness.h instead of futex's logging.h, as discussed here [1]. This allows to remove a lot of boilerplate code and to simplify some parts of the test logic, mainly when the test needs to exit early. The result of this is more than 500 lines removed from tools/testing/selftests/futex/. Also, this enables new tests to use kselftest.h features like ASSERT_s and such. There are some caveats around this refactor: - logging.h had verbosity levels, while kselftest_harness.h doesn't. I created a new print function called ksft_print_dbg_msg() that prints the message if the user uses the -d flag, so now there's an equivalent of this feature. - futex_requeue_pi test accepted command line arguments to be used as test parameters (e.g. ./futex_requeue_pi -b -l -t 500000). This doesn't work with kselftest_harness.h because there's no straightforward way to send command line arguments to the test. I used FIXTURE_VARIANT() to achieve the same result, but now the parameters live inside of the test file, instead of on functional/run.sh. This increased a little bit the number of test cases for futex_requeue_pi, from 22 to 24. - test_harness_run() calls mmap(MAP_SHARED) before running the test and this has caused a side effect on test futex_numa_mpol.c. This test also calls mmap() and then try to access an address out of boundaries of this mapped memory for a "Memory out of range" subtest, where the kernel should return -EACCESS. After the refactor, the test address might be fall inside the first memory mapped region, thus being a valid address and succeeding the syscall, making the test fail. To fix that, I created a small "buffer zone" with mmap(PROT_NONE) between both mmaps. I have compared the results of run.sh before and after this patchset and didn't find any regression from the test results. Thanks, André [1] https://lore.kernel.org/lkml/87ecv6p364.ffs@tglx/ --- Changes in v2: - Rebased on top of tip/master - Dropped priv_hash global test variant now that this feature was dropped - Added include <stdbool.h> in the first patch - Link to v1: https://lore.kernel.org/r/20250704-tonyk-robust_test_cleanup-v1-0-c0ff4f24c… --- André Almeida (15): selftests: kselftest: Create ksft_print_dbg_msg() selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h selftests/futex: Refactor futex_wait with kselftest_harness.h selftests/futex: Refactor futex_requeue with kselftest_harness.h selftests/futex: Refactor futex_waitv with kselftest_harness.h selftests/futex: Refactor futex_priv_hash with kselftest_harness.h selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h selftests/futex: Drop logging.h include from futex_numa selftests/futex: Remove logging.h file tools/testing/selftests/futex/functional/Makefile | 3 +- .../selftests/futex/functional/futex_numa.c | 3 +- .../selftests/futex/functional/futex_numa_mpol.c | 57 ++--- .../selftests/futex/functional/futex_priv_hash.c | 49 +--- .../selftests/futex/functional/futex_requeue.c | 76 ++---- .../selftests/futex/functional/futex_requeue_pi.c | 261 ++++++++++----------- .../functional/futex_requeue_pi_mismatched_ops.c | 80 ++----- .../functional/futex_requeue_pi_signal_restart.c | 129 +++------- .../selftests/futex/functional/futex_wait.c | 103 +++----- .../functional/futex_wait_private_mapped_file.c | 83 ++----- .../futex/functional/futex_wait_timeout.c | 139 +++++------ .../functional/futex_wait_uninitialized_heap.c | 76 ++---- .../futex/functional/futex_wait_wouldblock.c | 75 ++---- .../selftests/futex/functional/futex_waitv.c | 98 ++++---- tools/testing/selftests/futex/functional/run.sh | 62 +---- tools/testing/selftests/futex/include/logging.h | 148 ------------ tools/testing/selftests/kselftest.h | 14 ++ tools/testing/selftests/kselftest_harness.h | 13 +- 18 files changed, 465 insertions(+), 1004 deletions(-) --- base-commit: ed0272f0675f31642c3d445a596b544de9db405b change-id: 20250703-tonyk-robust_test_cleanup-d1f3406365d9 Best regards, -- André Almeida <andrealmeid(a)igalia.com>

5 months, 3 weeks

1
15
0 0

[PATCH v8 0/6] use per-vma locks for /proc/pid/maps reads

by Suren Baghdasaryan

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. The latest results from Paul: Stock mm-unstable, all of the runs had maximum latencies in excess of 0.5 milliseconds, and with 80% of the runs' latencies exceeding a full millisecond, and ranging up beyond 4 full milliseconds. In contrast, 99% of the runs with this patch series applied had maximum latencies of less than 0.5 milliseconds, with the single outlier at only 0.608 milliseconds. From a median-performance (as opposed to maximum-latency) viewpoint, this patch series also looks good, with stock mm weighing in at 11 microseconds and patch series at 6 microseconds, better than a 2x improvement. Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.521 0.011 0.008 0.552 0.011 0.008 0.590 0.011 0.008 0.660 ... 0.011 0.015 2.987 0.011 0.015 3.038 0.011 0.016 3.431 0.011 0.016 4.707 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.026 0.006 0.005 0.029 0.006 0.005 0.034 0.006 0.005 0.035 ... 0.006 0.006 0.421 0.006 0.006 0.423 0.006 0.006 0.439 0.006 0.006 0.608 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's even more important now to test for unexpected inconsistencies. In [3] Lorenzo identified the following possible vma merging/splitting scenarios: Merges with changes to existing vmas: 1 Merge both - mapping a vma over another one and between two vmas which can be merged after this replacement; 2. Merge left full - mapping a vma at the end of an existing one and completely over its right neighbor; 3. Merge left partial - mapping a vma at the end of an existing one and partially over its right neighbor; 4. Merge right full - mapping a vma before the start of an existing one and completely over its left neighbor; 5. Merge right partial - mapping a vma before the start of an existing one and partially over its left neighbor; Merges without changes to existing vmas: 6. Merge both - mapping a vma into a gap between two vmas which can be merged after the insertion; 7. Merge left - mapping a vma at the end of an existing one; 8. Merge right - mapping a vma before the start end of an existing one; Splits 9. Split with new vma at the lower address; 10. Split with new vma at the higher address; If such merges or splits happen concurrently with the /proc/maps reading we might report a vma twice, once before the modification and once after it is modified: Case 1 might report overwritten and previous vma along with the final merged vma; Case 2 might report previous and the final merged vma; Case 3 might cause us to retry once we detect the temporary gap caused by shrinking of the right neighbor; Case 4 might report overritten and the final merged vma; Case 5 might cause us to retry once we detect the temporary gap caused by shrinking of the left neighbor; Case 6 might report previous vma and the gap along with the final marged vma; Case 7 might report previous and the final merged vma; Case 8 might report the original gap and the final merged vma covering the gap; Case 9 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma start; Case 10 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma end; In all these cases the retry mechanism prevents us from reporting possible temporary gaps. Changes since v7 [4]: - Refactored tests to use kselftest harness, per David Hildenbrand and Lorenzo Stoakes - Removed PROCMAP_QUERY selftest, per David Hildenbrand and Lorenzo Stoakes - Added Acked-by, per David Hildenbrand - Replaced sentinels values with named definitions, per Vlastimil Babka - Added Reviewed-by, per Vlastimil Babka !!! NOTES FOR APPLYING THE PATCHSET !!! Applies cleanly over mm-unstable after reverting v7 version of this patchset (from 94951ab6fe6f to e47914e6c28f in mm-unstable). [1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test [3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo… [4] https://lore.kernel.org/all/20250716030557.1547501-1-surenb@google.com/ Suren Baghdasaryan (6): selftests/proc: add /proc/pid/maps tearing from vma split test selftests/proc: extend /proc/pid/maps tearing test to include vma resizing selftests/proc: extend /proc/pid/maps tearing test to include vma remapping selftests/proc: add verbose mode for /proc/pid/maps tearing tests fs/proc/task_mmu: remove conversion of seq_file position to unsigned fs/proc/task_mmu: read proc/pid/maps under per-vma lock fs/proc/internal.h | 5 + fs/proc/task_mmu.c | 158 +++- include/linux/mmap_lock.h | 11 + mm/madvise.c | 3 +- mm/mmap_lock.c | 93 +++ tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-maps-race.c | 741 ++++++++++++++++++ 8 files changed, 997 insertions(+), 16 deletions(-) create mode 100644 tools/testing/selftests/proc/proc-maps-race.c -- 2.50.0.727.gbf7dc18ff4-goog

5 months, 3 weeks

2
7
0 0

Start Campaign Outreach with Fresh Data from GSX 2025

by Ben Graham

Hi , Interested in getting the GSX 2025 attendee list? Expo Name: Global Security Exchange (GSX) 2025 Total Number of records: 17,000 records List includes: Company Name, Contact Name, Job Title, Mailing Address, Phone, Emails, etc. Are you considering buying these leads? If yes, I can send you the pricing information. Awaiting your message Regards Ben Graham Demand Generation Manager US Marketing Data Inc., Please reply with REMOVE if you don't wish to receive further emails

5 months, 3 weeks

1
0
0 0

[PATCH 0/7] Replace "__auto_type" with "auto"

by H. Peter Anvin

"auto" was defined as a keyword back in the K&R days, but as a storage type specifier. No one ever used it, since it was and is the default storage type for local variables. C++11 recycled the keyword to allow a type to be declared based on the type of an initializer. This was finally adopted into standard C in C23. gcc and clang provide the "__auto_type" alias keyword as an extension for pre-C23, however, there is no reason to pollute the bulk of the source base with this temporary keyword; instead define "auto" as a macro unless the compiler is running in C23+ mode. This macro is added in <linux/compiler_types.h> because that header is included in some of the tools headers, wheres <linux/compiler.h> is not as it has a bunch of very kernel-specific things in it. --- arch/nios2/include/asm/uaccess.h | 4 ++-- arch/x86/include/asm/bug.h | 2 +- arch/x86/include/asm/string_64.h | 6 +++--- arch/x86/include/asm/uaccess_64.h | 2 +- fs/proc/inode.c | 16 ++++++++-------- include/linux/cleanup.h | 4 ++-- include/linux/compiler.h | 2 +- include/linux/compiler_types.h | 13 +++++++++++++ include/linux/minmax.h | 6 +++--- tools/testing/selftests/bpf/prog_tests/socket_helpers.h | 9 +++++++-- tools/virtio/linux/compiler.h | 2 +- 11 files changed, 42 insertions(+), 24 deletions(-)

5 months, 3 weeks

4
14
0 0

[PATCH RFC 0/3] selftests/landlock: scoping abstractions

by Abhinav Saxena

Hi all, I was starting to work on the memfd-exec[1] feature and observed that Landlock's scoped-IPC features (abstract UNIX sockets and signals) follow a consistent high-level model, which I'm calling a resource-accessor pattern: Resource Process <-> Accessor Process - Resource process: owns or manages the asset - socket creator (bind/accept) - signal handler - memfd creator - Accessor process: attempts to use the asset - socket client (connect/sendto) - signal sender - memfd executor RESOURCE-ACCESSOR PATTERN FUNDAMENTALS ====================================== This pattern appears fundamental to Landlock scoping because: 1. Consistent enforcement model: Landlock restrictions are enforced only on the accessor side; the resource side remains unconstrained across all scope types. 2. Reflects actual security boundaries: In practice, sandboxed processes typically need to access resources created by other processes, not the reverse. 3. Scalable design: This model works consistently whether processes are in parent-child relationships or independent peer domains. 4. Real-world usage patterns: Container runtimes and sandbox orchestrators routinely start multiple workers that restrict themselves independently. CURRENT TEST COVERAGE GAP ========================= Existing self-tests cover hierarchical resource <-> accessor pairs but do not exercise the case where each task enters an independent domain. While 'sibling_domain' tests exist, they still use parent-child relationship patterns rather than true peer domains. Current Coverage (Linear Hierarchies Only): ------------------------------------------- Type 1: Parent-Child (scoped_domains) P1 ---- P2 Type 2: Three Generations (scoped_vs_unscoped) P1 ---- P2 ---- P3 Variations tested for both types: - No domains - Various scoped domain combinations - Nested domains within inherited domains - Mixed domain types (SCOPE vs OTHER vs NONE) Missing Coverage (True Sibling Scenarios): ------------------------------------------ Root | +-- Child A [various domain types] | +-- Child B [various domain types] Missing test scenarios: - A <-> B cross-sibling communication - Mixed sibling domain combinations - Sibling isolation enforcement - Parent -> A, Parent -> B differential access SOLUTION ======== This series implements the missing sibling pattern using the resource-accessor model. The tests create a fork tree that looks like this: coordinator (no domain) | +-- resource_proc (Domain X) /* owns the resource */ | +-- accessor_proc (Domain Y) /* tries to access */ This directly addresses the missing coverage by creating two independent child processes that establish peer domains, rather than the hierarchical parent-child domains covered by existing tests. Both children call landlock_restrict_self() for the first time, so their struct landlock_domain->parent pointers are NULL, creating true peer domains. The harness exposes four test variants: Variant name | Resource domain | Accessor domain | Result -------------------|-----------------|-----------------|---------- none_to_none | none | none | ALLOW none_to_scoped | none | scoped | DENY scoped_to_none | scoped | none | ALLOW scoped_to_scoped | scoped | scoped (peer) | DENY The scoped_to_scoped case was missing from current coverage. TESTING ======= All patches apply cleanly to v6.14-rc2 and pass on landlock/master. The helpers are small and re-use the existing kselftest_harness.h fixture/variant pattern. All patches have been validated with scripts/checkpatch.pl --strict and show no warnings. This series introduces **no kernel changes**, only selftests additions. Feedback very welcome. Thanks, Abhinav [1] https://github.com/landlock-lsm/linux/issues/37 Links: - Landlock documentation: https://docs.kernel.org/userspace-api/landlock.html - Landlock LSM kernel docs: https://docs.kernel.org/security/landlock.html - Existing tests: tools/testing/selftests/landlock/scoped_* Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com> --- Abhinav Saxena (3): selftests/landlock: move sandbox_type to common selftests/landlock: add cross-domain variants selftests/landlock: add cross-domain signal tests tools/testing/selftests/landlock/scoped_common.h | 7 + .../landlock/scoped_cross_domain_variants.h | 54 +++++ .../landlock/scoped_multiple_domain_variants.h | 7 - .../selftests/landlock/scoped_signal_test.c | 237 +++++++++++++++++++++ 4 files changed, 298 insertions(+), 7 deletions(-) --- base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38 change-id: 20250715-landlock_abstractions-dbc0aabf1063 Best regards, -- Abhinav Saxena <xandfury(a)gmail.com>

5 months, 3 weeks

1
3
0 0

[PATCH v13 net-next 00/14] AccECN protocol patch series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the v10 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best Regards, Chia-Yu --- v13 (18-Jul-2025) - Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>) - Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>) - Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>) - Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>) - Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>) - Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>) - Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>) - Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>) v12 (04-Jul-2025) - Fix compilation issues with some intermediate patches in v11 - Add more comments for AccECN helpers of tcp_ecn.h v11 (03-Jul-2025) - Fix compilation issues with some intermediate patches in v10 v10 (02-Jul-2025) - Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>) - Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>) - Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>) - Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>) - Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>) - Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch - Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>) v9 (21-Jun-2025) - Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>) - Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>) - Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>) - Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>) - Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>) v8 (10-Jun-2025) - Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>) - Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>) - Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>) v7 (14-May-2025) - Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>) - Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message for #9 to explain the increase in tcp_sock_write_rx group size - Modify group size of tcp_sock_write_tx in #10 based on pahole results v6 (09-May-2025) - Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>) - Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>) - Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>) - Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>) - Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>) - Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>) - Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15 v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Mar-2025) - Add one missing patch from the previous AccECN protocol preparation patch series to this patch series. --- Chia-Yu Chang (5): tcp: reorganize tcp_sock_write_txrx group for variables later tcp: ecn functions in separated include file tcp: accecn: AccECN option send control tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (9): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics Documentation/networking/ip-sysctl.rst | 55 +- .../networking/net_cachelines/tcp_sock.rst | 12 + include/linux/tcp.h | 32 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 87 ++- include/net/tcp_ecn.h | 649 ++++++++++++++++++ include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 4 + net/ipv4/sysctl_net_ipv4.c | 19 + net/ipv4/tcp.c | 28 +- net/ipv4/tcp_input.c | 353 ++++++++-- net/ipv4/tcp_ipv4.c | 8 +- net/ipv4/tcp_minisocks.c | 40 +- net/ipv4/tcp_output.c | 294 ++++++-- net/ipv6/syncookies.c | 2 + net/ipv6/tcp_ipv6.c | 1 + 16 files changed, 1409 insertions(+), 184 deletions(-) create mode 100644 include/net/tcp_ecn.h -- 2.34.1

5 months, 3 weeks

2
15
0 0

[PATCH v24 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v24. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v24 (18-Jul-2025) - Replace TCA_DUALPI2 prefix with TC_DUALPI2 for enums in pkt_sched.h (Jakub Kicinski <kuba(a)kernel.org>) - Report error if both packet and time step thresholds are provided (Jakub Kicinski <kuba(a)kernel.org>) v23 (13-Jul-2025) and v22 (11-Jul-2025) - Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>) v21 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>) - Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>) - Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>) - Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>) - Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>) - Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>) - Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>) - Fix some typos and format issues of dualpi2 attributes v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 151 ++- include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 68 + net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1174 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1668 insertions(+), 5 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

5 months, 3 weeks

2
7
0 0

[PATCH 0/2] Fix undetected overflow when allocating IOVA

by Jason Gunthorpe

Syzkaller found this, the ALIGN() call can overflow and corrupt the allocation process. Fix the bug and add some test coverage. Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com> Jason Gunthorpe (2): iommufd: Prevent ALIGN() overflow iommufd/selftest: Test reserved regions near ULONG_MAX drivers/iommu/iommufd/io_pagetable.c | 41 +++++++++++++++---------- tools/testing/selftests/iommu/iommufd.c | 18 +++++++++++ 2 files changed, 43 insertions(+), 16 deletions(-) base-commit: 601b1d0d9395c711383452bd0d47037afbbb4bcf -- 2.43.0

5 months, 3 weeks

4
13
0 0

[PATCH v2] selftests/damon: introduce _common.sh to host shared function

by Enze Li

The current test scripts contain duplicated root permission checks in multiple locations. This patch consolidates these checks into _common.sh to eliminate code redundancy. Signed-off-by: Enze Li <lienze(a)kylinos.cn> --- tools/testing/selftests/damon/_common.sh | 11 +++++++++++ tools/testing/selftests/damon/lru_sort.sh | 8 +++----- tools/testing/selftests/damon/reclaim.sh | 8 +++----- tools/testing/selftests/damon/sysfs.sh | 11 ++--------- .../damon/sysfs_update_removed_scheme_dir.sh | 8 +++----- 5 files changed, 22 insertions(+), 24 deletions(-) create mode 100644 tools/testing/selftests/damon/_common.sh diff --git a/tools/testing/selftests/damon/_common.sh b/tools/testing/selftests/damon/_common.sh new file mode 100644 index 000000000000..0279698f733e --- /dev/null +++ b/tools/testing/selftests/damon/_common.sh @@ -0,0 +1,11 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +check_dependencies() +{ + if [ $EUID -ne 0 ] + then + echo "Run as root" + exit $ksft_skip + fi +} diff --git a/tools/testing/selftests/damon/lru_sort.sh b/tools/testing/selftests/damon/lru_sort.sh index 61b80197c896..1e4849db78a9 100755 --- a/tools/testing/selftests/damon/lru_sort.sh +++ b/tools/testing/selftests/damon/lru_sort.sh @@ -1,14 +1,12 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_lru_sort_enabled="/sys/module/damon_lru_sort/parameters/enabled" if [ ! -f "$damon_lru_sort_enabled" ] diff --git a/tools/testing/selftests/damon/reclaim.sh b/tools/testing/selftests/damon/reclaim.sh index 78dbc2334cbe..e56ceb035129 100755 --- a/tools/testing/selftests/damon/reclaim.sh +++ b/tools/testing/selftests/damon/reclaim.sh @@ -1,14 +1,12 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_reclaim_enabled="/sys/module/damon_reclaim/parameters/enabled" if [ ! -f "$damon_reclaim_enabled" ] diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh index e9a976d296e2..83e3b7f63d81 100755 --- a/tools/testing/selftests/damon/sysfs.sh +++ b/tools/testing/selftests/damon/sysfs.sh @@ -1,6 +1,8 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest frmework requirement - SKIP code is 4. ksft_skip=4 @@ -364,14 +366,5 @@ test_damon_sysfs() test_kdamonds "$damon_sysfs/kdamonds" } -check_dependencies() -{ - if [ $EUID -ne 0 ] - then - echo "Run as root" - exit $ksft_skip - fi -} - check_dependencies test_damon_sysfs "/sys/kernel/mm/damon/admin" diff --git a/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh b/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh index ade35576e748..35fc32beeaf7 100755 --- a/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh +++ b/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh @@ -1,14 +1,12 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 +source _common.sh + # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_sysfs="/sys/kernel/mm/damon/admin" if [ ! -d "$damon_sysfs" ] base-commit: e2291551827fe5d2d3758c435c191d32b6d1350e -- 2.43.0

5 months, 3 weeks

2
1
0 0

[PATCH] selftests/damon: introduce _common.sh to host shared function

by Enze Li

The current test scripts contain duplicated root permission checks in multiple locations. This patch consolidates these checks into _common.sh to eliminate code redundancy. Signed-off-by: Enze Li <lienze(a)kylinos.cn> --- tools/testing/selftests/damon/_common.sh | 14 ++++++++++++++ tools/testing/selftests/damon/lru_sort.sh | 9 ++------- tools/testing/selftests/damon/reclaim.sh | 9 ++------- tools/testing/selftests/damon/sysfs.sh | 12 +----------- .../damon/sysfs_update_removed_scheme_dir.sh | 9 ++------- 5 files changed, 21 insertions(+), 32 deletions(-) create mode 100644 tools/testing/selftests/damon/_common.sh diff --git a/tools/testing/selftests/damon/_common.sh b/tools/testing/selftests/damon/_common.sh new file mode 100644 index 000000000000..3920b619c30f --- /dev/null +++ b/tools/testing/selftests/damon/_common.sh @@ -0,0 +1,14 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# Kselftest frmework requirement - SKIP code is 4. +ksft_skip=4 + +check_dependencies() +{ + if [ $EUID -ne 0 ] + then + echo "Run as root" + exit $ksft_skip + fi +} diff --git a/tools/testing/selftests/damon/lru_sort.sh b/tools/testing/selftests/damon/lru_sort.sh index 61b80197c896..0d128d809fd3 100755 --- a/tools/testing/selftests/damon/lru_sort.sh +++ b/tools/testing/selftests/damon/lru_sort.sh @@ -1,14 +1,9 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 +source _common.sh -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_lru_sort_enabled="/sys/module/damon_lru_sort/parameters/enabled" if [ ! -f "$damon_lru_sort_enabled" ] diff --git a/tools/testing/selftests/damon/reclaim.sh b/tools/testing/selftests/damon/reclaim.sh index 78dbc2334cbe..41e450a696ae 100755 --- a/tools/testing/selftests/damon/reclaim.sh +++ b/tools/testing/selftests/damon/reclaim.sh @@ -1,14 +1,9 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 +source _common.sh -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_reclaim_enabled="/sys/module/damon_reclaim/parameters/enabled" if [ ! -f "$damon_reclaim_enabled" ] diff --git a/tools/testing/selftests/damon/sysfs.sh b/tools/testing/selftests/damon/sysfs.sh index e9a976d296e2..0326b9ad55ca 100755 --- a/tools/testing/selftests/damon/sysfs.sh +++ b/tools/testing/selftests/damon/sysfs.sh @@ -1,8 +1,7 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 -# Kselftest frmework requirement - SKIP code is 4. -ksft_skip=4 +source _common.sh ensure_write_succ() { @@ -364,14 +363,5 @@ test_damon_sysfs() test_kdamonds "$damon_sysfs/kdamonds" } -check_dependencies() -{ - if [ $EUID -ne 0 ] - then - echo "Run as root" - exit $ksft_skip - fi -} - check_dependencies test_damon_sysfs "/sys/kernel/mm/damon/admin" diff --git a/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh b/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh index ade35576e748..730165bd7f03 100755 --- a/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh +++ b/tools/testing/selftests/damon/sysfs_update_removed_scheme_dir.sh @@ -1,14 +1,9 @@ #!/bin/bash # SPDX-License-Identifier: GPL-2.0 -# Kselftest framework requirement - SKIP code is 4. -ksft_skip=4 +source _common.sh -if [ $EUID -ne 0 ] -then - echo "Run as root" - exit $ksft_skip -fi +check_dependencies damon_sysfs="/sys/kernel/mm/damon/admin" if [ ! -d "$damon_sysfs" ] base-commit: e2291551827fe5d2d3758c435c191d32b6d1350e -- 2.43.0

5 months, 3 weeks

3
4
0 0

[PATCH 0/2] selftests/cgroup: better bound for cpu.max tests

by Shashank Balaji

cpu.max selftests (both the normal one and the nested one) test the working of throttling by setting up cpu.max, running a cpu hog process for a specified duration, and comparing usage_usec as reported by cpu.stat with the duration of the cpu hog: they should be far enough. Currently, this is done by using values_close, which has two problems: 1. Semantic: values_close is used with an error percentage of 95%, which one will not expect on seeing "values close". The intent it's actually going for is "values far". 2. Accuracy: the tests can pass even if usage_usec is upto around double the expected amount. That's too high of a margin for usage_usec. Overall, this patchset improves the readability and accuracy of the cpu.max tests. Signed-off-by: Shashank Balaji <shashank.mahadasyam(a)sony.com> --- Shashank Balaji (2): selftests/cgroup: rename `expected` to `duration` in cpu.max tests selftests/cgroup: better bound in cpu.max tests tools/testing/selftests/cgroup/test_cpu.c | 42 ++++++++++++++++++------------- 1 file changed, 24 insertions(+), 18 deletions(-) --- base-commit: 66701750d5565c574af42bef0b789ce0203e3071 change-id: 20250227-kselftest-cgroup-fix-cpu-max-56619928e99b Best regards, -- Shashank Balaji <shashank.mahadasyam(a)sony.com>

5 months, 3 weeks

3
16
0 0

[PATCH bpf-next v5 0/3] Allow mmap of /sys/kernel/btf/vmlinux

by Lorenz Bauer

I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go. With some upcoming changes the library is sitting at 5MiB for a parse. Most of that memory is simply copying the BTF blob into user space. By allowing vmlinux BTF to be mmapped read-only into user space I can cut memory usage by about 75%. Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com> --- Changes in v5: - Fix error return of btf_parse_raw_mmap (Andrii) - Link to v4: https://lore.kernel.org/r/20250510-vmlinux-mmap-v4-0-69e424b2a672@isovalent… Changes in v4: - Go back to remap_pfn_range for aarch64 compat - Dropped btf_new_no_copy (Andrii) - Fixed nits in selftests (Andrii) - Clearer error handling in the mmap handler (Andrii) - Fixed build on s390 - Link to v3: https://lore.kernel.org/r/20250505-vmlinux-mmap-v3-0-5d53afa060e8@isovalent… Changes in v3: - Remove slightly confusing calculation of trailing (Alexei) - Use vm_insert_page (Alexei) - Simplified libbpf code - Link to v2: https://lore.kernel.org/r/20250502-vmlinux-mmap-v2-0-95c271434519@isovalent… Changes in v2: - Use btf__new in selftest - Avoid vm_iomap_memory in btf_vmlinux_mmap - Add VM_DONTDUMP - Add support to libbpf - Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent… --- Lorenz Bauer (3): btf: allow mmap of vmlinux btf selftests: bpf: add a test for mmapable vmlinux BTF libbpf: Use mmap to parse vmlinux BTF from sysfs include/asm-generic/vmlinux.lds.h | 3 +- kernel/bpf/sysfs_btf.c | 32 ++++++++ tools/lib/bpf/btf.c | 89 +++++++++++++++++----- tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 81 ++++++++++++++++++++ 4 files changed, 186 insertions(+), 19 deletions(-) --- base-commit: 7220eabff8cb4af3b93cd021aa853b9f5df2923f change-id: 20250501-vmlinux-mmap-2ec5563c3ef1 Best regards, -- Lorenz Bauer <lmb(a)isovalent.com>

5 months, 3 weeks

6
11
0 0

[PATCH bpf-next v2 0/3] bpf: Show precise rejected function when attaching to __noreturn and deny list functions

by KaFai Wan

Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions. Add log for attaching tracing programs to functions in deny list. Add selftest for attaching tracing programs to functions in deny list. changes: v2: - change verifier log message (Alexei) - add missing Suggested-by v1: https://lore.kernel.org/all/20250710162717.3808020-1-mannkafai@gmail.com/ --- KaFai Wan (3): bpf: Show precise rejected function when attaching fexit/fmod_ret to __noreturn functions bpf: Add log for attaching tracing programs to functions in deny list selftests/bpf: Add selftest for attaching tracing programs to functions in deny list kernel/bpf/verifier.c | 5 ++++- .../selftests/bpf/prog_tests/tracing_deny.c | 11 +++++++++++ .../testing/selftests/bpf/progs/fexit_noreturns.c | 2 +- tools/testing/selftests/bpf/progs/tracing_deny.c | 15 +++++++++++++++ 4 files changed, 31 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_deny.c create mode 100644 tools/testing/selftests/bpf/progs/tracing_deny.c -- 2.43.0

5 months, 3 weeks

2
5
0 0

[PATCH v23 net-next 0/6] DUALPI2 patch

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the DualPI2 patch v23. This patch serise adds DualPI Improved with a Square (DualPI2) with following features: * Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague) * Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332 * Configurable overload strategies * Use of sojourn time to reliably estimate queue delay * Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332). Best regards, Chia-Yu --- v23 (13-Jul-2025) and v22 (11-Jul-2025) - Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>) v21 (02-Jul-2025) - Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>) - Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>) - Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>) - Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>) - Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>) - Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>) - Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>) - Fix some typos and format issues of dualpi2 attributes v20 (21-Jun-2025) - Add one more commit to fix warning and style check on tdc.sh reported by shellcheck - Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>) v19 (14-Jun-2025) - Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>) - Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>) v18 (13-Jun-2025) - Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute - Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>) v17 (25-May-2025, Resent at 11-Jun-2025) - Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>) - Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>) - Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>) - Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>) - Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>) - Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>) v16 (16-MAy-2025) - Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>) - Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>) - Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>) - Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>) v15 (09-May-2025) - Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>) - Fix one typo in comment of #2 - Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h v14 (05-May-2025) - Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun - Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>) - Update dualpi2.json to align with the updated variable order in pkt_sched.h - Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>) v13 (26-Apr-2025) - Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>) - Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>) - Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>) v12 (22-Apr-2025) - Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>) - Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>) - Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>) - Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>) v11 (15-Apr-2025) - Replace hstimer_init with hstimer_setup in sch_dualpi2.c v10 (25-Mar-2025) - Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>) - Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>) - Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>) v9 (16-Mar-2025) - Fix mem_usage error in previous version - Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking. In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets. This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0. Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 11.55 11.70 ms 350 TCP upload avg : 18.96 N/A Mbits/s 350 TCP upload sum : 18.96 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 10.81 10.70 ms 350 TCP upload avg : 18.91 N/A Mbits/s 350 TCP upload sum : 18.91 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 12.61 12.80 ms 350 TCP upload avg : 9.48 N/A Mbits/s 350 TCP upload sum : 9.48 N/A Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.06 10.80 ms 350 TCP upload avg : 9.43 N/A Mbits/s 350 TCP upload sum : 9.43 N/A Mbits/s 350 Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay' Old versions: avg median # data pts Ping (ms) ICMP : 40.86 37.45 ms 350 TCP upload avg : 0.88 N/A Mbits/s 350 TCP upload sum : 0.88 N/A Mbits/s 350 TCP upload::1 : 0.88 0.97 Mbits/s 350 New version (v9): avg median # data pts Ping (ms) ICMP : 11.07 10.40 ms 350 TCP upload avg : 0.55 N/A Mbits/s 350 TCP upload sum : 0.55 N/A Mbits/s 350 TCP upload::1 : 0.55 0.59 Mbits/s 350 v8 (11-Mar-2025) - Fix warning messages in v7 v7 (07-Mar-2025) - Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>) v6 (04-Mar-2025) - Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>) - Update test cases in dualpi2.json - Update commit message v5 (22-Feb-2025) - A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL: Unshaped 1gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 - Summary of tcp_4down run 'MQ + FQ_PIE' avg median # data pts Ping (ms) ICMP : 1.21 1.37 ms 350 TCP download avg : 235.42 N/A Mbits/s 350 TCP download sum : 941.61 N/A Mbits/s 350 TCP download::1 : 232.54 233.13 Mbits/s 350 TCP download::2 : 232.52 232.80 Mbits/s 350 TCP download::3 : 233.14 233.78 Mbits/s 350 TCP download::4 : 243.41 241.48 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2' avg median # data pts Ping (ms) ICMP : 1.19 1.34 ms 349 TCP download avg : 235.42 N/A Mbits/s 349 TCP download sum : 941.68 N/A Mbits/s 349 TCP download::1 : 235.19 235.39 Mbits/s 349 TCP download::2 : 235.03 235.35 Mbits/s 349 TCP download::3 : 236.89 235.44 Mbits/s 349 TCP download::4 : 234.57 235.19 Mbits/s 349 Unshaped 1gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 1.88 1.86 ms 350 TCP download avg : 7.39 N/A Mbits/s 350 TCP download sum : 946.47 N/A Mbits/s 350 Unshaped 10gigE with 4 download streams test: - Summary of tcp_4down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2354.08 N/A Mbits/s 350 TCP download sum : 9416.31 N/A Mbits/s 350 TCP download::1 : 2353.65 2352.81 Mbits/s 350 TCP download::2 : 2354.54 2354.21 Mbits/s 350 TCP download::3 : 2353.56 2353.78 Mbits/s 350 TCP download::4 : 2354.56 2354.45 Mbits/s 350 - Summary of tcp_4down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 0.20 0.19 ms 350 TCP download avg : 2354.76 N/A Mbits/s 350 TCP download sum : 9419.04 N/A Mbits/s 350 TCP download::1 : 2354.77 2353.89 Mbits/s 350 TCP download::2 : 2353.41 2354.29 Mbits/s 350 TCP download::3 : 2356.18 2354.19 Mbits/s 350 TCP download::4 : 2354.68 2353.15 Mbits/s 350 - Summary of tcp_4down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 0.24 0.24 ms 350 TCP download avg : 2354.11 N/A Mbits/s 350 TCP download sum : 9416.43 N/A Mbits/s 350 TCP download::1 : 2354.75 2353.93 Mbits/s 350 TCP download::2 : 2353.15 2353.75 Mbits/s 350 TCP download::3 : 2353.49 2353.72 Mbits/s 350 TCP download::4 : 2355.04 2353.73 Mbits/s 350 Unshaped 10gigE with 128 download streams test: - Summary of tcp_128down run 'MQ + FQ_CODEL': avg median # data pts Ping (ms) ICMP : 7.57 8.69 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9467.82 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + FQ_PIE': avg median # data pts Ping (ms) ICMP : 7.82 8.91 ms 350 TCP download avg : 73.97 N/A Mbits/s 350 TCP download sum : 9468.42 N/A Mbits/s 350 - Summary of tcp_128down run 'MQ + DUALPI2': avg median # data pts Ping (ms) ICMP : 6.87 7.93 ms 350 TCP download avg : 73.95 N/A Mbits/s 350 TCP download sum : 9465.87 N/A Mbits/s 350 From the results shown above, we see small differences between combinations. - Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>) - Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>) - Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>) - Update license identifier (Dave Taht <dave.taht(a)gmail.com>) - Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>) - Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>) - Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c - Fix step_thresh in packets - Update code comments in sch_dualpi2.c v4 (22-Oct-2024) - Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>) - Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>) - Fix line length warning. v3 (19-Oct-2024) - Fix compilaiton error - Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Oct-2024) - Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>) - Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>) - Fix line length warning --- Chia-Yu Chang (5): sched: Struct definition and parsing of dualpi2 qdisc sched: Dump configuration and statistics of dualpi2 qdisc selftests/tc-testing: Fix warning and style check on tdc.sh selftests/tc-testing: Add selftests for qdisc DualPI2 Documentation: netlink: specs: tc: Add DualPI2 specification Koen De Schepper (1): sched: Add enqueue/dequeue of dualpi2 qdisc Documentation/netlink/specs/tc.yaml | 151 ++- include/net/dropreason-core.h | 6 + include/uapi/linux/pkt_sched.h | 68 + net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/sch_dualpi2.c | 1171 +++++++++++++++++ tools/testing/selftests/tc-testing/config | 1 + .../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++ tools/testing/selftests/tc-testing/tdc.sh | 6 +- 9 files changed, 1665 insertions(+), 5 deletions(-) create mode 100644 net/sched/sch_dualpi2.c create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json -- 2.34.1

5 months, 3 weeks

3
11
0 0

[PATCH] selftest/futex: fix format-security warnings in futex_priv_hash

by Nai-Chen Cheng

Fix format-security warnings by using proper format strings when passing message variables to ksft_exit_fail_msg(), ksft_test_result_pass(), and ksft_test_result_skip() function. This prevents potential security issues and eliminates compiler warnings when building with -Wformat-security. Signed-off-by: Nai-Chen Cheng <bleach1827(a)gmail.com> --- .../selftests/futex/functional/futex_priv_hash.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c index 24a92dc94eb8..19651087c4de 100644 --- a/tools/testing/selftests/futex/functional/futex_priv_hash.c +++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c @@ -184,10 +184,10 @@ int main(int argc, char *argv[]) futex_slots1 = futex_hash_slots_get(); if (futex_slots1 <= 0) { ksft_print_msg("Current hash buckets: %d\n", futex_slots1); - ksft_exit_fail_msg(test_msg_auto_create); + ksft_exit_fail_msg("%s", test_msg_auto_create); } - ksft_test_result_pass(test_msg_auto_create); + ksft_test_result_pass("%s", test_msg_auto_create); online_cpus = sysconf(_SC_NPROCESSORS_ONLN); ret = pthread_barrier_init(&barrier_main, NULL, MAX_THREADS + 1); @@ -212,11 +212,11 @@ int main(int argc, char *argv[]) if (futex_slotsn < 0 || futex_slots1 == futex_slotsn) { ksft_print_msg("Expected increase of hash buckets but got: %d -> %d\n", futex_slots1, futex_slotsn); - ksft_exit_fail_msg(test_msg_auto_inc); + ksft_exit_fail_msg("%s", test_msg_auto_inc); } - ksft_test_result_pass(test_msg_auto_inc); + ksft_test_result_pass("%s", test_msg_auto_inc); } else { - ksft_test_result_skip(test_msg_auto_inc); + ksft_test_result_skip("%s", test_msg_auto_inc); } ret = pthread_mutex_unlock(&global_lock); -- 2.43.0

5 months, 3 weeks

1
0
0 0

[PATCH] kselftest/arm4: Provide local defines for AT_HWCAP3

by Mark Brown

Some build environments for the selftests are not picking up the newly added AT_HWCAP3 when using the libc headers, even with headers_install (which we require already for the arm64 selftests). As a quick fix add local definitions of the constant to tools use it, while auxvec.h is installed with some toolchains it needs some persuasion to get picked up. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- tools/testing/selftests/arm64/abi/hwcap.c | 4 ++++ tools/testing/selftests/arm64/mte/check_prctl.c | 4 ++++ 2 files changed, 8 insertions(+) diff --git a/tools/testing/selftests/arm64/abi/hwcap.c b/tools/testing/selftests/arm64/abi/hwcap.c index 35f521e5f41c..aa902408facd 100644 --- a/tools/testing/selftests/arm64/abi/hwcap.c +++ b/tools/testing/selftests/arm64/abi/hwcap.c @@ -21,6 +21,10 @@ #define TESTS_PER_HWCAP 3 +#ifndef AT_HWCAP3 +#define AT_HWCAP3 29 +#endif + /* * Function expected to generate exception when the feature is not * supported and return when it is supported. If the specific exception diff --git a/tools/testing/selftests/arm64/mte/check_prctl.c b/tools/testing/selftests/arm64/mte/check_prctl.c index 4c89e9538ca0..c36c4c49ff95 100644 --- a/tools/testing/selftests/arm64/mte/check_prctl.c +++ b/tools/testing/selftests/arm64/mte/check_prctl.c @@ -12,6 +12,10 @@ #include "kselftest.h" +#ifndef AT_HWCAP3 +#define AT_HWCAP3 29 +#endif + static int set_tagged_addr_ctrl(int val) { int ret; --- base-commit: 86731a2a651e58953fc949573895f2fa6d456841 change-id: 20250710-arm64-selftest-bodge-hwcap3-b6ab30ab69cd Best regards, -- Mark Brown <broonie(a)kernel.org>

5 months, 3 weeks

3
3
0 0

[PATCH v9 00/29] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)

by Nicolin Chen

The vIOMMU object is designed to represent a slice of an IOMMU HW for its virtualization features shared with or passed to user space (a VM mostly) in a way of HW acceleration. This extended the HWPT-based design for more advanced virtualization feature. HW QUEUE introduced by this series as a part of the vIOMMU infrastructure represents a HW accelerated queue/buffer for VM to use exclusively, e.g. - NVIDIA's Virtual Command Queue - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer each of which allows its IOMMU HW to directly access a queue memory owned by a guest VM and allows a guest OS to control the HW queue direclty, to avoid VM Exit overheads to improve the performance. Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC allowing VMM to forward the IOMMU-specific queue info, such as queue base address, size, and etc. Meanwhile, a guest-owned queue needs the guest kernel to control the queue by reading/writing its consumer and producer indexes, via MMIO acceses to the hardware MMIO registers. Introduce an mmap infrastructure for iommufd to support passing through a piece of MMIO region from the host physical address space to the guest physical address space. The mmap info (offset/ length) used by an mmap syscall must be pre-allocated and returned to the user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC call. Thus, it requires a driver-specific user data support in the vIOMMU allocation flow. As a real-world use case, this series implements a HW QUEUE support in the tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it is also the Tegra CMDQV series Part-2 (user-space support), reworked from Previous RFCv1: https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/ This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to the standard SMMUv3 operating in the nested translation mode trapping CMDQ for TLBI and ATC_INV commands, this gives a huge performance improvement: 70% to 90% reductions of invalidation time were measured by various DMA unmap tests running in a guest OS. // Unmap latencies from "dma_map_benchmark -g @granule -t @threads", // by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq" @granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0 4KB 1 35.7 us 5.3 us 16KB 1 41.8 us 6.8 us 64KB 1 68.9 us 9.9 us 128KB 1 109.0 us 12.6 us 256KB 1 187.1 us 18.0 us 4KB 2 96.9 us 6.8 us 16KB 2 97.8 us 7.5 us 64KB 2 151.5 us 10.7 us 128KB 2 257.8 us 12.7 us 256KB 2 443.0 us 17.9 us This is on Github: https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v9 Paring QEMU branch for testing (reusing v8): https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v8 Changelog v9 (attached git-diff v8..v9 at the end of this letter) * Add Reviewed-by from Vasant and Jason * [iommufd] Fix offset calculation * [iommufd] Add unaligned iova/length selftest coverage for hw_queue * [iommufd] Pass in aligned iova/length to iommufd_access_pin_pages() * [smmu] Change "u32 *type" at arm_smmu_hw_info() in the header v8 https://lore.kernel.org/all/cover.1751677708.git.nicolinc@nvidia.com/ * Add Reviewed-by from Pranj, Kevin and Jason * Improve kdoc and comments * [iommufd] Skip selftest for no_viommu variants * [iommufd] Add unmap coverage for non internal area * [iommufd] Skip the first page when mtree_alloc_range() * [iommufd] Correct the passed in index to mtree_erase() * [iommufd] Correct variable types in iommufd_hw_queue_alloc_phys() * [iommufd] Reject iopt_unmap_iova_range() if area->num_locks is set * [tegra] Rename "SID replacement" with "SID mapping" * [tegra] Unwrap useless _tegra241_vcmdq_hw_init helper v7 https://lore.kernel.org/all/cover.1750966133.git.nicolinc@nvidia.com/ * Rebased on Jason's for-next tree (iommufd_hw_queue-prep series) * Add Reviewed-by from Baolu, Jason, Pranjal * Update kdocs and notes * [iommu] Replace "u32" with "enum iommu_hw_info_type" * [iommufd] Rename vdev->id to vdev->virt_id * [iommufd] Replace macros with inline helpers * [iommufd] Report unmapped_bytes in error path * [iommufd] Add iommufd_access_is_internal helper * [iommufd] Do not drop ops->unmap check for mdevs * [iommufd] Store physical addresses in immap structure * [iommufd] Reorder access and hw_queue object allocations * [iommufd] Scan for an internal access before any unmap call * [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue * [iommufd] Use kcalloc to avoid failure due to memory fragmentation * [tegra] Use "else" * [tegra] Lock destroy() using lvcmdq_mutex v6 https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/ * Rebase on iommufd_hw_queue-prep-v2 * Add Reviewed-by from Kevin and Jason * [iommufd] Update kdocs and notes * [iommufd] Drop redundant pages[i] check * [iommufd] Allow nesting_parent_iova to be 0 * [iommufd] Add iommufd_hw_queue_alloc_phys() * [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs * [iommufd] Move destroy ops to vdevice/hw_queue structures * [iommufd] Add union in hw_info struct to share out_data_type field * [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs * [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init * [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init * [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys * [smmu] Drop arm_smmu_domain_ipa_to_pa * [smmu] Update arm_smmu_impl_ops changes for vsmmu_init * [tegra] Add a vdev_to_vsid macro * [tegra] Add lvcmdq_mutex to protect multi queues * [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak) v5 https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc6 * Add Reviewed-by from Jason and Kevin * Correct typos in kdoc and update commit logs * [iommufd] Add a cosmetic fix * [iommufd] Drop unused num_pfns * [iommufd] Drop unnecessary check * [iommufd] Reorder patch sequence * [iommufd] Use io_remap_pfn_range() * [iommufd] Use success oriented flow * [iommufd] Fix max_npages calculation * [iommufd] Add more selftest coverage * [iommufd] Drop redundant static_assert * [iommufd] Fix mmap pfn range validation * [iommufd] Reject unmap on pinned iovas * [iommufd] Drop redundant vm_flags_set() * [iommufd] Drop iommufd_struct_destroy() * [iommufd] Drop redundant queue iova test * [iommufd] Use "mmio_addr" and "mmio_pfn" * [iommufd] Rename to "nesting_parent_iova" * [iommufd] Make iopt_pin_pages call option * [iommufd] Add ictx comparison in depend() * [iommufd] Add iommufd_object_alloc_ucmd() * [iommufd] Move kcalloc() after validations * [iommufd] Replace ictx setting with WARN_ON * [iommufd] Make hw_info's type bidirectional * [smmu] Add supported_vsmmu_type in impl_ops * [smmu] Drop impl report in smmu vendor struct * [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV * [tegra] Replace "number of VINTFs" with a note * [tegra] Drop the redundant lvcmdq pointer setting * [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA * [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op v4 https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/ * Rebase on v6.15-rc5 * Add Reviewed-by from Vasant * Rename "vQUEUE" to "HW QUEUE" * Use "offset" and "length" for all mmap-related variables * [iommufd] Use u64 for guest PA * [iommufd] Fix typo in uAPI doc * [iommufd] Rename immap_id to offset * [iommufd] Drop the partial-size mmap support * [iommufd] Do not replace WARN_ON with WARN_ON_ONCE * [iommufd] Use "u64 base_addr" for queue base address * [iommufd] Use u64 base_pfn/num_pfns for immap structure * [iommufd] Correct the size passed in to mtree_alloc_range() * [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops v3 https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/ * Add Reviewed-by from Baolu, Pranjal, and Alok * Revise kdocs, uAPI docs, and commit logs * Rename "vCMDQ" back to "vQUEUE" for AMD cases * [tegra] Add tegra241_vcmdq_hw_flush_timeout() * [tegra] Rename vsmmu_alloc to alloc_vintf_user * [tegra] Use writel for SID replacement registers * [tegra] Move mmap removal call to vsmmu_destroy op * [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user() * [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED() * [iommufd] Add an object-type "owner" to immap structure * [iommufd] Drop the ictx input in the new for-driver APIs * [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle * [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers * [iommufd] Rename iommufd_ctx_alloc/free_mmap to _iommufd_alloc/destroy_mmap v2 https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/ * Add Reviewed-by from Jason * [smmu] Fix vsmmu initial value * [smmu] Support impl for hw_info * [tegra] Rename "slot" to "vsid" * [tegra] Update kdocs and commit logs * [tegra] Map/unmap LVCMDQ dynamically * [tegra] Refcount the previous LVCMDQ * [tegra] Return -EEXIST if LVCMDQ exists * [tegra] Simplify VINTF cleanup routine * [tegra] Use vmid and s2_domain in vsmmu * [tegra] Rename "mmap_pgoff" to "immap_id" * [tegra] Add more addr and length validation * [iommufd] Add more narrative to mmap's kdoc * [iommufd] Add iommufd_struct_depend/undepend() * [iommufd] Rename vcmdq_free op to vcmdq_destroy * [iommufd] Fix bug in iommu_copy_struct_to_user() * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap() * [iommufd] Test the queue memory for its contiguity * [iommufd] Return -ENXIO if address or length fails * [iommufd] Do not change @min_last in mock_viommu_alloc() * [iommufd] Generalize TEGRA241_VCMDQ data in core structure * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping v1 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/ Thanks Nicolin Nicolin Chen (29): iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range iommufd: Correct virt_id kdoc at struct iommu_vdevice_alloc iommufd/viommu: Explicitly define vdev->virt_id iommu: Use enum iommu_hw_info_type for type in hw_info op iommu: Add iommu_copy_struct_to_user helper iommu: Pass in a driver-level user data structure to viommu_init op iommufd/viommu: Allow driver-specific user data for a vIOMMU object iommufd/selftest: Support user_data in mock_viommu_alloc iommufd/selftest: Add coverage for viommu data iommufd/access: Add internal APIs for HW queue to use iommufd/access: Bypass access->ops->unmap for internal use iommufd/viommu: Add driver-defined vDEVICE support iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC iommufd: Add mmap interface iommufd/selftest: Add coverage for the new mmap interface Documentation: userspace-api: iommufd: Update HW QUEUE iommu: Allow an input type in hw_info op iommufd: Allow an input data_type via iommu_hw_info iommufd/selftest: Update hw_info coverage for an input data_type iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops iommu/tegra241-cmdqv: Use request_threaded_irq iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() iommu/tegra241-cmdqv: Do not statically map LVCMDQs iommu/tegra241-cmdqv: Add user-space use support iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 25 +- drivers/iommu/iommufd/io_pagetable.h | 5 +- drivers/iommu/iommufd/iommufd_private.h | 46 +- drivers/iommu/iommufd/iommufd_test.h | 20 + include/linux/iommu.h | 50 +- include/linux/iommufd.h | 160 ++++++ include/uapi/linux/iommufd.h | 147 +++++- tools/testing/selftests/iommu/iommufd_utils.h | 89 +++- .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 28 +- .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 477 +++++++++++++++++- drivers/iommu/intel/iommu.c | 7 +- drivers/iommu/iommufd/device.c | 87 +++- drivers/iommu/iommufd/driver.c | 82 ++- drivers/iommu/iommufd/io_pagetable.c | 13 +- drivers/iommu/iommufd/main.c | 69 +++ drivers/iommu/iommufd/pages.c | 12 +- drivers/iommu/iommufd/selftest.c | 153 +++++- drivers/iommu/iommufd/viommu.c | 218 +++++++- tools/testing/selftests/iommu/iommufd.c | 141 +++++- .../selftests/iommu/iommufd_fail_nth.c | 15 +- Documentation/userspace-api/iommufd.rst | 12 + 21 files changed, 1745 insertions(+), 111 deletions(-) -- 2.43.0 diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index aa25156e04a3..3fa02c51df9f 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -1045,7 +1045,8 @@ struct arm_vsmmu { }; #if IS_ENABLED(CONFIG_ARM_SMMU_V3_IOMMUFD) -void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type); +void *arm_smmu_hw_info(struct device *dev, u32 *length, + enum iommu_hw_info_type *type); size_t arm_smmu_get_viommu_size(struct device *dev, enum iommu_viommu_type viommu_type); int arm_vsmmu_init(struct iommufd_viommu *viommu, diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 00641204efb2..91339f799916 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -206,7 +206,11 @@ static void iommufd_hw_queue_destroy_access(struct iommufd_ctx *ictx, struct iommufd_access *access, u64 base_iova, size_t length) { - iommufd_access_unpin_pages(access, base_iova, length); + u64 aligned_iova = PAGE_ALIGN_DOWN(base_iova); + u64 offset = base_iova - aligned_iova; + + iommufd_access_unpin_pages(access, aligned_iova, + PAGE_ALIGN(length + offset)); iommufd_access_detach_internal(access); iommufd_access_destroy_internal(ictx, access); } @@ -239,22 +243,23 @@ static struct iommufd_access * iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd, struct iommufd_viommu *viommu, phys_addr_t *base_pa) { + u64 aligned_iova = PAGE_ALIGN_DOWN(cmd->nesting_parent_iova); + u64 offset = cmd->nesting_parent_iova - aligned_iova; struct iommufd_access *access; struct page **pages; size_t max_npages; size_t length; - u64 offset; size_t i; int rc; - offset = - cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova); - /* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */ + /* max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */ if (check_add_overflow(offset, cmd->length, &length)) return ERR_PTR(-ERANGE); if (check_add_overflow(length, PAGE_SIZE - 1, &length)) return ERR_PTR(-ERANGE); max_npages = length / PAGE_SIZE; + /* length needs to be page aligned too */ + length = max_npages * PAGE_SIZE; /* * Use kvcalloc() to avoid memory fragmentation for a large page array. @@ -274,8 +279,7 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd, if (rc) goto out_destroy; - rc = iommufd_access_pin_pages(access, cmd->nesting_parent_iova, - cmd->length, pages, 0); + rc = iommufd_access_pin_pages(access, aligned_iova, length, pages, 0); if (rc) goto out_detach; @@ -287,13 +291,12 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd, goto out_unpin; } - *base_pa = page_to_pfn(pages[0]) << PAGE_SHIFT; + *base_pa = (page_to_pfn(pages[0]) << PAGE_SHIFT) + offset; kfree(pages); return access; out_unpin: - iommufd_access_unpin_pages(access, cmd->nesting_parent_iova, - cmd->length); + iommufd_access_unpin_pages(access, aligned_iova, length); out_detach: iommufd_access_detach_internal(access); out_destroy: diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c index 9d5b852d5e19..d59d48022a24 100644 --- a/tools/testing/selftests/iommu/iommufd.c +++ b/tools/testing/selftests/iommu/iommufd.c @@ -3104,17 +3104,18 @@ TEST_F(iommufd_viommu, hw_queue) /* Allocate index=0, declare ownership of the iova */ test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, PAGE_SIZE, &hw_queue_id[0]); - /* Fail duplicate */ + /* Fail duplicated index */ test_err_hw_queue_alloc(EEXIST, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova, PAGE_SIZE, &hw_queue_id[0]); /* Fail unmap, due to iova ownership */ test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE); /* The 2nd page is not pinned, so it can be unmmap */ - test_ioctl_ioas_unmap(iova + PAGE_SIZE, PAGE_SIZE); + test_ioctl_ioas_unmap(iova2, PAGE_SIZE); - /* Allocate index=1 */ + /* Allocate index=1, with an unaligned case */ test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, - iova, PAGE_SIZE, &hw_queue_id[1]); + iova + PAGE_SIZE / 2, PAGE_SIZE / 2, + &hw_queue_id[1]); /* Fail to destroy, due to dependency */ EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, hw_queue_id[0]));

5 months, 3 weeks

5
41
0 0

[PATCH v3 0/6] binder: Set up KUnit tests for alloc

by Tiffany Yang

Hello, binder_alloc_selftest provides a robust set of checks for the binder allocator, but it rarely runs because it must hook into a running binder process and block all other binder threads until it completes. The test itself is a good candidate for conversion to KUnit, and it can be further isolated from user processes by using a test-specific lru freelist instead of the global one. This series converts the selftest to KUnit to make it less burdensome to run and to set up a foundation for unit testing future binder_alloc changes. Thanks, Tiffany Tiffany Yang (6): binder: Fix selftest page indexing binder: Store lru freelist in binder_alloc kunit: test: Export kunit_attach_mm() binder: Scaffolding for binder_alloc KUnit tests binder: Convert binder_alloc selftests to KUnit binder: encapsulate individual alloc test cases drivers/android/Kconfig | 15 +- drivers/android/Makefile | 2 +- drivers/android/binder.c | 10 +- drivers/android/binder_alloc.c | 39 +- drivers/android/binder_alloc.h | 14 +- drivers/android/binder_alloc_selftest.c | 306 ----------- drivers/android/binder_internal.h | 4 + drivers/android/tests/.kunitconfig | 3 + drivers/android/tests/Makefile | 3 + drivers/android/tests/binder_alloc_kunit.c | 573 +++++++++++++++++++++ include/kunit/test.h | 12 + lib/kunit/user_alloc.c | 4 +- 12 files changed, 645 insertions(+), 340 deletions(-) delete mode 100644 drivers/android/binder_alloc_selftest.c create mode 100644 drivers/android/tests/.kunitconfig create mode 100644 drivers/android/tests/Makefile create mode 100644 drivers/android/tests/binder_alloc_kunit.c -- 2.50.0.727.gbf7dc18ff4-goog

5 months, 3 weeks

3
24
0 0

[PATCH v7 0/7] use per-vma locks for /proc/pid/maps reads

by Suren Baghdasaryan

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. The latest results from Paul: Stock mm-unstable, all of the runs had maximum latencies in excess of 0.5 milliseconds, and with 80% of the runs' latencies exceeding a full millisecond, and ranging up beyond 4 full milliseconds. In contrast, 99% of the runs with this patch series applied had maximum latencies of less than 0.5 milliseconds, with the single outlier at only 0.608 milliseconds. From a median-performance (as opposed to maximum-latency) viewpoint, this patch series also looks good, with stock mm weighing in at 11 microseconds and patch series at 6 microseconds, better than a 2x improvement. Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.521 0.011 0.008 0.552 0.011 0.008 0.590 0.011 0.008 0.660 ... 0.011 0.015 2.987 0.011 0.015 3.038 0.011 0.016 3.431 0.011 0.016 4.707 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.026 0.006 0.005 0.029 0.006 0.005 0.034 0.006 0.005 0.035 ... 0.006 0.006 0.421 0.006 0.006 0.423 0.006 0.006 0.439 0.006 0.006 0.608 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's even more important now to test for unexpected inconsistencies. In [3] Lorenzo identified the following possible vma merging/splitting scenarios: Merges with changes to existing vmas: 1 Merge both - mapping a vma over another one and between two vmas which can be merged after this replacement; 2. Merge left full - mapping a vma at the end of an existing one and completely over its right neighbor; 3. Merge left partial - mapping a vma at the end of an existing one and partially over its right neighbor; 4. Merge right full - mapping a vma before the start of an existing one and completely over its left neighbor; 5. Merge right partial - mapping a vma before the start of an existing one and partially over its left neighbor; Merges without changes to existing vmas: 6. Merge both - mapping a vma into a gap between two vmas which can be merged after the insertion; 7. Merge left - mapping a vma at the end of an existing one; 8. Merge right - mapping a vma before the start end of an existing one; Splits 9. Split with new vma at the lower address; 10. Split with new vma at the higher address; If such merges or splits happen concurrently with the /proc/maps reading we might report a vma twice, once before the modification and once after it is modified: Case 1 might report overwritten and previous vma along with the final merged vma; Case 2 might report previous and the final merged vma; Case 3 might cause us to retry once we detect the temporary gap caused by shrinking of the right neighbor; Case 4 might report overritten and the final merged vma; Case 5 might cause us to retry once we detect the temporary gap caused by shrinking of the left neighbor; Case 6 might report previous vma and the gap along with the final marged vma; Case 7 might report previous and the final merged vma; Case 8 might report the original gap and the final merged vma covering the gap; Case 9 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma start; Case 10 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma end; In all these cases the retry mechanism prevents us from reporting possible temporary gaps. Changes since v6 [4]: - Updated patch 7/8 changelog, per Lorenzo Stoakes - Added comments, per Lorenzo Stoakes - Added Reviewed-by, per Lorenzo Stoakes and Liam Howlett - Replaced iter with vmi, per Lorenzo Stoakes - Renamed from lock_vma_under_mmap_lock() to lock_next_vma_under_mmap_lock(), per Lorenzo Stoakes - Renamed lock_next_vma() parameter from addr to from_addr - Renamed labels in lock_next_vma() to reflect fallback cases, per Lorenzo Stoakes - Handle vma_start_read_locked() failure inside lock_next_vma_under_mmap_lock() and added fallback_to_mmap_lock() for that, per Vlastimil Babka - Added missing vma_iter_init() after re-entering rcu read section inside lock_next_vma(), per Vlastimil Babka - Replaced vma_iter_init() with vma_iter_set(), per Liam Howlett - Removed the last patch converting PROCMAP_QUERY to use per-vma locks. That patch will be posted separately, per David Hildenbrand, Vlastimil Babka and Liam Howlett - Updated performance numbers, per Paul E. McKenney !!! NOTES FOR APPLYING THE PATCHSET !!! Applies cleanly over mm-unstable after reverting v6 version of this patchset (from 2771a4b86aa1 to a20b00f7cf33 in mm-unstable). [1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test [3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo… [4] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/ Suren Baghdasaryan (7): selftests/proc: add /proc/pid/maps tearing from vma split test selftests/proc: extend /proc/pid/maps tearing test to include vma resizing selftests/proc: extend /proc/pid/maps tearing test to include vma remapping selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified selftests/proc: add verbose more for tests to facilitate debugging fs/proc/task_mmu: remove conversion of seq_file position to unsigned fs/proc/task_mmu: read proc/pid/maps under per-vma lock fs/proc/internal.h | 5 + fs/proc/task_mmu.c | 155 +++- include/linux/mmap_lock.h | 11 + mm/madvise.c | 3 +- mm/mmap_lock.c | 93 ++ tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++ 8 files changed, 1082 insertions(+), 16 deletions(-) create mode 100644 tools/testing/selftests/proc/proc-maps-race.c -- 2.50.0.727.gbf7dc18ff4-goog

5 months, 3 weeks

5
19
0 0

[PATCH 0/2] bpf, arm64: relax constraint in BPF JIT compiler

by Alexis Lothoré (eBPF Foundation)

Hello, this series follows up on the one introducing 9+ args for tracing programs [1]. It has been observed with this series that there are cases for which we can not identify accurately the location of the target function arguments to prepare correctly the corresponding BPF trampoline. This is the case for example if: - the function consumes a struct variable _by value_ - it is passed on the stack (no more register available for it) - it has some __packed__ or __aligned(X)__ attribute As a consequence, a small restrictive check has been added to the ARM64 side, highlighting that other arch supporting 9+ args in BPF trampolines are already suffering from the same issue. After a bit of discussions and attempts, the chosen solution is, rather than applying the same constraint to all JIT compilers, to prevent such function from being encoded at all in BTF info([2]). As the pahole side is closed to be integrated, we can now remove the restrictive check from kernel side. [1] https://lore.kernel.org/bpf/20250527-many_args_arm64-v3-0-3faf7bb8e4a2@boot… [2] https://lore.kernel.org/bpf/20250707-btf_skip_structs_on_stack-v3-0-29569e0… Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com> --- Alexis Lothoré (eBPF Foundation) (2): bpf, arm64: remove structs on stack constraint selftests/bpf: enable tracing_struct tests for arm64 arch/arm64/net/bpf_jit_comp.c | 5 ----- tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 - 2 files changed, 6 deletions(-) --- base-commit: 8da1e37fc84868b50ba6a7cdf082aa3b0d11e006 change-id: 20250708-arm64_relax_jit_comp-e8889647d8d2 Best regards, -- Alexis Lothoré, Bootlin Embedded Linux and Kernel engineering https://bootlin.com

5 months, 3 weeks

5
8
0 0

[PATCH net-next v7 0/3] selftest: net: Add selftest for netpoll

by Breno Leitao

I am submitting a new selftest for the netpoll subsystem specifically targeting the case where the RX is polling in the TX path, which is a case that we don't have any test in the tree today. This is done when netpoll_poll_dev() called, and this test creates a scenario when that is probably. The test does the following: 1) Configuring a single RX/TX queue to increase contention on the interface. 2) Generating background traffic to saturate the network, mimicking real-world congestion. 3) Sending netconsole messages to trigger netpoll polling and monitor its behavior. 4) Using dynamic netconsole targets via configfs, with the ability to delete and recreate targets during the test. 5) Running bpftrace in parallel to verify that netpoll_poll_dev() is called when expected. If it is called, then the test passes, otherwise the test is marked as skipped. In order to achieve it, I stole Jakub's bpftrace helper from [1], and did some small changes that I found useful to use the helper. So, this patchset basically contains: 1) The code stolen from Jakub 2) Improvements on bpftrace() helper 3) The selftest itself Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1] --- Changes in v7: - Rebased on top of net-next - Using `ethtool -l` json option instead of parsing it manually. - Link to v6: https://lore.kernel.org/r/20250711-netpoll_test-v6-0-130465f286a8@debian.org Changes in v6: - Remove the network toggled (Jakub) - Set ringsize and queue size (Jakub) - Some other general improvements (Jakub) - Link to v5: https://lore.kernel.org/r/20250709-netpoll_test-v5-0-b3737895affe@debian.org Changes in v5: - Rebased on top of net-next. - Calling bpftrace_stop using the defer helper. (Willem) - Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org Changes in v4: - Make the test XFail if it doesn't hit the function we are looking for - Toggle the interface while the traffic is flowing. - Bumped the number of messages from 10 to 40 per iterations. * This is hitting ~15 times per run on my vng test. - Decreased the time from 15 seconds to 10 seconds, given that if it didn't hit the function in 10 seconds, 5 seconds extra will not help. - Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org Changes in v3: - Make pylint happy (Simon) - Remove the unnecessary patch in bpftrace to raise an exception when it fails. (Jakub) - Improved the bpftrace code (Willem) - Stop sending messages if bpftrace is not alive anymore. - Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org Changes in v2: - Stole Jakub's helper to run bpftrace - Removed the DEBUG option and moved logs to logging - Change the code to have a higher chance of calling netpoll_poll_dev(). In my current configuration, it is hitting multiple times during the test. - Save and restore TX/RX queue size (Jakub) - Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org --- Breno Leitao (2): selftests: drv-net: Strip '@' prefix from bpftrace map keys selftests: net: add netpoll basic functionality test Jakub Kicinski (1): selftests: drv-net: add helper/wrapper for bpftrace tools/testing/selftests/drivers/net/Makefile | 1 + .../selftests/drivers/net/lib/py/__init__.py | 4 +- .../testing/selftests/drivers/net/netpoll_basic.py | 396 +++++++++++++++++++++ tools/testing/selftests/net/lib/py/utils.py | 35 ++ 4 files changed, 434 insertions(+), 2 deletions(-) --- base-commit: b06c4311711c57c5e558bd29824b08f0a6e2a155 change-id: 20250612-netpoll_test-a1324d2057c8 Best regards, -- Breno Leitao <leitao(a)debian.org>

5 months, 3 weeks

2
4
0 0

[PATCH v2 4/4] selftests/rseq: Add test for mm_cid compaction

by Gabriele Monaco

A task in the kernel (task_mm_cid_work) runs somewhat periodically to compact the mm_cid for each process. Add a test to validate that it runs correctly and timely. The test spawns 1 thread pinned to each CPU, then each thread, including the main one, runs in short bursts for some time. During this period, the mm_cids should be spanning all numbers between 0 and nproc. At the end of this phase, a thread with high enough mm_cid (>= nproc/2) is selected to be the new leader, all other threads terminate. After some time, the only running thread should see 0 as mm_cid, if that doesn't happen, the compaction mechanism didn't work and the test fails. The test never fails if only 1 core is available, in which case, we cannot test anything as the only available mm_cid is 0. Acked-by: Shuah Khan <skhan(a)linuxfoundation.org> Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com> --- tools/testing/selftests/rseq/.gitignore | 1 + tools/testing/selftests/rseq/Makefile | 2 +- .../selftests/rseq/mm_cid_compaction_test.c | 204 ++++++++++++++++++ 3 files changed, 206 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore index 0fda241fa62b0..b3920c59bf401 100644 --- a/tools/testing/selftests/rseq/.gitignore +++ b/tools/testing/selftests/rseq/.gitignore @@ -3,6 +3,7 @@ basic_percpu_ops_test basic_percpu_ops_mm_cid_test basic_test basic_rseq_op_test +mm_cid_compaction_test param_test param_test_benchmark param_test_compare_twice diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile index 0d0a5fae59547..bc4d940f66d40 100644 --- a/tools/testing/selftests/rseq/Makefile +++ b/tools/testing/selftests/rseq/Makefile @@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \ param_test_benchmark param_test_compare_twice param_test_mm_cid \ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \ - syscall_errors_test + syscall_errors_test mm_cid_compaction_test TEST_GEN_PROGS_EXTENDED = librseq.so diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c new file mode 100644 index 0000000000000..d13623625f5a9 --- /dev/null +++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c @@ -0,0 +1,204 @@ +// SPDX-License-Identifier: LGPL-2.1 +#define _GNU_SOURCE +#include <assert.h> +#include <pthread.h> +#include <sched.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <stddef.h> + +#include "../kselftest.h" +#include "rseq.h" + +#define VERBOSE 0 +#define printf_verbose(fmt, ...) \ + do { \ + if (VERBOSE) \ + printf(fmt, ##__VA_ARGS__); \ + } while (0) + +/* 50 ms */ +#define RUNNER_PERIOD 50000 +/* + * Number of runs before we terminate or get the token. + * The number is slowly increasing with the number of CPUs as the compaction + * process can take longer on larger systems. This is an arbitrary value. + */ +#define THREAD_RUNS (3 + args->num_cpus/8) + +/* + * Number of times we check that the mm_cid were compacted. + * Checks are repeated every RUNNER_PERIOD. + */ +#define MM_CID_COMPACT_TIMEOUT 10 + +struct thread_args { + int cpu; + int num_cpus; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + pthread_t *tinfo; + struct thread_args *args_head; +}; + +static void __noreturn *thread_runner(void *arg) +{ + struct thread_args *args = arg; + int i, ret, curr_mm_cid; + cpu_set_t cpumask; + + CPU_ZERO(&cpumask); + CPU_SET(args->cpu, &cpumask); + ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask); + if (ret) { + errno = ret; + perror("Error: failed to set affinity"); + abort(); + } + pthread_barrier_wait(args->barrier); + + for (i = 0; i < THREAD_RUNS; i++) + usleep(RUNNER_PERIOD); + curr_mm_cid = rseq_current_mm_cid(); + /* + * We select one thread with high enough mm_cid to be the new leader. + * All other threads (including the main thread) will terminate. + * After some time, the mm_cid of the only remaining thread should + * converge to 0, if not, the test fails. + */ + if (curr_mm_cid >= args->num_cpus / 2 && + !pthread_mutex_trylock(args->token)) { + printf_verbose( + "cpu%d has mm_cid=%d and will be the new leader.\n", + sched_getcpu(), curr_mm_cid); + for (i = 0; i < args->num_cpus; i++) { + if (args->tinfo[i] == pthread_self()) + continue; + ret = pthread_join(args->tinfo[i], NULL); + if (ret) { + errno = ret; + perror("Error: failed to join thread"); + abort(); + } + } + pthread_barrier_destroy(args->barrier); + free(args->tinfo); + free(args->token); + free(args->barrier); + free(args->args_head); + + for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) { + curr_mm_cid = rseq_current_mm_cid(); + printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i, + curr_mm_cid, sched_getcpu()); + if (curr_mm_cid == 0) + exit(EXIT_SUCCESS); + usleep(RUNNER_PERIOD); + } + exit(EXIT_FAILURE); + } + printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n", + sched_getcpu(), curr_mm_cid); + pthread_exit(NULL); +} + +int test_mm_cid_compaction(void) +{ + cpu_set_t affinity; + int i, j, ret = 0, num_threads; + pthread_t *tinfo; + pthread_mutex_t *token; + pthread_barrier_t *barrier; + struct thread_args *args; + + sched_getaffinity(0, sizeof(affinity), &affinity); + num_threads = CPU_COUNT(&affinity); + tinfo = calloc(num_threads, sizeof(*tinfo)); + if (!tinfo) { + perror("Error: failed to allocate tinfo"); + return -1; + } + args = calloc(num_threads, sizeof(*args)); + if (!args) { + perror("Error: failed to allocate args"); + ret = -1; + goto out_free_tinfo; + } + token = malloc(sizeof(*token)); + if (!token) { + perror("Error: failed to allocate token"); + ret = -1; + goto out_free_args; + } + barrier = malloc(sizeof(*barrier)); + if (!barrier) { + perror("Error: failed to allocate barrier"); + ret = -1; + goto out_free_token; + } + if (num_threads == 1) { + fprintf(stderr, "Cannot test on a single cpu. " + "Skipping mm_cid_compaction test.\n"); + /* only skipping the test, this is not a failure */ + goto out_free_barrier; + } + pthread_mutex_init(token, NULL); + ret = pthread_barrier_init(barrier, NULL, num_threads); + if (ret) { + errno = ret; + perror("Error: failed to initialise barrier"); + goto out_free_barrier; + } + for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) { + if (!CPU_ISSET(i, &affinity)) + continue; + args[j].num_cpus = num_threads; + args[j].tinfo = tinfo; + args[j].token = token; + args[j].barrier = barrier; + args[j].cpu = i; + args[j].args_head = args; + if (!j) { + /* The first thread is the main one */ + tinfo[0] = pthread_self(); + ++j; + continue; + } + ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]); + if (ret) { + errno = ret; + perror("Error: failed to create thread"); + abort(); + } + ++j; + } + printf_verbose("Started %d threads.\n", num_threads); + + /* Also main thread will terminate if it is not selected as leader */ + thread_runner(&args[0]); + + /* only reached in case of errors */ +out_free_barrier: + free(barrier); +out_free_token: + free(token); +out_free_args: + free(args); +out_free_tinfo: + free(tinfo); + + return ret; +} + +int main(int argc, char **argv) +{ + if (!rseq_mm_cid_available()) { + fprintf(stderr, "Error: rseq_mm_cid unavailable\n"); + return -1; + } + if (test_mm_cid_compaction()) + return -1; + return 0; +} -- 2.50.1

5 months, 3 weeks

1
0
0 0

[PATCH net-next V5 0/5] net: netdevsim: hook in XDP handling

by Mohsin Bashir

This patch series add tests to validate XDP native support for PASS, DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment. For adjustment tests, validate support for both the extension and shrinking cases across various packet sizes and offset values. The pass criteria for head/tail adjustment tests require that at-least one adjustment value works for at-least one packet size. This ensure that the variability in maximum supported head/tail adjustment offset across different drivers is being incorporated. The results reported in this series are based on fbnic. However, the series is tested against multiple other drivers including netdevism. Note: The XDP support for fbnic will be added later. --- Change-log: V5: - Fix warning caused by rcu_dereference() in p1 - Fix checkpatch warnings with P3, P4, and P5 V4: https://lore.kernel.org/netdev/20250714210352.1115230-1-mohsin.bashr@gmail.… V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.… V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.… Jakub Kicinski (1): net: netdevsim: hook in XDP handling Mohsin Bashir (4): selftests: drv-net: Test XDP_PASS/DROP support selftests: drv-net: Test XDP_TX support selftests: drv-net: Test tail-adjustment support selftests: drv-net: Test head-adjustment support drivers/net/netdevsim/netdev.c | 19 +- tools/testing/selftests/drivers/net/Makefile | 1 + tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++ .../selftests/net/lib/xdp_native.bpf.c | 540 ++++++++++++++ 4 files changed, 1215 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/drivers/net/xdp.py create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c -- 2.47.1

5 months, 3 weeks

2
6
0 0

[PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY

by Suren Baghdasaryan

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any other task from concurrently modifying the address space. This guarantees coherent reporting of virtual address ranges, however it can block important updates from happening. Oftentimes /proc/pid/maps readers are low priority monitoring tasks and them blocking high priority tasks results in priority inversion. Locking the entire address space is required to present fully coherent picture of the address space, however even current implementation does not strictly guarantee that by outputting vmas in page-size chunks and dropping mmap_lock in between each chunk. Address space modifications are possible while mmap_lock is dropped and userspace reading the content is expected to deal with possible concurrent address space modifications. Considering these relaxed rules, holding mmap_lock is not strictly needed as long as we can guarantee that a concurrently modified vma is reported either in its original form or after it was modified. This patchset switches from holding mmap_lock while reading /proc/pid/maps to taking per-vma locks as we walk the vma tree. This reduces the contention with tasks modifying the address space because they would have to contend for the same vma as opposed to the entire address space. Same is done for PROCMAP_QUERY ioctl which locks only the vma that fell into the requested range instead of the entire address space. Previous version of this patchset [1] tried to perform /proc/pid/maps reading under RCU, however its implementation is quite complex and the results are worse than the new version because it still relied on mmap_lock speculation which retries if any part of the address space gets modified. New implementaion is both simpler and results in less contention. Note that similar approach would not work for /proc/pid/smaps reading as it also walks the page table and that's not RCU-safe. Paul McKenney's designed a test [2] to measure mmap/munmap latencies while concurrently reading /proc/pid/maps. The test has a pair of processes scanning /proc/PID/maps, and another process unmapping and remapping 4K pages from a 128MB range of anonymous memory. At the end of each 10 second run, the latency of each mmap() or munmap() operation is measured, and for each run the maximum and mean latency is printed. The map/unmap process is started first, its PID is passed to the scanners, and then the map/unmap process waits until both scanners are running before starting its timed test. The scanners keep scanning until the specified /proc/PID/maps file disappears. This test registered close to 10x improvement in update latencies: Before the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.011 0.008 0.455 0.011 0.008 0.472 0.011 0.008 0.535 0.011 0.009 0.545 ... 0.011 0.014 2.875 0.011 0.014 2.913 0.011 0.014 3.007 0.011 0.015 3.018 After the change: ./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2 0.006 0.005 0.036 0.006 0.005 0.039 0.006 0.005 0.039 0.006 0.005 0.039 ... 0.006 0.006 0.403 0.006 0.006 0.474 0.006 0.006 0.479 0.006 0.006 0.498 The patchset also adds a number of tests to check for /proc/pid/maps data coherency. They are designed to detect any unexpected data tearing while performing some common address space modifications (vma split, resize and remap). Even before these changes, reading /proc/pid/maps might have inconsistent data because the file is read page-by-page with mmap_lock being dropped between the pages. An example of user-visible inconsistency can be that the same vma is printed twice: once before it was modified and then after the modifications. For example if vma was extended, it might be found and reported twice. What is not expected is to see a gap where there should have been a vma both before and after modification. This patchset increases the chances of such tearing, therefore it's even more important now to test for unexpected inconsistencies. In [3] Lorenzo identified the following possible vma merging/splitting scenarios: Merges with changes to existing vmas: 1 Merge both - mapping a vma over another one and between two vmas which can be merged after this replacement; 2. Merge left full - mapping a vma at the end of an existing one and completely over its right neighbor; 3. Merge left partial - mapping a vma at the end of an existing one and partially over its right neighbor; 4. Merge right full - mapping a vma before the start of an existing one and completely over its left neighbor; 5. Merge right partial - mapping a vma before the start of an existing one and partially over its left neighbor; Merges without changes to existing vmas: 6. Merge both - mapping a vma into a gap between two vmas which can be merged after the insertion; 7. Merge left - mapping a vma at the end of an existing one; 8. Merge right - mapping a vma before the start end of an existing one; Splits 9. Split with new vma at the lower address; 10. Split with new vma at the higher address; If such merges or splits happen concurrently with the /proc/maps reading we might report a vma twice, once before the modification and once after it is modified: Case 1 might report overwritten and previous vma along with the final merged vma; Case 2 might report previous and the final merged vma; Case 3 might cause us to retry once we detect the temporary gap caused by shrinking of the right neighbor; Case 4 might report overritten and the final merged vma; Case 5 might cause us to retry once we detect the temporary gap caused by shrinking of the left neighbor; Case 6 might report previous vma and the gap along with the final marged vma; Case 7 might report previous and the final merged vma; Case 8 might report the original gap and the final merged vma covering the gap; Case 9 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma start; Case 10 might cause us to retry once we detect the temporary gap caused by shrinking of the original vma at the vma end; In all these cases the retry mechanism prevents us from reporting possible temporary gaps. Changes since v5 [4]: - Made /proc/pid/maps tearing test a separate selftest, per Alexey Dobriyan - Changed asserts with or'ed conditions into separate ones, per Alexey Dobriyan - Added a small cleanup patch [6/8] to avoid unnecessary seq_file position type casting - Removed unnecessary is_sentinel_pos() helper - Changed titles to use fs/proc/task_mmu instead of mm/maps prefix, per David Hildenbrand - Included Lorenzo's fix for mmap lock assertion in anon_vma_name() - Reworked the last patch to avoid allocation in the rcu read section, which replaces Jeongjun Park's fix !!! NOTES FOR APPLYING THE PATCHSET !!! Applies cleanly over mm-unstable after reverting old version with fixes. The following patches should be reverted before applyng this patchset: b33ce1be8a40 ("selftests/proc: add /proc/pid/maps tearing from vma split test") b538e0580fd6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma resizing") 4996b4409cc6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma remapping") c39471f78d5e ("selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified") 487570f548f3 ("selftests/proc: add verbose more for tests to facilitate debugging") e1ba4969cba1 ("mm/maps: read proc/pid/maps under per-vma lock") ecb110179e77 ("mm/madvise: fixup stray mmap lock assert in anon_vma_name()") 6772c457a865 ("fs/proc/task_mmu:: execute PROCMAP_QUERY ioctl under per-vma locks") d5c67bb2c5fb ("mm/maps: move kmalloc() call location in do_procmap_query() out of RCU critical section") [1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/ [2] https://github.com/paulmckrcu/proc-mmap_sem-test [3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo… [4] https://lore.kernel.org/all/20250624193359.3865351-1-surenb@google.com/ Suren Baghdasaryan (8): selftests/proc: add /proc/pid/maps tearing from vma split test selftests/proc: extend /proc/pid/maps tearing test to include vma resizing selftests/proc: extend /proc/pid/maps tearing test to include vma remapping selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified selftests/proc: add verbose more for tests to facilitate debugging fs/proc/task_mmu: remove conversion of seq_file position to unsigned fs/proc/task_mmu: read proc/pid/maps under per-vma lock fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks fs/proc/internal.h | 5 + fs/proc/task_mmu.c | 188 +++- include/linux/mmap_lock.h | 11 + mm/madvise.c | 3 +- mm/mmap_lock.c | 88 ++ tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/proc/Makefile | 1 + tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++ 8 files changed, 1098 insertions(+), 28 deletions(-) create mode 100644 tools/testing/selftests/proc/proc-maps-race.c -- 2.50.0.727.gbf7dc18ff4-goog

5 months, 3 weeks

6
46
0 0

[PATCH v5] selftests/mm: add process_madvise() tests

by wang lian

Add tests for process_madvise(), focusing on verifying behavior under various conditions including valid usage and error cases. Signed-off-by: wang lian <lianux.mm(a)gmail.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Suggested-by: David Hildenbrand <david(a)redhat.com> Suggested-by: Zi Yan <ziy(a)nvidia.com> Suggested-by: Mark Brown <broonie(a)kernel.org> Acked-by: SeongJae Park <sj(a)kernel.org> --- Changelog v5: - Refactor the remote_collapse test to concentrate on its primary goal confirming the successful remote invocation of process_madvise() on a child process. - Split the validation logic for invalid pidfds out of the remote test and into two new (`exited_process_pidfd` and `bad_pidfd`). - Based mm-new branch, can ensure clean application Changelog v4: https://lore.kernel.org/lkml/20250710112249.58722-1-lianux.mm@gmail.com/ - Refine resource cleanup logic in test teardown to be more robust. - Improve remote_collapse test to correctly handle different THP (Transparent Huge Page) policies ('always', 'madvise', 'never'), including handling race conditions with khugepaged. - Resolve build errors Changelog v3: https://lore.kernel.org/lkml/20250703044326.65061-1-lianux.mm@gmail.com/ - Rebased onto the latest mm-stable branch to ensure clean application. - Refactor common signal handling logic into vm_util to reduce code duplication. - Improve test robustness and diagnostics based on community feedback. - Address minor code style and script corrections. Changelog v2: https://lore.kernel.org/lkml/20250630140957.4000-1-lianux.mm@gmail.com/ - Drop MADV_DONTNEED tests based on feedback. - Focus solely on process_madvise() syscall. - Improve error handling and structure. - Add future-proof flag test. - Style and comment cleanups. -V1: https://lore.kernel.org/lkml/20250621133003.4733-1-lianux.mm@gmail.com/ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/process_madv.c | 304 ++++++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 5 + 4 files changed, 311 insertions(+) create mode 100644 tools/testing/selftests/mm/process_madv.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index f2dafa0b700b..e7b23a8a05fe 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -21,6 +21,7 @@ on-fault-limit transhuge-stress pagemap_ioctl pfnmap +process_madv *.tmp* protection_keys protection_keys_32 diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index ae6f994d3add..d13b3cef2a2b 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -85,6 +85,7 @@ TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += pfnmap +TEST_GEN_FILES += process_madv TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += uffd-stress diff --git a/tools/testing/selftests/mm/process_madv.c b/tools/testing/selftests/mm/process_madv.c new file mode 100644 index 000000000000..249e2ed8dfe9 --- /dev/null +++ b/tools/testing/selftests/mm/process_madv.c @@ -0,0 +1,304 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#define _GNU_SOURCE +#include "../kselftest_harness.h" +#include <errno.h> +#include <setjmp.h> +#include <signal.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <linux/mman.h> +#include <sys/syscall.h> +#include <unistd.h> +#include <sched.h> +#include "vm_util.h" + +#include "../pidfd/pidfd.h" + +FIXTURE(process_madvise) +{ + unsigned long page_size; + pid_t child_pid; + int pidfd; +}; + +FIXTURE_SETUP(process_madvise) +{ + self->page_size = (unsigned long)sysconf(_SC_PAGESIZE); + self->pidfd = PIDFD_SELF; + self->child_pid = -1; +}; + +FIXTURE_TEARDOWN_PARENT(process_madvise) +{ + if (self->child_pid > 0) { + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + } +} + +static ssize_t sys_process_madvise(int pidfd, const struct iovec *iovec, + size_t vlen, int advice, unsigned int flags) +{ + return syscall(__NR_process_madvise, pidfd, iovec, vlen, advice, flags); +} + +/* + * This test uses PIDFD_SELF to target the current process. The main + * goal is to verify the basic behavior of process_madvise() with + * a vector of non-contiguous memory ranges, not its cross-process + * capabilities. + */ +TEST_F(process_madvise, basic) +{ + const unsigned long pagesize = self->page_size; + const int madvise_pages = 4; + struct iovec vec[madvise_pages]; + int pidfd = self->pidfd; + ssize_t ret; + char *map; + + /* + * Create a single large mapping. We will pick pages from this + * mapping to advise on. This ensures we test non-contiguous iovecs. + */ + map = mmap(NULL, pagesize * 10, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + /* Fill the entire region with a known pattern. */ + memset(map, 'A', pagesize * 10); + + /* + * Setup the iovec to point to 4 non-contiguous pages + * within the mapping. + */ + vec[0].iov_base = &map[0 * pagesize]; + vec[0].iov_len = pagesize; + vec[1].iov_base = &map[3 * pagesize]; + vec[1].iov_len = pagesize; + vec[2].iov_base = &map[5 * pagesize]; + vec[2].iov_len = pagesize; + vec[3].iov_base = &map[8 * pagesize]; + vec[3].iov_len = pagesize; + + ret = sys_process_madvise(pidfd, vec, madvise_pages, MADV_DONTNEED, 0); + if (ret == -1 && errno == EPERM) + SKIP(return, + "process_madvise() unsupported or permission denied, try running as root.\n"); + else if (errno == EINVAL) + SKIP(return, + "process_madvise() unsupported or parameter invalid, please check arguments.\n"); + + /* The call should succeed and report the total bytes processed. */ + ASSERT_EQ(ret, madvise_pages * pagesize); + + /* Check that advised pages are now zero. */ + for (int i = 0; i < madvise_pages; i++) { + char *advised_page = (char *)vec[i].iov_base; + + /* Content must be 0, not 'A'. */ + ASSERT_EQ(*advised_page, '\0'); + } + + /* Check that an un-advised page in between is still 'A'. */ + char *unadvised_page = &map[1 * pagesize]; + + for (int i = 0; i < pagesize; i++) + ASSERT_EQ(unadvised_page[i], 'A'); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize * 10), 0); +} + +/* + * This test deterministically validates process_madvise() with MADV_COLLAPSE + * on a remote process, other advices are difficult to verify reliably. + * + * The test verifies that a memory region in a child process, + * focus on process_madv remote result, only check addresses and lengths. + * The correctness of the MADV_COLLAPSE can be found in the relevant test examples in khugepaged. + */ +TEST_F(process_madvise, remote_collapse) +{ + const unsigned long pagesize = self->page_size; + int pidfd; + long huge_page_size; + int pipe_info[2]; + ssize_t ret; + struct iovec vec; + + struct child_info { + pid_t pid; + void *map_addr; + } info; + + huge_page_size = default_huge_page_size(); + if (huge_page_size <= 0) + SKIP(return, "Could not determine a valid huge page size.\n"); + + ASSERT_EQ(pipe(pipe_info), 0); + + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) { + char *map; + size_t map_size = 2 * huge_page_size; + + close(pipe_info[0]); + + map = mmap(NULL, map_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + ASSERT_NE(map, MAP_FAILED); + + /* Fault in as small pages */ + for (size_t i = 0; i < map_size; i += pagesize) + map[i] = 'A'; + + /* Send info and pause */ + info.pid = getpid(); + info.map_addr = map; + ret = write(pipe_info[1], &info, sizeof(info)); + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[1]); + + pause(); + exit(0); + } + + close(pipe_info[1]); + + /* Receive child info */ + ret = read(pipe_info[0], &info, sizeof(info)); + if (ret <= 0) { + waitpid(self->child_pid, NULL, 0); + SKIP(return, "Failed to read child info from pipe.\n"); + } + ASSERT_EQ(ret, sizeof(info)); + close(pipe_info[0]); + self->child_pid = info.pid; + + pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(pidfd, 0); + + vec.iov_base = info.map_addr; + vec.iov_len = huge_page_size; + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_COLLAPSE, 0); + if (ret == -1) { + if (errno == EINVAL) + SKIP(return, "PROCESS_MADV_ADVISE is not supported.\n"); + else if (errno == EPERM) + SKIP(return, + "No process_madvise() permissions, try running as root.\n"); + goto cleanup; + } + + ASSERT_EQ(ret, huge_page_size); + +cleanup: + /* Cleanup */ + kill(self->child_pid, SIGKILL); + waitpid(self->child_pid, NULL, 0); + if (pidfd >= 0) + close(pidfd); +} + +/* + * Test process_madvise() with a pidfd for a process that has already + * exited to ensure correct error handling. + */ +TEST_F(process_madvise, exited_process_pidfd) +{ + struct iovec vec; + ssize_t ret; + int pidfd; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* + * Using a pidfd for a process that has already exited should fail + * with ESRCH. + */ + self->child_pid = fork(); + ASSERT_NE(self->child_pid, -1); + + if (self->child_pid == 0) + exit(0); + + pidfd = syscall(__NR_pidfd_open, self->child_pid, 0); + ASSERT_GE(pidfd, 0); + + /* Wait for the child to ensure it has terminated. */ + waitpid(self->child_pid, NULL, 0); + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, ESRCH); + close(pidfd); +} + +/* + * Test process_madvise() with bad pidfds to ensure correct error + * handling. + */ +TEST_F(process_madvise, bad_pidfd) +{ + struct iovec vec; + ssize_t ret; + + vec.iov_base = (void *)0x1234; + vec.iov_len = 4096; + + /* Using an invalid fd number (-1) should fail with EBADF. */ + ret = sys_process_madvise(-1, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); + + /* + * Using a valid fd that is not a pidfd (e.g. stdin) should fail + * with EBADF. + */ + ret = sys_process_madvise(STDIN_FILENO, &vec, 1, MADV_DONTNEED, 0); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EBADF); +} + +/* + * Test process_madvise() with an invalid flag value. Currently, only a flag + * value of 0 is supported. This test is reserved for the future, e.g., if + * synchronous flags are added. + */ +TEST_F(process_madvise, flag) +{ + const unsigned long pagesize = self->page_size; + unsigned int invalid_flag; + int pidfd = self->pidfd; + struct iovec vec; + char *map; + ssize_t ret; + + map = mmap(NULL, pagesize, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, + 0); + if (map == MAP_FAILED) + SKIP(return, "mmap failed, not enough memory.\n"); + + vec.iov_base = map; + vec.iov_len = pagesize; + + invalid_flag = 0x80000000; + + ret = sys_process_madvise(pidfd, &vec, 1, MADV_DONTNEED, invalid_flag); + ASSERT_EQ(ret, -1); + ASSERT_EQ(errno, EINVAL); + + /* Cleanup. */ + ASSERT_EQ(munmap(map, pagesize), 0); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index a38c984103ce..471e539d82b8 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -65,6 +65,8 @@ separated by spaces: test pagemap_scan IOCTL - pfnmap tests for VM_PFNMAP handling +- process_madv + test for process_madv - cow test copy-on-write semantics - thp @@ -425,6 +427,9 @@ CATEGORY="madv_guard" run_test ./guard-regions # MADV_POPULATE_READ and MADV_POPULATE_WRITE tests CATEGORY="madv_populate" run_test ./madv_populate +# PROCESS_MADV test +CATEGORY="process_madv" run_test ./process_madv + CATEGORY="vma_merge" run_test ./merge if [ -x ./memfd_secret ] -- 2.43.0

5 months, 3 weeks

2
4
0 0

[PATCH net-next v12 00/10] tun: Introduce virtio-net hashing feature

by Akihiko Odaki

NOTE: I'm leaving Daynix Computing Ltd., for which I worked on this patch series, by the end of this month. While net-next is closed, this is the last chance for me to send another version so let me send the local changes now. Please contact Yuri Benditovich, who is CCed on this email, for anything about this series. virtio-net have two usage of hashes: one is RSS and another is hash reporting. Conventionally the hash calculation was done by the VMM. However, computing the hash after the queue was chosen defeats the purpose of RSS. Another approach is to use eBPF steering program. This approach has another downside: it cannot report the calculated hash due to the restrictive nature of eBPF. Introduce the code to compute hashes to the kernel in order to overcome thse challenges. An alternative solution is to extend the eBPF steering program so that it will be able to report to the userspace, but it is based on context rewrites, which is in feature freeze. We can adopt kfuncs, but they will not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM and vhost_net). The patches for QEMU to use this new feature was submitted as RFC and is available at: https://patchew.org/QEMU/20250530-hash-v5-0-343d7d7a8200@daynix.com/ This work was presented at LPC 2024: https://lpc.events/event/18/contributions/1963/ V1 -> V2: Changed to introduce a new BPF program type. Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com> --- Changes in v12: - Updated tools/testing/selftests/net/config. - Split TUNSETVNETHASH. - Link to v11: https://lore.kernel.org/r/20250317-rss-v11-0-4cacca92f31f@daynix.com Changes in v11: - Added the missing code to free vnet_hash in patch "tap: Introduce virtio-net hash feature". - Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com Changes in v10: - Split common code and TUN/TAP-specific code into separate patches. - Reverted a spurious style change in patch "tun: Introduce virtio-net hash feature". - Added a comment explaining disable_ipv6 in tests. - Used AF_PACKET for patch "selftest: tun: Add tests for virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as the testing function now needs access to more variant-specific variables. - Corrected the message of patch "selftest: tun: Add tests for virtio-net hashing"; it mentioned validation of configuration but it is not scope of this patch. - Expanded the description of patch "selftest: tun: Add tests for virtio-net hashing". - Added patch "tun: Allow steering eBPF program to fall back". - Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock. - Removed redundant tests for tun_vnet_ioctl(). - Added patch "selftest: tap: Add tests for virtio-net ioctls". - Added a design explanation of ioctls for extensibility and migration. - Removed a few branches in patch "vhost/net: Support VIRTIO_NET_F_HASH_REPORT". - Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com Changes in v9: - Added a missing return statement in patch "tun: Introduce virtio-net hash feature". - Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com Changes in v8: - Disabled IPv6 to eliminate noises in tests. - Added a branch in tap to avoid unnecessary dissection when hash reporting is disabled. - Removed unnecessary rtnl_lock(). - Extracted code to handle new ioctls into separate functions to avoid adding extra NULL checks to the code handling other ioctls. - Introduced variable named "fd" to __tun_chr_ioctl(). - s/-/=/g in a patch message to avoid confusing Git. - Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com Changes in v7: - Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for VHOST_NET_F_VIRTIO_NET_HDR. - s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing". - Added tap_skb_cb type. - Rebased. - Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com Changes in v6: - Extracted changes to fill vnet header holes into another series. - Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun: Introduce virtio-net hash reporting feature", and "tun: Introduce virtio-net RSS" into patch "tun: Introduce virtio-net hash feature". - Dropped the RFC tag. - Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com Changes in v5: - Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE. - Optimized the calculation of the hash value according to: https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca… - Added patch "tun: Unify vnet implementation". - Dropped patch "tap: Pad virtio header with zero". - Added patch "selftest: tun: Test vnet ioctls without device". - Reworked selftests to skip for older kernels. - Documented the case when the underlying device is deleted and packets have queue_mapping set by TC. - Reordered test harness arguments. - Added code to handle fragmented packets. - Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com Changes in v4: - Moved tun_vnet_hash_ext to if_tun.h. - Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc(). - Replaced htons() with cpu_to_be16(). - Changed virtio_net_hash_rss() to return void. - Reordered variable declarations in virtio_net_hash_rss(). - Removed virtio_net_hdr_v1_hash_from_skb(). - Updated messages of "tap: Pad virtio header with zero" and "tun: Pad virtio header with zero". - Fixed vnet_hash allocation size. - Ensured to free vnet_hash when destructing tun_struct. - Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com Changes in v3: - Reverted back to add ioctl. - Split patch "tun: Introduce virtio-net hashing feature" into "tun: Introduce virtio-net hash reporting feature" and "tun: Introduce virtio-net RSS". - Changed to reuse hash values computed for automq instead of performing RSS hashing when hash reporting is requested but RSS is not. - Extracted relevant data from struct tun_struct to keep it minimal. - Added kernel-doc. - Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF. - Initialized num_buffers with 1. - Added a test case for unclassified packets. - Fixed error handling in tests. - Changed tests to verify that the queue index will not overflow. - Rebased. - Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com --- Akihiko Odaki (10): virtio_net: Add functions for hashing net: flow_dissector: Export flow_keys_dissector_symmetric tun: Allow steering eBPF program to fall back tun: Add common virtio-net hash feature code tun: Introduce virtio-net hash feature tap: Introduce virtio-net hash feature selftest: tun: Test vnet ioctls without device selftest: tun: Add tests for virtio-net hashing selftest: tap: Add tests for virtio-net ioctls vhost/net: Support VIRTIO_NET_F_HASH_REPORT Documentation/networking/tuntap.rst | 7 + drivers/net/Kconfig | 1 + drivers/net/ipvlan/ipvtap.c | 2 +- drivers/net/macvtap.c | 2 +- drivers/net/tap.c | 80 +++++- drivers/net/tun.c | 92 +++++-- drivers/net/tun_vnet.h | 165 +++++++++++- drivers/vhost/net.c | 68 ++--- include/linux/if_tap.h | 4 +- include/linux/skbuff.h | 3 + include/linux/virtio_net.h | 188 ++++++++++++++ include/net/flow_dissector.h | 1 + include/uapi/linux/if_tun.h | 80 ++++++ net/core/flow_dissector.c | 3 +- net/core/skbuff.c | 4 + tools/testing/selftests/net/Makefile | 2 +- tools/testing/selftests/net/config | 1 + tools/testing/selftests/net/tap.c | 131 +++++++++- tools/testing/selftests/net/tun.c | 485 ++++++++++++++++++++++++++++++++++- 19 files changed, 1234 insertions(+), 85 deletions(-) --- base-commit: 5cb8274d66c611b7889565c418a8158517810f9b change-id: 20240403-rss-e737d89efa77 Best regards, -- Akihiko Odaki <akihiko.odaki(a)daynix.com>

5 months, 3 weeks

5
38
0 0

[PATCH v12 net-next 00/15] AccECN protocol patch series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Please find the v12 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best Regards, Chia-Yu --- v12 (04-Jul-2025) - Fix compilation issues with some intermediate patches in v11 - Add more comments for AccECN helpers of tcp_ecn.h v11 (03-Jul-2025) - Fix compilation issues with some intermediate patches in v10 v10 (02-Jul-2025) - Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>) - Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>) - Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>) - Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>) - Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>) - Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch - Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>) v9 (21-Jun-2025) - Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>) - Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>) - Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>) - Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>) - Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>) - Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>) v8 (10-Jun-2025) - Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>) - Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>) - Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>) v7 (14-May-2025) - Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>) - Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message for #9 to explain the increase in tcp_sock_write_rx group size - Modify group size of tcp_sock_write_tx in #10 based on pahole results v6 (09-May-2025) - Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>) - Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>) - Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>) - Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>) - Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>) - Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>) - Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>) - Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>) - Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>) - Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15 v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>) v2 (18-Mar-2025) - Add one missing patch from the previous AccECN protocol preparation patch series to this patch series. --- Chia-Yu Chang (6): tcp: reorganize tcp_sock_write_txrx group for variables later tcp: ecn functions in separated include file tcp: Add wait_third_ack for ECN negotiation in simultaneous connect tcp: accecn: AccECN option send control tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (9): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics Documentation/networking/ip-sysctl.rst | 55 +- .../networking/net_cachelines/tcp_sock.rst | 13 + include/linux/tcp.h | 33 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 87 ++- include/net/tcp_ecn.h | 663 ++++++++++++++++++ include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 4 + net/ipv4/sysctl_net_ipv4.c | 19 + net/ipv4/tcp.c | 29 +- net/ipv4/tcp_input.c | 371 ++++++++-- net/ipv4/tcp_ipv4.c | 8 +- net/ipv4/tcp_minisocks.c | 40 +- net/ipv4/tcp_output.c | 297 ++++++-- net/ipv6/syncookies.c | 2 + net/ipv6/tcp_ipv6.c | 1 + 16 files changed, 1444 insertions(+), 187 deletions(-) create mode 100644 include/net/tcp_ecn.h -- 2.34.1

5 months, 3 weeks

3
30
0 0

[PATCH v17 00/27] riscv control-flow integrity for usermode

by Deepak Gupta

Basics and overview =================== Software with larger attack surfaces (e.g. network facing apps like databases, browsers or apps relying on browser runtimes) suffer from memory corruption issues which can be utilized by attackers to bend control flow of the program to eventually gain control (by making their payload executable). Attackers are able to perform such attacks by leveraging call-sites which rely on indirect calls or return sites which rely on obtaining return address from stack memory. To mitigate such attacks, risc-v extension zicfilp enforces that all indirect calls must land on a landing pad instruction `lpad` else cpu will raise software check exception (a new cpu exception cause code on riscv). Similarly for return flow, risc-v extension zicfiss extends architecture with - `sspush` instruction to push return address on a shadow stack - `sspopchk` instruction to pop return address from shadow stack and compare with input operand (i.e. return address on stack) - `sspopchk` to raise software check exception if comparision above was a mismatch - Protection mechanism using which shadow stack is not writeable via regular store instructions More information an details can be found at extensions github repo [1]. Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel CET [3] and branch target identification (BTI) [4] on arm. Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack. x86 and arm64 support for user mode shadow stack is already in mainline. Kernel awareness for user control flow integrity ================================================ This series picks up Samuel Holland's envcfg changes [2] as well. So if those are being applied independently, they should be removed from this series. Enabling: In order to maintain compatibility and not break anything in user mode, kernel doesn't enable control flow integrity cpu extensions on binary by default. Instead exposes a prctl interface to enable, disable and lock the shadow stack or landing pad feature for a task. This allows userspace (loader) to enumerate if all objects in its address space are compiled with shadow stack and landing pad support and accordingly enable the feature. Additionally if a subsequent `dlopen` happens on a library, user mode can take a decision again to disable the feature (if incoming library is not compiled with support) OR terminate the task (if user mode policy is strict to have all objects in address space to be compiled with control flow integirty cpu feature). prctl to enable shadow stack results in allocating shadow stack from virtual memory and activating for user address space. x86 and arm64 are also following same direction due to similar reason(s). clone/fork: On clone and fork, cfi state for task is inherited by child. Shadow stack is part of virtual memory and is a writeable memory from kernel perspective (writeable via a restricted set of instructions aka shadow stack instructions) Thus kernel changes ensure that this memory is converted into read-only when fork/clone happens and COWed when fault is taken due to sspush, sspopchk or ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled, kernel will automatically allocate a shadow stack for that clone call. map_shadow_stack: x86 introduced `map_shadow_stack` system call to allow user space to explicitly map shadow stack memory in its address space. It is useful to allocate shadow for different contexts managed by a single thread (green threads or contexts) risc-v implements this system call as well. signal management: If shadow stack is enabled for a task, kernel performs an asynchronous control flow diversion to deliver the signal and eventually expects userspace to issue sigreturn so that original execution can be resumed. Even though resume context is prepared by kernel, it is in user space memory and is subject to memory corruption and corruption bugs can be utilized by attacker in this race window to perform arbitrary sigreturn and eventually bypass cfi mechanism. Another issue is how to ensure that cfi related state on sigcontext area is not trampled by legacy apps or apps compiled with old kernel headers. In order to mitigate control-flow hijacting, kernel prepares a token and place it on shadow stack before signal delivery and places address of token in sigcontext structure. During sigreturn, kernel obtains address of token from sigcontext struture, reads token from shadow stack and validates it and only then allow sigreturn to succeed. Compatiblity issue is solved by adopting dynamic sigcontext management introduced for vector extension. This series re-factor the code little bit to allow future sigcontext management easy (as proposed by Andy Chiu from SiFive) config and compilation: Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this config option picks the kernel support for user control flow integrity. This optin is presented only if toolchain has shadow stack and landing pad support. And is on purpose guarded by toolchain support. Reason being that eventually vDSO also needs to be compiled in with shadow stack and landing pad support. vDSO compile patches are not included as of now because landing pad labeling scheme is yet to settle for usermode runtime. To get more information on kernel interactions with respect to zicfilp and zicfiss, patch series adds documentation for `zicfilp` and `zicfiss` in following: Documentation/arch/riscv/zicfiss.rst Documentation/arch/riscv/zicfilp.rst How to test this series ======================= Toolchain --------- $ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev $ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static" $ make -j$(nproc) Qemu ---- Get the lastest qemu $ cd qemu $ mkdir build $ cd build $ ../configure --target-list=riscv64-softmmu $ make -j$(nproc) Opensbi ------- $ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi $ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic Linux ----- Running defconfig is fine. CFI is enabled by default if the toolchain supports it. $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig $ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) In case you're building your own rootfs using toolchain, please make sure you pick following patch to ensure that vDSO compiled with lpad and shadow stack. "arch/riscv: compile vdso with landing pad" Branch where above patch can be picked https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1 Running ------- Modify your qemu command to have: -bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin -cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true vDSO related Opens (in the flux) ================================= I am listing these opens for laying out plan and what to expect in future patch sets. And of course for the sake of discussion. Shadow stack and landing pad enabling in vDSO ---------------------------------------------- vDSO must have shadow stack and landing pad support compiled in for task to have shadow stack and landing pad support. This patch series doesn't enable that (yet). Enabling shadow stack support in vDSO should be straight forward (intend to do that in next versions of patch set). Enabling landing pad support in vDSO requires some collaboration with toolchain folks to follow a single label scheme for all object binaries. This is necessary to ensure that all indirect call-sites are setting correct label and target landing pads are decorated with same label scheme. How many vDSOs --------------- Shadow stack instructions are carved out of zimop (may be operations) and if CPU doesn't implement zimop, they're illegal instructions. Kernel could be running on a CPU which may or may not implement zimop. And thus kernel will have to carry 2 different vDSOs and expose the appropriate one depending on whether CPU implements zimop or not. References ========== [1] - https://github.com/riscv/riscv-cfi [2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c… [3] - https://lwn.net/Articles/889475/ [4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific… [5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i… [6] - https://lwn.net/Articles/940403/ To: Thomas Gleixner <tglx(a)linutronix.de> To: Ingo Molnar <mingo(a)redhat.com> To: Borislav Petkov <bp(a)alien8.de> To: Dave Hansen <dave.hansen(a)linux.intel.com> To: x86(a)kernel.org To: H. Peter Anvin <hpa(a)zytor.com> To: Andrew Morton <akpm(a)linux-foundation.org> To: Liam R. Howlett <Liam.Howlett(a)oracle.com> To: Vlastimil Babka <vbabka(a)suse.cz> To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> To: Paul Walmsley <paul.walmsley(a)sifive.com> To: Palmer Dabbelt <palmer(a)dabbelt.com> To: Albert Ou <aou(a)eecs.berkeley.edu> To: Conor Dooley <conor(a)kernel.org> To: Rob Herring <robh(a)kernel.org> To: Krzysztof Kozlowski <krzk+dt(a)kernel.org> To: Arnd Bergmann <arnd(a)arndb.de> To: Christian Brauner <brauner(a)kernel.org> To: Peter Zijlstra <peterz(a)infradead.org> To: Oleg Nesterov <oleg(a)redhat.com> To: Eric Biederman <ebiederm(a)xmission.com> To: Kees Cook <kees(a)kernel.org> To: Jonathan Corbet <corbet(a)lwn.net> To: Shuah Khan <shuah(a)kernel.org> To: Jann Horn <jannh(a)google.com> To: Conor Dooley <conor+dt(a)kernel.org> To: Miguel Ojeda <ojeda(a)kernel.org> To: Alex Gaynor <alex.gaynor(a)gmail.com> To: Boqun Feng <boqun.feng(a)gmail.com> To: Gary Guo <gary(a)garyguo.net> To: Björn Roy Baron <bjorn3_gh(a)protonmail.com> To: Benno Lossin <benno.lossin(a)proton.me> To: Andreas Hindborg <a.hindborg(a)kernel.org> To: Alice Ryhl <aliceryhl(a)google.com> To: Trevor Gross <tmgross(a)umich.edu> Cc: linux-kernel(a)vger.kernel.org Cc: linux-fsdevel(a)vger.kernel.org Cc: linux-mm(a)kvack.org Cc: linux-riscv(a)lists.infradead.org Cc: devicetree(a)vger.kernel.org Cc: linux-arch(a)vger.kernel.org Cc: linux-doc(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: alistair.francis(a)wdc.com Cc: richard.henderson(a)linaro.org Cc: jim.shu(a)sifive.com Cc: andybnac(a)gmail.com Cc: kito.cheng(a)sifive.com Cc: charlie(a)rivosinc.com Cc: atishp(a)rivosinc.com Cc: evan(a)rivosinc.com Cc: cleger(a)rivosinc.com Cc: alexghiti(a)rivosinc.com Cc: samitolvanen(a)google.com Cc: broonie(a)kernel.org Cc: rick.p.edgecombe(a)intel.com Cc: rust-for-linux(a)vger.kernel.org changelog --------- v17: - fixed warnings due to empty macros in usercfi.h (reported by alexg) - fixed prefixes in commit titles reported by alexg - took below uprobe with fcfi v2 patch from Zong Li and squashed it with "riscv/traps: Introduce software check exception and uprobe handling" https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/ v16: - If FWFT is not implemented or returns error for shadow stack activation, then no_usercfi is set to disable shadow stack. Although this should be picked up by extension validation and activation. Fixed this bug for zicfilp and zicfiss both. Thanks to Charlie Jenkins for reporting this. - If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by Charlie Jenkins. - Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to keep it off till we have more hardware availibility with RVA23 profile and zimop/zcmop implemented. Else this will start breaking people's workflow - Includes the fix if "!RV64 and !SBI" then definitions for FWFT in asm-offsets.c error. v15: - Toolchain has been updated to include `-fcf-protection` flag. This exists for x86 as well. Updated kernel patches to compile vDSO and selftest to compile with `fcf-protection=full` flag. - selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI. - Patch to enable shadow stack for kernel wasn't hidden behind CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that. v14: - rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants. - Took Radim's suggestions on bitfields. - Placed cfi_state at the end of thread_info block so that current situation is not disturbed with respect to member fields of thread_info in single cacheline. v13: - cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses riscv_has_extension_unlikely() - uses nops(count) to create nop slide - RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it - changed ternaries to simply use implicit casting to convert to bool. - kernel command line allows to disable zicfilp and zicfiss independently. updated kernel-parameters.txt. - ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace kselftest. - cosmetic and grammatical changes to documentation. v12: - It seems like I had accidently squashed arch agnostic indirect branch tracking prctl and riscv implementation of those prctls. Split them again. - set_shstk_status/set_indir_lp_status perform CSR writes only when CPU support is available. As suggested by Zong Li. - Some minor clean up in kselftests as suggested by Zong Li. v11: - patch "arch/riscv: compile vdso with landing pad" was unconditionally selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to to `lpad 0`. v10: - dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch is not that interesting to this patch series for risc-v. There are instances in arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch to expedite merging in riscv tree. - Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to validate presence of cfi based on config. - Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure we add single vdso object with cfi enabled. But a vdso object with scheme of zero labeled landing pad is least common denominator and should work with all objects of zero labeled as well as function-signature labeled objects. v9: - rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion") - dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs) - dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs) v8: - rebased on palmer/for-next - dropped samuel holland's `envcfg` context switch patches. they are in parlmer/for-next v7: - Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv" Instead using `deactivate_mm` flow to clean up. see here for more context https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.… - Changed the header include in `kselftest`. Hopefully this fixes compile issue faced by Zong Li at SiFive. - Cleaned up an orphaned change to `mm/mmap.c` in below patch "riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE" - Lock interfaces for shadow stack and indirect branch tracking expect arg == 0 Any future evolution of this interface should accordingly define how arg should be setup. - `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper `is_shadow_stack_vma`. - Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv… v6: - Picked up Samuel Holland's changes as is with `envcfg` placed in `thread` instead of `thread_info` - fixed unaligned newline escapes in kselftest - cleaned up messages in kselftest and included test output in commit message - fixed a bug in clone path reported by Zong Li - fixed a build issue if CONFIG_RISCV_ISA_V is not selected (this was introduced due to re-factoring signal context management code) v5: - rebased on v6.12-rc1 - Fixed schema related issues in device tree file - Fixed some of the documentation related issues in zicfilp/ss.rst (style issues and added index) - added `SHADOW_STACK_SET_MARKER` so that implementation can define base of shadow stack. - Fixed warnings on definitions added in usercfi.h when CONFIG_RISCV_USER_CFI is not selected. - Adopted context header based signal handling as proposed by Andy Chiu - Added support for enabling kernel mode access to shadow stack using FWFT (https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…) - Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv… (Note: I had an issue in my workflow due to which version number wasn't picked up correctly while sending out patches) v4: - rebased on 6.11-rc6 - envcfg: Converged with Samuel Holland's patches for envcfg management on per- thread basis. - vma_is_shadow_stack is renamed to is_vma_shadow_stack - picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch - signal context: using extended context management to maintain compatibility. - fixed `-Wmissing-prototypes` compiler warnings for prctl functions - Documentation fixes and amending typos. - Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/ v3: - envcfg logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been picked on per task basis, even though CPU didn't implement it. Fixed in this series. - dt-bindings As suggested, split into separate commit. fixed the messaging that spec is in public review - arch_is_shadow_stack change arch_is_shadow_stack changed to vma_is_shadow_stack - hwprobe zicfiss / zicfilp if present will get enumerated in hwprobe - selftests As suggested, added object and binary filenames to .gitignore Selftest binary anyways need to be compiled with cfi enabled compiler which will make sure that landing pad and shadow stack are enabled. Thus removed separate enable/disable tests. Cleaned up tests a bit. - Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ v2: - Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow integrity for user mode programs can be compiled in the kernel. - Enabling of control flow integrity for user programs is left to user runtime - This patch series introduces arch agnostic `prctls` to enable shadow stack and indirect branch tracking. And implements them on riscv. --- Changes in v17: - Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri… Changes in v16: - Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri… Changes in v15: - changelog posted just below cover letter - Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri… Changes in v14: - changelog posted just below cover letter - Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri… Changes in v13: - changelog posted just below cover letter - Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri… Changes in v12: - changelog posted just below cover letter - Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri… Changes in v11: - changelog posted just below cover letter - Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri… --- Andy Chiu (1): riscv: signal: abstract header saving for setup_sigcontext Deepak Gupta (25): mm: VM_SHADOW_STACK definition for riscv dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml) riscv: zicfiss / zicfilp enumeration riscv: zicfiss / zicfilp extension csr and bit definitions riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE riscv/mm: manufacture shadow stack pte riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs riscv/mm: write protect and shadow stack riscv/mm: Implement map_shadow_stack() syscall riscv/shstk: If needed allocate a new shadow stack on clone riscv: Implements arch agnostic shadow stack prctls prctl: arch-agnostic prctl for indirect branch tracking riscv: Implements arch agnostic indirect branch tracking prctls riscv/traps: Introduce software check exception and uprobe handling riscv/signal: save and restore of shadow stack for signal riscv/kernel: update __show_regs to print shadow stack register riscv/ptrace: riscv cfi status and state via ptrace and in core files riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe riscv: kernel command line option to opt out of user cfi riscv: enable kernel access to shadow stack memory via FWFT sbi call riscv: create a config for shadow stack and landing pad instr support riscv: Documentation for landing pad / indirect branch tracking riscv: Documentation for shadow stack on riscv kselftest/riscv: kselftest for user mode cfi Jim Shu (1): arch/riscv: compile vdso with landing pad Documentation/admin-guide/kernel-parameters.txt | 8 + Documentation/arch/riscv/index.rst | 2 + Documentation/arch/riscv/zicfilp.rst | 115 +++++ Documentation/arch/riscv/zicfiss.rst | 179 +++++++ .../devicetree/bindings/riscv/extensions.yaml | 14 + arch/riscv/Kconfig | 21 + arch/riscv/Makefile | 5 +- arch/riscv/include/asm/asm-prototypes.h | 1 + arch/riscv/include/asm/assembler.h | 44 ++ arch/riscv/include/asm/cpufeature.h | 12 + arch/riscv/include/asm/csr.h | 16 + arch/riscv/include/asm/entry-common.h | 2 + arch/riscv/include/asm/hwcap.h | 2 + arch/riscv/include/asm/mman.h | 26 + arch/riscv/include/asm/mmu_context.h | 7 + arch/riscv/include/asm/pgtable.h | 30 +- arch/riscv/include/asm/processor.h | 2 + arch/riscv/include/asm/thread_info.h | 3 + arch/riscv/include/asm/usercfi.h | 95 ++++ arch/riscv/include/asm/vector.h | 3 + arch/riscv/include/uapi/asm/hwprobe.h | 2 + arch/riscv/include/uapi/asm/ptrace.h | 34 ++ arch/riscv/include/uapi/asm/sigcontext.h | 1 + arch/riscv/kernel/Makefile | 1 + arch/riscv/kernel/asm-offsets.c | 10 + arch/riscv/kernel/cpufeature.c | 27 + arch/riscv/kernel/entry.S | 33 +- arch/riscv/kernel/head.S | 27 + arch/riscv/kernel/process.c | 27 +- arch/riscv/kernel/ptrace.c | 95 ++++ arch/riscv/kernel/signal.c | 148 +++++- arch/riscv/kernel/sys_hwprobe.c | 2 + arch/riscv/kernel/sys_riscv.c | 10 + arch/riscv/kernel/traps.c | 51 ++ arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++ arch/riscv/kernel/vdso/Makefile | 6 + arch/riscv/kernel/vdso/flush_icache.S | 4 + arch/riscv/kernel/vdso/getcpu.S | 4 + arch/riscv/kernel/vdso/rt_sigreturn.S | 4 + arch/riscv/kernel/vdso/sys_hwprobe.S | 4 + arch/riscv/mm/init.c | 2 +- arch/riscv/mm/pgtable.c | 16 + include/linux/cpu.h | 4 + include/linux/mm.h | 7 + include/uapi/linux/elf.h | 2 + include/uapi/linux/prctl.h | 27 + kernel/sys.c | 30 ++ tools/testing/selftests/riscv/Makefile | 2 +- tools/testing/selftests/riscv/cfi/.gitignore | 3 + tools/testing/selftests/riscv/cfi/Makefile | 16 + tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++ tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++ tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++ tools/testing/selftests/riscv/cfi/shadowstack.h | 27 + 54 files changed, 2369 insertions(+), 29 deletions(-) --- base-commit: 4181f8ad7a1061efed0219951d608d4988302af7 change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2 -- - debug

5 months, 3 weeks

2
34
0 0

[PATCH] tools/nolibc: add support for Alpha

by Thomas Weißschuh

A straightforward new architecture. Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Only tested on QEMU so far. Testing on real hardware would be very welcome. Test instructions: $ cd tools/testings/selftests/nolibc/ $ make -f Makefile.nolibc ARCH=alpha CROSS_COMPILE=alpha-linux- nolibc-test $ file nolibc-test nolibc-test: ELF 64-bit LSB executable, Alpha (unofficial), version 1 (SYSV), statically linked, not stripped $ ./nolibc-test Running test 'startup' 0 argc = 1 [OK] ... Total number of errors: 0 Exiting with status 0 --- tools/include/nolibc/arch-alpha.h | 164 +++++++++++++++++++++++++ tools/include/nolibc/arch.h | 2 + tools/testing/selftests/nolibc/Makefile.nolibc | 5 + tools/testing/selftests/nolibc/nolibc-test.c | 4 + tools/testing/selftests/nolibc/run-tests.sh | 3 +- 5 files changed, 177 insertions(+), 1 deletion(-) diff --git a/tools/include/nolibc/arch-alpha.h b/tools/include/nolibc/arch-alpha.h new file mode 100644 index 0000000000000000000000000000000000000000..6b9bb6c749b931f30ce7bd6cd125622828405604 --- /dev/null +++ b/tools/include/nolibc/arch-alpha.h @@ -0,0 +1,164 @@ +/* SPDX-License-Identifier: LGPL-2.1 OR MIT */ +/* + * Alpha specific definitions for NOLIBC + * Copyright (C) 2025 Thomas Weißschuh <linux(a)weissschuh.net> + */ + +#ifndef _NOLIBC_ARCH_ALPHA_H +#define _NOLIBC_ARCH_ALPHA_H + +#include "compiler.h" +#include "crt.h" + +/* + * Syscalls for Alpha: + * - registers are 64-bit + * - syscall number is passed in $0/v0 + * - the system call is performed by calling callsys + * - syscall return comes in $0/v0, error flag in $19/a4 + * - arguments are passed in $16/a0 to $21/a5 + * - GCC does not support symbol register names + */ + +#define my_syscall0(num) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall1(num, arg1) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall2(num, arg1, arg2) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall3(num, arg1, arg2, arg3) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall4(num, arg1, arg2, arg3, arg4) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + register long _arg4 __asm__ ("$19") = (long)(arg4); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3), "r"(_arg4) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall5(num, arg1, arg2, arg3, arg4, arg5) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + register long _arg4 __asm__ ("$19") = (long)(arg4); \ + register long _arg5 __asm__ ("$20") = (long)(arg5); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3), "r"(_arg4), \ + "r"(_arg5) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +#define my_syscall6(num, arg1, arg2, arg3, arg4, arg5, arg6) \ +({ \ + register long _num __asm__ ("$0") = (num); \ + register long _ret __asm__ ("$0"); \ + register long _err __asm__ ("$19"); \ + register long _arg1 __asm__ ("$16") = (long)(arg1); \ + register long _arg2 __asm__ ("$17") = (long)(arg2); \ + register long _arg3 __asm__ ("$18") = (long)(arg3); \ + register long _arg4 __asm__ ("$19") = (long)(arg4); \ + register long _arg5 __asm__ ("$20") = (long)(arg5); \ + register long _arg6 __asm__ ("$21") = (long)(arg6); \ + \ + __asm__ volatile ( \ + "callsys" \ + : "=r"(_ret), "=r"(_err) \ + : "r"(_num), "r"(_arg1), "r"(_arg2), "r"(_arg3), "r"(_arg4), \ + "r"(_arg5), "r"(_arg6) \ + : "memory", "cc" \ + ); \ + _err ? -_ret : _ret; \ +}) + +/* startup code */ +void __attribute__((weak, noreturn)) __nolibc_entrypoint __no_stack_protector _start(void) +{ + __asm__ volatile ( + "br $gp, 0f\n" /* setup $gp, so that 'lda' works */ + "0: ldgp $gp, 0($gp)\n" + "lda $27, _start_c\n" /* setup current function address for _start_c */ + "mov $sp, $16\n" /* save argc pointer to $16, as arg1 of _start_c */ + "br _start_c\n" /* transfer to c runtime */ + ); + __nolibc_entrypoint_epilogue(); +} + +#endif /* _NOLIBC_ARCH_ALPHA_H */ diff --git a/tools/include/nolibc/arch.h b/tools/include/nolibc/arch.h index 426c89198135564acca44c485e5c2d8ba36a6fe9..72585d4c04e2896a275faadf881e98286f914fb3 100644 --- a/tools/include/nolibc/arch.h +++ b/tools/include/nolibc/arch.h @@ -37,6 +37,8 @@ #include "arch-m68k.h" #elif defined(__sh__) #include "arch-sh.h" +#elif defined(__alpha__) +#include "arch-alpha.h" #else #error Unsupported Architecture #endif diff --git a/tools/testing/selftests/nolibc/Makefile.nolibc b/tools/testing/selftests/nolibc/Makefile.nolibc index 0fb759ba992ee6b1693b88f1b2e77463afa9f38b..0da33fe99bd630ec5100b5beed939d524af2b3d4 100644 --- a/tools/testing/selftests/nolibc/Makefile.nolibc +++ b/tools/testing/selftests/nolibc/Makefile.nolibc @@ -93,6 +93,7 @@ IMAGE_sparc32 = arch/sparc/boot/image IMAGE_sparc64 = arch/sparc/boot/image IMAGE_m68k = vmlinux IMAGE_sh4 = arch/sh/boot/zImage +IMAGE_alpha = vmlinux IMAGE = $(objtree)/$(IMAGE_$(XARCH)) IMAGE_NAME = $(notdir $(IMAGE)) @@ -123,6 +124,7 @@ DEFCONFIG_sparc32 = sparc32_defconfig DEFCONFIG_sparc64 = sparc64_defconfig DEFCONFIG_m68k = virt_defconfig DEFCONFIG_sh4 = rts7751r2dplus_defconfig +DEFCONFIG_alpha = defconfig DEFCONFIG = $(DEFCONFIG_$(XARCH)) EXTRACONFIG_x32 = -e CONFIG_X86_X32_ABI @@ -130,6 +132,7 @@ EXTRACONFIG_arm = -e CONFIG_NAMESPACES EXTRACONFIG_armthumb = -e CONFIG_NAMESPACES EXTRACONFIG_m68k = -e CONFIG_BLK_DEV_INITRD EXTRACONFIG_sh4 = -e CONFIG_BLK_DEV_INITRD -e CONFIG_CMDLINE_FROM_BOOTLOADER +EXTRACONFIG_alpha = -e CONFIG_BLK_DEV_INITRD EXTRACONFIG = $(EXTRACONFIG_$(XARCH)) # optional tests to run (default = all) @@ -162,6 +165,7 @@ QEMU_ARCH_sparc32 = sparc QEMU_ARCH_sparc64 = sparc64 QEMU_ARCH_m68k = m68k QEMU_ARCH_sh4 = sh4 +QEMU_ARCH_alpha = alpha QEMU_ARCH = $(QEMU_ARCH_$(XARCH)) QEMU_ARCH_USER_ppc64le = ppc64le @@ -203,6 +207,7 @@ QEMU_ARGS_sparc32 = -M SS-5 -m 256M -append "console=ttyS0,115200 panic=-1 $( QEMU_ARGS_sparc64 = -M sun4u -append "console=ttyS0,115200 panic=-1 $(TEST:%=NOLIBC_TEST=%)" QEMU_ARGS_m68k = -M virt -append "console=ttyGF0,115200 panic=-1 $(TEST:%=NOLIBC_TEST=%)" QEMU_ARGS_sh4 = -M r2d -serial file:/dev/stdout -append "console=ttySC1,115200 panic=-1 $(TEST:%=NOLIBC_TEST=%)" +QEMU_ARGS_alpha = -M clipper -append "console=ttyS0 panic=-1 $(TEST:%=NOLIBC_TEST=%)" QEMU_ARGS = -m 1G $(QEMU_ARGS_$(XARCH)) $(QEMU_ARGS_BIOS) $(QEMU_ARGS_EXTRA) # OUTPUT is only set when run from the main makefile, otherwise diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c index a297ee0d6d0754dfcd9f9e5609d42c7442dabc4e..bbbb2a485f220fed69556baaf2603d9cf24a1c36 100644 --- a/tools/testing/selftests/nolibc/nolibc-test.c +++ b/tools/testing/selftests/nolibc/nolibc-test.c @@ -709,6 +709,10 @@ int run_startup(int min, int max) /* checking NULL for argv/argv0, environ and _auxv is not enough, let's compare with sbrk(0) or &end */ extern char end; char *brk = sbrk(0) != (void *)-1 ? sbrk(0) : &end; +#if defined(__alpha__) + /* the ordering above does not work on an alpha kernel */ + brk = NULL; +#endif /* differ from nolibc, both glibc and musl have no global _auxv */ const unsigned long *test_auxv = (void *)-1; #ifdef NOLIBC diff --git a/tools/testing/selftests/nolibc/run-tests.sh b/tools/testing/selftests/nolibc/run-tests.sh index e8af1fb505cf3573b4a6b37228dee764fe2e5277..8ce57d7006594c531f471d777d579c4f08d87efe 100755 --- a/tools/testing/selftests/nolibc/run-tests.sh +++ b/tools/testing/selftests/nolibc/run-tests.sh @@ -28,6 +28,7 @@ all_archs=( sparc32 sparc64 m68k sh4 + alpha ) archs="${all_archs[@]}" @@ -189,7 +190,7 @@ test_arch() { echo "Unsupported configuration" return fi - if [ "$arch" = "m68k" -o "$arch" = "sh4" ] && [ "$llvm" = "1" ]; then + if [ "$arch" = "m68k" -o "$arch" = "sh4" -o "$arch" = "alpha" ] && [ "$llvm" = "1" ]; then echo "Unsupported configuration" return fi --- base-commit: b9e50363178a40c76bebaf2f00faa2b0b6baf8d1 change-id: 20250609-nolibc-alpha-33e79644544c Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

5 months, 3 weeks

4
6
0 0

[PATCH net-next V4 0/5] selftests: drv-net: Test XDP native support

by Mohsin Bashir

This patch series add tests to validate XDP native support for PASS, DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment. For adjustment tests, validate support for both the extension and shrinking cases across various packet sizes and offset values. The pass criteria for head/tail adjustment tests require that at-least one adjustment value works for at-least one packet size. This ensure that the variability in maximum supported head/tail adjustment offset across different drivers is being incorporated. The results reported in this series are based on fbnic. However, the series is tested against multiple other drivers including netdevism. Note: The XDP support for fbnic will be added later. --- Change-log: V4: - Support XDP handling for netdevsim - Fix pylint warning with P4 - Update commit message for P2,P3 to show pass/fail summary V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.… V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.… Jakub Kicinski (1): net: netdevsim: hook in XDP handling Mohsin Bashir (4): selftests: drv-net: Test XDP_PASS/DROP support selftests: drv-net: Test XDP_TX support selftests: drv-net: Test tail-adjustment support selftests: drv-net: Test head-adjustment support drivers/net/netdevsim/netdev.c | 19 +- tools/testing/selftests/drivers/net/Makefile | 1 + tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++ .../selftests/net/lib/xdp_native.bpf.c | 538 ++++++++++++++ 4 files changed, 1213 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/drivers/net/xdp.py create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c -- 2.47.1

5 months, 3 weeks

3
8
0 0

[PATCH net-next v3 0/3] netdevsim: add support for PHY devices

by Maxime Chevallier

Hi everyone, Here's a V3 for the netdevsim PHY support. This V3 includes : - A fix for a compiling issue with PHYLIB=n - An updated KConfig to only allow PHYLIB=y|n - Converted the link setting file to a bool debugfs file, relying on link state polling The idea of this series is to allow attaching virtual PHY devices to netdevsim, so that we can test PHY-related ethtool commands. This can be extended in the future for phylib testing as well. V2: https://lore.kernel.org/netdev/20250708115531.111326-1-maxime.chevallier@bo… - Fix building with PHYLIB=m - Use shellcheck on the shell scripts V1: https://lore.kernel.org/netdev/20250702082806.706973-1-maxime.chevallier@bo… Maxime Chevallier (3): net: netdevsim: Add PHY support in netdevsim selftests: ethtool: Drop the unused old_netdevs variable selftests: ethtool: Introduce ethernet PHY selftests on netdevsim drivers/net/Kconfig | 1 + drivers/net/netdevsim/Makefile | 4 + drivers/net/netdevsim/dev.c | 2 + drivers/net/netdevsim/netdev.c | 8 + drivers/net/netdevsim/netdevsim.h | 25 ++ drivers/net/netdevsim/phy.c | 375 ++++++++++++++++++ .../selftests/drivers/net/netdevsim/config | 1 + .../drivers/net/netdevsim/ethtool-common.sh | 19 +- .../drivers/net/netdevsim/ethtool-phy.sh | 64 +++ 9 files changed, 496 insertions(+), 3 deletions(-) create mode 100644 drivers/net/netdevsim/phy.c create mode 100755 tools/testing/selftests/drivers/net/netdevsim/ethtool-phy.sh -- 2.49.0

5 months, 3 weeks

3
12
0 0

[PATCH net] selftests: net: increase inter-packet timeout in udpgro.sh

by Paolo Abeni

The mentioned test is not very stable when running on top of debug kernel build. Increase the inter-packet timeout to allow more slack in such environments. Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO") Signed-off-by: Paolo Abeni <pabeni(a)redhat.com> --- tools/testing/selftests/net/udpgro.sh | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/udpgro.sh b/tools/testing/selftests/net/udpgro.sh index 1dc337c709f8..b17e032a6d75 100755 --- a/tools/testing/selftests/net/udpgro.sh +++ b/tools/testing/selftests/net/udpgro.sh @@ -48,7 +48,7 @@ run_one() { cfg_veth - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 100 ${rx_args} & local PID1=$! wait_local_port_listen ${PEER_NS} 8000 udp @@ -95,7 +95,7 @@ run_one_nat() { # will land on the 'plain' one ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -b ${addr1} -n 0 & local PID1=$! - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${family} -b ${addr2%/*} ${rx_args} & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 100 ${family} -b ${addr2%/*} ${rx_args} & local PID2=$! wait_local_port_listen "${PEER_NS}" 8000 udp @@ -117,9 +117,9 @@ run_one_2sock() { cfg_veth - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} -p 12345 & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 100 ${rx_args} -p 12345 & local PID1=$! - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 2000 -R 10 ${rx_args} & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 2000 -R 100 ${rx_args} & local PID2=$! wait_local_port_listen "${PEER_NS}" 12345 udp -- 2.50.0

5 months, 3 weeks

3
2
0 0

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror