- Linux-kselftest-mirror - lists.linaro.org

[PATCH v2 0/9] Initial DMABUF support for iommufd

by Jason Gunthorpe

This series is the start of adding full DMABUF support to iommufd. Currently it is limited to only work with VFIO's DMABUF exporter. It sits on top of Leon's series to add a DMABUF exporter to VFIO: https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.c… The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but otherwise works the same as it does today for a memfd. The user can select a slice of the FD to map into the ioas and if the underliyng alignment requirements are met it will be placed in the iommu_domain. Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR memory from VFIO to an iommu_domain controlled by iommufd. This is used for PCI Peer to Peer support in VMs, and is the last feature that the VFIO type 1 container has that iommufd couldn't do. The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime control and is a use-after-free security problem. Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there should be no access to the MMIO it can shoot down the mapping in iommufd which will unmap it from the iommu_domain. There is no automatic remap, this is a safety protocol so the kernel doesn't get stuck. Userspace is expected to know it is doing something that will revoke the dmabuf and map/unmap it around the activity. Eg when QEMU goes to issue FLR it should do the map/unmap to iommufd. Since DMABUF is missing some key general features for this use case it relies on a "private interconnect" between VFIO and iommufd via the vfio_pci_dma_buf_iommufd_map() call. The call confirms the DMABUF has revoke semantics and delivers a phys_addr for the memory suitable for use with iommu_map(). Medium term there is a desire to expand the supported DMABUFs to include GPU drivers to support DPDK/SPDK type use cases so future series will work to add a general concept of revoke and a general negotiation of interconnect to remove vfio_pci_dma_buf_iommufd_map(). I also plan another series to modify iommufd's vfio_compat to transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI of type1. The latest series for interconnect negotation to exchange a phys_addr is: https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com And the discussion for design of revoke is here: https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/ This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf v2: - Rebase on Leon's v9 - Fix mislocking in an iopt_fill_domain() error path - Revise the comments around how the sub page offset works - Remove a useless WARN_ON in iopt_pages_rw_access() - Fixed missed memory free in the selftest v1: https://patch.msgid.link/r/0-v1-64bed2430cdb+31b-iommufd_dmabuf_jgg@nvidia.… Jason Gunthorpe (9): vfio/pci: Add vfio_pci_dma_buf_iommufd_map() iommufd: Add DMABUF to iopt_pages iommufd: Do not map/unmap revoked DMABUFs iommufd: Allow a DMABUF to be revoked iommufd: Allow MMIO pages in a batch iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd/selftest: Add some tests for the dmabuf flow drivers/iommu/iommufd/io_pagetable.c | 78 +++- drivers/iommu/iommufd/io_pagetable.h | 54 ++- drivers/iommu/iommufd/ioas.c | 8 +- drivers/iommu/iommufd/iommufd_private.h | 14 +- drivers/iommu/iommufd/iommufd_test.h | 10 + drivers/iommu/iommufd/main.c | 10 + drivers/iommu/iommufd/pages.c | 414 ++++++++++++++++-- drivers/iommu/iommufd/selftest.c | 143 ++++++ drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++ include/linux/vfio_pci_core.h | 4 + tools/testing/selftests/iommu/iommufd.c | 43 ++ tools/testing/selftests/iommu/iommufd_utils.h | 44 ++ 12 files changed, 786 insertions(+), 70 deletions(-) base-commit: f836737ed56db9e2d5b047c56a31e05af0f3f116 -- 2.43.0

1 month, 2 weeks

2
11
0 0

[PATCH] selftests/filesystems: Assume that TIOCGPTPEER is defined

by Mark Brown

The devpts_pts selftest has an ifdef in case an architecture does not define TIOCGPTPEER, but the handling for this is broken since we need errno to be set to EINVAL in order to skip the test as we should. Given that this ioctl() has been defined since v4.15 we may as well just assume it's there rather than write handling code which will probably never get used. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- tools/testing/selftests/filesystems/devpts_pts.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/tools/testing/selftests/filesystems/devpts_pts.c b/tools/testing/selftests/filesystems/devpts_pts.c index b1fc9b916ace..cad7da1bd7ca 100644 --- a/tools/testing/selftests/filesystems/devpts_pts.c +++ b/tools/testing/selftests/filesystems/devpts_pts.c @@ -100,7 +100,7 @@ static int resolve_procfd_symlink(int fd, char *buf, size_t buflen) static int do_tiocgptpeer(char *ptmx, char *expected_procfd_contents) { int ret; - int master = -1, slave = -1, fret = -1; + int master = -1, slave, fret = -1; master = open(ptmx, O_RDWR | O_NOCTTY | O_CLOEXEC); if (master < 0) { @@ -119,9 +119,7 @@ static int do_tiocgptpeer(char *ptmx, char *expected_procfd_contents) goto do_cleanup; } -#ifdef TIOCGPTPEER slave = ioctl(master, TIOCGPTPEER, O_RDWR | O_NOCTTY | O_CLOEXEC); -#endif if (slave < 0) { if (errno == EINVAL) { fprintf(stderr, "TIOCGPTPEER is not supported. " --- base-commit: ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d change-id: 20251126-selftests-filesystems-devpts-tiocgptpeer-fbd30e579859 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 month, 2 weeks

1
0
0 0

[PATCH v6 0/2] platform/chrome: Fix an UAF via revocable primitive APIs

by Tzung-Bi Shih

The series is separated from [1] to show the independency and compare potential use cases easier. This use case uses the primitive revocable APIs directly. It relies on the revocable core part [2]. It tries to fix an UAF in the fops of cros_ec_chardev after the underlying protocol device has gone by using revocable. The file operations make sure the resources are available when using them. Even though it has the finest grain for accessing the resources, it makes the user code verbose. Per feedback from the community, I'm looking for some subsystem level helpers so that user code can be simlper. The 1st patch converts existing protocol devices to resource providers of cros_ec_device. The 2nd patch converts cros_ec_chardev to a resource consumer of cros_ec_device to fix the UAF. [1] https://lore.kernel.org/chrome-platform/20251016054204.1523139-1-tzungbi@ke… [2] https://lore.kernel.org/chrome-platform/20251106152330.11733-1-tzungbi@kern… v6: - New, separated from an existing series. Tzung-Bi Shih (2): platform/chrome: Protect cros_ec_device lifecycle with revocable platform/chrome: cros_ec_chardev: Consume cros_ec_device via revocable drivers/platform/chrome/cros_ec.c | 5 ++ drivers/platform/chrome/cros_ec_chardev.c | 71 ++++++++++++++++----- include/linux/platform_data/cros_ec_proto.h | 4 ++ 3 files changed, 65 insertions(+), 15 deletions(-) -- 2.48.1

1 month, 2 weeks

3
7
0 0

[RFC PATCH v2 0/3] Add testable code specifications

by Gabriele Paoloni

[1] was an initial proposal defining testable code specifications for some functions in /drivers/char/mem.c. However a Guideline to write such specifications was missing and test cases tracing to such specifications were missing. This patchset represents a next step and is organised as follows: - patch 1/3 contains the Guideline for writing code specifications - patch 2/3 contains examples of code specfications defined for some functions of drivers/char/mem.c - patch 3/3 contains examples of selftests that map to some code specifications of patch 2/3 [1] https://lore.kernel.org/all/20250821170419.70668-1-gpaoloni@redhat.com/ --- Changes from v1: 1) Added a Guideline to write code specifications in the Linux Kernel Documentation 2) Addressed Greg KH comments in /drivers/char/mem.c 3) Added example of test cases mapping to the code specifications in /drivers/char/mem.c --- Alessandro Carminati (1): selftests/devmem: initial testset Gabriele Paoloni (2): Documentation: add guidelines for writing testable code specifications /dev/mem: Add initial documentation of memory_open() and mem_fops .../doc-guide/code-specifications.rst | 208 +++++++ Documentation/doc-guide/index.rst | 1 + drivers/char/mem.c | 231 ++++++- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/devmem/Makefile | 13 + tools/testing/selftests/devmem/debug.c | 25 + tools/testing/selftests/devmem/debug.h | 14 + tools/testing/selftests/devmem/devmem.c | 200 ++++++ tools/testing/selftests/devmem/ram_map.c | 250 ++++++++ tools/testing/selftests/devmem/ram_map.h | 38 ++ tools/testing/selftests/devmem/secret.c | 46 ++ tools/testing/selftests/devmem/secret.h | 13 + tools/testing/selftests/devmem/tests.c | 569 ++++++++++++++++++ tools/testing/selftests/devmem/tests.h | 45 ++ tools/testing/selftests/devmem/utils.c | 379 ++++++++++++ tools/testing/selftests/devmem/utils.h | 119 ++++ 16 files changed, 2146 insertions(+), 6 deletions(-) create mode 100644 Documentation/doc-guide/code-specifications.rst create mode 100644 tools/testing/selftests/devmem/Makefile create mode 100644 tools/testing/selftests/devmem/debug.c create mode 100644 tools/testing/selftests/devmem/debug.h create mode 100644 tools/testing/selftests/devmem/devmem.c create mode 100644 tools/testing/selftests/devmem/ram_map.c create mode 100644 tools/testing/selftests/devmem/ram_map.h create mode 100644 tools/testing/selftests/devmem/secret.c create mode 100644 tools/testing/selftests/devmem/secret.h create mode 100644 tools/testing/selftests/devmem/tests.c create mode 100644 tools/testing/selftests/devmem/tests.h create mode 100644 tools/testing/selftests/devmem/utils.c create mode 100644 tools/testing/selftests/devmem/utils.h -- 2.48.1

1 month, 2 weeks

7
23
0 0

[PATCH 1/2] rust: allow `unreachable_pub` for doctests

by Miguel Ojeda

Examples (i.e. doctests) may want to show public items such as structs, thus the `unreachable_pub` warning is not very helpful. Thus allow it for all doctests. In addition, remove it from the existing `expect`s we have in a couple doctests. Suggested-by: Alice Ryhl <aliceryhl(a)google.com> Link: https://lore.kernel.org/rust-for-linux/aRG9VjsaCjsvAwUn@google.com/ Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org> --- rust/kernel/init.rs | 2 +- rust/kernel/types.rs | 2 +- scripts/rustdoc_test_gen.rs | 1 + 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs index 4949047af8d7..e476d81c1a27 100644 --- a/rust/kernel/init.rs +++ b/rust/kernel/init.rs @@ -67,7 +67,7 @@ //! ``` //! //! ```rust -//! # #![expect(unreachable_pub, clippy::disallowed_names)] +//! # #![expect(clippy::disallowed_names)] //! use kernel::{prelude::*, types::Opaque}; //! use core::{ptr::addr_of_mut, marker::PhantomPinned, pin::Pin}; //! # mod bindings { diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs index dc0a02f5c3cf..835824788506 100644 --- a/rust/kernel/types.rs +++ b/rust/kernel/types.rs @@ -289,7 +289,7 @@ fn drop(&mut self) { /// # Examples /// /// ``` -/// # #![expect(unreachable_pub, clippy::disallowed_names)] +/// # #![expect(clippy::disallowed_names)] /// use kernel::types::Opaque; /// # // Emulate a C struct binding which is from C, maybe uninitialized or not, only the C side /// # // knows. diff --git a/scripts/rustdoc_test_gen.rs b/scripts/rustdoc_test_gen.rs index c8f9dc2ab976..0e6a0542d1bd 100644 --- a/scripts/rustdoc_test_gen.rs +++ b/scripts/rustdoc_test_gen.rs @@ -208,6 +208,7 @@ macro_rules! assert_eq {{ #[allow(unused)] static __DOCTEST_ANCHOR: i32 = ::core::line!() as i32 + {body_offset} + 1; {{ + #![allow(unreachable_pub)] {body} main(); }} base-commit: e9a6fb0bcdd7609be6969112f3fbfcce3b1d4a7c -- 2.51.2

1 month, 2 weeks

8
13
0 0

[PATCH 00/32] ns: support file handles

by Christian Brauner

For a while now we have supported file handles for pidfds. This has proven to be very useful. Extend the concept to cover namespaces as well. After this patchset it is possible to encode and decode namespace file handles using the commong name_to_handle_at() and open_by_handle_at() apis. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. Both the network namespace and the mount namespace already have an associated cookie that isn't recycled and is fully exposed to userspace. Move this into ns_common and use the same id space for all namespaces so they can trivially and reliably be compared. There's more coming based on the iterator infrastructure but the series is large enough and focuses on file handles. Extensive selftests included. I still have various other test-suites to run but it holds up so far. Signed-off-by: Christian Brauner <brauner(a)kernel.org> --- Christian Brauner (32): pidfs: validate extensible ioctls nsfs: validate extensible ioctls block: use extensible_ioctl_valid() ns: move to_ns_common() to ns_common.h nsfs: add nsfs.h header ns: uniformly initialize ns_common mnt: use ns_common_init() ipc: use ns_common_init() cgroup: use ns_common_init() pid: use ns_common_init() time: use ns_common_init() uts: use ns_common_init() user: use ns_common_init() net: use ns_common_init() ns: remove ns_alloc_inum() nstree: make iterator generic mnt: support iterator cgroup: support iterator ipc: support iterator net: support iterator pid: support iterator time: support iterator userns: support iterator uts: support iterator ns: add to_<type>_ns() to respective headers nsfs: add current_in_namespace() nsfs: support file handles nsfs: support exhaustive file handles nsfs: add missing id retrieval support tools: update nsfs.h uapi header selftests/namespaces: add identifier selftests selftests/namespaces: add file handle selftests block/blk-integrity.c | 8 +- fs/fhandle.c | 6 + fs/internal.h | 1 + fs/mount.h | 10 +- fs/namespace.c | 156 +-- fs/nsfs.c | 266 +++- fs/pidfs.c | 2 +- include/linux/cgroup.h | 5 + include/linux/exportfs.h | 6 + include/linux/fs.h | 14 + include/linux/ipc_namespace.h | 5 + include/linux/ns_common.h | 29 + include/linux/nsfs.h | 40 + include/linux/nsproxy.h | 11 - include/linux/nstree.h | 89 ++ include/linux/pid_namespace.h | 5 + include/linux/proc_ns.h | 32 +- include/linux/time_namespace.h | 9 + include/linux/user_namespace.h | 5 + include/linux/utsname.h | 5 + include/net/net_namespace.h | 6 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/nsfs.h | 12 +- init/main.c | 2 + ipc/msgutil.c | 1 + ipc/namespace.c | 12 +- ipc/shm.c | 2 + kernel/Makefile | 2 +- kernel/cgroup/cgroup.c | 2 + kernel/cgroup/namespace.c | 24 +- kernel/nstree.c | 233 ++++ kernel/pid_namespace.c | 13 +- kernel/time/namespace.c | 23 +- kernel/user_namespace.c | 17 +- kernel/utsname.c | 28 +- net/core/net_namespace.c | 59 +- tools/include/uapi/linux/nsfs.h | 23 +- tools/testing/selftests/namespaces/.gitignore | 2 + tools/testing/selftests/namespaces/Makefile | 7 + tools/testing/selftests/namespaces/config | 7 + .../selftests/namespaces/file_handle_test.c | 1410 ++++++++++++++++++++ tools/testing/selftests/namespaces/nsid_test.c | 986 ++++++++++++++ 42 files changed, 3306 insertions(+), 270 deletions(-) --- base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585 change-id: 20250905-work-namespace-c68826dda0d4

1 month, 2 weeks

15
80
0 0

[PATCH v6 net-next 00/14] AccECN protocol case handling series

by chia-yu.chang＠nokia-bell-labs.com

From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com> Hello, Plesae find the v5 AccECN case handling patch series, which covers several excpetional case handling of Accurate ECN spec (RFC9768), adds new identifiers to be used by CC modules, adds ecn_delta into rate_sample, and keeps the ACE counter for computation, etc. This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ Best regards, Chia-Yu --- v6: - Update comment in #3 to highlight RX path is only used for virtio-net (Paolo Abeni <pabeni(a)redhat.com>) - Rename TCP_CONG_WANTS_ECT_1 to TCP_CONG_ECT_1_NEGOTIATION to distiguish from TCP_CONG_ECT_1_ESTABLISH (Paolo Abeni <pabeni(a)redhat.com>) - Move TCP_CONG_ECT_1_ESTABLISH in #6 to latter patch series (Paolo Abeni <pabeni(a)redhat.com>) - Add new synack_type instead of moving the increment of num_retran in #9 (Paolo Abeni <pabeni(a)redhat.com>) - Use new synack_type TCP_SYNACK_RETRANS and num_retrans for SYN/ACK retx fallbackk for AccECN in #10 (Paolo Abeni <pabeni(a)redhat.com>) - Do not cast const struct into non-const in #11, and set AccECN fail mode after tcp_rtx_synack() (Paolo Abeni <pabeni(a)redhat.com>) v5: - Move previous #11 in v4 in latter patch after discussion with RFC author. - Add #3 to update the comments for SKB_GSO_TCP_ECN and SKB_GSO_TCP_ACCECN. (Parav Pandit <parav(a)nvidia.com>) - Add gro self-test for TCP CWR flag in #4. (Eric Dumazet <edumazet(a)google.com>) - Add fixes: tag into #7 (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message of #8 and if condition check (Paolo Abeni <pabeni(a)redhat.com>) - Add empty line between variable declarations and code in #13 (Paolo Abeni <pabeni(a)redhat.com>) v4: - Add previous #13 in v2 back after dicussion with the RFC author. - Add TCP_ACCECN_OPTION_PERSIST to tcp_ecn_option sysctl to ignore AccECN fallback policy on sending AccECN option. v3: - Add additional min() check if pkts_acked_ewma is not initialized in #1. (Paolo Abeni <pabeni(a)redhat.com>) - Change TCP_CONG_WANTS_ECT_1 into individual flag add helper function INET_ECN_xmit_wants_ect_1() in #3. (Paolo Abeni <pabeni(a)redhat.com>) - Add empty line between variable declarations and code in #4. (Paolo Abeni <pabeni(a)redhat.com>) - Update commit message to fix old AccECN commits in #5. (Paolo Abeni <pabeni(a)redhat.com>) - Remove unnecessary brackets in #10. (Paolo Abeni <pabeni(a)redhat.com>) - Move patch #3 in v2 to a later Prague patch serise and remove patch #13 in v2. (Paolo Abeni <pabeni(a)redhat.com>) --- Chia-Yu Chang (12): net: update commnets for SKB_GSO_TCP_ECN and SKB_GSO_TCP_ACCECN selftests/net: gro: add self-test for TCP CWR flag tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers tcp: disable RFC3168 fallback identifier for CC modules tcp: accecn: handle unexpected AccECN negotiation feedback tcp: accecn: retransmit downgraded SYN in AccECN negotiation tcp: add TCP_SYNACK_RETRANS synack_type tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion tcp: accecn: fallback outgoing half link to non-AccECN tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST tcp: accecn: enable AccECN Ilpo Järvinen (2): tcp: try to avoid safer when ACKs are thinned gro: flushing when CWR is set negatively affects AccECN Documentation/networking/ip-sysctl.rst | 4 +- .../networking/net_cachelines/tcp_sock.rst | 1 + include/linux/skbuff.h | 14 ++- include/linux/tcp.h | 4 +- include/net/inet_ecn.h | 20 +++- include/net/tcp.h | 32 ++++++- include/net/tcp_ecn.h | 92 ++++++++++++++----- net/ipv4/inet_connection_sock.c | 4 + net/ipv4/sysctl_net_ipv4.c | 4 +- net/ipv4/tcp.c | 2 + net/ipv4/tcp_cong.c | 5 +- net/ipv4/tcp_input.c | 37 +++++++- net/ipv4/tcp_minisocks.c | 46 +++++++--- net/ipv4/tcp_offload.c | 3 +- net/ipv4/tcp_output.c | 32 ++++--- net/ipv4/tcp_timer.c | 3 + tools/testing/selftests/net/gro.c | 80 +++++++++++----- 17 files changed, 295 insertions(+), 88 deletions(-) -- 2.34.1

1 month, 2 weeks

3
25
0 0

[PATCH] tools: bpf: remove runqslower tool

by Hoyeon Lee

runqslower was added in commit 9c01546d26d2 ("tools/bpf: Add runqslower tool to tools/bpf") as a BCC port to showcase early BPF CO-RE + libbpf workflows. runqslower continues to live in BCC (libbpf-tools), so there is no need to keep building and maintaining it. Drop tools/bpf/runqslower and remove all build hooks in tools/bpf and selftests accordingly. Signed-off-by: Hoyeon Lee <hoyeon.lee(a)suse.com> --- tools/bpf/Makefile | 13 +- tools/bpf/runqslower/.gitignore | 2 - tools/bpf/runqslower/Makefile | 91 ---------- tools/bpf/runqslower/runqslower.bpf.c | 106 ----------- tools/bpf/runqslower/runqslower.c | 171 ------------------ tools/bpf/runqslower/runqslower.h | 13 -- tools/testing/selftests/bpf/.gitignore | 1 - tools/testing/selftests/bpf/Makefile | 14 -- .../selftests/bpf/test_bpftool_build.sh | 4 - 9 files changed, 3 insertions(+), 412 deletions(-) delete mode 100644 tools/bpf/runqslower/.gitignore delete mode 100644 tools/bpf/runqslower/Makefile delete mode 100644 tools/bpf/runqslower/runqslower.bpf.c delete mode 100644 tools/bpf/runqslower/runqslower.c delete mode 100644 tools/bpf/runqslower/runqslower.h diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index 062bbd6cd048..fd2585af1252 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -32,7 +32,7 @@ FEATURE_TESTS = libbfd disassembler-four-args disassembler-init-styled FEATURE_DISPLAY = libbfd check_feat := 1 -NON_CHECK_FEAT_TARGETS := clean bpftool_clean runqslower_clean resolve_btfids_clean +NON_CHECK_FEAT_TARGETS := clean bpftool_clean resolve_btfids_clean ifdef MAKECMDGOALS ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),) check_feat := 0 @@ -70,7 +70,7 @@ $(OUTPUT)%.lex.o: $(OUTPUT)%.lex.c PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm -all: $(PROGS) bpftool runqslower +all: $(PROGS) bpftool $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm' $(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o @@ -86,7 +86,7 @@ $(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c $(OUTPUT)bpf_exp.yacc.o: $(OUTPUT)bpf_exp.yacc.c $(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.lex.c -clean: bpftool_clean runqslower_clean resolve_btfids_clean +clean: bpftool_clean resolve_btfids_clean $(call QUIET_CLEAN, bpf-progs) $(Q)$(RM) -r -- $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* @@ -112,12 +112,6 @@ bpftool_install: bpftool_clean: $(call descend,bpftool,clean) -runqslower: - $(call descend,runqslower) - -runqslower_clean: - $(call descend,runqslower,clean) - resolve_btfids: $(call descend,resolve_btfids) @@ -125,5 +119,4 @@ resolve_btfids_clean: $(call descend,resolve_btfids,clean) .PHONY: all install clean bpftool bpftool_install bpftool_clean \ - runqslower runqslower_clean \ resolve_btfids resolve_btfids_clean diff --git a/tools/bpf/runqslower/.gitignore b/tools/bpf/runqslower/.gitignore deleted file mode 100644 index ffdb70230c8b..000000000000 --- a/tools/bpf/runqslower/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -# SPDX-License-Identifier: GPL-2.0-only -/.output diff --git a/tools/bpf/runqslower/Makefile b/tools/bpf/runqslower/Makefile deleted file mode 100644 index 78a436c4072e..000000000000 --- a/tools/bpf/runqslower/Makefile +++ /dev/null @@ -1,91 +0,0 @@ -# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) -include ../../scripts/Makefile.include - -OUTPUT ?= $(abspath .output)/ - -BPFTOOL_OUTPUT := $(OUTPUT)bpftool/ -DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bootstrap/bpftool -BPFTOOL ?= $(DEFAULT_BPFTOOL) -BPF_TARGET_ENDIAN ?= --target=bpf -LIBBPF_SRC := $(abspath ../../lib/bpf) -BPFOBJ_OUTPUT := $(OUTPUT)libbpf/ -BPFOBJ := $(BPFOBJ_OUTPUT)libbpf.a -BPF_DESTDIR := $(BPFOBJ_OUTPUT) -BPF_INCLUDE := $(BPF_DESTDIR)/include -INCLUDES := -I$(OUTPUT) -I$(BPF_INCLUDE) -I$(abspath ../../include/uapi) -CFLAGS := -g -Wall $(CLANG_CROSS_FLAGS) -CFLAGS += $(EXTRA_CFLAGS) -LDFLAGS += $(EXTRA_LDFLAGS) -LDLIBS += -lelf -lz - -# Try to detect best kernel BTF source -KERNEL_REL := $(shell uname -r) -VMLINUX_BTF_PATHS := $(if $(O),$(O)/vmlinux) \ - $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \ - ../../../vmlinux /sys/kernel/btf/vmlinux \ - /boot/vmlinux-$(KERNEL_REL) -VMLINUX_BTF_PATH := $(or $(VMLINUX_BTF),$(firstword \ - $(wildcard $(VMLINUX_BTF_PATHS)))) - -ifneq ($(V),1) -MAKEFLAGS += --no-print-directory -submake_extras := feature_display=0 -endif - -.DELETE_ON_ERROR: - -.PHONY: all clean runqslower libbpf_hdrs -all: runqslower - -runqslower: $(OUTPUT)/runqslower - -clean: - $(call QUIET_CLEAN, runqslower) - $(Q)$(RM) -r $(BPFOBJ_OUTPUT) $(BPFTOOL_OUTPUT) - $(Q)$(RM) $(OUTPUT)*.o $(OUTPUT)*.d - $(Q)$(RM) $(OUTPUT)*.skel.h $(OUTPUT)vmlinux.h - $(Q)$(RM) $(OUTPUT)runqslower - $(Q)$(RM) -r .output - -libbpf_hdrs: $(BPFOBJ) - -$(OUTPUT)/runqslower: $(OUTPUT)/runqslower.o $(BPFOBJ) - $(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $^ $(LDLIBS) -o $@ - -$(OUTPUT)/runqslower.o: runqslower.h $(OUTPUT)/runqslower.skel.h \ - $(OUTPUT)/runqslower.bpf.o | libbpf_hdrs - -$(OUTPUT)/runqslower.bpf.o: $(OUTPUT)/vmlinux.h runqslower.h | libbpf_hdrs - -$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(BPFTOOL) - $(QUIET_GEN)$(BPFTOOL) gen skeleton $< > $@ - -$(OUTPUT)/%.bpf.o: %.bpf.c $(BPFOBJ) | $(OUTPUT) - $(QUIET_GEN)$(CLANG) -g -O2 $(BPF_TARGET_ENDIAN) $(INCLUDES) \ - -c $(filter %.c,$^) -o $@ && \ - $(LLVM_STRIP) -g $@ - -$(OUTPUT)/%.o: %.c | $(OUTPUT) - $(QUIET_CC)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@ - -$(OUTPUT) $(BPFOBJ_OUTPUT) $(BPFTOOL_OUTPUT): - $(QUIET_MKDIR)mkdir -p $@ - -$(OUTPUT)/vmlinux.h: $(VMLINUX_BTF_PATH) | $(OUTPUT) $(BPFTOOL) -ifeq ($(VMLINUX_H),) - $(Q)if [ ! -e "$(VMLINUX_BTF_PATH)" ] ; then \ - echo "Couldn't find kernel BTF; set VMLINUX_BTF to" \ - "specify its location." >&2; \ - exit 1;\ - fi - $(QUIET_GEN)$(BPFTOOL) btf dump file $(VMLINUX_BTF_PATH) format c > $@ -else - $(Q)cp "$(VMLINUX_H)" $@ -endif - -$(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(BPFOBJ_OUTPUT) - $(Q)$(MAKE) $(submake_extras) -C $(LIBBPF_SRC) OUTPUT=$(BPFOBJ_OUTPUT) \ - DESTDIR=$(BPFOBJ_OUTPUT) prefix= $(abspath $@) install_headers - -$(DEFAULT_BPFTOOL): | $(BPFTOOL_OUTPUT) - $(Q)$(MAKE) $(submake_extras) -C ../bpftool OUTPUT=$(BPFTOOL_OUTPUT) bootstrap diff --git a/tools/bpf/runqslower/runqslower.bpf.c b/tools/bpf/runqslower/runqslower.bpf.c deleted file mode 100644 index fced54a3adf6..000000000000 --- a/tools/bpf/runqslower/runqslower.bpf.c +++ /dev/null @@ -1,106 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -// Copyright (c) 2019 Facebook -#include "vmlinux.h" -#include <bpf/bpf_helpers.h> -#include "runqslower.h" - -#define TASK_RUNNING 0 -#define BPF_F_CURRENT_CPU 0xffffffffULL - -const volatile __u64 min_us = 0; -const volatile pid_t targ_pid = 0; - -struct { - __uint(type, BPF_MAP_TYPE_TASK_STORAGE); - __uint(map_flags, BPF_F_NO_PREALLOC); - __type(key, int); - __type(value, u64); -} start SEC(".maps"); - -struct { - __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); - __uint(key_size, sizeof(u32)); - __uint(value_size, sizeof(u32)); -} events SEC(".maps"); - -/* record enqueue timestamp */ -__always_inline -static int trace_enqueue(struct task_struct *t) -{ - u32 pid = t->pid; - u64 *ptr; - - if (!pid || (targ_pid && targ_pid != pid)) - return 0; - - ptr = bpf_task_storage_get(&start, t, 0, - BPF_LOCAL_STORAGE_GET_F_CREATE); - if (!ptr) - return 0; - - *ptr = bpf_ktime_get_ns(); - return 0; -} - -SEC("tp_btf/sched_wakeup") -int handle__sched_wakeup(u64 *ctx) -{ - /* TP_PROTO(struct task_struct *p) */ - struct task_struct *p = (void *)ctx[0]; - - return trace_enqueue(p); -} - -SEC("tp_btf/sched_wakeup_new") -int handle__sched_wakeup_new(u64 *ctx) -{ - /* TP_PROTO(struct task_struct *p) */ - struct task_struct *p = (void *)ctx[0]; - - return trace_enqueue(p); -} - -SEC("tp_btf/sched_switch") -int handle__sched_switch(u64 *ctx) -{ - /* TP_PROTO(bool preempt, struct task_struct *prev, - * struct task_struct *next) - */ - struct task_struct *prev = (struct task_struct *)ctx[1]; - struct task_struct *next = (struct task_struct *)ctx[2]; - struct runq_event event = {}; - u64 *tsp, delta_us; - u32 pid; - - /* ivcsw: treat like an enqueue event and store timestamp */ - if (prev->__state == TASK_RUNNING) - trace_enqueue(prev); - - pid = next->pid; - - /* For pid mismatch, save a bpf_task_storage_get */ - if (!pid || (targ_pid && targ_pid != pid)) - return 0; - - /* fetch timestamp and calculate delta */ - tsp = bpf_task_storage_get(&start, next, 0, 0); - if (!tsp) - return 0; /* missed enqueue */ - - delta_us = (bpf_ktime_get_ns() - *tsp) / 1000; - if (min_us && delta_us <= min_us) - return 0; - - event.pid = pid; - event.delta_us = delta_us; - bpf_get_current_comm(&event.task, sizeof(event.task)); - - /* output */ - bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, - &event, sizeof(event)); - - bpf_task_storage_delete(&start, next); - return 0; -} - -char LICENSE[] SEC("license") = "GPL"; diff --git a/tools/bpf/runqslower/runqslower.c b/tools/bpf/runqslower/runqslower.c deleted file mode 100644 index 83c5993a139a..000000000000 --- a/tools/bpf/runqslower/runqslower.c +++ /dev/null @@ -1,171 +0,0 @@ -// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) -// Copyright (c) 2019 Facebook -#include <argp.h> -#include <stdio.h> -#include <stdlib.h> -#include <string.h> -#include <time.h> -#include <bpf/libbpf.h> -#include <bpf/bpf.h> -#include "runqslower.h" -#include "runqslower.skel.h" - -struct env { - pid_t pid; - __u64 min_us; - bool verbose; -} env = { - .min_us = 10000, -}; - -const char *argp_program_version = "runqslower 0.1"; -const char *argp_program_bug_address = "<bpf(a)vger.kernel.org>"; -const char argp_program_doc[] = -"runqslower Trace long process scheduling delays.\n" -" For Linux, uses eBPF, BPF CO-RE, libbpf, BTF.\n" -"\n" -"This script traces high scheduling delays between tasks being\n" -"ready to run and them running on CPU after that.\n" -"\n" -"USAGE: runqslower [-p PID] [min_us]\n" -"\n" -"EXAMPLES:\n" -" runqslower # trace run queue latency higher than 10000 us (default)\n" -" runqslower 1000 # trace run queue latency higher than 1000 us\n" -" runqslower -p 123 # trace pid 123 only\n"; - -static const struct argp_option opts[] = { - { "pid", 'p', "PID", 0, "Process PID to trace"}, - { "verbose", 'v', NULL, 0, "Verbose debug output" }, - {}, -}; - -static error_t parse_arg(int key, char *arg, struct argp_state *state) -{ - static int pos_args; - int pid; - long long min_us; - - switch (key) { - case 'v': - env.verbose = true; - break; - case 'p': - errno = 0; - pid = strtol(arg, NULL, 10); - if (errno || pid <= 0) { - fprintf(stderr, "Invalid PID: %s\n", arg); - argp_usage(state); - } - env.pid = pid; - break; - case ARGP_KEY_ARG: - if (pos_args++) { - fprintf(stderr, - "Unrecognized positional argument: %s\n", arg); - argp_usage(state); - } - errno = 0; - min_us = strtoll(arg, NULL, 10); - if (errno || min_us <= 0) { - fprintf(stderr, "Invalid delay (in us): %s\n", arg); - argp_usage(state); - } - env.min_us = min_us; - break; - default: - return ARGP_ERR_UNKNOWN; - } - return 0; -} - -int libbpf_print_fn(enum libbpf_print_level level, - const char *format, va_list args) -{ - if (level == LIBBPF_DEBUG && !env.verbose) - return 0; - return vfprintf(stderr, format, args); -} - -void handle_event(void *ctx, int cpu, void *data, __u32 data_sz) -{ - const struct runq_event *e = data; - struct tm *tm; - char ts[32]; - time_t t; - - time(&t); - tm = localtime(&t); - strftime(ts, sizeof(ts), "%H:%M:%S", tm); - printf("%-8s %-16s %-6d %14llu\n", ts, e->task, e->pid, e->delta_us); -} - -void handle_lost_events(void *ctx, int cpu, __u64 lost_cnt) -{ - printf("Lost %llu events on CPU #%d!\n", lost_cnt, cpu); -} - -int main(int argc, char **argv) -{ - static const struct argp argp = { - .options = opts, - .parser = parse_arg, - .doc = argp_program_doc, - }; - struct perf_buffer *pb = NULL; - struct runqslower_bpf *obj; - int err; - - err = argp_parse(&argp, argc, argv, 0, NULL, NULL); - if (err) - return err; - - libbpf_set_print(libbpf_print_fn); - - /* Use libbpf 1.0 API mode */ - libbpf_set_strict_mode(LIBBPF_STRICT_ALL); - - obj = runqslower_bpf__open(); - if (!obj) { - fprintf(stderr, "failed to open and/or load BPF object\n"); - return 1; - } - - /* initialize global data (filtering options) */ - obj->rodata->targ_pid = env.pid; - obj->rodata->min_us = env.min_us; - - err = runqslower_bpf__load(obj); - if (err) { - fprintf(stderr, "failed to load BPF object: %d\n", err); - goto cleanup; - } - - err = runqslower_bpf__attach(obj); - if (err) { - fprintf(stderr, "failed to attach BPF programs\n"); - goto cleanup; - } - - printf("Tracing run queue latency higher than %llu us\n", env.min_us); - printf("%-8s %-16s %-6s %14s\n", "TIME", "COMM", "PID", "LAT(us)"); - - pb = perf_buffer__new(bpf_map__fd(obj->maps.events), 64, - handle_event, handle_lost_events, NULL, NULL); - err = libbpf_get_error(pb); - if (err) { - pb = NULL; - fprintf(stderr, "failed to open perf buffer: %d\n", err); - goto cleanup; - } - - while ((err = perf_buffer__poll(pb, 100)) >= 0) - ; - printf("Error polling perf buffer: %d\n", err); - -cleanup: - perf_buffer__free(pb); - runqslower_bpf__destroy(obj); - - return err != 0; -} diff --git a/tools/bpf/runqslower/runqslower.h b/tools/bpf/runqslower/runqslower.h deleted file mode 100644 index 4f70f07200c2..000000000000 --- a/tools/bpf/runqslower/runqslower.h +++ /dev/null @@ -1,13 +0,0 @@ -/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ -#ifndef __RUNQSLOWER_H -#define __RUNQSLOWER_H - -#define TASK_COMM_LEN 16 - -struct runq_event { - char task[TASK_COMM_LEN]; - __u64 delta_us; - pid_t pid; -}; - -#endif /* __RUNQSLOWER_H */ diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index be1ee7ba7ce0..e091809f07a0 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -32,7 +32,6 @@ test_cpp /cpuv4 /host-tools /tools -/runqslower /bench /veristat /sign-file diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index f00587d4ede6..79f9f96d153f 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -127,7 +127,6 @@ TEST_KMOD_TARGETS = $(addprefix $(OUTPUT)/,$(TEST_KMODS)) TEST_GEN_PROGS_EXTENDED = \ bench \ flow_dissector_load \ - runqslower \ test_cpp \ test_lirc_mode2_user \ veristat \ @@ -209,8 +208,6 @@ HOST_INCLUDE_DIR := $(INCLUDE_DIR) endif HOST_BPFOBJ := $(HOST_BUILD_DIR)/libbpf/libbpf.a RESOLVE_BTFIDS := $(HOST_BUILD_DIR)/resolve_btfids/resolve_btfids -RUNQSLOWER_OUTPUT := $(BUILD_DIR)/runqslower/ - VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux) \ $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \ ../../../../vmlinux \ @@ -304,17 +301,6 @@ TRUNNER_BPFTOOL := $(DEFAULT_BPFTOOL) USE_BOOTSTRAP := "bootstrap/" endif -$(OUTPUT)/runqslower: $(BPFOBJ) | $(DEFAULT_BPFTOOL) $(RUNQSLOWER_OUTPUT) - $(Q)$(MAKE) $(submake_extras) -C $(TOOLSDIR)/bpf/runqslower \ - OUTPUT=$(RUNQSLOWER_OUTPUT) VMLINUX_BTF=$(VMLINUX_BTF) \ - BPFTOOL_OUTPUT=$(HOST_BUILD_DIR)/bpftool/ \ - BPFOBJ_OUTPUT=$(BUILD_DIR)/libbpf/ \ - BPFOBJ=$(BPFOBJ) BPF_INCLUDE=$(INCLUDE_DIR) \ - BPF_TARGET_ENDIAN=$(BPF_TARGET_ENDIAN) \ - EXTRA_CFLAGS='-g $(OPT_FLAGS) $(SAN_CFLAGS) $(EXTRA_CFLAGS)' \ - EXTRA_LDFLAGS='$(SAN_LDFLAGS) $(EXTRA_LDFLAGS)' && \ - cp $(RUNQSLOWER_OUTPUT)runqslower $@ - TEST_GEN_PROGS_EXTENDED += $(TRUNNER_BPFTOOL) $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED): $(BPFOBJ) diff --git a/tools/testing/selftests/bpf/test_bpftool_build.sh b/tools/testing/selftests/bpf/test_bpftool_build.sh index 1453a53ed547..b03a87571592 100755 --- a/tools/testing/selftests/bpf/test_bpftool_build.sh +++ b/tools/testing/selftests/bpf/test_bpftool_build.sh @@ -90,10 +90,6 @@ echo -e "... through kbuild\n" if [ -f ".config" ] ; then make_and_clean tools/bpf - ## "make tools/bpf" sets $(OUTPUT) to ...tools/bpf/runqslower for - ## runqslower, but the default (used for the "clean" target) is .output. - ## Let's make sure we clean runqslower's directory properly. - make -C tools/bpf/runqslower OUTPUT=${KDIR_ROOT_DIR}/tools/bpf/runqslower/ clean ## $OUTPUT is overwritten in kbuild Makefile, and thus cannot be passed ## down from toplevel Makefile to bpftool's Makefile. -- 2.52.0

1 month, 2 weeks

2
1
0 0

[PATCH bpf-next v11 0/8] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps

by Leon Hwang

This patch set introduces the BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps, as the requirement of BPF_F_ALL_CPUS flag for percpu_array maps was discussed in the thread of "[PATCH bpf-next v3 0/4] bpf: Introduce global percpu data"[1]. The goal of BPF_F_ALL_CPUS flag is to reduce data caching overhead in light skeletons by allowing a single value to be reused to update values across all CPUs. This avoids the M:N problem where M cached values are used to update a map on N CPUs kernel. The BPF_F_CPU flag is accompanied by *flags*-embedded cpu info, which specifies the target CPU for the operation: * For lookup operations: the flag field alongside cpu info enable querying a value on the specified CPU. * For update operations: the flag field alongside cpu info enable updating value for specified CPU. Links: [1] https://lore.kernel.org/bpf/20250526162146.24429-1-leon.hwang@linux.dev/ Changes: v10 -> v11: * Support the combination of BPF_EXIST and BPF_F_CPU/BPF_F_ALL_CPUS for update operations. * Fix unstable lru_percpu_hash map test using the combination of BPF_EXIST and BPF_F_CPU/BPF_F_ALL_CPUS to avoid LRU eviction (reported by Alexei). v9 -> v10: * Add tests to verify array and hash maps do not support BPF_F_CPU and BPF_F_ALL_CPUS flags. * Address comment from Andrii: * Copy map value using copy_map_value_long for percpu_cgroup_storage maps in a separate patch. v8 -> v9: * Change value type from u64 to u32 in selftests. * Address comments from Andrii: * Keep value_size unaligned and update everywhere for consistency when cpu flags are specified. * Update value by getting pointer for percpu hash and percpu cgroup_storage maps. v7 -> v8: * Address comments from Andrii: * Check BPF_F_LOCK when update percpu_array, percpu_hash and lru_percpu_hash maps. * Refactor flags check in __htab_map_lookup_and_delete_batch(). * Keep value_size unaligned and copy value using copy_map_value() in __htab_map_lookup_and_delete_batch() when BPF_F_CPU is specified. * Update warn message in libbpf's validate_map_op(). * Update comment of libbpf's bpf_map__lookup_elem(). v6 -> v7: * Get correct value size for percpu_hash and lru_percpu_hash in update_batch API. * Set 'count' as 'max_entries' in test cases for lookup_batch API. * Address comment from Alexei: * Move cpu flags check into bpf_map_check_op_flags(). v5 -> v6: * Move bpf_map_check_op_flags() from 'bpf.h' to 'syscall.c'. * Address comments from Alexei: * Drop the refactoring code of data copying logic for percpu maps. * Drop bpf_map_check_op_flags() wrappers. v4 -> v5: * Address comments from Andrii: * Refactor data copying logic for all percpu maps. * Drop this_cpu_ptr() micro-optimization. * Drop cpu check in libbpf's validate_map_op(). * Enhance bpf_map_check_op_flags() using *allowed flags* instead of 'extra_flags_mask'. v3 -> v4: * Address comments from Andrii: * Remove unnecessary map_type check in bpf_map_value_size(). * Reduce code churn. * Remove unnecessary do_delete check in __htab_map_lookup_and_delete_batch(). * Introduce bpf_percpu_copy_to_user() and bpf_percpu_copy_from_user(). * Rename check_map_flags() to bpf_map_check_op_flags() with extra_flags_mask. * Add human-readable pr_warn() explanations in validate_map_op(). * Use flags in bpf_map__delete_elem() and bpf_map__lookup_and_delete_elem(). * Drop "for alignment reasons". v3 link: https://lore.kernel.org/bpf/20250821160817.70285-1-leon.hwang@linux.dev/ v2 -> v3: * Address comments from Alexei: * Use BPF_F_ALL_CPUS instead of BPF_ALL_CPUS magic. * Introduce these two cpu flags for all percpu maps. * Address comments from Jiri: * Reduce some unnecessary u32 cast. * Refactor more generic map flags check function. * A code style issue. v2 link: https://lore.kernel.org/bpf/20250805163017.17015-1-leon.hwang@linux.dev/ v1 -> v2: * Address comments from Andrii: * Embed cpu info as high 32 bits of *flags* totally. * Use ERANGE instead of E2BIG. * Few format issues. Leon Hwang (8): bpf: Introduce internal bpf_map_check_op_flags helper function bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_hash maps bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_cgroup_storage maps libbpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu maps selftests/bpf: Add cases to test BPF_F_CPU and BPF_F_ALL_CPUS flags include/linux/bpf-cgroup.h | 4 +- include/linux/bpf.h | 44 ++- include/uapi/linux/bpf.h | 2 + kernel/bpf/arraymap.c | 32 +- kernel/bpf/hashtab.c | 96 +++-- kernel/bpf/local_storage.c | 27 +- kernel/bpf/syscall.c | 68 ++-- tools/include/uapi/linux/bpf.h | 2 + tools/lib/bpf/bpf.h | 8 + tools/lib/bpf/libbpf.c | 26 +- tools/lib/bpf/libbpf.h | 21 +- .../selftests/bpf/prog_tests/percpu_alloc.c | 335 ++++++++++++++++++ .../selftests/bpf/progs/percpu_alloc_array.c | 32 ++ 13 files changed, 590 insertions(+), 107 deletions(-) -- 2.51.2

1 month, 2 weeks

3
13
0 0

[bpf-next] selftests/bpf: propagate LLVM toolchain into runqslower sub-make

by Hoyeon Lee

The runqslower build invokes a nested make, but the selected LLVM toolchain (via LLVM=-<version>) is not propagated. This causes the sub-make to call the system-default 'clang' and 'llvm-strip' even when a specific LLVM version is intended. # LLVM=-20 V=1 make -C tools/testing/selftests/bpf ... make -C tools/bpf/runqslower ... clang -g -O2 --target=bpfel -I... -c runqslower.bpf.c -o runqslower.bpf.o && \ llvm-strip -g runqslower.bpf.o /bin/sh: 1: clang: not found (expected: clang-20 and llvm-strip-20) Propagate CLANG and LLVM_STRIP to the sub-make to ensure LLVM version consistency across all builds. Signed-off-by: Hoyeon Lee <hoyeon.lee(a)suse.com> --- tools/testing/selftests/bpf/Makefile | 1 + tools/testing/selftests/lib.mk | 1 + 2 files changed, 2 insertions(+) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 34ea23c63bd5..79ab69920dca 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -306,6 +306,7 @@ endif $(OUTPUT)/runqslower: $(BPFOBJ) | $(DEFAULT_BPFTOOL) $(RUNQSLOWER_OUTPUT) $(Q)$(MAKE) $(submake_extras) -C $(TOOLSDIR)/bpf/runqslower \ + CLANG=$(CLANG) LLVM_STRIP=$(LLVM_STRIP) \ OUTPUT=$(RUNQSLOWER_OUTPUT) VMLINUX_BTF=$(VMLINUX_BTF) \ BPFTOOL_OUTPUT=$(HOST_BUILD_DIR)/bpftool/ \ BPFOBJ_OUTPUT=$(BUILD_DIR)/libbpf/ \ diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk index a448fae57831..f14255b2afbd 100644 --- a/tools/testing/selftests/lib.mk +++ b/tools/testing/selftests/lib.mk @@ -8,6 +8,7 @@ LLVM_SUFFIX := $(LLVM) endif CLANG := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) +LLVM_STRIP := $(LLVM_PREFIX)llvm-strip$(LLVM_SUFFIX) CLANG_TARGET_FLAGS_arm := arm-linux-gnueabi CLANG_TARGET_FLAGS_arm64 := aarch64-linux-gnu -- 2.51.1

1 month, 2 weeks

2
1
0 0

[PATCH bpf-next] selftests/bpf: call bpf_get_numa_node_id() in trigger_count()

by Menglong Dong

The bench test "trig-kernel-count" can be used as a baseline comparison for fentry and other benchmarks, and the calling to bpf_get_numa_node_id() should be considered as composition of the baseline. So, let's call it in trigger_count(). Meanwhile, rename trigger_count() to trigger_kernel_count() to make it easier understand. Signed-off-by: Menglong Dong <dongml2(a)chinatelecom.cn> --- tools/testing/selftests/bpf/benchs/bench_trigger.c | 4 ++-- tools/testing/selftests/bpf/progs/trigger_bench.c | 6 ++++-- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c index 1e2aff007c2a..34018fc3927f 100644 --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c @@ -180,10 +180,10 @@ static void trigger_kernel_count_setup(void) { setup_ctx(); bpf_program__set_autoload(ctx.skel->progs.trigger_driver, false); - bpf_program__set_autoload(ctx.skel->progs.trigger_count, true); + bpf_program__set_autoload(ctx.skel->progs.trigger_kernel_count, true); load_ctx(); /* override driver program */ - ctx.driver_prog_fd = bpf_program__fd(ctx.skel->progs.trigger_count); + ctx.driver_prog_fd = bpf_program__fd(ctx.skel->progs.trigger_kernel_count); } static void trigger_kprobe_setup(void) diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c index 3d5f30c29ae3..2898b3749d07 100644 --- a/tools/testing/selftests/bpf/progs/trigger_bench.c +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c @@ -42,12 +42,14 @@ int bench_trigger_uprobe_multi(void *ctx) const volatile int batch_iters = 0; SEC("?raw_tp") -int trigger_count(void *ctx) +int trigger_kernel_count(void *ctx) { int i; - for (i = 0; i < batch_iters; i++) + for (i = 0; i < batch_iters; i++) { inc_counter(); + bpf_get_numa_node_id(); + } return 0; } -- 2.51.2

1 month, 2 weeks

2
1
0 0

[PATCH v3 00/18] vfio: selftests: Support for multi-device tests

by David Matlack

This series adds support for tests that use multiple devices, and adds one new test, vfio_pci_device_init_perf_test, which measures parallel device initialization time to demonstrate the improvement from commit e908f58b6beb ("vfio/pci: Separate SR-IOV VF dev_set"). This series also breaks apart the monolithic vfio_util.h and vfio_pci_device.c into separate files, to account for all the new code. This required quite a bit of code motion so the diffstat looks large. The final layout is more granular and provides a better separation of the IOMMU code from the device code. Final layout: C files: - tools/testing/selftests/vfio/lib/libvfio.c - tools/testing/selftests/vfio/lib/iommu.c - tools/testing/selftests/vfio/lib/iova_allocator.c - tools/testing/selftests/vfio/lib/vfio_pci_device.c - tools/testing/selftests/vfio/lib/vfio_pci_driver.c H files: - tools/testing/selftests/vfio/lib/include/libvfio.h - tools/testing/selftests/vfio/lib/include/libvfio/assert.h - tools/testing/selftests/vfio/lib/include/libvfio/iommu.h - tools/testing/selftests/vfio/lib/include/libvfio/iova_allocator.h - tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h - tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h Notably, vfio_util.h is now gone and replaced with libvfio.h. This series is based on vfio/next plus Alex Mastro's series to add the IOVA allocator [1]. It should apply cleanly to vfio/next once Alex's series is merged by Linus into the next 6.18 rc and then merged into vfio/next. This series can be found on GitHub: https://github.com/dmatlack/linux/tree/vfio/selftests/init_perf_test/v3 [1] https://lore.kernel.org/kvm/20251111-iova-ranges-v3-0-7960244642c5@fb.com/ Cc: Alex Mastro <amastro(a)fb.com> Cc: Jason Gunthorpe <jgg(a)nvidia.com> Cc: Josh Hilke <jrhilke(a)google.com> Cc: Raghavendra Rao Ananta <rananta(a)google.com> Cc: Vipin Sharma <vipinsh(a)google.com> v3: - Replace literal with NSEC_PER_SEC (Alex Mastro) - Fix Makefile accumulate vs. assignment (Alex Mastro) v2: https://lore.kernel.org/kvm/20251112192232.442761-1-dmatlack@google.com/ v1: https://lore.kernel.org/kvm/20251008232531.1152035-1-dmatlack@google.com/ David Matlack (18): vfio: selftests: Move run.sh into scripts directory vfio: selftests: Split run.sh into separate scripts vfio: selftests: Allow passing multiple BDFs on the command line vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode vfio: selftests: Introduce struct iommu vfio: selftests: Support multiple devices in the same container/iommufd vfio: selftests: Eliminate overly chatty logging vfio: selftests: Prefix logs with device BDF where relevant vfio: selftests: Upgrade driver logging to dev_err() vfio: selftests: Rename struct vfio_dma_region to dma_region vfio: selftests: Move IOMMU library code into iommu.c vfio: selftests: Move IOVA allocator into iova_allocator.c vfio: selftests: Stop passing device for IOMMU operations vfio: selftests: Rename vfio_util.h to libvfio.h vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c vfio: selftests: Split libvfio.h into separate header files vfio: selftests: Eliminate INVALID_IOVA vfio: selftests: Add vfio_pci_device_init_perf_test tools/testing/selftests/vfio/Makefile | 10 +- .../selftests/vfio/lib/drivers/dsa/dsa.c | 36 +- .../selftests/vfio/lib/drivers/ioat/ioat.c | 18 +- .../selftests/vfio/lib/include/libvfio.h | 26 + .../vfio/lib/include/libvfio/assert.h | 54 ++ .../vfio/lib/include/libvfio/iommu.h | 76 +++ .../vfio/lib/include/libvfio/iova_allocator.h | 23 + .../lib/include/libvfio/vfio_pci_device.h | 125 ++++ .../lib/include/libvfio/vfio_pci_driver.h | 97 +++ .../selftests/vfio/lib/include/vfio_util.h | 331 ----------- tools/testing/selftests/vfio/lib/iommu.c | 465 +++++++++++++++ .../selftests/vfio/lib/iova_allocator.c | 94 +++ tools/testing/selftests/vfio/lib/libvfio.c | 78 +++ tools/testing/selftests/vfio/lib/libvfio.mk | 5 +- .../selftests/vfio/lib/vfio_pci_device.c | 555 +----------------- .../selftests/vfio/lib/vfio_pci_driver.c | 16 +- tools/testing/selftests/vfio/run.sh | 109 ---- .../testing/selftests/vfio/scripts/cleanup.sh | 41 ++ tools/testing/selftests/vfio/scripts/lib.sh | 42 ++ tools/testing/selftests/vfio/scripts/run.sh | 16 + tools/testing/selftests/vfio/scripts/setup.sh | 48 ++ .../selftests/vfio/vfio_dma_mapping_test.c | 46 +- .../selftests/vfio/vfio_iommufd_setup_test.c | 2 +- .../vfio/vfio_pci_device_init_perf_test.c | 168 ++++++ .../selftests/vfio/vfio_pci_device_test.c | 12 +- .../selftests/vfio/vfio_pci_driver_test.c | 51 +- 26 files changed, 1481 insertions(+), 1063 deletions(-) create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio.h create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/assert.h create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/iommu.h create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/iova_allocator.h create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h delete mode 100644 tools/testing/selftests/vfio/lib/include/vfio_util.h create mode 100644 tools/testing/selftests/vfio/lib/iommu.c create mode 100644 tools/testing/selftests/vfio/lib/iova_allocator.c create mode 100644 tools/testing/selftests/vfio/lib/libvfio.c delete mode 100755 tools/testing/selftests/vfio/run.sh create mode 100755 tools/testing/selftests/vfio/scripts/cleanup.sh create mode 100755 tools/testing/selftests/vfio/scripts/lib.sh create mode 100755 tools/testing/selftests/vfio/scripts/run.sh create mode 100755 tools/testing/selftests/vfio/scripts/setup.sh create mode 100644 tools/testing/selftests/vfio/vfio_pci_device_init_perf_test.c base-commit: fa804aa4ac1b091ef2ec2981f08a1c28aaeba8e7 prerequisite-patch-id: dcf23dcc1198960bda3102eefaa21df60b2e4c54 prerequisite-patch-id: e32e56d5bf7b6c7dd40d737aa3521560407e00f5 prerequisite-patch-id: 4f79a41bf10a4c025ba5f433551b46035aa15878 prerequisite-patch-id: f903a45f0c32319138cd93a007646ab89132b18c -- 2.52.0.rc2.455.g230fcf2819-goog

1 month, 2 weeks

3
25
0 0

Re: [PATCH v3] selftests/seccomp: Fix indentation and rebase error logging patch

by Sameeksha Sankpal

Hi Shuah, Thanks for pointing that out. Apologies for missing the mailing lists earlier. Resending this follow-up with the correct CC list and in plain text format. Please let me know if there’s anything else I should improve in this patch. I’m happy to resend it as v4 if needed. Thanks, Sameeksha On Mon, 24 Nov 2025 at 23:59, Shuah Khan <skhan(a)linuxfoundation.org> wrote: > > On 11/21/25 23:21, Sameeksha Sankpal wrote: > > Hi, > > Just following up on this patch. > > It’s been a few months, so I wanted to check if there is anything else I > > should address or improve to move it forward. > > I see that you didn't cc any mailing list on this email? Please keep > everybody in the loop when you send responses. > > > > > Thanks, > > Sameeksha Sankpal > > > > On Fri, 30 May 2025 at 04:25, Sameeksha Sankpal <sameekshasankpal(a)gmail.com> > > wrote: > > > >> Rebase the error logging enhancement for get_proc_stat() against the > >> upstream seccomp tree with proper indentation formatting. > >> > >> Suggested-by: Kees Cook <kees(a)kernel.org> > >> Signed-off-by: Sameeksha Sankpal <sameekshasankpal(a)gmail.com> > >> --- > >> v1 -> v2: > >> - Used TH_LOG instead of printf for error logging > >> - Moved variable declaration to the top of the function > >> - Applied review suggestion by Kees Cook > >> > >> v2 -> v3: > >> - Rebased against upstream seccomp tree (was previously against v1) > >> - Fixed indentation to use tabs instead of spaces > >> - Used scripts/checkpatch.pl to check the patch for common errors > >> - Removed the blank line beforeS S-o-b added in v2 > >> > >> tools/testing/selftests/seccomp/seccomp_bpf.c | 5 +++++ > >> 1 file changed, 5 insertions(+) > >> > >> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c > >> b/tools/testing/selftests/seccomp/seccomp_bpf.c > >> index 61acbd45ffaa..dbd7e705a2af 100644 > >> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c > >> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c > >> @@ -4508,9 +4508,14 @@ static char get_proc_stat(struct __test_metadata > >> *_metadata, pid_t pid) > >> char proc_path[100] = {0}; > >> char status; > >> char *line; > >> + int rc; > >> > >> snprintf(proc_path, sizeof(proc_path), "/proc/%d/stat", pid); > >> ASSERT_EQ(get_nth(_metadata, proc_path, 3, &line), 1); > >> + rc = get_nth(_metadata, proc_path, 3, &line); > >> + ASSERT_EQ(rc, 1) { > >> + TH_LOG("user_notification_fifo: failed to read stat for > >> PID %d (rc=%d)", pid, rc); > >> + } > >> > >> status = *line; > >> free(line); > >> -- > >> 2.43.0 > >> > >> > > > thanks, > -- Shuah

1 month, 2 weeks

2
1
0 0

[PATCH] selftests/net: initialize char variable to null

by Ankit Khushwaha

char variable in 'so_txtime.c' & 'txtimestamp.c' left uninitilized by when switch default case taken. raises following warning. txtimestamp.c:240:2: warning: variable 'tsname' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized] so_txtime.c:210:3: warning: variable 'reason' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized] initialize these variables to NULL to fix this. Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux(a)gmail.com> --- tools/testing/selftests/net/so_txtime.c | 2 +- tools/testing/selftests/net/txtimestamp.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/so_txtime.c b/tools/testing/selftests/net/so_txtime.c index 8457b7ccbc09..b76df1efc2ef 100644 --- a/tools/testing/selftests/net/so_txtime.c +++ b/tools/testing/selftests/net/so_txtime.c @@ -174,7 +174,7 @@ static int do_recv_errqueue_timeout(int fdt) msg.msg_controllen = sizeof(control); while (1) { - const char *reason; + const char *reason = NULL; ret = recvmsg(fdt, &msg, MSG_ERRQUEUE); if (ret == -1 && errno == EAGAIN) diff --git a/tools/testing/selftests/net/txtimestamp.c b/tools/testing/selftests/net/txtimestamp.c index dae91eb97d69..bcc14688661d 100644 --- a/tools/testing/selftests/net/txtimestamp.c +++ b/tools/testing/selftests/net/txtimestamp.c @@ -217,7 +217,7 @@ static void print_timestamp_usr(void) static void print_timestamp(struct scm_timestamping *tss, int tstype, int tskey, int payload_len) { - const char *tsname; + const char *tsname = NULL; validate_key(tskey, tstype); -- 2.52.0

1 month, 2 weeks

2
6
0 0

[PATCH] selftests: tpm2: Fix ill defined assertions

by Maurice Hieronymus

Remove parentheses around assert statements in Python. With parentheses, assert always evaluates to True, making the checks ineffective. Signed-off-by: Maurice Hieronymus <mhi(a)mailbox.org> --- tools/testing/selftests/tpm2/tpm2.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/tpm2/tpm2.py b/tools/testing/selftests/tpm2/tpm2.py index bba8cb54548e..3d130c30bc7c 100644 --- a/tools/testing/selftests/tpm2/tpm2.py +++ b/tools/testing/selftests/tpm2/tpm2.py @@ -437,7 +437,7 @@ class Client: def extend_pcr(self, i, dig, bank_alg = TPM2_ALG_SHA1): ds = get_digest_size(bank_alg) - assert(ds == len(dig)) + assert ds == len(dig) auth_cmd = AuthCommand() @@ -589,7 +589,7 @@ class Client: def seal(self, parent_key, data, auth_value, policy_dig, name_alg = TPM2_ALG_SHA1): ds = get_digest_size(name_alg) - assert(not policy_dig or ds == len(policy_dig)) + assert not policy_dig or ds == len(policy_dig) attributes = 0 if not policy_dig: base-commit: 821e6e2a328bb907d40f8d1141d8b6c079aa7340 -- 2.50.1

1 month, 2 weeks

2
1
0 0

[PATCH net-next 12/12] selftests: net: selftest for ipvlan-macnat mode

by Dmitry Skorodumov

Implemented a self-test for ipvlan in l2macnat mode. The test verifies: 1) It's not possible to configure an ip in l2macnat mode on ipvtap 2) It creates several net namespaces - Default namespace emulates host, - ipvlan-tst-phy emulates some host in remote network - ipvlan-tst-0/1 emulate VMs on host. Test verifies, that MAC addresses are as expected in ARP/NEIGH tables: all MACs in 'tst-phy' points to "host" mac-address all MACs in Default and tst are real ones 3) The l2macnat mode has limited number of addresses remembered on port. Test verifies, that this limit really works. Signed-off-by: Dmitry Skorodumov <skorodumov.dmitry(a)huawei.com> --- tools/testing/selftests/net/Makefile | 2 + .../selftests/net/ipvtap_macnat_bridge.py | 168 +++++++++ .../selftests/net/ipvtap_macnat_test.sh | 333 ++++++++++++++++++ 3 files changed, 503 insertions(+) create mode 100755 tools/testing/selftests/net/ipvtap_macnat_bridge.py create mode 100755 tools/testing/selftests/net/ipvtap_macnat_test.sh diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index b5127e968108..050d864f0bd9 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -49,6 +49,7 @@ TEST_PROGS := \ ipv6_flowlabel.sh \ ipv6_force_forwarding.sh \ ipv6_route_update_soft_lockup.sh \ + ipvtap_macnat_test.sh \ l2_tos_ttl_inherit.sh \ l2tp.sh \ link_netns.py \ @@ -191,6 +192,7 @@ TEST_GEN_PROGS := \ TEST_FILES := \ fcnal-test.sh \ in_netns.sh \ + ipvtap_macnat_bridge.py \ lib.sh \ settings \ setup_loopback.sh \ diff --git a/tools/testing/selftests/net/ipvtap_macnat_bridge.py b/tools/testing/selftests/net/ipvtap_macnat_bridge.py new file mode 100755 index 000000000000..7dc4a626e5bb --- /dev/null +++ b/tools/testing/selftests/net/ipvtap_macnat_bridge.py @@ -0,0 +1,168 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +""" +Script to bridge ipvtap and tap, +needed to simulate behaviour of virtual machine using ipvtap. + +ipvtap in macnat mode cannot have IP address. +Due to limitations of ipvtap, it also cannot be plugged +into bridge. +Use this script to connect ipvtap and tap and assing IP to tap. +""" + +import socket +import os +import select +import sys +import signal +import fcntl +import struct +import subprocess + +# Linux TUN/TAP constants +TUNSETIFF = 0x400454ca +IFF_TUN = 0x0001 +IFF_TAP = 0x0002 +IFF_NO_PI = 0x1000 + +ns_name = "non-initialized" + +class TapBridge: + """Simple class to bridge ipvtap and tap interfaces""" + def __init__(self, tap, ipvtap, buffer_size=65536): + self.tap_name = tap + self.ipvtap_name = ipvtap + self.buffer_size = buffer_size + self.running = False + + def open_ipvtap_sock(self, tap_name): + """Open a IPVTAP interface using raw socket""" + try: + sock = socket.socket(socket.AF_PACKET, + socket.SOCK_RAW, + socket.ntohs(0x0003)) + sock.bind((tap_name, 0)) + sock.setblocking(False) + print(f"Connected to IPVTAP interface: {tap_name}") + return sock + + except (OSError, IOError) as e: + print(f"Error opening IPVTAP interface {tap_name}: {e}") + return None + + def create_tap_interface(self, tap_name): + """Create and configure a TAP interface using /dev/net/tun""" + try: + # Open the tun device + tun_fd = os.open('/dev/net/tun', os.O_RDWR) + if tun_fd < 0: + raise OSError("Failed to open /dev/net/tun (err: {os.errno})") + + # Prepare the ifr structure + tap_name_bytes = tap_name.encode('utf-8') + ifr = struct.pack('16sH', tap_name_bytes, IFF_TAP | IFF_NO_PI) + + # Set the interface name and flags + result = fcntl.ioctl(tun_fd, TUNSETIFF, ifr) + + # Get the actual interface name that was set + unpacked = struct.unpack('16sH', result) + actual_name = unpacked[0].split(b'\x00')[0].decode() + print(f"Created TAP interface: {actual_name}") + + return tun_fd + + except (OSError, IOError) as e: + print(f"Error creating TAP interface {tap_name}: {e}") + return None + + def forward_data(self, from_fd, to_fd, description): + """Forward data from one file descriptor to another""" + try: + data = os.read(from_fd, self.buffer_size) + if data: + os.write(to_fd, data) + return True + return False + + except BlockingIOError: + return True + except (OSError, IOError) as e: + print(f"Error forwarding data {description}: {e}") + return False + + def run(self): + """Main bridge loop""" + # Create TAP interfaces + tap1_fd = self.create_tap_interface(self.tap_name) + + sock = self.open_ipvtap_sock(self.ipvtap_name) + tap2_fd = sock.fileno() + + if tap1_fd is None or tap2_fd is None: + print("Failed to create TAP interfaces") + return + + print("Press Ctrl+C to stop\n") + + self.running = True + stats = {'tap1_to_tap2': 0, 'tap2_to_tap1': 0} + while self.running: + try: + # Use select to monitor both file descriptors + readable, _, _ = select.select([tap1_fd, tap2_fd], [], [], 1.0) + + for fd in readable: + if fd == tap1_fd: + descr = f"from {self.tap_name} to {self.ipvtap_name}" + if self.forward_data(tap1_fd, tap2_fd, descr): + stats['tap1_to_tap2'] += 1 + else: + self.running = False + elif fd == tap2_fd: + descr = f"from {self.ipvtap_name} to {self.tap_name}" + if self.forward_data(tap2_fd, tap1_fd, descr): + stats['tap2_to_tap1'] += 1 + else: + self.running = False + + except KeyboardInterrupt: + print("\nShutting down...") + self.running = False + except (OSError, IOError) as e: + print(f"Error in main loop: {e}") + self.running = False + + # Cleanup + os.close(tap1_fd) + os.close(tap2_fd) + print(f"Bridge stopped in {ns_name}. Stats: {stats}") + + +def signal_handler(_sig, _frame): + """SIGINT handler for macnat bridge""" + print(f'\nReceived interrupt signal, shutting down bridge in {ns_name}') + sys.exit(0) + + +if __name__ == "__main__": + ns_name = subprocess.getoutput("ip netns identify") or "default" + + signal.signal(signal.SIGINT, signal_handler) + + # Check if running as root + if os.geteuid() != 0: + print("ERROR: This script must be run as root!") + sys.exit(1) + + if len(sys.argv) != 3: + print("Usage: tap_bridge.py tap_name ipvtap_name") + sys.exit(1) + + TAP = sys.argv[1] + IPVTAP = sys.argv[2] + + print(f"Starting TAP bridge between {TAP} and {IPVTAP} in {ns_name}") + bridge = TapBridge(TAP, IPVTAP) + bridge.run() diff --git a/tools/testing/selftests/net/ipvtap_macnat_test.sh b/tools/testing/selftests/net/ipvtap_macnat_test.sh new file mode 100755 index 000000000000..927d75af776b --- /dev/null +++ b/tools/testing/selftests/net/ipvtap_macnat_test.sh @@ -0,0 +1,333 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Tests for ipvtap in macnat mode + +NS_TST0=ipvlan-tst-0 +NS_TST1=ipvlan-tst-1 +NS_PHY=ipvlan-tst-phy + +IP_HOST=172.25.0.1 +IP_PHY=172.25.0.2 +IP_TST0=172.25.0.10 +IP_TST1=172.25.0.30 + +IP_OK0=("172.25.0.10" "172.25.0.11" "172.25.0.12" "172.25.0.13") +IP6_OK0=("fc00::10" "fc00::11" "fc00::12" "fc00::13" ) + +IP_OVFL0="172.25.0.14" +IP6_OVFL0="fc00::14" + +IP6_HOST=fc00::1 +IP6_PHY=fc00::2 +IP6_TST0=fc00::10 +IP6_TST1=fc00::30 + +MAC_HOST="92:3a:00:00:00:01" +MAC_PHY="92:3a:00:00:00:02" +MAC_TST0="92:3a:00:00:00:10" +MAC_TST1="92:3a:00:00:00:30" + +VETH_HOST=vethtst +VETH_PHY=vethtst.p + +# +# The testing environment looks this way: +# +# |------HOST------| |------PHY-------| +# | veth<----------------->veth | +# |------|--|------| |----------------| +# | | +# | | |-----TST0-------| +# | |------------|----ipvtap | +# | |----------------| +# | +# | |-----TST1-------| +# |---------------|----ivtap | +# |----------------| +# +# The macnat mode is for virtual machines, so ipvtap-interface is supposed +# to be used only for traffic monitoring and doesn't have ip-address. +# +# To simulate a virtual machine on ipvtap, we create TAP-interfaces +# in TST environments and assing IP-addresses to them. +# TAP and IPVTAP are connected with simple python script. +# + +ns_run() { + ns=$1 + shift + if [[ "$ns" == "default" ]]; then + "$@" >/dev/null + else + ip netns exec "$ns" "$@" >/dev/null + fi +} + +configure_ns() { + local ns=$1 + local n=$2 + local ip=$3 + local ip6=$4 + local mac=$5 + + ns_run "$ns" ip link set lo up + + if ! ip link add netns "$ns" name "ipvtap0.$n" link $VETH_HOST \ + type ipvtap mode l2macnat bridge; then + exit_error "FAIL: Failed to configure ipvtap link." + fi + ns_run "$ns" ip link set "ipvtap0.$n" up + + ns_run "$ns" ip tuntap add mode tap "tap0.$n" + ns_run "$ns" ip link set dev "tap0.$n" address "$mac" + # disable dad + ns_run "$ns" sysctl -w "net/ipv6/conf/tap0.$n/accept_dad"=0 + ns_run "$ns" ip link set "tap0.$n" up + ns_run "$ns" ip a a "$ip/24" dev "tap0.$n" + ns_run "$ns" ip a a "$ip6/64" dev "tap0.$n" +} + +start_macnat_bridge() { + local ns=$1 + local n=$2 + ip netns exec "$ns" python3 ipvtap_macnat_bridge.py \ + "tap0.$n" "ipvtap0.$n" & +} + +configure_veth() { + local ns=$1 + local veth=$2 + local ip=$3 + local ip6=$4 + local mac=$5 + + ns_run "$ns" ip link set lo up + ns_run "$ns" ethtool -K "$veth" tx off rx off + ns_run "$ns" ip link set dev "$veth" address "$mac" + ns_run "$ns" ip link set "$veth" up + ns_run "$ns" ip a a "$ip/24" dev "$veth" + ns_run "$ns" ip a a "$ip6/64" dev "$veth" +} + +setup_env() { + ip netns add $NS_TST0 + ip netns add $NS_TST1 + ip netns add $NS_PHY + + # setup simulated other-host (phy) and host itself + ip link add $VETH_HOST type veth peer name $VETH_PHY \ + netns $NS_PHY >/dev/null + + # host config + configure_veth default $VETH_HOST $IP_HOST $IP6_HOST $MAC_HOST + configure_veth $NS_PHY $VETH_PHY $IP_PHY $IP6_PHY $MAC_PHY + + # TST namespaces config + configure_ns $NS_TST0 0 $IP_TST0 $IP6_TST0 $MAC_TST0 + configure_ns $NS_TST1 1 $IP_TST1 $IP6_TST1 $MAC_TST1 +} + +ping_all() { + # This will learn MAC/IP addresses on ipvtap + local ns=$1 + + ns_run "$ns" ping -c 1 $IP_TST0 + ns_run "$ns" ping -c 1 $IP6_TST0 + + ns_run "$ns" ping -c 1 $IP_TST1 + ns_run "$ns" ping -c 1 $IP6_TST1 + + ns_run "$ns" ping -c 1 $IP_HOST + ns_run "$ns" ping -c 1 $IP6_HOST + + ns_run "$ns" ping -c 1 $IP_PHY + ns_run "$ns" ping -c 1 $IP6_PHY +} + +check_mac_eq() { + # Ensure IP corresponds to MAC. + local ns=$1 + local ip=$2 + local mac=$3 + local dev=$4 + + if [[ "$ns" == "default" ]]; then + out=$( + ip neigh show "$ip" dev "$dev" \ + | grep "$ip" \ + | grep "$mac" + ) + else + out=$( + ip netns exec "$ns" \ + ip neigh show "$ip" dev "$dev" \ + | grep "$ip" \ + | grep "$mac" + ) + fi + + if [[ $out'X' == "X" ]]; then + exit_error "FAIL: '$ip' is not '$mac'" + fi +} + +cleanup_env() { + ip link del $VETH_HOST + ip netns del $NS_TST0 + ip netns del $NS_TST1 + ip netns del $NS_PHY +} + +exit_error() { + echo "$1" + exit 1 +} + +test_check_mac() { + # All IPs in NS_PHY should have MAC of the host + check_mac_eq $NS_PHY $IP_TST0 $MAC_HOST $VETH_PHY + check_mac_eq $NS_PHY $IP6_TST0 $MAC_HOST $VETH_PHY + check_mac_eq $NS_PHY $IP_TST1 $MAC_HOST $VETH_PHY + check_mac_eq $NS_PHY $IP6_TST1 $MAC_HOST $VETH_PHY + check_mac_eq $NS_PHY $IP_HOST $MAC_HOST $VETH_PHY + check_mac_eq $NS_PHY $IP6_HOST $MAC_HOST $VETH_PHY + + # All IPs in TST0 should have corresponding MAC + check_mac_eq $NS_TST0 $IP_HOST $MAC_HOST tap0.0 + check_mac_eq $NS_TST0 $IP6_HOST $MAC_HOST tap0.0 + check_mac_eq $NS_TST0 $IP_TST1 $MAC_TST1 tap0.0 + check_mac_eq $NS_TST0 $IP6_TST1 $MAC_TST1 tap0.0 + check_mac_eq $NS_TST0 $IP_PHY $MAC_PHY tap0.0 + check_mac_eq $NS_TST0 $IP6_PHY $MAC_PHY tap0.0 + + # All IPs in host should have corresponding MAC + check_mac_eq default $IP_TST0 $MAC_TST0 $VETH_HOST + check_mac_eq default $IP6_TST0 $MAC_TST0 $VETH_HOST + check_mac_eq default $IP_TST1 $MAC_TST1 $VETH_HOST + check_mac_eq default $IP6_TST1 $MAC_TST1 $VETH_HOST + check_mac_eq default $IP_PHY $MAC_PHY $VETH_HOST + check_mac_eq default $IP6_PHY $MAC_PHY $VETH_HOST +} + +test_ip_add() { + # adding IPs to ipvtap should be forbidden and should fail + if ns_run $NS_TST0 ip a a 172.26.0.1/24 dev ipvtap0.0; then + exit_error "FAIL: Module allowed to add ip to ipvtap." + fi + + if ns_run $NS_TST0 ip a a fc01::1/64 dev ipvtap0.0; then + exit_error "FAIL: Module allowed to add ip6 to ipvtap." + fi +} + +test_ip_overflow() { + # The ipvtap remembers limited number of addresses on interface. + # Let's overflow it and check that oldest one doesn't work. + + ns_run $NS_TST0 ip addr flush dev tap0.0 + + # Add exactly 4 ip addresses + for ip in "${IP_OK0[@]}"; do + ns_run $NS_TST0 ip a a "$ip/24" dev tap0.0 + ns_run $NS_TST0 ping -c 1 $IP_HOST -I "$ip" + done + + # Initial check that ping works + if ! ping -c 2 $IP_TST0; then + exit_error "FAIL: Failed to ping tst0" + fi + + # Add 1 more ip addresses + ns_run "$NS_TST0" ip a a $IP_OVFL0/24 dev tap0.0 + ns_run $NS_TST0 ping -c 1 $IP_HOST -I $IP_OVFL0 + # check that ping to oldest one from host fails. + echo "the next ping should fail:" + if ping -c 2 $IP_TST0; then + exit_error "FAIL: IP-0 still exists on interface" + fi + + # ping host using address-0 and force relearn of IP0. + # Host should be able ping after that + ns_run $NS_TST0 ping -c 1 $IP_HOST -I $IP_TST0 + + if ! ping -c 2 $IP_TST0; then + exit_error "FAIL: Failed to ping tst0 at stage 3" + fi +} + +test_ip6_overflow() { + # The ipvtap stores limited number of addresses on interface. + # Let's overflow it and check that oldest one doesn't work. + + ns_run $NS_TST0 ip addr flush dev tap0.0 + + # Add exactly 4 ip addresses + for ip6 in "${IP6_OK0[@]}"; do + ns_run $NS_TST0 ip a a "$ip6/64" dev tap0.0 + ns_run $NS_TST0 ping -c 1 $IP6_HOST -I "$ip6" + done + + # Initial check that ping6 works + if ! ping -c 2 $IP6_TST0; then + exit_error "FAIL: Failed to ping6 tst0" + fi + + # Add 1 more ip6 addresses + ns_run $NS_TST0 ip a a $IP6_OVFL0/64 dev tap0.0 + ns_run $NS_TST0 ping -c 1 $IP6_HOST -I $IP6_OVFL0 + # check that ping to oldest one from host fails. + echo "the next ping should fail:" + if ping -c 2 $IP6_TST0; then + exit_error "FAIL: IP6-0 still exists on interface" + fi + + # ping host using address-0 and force relearn of IP0. + # Host should be able ping after that + ns_run $NS_TST0 ping -c 1 $IP6_HOST -I $IP6_TST0 + if ! ping -c 2 $IP6_TST0; then + exit_error "FAIL: Failed to ping6 tst0 at stage 3" + fi +} + +exec_test() { + echo "TEST: $2" + $1 + echo "PASSED: $2" +} + +trap cleanup_env EXIT + +echo "ipvlan macnat tests" +echo "===================" + +modprobe -q tap +modprobe -q ipvlan +modprobe -q ipvtap + +setup_env + +exec_test test_ip_add "ip add not allowed" + +start_macnat_bridge $NS_TST0 0 +mb_pid1=$! +start_macnat_bridge $NS_TST1 1 +mb_pid2=$! + +echo "<<< Preparation: pinging all...." +ping_all default +ping_all $NS_TST0 +ping_all $NS_TST1 +ping_all $NS_PHY +echo "Finished preparational pinging all. >>>" + +exec_test test_check_mac "mac correctness" +exec_test test_ip_overflow "ip learn capacity overflow" +exec_test test_ip6_overflow "ip6 learn capacity overflow" + +kill -INT $mb_pid1 +kill -INT $mb_pid2 +wait $mb_pid1 +wait $mb_pid2 + +echo "All tests passed" -- 2.25.1

1 month, 2 weeks

2
1
0 0

[PATCH 0/5] mm, kvm: add guest_memfd support for uffd minor faults

by Mike Rapoport

From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org> Hi, These patches allow guest_memfd to notify userspace about minor page faults using userfaultfd and let userspace to resolve these page faults using UFFDIO_CONTINUE. To allow UFFDIO_CONTINUE outside of the core mm I added a get_shmem_folio() callback to vm_ops that allows an address space backing a VMA to return a folio that exists in it's page cache (patch 2) In order for guest_memfd to notify userspace about page faults, there is a new VM_FAULT_UFFD_MINOR that a ->fault() handler can return to inform the page fault handler that it needs to call handle_userfault() to complete the fault (patch 3). Patch 4 plumbs these new goodies into guest_memfd. This series is the minimal change I've been able to come up with to allow integration of guest_memfd with uffd and while refactoring uffd and making mfill_atomic() flow more linear would have been a nice improvement, it's way out of the scope of enabling uffd with guest_memfd. v2 changes: * Introduce VM_FAULF_UFFD_MINOR to avoid exporting handle_userfault() * Simplify vma_can_mfill_atomic() * Rename get_pagecache_folio() to get_shared_folio() and use inode instead of vma as its argument v1: https://lore.kernel.org/all/20251117114631.2029447-1-rppt@kernel.org Mike Rapoport (Microsoft) (4): userfaultfd: move vma_can_userfault out of line userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE mm: introduce VM_FAULT_UFFD_MINOR fault reason guest_memfd: add support for userfaultfd minor mode Nikita Kalyazin (1): KVM: selftests: test userfaultfd minor for guest_memfd include/linux/mm.h | 9 ++ include/linux/mm_types.h | 3 + include/linux/userfaultfd_k.h | 36 +----- mm/memory.c | 2 + mm/shmem.c | 21 +++- mm/userfaultfd.c | 80 +++++++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++ virt/kvm/guest_memfd.c | 29 +++++ 8 files changed, 232 insertions(+), 51 deletions(-) base-commit: 6a23ae0a96a600d1d12557add110e0bb6e32730c -- 2.50.1

1 month, 2 weeks

2
10
0 0

[PATCH] selftests/run_kselftest.sh: Add `--skip` argument option

by Ricardo B. Marlière

Currently the only way of excluding certain tests from a collection is by passing all the other tests explicitly via `--test`. Therefore, if the user wants to skip a single test the resulting command line might be too big, depending on the collection. Add an option `--skip` that takes care of that. Signed-off-by: Ricardo B. Marlière <rbm(a)suse.com> --- tools/testing/selftests/run_kselftest.sh | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tools/testing/selftests/run_kselftest.sh b/tools/testing/selftests/run_kselftest.sh index d4be97498b32..84d45254675c 100755 --- a/tools/testing/selftests/run_kselftest.sh +++ b/tools/testing/selftests/run_kselftest.sh @@ -30,6 +30,7 @@ Usage: $0 [OPTIONS] -s | --summary Print summary with detailed log in output.log (conflict with -p) -p | --per-test-log Print test log in /tmp with each test name (conflict with -s) -t | --test COLLECTION:TEST Run TEST from COLLECTION + -S | --skip COLLECTION:TEST Skip TEST from COLLECTION -c | --collection COLLECTION Run all tests from COLLECTION -l | --list List the available collection:test entries -d | --dry-run Don't actually run any tests @@ -43,6 +44,7 @@ EOF COLLECTIONS="" TESTS="" +SKIP="" dryrun="" kselftest_override_timeout="" ERROR_ON_FAIL=true @@ -58,6 +60,9 @@ while true; do -t | --test) TESTS="$TESTS $2" shift 2 ;; + -S | --skip) + SKIP="$SKIP $2" + shift 2 ;; -c | --collection) COLLECTIONS="$COLLECTIONS $2" shift 2 ;; @@ -109,6 +114,12 @@ if [ -n "$TESTS" ]; then done available="$(echo "$valid" | sed -e 's/ /\n/g')" fi +# Remove tests to be skipped from available list +if [ -n "$SKIP" ]; then + for skipped in $SKIP ; do + available="$(echo "$available" | grep -v "^${skipped}$")" + done +fi kselftest_failures_file="$(mktemp --tmpdir kselftest-failures-XXXXXX)" export kselftest_failures_file --- base-commit: a2f7990d330937a204b86b9cafbfef82f87a8693 change-id: 20251125-selftests-add_skip_opt-0f3fd24d7afa Best regards, -- Ricardo B. Marlière <rbm(a)suse.com>

1 month, 2 weeks

2
1
0 0

[PATCH v7 00/11] arm64: entry: Convert to Generic Entry

by Jinjie Ruan

Currently, x86, Riscv, Loongarch use the Generic Entry which makes maintainers' work easier and codes more elegant. arm64 has already successfully switched to the Generic IRQ Entry in commit b3cf07851b6c ("arm64: entry: Switch to generic IRQ entry"), it is time to completely convert arm64 to Generic Entry. The goal is to bring arm64 in line with other architectures that already use the generic entry infrastructure, reducing duplicated code and making it easier to share future changes in entry/exit paths, such as "Syscall User Dispatch". This patch set is rebased on v6.18-rc6. The performance benchmarks from perf bench basic syscall on real hardware are below: | Metric | W/O Generic Framework | With Generic Framework | Change | | ---------- | --------------------- | ---------------------- | ------ | | Total time | 2.813 [sec] | 2.930 [sec] | ↑4% | | usecs/op | 0.281349 | 0.293006 | ↑4% | | ops/sec | 3,554,299 | 3,412,894 | ↓4% | Compared to earlier with arch specific handling, the performance decreased by approximately 4%. It was tested ok with following test cases on QEMU virt platform: - Perf tests. - Different `dynamic preempt` mode switch. - Pseudo NMI tests. - Stress-ng CPU stress test. - MTE test case in Documentation/arch/arm64/memory-tagging-extension.rst and all test cases in tools/testing/selftests/arm64/mte/*. - "sud" selftest testcase. - get_syscall_info, peeksiginfo in tools/testing/selftests/ptrace. The test QEMU configuration is as follows: qemu-system-aarch64 \ -M virt,gic-version=3,virtualization=on,mte=on \ -cpu max,pauth-impdef=on \ -kernel Image \ -smp 8,sockets=1,cores=4,threads=2 \ -m 512m \ -nographic \ -no-reboot \ -device virtio-rng-pci \ -append "root=/dev/vda rw console=ttyAMA0 kgdboc=ttyAMA0,115200 \ earlycon preempt=voluntary irqchip.gicv3_pseudo_nmi=1" \ -drive if=none,file=images/rootfs.ext4,format=raw,id=hd0 \ -device virtio-blk-device,drive=hd0 \ Chanegs in v7: - Support "Syscall User Dispatch" by implementing arch_syscall_is_vdso_sigreturn() as kemal suggested. - Add aarch64 support for "sud" selftest testcase, which tested ok with the patch series. - Fix the kernel test robot warning for arch_ptrace_report_syscall_entry() and arch_ptrace_report_syscall_exit() in asm/entry-common.h. - Add perf syscall performance test. - Link to v6: https://lore.kernel.org/all/20250916082611.2972008-1-ruanjinjie@huawei.com/ Changes in v6: - Rebased on v6.17-rc5-next as arm64 generic irq entry has merged. - Update the commit message. - Link to v5: https://lore.kernel.org/all/20241206101744.4161990-1-ruanjinjie@huawei.com/ Changes in v5: - Not change arm32 and keep inerrupts_enabled() macro for gicv3 driver. - Move irqentry_state definition into arch/arm64/kernel/entry-common.c. - Avoid removing the __enter_from_*() and __exit_to_*() wrappers. - Update "irqentry_state_t ret/irq_state" to "state" to keep it consistently. - Use generic irq entry header for PREEMPT_DYNAMIC after split the generic entry. - Also refactor the ARM64 syscall code. - Introduce arch_ptrace_report_syscall_entry/exit(), instead of arch_pre/post_report_syscall_entry/exit() to simplify code. - Make the syscall patches clear separation. - Update the commit message. - Link to v4: https://lore.kernel.org/all/20241025100700.3714552-1-ruanjinjie@huawei.com/ Changes in v4: - Rework/cleanup split into a few patches as Mark suggested. - Replace interrupts_enabled() macro with regs_irqs_disabled(), instead of left it here. - Remove rcu and lockdep state in pt_regs by using temporary irqentry_state_t as Mark suggested. - Remove some unnecessary intermediate functions to make it clear. - Rework preempt irq and PREEMPT_DYNAMIC code to make the switch more clear. - arch_prepare_*_entry/exit() -> arch_pre_*_entry/exit(). - Expand the arch functions comment. - Make arch functions closer to its caller. - Declare saved_reg in for block. - Remove arch_exit_to_kernel_mode_prepare(), arch_enter_from_kernel_mode(). - Adjust "Add few arch functions to use generic entry" patch to be the penultimate. - Update the commit message. - Add suggested-by. - Link to v3: https://lore.kernel.org/all/20240629085601.470241-1-ruanjinjie@huawei.com/ Changes in v3: - Test the MTE test cases. - Handle forget_syscall() in arch_post_report_syscall_entry() - Make the arch funcs not use __weak as Thomas suggested, so move the arch funcs to entry-common.h, and make arch_forget_syscall() folded in arch_post_report_syscall_entry() as suggested. - Move report_single_step() to thread_info.h for arm64 - Change __always_inline() to inline, add inline for the other arch funcs. - Remove unused signal.h for entry-common.h. - Add Suggested-by. - Update the commit message. Changes in v2: - Add tested-by. - Fix a bug that not call arch_post_report_syscall_entry() in syscall_trace_enter() if ptrace_report_syscall_entry() return not zero. - Refactor report_syscall(). - Add comment for arch_prepare_report_syscall_exit(). - Adjust entry-common.h header file inclusion to alphabetical order. - Update the commit message. Jinjie Ruan (10): arm64/ptrace: Split report_syscall() arm64/ptrace: Refactor syscall_trace_enter/exit() arm64/ptrace: Refator el0_svc_common() entry: Add syscall_exit_to_user_mode_prepare() helper arm64/ptrace: Handle ptrace_report_syscall_entry() error arm64/ptrace: Expand secure_computing() in place arm64/ptrace: Use syscall_get_arguments() heleper entry: Add arch_ptrace_report_syscall_entry/exit() entry: Add has_syscall_work() helper arm64: entry: Convert to generic entry kemal (1): selftests: sud_test: Support aarch64 arch/arm64/Kconfig | 2 +- arch/arm64/include/asm/entry-common.h | 69 ++++++++++++++ arch/arm64/include/asm/syscall.h | 29 +++++- arch/arm64/include/asm/thread_info.h | 22 +---- arch/arm64/kernel/debug-monitors.c | 7 ++ arch/arm64/kernel/ptrace.c | 90 ------------------- arch/arm64/kernel/signal.c | 2 +- arch/arm64/kernel/syscall.c | 31 ++----- include/linux/entry-common.h | 42 ++++++--- kernel/entry/syscall-common.c | 43 ++++++++- .../syscall_user_dispatch/sud_test.c | 4 + 11 files changed, 188 insertions(+), 153 deletions(-) -- 2.34.1

1 month, 2 weeks

3
35
0 0

[PATCH bpf-next v4 0/3] bpf: Fix FIONREAD and copied_seq issues

by Jiayuan Chen

syzkaller reported a bug [1] where a socket using sockmap, after being unloaded, exposed incorrect copied_seq calculation. The selftest I provided can be used to reproduce the issue reported by syzkaller. TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40 WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724 Call Trace: <TASK> receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline] tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421 __sys_getsockopt+0x12f/0x260 net/socket.c:2450 __do_sys_getsockopt net/socket.c:2457 [inline] __se_sys_getsockopt net/socket.c:2454 [inline] __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f A sockmap socket maintains its own receive queue (ingress_msg) which may contain data from either its own protocol stack or forwarded from other sockets. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack The issue occurs when reading from ingress_msg: we update tp->copied_seq by default, but if the data comes from other sockets (not the socket's own protocol stack), tcp->rcv_nxt remains unchanged. Later, when converting back to a native socket, reads may fail as copied_seq could be significantly larger than rcv_nxt. Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is insufficient for sockmap sockets, requiring separate field tracking. [1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 --- v1 -> v4: Use skmsg.sk instead of extending BPF_F_XXX macro and fix CI failure reported by CI v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/ Jiayuan Chen (3): bpf, sockmap: Fix incorrect copied_seq calculation bpf, sockmap: Fix FIONREAD for sockmap bpf, selftest: Add tests for FIONREAD and copied_seq include/linux/skmsg.h | 58 ++++- net/core/skmsg.c | 28 ++- net/ipv4/tcp_bpf.c | 26 ++- net/ipv4/udp_bpf.c | 25 ++- .../selftests/bpf/prog_tests/sockmap_basic.c | 203 +++++++++++++++++- .../bpf/progs/test_sockmap_pass_prog.c | 8 + 6 files changed, 331 insertions(+), 17 deletions(-) -- 2.43.0

1 month, 2 weeks

2
4
0 0

[PATCH net-next v11 00/13] vsock: add namespace support to vhost-vsock and loopback

by Bobby Eshleman

This series adds namespace support to vhost-vsock and loopback. It does not add namespaces to any of the other guest transports (virtio-vsock, hyperv, or vmci). The current revision supports two modes: local and global. Local mode is complete isolation of namespaces, while global mode is complete sharing between namespaces of CIDs (the original behavior). The mode is set using /proc/sys/net/vsock/ns_mode. Modes are per-netns and write-once. This allows a system to configure namespaces independently (some may share CIDs, others are completely isolated). This also supports future possible mixed use cases, where there may be namespaces in global mode spinning up VMs while there are mixed mode namespaces that provide services to the VMs, but are not allowed to allocate from the global CID pool (this mode is not implemented in this series). If a socket or VM is created when a namespace is global but the namespace changes to local, the socket or VM will continue working normally. That is, the socket or VM assumes the mode behavior of the namespace at the time the socket/VM was created. The original mode is captured in vsock_create() and so occurs at the time of socket(2) and accept(2) for sockets and open(2) on /dev/vhost-vsock for VMs. This prevents a socket/VM connection from suddenly breaking due to a namespace mode change. Any new sockets/VMs created after the mode change will adopt the new mode's behavior. Additionally, added tests for the new namespace features: tools/testing/selftests/vsock/vmtest.sh 1..29 ok 1 vm_server_host_client ok 2 vm_client_host_server ok 3 vm_loopback ok 4 ns_vm_local_mode_rejected ok 5 ns_host_vsock_ns_mode_ok ok 6 ns_host_vsock_ns_mode_write_once_ok ok 7 ns_global_same_cid_fails ok 8 ns_local_same_cid_ok ok 9 ns_global_local_same_cid_ok ok 10 ns_local_global_same_cid_ok ok 11 ns_diff_global_host_connect_to_global_vm_ok ok 12 ns_diff_global_host_connect_to_local_vm_fails ok 13 ns_diff_global_vm_connect_to_global_host_ok ok 14 ns_diff_global_vm_connect_to_local_host_fails ok 15 ns_diff_local_host_connect_to_local_vm_fails ok 16 ns_diff_local_vm_connect_to_local_host_fails ok 17 ns_diff_global_to_local_loopback_local_fails ok 18 ns_diff_local_to_global_loopback_fails ok 19 ns_diff_local_to_local_loopback_fails ok 20 ns_diff_global_to_global_loopback_ok ok 21 ns_same_local_loopback_ok ok 22 ns_same_local_host_connect_to_local_vm_ok ok 23 ns_same_local_vm_connect_to_local_host_ok ok 24 ns_mode_change_connection_continue_vm_ok ok 25 ns_mode_change_connection_continue_host_ok ok 26 ns_mode_change_connection_continue_both_ok ok 27 ns_delete_vm_ok ok 28 ns_delete_host_ok ok 29 ns_delete_both_ok SUMMARY: PASS=29 SKIP=0 FAIL=0 Dependent on series: https://lore.kernel.org/all/20251108-vsock-selftests-fixes-and-improvements… Thanks again for everyone's help and reviews! Suggested-by: Sargun Dhillon <sargun(a)sargun.me> Signed-off-by: Bobby Eshleman <bobbyeshleman(a)gmail.com> Changes in v11: - vmtest: add a patch to use ss in wait_for_listener functions and support vsock, tcp, and unix. Change all patches to use the new functions. - vmtest: add a patch to re-use vm dmesg / warn counting functions - Link to v10: https://lore.kernel.org/r/20251117-vsock-vmtest-v10-0-df08f165bf3e@meta.com Changes in v10: - Combine virtio common patches into one (Stefano) - Resolve vsock_loopback virtio_transport_reset_no_sock() issue with info->vsk setting. This eliminates the need for skb->cb, so remove skb->cb patches. - many line width 80 fixes - Link to v9: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-0-852787a37bed@meta.com Changes in v9: - reorder loopback patch after patch for virtio transport common code - remove module ordering tests patch because loopback no longer depends on pernet ops - major simplifications in vsock_loopback - added a new patch for blocking local mode for guests, added test case to check - add net ref tracking to vsock_loopback patch - Link to v8: https://lore.kernel.org/r/20251023-vsock-vmtest-v8-0-dea984d02bb0@meta.com Changes in v8: - Break generic cleanup/refactoring patches into standalone series, remove those from this series - Link to dependency: https://lore.kernel.org/all/20251022-vsock-selftests-fixes-and-improvements… - Link to v7: https://lore.kernel.org/r/20251021-vsock-vmtest-v7-0-0661b7b6f081@meta.com Changes in v7: - fix hv_sock build - break out vmtest patches into distinct, more well-scoped patches - change `orig_net_mode` to `net_mode` - many fixes and style changes in per-patch change sets (see individual patches for specific changes) - optimize `virtio_vsock_skb_cb` layout - update commit messages with more useful descriptions - vsock_loopback: use orig_net_mode instead of current net mode - add tests for edge cases (ns deletion, mode changing, loopback module load ordering) - Link to v6: https://lore.kernel.org/r/20250916-vsock-vmtest-v6-0-064d2eb0c89d@meta.com Changes in v6: - define behavior when mode changes to local while socket/VM is alive - af_vsock: clarify description of CID behavior - af_vsock: use stronger langauge around CID rules (dont use "may") - af_vsock: improve naming of buf/buffer - af_vsock: improve string length checking on proc writes - vsock_loopback: add space in struct to clarify lock protection - vsock_loopback: do proper cleanup/unregister on vsock_loopback_exit() - vsock_loopback: use virtio_vsock_skb_net() instead of sock_net() - vsock_loopback: set loopback to NULL after kfree() - vsock_loopback: use pernet_operations and remove callback mechanism - vsock_loopback: add macros for "global" and "local" - vsock_loopback: fix length checking - vmtest.sh: check for namespace support in vmtest.sh - Link to v5: https://lore.kernel.org/r/20250827-vsock-vmtest-v5-0-0ba580bede5b@meta.com Changes in v5: - /proc/net/vsock_ns_mode -> /proc/sys/net/vsock/ns_mode - vsock_global_net -> vsock_global_dummy_net - fix netns lookup in vhost_vsock to respect pid namespaces - add callbacks for vsock_loopback to avoid circular dependency - vmtest.sh loads vsock_loopback module - remove vsock_net_mode_can_set() - change vsock_net_write_mode() to return true/false based on success - make vsock_net_mode enum instead of u8 - Link to v4: https://lore.kernel.org/r/20250805-vsock-vmtest-v4-0-059ec51ab111@meta.com Changes in v4: - removed RFC tag - implemented loopback support - renamed new tests to better reflect behavior - completed suite of tests with permutations of ns modes and vsock_test as guest/host - simplified socat bridging with unix socket instead of tcp + veth - only use vsock_test for success case, socat for failure case (context in commit message) - lots of cleanup Changes in v3: - add notion of "modes" - add procfs /proc/net/vsock_ns_mode - local and global modes only - no /dev/vhost-vsock-netns - vmtest.sh already merged, so new patch just adds new tests for NS - Link to v2: https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com Changes in v2: - only support vhost-vsock namespaces - all g2h namespaces retain old behavior, only common API changes impacted by vhost-vsock changes - add /dev/vhost-vsock-netns for "opt-in" - leave /dev/vhost-vsock to old behavior - removed netns module param - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com Changes in v1: - added 'netns' module param to vsock.ko to enable the network namespace support (disabled by default) - added 'vsock_net_eq()' to check the "net" assigned to a socket only when 'netns' support is enabled - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/ --- Bobby Eshleman (13): vsock: a per-net vsock NS mode state vsock: add netns to vsock core vsock: reject bad VSOCK_NET_MODE_LOCAL configuration for G2H virtio: set skb owner of virtio_transport_reset_no_sock() reply vsock: add netns support to virtio transports selftests/vsock: add namespace helpers to vmtest.sh selftests/vsock: prepare vm management helpers for namespaces selftests/vsock: add vm_dmesg_{warn,oops}_count() helpers selftests/vsock: use ss to wait for listeners instead of /proc/net selftests/vsock: add tests for proc sys vsock ns_mode selftests/vsock: add namespace tests for CID collisions selftests/vsock: add tests for host <-> vm connectivity with namespaces selftests/vsock: add tests for namespace deletion and mode changes MAINTAINERS | 1 + drivers/vhost/vsock.c | 57 +- include/linux/virtio_vsock.h | 8 +- include/net/af_vsock.h | 64 +- include/net/net_namespace.h | 4 + include/net/netns/vsock.h | 17 + net/vmw_vsock/af_vsock.c | 290 ++++++++- net/vmw_vsock/hyperv_transport.c | 6 + net/vmw_vsock/virtio_transport.c | 29 +- net/vmw_vsock/virtio_transport_common.c | 69 +- net/vmw_vsock/vmci_transport.c | 12 + net/vmw_vsock/vsock_loopback.c | 20 +- tools/testing/selftests/vsock/vmtest.sh | 1087 +++++++++++++++++++++++++++++-- 13 files changed, 1560 insertions(+), 104 deletions(-) --- base-commit: 962ac5ca99a5c3e7469215bf47572440402dfd59 change-id: 20250325-vsock-vmtest-b3a21d2102c2 prerequisite-message-id: <20251022-vsock-selftests-fixes-and-improvements-v1-0-edeb179d6463(a)meta.com> prerequisite-patch-id: a2eecc3851f2509ed40009a7cab6990c6d7cfff5 prerequisite-patch-id: 501db2100636b9c8fcb3b64b8b1df797ccbede85 prerequisite-patch-id: ba1a2f07398a035bc48ef72edda41888614be449 prerequisite-patch-id: fd5cc5445aca9355ce678e6d2bfa89fab8a57e61 prerequisite-patch-id: 795ab4432ffb0843e22b580374782e7e0d99b909 prerequisite-patch-id: 1499d263dc933e75366c09e045d2125ca39f7ddd prerequisite-patch-id: f92d99bb1d35d99b063f818a19dcda999152d74c prerequisite-patch-id: e3296f38cdba6d903e061cff2bbb3e7615e8e671 prerequisite-patch-id: bc4662b4710d302d4893f58708820fc2a0624325 prerequisite-patch-id: f8991f2e98c2661a706183fde6b35e2b8d9aedcf prerequisite-patch-id: 44bf9ed69353586d284e5ee63d6fffa30439a698 prerequisite-patch-id: d50621bc630eeaf608bbaf260370c8dabf6326df Best regards, -- Bobby Eshleman <bobbyeshleman(a)meta.com>

1 month, 2 weeks

2
30
0 0

[PATCH v3] selftests/futex: Remove static keyword from 'head'

by Ankit Khushwaha

'head' is defined as 'static struct robust_list_head' that stores the local variable of 'struct lock_struct a' raising the Wdangling-pointer warning. robust_list.c: In function ��child_circular_list��: robust_list.c:522:24: warning: storing the address of local variable ��a�� in ��head.list.next�� [-Wdangling-pointer=] 522 | head.list.next = &a.list; | ~~~~~~~~~~~~~~~^~~~~~~~~ robust_list.c:513:28: note: ��a�� declared here 513 | struct lock_struct a, b, c; | ^ robust_list.c:512:40: note: ��head�� declared here 512 | static struct robust_list_head head; | ^~~~ Since 'head' doesn't need static storge duration, removing the static keyword of it to fix this. Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux(a)gmail.com> --- v3: Updated the patch name and msg as suggested by Andr��. v2: https://lore.kernel.org/all/20251118170907.108832-1-ankitkhushwaha.linux@gm… Added changes suggested by Andr��. v1: https://lore.kernel.org/all/20251118162619.50586-1-ankitkhushwaha.linux@gma… --- tools/testing/selftests/futex/functional/robust_list.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/futex/functional/robust_list.c b/tools/testing/selftests/futex/functional/robust_list.c index e7d1254e18ca..ef21a7ec9def 100644 --- a/tools/testing/selftests/futex/functional/robust_list.c +++ b/tools/testing/selftests/futex/functional/robust_list.c @@ -509,7 +509,7 @@ TEST(test_robust_list_multiple_elements) static int child_circular_list(void *arg) { - static struct robust_list_head head; + struct robust_list_head head; struct lock_struct a, b, c; int ret; -- 2.52.0

1 month, 2 weeks

1
0
0 0

[PATCH v2] selftests/futex: Fix storing address of local variable

by Ankit Khushwaha

In "child_circular_list()" address of local variable ��lock_struct a�� is assigned to "" raising the following warning. robust_list.c: In function ��child_circular_list��: robust_list.c:522:24: warning: storing the address of local variable ��a�� in ��head.list.next�� [-Wdangling-pointer=] 522 | head.list.next = &a.list; | ~~~~~~~~~~~~~~~^~~~~~~~~ robust_list.c:513:28: note: ��a�� declared here 513 | struct lock_struct a, b, c; | ^ robust_list.c:512:40: note: ��head�� declared here 512 | static struct robust_list_head head; | ^~~~ removing the static keyword of "head" to fix this. Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux(a)gmail.com> --- changelog: v2: Added changes suggested by Andr��. v1: https://lore.kernel.org/all/20251118162619.50586-1-ankitkhushwaha.linux@gma… --- tools/testing/selftests/futex/functional/robust_list.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/futex/functional/robust_list.c b/tools/testing/selftests/futex/functional/robust_list.c index e7d1254e18ca..ef21a7ec9def 100644 --- a/tools/testing/selftests/futex/functional/robust_list.c +++ b/tools/testing/selftests/futex/functional/robust_list.c @@ -509,7 +509,7 @@ TEST(test_robust_list_multiple_elements) static int child_circular_list(void *arg) { - static struct robust_list_head head; + struct robust_list_head head; struct lock_struct a, b, c; int ret; -- 2.51.1

1 month, 2 weeks

2
2
0 0

[PATCH net-next] selftests: af_unix: don't use SKIP for expected failures

by Jakub Kicinski

netdev CI reserves SKIP in selftests for cases which can't be executed due to setup issues, like missing or old commands. Tests which are expected to fail must use XFAIL. Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> --- CC: kuniyu(a)google.com CC: adelodunolaoluwa(a)yahoo.com CC: shuah(a)kernel.org CC: linux-kselftest(a)vger.kernel.org --- tools/testing/selftests/net/af_unix/unix_connreset.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/net/af_unix/unix_connreset.c b/tools/testing/selftests/net/af_unix/unix_connreset.c index bffef2b54bfd..6eb936207b31 100644 --- a/tools/testing/selftests/net/af_unix/unix_connreset.c +++ b/tools/testing/selftests/net/af_unix/unix_connreset.c @@ -161,8 +161,12 @@ TEST_F(unix_sock, reset_closed_embryo) char buf[16] = {}; ssize_t n; - if (variant->socket_type == SOCK_DGRAM) - SKIP(return, "This test only applies to SOCK_STREAM and SOCK_SEQPACKET"); + if (variant->socket_type == SOCK_DGRAM) { + snprintf(_metadata->results->reason, + sizeof(_metadata->results->reason), + "Test only applies to SOCK_STREAM and SOCK_SEQPACKET"); + exit(KSFT_XFAIL); + } /* Close server without accept()ing */ close(self->server); -- 2.51.1

1 month, 2 weeks

3
2
0 0

[PATCH net-next] selftests: netconsole: ensure required log level is set on netcons_basic

by Andre Carvalho

This commit ensures that the required log level is set at the start of the test iteration. Part of the cleanup performed at the end of each test iteration resets the log level (do_cleanup in lib_netcons.sh) to the values defined at the time test script started. This may cause further test iterations to fail if the default values are not sufficient. Signed-off-by: Andre Carvalho <asantostc(a)gmail.com> --- tools/testing/selftests/drivers/net/netcons_basic.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh index a3446b569976..2022f3061738 100755 --- a/tools/testing/selftests/drivers/net/netcons_basic.sh +++ b/tools/testing/selftests/drivers/net/netcons_basic.sh @@ -28,8 +28,6 @@ OUTPUT_FILE="/tmp/${TARGET}" # Check for basic system dependency and exit if not found check_for_dependencies -# Set current loglevel to KERN_INFO(6), and default to KERN_NOTICE(5) -echo "6 5" > /proc/sys/kernel/printk # Remove the namespace, interfaces and netconsole target on exit trap cleanup EXIT @@ -39,6 +37,9 @@ do for IP_VERSION in "ipv6" "ipv4" do echo "Running with target mode: ${FORMAT} (${IP_VERSION})" + # Set current loglevel to KERN_INFO(6), and default to + # KERN_NOTICE(5) + echo "6 5" > /proc/sys/kernel/printk # Create one namespace and two interfaces set_network "${IP_VERSION}" # Create a dynamic target for netconsole --- base-commit: e2c20036a8879476c88002730d8a27f4e3c32d4b change-id: 20251121-netcons-basic-loglevel-69e2715c1029 Best regards, -- Andre Carvalho <asantostc(a)gmail.com>

1 month, 2 weeks

3
2
0 0

[PATCH net-next 0/5] selftests: hw-net: toeplitz: read config from the NIC directly

by Jakub Kicinski

First patch here tries to auto-disable building the iouring sample. Our CI will still run the iouring test(s), of course, but it looks like the liburing updates aren't very quick in distroes and having to hack around it when developing unrelated tests is a bit annoying. Remaining 4 patches iron out running the Toeplitz hash test against real NICs. I tested mlx5, bnxt and fbnic, they all pass now. I switched to using YNL directly in the C code, can't see a reason to get the info in Python and pass it to C via argv. The old code likely did this because it predates YNL. Jakub Kicinski (5): selftests: hw-net: auto-disable building the iouring C code selftests: hw-net: toeplitz: make sure NICs have pure Toeplitz configured selftests: hw-net: toeplitz: read the RSS key directly from C selftests: hw-net: toeplitz: read indirection table from the device selftests: hw-net: toeplitz: give the test up to 4 seconds .../testing/selftests/drivers/net/hw/Makefile | 23 ++++++- .../selftests/drivers/net/hw/toeplitz.c | 65 ++++++++++++++++++- .../selftests/drivers/net/hw/toeplitz.py | 28 ++++---- 3 files changed, 98 insertions(+), 18 deletions(-) -- 2.51.1

1 month, 2 weeks

4
13
0 0

[PATCH bpf-next v3] selftests/bpf: Fix htab_update/reenter_update selftest failure

by Saket Kumar Bhaskar

Since commit 31158ad02ddb ("rqspinlock: Add deadlock detection and recovery") the updated path on re-entrancy now reports deadlock via -EDEADLK instead of the previous -EBUSY. Also, the way reentrancy was exercised (via fentry/lookup_elem_raw) has been fragile because lookup_elem_raw may be inlined (find_kernel_btf_id() will return -ESRCH). To fix this fentry is attached to bpf_obj_free_fields() instead of lookup_elem_raw() and: - The htab map is made to use a BTF-described struct val with a struct bpf_timer so that check_and_free_fields() reliably calls bpf_obj_free_fields() on element replacement. - The selftest is updated to do two updates to the same key (insert + replace) in prog_test. - The selftest is updated to align with expected errno with the kernel’s current behavior. Signed-off-by: Saket Kumar Bhaskar <skb99(a)linux.ibm.com> --- Changes since v2: Addressed CI failures: * Initialize key to 0 before the first update. * Used pointer value to pass for update and memset rather than &value. v2: https://lore.kernel.org/all/20251114152653.356782-1-skb99@linux.ibm.com/ Changes since v1: Addressed comments from Alexei: * Fixed the scenario where test may fail when lookup_elem_raw() is inlined. v1: https://lore.kernel.org/all/20251106052628.349117-1-skb99@linux.ibm.com/ .../selftests/bpf/prog_tests/htab_update.c | 37 ++++++++++++++----- .../testing/selftests/bpf/progs/htab_update.c | 19 +++++++--- 2 files changed, 41 insertions(+), 15 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c index 2bc85f4814f4..d0b405eb2966 100644 --- a/tools/testing/selftests/bpf/prog_tests/htab_update.c +++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c @@ -15,17 +15,17 @@ struct htab_update_ctx { static void test_reenter_update(void) { struct htab_update *skel; - unsigned int key, value; + void *value = NULL; + unsigned int key, value_size; int err; skel = htab_update__open(); if (!ASSERT_OK_PTR(skel, "htab_update__open")) return; - /* lookup_elem_raw() may be inlined and find_kernel_btf_id() will return -ESRCH */ - bpf_program__set_autoload(skel->progs.lookup_elem_raw, true); + bpf_program__set_autoload(skel->progs.bpf_obj_free_fields, true); err = htab_update__load(skel); - if (!ASSERT_TRUE(!err || err == -ESRCH, "htab_update__load") || err) + if (!ASSERT_TRUE(!err, "htab_update__load") || err) goto out; skel->bss->pid = getpid(); @@ -33,14 +33,33 @@ static void test_reenter_update(void) if (!ASSERT_OK(err, "htab_update__attach")) goto out; - /* Will trigger the reentrancy of bpf_map_update_elem() */ + value_size = bpf_map__value_size(skel->maps.htab); + + value = calloc(1, value_size); + if (!ASSERT_OK_PTR(value, "calloc value")) + goto out; + /* + * First update: plain insert. This should NOT trigger the re-entrancy + * path, because there is no old element to free yet. + */ key = 0; - value = 0; - err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, &value, 0); - if (!ASSERT_OK(err, "add element")) + err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, value, BPF_ANY); + if (!ASSERT_OK(err, "first update (insert)")) + goto out; + + /* + * Second update: replace existing element with same key and trigger + * the reentrancy of bpf_map_update_elem(). + * check_and_free_fields() calls bpf_obj_free_fields() on the old + * value, which is where fentry program runs and performs a nested + * bpf_map_update_elem(), triggering -EDEADLK. + */ + memset(value, 0, value_size); + err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, value, BPF_ANY); + if (!ASSERT_OK(err, "second update (replace)")) goto out; - ASSERT_EQ(skel->bss->update_err, -EBUSY, "no reentrancy"); + ASSERT_EQ(skel->bss->update_err, -EDEADLK, "no reentrancy"); out: htab_update__destroy(skel); } diff --git a/tools/testing/selftests/bpf/progs/htab_update.c b/tools/testing/selftests/bpf/progs/htab_update.c index 7481bb30b29b..195d3b2fba00 100644 --- a/tools/testing/selftests/bpf/progs/htab_update.c +++ b/tools/testing/selftests/bpf/progs/htab_update.c @@ -6,24 +6,31 @@ char _license[] SEC("license") = "GPL"; +/* Map value type: has BTF-managed field (bpf_timer) */ +struct val { + struct bpf_timer t; + __u64 payload; +}; + struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 1); - __uint(key_size, sizeof(__u32)); - __uint(value_size, sizeof(__u32)); + __type(key, __u32); + __type(value, struct val); } htab SEC(".maps"); int pid = 0; int update_err = 0; -SEC("?fentry/lookup_elem_raw") -int lookup_elem_raw(void *ctx) +SEC("?fentry/bpf_obj_free_fields") +int bpf_obj_free_fields(void *ctx) { - __u32 key = 0, value = 1; + __u32 key = 0; + struct val value = { .payload = 1 }; if ((bpf_get_current_pid_tgid() >> 32) != pid) return 0; - update_err = bpf_map_update_elem(&htab, &key, &value, 0); + update_err = bpf_map_update_elem(&htab, &key, &value, BPF_ANY); return 0; } -- 2.51.0

1 month, 2 weeks

3
2
0 0

[PATCH] selftests: tracing: Add tprobe enable/disable testcase

by Masami Hiramatsu (Google)

From: Masami Hiramatsu (Google) <mhiramat(a)kernel.org> Commit 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") caused regression bug and tprobe did not work. To prevent similar problems, add a testcase which enables/disables a tprobe and check the results. Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org> --- .../test.d/dynevent/enable_disable_tprobe.tc | 40 ++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/enable_disable_tprobe.tc diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/enable_disable_tprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/enable_disable_tprobe.tc new file mode 100644 index 000000000000..c1f1cafa30f3 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/dynevent/enable_disable_tprobe.tc @@ -0,0 +1,40 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Generic dynamic event - enable/disable tracepoint probe events +# requires: dynamic_events "t[:[<group>/][<event>]] <tracepoint> [<args>]":README + +echo 0 > events/enable +echo > dynamic_events + +TRACEPOINT=sched_switch +ENABLEFILE=events/tracepoints/myprobe/enable + +:;: "Add tracepoint event on $TRACEPOINT" ;: + +echo "t:myprobe ${TRACEPOINT}" >> dynamic_events + +:;: "Check enable/disable to ensure it works" ;: + +echo 1 > $ENABLEFILE + +grep -q $TRACEPOINT trace + +echo 0 > $ENABLEFILE + +echo > trace + +! grep -q $TRACEPOINT trace + +:;: "Repeat enable/disable to ensure it works" ;: + +echo 1 > $ENABLEFILE + +grep -q $TRACEPOINT trace + +echo 0 > $ENABLEFILE + +echo > trace + +! grep -q $TRACEPOINT trace + +exit 0

1 month, 2 weeks

4
7
0 0

[PATCH bpf-next v10 0/8] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps

by Leon Hwang

This patch set introduces the BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps, as the requirement of BPF_F_ALL_CPUS flag for percpu_array maps was discussed in the thread of "[PATCH bpf-next v3 0/4] bpf: Introduce global percpu data"[1]. The goal of BPF_F_ALL_CPUS flag is to reduce data caching overhead in light skeletons by allowing a single value to be reused to update values across all CPUs. This avoids the M:N problem where M cached values are used to update a map on N CPUs kernel. The BPF_F_CPU flag is accompanied by *flags*-embedded cpu info, which specifies the target CPU for the operation: * For lookup operations: the flag field alongside cpu info enable querying a value on the specified CPU. * For update operations: the flag field alongside cpu info enable updating value for specified CPU. Links: [1] https://lore.kernel.org/bpf/20250526162146.24429-1-leon.hwang@linux.dev/ Changes: v9 -> v10: * Add tests to verify array and hash maps do not support BPF_F_CPU and BPF_F_ALL_CPUS flags. * Address comment from Andrii: * Copy map value using copy_map_value_long for percpu_cgroup_storage maps in a separate patch. v8 -> v9: * Change value type from u64 to u32 in selftests. * Address comments from Andrii: * Keep value_size unaligned and update everywhere for consistency when cpu flags are specified. * Update value by getting pointer for percpu hash and percpu cgroup_storage maps. v7 -> v8: * Address comments from Andrii: * Check BPF_F_LOCK when update percpu_array, percpu_hash and lru_percpu_hash maps. * Refactor flags check in __htab_map_lookup_and_delete_batch(). * Keep value_size unaligned and copy value using copy_map_value() in __htab_map_lookup_and_delete_batch() when BPF_F_CPU is specified. * Update warn message in libbpf's validate_map_op(). * Update comment of libbpf's bpf_map__lookup_elem(). v6 -> v7: * Get correct value size for percpu_hash and lru_percpu_hash in update_batch API. * Set 'count' as 'max_entries' in test cases for lookup_batch API. * Address comment from Alexei: * Move cpu flags check into bpf_map_check_op_flags(). v5 -> v6: * Move bpf_map_check_op_flags() from 'bpf.h' to 'syscall.c'. * Address comments from Alexei: * Drop the refactoring code of data copying logic for percpu maps. * Drop bpf_map_check_op_flags() wrappers. v4 -> v5: * Address comments from Andrii: * Refactor data copying logic for all percpu maps. * Drop this_cpu_ptr() micro-optimization. * Drop cpu check in libbpf's validate_map_op(). * Enhance bpf_map_check_op_flags() using *allowed flags* instead of 'extra_flags_mask'. v3 -> v4: * Address comments from Andrii: * Remove unnecessary map_type check in bpf_map_value_size(). * Reduce code churn. * Remove unnecessary do_delete check in __htab_map_lookup_and_delete_batch(). * Introduce bpf_percpu_copy_to_user() and bpf_percpu_copy_from_user(). * Rename check_map_flags() to bpf_map_check_op_flags() with extra_flags_mask. * Add human-readable pr_warn() explanations in validate_map_op(). * Use flags in bpf_map__delete_elem() and bpf_map__lookup_and_delete_elem(). * Drop "for alignment reasons". v3 link: https://lore.kernel.org/bpf/20250821160817.70285-1-leon.hwang@linux.dev/ v2 -> v3: * Address comments from Alexei: * Use BPF_F_ALL_CPUS instead of BPF_ALL_CPUS magic. * Introduce these two cpu flags for all percpu maps. * Address comments from Jiri: * Reduce some unnecessary u32 cast. * Refactor more generic map flags check function. * A code style issue. v2 link: https://lore.kernel.org/bpf/20250805163017.17015-1-leon.hwang@linux.dev/ v1 -> v2: * Address comments from Andrii: * Embed cpu info as high 32 bits of *flags* totally. * Use ERANGE instead of E2BIG. * Few format issues. Leon Hwang (8): bpf: Introduce internal bpf_map_check_op_flags helper function bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array maps bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and lru_percpu_hash maps bpf: Copy map value using copy_map_value_long for percpu_cgroup_storage maps bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_cgroup_storage maps libbpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu maps selftests/bpf: Add cases to test BPF_F_CPU and BPF_F_ALL_CPUS flags include/linux/bpf-cgroup.h | 4 +- include/linux/bpf.h | 44 ++- include/uapi/linux/bpf.h | 2 + kernel/bpf/arraymap.c | 29 +- kernel/bpf/hashtab.c | 94 ++++-- kernel/bpf/local_storage.c | 27 +- kernel/bpf/syscall.c | 65 ++-- tools/include/uapi/linux/bpf.h | 2 + tools/lib/bpf/bpf.h | 8 + tools/lib/bpf/libbpf.c | 26 +- tools/lib/bpf/libbpf.h | 21 +- .../selftests/bpf/prog_tests/percpu_alloc.c | 312 ++++++++++++++++++ .../selftests/bpf/progs/percpu_alloc_array.c | 32 ++ 13 files changed, 562 insertions(+), 104 deletions(-) -- 2.51.2

1 month, 2 weeks

2
10
0 0

[PATCH v6 0/9] futex: Create {set,get}_robust_list2() syscalls

by André Almeida

Hello, This version is a complete rewrite of the syscall (thanks Thomas for the suggestions!). * Use case The use-case for the new syscalls is detailed in the last patch version: https://lore.kernel.org/lkml/20250626-tonyk-robust_futex-v5-0-179194dbde8f@… * The syscall interface Documented at patches 3/9 "futex: Create set_robust_list2() syscall" and 4/9 "futex: Create get_robust_list2() syscall". * Testing I expanded the current robust list selftest to use the new interface, and also ported the original syscall to use the new syscall internals, and everything survived the tests. * Changelog Changes from v5: - Complete interface rewrite, there are so many changes but the main ones are the following points - Array of robust lists now has a static size, allocated once during the first usage of the list - Now that the list of robust lists have a fixed size, I removed the logic of having a command for creating a new index on the list. To simplify things for everyone, userspace just need to call set_robust_list2(head, 32-bit/64-bit type, index). - Created get_robust_list2() - The new code can be better integrated with the original interface - v5: https://lore.kernel.org/r/20250626-tonyk-robust_futex-v5-0-179194dbde8f@iga… Feedback is very welcomed! --- André Almeida (9): futex: Use explicit sizes for compat_robust_list structs futex: Make exit_robust_list32() unconditionally available for 64-bit kernels futex: Create set_robust_list2() syscall futex: Create get_robust_list2() syscall futex: Wire up set_robust_list2 syscall futex: Wire up get_robust_list2 syscall selftests/futex: Expand for set_robust_list2() selftests/futex: Expand for get_robust_list2() futex: Use new robust list API internally arch/alpha/kernel/syscalls/syscall.tbl | 2 + arch/arm/tools/syscall.tbl | 2 + arch/m68k/kernel/syscalls/syscall.tbl | 2 + arch/microblaze/kernel/syscalls/syscall.tbl | 2 + arch/mips/kernel/syscalls/syscall_n32.tbl | 2 + arch/mips/kernel/syscalls/syscall_n64.tbl | 2 + arch/mips/kernel/syscalls/syscall_o32.tbl | 2 + arch/parisc/kernel/syscalls/syscall.tbl | 2 + arch/powerpc/kernel/syscalls/syscall.tbl | 2 + arch/s390/kernel/syscalls/syscall.tbl | 2 + arch/sh/kernel/syscalls/syscall.tbl | 2 + arch/sparc/kernel/syscalls/syscall.tbl | 2 + arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + arch/xtensa/kernel/syscalls/syscall.tbl | 2 + include/linux/compat.h | 13 +- include/linux/futex.h | 30 +- include/linux/sched.h | 6 +- include/uapi/asm-generic/unistd.h | 7 +- include/uapi/linux/futex.h | 26 ++ kernel/futex/core.c | 140 ++++-- kernel/futex/syscalls.c | 134 +++++- kernel/sys_ni.c | 2 + scripts/syscall.tbl | 1 + .../selftests/futex/functional/robust_list.c | 504 +++++++++++++++++++-- 25 files changed, 788 insertions(+), 105 deletions(-) --- base-commit: c42ba5a87bdccbca11403b7ca8bad1a57b833732 change-id: 20250225-tonyk-robust_futex-60adeedac695 Best regards, -- André Almeida <andrealmeid(a)igalia.com>

1 month, 2 weeks

4
18
0 0

[PATCH 0/2] selftests/nolibc: fix loongarch build with recent versions of clang

by Thomas Weißschuh

LLVM 21 switched to -mcmodel=medium for LoongArch64 compilations. This code model uses R_LARCH_ECALL36 relocations which might not be supported by GNU ld which the nolibc testsuite uses by default. Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Thomas Weißschuh (2): selftests/nolibc: use lld to link loongarch binaries selftests/nolibc: error out on linker warnings tools/testing/selftests/nolibc/Makefile.nolibc | 1 + tools/testing/selftests/nolibc/run-tests.sh | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) --- base-commit: 6059e06967aaac9bf736c6cec75b9bccaf5bbe18 change-id: 20251121-nolibc-lld-f32af4983cc0 Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

1 month, 3 weeks

2
3
0 0

[PATCH bpf v1] selftests: test_tag: prog_tag is calculated using SHA256.

by Xing Guo

commit 603b44162325 ("bpf: Update the bpf_prog_calc_tag to use SHA256") changed digest of prog_tag to SHA256 but forgot to update tests correspondingly. This patch helps fix it. Fixes: 603b44162325 ("bpf: Update the bpf_prog_calc_tag to use SHA256") Signed-off-by: Xing Guo <higuoxing(a)gmail.com> --- tools/testing/selftests/bpf/test_tag.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_tag.c b/tools/testing/selftests/bpf/test_tag.c index 5546b05a0486..f1300047c1e0 100644 --- a/tools/testing/selftests/bpf/test_tag.c +++ b/tools/testing/selftests/bpf/test_tag.c @@ -116,7 +116,7 @@ static void tag_from_alg(int insns, uint8_t *tag, uint32_t len) static const struct sockaddr_alg alg = { .salg_family = AF_ALG, .salg_type = "hash", - .salg_name = "sha1", + .salg_name = "sha256", }; int fd_base, fd_alg, ret; ssize_t size; -- 2.51.2

1 month, 3 weeks

2
1
0 0

[PATCH] selftests/iommu: Fix array-bounds warning in get_hw_info

by Nirbhay Sharma

GCC warns about potential out-of-bounds access when the test provides a buffer smaller than struct iommu_test_hw_info: iommufd_utils.h:817:37: warning: array subscript 'struct iommu_test_hw_info[0]' is partly outside array bounds of 'struct iommu_test_hw_info_buffer_smaller[1]' [-Warray-bounds=] 817 | assert(!info->flags); | ~~~~^~~~~~~ The warning occurs because 'info' is cast to a pointer to the full 8-byte struct at the top of the function, but the buffer_smaller test case passes only a 4-byte buffer. While the code correctly checks data_len before accessing each field, GCC's flow analysis with inlining doesn't recognize that the size check protects the access. Fix this by accessing fields through appropriately-typed pointers that match the actual field sizes (__u32), declared only after the bounds check. This makes the relationship between the size check and memory access explicit to the compiler. Signed-off-by: Nirbhay Sharma <nirbhay.lkd(a)gmail.com> --- tools/testing/selftests/iommu/iommufd_utils.h | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h index 9f472c20c190..37c1b994008c 100644 --- a/tools/testing/selftests/iommu/iommufd_utils.h +++ b/tools/testing/selftests/iommu/iommufd_utils.h @@ -770,7 +770,6 @@ static int _test_cmd_get_hw_info(int fd, __u32 device_id, __u32 data_type, void *data, size_t data_len, uint32_t *capabilities, uint8_t *max_pasid) { - struct iommu_test_hw_info *info = (struct iommu_test_hw_info *)data; struct iommu_hw_info cmd = { .size = sizeof(cmd), .dev_id = device_id, @@ -810,11 +809,19 @@ static int _test_cmd_get_hw_info(int fd, __u32 device_id, __u32 data_type, } } - if (info) { - if (data_len >= offsetofend(struct iommu_test_hw_info, test_reg)) - assert(info->test_reg == IOMMU_HW_INFO_SELFTEST_REGVAL); - if (data_len >= offsetofend(struct iommu_test_hw_info, flags)) - assert(!info->flags); + if (data) { + if (data_len >= offsetofend(struct iommu_test_hw_info, + test_reg)) { + __u32 *test_reg = (__u32 *)data + 1; + + assert(*test_reg == IOMMU_HW_INFO_SELFTEST_REGVAL); + } + if (data_len >= offsetofend(struct iommu_test_hw_info, + flags)) { + __u32 *flags = data; + + assert(!*flags); + } } if (max_pasid) -- 2.48.1

1 month, 3 weeks

2
1
0 0

[PATCH v2 00/10] KVM: nVMX: Improve performance for unmanaged guest memory

by griffoul＠gmail.com

From: Fred Griffoul <fgriffo(a)amazon.co.uk> This patch series addresses both performance and correctness issues in nested VMX when handling guest memory. During nested VMX operations, L0 (KVM) accesses specific L1 guest pages to manage L2 execution. These pages fall into two categories: pages accessed only by L0 (such as the L1 MSR bitmap page or the eVMCS page), and pages passed to the L2 guest via vmcs02 (such as APIC access, virtual APIC, and posted interrupt descriptor pages). The current implementation uses kvm_vcpu_map/unmap, which causes two issues. First, the current approach is missing proper invalidation handling in critical scenarios. Enlightened VMCS (eVMCS) pages can become stale when memslots are modified, as there is no mechanism to invalidate the cached mappings. Similarly, APIC access and virtual APIC pages can be migrated by the host, but without proper notification through mmu_notifier callbacks, the mappings become invalid and can lead to incorrect behavior. Second, for unmanaged guest memory (memory not directly mapped by the kernel, such as memory passed with the mem= parameter or guest_memfd for non-CoCo VMs), this workflow invokes expensive memremap/memunmap operations on every L2 VM entry/exit cycle. This creates significant overhead that impacts nested virtualization performance. This series replaces kvm_host_map with gfn_to_pfn_cache in nested VMX. The pfncache infrastructure maintains persistent mappings as long as the page GPA does not change, eliminating the memremap/memunmap overhead on every VM entry/exit cycle. Additionally, pfncache provides proper invalidation handling via mmu_notifier callbacks and memslots generation check, ensuring that mappings are correctly updated during both memslot updates and page migration events. As an example, a microbenchmark using memslot_perf_test with 8192 memslots demonstrates huge improvements in nested VMX operations with unmanaged guest memory: Before After Improvement map: 26.12s 1.54s ~17x faster unmap: 40.00s 0.017s ~2353x faster unmap chunked: 10.07s 0.005s ~2014x faster The series is organized as follows: Patches 1-5 handle the L1 MSR bitmap page and system pages (APIC access, virtual APIC, and posted interrupt descriptor). Patch 1 converts the MSR bitmap to use gfn_to_pfn_cache. Patches 2-3 restore and complete "guest-uses-pfn" support in pfncache. Patch 4 converts the system pages to use gfn_to_pfn_cache. Patch 5 adds a selftest for cache invalidation and memslot updates. Patches 6-7 add enlightened VMCS support. Patch 6 avoids accessing eVMCS fields after they are copied into the cached vmcs12 structure. Patch 7 converts eVMCS page mapping to use gfn_to_pfn_cache. Patches 8-10 implement persistent nested context to handle L2 vCPU multiplexing and migration between L1 vCPUs. Patch 8 introduces the nested context management infrastructure. Patch 9 integrates pfncache with persistent nested context. Patch 10 adds a selftest for this L2 vCPU context switching. v2: - Extended series to support enlightened VMCS (eVMCS). - Added persistent nested context for improved L2 vCPU handling. - Added additional selftests. Suggested-by: dwmw(a)amazon.co.uk Fred Griffoul (10): KVM: nVMX: Implement cache for L1 MSR bitmap KVM: pfncache: Restore guest-uses-pfn support KVM: x86: Add nested state validation for pfncache support KVM: nVMX: Implement cache for L1 APIC pages KVM: selftests: Add nested VMX APIC cache invalidation test KVM: nVMX: Cache evmcs fields to ensure consistency during VM-entry KVM: nVMX: Replace evmcs kvm_host_map with pfncache KVM: x86: Add nested context management KVM: nVMX: Use nested context for pfncache persistence KVM: selftests: Add L2 vcpu context switch test arch/x86/include/asm/kvm_host.h | 32 ++ arch/x86/include/uapi/asm/kvm.h | 2 + arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/nested.c | 199 ++++++++ arch/x86/kvm/vmx/hyperv.c | 5 +- arch/x86/kvm/vmx/hyperv.h | 33 +- arch/x86/kvm/vmx/nested.c | 463 ++++++++++++++---- arch/x86/kvm/vmx/vmx.c | 8 + arch/x86/kvm/vmx/vmx.h | 16 +- arch/x86/kvm/x86.c | 19 +- include/linux/kvm_host.h | 34 +- include/linux/kvm_types.h | 1 + tools/testing/selftests/kvm/Makefile.kvm | 2 + .../selftests/kvm/x86/vmx_apic_update_test.c | 302 ++++++++++++ .../selftests/kvm/x86/vmx_l2_switch_test.c | 416 ++++++++++++++++ virt/kvm/kvm_main.c | 3 +- virt/kvm/kvm_mm.h | 6 +- virt/kvm/pfncache.c | 43 +- 18 files changed, 1467 insertions(+), 119 deletions(-) create mode 100644 arch/x86/kvm/nested.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c -- 2.43.0

1 month, 3 weeks

2
13
0 0

[PATCH bpf-next v1 0/3] bpf: Fix FIONREAD and copied_seq issues

by Jiayuan Chen

syzkaller reported a bug [1] where a socket using sockmap, after being unloaded, exposed incorrect copied_seq calculation. The selftest I provided can be used to reproduce the issue reported by syzkaller. TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40 WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724 Call Trace: <TASK> receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline] tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421 __sys_getsockopt+0x12f/0x260 net/socket.c:2450 __do_sys_getsockopt net/socket.c:2457 [inline] __se_sys_getsockopt net/socket.c:2454 [inline] __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f A sockmap socket maintains its own receive queue (ingress_msg) which may contain data from either its own protocol stack or forwarded from other sockets. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack The issue occurs when reading from ingress_msg: we update tp->copied_seq by default, but if the data comes from other sockets (not the socket's own protocol stack), tcp->rcv_nxt remains unchanged. Later, when converting back to a native socket, reads may fail as copied_seq could be significantly larger than rcv_nxt. Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is insufficient for sockmap sockets, requiring separate field tracking. [1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 Jiayuan Chen (3): bpf, sockmap: Fix incorrect copied_seq calculation bpf, sockmap: Fix FIONREAD for sockmap bpf, selftest: Add tests for FIONREAD and copied_seq include/linux/skmsg.h | 71 ++++++- net/core/skmsg.c | 20 +- net/ipv4/tcp_bpf.c | 26 ++- net/ipv4/udp_bpf.c | 25 ++- .../selftests/bpf/prog_tests/sockmap_basic.c | 192 +++++++++++++++++- .../bpf/progs/test_sockmap_pass_prog.c | 8 + 6 files changed, 325 insertions(+), 17 deletions(-) -- 2.43.0

1 month, 3 weeks

3
8
0 0

[PATCH bpf-next v2 0/2] selftests/bpf: networking test cleanups

by Hoyeon Lee

This series finishes the sockaddr_storage migration in the networking selftests by removing the remaining open-coded IPv4/IPv6 wrappers (addr_port/tuple in cls_redirect, sa46 in select_reuseport). The tests now use sockaddr_storage directly. No other custom socket-address wrappers remain after this series, so the churn stops here and behavior is unchanged. --- Changes in v2: - Drop the tuple wrapper entirely in cls_redirect and rely on ss_family - Limit the series to patches 1/2 (3/4 applied; 5 sent separately) Hoyeon Lee (2): selftests/bpf: use sockaddr_storage directly in cls_redirect test selftests/bpf: use sockaddr_storage instead of sa46 in select_reuseport test .../selftests/bpf/prog_tests/cls_redirect.c | 122 ++++++------------ .../bpf/prog_tests/select_reuseport.c | 67 +++++----- 2 files changed, 77 insertions(+), 112 deletions(-) -- 2.51.1

1 month, 3 weeks

3
4
0 0

[PATCH v2 0/4] KVM: selftests: Test SET_NESTED_STATE with 48-bit L2 on 57-bit L1

by Jim Mattson

Prior to commit 9245fd6b8531 ("KVM: x86: model canonical checks more precisely"), KVM_SET_NESTED_STATE would fail if the state was captured with L2 active, L1 had CR4.LA57 set, L2 did not, and the VMCS12.HOST_GSBASE (or other host-state field checked for canonicality) had an address greater than 48 bits wide. Add a regression test that reproduces the KVM_SET_NESTED_STATE failure conditions. To do so, the first three patches add support for 5-level paging in the selftest L1 VM. v1 -> v2 Ended the page walking loops before visiting 4K mappings [Yosry] Changed VM_MODE_PXXV48_4K into VM_MODE_PXXVYY_4K; use 5-level paging when possible [Sean] Removed the check for non-NULL vmx_pages in guest_code() [Yosry] Jim Mattson (4): KVM: selftests: Use a loop to create guest page tables KVM: selftests: Use a loop to walk guest page tables KVM: selftests: Change VM_MODE_PXXV48_4K to VM_MODE_PXXVYY_4K KVM: selftests: Add a VMX test for LA57 nested state tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/include/kvm_util.h | 4 +- .../selftests/kvm/include/x86/processor.h | 2 +- .../selftests/kvm/lib/arm64/processor.c | 2 +- tools/testing/selftests/kvm/lib/kvm_util.c | 30 ++-- .../testing/selftests/kvm/lib/x86/processor.c | 80 +++++------ tools/testing/selftests/kvm/lib/x86/vmx.c | 6 +- .../kvm/x86/vmx_la57_nested_state_test.c | 134 ++++++++++++++++++ 8 files changed, 197 insertions(+), 62 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86/vmx_la57_nested_state_test.c -- 2.51.1.851.g4ebd6896fd-goog

1 month, 3 weeks

3
7
0 0

[PATCH v2 0/3] arm64/sme: Support disabling streaming mode via ptrace on SME only systems

by Mark Brown

Currently it is not possible to disable streaming mode via ptrace on SME only systems, the interface for doing this is to write via NT_ARM_SVE but such writes will be rejected on a system without SVE support. Enable this functionality by allowing userspace to write SVE_PT_REGS_FPSIMD format data via NT_ARM_SVE with the vector length set to 0 on SME only systems. Such writes currently error since we require that a vector length is specified which should minimise the risk that existing software is relying on current behaviour. Reads are not supported since I am not aware of any use case for this and there is some risk that an existing userspace application may be confused if it reads NT_ARM_SVE on a system without SVE. Existing kernels will return FPSIMD formatted register state from NT_ARM_SVE if full SVE state is not stored, for example if the task has not used SVE. Returning a vector length of 0 would create a risk that software could try to do things like allocate space for register state with zero sizes, while returning a vector length of 128 bits would look like SVE is supported. It seems safer to just not make the changes to add read support. It remains possible for userspace to detect a SME only system via the ptrace interface only since reads of NT_ARM_SSVE and NT_ARM_ZA will suceed while reads of NT_ARM_SVE will fail. Read/write access to the FPSIMD registers in non-streaming mode is available via REGSET_FPR. The aim is is to make a minimally invasive change, no operation that would previously have succeeded will be affected, and we use a previously defined interface in new circumstances rather than define completely new ABI. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Rebase onto v6.18-rc1 - Link to v1: https://lore.kernel.org/r/20250820-arm64-sme-ptrace-sme-only-v1-0-f7c22b287… --- Mark Brown (3): arm64/sme: Support disabling streaming mode via ptrace on SME only systems kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace Documentation/arch/arm64/sve.rst | 5 +++ arch/arm64/kernel/ptrace.c | 40 +++++++++++++++--- tools/testing/selftests/arm64/fp/fp-ptrace.c | 5 +-- tools/testing/selftests/arm64/fp/sve-ptrace.c | 61 +++++++++++++++++++++++++++ 4 files changed, 100 insertions(+), 11 deletions(-) --- base-commit: cb6649f6217c0331b885cf787f1d175963e2a1d2 change-id: 20250717-arm64-sme-ptrace-sme-only-1fb850600ea0 Best regards, -- Mark Brown <broonie(a)kernel.org>

1 month, 3 weeks

5
7
0 0

[PATCH v4 0/9] introduce VM_MAYBE_GUARD and make it sticky

by Lorenzo Stoakes

Currently, guard regions are not visible to users except through /proc/$pid/pagemap, with no explicit visibility at the VMA level. This makes the feature less useful, as it isn't entirely apparent which VMAs may have these entries present, especially when performing actions which walk through memory regions such as those performed by CRIU. This series addresses this issue by introducing the VM_MAYBE_GUARD flag which fulfils this role, updating the smaps logic to display an entry for these. The semantics of this flag are that a guard region MAY be present if set (we cannot be sure, as we can't efficiently track whether an MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if not set the VMA definitely does NOT have any guard regions present. It's problematic to establish this flag without further action, because that means that VMAs with guard regions in them become non-mergeable with adjacent VMAs for no especially good reason. To work around this, this series also introduces the concept of 'sticky' VMA flags - that is flags which: a. if set in one VMA and not in another still permit those VMAs to be merged (if otherwise compatible). b. When they are merged, the resultant VMA must have the flag set. The VMA logic is updated to propagate these flags correctly. Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve an issue with file-backed guard regions - previously these established an anon_vma object for file-backed mappings solely to have vma_needs_copy() correctly propagate guard region mappings to child processes. We introduce a new flag alias VM_COPY_ON_FORK (which currently only specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly for this flag and to copy page tables if it is present, which resolves this issue. Additionally, we add the ability for allow-listed VMA flags to be atomically writable with only mmap/VMA read locks held. The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure does not cause any races by being allowed to do so. This allows us to maintain guard region installation as a read-locked operation and not endure the overhead of obtaining a write lock here. Finally we introduce extensive VMA userland tests to assert that the sticky VMA logic behaves correctly as well as guard region self tests to assert that smaps visibility is correctly implemented. v4: * Propagated tags, thanks all! * Folded all fixups into series (thanks to Andrew for his patience with these :) * Added patch to correct an issue raised by Pedro - we can't unconditionally set newflags |= vma->vm_flags because on split/noop we're overwriting them. * In new patch, corrected horrible formatting of vma_modify_*() while we are here. * In new patch, added kdoc as 3 kernel developers, including the author of the code (!!) have been confused by this. Make explicitly clear what each does. * In new patch, make vm_flags_ptr parameter a pointer for vma_modify_flags, and have the function correctly update the flags on merge, abstracting this mess somewhat and avoiding case-by-case open-coding of the fix. Describe clearly what's going on in the kdoc. * Fixed typo reported by Jane and Liam, I must have been very tired... :) * When introducing the new patch, we couldn't reference sticky VMA flags yet as the concept had not yet been introduced. So update the patch that introduces sticky flags to change the comments to reference the concept now established. v3: * Propagated tags thanks Vlastimil & Pedro! :) * Fixed doc nit as per Pedro. * Added vma_flag_test_atomic() in preparation for fixing retract_page_tables() (see below). We make this not require any locks, as we serialise on the page table lock in retract_page_tables(). * Split the atomic flag enablement and actually setting the flag for guard install into two separate commits so we clearly separate the various VMA flag implementation details and us enabling this feature. * Mentioned setting anon_vma for anonymous mappings in commit message as per Vlastimil. * Fixed an issue with retract_page_tables() whereby madvise(..., MADV_COLLAPSE) relies upon file-backed VMAs not being collapsed due to the UFFD WP VMA flag being set or the VMA having vma->anon_vma set (i.e. being a MAP_PRIVATE file-backed VMA). This was updated to also check for VM_MAYBE_GUARD. * Introduced MADV_COLLAPSE self test to assert that the behaviour is correct. I first reproduced the issue locally and then adapted the test to assert that this no longer occurs. * Mentioned KCSAN permissiveness in commit message as per Pedro. * Mentioned mmap/VMA read lock excluding mmap/VMA write lock and thus avoiding meaningful RMW races in commit message as per Vlastimil. * Mentioned previous unconditional vma->anon_vma installation on guard region installation as per Vlastimil. * Avoided having merging compromised by reordering patches such that the sticky VMA functionality is implemented prior to VM_MAYBE_GUARD being utilised upon guard region installation, rendering Vlastimil's request to mention this in a commit message unnecessary. * Separated out sticky and copy on fork patches as per Pedro. * Added VM_PFNMAP, VM_MIXEDMAP, VM_UFFD_WP to VM_COPY_ON_FORK to make things more consistent and clean. * Added mention of why generally VM_STICKY should be VM_COPY_ON_FORK in copy on fork patch. https://lore.kernel.org/all/cover.1762531708.git.lorenzo.stoakes@oracle.com/ v2: * Separated out userland VMA tests for sticky behaviour as per Suren. * Added the concept of atomic writable VMA flags as per Pedro and Vlastimil. * Made VM_MAYBE_GUARD an atomic writable flag so we don't have to take a VMA write lock in madvise() as per Pedro and Vlastimil. https://lore.kernel.org/all/cover.1762422915.git.lorenzo.stoakes@oracle.com/ v1: https://lore.kernel.org/all/cover.1761756437.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (9): mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps mm: add atomic VMA flags and set VM_MAYBE_GUARD as such mm: update vma_modify_flags() to handle residual flags, document mm: implement sticky VMA flags mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one mm: set the VM_MAYBE_GUARD flag on guard region install tools/testing/vma: add VMA sticky userland tests tools/testing/selftests/mm: add MADV_COLLAPSE test case tools/testing/selftests/mm: add smaps visibility guard region test Documentation/filesystems/proc.rst | 5 +- fs/proc/task_mmu.c | 1 + include/linux/mm.h | 101 +++++++++++ include/trace/events/mmflags.h | 1 + mm/khugepaged.c | 71 +++++--- mm/madvise.c | 24 ++- mm/memory.c | 14 +- mm/mlock.c | 2 +- mm/mprotect.c | 2 +- mm/mseal.c | 9 +- mm/vma.c | 78 +++++---- mm/vma.h | 138 +++++++++++---- tools/testing/selftests/mm/guard-regions.c | 185 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + tools/testing/vma/vma.c | 92 ++++++++-- tools/testing/vma/vma_internal.h | 55 ++++++ 17 files changed, 650 insertions(+), 134 deletions(-) -- 2.51.2

1 month, 3 weeks

6
19
0 0

[PATCH bpf-next v3 0/3] bpf: Fix FIONREAD and copied_seq issues

by Jiayuan Chen

syzkaller reported a bug [1] where a socket using sockmap, after being unloaded, exposed incorrect copied_seq calculation. The selftest I provided can be used to reproduce the issue reported by syzkaller. TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40 WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724 Call Trace: <TASK> receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline] tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421 __sys_getsockopt+0x12f/0x260 net/socket.c:2450 __do_sys_getsockopt net/socket.c:2457 [inline] __se_sys_getsockopt net/socket.c:2454 [inline] __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f A sockmap socket maintains its own receive queue (ingress_msg) which may contain data from either its own protocol stack or forwarded from other sockets. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack The issue occurs when reading from ingress_msg: we update tp->copied_seq by default, but if the data comes from other sockets (not the socket's own protocol stack), tcp->rcv_nxt remains unchanged. Later, when converting back to a native socket, reads may fail as copied_seq could be significantly larger than rcv_nxt. Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is insufficient for sockmap sockets, requiring separate field tracking. [1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 --- v1 -> v3: Use skmsg.sk instead of extending BPF_F_XXX macro and fix CI failure reported by ci v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/ Jiayuan Chen (3): bpf, sockmap: Fix incorrect copied_seq calculation bpf, sockmap: Fix FIONREAD for sockmap bpf, selftest: Add tests for FIONREAD and copied_seq include/linux/skmsg.h | 48 ++++- net/core/skmsg.c | 28 ++- net/ipv4/tcp_bpf.c | 26 ++- net/ipv4/udp_bpf.c | 25 ++- .../selftests/bpf/prog_tests/sockmap_basic.c | 203 +++++++++++++++++- .../bpf/progs/test_sockmap_pass_prog.c | 8 + 6 files changed, 322 insertions(+), 16 deletions(-) -- 2.43.0

1 month, 3 weeks

2
4
0 0

[PATCH 0/9] Initial DMABUF support for iommufd

by Jason Gunthorpe

This series is the start of adding full DMABUF support to iommufd. Currently it is limited to only work with VFIO's DMABUF exporter. It sits on top of Leon's series to add a DMABUF exporter to VFIO: https://lore.kernel.org/r/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but otherwise works the same as it does today for a memfd. The user can select a slice of the FD to map into the ioas and if the underliyng alignment requirements are met it will be placed in the iommu_domain. Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR memory from VFIO to an iommu_domain controlled by iommufd. This is used for PCI Peer to Peer support in VMs, and is the last feature that the VFIO type 1 container has that iommufd couldn't do. The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime control and is a use-after-free security problem. Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there should be no access to the MMIO it can shoot down the mapping in iommufd which will unmap it from the iommu_domain. There is no automatic remap, this is a safety protocol so the kernel doesn't get stuck. Userspace is expected to know it is doing something that will revoke the dmabuf and map/unmap it around the activity. Eg when QEMU goes to issue FLR it should do the map/unmap to iommufd. Since DMABUF is missing some key general features for this use case it relies on a "private interconnect" between VFIO and iommufd via the vfio_pci_dma_buf_iommufd_map() call. The call confirms the DMABUF has revoke semantics and delivers a phys_addr for the memory suitable for use with iommu_map(). Medium term there is a desire to expand the supported DMABUFs to include GPU drivers to support DPDK/SPDK type use cases so future series will work to add a general concept of revoke and a general negotiation of interconnect to remove vfio_pci_dma_buf_iommufd_map(). I also plan another series to modify iommufd's vfio_compat to transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI of type1. The latest series for interconnect negotation to exchange a phys_addr is: https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com And the discussion for design of revoke is here: https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/ This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf v2: - Rebase on Leon's v7 - Fix mislocking in an iopt_fill_domain() error path v1: https://patch.msgid.link/r/0-v1-64bed2430cdb+31b-iommufd_dmabuf_jgg@nvidia.… Jason Gunthorpe (9): vfio/pci: Add vfio_pci_dma_buf_iommufd_map() iommufd: Add DMABUF to iopt_pages iommufd: Do not map/unmap revoked DMABUFs iommufd: Allow a DMABUF to be revoked iommufd: Allow MMIO pages in a batch iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd/selftest: Add some tests for the dmabuf flow drivers/iommu/iommufd/io_pagetable.c | 78 +++- drivers/iommu/iommufd/io_pagetable.h | 53 ++- drivers/iommu/iommufd/ioas.c | 8 +- drivers/iommu/iommufd/iommufd_private.h | 14 +- drivers/iommu/iommufd/iommufd_test.h | 10 + drivers/iommu/iommufd/main.c | 10 + drivers/iommu/iommufd/pages.c | 407 ++++++++++++++++-- drivers/iommu/iommufd/selftest.c | 142 ++++++ drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++ include/linux/vfio_pci_core.h | 4 + tools/testing/selftests/iommu/iommufd.c | 43 ++ tools/testing/selftests/iommu/iommufd_utils.h | 44 ++ 12 files changed, 781 insertions(+), 66 deletions(-) base-commit: bb04e92c86b44b3e36532099b68de1e889acfee7 -- 2.43.0

1 month, 3 weeks

6
44
0 0

[RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults

by Mike Rapoport

From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org> Hi, These patches allow guest_memfd to notify userspace about minor page faults using userfaultfd and let userspace to resolve these page faults using UFFDIO_CONTINUE. To allow UFFDIO_CONTINUE outside of the core mm I added a get_pagecache_folio() callback to vm_ops that allows an address space backing a VMA to return a folio that exists in it's page cache (patch 2) In order for guest_memfd to notify userspace about page faults, it has to call handle_userfault() and since guest_memfd may be a part of kvm module, handle_userfault() is exported for kvm module (patch 3). Note that patch 3 changelog does not provide motivation for enabling uffd in guest_memfd, mainly because I can't say I understand why is that required :) Would be great to hear from KVM folks about it. This series is the minimal change I've been able to come up with to allow integration of guest_memfd with uffd and while refactoring uffd and making mfill_atomic() flow more linear would have been a nice improvement, it's way out of the scope of enabling uffd with guest_memfd. Mike Rapoport (Microsoft) (3): userfaultfd: move vma_can_userfault out of line userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE userfaultfd, guest_memfd: support userfault minor mode in guest_memfd Nikita Kalyazin (1): KVM: selftests: test userfaultfd minor for guest_memfd fs/userfaultfd.c | 4 +- include/linux/mm.h | 9 ++ include/linux/userfaultfd_k.h | 36 +----- include/uapi/linux/userfaultfd.h | 8 +- mm/shmem.c | 20 ++++ mm/userfaultfd.c | 88 ++++++++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++ virt/kvm/guest_memfd.c | 30 +++++ 8 files changed, 245 insertions(+), 53 deletions(-) base-commit: 6146a0f1dfae5d37442a9ddcba012add260bceb0 -- 2.50.1

1 month, 3 weeks

4
12
0 0

[PATCH v3 00/10] KVM: nVMX: Improve performance for unmanaged guest memory

by Fred Griffoul

From: Fred Griffoul <fgriffo(a)amazon.co.uk> This patch series addresses both performance and correctness issues in nested VMX when handling guest memory. During nested VMX operations, L0 (KVM) accesses specific L1 guest pages to manage L2 execution. These pages fall into two categories: pages accessed only by L0 (such as the L1 MSR bitmap page or the eVMCS page), and pages passed to the L2 guest via vmcs02 (such as APIC access, virtual APIC, and posted interrupt descriptor pages). The current implementation uses kvm_vcpu_map/unmap, which causes two issues. First, the current approach is missing proper invalidation handling in critical scenarios. Enlightened VMCS (eVMCS) pages can become stale when memslots are modified, as there is no mechanism to invalidate the cached mappings. Similarly, APIC access and virtual APIC pages can be migrated by the host, but without proper notification through mmu_notifier callbacks, the mappings become invalid and can lead to incorrect behavior. Second, for unmanaged guest memory (memory not directly mapped by the kernel, such as memory passed with the mem= parameter or guest_memfd for non-CoCo VMs), this workflow invokes expensive memremap/memunmap operations on every L2 VM entry/exit cycle. This creates significant overhead that impacts nested virtualization performance. This series replaces kvm_host_map with gfn_to_pfn_cache in nested VMX. The pfncache infrastructure maintains persistent mappings as long as the page GPA does not change, eliminating the memremap/memunmap overhead on every VM entry/exit cycle. Additionally, pfncache provides proper invalidation handling via mmu_notifier callbacks and memslots generation check, ensuring that mappings are correctly updated during both memslot updates and page migration events. As an example, a microbenchmark using memslot_perf_test with 8192 memslots demonstrates huge improvements in nested VMX operations with unmanaged guest memory (this is a synthetic benchmark run on AWS EC2 Nitro instances, and the results are not representative of typical nested virtualization workloads): Before After Improvement map: 26.12s 1.54s ~17x faster unmap: 40.00s 0.017s ~2353x faster unmap chunked: 10.07s 0.005s ~2014x faster The series is organized as follows: Patches 1-5 handle the L1 MSR bitmap page and system pages (APIC access, virtual APIC, and posted interrupt descriptor). Patch 1 converts the MSR bitmap to use gfn_to_pfn_cache. Patches 2-3 restore and complete "guest-uses-pfn" support in pfncache. Patch 4 converts the system pages to use gfn_to_pfn_cache. Patch 5 adds a selftest for cache invalidation and memslot updates. Patches 6-7 add enlightened VMCS support. Patch 6 avoids accessing eVMCS fields after they are copied into the cached vmcs12 structure. Patch 7 converts eVMCS page mapping to use gfn_to_pfn_cache. Patches 8-10 implement persistent nested context to handle L2 vCPU multiplexing and migration between L1 vCPUs. Patch 8 introduces the nested context management infrastructure. Patch 9 integrates pfncache with persistent nested context. Patch 10 adds a selftest for this L2 vCPU context switching. v3: - fixed warnings reported by kernel test robot in patches 7 and 8. v2: - Extended series to support enlightened VMCS (eVMCS). - Added persistent nested context for improved L2 vCPU handling. - Added additional selftests. Suggested-by: dwmw(a)amazon.co.uk Fred Griffoul (10): KVM: nVMX: Implement cache for L1 MSR bitmap KVM: pfncache: Restore guest-uses-pfn support KVM: x86: Add nested state validation for pfncache support KVM: nVMX: Implement cache for L1 APIC pages KVM: selftests: Add nested VMX APIC cache invalidation test KVM: nVMX: Cache evmcs fields to ensure consistency during VM-entry KVM: nVMX: Replace evmcs kvm_host_map with pfncache KVM: x86: Add nested context management KVM: nVMX: Use nested context for pfncache persistence KVM: selftests: Add L2 vcpu context switch test arch/x86/include/asm/kvm_host.h | 32 ++ arch/x86/include/uapi/asm/kvm.h | 2 + arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/nested.c | 199 ++++++++ arch/x86/kvm/vmx/hyperv.c | 5 +- arch/x86/kvm/vmx/hyperv.h | 33 +- arch/x86/kvm/vmx/nested.c | 469 ++++++++++++++---- arch/x86/kvm/vmx/vmx.c | 8 + arch/x86/kvm/vmx/vmx.h | 16 +- arch/x86/kvm/x86.c | 19 +- include/linux/kvm_host.h | 34 +- include/linux/kvm_types.h | 1 + tools/testing/selftests/kvm/Makefile.kvm | 2 + .../selftests/kvm/x86/vmx_apic_update_test.c | 302 +++++++++++ .../selftests/kvm/x86/vmx_l2_switch_test.c | 416 ++++++++++++++++ virt/kvm/kvm_main.c | 3 +- virt/kvm/kvm_mm.h | 6 +- virt/kvm/pfncache.c | 43 +- 18 files changed, 1469 insertions(+), 123 deletions(-) create mode 100644 arch/x86/kvm/nested.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c base-commit: 6b36119b94d0b2bb8cea9d512017efafd461d6ac prerequisite-patch-id: afd3db49735b65c8a642de8dab7d0160d5da4b67 -- 2.43.0

1 month, 3 weeks

1
10
0 0

[PATCH bpf-next v2 0/3] bpf: Fix FIONREAD and copied_seq issues

by Jiayuan Chen

syzkaller reported a bug [1] where a socket using sockmap, after being unloaded, exposed incorrect copied_seq calculation. The selftest I provided can be used to reproduce the issue reported by syzkaller. TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40 WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724 Call Trace: <TASK> receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline] tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421 __sys_getsockopt+0x12f/0x260 net/socket.c:2450 __do_sys_getsockopt net/socket.c:2457 [inline] __se_sys_getsockopt net/socket.c:2454 [inline] __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f A sockmap socket maintains its own receive queue (ingress_msg) which may contain data from either its own protocol stack or forwarded from other sockets. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack The issue occurs when reading from ingress_msg: we update tp->copied_seq by default, but if the data comes from other sockets (not the socket's own protocol stack), tcp->rcv_nxt remains unchanged. Later, when converting back to a native socket, reads may fail as copied_seq could be significantly larger than rcv_nxt. Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is insufficient for sockmap sockets, requiring separate field tracking. [1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 --- v1 -> v2: Use skmsg.sk instead of extending BPF_F_XXX macro v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/ Jiayuan Chen (3): bpf, sockmap: Fix incorrect copied_seq calculation bpf, sockmap: Fix FIONREAD for sockmap bpf, selftest: Add tests for FIONREAD and copied_seq include/linux/skmsg.h | 48 ++++- net/core/skmsg.c | 29 ++- net/ipv4/tcp_bpf.c | 26 ++- net/ipv4/udp_bpf.c | 25 ++- .../selftests/bpf/prog_tests/sockmap_basic.c | 203 +++++++++++++++++- .../bpf/progs/test_sockmap_pass_prog.c | 8 + 6 files changed, 323 insertions(+), 16 deletions(-) -- 2.43.0

1 month, 3 weeks

3
5
0 0

[PATCH net-next v3 0/4] netconsole: Allow userdata buffer to grow dynamically

by Gustavo Luiz Duarte

The current netconsole implementation allocates a static buffer for extradata (userdata + sysdata) with a fixed size of MAX_EXTRADATA_ENTRY_LEN * MAX_EXTRADATA_ITEMS bytes for every target, regardless of whether userspace actually uses this feature. This forces us to keep MAX_EXTRADATA_ITEMS small (16), which is restrictive for users who need to attach more metadata to their log messages. This patch series enables dynamic allocation of the userdata buffer, allowing it to grow on-demand based on actual usage. The series: 1. Refactors send_fragmented_body() to simplify handling of separated userdata and sysdata (patch 1/4) 2. Splits userdata and sysdata into separate buffers (patch 2/4) 3. Implements dynamic allocation for the userdata buffer (patch 3/4) 4. Increases MAX_USERDATA_ITEMS from 16 to 256 now that we can do so without memory waste (patch 4/4) Benefits: - No memory waste when userdata is not used - Targets that use userdata only consume what they need - Users can attach significantly more metadata without impacting systems that don't use this feature Signed-off-by: Gustavo Luiz Duarte <gustavold(a)gmail.com> --- Changes in v3: - Split calculating the lentgh of the formatted userdata string into a separate function calc_userdata_len(). - Exit update_userdata() immediately if we hit WARN due to too many userdata entries. - Use offset instead of len to save userdata_length in update_userdata() - Link to v2: https://lore.kernel.org/r/20251113-netconsole_dynamic_extradata-v2-0-18cf7f… Changes in v2: - Added null pointer checks for userdata and sysdata buffers - Added MAX_SYSDATA_ITEMS to enum sysdata_feature - Moved code out of ifdef in send_msg_no_fragmentation() - Renamed variables in send_fragmented_body() to make it easier to reason about the code - Link to v1: https://lore.kernel.org/r/20251105-netconsole_dynamic_extradata-v1-0-142890… --- Gustavo Luiz Duarte (4): netconsole: Simplify send_fragmented_body() netconsole: Split userdata and sysdata netconsole: Dynamic allocation of userdata buffer netconsole: Increase MAX_USERDATA_ITEMS drivers/net/netconsole.c | 386 +++++++++++---------- .../selftests/drivers/net/netcons_overflow.sh | 2 +- 2 files changed, 195 insertions(+), 193 deletions(-) --- base-commit: 45a1cd8346ca245a1ca475b26eb6ceb9d8b7c6f0 change-id: 20251007-netconsole_dynamic_extradata-21bd9d726568 Best regards, -- Gustavo Duarte <gustavold(a)meta.com>

1 month, 3 weeks

3
6
0 0

[PATCH net-next 0/6] selftests: drv-net: Fix issues in devlink_rate_tc_bw.py

by Carolina Jubran

Hi, This series fixes issues in the devlink_rate_tc_bw.py selftest and introduces a new Iperf3Runner that helps with measurement handling. Thanks, Carolina Carolina Jubran (6): selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS selftests: drv-net: introduce Iperf3Runner for measurement use cases selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py .../testing/selftests/drivers/net/hw/Makefile | 1 + .../drivers/net/hw/devlink_rate_tc_bw.py | 174 ++++++++---------- .../drivers/net/hw/lib/py/__init__.py | 5 +- .../selftests/drivers/net/lib/py/__init__.py | 5 +- .../selftests/drivers/net/lib/py/load.py | 84 ++++++++- 5 files changed, 157 insertions(+), 112 deletions(-) -- 2.38.1

1 month, 3 weeks

2
7
0 0

[PATCH net-next v3 00/12] selftests: drv-net: convert GRO and Toeplitz tests to work for drivers in NIPA

by Jakub Kicinski

Main objective of this series is to convert the gro.sh and toeplitz.sh tests to be "NIPA-compatible" - meaning make use of the Python env, which lets us run the tests against either netdevsim or a real device. The tests seem to have been written with a different flow in mind. Namely they source different bash "setup" scripts depending on arguments passed to the test. While I have nothing against the use of bash and the overall architecture - the existing code needs quite a bit of work (don't assume MAC/IP addresses, support remote endpoint over SSH). If I'm the one fixing it, I'd rather convert them to our "simplistic" Python. This series rewrites the tests in Python while addressing their shortcomings. The functionality of running the test over loopback on a real device is retained but with a different method of invocation (see the last patch). Once again we are dealing with a script which run over a variety of protocols (combination of [ipv4, ipv6, ipip] x [tcp, udp]). The first 4 patches add support for test variants to our scripts. We use the term "variant" in the same sense as the C kselftest_harness.h - variant is just a set of static input arguments. Note that neither GRO nor the Toeplitz test fully passes for me on any HW I have access to. But this is unrelated to the conversion. This series is not making any real functional changes to the tests, it is limited to improving the "test harness" scripts. v3: [patch 1] Exception -> BaseException [patch 3] use named tuple instead of attaching attrs directly to a func [patch 9] restore the comment about retries in GRO test [patch 10] use open() instead of echo [patch 10] move MTU changes to _setup() to handle all the config related stuff in that function v2: https://lore.kernel.org/20251118215126.2225826-1-kuba@kernel.org [patch 5] fix accidental modification of gitignore [patch 8] fix typo in "compared" [patch 9] fix typo I -> It [patch 10] fix typoe configure -> configured v1: https://lore.kernel.org/20251117205810.1617533-1-kuba@kernel.org Jakub Kicinski (12): selftests: net: py: coding style improvements selftests: net: py: extract the case generation logic selftests: net: py: add test variants selftests: drv-net: xdp: use variants for qstat tests selftests: net: relocate gro and toeplitz tests to drivers/net selftests: net: py: support ksft ready without wait selftests: net: py: read ip link info about remote dev netdevsim: pass packets thru GRO on Rx selftests: drv-net: add a Python version of the GRO test selftests: drv-net: hw: convert the Toeplitz test to Python netdevsim: add loopback support selftests: net: remove old setup_* scripts tools/testing/selftests/drivers/net/Makefile | 2 + .../testing/selftests/drivers/net/hw/Makefile | 6 +- tools/testing/selftests/net/Makefile | 7 - tools/testing/selftests/net/lib/Makefile | 1 + drivers/net/netdevsim/netdev.c | 26 ++- .../testing/selftests/{ => drivers}/net/gro.c | 5 +- .../{net => drivers/net/hw}/toeplitz.c | 7 +- .../testing/selftests/drivers/net/.gitignore | 1 + tools/testing/selftests/drivers/net/gro.py | 164 ++++++++++++++ .../selftests/drivers/net/hw/.gitignore | 1 + .../drivers/net/hw/lib/py/__init__.py | 4 +- .../selftests/drivers/net/hw/toeplitz.py | 209 ++++++++++++++++++ .../selftests/drivers/net/lib/py/__init__.py | 4 +- .../selftests/drivers/net/lib/py/env.py | 2 + tools/testing/selftests/drivers/net/xdp.py | 42 ++-- tools/testing/selftests/net/.gitignore | 2 - tools/testing/selftests/net/gro.sh | 105 --------- .../selftests/net/lib/ksft_setup_loopback.sh | 111 ++++++++++ .../testing/selftests/net/lib/py/__init__.py | 5 +- tools/testing/selftests/net/lib/py/ksft.py | 91 ++++++-- tools/testing/selftests/net/lib/py/nsim.py | 2 +- tools/testing/selftests/net/lib/py/utils.py | 20 +- tools/testing/selftests/net/setup_loopback.sh | 120 ---------- tools/testing/selftests/net/setup_veth.sh | 45 ---- tools/testing/selftests/net/toeplitz.sh | 199 ----------------- .../testing/selftests/net/toeplitz_client.sh | 28 --- 26 files changed, 632 insertions(+), 577 deletions(-) rename tools/testing/selftests/{ => drivers}/net/gro.c (99%) rename tools/testing/selftests/{net => drivers/net/hw}/toeplitz.c (99%) create mode 100755 tools/testing/selftests/drivers/net/gro.py create mode 100755 tools/testing/selftests/drivers/net/hw/toeplitz.py delete mode 100755 tools/testing/selftests/net/gro.sh create mode 100755 tools/testing/selftests/net/lib/ksft_setup_loopback.sh delete mode 100644 tools/testing/selftests/net/setup_loopback.sh delete mode 100644 tools/testing/selftests/net/setup_veth.sh delete mode 100755 tools/testing/selftests/net/toeplitz.sh delete mode 100755 tools/testing/selftests/net/toeplitz_client.sh -- 2.51.1

1 month, 3 weeks

4
16
0 0

[PATCH v1] selftests: hid: tests: test_wacom_generic: add base test for display devices and opaque devices

by Alex Tran

Verify Wacom devices set INPUT_PROP_DIRECT appropriately on display devices and INPUT_PROP_POINTER appropriately on opaque devices. Tests are defined in the base class and disabled for inapplicable device types. Signed-off-by: Alex Tran <alex.t.tran(a)gmail.com> --- .../selftests/hid/tests/test_wacom_generic.py | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/hid/tests/test_wacom_generic.py b/tools/testing/selftests/hid/tests/test_wacom_generic.py index 2d6d04f0f..aa2a175f2 100644 --- a/tools/testing/selftests/hid/tests/test_wacom_generic.py +++ b/tools/testing/selftests/hid/tests/test_wacom_generic.py @@ -600,15 +600,17 @@ class BaseTest: def test_prop_direct(self): """ - Todo: Verify that INPUT_PROP_DIRECT is set on display devices. + Verify that INPUT_PROP_DIRECT is set on display devices. """ - pass + evdev = self.uhdev.get_evdev() + assert libevdev.INPUT_PROP_DIRECT in evdev.properties def test_prop_pointer(self): """ - Todo: Verify that INPUT_PROP_POINTER is set on opaque devices. + Verify that INPUT_PROP_POINTER is set on opaque devices. """ - pass + evdev = self.uhdev.get_evdev() + assert libevdev.INPUT_PROP_POINTER in evdev.properties class PenTabletTest(BaseTest.TestTablet): @@ -622,6 +624,8 @@ class TouchTabletTest(BaseTest.TestTablet): class TestOpaqueTablet(PenTabletTest): + test_prop_direct = None + def create_device(self): return OpaqueTablet() @@ -864,6 +868,7 @@ class TestPTHX60_Pen(TestOpaqueCTLTablet): class TestDTH2452Tablet(test_multitouch.BaseTest.TestMultitouch, TouchTabletTest): ContactIds = namedtuple("ContactIds", "contact_id, tracking_id, slot_num") + test_prop_pointer = None def create_device(self): return test_multitouch.Digitizer( -- 2.51.0

1 month, 3 weeks

2
3
0 0

[PATCH net-next v10 00/11] vsock: add namespace support to vhost-vsock and loopback

by Bobby Eshleman

This series adds namespace support to vhost-vsock and loopback. It does not add namespaces to any of the other guest transports (virtio-vsock, hyperv, or vmci). The current revision supports two modes: local and global. Local mode is complete isolation of namespaces, while global mode is complete sharing between namespaces of CIDs (the original behavior). The mode is set using /proc/sys/net/vsock/ns_mode. Modes are per-netns and write-once. This allows a system to configure namespaces independently (some may share CIDs, others are completely isolated). This also supports future possible mixed use cases, where there may be namespaces in global mode spinning up VMs while there are mixed mode namespaces that provide services to the VMs, but are not allowed to allocate from the global CID pool (this mode is not implemented in this series). If a socket or VM is created when a namespace is global but the namespace changes to local, the socket or VM will continue working normally. That is, the socket or VM assumes the mode behavior of the namespace at the time the socket/VM was created. The original mode is captured in vsock_create() and so occurs at the time of socket(2) and accept(2) for sockets and open(2) on /dev/vhost-vsock for VMs. This prevents a socket/VM connection from suddenly breaking due to a namespace mode change. Any new sockets/VMs created after the mode change will adopt the new mode's behavior. Additionally, added tests for the new namespace features: tools/testing/selftests/vsock/vmtest.sh 1..29 ok 1 vm_server_host_client ok 2 vm_client_host_server ok 3 vm_loopback ok 4 ns_guest_local_mode_rejected ok 5 ns_host_vsock_ns_mode_ok ok 6 ns_host_vsock_ns_mode_write_once_ok ok 7 ns_global_same_cid_fails ok 8 ns_local_same_cid_ok ok 9 ns_global_local_same_cid_ok ok 10 ns_local_global_same_cid_ok ok 11 ns_diff_global_host_connect_to_global_vm_ok ok 12 ns_diff_global_host_connect_to_local_vm_fails ok 13 ns_diff_global_vm_connect_to_global_host_ok ok 14 ns_diff_global_vm_connect_to_local_host_fails ok 15 ns_diff_local_host_connect_to_local_vm_fails ok 16 ns_diff_local_vm_connect_to_local_host_fails ok 17 ns_diff_global_to_local_loopback_local_fails ok 18 ns_diff_local_to_global_loopback_fails ok 19 ns_diff_local_to_local_loopback_fails ok 20 ns_diff_global_to_global_loopback_ok ok 21 ns_same_local_loopback_ok ok 22 ns_same_local_host_connect_to_local_vm_ok ok 23 ns_same_local_vm_connect_to_local_host_ok ok 24 ns_mode_change_connection_continue_vm_ok ok 25 ns_mode_change_connection_continue_host_ok ok 26 ns_mode_change_connection_continue_both_ok ok 27 ns_delete_vm_ok ok 28 ns_delete_host_ok ok 29 ns_delete_both_ok SUMMARY: PASS=29 SKIP=0 FAIL=0 Dependent on series: https://lore.kernel.org/all/20251108-vsock-selftests-fixes-and-improvements… Thanks again for everyone's help and reviews! Suggested-by: Sargun Dhillon <sargun(a)sargun.me> Signed-off-by: Bobby Eshleman <bobbyeshleman(a)gmail.com> To: Stefano Garzarella <sgarzare(a)redhat.com> To: Shuah Khan <shuah(a)kernel.org> To: David S. Miller <davem(a)davemloft.net> To: Eric Dumazet <edumazet(a)google.com> To: Jakub Kicinski <kuba(a)kernel.org> To: Paolo Abeni <pabeni(a)redhat.com> To: Simon Horman <horms(a)kernel.org> To: Stefan Hajnoczi <stefanha(a)redhat.com> To: Michael S. Tsirkin <mst(a)redhat.com> To: Jason Wang <jasowang(a)redhat.com> To: Xuan Zhuo <xuanzhuo(a)linux.alibaba.com> To: Eugenio Pérez <eperezma(a)redhat.com> To: K. Y. Srinivasan <kys(a)microsoft.com> To: Haiyang Zhang <haiyangz(a)microsoft.com> To: Wei Liu <wei.liu(a)kernel.org> To: Dexuan Cui <decui(a)microsoft.com> To: Bryan Tan <bryan-bt.tan(a)broadcom.com> To: Vishnu Dasa <vishnu.dasa(a)broadcom.com> To: Broadcom internal kernel review list <bcm-kernel-feedback-list(a)broadcom.com> Cc: virtualization(a)lists.linux.dev Cc: netdev(a)vger.kernel.org Cc: linux-kselftest(a)vger.kernel.org Cc: linux-kernel(a)vger.kernel.org Cc: kvm(a)vger.kernel.org Cc: linux-hyperv(a)vger.kernel.org Cc: berrange(a)redhat.com Cc: Sargun Dhillon <sargun(a)sargun.me> Changes in v10: - Combine virtio common patches into one (Stefano) - Resolve vsock_loopback virtio_transport_reset_no_sock() issue with info->vsk setting. This eliminates the need for skb->cb, so remove skb->cb patches. - many line width 80 fixes - Link to v9: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-0-852787a37bed@meta.com Changes in v9: - reorder loopback patch after patch for virtio transport common code - remove module ordering tests patch because loopback no longer depends on pernet ops - major simplifications in vsock_loopback - added a new patch for blocking local mode for guests, added test case to check - add net ref tracking to vsock_loopback patch - Link to v8: https://lore.kernel.org/r/20251023-vsock-vmtest-v8-0-dea984d02bb0@meta.com Changes in v8: - Break generic cleanup/refactoring patches into standalone series, remove those from this series - Link to dependency: https://lore.kernel.org/all/20251022-vsock-selftests-fixes-and-improvements… - Link to v7: https://lore.kernel.org/r/20251021-vsock-vmtest-v7-0-0661b7b6f081@meta.com Changes in v7: - fix hv_sock build - break out vmtest patches into distinct, more well-scoped patches - change `orig_net_mode` to `net_mode` - many fixes and style changes in per-patch change sets (see individual patches for specific changes) - optimize `virtio_vsock_skb_cb` layout - update commit messages with more useful descriptions - vsock_loopback: use orig_net_mode instead of current net mode - add tests for edge cases (ns deletion, mode changing, loopback module load ordering) - Link to v6: https://lore.kernel.org/r/20250916-vsock-vmtest-v6-0-064d2eb0c89d@meta.com Changes in v6: - define behavior when mode changes to local while socket/VM is alive - af_vsock: clarify description of CID behavior - af_vsock: use stronger langauge around CID rules (dont use "may") - af_vsock: improve naming of buf/buffer - af_vsock: improve string length checking on proc writes - vsock_loopback: add space in struct to clarify lock protection - vsock_loopback: do proper cleanup/unregister on vsock_loopback_exit() - vsock_loopback: use virtio_vsock_skb_net() instead of sock_net() - vsock_loopback: set loopback to NULL after kfree() - vsock_loopback: use pernet_operations and remove callback mechanism - vsock_loopback: add macros for "global" and "local" - vsock_loopback: fix length checking - vmtest.sh: check for namespace support in vmtest.sh - Link to v5: https://lore.kernel.org/r/20250827-vsock-vmtest-v5-0-0ba580bede5b@meta.com Changes in v5: - /proc/net/vsock_ns_mode -> /proc/sys/net/vsock/ns_mode - vsock_global_net -> vsock_global_dummy_net - fix netns lookup in vhost_vsock to respect pid namespaces - add callbacks for vsock_loopback to avoid circular dependency - vmtest.sh loads vsock_loopback module - remove vsock_net_mode_can_set() - change vsock_net_write_mode() to return true/false based on success - make vsock_net_mode enum instead of u8 - Link to v4: https://lore.kernel.org/r/20250805-vsock-vmtest-v4-0-059ec51ab111@meta.com Changes in v4: - removed RFC tag - implemented loopback support - renamed new tests to better reflect behavior - completed suite of tests with permutations of ns modes and vsock_test as guest/host - simplified socat bridging with unix socket instead of tcp + veth - only use vsock_test for success case, socat for failure case (context in commit message) - lots of cleanup Changes in v3: - add notion of "modes" - add procfs /proc/net/vsock_ns_mode - local and global modes only - no /dev/vhost-vsock-netns - vmtest.sh already merged, so new patch just adds new tests for NS - Link to v2: https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com Changes in v2: - only support vhost-vsock namespaces - all g2h namespaces retain old behavior, only common API changes impacted by vhost-vsock changes - add /dev/vhost-vsock-netns for "opt-in" - leave /dev/vhost-vsock to old behavior - removed netns module param - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com Changes in v1: - added 'netns' module param to vsock.ko to enable the network namespace support (disabled by default) - added 'vsock_net_eq()' to check the "net" assigned to a socket only when 'netns' support is enabled - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/ --- Bobby Eshleman (11): vsock: a per-net vsock NS mode state vsock: add netns to vsock core vsock: reject bad VSOCK_NET_MODE_LOCAL configuration for G2H vsock: add netns support to virtio transports virtio: set skb owner of virtio_transport_reset_no_sock() reply selftests/vsock: add namespace helpers to vmtest.sh selftests/vsock: prepare vm management helpers for namespaces selftests/vsock: add tests for proc sys vsock ns_mode selftests/vsock: add namespace tests for CID collisions selftests/vsock: add tests for host <-> vm connectivity with namespaces selftests/vsock: add tests for namespace deletion and mode changes MAINTAINERS | 1 + drivers/vhost/vsock.c | 57 +- include/linux/virtio_vsock.h | 8 +- include/net/af_vsock.h | 58 +- include/net/net_namespace.h | 4 + include/net/netns/vsock.h | 17 + net/vmw_vsock/af_vsock.c | 294 ++++++++- net/vmw_vsock/hyperv_transport.c | 6 + net/vmw_vsock/virtio_transport.c | 29 +- net/vmw_vsock/virtio_transport_common.c | 69 +- net/vmw_vsock/vmci_transport.c | 7 + net/vmw_vsock/vsock_loopback.c | 20 +- tools/testing/selftests/vsock/vmtest.sh | 1037 +++++++++++++++++++++++++++++-- 13 files changed, 1514 insertions(+), 93 deletions(-) --- base-commit: 962ac5ca99a5c3e7469215bf47572440402dfd59 change-id: 20250325-vsock-vmtest-b3a21d2102c2 prerequisite-message-id: <20251022-vsock-selftests-fixes-and-improvements-v1-0-edeb179d6463(a)meta.com> prerequisite-patch-id: a2eecc3851f2509ed40009a7cab6990c6d7cfff5 prerequisite-patch-id: 501db2100636b9c8fcb3b64b8b1df797ccbede85 prerequisite-patch-id: ba1a2f07398a035bc48ef72edda41888614be449 prerequisite-patch-id: fd5cc5445aca9355ce678e6d2bfa89fab8a57e61 prerequisite-patch-id: 795ab4432ffb0843e22b580374782e7e0d99b909 prerequisite-patch-id: 1499d263dc933e75366c09e045d2125ca39f7ddd prerequisite-patch-id: f92d99bb1d35d99b063f818a19dcda999152d74c prerequisite-patch-id: e3296f38cdba6d903e061cff2bbb3e7615e8e671 prerequisite-patch-id: bc4662b4710d302d4893f58708820fc2a0624325 prerequisite-patch-id: f8991f2e98c2661a706183fde6b35e2b8d9aedcf prerequisite-patch-id: 44bf9ed69353586d284e5ee63d6fffa30439a698 prerequisite-patch-id: d50621bc630eeaf608bbaf260370c8dabf6326df Best regards, -- Bobby Eshleman <bobbyeshleman(a)meta.com>

1 month, 3 weeks

2
24
0 0

[PATCH RESEND v2 1/1] selftest/sched: skip the test if smt is not enabled

by Yifei Liu

The core scheduling is for smt enabled cpus. It is not returns failure and gives plenty of error messages and not clearly points to the smt issue if the smt is disabled. It just mention "not a core sched system" and many other messages. For example: Not a core sched system tid=210574, / tgid=210574 / pgid=210574: ffffffffffffffff Not a core sched system tid=210575, / tgid=210575 / pgid=210574: ffffffffffffffff Not a core sched system tid=210577, / tgid=210575 / pgid=210574: ffffffffffffffff (similar things many other times) In this patch, the test will first read /sys/devices/system/cpu/smt/active, if the file cannot be opened or its value is 0, the test is skipped with an explanatory message. This helps developers understand why it is skipped and avoids unnecessary attention when running the full selftest suite. Cc: stable(a)vger.kernel.org Signed-off-by: Yifei Liu <yifei.l.liu(a)oracle.com> --- tools/testing/selftests/sched/cs_prctl_test.c | 23 ++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c index 52d97fae4dbd..7ce8088cde6a 100644 --- a/tools/testing/selftests/sched/cs_prctl_test.c +++ b/tools/testing/selftests/sched/cs_prctl_test.c @@ -32,6 +32,8 @@ #include <stdlib.h> #include <string.h> +#include "../kselftest.h" + #if __GLIBC_PREREQ(2, 30) == 0 #include <sys/syscall.h> static pid_t gettid(void) @@ -109,6 +111,22 @@ static void handle_usage(int rc, char *msg) exit(rc); } +int check_smt(void) +{ + int c = 0; + FILE *file; + + file = fopen("/sys/devices/system/cpu/smt/active", "r"); + if (!file) + return 0; + c = fgetc(file) - 0x30; + fclose(file); + if (c == 0 || c == 1) + return c; + //if fgetc returns EOF or -1 for correupted files, return 0. + return 0; +} + static unsigned long get_cs_cookie(int pid) { unsigned long long cookie; @@ -271,7 +289,10 @@ int main(int argc, char *argv[]) delay = -1; srand(time(NULL)); - + if (!check_smt()) { + ksft_test_result_skip("smt not enabled\n"); + return 1; + } /* put into separate process group */ if (setpgid(0, 0) != 0) handle_error("process group"); -- 2.50.1

1 month, 3 weeks

1
0
0 0

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror