- Linux-kselftest-mirror - lists.linaro.org

[PATCH bpf-next v2 0/2] selftests/bpf: networking test cleanups

by Hoyeon Lee

This series finishes the sockaddr_storage migration in the networking selftests by removing the remaining open-coded IPv4/IPv6 wrappers (addr_port/tuple in cls_redirect, sa46 in select_reuseport). The tests now use sockaddr_storage directly. No other custom socket-address wrappers remain after this series, so the churn stops here and behavior is unchanged. --- Changes in v2: - Drop the tuple wrapper entirely in cls_redirect and rely on ss_family - Limit the series to patches 1/2 (3/4 applied; 5 sent separately) Hoyeon Lee (2): selftests/bpf: use sockaddr_storage directly in cls_redirect test selftests/bpf: use sockaddr_storage instead of sa46 in select_reuseport test .../selftests/bpf/prog_tests/cls_redirect.c | 122 ++++++------------ .../bpf/prog_tests/select_reuseport.c | 67 +++++----- 2 files changed, 77 insertions(+), 112 deletions(-) -- 2.51.1

2 weeks, 4 days

3
4
0 0

[PATCH v2 0/4] KVM: selftests: Test SET_NESTED_STATE with 48-bit L2 on 57-bit L1

by Jim Mattson

Prior to commit 9245fd6b8531 ("KVM: x86: model canonical checks more precisely"), KVM_SET_NESTED_STATE would fail if the state was captured with L2 active, L1 had CR4.LA57 set, L2 did not, and the VMCS12.HOST_GSBASE (or other host-state field checked for canonicality) had an address greater than 48 bits wide. Add a regression test that reproduces the KVM_SET_NESTED_STATE failure conditions. To do so, the first three patches add support for 5-level paging in the selftest L1 VM. v1 -> v2 Ended the page walking loops before visiting 4K mappings [Yosry] Changed VM_MODE_PXXV48_4K into VM_MODE_PXXVYY_4K; use 5-level paging when possible [Sean] Removed the check for non-NULL vmx_pages in guest_code() [Yosry] Jim Mattson (4): KVM: selftests: Use a loop to create guest page tables KVM: selftests: Use a loop to walk guest page tables KVM: selftests: Change VM_MODE_PXXV48_4K to VM_MODE_PXXVYY_4K KVM: selftests: Add a VMX test for LA57 nested state tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/include/kvm_util.h | 4 +- .../selftests/kvm/include/x86/processor.h | 2 +- .../selftests/kvm/lib/arm64/processor.c | 2 +- tools/testing/selftests/kvm/lib/kvm_util.c | 30 ++-- .../testing/selftests/kvm/lib/x86/processor.c | 80 +++++------ tools/testing/selftests/kvm/lib/x86/vmx.c | 6 +- .../kvm/x86/vmx_la57_nested_state_test.c | 134 ++++++++++++++++++ 8 files changed, 197 insertions(+), 62 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86/vmx_la57_nested_state_test.c -- 2.51.1.851.g4ebd6896fd-goog

2 weeks, 4 days

3
7
0 0

[PATCH v2 0/3] arm64/sme: Support disabling streaming mode via ptrace on SME only systems

by Mark Brown

Currently it is not possible to disable streaming mode via ptrace on SME only systems, the interface for doing this is to write via NT_ARM_SVE but such writes will be rejected on a system without SVE support. Enable this functionality by allowing userspace to write SVE_PT_REGS_FPSIMD format data via NT_ARM_SVE with the vector length set to 0 on SME only systems. Such writes currently error since we require that a vector length is specified which should minimise the risk that existing software is relying on current behaviour. Reads are not supported since I am not aware of any use case for this and there is some risk that an existing userspace application may be confused if it reads NT_ARM_SVE on a system without SVE. Existing kernels will return FPSIMD formatted register state from NT_ARM_SVE if full SVE state is not stored, for example if the task has not used SVE. Returning a vector length of 0 would create a risk that software could try to do things like allocate space for register state with zero sizes, while returning a vector length of 128 bits would look like SVE is supported. It seems safer to just not make the changes to add read support. It remains possible for userspace to detect a SME only system via the ptrace interface only since reads of NT_ARM_SSVE and NT_ARM_ZA will suceed while reads of NT_ARM_SVE will fail. Read/write access to the FPSIMD registers in non-streaming mode is available via REGSET_FPR. The aim is is to make a minimally invasive change, no operation that would previously have succeeded will be affected, and we use a previously defined interface in new circumstances rather than define completely new ABI. Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v2: - Rebase onto v6.18-rc1 - Link to v1: https://lore.kernel.org/r/20250820-arm64-sme-ptrace-sme-only-v1-0-f7c22b287… --- Mark Brown (3): arm64/sme: Support disabling streaming mode via ptrace on SME only systems kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace Documentation/arch/arm64/sve.rst | 5 +++ arch/arm64/kernel/ptrace.c | 40 +++++++++++++++--- tools/testing/selftests/arm64/fp/fp-ptrace.c | 5 +-- tools/testing/selftests/arm64/fp/sve-ptrace.c | 61 +++++++++++++++++++++++++++ 4 files changed, 100 insertions(+), 11 deletions(-) --- base-commit: cb6649f6217c0331b885cf787f1d175963e2a1d2 change-id: 20250717-arm64-sme-ptrace-sme-only-1fb850600ea0 Best regards, -- Mark Brown <broonie(a)kernel.org>

2 weeks, 4 days

5
7
0 0

[PATCH v4 0/9] introduce VM_MAYBE_GUARD and make it sticky

by Lorenzo Stoakes

Currently, guard regions are not visible to users except through /proc/$pid/pagemap, with no explicit visibility at the VMA level. This makes the feature less useful, as it isn't entirely apparent which VMAs may have these entries present, especially when performing actions which walk through memory regions such as those performed by CRIU. This series addresses this issue by introducing the VM_MAYBE_GUARD flag which fulfils this role, updating the smaps logic to display an entry for these. The semantics of this flag are that a guard region MAY be present if set (we cannot be sure, as we can't efficiently track whether an MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if not set the VMA definitely does NOT have any guard regions present. It's problematic to establish this flag without further action, because that means that VMAs with guard regions in them become non-mergeable with adjacent VMAs for no especially good reason. To work around this, this series also introduces the concept of 'sticky' VMA flags - that is flags which: a. if set in one VMA and not in another still permit those VMAs to be merged (if otherwise compatible). b. When they are merged, the resultant VMA must have the flag set. The VMA logic is updated to propagate these flags correctly. Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve an issue with file-backed guard regions - previously these established an anon_vma object for file-backed mappings solely to have vma_needs_copy() correctly propagate guard region mappings to child processes. We introduce a new flag alias VM_COPY_ON_FORK (which currently only specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly for this flag and to copy page tables if it is present, which resolves this issue. Additionally, we add the ability for allow-listed VMA flags to be atomically writable with only mmap/VMA read locks held. The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure does not cause any races by being allowed to do so. This allows us to maintain guard region installation as a read-locked operation and not endure the overhead of obtaining a write lock here. Finally we introduce extensive VMA userland tests to assert that the sticky VMA logic behaves correctly as well as guard region self tests to assert that smaps visibility is correctly implemented. v4: * Propagated tags, thanks all! * Folded all fixups into series (thanks to Andrew for his patience with these :) * Added patch to correct an issue raised by Pedro - we can't unconditionally set newflags |= vma->vm_flags because on split/noop we're overwriting them. * In new patch, corrected horrible formatting of vma_modify_*() while we are here. * In new patch, added kdoc as 3 kernel developers, including the author of the code (!!) have been confused by this. Make explicitly clear what each does. * In new patch, make vm_flags_ptr parameter a pointer for vma_modify_flags, and have the function correctly update the flags on merge, abstracting this mess somewhat and avoiding case-by-case open-coding of the fix. Describe clearly what's going on in the kdoc. * Fixed typo reported by Jane and Liam, I must have been very tired... :) * When introducing the new patch, we couldn't reference sticky VMA flags yet as the concept had not yet been introduced. So update the patch that introduces sticky flags to change the comments to reference the concept now established. v3: * Propagated tags thanks Vlastimil & Pedro! :) * Fixed doc nit as per Pedro. * Added vma_flag_test_atomic() in preparation for fixing retract_page_tables() (see below). We make this not require any locks, as we serialise on the page table lock in retract_page_tables(). * Split the atomic flag enablement and actually setting the flag for guard install into two separate commits so we clearly separate the various VMA flag implementation details and us enabling this feature. * Mentioned setting anon_vma for anonymous mappings in commit message as per Vlastimil. * Fixed an issue with retract_page_tables() whereby madvise(..., MADV_COLLAPSE) relies upon file-backed VMAs not being collapsed due to the UFFD WP VMA flag being set or the VMA having vma->anon_vma set (i.e. being a MAP_PRIVATE file-backed VMA). This was updated to also check for VM_MAYBE_GUARD. * Introduced MADV_COLLAPSE self test to assert that the behaviour is correct. I first reproduced the issue locally and then adapted the test to assert that this no longer occurs. * Mentioned KCSAN permissiveness in commit message as per Pedro. * Mentioned mmap/VMA read lock excluding mmap/VMA write lock and thus avoiding meaningful RMW races in commit message as per Vlastimil. * Mentioned previous unconditional vma->anon_vma installation on guard region installation as per Vlastimil. * Avoided having merging compromised by reordering patches such that the sticky VMA functionality is implemented prior to VM_MAYBE_GUARD being utilised upon guard region installation, rendering Vlastimil's request to mention this in a commit message unnecessary. * Separated out sticky and copy on fork patches as per Pedro. * Added VM_PFNMAP, VM_MIXEDMAP, VM_UFFD_WP to VM_COPY_ON_FORK to make things more consistent and clean. * Added mention of why generally VM_STICKY should be VM_COPY_ON_FORK in copy on fork patch. https://lore.kernel.org/all/cover.1762531708.git.lorenzo.stoakes@oracle.com/ v2: * Separated out userland VMA tests for sticky behaviour as per Suren. * Added the concept of atomic writable VMA flags as per Pedro and Vlastimil. * Made VM_MAYBE_GUARD an atomic writable flag so we don't have to take a VMA write lock in madvise() as per Pedro and Vlastimil. https://lore.kernel.org/all/cover.1762422915.git.lorenzo.stoakes@oracle.com/ v1: https://lore.kernel.org/all/cover.1761756437.git.lorenzo.stoakes@oracle.com/ Lorenzo Stoakes (9): mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps mm: add atomic VMA flags and set VM_MAYBE_GUARD as such mm: update vma_modify_flags() to handle residual flags, document mm: implement sticky VMA flags mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one mm: set the VM_MAYBE_GUARD flag on guard region install tools/testing/vma: add VMA sticky userland tests tools/testing/selftests/mm: add MADV_COLLAPSE test case tools/testing/selftests/mm: add smaps visibility guard region test Documentation/filesystems/proc.rst | 5 +- fs/proc/task_mmu.c | 1 + include/linux/mm.h | 101 +++++++++++ include/trace/events/mmflags.h | 1 + mm/khugepaged.c | 71 +++++--- mm/madvise.c | 24 ++- mm/memory.c | 14 +- mm/mlock.c | 2 +- mm/mprotect.c | 2 +- mm/mseal.c | 9 +- mm/vma.c | 78 +++++---- mm/vma.h | 138 +++++++++++---- tools/testing/selftests/mm/guard-regions.c | 185 +++++++++++++++++++++ tools/testing/selftests/mm/vm_util.c | 5 + tools/testing/selftests/mm/vm_util.h | 1 + tools/testing/vma/vma.c | 92 ++++++++-- tools/testing/vma/vma_internal.h | 55 ++++++ 17 files changed, 650 insertions(+), 134 deletions(-) -- 2.51.2

2 weeks, 4 days

6
19
0 0

[PATCH bpf-next v3 0/3] bpf: Fix FIONREAD and copied_seq issues

by Jiayuan Chen

syzkaller reported a bug [1] where a socket using sockmap, after being unloaded, exposed incorrect copied_seq calculation. The selftest I provided can be used to reproduce the issue reported by syzkaller. TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40 WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724 Call Trace: <TASK> receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline] tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421 __sys_getsockopt+0x12f/0x260 net/socket.c:2450 __do_sys_getsockopt net/socket.c:2457 [inline] __se_sys_getsockopt net/socket.c:2454 [inline] __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f A sockmap socket maintains its own receive queue (ingress_msg) which may contain data from either its own protocol stack or forwarded from other sockets. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack The issue occurs when reading from ingress_msg: we update tp->copied_seq by default, but if the data comes from other sockets (not the socket's own protocol stack), tcp->rcv_nxt remains unchanged. Later, when converting back to a native socket, reads may fail as copied_seq could be significantly larger than rcv_nxt. Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is insufficient for sockmap sockets, requiring separate field tracking. [1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 --- v1 -> v3: Use skmsg.sk instead of extending BPF_F_XXX macro and fix CI failure reported by ci v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/ Jiayuan Chen (3): bpf, sockmap: Fix incorrect copied_seq calculation bpf, sockmap: Fix FIONREAD for sockmap bpf, selftest: Add tests for FIONREAD and copied_seq include/linux/skmsg.h | 48 ++++- net/core/skmsg.c | 28 ++- net/ipv4/tcp_bpf.c | 26 ++- net/ipv4/udp_bpf.c | 25 ++- .../selftests/bpf/prog_tests/sockmap_basic.c | 203 +++++++++++++++++- .../bpf/progs/test_sockmap_pass_prog.c | 8 + 6 files changed, 322 insertions(+), 16 deletions(-) -- 2.43.0

2 weeks, 4 days

2
4
0 0

[PATCH 0/9] Initial DMABUF support for iommufd

by Jason Gunthorpe

This series is the start of adding full DMABUF support to iommufd. Currently it is limited to only work with VFIO's DMABUF exporter. It sits on top of Leon's series to add a DMABUF exporter to VFIO: https://lore.kernel.org/r/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but otherwise works the same as it does today for a memfd. The user can select a slice of the FD to map into the ioas and if the underliyng alignment requirements are met it will be placed in the iommu_domain. Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR memory from VFIO to an iommu_domain controlled by iommufd. This is used for PCI Peer to Peer support in VMs, and is the last feature that the VFIO type 1 container has that iommufd couldn't do. The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime control and is a use-after-free security problem. Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there should be no access to the MMIO it can shoot down the mapping in iommufd which will unmap it from the iommu_domain. There is no automatic remap, this is a safety protocol so the kernel doesn't get stuck. Userspace is expected to know it is doing something that will revoke the dmabuf and map/unmap it around the activity. Eg when QEMU goes to issue FLR it should do the map/unmap to iommufd. Since DMABUF is missing some key general features for this use case it relies on a "private interconnect" between VFIO and iommufd via the vfio_pci_dma_buf_iommufd_map() call. The call confirms the DMABUF has revoke semantics and delivers a phys_addr for the memory suitable for use with iommu_map(). Medium term there is a desire to expand the supported DMABUFs to include GPU drivers to support DPDK/SPDK type use cases so future series will work to add a general concept of revoke and a general negotiation of interconnect to remove vfio_pci_dma_buf_iommufd_map(). I also plan another series to modify iommufd's vfio_compat to transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI of type1. The latest series for interconnect negotation to exchange a phys_addr is: https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com And the discussion for design of revoke is here: https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/ This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf v2: - Rebase on Leon's v7 - Fix mislocking in an iopt_fill_domain() error path v1: https://patch.msgid.link/r/0-v1-64bed2430cdb+31b-iommufd_dmabuf_jgg@nvidia.… Jason Gunthorpe (9): vfio/pci: Add vfio_pci_dma_buf_iommufd_map() iommufd: Add DMABUF to iopt_pages iommufd: Do not map/unmap revoked DMABUFs iommufd: Allow a DMABUF to be revoked iommufd: Allow MMIO pages in a batch iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd/selftest: Add some tests for the dmabuf flow drivers/iommu/iommufd/io_pagetable.c | 78 +++- drivers/iommu/iommufd/io_pagetable.h | 53 ++- drivers/iommu/iommufd/ioas.c | 8 +- drivers/iommu/iommufd/iommufd_private.h | 14 +- drivers/iommu/iommufd/iommufd_test.h | 10 + drivers/iommu/iommufd/main.c | 10 + drivers/iommu/iommufd/pages.c | 407 ++++++++++++++++-- drivers/iommu/iommufd/selftest.c | 142 ++++++ drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++ include/linux/vfio_pci_core.h | 4 + tools/testing/selftests/iommu/iommufd.c | 43 ++ tools/testing/selftests/iommu/iommufd_utils.h | 44 ++ 12 files changed, 781 insertions(+), 66 deletions(-) base-commit: bb04e92c86b44b3e36532099b68de1e889acfee7 -- 2.43.0

2 weeks, 4 days

6
44
0 0

[RFC PATCH 0/4] mm, kvm: add guest_memfd support for uffd minor faults

by Mike Rapoport

From: "Mike Rapoport (Microsoft)" <rppt(a)kernel.org> Hi, These patches allow guest_memfd to notify userspace about minor page faults using userfaultfd and let userspace to resolve these page faults using UFFDIO_CONTINUE. To allow UFFDIO_CONTINUE outside of the core mm I added a get_pagecache_folio() callback to vm_ops that allows an address space backing a VMA to return a folio that exists in it's page cache (patch 2) In order for guest_memfd to notify userspace about page faults, it has to call handle_userfault() and since guest_memfd may be a part of kvm module, handle_userfault() is exported for kvm module (patch 3). Note that patch 3 changelog does not provide motivation for enabling uffd in guest_memfd, mainly because I can't say I understand why is that required :) Would be great to hear from KVM folks about it. This series is the minimal change I've been able to come up with to allow integration of guest_memfd with uffd and while refactoring uffd and making mfill_atomic() flow more linear would have been a nice improvement, it's way out of the scope of enabling uffd with guest_memfd. Mike Rapoport (Microsoft) (3): userfaultfd: move vma_can_userfault out of line userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE userfaultfd, guest_memfd: support userfault minor mode in guest_memfd Nikita Kalyazin (1): KVM: selftests: test userfaultfd minor for guest_memfd fs/userfaultfd.c | 4 +- include/linux/mm.h | 9 ++ include/linux/userfaultfd_k.h | 36 +----- include/uapi/linux/userfaultfd.h | 8 +- mm/shmem.c | 20 ++++ mm/userfaultfd.c | 88 ++++++++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++ virt/kvm/guest_memfd.c | 30 +++++ 8 files changed, 245 insertions(+), 53 deletions(-) base-commit: 6146a0f1dfae5d37442a9ddcba012add260bceb0 -- 2.50.1

2 weeks, 4 days

4
12
0 0

[PATCH v3 00/10] KVM: nVMX: Improve performance for unmanaged guest memory

by Fred Griffoul

From: Fred Griffoul <fgriffo(a)amazon.co.uk> This patch series addresses both performance and correctness issues in nested VMX when handling guest memory. During nested VMX operations, L0 (KVM) accesses specific L1 guest pages to manage L2 execution. These pages fall into two categories: pages accessed only by L0 (such as the L1 MSR bitmap page or the eVMCS page), and pages passed to the L2 guest via vmcs02 (such as APIC access, virtual APIC, and posted interrupt descriptor pages). The current implementation uses kvm_vcpu_map/unmap, which causes two issues. First, the current approach is missing proper invalidation handling in critical scenarios. Enlightened VMCS (eVMCS) pages can become stale when memslots are modified, as there is no mechanism to invalidate the cached mappings. Similarly, APIC access and virtual APIC pages can be migrated by the host, but without proper notification through mmu_notifier callbacks, the mappings become invalid and can lead to incorrect behavior. Second, for unmanaged guest memory (memory not directly mapped by the kernel, such as memory passed with the mem= parameter or guest_memfd for non-CoCo VMs), this workflow invokes expensive memremap/memunmap operations on every L2 VM entry/exit cycle. This creates significant overhead that impacts nested virtualization performance. This series replaces kvm_host_map with gfn_to_pfn_cache in nested VMX. The pfncache infrastructure maintains persistent mappings as long as the page GPA does not change, eliminating the memremap/memunmap overhead on every VM entry/exit cycle. Additionally, pfncache provides proper invalidation handling via mmu_notifier callbacks and memslots generation check, ensuring that mappings are correctly updated during both memslot updates and page migration events. As an example, a microbenchmark using memslot_perf_test with 8192 memslots demonstrates huge improvements in nested VMX operations with unmanaged guest memory (this is a synthetic benchmark run on AWS EC2 Nitro instances, and the results are not representative of typical nested virtualization workloads): Before After Improvement map: 26.12s 1.54s ~17x faster unmap: 40.00s 0.017s ~2353x faster unmap chunked: 10.07s 0.005s ~2014x faster The series is organized as follows: Patches 1-5 handle the L1 MSR bitmap page and system pages (APIC access, virtual APIC, and posted interrupt descriptor). Patch 1 converts the MSR bitmap to use gfn_to_pfn_cache. Patches 2-3 restore and complete "guest-uses-pfn" support in pfncache. Patch 4 converts the system pages to use gfn_to_pfn_cache. Patch 5 adds a selftest for cache invalidation and memslot updates. Patches 6-7 add enlightened VMCS support. Patch 6 avoids accessing eVMCS fields after they are copied into the cached vmcs12 structure. Patch 7 converts eVMCS page mapping to use gfn_to_pfn_cache. Patches 8-10 implement persistent nested context to handle L2 vCPU multiplexing and migration between L1 vCPUs. Patch 8 introduces the nested context management infrastructure. Patch 9 integrates pfncache with persistent nested context. Patch 10 adds a selftest for this L2 vCPU context switching. v3: - fixed warnings reported by kernel test robot in patches 7 and 8. v2: - Extended series to support enlightened VMCS (eVMCS). - Added persistent nested context for improved L2 vCPU handling. - Added additional selftests. Suggested-by: dwmw(a)amazon.co.uk Fred Griffoul (10): KVM: nVMX: Implement cache for L1 MSR bitmap KVM: pfncache: Restore guest-uses-pfn support KVM: x86: Add nested state validation for pfncache support KVM: nVMX: Implement cache for L1 APIC pages KVM: selftests: Add nested VMX APIC cache invalidation test KVM: nVMX: Cache evmcs fields to ensure consistency during VM-entry KVM: nVMX: Replace evmcs kvm_host_map with pfncache KVM: x86: Add nested context management KVM: nVMX: Use nested context for pfncache persistence KVM: selftests: Add L2 vcpu context switch test arch/x86/include/asm/kvm_host.h | 32 ++ arch/x86/include/uapi/asm/kvm.h | 2 + arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/nested.c | 199 ++++++++ arch/x86/kvm/vmx/hyperv.c | 5 +- arch/x86/kvm/vmx/hyperv.h | 33 +- arch/x86/kvm/vmx/nested.c | 469 ++++++++++++++---- arch/x86/kvm/vmx/vmx.c | 8 + arch/x86/kvm/vmx/vmx.h | 16 +- arch/x86/kvm/x86.c | 19 +- include/linux/kvm_host.h | 34 +- include/linux/kvm_types.h | 1 + tools/testing/selftests/kvm/Makefile.kvm | 2 + .../selftests/kvm/x86/vmx_apic_update_test.c | 302 +++++++++++ .../selftests/kvm/x86/vmx_l2_switch_test.c | 416 ++++++++++++++++ virt/kvm/kvm_main.c | 3 +- virt/kvm/kvm_mm.h | 6 +- virt/kvm/pfncache.c | 43 +- 18 files changed, 1469 insertions(+), 123 deletions(-) create mode 100644 arch/x86/kvm/nested.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c base-commit: 6b36119b94d0b2bb8cea9d512017efafd461d6ac prerequisite-patch-id: afd3db49735b65c8a642de8dab7d0160d5da4b67 -- 2.43.0

2 weeks, 4 days

1
10
0 0

[PATCH bpf-next v2 0/3] bpf: Fix FIONREAD and copied_seq issues

by Jiayuan Chen

syzkaller reported a bug [1] where a socket using sockmap, after being unloaded, exposed incorrect copied_seq calculation. The selftest I provided can be used to reproduce the issue reported by syzkaller. TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40 WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724 Call Trace: <TASK> receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline] tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421 __sys_getsockopt+0x12f/0x260 net/socket.c:2450 __do_sys_getsockopt net/socket.c:2457 [inline] __se_sys_getsockopt net/socket.c:2454 [inline] __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f A sockmap socket maintains its own receive queue (ingress_msg) which may contain data from either its own protocol stack or forwarded from other sockets. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack The issue occurs when reading from ingress_msg: we update tp->copied_seq by default, but if the data comes from other sockets (not the socket's own protocol stack), tcp->rcv_nxt remains unchanged. Later, when converting back to a native socket, reads may fail as copied_seq could be significantly larger than rcv_nxt. Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is insufficient for sockmap sockets, requiring separate field tracking. [1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 --- v1 -> v2: Use skmsg.sk instead of extending BPF_F_XXX macro v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/ Jiayuan Chen (3): bpf, sockmap: Fix incorrect copied_seq calculation bpf, sockmap: Fix FIONREAD for sockmap bpf, selftest: Add tests for FIONREAD and copied_seq include/linux/skmsg.h | 48 ++++- net/core/skmsg.c | 29 ++- net/ipv4/tcp_bpf.c | 26 ++- net/ipv4/udp_bpf.c | 25 ++- .../selftests/bpf/prog_tests/sockmap_basic.c | 203 +++++++++++++++++- .../bpf/progs/test_sockmap_pass_prog.c | 8 + 6 files changed, 323 insertions(+), 16 deletions(-) -- 2.43.0

2 weeks, 5 days

3
5
0 0

[PATCH net-next v3 0/4] netconsole: Allow userdata buffer to grow dynamically

by Gustavo Luiz Duarte

The current netconsole implementation allocates a static buffer for extradata (userdata + sysdata) with a fixed size of MAX_EXTRADATA_ENTRY_LEN * MAX_EXTRADATA_ITEMS bytes for every target, regardless of whether userspace actually uses this feature. This forces us to keep MAX_EXTRADATA_ITEMS small (16), which is restrictive for users who need to attach more metadata to their log messages. This patch series enables dynamic allocation of the userdata buffer, allowing it to grow on-demand based on actual usage. The series: 1. Refactors send_fragmented_body() to simplify handling of separated userdata and sysdata (patch 1/4) 2. Splits userdata and sysdata into separate buffers (patch 2/4) 3. Implements dynamic allocation for the userdata buffer (patch 3/4) 4. Increases MAX_USERDATA_ITEMS from 16 to 256 now that we can do so without memory waste (patch 4/4) Benefits: - No memory waste when userdata is not used - Targets that use userdata only consume what they need - Users can attach significantly more metadata without impacting systems that don't use this feature Signed-off-by: Gustavo Luiz Duarte <gustavold(a)gmail.com> --- Changes in v3: - Split calculating the lentgh of the formatted userdata string into a separate function calc_userdata_len(). - Exit update_userdata() immediately if we hit WARN due to too many userdata entries. - Use offset instead of len to save userdata_length in update_userdata() - Link to v2: https://lore.kernel.org/r/20251113-netconsole_dynamic_extradata-v2-0-18cf7f… Changes in v2: - Added null pointer checks for userdata and sysdata buffers - Added MAX_SYSDATA_ITEMS to enum sysdata_feature - Moved code out of ifdef in send_msg_no_fragmentation() - Renamed variables in send_fragmented_body() to make it easier to reason about the code - Link to v1: https://lore.kernel.org/r/20251105-netconsole_dynamic_extradata-v1-0-142890… --- Gustavo Luiz Duarte (4): netconsole: Simplify send_fragmented_body() netconsole: Split userdata and sysdata netconsole: Dynamic allocation of userdata buffer netconsole: Increase MAX_USERDATA_ITEMS drivers/net/netconsole.c | 386 +++++++++++---------- .../selftests/drivers/net/netcons_overflow.sh | 2 +- 2 files changed, 195 insertions(+), 193 deletions(-) --- base-commit: 45a1cd8346ca245a1ca475b26eb6ceb9d8b7c6f0 change-id: 20251007-netconsole_dynamic_extradata-21bd9d726568 Best regards, -- Gustavo Duarte <gustavold(a)meta.com>

2 weeks, 5 days

3
6
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror