This series fixes two issues in the bonding 802.3ad implementation
related to port state management and churn detection:
1. When disabling a port, we need to set AD_RX_PORT_DISABLED to ensure
proper state machine transitions, preventing ports from getting stuck
in AD_RX_CURRENT state.
2. The ad_churn_machine implementation is restructured to follow IEEE
802.1AX-2014 specifications correctly. The current implementation has
several issues: it doesn't transition to "none" state immediately when
synchronization is achieved, and can get stuck in churned state in
multi-aggregator scenarios.
3. Selftests are enhanced to validate both mux state machine and churn
state logic under aggregator selection and failover scenarios.
These changes ensure proper LACP state machine behavior and fix issues
where ports could remain in incorrect states during aggregator failover.
Hangbin Liu (3):
bonding: set AD_RX_PORT_DISABLED when disabling a port
bonding: restructure ad_churn_machine
selftests: bonding: add mux and churn state testing
drivers/net/bonding/bond_3ad.c | 105 ++++++++++++++----
.../selftests/drivers/net/bonding/Makefile | 2 +-
...nd_lacp_prio.sh => bond_lacp_ad_select.sh} | 73 ++++++++++++
3 files changed, 159 insertions(+), 21 deletions(-)
rename tools/testing/selftests/drivers/net/bonding/{bond_lacp_prio.sh => bond_lacp_ad_select.sh} (64%)
--
2.50.1
Hello,
This version is a complete rewrite of the syscall (thanks Thomas for the
suggestions!).
* Use case
The use-case for the new syscalls is detailed in the last patch version:
https://lore.kernel.org/lkml/20250626-tonyk-robust_futex-v5-0-179194dbde8f@…
* The syscall interface
Documented at patches 3/9 "futex: Create set_robust_list2() syscall" and
4/9 "futex: Create get_robust_list2() syscall".
* Testing
I expanded the current robust list selftest to use the new interface,
and also ported the original syscall to use the new syscall internals,
and everything survived the tests.
* Changelog
Changes from v5:
- Complete interface rewrite, there are so many changes but the main
ones are the following points
- Array of robust lists now has a static size, allocated once during the
first usage of the list
- Now that the list of robust lists have a fixed size, I removed the
logic of having a command for creating a new index on the list. To
simplify things for everyone, userspace just need to call
set_robust_list2(head, 32-bit/64-bit type, index).
- Created get_robust_list2()
- The new code can be better integrated with the original interface
- v5: https://lore.kernel.org/r/20250626-tonyk-robust_futex-v5-0-179194dbde8f@iga…
Feedback is very welcomed!
---
André Almeida (9):
futex: Use explicit sizes for compat_robust_list structs
futex: Make exit_robust_list32() unconditionally available for 64-bit kernels
futex: Create set_robust_list2() syscall
futex: Create get_robust_list2() syscall
futex: Wire up set_robust_list2 syscall
futex: Wire up get_robust_list2 syscall
selftests/futex: Expand for set_robust_list2()
selftests/futex: Expand for get_robust_list2()
futex: Use new robust list API internally
arch/alpha/kernel/syscalls/syscall.tbl | 2 +
arch/arm/tools/syscall.tbl | 2 +
arch/m68k/kernel/syscalls/syscall.tbl | 2 +
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 2 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +
arch/parisc/kernel/syscalls/syscall.tbl | 2 +
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +
arch/s390/kernel/syscalls/syscall.tbl | 2 +
arch/sh/kernel/syscalls/syscall.tbl | 2 +
arch/sparc/kernel/syscalls/syscall.tbl | 2 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +
include/linux/compat.h | 13 +-
include/linux/futex.h | 30 +-
include/linux/sched.h | 6 +-
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/futex.h | 26 ++
kernel/futex/core.c | 140 ++++--
kernel/futex/syscalls.c | 134 +++++-
kernel/sys_ni.c | 2 +
scripts/syscall.tbl | 1 +
.../selftests/futex/functional/robust_list.c | 504 +++++++++++++++++++--
25 files changed, 788 insertions(+), 105 deletions(-)
---
base-commit: c42ba5a87bdccbca11403b7ca8bad1a57b833732
change-id: 20250225-tonyk-robust_futex-60adeedac695
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
nolibc currently uses 32-bit types for various APIs. These are
problematic as their reduced value range can lead to truncated values.
Intended for 6.19.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Changes in v2:
- Drop already applied ino_t and off_t patches.
- Also handle 'struct timeval'.
- Make the progression of the series a bit clearer.
- Add compatibility assertions.
- Link to v1: https://lore.kernel.org/r/20251029-nolibc-uapi-types-v1-0-e79de3b215d8@weis…
---
Thomas Weißschuh (13):
tools/nolibc/poll: use kernel types for system call invocations
tools/nolibc/poll: drop __NR_poll fallback
tools/nolibc/select: drop non-pselect based implementations
tools/nolibc/time: drop invocation of gettimeofday system call
tools/nolibc: prefer explicit 64-bit time-related system calls
tools/nolibc/gettimeofday: avoid libgcc 64-bit divisions
tools/nolibc/select: avoid libgcc 64-bit multiplications
tools/nolibc: use custom structs timespec and timeval
tools/nolibc: always use 64-bit time types
selftests/nolibc: test compatibility of nolibc and kernel time types
tools/nolibc: remove time conversions
tools/nolibc: add __nolibc_static_assert()
selftests/nolibc: add static assertions around time types handling
tools/include/nolibc/arch-s390.h | 3 +
tools/include/nolibc/compiler.h | 2 +
tools/include/nolibc/poll.h | 14 ++--
tools/include/nolibc/std.h | 2 +-
tools/include/nolibc/sys/select.h | 25 ++-----
tools/include/nolibc/sys/time.h | 6 +-
tools/include/nolibc/sys/timerfd.h | 32 +++------
tools/include/nolibc/time.h | 102 +++++++++------------------
tools/include/nolibc/types.h | 17 ++++-
tools/testing/selftests/nolibc/nolibc-test.c | 27 +++++++
10 files changed, 107 insertions(+), 123 deletions(-)
---
base-commit: 586e8d5137dfcddfccca44c3b992b92d2be79347
change-id: 20251001-nolibc-uapi-types-1c072d10fcc7
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
LLVM 21 switched to -mcmodel=medium for LoongArch64 compilations.
This code model uses R_LARCH_ECALL36 relocations which might not be
supported by GNU ld which the nolibc testsuite uses by default.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (2):
selftests/nolibc: use lld to link loongarch binaries
selftests/nolibc: error out on linker warnings
tools/testing/selftests/nolibc/Makefile.nolibc | 1 +
tools/testing/selftests/nolibc/run-tests.sh | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
---
base-commit: 6059e06967aaac9bf736c6cec75b9bccaf5bbe18
change-id: 20251121-nolibc-lld-f32af4983cc0
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
GCC warns about potential out-of-bounds access when the test provides
a buffer smaller than struct iommu_test_hw_info:
iommufd_utils.h:817:37: warning: array subscript 'struct
iommu_test_hw_info[0]' is partly outside array bounds of 'struct
iommu_test_hw_info_buffer_smaller[1]'
[-Warray-bounds=]
817 | assert(!info->flags);
| ~~~~^~~~~~~
The warning occurs because 'info' is cast to a pointer to the full
8-byte struct at the top of the function, but the buffer_smaller test
case passes only a 4-byte buffer. While the code correctly checks
data_len before accessing each field, GCC's flow analysis with inlining
doesn't recognize that the size check protects the access.
Fix this by accessing fields through appropriately-typed pointers that
match the actual field sizes (__u32), declared only after the bounds
check. This makes the relationship between the size check and memory
access explicit to the compiler.
Signed-off-by: Nirbhay Sharma <nirbhay.lkd(a)gmail.com>
---
tools/testing/selftests/iommu/iommufd_utils.h | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 9f472c20c190..37c1b994008c 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -770,7 +770,6 @@ static int _test_cmd_get_hw_info(int fd, __u32 device_id, __u32 data_type,
void *data, size_t data_len,
uint32_t *capabilities, uint8_t *max_pasid)
{
- struct iommu_test_hw_info *info = (struct iommu_test_hw_info *)data;
struct iommu_hw_info cmd = {
.size = sizeof(cmd),
.dev_id = device_id,
@@ -810,11 +809,19 @@ static int _test_cmd_get_hw_info(int fd, __u32 device_id, __u32 data_type,
}
}
- if (info) {
- if (data_len >= offsetofend(struct iommu_test_hw_info, test_reg))
- assert(info->test_reg == IOMMU_HW_INFO_SELFTEST_REGVAL);
- if (data_len >= offsetofend(struct iommu_test_hw_info, flags))
- assert(!info->flags);
+ if (data) {
+ if (data_len >= offsetofend(struct iommu_test_hw_info,
+ test_reg)) {
+ __u32 *test_reg = (__u32 *)data + 1;
+
+ assert(*test_reg == IOMMU_HW_INFO_SELFTEST_REGVAL);
+ }
+ if (data_len >= offsetofend(struct iommu_test_hw_info,
+ flags)) {
+ __u32 *flags = data;
+
+ assert(!*flags);
+ }
}
if (max_pasid)
--
2.48.1
From: Fred Griffoul <fgriffo(a)amazon.co.uk>
This patch series addresses both performance and correctness issues in
nested VMX when handling guest memory.
During nested VMX operations, L0 (KVM) accesses specific L1 guest pages
to manage L2 execution. These pages fall into two categories: pages
accessed only by L0 (such as the L1 MSR bitmap page or the eVMCS page),
and pages passed to the L2 guest via vmcs02 (such as APIC access,
virtual APIC, and posted interrupt descriptor pages).
The current implementation uses kvm_vcpu_map/unmap, which causes two
issues.
First, the current approach is missing proper invalidation handling in
critical scenarios. Enlightened VMCS (eVMCS) pages can become stale when
memslots are modified, as there is no mechanism to invalidate the cached
mappings. Similarly, APIC access and virtual APIC pages can be migrated
by the host, but without proper notification through mmu_notifier
callbacks, the mappings become invalid and can lead to incorrect
behavior.
Second, for unmanaged guest memory (memory not directly mapped by the
kernel, such as memory passed with the mem= parameter or guest_memfd for
non-CoCo VMs), this workflow invokes expensive memremap/memunmap
operations on every L2 VM entry/exit cycle. This creates significant
overhead that impacts nested virtualization performance.
This series replaces kvm_host_map with gfn_to_pfn_cache in nested VMX.
The pfncache infrastructure maintains persistent mappings as long as the
page GPA does not change, eliminating the memremap/memunmap overhead on
every VM entry/exit cycle. Additionally, pfncache provides proper
invalidation handling via mmu_notifier callbacks and memslots generation
check, ensuring that mappings are correctly updated during both memslot
updates and page migration events.
As an example, a microbenchmark using memslot_perf_test with 8192
memslots demonstrates huge improvements in nested VMX operations with
unmanaged guest memory:
Before After Improvement
map: 26.12s 1.54s ~17x faster
unmap: 40.00s 0.017s ~2353x faster
unmap chunked: 10.07s 0.005s ~2014x faster
The series is organized as follows:
Patches 1-5 handle the L1 MSR bitmap page and system pages (APIC access,
virtual APIC, and posted interrupt descriptor). Patch 1 converts the MSR
bitmap to use gfn_to_pfn_cache. Patches 2-3 restore and complete
"guest-uses-pfn" support in pfncache. Patch 4 converts the system pages
to use gfn_to_pfn_cache. Patch 5 adds a selftest for cache invalidation
and memslot updates.
Patches 6-7 add enlightened VMCS support. Patch 6 avoids accessing eVMCS
fields after they are copied into the cached vmcs12 structure. Patch 7
converts eVMCS page mapping to use gfn_to_pfn_cache.
Patches 8-10 implement persistent nested context to handle L2 vCPU
multiplexing and migration between L1 vCPUs. Patch 8 introduces the
nested context management infrastructure. Patch 9 integrates pfncache
with persistent nested context. Patch 10 adds a selftest for this L2
vCPU context switching.
v2:
- Extended series to support enlightened VMCS (eVMCS).
- Added persistent nested context for improved L2 vCPU handling.
- Added additional selftests.
Suggested-by: dwmw(a)amazon.co.uk
Fred Griffoul (10):
KVM: nVMX: Implement cache for L1 MSR bitmap
KVM: pfncache: Restore guest-uses-pfn support
KVM: x86: Add nested state validation for pfncache support
KVM: nVMX: Implement cache for L1 APIC pages
KVM: selftests: Add nested VMX APIC cache invalidation test
KVM: nVMX: Cache evmcs fields to ensure consistency during VM-entry
KVM: nVMX: Replace evmcs kvm_host_map with pfncache
KVM: x86: Add nested context management
KVM: nVMX: Use nested context for pfncache persistence
KVM: selftests: Add L2 vcpu context switch test
arch/x86/include/asm/kvm_host.h | 32 ++
arch/x86/include/uapi/asm/kvm.h | 2 +
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/nested.c | 199 ++++++++
arch/x86/kvm/vmx/hyperv.c | 5 +-
arch/x86/kvm/vmx/hyperv.h | 33 +-
arch/x86/kvm/vmx/nested.c | 463 ++++++++++++++----
arch/x86/kvm/vmx/vmx.c | 8 +
arch/x86/kvm/vmx/vmx.h | 16 +-
arch/x86/kvm/x86.c | 19 +-
include/linux/kvm_host.h | 34 +-
include/linux/kvm_types.h | 1 +
tools/testing/selftests/kvm/Makefile.kvm | 2 +
.../selftests/kvm/x86/vmx_apic_update_test.c | 302 ++++++++++++
.../selftests/kvm/x86/vmx_l2_switch_test.c | 416 ++++++++++++++++
virt/kvm/kvm_main.c | 3 +-
virt/kvm/kvm_mm.h | 6 +-
virt/kvm/pfncache.c | 43 +-
18 files changed, 1467 insertions(+), 119 deletions(-)
create mode 100644 arch/x86/kvm/nested.c
create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c
create mode 100644 tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c
--
2.43.0
syzkaller reported a bug [1] where a socket using sockmap, after being
unloaded, exposed incorrect copied_seq calculation. The selftest I
provided can be used to reproduce the issue reported by syzkaller.
TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40
WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724
Call Trace:
<TASK>
receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline]
tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200
do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713
tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812
do_sock_getsockopt+0x34d/0x440 net/socket.c:2421
__sys_getsockopt+0x12f/0x260 net/socket.c:2450
__do_sys_getsockopt net/socket.c:2457 [inline]
__se_sys_getsockopt net/socket.c:2454 [inline]
__x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
A sockmap socket maintains its own receive queue (ingress_msg) which may
contain data from either its own protocol stack or forwarded from other
sockets.
FD1:read()
-- FD1->copied_seq++
| [read data]
|
[enqueue data] v
[sockmap] -> ingress to self -> ingress_msg queue
FD1 native stack ------> ^
-- FD1->rcv_nxt++ -> redirect to other | [enqueue data]
| |
| ingress to FD1
v ^
... | [sockmap]
FD2 native stack
The issue occurs when reading from ingress_msg: we update tp->copied_seq
by default, but if the data comes from other sockets (not the socket's
own protocol stack), tcp->rcv_nxt remains unchanged. Later, when
converting back to a native socket, reads may fail as copied_seq could
be significantly larger than rcv_nxt.
Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is
insufficient for sockmap sockets, requiring separate field tracking.
[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Jiayuan Chen (3):
bpf, sockmap: Fix incorrect copied_seq calculation
bpf, sockmap: Fix FIONREAD for sockmap
bpf, selftest: Add tests for FIONREAD and copied_seq
include/linux/skmsg.h | 71 ++++++-
net/core/skmsg.c | 20 +-
net/ipv4/tcp_bpf.c | 26 ++-
net/ipv4/udp_bpf.c | 25 ++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 192 +++++++++++++++++-
.../bpf/progs/test_sockmap_pass_prog.c | 8 +
6 files changed, 325 insertions(+), 17 deletions(-)
--
2.43.0