The `__WORDSIZE` macro, defined in the non-standard `<bits/wordsize.h>`
header, is a GNU extension and not universally available with all
toolchains, such as Clang when used with musl libc.
This can lead to build failures in environments where this header is
missing.
The intention of the code is to determine the bit width of a C `long`.
Replace the non-portable `__WORDSIZE` with the standard and portable
`sizeof(long) * 8` expression to achieve the same result.
This change also removes the inclusion of the now-unused
`<bits/wordsize.h>` header.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
tools/testing/selftests/net/psock_tpacket.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c
index 221270cee3ea..0dd909e325d9 100644
--- a/tools/testing/selftests/net/psock_tpacket.c
+++ b/tools/testing/selftests/net/psock_tpacket.c
@@ -33,7 +33,6 @@
#include <ctype.h>
#include <fcntl.h>
#include <unistd.h>
-#include <bits/wordsize.h>
#include <net/ethernet.h>
#include <netinet/ip.h>
#include <arpa/inet.h>
@@ -785,7 +784,7 @@ static int test_kernel_bit_width(void)
static int test_user_bit_width(void)
{
- return __WORDSIZE;
+ return sizeof(long) * 8;
}
static const char *tpacket_str[] = {
--
2.50.1.703.g449372360f-goog
Unlike IPv4, IPv6 routing strictly requires the source address to be valid
on the outgoing interface. If the NS target is set to a remote VLAN interface,
and the source address is also configured on a VLAN over a bond interface,
setting the oif to the bond device will fail to retrieve the correct
destination route.
Fix this by not setting the oif to the bond device when retrieving the NS
target destination. This allows the correct destination device (the VLAN
interface) to be determined, so that bond_verify_device_path can return the
proper VLAN tags for sending NS messages.
Reported-by: David Wilder <wilder(a)us.ibm.com>
Closes: https://lore.kernel.org/netdev/aGOKggdfjv0cApTO@fedora/
Suggested-by: Jay Vosburgh <jv(a)jvosburgh.net>
Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
drivers/net/bonding/bond_main.c | 1 -
.../drivers/net/bonding/bond_options.sh | 59 +++++++++++++++++++
2 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 257333c88710..30cf97f4e814 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3355,7 +3355,6 @@ static void bond_ns_send_all(struct bonding *bond, struct slave *slave)
/* Find out through which dev should the packet go */
memset(&fl6, 0, sizeof(struct flowi6));
fl6.daddr = targets[i];
- fl6.flowi6_oif = bond->dev->ifindex;
dst = ip6_route_output(dev_net(bond->dev), NULL, &fl6);
if (dst->error) {
diff --git a/tools/testing/selftests/drivers/net/bonding/bond_options.sh b/tools/testing/selftests/drivers/net/bonding/bond_options.sh
index 7bc148889ca7..b3eb8a919c71 100755
--- a/tools/testing/selftests/drivers/net/bonding/bond_options.sh
+++ b/tools/testing/selftests/drivers/net/bonding/bond_options.sh
@@ -7,6 +7,7 @@ ALL_TESTS="
prio
arp_validate
num_grat_arp
+ vlan_over_bond
"
lib_dir=$(dirname "$0")
@@ -376,6 +377,64 @@ num_grat_arp()
done
}
+vlan_over_bond_arp()
+{
+ local mode="$1"
+ RET=0
+
+ bond_reset "mode $mode arp_interval 100 arp_ip_target 192.0.3.10"
+ ip -n "${s_ns}" link add bond0.3 link bond0 type vlan id 3
+ ip -n "${s_ns}" link set bond0.3 up
+ ip -n "${s_ns}" addr add 192.0.3.1/24 dev bond0.3
+ ip -n "${s_ns}" addr add 2001:db8::3:1/64 dev bond0.3
+
+ slowwait_for_counter 5 5 tc_rule_handle_stats_get \
+ "dev eth0.3 ingress" 101 ".packets" "-n ${c_ns}" || RET=1
+ log_test "vlan over bond arp" "$mode"
+}
+
+vlan_over_bond_ns()
+{
+ local mode="$1"
+ RET=0
+
+ if skip_ns; then
+ log_test_skip "vlan_over_bond ns" "$mode"
+ return 0
+ fi
+
+ bond_reset "mode $mode arp_interval 100 ns_ip6_target 2001:db8::3:10"
+ ip -n "${s_ns}" link add bond0.3 link bond0 type vlan id 3
+ ip -n "${s_ns}" link set bond0.3 up
+ ip -n "${s_ns}" addr add 192.0.3.1/24 dev bond0.3
+ ip -n "${s_ns}" addr add 2001:db8::3:1/64 dev bond0.3
+
+ slowwait_for_counter 5 5 tc_rule_handle_stats_get \
+ "dev eth0.3 ingress" 102 ".packets" "-n ${c_ns}" || RET=1
+ log_test "vlan over bond ns" "$mode"
+}
+
+vlan_over_bond()
+{
+ # add vlan 3 for client
+ ip -n "${c_ns}" link add eth0.3 link eth0 type vlan id 3
+ ip -n "${c_ns}" link set eth0.3 up
+ ip -n "${c_ns}" addr add 192.0.3.10/24 dev eth0.3
+ ip -n "${c_ns}" addr add 2001:db8::3:10/64 dev eth0.3
+
+ # Add tc rule to check the vlan pkts
+ tc -n "${c_ns}" qdisc add dev eth0.3 clsact
+ tc -n "${c_ns}" filter add dev eth0.3 ingress protocol arp \
+ handle 101 flower skip_hw arp_op request \
+ arp_sip 192.0.3.1 arp_tip 192.0.3.10 action pass
+ tc -n "${c_ns}" filter add dev eth0.3 ingress protocol ipv6 \
+ handle 102 flower skip_hw ip_proto icmpv6 \
+ type 135 src_ip 2001:db8::3:1 action pass
+
+ vlan_over_bond_arp "active-backup"
+ vlan_over_bond_ns "active-backup"
+}
+
trap cleanup EXIT
setup_prepare
--
2.50.1
Currently it hard codes the number of hugepage to check for
check_huge_anon(), but it would be more reasonable to do the check based
on a number passed in.
Pass in the hugepage number and do the check based on it.
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Donet Tom <donettom(a)linux.ibm.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Zi Yan <ziy(a)nvidia.com>
---
v2:
* use mm-new
* add back nr_hpages which is removed by an early commit
* adjust the change log a little
* drop RB and resend
---
tools/testing/selftests/mm/split_huge_page_test.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index 5ab488fab1cd..63ac82f0b9e0 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -105,12 +105,12 @@ static char *allocate_zero_filled_hugepage(size_t len)
return result;
}
-static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, size_t len)
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
{
unsigned long rss_anon_before, rss_anon_after;
size_t i;
- if (!check_huge_anon(one_page, 4, pmd_pagesize))
+ if (!check_huge_anon(one_page, nr_hpages, pmd_pagesize))
ksft_exit_fail_msg("No THP is allocated\n");
rss_anon_before = rss_anon();
@@ -141,7 +141,7 @@ void split_pmd_zero_pages(void)
size_t len = nr_hpages * pmd_pagesize;
one_page = allocate_zero_filled_hugepage(len);
- verify_rss_anon_split_huge_page_all_zeroes(one_page, len);
+ verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
ksft_test_result_pass("Split zero filled huge pages successful\n");
free(one_page);
}
--
2.34.1
--
Hi,
PERDIS SUPER U is a leading retail group in France with numerous
outlets across the country. After reviewing your company profile and
products, we’re very interested in establishing a long-term partnership.
Kindly share your product catalog or website so we can review your
offerings and pricing. We are ready to place orders and begin
cooperation.Please note: Our payment terms are SWIFT, 14 days after
delivery.
Looking forward to your response.
Best regards,
Dominique Schelcher
Director, PERDIS SUPER U
RUE DE SAVOIE, 45600 SAINT-PÈRE-SUR-LOIRE
VAT: FR65380071464
www.magasins-u.com
This patchset uses kpageflags to get after-split folio orders for a better
split_huge_page_test result check[1]. The added gather_folio_orders() scans
through a VPN range and collects the numbers of folios at different orders.
check_folio_orders() compares the result of gather_folio_orders() to
a given list of numbers of different orders.
This patchset also added new order and in folio offset to the split huge
page debugfs's pr_debug()s;
Changelog
===
From V1[2]:
1. Dropped split_huge_pages_pid() for loop step change to avoid messing
up with PTE-mapped THP handling. split_huge_page_test.c is changed to
perform split at [addr, addr + pagesize) range to limit one
folio_split() per folio.
2. Moved pr_debug changes in Patch 2 to Patch 1.
3. Moved KPF_* to vm_util.h and used PAGEMAP_PFN instead of local PFN_MASK.
4. Used pagemap_get_pfn() helper.
5. Used char *vaddr and size_t len as inputs to gather_folio_orders() and
check_folio_orders() instead of vpn and nr_pages.
6. Removed variable length variables and used malloc instead.
[1] https://lore.kernel.org/linux-mm/e2f32bdb-e4a4-447c-867c-31405cbba151@redha…
[2] https://lore.kernel.org/linux-mm/20250806022045.342824-1-ziy@nvidia.com/
Zi Yan (3):
mm/huge_memory: add new_order and offset to split_huge_pages*()
pr_debug.
selftests/mm: add check_folio_orders() helper.
selftests/mm: check after-split folio orders in split_huge_page_test.
mm/huge_memory.c | 8 +-
.../selftests/mm/split_huge_page_test.c | 102 ++++++++++----
tools/testing/selftests/mm/vm_util.c | 133 ++++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 7 +
4 files changed, 217 insertions(+), 33 deletions(-)
--
2.47.2
This series introduces NUMA-aware memory placement support for KVM guests
with guest_memfd memory backends. It builds upon Fuad Tabba's work that
enabled host-mapping for guest_memfd memory [1].
== Background ==
KVM's guest-memfd memory backend currently lacks support for NUMA policy
enforcement, causing guest memory allocations to be distributed across host
nodes according to kernel's default behavior, irrespective of any policy
specified by the VMM. This limitation arises because conventional userspace
NUMA control mechanisms like mbind(2) don't work since the memory isn't
directly mapped to userspace when allocations occur.
Fuad's work [1] provides the necessary mmap capability, and this series
leverages it to enable mbind(2).
== Implementation ==
This series implements proper NUMA policy support for guest-memfd by:
1. Adding mempolicy-aware allocation APIs to the filemap layer.
2. Introducing custom inodes (via a dedicated slab-allocated inode cache,
kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory.
3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA
policy.
With these changes, VMMs can now control guest memory placement by mapping
guest_memfd file descriptor and using mbind(2) to specify:
- Policy modes: default, bind, interleave, or preferred
- Host NUMA nodes: List of target nodes for memory allocation
These Policies affect only future allocations and do not migrate existing
memory. This matches mbind(2)'s default behavior which affects only new
allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not
supported for guest_memfd as it is unmovable by design).
== Upstream Plan ==
Phased approach as per David's guest_memfd extension overview [2] and
community calls [3]:
Phase 1 (this series):
1. Focuses on shared guest_memfd support (non-CoCo VMs).
2. Builds on Fuad's host-mapping work.
Phase2 (future work):
1. NUMA support for private guest_memfd (CoCo VMs).
2. Depends on SNP in-place conversion support [4].
This series provides a clean integration path for NUMA-aware memory
management for guest_memfd and lays the groundwork for future confidential
computing NUMA capabilities.
Please review and provide feedback!
Thanks,
Shivank
== Changelog ==
- v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy.
- v3: Introduced fbind() syscall for VMM memory-placement configuration.
- v4-v6: Current approach using shared_policy support and vm_ops (based on
suggestions from David [5] and guest_memfd bi-weekly upstream
call discussion [6]).
- v7: Use inodes to store NUMA policy instead of file [7].
- v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory.
- v9: Rebase on top of Fuad's V13 and incorporate review comments
[1] https://lore.kernel.org/all/20250709105946.4009897-1-tabba@google.com
[2] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[4] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com
[5] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com
[6] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com
[7] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com
Ackerley Tng (1):
KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
Matthew Wilcox (Oracle) (2):
mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()
mm/filemap: Extend __filemap_get_folio() to support NUMA memory
policies
Shivank Garg (4):
mm/mempolicy: Export memory policy symbols
KVM: guest_memfd: Add slab-allocated inode cache
KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy
support
fs/bcachefs/fs-io-buffered.c | 2 +-
fs/btrfs/compression.c | 4 +-
fs/btrfs/verity.c | 2 +-
fs/erofs/zdata.c | 2 +-
fs/f2fs/compress.c | 2 +-
include/linux/pagemap.h | 18 +-
include/uapi/linux/magic.h | 1 +
mm/filemap.c | 23 +-
mm/mempolicy.c | 6 +
mm/readahead.c | 2 +-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 122 ++++++++-
virt/kvm/guest_memfd.c | 255 ++++++++++++++++--
virt/kvm/kvm_main.c | 7 +-
virt/kvm/kvm_mm.h | 10 +-
15 files changed, 408 insertions(+), 49 deletions(-)
--
2.43.0
---
== Earlier Postings ==
v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com
v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com
v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com
v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com
v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com
v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com
v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com
v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com
Resolve a conflict between
commit 6a68d28066b6 ("selftests/coredump: Fix "socket_detect_userspace_client" test failure")
and
commit 994dc26302ed ("selftests/coredump: fix build")
The first commit adds a read() to wait for write() from another thread to
finish. But the second commit removes the write().
Now that the two commits are in the same tree, the read() now gets EOF and
the test fails.
Remove this read() so that the test passes.
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
---
tools/testing/selftests/coredump/stackdump_test.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/coredump/stackdump_test.c b/tools/testing/selftests/coredump/stackdump_test.c
index 5a5a7a5f7e1d..a4ac80bb1003 100644
--- a/tools/testing/selftests/coredump/stackdump_test.c
+++ b/tools/testing/selftests/coredump/stackdump_test.c
@@ -446,9 +446,6 @@ TEST_F(coredump, socket_detect_userspace_client)
if (info.coredump_mask & PIDFD_COREDUMPED)
goto out;
- if (read(fd_coredump, &c, 1) < 1)
- goto out;
-
exit_code = EXIT_SUCCESS;
out:
if (fd_peer_pidfd >= 0)
--
2.39.5
Various KUnit tests require PCI infrastructure to work. All normal
platforms enable PCI by default, but UML does not. Enabling PCI from
.kunitconfig files is problematic as it would not be portable. So in
commit 6fc3a8636a7b ("kunit: tool: Enable virtio/PCI by default on UML")
PCI was enabled by way of CONFIG_UML_PCI_OVER_VIRTIO=y. However
CONFIG_UML_PCI_OVER_VIRTIO requires additional configuration of
CONFIG_UML_PCI_OVER_VIRTIO_DEVICE_ID or will otherwise trigger a WARN() in
virtio_pcidev_init(). However there is no one correct value for
UML_PCI_OVER_VIRTIO_DEVICE_ID which could be used by default.
This warning is confusing when debugging test failures.
On the other hand, the functionality of CONFIG_UML_PCI_OVER_VIRTIO is not
used at all, given that it is completely non-functional as indicated by
the WARN() in question. Instead it is only used as a way to enable
CONFIG_UML_PCI which itself is not directly configurable.
Instead of going through CONFIG_UML_PCI_OVER_VIRTIO, introduce a custom
configuration option which enables CONFIG_UML_PCI without triggering
warnings or building dead code.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Reviewed-by: Johannes Berg <johannes(a)sipsolutions.net>
---
Changes in v2:
- Rebase onto v6.17-rc1
- Pick up review from Johannes
- Link to v1: https://lore.kernel.org/r/20250627-kunit-uml-pci-v1-1-a622fa445e58@linutron…
---
lib/kunit/Kconfig | 7 +++++++
tools/testing/kunit/configs/arch_uml.config | 5 ++---
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index c10ede4b1d2201d5f8cddeb71cc5096e21be9b6a..1823539e96da30e165fa8d395ccbd3f6754c836e 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -106,4 +106,11 @@ config KUNIT_DEFAULT_TIMEOUT
If unsure, the default timeout of 300 seconds is suitable for most
cases.
+config KUNIT_UML_PCI
+ bool "KUnit UML PCI Support"
+ depends on UML
+ select UML_PCI
+ help
+ Enables the PCI subsystem on UML for use by KUnit tests.
+
endif # KUNIT
diff --git a/tools/testing/kunit/configs/arch_uml.config b/tools/testing/kunit/configs/arch_uml.config
index 54ad8972681a2cc724e6122b19407188910b9025..28edf816aa70e6f408d9486efff8898df79ee090 100644
--- a/tools/testing/kunit/configs/arch_uml.config
+++ b/tools/testing/kunit/configs/arch_uml.config
@@ -1,8 +1,7 @@
# Config options which are added to UML builds by default
-# Enable virtio/pci, as a lot of tests require it.
-CONFIG_VIRTIO_UML=y
-CONFIG_UML_PCI_OVER_VIRTIO=y
+# Enable pci, as a lot of tests require it.
+CONFIG_KUNIT_UML_PCI=y
# Enable FORTIFY_SOURCE for wider checking.
CONFIG_FORTIFY_SOURCE=y
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250626-kunit-uml-pci-a2b687553746
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Hello,
kernel test robot noticed "kernel-selftests.filesystems/mount-notify.make.fail" on:
commit: c6d9775c2066a37385e784ee2e0ce83bd6644610 ("selftests/fs/mount-notify: build with tools include dir")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
[test failed on linus/master 6e64f4580381e32c06ee146ca807c555b8f73e24]
[test failed on linux-next/master 442d93313caebc8ccd6d53f4572c50732a95bc48]
in testcase: kernel-selftests
version: kernel-selftests-x86_64-186f3edfdd41-1_20250803
with following parameters:
group: filesystems
config: x86_64-rhel-9.4-kselftests
compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz (Skylake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang(a)intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202508110628.65069d92-lkp@intel.com
2025-08-06 18:21:58 make -j36 TARGETS=filesystems/mount-notify
make[1]: Entering directory '/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/tools/testing/selftests/filesystems/mount-notify'
CC mount-notify_test
mount-notify_test.c:21:3: error: conflicting types for ‘__kernel_fsid_t’; have ‘struct <anonymous>’
21 | } __kernel_fsid_t;
| ^~~~~~~~~~~~~~~
In file included from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/asm/posix_types_64.h:18,
from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/asm/posix_types.h:7,
from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/linux/posix_types.h:36,
from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/linux/types.h:9,
from /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/linux/stat.h:5,
from /usr/include/x86_64-linux-gnu/bits/statx.h:31,
from /usr/include/x86_64-linux-gnu/sys/stat.h:465,
from mount-notify_test.c:9:
/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/usr/include/asm-generic/posix_types.h:81:3: note: previous declaration of ‘__kernel_fsid_t’ with type ‘__kernel_fsid_t’
81 | } __kernel_fsid_t;
| ^~~~~~~~~~~~~~~
make[1]: *** [../../lib.mk:222: /usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/tools/testing/selftests/filesystems/mount-notify/mount-notify_test] Error 1
make[1]: Leaving directory '/usr/src/perf_selftests-x86_64-rhel-9.4-kselftests-c6d9775c2066a37385e784ee2e0ce83bd6644610/tools/testing/selftests/filesystems/mount-notify'
make: *** [Makefile:203: all] Error 2
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250811/202508110628.65069d92-lkp@…
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Currently it hard coded the number of hugepage to check for
check_huge_anon(), but we already have the number passed in.
Do the check based on the number of hugepage passed in is more
reasonable.
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
---
tools/testing/selftests/mm/split_huge_page_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index 44a3f8a58806..bf40e6b121ab 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -111,7 +111,7 @@ static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hp
unsigned long rss_anon_before, rss_anon_after;
size_t i;
- if (!check_huge_anon(one_page, 4, pmd_pagesize))
+ if (!check_huge_anon(one_page, nr_hpages, pmd_pagesize))
ksft_exit_fail_msg("No THP is allocated\n");
rss_anon_before = rss_anon();
--
2.34.1
The SBI Firmware Feature extension allows the S-mode to request some
specific features (either hardware or software) to be enabled. This
series uses this extension to request misaligned access exception
delegation to S-mode in order to let the kernel handle it. It also adds
support for the KVM FWFT SBI extension based on the misaligned access
handling infrastructure.
FWFT SBI extension is part of the SBI V3.0 specifications [1]. It can be
tested using the qemu provided at [2] which contains the series from
[3]. Upstream kvm-unit-tests can be used inside kvm to tests the correct
delegation of misaligned exceptions. Upstream OpenSBI can be used.
Note: Since SBI V3.0 is not yet ratified, FWFT extension API is split
between interface only and implementation, allowing to pick only the
interface which do not have hard dependencies on SBI.
The tests can be run using the kselftest from series [4].
$ qemu-system-riscv64 \
-cpu rv64,trap-misaligned-access=true,v=true \
-M virt \
-m 1024M \
-bios fw_dynamic.bin \
-kernel Image
...
# ./misaligned
TAP version 13
1..23
# Starting 23 tests from 1 test cases.
# RUN global.gp_load_lh ...
# OK global.gp_load_lh
ok 1 global.gp_load_lh
# RUN global.gp_load_lhu ...
# OK global.gp_load_lhu
ok 2 global.gp_load_lhu
# RUN global.gp_load_lw ...
# OK global.gp_load_lw
ok 3 global.gp_load_lw
# RUN global.gp_load_lwu ...
# OK global.gp_load_lwu
ok 4 global.gp_load_lwu
# RUN global.gp_load_ld ...
# OK global.gp_load_ld
ok 5 global.gp_load_ld
# RUN global.gp_load_c_lw ...
# OK global.gp_load_c_lw
ok 6 global.gp_load_c_lw
# RUN global.gp_load_c_ld ...
# OK global.gp_load_c_ld
ok 7 global.gp_load_c_ld
# RUN global.gp_load_c_ldsp ...
# OK global.gp_load_c_ldsp
ok 8 global.gp_load_c_ldsp
# RUN global.gp_load_sh ...
# OK global.gp_load_sh
ok 9 global.gp_load_sh
# RUN global.gp_load_sw ...
# OK global.gp_load_sw
ok 10 global.gp_load_sw
# RUN global.gp_load_sd ...
# OK global.gp_load_sd
ok 11 global.gp_load_sd
# RUN global.gp_load_c_sw ...
# OK global.gp_load_c_sw
ok 12 global.gp_load_c_sw
# RUN global.gp_load_c_sd ...
# OK global.gp_load_c_sd
ok 13 global.gp_load_c_sd
# RUN global.gp_load_c_sdsp ...
# OK global.gp_load_c_sdsp
ok 14 global.gp_load_c_sdsp
# RUN global.fpu_load_flw ...
# OK global.fpu_load_flw
ok 15 global.fpu_load_flw
# RUN global.fpu_load_fld ...
# OK global.fpu_load_fld
ok 16 global.fpu_load_fld
# RUN global.fpu_load_c_fld ...
# OK global.fpu_load_c_fld
ok 17 global.fpu_load_c_fld
# RUN global.fpu_load_c_fldsp ...
# OK global.fpu_load_c_fldsp
ok 18 global.fpu_load_c_fldsp
# RUN global.fpu_store_fsw ...
# OK global.fpu_store_fsw
ok 19 global.fpu_store_fsw
# RUN global.fpu_store_fsd ...
# OK global.fpu_store_fsd
ok 20 global.fpu_store_fsd
# RUN global.fpu_store_c_fsd ...
# OK global.fpu_store_c_fsd
ok 21 global.fpu_store_c_fsd
# RUN global.fpu_store_c_fsdsp ...
# OK global.fpu_store_c_fsdsp
ok 22 global.fpu_store_c_fsdsp
# RUN global.gen_sigbus ...
[12797.988647] misaligned[618]: unhandled signal 7 code 0x1 at 0x0000000000014dc0 in misaligned[4dc0,10000+76000]
[12797.988990] CPU: 0 UID: 0 PID: 618 Comm: misaligned Not tainted 6.13.0-rc6-00008-g4ec4468967c9-dirty #51
[12797.989169] Hardware name: riscv-virtio,qemu (DT)
[12797.989264] epc : 0000000000014dc0 ra : 0000000000014d00 sp : 00007fffe165d100
[12797.989407] gp : 000000000008f6e8 tp : 0000000000095760 t0 : 0000000000000008
[12797.989544] t1 : 00000000000965d8 t2 : 000000000008e830 s0 : 00007fffe165d160
[12797.989692] s1 : 000000000000001a a0 : 0000000000000000 a1 : 0000000000000002
[12797.989831] a2 : 0000000000000000 a3 : 0000000000000000 a4 : ffffffffdeadbeef
[12797.989964] a5 : 000000000008ef61 a6 : 626769735f6e0000 a7 : fffffffffffff000
[12797.990094] s2 : 0000000000000001 s3 : 00007fffe165d838 s4 : 00007fffe165d848
[12797.990238] s5 : 000000000000001a s6 : 0000000000010442 s7 : 0000000000010200
[12797.990391] s8 : 000000000000003a s9 : 0000000000094508 s10: 0000000000000000
[12797.990526] s11: 0000555567460668 t3 : 00007fffe165d070 t4 : 00000000000965d0
[12797.990656] t5 : fefefefefefefeff t6 : 0000000000000073
[12797.990756] status: 0000000200004020 badaddr: 000000000008ef61 cause: 0000000000000006
[12797.990911] Code: 8793 8791 3423 fcf4 3783 fc84 c737 dead 0713 eef7 (c398) 0001
# OK global.gen_sigbus
ok 23 global.gen_sigbus
# PASSED: 23 / 23 tests passed.
# Totals: pass:23 fail:0 xfail:0 xpass:0 skip:0 error:0
With kvm-tools:
# lkvm run -k sbi.flat -m 128
Info: # lkvm run -k sbi.flat -m 128 -c 1 --name guest-97
Info: Removed ghost socket file "/root/.lkvm//guest-97.sock".
##########################################################################
# kvm-unit-tests
##########################################################################
... [test messages elided]
PASS: sbi: fwft: FWFT extension probing no error
PASS: sbi: fwft: get/set reserved feature 0x6 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x3fffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0x80000000 error == SBI_ERR_DENIED
PASS: sbi: fwft: get/set reserved feature 0xbfffffff error == SBI_ERR_DENIED
PASS: sbi: fwft: misaligned_deleg: Get misaligned deleg feature no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 0
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 1
PASS: sbi: fwft: misaligned_deleg: Verify misaligned load exception trap in supervisor
SUMMARY: 50 tests, 2 unexpected failures, 12 skipped
This series is available at [5].
Link: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/vv3.0-rc2/… [1]
Link: https://github.com/rivosinc/qemu/tree/dev/cleger/misaligned [2]
Link: https://lore.kernel.org/all/20241211211933.198792-3-fkonrad@amd.com/T/ [3]
Link: https://lore.kernel.org/linux-riscv/20250414123543.1615478-1-cleger@rivosin… [4]
Link: https://github.com/rivosinc/linux/tree/dev/cleger/fwft [5]
---
V8:
- Move misaligned_access_speed under CONFIG_RISCV_MISALIGNED and add a
separate commit for that.
V7:
- Fix ifdefery build problems
- Move sbi_fwft_is_supported with fwft_set_req struct
- Added Atish Reviewed-by
- Updated KVM vcpu cfg hedeleg value in set_delegation
- Changed SBI ETIME error mapping to ETIMEDOUT
- Fixed a few typo reported by Alok
V6:
- Rename FWFT interface to remove "_local"
- Fix test for MEDELEG values in KVM FWFT support
- Add __init for unaligned_access_init()
- Rebased on master
V5:
- Return ERANGE as mapping for SBI_ERR_BAD_RANGE
- Removed unused sbi_fwft_get()
- Fix kernel for sbi_fwft_local_set_cpumask()
- Fix indentation for sbi_fwft_local_set()
- Remove spurious space in kvm_sbi_fwft_ops.
- Rebased on origin/master
- Remove fixes commits and sent them as a separate series [4]
V4:
- Check SBI version 3.0 instead of 2.0 for FWFT presence
- Use long for kvm_sbi_fwft operation return value
- Init KVM sbi extension even if default_disabled
- Remove revert_on_fail parameter for sbi_fwft_feature_set().
- Fix comments for sbi_fwft_set/get()
- Only handle local features (there are no globals yet in the spec)
- Add new SBI errors to sbi_err_map_linux_errno()
V3:
- Added comment about kvm sbi fwft supported/set/get callback
requirements
- Move struct kvm_sbi_fwft_feature in kvm_sbi_fwft.c
- Add a FWFT interface
V2:
- Added Kselftest for misaligned testing
- Added get_user() usage instead of __get_user()
- Reenable interrupt when possible in misaligned access handling
- Document that riscv supports unaligned-traps
- Fix KVM extension state when an init function is present
- Rework SBI misaligned accesses trap delegation code
- Added support for CPU hotplugging
- Added KVM SBI reset callback
- Added reset for KVM SBI FWFT lock
- Return SBI_ERR_DENIED_LOCKED when LOCK flag is set
Clément Léger (14):
riscv: sbi: add Firmware Feature (FWFT) SBI extensions definitions
riscv: sbi: remove useless parenthesis
riscv: sbi: add new SBI error mappings
riscv: sbi: add FWFT extension interface
riscv: sbi: add SBI FWFT extension calls
riscv: misaligned: request misaligned exception from SBI
riscv: misaligned: use on_each_cpu() for scalar misaligned access
probing
riscv: misaligned: declare misaligned_access_speed under
CONFIG_RISCV_MISALIGNED
riscv: misaligned: move emulated access uniformity check in a function
riscv: misaligned: add a function to check misalign trap delegability
RISC-V: KVM: add SBI extension init()/deinit() functions
RISC-V: KVM: add SBI extension reset callback
RISC-V: KVM: add support for FWFT SBI extension
RISC-V: KVM: add support for SBI_FWFT_MISALIGNED_DELEG
arch/riscv/include/asm/cpufeature.h | 14 +-
arch/riscv/include/asm/kvm_host.h | 5 +-
arch/riscv/include/asm/kvm_vcpu_sbi.h | 12 +
arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h | 29 +++
arch/riscv/include/asm/sbi.h | 60 +++++
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kernel/sbi.c | 81 ++++++-
arch/riscv/kernel/traps_misaligned.c | 112 ++++++++-
arch/riscv/kernel/unaligned_access_speed.c | 8 +-
arch/riscv/kvm/Makefile | 1 +
arch/riscv/kvm/vcpu.c | 4 +-
arch/riscv/kvm/vcpu_sbi.c | 54 +++++
arch/riscv/kvm/vcpu_sbi_fwft.c | 257 +++++++++++++++++++++
arch/riscv/kvm/vcpu_sbi_sta.c | 3 +-
14 files changed, 620 insertions(+), 21 deletions(-)
create mode 100644 arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
create mode 100644 arch/riscv/kvm/vcpu_sbi_fwft.c
--
2.49.0
Hi all,
This patch series addresses false positives in the generic mm selftests
and skips tests that cannot run correctly due to missing features or system
limitations.
v2: https://lore.kernel.org/all/20250703060656.54345-1-aboorvad@linux.ibm.com/
Changes in v3:
- Rebased onto the latest mm-new branch, top commit of the base is commit 0709ddf8951f ("mm: add zblock allocator").
- Minor refactor based on the review comments.
- Included the tags from the previous version.
---
v1: https://lore.kernel.org/all/20250616160632.35250-1-aboorvad@linux.ibm.com/
Changes in v2:
- Rebased onto the mm-new branch, top commit of the base is commit 3b4a8ad89f7e ("mm: add zblock allocator").
- Split some patches for clarity.
- Updated virtual_address_range test to support testing 4PB VA on PPC64.
- Added proper Fixes: tags.
- Included a patch to skip a failing userfaultfd test when unsupported,
instead of reporting a failure.
---
Please let us know if you have any further comments.
Thanks,
Aboorva
Aboorva Devarajan (3):
selftests/mm: Fix child process exit codes in ksm_functional_tests
selftests/mm: Skip thuge-gen test if system is not setup properly
selftests/mm: Skip hugepage-mremap test if userfaultfd unavailable
Donet Tom (4):
mm/selftests: Fix incorrect pointer being passed to mark_range()
selftests/mm: Add support to test 4PB VA on PPC64
selftest/mm: Fix ksm_funtional_test failures
mm/selftests: Fix split_huge_page_test failure on systems with 64KB
page size
tools/testing/selftests/mm/hugepage-mremap.c | 16 +++++++++--
.../selftests/mm/ksm_functional_tests.c | 28 +++++++++++++------
.../selftests/mm/split_huge_page_test.c | 23 +++++++++------
tools/testing/selftests/mm/thuge-gen.c | 11 +++++---
.../selftests/mm/virtual_address_range.c | 13 ++++++++-
5 files changed, 67 insertions(+), 24 deletions(-)
--
2.47.1
Changes in v2:
- Restore RET_FAIL assignments in error paths to ensure the test's exit
code accurately reflects the failure status.
Wake Liu (1):
selftests/futex: Check for shmget support at runtime
.../selftests/futex/functional/futex_wait.c | 49 +++++++------
.../selftests/futex/functional/futex_waitv.c | 73 ++++++++++++-------
2 files changed, 73 insertions(+), 49 deletions(-)
--
2.50.1.703.g449372360f-goog
The binderfs selftests, specifically `binderfs_stress` and
`binderfs_test_unprivileged`, depend on user namespaces to run.
On kernels built without user namespace support (CONFIG_USER_NS=n),
these tests will fail.
To prevent these failures, add a check for the availability of user
namespaces by testing for the existence of "/proc/self/ns/user". If
the check fails, skip the tests and print an informative message.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
.../selftests/filesystems/binderfs/binderfs_test.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c
index 81db85a5cc16..e77ed34ebd06 100644
--- a/tools/testing/selftests/filesystems/binderfs/binderfs_test.c
+++ b/tools/testing/selftests/filesystems/binderfs/binderfs_test.c
@@ -291,6 +291,11 @@ static int write_id_mapping(enum idmap_type type, pid_t pid, const char *buf,
return 0;
}
+static bool has_userns(void)
+{
+ return (access("/proc/self/ns/user", F_OK) == 0);
+}
+
static void change_userns(struct __test_metadata *_metadata, int syncfds[2])
{
int ret;
@@ -378,6 +383,9 @@ static void *binder_version_thread(void *data)
*/
TEST(binderfs_stress)
{
+ if (!has_userns())
+ SKIP(return, "%s: user namespace not supported\n", __func__);
+
int fds[1000];
int syncfds[2];
pid_t pid;
@@ -502,6 +510,8 @@ TEST(binderfs_test_privileged)
TEST(binderfs_test_unprivileged)
{
+ if (!has_userns())
+ SKIP(return, "%s: user namespace not supported\n", __func__);
int ret;
int syncfds[2];
pid_t pid;
--
2.50.1.703.g449372360f-goog
The futex_waitv() syscall was introduced in Linux 5.16. The existing
test in futex_wait_timeout.c will fail on kernels older than 5.16
due to the syscall not being implemented.
Modify the test_timeout() function to check if the error returned
is ENOSYS. If it is, skip the test and report it as such, rather than
failing. This ensures the selftests can be run on a wider range of
kernel versions without false negatives.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
.../selftests/futex/functional/futex_wait_timeout.c | 11 ++++++++---
.../testing/selftests/futex/functional/futex_waitv.c | 8 ++++++++
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/futex/functional/futex_wait_timeout.c b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
index d183f878360b..323cab339814 100644
--- a/tools/testing/selftests/futex/functional/futex_wait_timeout.c
+++ b/tools/testing/selftests/futex/functional/futex_wait_timeout.c
@@ -64,9 +64,14 @@ void *get_pi_lock(void *arg)
static void test_timeout(int res, int *ret, char *test_name, int err)
{
if (!res || errno != err) {
- ksft_test_result_fail("%s returned %d\n", test_name,
- res < 0 ? errno : res);
- *ret = RET_FAIL;
+ if (errno == ENOSYS) {
+ ksft_test_result_skip("%s: %s\n",
+ test_name, strerror(errno));
+ } else {
+ ksft_test_result_fail("%s returned %d\n", test_name,
+ res < 0 ? errno : res);
+ *ret = RET_FAIL;
+ }
} else {
ksft_test_result_pass("%s succeeds\n", test_name);
}
diff --git a/tools/testing/selftests/futex/functional/futex_waitv.c b/tools/testing/selftests/futex/functional/futex_waitv.c
index 034dbfef40cb..2a86fd3ea657 100644
--- a/tools/testing/selftests/futex/functional/futex_waitv.c
+++ b/tools/testing/selftests/futex/functional/futex_waitv.c
@@ -59,6 +59,14 @@ void *waiterfn(void *arg)
int main(int argc, char *argv[])
{
+ if (!ksft_min_kernel_version(5, 16)) {
+ ksft_print_header();
+ ksft_set_plan(0);
+ ksft_print_msg("%s: FUTEX_WAITV not implemented until 5.16\n",
+ basename(argv[0]));
+ ksft_print_cnts();
+ return KSFT_SKIP;
+ }
pthread_t waiter;
int res, ret = RET_PASS;
struct timespec to;
--
2.50.1.703.g449372360f-goog
On heterogeneous arm64 systems, KVM's PMU emulation is based on the
features of a single host PMU instance. When a vCPU is migrated to a
pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
incrementing.
Although this behavior is permitted by the architecture, Windows does
not handle it gracefully and may crash with a division-by-zero error.
The current workaround requires VMMs to pin vCPUs to a set of pCPUs
that share a compatible PMU. This is difficult to implement correctly in
QEMU/libvirt, where pinning occurs after vCPU initialization, and it
also restricts the guest to a subset of available pCPUs.
This patch introduces the KVM_ARM_VCPU_PMU_V3_COMPOSITION attribute to
create a "composite" PMU. When set, KVM exposes a PMU that is compatible
with all pCPUs by advertising only a single cycle counter, a feature
common to all PMUs.
This allows Windows guests to run reliably on heterogeneous systems
without crashing, even without vCPU pinning, and enables VMMs to
schedule vCPUs across all available pCPUs, making full use of the host
hardware.
A QEMU patch that demonstrates the usage of the new attribute is
available at:
https://lore.kernel.org/qemu-devel/20250806-kvm-v1-1-d1d50b7058cd@rsg.ci.i.…
("[PATCH RFC] target/arm/kvm: Choose PMU backend")
Signed-off-by: Akihiko Odaki <odaki(a)rsg.ci.i.u-tokyo.ac.jp>
---
Changes in v2:
- Added the KVM_ARM_VCPU_PMU_V3_COMPOSITION attribute to opt in the
feature.
- Added code to handle overflow.
- Link to v1: https://lore.kernel.org/r/20250319-hybrid-v1-1-4d1ada10e705@daynix.com
---
Akihiko Odaki (2):
KVM: arm64: PMU: Introduce KVM_ARM_VCPU_PMU_V3_COMPOSITION
KVM: arm64: selftests: Test guest PMUv3 composition
Documentation/virt/kvm/devices/vcpu.rst | 30 ++
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/arm.c | 5 +-
arch/arm64/kvm/pmu-emul.c | 495 +++++++++++++--------
arch/arm64/kvm/sys_regs.c | 2 +-
include/kvm/arm_pmu.h | 12 +-
.../selftests/kvm/arm64/vpmu_counter_access.c | 148 ++++--
8 files changed, 461 insertions(+), 234 deletions(-)
---
base-commit: 8ec6d99a41e3d1dbdff2bdb3aa42951681e1e76c
change-id: 20250224-hybrid-01d5ff47edd2
Best regards,
--
Akihiko Odaki <odaki(a)rsg.ci.i.u-tokyo.ac.jp>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v12.
For more details of DualPI2, please refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
---
v12 (04-Aug-2025)
- Split into 3 patches: one move get_float(), one add get_float_min_max(), one for dualpi2 (David Ahern <dsahern(a)kernel.org>)
- Repalce matches() with strcmp() within get_packets() (David Ahern <dsahern(a)kernel.org>)
- Apply reverse xmas tree listing of variables (David Ahern <dsahern(a)kernel.org>)
v11 (18-Jul-2025)
- Replace TCA_DUALPI2 prefix with TC_DUALPI2 prefix for enums (Jakub Kicinski <kuba(a)kernel.org>)
v10 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS w/ STEP_THRESH_PKTS and STEP_THRESH_US of net-next patch (Jakub Kicinski <kuba(a)kernel.org>)
v9 (13-Jun-2025)
- Fix space issue and typos (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Change 'rtt_typical' to 'typical_rtt' in tc/q_dualpi2.c (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Add the num of enum used by DualPI2 in pkt_sched.h
v8 (09-May-2025)
- Update pkt_sched.h with the one in nex-next
- Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
v7 (05-May-2025)
- Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-2025)
- Update JSON file output due to spec modification in tc.yaml of net-next
v5 (25-Mar-2025)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-2025)
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking.
v3 (21-Feb-2025)
- Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update the manual to align with the latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-2024)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
---
Chia-Yu Chang (3):
Move get_float() from ip/iplink_can.c to lib/utils.c
Add get_float_min_max() in lib/utils.c
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/utils.h | 2 +
ip/iplink_can.c | 14 --
lib/utils.c | 30 +++
man/man8/tc-dualpi2.8 | 249 ++++++++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 535 ++++++++++++++++++++++++++++++++++++++++++
7 files changed, 827 insertions(+), 15 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
The "nohz_full" and "rcu_nocbs" boot command parameters can be used to
remove a lot of kernel overhead on a specific set of isolated CPUs which
can be used to run some latency/bandwidth sensitive workloads with as
little kernel disturbance/noise as possible. The problem with this mode
of operation is the fact that it is a static configuration which cannot
be changed after boot to adjust for changes in application loading.
There is always a desire to enable runtime modification of the number
of isolated CPUs that can be dedicated to this type of demanding
workloads. This patchset is an attempt to do just that with an amount of
CPU isolation close to what can be done with the nohz_full and rcu_nocbs
boot kernel parameters.
This patch series provides the ability to change the set of housekeeping
CPUs at run time via the cpuset isolated partition functionality.
Currently, the cpuset isolated partition is able to disable scheduler
load balancing and the CPU affinity of the unbound workqueue to avoid the
isolated CPUs. This patch series will extend that with other kernel noises
associated with the nohz_full boot command line parameter which has the
following sub-categories:
- tick
- timer
- RCU
- MISC
- WQ
- kthread
The rcu_nocbs is actually a subset of nohz_full focusing just on the
RCU part of the kernel noises. The WQ part has already been handled by
the current cpuset code.
This series focuses on the tick and RCU part of the kernel noises by
actively changing their internal data structures to track changes in
the list of isolated CPUs used by cpuset isolated partitions.
The dynamic update of the lists of housekeeping CPUs at run time will
also have impact on the other part of the kernel noises that reference
the lists of housekeeping CPUs at run time.
The pending patch series on timer migration[1], when properly integrated
will support the timer part too.
The CPU hotplug functionality of the Linux kernel is used to facilitate
the runtime change of the nohz_full isolated CPUs with minimal code
changes. The CPUs that need to be switched from non-isolated to
isolated or vice versa will be brought offline first, making the
necessary changes and then brought back online afterward.
The use of CPU hotplug, however, does have a slight drawback of
freezing all the other CPUs in part of the offlining process using
the stop machine feature of the kernel. That will cause a noticeable
latency spikes in other running applications which may be significant
to sensitive applications running on isolated CPUs in other isolated
partitions at the time. Hopefully we can find a way to solve this
problem in the future.
One possible workaround for this is to reserve a set of nohz_full
isolated CPUs at boot time using the nohz_full boot command parameter.
The bringing of those nohz_full reserved CPUs into and out of isolated
partitions will not invoke CPU hotplug and hence will not cause
unexpected latency spikes. These reserved CPUs will only be needed
if there are other existing isolated partitions running critical
applications at the time when an isolated partition needs to be created.
Patches 1-4 updates the CPU isolation code at kernel/sched/isolation.c
to enable dynamic update of the lists of housekeeping CPUs.
Patch 5 introduces a new cpuhp_offline_cb() API for shutting down the
given set of CPUs, running the given callback method and then bringing
those CPUs back online again. This new API will block any incoming
hotplug events from interfering this operation.
Patches 6-9 updates the cpuset partition code to use the new cpuhp API
to shut down the affect CPUs, making changes to the housekeeping
cpumasks and then bring those CPUs online afterward.
Patch 10 works around an issue in the DL server code that block the
hotplug operation under certain configurations.
Patch 11-14 updates the timer tick and related code to enable proper
updates to the set of CPUs requiring nohz_full dynticks support.
Patch 15 enables runtime modification to the set of isolated CPUs
requiring RCU NO-CB CPU support with minor changes to the RCU code.
Patches 16-18 includes other miscellaneous updates to cpuset code and
documentation.
This patch series is applied on top of some other cpuset patches[1]
posted upstream recently.
[1] https://lore.kernel.org/lkml/20250806093855.86469-1-gmonaco@redhat.com/
[2] https://lore.kernel.org/lkml/20250806172430.1155133-1-longman@redhat.com/
Waiman Long (18):
sched/isolation: Enable runtime update of housekeeping cpumasks
sched/isolation: Call sched_tick_offload_init() when
HK_FLAG_KERNEL_NOISE is first set
sched/isolation: Use RCU to delay successive housekeeping cpumask
updates
sched/isolation: Add a debugfs file to dump housekeeping cpumasks
cpu/hotplug: Add a new cpuhp_offline_cb() API
cgroup/cpuset: Introduce a new top level isolcpus_update_mutex
cgroup/cpuset: Allow overwriting HK_TYPE_DOMAIN housekeeping cpumask
cgroup/cpuset: Use CPU hotplug to enable runtime nohz_full
modification
cgroup/cpuset: Revert "Include isolated cpuset CPUs in
cpu_is_isolated() check"
sched/core: Ignore DL BW deactivation error if in
cpuhp_offline_cb_mode
tick/nohz: Make nohz_full parameter optional
tick/nohz: Introduce tick_nohz_full_update_cpus() to update
tick_nohz_full_mask
tick/nohz: Allow runtime changes in full dynticks CPUs
tick: Pass timer tick job to an online HK CPU in tick_cpu_dying()
cgroup/cpuset: Enable RCU NO-CB CPU offloading of newly isolated CPUs
cgroup/cpuset: Don't set have_boot_nohz_full without any boot time
nohz_full CPU
cgroup/cpuset: Documentation updates & don't use CPU 0 for isolated
partition
cgroup/cpuset: Add pr_debug() statements for cpuhp_offline_cb() call
Documentation/admin-guide/cgroup-v2.rst | 33 +-
.../admin-guide/kernel-parameters.txt | 19 +-
include/linux/context_tracking.h | 8 +-
include/linux/cpuhplock.h | 9 +
include/linux/cpuset.h | 6 -
include/linux/rcupdate.h | 2 +
include/linux/sched/isolation.h | 9 +-
include/linux/tick.h | 2 +
kernel/cgroup/cpuset.c | 344 ++++++++++++------
kernel/context_tracking.c | 21 +-
kernel/cpu.c | 47 +++
kernel/rcu/tree_nocb.h | 7 +-
kernel/sched/core.c | 8 +-
kernel/sched/debug.c | 32 ++
kernel/sched/isolation.c | 151 +++++++-
kernel/sched/sched.h | 2 +-
kernel/time/tick-common.c | 15 +-
kernel/time/tick-sched.c | 24 +-
.../selftests/cgroup/test_cpuset_prs.sh | 15 +-
19 files changed, 583 insertions(+), 171 deletions(-)
--
2.50.0
David asked me if there is a way of checking split_huge_page_test
results instead of the existing smap check[1]. This patchset uses
kpageflags to get after-split folio orders for a better
split_huge_page_test result check. The added gather_folio_orders() scans
through a VPN range and collects the numbers of folios at different orders.
check_folio_orders() compares the result of gather_folio_orders() to
a given list of numbers of different orders.
split_huge_page_test needs the FORCE_READ fix in [2] to work correctly.
This patchset also:
1. added new order and in folio offset to the split huge page debugfs's
pr_debug()s;
2. changed split_huge_pages_pid() to skip the rest of a folio if it is
split by folio_split() (not changing split_folio_to_order() part
since split_pte_mapped_thp test relies on its behavior).
[1] https://lore.kernel.org/linux-mm/e2f32bdb-e4a4-447c-867c-31405cbba151@redha…
[2] https://lore.kernel.org/linux-mm/20250805175140.241656-1-ziy@nvidia.com/
Zi Yan (4):
mm/huge_memory: add new_order and offset to split_huge_pages*()
pr_debug.
mm/huge_memory: move to next folio after folio_split() succeeds.
selftests/mm: add check_folio_orders() helper.
selftests/mm: check after-split folio orders in split_huge_page_test.
mm/huge_memory.c | 22 +--
.../selftests/mm/split_huge_page_test.c | 67 ++++++---
tools/testing/selftests/mm/vm_util.c | 139 ++++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 2 +
4 files changed, 200 insertions(+), 30 deletions(-)
--
2.47.2
Userspace generally expects APIs that return -EMSGSIZE to allow for them
to adjust their buffer size and retry the operation. However, the
fscontext log would previously clear the message even in the -EMSGSIZE
case.
Given that it is very cheap for us to check whether the buffer is too
small before we remove the message from the ring buffer, let's just do
that instead. While we're at it, refactor some fscontext_read() into a
separate helper to make the ring buffer logic a bit easier to read.
Fixes: 007ec26cdc9f ("vfs: Implement logging through fs_context")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v3:
- selftests: use EXPECT_STREQ()
- v2: <https://lore.kernel.org/r/20250806-fscontext-log-cleanups-v2-0-88e9d34d142f…>
Changes in v2:
- Refactor message fetching to fetch_message_locked() which returns
ERR_PTR() in error cases. [Al Viro]
- v1: <https://lore.kernel.org/r/20250806-fscontext-log-cleanups-v1-0-880597d42a5a…>
---
Aleksa Sarai (2):
fscontext: do not consume log entries when returning -EMSGSIZE
selftests/filesystems: add basic fscontext log tests
fs/fsopen.c | 54 +++++-----
tools/testing/selftests/filesystems/.gitignore | 1 +
tools/testing/selftests/filesystems/Makefile | 2 +-
tools/testing/selftests/filesystems/fclog.c | 130 +++++++++++++++++++++++++
4 files changed, 162 insertions(+), 25 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250806-fscontext-log-cleanups-50f0143674ae
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
As described in commit 7a54947e727b ('Merge patch series "fs: allow
changing idmappings"'), open_tree_attr(2) was necessary in order to
allow for a detached mount to be created and have its idmappings changed
without the risk of any racing threads operating on it. For this reason,
mount_setattr(2) still does not allow for id-mappings to be changed.
However, there was a bug in commit 2462651ffa76 ("fs: allow changing
idmappings") which allowed users to bypass this restriction by calling
open_tree_attr(2) *without* OPEN_TREE_CLONE.
can_idmap_mount() prevented this bug from allowing an attached
mountpoint's id-mapping from being modified (thanks to an is_anon_ns()
check), but this still allows for detached (but visible) mounts to have
their be id-mapping changed. This risks the same UAF and locking issues
as described in the merge commit, and was likely unintentional.
For what it's worth, I found this while working on the open_tree_attr(2)
man page, and was trying to figure out what open_tree_attr(2)'s
behaviour was in the (slightly fruity) ~OPEN_TREE_CLONE case.
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Aleksa Sarai (2):
open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE
selftests/mount_setattr: add smoke tests for open_tree_attr(2) bug
fs/namespace.c | 3 +-
.../selftests/mount_setattr/mount_setattr_test.c | 77 ++++++++++++++++++----
2 files changed, 66 insertions(+), 14 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250808-open_tree_attr-bugfix-idmap-bb741166dc04
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
Hello!
KUnit offers a parameterized testing framework, where tests can be
run multiple times with different inputs.
Currently, the same `struct kunit` is used for each parameter
execution. After each run, the test instance gets cleaned up.
This creates the following limitations:
a. There is no way to store resources that are accessible across
the individual parameter test executions.
b. It's not possible to pass additional context besides the
previous parameter to `generate_params()` to get the next
parameter.
c. Test users are restricted to using pre-defined static arrays
of parameter objects or `generate_params()` to define their
parameters. There is no flexibility to pass a custom dynamic
array without using `generate_params()`, which can be complex
if generating the next parameter depends on more than just
the single previous parameter (e.g., two or more previous
parameters).
This patch series resolves these limitations by:
1. [P 1] Giving each parameterized test execution its own
`struct kunit`. This aligns more with the definition of a
`struct kunit` as a running instance of a test. It will also
remove the need to manage state, such as resetting the
`test->priv` field or the `test->status_comment` after every
parameter run.
2. [P 1] Introducing a parent pointer of type `struct kunit`.
Behind the scenes, a parent instance for the parameterized
tests will be created. It won't be used to execute any test
logic, but will instead be used as a context for shared
resources. Each individual running instance of a test will
now have a reference to that parent instance and thus, have
access to those resources.
3. [P 2] Introducing `param_init()` and `param_exit()` functions
that can set up and clean up the parent instance of the
parameterized tests. They will run once before and after the
parameterized series and provide a way for the user to
access the parent instance to add the parameter array or any
other resources to it, including custom ones to the
`test->parent->priv` field or to `test->parent->resources`
via the Resource API (link below).
https://elixir.bootlin.com/linux/v6.16-rc7/source/include/kunit/resource.h
4. [P 3, 4 & 5] Passing the parent `struct kunit` as an additional
parameter to `generate_params()`. This provides
`generate_params()` with more available context, making
parameter generation much more flexible. The
`generate_params()` implementations in the KCSAN and drm/xe
tests have been adapted to match the new function pointer
signature.
5. [P 6] Introducing a `params_data` field in `struct kunit`.
This will allow the parent instance of a test to have direct
storage of the parameter array, enabling features like using
dynamic parameter arrays or using context beyond just the
previous parameter.
Thank you!
-Marie
Marie Zhussupova (9):
kunit: Add parent kunit for parameterized test context
kunit: Introduce param_init/exit for parameterized test shared context
management
kunit: Pass additional context to generate_params for parameterized
testing
kcsan: test: Update parameter generator to new signature
drm/xe: Update parameter generator to new signature
kunit: Enable direct registration of parameter arrays to a KUnit test
kunit: Add example parameterized test with shared resources and direct
static parameter array setup
kunit: Add example parameterized test with direct dynamic parameter
array setup
Documentation: kunit: Document new parameterized test features
Documentation/dev-tools/kunit/usage.rst | 455 +++++++++++++++++++++++-
drivers/gpu/drm/xe/tests/xe_pci.c | 2 +-
include/kunit/test.h | 98 ++++-
kernel/kcsan/kcsan_test.c | 2 +-
lib/kunit/kunit-example-test.c | 207 +++++++++++
lib/kunit/test.c | 82 ++++-
6 files changed, 818 insertions(+), 28 deletions(-)
--
2.50.1.552.g942d659e1b-goog
Hi,
I found some text that contains a spelling mistake, however I can't
parse the message either, so I'm reporting this as a minor issue that
needs some attention.
The issue is found in commit:
commit cda95faef7bcf26ba3f54c3cddce66d50116d146
Author: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Date: Wed Apr 16 18:29:20 2025 +0200
selftests/futex: Add futex_priv_hash
Namely:
static void usage(char *prog)
{
printf("Usage: %s\n", prog);
printf(" -c Use color\n");
printf(" -g Test global hash instead intead local immutable
\n");
printf(" -h Display this help message\n");
printf(" -v L Verbosity level: %d=QUIET %d=CRITICAL %d=INFO\n",
VQUIET, VCRITICAL, VINFO);
}
there word "intead" for the -g option should be removed, but I'm also
finding the resulting text hard to parse, perhaps it needs to be rephrased?
Colin
This is series 2b/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v2:
- Update patch title (was nova-core, now drm/panic).
- Link to v1: https://lore.kernel.org/r/20250709-core-cstr-fanout-1-v1-0-fd793b3e58a2@gma…
---
Tamir Duberstein (10):
drm/panic: use `core::ffi::CStr` method names
rust: auxiliary: use `core::ffi::CStr` method names
rust: configfs: use `core::ffi::CStr` method names
rust: cpufreq: use `core::ffi::CStr` method names
rust: drm: use `core::ffi::CStr` method names
rust: firmware: use `core::ffi::CStr` method names
rust: kunit: use `core::ffi::CStr` method names
rust: miscdevice: use `core::ffi::CStr` method names
rust: net: use `core::ffi::CStr` method names
rust: of: use `core::ffi::CStr` method names
drivers/gpu/drm/drm_panic_qr.rs | 2 +-
rust/kernel/auxiliary.rs | 4 ++--
rust/kernel/configfs.rs | 4 ++--
rust/kernel/cpufreq.rs | 2 +-
rust/kernel/drm/device.rs | 4 ++--
rust/kernel/firmware.rs | 2 +-
rust/kernel/kunit.rs | 6 +++---
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 2 +-
rust/kernel/of.rs | 2 +-
samples/rust/rust_configfs.rs | 2 +-
11 files changed, 16 insertions(+), 16 deletions(-)
---
base-commit: cc84ef3b88f407e8bd5a5f7b6906d1e69851c856
change-id: 20250709-core-cstr-fanout-1-f20611832272
prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1
prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573
prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e
prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa
prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec
prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27
prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This is series 2a/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v2:
- Rebase on rust-next.
- Drop pin-init patch, which is no longer needed.
- Link to v1: https://lore.kernel.org/r/20250709-core-cstr-fanout-1-v1-0-64308e7203fc@gma…
---
Tamir Duberstein (8):
gpu: nova-core: use `kernel::{fmt,prelude::fmt!}`
rust: alloc: use `kernel::{fmt,prelude::fmt!}`
rust: block: use `kernel::{fmt,prelude::fmt!}`
rust: device: use `kernel::{fmt,prelude::fmt!}`
rust: file: use `kernel::{fmt,prelude::fmt!}`
rust: kunit: use `kernel::{fmt,prelude::fmt!}`
rust: seq_file: use `kernel::{fmt,prelude::fmt!}`
rust: sync: use `kernel::{fmt,prelude::fmt!}`
drivers/block/rnull.rs | 2 +-
drivers/gpu/nova-core/gpu.rs | 3 +--
drivers/gpu/nova-core/regs/macros.rs | 6 +++---
rust/kernel/alloc/kbox.rs | 2 +-
rust/kernel/alloc/kvec.rs | 2 +-
rust/kernel/alloc/kvec/errors.rs | 2 +-
rust/kernel/block/mq.rs | 2 +-
rust/kernel/block/mq/gen_disk.rs | 2 +-
rust/kernel/block/mq/raw_writer.rs | 3 +--
rust/kernel/device.rs | 6 +++---
rust/kernel/fs/file.rs | 5 +++--
rust/kernel/kunit.rs | 8 ++++----
rust/kernel/seq_file.rs | 6 +++---
rust/kernel/sync/arc.rs | 2 +-
scripts/rustdoc_test_gen.rs | 2 +-
15 files changed, 26 insertions(+), 27 deletions(-)
---
base-commit: cc84ef3b88f407e8bd5a5f7b6906d1e69851c856
change-id: 20250709-core-cstr-fanout-1-f20611832272
prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1
prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573
prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e
prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa
prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec
prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27
prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Hi there,
I have explicitly disabled mptpcp by default on my custom kernel and
this seems to be causing the test case to fail. Even after enabling
mtpcp via sysctl command or adding an entry to /etc/sysctl.conf this
fails. I don't think this test should be failing and should account for
cases where mptcp has not been enabled by default?
These are the test logs:
$ sudo tools/testing/selftests/bpf/test_progs -t mptcp
Can't find bpf_testmod.ko kernel module: -2
WARNING! Selftests relying on bpf_testmod.ko will be skipped.
run_test:PASS:bpf_prog_attach 0 nsec
run_test:PASS:connect to fd 0 nsec
verify_tsk:PASS:bpf_map_lookup_elem 0 nsec
verify_tsk:PASS:unexpected invoked count 0 nsec
verify_tsk:PASS:unexpected is_mptcp 0 nsec
test_base:PASS:run_test tcp 0 nsec
(network_helpers.c:107: errno: Protocol not available) Failed to create
server socket
test_base:FAIL:start_mptcp_server unexpected start_mptcp_server: actual
-1 < expected 0
#178/1 mptcp/base:FAIL
test_mptcpify:PASS:test__join_cgroup 0 nsec
create_netns:PASS:ip netns add mptcp_ns 0 nsec
create_netns:PASS:ip -net mptcp_ns link set dev lo up 0 nsec
test_mptcpify:PASS:create_netns 0 nsec
run_mptcpify:PASS:skel_open_load 0 nsec
run_mptcpify:PASS:skel_attach 0 nsec
(network_helpers.c:107: errno: Protocol not available) Failed to create
server socket
run_mptcpify:FAIL:start_server unexpected start_server: actual -1 <
expected 0
test_mptcpify:FAIL:run_mptcpify unexpected error: -5 (errno 92)
#178/2 mptcp/mptcpify:FAIL
#178 mptcp:FAIL
All error logs:
test_base:PASS:test__join_cgroup 0 nsec
create_netns:PASS:ip netns add mptcp_ns 0 nsec
create_netns:PASS:ip -net mptcp_ns link set dev lo up 0 nsec
test_base:PASS:create_netns 0 nsec
test_base:PASS:start_server 0 nsec
run_test:PASS:skel_open_load 0 nsec
run_test:PASS:skel_attach 0 nsec
run_test:PASS:bpf_prog_attach 0 nsec
run_test:PASS:connect to fd 0 nsec
verify_tsk:PASS:bpf_map_lookup_elem 0 nsec
verify_tsk:PASS:unexpected invoked count 0 nsec
verify_tsk:PASS:unexpected is_mptcp 0 nsec
test_base:PASS:run_test tcp 0 nsec
(network_helpers.c:107: errno: Protocol not available) Failed to create
server socket
test_base:FAIL:start_mptcp_server unexpected start_mptcp_server: actual
-1 < expected 0
#178/1 mptcp/base:FAIL
test_mptcpify:PASS:test__join_cgroup 0 nsec
create_netns:PASS:ip netns add mptcp_ns 0 nsec
create_netns:PASS:ip -net mptcp_ns link set dev lo up 0 nsec
test_mptcpify:PASS:create_netns 0 nsec
run_mptcpify:PASS:skel_open_load 0 nsec
run_mptcpify:PASS:skel_attach 0 nsec
(network_helpers.c:107: errno: Protocol not available) Failed to create
server socket
run_mptcpify:FAIL:start_server unexpected start_server: actual -1 <
expected 0
test_mptcpify:FAIL:run_mptcpify unexpected error: -5 (errno 92)
#178/2 mptcp/mptcpify:FAIL
#178 mptcp:FAIL
Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
This is the custom patch I had applied on the LTS v6.12.36 kernel and
tested it:
diff --git a/net/mptcp/ctrl.c b/net/mptcp/ctrl.c
index dd595d9b5e50c..bdcc4136e92ef 100644
--- a/net/mptcp/ctrl.c
+++ b/net/mptcp/ctrl.c
@@ -89,7 +89,7 @@ const char *mptcp_get_scheduler(const struct net *net)
static void mptcp_pernet_set_defaults(struct mptcp_pernet *pernet)
{
- pernet->mptcp_enabled = 1;
+ pernet->mptcp_enabled = 0;
pernet->add_addr_timeout = TCP_RTO_MAX;
pernet->blackhole_timeout = 3600;
atomic_set(&pernet->active_disable_times, 0);
--
Thanks & Regards, Harshvardhan
Running the test added with a recent fix on a driver with persistent
NAPI config leads to a deadlock. The deadlock is fixed by patch 3,
patch 2 is I think a more fundamental problem with the way we
implemented the config.
I hope the fix makes sense, my own thinking is definitely colored
by my preference (IOW how the per-queue config RFC was implemented).
Jakub Kicinski (3):
selftests: drv-net: don't assume device has only 2 queues
net: update NAPI threaded config even for disabled NAPIs
net: prevent deadlocks when enabling NAPIs with mixed kthread config
include/linux/netdevice.h | 3 ++-
net/core/dev.h | 8 ++++++++
net/core/dev.c | 12 +++++++++---
tools/testing/selftests/drivers/net/napi_threaded.py | 10 ++++++----
4 files changed, 25 insertions(+), 8 deletions(-)
--
2.50.1
This series adds namespace support to vhost-vsock. It does not add
namespaces to any of the guest transports (virtio-vsock, hyperv, or
vmci).
The current revision only supports two modes: local or global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).
Future may include supporting a mixed mode, which I expect to be more
complicated because socket lookups will have to include new logic and
API changes to behave differently based on if the lookup is part of a
mixed mode CID allocation, a global CID allocation, a mixed-to-global
connection (allowed), or a global-to-mixed connection (not allowed).
Modes are per-netns and write-once. This allows a system to configure
namespaces independently (some may share CIDs, others are completely
isolated). This also supports future mixed use cases, where there may be
namespaces in global mode spinning up VMs while there are
mixed mode namespaces that provide services to the VMs, but are not
allowed to allocate from the global CID pool.
Thanks again for everyone's help and reviews!
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)gmail.com>
To: Stefano Garzarella <sgarzare(a)redhat.com>
To: Shuah Khan <shuah(a)kernel.org>
To: David S. Miller <davem(a)davemloft.net>
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Paolo Abeni <pabeni(a)redhat.com>
To: Simon Horman <horms(a)kernel.org>
To: Stefan Hajnoczi <stefanha(a)redhat.com>
To: Michael S. Tsirkin <mst(a)redhat.com>
To: Jason Wang <jasowang(a)redhat.com>
To: Xuan Zhuo <xuanzhuo(a)linux.alibaba.com>
To: Eugenio Pérez <eperezma(a)redhat.com>
To: K. Y. Srinivasan <kys(a)microsoft.com>
To: Haiyang Zhang <haiyangz(a)microsoft.com>
To: Wei Liu <wei.liu(a)kernel.org>
To: Dexuan Cui <decui(a)microsoft.com>
To: Bryan Tan <bryan-bt.tan(a)broadcom.com>
To: Vishnu Dasa <vishnu.dasa(a)broadcom.com>
To: Broadcom internal kernel review list <bcm-kernel-feedback-list(a)broadcom.com>
Cc: virtualization(a)lists.linux.dev
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: kvm(a)vger.kernel.org
Cc: linux-hyperv(a)vger.kernel.org
Cc: berrange(a)redhat.com
Changes in v4:
- removed RFC tag
- implemented loopback support
- renamed new tests to better reflect behavior
- completed suite of tests with permutations of ns modes and vsock_test
as guest/host
- simplified socat bridging with unix socket instead of tcp + veth
- only use vsock_test for success case, socat for failure case (context
in commit message)
- lots of cleanup
Changes in v3:
- add notion of "modes"
- add procfs /proc/net/vsock_ns_mode
- local and global modes only
- no /dev/vhost-vsock-netns
- vmtest.sh already merged, so new patch just adds new tests for NS
- Link to v2:
https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com
Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1:
https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
Changes in v1:
- added 'netns' module param to vsock.ko to enable the
network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
---
Bobby Eshleman (12):
vsock: a per-net vsock NS mode state
vsock: add net to vsock skb cb
vsock: add netns to af_vsock core
vsock/virtio: add netns to virtio transport common
vhost/vsock: add netns support
vsock/virtio: use the global netns
hv_sock: add netns hooks
vsock/vmci: add netns hooks
vsock/loopback: add netns support
selftests/vsock: improve logging in vmtest.sh
selftests/vsock: invoke vsock_test through helpers
selftests/vsock: add namespace tests
MAINTAINERS | 1 +
drivers/vhost/vsock.c | 48 +-
include/linux/virtio_vsock.h | 12 +
include/net/af_vsock.h | 59 +-
include/net/net_namespace.h | 4 +
include/net/netns/vsock.h | 21 +
net/vmw_vsock/af_vsock.c | 204 +++++-
net/vmw_vsock/hyperv_transport.c | 2 +-
net/vmw_vsock/virtio_transport.c | 5 +-
net/vmw_vsock/virtio_transport_common.c | 14 +-
net/vmw_vsock/vmci_transport.c | 4 +-
net/vmw_vsock/vsock_loopback.c | 59 +-
tools/testing/selftests/vsock/vmtest.sh | 1088 ++++++++++++++++++++++++++-----
13 files changed, 1330 insertions(+), 191 deletions(-)
---
base-commit: dd500e4aecf25e48e874ca7628697969df679493
change-id: 20250325-vsock-vmtest-b3a21d2102c2
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
Signed-off-by: Devaansh Kumar <devaanshk840(a)gmail.com>
---
tools/testing/selftests/futex/include/futextest.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9b7..c77352b97 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -134,7 +134,9 @@ futex_unlock_pi(futex_t *uaddr, int opflags)
}
/**
- * futex_wake_op() - FIXME: COME UP WITH A GOOD ONE LINE DESCRIPTION
+ * futex_wake_op() - atomically modify uaddr2
+ * @nr_wake: wake up to this many tasks on uaddr
+ * @nr_wake2: wake up to this many tasks on uaddr2
*/
static inline int
futex_wake_op(futex_t *uaddr, futex_t *uaddr2, int nr_wake, int nr_wake2,
--
2.49.0
The get_next_frame() function in psock_tpacket.c was missing a return
statement in its default switch case, leading to a compiler warning.
This was caused by a `bug_on(1)` call, which is defined as an
`assert()`, being compiled out because NDEBUG is defined during the
build.
Instead of adding a `return NULL;` which would silently hide the error
and could lead to crashes later, this change restores the original
author's intent. By adding `#undef NDEBUG` before including <assert.h>,
we ensure the assertion is active and will cause the test to abort if
this unreachable code is ever executed.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
tools/testing/selftests/net/psock_tpacket.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c
index 0dd909e325d9..a54f2eb754ce 100644
--- a/tools/testing/selftests/net/psock_tpacket.c
+++ b/tools/testing/selftests/net/psock_tpacket.c
@@ -38,6 +38,7 @@
#include <arpa/inet.h>
#include <stdint.h>
#include <string.h>
+#undef NDEBUG
#include <assert.h>
#include <net/if.h>
#include <inttypes.h>
--
2.50.1.703.g449372360f-goog
With /proc/pid/maps now being read under per-vma lock protection we can
reuse parts of that code to execute PROCMAP_QUERY ioctl also without
taking mmap_lock. The change is designed to reduce mmap_lock contention
and prevent PROCMAP_QUERY ioctl calls from blocking address space updates.
This patchset was split out of the original patchset [1] that introduced
per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY
tests, code refactoring patch to simplify the main change and the actual
transition to per-vma lock.
Changes since v2 [2]
- Added Reviewed-by, per Vlastimil Babka
- Fixed query_vma_find_by_addr() to handle lock_ctx->mmap_locked case,
per Vlastimil Babka
[1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/
[2] https://lore.kernel.org/all/20250804231552.1217132-1-surenb@google.com/
Suren Baghdasaryan (3):
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
fs/proc/task_mmu: factor out proc_maps_private fields used by
PROCMAP_QUERY
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 15 +-
fs/proc/task_mmu.c | 152 ++++++++++++------
fs/proc/task_nommu.c | 14 +-
tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++
4 files changed, 184 insertions(+), 62 deletions(-)
base-commit: 8e7e0c6d09502e44aa7a8fce0821e042a6ec03d1
--
2.50.1.565.gc32cd1483b-goog
Avoid pointer type value compared with 0 to make code clear.
./tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c:221:10-11: WARNING comparing pointer to 0.
Reported-by: Abaci Robot <abaci(a)linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=23403
Signed-off-by: Jiapeng Chong <jiapeng.chong(a)linux.alibaba.com>
---
tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c b/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c
index 4f94c971ae86..6b725725b2bf 100644
--- a/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c
+++ b/tools/testing/selftests/bpf/progs/mem_rdonly_untrusted.c
@@ -218,7 +218,7 @@ int null_check(void *ctx)
int *p;
p = bpf_rdonly_cast(0, 0);
- if (p == 0)
+ if (!p)
/* make this a function call to avoid compiler
* moving r0 assignment before check.
*/
--
2.43.5
FORCE_READ() converts input value x to its pointer type then reads from
address x. This is wrong. If x is a non-pointer, it would be caught it
easily. But all FORCE_READ() callers are trying to read from a pointer and
FORCE_READ() basically reads a pointer to a pointer instead of the original
typed pointer. Almost no access violation was found, except the one from
split_huge_page_test.
Fix it by implementing a simplified READ_ONCE() instead.
Fixes: 3f6bfd4789a0 ("selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));"")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
---
FORCE_READ() comes from commit 876320d71f51 ("selftests/mm: add self tests for
guard page feature"). I will a separate patch to stable tree.
tools/testing/selftests/mm/cow.c | 4 ++--
tools/testing/selftests/mm/guard-regions.c | 2 +-
tools/testing/selftests/mm/hugetlb-madvise.c | 4 +++-
tools/testing/selftests/mm/migration.c | 2 +-
tools/testing/selftests/mm/pagemap_ioctl.c | 2 +-
tools/testing/selftests/mm/split_huge_page_test.c | 7 +++++--
tools/testing/selftests/mm/vm_util.h | 2 +-
7 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index d30625c18259..c744c603d688 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -1554,8 +1554,8 @@ static void run_with_zeropage(non_anon_test_fn fn, const char *desc)
}
/* Read from the page to populate the shared zeropage. */
- FORCE_READ(mem);
- FORCE_READ(smem);
+ FORCE_READ(*mem);
+ FORCE_READ(*smem);
fn(mem, smem, pagesize);
munmap:
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index b0d42eb04e3a..8dd81c0a4a5a 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -145,7 +145,7 @@ static bool try_access_buf(char *ptr, bool write)
if (write)
*ptr = 'x';
else
- FORCE_READ(ptr);
+ FORCE_READ(*ptr);
}
signal_jump_set = false;
diff --git a/tools/testing/selftests/mm/hugetlb-madvise.c b/tools/testing/selftests/mm/hugetlb-madvise.c
index 1afe14b9dc0c..c5940c0595be 100644
--- a/tools/testing/selftests/mm/hugetlb-madvise.c
+++ b/tools/testing/selftests/mm/hugetlb-madvise.c
@@ -50,8 +50,10 @@ void read_fault_pages(void *addr, unsigned long nr_pages)
unsigned long i;
for (i = 0; i < nr_pages; i++) {
+ unsigned long *addr2 =
+ ((unsigned long *)(addr + (i * huge_page_size)));
/* Prevent the compiler from optimizing out the entire loop: */
- FORCE_READ(((unsigned long *)(addr + (i * huge_page_size))));
+ FORCE_READ(*addr2);
}
}
diff --git a/tools/testing/selftests/mm/migration.c b/tools/testing/selftests/mm/migration.c
index c5a73617796a..ea945eebec2f 100644
--- a/tools/testing/selftests/mm/migration.c
+++ b/tools/testing/selftests/mm/migration.c
@@ -110,7 +110,7 @@ void *access_mem(void *ptr)
* the memory access actually happens and prevents the compiler
* from optimizing away this entire loop.
*/
- FORCE_READ((uint64_t *)ptr);
+ FORCE_READ(*(uint64_t *)ptr);
}
return NULL;
diff --git a/tools/testing/selftests/mm/pagemap_ioctl.c b/tools/testing/selftests/mm/pagemap_ioctl.c
index 0d4209eef0c3..e6face7c0166 100644
--- a/tools/testing/selftests/mm/pagemap_ioctl.c
+++ b/tools/testing/selftests/mm/pagemap_ioctl.c
@@ -1525,7 +1525,7 @@ void zeropfn_tests(void)
ret = madvise(mem, hpage_size, MADV_HUGEPAGE);
if (!ret) {
- FORCE_READ(mem);
+ FORCE_READ(*mem);
ret = pagemap_ioctl(mem, hpage_size, &vec, 1, 0,
0, PAGE_IS_PFNZERO, 0, 0, PAGE_IS_PFNZERO);
diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index 718daceb5282..3c761228e451 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -440,8 +440,11 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd,
}
madvise(*addr, fd_size, MADV_HUGEPAGE);
- for (size_t i = 0; i < fd_size; i++)
- FORCE_READ((*addr + i));
+ for (size_t i = 0; i < fd_size; i++) {
+ char *addr2 = *addr + i;
+
+ FORCE_READ(*addr2);
+ }
if (!check_huge_file(*addr, fd_size / pmd_pagesize, pmd_pagesize)) {
ksft_print_msg("No large pagecache folio generated, please provide a filesystem supporting large folio\n");
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index c20298ae98ea..b55d1809debc 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -23,7 +23,7 @@
* anything with it in order to trigger a read page fault. We therefore must use
* volatile to stop the compiler from optimising this away.
*/
-#define FORCE_READ(x) (*(volatile typeof(x) *)x)
+#define FORCE_READ(x) (*(const volatile typeof(x) *)&(x))
extern unsigned int __page_size;
extern unsigned int __page_shift;
--
2.47.2
When using GCC on x86-64 to compile an usdt prog with -O1 or higher
optimization, the compiler will generate SIB addressing mode for global
array and PC-relative addressing mode for global variable,
e.g. "1@-96(%rbp,%rax,8)" and "-1@4+t1(%rip)".
The current USDT implementation in libbpf cannot parse these two formats,
causing `bpf_program__attach_usdt()` to fail with -ENOENT
(unrecognized register).
This patch series adds support for SIB addressing mode in USDT probes.
The main changes include:
- add correct handling logic for SIB-addressed arguments in
`parse_usdt_arg`.
- force -O2 optimization for usdt.test.o to generate SIB addressing usdt
argument spec.
- change the global variable t1 to a local variable, to avoid compiler
generating PC-relative addressing mode for it.
Testing shows that the SIB probe correctly generates 8@(%rcx,%rax,8)
argument spec and passes all validation checks.
The modification history of this patch series:
Change since v1:
- refactor the code to make it more readable
- modify the commit message to explain why and how
Change since v2:
- fix the `scale` uninitialized error
Change since v3:
- force -O2 optimization for usdt.test.o to generate SIB addressing usdt
and pass all test cases.
Change since v4:
- split the patch into two parts, one for the fix and the other for the
test
Change since v5:
- Only enable optimization for x86 architecture to generate SIB addressing
usdt argument spec.
Do we need to add support for PC-relative USDT argument spec handling in
libbpf? I have some interest in this question, but currently have no
ideas. Getting offsets based on symbols requires dependency on the symbol
table. However, once the binary file is stripped, the symtab will also be
removed, which will cause this approach to fail. Does anyone have any
thoughts on this?
Jiawei Zhao (2):
libbpf: fix USDT SIB argument handling causing unrecognized register
error
selftests/bpf: Force -O2 for USDT selftests to cover SIB handling
logic
tools/lib/bpf/usdt.bpf.h | 33 +++++++++++++-
tools/lib/bpf/usdt.c | 43 ++++++++++++++++---
tools/testing/selftests/bpf/Makefile | 8 ++++
tools/testing/selftests/bpf/prog_tests/usdt.c | 18 +++++---
4 files changed, 89 insertions(+), 13 deletions(-)
--
2.43.0
Userspace generally expects APIs that return -EMSGSIZE to allow for them
to adjust their buffer size and retry the operation. However, the
fscontext log would previously clear the message even in the -EMSGSIZE
case.
Given that it is very cheap for us to check whether the buffer is too
small before we remove the message from the ring buffer, let's just do
that instead. While we're at it, refactor some fscontext_read() into a
separate helper to make the ring buffer logic a bit easier to read.
Fixes: 007ec26cdc9f ("vfs: Implement logging through fs_context")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v2:
- Refactor message fetching to fetch_message_locked() which returns
ERR_PTR() in error cases. [Al Viro]
- v1: <https://lore.kernel.org/r/20250806-fscontext-log-cleanups-v1-0-880597d42a5a…>
---
Aleksa Sarai (2):
fscontext: do not consume log entries when returning -EMSGSIZE
selftests/filesystems: add basic fscontext log tests
fs/fsopen.c | 54 +++++-----
tools/testing/selftests/filesystems/.gitignore | 1 +
tools/testing/selftests/filesystems/Makefile | 2 +-
tools/testing/selftests/filesystems/fclog.c | 135 +++++++++++++++++++++++++
4 files changed, 167 insertions(+), 25 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250806-fscontext-log-cleanups-50f0143674ae
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
Userspace generally expects APIs that return EMSGSIZE to allow for them
to adjust their buffer size and retry the operation. However, the
fscontext log would previously clear the message even in the EMSGSIZE
case.
Given that it is very cheap for us to check whether the buffer is too
small before we remove the message from the ring buffer, let's just do
that instead.
Fixes: 007ec26cdc9f ("vfs: Implement logging through fs_context")
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Aleksa Sarai (2):
fscontext: do not consume log entries for -EMSGSIZE case
selftests/filesystems: add basic fscontext log tests
fs/fsopen.c | 22 ++--
tools/testing/selftests/filesystems/.gitignore | 1 +
tools/testing/selftests/filesystems/Makefile | 2 +-
tools/testing/selftests/filesystems/fclog.c | 137 +++++++++++++++++++++++++
4 files changed, 153 insertions(+), 9 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250806-fscontext-log-cleanups-50f0143674ae
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
With /proc/pid/maps now being read under per-vma lock protection we can
reuse parts of that code to execute PROCMAP_QUERY ioctl also without
taking mmap_lock. The change is designed to reduce mmap_lock contention
and prevent PROCMAP_QUERY ioctl calls from blocking address space updates.
This patchset was split out of the original patchset [1] that introduced
per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY
tests, code refactoring patch to simplify the main change and the actual
transition to per-vma lock.
Changes since v1 [2]
- Added Tested-by and Acked-by, per SeongJae Park
- Fixed NOMMU case, per Vlastimil Babka
- Renamed proc_maps_query_data to proc_maps_locking_ctx,
per Vlastimil Babka
[1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/
[2] https://lore.kernel.org/all/20250731220024.702621-1-surenb@google.com/
Suren Baghdasaryan (3):
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
fs/proc/task_mmu: factor out proc_maps_private fields used by
PROCMAP_QUERY
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 15 +-
fs/proc/task_mmu.c | 149 ++++++++++++------
fs/proc/task_nommu.c | 14 +-
tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++
4 files changed, 181 insertions(+), 62 deletions(-)
base-commit: 01da54f10fddf3b01c5a3b80f6b16bbad390c302
--
2.50.1.565.gc32cd1483b-goog
With joint effort from the upstream KVM community, we come up with the
4th version of mediated vPMU for x86. We have made the following changes
on top of the previous RFC v3.
v3 -> v4
- Rebase whole patchset on 6.14-rc3 base.
- Address Peter's comments on Perf part.
- Address Sean's comments on KVM part.
* Change key word "passthrough" to "mediated" in all patches
* Change static enabling to user space dynamic enabling via KVM_CAP_PMU_CAPABILITY.
* Only support GLOBAL_CTRL save/restore with VMCS exec_ctrl, drop the MSR
save/retore list support for GLOBAL_CTRL, thus the support of mediated
vPMU is constrained to SapphireRapids and later CPUs on Intel side.
* Merge some small changes into a single patch.
- Address Sandipan's comment on invalid pmu pointer.
- Add back "eventsel_hw" and "fixed_ctr_ctrl_hw" to avoid to directly
manipulate pmc->eventsel and pmu->fixed_ctr_ctrl.
Testing (Intel side):
- Perf-based legacy vPMU (force emulation on/off)
* Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test pass.
* KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
* Basic perf counting/sampling tests in 3 scenarios, guest-only,
host-only and host-guest coexistence all pass.
- Mediated vPMU (force emulation on/off)
* Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test pass.
* KUT PMU tests pmu, pmu_lbr, pmu_pebs pass.
* Basic perf counting/sampling tests in 3 scenarios, guest-only,
host-only and host-guest coexistence all pass.
- Failures. All above tests passed on Intel Granite Rapids as well
except a failure on KUT/pmu_pebs.
* GP counter 0 (0xfffffffffffe): PEBS record (written seq 0)
is verified (including size, counters and cfg).
* The pebs_data_cfg (0xb500000000) doesn't match with the
effective MSR_PEBS_DATA_CFG (0x0).
* This failure has nothing to do with this mediated vPMU patch set. The
failure is caused by Granite Rapids supported timed PEBS which needs
extra support on Qemu and KUT/pmu_pebs. These extra support would be
sent in separate patches later.
Testing (AMD side):
- Kselftests pmu_counters_test, pmu_event_filter_test and
vmx_pmu_caps_test all pass
- legacy guest with KUT/pmu:
* qmeu option: -cpu host, -perfctr-core
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perfmon-v1 guest with KUT/pmu:
* qmeu option: -cpu host, -perfmon-v2
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perfmon-v2 guest with KUT/pmu:
* qmeu option: -cpu host
* when set force_emulation_prefix=1, passes
* when set force_emulation_prefix=0, passes
- perf_fuzzer (perfmon-v2):
* fails with soft lockup in guest in current version.
* culprit could be between 6.13 ~ 6.14-rc3 within KVM
* Series tested on 6.12 and 6.13 without issue.
Note: a QEMU series is needed to run mediated vPMU v4:
- https://lore.kernel.org/all/20250324123712.34096-1-dapeng1.mi@linux.intel.c…
History:
- RFC v3: https://lore.kernel.org/all/20240801045907.4010984-1-mizhang@google.com/
- RFC v2: https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@google.com/
- RFC v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.int…
Dapeng Mi (18):
KVM: x86/pmu: Introduce enable_mediated_pmu global parameter
KVM: x86/pmu: Check PMU cpuid configuration from user space
KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
KVM: x86/pmu: Add perf_capabilities field in struct kvm_host_values{}
KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
KVM: VMX: Add macros to wrap around
{secondary,tertiary}_exec_controls_changebit()
KVM: x86/pmu: Check if mediated vPMU can intercept rdpmc
KVM: x86/pmu/vmx: Save/load guest IA32_PERF_GLOBAL_CTRL with
vm_exit/entry_ctrl
KVM: x86/pmu: Optimize intel/amd_pmu_refresh() helpers
KVM: x86/pmu: Setup PMU MSRs' interception mode
KVM: x86/pmu: Handle PMU MSRs interception and event filtering
KVM: x86/pmu: Switch host/guest PMU context at vm-exit/vm-entry
KVM: x86/pmu: Handle emulated instruction for mediated vPMU
KVM: nVMX: Add macros to simplify nested MSR interception setting
KVM: selftests: Add mediated vPMU supported for pmu tests
KVM: Selftests: Support mediated vPMU for vmx_pmu_caps_test
KVM: Selftests: Fix pmu_counters_test error for mediated vPMU
KVM: x86/pmu: Expose enable_mediated_pmu parameter to user space
Kan Liang (8):
perf: Support get/put mediated PMU interfaces
perf: Skip pmu_ctx based on event_type
perf: Clean up perf ctx time
perf: Add a EVENT_GUEST flag
perf: Add generic exclude_guest support
perf: Add switch_guest_ctx() interface
perf/x86: Support switch_guest_ctx interface
perf/x86/intel: Support PERF_PMU_CAP_MEDIATED_VPMU
Mingwei Zhang (5):
perf/x86: Forbid PMI handler when guest own PMU
perf/x86/core: Plumb mediated PMU capability from x86_pmu to
x86_pmu_cap
KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
KVM: x86/pmu: introduce eventsel_hw to prepare for pmu event filtering
KVM: nVMX: Add nested virtualization support for mediated PMU
Sandipan Das (4):
perf/x86/core: Do not set bit width for unavailable counters
KVM: x86/pmu: Add AMD PMU registers to direct access list
KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
write to event selectors
perf/x86/amd: Support PERF_PMU_CAP_MEDIATED_VPMU for AMD host
Xiong Zhang (3):
x86/irq: Factor out common code for installing kvm irq handler
perf: core/x86: Register a new vector for KVM GUEST PMI
KVM: x86/pmu: Register KVM_GUEST_PMI_VECTOR handler
arch/x86/events/amd/core.c | 2 +
arch/x86/events/core.c | 40 +-
arch/x86/events/intel/core.c | 5 +
arch/x86/include/asm/hardirq.h | 1 +
arch/x86/include/asm/idtentry.h | 1 +
arch/x86/include/asm/irq.h | 2 +-
arch/x86/include/asm/irq_vectors.h | 5 +-
arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 +
arch/x86/include/asm/kvm_host.h | 10 +
arch/x86/include/asm/msr-index.h | 18 +-
arch/x86/include/asm/perf_event.h | 1 +
arch/x86/include/asm/vmx.h | 1 +
arch/x86/kernel/idt.c | 1 +
arch/x86/kernel/irq.c | 39 +-
arch/x86/kvm/cpuid.c | 15 +
arch/x86/kvm/pmu.c | 254 ++++++++-
arch/x86/kvm/pmu.h | 45 ++
arch/x86/kvm/svm/pmu.c | 148 ++++-
arch/x86/kvm/svm/svm.c | 26 +
arch/x86/kvm/svm/svm.h | 2 +-
arch/x86/kvm/vmx/capabilities.h | 11 +-
arch/x86/kvm/vmx/nested.c | 68 ++-
arch/x86/kvm/vmx/pmu_intel.c | 224 ++++++--
arch/x86/kvm/vmx/vmx.c | 89 +--
arch/x86/kvm/vmx/vmx.h | 11 +-
arch/x86/kvm/x86.c | 63 ++-
arch/x86/kvm/x86.h | 2 +
include/linux/perf_event.h | 47 +-
kernel/events/core.c | 519 ++++++++++++++----
.../beauty/arch/x86/include/asm/irq_vectors.h | 5 +-
.../selftests/kvm/include/kvm_test_harness.h | 13 +
.../testing/selftests/kvm/include/kvm_util.h | 3 +
.../selftests/kvm/include/x86/processor.h | 8 +
tools/testing/selftests/kvm/lib/kvm_util.c | 23 +
.../selftests/kvm/x86/pmu_counters_test.c | 24 +-
.../selftests/kvm/x86/pmu_event_filter_test.c | 8 +-
.../selftests/kvm/x86/vmx_pmu_caps_test.c | 2 +-
37 files changed, 1480 insertions(+), 258 deletions(-)
base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319
--
2.49.0.395.g12beb8f557-goog
We keep seeing flakes on packetdrill on debug kernels, while
non-debug kernels are stable, not a single flake in 200 runs.
Time to give up, debug kernels appear to suffer from 10msec
latency spikes and any timing-sensitive test is bound to flake.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: shuah(a)kernel.org
CC: willemb(a)google.com
CC: matttbe(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
.../selftests/net/packetdrill/ksft_runner.sh | 19 +------------------
1 file changed, 1 insertion(+), 18 deletions(-)
diff --git a/tools/testing/selftests/net/packetdrill/ksft_runner.sh b/tools/testing/selftests/net/packetdrill/ksft_runner.sh
index c5b01e1bd4c7..a7e790af38ff 100755
--- a/tools/testing/selftests/net/packetdrill/ksft_runner.sh
+++ b/tools/testing/selftests/net/packetdrill/ksft_runner.sh
@@ -35,24 +35,7 @@ failfunc=ktap_test_fail
if [[ -n "${KSFT_MACHINE_SLOW}" ]]; then
optargs+=('--tolerance_usecs=14000')
-
- # xfail tests that are known flaky with dbg config, not fixable.
- # still run them for coverage (and expect 100% pass without dbg).
- declare -ar xfail_list=(
- "tcp_blocking_blocking-connect.pkt"
- "tcp_blocking_blocking-read.pkt"
- "tcp_eor_no-coalesce-retrans.pkt"
- "tcp_fast_recovery_prr-ss.*.pkt"
- "tcp_sack_sack-route-refresh-ip-tos.pkt"
- "tcp_slow_start_slow-start-after-win-update.pkt"
- "tcp_timestamping.*.pkt"
- "tcp_user_timeout_user-timeout-probe.pkt"
- "tcp_zerocopy_cl.*.pkt"
- "tcp_zerocopy_epoll_.*.pkt"
- "tcp_tcp_info_tcp-info-.*-limited.pkt"
- )
- readonly xfail_regex="^($(printf '%s|' "${xfail_list[@]}"))$"
- [[ "$script" =~ ${xfail_regex} ]] && failfunc=ktap_test_xfail
+ failfunc=ktap_test_xfail
fi
ktap_print_header
--
2.50.1
This change resolves non literal string format warning invoked
for proc-maps-race.c while compiling.
proc-maps-race.c:205:17: warning: format not a string literal and no format arguments [-Wformat-security]
205 | printf(text);
| ^~~~~~
proc-maps-race.c:209:17: warning: format not a string literal and no format arguments [-Wformat-security]
209 | printf(text);
| ^~~~~~
proc-maps-race.c: In function ‘print_last_lines’:
proc-maps-race.c:224:9: warning: format not a string literal and no format arguments [-Wformat-security]
224 | printf(start);
| ^~~~~~
Added string format specifier %s for the printf calls
in both print_first_lines() and print_last_lines() thus
resolving the warnings invoked.
The test executes fine after this change thus causing no
affect to the functional behavior of the test.
Fixes: aadc099c480f ("selftests/proc: add verbose mode for /proc/pid/maps tearing tests")
Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com>
Acked-by: Suren Baghdasaryan <surenb(a)google.com>
---
Changes since v1:
- Added Fixes tag
- Included Acked-by Suren Baghdasaryan
https://lore.kernel.org/all/CAHCkknoxpKV80-S3jByY1xnRXd1Pr=v=D2a0ZcgnY0-Hny…
tools/testing/selftests/proc/proc-maps-race.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c
index 66773685a047..94bba4553130 100644
--- a/tools/testing/selftests/proc/proc-maps-race.c
+++ b/tools/testing/selftests/proc/proc-maps-race.c
@@ -202,11 +202,11 @@ static void print_first_lines(char *text, int nr)
int offs = end - text;
text[offs] = '\0';
- printf(text);
+ printf("%s", text);
text[offs] = '\n';
printf("\n");
} else {
- printf(text);
+ printf("%s", text);
}
}
@@ -221,7 +221,7 @@ static void print_last_lines(char *text, int nr)
nr--;
start--;
}
- printf(start);
+ printf("%s", start);
}
static void print_boundaries(const char *title, FIXTURE_DATA(proc_maps_race) *self)
--
2.43.0
This change resolves non literal string format warning invoked
for proc-maps-race.c while compiling.
proc-maps-race.c:205:17: warning: format not a string literal and no format arguments [-Wformat-security]
205 | printf(text);
| ^~~~~~
proc-maps-race.c:209:17: warning: format not a string literal and no format arguments [-Wformat-security]
209 | printf(text);
| ^~~~~~
proc-maps-race.c: In function ‘print_last_lines’:
proc-maps-race.c:224:9: warning: format not a string literal and no format arguments [-Wformat-security]
224 | printf(start);
| ^~~~~~
Added string format specifier %s for the printf calls
in both print_first_lines() and print_last_lines() thus
resolving the warnings invoked.
The test executes fine after this change thus causing no
affect to the functional behavior of the test.
Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com>
---
tools/testing/selftests/proc/proc-maps-race.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c
index 66773685a047..94bba4553130 100644
--- a/tools/testing/selftests/proc/proc-maps-race.c
+++ b/tools/testing/selftests/proc/proc-maps-race.c
@@ -202,11 +202,11 @@ static void print_first_lines(char *text, int nr)
int offs = end - text;
text[offs] = '\0';
- printf(text);
+ printf("%s", text);
text[offs] = '\n';
printf("\n");
} else {
- printf(text);
+ printf("%s", text);
}
}
@@ -221,7 +221,7 @@ static void print_last_lines(char *text, int nr)
nr--;
start--;
}
- printf(start);
+ printf("%s", start);
}
static void print_boundaries(const char *title, FIXTURE_DATA(proc_maps_race) *self)
--
2.43.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find DUALPI2 iproute2 patch v11.
For more details of DualPI2, please refer IETF RFC9332
(https://datatracker.ietf.org/doc/html/rfc9332).
Best Regards,
Chia-Yu
---
v11 (18-Jul-2025)
- Replace TCA_DUALPI2 prefix with TC_DUALPI2 prefix for enums (Jakub Kicinski <kuba(a)kernel.org>)
v10 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS w/ STEP_THRESH_PKTS and STEP_THRESH_US of net-next patch (Jakub Kicinski <kuba(a)kernel.org>)
v9 (13-Jun-2025)
- Fix space issue and typos (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Change 'rtt_typical' to 'typical_rtt' in tc/q_dualpi2.c (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Add the num of enum used by DualPI2 in pkt_sched.h
v8 (09-May-2025)
- Update pkt_sched.h with the one in nex-next
- Correct a typo in the comment within pkt_sched.h (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update manual content in man/man8/tc-dualpi2.8 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update tc/q_dualpi2.c to fix missing blank lines and add missing case (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
v7 (05-May-2025)
- Align pkt_sched.h with the v14 version of net-next due to spec modification in tc.yaml
- Reorganize dualpi2_print_opt() to match the order in tc.yaml
- Remove credit-queue in PRINT_JSON
v6 (26-Apr-2025)
- Update JSON file output due to spec modification in tc.yaml of net-next
v5 (25-Mar-2025)
- Use matches() to replace current strcmp() (Stephen Hemminger <stephen(a)networkplumber.org>)
- Use general parse_percent() for handling scaled percentage values (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 stats (Stephen Hemminger <stephen(a)networkplumber.org>)
v4 (16-Mar-2025)
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step marking.
v3 (21-Feb-2025)
- Add memlimit to the dualpi2 attribute, and add memory_used, max_memory_used, and memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update the manual to align with the latest implementation and clarify the queue naming and default unit
- Use common "get_scaled_alpha_beta" and clean print_opt for Dualpi2
v2 (23-Oct-2024)
- Rename get_float in dualpi2 to get_float_min_max in utils.c
- Move get_float from iplink_can.c in utils.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Add print function for JSON of dualpi2 (Stephen Hemminger <stephen(a)networkplumber.org>)
---
Chia-Yu Chang (1):
tc: add dualpi2 scheduler module
bash-completion/tc | 11 +-
include/uapi/linux/pkt_sched.h | 68 +++++
include/utils.h | 2 +
ip/iplink_can.c | 14 -
lib/utils.c | 30 ++
man/man8/tc-dualpi2.8 | 249 ++++++++++++++++
tc/Makefile | 1 +
tc/q_dualpi2.c | 528 +++++++++++++++++++++++++++++++++
8 files changed, 888 insertions(+), 15 deletions(-)
create mode 100644 man/man8/tc-dualpi2.8
create mode 100644 tc/q_dualpi2.c
--
2.34.1
From: Benjamin Berg <benjamin.berg(a)intel.com>
Hi,
This patchset adds signal handling to nolibc. Initially, I would like to
use this for tests. But in the long run, the goal is to use nolibc for
the UML kernel itself. In both cases, signal handling will be needed.
With v3 everything is now included in nolibc instead of trying to use
the messy kernel headers.
Benjamin
Benjamin Berg (4):
selftests/nolibc: fix EXPECT_NZ macro
selftests/nolibc: remove outdated comment about construct order
tools/nolibc: add more generic bitmask macros for FD_*
tools/nolibc: add signal support
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/arch-s390.h | 4 +-
tools/include/nolibc/asm-signal.h | 237 +++++++++++++++++++
tools/include/nolibc/signal.h | 179 ++++++++++++++
tools/include/nolibc/sys.h | 2 +-
tools/include/nolibc/sys/wait.h | 1 +
tools/include/nolibc/time.h | 2 +-
tools/include/nolibc/types.h | 81 ++++---
tools/testing/selftests/nolibc/nolibc-test.c | 139 ++++++++++-
9 files changed, 608 insertions(+), 38 deletions(-)
create mode 100644 tools/include/nolibc/asm-signal.h
--
2.50.1
Currently all test cases are linked with thp_settings, while only 6
out of 50+ targets rely on it.
Instead of making thp_settings as a common dependency, link it only
when necessary.
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
---
tools/testing/selftests/mm/Makefile | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index d4f19f87053b..eea4881c918a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -158,14 +158,19 @@ TEST_FILES += write_hugetlb_memory.sh
include ../lib.mk
-$(TEST_GEN_PROGS): vm_util.c thp_settings.c
-$(TEST_GEN_FILES): vm_util.c thp_settings.c
+$(TEST_GEN_PROGS): vm_util.c
+$(TEST_GEN_FILES): vm_util.c
$(OUTPUT)/uffd-stress: uffd-common.c
$(OUTPUT)/uffd-unit-tests: uffd-common.c
-$(OUTPUT)/uffd-wp-mremap: uffd-common.c
+$(OUTPUT)/uffd-wp-mremap: uffd-common.c thp_settings.c
$(OUTPUT)/protection_keys: pkey_util.c
$(OUTPUT)/pkey_sighandler_tests: pkey_util.c
+$(OUTPUT)/cow: thp_settings.c
+$(OUTPUT)/migration: thp_settings.c
+$(OUTPUT)/khugepaged: thp_settings.c
+$(OUTPUT)/ksm_tests: thp_settings.c
+$(OUTPUT)/soft-dirty: thp_settings.c
ifeq ($(ARCH),x86_64)
BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
--
2.34.1
When using GCC on x86-64 to compile an usdt prog with -O1 or higher
optimization, the compiler will generate SIB addressing mode for global
array and PC-relative addressing mode for global variable,
e.g. "1@-96(%rbp,%rax,8)" and "-1@4+t1(%rip)".
The current USDT implementation in libbpf cannot parse these two formats,
causing `bpf_program__attach_usdt()` to fail with -ENOENT
(unrecognized register).
This patch series adds support for SIB addressing mode in USDT probes.
The main changes include:
- add correct handling logic for SIB-addressed arguments in
`parse_usdt_arg`.
- force -O2 optimization for usdt.test.o to generate SIB addressing usdt
argument spec.
- change the global variable t1 to a local variable, to avoid compiler
generating PC-relative addressing mode for it.
Testing shows that the SIB probe correctly generates 8@(%rcx,%rax,8)
argument spec and passes all validation checks.
The modification history of this patch series:
Change since v1:
- refactor the code to make it more readable
- modify the commit message to explain why and how
Change since v2:
- fix the `scale` uninitialized error
Change since v3:
- force -O2 optimization for usdt.test.o to generate SIB addressing usdt
and pass all test cases.
Change since v4:
- split the patch into two parts, one for the fix and the other for the
test
Do we need to add support for PC-relative USDT argument spec handling in
libbpf? I have some interest in this question, but currently have no
ideas. Getting offsets based on symbols requires dependency on the symbol
table. However, once the binary file is stripped, the symtab will also be
removed, which will cause this approach to fail. Does anyone have any
thoughts on this?
Jiawei Zhao (2):
libbpf: fix USDT SIB argument handling causing unrecognized register
error
selftests/bpf: Force -O2 for USDT selftests to cover SIB handling
logic
tools/lib/bpf/usdt.bpf.h | 33 +++++++++++++-
tools/lib/bpf/usdt.c | 43 ++++++++++++++++---
tools/testing/selftests/bpf/Makefile | 5 +++
tools/testing/selftests/bpf/prog_tests/usdt.c | 18 +++++---
4 files changed, 86 insertions(+), 13 deletions(-)
--
2.43.0
When using GCC on x86-64 to compile an usdt prog with -O1 or higher
optimization, the compiler will generate SIB addressing mode for global
array and PC-relative addressing mode for global variable,
e.g. "1@-96(%rbp,%rax,8)" and "-1@4+t1(%rip)".
The current USDT implementation in libbpf cannot parse these two formats,
causing `bpf_program__attach_usdt()` to fail with -ENOENT
(unrecognized register).
This patch series adds support for SIB addressing mode in USDT probes.
The main changes include:
- add correct handling logic for SIB-addressed arguments in
`parse_usdt_arg`.
- force -O2 optimization for usdt.test.o to generate SIB addressing usdt
argument spec.
- change the global variable t1 to a local variable, to avoid compiler
generating PC-relative addressing mode for it.
Testing shows that the SIB probe correctly generates 8@(%rcx,%rax,8)
argument spec and passes all validation checks.
The modification history of this patch series:
Change since v1:
- refactor the code to make it more readable
- modify the commit message to explain why and how
Change since v2:
- fix the `scale` uninitialized error
Change since v3:
- force -O2 optimization for usdt.test.o to generate SIB addressing usdt
and pass all test cases.
Do we need to add support for PC-relative USDT argument spec handling in libbpf?
I have some interest in this question, but currently have no ideas. Getting offsets
based on symbols requires dependency on the symbol table. However, once the binary
file is stripped, the symtab will also be removed, which will cause this approach
to fail. Does anyone have any thoughts on this?
Jiawei Zhao (1):
libbpf: fix USDT SIB argument handling causing unrecognized register
error
tools/lib/bpf/usdt.bpf.h | 33 +++++++++++++-
tools/lib/bpf/usdt.c | 43 ++++++++++++++++---
tools/testing/selftests/bpf/Makefile | 5 +++
tools/testing/selftests/bpf/prog_tests/usdt.c | 18 +++++---
4 files changed, 86 insertions(+), 13 deletions(-)
--
2.43.0
With /proc/pid/maps now being read under per-vma lock protection we can
reuse parts of that code to execute PROCMAP_QUERY ioctl also without
taking mmap_lock. The change is designed to reduce mmap_lock contention
and prevent PROCMAP_QUERY ioctl calls from blocking address space updates.
This patchset was split out of the original patchset [1] that introduced
per-vma lock usage for /proc/pid/maps reading. It contains PROCMAP_QUERY
tests, code refactoring patch to simplify the main change and the actual
transition to per-vma lock.
[1] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/
Suren Baghdasaryan (3):
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
fs/proc/task_mmu: factor out proc_maps_private fields used by
PROCMAP_QUERY
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 15 +-
fs/proc/task_mmu.c | 149 ++++++++++++------
tools/testing/selftests/proc/proc-maps-race.c | 65 ++++++++
3 files changed, 174 insertions(+), 55 deletions(-)
base-commit: 01da54f10fddf3b01c5a3b80f6b16bbad390c302
--
2.50.1.565.gc32cd1483b-goog
The step_after_suspend_test verifies that the system successfully
suspended and resumed by setting a timerfd and checking whether the
timer fully expired. However, this method is unreliable due to timing
races.
In practice, the system may take time to enter suspend, during which the
timer may expire just before or during the transition. As a result,
the remaining time after resume may show non-zero nanoseconds, even if
suspend/resume completed successfully. This leads to false test failures.
Replace the timer-based check with a read from
/sys/power/suspend_stats/success. This counter is incremented only
after a full suspend/resume cycle, providing a reliable and race-free
indicator.
Also remove the unused file descriptor for /sys/power/state, which
remained after switching to a system() call to trigger suspend [1].
[1] https://lore.kernel.org/all/20240930224025.2858767-1-yifei.l.liu@oracle.com/
Fixes: c66be905cda2 ("selftests: breakpoints: use remaining time to check if suspend succeed")
Signed-off-by: Moon Hee Lee <moonhee.lee.ca(a)gmail.com>
---
.../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++-----
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/breakpoints/step_after_suspend_test.c b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
index 8d275f03e977..8d233ac95696 100644
--- a/tools/testing/selftests/breakpoints/step_after_suspend_test.c
+++ b/tools/testing/selftests/breakpoints/step_after_suspend_test.c
@@ -127,22 +127,42 @@ int run_test(int cpu)
return KSFT_PASS;
}
+/*
+ * Reads the suspend success count from sysfs.
+ * Returns the count on success or exits on failure.
+ */
+static int get_suspend_success_count_or_fail(void)
+{
+ FILE *fp;
+ int val;
+
+ fp = fopen("/sys/power/suspend_stats/success", "r");
+ if (!fp)
+ ksft_exit_fail_msg(
+ "Failed to open suspend_stats/success: %s\n",
+ strerror(errno));
+
+ if (fscanf(fp, "%d", &val) != 1) {
+ fclose(fp);
+ ksft_exit_fail_msg(
+ "Failed to read suspend success count\n");
+ }
+
+ fclose(fp);
+ return val;
+}
+
void suspend(void)
{
- int power_state_fd;
int timerfd;
int err;
+ int count_before;
+ int count_after;
struct itimerspec spec = {};
if (getuid() != 0)
ksft_exit_skip("Please run the test as root - Exiting.\n");
- power_state_fd = open("/sys/power/state", O_RDWR);
- if (power_state_fd < 0)
- ksft_exit_fail_msg(
- "open(\"/sys/power/state\") failed %s)\n",
- strerror(errno));
-
timerfd = timerfd_create(CLOCK_BOOTTIME_ALARM, 0);
if (timerfd < 0)
ksft_exit_fail_msg("timerfd_create() failed\n");
@@ -152,14 +172,15 @@ void suspend(void)
if (err < 0)
ksft_exit_fail_msg("timerfd_settime() failed\n");
+ count_before = get_suspend_success_count_or_fail();
+
system("(echo mem > /sys/power/state) 2> /dev/null");
- timerfd_gettime(timerfd, &spec);
- if (spec.it_value.tv_sec != 0 || spec.it_value.tv_nsec != 0)
+ count_after = get_suspend_success_count_or_fail();
+ if (count_after <= count_before)
ksft_exit_fail_msg("Failed to enter Suspend state\n");
close(timerfd);
- close(power_state_fd);
}
int main(int argc, char **argv)
--
2.43.0
This patch fixes unstable LACP negotiation when bonding is configured in
passive mode (`lacp_active=off`).
Previously, the actor would stop sending LACPDUs after initial negotiation
succeeded, leading to the partner timing out and restarting the negotiation
cycle. This resulted in continuous LACP state flapping.
The fix ensures the passive actor starts sending periodic LACPDUs after
receiving the first LACPDU from the partner, in accordance with IEEE
802.1AX-2020 section 6.4.1.
Out of topic:
Although this patch addresses a functional bug and could be considered for
`net`, I'm slightly concerned about potential regressions, as it changes
the current bonding LACP protocol behavior.
It might be safer to merge this through `net-next` first to allow broader
testing. Thoughts?
Hangbin Liu (2):
bonding: send LACPDUs periodically in passive mode after receiving
partner's LACPDU
selftests: bonding: add test for passive LACP mode
drivers/net/bonding/bond_3ad.c | 72 ++++++++++----
drivers/net/bonding/bond_options.c | 1 +
include/net/bond_3ad.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 3 +-
.../drivers/net/bonding/bond_passive_lacp.sh | 93 +++++++++++++++++++
5 files changed, 151 insertions(+), 19 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh
--
2.46.0
There are several situations where VMM is involved when handling
synchronous external instruction or data aborts, and often VMM
needs to inject external aborts to guest. In addition to manipulating
individual registers with KVM_SET_ONE_REG API, an easier way is to
use the KVM_SET_VCPU_EVENTS API.
This patchset adds two new features to the KVM_SET_VCPU_EVENTS API.
1. Extend KVM_SET_VCPU_EVENTS to support external instruction abort.
2. Allow userspace to emulate ESR_ELx.ISS by supplying ESR_ELx.
In this way, we can also allow userspace to emulate ESR_ELx.ISS2
in future.
The UAPI change for #1 is straightforward. However, I would appreciate
some feedback on the ABI change for #2:
struct kvm_vcpu_events {
struct {
__u8 serror_pending;
__u8 serror_has_esr;
__u8 ext_dabt_pending;
__u8 ext_iabt_pending;
__u8 ext_abt_has_esr;
__u8 pad[3];
__u64 serror_esr;
__u64 ext_abt_esr; // <= +8 bytes
} exception;
__u32 reserved[10]; // <= -8 bytes
};
The offset to kvm_vcpu_events.reserved changes, and the size of
exception changes. I think we can't say userspace will never access
reserved, or they will never use sizeof(exception). Theoretically this
is an ABI break and I want to call it out and ask if a new ABI is needed
for feature #2. For example, is it worthy to introduce exception_v2
or kvm_vcpu_events_v2.
Based on commit 7b8346bd9fce6 ("KVM: arm64: Don't attempt vLPI mappings
when vPE allocation is disabled")
Jiaqi Yan (3):
KVM: arm64: Allow userspace to supply ESR when injecting SEA
KVM: selftests: Test injecting external abort with ISS
Documentation: kvm: update UAPI for injecting SEA
Raghavendra Rao Ananta (1):
KVM: arm64: Allow userspace to inject external instruction abort
Documentation/virt/kvm/api.rst | 48 +++--
arch/arm64/include/asm/kvm_emulate.h | 9 +-
arch/arm64/include/uapi/asm/kvm.h | 7 +-
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/emulate-nested.c | 6 +-
arch/arm64/kvm/guest.c | 42 ++--
arch/arm64/kvm/inject_fault.c | 16 +-
include/uapi/linux/kvm.h | 1 +
tools/arch/arm64/include/uapi/asm/kvm.h | 7 +-
.../selftests/kvm/arm64/external_aborts.c | 191 +++++++++++++++---
.../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++++++
11 files changed, 352 insertions(+), 74 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
--
2.50.1.565.gc32cd1483b-goog
Problem
=======
When host APEI is unable to claim synchronous external abort (SEA)
during stage-2 guest abort, today KVM directly injects an async SError
into the VCPU then resumes it. The injected SError usually results in
unpleasant guest kernel panic.
One of the major situation of guest SEA is when VCPU consumes recoverable
uncorrected memory error (UER), which is not uncommon at all in modern
datacenter servers with large amounts of physical memory. Although SError
and guest panic is sufficient to stop the propagation of corrupted memory
there is room to recover from an UER in a more graceful manner.
Proposed Solution
=================
Alternatively KVM can replay the SEA to the faulting VCPU, via existing
KVM_SET_VCPU_EVENTS API. If the memory poison consumption or the fault
that cause SEA is not from guest kernel, the blast radius can be limited
to the consuming or faulting guest userspace process, so the VM can keep
running.
In addition, instead of doing under the hood without involving userspace,
there are benefits to redirect the SEA to VMM:
- VM customers care about the disruptions caused by memory errors, and
VMM usually has the responsibility to start the process of notifying
the customers of memory error events in their VMs. For example some
cloud provider emits a critical log in their observability UI [1], and
provides playbook for customers on how to mitigate disruptions to
their workloads.
- VMM can protect future memory error consumption by unmapping the poisoned
pages from stage-2 page table with KVM userfault, or by splitting the
memslot that contains the poisoned guest pages [2].
- VMM can keep track of SEA events in the VM. When VMM thinks the status
on the host or the VM is bad enough, e.g. number of distinct SEAs
exceeds a threshold, it can restart the VM on another healthy host.
- Behavior parity with x86 architecture. When machine check exception
(MCE) is caused by VCPU, kernel or KVM signals userspace SIGBUS to
let VMM either recover from the MCE, or terminate itself with VM.
The prior RFC proposes to implement SIGBUS on arm64 as well, but
Marc preferred VCPU exit over signal [3]. However, implementation
aside, returning SEA to VMM is on par with returning MCE to VMM.
Once SEA is redirected to VMM, among other actions, VMM is encouraged
to inject external aborts into the faulting VCPU, which is already
supported by KVM on arm64. We notice injecting instruction abort is not
fully supported by KVM_SET_VCPU_EVENTS. Complement it in the patchset.
New UAPIs
=========
This patchset introduces following userspace-visiable changes to empower
VMM to control what happens next for SEA on guest memory:
- KVM_CAP_ARM_SEA_TO_USER. While taking SEA, if userspace has enabled
this new capability at VM creation, and the SEA is not caused by
memory allocated for stage-2 translation table, instead of injecting
SError, return KVM_EXIT_ARM_SEA to userspace.
- KVM_EXIT_ARM_SEA. This is the VM exit reason VMM gets. The details
about the SEA is provided in arm_sea as much as possible, including
sanitized ESR value at EL2, if guest virtual and physical addresses
(GPA and GVA) are available and the values if available.
- KVM_CAP_ARM_INJECT_EXT_IABT. VMM today can inject external data abort
to VCPU via KVM_SET_VCPU_EVENTS API. However, in case of instruction
abort, VMM cannot inject it via KVM_SET_VCPU_EVENTS.
KVM_CAP_ARM_INJECT_EXT_IABT is just a natural extend to
KVM_CAP_ARM_INJECT_EXT_DABT that tells VMM KVM_SET_VCPU_EVENTS now
supports external instruction abort.
* From v1 [4]:
- Rebased on commit 4d62121ce9b5 ("KVM: arm64: vgic-debug: Avoid
dereferencing NULL ITE pointer").
- Sanitize ESR_EL2 before reporting it to userspace.
- Do not do KVM_EXIT_ARM_SEA when SEA is caused by memory allocated to
stage-2 translation table.
[1] https://cloud.google.com/solutions/sap/docs/manage-host-errors
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com
[3] https://lore.kernel.org/kvm/86pljbqqh0.wl-maz@kernel.org
[4] https://lore.kernel.org/kvm/20250505161412.1926643-1-jiaqiyan@google.com
Jiaqi Yan (5):
KVM: arm64: VM exit to userspace to handle SEA
KVM: arm64: Set FnV for VCPU when FAR_EL2 is invalid
KVM: selftests: Test for KVM_EXIT_ARM_SEA and KVM_CAP_ARM_SEA_TO_USER
KVM: selftests: Test for KVM_CAP_INJECT_EXT_IABT
Documentation: kvm: new uAPI for handling SEA
Raghavendra Rao Ananta (1):
KVM: arm64: Allow userspace to inject external instruction aborts
Documentation/virt/kvm/api.rst | 128 ++++++-
arch/arm64/include/asm/kvm_emulate.h | 67 ++++
arch/arm64/include/asm/kvm_host.h | 8 +
arch/arm64/include/asm/kvm_ras.h | 2 +-
arch/arm64/include/uapi/asm/kvm.h | 3 +-
arch/arm64/kvm/arm.c | 6 +
arch/arm64/kvm/guest.c | 13 +-
arch/arm64/kvm/inject_fault.c | 3 +
arch/arm64/kvm/mmu.c | 59 ++-
include/uapi/linux/kvm.h | 12 +
tools/arch/arm64/include/asm/esr.h | 2 +
tools/arch/arm64/include/uapi/asm/kvm.h | 3 +-
tools/testing/selftests/kvm/Makefile.kvm | 2 +
.../testing/selftests/kvm/arm64/inject_iabt.c | 98 +++++
.../testing/selftests/kvm/arm64/sea_to_user.c | 340 ++++++++++++++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 1 +
16 files changed, 718 insertions(+), 29 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/inject_iabt.c
create mode 100644 tools/testing/selftests/kvm/arm64/sea_to_user.c
--
2.49.0.1266.g31b7d2e469-goog
Vishal!
On Wed, Jul 30 2025 at 23:35, Vishal Parmar wrote:
Please do not top-post and trim your replies.
> The intent behind this change is to make output useful as is.
> for example, to provide a performance report in case of regression.
The point John was making:
>> So it might be worth looking into getting the output to be happy with
>> TAP while you're tweaking things here.
The kernel selftests are converting over to standardized TAP output
format, which is intended to aid automated testing.
So if we change the outpot format of this test, then we switch it over to
TAP format and do not invent yet another randomized output scheme.
> CSV format is also a good alternative if the maintainer prefers that.
The most important information is whether the test succeeded or not and
CSV format is not helping either to conform with the test output
standards.
For the success case, the actual numbers are uninteresting. In the
failure case it's sufficient to emit:
ksft_test_result_fail("Req: NNNN, Exp: $MMMM, Res: $LLLL\n", ...);
In case of regressions (fail), a report providing this output is good
enough for the relevant maintainer/developer to start investigating.
No?
Thanks,
tglx
Hi!
Does anyone have ideas about crediting test authors or tests for bugs
discovered? We increasingly see situations where someone adds a test
then our subsystem CI uncovers a (1 in a 100 runs) bug using that test.
Using reported-by doesn't feel right. But credit should go to the
person who wrote the test. Is anyone else having this dilemma?
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).
/* pidns mount option for procfs */
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes creates
a procfs mount from an empty pidns so that user namespaced containers
can be nested (without this, the nested containers would fail to
mount procfs). But this requires forking off a helper process because
you cannot just one-shot this using mount(2).
* Container runtimes in general need to fork into a container before
configuring its mounts, which can lead to security issues in the case
of shared-pidns containers (a privileged process in the pidns can
interact with your container runtime process). While
SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
strict need for this due to a minor uAPI wart is kind of unfortunate.
Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.
The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.
/* ioctl(PROCFS_GET_PID_NAMESPACE) */
In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.
To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.
It's not quite clear what is the correct security model for this API,
but the current approach I've taken is to:
* Make the ioctl only valid on the root (meaning that a process without
access to the procfs root -- such as only having an fd to a procfs
file or some open_tree(2)-like subset -- cannot use this API).
* Require that the process requesting either has access to
/proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
access to it and can join it if they had a handle), or is in a pidns
that is a direct ancestor of the target pidns (i.e. all of the pids
are already visible in the procfs for the current process's pidns).
The security model for this is a little loose, as it seems to me that
all of the cases mentioned are valid cases to allow access, but I'm open
to suggestions for whether we need to make this stricter or looser.
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v2:
- #ifdef CONFIG_PID_NS
- Improve cover letter wording to make it clear we're talking about two
separate features with different permission models. [Andy Lutomirski]
- Fix build warnings in pidns_is_ancestor() patch. [kernel test robot]
- v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cypha…>
---
Aleksa Sarai (4):
pidns: move is-ancestor logic to helper
procfs: add "pidns" mount option
procfs: add PROCFS_GET_PID_NAMESPACE ioctl
selftests/proc: add tests for new pidns APIs
Documentation/filesystems/proc.rst | 10 ++
fs/proc/root.c | 144 ++++++++++++++-
include/linux/pid_namespace.h | 9 +
include/uapi/linux/fs.h | 3 +
kernel/pid_namespace.c | 23 ++-
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-pidns.c | 286 ++++++++++++++++++++++++++++++
8 files changed, 461 insertions(+), 16 deletions(-)
---
base-commit: 4c838c7672c39ec6ec48456c6ce22d14a68f4cda
change-id: 20250717-procfs-pidns-api-8ed1583431f0
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
Currently the RISC-V qemu architecture config has some code to check for
the presence of the BIOS files before loading, printing a message and
quitting if it's not present.
However, this prevents us from loading the architecture file even just
to inspect it if the dependencies are not present. Instead, have kunit.py
look for a check_dependencies function, and call it if present only when
the architecture config is being used.
This is necessary for future changes which enumerate or automatically
select an architecture.
Signed-off-by: David Gow <davidgow(a)google.com>
---
tools/testing/kunit/kunit_kernel.py | 5 +++++
tools/testing/kunit/qemu_configs/riscv.py | 10 ++++++----
2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py
index 260d8d9aa1db..c3201a76da24 100644
--- a/tools/testing/kunit/kunit_kernel.py
+++ b/tools/testing/kunit/kunit_kernel.py
@@ -230,6 +230,11 @@ def _get_qemu_ops(config_path: str,
assert isinstance(spec.loader, importlib.abc.Loader)
spec.loader.exec_module(config)
+ # Check for any per-architecture dependencies
+ if hasattr(config, 'check_dependencies'):
+ if not config.check_dependencies():
+ raise ValueError('Missing dependencies for ' + config_path)
+
if not hasattr(config, 'QEMU_ARCH'):
raise ValueError('qemu_config module missing "QEMU_ARCH": ' + config_path)
params: qemu_config.QemuArchParams = config.QEMU_ARCH
diff --git a/tools/testing/kunit/qemu_configs/riscv.py b/tools/testing/kunit/qemu_configs/riscv.py
index c87758030ff7..3c271d1005d9 100644
--- a/tools/testing/kunit/qemu_configs/riscv.py
+++ b/tools/testing/kunit/qemu_configs/riscv.py
@@ -6,10 +6,12 @@ import sys
OPENSBI_FILE = 'opensbi-riscv64-generic-fw_dynamic.bin'
OPENSBI_PATH = '/usr/share/qemu/' + OPENSBI_FILE
-if not os.path.isfile(OPENSBI_PATH):
- print('\n\nOpenSBI bios was not found in "' + OPENSBI_PATH + '".\n'
- 'Please ensure that qemu-system-riscv is installed, or edit the path in "qemu_configs/riscv.py"\n')
- sys.exit()
+def check_dependencies() -> bool:
+ if not os.path.isfile(OPENSBI_PATH):
+ print('\n\nOpenSBI bios was not found in "' + OPENSBI_PATH + '".\n'
+ 'Please ensure that qemu-system-riscv is installed, or edit the path in "qemu_configs/riscv.py"\n')
+ return False
+ return True
QEMU_ARCH = QemuArchParams(linux_arch='riscv',
kconfig='''
--
2.50.1.552.g942d659e1b-goog
Hi Thadeu,,
On Sun, 15 Jun 2025 at 23:31, Thadeu Lima de Souza Cascardo
<cascardo(a)igalia.com> wrote:
>
> Add test cases for static and dynamic minor number allocation and
> deallocation.
>
> While at it, improve description and test suite name.
>
> Some of the cases include:
>
> - that static and dynamic allocation reserved the expected minors.
>
> - that registering duplicate minors or duplicate names will fail.
>
> - that failing to create a sysfs file (due to duplicate names) will
> deallocate the dynamic minor correctly.
>
> - that dynamic allocation does not allocate a minor number in the static
> range.
>
> - that there are no collisions when mixing dynamic and static allocations.
>
> - that opening devices with various minor device numbers work.
>
> - that registering a static number in the dynamic range won't conflict with
> a dynamic allocation.
>
> This last test verifies the bug fixed by commit 6d04d2b554b1 ("misc:
> misc_minor_alloc to use ida for all dynamic/misc dynamic minors") has not
> regressed.
>
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)igalia.com>
Thanks for your patch, which is now commit 74d8361be3441dff ("char:
misc: add test cases") in linus/master stable/master
> Changes in v5:
> - Make miscdevice unit test built-in only
> - Make unit test require CONFIG_KUNIT=y
Why were these changes made? This means the test is no longer available
if KUNIT=m, and I can no longer just load the module when I want to
run the test.
> - Link to v4: https://lore.kernel.org/r/20250423-misc-dynrange-v4-0-133b5ae4ca18@igalia.c…
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2506,8 +2506,8 @@ config TEST_IDA
> tristate "Perform selftest on IDA functions"
>
> config TEST_MISC_MINOR
> - tristate "miscdevice KUnit test" if !KUNIT_ALL_TESTS
> - depends on KUNIT
> + bool "miscdevice KUnit test" if !KUNIT_ALL_TESTS
> + depends on KUNIT=y
> default KUNIT_ALL_TESTS
> help
> Kunit test for miscdevice API, specially its behavior in respect to
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert(a)linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
If WAIT_KILLABLE_RECV was specified, and an event is received, the
tracee's syscall is not supposed to be interruptible. This was not properly
ensured if the reply was sent too fast, and an interrupting signal was
received before the reply was processed on the tracee side.
This series fixes the bug and adds a test case for it to the selftests.
Signed-off-by: Johannes Nixdorf <johannes(a)nixdorf.dev>
---
Changes in v2:
- Added a selftest for the bug.
- Link to v1: https://lore.kernel.org/r/20250723-seccomp-races-v1-1-bef5667ce30a@nixdorf.…
---
Johannes Nixdorf (2):
seccomp: Fix a race with WAIT_KILLABLE_RECV if the tracer replies too fast
selftests/seccomp: Add a test for the WAIT_KILLABLE_RECV fast reply race
kernel/seccomp.c | 13 ++-
tools/testing/selftests/seccomp/seccomp_bpf.c | 130 ++++++++++++++++++++++++++
2 files changed, 136 insertions(+), 7 deletions(-)
---
base-commit: 89be9a83ccf1f88522317ce02f854f30d6115c41
change-id: 20250721-seccomp-races-e97897d6d94b
Best regards,
--
Johannes Nixdorf <johannes(a)nixdorf.dev>
Hi Linus,
Please pull this kselftest next update for Linux 6.17-rc1.
Fixes
- false failure of subsystem event test
- glob filter test to use mutex_unlock() instead of mutex_trylock()
- several spelling errors in tests
- test_kexec_jump build errors
- pidfd test duplicate-symbol warnings for SCHED_ CPP symbols
Adds a reliable check for suspend to breakpoints suspend test
Improvements to ipc test
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 19272b37aa4f83ca52bdf9c16d5d81bdd1354494:
Linux 6.16-rc1 (2025-06-08 13:44:43 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-next-6.17-rc1
for you to fetch changes up to 30fb5e134f05800dc424f8aa1d69841a6bdd9a54:
selftests/pidfd: Fix duplicate-symbol warnings for SCHED_ CPP symbols (2025-07-24 16:14:45 -0600)
----------------------------------------------------------------
linux_kselftest-next-6.17-rc1
Fixes
- false failure of subsystem event test
- glob filter test to use mutex_unlock() instead of mutex_trylock()
- several spelling errors in tests
- test_kexec_jump build errors
- pidfd test duplicate-symbol warnings for SCHED_ CPP symbols
Adds a reliable check for suspend to breakpoints suspend test
Improvements to ipc test
----------------------------------------------------------------
Ankit Chauhan (1):
selftests/ptrace: Fix spelling mistake "multible" -> "multiple"
Jihed Chaibi (1):
selftests/cpu-hotplug: fix typo in hotplaggable_offline_cpus function name
Masami Hiramatsu (Google) (1):
selftests: tracing: Use mutex_unlock for testing glob filter
Moon Hee Lee (2):
selftests: breakpoints: use suspend_stats to reliably check suspend success
selftests/kexec: fix test_kexec_jump build
Nick Huang (1):
selftests: ipc: Replace fail print statements with ksft_test_result_fail
Paul E. McKenney (1):
selftests/pidfd: Fix duplicate-symbol warnings for SCHED_ CPP symbols
Shuah Khan (1):
selftests: print installation complete message
Steven Rostedt (1):
selftests/tracing: Fix false failure of subsystem event test
Tianyi Cui (1):
selftests: Add version file to kselftest installation dir
tools/testing/selftests/Makefile | 8 ++++
.../breakpoints/step_after_suspend_test.c | 41 ++++++++++++++-----
.../selftests/cpu-hotplug/cpu-on-off-test.sh | 4 +-
.../ftrace/test.d/event/subsystem-enable.tc | 28 ++++++++++++-
.../ftrace/test.d/ftrace/func-filter-glob.tc | 2 +-
tools/testing/selftests/ipc/msgque.c | 47 +++++++++++-----------
tools/testing/selftests/kexec/Makefile | 2 +-
tools/testing/selftests/pidfd/pidfd.h | 9 +++++
tools/testing/selftests/ptrace/peeksiginfo.c | 2 +-
9 files changed, 102 insertions(+), 41 deletions(-)
----------------------------------------------------------------
Hi Linus,
Please pull the following kunit next update for Linux 6.17-rc1.
Corrects MODULE_IMPORT_NS() syntax documentation, makes kunit_test timeout
configurable via a module parameter and a Kconfig option, fixes longest
symbol length test, adds a test for static stub, and adjusts kunit_test
timeout based on test_{suite,case} speed.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 19272b37aa4f83ca52bdf9c16d5d81bdd1354494:
Linux 6.16-rc1 (2025-06-08 13:44:43 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest-kunit-6.17-rc1
for you to fetch changes up to 34db4fba81916a2001d7a503dfcf718c08ed5c42:
kunit: fix longest symbol length test (2025-07-10 14:02:07 -0600)
----------------------------------------------------------------
linux_kselftest-kunit-6.17-rc1
Corrects MODULE_IMPORT_NS() syntax documentation, makes kunit_test timeout
configurable via a module parameter and a Kconfig option, fixes longest
symbol length test, adds a test for static stub, and adjusts kunit_test
timeout based on test_{suite,case} speed.
----------------------------------------------------------------
Brian Norris (1):
Documentation: kunit: Correct MODULE_IMPORT_NS() syntax
Marie Zhussupova (1):
kunit: Make default kunit_test timeout configurable via both a module parameter and a Kconfig option
Sergio González Collado (1):
kunit: fix longest symbol length test
Tzung-Bi Shih (1):
kunit: Add test for static stub
Ujwal Jain (1):
kunit: Adjust kunit_test timeout based on test_{suite,case} speed
Documentation/dev-tools/kunit/usage.rst | 2 +-
include/kunit/try-catch.h | 1 +
lib/Kconfig.debug | 1 +
lib/kunit/Kconfig | 13 ++++++++
lib/kunit/kunit-test.c | 55 ++++++++++++++++++++++++++++++---
lib/kunit/test.c | 47 ++++++++++++++++++++++++++--
lib/kunit/try-catch-impl.h | 4 ++-
lib/kunit/try-catch.c | 29 ++---------------
lib/tests/longest_symbol_kunit.c | 3 +-
9 files changed, 118 insertions(+), 37 deletions(-)
----------------------------------------------------------------
Introduce SW acceleration for IPIP tunnels in the netfilter flowtable
infrastructure.
---
Changes in v5:
- Rely on __ipv4_addr_hash() to compute the hash used as encap ID
- Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap()
- Add nf_flow_ip4_ecanp_pop utility routine
- Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kern…
Changes in v4:
- Use the hash value of the saddr, daddr and protocol of outer IP header as
encapsulation id.
- Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kern…
Changes in v3:
- Add outer IP header sanity checks
- target nf-next tree instead of net-next
- Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern…
Changes in v2:
- Introduce IPIP flowtable selftest
- Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern…
---
Lorenzo Bianconi (2):
net: netfilter: Add IPIP flowtable SW acceleration
selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
include/linux/netdevice.h | 1 +
net/ipv4/ipip.c | 28 +++++++++++
net/netfilter/nf_flow_table_ip.c | 56 +++++++++++++++++++++-
net/netfilter/nft_flow_offload.c | 1 +
.../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++
5 files changed, 124 insertions(+), 2 deletions(-)
---
base-commit: dd500e4aecf25e48e874ca7628697969df679493
change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067
Best regards,
--
Lorenzo Bianconi <lorenzo(a)kernel.org>
Show precise rejected function when attaching fexit/fmod_ret to __noreturn
functions.
Add log for attaching tracing programs to functions in deny list.
Add selftest for attaching tracing programs to functions in deny list.
Migrate fexit_noreturns case into tracing_failure test suite.
changes:
v4:
- change tracing_deny case attaching function (Yonghong Song)
- add Acked-by: Yafang Shao and Yonghong Song
v3:
- add tracing_deny case into existing files (Alexei)
- migrate fexit_noreturns into tracing_failure
- change SOB
https://lore.kernel.org/bpf/20250722153434.20571-1-kafai.wan@linux.dev/
v2:
- change verifier log message (Alexei)
- add missing Suggested-by
https://lore.kernel.org/bpf/20250714120408.1627128-1-mannkafai@gmail.com/
v1:
https://lore.kernel.org/all/20250710162717.3808020-1-mannkafai@gmail.com/
---
KaFai Wan (4):
bpf: Show precise rejected function when attaching fexit/fmod_ret to
__noreturn functions
bpf: Add log for attaching tracing programs to functions in deny list
selftests/bpf: Add selftest for attaching tracing programs to
functions in deny list
selftests/bpf: Migrate fexit_noreturns case into tracing_failure test
suite
kernel/bpf/verifier.c | 5 +-
.../bpf/prog_tests/fexit_noreturns.c | 9 ----
.../bpf/prog_tests/tracing_failure.c | 52 +++++++++++++++++++
.../selftests/bpf/progs/fexit_noreturns.c | 15 ------
.../selftests/bpf/progs/tracing_failure.c | 12 +++++
5 files changed, 68 insertions(+), 25 deletions(-)
delete mode 100644 tools/testing/selftests/bpf/prog_tests/fexit_noreturns.c
delete mode 100644 tools/testing/selftests/bpf/progs/fexit_noreturns.c
--
2.43.0
[All the precursor patches are merged now and AMD/RISCV/VTD conversions
are written]
Currently each of the iommu page table formats duplicates all of the logic
to maintain the page table and perform map/unmap/etc operations. There are
several different versions of the algorithms between all the different
formats. The io-pgtable system provides an interface to help isolate the
page table code from the iommu driver, but doesn't provide tools to
implement the common algorithms.
This makes it very hard to improve the state of the pagetable code under
the iommu domains as any proposed improvement needs to alter a large
number of different driver code paths. Combined with a lack of software
based testing this makes improvement in this area very hard.
iommufd wants several new page table operations:
- More efficient map/unmap operations, using iommufd's batching logic
- unmap that returns the physical addresses into a batch as it progresses
- cut that allows splitting areas so large pages can have holes
poked in them dynamically (ie guestmemfd hitless shared/private
transitions)
- More agressive freeing of table memory to avoid waste
- Fragmenting large pages so that dirty tracking can be more granular
- Reassembling large pages so that VMs can run at full IO performance
in migration/dirty tracking error flows
- KHO integration for kernel live upgrade
Together these are algorithmically complex enough to be a very significant
task to go and implement in all the page table formats we support. Just
the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86
PAE / AMDv1 / VT-D SS / RISCV)
Instead of doing the duplicated work, this series takes the first step to
consolidate the algorithms into one places. In spirit it is similar to the
work Christoph did a few years back to pull the redundant get_user_pages()
implementations out of the arch code into core MM. This unlocked a great
deal of improvement in that space in the following years. I would like to
see the same benefit in iommu as well.
My first RFC showed a bigger picture with all most all formats and more
algorithms. This series reorganizes that to be narrowly focused on just
enough to convert the AMD driver to use the new mechanism.
kunit tests are provided that allow good testing of the algorithms and all
formats on x86, nothing is arch specific.
AMD is one of the simpler options as the HW is quite uniform with few
different options/bugs while still requiring the complicated contiguous
pages support. The HW also has a very simple range based invalidation
approach that is easy to implement.
The AMD v1 and AMD v2 page table formats are implemented bit for bit
identical to the current code, tested using a compare kunit test that
checks against the io-pgtable version (on github, see below).
Updating the AMD driver to replace the io-pgtable layer with the new stuff
is fairly straightforward now. The layering is fixed up in the new version
so that all the invalidation goes through function pointers.
Several small fixing patches have come out of this as I've been fixing the
problems that the test suite uncovers in the current code, and
implementing the fixed version in iommupt.
On performance, there is a quite wide variety of implementation designs
across all the drivers. Looking at some key performance across
the main formats:
iommu_map():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 51,63 , 19.19 (AMDV1)
256*2^12, 386,1909 , 367,1795 , 79.79
256*2^21, 362,1633 , 355,1556 , 77.77
2^12, 56,62 , 52,59 , 11.11 (AMDv2)
256*2^12, 405,1355 , 357,1292 , 72.72
256*2^21, 393,1160 , 358,1114 , 67.67
2^12, 55,65 , 53,62 , 14.14 (VTD second stage)
256*2^12, 391,518 , 332,512 , 35.35
256*2^21, 383,635 , 336,624 , 46.46
2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit)
256*2^12, 380,389 , 361,369 , 2.02
256*2^21, 358,419 , 345,400 , 13.13
iommu_unmap():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 69,88 , 65,85 , 23.23 (AMDv1)
256*2^12, 353,6498 , 331,6029 , 94.94
256*2^21, 373,6014 , 360,5706 , 93.93
2^12, 71,72 , 66,69 , 4.04 (AMDv2)
256*2^12, 228,891 , 206,871 , 76.76
256*2^21, 254,721 , 245,711 , 65.65
2^12, 69,87 , 65,82 , 20.20 (VTD second stage)
256*2^12, 210,321 , 200,315 , 36.36
256*2^21, 255,349 , 238,342 , 30.30
2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit)
256*2^12, 521,357 , 447,346 , -29.29
256*2^21, 489,358 , 433,345 , -25.25
* Above numbers include additional patches to remove the iommu_pgsize()
overheads. gcc 13.3.0, i7-12700
This version provides fairly consistent performance across formats. ARM
unmap performance is quite different because this version supports
contiguous pages and uses a very different algorithm for unmapping. Though
why it is so worse compared to AMDv1 I haven't figured out yet.
The per-format commits include a more detailed chart.
There is a second branch:
https://github.com/jgunthorpe/linux/commits/iommu_pt_all
Containing supporting work and future steps:
- ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats
- RISCV format and RISCV conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv
- Support for a DMA incoherent HW page table walker
- VT-D second stage format and VT-D conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd
- DART v1 & v2 format
- Draft of a iommufd 'cut' operation to break down huge pages
- A compare test that checks the iommupt formats against the iopgtable
interface, including updating AMD to have a working iopgtable and patches
to make VT-D have an iopgtable for testing.
- A performance test to micro-benchmark map and unmap against iogptable
My strategy is to go one by one for the drivers:
- AMD driver conversion
- RISCV page table and driver
- Intel VT-D driver and VTDSS page table
- Flushing improvements for RISCV
- ARM SMMUv3
And concurrently work on the algorithm side:
- debugfs content dump, like VT-D has
- Cut support
- Increase/Decrease page size support
- map/unmap batching
- KHO
As we make more algorithm improvements the value to convert the drivers
increases.
This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt
v2:
- Rebase on v6.16-rc2
- s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better
- Comment and documentation updates
- Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top
pointer
- Add missed force_aperture = true
- Make pt_iommu_deinit() take care of the not-yet-inited error case
internally as AMD/RISCV/VTD all shared this logic
- Change gather_range() into gather_range_pages() so it also deals with
the page list. This makes the following cache flushing series simpler
- Fix missed update of unmap->unmapped in some error cases
- Change clear_contig() to order the gather more logically
- Remove goto from the error handling in __map_range_leaf()
- s/log2_/oalog2_/ in places where the argument is an oaddr_t
- Pass the pts to pt_table_install64/32()
- Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's
information on how PASID 0 works.
v1: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com
- AMD driver only, many code changes
RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/
Alejandro Jimenez (1):
iommu/amd: Use the generic iommu page table
Jason Gunthorpe (14):
genpt: Generic Page Table base API
genpt: Add Documentation/ files
iommupt: Add the basic structure of the iommu implementation
iommupt: Add the AMD IOMMU v1 page table format
iommupt: Add iova_to_phys op
iommupt: Add unmap_pages op
iommupt: Add map_pages op
iommupt: Add read_and_clear_dirty op
iommupt: Add a kunit test for Generic Page Table
iommupt: Add a mock pagetable format for iommufd selftest to use
iommufd: Change the selftest to use iommupt instead of xarray
iommupt: Add the x86 64 bit page table format
iommu/amd: Remove AMD io_pgtable support
iommupt: Add a kunit test for the IOMMU implementation
.clang-format | 1 +
Documentation/driver-api/generic_pt.rst | 140 ++
Documentation/driver-api/index.rst | 1 +
drivers/iommu/Kconfig | 2 +
drivers/iommu/Makefile | 1 +
drivers/iommu/amd/Kconfig | 5 +-
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 1 -
drivers/iommu/amd/amd_iommu_types.h | 109 +-
drivers/iommu/amd/io_pgtable.c | 560 --------
drivers/iommu/amd/io_pgtable_v2.c | 370 ------
drivers/iommu/amd/iommu.c | 516 ++++----
drivers/iommu/generic_pt/.kunitconfig | 13 +
drivers/iommu/generic_pt/Kconfig | 72 ++
drivers/iommu/generic_pt/fmt/Makefile | 26 +
drivers/iommu/generic_pt/fmt/amdv1.h | 409 ++++++
drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 +
drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 +
drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 +
drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 +
drivers/iommu/generic_pt/fmt/iommu_template.h | 48 +
drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 +
drivers/iommu/generic_pt/fmt/x86_64.h | 248 ++++
drivers/iommu/generic_pt/iommu_pt.h | 1150 +++++++++++++++++
drivers/iommu/generic_pt/kunit_generic_pt.h | 717 ++++++++++
drivers/iommu/generic_pt/kunit_iommu.h | 183 +++
drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++
drivers/iommu/generic_pt/pt_common.h | 354 +++++
drivers/iommu/generic_pt/pt_defs.h | 323 +++++
drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++
drivers/iommu/generic_pt/pt_iter.h | 640 +++++++++
drivers/iommu/generic_pt/pt_log2.h | 130 ++
drivers/iommu/io-pgtable.c | 4 -
drivers/iommu/iommufd/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_test.h | 11 +-
drivers/iommu/iommufd/selftest.c | 439 +++----
include/linux/generic_pt/common.h | 166 +++
include/linux/generic_pt/iommu.h | 270 ++++
include/linux/io-pgtable.h | 2 -
tools/testing/selftests/iommu/iommufd.c | 60 +-
tools/testing/selftests/iommu/iommufd_utils.h | 12 +
41 files changed, 6119 insertions(+), 1589 deletions(-)
create mode 100644 Documentation/driver-api/generic_pt.rst
delete mode 100644 drivers/iommu/amd/io_pgtable.c
delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c
create mode 100644 drivers/iommu/generic_pt/.kunitconfig
create mode 100644 drivers/iommu/generic_pt/Kconfig
create mode 100644 drivers/iommu/generic_pt/fmt/Makefile
create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c
create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h
create mode 100644 drivers/iommu/generic_pt/iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/pt_common.h
create mode 100644 drivers/iommu/generic_pt/pt_defs.h
create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h
create mode 100644 drivers/iommu/generic_pt/pt_iter.h
create mode 100644 drivers/iommu/generic_pt/pt_log2.h
create mode 100644 include/linux/generic_pt/common.h
create mode 100644 include/linux/generic_pt/iommu.h
base-commit: cd76b0248a38645a3e3f8ca4a48bffc591e9da19
--
2.43.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v14 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v14 (22-Jul-2025)
- Add missing const for struct tcp_sock of tcp_accecn_option_beacon_check() of #11 (Simon Horman <horms(a)kernel.org>)
v13 (18-Jul-2025)
- Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>)
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (5):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 32 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 649 ++++++++++++++++++
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 28 +-
net/ipv4/tcp_input.c | 353 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 294 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1409 insertions(+), 184 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
Although setup_ns() set net.ipv4.conf.default.rp_filter=0,
loading certain module such as ipip will automatically create a tunl0 interface
in all netns including new created ones. In the script, this is before than
default.rp_filter=0 applied, as a result tunl0.rp_filter remains set to 1
which causes the test report FAIL when ipip module is preloaded.
Before fix:
Testing DR mode...
Testing NAT mode...
Testing Tunnel mode...
ipvs.sh: FAIL
After fix:
Testing DR mode...
Testing NAT mode...
Testing Tunnel mode...
ipvs.sh: PASS
Fixes: 7c8b89ec506e ("selftests: netfilter: remove rp_filter configuration")
v2: Fixed the format of Fixes tag.
Signed-off-by: Yi Chen <yiche(a)redhat.com>
---
tools/testing/selftests/net/netfilter/ipvs.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/netfilter/ipvs.sh b/tools/testing/selftests/net/netfilter/ipvs.sh
index 6af2ea3ad6b8..9c9d5b38ab71 100755
--- a/tools/testing/selftests/net/netfilter/ipvs.sh
+++ b/tools/testing/selftests/net/netfilter/ipvs.sh
@@ -151,7 +151,7 @@ test_nat() {
test_tun() {
ip netns exec "${ns0}" ip route add "${vip_v4}" via "${gip_v4}" dev br0
- ip netns exec "${ns1}" modprobe -q ipip
+ modprobe -q ipip
ip netns exec "${ns1}" ip link set tunl0 up
ip netns exec "${ns1}" sysctl -qw net.ipv4.ip_forward=0
ip netns exec "${ns1}" sysctl -qw net.ipv4.conf.all.send_redirects=0
@@ -160,10 +160,10 @@ test_tun() {
ip netns exec "${ns1}" ipvsadm -a -i -t "${vip_v4}:${port}" -r ${rip_v4}:${port}
ip netns exec "${ns1}" ip addr add ${vip_v4}/32 dev lo:1
- ip netns exec "${ns2}" modprobe -q ipip
ip netns exec "${ns2}" ip link set tunl0 up
ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_ignore=1
ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_announce=2
+ ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.tunl0.rp_filter=0
ip netns exec "${ns2}" ip addr add "${vip_v4}/32" dev lo:1
test_service
--
2.50.1
Currently, UDP exchange is prone to failure when cmd attempt to send data
while socat in bkg is not ready. Since, the behavior is probabilistic, this
can result in flakiness for XDP tests. While testing
test_xdp_native_tx_mb() on netdevsim, a failure rate of around 1% in 500
500 iterations was observed.
Use wait_port_listen() to ensure that the bkg socat is started and ready to
receive before cmd start sending. With proposed changes, a re-run of the
same test passed 100% of time.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr(a)gmail.com>
---
tools/testing/selftests/drivers/net/xdp.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/xdp.py b/tools/testing/selftests/drivers/net/xdp.py
index 887d662ad128..1dd8bf3bf6c9 100755
--- a/tools/testing/selftests/drivers/net/xdp.py
+++ b/tools/testing/selftests/drivers/net/xdp.py
@@ -13,7 +13,7 @@ from enum import Enum
from lib.py import ksft_run, ksft_exit, ksft_eq, ksft_ne, ksft_pr
from lib.py import KsftFailEx, NetDrvEpEnv, EthtoolFamily, NlError
-from lib.py import bkg, cmd, rand_port
+from lib.py import bkg, cmd, rand_port, wait_port_listen
from lib.py import ip, bpftool, defer
@@ -70,6 +70,7 @@ def _exchg_udp(cfg, port, test_string):
tx_udp_cmd = f"echo -n {test_string} | socat -t 2 -u STDIN UDP:{cfg.baddr}:{port}"
with bkg(rx_udp_cmd, exit_wait=True) as nc:
+ wait_port_listen(port, proto="udp")
cmd(tx_udp_cmd, host=cfg.remote, shell=True)
return nc.stdout.strip()
@@ -310,6 +311,7 @@ def test_xdp_native_tx_mb(cfg):
tx_udp = f"echo {test_string} | socat -t 2 -u STDIN UDP:{cfg.baddr}:{port}"
with bkg(rx_udp, host=cfg.remote, exit_wait=True) as rnc:
+ wait_port_listen(port, proto="udp", host=cfg.remote)
cmd(tx_udp, host=cfg.remote, shell=True)
stats = _get_stats(prog_info['maps']['map_xdp_stats'])
--
2.47.3
It is currently impossible to enable ipv6 forwarding on a per-interface
basis like in ipv4. To enable forwarding on an ipv6 interface we need to
enable it on all interfaces and disable it on the other interfaces using
a netfilter rule. This is especially cumbersome if you have lots of
interfaces and only want to enable forwarding on a few. According to the
sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding
for all interfaces, while the interface-specific
`net.ipv6.conf.<interface>.forwarding` configures the interface
Host/Router configuration.
Introduce a new sysctl flag `force_forwarding`, which can be set on every
interface. The ip6_forwarding function will then check if the global
forwarding flag OR the force_forwarding flag is active and forward the
packet.
To preserve backwards-compatibility reset the flag (on all interfaces)
to 0 if the net.ipv6.conf.all.forwarding flag is set to 0.
Add a short selftest that checks if a packet gets forwarded with and
without `force_forwarding`.
[0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Acked-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com>
---
v7:
* rebase
* fix typos in commit message
v6: https://lore.kernel.org/netdev/20250711124243.526735-1-g.goller@proxmox.com/
* rebase
* remove brackets around single line
* add 'nodad' to addresses in selftest to avoid sporadic failures
v5: https://lore.kernel.org/netdev/20250707094307.223975-1-g.goller@proxmox.com/
* update conf/all/forwarding docs
* simplified backwards-compat comment
* remove ASSERT_RTNL as it's guaranteed by __in6_dev_get_rtnl_net()
already
* cange ip6_forward logic so that it doesn't depend on the idev
existing
* move WRITE_ONCE inside device lock
v4: https://lore.kernel.org/netdev/20250703160154.560239-1-g.goller@proxmox.com/
* actually write the sysctl value to the table
* use ASSERT_RTNL() when forwarding the sysctl change
* remove useless comments in function body
* simplify forwarding and force_forwarding check in ip6_output.c
* fix code backticks in Documentation (double instead of single)
* add selftests
v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/
* remove forwarding=0 setting force_forwarding=0 globally.
* add min and max (0 and 1) value to sysctl.
v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/
* rename from `do_forwarding` to `force_forwarding`.
* add global `force_forwarding` flag which will enable
`force_forwarding` on every interface like the
`ipv4.all.forwarding` flag.
* `forwarding`=0 will disable global and per-interface
`force_forwarding`.
* export option as NETCONFA_FORCE_FORWARDING.
v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/
Documentation/networking/ip-sysctl.rst | 8 +-
include/linux/ipv6.h | 1 +
include/uapi/linux/ipv6.h | 1 +
include/uapi/linux/netconf.h | 1 +
include/uapi/linux/sysctl.h | 1 +
net/ipv6/addrconf.c | 82 ++++++++++++++
net/ipv6/ip6_output.c | 3 +-
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++
9 files changed, 200 insertions(+), 3 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_force_forwarding.sh
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 14700ea77e75..bb620f554598 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -2543,8 +2543,8 @@ conf/all/disable_ipv6 - BOOLEAN
conf/all/forwarding - BOOLEAN
Enable global IPv6 forwarding between all interfaces.
- IPv4 and IPv6 work differently here; e.g. netfilter must be used
- to control which interfaces may forward packets and which not.
+ IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must
+ be used to control which interfaces may forward packets.
This also sets all interfaces' Host/Router setting
'forwarding' to the specified value. See below for details.
@@ -2561,6 +2561,10 @@ proxy_ndp - BOOLEAN
Default: 0 (disabled)
+force_forwarding - BOOLEAN
+ Enable forwarding on this interface only -- regardless of the setting on
+ ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0,
+ the ``force_forwarding`` flag will be reset on all interfaces.
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv6 reply packets that are not
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index db0eb0d86b64..bc6ec2959173 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -17,6 +17,7 @@ struct ipv6_devconf {
__s32 hop_limit;
__s32 mtu6;
__s32 forwarding;
+ __s32 force_forwarding;
__s32 disable_policy;
__s32 proxy_ndp;
__cacheline_group_end(ipv6_devconf_read_txrx);
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index cf592d7b630f..d4d3ae774b26 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -199,6 +199,7 @@ enum {
DEVCONF_NDISC_EVICT_NOCARRIER,
DEVCONF_ACCEPT_UNTRACKED_NA,
DEVCONF_ACCEPT_RA_MIN_LFT,
+ DEVCONF_FORCE_FORWARDING,
DEVCONF_MAX
};
diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h
index fac4edd55379..1c8c84d65ae3 100644
--- a/include/uapi/linux/netconf.h
+++ b/include/uapi/linux/netconf.h
@@ -19,6 +19,7 @@ enum {
NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
NETCONFA_INPUT,
NETCONFA_BC_FORWARDING,
+ NETCONFA_FORCE_FORWARDING,
__NETCONFA_MAX
};
#define NETCONFA_MAX (__NETCONFA_MAX - 1)
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 8981f00204db..63d1464cb71c 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -573,6 +573,7 @@ enum {
NET_IPV6_ACCEPT_RA_FROM_LOCAL=26,
NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27,
NET_IPV6_RA_DEFRTR_METRIC=28,
+ NET_IPV6_FORCE_FORWARDING=29,
__NET_IPV6_MAX
};
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4f1d7d110302..81a067a2e526 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.ndisc_evict_nocarrier = 1,
.ra_honor_pio_life = 0,
.ra_honor_pio_pflag = 0,
+ .force_forwarding = 0,
};
static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
.ndisc_evict_nocarrier = 1,
.ra_honor_pio_life = 0,
.ra_honor_pio_pflag = 0,
+ .force_forwarding = 0,
};
/* Check if link is ready: is it up and is a valid qdisc available */
@@ -857,6 +859,9 @@ static void addrconf_forward_change(struct net *net, __s32 newf)
idev = __in6_dev_get_rtnl_net(dev);
if (idev) {
int changed = (!idev->cnf.forwarding) ^ (!newf);
+ /* Disabling all.forwarding sets 0 to force_forwarding for all interfaces */
+ if (newf == 0)
+ WRITE_ONCE(idev->cnf.force_forwarding, 0);
WRITE_ONCE(idev->cnf.forwarding, newf);
if (changed)
@@ -5710,6 +5715,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf,
array[DEVCONF_ACCEPT_UNTRACKED_NA] =
READ_ONCE(cnf->accept_untracked_na);
array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft);
+ array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding);
}
static inline size_t inet6_ifla6_size(void)
@@ -6738,6 +6744,75 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write
return ret;
}
+static void addrconf_force_forward_change(struct net *net, __s32 newf)
+{
+ struct net_device *dev;
+ struct inet6_dev *idev;
+
+ for_each_netdev(net, dev) {
+ idev = __in6_dev_get_rtnl_net(dev);
+ if (idev) {
+ int changed = (!idev->cnf.force_forwarding) ^ (!newf);
+
+ WRITE_ONCE(idev->cnf.force_forwarding, newf);
+ if (changed)
+ inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ dev->ifindex, &idev->cnf);
+ }
+ }
+}
+
+static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct inet6_dev *idev = ctl->extra1;
+ struct ctl_table tmp_ctl = *ctl;
+ struct net *net = ctl->extra2;
+ int *valp = ctl->data;
+ int new_val = *valp;
+ int old_val = *valp;
+ loff_t pos = *ppos;
+ int ret;
+
+ tmp_ctl.extra1 = SYSCTL_ZERO;
+ tmp_ctl.extra2 = SYSCTL_ONE;
+ tmp_ctl.data = &new_val;
+
+ ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos);
+
+ if (write && old_val != new_val) {
+ if (!rtnl_net_trylock(net))
+ return restart_syscall();
+
+ WRITE_ONCE(*valp, new_val);
+
+ if (valp == &net->ipv6.devconf_dflt->force_forwarding) {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ NETCONFA_IFINDEX_DEFAULT,
+ net->ipv6.devconf_dflt);
+ } else if (valp == &net->ipv6.devconf_all->force_forwarding) {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ NETCONFA_IFINDEX_ALL,
+ net->ipv6.devconf_all);
+
+ addrconf_force_forward_change(net, new_val);
+ } else {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ idev->dev->ifindex,
+ &idev->cnf);
+ }
+ rtnl_net_unlock(net);
+ }
+
+ if (ret)
+ *ppos = pos;
+ return ret;
+}
+
static int minus_one = -1;
static const int two_five_five = 255;
static u32 ioam6_if_id_max = U16_MAX;
@@ -7208,6 +7283,13 @@ static const struct ctl_table addrconf_sysctl[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO,
},
+ {
+ .procname = "force_forwarding",
+ .data = &ipv6_devconf.force_forwarding,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = addrconf_sysctl_force_forwarding,
+ },
};
static int __addrconf_sysctl_register(struct net *net, char *dev_name,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 0412f8544695..1e1410237b6e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -511,7 +511,8 @@ int ip6_forward(struct sk_buff *skb)
u32 mtu;
idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif));
- if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0)
+ if (!READ_ONCE(net->ipv6.devconf_all->forwarding) &&
+ (!idev || !READ_ONCE(idev->cnf.force_forwarding)))
goto error;
if (skb->pkt_type != PACKET_HOST)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 13e2678d418b..b31a71f2b372 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -116,6 +116,7 @@ TEST_GEN_FILES += skf_net_off
TEST_GEN_FILES += tfo
TEST_PROGS += tfo_passive.sh
TEST_PROGS += broadcast_pmtu.sh
+TEST_PROGS += ipv6_force_forwarding.sh
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := busy_poller netlink-dumps
diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh
new file mode 100755
index 000000000000..bf0243366caa
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh
@@ -0,0 +1,105 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test IPv6 force_forwarding interface property
+#
+# This test verifies that the force_forwarding property works correctly:
+# - When global forwarding is disabled, packets are not forwarded normally
+# - When force_forwarding is enabled on an interface, packets are forwarded
+# regardless of the global forwarding setting
+
+source lib.sh
+
+cleanup() {
+ cleanup_ns $ns1 $ns2 $ns3
+}
+
+trap cleanup EXIT
+
+setup_test() {
+ # Create three namespaces: sender, router, receiver
+ setup_ns ns1 ns2 ns3
+
+ # Create veth pairs: ns1 <-> ns2 <-> ns3
+ ip link add name veth12 type veth peer name veth21
+ ip link add name veth23 type veth peer name veth32
+
+ # Move interfaces to namespaces
+ ip link set veth12 netns $ns1
+ ip link set veth21 netns $ns2
+ ip link set veth23 netns $ns2
+ ip link set veth32 netns $ns3
+
+ # Configure interfaces
+ ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12 nodad
+ ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21 nodad
+ ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23 nodad
+ ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32 nodad
+
+ # Bring up interfaces
+ ip -n $ns1 link set veth12 up
+ ip -n $ns2 link set veth21 up
+ ip -n $ns2 link set veth23 up
+ ip -n $ns3 link set veth32 up
+
+ # Add routes
+ ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2
+ ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1
+
+ # Disable global forwarding
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0
+}
+
+test_force_forwarding() {
+ local ret=0
+
+ echo "TEST: force_forwarding functionality"
+
+ # Check if force_forwarding sysctl exists
+ if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then
+ echo "SKIP: force_forwarding not available"
+ return $ksft_skip
+ fi
+
+ # Test 1: Without force_forwarding, ping should fail
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0
+
+ if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then
+ echo "FAIL: ping succeeded when forwarding disabled"
+ ret=1
+ else
+ echo "PASS: forwarding disabled correctly"
+ fi
+
+ # Test 2: With force_forwarding enabled, ping should succeed
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1
+
+ if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then
+ echo "PASS: force_forwarding enabled forwarding"
+ else
+ echo "FAIL: ping failed with force_forwarding enabled"
+ ret=1
+ fi
+
+ return $ret
+}
+
+echo "IPv6 force_forwarding test"
+echo "=========================="
+
+setup_test
+test_force_forwarding
+ret=$?
+
+if [ $ret -eq 0 ]; then
+ echo "OK"
+ exit 0
+elif [ $ret -eq $ksft_skip ]; then
+ echo "SKIP"
+ exit $ksft_skip
+else
+ echo "FAIL"
+ exit 1
+fi
--
2.39.5
Extend the `check_for_dependencies()` function in `lib_netcons.sh` to check
whether IPv6 is enabled by verifying the existence of
`/proc/net/if_inet6`. Having IPv6 is a now a dependency of netconsole
tests. If the file does not exist, the script will skip the test with an
appropriate message suggesting to verify if `CONFIG_IPV6` is enabled.
This prevents the test to misbehave if IPv6 is not configured.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh
index 258af805497b4..b6071e80ebbb6 100644
--- a/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh
+++ b/tools/testing/selftests/drivers/net/lib/sh/lib_netcons.sh
@@ -281,6 +281,11 @@ function check_for_dependencies() {
exit "${ksft_skip}"
fi
+ if [ ! -f /proc/net/if_inet6 ]; then
+ echo "SKIP: IPv6 not configured. Check if CONFIG_IPV6 is enabled" >&2
+ exit "${ksft_skip}"
+ fi
+
if [ ! -f "${NSIM_DEV_SYS_NEW}" ]; then
echo "SKIP: file ${NSIM_DEV_SYS_NEW} does not exist. Check if CONFIG_NETDEVSIM is enabled" >&2
exit "${ksft_skip}"
---
base-commit: dd500e4aecf25e48e874ca7628697969df679493
change-id: 20250723-netcons_test_ipv6-15b1b76bb231
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
v3:
* Disallowed move operation except for MREMAP_FIXED.
* Disallow gap at start of aggregate range to avoid confusion.
* Disallow any file-baked VMAs with custom get_unmapped_area.
* Renamed multi_vma to seen_vma to be clearer. Stop reusing new_addr, use
separate target_addr var to track next target address.
* Check if first VMA fails multi VMA check, if so we'll allow one VMA but
not multiple.
* Updated the commit message for patch 9 to be clearer about gap behaviour.
* Removed accidentally included debug goto statement in test (doh!). Test
was and is passing regardless.
* Unmap target range in test, previously we ended up moving additional VMAs
unintentionally. This still all passed :) but was not what was intended.
* Removed self-merge check - there is absolutely no way this can happen
across multiple VMAs, as there is no means of moving VMAs such that a VMA
merges with itself.
v2:
* Squashed uffd stub fix into series.
* Propagated tags, thanks!
* Fixed param naming in patch 4 as per Vlastimil.
* Renamed vma_reset to vmi_needs_reset + dropped reset on unmap as per
Liam.
* Correctly return -EFAULT if no VMAs in input range.
* Account for get_unmapped_area() disregarding MAP_FIXED and returning an
altered address.
* Added additional explanatatory comment to the remap_move() function.
https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
v1:
https://lore.kernel.org/all/cover.1751865330.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (10):
mm/mremap: perform some simple cleanups
mm/mremap: refactor initial parameter sanity checks
mm/mremap: put VMA check and prep logic into helper function
mm/mremap: cleanup post-processing stage of mremap
mm/mremap: use an explicit uffd failure path for mremap
mm/mremap: check remap conditions earlier
mm/mremap: move remap_is_valid() into check_prep_vma()
mm/mremap: clean up mlock populate behaviour
mm/mremap: permit mremap() move of multiple VMAs
tools/testing/selftests: extend mremap_test to test multi-VMA mremap
fs/userfaultfd.c | 15 +-
include/linux/userfaultfd_k.h | 5 +
mm/mremap.c | 553 +++++++++++++++--------
tools/testing/selftests/mm/mremap_test.c | 146 +++++-
4 files changed, 518 insertions(+), 201 deletions(-)
--
2.50.0
Cosmin reports the following locking issue:
# BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:275
# dump_stack_lvl+0x4f/0x60
# __might_resched+0xeb/0x140
# mutex_lock+0x1a/0x40
# dev_set_promiscuity+0x26/0x90
# __dev_set_promiscuity+0x85/0x170
# __dev_set_rx_mode+0x69/0xa0
# dev_uc_add+0x6d/0x80
# vlan_dev_open+0x5f/0x120 [8021q]
# __dev_open+0x10c/0x2a0
# __dev_change_flags+0x1a4/0x210
# netif_change_flags+0x22/0x60
# do_setlink.isra.0+0xdb0/0x10f0
# rtnl_newlink+0x797/0xb00
# rtnetlink_rcv_msg+0x1cb/0x3f0
# netlink_rcv_skb+0x53/0x100
# netlink_unicast+0x273/0x3b0
# netlink_sendmsg+0x1f2/0x430
Which is similar to recent syzkaller reports in [0] and [1] and triggers
because macsec does not advertise IFF_UNICAST_FLT although it has proper
ndo_set_rx_mode callback that takes care of pushing uc/mc addresses
down to the real device.
In general, dev_uc_add call path is problematic for stacking
non-IFF_UNICAST_FLT because we might grab netdev instance lock under
addr_list_lock spinlock, so this is not a systemic fix.
0: https://lore.kernel.org/netdev/686d55b4.050a0220.1ffab7.0014.GAE@google.com
1: https://lore.kernel.org/netdev/68712acf.a00a0220.26a83e.0051.GAE@google.com/
Reviewed-by: Simon Horman <horms(a)kernel.org>
Tested-by: Simon Horman <horms(a)kernel.org>
Link: https://lore.kernel.org/netdev/2aff4342b0f5b1539c02ffd8df4c7e58dd9746e7.cam…
Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations")
Reported-by: Cosmin Ratiu <cratiu(a)nvidia.com>
Tested-by: Cosmin Ratiu <cratiu(a)nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf(a)fomichev.me>
---
drivers/net/macsec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 7edbe76b5455..4c75d1fea552 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -3868,7 +3868,7 @@ static void macsec_setup(struct net_device *dev)
ether_setup(dev);
dev->min_mtu = 0;
dev->max_mtu = ETH_MAX_MTU;
- dev->priv_flags |= IFF_NO_QUEUE;
+ dev->priv_flags |= IFF_NO_QUEUE | IFF_UNICAST_FLT;
dev->netdev_ops = &macsec_netdev_ops;
dev->needs_free_netdev = true;
dev->priv_destructor = macsec_free_netdev;
--
2.50.1
This patchset fixes an issue where bonding fails to establish a stable LACP
negotiation when operating in passive mode (lacp_active=off).
In passive mode, the current implementation only replies when the partner's
state changes, which results in LACP timeout and unstable aggregator formation.
With this change, the bond responds to each received LACPDU in passive mode
by setting ntt = true, ensuring timely replies and stable LACP negotiation.
Hangbin Liu (2):
bonding: update ntt to true in passive mode
selftests: bonding: add test for passive LACP mode
drivers/net/bonding/bond_3ad.c | 6 ++
.../drivers/net/bonding/bond_passive_lacp.sh | 21 +++++
.../drivers/net/bonding/bond_topo_lacp.sh | 77 +++++++++++++++++++
3 files changed, 104 insertions(+)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_passive_lacp.sh
create mode 100644 tools/testing/selftests/drivers/net/bonding/bond_topo_lacp.sh
--
2.46.0
This patch series makes the selftest work with NV enabled. The guest code
is run in vEL2 instead of EL1. We add a command line option to enable
testing of NV. The NV tests are disabled by default.
Modified around 12 selftests in this series.
Changes since v1:
- Updated NV helper functions as per comments [1].
- Modified existing testscases to run guest code in vEL2.
[1] https://lkml.iu.edu/hypermail/linux/kernel/2502.0/07001.html
Ganapatrao Kulkarni (9):
KVM: arm64: nv: selftests: Add support to run guest code in vEL2.
KVM: arm64: nv: selftests: Add simple test to run guest code in vEL2
KVM: arm64: nv: selftests: Enable hypervisor timer tests to run in
vEL2
KVM: arm64: nv: selftests: enable aarch32_id_regs test to run in vEL2
KVM: arm64: nv: selftests: Enable vgic tests to run in vEL2
KVM: arm64: nv: selftests: Enable set_id_regs test to run in vEL2
KVM: arm64: nv: selftests: Enable test to run in vEL2
KVM: selftests: arm64: Extend kvm_page_table_test to run guest code in
vEL2
KVM: arm64: nv: selftests: Enable page_fault_test test to run in vEL2
tools/testing/selftests/kvm/Makefile.kvm | 2 +
tools/testing/selftests/kvm/arch_timer.c | 8 +-
.../selftests/kvm/arm64/aarch32_id_regs.c | 34 ++++-
.../testing/selftests/kvm/arm64/arch_timer.c | 118 +++++++++++++++---
.../selftests/kvm/arm64/nv_guest_hypervisor.c | 68 ++++++++++
.../selftests/kvm/arm64/page_fault_test.c | 35 +++++-
.../testing/selftests/kvm/arm64/set_id_regs.c | 57 ++++++++-
tools/testing/selftests/kvm/arm64/vgic_init.c | 54 +++++++-
tools/testing/selftests/kvm/arm64/vgic_irq.c | 27 ++--
.../selftests/kvm/arm64/vgic_lpi_stress.c | 19 ++-
.../testing/selftests/kvm/guest_print_test.c | 32 +++++
.../selftests/kvm/include/arm64/arch_timer.h | 16 +++
.../kvm/include/arm64/kvm_util_arch.h | 3 +
.../selftests/kvm/include/arm64/nv_util.h | 45 +++++++
.../selftests/kvm/include/arm64/vgic.h | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 3 +
.../selftests/kvm/include/timer_test.h | 1 +
.../selftests/kvm/kvm_page_table_test.c | 30 ++++-
tools/testing/selftests/kvm/lib/arm64/nv.c | 46 +++++++
.../selftests/kvm/lib/arm64/processor.c | 61 ++++++---
tools/testing/selftests/kvm/lib/arm64/vgic.c | 8 ++
21 files changed, 604 insertions(+), 64 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/nv_guest_hypervisor.c
create mode 100644 tools/testing/selftests/kvm/include/arm64/nv_util.h
create mode 100644 tools/testing/selftests/kvm/lib/arm64/nv.c
--
2.48.1
This series fixes remote command checking and cleans up command
requirement calls across tests.
The first patch fixes require_cmd() incorrectly checking commands
locally even when remote=True was specified due to a missing host
parameter.
The second patch makes require_cmd() usage explicit about local/remote
requirements, avoiding unnecessary test failures and consolidating
duplicate calls.
Gal Pressman (2):
selftests: drv-net: Fix remote command checking in require_cmd()
selftests: drv-net: Make command requirements explicit
tools/testing/selftests/drivers/net/hw/devlink_rate_tc_bw.py | 3 +--
tools/testing/selftests/drivers/net/hw/rss_input_xfrm.py | 2 +-
tools/testing/selftests/drivers/net/hw/tso.py | 2 +-
tools/testing/selftests/drivers/net/lib/py/env.py | 2 +-
tools/testing/selftests/drivers/net/lib/py/load.py | 2 +-
tools/testing/selftests/drivers/net/ping.py | 2 +-
6 files changed, 6 insertions(+), 7 deletions(-)
--
2.40.1
There are a couple issues with the tso selftest.
- Features required for test cases are detected by searching the set
of active features at test start, so if a feature is supported by
hw, but disabled, the test will report that the feature under test
is not available and fail.
- The vxlan test cases do not use the correct ip link flags based on
the gso feature under test
- The non-tunneled tso6 test case is showing up with the wrong name.
With all patches applied test output is:
# Detected qstat for LSO wire-packets
TAP version 13
1..14
ok 1 tso.ipv4
# Testing with mangleid enabled
ok 2 tso.vxlan4_ipv4
ok 3 tso.vxlan4_ipv6
# Testing with mangleid enabled
ok 4 tso.vxlan_csum4_ipv4
ok 5 tso.vxlan_csum4_ipv6
# Testing with mangleid enabled
ok 6 tso.gre4_ipv4
ok 7 tso.gre4_ipv6
ok 8 tso.ipv6
# Testing with mangleid enabled
ok 9 tso.vxlan6_ipv4
ok 10 tso.vxlan6_ipv6
# Testing with mangleid enabled
ok 11 tso.vxlan_csum6_ipv4
ok 12 tso.vxlan_csum6_ipv6
# Testing with mangleid enabled
ok 13 tso.gre6_ipv4
ok 14 tso.gre6_ipv6
# Totals: pass:14 fail:0 xfail:0 xpass:0 skip:0 error:0
Daniel Zahka (3):
selftests: drv-net: tso: enable test cases based on hw_features
selftests: drv-net: tso: fix vxlan tunnel flags to get correct
gso_type
selftests: drv-net: tso: fix non-tunneled tso6 test case name
tools/testing/selftests/drivers/net/hw/tso.py | 99 +++++++++++--------
1 file changed, 59 insertions(+), 40 deletions(-)
--
2.47.1
The pidfd selftests run in userspace and include both userspace and kernel
header files. On some distros (for example, CentOS), this results in
duplicate-symbol warnings in allmodconfig builds, while on other distros
(for example, Ubuntu) it does not. (This happens in recent -next trees,
including next-20250714.)
Therefore, use #undef to get rid of the userspace definitions in favor
of the kernel definitions.
Other ways of handling this include splitting up the selftest code so
that the userspace definitions go into one translation unit and the
kernel definitions into another (which might or might not be feasible)
or to adjust compiler command-line options to suppress the warnings
(which might or might not be desirable).
Signed-off-by: Paul E. McKenney <paulmck(a)kernel.org>
Cc: Christian Brauner <brauner(a)kernel.org>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: <linux-kselftest(a)vger.kernel.org>
---
pidfd.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index efd74063126eb..6ff495398e872 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -16,6 +16,10 @@
#include <sys/types.h>
#include <sys/wait.h>
+#undef SCHED_NORMAL
+#undef SCHED_FLAG_KEEP_ALL
+#undef SCHED_FLAG_UTIL_CLAMP
+
#include "../kselftest.h"
#include "../clone3/clone3_selftests.h"
The tests expect the modules to be in the same working directory, but when
using a different $INSTALL_PATH they are not copied over.
Signed-off-by: Ricardo B. Marlière <rbm(a)suse.com>
---
# cd tools/testing/selftests/kselftest_install && ./run_kselftest.sh -t bpf:test_verifier
TAP version 13
1..1
# timeout set to 0
# selftests: bpf: test_verifier
# Can't find bpf_testmod.ko kernel module: -2
not ok 1 selftests: bpf: test_verifier # exit=1
---
tools/testing/selftests/bpf/Makefile | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4863106034dfbcd35f830432322f054d897bb406..56b0565af8a76a9e784836a836935dd22e814fc0 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -877,5 +877,7 @@ override define INSTALL_RULE
@for DIR in $(TEST_INST_SUBDIRS); do \
mkdir -p $(INSTALL_PATH)/$$DIR; \
rsync -a $(OUTPUT)/$$DIR/*.bpf.o $(INSTALL_PATH)/$$DIR;\
+ rsync -a $(OUTPUT)/$$DIR/*.ko $(INSTALL_PATH)/$$DIR;\
+ rsync -a $(OUTPUT)/*.ko $(INSTALL_PATH);\
done
endef
---
base-commit: f227e9ed4fe4f2fed40e4725d6c10860d30c2ea2
change-id: 20250724-bpf-next_for-next-f1de3e4becc8
Best regards,
--
Ricardo B. Marlière <rbm(a)suse.com>
From: Steven Rostedt <rostedt(a)goodmis.org>
The subsystem event test enables all "sched" events and makes sure there's
at least 3 different events in the output. It used to cat the entire trace
file to | wc -l, but on slow machines, that could last a very long time.
To solve that, it was changed to just read the first 100 lines of the
trace file. This can cause false failures as some events repeat so often,
that the 100 lines that are examined could possibly be of only one event.
Instead, create an awk script that looks for 3 different events and will
exit out after it finds them. This will find the 3 events the test looks
for (eventually if it works), and still exit out after the test is
satisfied and not cause slower machines to run forever.
Reported-by: Tengda Wu <wutengda(a)huaweicloud.com>
Closes: https://lore.kernel.org/all/20250710130134.591066-1-wutengda@huaweicloud.co…
Fixes: 1a4ea83a6e67 ("selftests/ftrace: Limit length in subsystem-enable tests")
Signed-off-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
---
.../ftrace/test.d/event/subsystem-enable.tc | 28 +++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc
index b7c8f29c09a9..65916bb55dfb 100644
--- a/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc
+++ b/tools/testing/selftests/ftrace/test.d/event/subsystem-enable.tc
@@ -14,11 +14,35 @@ fail() { #msg
exit_fail
}
+# As reading trace can last forever, simply look for 3 different
+# events then exit out of reading the file. If there's not 3 different
+# events, then the test has failed.
+check_unique() {
+ cat trace | grep -v '^#' | awk '
+ BEGIN { cnt = 0; }
+ {
+ for (i = 0; i < cnt; i++) {
+ if (event[i] == $5) {
+ break;
+ }
+ }
+ if (i == cnt) {
+ event[cnt++] = $5;
+ if (cnt > 2) {
+ exit;
+ }
+ }
+ }
+ END {
+ printf "%d", cnt;
+ }'
+}
+
echo 'sched:*' > set_event
yield
-count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l`
+count=`check_unique`
if [ $count -lt 3 ]; then
fail "at least fork, exec and exit events should be recorded"
fi
@@ -29,7 +53,7 @@ echo 1 > events/sched/enable
yield
-count=`head -n 100 trace | grep -v ^# | awk '{ print $5 }' | sort -u | wc -l`
+count=`check_unique`
if [ $count -lt 3 ]; then
fail "at least fork, exec and exit events should be recorded"
fi
--
2.47.2
(yeah, I made the same non-ASCII mistake...)
On 7/24/25 10:49 AM, Shannon Nelson wrote:
> On 7/24/25 1:20 AM, Xiumei Mu wrote:
>> resent the reply again with "plain text mode"
>>
>> On Thu, Jul 24, 2025 at 2:25 PM Hangbin Liu<liuhangbin(a)gmail.com> wrote:
>>> Hi Xiumei,
>>> On Thu, Jul 24, 2025 at 12:55:02PM +0800, Xiumei Mu wrote:
>>>> The esp4_offload module, loaded during IPsec offload tests, should
>>>> be reset to its default settings after testing.
>>>> Otherwise, leaving it enabled could unintentionally affect subsequence
>>>> test cases by keeping offload active.
>>> Would you please show which subsequence test will be affected?
>> Any general ipsec case, which expects to be tested by default
>> behavior(without offload).
>> esp4_offload will affect the performance.
>>
>>>> Fixes: 2766a11161cc ("selftests: rtnetlink: add ipsec offload API test")
>>> It would be good to Cc the fix commit author. You can use
>>> `./scripts/get_maintainer.pl your_patch_file` to get the contacts you
>>> need to Cc.
>> I used the script to generate the cc list.
>> and I double checked the old email of the author is invalid
>> added his personal email in the cc list:
>>
>> Shannon Nelson<shannon.nelson(a)oracle.com>. -----> Shannon Nelson
>> <sln(a)onemain.com>
>>
>> get the information from here:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a…
>
> Yep, that was me a couple of corporate email addresses ago. Thanks for
> digging up the new email address. Luckily I have a couple of mail
> filters watching for old email addresses.
>
>>>> Signed-off-by: Xiumei Mu<xmu(a)redhat.com>
>>>> ---
>>>> tools/testing/selftests/net/rtnetlink.sh | 6 ++++++
>>>> 1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
>>>> index 2e8243a65b50..5cc1b5340a1a 100755
>>>> --- a/tools/testing/selftests/net/rtnetlink.sh
>>>> +++ b/tools/testing/selftests/net/rtnetlink.sh
>>>> @@ -673,6 +673,11 @@ kci_test_ipsec_offload()
>>>> sysfsf=$sysfsd/ipsec
>>>> sysfsnet=/sys/bus/netdevsim/devices/netdevsim0/net/
>>>> probed=false
>>>> + esp4_offload_probed_default=false
>>>> +
>>>> + if lsmod | grep -q esp4_offload; then
>>>> + esp4_offload_probed_default=true
>>>> + fi
>>> If the mode is loaded by default, how to avoid the subsequence test to be
>>> failed?
>> The module is not loaded by default, but some users or testers may
>> need to load esp4_offload in their own environments.
>> Therefore, resetting it to the default configuration is the best
>> practice to prevent this self-test case from impacting subsequent
>> tests
>
> Seems reasonable to me.
>
>>>> if ! mount | grep -q debugfs; then
>>>> mount -t debugfs none /sys/kernel/debug/ &> /dev/null
>>>> @@ -766,6 +771,7 @@ EOF
>>>> fi
>>>>
>>>> # clean up any leftovers
>>>> + [ $esp4_offload_probed_default == false ] && rmmod esp4_offload
>>> The new patch need to pass shellcheck. We need to double quote the variable.
>> Thanks your comment, I will add double quote in patchv2
>
> Or you keep with the existing style as done a line or two later:
> $esp4_offload_probed_default && rmmod esp4_offload Either way,
> Reviewed-by: Shannon Nelson <sln(a)onemain.com> Cheers, sln
>>> Thanks
>>> Hangbin
>>>> echo 0 > /sys/bus/netdevsim/del_device
>>>> $probed && rmmod netdevsim
>>>>
>>>> --
>>>> 2.50.1
>>>>
>
Show precise rejected function when attaching fexit/fmod_ret to __noreturn
functions.
Add log for attaching tracing programs to functions in deny list.
Add selftest for attaching tracing programs to functions in deny list.
Migrate fexit_noreturns case into tracing_failure test suite.
changes:
v3:
- add tracing_deny case into existing files (Alexei)
- migrate fexit_noreturns into tracing_failure
- change SOB
v2:
- change verifier log message (Alexei)
- add missing Suggested-by
https://lore.kernel.org/bpf/20250714120408.1627128-1-mannkafai@gmail.com/
v1:
https://lore.kernel.org/all/20250710162717.3808020-1-mannkafai@gmail.com/
---
KaFai Wan (4):
bpf: Show precise rejected function when attaching fexit/fmod_ret to
__noreturn functions
bpf: Add log for attaching tracing programs to functions in deny list
selftests/bpf: Add selftest for attaching tracing programs to
functions in deny list
selftests/bpf: Migrate fexit_noreturns case into tracing_failure test
suite
kernel/bpf/verifier.c | 5 +-
.../bpf/prog_tests/fexit_noreturns.c | 9 ----
.../bpf/prog_tests/tracing_failure.c | 52 +++++++++++++++++++
.../selftests/bpf/progs/fexit_noreturns.c | 15 ------
.../selftests/bpf/progs/tracing_failure.c | 12 +++++
5 files changed, 68 insertions(+), 25 deletions(-)
delete mode 100644 tools/testing/selftests/bpf/prog_tests/fexit_noreturns.c
delete mode 100644 tools/testing/selftests/bpf/progs/fexit_noreturns.c
--
2.43.0
This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on netdevsim. However, the
series is tested against multiple other drivers including fbnic.
Note: The XDP support for fbnic will be added later.
---
Change-log:
- Force checksum computation in netdevsim xdp hook
- Update checksum when updating packet for head/tail adjustment cases
- Use 1 as the minimum value for the data growth while adjusting tail
V5: https://lore.kernel.org/netdev/20250715210553.1568963-6-mohsin.bashr@gmail.…
V4: https://lore.kernel.org/netdev/20250714210352.1115230-1-mohsin.bashr@gmail.…
V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.…
V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Jakub Kicinski (1):
net: netdevsim: hook in XDP handling
Mohsin Bashir (4):
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
drivers/net/netdevsim/netdev.c | 21 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++
.../selftests/net/lib/xdp_native.bpf.c | 621 +++++++++++++++++
4 files changed, 1298 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).
/* pidns mount option for procfs */
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes creates
a procfs mount from an empty pidns so that user namespaced containers
can be nested (without this, the nested containers would fail to
mount procfs). But this requires forking off a helper process because
you cannot just one-shot this using mount(2).
* Container runtimes in general need to fork into a container before
configuring its mounts, which can lead to security issues in the case
of shared-pidns containers (a privileged process in the pidns can
interact with your container runtime process). While
SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
strict need for this due to a minor uAPI wart is kind of unfortunate.
Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.
The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.
Note that you cannot change the pidns of an already-created procfs
instance. The primary reason is that allowing this to be changed would
require RCU-protecting proc_pid_ns(sb) and thus auditing all of
fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
the pid namespace. Since creating procfs instances is very cheap, it
seems unnecessary to overcomplicate this upfront. Trying to reconfigure
procfs this way errors out with -EBUSY.
/* ioctl(PROCFS_GET_PID_NAMESPACE) */
In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.
To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.
It's not quite clear what is the correct security model for this API,
but the current approach I've taken is to:
* Make the ioctl only valid on the root (meaning that a process without
access to the procfs root -- such as only having an fd to a procfs
file or some open_tree(2)-like subset -- cannot use this API).
* Require that the process requesting either has access to
/proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
access to it and can join it if they had a handle), or is in a pidns
that is a direct ancestor of the target pidns (i.e. all of the pids
are already visible in the procfs for the current process's pidns).
The security model for this is a little loose, as it seems to me that
all of the cases mentioned are valid cases to allow access, but I'm open
to suggestions for whether we need to make this stricter or looser.
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v3:
- Disallow changing pidns for existing procfs instances, as we'd
probably have to RCU-protect everything that touches the pinned pidns
reference.
- Improve tests with slightly nicer ASSERT_ERRNO* macros.
- v2: <https://lore.kernel.org/r/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cypha…>
Changes in v2:
- #ifdef CONFIG_PID_NS
- Improve cover letter wording to make it clear we're talking about two
separate features with different permission models. [Andy Lutomirski]
- Fix build warnings in pidns_is_ancestor() patch. [kernel test robot]
- v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cypha…>
---
Aleksa Sarai (4):
pidns: move is-ancestor logic to helper
procfs: add "pidns" mount option
procfs: add PROCFS_GET_PID_NAMESPACE ioctl
selftests/proc: add tests for new pidns APIs
Documentation/filesystems/proc.rst | 12 ++
fs/proc/root.c | 156 +++++++++++++++++-
include/linux/pid_namespace.h | 9 ++
include/uapi/linux/fs.h | 3 +
kernel/pid_namespace.c | 23 ++-
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-pidns.c | 252 ++++++++++++++++++++++++++++++
8 files changed, 441 insertions(+), 16 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250717-procfs-pidns-api-8ed1583431f0
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
Although setup_ns() set net.ipv4.conf.default.rp_filter=0,
loading certain module such as ipip will automatically create a tunl0 interface
in all netns including new created ones, this in script is before than
default.rp_filter=0 applied, as a result tunl0.rp_filter remains set to 1
which causes the test report FAIL when ipip module is preloaded.
Before fix:
Testing DR mode...
Testing NAT mode...
Testing Tunnel mode...
ipvs.sh: FAIL
After fix:
Testing DR mode...
Testing NAT mode...
Testing Tunnel mode...
ipvs.sh: PASS
Fixes: ("7c8b89ec5 selftests: netfilter: remove rp_filter configuration")
Signed-off-by: Yi Chen <yiche(a)redhat.com>
---
tools/testing/selftests/net/netfilter/ipvs.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/netfilter/ipvs.sh b/tools/testing/selftests/net/netfilter/ipvs.sh
index 6af2ea3ad6b8..9c9d5b38ab71 100755
--- a/tools/testing/selftests/net/netfilter/ipvs.sh
+++ b/tools/testing/selftests/net/netfilter/ipvs.sh
@@ -151,7 +151,7 @@ test_nat() {
test_tun() {
ip netns exec "${ns0}" ip route add "${vip_v4}" via "${gip_v4}" dev br0
- ip netns exec "${ns1}" modprobe -q ipip
+ modprobe -q ipip
ip netns exec "${ns1}" ip link set tunl0 up
ip netns exec "${ns1}" sysctl -qw net.ipv4.ip_forward=0
ip netns exec "${ns1}" sysctl -qw net.ipv4.conf.all.send_redirects=0
@@ -160,10 +160,10 @@ test_tun() {
ip netns exec "${ns1}" ipvsadm -a -i -t "${vip_v4}:${port}" -r ${rip_v4}:${port}
ip netns exec "${ns1}" ip addr add ${vip_v4}/32 dev lo:1
- ip netns exec "${ns2}" modprobe -q ipip
ip netns exec "${ns2}" ip link set tunl0 up
ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_ignore=1
ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.all.arp_announce=2
+ ip netns exec "${ns2}" sysctl -qw net.ipv4.conf.tunl0.rp_filter=0
ip netns exec "${ns2}" ip addr add "${vip_v4}/32" dev lo:1
test_service
--
2.50.1
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v26.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v25 (19-Jul-2025) and v26 (22-Jul-2025)
- Restruct to avoid using lock and unlock when both step_thresh are provided (Jakub Kicinski <kuba(a)kernel.org>)
v24 (18-Jul-2025)
- Replace TCA_DUALPI2 prefix with TC_DUALPI2 for enums in pkt_sched.h (Jakub Kicinski <kuba(a)kernel.org>)
- Report error if both packet and time step thresholds are provided (Jakub Kicinski <kuba(a)kernel.org>)
v22 (11-Jul-2025) and v23 (13-Jul-2025)
- Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>)
v21 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>)
- Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>)
- Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>)
- Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>)
- Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>)
- Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>)
- Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>)
- Fix some typos and format issues of dualpi2 attributes
v20 (21-Jun-2025)
- Add one more commit to fix warning and style check on tdc.sh reported by shellcheck
- Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>)
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (5):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Fix warning and style check on tdc.sh
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 151 ++-
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1175 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 6 +-
9 files changed, 1669 insertions(+), 5 deletions(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).
This has historically meant that userspace was required to do some
special dances in order to configure the pidns of a procfs mount as
desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes creates
a procfs mount from an empty pidns so that user namespaced containers
can be nested (without this, the nested containers would fail to
mount procfs). But this requires forking off a helper process because
you cannot just one-shot this using mount(2).
* Container runtimes in general need to fork into a container before
configuring its mounts, which can lead to security issues in the case
of shared-pidns containers (a privileged process in the pidns can
interact with your container runtime process). While
SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
strict need for this due to a minor uAPI wart is kind of unfortunate.
Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of. This fulfils the
requirements of container runtimes, but I suspect that this may be too
strict for some usecases.
The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions.
In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.
To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.
It's not quite clear what is the correct security model for this API,
but the current approach I've taken is to:
* Make the ioctl only valid on the root (meaning that a process without
access to the procfs root -- such as only having an fd to a procfs
file or some open_tree(2)-like subset -- cannot use this API).
* Require that the process requesting either has access to
/proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
access to it and can join it if they had a handle), or is in a pidns
that is a direct ancestor of the target pidns (i.e. all of the pids
are already visible in the procfs for the current process's pidns).
The security model for this is a little loose, as it seems to me that
all of the cases mentioned are valid cases to allow access, but I'm open
to suggestions for whether we need to make this stricter or looser.
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Aleksa Sarai (4):
pidns: move is-ancestor logic to helper
procfs: add pidns= mount option
procfs: add PROCFS_GET_PID_NAMESPACE ioctl
selftests/proc: add tests for new pidns APIs
Documentation/filesystems/proc.rst | 10 ++
fs/proc/root.c | 132 +++++++++++++-
include/linux/pid_namespace.h | 9 +
include/uapi/linux/fs.h | 3 +
kernel/pid_namespace.c | 21 ++-
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-pidns.c | 286 ++++++++++++++++++++++++++++++
8 files changed, 448 insertions(+), 15 deletions(-)
---
base-commit: 4c838c7672c39ec6ec48456c6ce22d14a68f4cda
change-id: 20250717-procfs-pidns-api-8ed1583431f0
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
This is the second version of a series that lets us run VMware
Workstation on Linux on top of KVM.
The most significant change in this series is the introduction of
CONFIG_KVM_VMWARE which is, in general, a nice cleanup for various
bits of VMware compatibility code that have been scattered around KVM.
(first patch)
The rest of the series builds upon the VMware platform to implement
features that are needed to run VMware guests without any
modifications on top of KVM:
- ability to turn on the VMware backdoor at runtime on a per-vm basis
(used to be a kernel boot argument only)
- support for VMware hypercalls - VMware products have a huge
collection of hypercalls, all of which are handled in userspace,
- support for handling legacy VMware backdoor in L0 in nested configs
- in cases where we have WS running a Windows VBS guest, the L0 would
be KVM, L1 Hyper-V so by default VMware Tools backdoor calls endup in
Hyper-V which can not handle them, so introduce a cap to let L0 handle
those.
The final change in the series is a kselftest of the VMware hypercall
functionality.
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Jonathan Corbet <corbet(a)lwn.net>
Cc: Sean Christopherson <seanjc(a)google.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: x86(a)kernel.org
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Zack Rusin <zack.rusin(a)broadcom.com>
Cc: Doug Covelli <doug.covelli(a)broadcom.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Arnaldo Carvalho de Melo <acme(a)redhat.com>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Joel Stanley <joel(a)jms.id.au>
Cc: Isaku Yamahata <isaku.yamahata(a)intel.com>
Cc: kvm(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Zack Rusin (5):
KVM: x86: Centralize KVM's VMware code
KVM: x86: Allow enabling of the vmware backdoor via a cap
KVM: x86: Add support for VMware guest specific hypercalls
KVM: x86: Add support for legacy VMware backdoors in nested setups
KVM: selftests: x86: Add a test for KVM_CAP_X86_VMWARE_HYPERCALL
Documentation/virt/kvm/api.rst | 86 +++++++-
MAINTAINERS | 9 +
arch/x86/include/asm/kvm_host.h | 13 ++
arch/x86/kvm/Kconfig | 16 ++
arch/x86/kvm/Makefile | 1 +
arch/x86/kvm/emulate.c | 11 +-
arch/x86/kvm/kvm_vmware.c | 85 ++++++++
arch/x86/kvm/kvm_vmware.h | 189 ++++++++++++++++++
arch/x86/kvm/pmu.c | 39 +---
arch/x86/kvm/pmu.h | 4 -
arch/x86/kvm/svm/nested.c | 6 +
arch/x86/kvm/svm/svm.c | 10 +-
arch/x86/kvm/vmx/nested.c | 6 +
arch/x86/kvm/vmx/vmx.c | 5 +-
arch/x86/kvm/x86.c | 74 +++----
arch/x86/kvm/x86.h | 2 -
include/uapi/linux/kvm.h | 27 +++
tools/include/uapi/linux/kvm.h | 3 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/x86/vmware_hypercall_test.c | 121 +++++++++++
20 files changed, 614 insertions(+), 94 deletions(-)
create mode 100644 arch/x86/kvm/kvm_vmware.c
create mode 100644 arch/x86/kvm/kvm_vmware.h
create mode 100644 tools/testing/selftests/kvm/x86/vmware_hypercall_test.c
--
2.48.1
Cosmin reports the following locking issue:
# BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:275
# dump_stack_lvl+0x4f/0x60
# __might_resched+0xeb/0x140
# mutex_lock+0x1a/0x40
# dev_set_promiscuity+0x26/0x90
# __dev_set_promiscuity+0x85/0x170
# __dev_set_rx_mode+0x69/0xa0
# dev_uc_add+0x6d/0x80
# vlan_dev_open+0x5f/0x120 [8021q]
# __dev_open+0x10c/0x2a0
# __dev_change_flags+0x1a4/0x210
# netif_change_flags+0x22/0x60
# do_setlink.isra.0+0xdb0/0x10f0
# rtnl_newlink+0x797/0xb00
# rtnetlink_rcv_msg+0x1cb/0x3f0
# netlink_rcv_skb+0x53/0x100
# netlink_unicast+0x273/0x3b0
# netlink_sendmsg+0x1f2/0x430
Which is similar to recent syzkaller reports in [0] and [1] and triggers
because macsec does not advertise IFF_UNICAST_FLT although it has proper
ndo_set_rx_mode callback that takes care of pushing uc/mc addresses
down to the real device.
In general, dev_uc_add call path is problematic for stacking
non-IFF_UNICAST_FLT because we might grab netdev instance lock under
addr_list_lock spinlock, so this is not a systemic fix.
0: https://lore.kernel.org/netdev/686d55b4.050a0220.1ffab7.0014.GAE@google.com
1: https://lore.kernel.org/netdev/68712acf.a00a0220.26a83e.0051.GAE@google.com/
Link: 2aff4342b0f5b1539c02ffd8df4c7e58dd9746e7.camel(a)nvidia.com
Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations")
Reported-by: Cosmin Ratiu <cratiu(a)nvidia.com>
Tested-by: Cosmin Ratiu <cratiu(a)nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf(a)fomichev.me>
---
drivers/net/macsec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 7edbe76b5455..4c75d1fea552 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -3868,7 +3868,7 @@ static void macsec_setup(struct net_device *dev)
ether_setup(dev);
dev->min_mtu = 0;
dev->max_mtu = ETH_MAX_MTU;
- dev->priv_flags |= IFF_NO_QUEUE;
+ dev->priv_flags |= IFF_NO_QUEUE | IFF_UNICAST_FLT;
dev->netdev_ops = &macsec_netdev_ops;
dev->needs_free_netdev = true;
dev->priv_destructor = macsec_free_netdev;
--
2.50.1
"auto" was defined as a keyword back in the K&R days, but as a storage
type specifier. No one ever used it, since it was and is the default
storage type for local variables.
C++11 recycled the keyword to allow a type to be declared based on the
type of an initializer. This was finally adopted into standard C in
C23.
gcc and clang provide the "__auto_type" alias keyword as an extension
for pre-C23, however, there is no reason to pollute the bulk of the
source base with this temporary keyword; instead define "auto" as a
macro unless the compiler is running in C23+ mode.
This macro is added in <linux/compiler_types.h> because that header is
included in some of the tools headers, wheres <linux/compiler.h> is
not as it has a bunch of very kernel-specific things in it.
Changes in v2:
- Restore indentation of macro backslashes (David Laight)
- arch/nios2: Replace an adjacent typeof() with a similar "auto" construct
(Linus Torvalds)
- fs/proc/inode.c: change "__auto_type" to "const auto" (Alexey Dobriyan)
---
arch/nios2/include/asm/uaccess.h | 8 ++++----
arch/x86/include/asm/bug.h | 2 +-
arch/x86/include/asm/string_64.h | 6 +++---
arch/x86/include/asm/uaccess_64.h | 2 +-
fs/proc/inode.c | 16 ++++++++--------
include/linux/cleanup.h | 6 +++---
include/linux/compiler.h | 2 +-
include/linux/compiler_types.h | 13 +++++++++++++
include/linux/minmax.h | 6 +++---
tools/testing/selftests/bpf/prog_tests/socket_helpers.h | 9 +++++++--
tools/virtio/linux/compiler.h | 2 +-
11 files changed, 45 insertions(+), 27 deletions(-)
Currently the sve-ptrace test program only runs if the system supports
SVE but since SME includes streaming SVE the tests it offers are valid
even on a system that only supports SME. Since the tests already have
individual hwcap checks just remove the top level test and rely on those.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/fp/sve-ptrace.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/tools/testing/selftests/arm64/fp/sve-ptrace.c b/tools/testing/selftests/arm64/fp/sve-ptrace.c
index 7f9b6a61d369..b22303778fb0 100644
--- a/tools/testing/selftests/arm64/fp/sve-ptrace.c
+++ b/tools/testing/selftests/arm64/fp/sve-ptrace.c
@@ -753,9 +753,6 @@ int main(void)
ksft_print_header();
ksft_set_plan(EXPECTED_TESTS);
- if (!(getauxval(AT_HWCAP) & HWCAP_SVE))
- ksft_exit_skip("SVE not available\n");
-
child = fork();
if (!child)
return do_child();
---
base-commit: 9e8ebfe677f9101bbfe1f75d548a5aec581e8213
change-id: 20250718-arm64-sve-ptrace-sme-only-4ab49d037295
Best regards,
--
Mark Brown <broonie(a)kernel.org>
When testing SME only systems I noticed that fp-ptrace does not cope at
all well with them, this series fixes the major issues so that the test
program completes successfully. The reason I was looking at this is
that following the recent round of fixes to ptrace we do not currently
offer any mechanism for disabling streaming mode via ptrace, this series
brings the program to a point where it tests the currently implemented
ABI. A further series allowing the disabling of streaming mode via
ptrace will follow.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (3):
kselftest/arm64: Test SME on SME only systems in fp-ptrace
kselftest/arm64: Fix SVE write data generation for SME only systems
kselftest/arm64: Handle attempts to disable SM on SME only systems
tools/testing/selftests/arm64/fp/fp-ptrace.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
---
base-commit: 86731a2a651e58953fc949573895f2fa6d456841
change-id: 20250718-arm64-fp-ptrace-sme-only-ab327d7f0d32
Best regards,
--
Mark Brown <broonie(a)kernel.org>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v25.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v25 (19-Jul-2025)
- Fix the missing sch_tree_unlock() in v24 (Jakub Kicinski <kuba(a)kernel.org>)
v24 (18-Jul-2025)
- Replace TCA_DUALPI2 prefix with TC_DUALPI2 for enums in pkt_sched.h (Jakub Kicinski <kuba(a)kernel.org>)
- Report error if both packet and time step thresholds are provided (Jakub Kicinski <kuba(a)kernel.org>)
v23 (13-Jul-2025) and v22 (11-Jul-2025)
- Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>)
v21 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>)
- Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>)
- Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>)
- Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>)
- Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>)
- Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>)
- Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>)
- Fix some typos and format issues of dualpi2 attributes
v20 (21-Jun-2025)
- Add one more commit to fix warning and style check on tdc.sh reported by shellcheck
- Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>)
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (5):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Fix warning and style check on tdc.sh
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 151 ++-
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1177 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 6 +-
9 files changed, 1671 insertions(+), 5 deletions(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
It is currently impossible to enable ipv6 forwarding on a per-interface
basis like in ipv4. To enable forwarding on an ipv6 interface we need to
enable it on all interfaces and disable it on the other interfaces using
a netfilter rule. This is especially cumbersome if you have lots of
interface and only want to enable forwarding on a few. According to the
sysctl docs [0] the `net.ipv6.conf.all.forwarding` enables forwarding
for all interfaces, while the interface-specific
`net.ipv6.conf.<interface>.forwarding` configures the interface
Host/Router configuration.
Introduce a new sysctl flag `force_forwarding`, which can be set on every
interface. The ip6_forwarding function will then check if the global
forwarding flag OR the force_forwarding flag is active and forward the
packet.
To preserver backwards-compatibility reset the flag (on all interfaces)
to 0 if the net.ipv6.conf.all.forwarding flag is set to 0.
Add a short selftest that checks if a packet gets forwarded with and
without `force_forwarding`.
[0]: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Signed-off-by: Gabriel Goller <g.goller(a)proxmox.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
---
v6:
* rebase
* remove brackts around single line
* add 'nodad' to addresses in selftest to avoid sporadic failures
v5: https://lore.kernel.org/netdev/20250707094307.223975-1-g.goller@proxmox.com/
* update conf/all/forwarding docs
* simplified backwards-compat comment
* remove ASSERT_RTNL as it's guaranteed by __in6_dev_get_rtnl_net()
already
* cange ip6_forward logic so that it doesn't depend on the idev
existing
* move WRITE_ONCE inside device lock
v4: https://lore.kernel.org/netdev/20250703160154.560239-1-g.goller@proxmox.com/
* actually write the sysctl value to the table
* use ASSERT_RTNL() when forwarding the sysctl change
* remove useless comments in function body
* simplify forwarding and force_forwarding check in ip6_output.c
* fix code backticks in Documentation (double instead of single)
* add selftests
v3: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/
* remove forwarding=0 setting force_forwarding=0 globally.
* add min and max (0 and 1) value to sysctl.
v2: https://lore.kernel.org/netdev/20250701140423.487411-1-g.goller@proxmox.com/
* rename from `do_forwarding` to `force_forwarding`.
* add global `force_forwarding` flag which will enable
`force_forwarding` on every interface like the
`ipv4.all.forwarding` flag.
* `forwarding`=0 will disable global and per-interface
`force_forwarding`.
* export option as NETCONFA_FORCE_FORWARDING.
v1: https://lore.kernel.org/netdev/20250702074619.139031-1-g.goller@proxmox.com/
Documentation/networking/ip-sysctl.rst | 9 +-
include/linux/ipv6.h | 1 +
include/uapi/linux/ipv6.h | 1 +
include/uapi/linux/netconf.h | 1 +
include/uapi/linux/sysctl.h | 1 +
net/ipv6/addrconf.c | 82 ++++++++++++++
net/ipv6/ip6_output.c | 3 +-
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/ipv6_force_forwarding.sh | 105 ++++++++++++++++++
9 files changed, 201 insertions(+), 3 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv6_force_forwarding.sh
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 0f1251cce314..6d92bae0257a 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -2281,8 +2281,8 @@ conf/all/disable_ipv6 - BOOLEAN
conf/all/forwarding - BOOLEAN
Enable global IPv6 forwarding between all interfaces.
- IPv4 and IPv6 work differently here; e.g. netfilter must be used
- to control which interfaces may forward packets and which not.
+ IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must
+ be used to control which interfaces may forward packets.
This also sets all interfaces' Host/Router setting
'forwarding' to the specified value. See below for details.
@@ -2292,6 +2292,11 @@ conf/all/forwarding - BOOLEAN
proxy_ndp - BOOLEAN
Do proxy ndp.
+force_forwarding - BOOLEAN
+ Enable forwarding on this interface only -- regardless of the setting on
+ ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0,
+ the ``force_forwarding`` flag will be reset on all interfaces.
+
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv6 reply packets that are not
associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 5aeeed22f35b..d975a86f29be 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -17,6 +17,7 @@ struct ipv6_devconf {
__s32 hop_limit;
__s32 mtu6;
__s32 forwarding;
+ __s32 force_forwarding;
__s32 disable_policy;
__s32 proxy_ndp;
__cacheline_group_end(ipv6_devconf_read_txrx);
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index cf592d7b630f..d4d3ae774b26 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -199,6 +199,7 @@ enum {
DEVCONF_NDISC_EVICT_NOCARRIER,
DEVCONF_ACCEPT_UNTRACKED_NA,
DEVCONF_ACCEPT_RA_MIN_LFT,
+ DEVCONF_FORCE_FORWARDING,
DEVCONF_MAX
};
diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h
index fac4edd55379..1c8c84d65ae3 100644
--- a/include/uapi/linux/netconf.h
+++ b/include/uapi/linux/netconf.h
@@ -19,6 +19,7 @@ enum {
NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
NETCONFA_INPUT,
NETCONFA_BC_FORWARDING,
+ NETCONFA_FORCE_FORWARDING,
__NETCONFA_MAX
};
#define NETCONFA_MAX (__NETCONFA_MAX - 1)
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 8981f00204db..63d1464cb71c 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -573,6 +573,7 @@ enum {
NET_IPV6_ACCEPT_RA_FROM_LOCAL=26,
NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27,
NET_IPV6_RA_DEFRTR_METRIC=28,
+ NET_IPV6_FORCE_FORWARDING=29,
__NET_IPV6_MAX
};
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ba2ec7c870cc..580aed034849 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -239,6 +239,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.ndisc_evict_nocarrier = 1,
.ra_honor_pio_life = 0,
.ra_honor_pio_pflag = 0,
+ .force_forwarding = 0,
};
static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -303,6 +304,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
.ndisc_evict_nocarrier = 1,
.ra_honor_pio_life = 0,
.ra_honor_pio_pflag = 0,
+ .force_forwarding = 0,
};
/* Check if link is ready: is it up and is a valid qdisc available */
@@ -857,6 +859,9 @@ static void addrconf_forward_change(struct net *net, __s32 newf)
idev = __in6_dev_get_rtnl_net(dev);
if (idev) {
int changed = (!idev->cnf.forwarding) ^ (!newf);
+ /* Disabling all.forwarding sets 0 to force_forwarding for all interfaces */
+ if (newf == 0)
+ WRITE_ONCE(idev->cnf.force_forwarding, 0);
WRITE_ONCE(idev->cnf.forwarding, newf);
if (changed)
@@ -5719,6 +5724,7 @@ static void ipv6_store_devconf(const struct ipv6_devconf *cnf,
array[DEVCONF_ACCEPT_UNTRACKED_NA] =
READ_ONCE(cnf->accept_untracked_na);
array[DEVCONF_ACCEPT_RA_MIN_LFT] = READ_ONCE(cnf->accept_ra_min_lft);
+ array[DEVCONF_FORCE_FORWARDING] = READ_ONCE(cnf->force_forwarding);
}
static inline size_t inet6_ifla6_size(void)
@@ -6747,6 +6753,75 @@ static int addrconf_sysctl_disable_policy(const struct ctl_table *ctl, int write
return ret;
}
+static void addrconf_force_forward_change(struct net *net, __s32 newf)
+{
+ struct net_device *dev;
+ struct inet6_dev *idev;
+
+ for_each_netdev(net, dev) {
+ idev = __in6_dev_get_rtnl_net(dev);
+ if (idev) {
+ int changed = (!idev->cnf.force_forwarding) ^ (!newf);
+
+ WRITE_ONCE(idev->cnf.force_forwarding, newf);
+ if (changed)
+ inet6_netconf_notify_devconf(dev_net(dev), RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ dev->ifindex, &idev->cnf);
+ }
+ }
+}
+
+static int addrconf_sysctl_force_forwarding(const struct ctl_table *ctl, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct inet6_dev *idev = ctl->extra1;
+ struct ctl_table tmp_ctl = *ctl;
+ struct net *net = ctl->extra2;
+ int *valp = ctl->data;
+ int new_val = *valp;
+ int old_val = *valp;
+ loff_t pos = *ppos;
+ int ret;
+
+ tmp_ctl.extra1 = SYSCTL_ZERO;
+ tmp_ctl.extra2 = SYSCTL_ONE;
+ tmp_ctl.data = &new_val;
+
+ ret = proc_douintvec_minmax(&tmp_ctl, write, buffer, lenp, ppos);
+
+ if (write && old_val != new_val) {
+ if (!rtnl_net_trylock(net))
+ return restart_syscall();
+
+ WRITE_ONCE(*valp, new_val);
+
+ if (valp == &net->ipv6.devconf_dflt->force_forwarding) {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ NETCONFA_IFINDEX_DEFAULT,
+ net->ipv6.devconf_dflt);
+ } else if (valp == &net->ipv6.devconf_all->force_forwarding) {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ NETCONFA_IFINDEX_ALL,
+ net->ipv6.devconf_all);
+
+ addrconf_force_forward_change(net, new_val);
+ } else {
+ inet6_netconf_notify_devconf(net, RTM_NEWNETCONF,
+ NETCONFA_FORCE_FORWARDING,
+ idev->dev->ifindex,
+ &idev->cnf);
+ }
+ rtnl_net_unlock(net);
+ }
+
+ if (ret)
+ *ppos = pos;
+ return ret;
+}
+
static int minus_one = -1;
static const int two_five_five = 255;
static u32 ioam6_if_id_max = U16_MAX;
@@ -7217,6 +7292,13 @@ static const struct ctl_table addrconf_sysctl[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO,
},
+ {
+ .procname = "force_forwarding",
+ .data = &ipv6_devconf.force_forwarding,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = addrconf_sysctl_force_forwarding,
+ },
};
static int __addrconf_sysctl_register(struct net *net, char *dev_name,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7bd29a9ff0db..3853090d7282 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -509,7 +509,8 @@ int ip6_forward(struct sk_buff *skb)
u32 mtu;
idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif));
- if (READ_ONCE(net->ipv6.devconf_all->forwarding) == 0)
+ if (!READ_ONCE(net->ipv6.devconf_all->forwarding) &&
+ (!idev || !READ_ONCE(idev->cnf.force_forwarding)))
goto error;
if (skb->pkt_type != PACKET_HOST)
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 332f387615d7..f64ec8a15a77 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -112,6 +112,7 @@ TEST_PROGS += skf_net_off.sh
TEST_GEN_FILES += skf_net_off
TEST_GEN_FILES += tfo
TEST_PROGS += tfo_passive.sh
+TEST_PROGS += ipv6_force_forwarding.sh
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := busy_poller netlink-dumps
diff --git a/tools/testing/selftests/net/ipv6_force_forwarding.sh b/tools/testing/selftests/net/ipv6_force_forwarding.sh
new file mode 100755
index 000000000000..bf0243366caa
--- /dev/null
+++ b/tools/testing/selftests/net/ipv6_force_forwarding.sh
@@ -0,0 +1,105 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test IPv6 force_forwarding interface property
+#
+# This test verifies that the force_forwarding property works correctly:
+# - When global forwarding is disabled, packets are not forwarded normally
+# - When force_forwarding is enabled on an interface, packets are forwarded
+# regardless of the global forwarding setting
+
+source lib.sh
+
+cleanup() {
+ cleanup_ns $ns1 $ns2 $ns3
+}
+
+trap cleanup EXIT
+
+setup_test() {
+ # Create three namespaces: sender, router, receiver
+ setup_ns ns1 ns2 ns3
+
+ # Create veth pairs: ns1 <-> ns2 <-> ns3
+ ip link add name veth12 type veth peer name veth21
+ ip link add name veth23 type veth peer name veth32
+
+ # Move interfaces to namespaces
+ ip link set veth12 netns $ns1
+ ip link set veth21 netns $ns2
+ ip link set veth23 netns $ns2
+ ip link set veth32 netns $ns3
+
+ # Configure interfaces
+ ip -n $ns1 addr add 2001:db8:1::1/64 dev veth12 nodad
+ ip -n $ns2 addr add 2001:db8:1::2/64 dev veth21 nodad
+ ip -n $ns2 addr add 2001:db8:2::1/64 dev veth23 nodad
+ ip -n $ns3 addr add 2001:db8:2::2/64 dev veth32 nodad
+
+ # Bring up interfaces
+ ip -n $ns1 link set veth12 up
+ ip -n $ns2 link set veth21 up
+ ip -n $ns2 link set veth23 up
+ ip -n $ns3 link set veth32 up
+
+ # Add routes
+ ip -n $ns1 route add 2001:db8:2::/64 via 2001:db8:1::2
+ ip -n $ns3 route add 2001:db8:1::/64 via 2001:db8:2::1
+
+ # Disable global forwarding
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.all.forwarding=0
+}
+
+test_force_forwarding() {
+ local ret=0
+
+ echo "TEST: force_forwarding functionality"
+
+ # Check if force_forwarding sysctl exists
+ if ! ip netns exec $ns2 test -f /proc/sys/net/ipv6/conf/veth21/force_forwarding; then
+ echo "SKIP: force_forwarding not available"
+ return $ksft_skip
+ fi
+
+ # Test 1: Without force_forwarding, ping should fail
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=0
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=0
+
+ if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then
+ echo "FAIL: ping succeeded when forwarding disabled"
+ ret=1
+ else
+ echo "PASS: forwarding disabled correctly"
+ fi
+
+ # Test 2: With force_forwarding enabled, ping should succeed
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth21.force_forwarding=1
+ ip netns exec $ns2 sysctl -qw net.ipv6.conf.veth23.force_forwarding=1
+
+ if ip netns exec $ns1 ping -6 -c 1 -W 2 2001:db8:2::2 &>/dev/null; then
+ echo "PASS: force_forwarding enabled forwarding"
+ else
+ echo "FAIL: ping failed with force_forwarding enabled"
+ ret=1
+ fi
+
+ return $ret
+}
+
+echo "IPv6 force_forwarding test"
+echo "=========================="
+
+setup_test
+test_force_forwarding
+ret=$?
+
+if [ $ret -eq 0 ]; then
+ echo "OK"
+ exit 0
+elif [ $ret -eq $ksft_skip ]; then
+ echo "SKIP"
+ exit $ksft_skip
+else
+ echo "FAIL"
+ exit 1
+fi
--
2.39.5
sysfs.py tests if DAMON sysfs interface is passing the user-requested
parameters to DAMON as expected. But only the default (minimum)
parameters are being tested. This is partially because _damon_sysfs.py,
which is the library for making the parameter requests, is not
supporting the entire parameters. The internal DAMON status dump script
(drgn_dump_damon_status.py) is also not dumping entire parameters.
Extend the test coverage by updating parameters input and status dumping
scripts to support all parameters, and writing additional tests using
those.
This increased test coverage actually found one real bug
(https://lore.kernel.org/20250719181932.72944-1-sj@kernel.org).
First seven patches (1-7) extend _damon_sysfs.py for all parameters
setup. The eight patch (8) fixes _damon_sysfs.py to use correct max
nr_acceses and age values for their type. Following three patches
(9-11) extend drgn_dump_damon_status.py to dump full DAMON parameters.
Following nine patches (12-20) refactor sysfs.py for general testing
code reuse, and extend it for full parameters check. Finally, two
patches (21 and 22) add test cases in sysfs.py for full parameters
testing.
SeongJae Park (22):
selftests/damon/_damon_sysfs: support DAMOS watermarks setup
selftests/damon/_damon_sysfs: support DAMOS filters setup
selftests/damon/_damon_sysfs: support monitoring intervals goal setup
selftests/damon/_damon_sysfs: support DAMOS quota weights setup
selftests/damon/_damon_sysfs: support DAMOS quota goal nid setup
selftests/damon/_damon_sysfs: support DAMOS action dests setup
selftests/damon/_damon_sysfs: support DAMOS target_nid setup
selftests/damon/_damon_sysfs: use 2**32 - 1 as max nr_accesses and age
selftests/damon/drgn_dump_damon_status: dump damos->migrate_dests
selftests/damon/drgn_dump_damon_status: dump ctx->ops.id
selftests/damon/drgn_dump_damon_status: dump DAMOS filters
selftests/damon/sysfs.py: generalize DAMOS Watermarks commit assertion
selftests/damon/sysfs.py: generalize DamosQuota commit assertion
selftests/damon/sysfs.py: test quota goal commitment
selftests/damon/sysfs.py: test DAMOS destinations commitment
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit
assertion
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
tools/testing/selftests/damon/_damon_sysfs.py | 301 +++++++++++++++++-
.../selftests/damon/drgn_dump_damon_status.py | 63 +++-
tools/testing/selftests/damon/sysfs.py | 284 +++++++++++++----
3 files changed, 568 insertions(+), 80 deletions(-)
base-commit: fc8066077f44a4fd43f8fdb12bc238f8fbeaa3c5
--
2.39.5
mptcp_connect.sh can be executed manually with "-m <MODE>" and "-C" to
make sure everything works as expected when using "mmap" and "sendfile"
modes instead of "poll", and with the MPTCP checksum support.
These modes should be validated, but they are not when the selftests are
executed via the kselftest helpers. It means that most CIs validating
these selftests, like NIPA for the net development trees and LKFT for
the stable ones, are not covering these modes.
To fix that, new test programs have been added, simply calling
mptcp_connect.sh with the right parameters.
The first patch can be backported up to v5.6, and the second one up to
v5.14.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Changes in v2:
- force using a different prefix in the subtests to avoid having the
same test names in all mptcp_connect*.sh selftests.
- Link to v1: https://lore.kernel.org/r/20250714-net-mptcp-sft-connect-alt-v1-0-bf1c5abbe…
---
Matthieu Baerts (NGI0) (2):
selftests: mptcp: connect: also cover alt modes
selftests: mptcp: connect: also cover checksum
tools/testing/selftests/net/mptcp/Makefile | 3 ++-
tools/testing/selftests/net/mptcp/mptcp_connect_checksum.sh | 5 +++++
tools/testing/selftests/net/mptcp/mptcp_connect_mmap.sh | 5 +++++
tools/testing/selftests/net/mptcp/mptcp_connect_sendfile.sh | 5 +++++
4 files changed, 17 insertions(+), 1 deletion(-)
---
base-commit: b640daa2822a39ff76e70200cb2b7b892b896dce
change-id: 20250714-net-mptcp-sft-connect-alt-c1aaf073ef4e
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Hello,
binder_alloc_selftest provides a robust set of checks for the binder
allocator, but it rarely runs because it must hook into a running binder
process and block all other binder threads until it completes. The test
itself is a good candidate for conversion to KUnit, and it can be
further isolated from user processes by using a test-specific lru
freelist instead of the global one. This series converts the selftest
to KUnit to make it less burdensome to run and to set up a foundation
for unit testing future binder_alloc changes.
Thanks,
Tiffany
Tiffany Yang (6):
binder: Fix selftest page indexing
binder: Store lru freelist in binder_alloc
kunit: test: Export kunit_attach_mm()
binder: Scaffolding for binder_alloc KUnit tests
binder: Convert binder_alloc selftests to KUnit
binder: encapsulate individual alloc test cases
drivers/android/Kconfig | 15 +-
drivers/android/Makefile | 2 +-
drivers/android/binder.c | 10 +-
drivers/android/binder_alloc.c | 39 +-
drivers/android/binder_alloc.h | 14 +-
drivers/android/binder_alloc_selftest.c | 306 -----------
drivers/android/binder_internal.h | 4 +
drivers/android/tests/.kunitconfig | 7 +
drivers/android/tests/Makefile | 6 +
drivers/android/tests/binder_alloc_kunit.c | 572 +++++++++++++++++++++
include/kunit/test.h | 12 +
lib/kunit/user_alloc.c | 4 +-
12 files changed, 651 insertions(+), 340 deletions(-)
delete mode 100644 drivers/android/binder_alloc_selftest.c
create mode 100644 drivers/android/tests/.kunitconfig
create mode 100644 drivers/android/tests/Makefile
create mode 100644 drivers/android/tests/binder_alloc_kunit.c
--
2.50.0.727.gbf7dc18ff4-goog
Introduce SW acceleration for IPIP tunnels in the netfilter flowtable
infrastructure.
---
Changes in v4:
- Use the hash value of the saddr, daddr and protocol of outer IP header as
encapsulation id.
- Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kern…
Changes in v3:
- Add outer IP header sanity checks
- target nf-next tree instead of net-next
- Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern…
Changes in v2:
- Introduce IPIP flowtable selftest
- Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern…
---
Lorenzo Bianconi (2):
net: netfilter: Add IPIP flowtable SW acceleration
selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
include/linux/netdevice.h | 1 +
net/ipv4/ipip.c | 25 +++++++++++
net/netfilter/nf_flow_table_ip.c | 48 +++++++++++++++++++++-
net/netfilter/nft_flow_offload.c | 1 +
.../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++++
5 files changed, 113 insertions(+), 2 deletions(-)
---
base-commit: d61f6cb6f6ef3c70d2ccc0d9c85c508cb8017da9
change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067
Best regards,
--
Lorenzo Bianconi <lorenzo(a)kernel.org>
This patch series refactors all futex selftests to use
kselftest_harness.h instead of futex's logging.h, as discussed here [1].
This allows to remove a lot of boilerplate code and to simplify some
parts of the test logic, mainly when the test needs to exit early. The
result of this is more than 500 lines removed from
tools/testing/selftests/futex/. Also, this enables new tests to use
kselftest.h features like ASSERT_s and such.
There are some caveats around this refactor:
- logging.h had verbosity levels, while kselftest_harness.h doesn't. I
created a new print function called ksft_print_dbg_msg() that prints
the message if the user uses the -d flag, so now there's an
equivalent of this feature.
- futex_requeue_pi test accepted command line arguments to be used as
test parameters (e.g. ./futex_requeue_pi -b -l -t 500000). This
doesn't work with kselftest_harness.h because there's no
straightforward way to send command line arguments to the test.
I used FIXTURE_VARIANT() to achieve the same result, but now the
parameters live inside of the test file, instead of on
functional/run.sh. This increased a little bit the number of test
cases for futex_requeue_pi, from 22 to 24.
- test_harness_run() calls mmap(MAP_SHARED) before running the test and
this has caused a side effect on test futex_numa_mpol.c. This test
also calls mmap() and then try to access an address out of
boundaries of this mapped memory for a "Memory out of range" subtest,
where the kernel should return -EACCESS. After the refactor, the test
address might be fall inside the first memory mapped region, thus
being a valid address and succeeding the syscall, making the test
fail. To fix that, I created a small "buffer zone" with
mmap(PROT_NONE) between both mmaps.
I have compared the results of run.sh before and after this patchset and
didn't find any regression from the test results.
Thanks,
André
[1] https://lore.kernel.org/lkml/87ecv6p364.ffs@tglx/
---
Changes in v2:
- Rebased on top of tip/master
- Dropped priv_hash global test variant now that this feature was
dropped
- Added include <stdbool.h> in the first patch
- Link to v1: https://lore.kernel.org/r/20250704-tonyk-robust_test_cleanup-v1-0-c0ff4f24c…
---
André Almeida (15):
selftests: kselftest: Create ksft_print_dbg_msg()
selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h
selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h
selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h
selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h
selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h
selftests/futex: Refactor futex_wait with kselftest_harness.h
selftests/futex: Refactor futex_requeue with kselftest_harness.h
selftests/futex: Refactor futex_waitv with kselftest_harness.h
selftests/futex: Refactor futex_priv_hash with kselftest_harness.h
selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h
selftests/futex: Drop logging.h include from futex_numa
selftests/futex: Remove logging.h file
tools/testing/selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/futex_numa.c | 3 +-
.../selftests/futex/functional/futex_numa_mpol.c | 57 ++---
.../selftests/futex/functional/futex_priv_hash.c | 49 +---
.../selftests/futex/functional/futex_requeue.c | 76 ++----
.../selftests/futex/functional/futex_requeue_pi.c | 261 ++++++++++-----------
.../functional/futex_requeue_pi_mismatched_ops.c | 80 ++-----
.../functional/futex_requeue_pi_signal_restart.c | 129 +++-------
.../selftests/futex/functional/futex_wait.c | 103 +++-----
.../functional/futex_wait_private_mapped_file.c | 83 ++-----
.../futex/functional/futex_wait_timeout.c | 139 +++++------
.../functional/futex_wait_uninitialized_heap.c | 76 ++----
.../futex/functional/futex_wait_wouldblock.c | 75 ++----
.../selftests/futex/functional/futex_waitv.c | 98 ++++----
tools/testing/selftests/futex/functional/run.sh | 62 +----
tools/testing/selftests/futex/include/logging.h | 148 ------------
tools/testing/selftests/kselftest.h | 14 ++
tools/testing/selftests/kselftest_harness.h | 13 +-
18 files changed, 465 insertions(+), 1004 deletions(-)
---
base-commit: ed0272f0675f31642c3d445a596b544de9db405b
change-id: 20250703-tonyk-robust_test_cleanup-d1f3406365d9
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space.
Previous version of this patchset [1] tried to perform /proc/pid/maps
reading under RCU, however its implementation is quite complex and the
results are worse than the new version because it still relied on
mmap_lock speculation which retries if any part of the address space gets
modified. New implementaion is both simpler and results in less
contention. Note that similar approach would not work for /proc/pid/smaps
reading as it also walks the page table and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears.
The latest results from Paul:
Stock mm-unstable, all of the runs had maximum latencies in excess of
0.5 milliseconds, and with 80% of the runs' latencies exceeding a full
millisecond, and ranging up beyond 4 full milliseconds. In contrast,
99% of the runs with this patch series applied had maximum latencies
of less than 0.5 milliseconds, with the single outlier at only 0.608
milliseconds.
From a median-performance (as opposed to maximum-latency) viewpoint,
this patch series also looks good, with stock mm weighing in at 11
microseconds and patch series at 6 microseconds, better than a 2x
improvement.
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.521
0.011 0.008 0.552
0.011 0.008 0.590
0.011 0.008 0.660
...
0.011 0.015 2.987
0.011 0.015 3.038
0.011 0.016 3.431
0.011 0.016 4.707
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.026
0.006 0.005 0.029
0.006 0.005 0.034
0.006 0.005 0.035
...
0.006 0.006 0.421
0.006 0.006 0.423
0.006 0.006 0.439
0.006 0.006 0.608
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes since v7 [4]:
- Refactored tests to use kselftest harness, per David Hildenbrand and
Lorenzo Stoakes
- Removed PROCMAP_QUERY selftest, per David Hildenbrand and
Lorenzo Stoakes
- Added Acked-by, per David Hildenbrand
- Replaced sentinels values with named definitions, per Vlastimil Babka
- Added Reviewed-by, per Vlastimil Babka
!!! NOTES FOR APPLYING THE PATCHSET !!!
Applies cleanly over mm-unstable after reverting v7 version of this
patchset (from 94951ab6fe6f to e47914e6c28f in mm-unstable).
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250716030557.1547501-1-surenb@google.com/
Suren Baghdasaryan (6):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: add verbose mode for /proc/pid/maps tearing tests
fs/proc/task_mmu: remove conversion of seq_file position to unsigned
fs/proc/task_mmu: read proc/pid/maps under per-vma lock
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 158 +++-
include/linux/mmap_lock.h | 11 +
mm/madvise.c | 3 +-
mm/mmap_lock.c | 93 +++
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-maps-race.c | 741 ++++++++++++++++++
8 files changed, 997 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/proc/proc-maps-race.c
--
2.50.0.727.gbf7dc18ff4-goog
Hi ,
Interested in getting the GSX 2025 attendee list?
Expo Name: Global Security Exchange (GSX) 2025
Total Number of records: 17,000 records
List includes: Company Name, Contact Name, Job Title, Mailing Address, Phone, Emails, etc.
Are you considering buying these leads? If yes, I can send you the pricing information.
Awaiting your message
Regards
Ben Graham
Demand Generation Manager
US Marketing Data Inc.,
Please reply with REMOVE if you don't wish to receive further emails
"auto" was defined as a keyword back in the K&R days, but as a storage
type specifier. No one ever used it, since it was and is the default
storage type for local variables.
C++11 recycled the keyword to allow a type to be declared based on the
type of an initializer. This was finally adopted into standard C in
C23.
gcc and clang provide the "__auto_type" alias keyword as an extension
for pre-C23, however, there is no reason to pollute the bulk of the
source base with this temporary keyword; instead define "auto" as a
macro unless the compiler is running in C23+ mode.
This macro is added in <linux/compiler_types.h> because that header is
included in some of the tools headers, wheres <linux/compiler.h> is
not as it has a bunch of very kernel-specific things in it.
---
arch/nios2/include/asm/uaccess.h | 4 ++--
arch/x86/include/asm/bug.h | 2 +-
arch/x86/include/asm/string_64.h | 6 +++---
arch/x86/include/asm/uaccess_64.h | 2 +-
fs/proc/inode.c | 16 ++++++++--------
include/linux/cleanup.h | 4 ++--
include/linux/compiler.h | 2 +-
include/linux/compiler_types.h | 13 +++++++++++++
include/linux/minmax.h | 6 +++---
tools/testing/selftests/bpf/prog_tests/socket_helpers.h | 9 +++++++--
tools/virtio/linux/compiler.h | 2 +-
11 files changed, 42 insertions(+), 24 deletions(-)
Hi all,
I was starting to work on the memfd-exec[1] feature and observed that
Landlock's scoped-IPC features (abstract UNIX sockets and signals)
follow a consistent high-level model, which I'm calling a
resource-accessor pattern:
Resource Process <-> Accessor Process
- Resource process: owns or manages the asset
- socket creator (bind/accept)
- signal handler
- memfd creator
- Accessor process: attempts to use the asset
- socket client (connect/sendto)
- signal sender
- memfd executor
RESOURCE-ACCESSOR PATTERN FUNDAMENTALS
======================================
This pattern appears fundamental to Landlock scoping because:
1. Consistent enforcement model: Landlock restrictions are enforced
only on the accessor side; the resource side remains unconstrained
across all scope types.
2. Reflects actual security boundaries: In practice, sandboxed
processes typically need to access resources created by other
processes, not the reverse.
3. Scalable design: This model works consistently whether processes
are in parent-child relationships or independent peer domains.
4. Real-world usage patterns: Container runtimes and sandbox
orchestrators routinely start multiple workers that restrict
themselves independently.
CURRENT TEST COVERAGE GAP
=========================
Existing self-tests cover hierarchical resource <-> accessor pairs
but do not exercise the case where each task enters an independent
domain. While 'sibling_domain' tests exist, they still use
parent-child relationship patterns rather than true peer domains.
Current Coverage (Linear Hierarchies Only):
-------------------------------------------
Type 1: Parent-Child (scoped_domains)
P1 ---- P2
Type 2: Three Generations (scoped_vs_unscoped)
P1 ---- P2 ---- P3
Variations tested for both types:
- No domains
- Various scoped domain combinations
- Nested domains within inherited domains
- Mixed domain types (SCOPE vs OTHER vs NONE)
Missing Coverage (True Sibling Scenarios):
------------------------------------------
Root
|
+-- Child A [various domain types]
|
+-- Child B [various domain types]
Missing test scenarios:
- A <-> B cross-sibling communication
- Mixed sibling domain combinations
- Sibling isolation enforcement
- Parent -> A, Parent -> B differential access
SOLUTION
========
This series implements the missing sibling pattern using the
resource-accessor model. The tests create a fork tree that looks
like this:
coordinator (no domain)
|
+-- resource_proc (Domain X) /* owns the resource */
|
+-- accessor_proc (Domain Y) /* tries to access */
This directly addresses the missing coverage by creating two
independent child processes that establish peer domains, rather than
the hierarchical parent-child domains covered by existing tests.
Both children call landlock_restrict_self() for the first time, so
their struct landlock_domain->parent pointers are NULL, creating
true peer domains. The harness exposes four test variants:
Variant name | Resource domain | Accessor domain | Result
-------------------|-----------------|-----------------|----------
none_to_none | none | none | ALLOW
none_to_scoped | none | scoped | DENY
scoped_to_none | scoped | none | ALLOW
scoped_to_scoped | scoped | scoped (peer) | DENY
The scoped_to_scoped case was missing from current coverage.
TESTING
=======
All patches apply cleanly to v6.14-rc2 and pass on landlock/master.
The helpers are small and re-use the existing kselftest_harness.h
fixture/variant pattern. All patches have been validated with
scripts/checkpatch.pl --strict and show no warnings.
This series introduces **no kernel changes**, only selftests additions.
Feedback very welcome.
Thanks,
Abhinav
[1] https://github.com/landlock-lsm/linux/issues/37
Links:
- Landlock documentation: https://docs.kernel.org/userspace-api/landlock.html
- Landlock LSM kernel docs: https://docs.kernel.org/security/landlock.html
- Existing tests: tools/testing/selftests/landlock/scoped_*
Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com>
---
Abhinav Saxena (3):
selftests/landlock: move sandbox_type to common
selftests/landlock: add cross-domain variants
selftests/landlock: add cross-domain signal tests
tools/testing/selftests/landlock/scoped_common.h | 7 +
.../landlock/scoped_cross_domain_variants.h | 54 +++++
.../landlock/scoped_multiple_domain_variants.h | 7 -
.../selftests/landlock/scoped_signal_test.c | 237 +++++++++++++++++++++
4 files changed, 298 insertions(+), 7 deletions(-)
---
base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38
change-id: 20250715-landlock_abstractions-dbc0aabf1063
Best regards,
--
Abhinav Saxena <xandfury(a)gmail.com>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v10 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v13 (18-Jul-2025)
- Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>)
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (5):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 32 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 649 ++++++++++++++++++
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 28 +-
net/ipv4/tcp_input.c | 353 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 294 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1409 insertions(+), 184 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v24.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v24 (18-Jul-2025)
- Replace TCA_DUALPI2 prefix with TC_DUALPI2 for enums in pkt_sched.h (Jakub Kicinski <kuba(a)kernel.org>)
- Report error if both packet and time step thresholds are provided (Jakub Kicinski <kuba(a)kernel.org>)
v23 (13-Jul-2025) and v22 (11-Jul-2025)
- Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>)
v21 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>)
- Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>)
- Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>)
- Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>)
- Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>)
- Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>)
- Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>)
- Fix some typos and format issues of dualpi2 attributes
v20 (21-Jun-2025)
- Add one more commit to fix warning and style check on tdc.sh reported by shellcheck
- Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>)
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (5):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Fix warning and style check on tdc.sh
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 151 ++-
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1174 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 6 +-
9 files changed, 1668 insertions(+), 5 deletions(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
Syzkaller found this, the ALIGN() call can overflow and corrupt the
allocation process. Fix the bug and add some test coverage.
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Jason Gunthorpe (2):
iommufd: Prevent ALIGN() overflow
iommufd/selftest: Test reserved regions near ULONG_MAX
drivers/iommu/iommufd/io_pagetable.c | 41 +++++++++++++++----------
tools/testing/selftests/iommu/iommufd.c | 18 +++++++++++
2 files changed, 43 insertions(+), 16 deletions(-)
base-commit: 601b1d0d9395c711383452bd0d47037afbbb4bcf
--
2.43.0
cpu.max selftests (both the normal one and the nested one) test the
working of throttling by setting up cpu.max, running a cpu hog process
for a specified duration, and comparing usage_usec as reported by
cpu.stat with the duration of the cpu hog: they should be far enough.
Currently, this is done by using values_close, which has two problems:
1. Semantic: values_close is used with an error percentage of 95%, which
one will not expect on seeing "values close". The intent it's
actually going for is "values far".
2. Accuracy: the tests can pass even if usage_usec is upto around double
the expected amount. That's too high of a margin for usage_usec.
Overall, this patchset improves the readability and accuracy of the
cpu.max tests.
Signed-off-by: Shashank Balaji <shashank.mahadasyam(a)sony.com>
---
Shashank Balaji (2):
selftests/cgroup: rename `expected` to `duration` in cpu.max tests
selftests/cgroup: better bound in cpu.max tests
tools/testing/selftests/cgroup/test_cpu.c | 42 ++++++++++++++++++-------------
1 file changed, 24 insertions(+), 18 deletions(-)
---
base-commit: 66701750d5565c574af42bef0b789ce0203e3071
change-id: 20250227-kselftest-cgroup-fix-cpu-max-56619928e99b
Best regards,
--
Shashank Balaji <shashank.mahadasyam(a)sony.com>
I'd like to cut down the memory usage of parsing vmlinux BTF in ebpf-go.
With some upcoming changes the library is sitting at 5MiB for a parse.
Most of that memory is simply copying the BTF blob into user space.
By allowing vmlinux BTF to be mmapped read-only into user space I can
cut memory usage by about 75%.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Changes in v5:
- Fix error return of btf_parse_raw_mmap (Andrii)
- Link to v4: https://lore.kernel.org/r/20250510-vmlinux-mmap-v4-0-69e424b2a672@isovalent…
Changes in v4:
- Go back to remap_pfn_range for aarch64 compat
- Dropped btf_new_no_copy (Andrii)
- Fixed nits in selftests (Andrii)
- Clearer error handling in the mmap handler (Andrii)
- Fixed build on s390
- Link to v3: https://lore.kernel.org/r/20250505-vmlinux-mmap-v3-0-5d53afa060e8@isovalent…
Changes in v3:
- Remove slightly confusing calculation of trailing (Alexei)
- Use vm_insert_page (Alexei)
- Simplified libbpf code
- Link to v2: https://lore.kernel.org/r/20250502-vmlinux-mmap-v2-0-95c271434519@isovalent…
Changes in v2:
- Use btf__new in selftest
- Avoid vm_iomap_memory in btf_vmlinux_mmap
- Add VM_DONTDUMP
- Add support to libbpf
- Link to v1: https://lore.kernel.org/r/20250501-vmlinux-mmap-v1-0-aa2724572598@isovalent…
---
Lorenz Bauer (3):
btf: allow mmap of vmlinux btf
selftests: bpf: add a test for mmapable vmlinux BTF
libbpf: Use mmap to parse vmlinux BTF from sysfs
include/asm-generic/vmlinux.lds.h | 3 +-
kernel/bpf/sysfs_btf.c | 32 ++++++++
tools/lib/bpf/btf.c | 89 +++++++++++++++++-----
tools/testing/selftests/bpf/prog_tests/btf_sysfs.c | 81 ++++++++++++++++++++
4 files changed, 186 insertions(+), 19 deletions(-)
---
base-commit: 7220eabff8cb4af3b93cd021aa853b9f5df2923f
change-id: 20250501-vmlinux-mmap-2ec5563c3ef1
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the DualPI2 patch v23.
This patch serise adds DualPI Improved with a Square (DualPI2) with following features:
* Supports congestion controls that comply with the Prague requirements in RFC9331 (e.g. TCP-Prague)
* Coupled dual-queue that separates the L4S traffic in a low latency queue (L-queue), without harming remaining traffic that is scheduled in classic queue (C-queue) due to congestion-coupling using PI2 as defined in RFC9332
* Configurable overload strategies
* Use of sojourn time to reliably estimate queue delay
* Supports ECN L4S-identifier (IP.ECN==0b*1) to classify traffic into respective queues
For more details of DualPI2, please refer IETF RFC9332 (https://datatracker.ietf.org/doc/html/rfc9332).
Best regards,
Chia-Yu
---
v23 (13-Jul-2025) and v22 (11-Jul-2025)
- Fix issue when user would like to change DualPI2 but provides an empty TCA_OPTIONS with no nested attributes (Paolo Abeni <pabeni(a)redhat.com>, Jakub Kicinski <kuba(a)kernel.org>)
v21 (02-Jul-2025)
- Replace STEP_THRESH and STEP_PACKETS with STEP_THRESH_PKTS and STEP_THRESH_US (Jakub Kicinski <kuba(a)kernel.org>)
- Move READ_ONCE and WRITE_ONCE to later DualPI2 patches (Jakub Kicinski <kuba(a)kernel.org>)
- Replace NLA_POLICY_FULL_RANGE with NLA_POLICY_RANGE (Jakub Kicinski <kuba(a)kernel.org>)
- Set extra error message for dualpi2_change (Jakub Kicinski <kuba(a)kernel.org>)
- Drop redundant else for better readability (Paolo Abeni <pabeni(a)redhat.com>)
- Replace step-thresh and step-packets with step-thresh-pkts and step-thresh-us (Jakub Kicinski <kuba(a)kernel.org>)
- Remove redundant name-prefix and simplify entries of dualpi2 enums (Jakub Kicinski <kuba(a)kernel.org>)
- Fix some typos and format issues of dualpi2 attributes
v20 (21-Jun-2025)
- Add one more commit to fix warning and style check on tdc.sh reported by shellcheck
- Remove double-prefixed of "tc_tc_dualpi2_attrs" in tc-user.h (Donald Hunter <donald.hunter(a)gmail.com>)
v19 (14-Jun-2025)
- Fix one typo in the comment of #1 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Update commit message of #4 (ALOK TIWARI <alok.a.tiwari(a)oracle.com>)
- Wrap long lines of Documentation/netlink/specs/tc.yaml to within 80 characters (Jakub Kicinski <kuba(a)kernel.org>)
v18 (13-Jun-2025)
- Add the num of enum used by DualPI2 and fix name and name-prefix of DualPI2 enum and attribute
- Replace from_timer() with timer_container_of() (Pedro Tammela <pctammela(a)mojatatu.com>)
v17 (25-May-2025, Resent at 11-Jun-2025)
- Replace 0xffffffff with U32_MAX (Paolo Abeni <pabeni(a)redhat.com>)
- Use helper function qdisc_dequeue_internal() and add new helper function skb_apply_step() (Paolo Abeni <pabeni(a)redhat.com>)
- Add s64 casting when calculating the delta of the PI controller (Paolo Abeni <pabeni(a)redhat.com>)
- Change the drop reason into SKB_DROP_REASON_QDISC_CONGESTED for drop_early (Paolo Abeni <pabeni(a)redhat.com>)
- Modify the condition to remove the original skb when enqueuing multiple GSO segments (Paolo Abeni <pabeni(a)redhat.com>)
- Add READ_ONCE() in dualpi2_dump_stat() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments, brackets, and brackets for readability (Paolo Abeni <pabeni(a)redhat.com>)
v16 (16-MAy-2025)
- Add qdisc_lock() to dualpi2_timer() in dualpi2_timer (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce convert_ns_to_usec() to convert usec to nsec without overflow in #1 (Paolo Abeni <pabeni(a)redhat.com>)
- Update convert_us_tonsec() to convert nsec to usec without overflow in #2 (Paolo Abeni <pabeni(a)redhat.com>)
- Add more descriptions with respect to DualPI2 in the cover ltter and add changelog in each patch (Paolo Abeni <pabeni(a)redhat.com>)
v15 (09-May-2025)
- Add enum of TCA_DUALPI2_ECN_MASK_CLA_ECT to remove potential leakeage in #1 (Simon Horman <horms(a)kernel.org>)
- Fix one typo in comment of #2
- Update tc.yaml in #5 to aligh with the updated enum of pkt_sched.h
v14 (05-May-2025)
- Modify tc.yaml: (1) Replace flags with enum and remove enum-as-flags, (2) Remove credit-queue in xstats, and (3) Change attribute types (Donald Hunter <donald.hun
- Add enum and fix the ordering of variables in pkt_sched.h to align with the modified tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add validators for DROP_OVERLOAD, DROP_EARLY, ECN_MASK, and SPLIT_GSO in sch_dualpi2.c (Donald Hunter <donald.hunter(a)gmail.com>)
- Update dualpi2.json to align with the updated variable order in pkt_sched.h
- Reorder patches (Donald Hunter <donald.hunter(a)gmail.com>)
v13 (26-Apr-2025)
- Use dashes in member names to follow YNL conventions in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Define enumerations separately for flags of drop-early, drop-overload, ecn-mask, credit-queue in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the types of split-gso and step-packets into flag in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Revert to u32/u8 types for tc-dualpi2-xstats members in tc.yaml (Donald Hunter <donald.hunter(a)gmail.com>)
- Add new test cases in tc-tests/qdiscs/dualpi2.json to cover all dualpi2 parameters (Donald Hunter <donald.hunter(a)gmail.com>)
- Change the type of TCA_DUALPI2_STEP_PACKETS into NLA_FLAG (Donald Hunter <donald.hunter(a)gmail.com>)
v12 (22-Apr-2025)
- Remove anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Replace u32/u8 with uint and s32 with int in tc spec document (Paolo Abeni <pabeni(a)redhat.com>)
- Introduce get_memory_limit function to handle potential overflow when multipling limit with MTU (Paolo Abeni <pabeni(a)redhat.com>)
- Double the packet length to further include packet overhead in memory_limit (Paolo Abeni <pabeni(a)redhat.com>)
- Remove the check of qdisc_qlen(sch) when calling qdisc_tree_reduce_backlog (Paolo Abeni <pabeni(a)redhat.com>)
v11 (15-Apr-2025)
- Replace hstimer_init with hstimer_setup in sch_dualpi2.c
v10 (25-Mar-2025)
- Remove leftover include in include/linux/netdevice.h and anonymous struct in sch_dualpi2.c (Paolo Abeni <pabeni(a)redhat.com>)
- Use kfree_skb_reason() and add SKB_DROP_REASON_DUALPI2_STEP_DROP drop reason (Paolo Abeni <pabeni(a)redhat.com>)
- Split sch_dualpi2.c into 3 patches (and overall 5 patches): Struct definition & parsing, Dump stats & configuration, Enqueue/Dequeue (Paolo Abeni <pabeni(a)redhat.com>)
v9 (16-Mar-2025)
- Fix mem_usage error in previous version
- Add min_qlen_step to the dualpi2 attribute as the minimum queue length in number of packets in the L-queue to start step threshold marking.
In previous versions, this value was fixed to 2, so the step threshold was applied to mark packets in the L queue only when the queue length of the L queue was greater than or equal to 2 packets.
This will cause larger queuing delays for L4S traffic at low rates (<20Mbps). So we parameterize it and change the default value to 0.
Comparison of tcp_1down run 'HTB 20Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 11.55 11.70 ms 350
TCP upload avg : 18.96 N/A Mbits/s 350
TCP upload sum : 18.96 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 10.81 10.70 ms 350
TCP upload avg : 18.91 N/A Mbits/s 350
TCP upload sum : 18.91 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 12.61 12.80 ms 350
TCP upload avg : 9.48 N/A Mbits/s 350
TCP upload sum : 9.48 N/A Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.06 10.80 ms 350
TCP upload avg : 9.43 N/A Mbits/s 350
TCP upload sum : 9.43 N/A Mbits/s 350
Comparison of tcp_1down run 'HTB 10Mbit + DUALPI2 + 10ms base delay'
Old versions:
avg median # data pts
Ping (ms) ICMP : 40.86 37.45 ms 350
TCP upload avg : 0.88 N/A Mbits/s 350
TCP upload sum : 0.88 N/A Mbits/s 350
TCP upload::1 : 0.88 0.97 Mbits/s 350
New version (v9):
avg median # data pts
Ping (ms) ICMP : 11.07 10.40 ms 350
TCP upload avg : 0.55 N/A Mbits/s 350
TCP upload sum : 0.55 N/A Mbits/s 350
TCP upload::1 : 0.55 0.59 Mbits/s 350
v8 (11-Mar-2025)
- Fix warning messages in v7
v7 (07-Mar-2025)
- Separate into 3 patches to avoid mixing changes of documentation, selftest, and code. (Cong Wang <xiyou.wangcong(a)gmail.com>)
v6 (04-Mar-2025)
- Add modprobe for dulapi2 in tc-testing script tc-testing/tdc.sh (Jakub Kicinski <kuba(a)kernel.org>)
- Update test cases in dualpi2.json
- Update commit message
v5 (22-Feb-2025)
- A comparison was done between MQ + DUALPI2, MQ + FQ_PIE, MQ + FQ_CODEL:
Unshaped 1gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
- Summary of tcp_4down run 'MQ + FQ_PIE'
avg median # data pts
Ping (ms) ICMP : 1.21 1.37 ms 350
TCP download avg : 235.42 N/A Mbits/s 350
TCP download sum : 941.61 N/A Mbits/s 350
TCP download::1 : 232.54 233.13 Mbits/s 350
TCP download::2 : 232.52 232.80 Mbits/s 350
TCP download::3 : 233.14 233.78 Mbits/s 350
TCP download::4 : 243.41 241.48 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2'
avg median # data pts
Ping (ms) ICMP : 1.19 1.34 ms 349
TCP download avg : 235.42 N/A Mbits/s 349
TCP download sum : 941.68 N/A Mbits/s 349
TCP download::1 : 235.19 235.39 Mbits/s 349
TCP download::2 : 235.03 235.35 Mbits/s 349
TCP download::3 : 236.89 235.44 Mbits/s 349
TCP download::4 : 234.57 235.19 Mbits/s 349
Unshaped 1gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 1.88 1.86 ms 350
TCP download avg : 7.39 N/A Mbits/s 350
TCP download sum : 946.47 N/A Mbits/s 350
Unshaped 10gigE with 4 download streams test:
- Summary of tcp_4down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 0.22 0.23 ms 350
TCP download avg : 2354.08 N/A Mbits/s 350
TCP download sum : 9416.31 N/A Mbits/s 350
TCP download::1 : 2353.65 2352.81 Mbits/s 350
TCP download::2 : 2354.54 2354.21 Mbits/s 350
TCP download::3 : 2353.56 2353.78 Mbits/s 350
TCP download::4 : 2354.56 2354.45 Mbits/s 350
- Summary of tcp_4down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 0.20 0.19 ms 350
TCP download avg : 2354.76 N/A Mbits/s 350
TCP download sum : 9419.04 N/A Mbits/s 350
TCP download::1 : 2354.77 2353.89 Mbits/s 350
TCP download::2 : 2353.41 2354.29 Mbits/s 350
TCP download::3 : 2356.18 2354.19 Mbits/s 350
TCP download::4 : 2354.68 2353.15 Mbits/s 350
- Summary of tcp_4down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 0.24 0.24 ms 350
TCP download avg : 2354.11 N/A Mbits/s 350
TCP download sum : 9416.43 N/A Mbits/s 350
TCP download::1 : 2354.75 2353.93 Mbits/s 350
TCP download::2 : 2353.15 2353.75 Mbits/s 350
TCP download::3 : 2353.49 2353.72 Mbits/s 350
TCP download::4 : 2355.04 2353.73 Mbits/s 350
Unshaped 10gigE with 128 download streams test:
- Summary of tcp_128down run 'MQ + FQ_CODEL':
avg median # data pts
Ping (ms) ICMP : 7.57 8.69 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9467.82 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + FQ_PIE':
avg median # data pts
Ping (ms) ICMP : 7.82 8.91 ms 350
TCP download avg : 73.97 N/A Mbits/s 350
TCP download sum : 9468.42 N/A Mbits/s 350
- Summary of tcp_128down run 'MQ + DUALPI2':
avg median # data pts
Ping (ms) ICMP : 6.87 7.93 ms 350
TCP download avg : 73.95 N/A Mbits/s 350
TCP download sum : 9465.87 N/A Mbits/s 350
From the results shown above, we see small differences between combinations.
- Update commit message to include results of no_split_gso and split_gso (Dave Taht <dave.taht(a)gmail.com> and Paolo Abeni <pabeni(a)redhat.com>)
- Add memlimit in the dualpi2 attribute, and add memory_used, max_memory_used, memory_limit in dualpi2 stats (Dave Taht <dave.taht(a)gmail.com>)
- Update note in sch_dualpi2.c related to BBRv3 status (Dave Taht <dave.taht(a)gmail.com>)
- Update license identifier (Dave Taht <dave.taht(a)gmail.com>)
- Add selftest in tools/testing/selftests/tc-testing (Cong Wang <xiyou.wangcong(a)gmail.com>)
- Use netlink policies for parameter checks (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Modify texts & fix typos in Documentation/netlink/specs/tc.yaml (Dave Taht <dave.taht(a)gmail.com>)
- Add descriptions of packet counter statistics and the reset function of sch_dualpi2.c
- Fix step_thresh in packets
- Update code comments in sch_dualpi2.c
v4 (22-Oct-2024)
- Update statement in Kconfig for DualPI2 (Stephen Hemminger <stephen(a)networkplumber.org>)
- Put a blank line after #define in sch_dualpi2.c (Stephen Hemminger <stephen(a)networkplumber.org>)
- Fix line length warning.
v3 (19-Oct-2024)
- Fix compilaiton error
- Update Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Oct-2024)
- Add Documentation/netlink/specs/tc.yaml (Jakub Kicinski <kuba(a)kernel.org>)
- Use dualpi2 instead of skb prefix (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Replace nla_parse_nested_deprecated with nla_parse_nested (Jamal Hadi Salim <jhs(a)mojatatu.com>)
- Fix line length warning
---
Chia-Yu Chang (5):
sched: Struct definition and parsing of dualpi2 qdisc
sched: Dump configuration and statistics of dualpi2 qdisc
selftests/tc-testing: Fix warning and style check on tdc.sh
selftests/tc-testing: Add selftests for qdisc DualPI2
Documentation: netlink: specs: tc: Add DualPI2 specification
Koen De Schepper (1):
sched: Add enqueue/dequeue of dualpi2 qdisc
Documentation/netlink/specs/tc.yaml | 151 ++-
include/net/dropreason-core.h | 6 +
include/uapi/linux/pkt_sched.h | 68 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 1171 +++++++++++++++++
tools/testing/selftests/tc-testing/config | 1 +
.../tc-testing/tc-tests/qdiscs/dualpi2.json | 254 ++++
tools/testing/selftests/tc-testing/tdc.sh | 6 +-
9 files changed, 1665 insertions(+), 5 deletions(-)
create mode 100644 net/sched/sch_dualpi2.c
create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/dualpi2.json
--
2.34.1
Some build environments for the selftests are not picking up the newly
added AT_HWCAP3 when using the libc headers, even with headers_install
(which we require already for the arm64 selftests). As a quick fix add
local definitions of the constant to tools use it, while auxvec.h is
installed with some toolchains it needs some persuasion to get picked up.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/abi/hwcap.c | 4 ++++
tools/testing/selftests/arm64/mte/check_prctl.c | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/tools/testing/selftests/arm64/abi/hwcap.c b/tools/testing/selftests/arm64/abi/hwcap.c
index 35f521e5f41c..aa902408facd 100644
--- a/tools/testing/selftests/arm64/abi/hwcap.c
+++ b/tools/testing/selftests/arm64/abi/hwcap.c
@@ -21,6 +21,10 @@
#define TESTS_PER_HWCAP 3
+#ifndef AT_HWCAP3
+#define AT_HWCAP3 29
+#endif
+
/*
* Function expected to generate exception when the feature is not
* supported and return when it is supported. If the specific exception
diff --git a/tools/testing/selftests/arm64/mte/check_prctl.c b/tools/testing/selftests/arm64/mte/check_prctl.c
index 4c89e9538ca0..c36c4c49ff95 100644
--- a/tools/testing/selftests/arm64/mte/check_prctl.c
+++ b/tools/testing/selftests/arm64/mte/check_prctl.c
@@ -12,6 +12,10 @@
#include "kselftest.h"
+#ifndef AT_HWCAP3
+#define AT_HWCAP3 29
+#endif
+
static int set_tagged_addr_ctrl(int val)
{
int ret;
---
base-commit: 86731a2a651e58953fc949573895f2fa6d456841
change-id: 20250710-arm64-selftest-bodge-hwcap3-b6ab30ab69cd
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.
HW QUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
- NVIDIA's Virtual Command Queue
- AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which allows its IOMMU HW to directly access a queue memory owned
by a guest VM and allows a guest OS to control the HW queue direclty, to
avoid VM Exit overheads to improve the performance.
Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.
Meanwhile, a guest-owned queue needs the guest kernel to control the queue
by reading/writing its consumer and producer indexes, via MMIO acceses to
the hardware MMIO registers. Introduce an mmap infrastructure for iommufd
to support passing through a piece of MMIO region from the host physical
address space to the guest physical address space. The mmap info (offset/
length) used by an mmap syscall must be pre-allocated and returned to the
user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC
call. Thus, it requires a driver-specific user data support in the vIOMMU
allocation flow.
As a real-world use case, this series implements a HW QUEUE support in the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.
// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
4KB 1 35.7 us 5.3 us
16KB 1 41.8 us 6.8 us
64KB 1 68.9 us 9.9 us
128KB 1 109.0 us 12.6 us
256KB 1 187.1 us 18.0 us
4KB 2 96.9 us 6.8 us
16KB 2 97.8 us 7.5 us
64KB 2 151.5 us 10.7 us
128KB 2 257.8 us 12.7 us
256KB 2 443.0 us 17.9 us
This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v9
Paring QEMU branch for testing (reusing v8):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v8
Changelog
v9 (attached git-diff v8..v9 at the end of this letter)
* Add Reviewed-by from Vasant and Jason
* [iommufd] Fix offset calculation
* [iommufd] Add unaligned iova/length selftest coverage for hw_queue
* [iommufd] Pass in aligned iova/length to iommufd_access_pin_pages()
* [smmu] Change "u32 *type" at arm_smmu_hw_info() in the header
v8
https://lore.kernel.org/all/cover.1751677708.git.nicolinc@nvidia.com/
* Add Reviewed-by from Pranj, Kevin and Jason
* Improve kdoc and comments
* [iommufd] Skip selftest for no_viommu variants
* [iommufd] Add unmap coverage for non internal area
* [iommufd] Skip the first page when mtree_alloc_range()
* [iommufd] Correct the passed in index to mtree_erase()
* [iommufd] Correct variable types in iommufd_hw_queue_alloc_phys()
* [iommufd] Reject iopt_unmap_iova_range() if area->num_locks is set
* [tegra] Rename "SID replacement" with "SID mapping"
* [tegra] Unwrap useless _tegra241_vcmdq_hw_init helper
v7
https://lore.kernel.org/all/cover.1750966133.git.nicolinc@nvidia.com/
* Rebased on Jason's for-next tree (iommufd_hw_queue-prep series)
* Add Reviewed-by from Baolu, Jason, Pranjal
* Update kdocs and notes
* [iommu] Replace "u32" with "enum iommu_hw_info_type"
* [iommufd] Rename vdev->id to vdev->virt_id
* [iommufd] Replace macros with inline helpers
* [iommufd] Report unmapped_bytes in error path
* [iommufd] Add iommufd_access_is_internal helper
* [iommufd] Do not drop ops->unmap check for mdevs
* [iommufd] Store physical addresses in immap structure
* [iommufd] Reorder access and hw_queue object allocations
* [iommufd] Scan for an internal access before any unmap call
* [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue
* [iommufd] Use kcalloc to avoid failure due to memory fragmentation
* [tegra] Use "else"
* [tegra] Lock destroy() using lvcmdq_mutex
v6
https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/
* Rebase on iommufd_hw_queue-prep-v2
* Add Reviewed-by from Kevin and Jason
* [iommufd] Update kdocs and notes
* [iommufd] Drop redundant pages[i] check
* [iommufd] Allow nesting_parent_iova to be 0
* [iommufd] Add iommufd_hw_queue_alloc_phys()
* [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs
* [iommufd] Move destroy ops to vdevice/hw_queue structures
* [iommufd] Add union in hw_info struct to share out_data_type field
* [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs
* [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init
* [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init
* [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys
* [smmu] Drop arm_smmu_domain_ipa_to_pa
* [smmu] Update arm_smmu_impl_ops changes for vsmmu_init
* [tegra] Add a vdev_to_vsid macro
* [tegra] Add lvcmdq_mutex to protect multi queues
* [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak)
v5
https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc6
* Add Reviewed-by from Jason and Kevin
* Correct typos in kdoc and update commit logs
* [iommufd] Add a cosmetic fix
* [iommufd] Drop unused num_pfns
* [iommufd] Drop unnecessary check
* [iommufd] Reorder patch sequence
* [iommufd] Use io_remap_pfn_range()
* [iommufd] Use success oriented flow
* [iommufd] Fix max_npages calculation
* [iommufd] Add more selftest coverage
* [iommufd] Drop redundant static_assert
* [iommufd] Fix mmap pfn range validation
* [iommufd] Reject unmap on pinned iovas
* [iommufd] Drop redundant vm_flags_set()
* [iommufd] Drop iommufd_struct_destroy()
* [iommufd] Drop redundant queue iova test
* [iommufd] Use "mmio_addr" and "mmio_pfn"
* [iommufd] Rename to "nesting_parent_iova"
* [iommufd] Make iopt_pin_pages call option
* [iommufd] Add ictx comparison in depend()
* [iommufd] Add iommufd_object_alloc_ucmd()
* [iommufd] Move kcalloc() after validations
* [iommufd] Replace ictx setting with WARN_ON
* [iommufd] Make hw_info's type bidirectional
* [smmu] Add supported_vsmmu_type in impl_ops
* [smmu] Drop impl report in smmu vendor struct
* [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV
* [tegra] Replace "number of VINTFs" with a note
* [tegra] Drop the redundant lvcmdq pointer setting
* [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA
* [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op
v4
https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/
* Rebase on v6.15-rc5
* Add Reviewed-by from Vasant
* Rename "vQUEUE" to "HW QUEUE"
* Use "offset" and "length" for all mmap-related variables
* [iommufd] Use u64 for guest PA
* [iommufd] Fix typo in uAPI doc
* [iommufd] Rename immap_id to offset
* [iommufd] Drop the partial-size mmap support
* [iommufd] Do not replace WARN_ON with WARN_ON_ONCE
* [iommufd] Use "u64 base_addr" for queue base address
* [iommufd] Use u64 base_pfn/num_pfns for immap structure
* [iommufd] Correct the size passed in to mtree_alloc_range()
* [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops
v3
https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/
* Add Reviewed-by from Baolu, Pranjal, and Alok
* Revise kdocs, uAPI docs, and commit logs
* Rename "vCMDQ" back to "vQUEUE" for AMD cases
* [tegra] Add tegra241_vcmdq_hw_flush_timeout()
* [tegra] Rename vsmmu_alloc to alloc_vintf_user
* [tegra] Use writel for SID replacement registers
* [tegra] Move mmap removal call to vsmmu_destroy op
* [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
* [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
* [iommufd] Add an object-type "owner" to immap structure
* [iommufd] Drop the ictx input in the new for-driver APIs
* [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
* [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
* [iommufd] Rename iommufd_ctx_alloc/free_mmap to
_iommufd_alloc/destroy_mmap
v2
https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
* Add Reviewed-by from Jason
* [smmu] Fix vsmmu initial value
* [smmu] Support impl for hw_info
* [tegra] Rename "slot" to "vsid"
* [tegra] Update kdocs and commit logs
* [tegra] Map/unmap LVCMDQ dynamically
* [tegra] Refcount the previous LVCMDQ
* [tegra] Return -EEXIST if LVCMDQ exists
* [tegra] Simplify VINTF cleanup routine
* [tegra] Use vmid and s2_domain in vsmmu
* [tegra] Rename "mmap_pgoff" to "immap_id"
* [tegra] Add more addr and length validation
* [iommufd] Add more narrative to mmap's kdoc
* [iommufd] Add iommufd_struct_depend/undepend()
* [iommufd] Rename vcmdq_free op to vcmdq_destroy
* [iommufd] Fix bug in iommu_copy_struct_to_user()
* [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
* [iommufd] Test the queue memory for its contiguity
* [iommufd] Return -ENXIO if address or length fails
* [iommufd] Do not change @min_last in mock_viommu_alloc()
* [iommufd] Generalize TEGRA241_VCMDQ data in core structure
* [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
* [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/
Thanks
Nicolin
Nicolin Chen (29):
iommufd: Report unmapped bytes in the error path of
iopt_unmap_iova_range
iommufd: Correct virt_id kdoc at struct iommu_vdevice_alloc
iommufd/viommu: Explicitly define vdev->virt_id
iommu: Use enum iommu_hw_info_type for type in hw_info op
iommu: Add iommu_copy_struct_to_user helper
iommu: Pass in a driver-level user data structure to viommu_init op
iommufd/viommu: Allow driver-specific user data for a vIOMMU object
iommufd/selftest: Support user_data in mock_viommu_alloc
iommufd/selftest: Add coverage for viommu data
iommufd/access: Add internal APIs for HW queue to use
iommufd/access: Bypass access->ops->unmap for internal use
iommufd/viommu: Add driver-defined vDEVICE support
iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
iommufd: Add mmap interface
iommufd/selftest: Add coverage for the new mmap interface
Documentation: userspace-api: iommufd: Update HW QUEUE
iommu: Allow an input type in hw_info op
iommufd: Allow an input data_type via iommu_hw_info
iommufd/selftest: Update hw_info coverage for an input data_type
iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/tegra241-cmdqv: Simplify deinit flow in
tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 25 +-
drivers/iommu/iommufd/io_pagetable.h | 5 +-
drivers/iommu/iommufd/iommufd_private.h | 46 +-
drivers/iommu/iommufd/iommufd_test.h | 20 +
include/linux/iommu.h | 50 +-
include/linux/iommufd.h | 160 ++++++
include/uapi/linux/iommufd.h | 147 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 89 +++-
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 28 +-
.../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 477 +++++++++++++++++-
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 87 +++-
drivers/iommu/iommufd/driver.c | 82 ++-
drivers/iommu/iommufd/io_pagetable.c | 13 +-
drivers/iommu/iommufd/main.c | 69 +++
drivers/iommu/iommufd/pages.c | 12 +-
drivers/iommu/iommufd/selftest.c | 153 +++++-
drivers/iommu/iommufd/viommu.c | 218 +++++++-
tools/testing/selftests/iommu/iommufd.c | 141 +++++-
.../selftests/iommu/iommufd_fail_nth.c | 15 +-
Documentation/userspace-api/iommufd.rst | 12 +
21 files changed, 1745 insertions(+), 111 deletions(-)
--
2.43.0
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index aa25156e04a3..3fa02c51df9f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1045,7 +1045,8 @@ struct arm_vsmmu {
};
#if IS_ENABLED(CONFIG_ARM_SMMU_V3_IOMMUFD)
-void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type);
+void *arm_smmu_hw_info(struct device *dev, u32 *length,
+ enum iommu_hw_info_type *type);
size_t arm_smmu_get_viommu_size(struct device *dev,
enum iommu_viommu_type viommu_type);
int arm_vsmmu_init(struct iommufd_viommu *viommu,
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 00641204efb2..91339f799916 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -206,7 +206,11 @@ static void iommufd_hw_queue_destroy_access(struct iommufd_ctx *ictx,
struct iommufd_access *access,
u64 base_iova, size_t length)
{
- iommufd_access_unpin_pages(access, base_iova, length);
+ u64 aligned_iova = PAGE_ALIGN_DOWN(base_iova);
+ u64 offset = base_iova - aligned_iova;
+
+ iommufd_access_unpin_pages(access, aligned_iova,
+ PAGE_ALIGN(length + offset));
iommufd_access_detach_internal(access);
iommufd_access_destroy_internal(ictx, access);
}
@@ -239,22 +243,23 @@ static struct iommufd_access *
iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd,
struct iommufd_viommu *viommu, phys_addr_t *base_pa)
{
+ u64 aligned_iova = PAGE_ALIGN_DOWN(cmd->nesting_parent_iova);
+ u64 offset = cmd->nesting_parent_iova - aligned_iova;
struct iommufd_access *access;
struct page **pages;
size_t max_npages;
size_t length;
- u64 offset;
size_t i;
int rc;
- offset =
- cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova);
- /* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */
+ /* max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */
if (check_add_overflow(offset, cmd->length, &length))
return ERR_PTR(-ERANGE);
if (check_add_overflow(length, PAGE_SIZE - 1, &length))
return ERR_PTR(-ERANGE);
max_npages = length / PAGE_SIZE;
+ /* length needs to be page aligned too */
+ length = max_npages * PAGE_SIZE;
/*
* Use kvcalloc() to avoid memory fragmentation for a large page array.
@@ -274,8 +279,7 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd,
if (rc)
goto out_destroy;
- rc = iommufd_access_pin_pages(access, cmd->nesting_parent_iova,
- cmd->length, pages, 0);
+ rc = iommufd_access_pin_pages(access, aligned_iova, length, pages, 0);
if (rc)
goto out_detach;
@@ -287,13 +291,12 @@ iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd,
goto out_unpin;
}
- *base_pa = page_to_pfn(pages[0]) << PAGE_SHIFT;
+ *base_pa = (page_to_pfn(pages[0]) << PAGE_SHIFT) + offset;
kfree(pages);
return access;
out_unpin:
- iommufd_access_unpin_pages(access, cmd->nesting_parent_iova,
- cmd->length);
+ iommufd_access_unpin_pages(access, aligned_iova, length);
out_detach:
iommufd_access_detach_internal(access);
out_destroy:
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 9d5b852d5e19..d59d48022a24 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -3104,17 +3104,18 @@ TEST_F(iommufd_viommu, hw_queue)
/* Allocate index=0, declare ownership of the iova */
test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 0,
iova, PAGE_SIZE, &hw_queue_id[0]);
- /* Fail duplicate */
+ /* Fail duplicated index */
test_err_hw_queue_alloc(EEXIST, viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
0, iova, PAGE_SIZE, &hw_queue_id[0]);
/* Fail unmap, due to iova ownership */
test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE);
/* The 2nd page is not pinned, so it can be unmmap */
- test_ioctl_ioas_unmap(iova + PAGE_SIZE, PAGE_SIZE);
+ test_ioctl_ioas_unmap(iova2, PAGE_SIZE);
- /* Allocate index=1 */
+ /* Allocate index=1, with an unaligned case */
test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST, 1,
- iova, PAGE_SIZE, &hw_queue_id[1]);
+ iova + PAGE_SIZE / 2, PAGE_SIZE / 2,
+ &hw_queue_id[1]);
/* Fail to destroy, due to dependency */
EXPECT_ERRNO(EBUSY, _test_ioctl_destroy(self->fd, hw_queue_id[0]));
Hello,
binder_alloc_selftest provides a robust set of checks for the binder
allocator, but it rarely runs because it must hook into a running binder
process and block all other binder threads until it completes. The test
itself is a good candidate for conversion to KUnit, and it can be
further isolated from user processes by using a test-specific lru
freelist instead of the global one. This series converts the selftest
to KUnit to make it less burdensome to run and to set up a foundation
for unit testing future binder_alloc changes.
Thanks,
Tiffany
Tiffany Yang (6):
binder: Fix selftest page indexing
binder: Store lru freelist in binder_alloc
kunit: test: Export kunit_attach_mm()
binder: Scaffolding for binder_alloc KUnit tests
binder: Convert binder_alloc selftests to KUnit
binder: encapsulate individual alloc test cases
drivers/android/Kconfig | 15 +-
drivers/android/Makefile | 2 +-
drivers/android/binder.c | 10 +-
drivers/android/binder_alloc.c | 39 +-
drivers/android/binder_alloc.h | 14 +-
drivers/android/binder_alloc_selftest.c | 306 -----------
drivers/android/binder_internal.h | 4 +
drivers/android/tests/.kunitconfig | 3 +
drivers/android/tests/Makefile | 3 +
drivers/android/tests/binder_alloc_kunit.c | 573 +++++++++++++++++++++
include/kunit/test.h | 12 +
lib/kunit/user_alloc.c | 4 +-
12 files changed, 645 insertions(+), 340 deletions(-)
delete mode 100644 drivers/android/binder_alloc_selftest.c
create mode 100644 drivers/android/tests/.kunitconfig
create mode 100644 drivers/android/tests/Makefile
create mode 100644 drivers/android/tests/binder_alloc_kunit.c
--
2.50.0.727.gbf7dc18ff4-goog
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space.
Previous version of this patchset [1] tried to perform /proc/pid/maps
reading under RCU, however its implementation is quite complex and the
results are worse than the new version because it still relied on
mmap_lock speculation which retries if any part of the address space gets
modified. New implementaion is both simpler and results in less
contention. Note that similar approach would not work for /proc/pid/smaps
reading as it also walks the page table and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears.
The latest results from Paul:
Stock mm-unstable, all of the runs had maximum latencies in excess of
0.5 milliseconds, and with 80% of the runs' latencies exceeding a full
millisecond, and ranging up beyond 4 full milliseconds. In contrast,
99% of the runs with this patch series applied had maximum latencies
of less than 0.5 milliseconds, with the single outlier at only 0.608
milliseconds.
From a median-performance (as opposed to maximum-latency) viewpoint,
this patch series also looks good, with stock mm weighing in at 11
microseconds and patch series at 6 microseconds, better than a 2x
improvement.
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.521
0.011 0.008 0.552
0.011 0.008 0.590
0.011 0.008 0.660
...
0.011 0.015 2.987
0.011 0.015 3.038
0.011 0.016 3.431
0.011 0.016 4.707
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.026
0.006 0.005 0.029
0.006 0.005 0.034
0.006 0.005 0.035
...
0.006 0.006 0.421
0.006 0.006 0.423
0.006 0.006 0.439
0.006 0.006 0.608
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes since v6 [4]:
- Updated patch 7/8 changelog, per Lorenzo Stoakes
- Added comments, per Lorenzo Stoakes
- Added Reviewed-by, per Lorenzo Stoakes and Liam Howlett
- Replaced iter with vmi, per Lorenzo Stoakes
- Renamed from lock_vma_under_mmap_lock() to
lock_next_vma_under_mmap_lock(), per Lorenzo Stoakes
- Renamed lock_next_vma() parameter from addr to from_addr
- Renamed labels in lock_next_vma() to reflect fallback cases,
per Lorenzo Stoakes
- Handle vma_start_read_locked() failure inside
lock_next_vma_under_mmap_lock() and added fallback_to_mmap_lock()
for that, per Vlastimil Babka
- Added missing vma_iter_init() after re-entering rcu read section inside
lock_next_vma(), per Vlastimil Babka
- Replaced vma_iter_init() with vma_iter_set(), per Liam Howlett
- Removed the last patch converting PROCMAP_QUERY to use per-vma locks.
That patch will be posted separately,
per David Hildenbrand, Vlastimil Babka and Liam Howlett
- Updated performance numbers, per Paul E. McKenney
!!! NOTES FOR APPLYING THE PATCHSET !!!
Applies cleanly over mm-unstable after reverting v6 version of this
patchset (from 2771a4b86aa1 to a20b00f7cf33 in mm-unstable).
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250704060727.724817-1-surenb@google.com/
Suren Baghdasaryan (7):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
fs/proc/task_mmu: remove conversion of seq_file position to unsigned
fs/proc/task_mmu: read proc/pid/maps under per-vma lock
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 155 +++-
include/linux/mmap_lock.h | 11 +
mm/madvise.c | 3 +-
mm/mmap_lock.c | 93 ++
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++
8 files changed, 1082 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/proc/proc-maps-race.c
--
2.50.0.727.gbf7dc18ff4-goog
Hello,
this series follows up on the one introducing 9+ args for tracing
programs [1]. It has been observed with this series that there are cases
for which we can not identify accurately the location of the target
function arguments to prepare correctly the corresponding BPF
trampoline. This is the case for example if:
- the function consumes a struct variable _by value_
- it is passed on the stack (no more register available for it)
- it has some __packed__ or __aligned(X)__ attribute
As a consequence, a small restrictive check has been added to the ARM64
side, highlighting that other arch supporting 9+ args in BPF trampolines
are already suffering from the same issue. After a bit of discussions
and attempts, the chosen solution is, rather than applying the same
constraint to all JIT compilers, to prevent such function from being
encoded at all in BTF info([2]). As the pahole side is closed to be
integrated, we can now remove the restrictive check from kernel side.
[1] https://lore.kernel.org/bpf/20250527-many_args_arm64-v3-0-3faf7bb8e4a2@boot…
[2] https://lore.kernel.org/bpf/20250707-btf_skip_structs_on_stack-v3-0-29569e0…
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore(a)bootlin.com>
---
Alexis Lothoré (eBPF Foundation) (2):
bpf, arm64: remove structs on stack constraint
selftests/bpf: enable tracing_struct tests for arm64
arch/arm64/net/bpf_jit_comp.c | 5 -----
tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 -
2 files changed, 6 deletions(-)
---
base-commit: 8da1e37fc84868b50ba6a7cdf082aa3b0d11e006
change-id: 20250708-arm64_relax_jit_comp-e8889647d8d2
Best regards,
--
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
I am submitting a new selftest for the netpoll subsystem specifically
targeting the case where the RX is polling in the TX path, which is
a case that we don't have any test in the tree today. This is done when
netpoll_poll_dev() called, and this test creates a scenario when that is
probably.
The test does the following:
1) Configuring a single RX/TX queue to increase contention on the
interface.
2) Generating background traffic to saturate the network, mimicking
real-world congestion.
3) Sending netconsole messages to trigger netpoll polling and monitor
its behavior.
4) Using dynamic netconsole targets via configfs, with the ability to
delete and recreate targets during the test.
5) Running bpftrace in parallel to verify that netpoll_poll_dev() is
called when expected. If it is called, then the test passes,
otherwise the test is marked as skipped.
In order to achieve it, I stole Jakub's bpftrace helper from [1], and
did some small changes that I found useful to use the helper.
So, this patchset basically contains:
1) The code stolen from Jakub
2) Improvements on bpftrace() helper
3) The selftest itself
Link: https://lore.kernel.org/all/20250421222827.283737-22-kuba@kernel.org/ [1]
---
Changes in v7:
- Rebased on top of net-next
- Using `ethtool -l` json option instead of parsing it manually.
- Link to v6: https://lore.kernel.org/r/20250711-netpoll_test-v6-0-130465f286a8@debian.org
Changes in v6:
- Remove the network toggled (Jakub)
- Set ringsize and queue size (Jakub)
- Some other general improvements (Jakub)
- Link to v5: https://lore.kernel.org/r/20250709-netpoll_test-v5-0-b3737895affe@debian.org
Changes in v5:
- Rebased on top of net-next.
- Calling bpftrace_stop using the defer helper. (Willem)
- Link to v4: https://lore.kernel.org/r/20250702-netpoll_test-v4-0-cec227e85639@debian.org
Changes in v4:
- Make the test XFail if it doesn't hit the function we are looking for
- Toggle the interface while the traffic is flowing.
- Bumped the number of messages from 10 to 40 per iterations.
* This is hitting ~15 times per run on my vng test.
- Decreased the time from 15 seconds to 10 seconds, given that if
it didn't hit the function in 10 seconds, 5 seconds extra will not
help.
- Link to v3: https://lore.kernel.org/r/20250627-netpoll_test-v3-0-575bd200c8a9@debian.org
Changes in v3:
- Make pylint happy (Simon)
- Remove the unnecessary patch in bpftrace to raise an exception when it
fails. (Jakub)
- Improved the bpftrace code (Willem)
- Stop sending messages if bpftrace is not alive anymore.
- Link to v2: https://lore.kernel.org/r/20250625-netpoll_test-v2-0-47d27775222c@debian.org
Changes in v2:
- Stole Jakub's helper to run bpftrace
- Removed the DEBUG option and moved logs to logging
- Change the code to have a higher chance of calling netpoll_poll_dev().
In my current configuration, it is hitting multiple times during the
test.
- Save and restore TX/RX queue size (Jakub)
- Link to v1: https://lore.kernel.org/r/20250620-netpoll_test-v1-1-5068832f72fc@debian.org
---
Breno Leitao (2):
selftests: drv-net: Strip '@' prefix from bpftrace map keys
selftests: net: add netpoll basic functionality test
Jakub Kicinski (1):
selftests: drv-net: add helper/wrapper for bpftrace
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/py/__init__.py | 4 +-
.../testing/selftests/drivers/net/netpoll_basic.py | 396 +++++++++++++++++++++
tools/testing/selftests/net/lib/py/utils.py | 35 ++
4 files changed, 434 insertions(+), 2 deletions(-)
---
base-commit: b06c4311711c57c5e558bd29824b08f0a6e2a155
change-id: 20250612-netpoll_test-a1324d2057c8
Best regards,
--
Breno Leitao <leitao(a)debian.org>
A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.
The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.
At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.
After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.
The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.
Acked-by: Shuah Khan <skhan(a)linuxfoundation.org>
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
---
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 204 ++++++++++++++++++
3 files changed, 206 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 0fda241fa62b0..b3920c59bf401 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
basic_percpu_ops_mm_cid_test
basic_test
basic_rseq_op_test
+mm_cid_compaction_test
param_test
param_test_benchmark
param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fae59547..bc4d940f66d40 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -17,7 +17,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
- syscall_errors_test
+ syscall_errors_test mm_cid_compaction_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..d13623625f5a9
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (VERBOSE) \
+ printf(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/* 50 ms */
+#define RUNNER_PERIOD 50000
+/*
+ * Number of runs before we terminate or get the token.
+ * The number is slowly increasing with the number of CPUs as the compaction
+ * process can take longer on larger systems. This is an arbitrary value.
+ */
+#define THREAD_RUNS (3 + args->num_cpus/8)
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+ int cpu;
+ int num_cpus;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ pthread_t *tinfo;
+ struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+ struct thread_args *args = arg;
+ int i, ret, curr_mm_cid;
+ cpu_set_t cpumask;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(args->cpu, &cpumask);
+ ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to set affinity");
+ abort();
+ }
+ pthread_barrier_wait(args->barrier);
+
+ for (i = 0; i < THREAD_RUNS; i++)
+ usleep(RUNNER_PERIOD);
+ curr_mm_cid = rseq_current_mm_cid();
+ /*
+ * We select one thread with high enough mm_cid to be the new leader.
+ * All other threads (including the main thread) will terminate.
+ * After some time, the mm_cid of the only remaining thread should
+ * converge to 0, if not, the test fails.
+ */
+ if (curr_mm_cid >= args->num_cpus / 2 &&
+ !pthread_mutex_trylock(args->token)) {
+ printf_verbose(
+ "cpu%d has mm_cid=%d and will be the new leader.\n",
+ sched_getcpu(), curr_mm_cid);
+ for (i = 0; i < args->num_cpus; i++) {
+ if (args->tinfo[i] == pthread_self())
+ continue;
+ ret = pthread_join(args->tinfo[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to join thread");
+ abort();
+ }
+ }
+ pthread_barrier_destroy(args->barrier);
+ free(args->tinfo);
+ free(args->token);
+ free(args->barrier);
+ free(args->args_head);
+
+ for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+ curr_mm_cid = rseq_current_mm_cid();
+ printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+ curr_mm_cid, sched_getcpu());
+ if (curr_mm_cid == 0)
+ exit(EXIT_SUCCESS);
+ usleep(RUNNER_PERIOD);
+ }
+ exit(EXIT_FAILURE);
+ }
+ printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+ sched_getcpu(), curr_mm_cid);
+ pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+ cpu_set_t affinity;
+ int i, j, ret = 0, num_threads;
+ pthread_t *tinfo;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ struct thread_args *args;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ num_threads = CPU_COUNT(&affinity);
+ tinfo = calloc(num_threads, sizeof(*tinfo));
+ if (!tinfo) {
+ perror("Error: failed to allocate tinfo");
+ return -1;
+ }
+ args = calloc(num_threads, sizeof(*args));
+ if (!args) {
+ perror("Error: failed to allocate args");
+ ret = -1;
+ goto out_free_tinfo;
+ }
+ token = malloc(sizeof(*token));
+ if (!token) {
+ perror("Error: failed to allocate token");
+ ret = -1;
+ goto out_free_args;
+ }
+ barrier = malloc(sizeof(*barrier));
+ if (!barrier) {
+ perror("Error: failed to allocate barrier");
+ ret = -1;
+ goto out_free_token;
+ }
+ if (num_threads == 1) {
+ fprintf(stderr, "Cannot test on a single cpu. "
+ "Skipping mm_cid_compaction test.\n");
+ /* only skipping the test, this is not a failure */
+ goto out_free_barrier;
+ }
+ pthread_mutex_init(token, NULL);
+ ret = pthread_barrier_init(barrier, NULL, num_threads);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to initialise barrier");
+ goto out_free_barrier;
+ }
+ for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+ args[j].num_cpus = num_threads;
+ args[j].tinfo = tinfo;
+ args[j].token = token;
+ args[j].barrier = barrier;
+ args[j].cpu = i;
+ args[j].args_head = args;
+ if (!j) {
+ /* The first thread is the main one */
+ tinfo[0] = pthread_self();
+ ++j;
+ continue;
+ }
+ ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to create thread");
+ abort();
+ }
+ ++j;
+ }
+ printf_verbose("Started %d threads.\n", num_threads);
+
+ /* Also main thread will terminate if it is not selected as leader */
+ thread_runner(&args[0]);
+
+ /* only reached in case of errors */
+out_free_barrier:
+ free(barrier);
+out_free_token:
+ free(token);
+out_free_args:
+ free(args);
+out_free_tinfo:
+ free(tinfo);
+
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ if (!rseq_mm_cid_available()) {
+ fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+ return -1;
+ }
+ if (test_mm_cid_compaction())
+ return -1;
+ return 0;
+}
--
2.50.1
This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on fbnic. However, the
series is tested against multiple other drivers including netdevism.
Note: The XDP support for fbnic will be added later.
---
Change-log:
V5:
- Fix warning caused by rcu_dereference() in p1
- Fix checkpatch warnings with P3, P4, and P5
V4: https://lore.kernel.org/netdev/20250714210352.1115230-1-mohsin.bashr@gmail.…
V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.…
V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Jakub Kicinski (1):
net: netdevsim: hook in XDP handling
Mohsin Bashir (4):
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
drivers/net/netdevsim/netdev.c | 19 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++
.../selftests/net/lib/xdp_native.bpf.c | 540 ++++++++++++++
4 files changed, 1215 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.
Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.
This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.
Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory. At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test. The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:
Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.011 0.008 0.455
0.011 0.008 0.472
0.011 0.008 0.535
0.011 0.009 0.545
...
0.011 0.014 2.875
0.011 0.014 2.913
0.011 0.014 3.007
0.011 0.015 3.018
After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
0.006 0.005 0.036
0.006 0.005 0.039
0.006 0.005 0.039
0.006 0.005 0.039
...
0.006 0.006 0.403
0.006 0.006 0.474
0.006 0.006 0.479
0.006 0.006 0.498
The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.
In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:
Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;
Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;
Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;
If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:
Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;
In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.
Changes since v5 [4]:
- Made /proc/pid/maps tearing test a separate selftest,
per Alexey Dobriyan
- Changed asserts with or'ed conditions into separate ones,
per Alexey Dobriyan
- Added a small cleanup patch [6/8] to avoid unnecessary seq_file position
type casting
- Removed unnecessary is_sentinel_pos() helper
- Changed titles to use fs/proc/task_mmu instead of mm/maps prefix,
per David Hildenbrand
- Included Lorenzo's fix for mmap lock assertion in anon_vma_name()
- Reworked the last patch to avoid allocation in the rcu read section,
which replaces Jeongjun Park's fix
!!! NOTES FOR APPLYING THE PATCHSET !!!
Applies cleanly over mm-unstable after reverting old version with fixes.
The following patches should be reverted before applyng this patchset:
b33ce1be8a40 ("selftests/proc: add /proc/pid/maps tearing from vma split test")
b538e0580fd6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma resizing")
4996b4409cc6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma remapping")
c39471f78d5e ("selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified")
487570f548f3 ("selftests/proc: add verbose more for tests to facilitate debugging")
e1ba4969cba1 ("mm/maps: read proc/pid/maps under per-vma lock")
ecb110179e77 ("mm/madvise: fixup stray mmap lock assert in anon_vma_name()")
6772c457a865 ("fs/proc/task_mmu:: execute PROCMAP_QUERY ioctl under per-vma locks")
d5c67bb2c5fb ("mm/maps: move kmalloc() call location in do_procmap_query() out of RCU critical section")
[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.lo…
[4] https://lore.kernel.org/all/20250624193359.3865351-1-surenb@google.com/
Suren Baghdasaryan (8):
selftests/proc: add /proc/pid/maps tearing from vma split test
selftests/proc: extend /proc/pid/maps tearing test to include vma
resizing
selftests/proc: extend /proc/pid/maps tearing test to include vma
remapping
selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
modified
selftests/proc: add verbose more for tests to facilitate debugging
fs/proc/task_mmu: remove conversion of seq_file position to unsigned
fs/proc/task_mmu: read proc/pid/maps under per-vma lock
fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
fs/proc/internal.h | 5 +
fs/proc/task_mmu.c | 188 +++-
include/linux/mmap_lock.h | 11 +
mm/madvise.c | 3 +-
mm/mmap_lock.c | 88 ++
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++
8 files changed, 1098 insertions(+), 28 deletions(-)
create mode 100644 tools/testing/selftests/proc/proc-maps-race.c
--
2.50.0.727.gbf7dc18ff4-goog
NOTE: I'm leaving Daynix Computing Ltd., for which I worked on this
patch series, by the end of this month.
While net-next is closed, this is the last chance for me to send another
version so let me send the local changes now.
Please contact Yuri Benditovich, who is CCed on this email, for anything
about this series.
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250530-hash-v5-0-343d7d7a8200@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v12:
- Updated tools/testing/selftests/net/config.
- Split TUNSETVNETHASH.
- Link to v11: https://lore.kernel.org/r/20250317-rss-v11-0-4cacca92f31f@daynix.com
Changes in v11:
- Added the missing code to free vnet_hash in patch
"tap: Introduce virtio-net hash feature".
- Link to v10: https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9af0@daynix.com
Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
the testing function now needs access to more variant-specific
variables.
- Corrected the message of patch "selftest: tun: Add tests for
virtio-net hashing"; it mentioned validation of configuration but
it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
"vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (10):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Allow steering eBPF program to fall back
tun: Add common virtio-net hash feature code
tun: Introduce virtio-net hash feature
tap: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
selftest: tap: Add tests for virtio-net ioctls
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/ipvlan/ipvtap.c | 2 +-
drivers/net/macvtap.c | 2 +-
drivers/net/tap.c | 80 +++++-
drivers/net/tun.c | 92 +++++--
drivers/net/tun_vnet.h | 165 +++++++++++-
drivers/vhost/net.c | 68 ++---
include/linux/if_tap.h | 4 +-
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 80 ++++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/tap.c | 131 +++++++++-
tools/testing/selftests/net/tun.c | 485 ++++++++++++++++++++++++++++++++++-
19 files changed, 1234 insertions(+), 85 deletions(-)
---
base-commit: 5cb8274d66c611b7889565c418a8158517810f9b
change-id: 20240403-rss-e737d89efa77
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v12 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (6):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: Add wait_third_ack for ECN negotiation in simultaneous connect
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 13 +
include/linux/tcp.h | 33 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 663 ++++++++++++++++++
include/uapi/linux/tcp.h | 7 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 29 +-
net/ipv4/tcp_input.c | 371 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 297 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1444 insertions(+), 187 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v17:
- fixed warnings due to empty macros in usercfi.h (reported by alexg)
- fixed prefixes in commit titles reported by alexg
- took below uprobe with fcfi v2 patch from Zong Li and squashed it with
"riscv/traps: Introduce software check exception and uprobe handling"
https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/
v16:
- If FWFT is not implemented or returns error for shadow stack activation, then
no_usercfi is set to disable shadow stack. Although this should be picked up
by extension validation and activation. Fixed this bug for zicfilp and zicfiss
both. Thanks to Charlie Jenkins for reporting this.
- If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by
Charlie Jenkins.
- Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to
keep it off till we have more hardware availibility with RVA23 profile and
zimop/zcmop implemented. Else this will start breaking people's workflow
- Includes the fix if "!RV64 and !SBI" then definitions for FWFT in
asm-offsets.c error.
v15:
- Toolchain has been updated to include `-fcf-protection` flag. This
exists for x86 as well. Updated kernel patches to compile vDSO and
selftest to compile with `fcf-protection=full` flag.
- selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI.
- Patch to enable shadow stack for kernel wasn't hidden behind
CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that.
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v17:
- Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri…
Changes in v16:
- Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri…
Changes in v15:
- changelog posted just below cover letter
- Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri…
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (25):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv/mm: manufacture shadow stack pte
riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs
riscv/mm: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception and uprobe handling
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/Makefile | 5 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 26 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 10 +
arch/riscv/kernel/cpufeature.c | 27 +
arch/riscv/kernel/entry.S | 33 +-
arch/riscv/kernel/head.S | 27 +
arch/riscv/kernel/process.c | 27 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 51 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 6 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 16 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 16 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 173 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
54 files changed, 2369 insertions(+), 29 deletions(-)
---
base-commit: 4181f8ad7a1061efed0219951d608d4988302af7
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
This patch series add tests to validate XDP native support for PASS,
DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment.
For adjustment tests, validate support for both the extension and
shrinking cases across various packet sizes and offset values.
The pass criteria for head/tail adjustment tests require that at-least
one adjustment value works for at-least one packet size. This ensure
that the variability in maximum supported head/tail adjustment offset
across different drivers is being incorporated.
The results reported in this series are based on fbnic. However, the
series is tested against multiple other drivers including netdevism.
Note: The XDP support for fbnic will be added later.
---
Change-log:
V4:
- Support XDP handling for netdevsim
- Fix pylint warning with P4
- Update commit message for P2,P3 to show pass/fail summary
V3: https://lore.kernel.org/netdev/20250712002648.2385849-1-mohsin.bashr@gmail.…
V2: https://lore.kernel.org/netdev/20250710184351.63797-1-mohsin.bashr@gmail.com
V1: https://lore.kernel.org/netdev/20250709173707.3177206-1-mohsin.bashr@gmail.…
Jakub Kicinski (1):
net: netdevsim: hook in XDP handling
Mohsin Bashir (4):
selftests: drv-net: Test XDP_PASS/DROP support
selftests: drv-net: Test XDP_TX support
selftests: drv-net: Test tail-adjustment support
selftests: drv-net: Test head-adjustment support
drivers/net/netdevsim/netdev.c | 19 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
tools/testing/selftests/drivers/net/xdp.py | 656 ++++++++++++++++++
.../selftests/net/lib/xdp_native.bpf.c | 538 ++++++++++++++
4 files changed, 1213 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/drivers/net/xdp.py
create mode 100644 tools/testing/selftests/net/lib/xdp_native.bpf.c
--
2.47.1
Hi everyone,
Here's a V3 for the netdevsim PHY support. This V3 includes :
- A fix for a compiling issue with PHYLIB=n
- An updated KConfig to only allow PHYLIB=y|n
- Converted the link setting file to a bool debugfs file, relying on
link state polling
The idea of this series is to allow attaching virtual PHY devices to
netdevsim, so that we can test PHY-related ethtool commands. This can be
extended in the future for phylib testing as well.
V2: https://lore.kernel.org/netdev/20250708115531.111326-1-maxime.chevallier@bo…
- Fix building with PHYLIB=m
- Use shellcheck on the shell scripts
V1: https://lore.kernel.org/netdev/20250702082806.706973-1-maxime.chevallier@bo…
Maxime Chevallier (3):
net: netdevsim: Add PHY support in netdevsim
selftests: ethtool: Drop the unused old_netdevs variable
selftests: ethtool: Introduce ethernet PHY selftests on netdevsim
drivers/net/Kconfig | 1 +
drivers/net/netdevsim/Makefile | 4 +
drivers/net/netdevsim/dev.c | 2 +
drivers/net/netdevsim/netdev.c | 8 +
drivers/net/netdevsim/netdevsim.h | 25 ++
drivers/net/netdevsim/phy.c | 375 ++++++++++++++++++
.../selftests/drivers/net/netdevsim/config | 1 +
.../drivers/net/netdevsim/ethtool-common.sh | 19 +-
.../drivers/net/netdevsim/ethtool-phy.sh | 64 +++
9 files changed, 496 insertions(+), 3 deletions(-)
create mode 100644 drivers/net/netdevsim/phy.c
create mode 100755 tools/testing/selftests/drivers/net/netdevsim/ethtool-phy.sh
--
2.49.0
There is a spelling mistake in a ksft_exit_fail_msg message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/futex/functional/futex_priv_hash.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/functional/futex_priv_hash.c b/tools/testing/selftests/futex/functional/futex_priv_hash.c
index a9cedc365102..aea001ac4946 100644
--- a/tools/testing/selftests/futex/functional/futex_priv_hash.c
+++ b/tools/testing/selftests/futex/functional/futex_priv_hash.c
@@ -122,7 +122,7 @@ static void futex_dummy_op(void)
}
ret = pthread_mutex_timedlock(&lock, &timeout);
if (ret == 0)
- ksft_exit_fail_msg("Succeffuly locked an already locked mutex.\n");
+ ksft_exit_fail_msg("Successfully locked an already locked mutex.\n");
if (ret != ETIMEDOUT)
ksft_exit_fail_msg("pthread_mutex_timedlock() did not timeout: %d.\n", ret);
--
2.50.0
Move the generic `FORCE_READ` macro from `guard-regions.c` to the shared
`vm_util.h` header to promote code reuse.
In `guard-regions.c`, replace `ksft_exit_skip()` with the `SKIP()` macro
to ensure only the current test is skipped on permission failure, instead
of terminating the entire test binary.
Signed-off-by: wang lian <lianux.mm(a)gmail.com>
---
tools/testing/selftests/mm/guard-regions.c | 9 +--------
tools/testing/selftests/mm/vm_util.h | 7 +++++++
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 93af3d3760f9..b0d42eb04e3a 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -35,13 +35,6 @@
static volatile sig_atomic_t signal_jump_set;
static sigjmp_buf signal_jmp_buf;
-/*
- * Ignore the checkpatch warning, we must read from x but don't want to do
- * anything with it in order to trigger a read page fault. We therefore must use
- * volatile to stop the compiler from optimising this away.
- */
-#define FORCE_READ(x) (*(volatile typeof(x) *)x)
-
/*
* How is the test backing the mapping being tested?
*/
@@ -582,7 +575,7 @@ TEST_F(guard_regions, process_madvise)
/* OK we don't have permission to do this, skip. */
if (count == -1 && errno == EPERM)
- ksft_exit_skip("No process_madvise() permissions, try running as root.\n");
+ SKIP(return, "No process_madvise() permissions, try running as root.\n");
/* Returns the number of bytes advised. */
ASSERT_EQ(count, 6 * page_size);
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 2b154c287591..c20298ae98ea 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -18,6 +18,13 @@
#define PM_SWAP BIT_ULL(62)
#define PM_PRESENT BIT_ULL(63)
+/*
+ * Ignore the checkpatch warning, we must read from x but don't want to do
+ * anything with it in order to trigger a read page fault. We therefore must use
+ * volatile to stop the compiler from optimising this away.
+ */
+#define FORCE_READ(x) (*(volatile typeof(x) *)x)
+
extern unsigned int __page_size;
extern unsigned int __page_shift;
--
2.43.0
mptcp_connect.sh can be executed manually with "-m <MODE>" and "-C" to
make sure everything works as expected when using "mmap" and "sendfile"
modes instead of "poll", and with the MPTCP checksum support.
These modes should be validated, but they are not when the selftests are
executed via the kselftest helpers. It means that most CIs validating
these selftests, like NIPA for the net development trees and LKFT for
the stable ones, are not covering these modes.
To fix that, new test programs have been added, simply calling
mptcp_connect.sh with the right parameters.
The first patch can be backported up to v5.6, and the second one up to
v5.14.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (2):
selftests: mptcp: connect: also cover alt modes
selftests: mptcp: connect: also cover checksum
tools/testing/selftests/net/mptcp/Makefile | 3 ++-
tools/testing/selftests/net/mptcp/mptcp_connect_checksum.sh | 4 ++++
tools/testing/selftests/net/mptcp/mptcp_connect_mmap.sh | 4 ++++
tools/testing/selftests/net/mptcp/mptcp_connect_sendfile.sh | 4 ++++
4 files changed, 14 insertions(+), 1 deletion(-)
---
base-commit: b640daa2822a39ff76e70200cb2b7b892b896dce
change-id: 20250714-net-mptcp-sft-connect-alt-c1aaf073ef4e
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
This patch series refactors all futex selftests to use
kselftest_harness.h instead of futex's logging.h, as discussed here [1].
This allows to remove a lot of boilerplate code and to simplify some
parts of the test logic, mainly when the test needs to exit early. The
result of this is more than 500 lines removed from
tools/testing/selftests/futex/. Also, this enables new tests to use
kselftest.h features like ASSERT_s and such.
There are some caveats around this refactor:
- logging.h had verbosity levels, while kselftest_harness.h doesn't. I
created a new print function called ksft_print_dbg_msg() that prints
the message if the user uses the -d flag, so now there's an
equivalent of this feature.
- futex_requeue_pi test accepted command line arguments to be used as
test parameters (e.g. ./futex_requeue_pi -b -l -t 500000). This
doesn't work with kselftest_harness.h because there's no
straightforward way to send command line arguments to the test.
I used FIXTURE_VARIANT() to achieve the same result, but now the
parameters live inside of the test file, instead of on
functional/run.sh. This increased a little bit the number of test
cases for futex_requeue_pi, from 22 to 24.
- test_harness_run() calls mmap(MAP_SHARED) before running the test and
this has caused a side effect on test futex_numa_mpol.c. This test
also calls mmap() and then try to access an address out of
boundaries of this mapped memory for a "Memory out of range" subtest,
where the kernel should return -EACCESS. After the refactor, the test
address might be fall inside the first memory mapped region, thus
being a valid address and succeeding the syscall, making the test
fail. To fix that, I created a small "buffer zone" with
mmap(PROT_NONE) between both mmaps.
I have compared the results of run.sh before and after this patchset and
didn't find any regression from the test results.
Thanks,
André
[1] https://lore.kernel.org/lkml/87ecv6p364.ffs@tglx/
---
André Almeida (15):
selftests: kselftest: Create ksft_print_dbg_msg()
selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h
selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h
selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h
selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h
selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h
selftests/futex: Refactor futex_wait with kselftest_harness.h
selftests/futex: Refactor futex_requeue with kselftest_harness.h
selftests/futex: Refactor futex_waitv with kselftest_harness.h
selftests/futex: Refactor futex_priv_hash with kselftest_harness.h
selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h
selftests/futex: Drop logging.h include from futex_numa
selftests/futex: Remove logging.h file
tools/testing/selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/futex_numa.c | 3 +-
.../selftests/futex/functional/futex_numa_mpol.c | 57 ++---
.../selftests/futex/functional/futex_priv_hash.c | 79 +++----
.../selftests/futex/functional/futex_requeue.c | 76 ++----
.../selftests/futex/functional/futex_requeue_pi.c | 261 ++++++++++-----------
.../functional/futex_requeue_pi_mismatched_ops.c | 80 ++-----
.../functional/futex_requeue_pi_signal_restart.c | 129 +++-------
.../selftests/futex/functional/futex_wait.c | 103 +++-----
.../functional/futex_wait_private_mapped_file.c | 83 ++-----
.../futex/functional/futex_wait_timeout.c | 139 +++++------
.../functional/futex_wait_uninitialized_heap.c | 76 ++----
.../futex/functional/futex_wait_wouldblock.c | 75 ++----
.../selftests/futex/functional/futex_waitv.c | 98 ++++----
tools/testing/selftests/futex/functional/run.sh | 62 +----
tools/testing/selftests/futex/include/logging.h | 148 ------------
tools/testing/selftests/kselftest.h | 13 +
tools/testing/selftests/kselftest_harness.h | 13 +-
18 files changed, 491 insertions(+), 1007 deletions(-)
---
base-commit: a24cc6ce1933eade12aa2b9859de0fcd2dac2c06
change-id: 20250703-tonyk-robust_test_cleanup-d1f3406365d9
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
Hello everyone,
I am reaching out to announce that we are once again planning to
gather to discuss testing and dependability related topics at the
Kernel Testing & Dependability Micro-conference (a.k.a. Testing MC)
at Linux Plumbers Conference Japan 2025.
- https://lpc.events/event/19/sessions/228/
The Linux Plumbers 2025 Kernel Testing & Dependability track focuses
on advancing the current state of testing of the Linux Kernel and its
related infrastructure.
The main purpose is to improve software quality and dependability for
applications that require predictability and trust.
We aim to create connections between folks working on similar
projects, and help individual projects make progress.
This track is intended to promote collaboration between all the
communities and people interested in the Kernel testing &
dependability.
This will help move the conversation forward from where we left off at
the LPC 2024 Kernel Testing & Dependability MC.
We ask that any topic discussions focus on issues/problems they are
facing and possible alternatives to resolving them.
The Micro-conference is open to all topics related to testing on
Linux, not necessarily in the kernel space.
Suggested topics:
- KernelCI: Maestro, kci-dev, kci-deploy, kci-gitlab, new dashboard, KCIDB-ng
- Improve sanitizers: KFENCE, KCSAN, KASAN, UBSAN
- Using Clang for better testing coverage: Now that the kernel fully
supports building with Clang, how can all that work be leveraged into
using Clang's features?
- Consolidating toolchains: reference collection for increased
reproducibility and quality control.
- How to spread KUnit throughout the kernel?
- Building and testing in-kernel Rust code.
- Identify missing features that will provide assurance in safety
critical systems.
- Which test coverage infrastructures are most effective to provide
evidence for kernel quality assurance? How should it be measured?
- Explore ways to improve testing framework and tests in the kernel
with a specific goal to increase traceability and code coverage.
- Regression Testing for safety: Prioritize configurations and tests
critical and important for quality and dependability.
- Transitioning to test-driven kernel release cycles for mainline and
stable: How to start relying on passing tests before releasing a new
tag?
- Explore how do SBOMs figure into dependability?
- Kernel benchmarking and kernel performance evaluation.
We invite you to submit proposals for discussions.
Proposals can be submitted here,
by 11:59PM UTC on Wednesday, September 10, 2025:
- https://lpc.events/event/19/abstracts/
Please send any inquiries to the MC leads:
Arisu Tachibana <arisu.tachibana(a)miraclelinux.com>
Shuah Khan <shuah(a)kernel.org>
Sasha Levin <sashal(a)kernel.org>
Guillaume Tucker <gtucker(a)gtucker.io>
thank you,
-- Arisu
The rtnetlink test for preferred lifetime of an address is quite flaky.
Problems started around the 6.16 merge window in May. The test fails
with:
FAIL: preferred_lft addresses remaining
and unlike most of our flakes this one fails on the "normal" kernel
builds, not the builds with kernel/configs/debug.config. I suspect
the flakes may be related to power saving, since the expirations
run from a "power efficient" workqueue. Adding a short sleep seems
to decrease the flakes by 8x but they still happen. With this
patch in place we get a flake every couple of weeks, not every
couple of days. Better ideas welcome..
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: liuhangbin(a)gmail.com
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/rtnetlink.sh | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 2e8243a65b50..b9e1497ea27a 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -299,6 +299,11 @@ kci_test_addrlft()
done
sleep 5
+ # Schedule out for a bit, address GC runs from the power efficient WQ
+ # if the long sleep above has put the whole system into sleep state
+ # the WQ may have not had a chance to run.
+ sleep 0.1
+
run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
if [ $? -eq 0 ]; then
check_err 1
--
2.50.0
This is series 2a/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (9):
gpu: nova-core: use `kernel::{fmt,prelude::fmt!}`
rust: alloc: use `kernel::{fmt,prelude::fmt!}`
rust: block: use `kernel::{fmt,prelude::fmt!}`
rust: device: use `kernel::{fmt,prelude::fmt!}`
rust: file: use `kernel::{fmt,prelude::fmt!}`
rust: kunit: use `kernel::{fmt,prelude::fmt!}`
rust: pin-init: use `kernel::{fmt,prelude::fmt!}`
rust: seq_file: use `kernel::{fmt,prelude::fmt!}`
rust: sync: use `kernel::{fmt,prelude::fmt!}`
drivers/block/rnull.rs | 2 +-
drivers/gpu/nova-core/gpu.rs | 3 +--
drivers/gpu/nova-core/regs/macros.rs | 6 +++---
rust/kernel/alloc/kbox.rs | 2 +-
rust/kernel/alloc/kvec.rs | 2 +-
rust/kernel/alloc/kvec/errors.rs | 2 +-
rust/kernel/block/mq.rs | 2 +-
rust/kernel/block/mq/gen_disk.rs | 2 +-
rust/kernel/block/mq/raw_writer.rs | 3 +--
rust/kernel/device.rs | 6 +++---
rust/kernel/fs/file.rs | 5 +++--
rust/kernel/init.rs | 4 ++--
rust/kernel/kunit.rs | 8 ++++----
rust/kernel/seq_file.rs | 6 +++---
rust/kernel/sync/arc.rs | 3 +--
scripts/rustdoc_test_gen.rs | 2 +-
16 files changed, 28 insertions(+), 30 deletions(-)
---
base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494
change-id: 20250709-core-cstr-fanout-1-f20611832272
prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1
prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573
prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e
prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa
prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec
prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27
prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This is series 2b/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Tamir Duberstein (10):
gpu: nova-core: use `core::ffi::CStr` method names
rust: auxiliary: use `core::ffi::CStr` method names
rust: configfs: use `core::ffi::CStr` method names
rust: cpufreq: use `core::ffi::CStr` method names
rust: drm: use `core::ffi::CStr` method names
rust: firmware: use `core::ffi::CStr` method names
rust: kunit: use `core::ffi::CStr` method names
rust: miscdevice: use `core::ffi::CStr` method names
rust: net: use `core::ffi::CStr` method names
rust: of: use `core::ffi::CStr` method names
drivers/gpu/drm/drm_panic_qr.rs | 2 +-
rust/kernel/auxiliary.rs | 4 ++--
rust/kernel/configfs.rs | 4 ++--
rust/kernel/cpufreq.rs | 2 +-
rust/kernel/drm/device.rs | 4 ++--
rust/kernel/firmware.rs | 2 +-
rust/kernel/kunit.rs | 6 +++---
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 2 +-
rust/kernel/of.rs | 2 +-
samples/rust/rust_configfs.rs | 2 +-
11 files changed, 16 insertions(+), 16 deletions(-)
---
base-commit: 769e324b66b0d92d04f315d0c45a0f72737c7494
change-id: 20250709-core-cstr-fanout-1-f20611832272
prerequisite-change-id: 20250704-core-cstr-prepare-9b9e6a7bd57e:v1
prerequisite-patch-id: 83b1239d1805f206711a5a936bbb61c83227d573
prerequisite-patch-id: a0355dd0efcc945b0565dc4e5a0f42b5a3d29c7e
prerequisite-patch-id: 8585bf441cfab705181f5606c63483c2e88d25aa
prerequisite-patch-id: 04ec344c0bc23f90dbeac10afe26df1a86ce53ec
prerequisite-patch-id: a2fc6cd05fce6d6da8d401e9f8a905bb5c0b2f27
prerequisite-patch-id: f14c099c87562069f25fb7aea6d9aae4086c49a8
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>