Introduce SW acceleration for IPIP tunnels in the netfilter flowtable
infrastructure.
---
Changes in v6:
- Rebase on top of nf-next main branch
- Link to v5: https://lore.kernel.org/r/20250721-nf-flowtable-ipip-v5-0-0865af9e58c6@kern…
Changes in v5:
- Rely on __ipv4_addr_hash() to compute the hash used as encap ID
- Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap()
- Add nf_flow_ip4_ecanp_pop utility routine
- Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kern…
Changes in v4:
- Use the hash value of the saddr, daddr and protocol of outer IP header as
encapsulation id.
- Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kern…
Changes in v3:
- Add outer IP header sanity checks
- target nf-next tree instead of net-next
- Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern…
Changes in v2:
- Introduce IPIP flowtable selftest
- Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern…
---
Lorenzo Bianconi (2):
net: netfilter: Add IPIP flowtable SW acceleration
selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
include/linux/netdevice.h | 1 +
net/ipv4/ipip.c | 28 +++++++++++
net/netfilter/nf_flow_table_ip.c | 56 +++++++++++++++++++++-
net/netfilter/nft_flow_offload.c | 1 +
.../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++
5 files changed, 124 insertions(+), 2 deletions(-)
---
base-commit: bab3ce404553de56242d7b09ad7ea5b70441ea41
change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067
Best regards,
--
Lorenzo Bianconi <lorenzo(a)kernel.org>
Hi all,
This series updates the drv-net XDP program used by the new xdp.py selftest
to use the bpf_dynptr APIs for packet access.
The selftest itself is unchanged.
The original program accessed packet headers directly via
ctx->data/data_end, implicitly assuming headers are always in the linear
region. That assumption is incorrect for multi-buffer XDP and does not
hold across all drivers. For example, mlx5 with striding RQ can leave the
linear area empty, causing the multi-buffer cases to fail.
Switching to bpf_xdp_load/store_bytes would work but always incurs copies.
Instead, this series adopts bpf_dynptr, which provides safe,
verifier-checked access across both linear and fragmented areas while
avoiding copies.
Amery Hung has also proposed a series [1] that addresses the same issues in
the program, but through the use of bpf_xdp_pull_data. My series is not
intended as a replacement for that work, but rather as an exploration of
another viable solution, each of which may be preferable under different
circumstances.
In cases where the program does not return XDP_PASS, I believe dynptr has
an advantage since it avoids an extra copy. Conversely, when the program
returns XDP_PASS, bpf_xdp_pull_data may be preferable, as the copy will
be performed in any case during skb creation.
It may make sense to split the work into two separate programs, allowing us
to test both solutions independently. Alternatively, we can consider a
combined approach, where the more fitting solution is applied for each use
case. I welcome feedback on which direction would be most useful.
[1] https://lore.kernel.org/all/20250905173352.3759457-1-ameryhung@gmail.com/
Thanks!
Nimrod
Nimrod Oren (5):
selftests: drv-net: Test XDP_TX with bpf_dynptr
selftests: drv-net: Test XDP tail adjustment with bpf_dynptr
selftests: drv-net: Test XDP head adjustment with bpf_dynptr
selftests: drv-net: Adjust XDP header data with bpf_dynptr
selftests: drv-net: Check XDP header data with bpf_dynptr
.../selftests/net/lib/xdp_native.bpf.c | 219 ++++++++----------
1 file changed, 96 insertions(+), 123 deletions(-)
--
2.45.0
This series fixes issues in devlink_rate_tc_bw.py selftest that made
its checks unreliable and its documentation inconsistent with the
actual configuration.
Thanks
Carolina Jubran (3):
selftests: drv-net: Fix and clarify TC bandwidth split in
devlink_rate_tc_bw.py
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
selftests: drv-net: Relax total BW check in devlink_rate_tc_bw.py
.../drivers/net/hw/devlink_rate_tc_bw.py | 102 ++++++++----------
1 file changed, 44 insertions(+), 58 deletions(-)
--
2.38.1
The loop in bench_sockmap_prog_destroy() has two issues:
1. Using 'sizeof(ctx.fds)' as the loop bound results in the number of
bytes, not the number of file descriptors, causing the loop to iterate
far more times than intended.
2. The condition 'ctx.fds[0] > 0' incorrectly checks only the first fd for
all iterations, potentially leaving file descriptors unclosed. Change
it to 'ctx.fds[i] > 0' to check each fd properly.
These fixes ensure correct cleanup of all file descriptors when the
benchmark exits.
Signed-off-by: Jiayuan Chen <jiayuan.chen(a)linux.dev>
Reported-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Closes: https://lore.kernel.org/bpf/aLqfWuRR9R_KTe5e@stanley.mountain/
---
tools/testing/selftests/bpf/benchs/bench_sockmap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/benchs/bench_sockmap.c b/tools/testing/selftests/bpf/benchs/bench_sockmap.c
index 8ebf563a67a2..cfc072aa7fff 100644
--- a/tools/testing/selftests/bpf/benchs/bench_sockmap.c
+++ b/tools/testing/selftests/bpf/benchs/bench_sockmap.c
@@ -10,6 +10,7 @@
#include <argp.h>
#include "bench.h"
#include "bench_sockmap_prog.skel.h"
+#include "bpf_util.h"
#define FILE_SIZE (128 * 1024)
#define DATA_REPEAT_SIZE 10
@@ -124,8 +125,8 @@ static void bench_sockmap_prog_destroy(void)
{
int i;
- for (i = 0; i < sizeof(ctx.fds); i++) {
- if (ctx.fds[0] > 0)
+ for (i = 0; i < ARRAY_SIZE(ctx.fds); i++) {
+ if (ctx.fds[i] > 0)
close(ctx.fds[i]);
}
--
2.43.0
Two patches here, first fixes the issue where tunnel core doesn't
actually extract DF bit from the outer IP header, even though both
OVS and TC flower allow matching on it. More details in the commit
message.
The second is a selftest for openvswitch that reproduces the issue,
but also just adds some basic coverage for the tunnel metadata
extraction and related openvswitch uAPI.
Ilya Maximets (2):
net: dst_metadata: fix IP_DF bit not extracted from tunnel headers
selftests: openvswitch: add a simple test for tunnel metadata
include/net/dst_metadata.h | 11 ++-
.../selftests/net/openvswitch/openvswitch.sh | 88 +++++++++++++++++--
2 files changed, 90 insertions(+), 9 deletions(-)
--
2.50.1
This is based on mm-unstable.
I will only CC non-MM folks on the cover letter and the respective patch
to not flood too many inboxes (the lists receive all patches).
--
As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.
To recap, the reason we currently need nth_page() within a folio is because
on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when
it might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages".
If we only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page()
within CMA allocations (again, could span memory sections), and in
one corner case (kfence) when processing memblock allocations (again,
could span memory sections).
So let's handle all that, add sanity checks, and remove nth_page().
Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch #6 -> #13 : disallow folios to have non-contiguous pages
Patch #14 -> #20 : remove nth_page() usage within folios
Patch #22 : disallow CMA allocations of non-contiguous pages
Patch #23 -> #33 : sanity+check + remove nth_page() usage within SG entry
Patch #34 : sanity-check + remove nth_page() usage in
unpin_user_page_range_dirty_lock()
Patch #35 : remove nth_page() in kfence
Patch #36 : adjust stale comment regarding nth_page
Patch #37 : mm: remove nth_page()
A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.
[1] https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-G…
v1 -> v2:
* "fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()"
-> Add comment for loop and remove comment of function regarding
copy_page_to_iter().
* Various smaller patch description tweaks I am not going to list for my
sanity
* "mips: mm: convert __flush_dcache_pages() to
__flush_dcache_folio_pages()"
-> Fix flush_dcache_page()
-> Drop "extern"
* "mm/gup: remove record_subpages()"
-> Added
* "mm/hugetlb: check for unreasonable folio sizes when registering hstate"
-> Refine comment
* "mm/cma: refuse handing out non-contiguous page ranges"
-> Add comment above loop
* "mm/page_alloc: reject unreasonable folio/compound page sizes in
alloc_contig_range_noprof()"
-> Added comment above check
* "mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()"
-> Refined comment
RFC -> v1:
* "wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel
config"
-> Mention that it was never really relevant for the test
* "mm/mm_init: make memmap_init_compound() look more like
prep_compound_page()"
-> Mention the setup of page links
* "mm: limit folio/compound page sizes in problematic kernel configs"
-> Improve comment for PUD handling, mentioning hugetlb and dax
* "mm: simplify folio_page() and folio_page_idx()"
-> Call variable "n"
* "mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()"
-> Keep __init_single_page() and refer to the usage of
memblock_reserved_mark_noinit()
* "fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()"
* "fs: hugetlbfs: remove nth_page() usage within folio in
adjust_range_hwpoison()"
-> Separate nth_page() removal from cleanups
-> Further improve cleanups
* "io_uring/zcrx: remove nth_page() usage within folio"
-> Keep the io_copy_cache for now and limit to nth_page() removal
* "mm/gup: drop nth_page() usage within folio when recording subpages"
-> Cleanup record_subpages as bit
* "mm/cma: refuse handing out non-contiguous page ranges"
-> Replace another instance of "pfn_to_page(pfn)" where we already have
the page
* "scatterlist: disallow non-contigous page ranges in a single SG entry"
-> We have to EXPORT the symbol. I thought about moving it to mm_inline.h,
but I really don't want to include that in include/linux/scatterlist.h
* "ata: libata-eh: drop nth_page() usage within SG entry"
* "mspro_block: drop nth_page() usage within SG entry"
* "memstick: drop nth_page() usage within SG entry"
* "mmc: drop nth_page() usage within SG entry"
-> Keep PAGE_SHIFT
* "scsi: scsi_lib: drop nth_page() usage within SG entry"
* "scsi: sg: drop nth_page() usage within SG entry"
-> Split patches, Keep PAGE_SHIFT
* "crypto: remove nth_page() usage within SG entry"
-> Keep PAGE_SHIFT
* "kfence: drop nth_page() usage"
-> Keep modifying i and use "start_pfn" only instead
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Marek Szyprowski <m.szyprowski(a)samsung.com>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Brendan Jackman <jackmanb(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Dennis Zhou <dennis(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Christoph Lameter <cl(a)gentwo.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: x86(a)kernel.org
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-mips(a)vger.kernel.org
Cc: linux-s390(a)vger.kernel.org
Cc: linux-crypto(a)vger.kernel.org
Cc: linux-ide(a)vger.kernel.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linux-mmc(a)vger.kernel.org
Cc: linux-arm-kernel(a)axis.com
Cc: linux-scsi(a)vger.kernel.org
Cc: kvm(a)vger.kernel.org
Cc: virtualization(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Cc: io-uring(a)vger.kernel.org
Cc: iommu(a)lists.linux.dev
Cc: kasan-dev(a)googlegroups.com
Cc: wireguard(a)lists.zx2c4.com
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-riscv(a)lists.infradead.org
David Hildenbrand (37):
mm: stop making SPARSEMEM_VMEMMAP user-selectable
arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu
kernel config
mm/page_alloc: reject unreasonable folio/compound page sizes in
alloc_contig_range_noprof()
mm/memremap: reject unreasonable folio/compound page sizes in
memremap_pages()
mm/hugetlb: check for unreasonable folio sizes when registering hstate
mm/mm_init: make memmap_init_compound() look more like
prep_compound_page()
mm: sanity-check maximum folio size in folio_set_order()
mm: limit folio/compound page sizes in problematic kernel configs
mm: simplify folio_page() and folio_page_idx()
mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
mm/mm/percpu-km: drop nth_page() usage within single allocation
fs: hugetlbfs: remove nth_page() usage within folio in
adjust_range_hwpoison()
fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()
mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
mm/gup: drop nth_page() usage within folio when recording subpages
mm/gup: remove record_subpages()
io_uring/zcrx: remove nth_page() usage within folio
mips: mm: convert __flush_dcache_pages() to
__flush_dcache_folio_pages()
mm/cma: refuse handing out non-contiguous page ranges
dma-remap: drop nth_page() in dma_common_contiguous_remap()
scatterlist: disallow non-contigous page ranges in a single SG entry
ata: libata-sff: drop nth_page() usage within SG entry
drm/i915/gem: drop nth_page() usage within SG entry
mspro_block: drop nth_page() usage within SG entry
memstick: drop nth_page() usage within SG entry
mmc: drop nth_page() usage within SG entry
scsi: scsi_lib: drop nth_page() usage within SG entry
scsi: sg: drop nth_page() usage within SG entry
vfio/pci: drop nth_page() usage within SG entry
crypto: remove nth_page() usage within SG entry
mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
kfence: drop nth_page() usage
block: update comment of "struct bio_vec" regarding nth_page()
mm: remove nth_page()
arch/arm64/Kconfig | 1 -
arch/mips/include/asm/cacheflush.h | 11 +++--
arch/mips/mm/cache.c | 8 ++--
arch/s390/Kconfig | 1 -
arch/x86/Kconfig | 1 -
crypto/ahash.c | 4 +-
crypto/scompress.c | 8 ++--
drivers/ata/libata-sff.c | 6 +--
drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +-
drivers/memstick/core/mspro_block.c | 3 +-
drivers/memstick/host/jmb38x_ms.c | 3 +-
drivers/memstick/host/tifm_ms.c | 3 +-
drivers/mmc/host/tifm_sd.c | 4 +-
drivers/mmc/host/usdhi6rol0.c | 4 +-
drivers/scsi/scsi_lib.c | 3 +-
drivers/scsi/sg.c | 3 +-
drivers/vfio/pci/pds/lm.c | 3 +-
drivers/vfio/pci/virtio/migrate.c | 3 +-
fs/hugetlbfs/inode.c | 36 +++++---------
include/crypto/scatterwalk.h | 4 +-
include/linux/bvec.h | 7 +--
include/linux/mm.h | 48 +++++++++++++++----
include/linux/page-flags.h | 5 +-
include/linux/scatterlist.h | 3 +-
io_uring/zcrx.c | 4 +-
kernel/dma/remap.c | 2 +-
mm/Kconfig | 3 +-
mm/cma.c | 39 +++++++++------
mm/gup.c | 36 +++++++-------
mm/hugetlb.c | 22 +++++----
mm/internal.h | 1 +
mm/kfence/core.c | 12 +++--
mm/memremap.c | 3 ++
mm/mm_init.c | 15 +++---
mm/page_alloc.c | 10 +++-
mm/pagewalk.c | 2 +-
mm/percpu-km.c | 2 +-
mm/util.c | 36 ++++++++++++++
tools/testing/scatterlist/linux/mm.h | 1 -
.../selftests/wireguard/qemu/kernel.config | 1 -
40 files changed, 217 insertions(+), 146 deletions(-)
base-commit: b73c6f2b5712809f5f386780ac46d1d78c31b2e6
--
2.50.1
This patchset introduces a new per-port bonding option: `ad_actor_port_prio`.
It allows users to configure the actor's port priority, which can then be used
by the bonding driver for aggregator selection based on port priority.
This provides finer control over LACP aggregator choice, especially in setups
with multiple eligible aggregators over 2 switches.
v5:
a) rename 'prio' to 'actor_port_prio' in bond_ad_select_tbl (Jay Vosburgh)
b) update document description
v4:
a) fix actor_port_prio minimal value (Jay Vosburgh)
b) fix ad_agg_selection_test comment order (Paolo Abeni)
c) restruct selftest, reduce duplication (Paolo Abeni)
v3:
a) add comments when init slave port_priority (Jonas Gorski)
b) rename ad_lacp_port_prio to lacp_port_prio (Jay Vosburgh)
v2:
a) set default bond option value for port priority (Nikolay Aleksandrov)
b) fix __agg_ports_priority coding style (Nikolay Aleksandrov)
c) fix shellcheck warns
Hangbin Liu (3):
bonding: add support for per-port LACP actor priority
bonding: support aggregator selection based on port priority
selftests: bonding: add test for LACP actor port priority
Documentation/networking/bonding.rst | 25 +++-
drivers/net/bonding/bond_3ad.c | 31 +++++
drivers/net/bonding/bond_netlink.c | 16 +++
drivers/net/bonding/bond_options.c | 45 +++++++-
include/net/bond_3ad.h | 2 +
include/net/bond_options.h | 1 +
include/uapi/linux/if_link.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 3 +-
.../drivers/net/bonding/bond_lacp_prio.sh | 108 ++++++++++++++++++
tools/testing/selftests/net/forwarding/lib.sh | 24 ----
tools/testing/selftests/net/lib.sh | 24 ++++
11 files changed, 247 insertions(+), 33 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_lacp_prio.sh
--
2.50.1
The three patches fix the va_high_addr_switch.sh test failure on x86_64.
Patch 1 fixes the hugepage setup issue that nr_hugepages is reset too
early in run_vmtests.sh and break the later va_high_addr_switch testing.
Patch 2 adds hugepage setup in va_high_addr_switch test, so that it can
still work if vm_runtests.sh changes the hugepage setup someday.
Patch 3 fixes the test failure caused by the hint addr align method change
in hugetlb_get_unmapped_area().
Changes in v2:
- patch 1 renames nr_hugepgs_origin to orig_nr_hugepgs
- add a patch 2 to setup hugeapges in va_high_addr_switch test
Chunyu Hu (3):
selftests/mm: fix hugepages cleanup too early
selftests/mm: alloc hugepages in va_high_addr_switch test
selftests/mm: fix va_high_addr_switch.sh failure on x86_64
tools/testing/selftests/mm/run_vmtests.sh | 9 ++++-
.../selftests/mm/va_high_addr_switch.c | 4 +-
.../selftests/mm/va_high_addr_switch.sh | 37 +++++++++++++++++++
3 files changed, 46 insertions(+), 4 deletions(-)
--
2.49.0
This patchset ensures that the number of hugepages is correctly set in the
system so that the uffd-stress test does not fail due to the racy nature of
the test. Patch 1 changes the hugepage constraint in the run_vmtests.sh
script, whereas patch 2 changes the constraint in the test itself.
---
Based on 6.17-rc5.
Dev Jain (2):
selftests/mm/uffd-stress: make test operate on less hugetlb memory
selftests/mm/uffd-stress: stricten constraint on free hugepages needed
before the test
tools/testing/selftests/mm/run_vmtests.sh | 10 +++++++---
tools/testing/selftests/mm/uffd-stress.c | 17 +++++++++++------
2 files changed, 18 insertions(+), 9 deletions(-)
--
2.30.2
This patchset ensures that the number of hugepages is correctly set in the
system so that the uffd-stress test does not fail due to the racy nature of
the test. Patch 1 corrects the hugepage constraint in the run_vmtests.sh
script, whereas patch 2 corrects the constraint in the test itself.
Dev Jain (2):
selftests/mm/uffd-stress: Make test operate on less hugetlb memory
selftests/mm/uffd-stress: Stricten constraint on free hugepages before
the test
tools/testing/selftests/mm/run_vmtests.sh | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--
2.30.2
Some high-level virtual drivers need to compute features from their
lower devices, but each currently has its own implementation and may
miss some feature computations. This patch set introduces a common function
to compute features for such devices.
Currently, bonding, team, and bridge have been updated to use the new
helper.
v2:
a) remove hard_header_len setting. I will set needed_headroom for bond/team
in a separate patch as bridge has it's own ways. (Ido Schimmel)
b) Add test file to Makefile, set RET=0 to a proper location. (Ido Schimmel)
Hangbin Liu (5):
net: add a common function to compute features from lowers devices
bonding: use common function to compute the features
team: use common function to compute the features
net: bridge: use common function to compute the features
selftests/net: add offload checking test for virtual interface
drivers/net/bonding/bond_main.c | 99 +----------
drivers/net/team/team_core.c | 73 +-------
include/linux/netdevice.h | 19 +++
net/bridge/br_if.c | 22 +--
net/core/dev.c | 76 +++++++++
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 2 +
tools/testing/selftests/net/vdev_offload.sh | 176 ++++++++++++++++++++
8 files changed, 285 insertions(+), 183 deletions(-)
create mode 100755 tools/testing/selftests/net/vdev_offload.sh
--
2.50.1
From: Fred Griffoul <griffoul(a)casper.infradead.org>
This patch series addresses performance issues in nested VMX when
handling unmanaged guest memory. Unmanaged guest memory refers to memory
not directly mapped by the kernel (no struct page), such as memory
passed with the mem= parameter or guest_memfd for non-Confidential
Computing (CoCo) VMs.
Current Problem:
During nested VMX operations, the system frequently accesses specific
guest pages during L2 VM entry/exit cycles. The current workflow:
1. kvm_vcpu_map() invokes memremap() for unmanaged memory.
2. The system either directly accesses mapped memory via nested VMX or
passes it to the L2 guest through vmcs02.
3. kvm_vcpu_unmap() invokes memunmap()
This repeated map/unmap cycle creates significant performance overhead
due to expensive remapping operations.
Solution approach:
Our solution replaces kvm_host_map with gfn_to_pfn_cache in nested VMX.
It addresses two distinct types of guest pages.
First, we handle the L1 MSR bitmap page, which requires read-only access
for folding L1 and L0 MSR bitmap. We implement this conversion to
gfn_to_pfn_cache in patch 1.
Second, we tackle system pages, including APIC access, virtual APIC, and
posted interrupt descriptor pages. These pages are more complex as
they're accessed by both nested VMX code _and_ passed to the L2 guest in
vmcs02 fields. This requires to restore and complete the
"guest-uses-pfn" support in pfncache through patches 2 and 3, followed
by implementing kvm_host_map replacement with caches in patch 4.
Testing:
Patch 5 introduces a new selftest to verify cache invalidation and
memslot update functionality.
The changes are available in a git repository at:
git://git.infradead.org/users/griffoul/linux.git tags/nvmx-gpc-v1
Suggested-by: dwmw(a)amazon.co.uk
Fred Griffoul (5):
KVM: nVMX: Implement cache for L1 MSR bitmap
KVM: pfncache: Restore guest-uses-pfn support
KVM: x86: Add nested state validation for pfncache support
KVM: nVMX: Implement cache for L1 APIC pages
KVM: selftests: Add nested VMX APIC cache invalidation test
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/nested.c | 213 +++++++++---
arch/x86/kvm/vmx/vmx.h | 10 +-
arch/x86/kvm/x86.c | 14 +-
include/linux/kvm_host.h | 34 +-
include/linux/kvm_types.h | 1 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/x86/vmx_apic_update_test.c | 302 ++++++++++++++++++
virt/kvm/kvm_main.c | 3 +-
virt/kvm/kvm_mm.h | 6 +-
virt/kvm/pfncache.c | 43 ++-
11 files changed, 575 insertions(+), 53 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c
--
2.51.0
The TODO about using the number of vCPUs instead of vcpu.id + 1
was already addressed by commit 376bc1b458c9 ("KVM: selftests: Don't
assume vcpu->id is '0' in xAPIC state test"). The comment is now
stale and can be removed.
Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com>
---
tools/testing/selftests/kvm/x86/xapic_state_test.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/xapic_state_test.c b/tools/testing/selftests/kvm/x86/xapic_state_test.c
index fdebff1165c7..3b4814c55722 100644
--- a/tools/testing/selftests/kvm/x86/xapic_state_test.c
+++ b/tools/testing/selftests/kvm/x86/xapic_state_test.c
@@ -120,8 +120,8 @@ static void test_icr(struct xapic_vcpu *x)
__test_icr(x, icr | i);
/*
- * Send all flavors of IPIs to non-existent vCPUs. TODO: use number of
- * vCPUs, not vcpu.id + 1. Arbitrarily use vector 0xff.
+ * Send all flavors of IPIs to non-existent vCPUs. Arbitrarily use
+ * vector 0xff.
*/
icr = APIC_INT_ASSERT | 0xff;
for (i = 0; i < 0xff; i++) {
--
2.43.0
Replace the hardcoded 0xff in test_icr() with the actual number of vcpus
created for the vm. This address the existing TODO and keeps the test
correct if it is ever run with multiple vcpus.
Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com>
---
tools/testing/selftests/kvm/x86/xapic_state_test.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/x86/xapic_state_test.c b/tools/testing/selftests/kvm/x86/xapic_state_test.c
index fdebff1165c7..4af36682503e 100644
--- a/tools/testing/selftests/kvm/x86/xapic_state_test.c
+++ b/tools/testing/selftests/kvm/x86/xapic_state_test.c
@@ -56,6 +56,17 @@ static void x2apic_guest_code(void)
} while (1);
}
+static unsigned int vm_nr_vcpus(struct kvm_vm *vm)
+{
+ struct kvm_vcpu *vcpu;
+ unsigned int count = 0;
+
+ list_for_each_entry(vcpu, &vm->vcpus, list)
+ count++;
+
+ return count;
+}
+
static void ____test_icr(struct xapic_vcpu *x, uint64_t val)
{
struct kvm_vcpu *vcpu = x->vcpu;
@@ -124,7 +135,7 @@ static void test_icr(struct xapic_vcpu *x)
* vCPUs, not vcpu.id + 1. Arbitrarily use vector 0xff.
*/
icr = APIC_INT_ASSERT | 0xff;
- for (i = 0; i < 0xff; i++) {
+ for (i = 0; i < vm_nr_vcpus(vcpu->vm); i++) {
if (i == vcpu->id)
continue;
for (j = 0; j < 8; j++)
--
2.43.0
Recent changes to make netlink socket memory accounting must
have broken the implicit assumption of the netlink-dump test
that we can fit exactly 64 dumps into the socket. Handle the
failure mode properly, and increase the dump count to 80
to make sure we still run into the error condition if
the default buffer size increases in the future.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
tools/testing/selftests/net/netlink-dumps.c | 43 ++++++++++++++++-----
1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/netlink-dumps.c b/tools/testing/selftests/net/netlink-dumps.c
index 07423f256f96..7618ebe528a4 100644
--- a/tools/testing/selftests/net/netlink-dumps.c
+++ b/tools/testing/selftests/net/netlink-dumps.c
@@ -31,9 +31,18 @@ struct ext_ack {
const char *str;
};
-/* 0: no done, 1: done found, 2: extack found, -1: error */
-static int nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
+enum get_ea_ret {
+ ERROR = -1,
+ NO_CTRL = 0,
+ FOUND_DONE,
+ FOUND_ERR,
+ FOUND_EXTACK,
+};
+
+static enum get_ea_ret
+nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
{
+ enum get_ea_ret ret = NO_CTRL;
const struct nlmsghdr *nlh;
const struct nlattr *attr;
ssize_t rem;
@@ -41,15 +50,19 @@ static int nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
for (rem = n; rem > 0; NLMSG_NEXT(nlh, rem)) {
nlh = (struct nlmsghdr *)&buf[n - rem];
if (!NLMSG_OK(nlh, rem))
- return -1;
+ return ERROR;
- if (nlh->nlmsg_type != NLMSG_DONE)
+ if (nlh->nlmsg_type == NLMSG_ERROR)
+ ret = FOUND_ERR;
+ else if (nlh->nlmsg_type == NLMSG_DONE)
+ ret = FOUND_DONE;
+ else
continue;
ea->err = -*(int *)NLMSG_DATA(nlh);
if (!(nlh->nlmsg_flags & NLM_F_ACK_TLVS))
- return 1;
+ return ret;
ynl_attr_for_each(attr, nlh, sizeof(int)) {
switch (ynl_attr_type(attr)) {
@@ -68,10 +81,10 @@ static int nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
}
}
- return 2;
+ return FOUND_EXTACK;
}
- return 0;
+ return ret;
}
static const struct {
@@ -99,9 +112,9 @@ static const struct {
TEST(dump_extack)
{
int netlink_sock;
+ int i, cnt, ret;
char buf[8192];
int one = 1;
- int i, cnt;
ssize_t n;
netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
@@ -118,7 +131,7 @@ TEST(dump_extack)
ASSERT_EQ(n, 0);
/* Dump so many times we fill up the buffer */
- cnt = 64;
+ cnt = 80;
for (i = 0; i < cnt; i++) {
n = send(netlink_sock, &dump_neigh_bad,
sizeof(dump_neigh_bad), 0);
@@ -140,10 +153,20 @@ TEST(dump_extack)
}
ASSERT_GE(n, (ssize_t)sizeof(struct nlmsghdr));
- EXPECT_EQ(nl_get_extack(buf, n, &ea), 2);
+ ret = nl_get_extack(buf, n, &ea);
+ /* Once we fill the buffer we'll see one ENOBUFS followed
+ * by a number of EBUSYs. Then the last recv() will finally
+ * trigger and complete the dump.
+ */
+ if (ret == FOUND_ERR && (ea.err == ENOBUFS || ea.err == EBUSY))
+ continue;
+ EXPECT_EQ(ret, FOUND_EXTACK);
+ EXPECT_EQ(ea.err, EINVAL);
EXPECT_EQ(ea.attr_offs,
sizeof(struct nlmsghdr) + sizeof(struct ndmsg));
}
+ /* Make sure last message was a full DONE+extack */
+ EXPECT_EQ(ret, FOUND_EXTACK);
}
static const struct {
--
2.51.0
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for managing GCS for KVM guests, it also
includes a fix for S1PIE which has also been sent separately as this
feature is a dependency for GCS. It is based on:
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/gcs
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v15:
- Rebase onto v6.17-rc1.
- Link to v14: https://lore.kernel.org/r/20241005-arm64-gcs-v14-0-59060cd6092b@kernel.org
Changes in v14:
- Rebase onto arm64/for-next/gcs which includes all the non-KVM support.
- Manage the fine grained traps for GCS instructions.
- Manage PSTATE.EXLOCK when delivering exceptions to KVM guests.
- Link to v13: https://lore.kernel.org/r/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
Changes in v13:
- Rebase onto v6.12-rc1.
- Allocate VM_HIGH_ARCH_6 since protection keys used all the existing
bits.
- Implement mm_release() and free transparently allocated GCSs there.
- Use bit 32 of AT_HWCAP for GCS due to AT_HWCAP2 being filled.
- Since we now only set GCSCRE0_EL1 on change ensure that it is
initialised with GCSPR_EL0 accessible to EL0.
- Fix OOM handling on thread copy.
- Link to v12: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec947436a@kernel.org
Changes in v12:
- Clarify and simplify the signal handling code so we work with the
register state.
- When checking for write aborts to shadow stack pages ensure the fault
is a data abort.
- Depend on !UPROBES.
- Comment cleanups.
- Link to v11: https://lore.kernel.org/r/20240822-arm64-gcs-v11-0-41b81947ecb5@kernel.org
Changes in v11:
- Remove the dependency on the addition of clone3() support for shadow
stacks, rebasing onto v6.11-rc3.
- Make ID_AA64PFR1_EL1.GCS writeable in KVM.
- Hide GCS registers when GCS is not enabled for KVM guests.
- Require HCRX_EL2.GCSEn if booting at EL1.
- Require that GCSCR_EL1 and GCSCRE0_EL1 be initialised regardless of
if we boot at EL2 or EL1.
- Remove some stray use of bit 63 in signal cap tokens.
- Warn if we see a GCS with VM_SHARED.
- Remove rdundant check for VM_WRITE in fault handling.
- Cleanups and clarifications in the ABI document.
- Clean up and improve documentation of some sync placement.
- Only set the EL0 GCS mode if it's actually changed.
- Various minor fixes and tweaks.
- Link to v10: https://lore.kernel.org/r/20240801-arm64-gcs-v10-0-699e2bd2190b@kernel.org
Changes in v10:
- Fix issues with THP.
- Tighten up requirements for initialising GCSCR*.
- Only generate GCS signal frames for threads using GCS.
- Only context switch EL1 GCS registers if S1PIE is enabled.
- Move context switch of GCSCRE0_EL1 to EL0 context switch.
- Make GCS registers unconditionally visible to userspace.
- Use FHU infrastructure.
- Don't change writability of ID_AA64PFR1_EL1 for KVM.
- Remove unused arguments from alloc_gcs().
- Typo fixes.
- Link to v9: https://lore.kernel.org/r/20240625-arm64-gcs-v9-0-0f634469b8f0@kernel.org
Changes in v9:
- Rebase onto v6.10-rc3.
- Restructure and clarify memory management fault handling.
- Fix up basic-gcs for the latest clone3() changes.
- Convert to newly merged KVM ID register based feature configuration.
- Fixes for NV traps.
- Link to v8: https://lore.kernel.org/r/20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org
Changes in v8:
- Invalidate signal cap token on stack when consuming.
- Typo and other trivial fixes.
- Don't try to use process_vm_write() on GCS, it intentionally does not
work.
- Fix leak of thread GCSs.
- Rebase onto latest clone3() series.
- Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org
Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (6):
arm64/gcs: Ensure FGTs for EL1 GCS instructions are disabled
KVM: arm64: Manage GCS access and registers for guests
KVM: arm64: Forward GCS exceptions to nested guests
KVM: arm64: Set PSTATE.EXLOCK when entering an exception
KVM: arm64: Allow GCS to be enabled for guests
KVM: selftests: arm64: Add GCS registers to get-reg-list
arch/arm64/include/asm/el2_setup.h | 4 +++
arch/arm64/include/asm/kvm_emulate.h | 3 ++
arch/arm64/include/asm/kvm_host.h | 14 +++++++++
arch/arm64/include/asm/vncr_mapping.h | 2 ++
arch/arm64/include/uapi/asm/ptrace.h | 1 +
arch/arm64/kvm/handle_exit.c | 14 +++++++--
arch/arm64/kvm/hyp/exception.c | 37 ++++++++++++++++++++++++
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 31 ++++++++++++++++++++
arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 10 +++++++
arch/arm64/kvm/sys_regs.c | 32 ++++++++++++++++++--
tools/testing/selftests/kvm/arm64/get-reg-list.c | 12 ++++++++
11 files changed, 155 insertions(+), 5 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
+lists
Please keep discussions on-list unless there's something that can't/shouldn't be
posted publicly, e.g. for confidentiality or security reasons.
On Tue, Sep 02, 2025, Faruqui, Aqib wrote:
> I suppose a fix for blindly using PAGE_SIZE in subsequent macros:
>
> #ifdef PAGE_SIZE
> #undef PAGE_SIZE
> #endif
> #define PAGE_SIZE (1ULL << PAGE_SHIFT)
>
> Is no better and is instead blindly suppressing the compiler's redefinition warning.
>
> I'm having trouble finding what causes the conflict, any advice here?
Maybe try a newer compiler? E.g. gcc-14.2 will spit out the exact location of the
previous definition.
In file included from include/x86/svm_util.h:13,
from include/x86/sev.h:15,
from lib/x86/sev.c:5:
include/x86/processor.h:373:9: error: "PAGE_SIZE" redefined [-Werror]
373 | #define PAGE_SIZE (1ULL << PAGE_SHIFT)
| ^~~~~~~~~
include/x86/processor.h:370:9: note: this is the location of the previous definition
370 | #define PAGE_SIZE BIT(12)
| ^~~~~~~~~
Fix to use the return value of the function 'chdir("/")' and check if the
return is either 0 (ok) or 1 (not ok, so the test stops).
The patch fies the solves the following errors:
mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
468 | chdir("/");
mount-notify_test_ns.c:489:17: warning: ignoring return value of
‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-
result]
489 | chdir("/");
To reproduce the issue, use the command:
make kselftest TARGET=filesystems/statmount
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
.../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 5a3b0ace1a88..a7f899599d52 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
index d91946e69591..dc9eb3087a1a 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
@@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
--
2.43.0
Fix to use the return value of the function 'chdir("/")' and check if the
return is either 0 (ok) or 1 (not ok, so the test stops).
The patch fies the solves the following errors:
mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
468 | chdir("/");
mount-notify_test_ns.c:489:17: warning: ignoring return value of
‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-
result]
489 | chdir("/");
To reproduce the issue, use the command:
make kselftest TARGET=filesystems/statmount
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
.../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 5a3b0ace1a88..a7f899599d52 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
index d91946e69591..dc9eb3087a1a 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
@@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
--
2.43.0
From: Feng Yang <yangfeng(a)kylinos.cn>
The error message printed here only uses the previous err value,
which results in it being printed as 0.
When bpf_map__attach_struct_ops encounters an error,
it uses libbpf_err_ptr(err) to set errno = -err and returns NULL.
Therefore, Using -errno can fix this issue.
Fix before:
run_subtest:FAIL:1019 bpf_map__attach_struct_ops failed for map pro_epilogue: err=0
Fix after:
run_subtest:FAIL:1019 bpf_map__attach_struct_ops failed for map pro_epilogue: err=-9
Signed-off-by: Feng Yang <yangfeng(a)kylinos.cn>
---
Changes in v3:
- Use -errno here directly, thanks: Andrii Nakryiko.
- Link to v2: https://lore.kernel.org/all/20250829014125.198653-1-yangfeng59949@163.com/
---
Changes in v2:
- Use libbpf_get_error, thanks: Alexei Starovoitov.
- Link to v1: https://lore.kernel.org/all/20250828081507.1380218-1-yangfeng59949@163.com/
---
tools/testing/selftests/bpf/test_loader.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/test_loader.c b/tools/testing/selftests/bpf/test_loader.c
index 78423cf89e01..33d59c093a27 100644
--- a/tools/testing/selftests/bpf/test_loader.c
+++ b/tools/testing/selftests/bpf/test_loader.c
@@ -1083,7 +1083,7 @@ void run_subtest(struct test_loader *tester,
link = bpf_map__attach_struct_ops(map);
if (!link) {
PRINT_FAIL("bpf_map__attach_struct_ops failed for map %s: err=%d\n",
- bpf_map__name(map), err);
+ bpf_map__name(map), -errno);
goto tobj_cleanup;
}
links[links_cnt++] = link;
--
2.25.1
This is based on mm-unstable and was cross-compiled heavily.
I should probably have already dropped the RFC label but I want to hear
first if I ignored some corner case (SG entries?) and I need to do
at least a bit more testing.
I will only CC non-MM folks on the cover letter and the respective patch
to not flood too many inboxes (the lists receive all patches).
---
As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.
To recap, the reason we currently need nth_page() within a folio is because
on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when
it might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages".
If we only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page()
within CMA allocations (again, could span memory sections), and in
one corner case (kfence) when processing memblock allocations (again,
could span memory sections).
So let's handle all that, add sanity checks, and remove nth_page().
Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch #6 -> #12 : disallow folios to have non-contiguous pages
Patch #13 -> #20 : remove nth_page() usage within folios
Patch #21 : disallow CMA allocations of non-contiguous pages
Patch #22 -> #31 : sanity+check + remove nth_page() usage within SG entry
Patch #32 : sanity-check + remove nth_page() usage in
unpin_user_page_range_dirty_lock()
Patch #33 : remove nth_page() in kfence
Patch #34 : adjust stale comment regarding nth_page
Patch #35 : mm: remove nth_page()
A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.
[1] https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-G…
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Marek Szyprowski <m.szyprowski(a)samsung.com>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Brendan Jackman <jackmanb(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Dennis Zhou <dennis(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Christoph Lameter <cl(a)gentwo.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: x86(a)kernel.org
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-mips(a)vger.kernel.org
Cc: linux-s390(a)vger.kernel.org
Cc: linux-crypto(a)vger.kernel.org
Cc: linux-ide(a)vger.kernel.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linux-mmc(a)vger.kernel.org
Cc: linux-arm-kernel(a)axis.com
Cc: linux-scsi(a)vger.kernel.org
Cc: kvm(a)vger.kernel.org
Cc: virtualization(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Cc: io-uring(a)vger.kernel.org
Cc: iommu(a)lists.linux.dev
Cc: kasan-dev(a)googlegroups.com
Cc: wireguard(a)lists.zx2c4.com
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-riscv(a)lists.infradead.org
David Hildenbrand (35):
mm: stop making SPARSEMEM_VMEMMAP user-selectable
arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu
kernel config
mm/page_alloc: reject unreasonable folio/compound page sizes in
alloc_contig_range_noprof()
mm/memremap: reject unreasonable folio/compound page sizes in
memremap_pages()
mm/hugetlb: check for unreasonable folio sizes when registering hstate
mm/mm_init: make memmap_init_compound() look more like
prep_compound_page()
mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
mm: sanity-check maximum folio size in folio_set_order()
mm: limit folio/compound page sizes in problematic kernel configs
mm: simplify folio_page() and folio_page_idx()
mm/mm/percpu-km: drop nth_page() usage within single allocation
fs: hugetlbfs: remove nth_page() usage within folio in
adjust_range_hwpoison()
mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
mm/gup: drop nth_page() usage within folio when recording subpages
io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
io_uring/zcrx: remove nth_page() usage within folio
mips: mm: convert __flush_dcache_pages() to
__flush_dcache_folio_pages()
mm/cma: refuse handing out non-contiguous page ranges
dma-remap: drop nth_page() in dma_common_contiguous_remap()
scatterlist: disallow non-contigous page ranges in a single SG entry
ata: libata-eh: drop nth_page() usage within SG entry
drm/i915/gem: drop nth_page() usage within SG entry
mspro_block: drop nth_page() usage within SG entry
memstick: drop nth_page() usage within SG entry
mmc: drop nth_page() usage within SG entry
scsi: core: drop nth_page() usage within SG entry
vfio/pci: drop nth_page() usage within SG entry
crypto: remove nth_page() usage within SG entry
mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
kfence: drop nth_page() usage
block: update comment of "struct bio_vec" regarding nth_page()
mm: remove nth_page()
arch/arm64/Kconfig | 1 -
arch/mips/include/asm/cacheflush.h | 11 +++--
arch/mips/mm/cache.c | 8 ++--
arch/s390/Kconfig | 1 -
arch/x86/Kconfig | 1 -
crypto/ahash.c | 4 +-
crypto/scompress.c | 8 ++--
drivers/ata/libata-sff.c | 6 +--
drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +-
drivers/memstick/core/mspro_block.c | 3 +-
drivers/memstick/host/jmb38x_ms.c | 3 +-
drivers/memstick/host/tifm_ms.c | 3 +-
drivers/mmc/host/tifm_sd.c | 4 +-
drivers/mmc/host/usdhi6rol0.c | 4 +-
drivers/scsi/scsi_lib.c | 3 +-
drivers/scsi/sg.c | 3 +-
drivers/vfio/pci/pds/lm.c | 3 +-
drivers/vfio/pci/virtio/migrate.c | 3 +-
fs/hugetlbfs/inode.c | 25 ++++------
include/crypto/scatterwalk.h | 4 +-
include/linux/bvec.h | 7 +--
include/linux/mm.h | 48 +++++++++++++++----
include/linux/page-flags.h | 5 +-
include/linux/scatterlist.h | 4 +-
io_uring/zcrx.c | 34 ++++---------
kernel/dma/remap.c | 2 +-
mm/Kconfig | 3 +-
mm/cma.c | 36 +++++++++-----
mm/gup.c | 13 +++--
mm/hugetlb.c | 23 ++++-----
mm/internal.h | 1 +
mm/kfence/core.c | 17 ++++---
mm/memremap.c | 3 ++
mm/mm_init.c | 13 ++---
mm/page_alloc.c | 5 +-
mm/pagewalk.c | 2 +-
mm/percpu-km.c | 2 +-
mm/util.c | 33 +++++++++++++
tools/testing/scatterlist/linux/mm.h | 1 -
.../selftests/wireguard/qemu/kernel.config | 1 -
40 files changed, 203 insertions(+), 150 deletions(-)
base-commit: c0e3b3f33ba7b767368de4afabaf7c1ddfdc3872
--
2.50.1
From: Vivek Yadav <vivekyadav1207731111(a)gmail.com>
Hi all,
This small series makes cosmetic style cleanups in the arm64 kselftests
to improve readability and suppress checkpatch warnings. These changes
are purely cosmetic and do not affect functionality.
Changes in this series:
* Suppress unnecessary checkpatch warning in a comment
* Add parentheses around sizeof for clarity
* Remove redundant blank line
---
Vivek Yadav (3):
kselftest/arm64: Remove extra blank line
kselftest/arm64: Supress warning and improve readability
kselftest/arm64: Add parentheses around sizeof for clarity
tools/testing/selftests/arm64/abi/hwcap.c | 1 -
tools/testing/selftests/arm64/bti/assembler.h | 1 -
tools/testing/selftests/arm64/fp/fp-ptrace.c | 1 -
tools/testing/selftests/arm64/fp/fp-stress.c | 4 ++--
tools/testing/selftests/arm64/fp/sve-ptrace.c | 2 +-
tools/testing/selftests/arm64/fp/vec-syscfg.c | 1 -
tools/testing/selftests/arm64/fp/zt-ptrace.c | 1 -
tools/testing/selftests/arm64/gcs/gcs-locking.c | 1 -
8 files changed, 3 insertions(+), 9 deletions(-)
--
2.25.1
Two small cleanups.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (2):
kselftest/arm64/gcs: Correctly check return value when disabling GCS
kselftest/arm64/gcs: Use nolibc's getauxval()
tools/testing/selftests/arm64/gcs/basic-gcs.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250821-nolibc-gcs-fixes-11cf7585bb74
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Fix to use the return value of the function 'chdir("/")' and check if the
return is either 0 (ok) or 1 (not ok, so the test stops).
The patch fies the solves the following errors:
mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
468 | chdir("/");
mount-notify_test_ns.c:489:17: warning: ignoring return value of
‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-
result]
489 | chdir("/");
To reproduce the issue, use the command:
make kselftest TARGET=filesystems/statmount
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
.../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 5a3b0ace1a88..a7f899599d52 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
index d91946e69591..dc9eb3087a1a 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
@@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
--
2.43.0
This series adds ONE_REG interface for SBI FWFT extension implemented
by KVM RISC-V. This was missed out in accepted SBI FWFT patches for
KVM RISC-V.
These patches can also be found in the riscv_kvm_fwft_one_reg_v3 branch
at: https://github.com/avpatel/linux.git
Changes since v2:
- Re-based on latest KVM RISC-V queue
- Improved FWFT ONE_REG interface to allow enabling/disabling each
FWFT feature from KVM userspace
Changes since v1:
- Dropped have_state in PATCH4 as suggested by Drew
- Added Drew's Reviewed-by in appropriate patches
Anup Patel (6):
RISC-V: KVM: Set initial value of hedeleg in kvm_arch_vcpu_create()
RISC-V: KVM: Introduce feature specific reset for SBI FWFT
RISC-V: KVM: Introduce optional ONE_REG callbacks for SBI extensions
RISC-V: KVM: Move copy_sbi_ext_reg_indices() to SBI implementation
RISC-V: KVM: Implement ONE_REG interface for SBI FWFT state
KVM: riscv: selftests: Add SBI FWFT to get-reg-list test
arch/riscv/include/asm/kvm_vcpu_sbi.h | 22 +-
arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 15 ++
arch/riscv/kvm/vcpu.c | 3 +-
arch/riscv/kvm/vcpu_onereg.c | 60 +----
arch/riscv/kvm/vcpu_sbi.c | 172 +++++++++++--
arch/riscv/kvm/vcpu_sbi_fwft.c | 227 ++++++++++++++++--
arch/riscv/kvm/vcpu_sbi_sta.c | 63 +++--
.../selftests/kvm/riscv/get-reg-list.c | 32 +++
9 files changed, 467 insertions(+), 128 deletions(-)
--
2.43.0
When many ADD_ADDR need to be sent, it can take some time to send each
of them, and create new subflows. Some CIs seem to occasionally have
issues with these tests, especially with "debug" kernels.
Two subtests will now run for a slightly longer time: the last two where
3 or more ADD_ADDR are sent during the test.
Reviewed-by: Geliang Tang <geliang(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
tools/testing/selftests/net/mptcp/mptcp_join.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index e9e11a9e60fd5374c8a98c3b7159ccbca8053030..b41cebfa1f921ce9ea6a88a908bf6d5e6027b367 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -2268,7 +2268,8 @@ signal_address_tests()
pm_nl_add_endpoint $ns1 10.0.3.1 flags signal
pm_nl_add_endpoint $ns1 10.0.4.1 flags signal
pm_nl_set_limits $ns2 3 3
- run_tests $ns1 $ns2 10.0.1.1
+ speed=slow \
+ run_tests $ns1 $ns2 10.0.1.1
chk_join_nr 3 3 3
chk_add_nr 3 3
fi
@@ -2280,7 +2281,8 @@ signal_address_tests()
pm_nl_add_endpoint $ns1 10.0.3.1 flags signal
pm_nl_add_endpoint $ns1 10.0.14.1 flags signal
pm_nl_set_limits $ns2 3 3
- run_tests $ns1 $ns2 10.0.1.1
+ speed=slow \
+ run_tests $ns1 $ns2 10.0.1.1
join_syn_tx=3 \
chk_join_nr 1 1 1
chk_add_nr 3 3
--
2.51.0
ADD_ADDR can be retransmitted, and with, the parent commit, these
retransmissions can be sent quicker: from 2 minutes to less than one
second.
To avoid false positives where retransmitted ADD_ADDR causes higher
counters than expected, it is required to be more tolerant. Errors are
now only reported when fewer ADD_ADDRs have been sent/received, except
if no ADD_ADDR are expected.
Before the parent commit, the tolerance was present for each tests where
the ADD_ADDR could be retransmitted in a reasonable time (1 sec). Now
that all tests can have retransmitted ADD_ADDR, it is normal to apply
the same tolerance for all tests.
An alternative could be to disable the ADD_ADDR retransmissions by
default, but that's changing the default kernel behaviour. Plus,
ADD_ADDR retransmissions can be required for some tests. To avoid adding
exceptions to many tests, it seems better to increase the tolerance.
Later, we could add a new MIB counter to identify the ADD_ADDR
retransmissions, and remove the tolerance when this counter is
available.
Reviewed-by: Geliang Tang <geliang(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
tools/testing/selftests/net/mptcp/mptcp_join.sh | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 2f046167a0b6cc6fb5531a033d8d95c9ea399cf9..e9e11a9e60fd5374c8a98c3b7159ccbca8053030 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -358,6 +358,7 @@ reset_with_add_addr_timeout()
tables="${ip6tables}"
fi
+ # set a maximum, to avoid too long timeout with exponential backoff
ip netns exec $ns1 sysctl -q net.mptcp.add_addr_timeout=1
if ! ip netns exec $ns2 $tables -A OUTPUT -p tcp \
@@ -1669,7 +1670,6 @@ chk_add_nr()
local tx=""
local rx=""
local count
- local timeout
if [[ $ns_invert = "invert" ]]; then
ns_tx=$ns2
@@ -1678,15 +1678,13 @@ chk_add_nr()
rx=" server"
fi
- timeout=$(ip netns exec ${ns_tx} sysctl -n net.mptcp.add_addr_timeout)
-
print_check "add addr rx${rx}"
count=$(mptcp_lib_get_counter ${ns_rx} "MPTcpExtAddAddr")
if [ -z "$count" ]; then
print_skip
- # if the test configured a short timeout tolerate greater then expected
- # add addrs options, due to retransmissions
- elif [ "$count" != "$add_nr" ] && { [ "$timeout" -gt 1 ] || [ "$count" -lt "$add_nr" ]; }; then
+ # Tolerate more ADD_ADDR then expected (if any), due to retransmissions
+ elif [ "$count" != "$add_nr" ] &&
+ { [ "$add_nr" -eq 0 ] || [ "$count" -lt "$add_nr" ]; }; then
fail_test "got $count ADD_ADDR[s] expected $add_nr"
else
print_ok
@@ -1774,18 +1772,15 @@ chk_add_tx_nr()
{
local add_tx_nr=$1
local echo_tx_nr=$2
- local timeout
local count
- timeout=$(ip netns exec $ns1 sysctl -n net.mptcp.add_addr_timeout)
-
print_check "add addr tx"
count=$(mptcp_lib_get_counter ${ns1} "MPTcpExtAddAddrTx")
if [ -z "$count" ]; then
print_skip
- # if the test configured a short timeout tolerate greater then expected
- # add addrs options, due to retransmissions
- elif [ "$count" != "$add_tx_nr" ] && { [ "$timeout" -gt 1 ] || [ "$count" -lt "$add_tx_nr" ]; }; then
+ # Tolerate more ADD_ADDR then expected (if any), due to retransmissions
+ elif [ "$count" != "$add_tx_nr" ] &&
+ { [ "$add_tx_nr" -eq 0 ] || [ "$count" -lt "$add_tx_nr" ]; }; then
fail_test "got $count ADD_ADDR[s] TX, expected $add_tx_nr"
else
print_ok
--
2.51.0
From: Geliang Tang <tanggeliang(a)kylinos.cn>
Currently the ADD_ADDR option is retransmitted with a fixed timeout. This
patch makes the retransmission timeout adaptive by using the maximum RTO
among all the subflows, while still capping it at the configured maximum
value (add_addr_timeout_max). This improves responsiveness when
establishing new subflows.
Specifically:
1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive
timeout.
2. Uses maximum subflow RTO (icsk_rto) when available.
3. Applies exponential backoff based on retransmission count.
4. Maintains fallback to configured max timeout when no RTO data exists.
This slightly changes the behaviour of the MPTCP "add_addr_timeout"
sysctl knob to be used as a maximum instead of a fixed value. But this
is seen as an improvement: the ADD_ADDR might be sent quicker than
before to improve the overall MPTCP connection. Also, the default
value is set to 2 min, which was already way too long, and caused the
ADD_ADDR not to be retransmitted for connections shorter than 2 minutes.
Suggested-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/576
Reviewed-by: Christoph Paasch <cpaasch(a)openai.com>
Signed-off-by: Geliang Tang <tanggeliang(a)kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
v2: no changes.
---
Documentation/networking/mptcp-sysctl.rst | 8 +++++---
net/mptcp/pm.c | 28 ++++++++++++++++++++++++----
2 files changed, 29 insertions(+), 7 deletions(-)
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 1683c139821e3ba6d9eaa3c59330a523d29f1164..1eb6af26b4a7acdedd575a126c576210a78f0d4d 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -8,9 +8,11 @@ MPTCP Sysfs variables
===============================
add_addr_timeout - INTEGER (seconds)
- Set the timeout after which an ADD_ADDR control message will be
- resent to an MPTCP peer that has not acknowledged a previous
- ADD_ADDR message.
+ Set the maximum value of timeout after which an ADD_ADDR control message
+ will be resent to an MPTCP peer that has not acknowledged a previous
+ ADD_ADDR message. A dynamically estimated retransmission timeout based
+ on the estimated connection round-trip-time is used if this value is
+ lower than the maximum one.
Do not retransmit if set to 0.
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 136a380602cae872b76560649c924330e5f42533..204e1f61212e2be77a8476f024b59be67d04b80a 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -268,6 +268,27 @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk,
return -EINVAL;
}
+static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk)
+{
+ const struct net *net = sock_net((struct sock *)msk);
+ unsigned int rto = mptcp_get_add_addr_timeout(net);
+ struct mptcp_subflow_context *subflow;
+ unsigned int max = 0;
+
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ struct inet_connection_sock *icsk = inet_csk(ssk);
+
+ if (icsk->icsk_rto > max)
+ max = icsk->icsk_rto;
+ }
+
+ if (max && max < rto)
+ rto = max;
+
+ return rto;
+}
+
static void mptcp_pm_add_timer(struct timer_list *timer)
{
struct mptcp_pm_add_entry *entry = timer_container_of(entry, timer,
@@ -292,7 +313,7 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
goto out;
}
- timeout = mptcp_get_add_addr_timeout(sock_net(sk));
+ timeout = mptcp_adjust_add_addr_timeout(msk);
if (!timeout)
goto out;
@@ -307,7 +328,7 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
if (entry->retrans_times < ADD_ADDR_RETRANS_MAX)
sk_reset_timer(sk, timer,
- jiffies + timeout);
+ jiffies + (timeout << entry->retrans_times));
spin_unlock_bh(&msk->pm.lock);
@@ -348,7 +369,6 @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk,
{
struct mptcp_pm_add_entry *add_entry = NULL;
struct sock *sk = (struct sock *)msk;
- struct net *net = sock_net(sk);
unsigned int timeout;
lockdep_assert_held(&msk->pm.lock);
@@ -374,7 +394,7 @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk,
timer_setup(&add_entry->add_timer, mptcp_pm_add_timer, 0);
reset_timer:
- timeout = mptcp_get_add_addr_timeout(net);
+ timeout = mptcp_adjust_add_addr_timeout(msk);
if (timeout)
sk_reset_timer(sk, &add_entry->add_timer, jiffies + timeout);
--
2.51.0
Changes in v2:
- Optimized the logic in descriptions. (Song Liu)
- Created a new header file to declare kfuncs for future extensions included by other files. (Christian Loehle)
- Fixed some logical issues in the code. (Christian Loehle)
Reference:
[1] https://lore.kernel.org/bpf/20250829101137.9507-1-yikai.lin@vivo.com/
Summary
----------
Hi, everyone,
This patch set introduces an extensible cpuidle governor framework
using BPF struct_ops, enabling dynamic implementation of idle-state selection policies
via BPF programs.
Motivation
----------
As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...),
where deeper states reduce power consumption, but results in longer wakeup latency,
potentially affecting performance.
Existing generic cpuidle governors operate effectively in common scenarios
but exhibit suboptimal behavior in specific Android phone's use cases.
Our testing reveals that during low-utilization scenarios
(e.g., screen-off background tasks like music playback with CPU utilization <10%),
the C0 state occupies ~50% of idle time, causing significant energy inefficiency.
Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.
To address this, we expect:
1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode)
2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)
OverView
----------
The BPF cpuidle ext governor registers at postcore_initcall()
but remains disabled by default due to its low priority "rating" with value "1".
Activation requires adjust higer "rating" than other governors within BPF.
Core Components:
1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations:
- ops.enable()/ops.disable(): enable or disable callback
- ops.select(): cpu Idle-state selection logic
- ops.set_stop_tick(): Scheduler tick management after state selection
- ops.reflect(): feedback info about previous idle state.
- ops.init()/ops.deinit(): Initialization or cleanup.
2.**Critical kfuncs for kernel state access**:
- bpf_cpuidle_ext_gov_update_rating():
Activate ext governor by raising rating must be called from "ops.init()"
- bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints
- bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode
Future work
----------
1. Scenario detection: Identifying low-utilization states (e.g., screen-off + background music)
2. Policy optimization: Optimizing state-selection algorithms for specific scenarios
Is it related to sched_ext?
---------------------------
The cpuidle framework is as follows.
----------------------------------------------------------
Scheduler Core
----------------------------------------------------------
|
v
----------------------------------------------------------
| FAIR Class | EXT Class | IDLE Class |
----------------------------------------------------------
| | | |
| | | v
| | | ------------------------
| | | enter_cpu_idle()
| | | ------------------------
| | | |
| | | v
| | | ------------------------------
| | | | CPUIDLE Governor |
| | | ------------------------------
| | | | | |
| | | v v v
| | |-----------------------------------
| | | default | | other | | BPF ext |
| | | Governor | | Governor | | Governor | <<===Here is the feature we add.
| | |-----------------------------------
| | | | | |
| | | v v v
| | |-------------------------------------
| | | select idle state
| | |-------------------------------------
Whereas cpuidle is invoked after switching to idle class when no tasks are present in the scheduling RQ.
They are not directly related, so implementing kfuncs or other extensions through sched_ext is not feasible.
Lin Yikai (2):
cpuidle: Implement BPF extensible cpuidle governor class
selftests/bpf: Add selftests for cpuidle_gov_ext
drivers/cpuidle/Kconfig | 12 +
drivers/cpuidle/governors/Makefile | 1 +
drivers/cpuidle/governors/ext.c | 537 ++++++++++++++++++
.../bpf/prog_tests/test_cpuidle_gov_ext.c | 28 +
.../selftests/bpf/progs/cpuidle_common.h | 13 +
.../selftests/bpf/progs/cpuidle_gov_ext.c | 200 +++++++
6 files changed, 791 insertions(+)
create mode 100644 drivers/cpuidle/governors/ext.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c
create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_common.h
create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c
--
2.43.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v16 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28, and it
will be RFC9768.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v16 (6-Sep-2025)
- Use TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_ECN in comments of tcp_ecn_send_syn() (Eric Dumazet <edumazet(a)google.com>)
- Add tcpi_accecn_fail_mode and tcpi_accecn_opt_seen to make tcp_info be multiple of 64 bits in patch #12
v15 (14-Aug-205)
- Update pahole results in commit messages
- Accurate ECN will become RFC9768
v14 (22-Jul-2025)
- Add missing const for struct tcp_sock of tcp_accecn_option_beacon_check() of #11 (Simon Horman <horms(a)kernel.org>)
v13 (18-Jul-2025)
- Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>)
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (5):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 32 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 642 ++++++++++++++++++
include/uapi/linux/tcp.h | 9 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 353 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 294 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1406 insertions(+), 184 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
devmem test fails on NIPA. Most likely we get skb(s) with readable
frags (why?) but the failure manifests as an OOM. The OOM happens
because ncdevmem spams the following message:
recvmsg ret=-1
recvmsg: Bad address
As of today, ncdevmem can't deal with various reasons of EFAULT:
- falling back to regular recvmsg for non-devmem skbs
- increasing ctrl_data size (can't happen with ncdevmem's large buffer)
Exit (cleanly) with error when recvmsg returns EFAULT. This should at
least cause the test to cleanup its state.
Signed-off-by: Stanislav Fomichev <sdf(a)fomichev.me>
---
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 8dc9511d046f..c0a22938bed2 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -945,6 +945,10 @@ static int do_server(struct memory_buffer *mem)
continue;
if (ret < 0) {
perror("recvmsg");
+ if (errno == EFAULT) {
+ pr_err("received EFAULT, won't recover");
+ goto err_close_client;
+ }
continue;
}
if (ret == 0) {
--
2.51.0
Create a netconsole test that puts a lot of pressure on the netconsole
list manipulation. Do it by creating dynamic targets and deleting
targets while messages are being sent. Also put interface down while the
In order to do it, refactor create_dynamic_target(), so it can be used to
create random targets in the torture test.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v2:
- Reuse the netconsole creation from lib_netcons.sh. Thus, refactoring
the create_dynamic_target() (Jakub)
- Move the "wait" to after all the messages has been sent.
- Link to v1: https://lore.kernel.org/r/20250902-netconsole_torture-v1-1-03c6066598e9@deb…
---
Breno Leitao (2):
selftest: netcons: refactor target creation
selftest: netcons: create a torture test
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 30 +++--
.../selftests/drivers/net/netcons_torture.sh | 127 +++++++++++++++++++++
3 files changed, 147 insertions(+), 11 deletions(-)
---
base-commit: 2fd4161d0d2547650d9559d57fc67b4e0a26a9e3
change-id: 20250902-netconsole_torture-8fc23f0aca99
Best regards,
--
Breno Leitao <leitao(a)debian.org>
When the bpf ring buffer is full, new events can not be recorded util
the consumer consumes some events to free space. This may cause critical
events to be discarded, such as in fault diagnostic, where recent events
are more critical than older ones.
So add ovewrite mode for bpf ring buffer. In this mode, the new event
overwrites the oldest event when the buffer is full.
v2:
- remove libbpf changes (Andrii)
- update overwrite benchmark
v1:
https://lore.kernel.org/bpf/20250804022101.2171981-1-xukuohai@huaweicloud.c…
Xu Kuohai (3):
bpf: Add overwrite mode for bpf ring buffer
selftests/bpf: Add test for overwrite ring buffer
selftests/bpf/benchs: Add producer and overwrite bench for ring buffer
include/uapi/linux/bpf.h | 4 +
kernel/bpf/ringbuf.c | 159 +++++++++++++++---
tools/include/uapi/linux/bpf.h | 4 +
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/bench.c | 2 +
.../selftests/bpf/benchs/bench_ringbufs.c | 95 ++++++++++-
.../bpf/benchs/run_bench_ringbufs.sh | 4 +
.../selftests/bpf/prog_tests/ringbuf.c | 74 ++++++++
.../selftests/bpf/progs/ringbuf_bench.c | 10 ++
.../bpf/progs/test_ringbuf_overwrite.c | 98 +++++++++++
10 files changed, 418 insertions(+), 35 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c
--
2.43.0