FEAT_LSFE is optional from v9.5, it adds new instructions for atomic
memory operations with floating point values. We have no immediate use
for it in kernel, provide a hwcap so userspace can discover it and allow
the ID register field to be exposed to KVM guests.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v4:
- Rebase onto arm64/for-next/cpufeature, note that both patches have
build dependencies on this.
- Drop unneeded cc clobber in hwcap.
- Use STRFADD as the instruction probed in hwcap.
- Link to v3: https://lore.kernel.org/r/20250818-arm64-lsfe-v3-0-af6f4d66eb39@kernel.org
Changes in v3:
- Rebase onto v6.17-rc1.
- Link to v2: https://lore.kernel.org/r/20250703-arm64-lsfe-v2-0-eced80999cb4@kernel.org
Changes in v2:
- Fix result of vi dropping in hwcap test.
- Link to v1: https://lore.kernel.org/r/20250627-arm64-lsfe-v1-0-68351c4bf741@kernel.org
---
Mark Brown (2):
KVM: arm64: Expose FEAT_LSFE to guests
kselftest/arm64: Add lsfe to the hwcaps test
arch/arm64/kvm/sys_regs.c | 4 +++-
tools/testing/selftests/arm64/abi/hwcap.c | 21 +++++++++++++++++++++
2 files changed, 24 insertions(+), 1 deletion(-)
---
base-commit: 220928e52cb03d223b3acad3888baf0687486d21
change-id: 20250625-arm64-lsfe-0810cf98adc2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Syzkaller found this, fput runs the release from a work queue so the
refcount remains elevated during abort. This is tricky so move more
handling of files into the core code.
Add a WARN_ON to catch things like this more reliably without relying on
kasn.
Update the fail_nth test to succeed on 6.17 kernels.
Jason Gunthorpe (3):
iommufd: Fix race during abort for file descriptors
iommufd: WARN if an object is aborted with an elevated refcount
iommufd/selftest: Update the fail_nth limit
drivers/iommu/iommufd/device.c | 3 +-
drivers/iommu/iommufd/eventq.c | 9 +----
drivers/iommu/iommufd/iommufd_private.h | 3 +-
drivers/iommu/iommufd/main.c | 39 +++++++++++++++++--
.../selftests/iommu/iommufd_fail_nth.c | 2 +-
5 files changed, 42 insertions(+), 14 deletions(-)
base-commit: 1046d40b0e78d2cd63f6183629699b629b21f877
--
2.43.0
[ based on kvm/next ]
Unmapping virtual machine guest memory from the host kernel's direct map is a
successful mitigation against Spectre-style transient execution issues: If the
kernel page tables do not contain entries pointing to guest memory, then any
attempted speculative read through the direct map will necessarily be blocked
by the MMU before any observable microarchitectural side-effects happen. This
means that Spectre-gadgets and similar cannot be used to target virtual machine
memory. Roughly 60% of speculative execution issues fall into this category [1,
Table 1].
This patch series extends guest_memfd with the ability to remove its memory
from the host kernel's direct map, to be able to attain the above protection
for KVM guests running inside guest_memfd.
Additionally, a Firecracker branch with support for these VMs can be found on
GitHub [2].
For more details, please refer to the v5 cover letter [v5]. No
substantial changes in design have taken place since.
=== Changes Since v5 ===
- Fix up error handling for set_direct_map_[in]valid_noflush() (Mike)
- Fix capability check for KVM_GUEST_MEMFD_NO_DIRECT_MAP (Mike)
- Make secretmem_aops static in mm/secretmem.c (Mike)
- Fixup some more comments in gup.c that referred to secretmem
specifically to instead point to AS_NO_DIRECT_MAP (Mike)
- New patch (PATCH 4/11) to avoid ifdeffery in kvm_gmem_free_folio() (Mike)
- vma_is_no_direct_map() -> vma_has_no_direct_map() rename (David)
- Squash some patches (David)
- Fix up const-ness of parameters to new functions in pagemap.h (Fuad)
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hidi…
[RFCv1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFCv2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[RFCv3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/
[v4]: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/
[v5]: https://lore.kernel.org/kvm/20250828093902.2719-1-roypat@amazon.co.uk/
Elliot Berman (1):
filemap: Pass address_space mapping to ->free_folio()
Patrick Roy (10):
arch: export set_direct_map_valid_noflush to KVM module
mm: introduce AS_NO_DIRECT_MAP
KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
KVM: guest_memfd: Add flag to remove from direct map
KVM: selftests: load elf via bounce buffer
KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
!= -1
KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing
selftests
KVM: selftests: Test guest execution from direct map removed gmem
Documentation/filesystems/locking.rst | 2 +-
Documentation/virt/kvm/api.rst | 5 ++
arch/arm64/include/asm/kvm_host.h | 12 ++++
arch/arm64/mm/pageattr.c | 1 +
arch/loongarch/mm/pageattr.c | 1 +
arch/riscv/mm/pageattr.c | 1 +
arch/s390/mm/pageattr.c | 1 +
arch/x86/mm/pat/set_memory.c | 1 +
fs/nfs/dir.c | 11 ++--
fs/orangefs/inode.c | 3 +-
include/linux/fs.h | 2 +-
include/linux/kvm_host.h | 9 +++
include/linux/pagemap.h | 16 +++++
include/linux/secretmem.h | 18 ------
include/uapi/linux/kvm.h | 2 +
lib/buildid.c | 4 +-
mm/filemap.c | 9 +--
mm/gup.c | 19 ++----
mm/mlock.c | 2 +-
mm/secretmem.c | 11 ++--
mm/vmscan.c | 4 +-
.../testing/selftests/kvm/guest_memfd_test.c | 2 +
.../testing/selftests/kvm/include/kvm_util.h | 37 ++++++++---
.../testing/selftests/kvm/include/test_util.h | 8 +++
tools/testing/selftests/kvm/lib/elf.c | 8 +--
tools/testing/selftests/kvm/lib/io.c | 23 +++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 61 +++++++++++--------
tools/testing/selftests/kvm/lib/test_util.c | 8 +++
tools/testing/selftests/kvm/lib/x86/sev.c | 1 +
.../selftests/kvm/pre_fault_memory_test.c | 1 +
.../selftests/kvm/set_memory_region_test.c | 50 +++++++++++++--
.../kvm/x86/private_mem_conversions_test.c | 7 ++-
virt/kvm/guest_memfd.c | 56 ++++++++++++++---
virt/kvm/kvm_main.c | 5 ++
34 files changed, 288 insertions(+), 113 deletions(-)
base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383
--
2.50.1
[Lots of changes in comments thanks to Randy]
Currently each of the iommu page table formats duplicates all of the logic
to maintain the page table and perform map/unmap/etc operations. There are
several different versions of the algorithms between all the different
formats. The io-pgtable system provides an interface to help isolate the
page table code from the iommu driver, but doesn't provide tools to
implement the common algorithms.
This makes it very hard to improve the state of the pagetable code under
the iommu domains as any proposed improvement needs to alter a large
number of different driver code paths. Combined with a lack of software
based testing this makes improvement in this area very hard.
iommufd wants several new page table operations:
- More efficient map/unmap operations, using iommufd's batching logic
- unmap that returns the physical addresses into a batch as it progresses
- cut that allows splitting areas so large pages can have holes
poked in them dynamically (ie guestmemfd hitless shared/private
transitions)
- More agressive freeing of table memory to avoid waste
- Fragmenting large pages so that dirty tracking can be more granular
- Reassembling large pages so that VMs can run at full IO performance
in migration/dirty tracking error flows
- KHO integration for kernel live upgrade
Together these are algorithmically complex enough to be a very significant
task to go and implement in all the page table formats we support. Just
the "server" focused drivers use almost all the formats (ARMv8 S1&S2 / x86
PAE / AMDv1 / VT-D SS / RISCV)
Instead of doing the duplicated work, this series takes the first step to
consolidate the algorithms into one places. In spirit it is similar to the
work Christoph did a few years back to pull the redundant get_user_pages()
implementations out of the arch code into core MM. This unlocked a great
deal of improvement in that space in the following years. I would like to
see the same benefit in iommu as well.
My first RFC showed a bigger picture with all most all formats and more
algorithms. This series reorganizes that to be narrowly focused on just
enough to convert the AMD driver to use the new mechanism.
kunit tests are provided that allow good testing of the algorithms and all
formats on x86, nothing is arch specific.
AMD is one of the simpler options as the HW is quite uniform with few
different options/bugs while still requiring the complicated contiguous
pages support. The HW also has a very simple range based invalidation
approach that is easy to implement.
The AMD v1 and AMD v2 page table formats are implemented bit for bit
identical to the current code, tested using a compare kunit test that
checks against the io-pgtable version (on github, see below).
Updating the AMD driver to replace the io-pgtable layer with the new stuff
is fairly straightforward now. The layering is fixed up in the new version
so that all the invalidation goes through function pointers.
Several small fixing patches have come out of this as I've been fixing the
problems that the test suite uncovers in the current code, and
implementing the fixed version in iommupt.
On performance, there is a quite wide variety of implementation designs
across all the drivers. Looking at some key performance across
the main formats:
iommu_map():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 51,63 , 19.19 (AMDV1)
256*2^12, 386,1909 , 367,1795 , 79.79
256*2^21, 362,1633 , 355,1556 , 77.77
2^12, 56,62 , 52,59 , 11.11 (AMDv2)
256*2^12, 405,1355 , 357,1292 , 72.72
256*2^21, 393,1160 , 358,1114 , 67.67
2^12, 55,65 , 53,62 , 14.14 (VTD second stage)
256*2^12, 391,518 , 332,512 , 35.35
256*2^21, 383,635 , 336,624 , 46.46
2^12, 57,65 , 55,63 , 12.12 (ARM 64 bit)
256*2^12, 380,389 , 361,369 , 2.02
256*2^21, 358,419 , 345,400 , 13.13
iommu_unmap():
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 69,88 , 65,85 , 23.23 (AMDv1)
256*2^12, 353,6498 , 331,6029 , 94.94
256*2^21, 373,6014 , 360,5706 , 93.93
2^12, 71,72 , 66,69 , 4.04 (AMDv2)
256*2^12, 228,891 , 206,871 , 76.76
256*2^21, 254,721 , 245,711 , 65.65
2^12, 69,87 , 65,82 , 20.20 (VTD second stage)
256*2^12, 210,321 , 200,315 , 36.36
256*2^21, 255,349 , 238,342 , 30.30
2^12, 72,77 , 68,74 , 8.08 (ARM 64 bit)
256*2^12, 521,357 , 447,346 , -29.29
256*2^21, 489,358 , 433,345 , -25.25
* Above numbers include additional patches to remove the iommu_pgsize()
overheads. gcc 13.3.0, i7-12700
This version provides fairly consistent performance across formats. ARM
unmap performance is quite different because this version supports
contiguous pages and uses a very different algorithm for unmapping. Though
why it is so worse compared to AMDv1 I haven't figured out yet.
The per-format commits include a more detailed chart.
There is a second branch:
https://github.com/jgunthorpe/linux/commits/iommu_pt_all
Containing supporting work and future steps:
- ARM short descriptor (32 bit), ARM long descriptor (64 bit) formats
- RISCV format and RISCV conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_riscv
- Support for a DMA incoherent HW page table walker
- VT-D second stage format and VT-D conversion
https://github.com/jgunthorpe/linux/commits/iommu_pt_vtd
- DART v1 & v2 format
- Draft of a iommufd 'cut' operation to break down huge pages
- A compare test that checks the iommupt formats against the iopgtable
interface, including updating AMD to have a working iopgtable and patches
to make VT-D have an iopgtable for testing.
- A performance test to micro-benchmark map and unmap against iogptable
My strategy is to go one by one for the drivers:
- AMD driver conversion
- RISCV page table and driver
- Intel VT-D driver and VTDSS page table
- Flushing improvements for RISCV
- ARM SMMUv3
And concurrently work on the algorithm side:
- debugfs content dump, like VT-D has
- Cut support
- Increase/Decrease page size support
- map/unmap batching
- KHO
As we make more algorithm improvements the value to convert the drivers
increases.
This is on github: https://github.com/jgunthorpe/linux/commits/iommu_pt
v4:
- Text grammar updates and kdoc fixes
v3: https://patch.msgid.link/r/0-v4-0d6a6726a372+18959-iommu_pt_jgg@nvidia.com
- Rebase on v6.16-rc3
- Integrate the HATS/HATDis changes
- Remove 'default n' from kconfig
- Remove unused 'PT_FIXED_TOP_LEVEL'
- Improve comments and coumentation
- Fix some compile warnings from kbuild robots
v2: https://patch.msgid.link/r/0-v3-a93aab628dbc+521-iommu_pt_jgg@nvidia.com
- Rebase on v6.16-rc2
- s/PT_ENTRY_WORD_SIZE/PT_ITEM_WORD_SIZE/s to follow the language better
- Comment and documentation updates
- Add PT_TOP_PHYS_MASK to help manage alignment restrictions on the top
pointer
- Add missed force_aperture = true
- Make pt_iommu_deinit() take care of the not-yet-inited error case
internally as AMD/RISCV/VTD all shared this logic
- Change gather_range() into gather_range_pages() so it also deals with
the page list. This makes the following cache flushing series simpler
- Fix missed update of unmap->unmapped in some error cases
- Change clear_contig() to order the gather more logically
- Remove goto from the error handling in __map_range_leaf()
- s/log2_/oalog2_/ in places where the argument is an oaddr_t
- Pass the pts to pt_table_install64/32()
- Do not use SIGN_EXTEND for the AMDv2 page table because of Vasant's
information on how PASID 0 works.
v1: https://patch.msgid.link/r/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com
- AMD driver only, many code changes
RFC: https://lore.kernel.org/all/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/
Alejandro Jimenez (1):
iommu/amd: Use the generic iommu page table
Jason Gunthorpe (14):
genpt: Generic Page Table base API
genpt: Add Documentation/ files
iommupt: Add the basic structure of the iommu implementation
iommupt: Add the AMD IOMMU v1 page table format
iommupt: Add iova_to_phys op
iommupt: Add unmap_pages op
iommupt: Add map_pages op
iommupt: Add read_and_clear_dirty op
iommupt: Add a kunit test for Generic Page Table
iommupt: Add a mock pagetable format for iommufd selftest to use
iommufd: Change the selftest to use iommupt instead of xarray
iommupt: Add the x86 64 bit page table format
iommu/amd: Remove AMD io_pgtable support
iommupt: Add a kunit test for the IOMMU implementation
.clang-format | 1 +
Documentation/driver-api/generic_pt.rst | 140 ++
Documentation/driver-api/index.rst | 1 +
drivers/iommu/Kconfig | 2 +
drivers/iommu/Makefile | 1 +
drivers/iommu/amd/Kconfig | 5 +-
drivers/iommu/amd/Makefile | 2 +-
drivers/iommu/amd/amd_iommu.h | 1 -
drivers/iommu/amd/amd_iommu_types.h | 109 +-
drivers/iommu/amd/io_pgtable.c | 560 --------
drivers/iommu/amd/io_pgtable_v2.c | 370 ------
drivers/iommu/amd/iommu.c | 538 ++++----
drivers/iommu/generic_pt/.kunitconfig | 13 +
drivers/iommu/generic_pt/Kconfig | 67 +
drivers/iommu/generic_pt/fmt/Makefile | 26 +
drivers/iommu/generic_pt/fmt/amdv1.h | 409 ++++++
drivers/iommu/generic_pt/fmt/defs_amdv1.h | 21 +
drivers/iommu/generic_pt/fmt/defs_x86_64.h | 21 +
drivers/iommu/generic_pt/fmt/iommu_amdv1.c | 15 +
drivers/iommu/generic_pt/fmt/iommu_mock.c | 10 +
drivers/iommu/generic_pt/fmt/iommu_template.h | 48 +
drivers/iommu/generic_pt/fmt/iommu_x86_64.c | 11 +
drivers/iommu/generic_pt/fmt/x86_64.h | 248 ++++
drivers/iommu/generic_pt/iommu_pt.h | 1149 +++++++++++++++++
drivers/iommu/generic_pt/kunit_generic_pt.h | 717 ++++++++++
drivers/iommu/generic_pt/kunit_iommu.h | 183 +++
drivers/iommu/generic_pt/kunit_iommu_pt.h | 451 +++++++
drivers/iommu/generic_pt/pt_common.h | 355 +++++
drivers/iommu/generic_pt/pt_defs.h | 323 +++++
drivers/iommu/generic_pt/pt_fmt_defaults.h | 193 +++
drivers/iommu/generic_pt/pt_iter.h | 636 +++++++++
drivers/iommu/generic_pt/pt_log2.h | 130 ++
drivers/iommu/io-pgtable.c | 4 -
drivers/iommu/iommufd/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_test.h | 11 +-
drivers/iommu/iommufd/selftest.c | 438 +++----
include/linux/generic_pt/common.h | 166 +++
include/linux/generic_pt/iommu.h | 270 ++++
include/linux/io-pgtable.h | 2 -
tools/testing/selftests/iommu/iommufd.c | 60 +-
tools/testing/selftests/iommu/iommufd_utils.h | 12 +
41 files changed, 6128 insertions(+), 1592 deletions(-)
create mode 100644 Documentation/driver-api/generic_pt.rst
delete mode 100644 drivers/iommu/amd/io_pgtable.c
delete mode 100644 drivers/iommu/amd/io_pgtable_v2.c
create mode 100644 drivers/iommu/generic_pt/.kunitconfig
create mode 100644 drivers/iommu/generic_pt/Kconfig
create mode 100644 drivers/iommu/generic_pt/fmt/Makefile
create mode 100644 drivers/iommu/generic_pt/fmt/amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_amdv1.h
create mode 100644 drivers/iommu/generic_pt/fmt/defs_x86_64.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_amdv1.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_mock.c
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_template.h
create mode 100644 drivers/iommu/generic_pt/fmt/iommu_x86_64.c
create mode 100644 drivers/iommu/generic_pt/fmt/x86_64.h
create mode 100644 drivers/iommu/generic_pt/iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_generic_pt.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu.h
create mode 100644 drivers/iommu/generic_pt/kunit_iommu_pt.h
create mode 100644 drivers/iommu/generic_pt/pt_common.h
create mode 100644 drivers/iommu/generic_pt/pt_defs.h
create mode 100644 drivers/iommu/generic_pt/pt_fmt_defaults.h
create mode 100644 drivers/iommu/generic_pt/pt_iter.h
create mode 100644 drivers/iommu/generic_pt/pt_log2.h
create mode 100644 include/linux/generic_pt/common.h
create mode 100644 include/linux/generic_pt/iommu.h
base-commit: 8da0d63bd5726ff656bfa1eacb45d6f5cce65616
--
2.43.0
This patch simplifies kublk's implementation of the feature list
command, fixes a bug where a feature was missing, and adds a test to
ensure that similar bugs do not happen in the future.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v2:
- Add log lines to new test in failure case, to tell the user how to fix
the test, and to indicate that the failure is expected when running
an old test suite against a new kernel (Ming Lei)
- Link to v1: https://lore.kernel.org/r/20250916-ublk_features-v1-0-52014be9cde5@purestor…
---
Uday Shankar (3):
selftests: ublk: kublk: simplify feat_map definition
selftests: ublk: kublk: add UBLK_F_BUF_REG_OFF_DAEMON to feat_map
selftests: ublk: add test to verify that feat_map is complete
tools/testing/selftests/ublk/Makefile | 1 +
tools/testing/selftests/ublk/kublk.c | 32 +++++++++++++------------
tools/testing/selftests/ublk/test_generic_13.sh | 20 ++++++++++++++++
3 files changed, 38 insertions(+), 15 deletions(-)
---
base-commit: da7b97ba0d219a14a83e9cc93f98b53939f12944
change-id: 20250916-ublk_features-07af4e321e5a
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
Currently the UAPI headers are always installed into the source directory.
When building out-of-tree this doesn't work, as the include path will be
wrong and it dirties the source tree, leading to complains by kbuild.
Make sure the 'headers' target installs the UAPI headers in the correctly.
The real target directory can come from multiple places. To handle them all
extract the target directory from KHDR_INCLUDES.
Reported-by: Jason Gunthorpe <jgg(a)nvidia.com>
Closes: https://lore.kernel.org/lkml/20250917153209.GA2023406@nvidia.com/
Fixes: 1a59f5d31569 ("selftests: Add headers target")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg(a)nvidia.com>
---
tools/testing/selftests/lib.mk | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 5303900339292e618dee4fd7ff8a7c2fa3209a68..a448fae57831d86098806adaff53f6f1a747febb 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -228,7 +228,10 @@ $(OUTPUT)/%:%.S
$(LINK.S) $^ $(LDLIBS) -o $@
endif
+# Extract the expected header directory
+khdr_output := $(patsubst %/usr/include,%,$(filter %/usr/include,$(KHDR_INCLUDES)))
+
headers:
- $(Q)$(MAKE) -C $(top_srcdir) headers
+ $(Q)$(MAKE) -f $(top_srcdir)/Makefile -C $(khdr_output) headers
.PHONY: run_tests all clean install emit_tests gen_mods_dir clean_mods_dir headers
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250918-kselftest-uapi-out-of-tree-98d50f59040c
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
[ I think at this point everyone is OK with the ABI, and the x86
implementation has been tested so hopefully we are near to being
able to get this merged? If there are any outstanding issues let
me know and I can look at addressing them. The one possible issue
I am aware of is that the RISC-V shadow stack support was briefly
in -next but got dropped along with the general RISC-V issues during
the last merge window, rebasing for that is still in progress. I
guess ideally this could be applied on a branch and then pulled into
the RISC-V tree? ]
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Yuri Khrustalev has raised questions from the libc side regarding
discoverability of extended clone3() structure sizes[2], this seems like
a general issue with clone3(). There was a suggestion to add a hwcap on
arm64 which isn't ideal but is doable there, though architecture
specific mechanisms would also be needed for x86 (and RISC-V if it's
support gets merged before this does). The idea has, however, had
strong pushback from the architecture maintainers and it is possible to
detect support for this in clone3() by attempting a call with a
misaligned shadow stack pointer specified so no hwcap has been added.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
[2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v21:
- Rebase onto https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git kernel-6.18.clone3
- Rename shadow_stack_token to shstk_token, since it's a simple rename I've
kept the acks and reviews but I dropped the tested-bys just to be safe.
- Link to v20: https://lore.kernel.org/r/20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@k…
Changes in v20:
- Comment fixes and clarifications in x86 arch_shstk_validate_clone()
from Rick Edgecombe.
- Spelling fix in documentation.
- Link to v19: https://lore.kernel.org/r/20250819-clone3-shadow-stack-v19-0-bc957075479b@k…
Changes in v19:
- Rebase onto v6.17-rc1.
- Link to v18: https://lore.kernel.org/r/20250702-clone3-shadow-stack-v18-0-7965d2b694db@k…
Changes in v18:
- Rebase onto v6.16-rc3.
- Thanks to pointers from Yuri Khrustalev this version has been tested
on x86 so I have removed the RFT tag.
- Clarify clone3_shadow_stack_valid() comment about the Kconfig check.
- Remove redundant GCSB DSYNCs in arm64 code.
- Fix token validation on x86.
- Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@k…
Changes in v17:
- Rebase onto v6.16-rc1.
- Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k…
Changes in v16:
- Rebase onto v6.15-rc2.
- Roll in fixes from x86 testing from Rick Edgecombe.
- Rework so that the argument is shadow_stack_token.
- Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k…
Changes in v15:
- Rebase onto v6.15-rc1.
- Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k…
Changes in v14:
- Rebase onto v6.14-rc1.
- Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k…
Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k…
Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k…
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 44 +++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 55 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 53 ++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 9 +-
kernel/fork.c | 93 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 620 insertions(+), 81 deletions(-)
---
base-commit: 76cea30ad520238160bf8f5e2f2803fcd7a08d22
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Currently it is not possible to disable streaming mode via ptrace on SME
only systems, the interface for doing this is to write via NT_ARM_SVE but
such writes will be rejected on a system without SVE support. Enable this
functionality by allowing userspace to write SVE_PT_REGS_FPSIMD format data
via NT_ARM_SVE with the vector length set to 0 on SME only systems. Such
writes currently error since we require that a vector length is specified
which should minimise the risk that existing software is relying on current
behaviour.
Reads are not supported since I am not aware of any use case for this and
there is some risk that an existing userspace application may be confused if
it reads NT_ARM_SVE on a system without SVE. Existing kernels will return
FPSIMD formatted register state from NT_ARM_SVE if full SVE state is not
stored, for example if the task has not used SVE. Returning a vector length
of 0 would create a risk that software could try to do things like allocate
space for register state with zero sizes, while returning a vector length of
128 bits would look like SVE is supported. It seems safer to just not make
the changes to add read support.
It remains possible for userspace to detect a SME only system via the ptrace
interface only since reads of NT_ARM_SSVE and NT_ARM_ZA will suceed while
reads of NT_ARM_SVE will fail. Read/write access to the FPSIMD registers in
non-streaming mode is available via REGSET_FPR.
The aim is is to make a minimally invasive change, no operation that would
previously have succeeded will be affected, and we use a previously defined
interface in new circumstances rather than define completely new ABI.
The series starts with some enhancements to sve-ptrace to cover some
further corners of existing behaviours in order to reduce the risk of
inadvertent changes, implements the proposed new ABI, then extends both
sve-ptrace and fp-ptrace to exercise it.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (5):
kselftest/arm64: Verify that we reject out of bounds VLs in sve-ptrace
kselftest/arm64: Check that unsupported regsets fail in sve-ptrace
arm64/sme: Support disabling streaming mode via ptrace on SME only systems
kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems
kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace
Documentation/arch/arm64/sve.rst | 5 +
arch/arm64/kernel/ptrace.c | 40 ++++++--
tools/testing/selftests/arm64/fp/fp-ptrace.c | 5 +-
tools/testing/selftests/arm64/fp/sve-ptrace.c | 139 +++++++++++++++++++++++++-
4 files changed, 177 insertions(+), 12 deletions(-)
---
base-commit: 768361ab16ce943ef3577cea204dc81aa4a47517
change-id: 20250717-arm64-sme-ptrace-sme-only-1fb850600ea0
prerequisite-change-id: 20250808-arm64-fp-trace-macro-02ede083da51
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi,
The pre-existing kselftest for TPM2 is derived works of my earlier Python
based rudimentary TPM2 stack called 'tpm2-scripts'.
In order to get more coverage and more mainintainable and extensible test
suite I'd like to eventually rewrite the tests with bash and tpm2sh, which
is a TPM2 cli written with Rust and based on my new TPM2 stack [1] [2].
Given linux-rust work, would it be acceptable to require cargo to install
a runner for kselftest? I'm finishing off now 0.11 version of the tool,
which will take some time (versions before that are honestly quite bad,
don't try them) but after that this would be something I'd like to
put together.
NOTE: while tpm2-protocol itself is Apache/MIT, tpm2sh is GPL3 licensed
command-line program (for what it is worth).
[1] https://github.com/puavo-org/tpm2sh
[2] https://git.kernel.org/pub/scm/linux/kernel/git/jarkko/tpm2-protocol.git/ab…
BR, Jarkko
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Plesae find the v2 AccECN case handling patch series, which covers
several excpetional case handling of Accurate ECN spec (RFC9768),
adds new identifiers to be used by CC modules, adds ecn_delta into
rate_sample, and keeps the ACE counter for computation, etc.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best regards,
Chia-Yu
---
Chia-Yu Chang (11):
tcp: L4S ECT(1) identifier and NEEDS_ACCECN for CC modules
tcp: disable RFC3168 fallback identifier for CC modules
tcp: accecn: handle unexpected AccECN negotiation feedback
tcp: accecn: retransmit downgraded SYN in AccECN negotiation
tcp: move increment of num_retrans
tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN
SYN/ACK
tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
tcp: accecn: fallback outgoing half link to non-AccECN
tcp: accecn: verify ACE counter in 1st ACK after AccECN negotiation
tcp: accecn: stop sending AccECN opt when loss ACK w/ option
tcp: accecn: enable AccECN
Ilpo Järvinen (3):
tcp: try to avoid safer when ACKs are thinned
gro: flushing when CWR is set negatively affects AccECN
tcp: accecn: Add ece_delta to rate_sample
.../networking/net_cachelines/tcp_sock.rst | 1 +
include/linux/tcp.h | 4 +-
include/net/inet_ecn.h | 20 +++-
include/net/tcp.h | 30 +++++-
include/net/tcp_ecn.h | 85 ++++++++++++-----
net/ipv4/sysctl_net_ipv4.c | 2 +-
net/ipv4/tcp.c | 2 +
net/ipv4/tcp_cong.c | 9 +-
net/ipv4/tcp_input.c | 91 +++++++++++++------
net/ipv4/tcp_minisocks.c | 40 +++++---
net/ipv4/tcp_offload.c | 3 +-
net/ipv4/tcp_output.c | 38 +++++---
12 files changed, 239 insertions(+), 86 deletions(-)
--
2.34.1
Soft offlining a HugeTLB page reduces the HugeTLB page pool.
Commit 56374430c5dfc ("mm/memory-failure: userspace controls soft-offlining pages")
introduced the following sysctl interface to control soft offline:
/proc/sys/vm/enable_soft_offline
The interface does not distinguish between page types:
0 - Soft offline is disabled
1 - Soft offline is enabled
Convert enable_soft_offline to a bitmask and support disabling soft
offline for HugeTLB pages:
Bits:
0 - Enable soft offline
1 - Disable soft offline for HugeTLB pages
Supported values:
0 - Soft offline is disabled
1 - Soft offline is enabled
3 - Soft offline is enabled (disabled for HugeTLB pages)
Existing behavior is preserved.
Update documentation and HugeTLB soft offline self tests.
Reported-by: Shawn Fan <shawn.fan(a)intel.com>
Suggested-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Kyle Meyer <kyle.meyer(a)hpe.com>
---
Tony's patch:
* https://lore.kernel.org/all/20250904155720.22149-1-tony.luck@intel.com
v1:
* https://lore.kernel.org/all/aMGkAI3zKlVsO0S2@hpe.com
v1 -> v2:
* Make the interface extensible, as suggested by David.
* Preserve existing behavior, as suggested by Jiaqi and David.
Why clear errno in self tests?
madvise() does not set errno when it's successful and errno is set by madvise()
during test_soft_offline_common(3) causing test_soft_offline_common(1) to fail:
# Test soft-offline when enabled_soft_offline=1
# Hugepagesize is 1048576kB
# enable_soft_offline => 1
# Before MADV_SOFT_OFFLINE nr_hugepages=7
# Allocated 0x80000000 bytes of hugetlb pages
# MADV_SOFT_OFFLINE 0x7fd600000000 ret=0, errno=95
# MADV_SOFT_OFFLINE should ret 0
# After MADV_SOFT_OFFLINE nr_hugepages=6
not ok 2 Test soft-offline when enabled_soft_offline=1
---
.../ABI/testing/sysfs-memory-page-offline | 3 ++
Documentation/admin-guide/sysctl/vm.rst | 28 ++++++++++++++++---
mm/memory-failure.c | 17 +++++++++--
.../selftests/mm/hugetlb-soft-offline.c | 19 ++++++++++---
4 files changed, 56 insertions(+), 11 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-memory-page-offline b/Documentation/ABI/testing/sysfs-memory-page-offline
index 00f4e35f916f..d3f05ed6605e 100644
--- a/Documentation/ABI/testing/sysfs-memory-page-offline
+++ b/Documentation/ABI/testing/sysfs-memory-page-offline
@@ -20,6 +20,9 @@ Description:
number, or a error when the offlining failed. Reading
the file is not allowed.
+ Soft-offline can be controlled via sysctl, see:
+ Documentation/admin-guide/sysctl/vm.rst
+
What: /sys/devices/system/memory/hard_offline_page
Date: Sep 2009
KernelVersion: 2.6.33
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4d71211fdad8..ace73480eb9d 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -309,19 +309,39 @@ physical memory) vs performance / capacity implications in transparent and
HugeTLB cases.
For all architectures, enable_soft_offline controls whether to soft offline
-memory pages. When set to 1, kernel attempts to soft offline the pages
-whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to
-the request to soft offline the pages. Its default value is 1.
+memory pages.
+
+enable_soft_offline is a bitmask:
+
+Bits::
+
+ 0 - Enable soft offline
+ 1 - Disable soft offline for HugeTLB pages
+
+Supported values::
+
+ 0 - Soft offline is disabled
+ 1 - Soft offline is enabled
+ 3 - Soft offline is enabled (disabled for HugeTLB pages)
+
+The default value is 1.
+
+If soft offline is disabled for the requested page type, EOPNOTSUPP is returned.
It is worth mentioning that after setting enable_soft_offline to 0, the
following requests to soft offline pages will not be performed:
+- Request to soft offline from sysfs (soft_offline_page).
+
- Request to soft offline pages from RAS Correctable Errors Collector.
-- On ARM, the request to soft offline pages from GHES driver.
+- On ARM and X86, the request to soft offline pages from GHES driver.
- On PARISC, the request to soft offline pages from Page Deallocation Table.
+Note:
+ Soft offlining a HugeTLB page reduces the HugeTLB page pool.
+
extfrag_threshold
=================
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc30ca4804bf..0ad9ae11d9e8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -64,11 +64,14 @@
#include "internal.h"
#include "ras/ras_event.h"
+#define SOFT_OFFLINE_ENABLED BIT(0)
+#define SOFT_OFFLINE_SKIP_HUGETLB BIT(1)
+
static int sysctl_memory_failure_early_kill __read_mostly;
static int sysctl_memory_failure_recovery __read_mostly = 1;
-static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_enable_soft_offline __read_mostly = SOFT_OFFLINE_ENABLED;
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
@@ -150,7 +153,7 @@ static const struct ctl_table memory_failure_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
- .extra2 = SYSCTL_ONE,
+ .extra2 = SYSCTL_THREE,
}
};
@@ -2799,12 +2802,20 @@ int soft_offline_page(unsigned long pfn, int flags)
return -EIO;
}
- if (!sysctl_enable_soft_offline) {
+ if (!(sysctl_enable_soft_offline & SOFT_OFFLINE_ENABLED)) {
pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n");
put_ref_page(pfn, flags);
return -EOPNOTSUPP;
}
+ if (sysctl_enable_soft_offline & SOFT_OFFLINE_SKIP_HUGETLB) {
+ if (folio_test_hugetlb(pfn_folio(pfn))) {
+ pr_info_once("disabled for HugeTLB pages by /proc/sys/vm/enable_soft_offline\n");
+ put_ref_page(pfn, flags);
+ return -EOPNOTSUPP;
+ }
+ }
+
mutex_lock(&mf_mutex);
if (PageHWPoison(page)) {
diff --git a/tools/testing/selftests/mm/hugetlb-soft-offline.c b/tools/testing/selftests/mm/hugetlb-soft-offline.c
index f086f0e04756..b87c8778cadf 100644
--- a/tools/testing/selftests/mm/hugetlb-soft-offline.c
+++ b/tools/testing/selftests/mm/hugetlb-soft-offline.c
@@ -5,6 +5,8 @@
* offlining failed with EOPNOTSUPP.
* - if enable_soft_offline = 1, a hugepage should be dissolved and
* nr_hugepages/free_hugepages should be reduced by 1.
+ * - if enable_soft_offline = 3, hugepages should stay intact and soft
+ * offlining failed with EOPNOTSUPP.
*
* Before running, make sure more than 2 hugepages of default_hugepagesz
* are allocated. For example, if /proc/meminfo/Hugepagesize is 2048kB:
@@ -32,6 +34,9 @@
#define EPREFIX " !!! "
+#define SOFT_OFFLINE_ENABLED (1 << 0)
+#define SOFT_OFFLINE_SKIP_HUGETLB (1 << 1)
+
static int do_soft_offline(int fd, size_t len, int expect_errno)
{
char *filemap = NULL;
@@ -56,6 +61,7 @@ static int do_soft_offline(int fd, size_t len, int expect_errno)
ksft_print_msg("Allocated %#lx bytes of hugetlb pages\n", len);
hwp_addr = filemap + len / 2;
+ errno = 0;
ret = madvise(hwp_addr, pagesize, MADV_SOFT_OFFLINE);
ksft_print_msg("MADV_SOFT_OFFLINE %p ret=%d, errno=%d\n",
hwp_addr, ret, errno);
@@ -83,7 +89,7 @@ static int set_enable_soft_offline(int value)
char cmd[256] = {0};
FILE *cmdfile = NULL;
- if (value != 0 && value != 1)
+ if (value < 0 || value > 3)
return -EINVAL;
sprintf(cmd, "echo %d > /proc/sys/vm/enable_soft_offline", value);
@@ -155,13 +161,17 @@ static int create_hugetlbfs_file(struct statfs *file_stat)
static void test_soft_offline_common(int enable_soft_offline)
{
int fd;
- int expect_errno = enable_soft_offline ? 0 : EOPNOTSUPP;
+ int expect_errno = 0;
struct statfs file_stat;
unsigned long hugepagesize_kb = 0;
unsigned long nr_hugepages_before = 0;
unsigned long nr_hugepages_after = 0;
int ret;
+ if (!(enable_soft_offline & SOFT_OFFLINE_ENABLED) ||
+ (enable_soft_offline & SOFT_OFFLINE_SKIP_HUGETLB))
+ expect_errno = EOPNOTSUPP;
+
ksft_print_msg("Test soft-offline when enabled_soft_offline=%d\n",
enable_soft_offline);
@@ -198,7 +208,7 @@ static void test_soft_offline_common(int enable_soft_offline)
// No need for the hugetlbfs file from now on.
close(fd);
- if (enable_soft_offline) {
+ if (expect_errno == 0) {
if (nr_hugepages_before != nr_hugepages_after + 1) {
ksft_test_result_fail("MADV_SOFT_OFFLINE should reduced 1 hugepage\n");
return;
@@ -219,8 +229,9 @@ static void test_soft_offline_common(int enable_soft_offline)
int main(int argc, char **argv)
{
ksft_print_header();
- ksft_set_plan(2);
+ ksft_set_plan(3);
+ test_soft_offline_common(3);
test_soft_offline_common(1);
test_soft_offline_common(0);
--
2.51.0
Hi everyone,
This patchset introduces a new BPF program type that allows overriding
a tracepoint probe function registered via register_trace_*.
Motivation
----------
Tracepoint probe functions registered via register_trace_* in the kernel
cannot be dynamically modified, changing a probe function requires recompiling
the kernel and rebooting. Nor can BPF programs change an existing
probe function.
Overiding tracepoint supports a way to apply patches into kernel quickly
(such as applying security ones), through predefined static tracepoints,
without waiting for upstream integration.
This patchset demonstrates the way to override probe functions by BPF program.
Overview
--------
This patchset adds BPF_PROG_TYPE_RAW_TRACEPOINT_OVERRIDE program type.
When this type of BPF program attaches, it overrides the target tracepoint
probe function.
And it also extends a new struct type "tracepoint_func_snapshot", which extends
the tracepoint structure. It is used to record the original probe function
registered by kernel after BPF program being attached and restore from it
after detachment.
Critical steps
--------------
1. Attach: Attach programs via the raw_tracepoint_open syscall.
2. Override:
(a) Locate the target probe by `probe_name`.
(b) Override target probe with the BPF program.
(c) Save the BPF program and target probe function into "tracepoint_func_snapshot".
3. Restore: When the BPF program is detached, automatically restore
the original probe function from earlier saved snapshot.
Future work
-----------
This patchset is intended as a first step toward supporting BPF programs
that can override tracepoint probes. The current implementation may not yet
cover all use cases or handle every corner case.
I welcome feedback and suggestions from the community, and will continue to
refine and improve the design based on comments and real-world requirements.
Thanks!
Fuyu
Fuyu Zhao (3):
bpf: Introduce BPF_PROG_TYPE_RAW_TRACEPOINT_OVERRIDE
libbpf: Add support for BPF_PROG_TYPE_RAW_TRACEPOINT_OVERRIDE
selftests/bpf: Add selftest for "raw_tp.o"
include/linux/bpf_types.h | 2 +
include/linux/trace_events.h | 9 +
include/linux/tracepoint-defs.h | 6 +
include/linux/tracepoint.h | 3 +
include/uapi/linux/bpf.h | 2 +
kernel/bpf/syscall.c | 35 +++-
kernel/trace/bpf_trace.c | 31 +++
kernel/tracepoint.c | 190 +++++++++++++++++-
tools/include/uapi/linux/bpf.h | 2 +
tools/lib/bpf/bpf.c | 1 +
tools/lib/bpf/bpf.h | 3 +-
tools/lib/bpf/libbpf.c | 27 ++-
tools/lib/bpf/libbpf.h | 3 +-
.../bpf/prog_tests/raw_tp_override_test_run.c | 23 +++
.../bpf/progs/test_raw_tp_override_test_run.c | 20 ++
.../selftests/bpf/test_kmods/bpf_testmod.c | 7 +
16 files changed, 352 insertions(+), 12 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_override_test_run.c
create mode 100644 tools/testing/selftests/bpf/progs/test_raw_tp_override_test_run.c
--
2.43.0
This patch series introduces LANDLOCK_SCOPE_MEMFD_EXEC, a new Landlock
scoping mechanism that restricts execution of anonymous memory file
descriptors (memfd) created via memfd_create(2). This addresses security
gaps where processes can bypass W^X policies and execute arbitrary code
through anonymous memory objects.
Fixes: https://github.com/landlock-lsm/linux/issues/37
SECURITY PROBLEM
================
Current Landlock filesystem restrictions do not cover memfd objects,
allowing processes to:
1. Read-to-execute bypass: Create writable memfd, inject code,
then execute via mmap(PROT_EXEC) or direct execve()
2. Anonymous execution: Execute code without touching the filesystem via
execve("/proc/self/fd/N") where N is a memfd descriptor
3. Cross-domain access violations: Pass memfd between processes to
bypass domain restrictions
These scenarios can occur in sandboxed environments where filesystem
access is restricted but memfd creation remains possible.
IMPLEMENTATION
==============
The implementation adds hierarchical execution control through domain
scoping:
Core Components:
- is_memfd_file(): Reliable memfd detection via "memfd:" dentry prefix
- domain_is_scoped(): Cross-domain hierarchy checking (moved to domain.c)
- LSM hooks: mmap_file, file_mprotect, bprm_creds_for_exec
- Creation-time restrictions: hook_file_alloc_security
Security Matrix:
Execution decisions follow domain hierarchy rules preventing both
same-domain bypass attempts and cross-domain access violations while
preserving legitimate hierarchical access patterns.
Domain Hierarchy with LANDLOCK_SCOPE_MEMFD_EXEC:
===============================================
Root (no domain) - No restrictions
|
+-- Domain A [SCOPE_MEMFD_EXEC] Layer 1
| +-- memfd_A (tagged with Domain A as creator)
| |
| +-- Domain A1 (child) [NO SCOPE] Layer 2
| | +-- Inherits Layer 1 restrictions from parent
| | +-- memfd_A1 (can create, inherits restrictions)
| | +-- Domain A1a [SCOPE_MEMFD_EXEC] Layer 3
| | +-- memfd_A1a (tagged with Domain A1a)
| |
| +-- Domain A2 (child) [SCOPE_MEMFD_EXEC] Layer 2
| +-- memfd_A2 (tagged with Domain A2 as creator)
| +-- CANNOT access memfd_A1 (different subtree)
|
+-- Domain B [SCOPE_MEMFD_EXEC] Layer 1
+-- memfd_B (tagged with Domain B as creator)
+-- CANNOT access ANY memfd from Domain A subtree
Execution Decision Matrix:
========================
Executor-> | A | A1 | A1a | A2 | B | Root
Creator | | | | | |
------------|-----|----|-----|----|----|-----
Domain A | X | X | X | X | X | Y
Domain A1 | Y | X | X | X | X | Y
Domain A1a | Y | Y | X | X | X | Y
Domain A2 | Y | X | X | X | X | Y
Domain B | X | X | X | X | X | Y
Root | Y | Y | Y | Y | Y | Y
Legend: Y = Execution allowed, X = Execution denied
Scenarios Covered:
- Direct mmap(PROT_EXEC) on memfd files
- Two-stage mmap(PROT_READ) + mprotect(PROT_EXEC) bypass attempts
- execve("/proc/self/fd/N") anonymous execution
- execveat() and fexecve() file descriptor execution
- Cross-process memfd inheritance and IPC passing
TESTING
=======
All patches have been validated with:
- scripts/checkpatch.pl --strict (clean)
- Selftests covering same-domain restrictions, cross-domain
hierarchy enforcement, and regular file isolation
- KUnit tests for memfd detection edge cases
DISCLAIMER
==========
My understanding of Landlock scoping semantics may be limited, but this
implementation reflects my current understanding based on available
documentation and code analysis. I welcome feedback and corrections
regarding the scoping logic and domain hierarchy enforcement.
Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com>
---
Abhinav Saxena (4):
landlock: add LANDLOCK_SCOPE_MEMFD_EXEC scope
landlock: implement memfd detection
landlock: add memfd exec LSM hooks and scoping
selftests/landlock: add memfd execution tests
include/uapi/linux/landlock.h | 5 +
security/landlock/.kunitconfig | 1 +
security/landlock/audit.c | 4 +
security/landlock/audit.h | 1 +
security/landlock/cred.c | 14 -
security/landlock/domain.c | 67 ++++
security/landlock/domain.h | 4 +
security/landlock/fs.c | 405 ++++++++++++++++++++-
security/landlock/limits.h | 2 +-
security/landlock/task.c | 67 ----
.../selftests/landlock/scoped_memfd_exec_test.c | 325 +++++++++++++++++
11 files changed, 812 insertions(+), 83 deletions(-)
---
base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38
change-id: 20250716-memfd-exec-ac0d582018c3
Best regards,
--
Abhinav Saxena <xandfury(a)gmail.com>
kunit_kcalloc() may fail. parse_filter_attr_test() uses the returned
pointer without checking it, and then writes to parsed_filters[j],
which can lead to a NULL pointer dereference under low-memory
conditions.
Use KUNIT_ASSERT_NOT_ERR_OR_NULL() to abort the test on allocation
failure, per KUnit guidance for unsafe-to-continue cases.
Fixes: 1c9fd080dffe ("kunit: fix uninitialized variables bug in attributes filtering")
Fixes: 76066f93f1df ("kunit: add tests for filtering attributes")
Signed-off-by: Guangshuo Li <lgs201920130244(a)gmail.com>
---
lib/kunit/executor_test.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/lib/kunit/executor_test.c b/lib/kunit/executor_test.c
index f0090c2729cd..084636ad8a72 100644
--- a/lib/kunit/executor_test.c
+++ b/lib/kunit/executor_test.c
@@ -127,6 +127,10 @@ static void parse_filter_attr_test(struct kunit *test)
parsed_filters = kunit_kcalloc(test, filter_count, sizeof(*parsed_filters),
GFP_KERNEL);
+
+ /* Abort test if allocation failed. */
+ KUNIT_ASSERT_NOT_ERR_OR_NULL(test, parsed_filters);
+
for (j = 0; j < filter_count; j++) {
parsed_filters[j] = kunit_next_attr_filter(&filter, &err);
KUNIT_ASSERT_EQ_MSG(test, err, 0, "failed to parse filter from '%s'", filters);
--
2.43.0
For a while now we have supported file handles for pidfds. This has
proven to be very useful.
Extend the concept to cover namespaces as well. After this patchset it
is possible to encode and decode namespace file handles using the
commong name_to_handle_at() and open_by_handle_at() apis.
Namespaces file descriptors can already be derived from pidfds which
means they aren't subject to overmount protection bugs. IOW, it's
irrelevant if the caller would not have access to an appropriate
/proc/<pid>/ns/ directory as they could always just derive the namespace
based on a pidfd already.
It has the same advantage as pidfds. It's possible to reliably and for
the lifetime of the system refer to a namespace without pinning any
resources and to compare them.
Permission checking is kept simple. If the caller is located in the
namespace the file handle refers to they are able to open it otherwise
they must hold privilege over the owning namespace of the relevant
namespace.
Both the network namespace and the mount namespace already have an
associated cookie that isn't recycled and is fully exposed to userspace.
Move this into ns_common and use the same id space for all namespaces so
they can trivially and reliably be compared.
There's more coming based on the iterator infrastructure but the series
is large enough and focuses on file handles.
Extensive selftests included.
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
---
Changes in v2:
- Address various review comments.
- Use a common NS_GET_ID ioctl() instead of individual ioctls.
- Link to v1: https://lore.kernel.org/20250910-work-namespace-v1-0-4dd56e7359d8@kernel.org
---
Christian Brauner (33):
pidfs: validate extensible ioctls
nsfs: drop tautological ioctl() check
nsfs: validate extensible ioctls
block: use extensible_ioctl_valid()
ns: move to_ns_common() to ns_common.h
nsfs: add nsfs.h header
ns: uniformly initialize ns_common
cgroup: use ns_common_init()
ipc: use ns_common_init()
mnt: use ns_common_init()
net: use ns_common_init()
pid: use ns_common_init()
time: use ns_common_init()
user: use ns_common_init()
uts: use ns_common_init()
ns: remove ns_alloc_inum()
nstree: make iterator generic
mnt: support ns lookup
cgroup: support ns lookup
ipc: support ns lookup
net: support ns lookup
pid: support ns lookup
time: support ns lookup
user: support ns lookup
uts: support ns lookup
ns: add to_<type>_ns() to respective headers
nsfs: add current_in_namespace()
nsfs: support file handles
nsfs: support exhaustive file handles
nsfs: add missing id retrieval support
tools: update nsfs.h uapi header
selftests/namespaces: add identifier selftests
selftests/namespaces: add file handle selftests
block/blk-integrity.c | 8 +-
fs/fhandle.c | 6 +
fs/internal.h | 1 +
fs/mount.h | 10 +-
fs/namespace.c | 156 +--
fs/nsfs.c | 201 ++-
fs/pidfs.c | 2 +-
include/linux/cgroup.h | 5 +
include/linux/exportfs.h | 6 +
include/linux/fs.h | 14 +
include/linux/ipc_namespace.h | 5 +
include/linux/ns_common.h | 29 +
include/linux/nsfs.h | 40 +
include/linux/nsproxy.h | 11 -
include/linux/nstree.h | 89 ++
include/linux/pid_namespace.h | 5 +
include/linux/proc_ns.h | 32 +-
include/linux/time_namespace.h | 9 +
include/linux/user_namespace.h | 5 +
include/linux/utsname.h | 5 +
include/net/net_namespace.h | 6 +
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/nsfs.h | 15 +-
init/main.c | 2 +
ipc/msgutil.c | 1 +
ipc/namespace.c | 12 +-
ipc/shm.c | 2 +
kernel/Makefile | 2 +-
kernel/cgroup/cgroup.c | 2 +
kernel/cgroup/namespace.c | 24 +-
kernel/nstree.c | 233 ++++
kernel/pid_namespace.c | 13 +-
kernel/time/namespace.c | 23 +-
kernel/user_namespace.c | 17 +-
kernel/utsname.c | 28 +-
net/core/net_namespace.c | 59 +-
tools/include/uapi/linux/nsfs.h | 17 +-
tools/testing/selftests/namespaces/.gitignore | 2 +
tools/testing/selftests/namespaces/Makefile | 7 +
tools/testing/selftests/namespaces/config | 7 +
.../selftests/namespaces/file_handle_test.c | 1429 ++++++++++++++++++++
tools/testing/selftests/namespaces/nsid_test.c | 986 ++++++++++++++
42 files changed, 3257 insertions(+), 270 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250905-work-namespace-c68826dda0d4
basic-gcs has it's own make rule to handle the special compiler
invocation to build against nolibc. This rule does not respect the
$(CFLAGS) passed by the Makefile from the parent directory.
However these $(CFLAGS) set up the include path to include the UAPI
headers from the current kernel.
Due to this the asm/hwcap.h header is used from the toolchain instead of
the UAPI and the definition of HWCAP_GCS is not found.
Restructure the rule for basic-gcs to respect the $(CFLAGS).
Also drop those options which are already provided by $(CFLAGS).
Reported-by: Naresh Kamboju <naresh.kamboju(a)linaro.org>
Closes: https://lore.kernel.org/lkml/CA+G9fYv77X+kKz2YT6xw7=9UrrotTbQ6fgNac7oohOg8B…
Fixes: a985fe638344 ("kselftest/arm64/gcs: Use nolibc's getauxval()")
Tested-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/arm64/gcs/Makefile | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/arm64/gcs/Makefile b/tools/testing/selftests/arm64/gcs/Makefile
index d2f3497a9103fc12ebc90c7f4e33ab9b846c6c8a..1fbbf0ca1f0291d00882920eb2d1efbf99882ec1 100644
--- a/tools/testing/selftests/arm64/gcs/Makefile
+++ b/tools/testing/selftests/arm64/gcs/Makefile
@@ -14,11 +14,11 @@ LDLIBS+=-lpthread
include ../../lib.mk
$(OUTPUT)/basic-gcs: basic-gcs.c
- $(CC) -g -fno-asynchronous-unwind-tables -fno-ident -s -Os -nostdlib \
- -static -include ../../../../include/nolibc/nolibc.h \
+ $(CC) $(CFLAGS) -fno-asynchronous-unwind-tables -fno-ident -s -nostdlib -nostdinc \
+ -static -I../../../../include/nolibc -include ../../../../include/nolibc/nolibc.h \
-I../../../../../usr/include \
-std=gnu99 -I../.. -g \
- -ffreestanding -Wall $^ -o $@ -lgcc
+ -ffreestanding $^ -o $@ -lgcc
$(OUTPUT)/gcs-stress-thread: gcs-stress-thread.S
$(CC) -nostdlib $^ -o $@
---
base-commit: 14a41628c470f4aa069075cdcf6ec0138b6cf1da
change-id: 20250916-arm64-gcs-nolibc-7d1f03a2a3cf
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Fix a memory leak in netpoll and introduce netconsole selftests that
expose the issue when running with kmemleak detection enabled.
This patchset includes a selftest for netpoll with multiple concurrent
users (netconsole + bonding), which simulates the scenario from test[1]
that originally demonstrated the issue allegedly fixed by commit
efa95b01da18 ("netpoll: fix use after free") - a commit that is now
being reverted.
Sending this to "net" branch because this is a fix, and the selftest
might help with the backports validation.
Link: https://lore.kernel.org/lkml/96b940137a50e5c387687bb4f57de8b0435a653f.14048… [1]
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v5:
- Set CONFIG_BONDING=m in selftests/drivers/net/config.
- Link to v4: https://lore.kernel.org/r/20250917-netconsole_torture-v4-0-0a5b3b8f81ce@deb…
Changes in v4:
- Added an additional selftest to test multiple netpoll users in
parallel
- Link to v3: https://lore.kernel.org/r/20250905-netconsole_torture-v3-0-875c7febd316@deb…
Changes in v3:
- This patchset is a merge of the fix and the selftest together as
recommended by Jakub.
Changes in v2:
- Reuse the netconsole creation from lib_netcons.sh. Thus, refactoring
the create_dynamic_target() (Jakub)
- Move the "wait" to after all the messages has been sent.
- Link to v1: https://lore.kernel.org/r/20250902-netconsole_torture-v1-1-03c6066598e9@deb…
---
Breno Leitao (4):
net: netpoll: fix incorrect refcount handling causing incorrect cleanup
selftest: netcons: refactor target creation
selftest: netcons: create a torture test
selftest: netcons: add test for netconsole over bonded interfaces
net/core/netpoll.c | 7 +-
tools/testing/selftests/drivers/net/Makefile | 2 +
tools/testing/selftests/drivers/net/config | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 197 ++++++++++++++++++---
.../selftests/drivers/net/netcons_over_bonding.sh | 76 ++++++++
.../selftests/drivers/net/netcons_torture.sh | 127 +++++++++++++
6 files changed, 385 insertions(+), 25 deletions(-)
---
base-commit: 5e87fdc37f8dc619549d49ba5c951b369ce7c136
change-id: 20250902-netconsole_torture-8fc23f0aca99
Best regards,
--
Breno Leitao <leitao(a)debian.org>
From: Yicong Yang <yangyicong(a)hisilicon.com>
Armv8.7 introduces single-copy atomic 64-byte loads and stores
instructions and its variants named under FEAT_{LS64, LS64_V}.
Add support for Armv8.7 FEAT_{LS64, LS64_V}:
- Add identifying and enabling in the cpufeature list
- Expose the support of these features to userspace through HWCAP3
and cpuinfo
- Add related hwcap test
- Handle the trap of unsupported memory (normal/uncacheable) access in a VM
A real scenario for this feature is that the userspace driver can make use of
this to implement direct WQE (workqueue entry) - a mechanism to fill WQE
directly into the hardware.
Picked Marc's 2 patches form [1] for handling the LS64 trap in a VM on emulated
MMIO and the introduce of KVM_EXIT_ARM_LDST64B.
[1] https://lore.kernel.org/linux-arm-kernel/20240815125959.2097734-1-maz@kerne…
Tested with updated hwcap test:
[root@localhost tmp]# dmesg | grep "All CPU(s) started"
[ 14.789859] CPU: All CPU(s) started at EL2
[root@localhost tmp]# ./hwcap
# LS64 present
ok 217 cpuinfo_match_LS64
ok 218 sigill_LS64
ok 219 # SKIP sigbus_LS64_V
# LS64_V present
ok 220 cpuinfo_match_LS64_V
ok 221 sigill_LS64_V
ok 222 # SKIP sigbus_LS64_V
# 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0
root@localhost:/mnt# dmesg | grep "All CPU(s) started"
[ 0.281152] CPU: All CPU(s) started at EL1
root@localhost:/mnt# ./hwcap
# LS64 present
ok 217 cpuinfo_match_LS64
ok 218 sigill_LS64
ok 219 # SKIP sigbus_LS64
# LS64_V present
ok 220 cpuinfo_match_LS64_V
ok 221 sigill_LS64_V
ok 222 # SKIP sigbus_LS64_V
# 115 skipped test(s) detected. Consider enabling relevant config options to improve coverage.
# Totals: pass:107 fail:0 xfail:0 xpass:0 skip:115 error:0
Change since v3:
- Inject DABT fault for LS64 fault on unsupported memory but with valid memslot
Link: https://lore.kernel.org/linux-arm-kernel/20250626080906.64230-1-yangyicong@…
Change since v2:
- Handle the LS64 fault to userspace and allow userspace to inject LS64 fault
- Reorder the patches to make KVM handling prior to feature support
Link: https://lore.kernel.org/linux-arm-kernel/20250331094320.35226-1-yangyicong@…
Change since v1:
- Drop the support for LS64_ACCDATA
- handle the DABT of unsupported memory type after checking the memory attributes
Link: https://lore.kernel.org/linux-arm-kernel/20241202135504.14252-1-yangyicong@…
Marc Zyngier (2):
KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots
KVM: arm64: Add documentation for KVM_EXIT_ARM_LDST64B
Yicong Yang (5):
KVM: arm64: Handle DABT caused by LS64* instructions on unsupported
memory
arm64: Provide basic EL2 setup for FEAT_{LS64, LS64_V} usage at EL0/1
arm64: Add support for FEAT_{LS64, LS64_V}
KVM: arm64: Enable FEAT_{LS64, LS64_V} in the supported guest
kselftest/arm64: Add HWCAP test for FEAT_{LS64, LS64_V}
Documentation/arch/arm64/booting.rst | 12 +++
Documentation/arch/arm64/elf_hwcaps.rst | 6 ++
Documentation/virt/kvm/api.rst | 43 +++++++++--
arch/arm64/include/asm/el2_setup.h | 12 ++-
arch/arm64/include/asm/esr.h | 8 ++
arch/arm64/include/asm/hwcap.h | 2 +
arch/arm64/include/asm/kvm_emulate.h | 7 ++
arch/arm64/include/uapi/asm/hwcap.h | 2 +
arch/arm64/kernel/cpufeature.c | 51 +++++++++++++
arch/arm64/kernel/cpuinfo.c | 2 +
arch/arm64/kvm/inject_fault.c | 29 ++++++++
arch/arm64/kvm/mmio.c | 27 ++++++-
arch/arm64/kvm/mmu.c | 14 +++-
arch/arm64/tools/cpucaps | 2 +
include/uapi/linux/kvm.h | 3 +-
tools/testing/selftests/arm64/abi/hwcap.c | 90 +++++++++++++++++++++++
16 files changed, 299 insertions(+), 11 deletions(-)
--
2.24.0
Hi all,
This is a second version of a series I sent some time ago, it continues
the work of migrating the script tests into prog_tests.
The test_xsk.sh script covers many AF_XDP use cases. The tests it runs
are defined in xksxceiver.c. Since this script is used to test real
hardware, the goal here is to leave it as it is, and only integrate the
tests that run on veth peers into the test_progs framework.
Some tests are flaky so they can't be integrated in the CI as they are.
I think that fixing their flakyness would require a significant amount of
work. So, as first step, I've excluded them from the list of tests
migrated to the CI (see PATCH 13). If these tests get fixed at some
point, integrating them into the CI will be straightforward.
PATCH 1 extracts test_xsk[.c/.h] from xskxceiver[.c/.h] to make the
tests available to test_progs.
PATCH 2 to 5 fix small issues in the current test
PATCH 7 to 12 handle all errors to release resources instead of calling
exit() when any error occurs.
PATCH 13 isolates some flaky tests
PATCH 14 integrate the non-flaky tests to the test_progs framework
Maciej, I've fixed the bug you found in the initial series. I've
looked for any hardware able to run test_xsk.sh in my office, but I
couldn't find one ... So here again, only the veth part has been tested,
sorry about that.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Changes in v3:
- Rebase on latest bpf-next_base to integrate commit c9110e6f7237 ("selftests/bpf:
Fix count write in testapp_xdp_metadata_copy()").
- Move XDP_METADATA_COPY_* tests from flaky-tests to nominal tests
- Link to v2: https://lore.kernel.org/r/20250902-xsk-v2-0-17c6345d5215@bootlin.com
Changes in v2:
- Rebase on the latest bpf-next_base and integrate the newly added tests
to the work (adjust_tail* and tx_queue_consumer tests)
- Re-order patches to split xkxceiver sooner.
- Fix the bug reported by Maciej.
- Fix verbose mode in test_xsk.sh by keeping kselftest (remove PATCH 1,
7 and 8)
- Link to v1: https://lore.kernel.org/r/20250313-xsk-v1-0-7374729a93b9@bootlin.com
---
Bastien Curutchet (eBPF Foundation) (14):
selftests/bpf: test_xsk: Split xskxceiver
selftests/bpf: test_xsk: Initialize bitmap before use
selftests/bpf: test_xsk: Fix memory leaks
selftests/bpf: test_xsk: Wrap test clean-up in functions
selftests/bpf: test_xsk: Release resources when swap fails
selftests/bpf: test_xsk: Add return value to init_iface()
selftests/bpf: test_xsk: Don't exit immediately when xsk_attach fails
selftests/bpf: test_xsk: Don't exit immediately when gettimeofday fails
selftests/bpf: test_xsk: Don't exit immediately when workers fail
selftests/bpf: test_xsk: Don't exit immediately if validate_traffic fails
selftests/bpf: test_xsk: Don't exit immediately on allocation failures
selftests/bpf: test_xsk: Move exit_with_error to xskxceiver.c
selftests/bpf: test_xsk: Isolate flaky tests
selftests/bpf: test_xsk: Integrate test_xsk.c to test_progs framework
tools/testing/selftests/bpf/Makefile | 11 +-
tools/testing/selftests/bpf/prog_tests/test_xsk.c | 2604 ++++++++++++++++++++
tools/testing/selftests/bpf/prog_tests/test_xsk.h | 294 +++
tools/testing/selftests/bpf/prog_tests/xsk.c | 146 ++
tools/testing/selftests/bpf/xskxceiver.c | 2685 +--------------------
tools/testing/selftests/bpf/xskxceiver.h | 156 --
6 files changed, 3170 insertions(+), 2726 deletions(-)
---
base-commit: e4e08c130231eb8071153ab5f056874d8f70430b
change-id: 20250218-xsk-0cf90e975d14
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v19 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28, and it
will be RFC9768.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v19 (16-Sep-2025)
- Repost remaining 10 patches in this series, as the first 4 patches are applied (Jakub Kicinski <kuba(a)kernel.org>)
v18 (11-Sep-2025)
- Reorder tcpi_accecn_fail_mode and tcpi_accecn_opt_seen to avoid adding fields in the middle of tcp_info (Eric Dumazet <edumazet(a)google.com>)
v17 (8-Sep-2025)
- Change tcp_ecn_mode_max from 5 to 2 to disable AccECN enablement before the whole AccECN feature been accpeted
v16 (6-Sep-2025)
- Use TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_ECN in comments of tcp_ecn_send_syn() (Eric Dumazet <edumazet(a)google.com>)
- Add tcpi_accecn_fail_mode and tcpi_accecn_opt_seen to make tcp_info be multiple of 64 bits in patch #12
v15 (14-Aug-205)
- Update pahole results in commit messages
- Accurate ECN will become RFC9768
v14 (22-Jul-2025)
- Add missing const for struct tcp_sock of tcp_accecn_option_beacon_check() of #11 (Simon Horman <horms(a)kernel.org>)
v13 (18-Jul-2025)
- Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>)
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (3):
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (7):
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 28 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 33 ++
include/net/tcp_ecn.h | 554 +++++++++++++++++-
include/uapi/linux/tcp.h | 9 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 318 +++++++++-
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 239 +++++++-
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1278 insertions(+), 76 deletions(-)
--
2.34.1
For a while now we have supported file handles for pidfds. This has
proven to be very useful.
Extend the concept to cover namespaces as well. After this patchset it
is possible to encode and decode namespace file handles using the
commong name_to_handle_at() and open_by_handle_at() apis.
Namespaces file descriptors can already be derived from pidfds which
means they aren't subject to overmount protection bugs. IOW, it's
irrelevant if the caller would not have access to an appropriate
/proc/<pid>/ns/ directory as they could always just derive the namespace
based on a pidfd already.
It has the same advantage as pidfds. It's possible to reliably and for
the lifetime of the system refer to a namespace without pinning any
resources and to compare them.
Permission checking is kept simple. If the caller is located in the
namespace the file handle refers to they are able to open it otherwise
they must hold privilege over the owning namespace of the relevant
namespace.
Both the network namespace and the mount namespace already have an
associated cookie that isn't recycled and is fully exposed to userspace.
Move this into ns_common and use the same id space for all namespaces so
they can trivially and reliably be compared.
There's more coming based on the iterator infrastructure but the series
is large enough and focuses on file handles.
Extensive selftests included. I still have various other test-suites to
run but it holds up so far.
Signed-off-by: Christian Brauner <brauner(a)kernel.org>
---
Christian Brauner (32):
pidfs: validate extensible ioctls
nsfs: validate extensible ioctls
block: use extensible_ioctl_valid()
ns: move to_ns_common() to ns_common.h
nsfs: add nsfs.h header
ns: uniformly initialize ns_common
mnt: use ns_common_init()
ipc: use ns_common_init()
cgroup: use ns_common_init()
pid: use ns_common_init()
time: use ns_common_init()
uts: use ns_common_init()
user: use ns_common_init()
net: use ns_common_init()
ns: remove ns_alloc_inum()
nstree: make iterator generic
mnt: support iterator
cgroup: support iterator
ipc: support iterator
net: support iterator
pid: support iterator
time: support iterator
userns: support iterator
uts: support iterator
ns: add to_<type>_ns() to respective headers
nsfs: add current_in_namespace()
nsfs: support file handles
nsfs: support exhaustive file handles
nsfs: add missing id retrieval support
tools: update nsfs.h uapi header
selftests/namespaces: add identifier selftests
selftests/namespaces: add file handle selftests
block/blk-integrity.c | 8 +-
fs/fhandle.c | 6 +
fs/internal.h | 1 +
fs/mount.h | 10 +-
fs/namespace.c | 156 +--
fs/nsfs.c | 266 +++-
fs/pidfs.c | 2 +-
include/linux/cgroup.h | 5 +
include/linux/exportfs.h | 6 +
include/linux/fs.h | 14 +
include/linux/ipc_namespace.h | 5 +
include/linux/ns_common.h | 29 +
include/linux/nsfs.h | 40 +
include/linux/nsproxy.h | 11 -
include/linux/nstree.h | 89 ++
include/linux/pid_namespace.h | 5 +
include/linux/proc_ns.h | 32 +-
include/linux/time_namespace.h | 9 +
include/linux/user_namespace.h | 5 +
include/linux/utsname.h | 5 +
include/net/net_namespace.h | 6 +
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/nsfs.h | 12 +-
init/main.c | 2 +
ipc/msgutil.c | 1 +
ipc/namespace.c | 12 +-
ipc/shm.c | 2 +
kernel/Makefile | 2 +-
kernel/cgroup/cgroup.c | 2 +
kernel/cgroup/namespace.c | 24 +-
kernel/nstree.c | 233 ++++
kernel/pid_namespace.c | 13 +-
kernel/time/namespace.c | 23 +-
kernel/user_namespace.c | 17 +-
kernel/utsname.c | 28 +-
net/core/net_namespace.c | 59 +-
tools/include/uapi/linux/nsfs.h | 23 +-
tools/testing/selftests/namespaces/.gitignore | 2 +
tools/testing/selftests/namespaces/Makefile | 7 +
tools/testing/selftests/namespaces/config | 7 +
.../selftests/namespaces/file_handle_test.c | 1410 ++++++++++++++++++++
tools/testing/selftests/namespaces/nsid_test.c | 986 ++++++++++++++
42 files changed, 3306 insertions(+), 270 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250905-work-namespace-c68826dda0d4
The active-backup bonding mode supports XFRM ESP offload. However, when
a bond is added using command like `ip link add bond0 type bond mode 1
miimon 100`, the `ethtool -k` command shows that the XFRM ESP offload is
disabled. This occurs because, in bond_newlink(), we change bond link
first and register bond device later. So the XFRM feature update in
bond_option_mode_set() is not called as the bond device is not yet
registered, leading to the offload feature not being set successfully.
To resolve this issue, we can modify the code order in bond_newlink() to
ensure that the bond device is registered first before changing the bond
link parameters. This change will allow the XFRM ESP offload feature to be
correctly enabled.
Fixes: 007ab5345545 ("bonding: fix feature flag setting at init time")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v2: rebase to latest net, no code update
---
drivers/net/bonding/bond_main.c | 2 +-
drivers/net/bonding/bond_netlink.c | 16 +++++++++-------
include/net/bonding.h | 1 +
3 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 57be04f6cb11..f4f0feddd9fa 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4411,7 +4411,7 @@ void bond_work_init_all(struct bonding *bond)
INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
}
-static void bond_work_cancel_all(struct bonding *bond)
+void bond_work_cancel_all(struct bonding *bond)
{
cancel_delayed_work_sync(&bond->mii_work);
cancel_delayed_work_sync(&bond->arp_work);
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index 57fff2421f1b..7a9d73ec8e91 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -579,20 +579,22 @@ static int bond_newlink(struct net_device *bond_dev,
struct rtnl_newlink_params *params,
struct netlink_ext_ack *extack)
{
+ struct bonding *bond = netdev_priv(bond_dev);
struct nlattr **data = params->data;
struct nlattr **tb = params->tb;
int err;
- err = bond_changelink(bond_dev, tb, data, extack);
- if (err < 0)
+ err = register_netdevice(bond_dev);
+ if (err)
return err;
- err = register_netdevice(bond_dev);
- if (!err) {
- struct bonding *bond = netdev_priv(bond_dev);
+ netif_carrier_off(bond_dev);
+ bond_work_init_all(bond);
- netif_carrier_off(bond_dev);
- bond_work_init_all(bond);
+ err = bond_changelink(bond_dev, tb, data, extack);
+ if (err) {
+ bond_work_cancel_all(bond);
+ unregister_netdevice(bond_dev);
}
return err;
diff --git a/include/net/bonding.h b/include/net/bonding.h
index e06f0d63b2c1..bd56ad976cfb 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -711,6 +711,7 @@ struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay);
void bond_work_init_all(struct bonding *bond);
+void bond_work_cancel_all(struct bonding *bond);
#ifdef CONFIG_PROC_FS
void bond_create_proc_entry(struct bonding *bond);
--
2.50.1
From: Lance Yang <lance.yang(a)linux.dev>
The madv_populate and soft-dirty kselftests currently fail on systems where
CONFIG_MEM_SOFT_DIRTY is disabled.
Introduce a new helper softdirty_supported() into vm_util.c/h to ensure
tests are properly skipped when the feature is not enabled.
Acked-by: David Hildenbrand <david(a)redhat.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
---
v2 -> v3:
- Optimize softdirty_supported() by directly assigning check_vmflag()
result (per David)
- Pick AB from David - thanks!
- https://lore.kernel.org/lkml/20250917122750.36608-1-lance.yang@linux.dev
v1 -> v2:
- Rename softdirty_is_supported() to softdirty_supported() (per David)
- Drop aarch64 specific handling (per David)
- https://lore.kernel.org/lkml/20250917055913.49759-1-lance.yang@linux.dev
tools/testing/selftests/mm/madv_populate.c | 21 ++-------------------
tools/testing/selftests/mm/soft-dirty.c | 5 ++++-
tools/testing/selftests/mm/vm_util.c | 17 +++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 1 +
4 files changed, 24 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/mm/madv_populate.c b/tools/testing/selftests/mm/madv_populate.c
index b6fabd5c27ed..d8d11bc67ddc 100644
--- a/tools/testing/selftests/mm/madv_populate.c
+++ b/tools/testing/selftests/mm/madv_populate.c
@@ -264,23 +264,6 @@ static void test_softdirty(void)
munmap(addr, SIZE);
}
-static int system_has_softdirty(void)
-{
- /*
- * There is no way to check if the kernel supports soft-dirty, other
- * than by writing to a page and seeing if the bit was set. But the
- * tests are intended to check that the bit gets set when it should, so
- * doing that check would turn a potentially legitimate fail into a
- * skip. Fortunately, we know for sure that arm64 does not support
- * soft-dirty. So for now, let's just use the arch as a corse guide.
- */
-#if defined(__aarch64__)
- return 0;
-#else
- return 1;
-#endif
-}
-
int main(int argc, char **argv)
{
int nr_tests = 16;
@@ -288,7 +271,7 @@ int main(int argc, char **argv)
pagesize = getpagesize();
- if (system_has_softdirty())
+ if (softdirty_supported())
nr_tests += 5;
ksft_print_header();
@@ -300,7 +283,7 @@ int main(int argc, char **argv)
test_holes();
test_populate_read();
test_populate_write();
- if (system_has_softdirty())
+ if (softdirty_supported())
test_softdirty();
err = ksft_get_fail_cnt();
diff --git a/tools/testing/selftests/mm/soft-dirty.c b/tools/testing/selftests/mm/soft-dirty.c
index 8a3f2b4b2186..4ee4db3750c1 100644
--- a/tools/testing/selftests/mm/soft-dirty.c
+++ b/tools/testing/selftests/mm/soft-dirty.c
@@ -200,8 +200,11 @@ int main(int argc, char **argv)
int pagesize;
ksft_print_header();
- ksft_set_plan(15);
+ if (!softdirty_supported())
+ ksft_exit_skip("soft-dirty is not support\n");
+
+ ksft_set_plan(15);
pagemap_fd = open(PAGEMAP_FILE_PATH, O_RDONLY);
if (pagemap_fd < 0)
ksft_exit_fail_msg("Failed to open %s\n", PAGEMAP_FILE_PATH);
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 56e9bd541edd..e33cda301dad 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -449,6 +449,23 @@ bool check_vmflag_pfnmap(void *addr)
return check_vmflag(addr, "pf");
}
+bool softdirty_supported(void)
+{
+ char *addr;
+ bool supported = false;
+ const size_t pagesize = getpagesize();
+
+ /* New mappings are expected to be marked with VM_SOFTDIRTY (sd). */
+ addr = mmap(0, pagesize, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (!addr)
+ ksft_exit_fail_msg("mmap failed\n");
+
+ supported = check_vmflag(addr, "sd");
+ munmap(addr, pagesize);
+ return supported;
+}
+
/*
* Open an fd at /proc/$pid/maps and configure procmap_out ready for
* PROCMAP_QUERY query. Returns 0 on success, or an error code otherwise.
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 07c4acfd84b6..26c30fdc0241 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -104,6 +104,7 @@ bool find_vma_procmap(struct procmap_fd *procmap, void *address);
int close_procmap(struct procmap_fd *procmap);
int write_sysfs(const char *file_path, unsigned long val);
int read_sysfs(const char *file_path, unsigned long *val);
+bool softdirty_supported(void);
static inline int open_self_procmap(struct procmap_fd *procmap_out)
{
--
2.49.0
This patch simplifies kublk's implementation of the feature list
command, fixes a bug where a feature was missing, and adds a test to
ensure that similar bugs do not happen in the future.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Uday Shankar (3):
selftests: ublk: kublk: simplify feat_map definition
selftests: ublk: kublk: add UBLK_F_BUF_REG_OFF_DAEMON to feat_map
selftests: ublk: add test to verify that feat_map is complete
tools/testing/selftests/ublk/Makefile | 1 +
tools/testing/selftests/ublk/kublk.c | 32 +++++++++++++------------
tools/testing/selftests/ublk/test_generic_13.sh | 16 +++++++++++++
3 files changed, 34 insertions(+), 15 deletions(-)
---
base-commit: da7b97ba0d219a14a83e9cc93f98b53939f12944
change-id: 20250916-ublk_features-07af4e321e5a
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
Hi,
While staring at epoll, I noticed ep_events_available() looks wrong. I
wrote a small program to confirm, and yes it is definitely wrong.
This series adds a reproducer to kselftest, and fix the bug.
Nam Cao (2):
selftests/eventpoll: Add test for multiple waiters
eventpoll: Fix epoll_wait() report false negative
fs/eventpoll.c | 16 +------
.../filesystems/epoll/epoll_wakeup_test.c | 45 +++++++++++++++++++
2 files changed, 47 insertions(+), 14 deletions(-)
--
2.39.5
Unlike IPv4, IPv6 routing strictly requires the source address to be valid
on the outgoing interface. If the NS target is set to a remote VLAN interface,
and the source address is also configured on a VLAN over a bond interface,
setting the oif to the bond device will fail to retrieve the correct
destination route.
Fix this by not setting the oif to the bond device when retrieving the NS
target destination. This allows the correct destination device (the VLAN
interface) to be determined, so that bond_verify_device_path can return the
proper VLAN tags for sending NS messages.
Reported-by: David Wilder <wilder(a)us.ibm.com>
Closes: https://lore.kernel.org/netdev/aGOKggdfjv0cApTO@fedora/
Suggested-by: Jay Vosburgh <jv(a)jvosburgh.net>
Tested-by: David Wilder <wilder(a)us.ibm.com>
Acked-by: Jay Vosburgh <jv(a)jvosburgh.net>
Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v4: rebase to latest net
v3: no update
v2: split the patch into 2 parts, the kernel change and test update (Jay Vosburgh)
---
drivers/net/bonding/bond_main.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 8832bc9f107b..57be04f6cb11 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3356,7 +3356,6 @@ static void bond_ns_send_all(struct bonding *bond, struct slave *slave)
/* Find out through which dev should the packet go */
memset(&fl6, 0, sizeof(struct flowi6));
fl6.daddr = targets[i];
- fl6.flowi6_oif = bond->dev->ifindex;
dst = ip6_route_output(dev_net(bond->dev), NULL, &fl6);
if (dst->error) {
--
2.50.1
This series adds namespace support to vhost-vsock and loopback. It does
not add namespaces to any of the other guest transports (virtio-vsock,
hyperv, or vmci).
The current revision supports two modes: local and global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).
The mode is set using /proc/sys/net/vsock/ns_mode.
Modes are per-netns and write-once. This allows a system to configure
namespaces independently (some may share CIDs, others are completely
isolated). This also supports future possible mixed use cases, where
there may be namespaces in global mode spinning up VMs while there are
mixed mode namespaces that provide services to the VMs, but are not
allowed to allocate from the global CID pool (this mode not implemented
in this series).
If a socket or VM is created when a namespace is global but the
namespace changes to local, the socket or VM will continue working
normally. That is, the socket or VM assumes the mode behavior of the
namespace at the time the socket/VM was created. The original mode is
captured in vsock_create() and so occurs at the time of socket(2) and
accept(2) for sockets and open(2) on /dev/vhost-vsock for VMs. This
prevents a socket/VM connection from suddenly breaking due to a
namespace mode change. Any new sockets/VMs created after the mode change
will adopt the new mode's behavior.
Additionally, added tests for the new namespace features:
tools/testing/selftests/vsock/vmtest.sh
1..22
ok 1 vm_server_host_client
ok 2 vm_client_host_server
ok 3 vm_loopback
ok 4 host_vsock_ns_mode_ok
ok 5 host_vsock_ns_mode_write_once_ok
ok 6 global_same_cid_fails
ok 7 local_same_cid_ok
ok 8 global_local_same_cid_ok
ok 9 local_global_same_cid_ok
ok 10 diff_ns_global_host_connect_to_global_vm_ok
ok 11 diff_ns_global_host_connect_to_local_vm_fails
ok 12 diff_ns_global_vm_connect_to_global_host_ok
ok 13 diff_ns_global_vm_connect_to_local_host_fails
ok 14 diff_ns_local_host_connect_to_local_vm_fails
ok 15 diff_ns_local_vm_connect_to_local_host_fails
ok 16 diff_ns_global_to_local_loopback_local_fails
ok 17 diff_ns_local_to_global_loopback_fails
ok 18 diff_ns_local_to_local_loopback_fails
ok 19 diff_ns_global_to_global_loopback_ok
ok 20 same_ns_local_loopback_ok
ok 21 same_ns_local_host_connect_to_local_vm_ok
ok 22 same_ns_local_vm_connect_to_local_host_ok
SUMMARY: PASS=22 SKIP=0 FAIL=0
Log: /tmp/vsock_vmtest_OQC4.log
Thanks again for everyone's help and reviews!
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)gmail.com>
To: Stefano Garzarella <sgarzare(a)redhat.com>
To: Shuah Khan <shuah(a)kernel.org>
To: David S. Miller <davem(a)davemloft.net>
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Paolo Abeni <pabeni(a)redhat.com>
To: Simon Horman <horms(a)kernel.org>
To: Stefan Hajnoczi <stefanha(a)redhat.com>
To: Michael S. Tsirkin <mst(a)redhat.com>
To: Jason Wang <jasowang(a)redhat.com>
To: Xuan Zhuo <xuanzhuo(a)linux.alibaba.com>
To: Eugenio Pérez <eperezma(a)redhat.com>
To: K. Y. Srinivasan <kys(a)microsoft.com>
To: Haiyang Zhang <haiyangz(a)microsoft.com>
To: Wei Liu <wei.liu(a)kernel.org>
To: Dexuan Cui <decui(a)microsoft.com>
To: Bryan Tan <bryan-bt.tan(a)broadcom.com>
To: Vishnu Dasa <vishnu.dasa(a)broadcom.com>
To: Broadcom internal kernel review list <bcm-kernel-feedback-list(a)broadcom.com>
Cc: virtualization(a)lists.linux.dev
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: kvm(a)vger.kernel.org
Cc: linux-hyperv(a)vger.kernel.org
Cc: berrange(a)redhat.com
Changes in v6:
- define behavior when mode changes to local while socket/VM is alive
- af_vsock: clarify description of CID behavior
- af_vsock: use stronger langauge around CID rules (dont use "may")
- af_vsock: improve naming of buf/buffer
- af_vsock: improve string length checking on proc writes
- vsock_loopback: add space in struct to clarify lock protection
- vsock_loopback: do proper cleanup/unregister on vsock_loopback_exit()
- vsock_loopback: use virtio_vsock_skb_net() instead of sock_net()
- vsock_loopback: set loopback to NULL after kfree()
- vsock_loopback: use pernet_operations and remove callback mechanism
- vsock_loopback: add macros for "global" and "local"
- vsock_loopback: fix length checking
- vmtest.sh: check for namespace support in vmtest.sh
- Link to v5: https://lore.kernel.org/r/20250827-vsock-vmtest-v5-0-0ba580bede5b@meta.com
Changes in v5:
- /proc/net/vsock_ns_mode -> /proc/sys/net/vsock/ns_mode
- vsock_global_net -> vsock_global_dummy_net
- fix netns lookup in vhost_vsock to respect pid namespaces
- add callbacks for vsock_loopback to avoid circular dependency
- vmtest.sh loads vsock_loopback module
- remove vsock_net_mode_can_set()
- change vsock_net_write_mode() to return true/false based on success
- make vsock_net_mode enum instead of u8
- Link to v4: https://lore.kernel.org/r/20250805-vsock-vmtest-v4-0-059ec51ab111@meta.com
Changes in v4:
- removed RFC tag
- implemented loopback support
- renamed new tests to better reflect behavior
- completed suite of tests with permutations of ns modes and vsock_test
as guest/host
- simplified socat bridging with unix socket instead of tcp + veth
- only use vsock_test for success case, socat for failure case (context
in commit message)
- lots of cleanup
Changes in v3:
- add notion of "modes"
- add procfs /proc/net/vsock_ns_mode
- local and global modes only
- no /dev/vhost-vsock-netns
- vmtest.sh already merged, so new patch just adds new tests for NS
- Link to v2:
https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com
Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1:
https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
Changes in v1:
- added 'netns' module param to vsock.ko to enable the
network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
---
Bobby Eshleman (9):
vsock: a per-net vsock NS mode state
vsock: add net to vsock skb cb
vsock: add netns to vsock core
vsock/loopback: add netns support
vsock/virtio: add netns to virtio transport common
vhost/vsock: add netns support
selftests/vsock: improve logging in vmtest.sh
selftests/vsock: invoke vsock_test through helpers
selftests/vsock: add namespace tests
MAINTAINERS | 1 +
drivers/vhost/vsock.c | 78 ++-
include/linux/virtio_vsock.h | 24 +
include/net/af_vsock.h | 71 +-
include/net/net_namespace.h | 4 +
include/net/netns/vsock.h | 26 +
net/vmw_vsock/af_vsock.c | 219 +++++-
net/vmw_vsock/hyperv_transport.c | 2 +-
net/vmw_vsock/virtio_transport.c | 6 +-
net/vmw_vsock/virtio_transport_common.c | 18 +-
net/vmw_vsock/vmci_transport.c | 6 +-
net/vmw_vsock/vsock_loopback.c | 102 ++-
tools/testing/selftests/vsock/vmtest.sh | 1133 +++++++++++++++++++++++++++----
13 files changed, 1501 insertions(+), 189 deletions(-)
---
base-commit: 949ddfb774fe527cebfa3f769804344940f7ed2e
change-id: 20250325-vsock-vmtest-b3a21d2102c2
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
Prior to commit 9245fd6b8531 ("KVM: x86: model canonical checks more
precisely"), KVM_SET_NESTED_STATE would fail if the state was captured
with L2 active, L1 had CR4.LA57 set, L2 did not, and the
VMCS12.HOST_GSBASE (or other host-state field checked for canonicality)
had an address greater than 48 bits wide.
Add a regression test that reproduces the KVM_SET_NESTED_STATE failure
conditions. To do so, the first three patches add support for 5-level
paging in the selftest L1 VM.
Jim Mattson (4):
KVM: selftests: Use a loop to create guest page tables
KVM: selftests: Use a loop to walk guest page tables
KVM: selftests: Add VM_MODE_PXXV57_4K VM mode
KVM: selftests: Add a VMX test for LA57 nested state
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/include/kvm_util.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 21 +++
.../testing/selftests/kvm/lib/x86/processor.c | 66 ++++-----
tools/testing/selftests/kvm/lib/x86/vmx.c | 7 +-
.../kvm/x86/vmx_la57_nested_state_test.c | 137 ++++++++++++++++++
6 files changed, 195 insertions(+), 38 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/vmx_la57_nested_state_test.c
--
2.51.0.470.ga7dc726c21-goog
This patch series refactors all futex selftests to use
kselftest_harness.h instead of futex's logging.h, as discussed here [1].
This allows to remove a lot of boilerplate code and to simplify some
parts of the test logic, mainly when the test needs to exit early. The
result of this is more than 500 lines removed from
tools/testing/selftests/futex/. Also, this enables new tests to use
kselftest.h features like ASSERT_s and such.
There are some caveats around this refactor:
- logging.h had verbosity levels, while kselftest_harness.h doesn't. I
created a new print function called ksft_print_dbg_msg() that prints
the message if the user uses the -d flag, so now there's an
equivalent of this feature.
- futex_requeue_pi test accepted command line arguments to be used as
test parameters (e.g. ./futex_requeue_pi -b -l -t 500000). This
doesn't work with kselftest_harness.h because there's no
straightforward way to send command line arguments to the test.
I used FIXTURE_VARIANT() to achieve the same result, but now the
parameters live inside of the test file, instead of on
functional/run.sh. This increased a little bit the number of test
cases for futex_requeue_pi, from 22 to 24.
Although there are a lot of patches, they follow a simple pattern:
- The usage() function, options parsing, kseftest setup
(ksft_print_header(), ksft_set_plan(), ksft_print_cnts()), are all
gone
- info() is replaced with ksft_print_dbg_msg()
- Exit on error paths are replaced with ksft_exit_fail_msg()
- fail/pass replaced with their ksft_ equivalents.
I have compared the results of run.sh before and after this patchset and
didn't find any regression from the test results.
Thanks,
André
[1] https://lore.kernel.org/lkml/87ecv6p364.ffs@tglx/
---
Changes in v3:
- Rebased on top of tip/locking/futex
- Link to v2: https://lore.kernel.org/r/20250720-tonyk-robust_test_cleanup-v2-0-1f9bcb5b7…
Changes in v2:
- Rebased on top of tip/master
- Dropped priv_hash global test variant now that this feature was
dropped
- Added include <stdbool.h> in the first patch
- Link to v1: https://lore.kernel.org/r/20250704-tonyk-robust_test_cleanup-v1-0-c0ff4f24c…
---
André Almeida (15):
selftests: kselftest: Create ksft_print_dbg_msg()
selftests/futex: Refactor futex_requeue_pi with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_mismatched_ops with kselftest_harness.h
selftests/futex: Refactor futex_requeue_pi_signal_restart with kselftest_harness.h
selftests/futex: Refactor futex_wait_timeout with kselftest_harness.h
selftests/futex: Refactor futex_wait_wouldblock with kselftest_harness.h
selftests/futex: Refactor futex_wait_unitialized_heap with kselftest_harness.h
selftests/futex: Refactor futex_wait_private_mapped_file with kselftest_harness.h
selftests/futex: Refactor futex_wait with kselftest_harness.h
selftests/futex: Refactor futex_requeue with kselftest_harness.h
selftests/futex: Refactor futex_waitv with kselftest_harness.h
selftests/futex: Refactor futex_priv_hash with kselftest_harness.h
selftests/futex: Refactor futex_numa_mpol with kselftest_harness.h
selftests/futex: Drop logging.h include from futex_numa
selftests/futex: Remove logging.h file
tools/testing/selftests/futex/functional/Makefile | 3 +-
.../selftests/futex/functional/futex_numa.c | 3 +-
.../selftests/futex/functional/futex_numa_mpol.c | 39 +--
.../selftests/futex/functional/futex_priv_hash.c | 48 +---
.../selftests/futex/functional/futex_requeue.c | 76 ++----
.../selftests/futex/functional/futex_requeue_pi.c | 261 ++++++++++-----------
.../functional/futex_requeue_pi_mismatched_ops.c | 86 ++-----
.../functional/futex_requeue_pi_signal_restart.c | 129 +++-------
.../selftests/futex/functional/futex_wait.c | 103 +++-----
.../functional/futex_wait_private_mapped_file.c | 83 ++-----
.../futex/functional/futex_wait_timeout.c | 139 +++++------
.../functional/futex_wait_uninitialized_heap.c | 76 ++----
.../futex/functional/futex_wait_wouldblock.c | 75 ++----
.../selftests/futex/functional/futex_waitv.c | 98 ++++----
tools/testing/selftests/futex/functional/run.sh | 61 +----
tools/testing/selftests/futex/include/logging.h | 148 ------------
tools/testing/selftests/kselftest.h | 14 ++
tools/testing/selftests/kselftest_harness.h | 13 +-
18 files changed, 456 insertions(+), 999 deletions(-)
---
base-commit: ed323aeda5e09fa1ab95946673939c8c425c329c
change-id: 20250703-tonyk-robust_test_cleanup-d1f3406365d9
Best regards,
--
André Almeida <andrealmeid(a)igalia.com>
Fix a memory leak in netpoll and introduce netconsole selftests that
expose the issue when running with kmemleak detection enabled.
This patchset includes a selftest for netpoll with multiple concurrent
users (netconsole + bonding), which simulates the scenario from test[1]
that originally demonstrated the issue allegedly fixed by commit
efa95b01da18 ("netpoll: fix use after free") - a commit that is now
being reverted.
Sending this to "net" branch because this is a fix, and the selftest
might help with the backports validation.
Link: https://lore.kernel.org/lkml/96b940137a50e5c387687bb4f57de8b0435a653f.14048… [1]
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v4:
- Added an additional selftest to test multiple netpoll users in
parallel
- Link to v3: https://lore.kernel.org/r/20250905-netconsole_torture-v3-0-875c7febd316@deb…
Changes in v3:
- This patchset is a merge of the fix and the selftest together as
recommended by Jakub.
Changes in v2:
- Reuse the netconsole creation from lib_netcons.sh. Thus, refactoring
the create_dynamic_target() (Jakub)
- Move the "wait" to after all the messages has been sent.
- Link to v1: https://lore.kernel.org/r/20250902-netconsole_torture-v1-1-03c6066598e9@deb…
---
Breno Leitao (4):
net: netpoll: fix incorrect refcount handling causing incorrect cleanup
selftest: netcons: refactor target creation
selftest: netcons: create a torture test
selftest: netcons: add test for netconsole over bonded interfaces
net/core/netpoll.c | 7 +-
tools/testing/selftests/drivers/net/Makefile | 2 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 197 ++++++++++++++++++---
.../selftests/drivers/net/netcons_over_bonding.sh | 76 ++++++++
.../selftests/drivers/net/netcons_torture.sh | 127 +++++++++++++
5 files changed, 384 insertions(+), 25 deletions(-)
---
base-commit: 5e87fdc37f8dc619549d49ba5c951b369ce7c136
change-id: 20250902-netconsole_torture-8fc23f0aca99
Best regards,
--
Breno Leitao <leitao(a)debian.org>
For testing the functionality of the vDSO, it is necessary to build
userspace programs for multiple different architectures.
It is additional work to acquire matching userspace cross-compilers with
full C libraries and then building root images out of those.
The kernel tree already contains nolibc, a small, header-only C library.
By using it, it is possible to build userspace programs without any
additional dependencies.
For example the kernel.org crosstools or multi-target clang can be used
to build test programs for a multitude of architectures.
While nolibc is very limited, it is enough for many selftests.
With some minor adjustments it is possible to make parse_vdso.c
compatible with nolibc.
As an example, vdso_standalone_test_x86 is now built from the same C
code as the regular vdso_test_gettimeofday, while still being completely
standalone.
Also drop the dependency of parse_vdso.c on the elf.h header from libc and only
use the one from the kernel's UAPI.
While this series is useful on its own now, it will also integrate with the
kunit UAPI framework currently under development:
https://lore.kernel.org/lkml/20250217-kunit-kselftests-v1-0-42b4524c3b0a@li…
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Provide a limits.h header in nolibc
- Pick up Reviewed-by tags from Kees
- Link to v1: https://lore.kernel.org/r/20250203-parse_vdso-nolibc-v1-0-9cb6268d77be@linu…
---
Thomas Weißschuh (16):
MAINTAINERS: Add vDSO selftests
elf, uapi: Add definition for STN_UNDEF
elf, uapi: Add definition for DT_GNU_HASH
elf, uapi: Add definitions for VER_FLG_BASE and VER_FLG_WEAK
elf, uapi: Add type ElfXX_Versym
elf, uapi: Add types ElfXX_Verdef and ElfXX_Veraux
tools/include: Add uapi/linux/elf.h
selftests: Add headers target
tools/nolibc: add limits.h shim header
selftests: vDSO: vdso_standalone_test_x86: Use vdso_init_form_sysinfo_ehdr
selftests: vDSO: parse_vdso: Drop vdso_init_from_auxv()
selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers
selftests: vDSO: parse_vdso: Test __SIZEOF_LONG__ instead of ULONG_MAX
selftests: vDSO: vdso_test_gettimeofday: Clean up includes
selftests: vDSO: vdso_test_gettimeofday: Make compatible with nolibc
selftests: vDSO: vdso_standalone_test_x86: Switch to nolibc
MAINTAINERS | 1 +
include/uapi/linux/elf.h | 38 ++
tools/include/nolibc/Makefile | 1 +
tools/include/nolibc/limits.h | 7 +
tools/include/uapi/linux/elf.h | 524 +++++++++++++++++++++
tools/testing/selftests/lib.mk | 5 +-
tools/testing/selftests/vDSO/Makefile | 11 +-
tools/testing/selftests/vDSO/parse_vdso.c | 19 +-
tools/testing/selftests/vDSO/parse_vdso.h | 1 -
.../selftests/vDSO/vdso_standalone_test_x86.c | 143 +-----
.../selftests/vDSO/vdso_test_gettimeofday.c | 4 +-
11 files changed, 590 insertions(+), 164 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20241017-parse_vdso-nolibc-e069baa7ff48
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
The generic vDSO provides a lot common functionality shared between
different architectures. SPARC is the last architecture not using it,
preventing some necessary code cleanup.
Make use of the generic infrastructure.
Follow-up to and replacement for Arnd's SPARC vDSO removal patches:
https://lore.kernel.org/lkml/20250707144726.4008707-1-arnd@kernel.org/
Tested on a Niagara T4 and QEMU.
This has a semantic conflict with my series "vdso: Reject absolute
relocations during build". The last patch of this series expects all users
of the generic vDSO library to use the vdsocheck tool.
This is not the case (yet) for SPARC64. I do have the patches for the
integration, the specifics will depend on which series is applied first.
Based on tip/timers/vdso.
[0] https://lore.kernel.org/lkml/20250812-vdso-absolute-reloc-v4-0-61a8b615e5ec…
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v3:
- Allocate vDSO data pages dynamically (and lots of preparations for that)
- Drop clock_getres()
- Fix 32bit clock_gettime() syscall fallback
- Link to v2: https://lore.kernel.org/r/20250815-vdso-sparc64-generic-2-v2-0-b5ff80672347…
Changes in v2:
- Rebase on v6.17-rc1
- Drop RFC state
- Fix typo in commit message
- Drop duplicate 'select GENERIC_TIME_VSYSCALL'
- Merge "sparc64: time: Remove architecture-specific clocksource data" into the
main conversion patch. It violated the check in __clocksource_register_scale()
- Link to v1: https://lore.kernel.org/r/20250724-vdso-sparc64-generic-2-v1-0-e376a3bd24d1…
---
Arnd Bergmann (1):
clocksource: remove ARCH_CLOCKSOURCE_DATA
Thomas Weißschuh (35):
selftests: vDSO: vdso_test_correctness: Handle different tv_usec types
arm64: vDSO: getrandom: Explicitly include asm/alternative.h
arm64: vDSO: gettimeofday: Explicitly include vdso/clocksource.h
arm64: vDSO: compat_gettimeofday: Add explicit includes
ARM: vdso: gettimeofday: Add explicit includes
powerpc/vdso/gettimeofday: Explicitly include vdso/time32.h
powerpc/vdso: Explicitly include asm/cputable.h and asm/feature-fixups.h
LoongArch: vDSO: Explicitly include asm/vdso/vdso.h
MIPS: vdso: Add include guard to asm/vdso/vdso.h
MIPS: vdso: Explicitly include asm/vdso/vdso.h
random: vDSO: Add explicit includes
vdso/gettimeofday: Add explicit includes
vdso/helpers: Explicitly include vdso/processor.h
vdso/datapage: Remove inclusion of gettimeofday.h
vdso/datapage: Trim down unnecessary includes
random: vDSO: trim vDSO includes
random: vDSO: remove ifdeffery
random: vDSO: split out datapage update into helper functions
random: vDSO: only access vDSO datapage after random_init()
s390/time: Set up vDSO datapage later
vdso/datastore: Reduce scope of some variables in vvar_fault()
vdso/datastore: Drop inclusion of linux/mmap_lock.h
vdso/datastore: Map pages through struct page
vdso/datastore: Allocate data pages dynamically
sparc64: vdso: Link with -z noexecstack
sparc64: vdso: Remove obsolete "fake section table" reservation
sparc64: vdso: Replace code patching with runtime conditional
sparc64: vdso: Move hardware counter read into header
sparc64: vdso: Move syscall fallbacks into header
sparc64: vdso: Introduce vdso/processor.h
sparc64: vdso: Switch to the generic vDSO library
sparc64: vdso2c: Drop sym_vvar_start handling
sparc64: vdso2c: Remove symbol handling
sparc64: vdso: Implement clock_gettime64()
clocksource: drop include of asm/clocksource.h from linux/clocksource.h
arch/arm/include/asm/vdso/gettimeofday.h | 2 +
arch/arm64/include/asm/vdso/compat_gettimeofday.h | 3 +
arch/arm64/include/asm/vdso/gettimeofday.h | 2 +
arch/arm64/kernel/vdso/vgetrandom.c | 2 +
arch/loongarch/kernel/process.c | 1 +
arch/loongarch/kernel/vdso.c | 1 +
arch/mips/include/asm/vdso/vdso.h | 5 +
arch/mips/kernel/vdso.c | 1 +
arch/powerpc/include/asm/vdso/gettimeofday.h | 1 +
arch/powerpc/include/asm/vdso/processor.h | 3 +
arch/s390/kernel/time.c | 4 +-
arch/sparc/Kconfig | 3 +-
arch/sparc/include/asm/clocksource.h | 9 -
arch/sparc/include/asm/processor.h | 3 +
arch/sparc/include/asm/processor_32.h | 2 -
arch/sparc/include/asm/processor_64.h | 25 --
arch/sparc/include/asm/vdso.h | 2 -
arch/sparc/include/asm/vdso/clocksource.h | 10 +
arch/sparc/include/asm/vdso/gettimeofday.h | 184 ++++++++++
arch/sparc/include/asm/vdso/processor.h | 41 +++
arch/sparc/include/asm/vdso/vsyscall.h | 10 +
arch/sparc/include/asm/vvar.h | 75 ----
arch/sparc/kernel/Makefile | 1 -
arch/sparc/kernel/time_64.c | 6 +-
arch/sparc/kernel/vdso.c | 69 ----
arch/sparc/vdso/Makefile | 8 +-
arch/sparc/vdso/vclock_gettime.c | 380 ++-------------------
arch/sparc/vdso/vdso-layout.lds.S | 26 +-
arch/sparc/vdso/vdso.lds.S | 2 -
arch/sparc/vdso/vdso2c.c | 24 --
arch/sparc/vdso/vdso2c.h | 45 +--
arch/sparc/vdso/vdso32/vdso32.lds.S | 4 +-
arch/sparc/vdso/vma.c | 274 +--------------
drivers/char/random.c | 75 ++--
include/linux/clocksource.h | 8 -
include/linux/vdso_datastore.h | 6 +
include/vdso/datapage.h | 23 +-
include/vdso/helpers.h | 1 +
init/main.c | 2 +
kernel/time/Kconfig | 4 -
lib/vdso/datastore.c | 73 ++--
lib/vdso/getrandom.c | 3 +
lib/vdso/gettimeofday.c | 17 +
.../testing/selftests/vDSO/vdso_test_correctness.c | 8 +-
44 files changed, 451 insertions(+), 997 deletions(-)
---
base-commit: 5f84f6004e298bd41c9e4ed45c18447954b1dce6
change-id: 20250722-vdso-sparc64-generic-2-25f2e058e92c
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
From: Lance Yang <lance.yang(a)linux.dev>
The madv_populate and soft-dirty kselftests currently fail on systems where
CONFIG_MEM_SOFT_DIRTY is disabled.
Introduce a new helper softdirty_supported() into vm_util.c/h to ensure
tests are properly skipped when the feature is not enabled.
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
---
v1 -> v2:
- Rename softdirty_is_supported() to softdirty_supported() (per David)
- Drop aarch64 specific handling (per David)
- https://lore.kernel.org/lkml/20250917055913.49759-1-lance.yang@linux.dev
tools/testing/selftests/mm/madv_populate.c | 21 ++-------------------
tools/testing/selftests/mm/soft-dirty.c | 5 ++++-
tools/testing/selftests/mm/vm_util.c | 19 +++++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 1 +
4 files changed, 26 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/mm/madv_populate.c b/tools/testing/selftests/mm/madv_populate.c
index b6fabd5c27ed..d8d11bc67ddc 100644
--- a/tools/testing/selftests/mm/madv_populate.c
+++ b/tools/testing/selftests/mm/madv_populate.c
@@ -264,23 +264,6 @@ static void test_softdirty(void)
munmap(addr, SIZE);
}
-static int system_has_softdirty(void)
-{
- /*
- * There is no way to check if the kernel supports soft-dirty, other
- * than by writing to a page and seeing if the bit was set. But the
- * tests are intended to check that the bit gets set when it should, so
- * doing that check would turn a potentially legitimate fail into a
- * skip. Fortunately, we know for sure that arm64 does not support
- * soft-dirty. So for now, let's just use the arch as a corse guide.
- */
-#if defined(__aarch64__)
- return 0;
-#else
- return 1;
-#endif
-}
-
int main(int argc, char **argv)
{
int nr_tests = 16;
@@ -288,7 +271,7 @@ int main(int argc, char **argv)
pagesize = getpagesize();
- if (system_has_softdirty())
+ if (softdirty_supported())
nr_tests += 5;
ksft_print_header();
@@ -300,7 +283,7 @@ int main(int argc, char **argv)
test_holes();
test_populate_read();
test_populate_write();
- if (system_has_softdirty())
+ if (softdirty_supported())
test_softdirty();
err = ksft_get_fail_cnt();
diff --git a/tools/testing/selftests/mm/soft-dirty.c b/tools/testing/selftests/mm/soft-dirty.c
index 8a3f2b4b2186..4ee4db3750c1 100644
--- a/tools/testing/selftests/mm/soft-dirty.c
+++ b/tools/testing/selftests/mm/soft-dirty.c
@@ -200,8 +200,11 @@ int main(int argc, char **argv)
int pagesize;
ksft_print_header();
- ksft_set_plan(15);
+ if (!softdirty_supported())
+ ksft_exit_skip("soft-dirty is not support\n");
+
+ ksft_set_plan(15);
pagemap_fd = open(PAGEMAP_FILE_PATH, O_RDONLY);
if (pagemap_fd < 0)
ksft_exit_fail_msg("Failed to open %s\n", PAGEMAP_FILE_PATH);
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 56e9bd541edd..ac41d10454a5 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -449,6 +449,25 @@ bool check_vmflag_pfnmap(void *addr)
return check_vmflag(addr, "pf");
}
+bool softdirty_supported(void)
+{
+ char *addr;
+ bool supported = false;
+ const size_t pagesize = getpagesize();
+
+ /* New mappings are expected to be marked with VM_SOFTDIRTY (sd). */
+ addr = mmap(0, pagesize, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (!addr)
+ ksft_exit_fail_msg("mmap failed\n");
+
+ if (check_vmflag(addr, "sd"))
+ supported = true;
+
+ munmap(addr, pagesize);
+ return supported;
+}
+
/*
* Open an fd at /proc/$pid/maps and configure procmap_out ready for
* PROCMAP_QUERY query. Returns 0 on success, or an error code otherwise.
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 07c4acfd84b6..26c30fdc0241 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -104,6 +104,7 @@ bool find_vma_procmap(struct procmap_fd *procmap, void *address);
int close_procmap(struct procmap_fd *procmap);
int write_sysfs(const char *file_path, unsigned long val);
int read_sysfs(const char *file_path, unsigned long *val);
+bool softdirty_supported(void);
static inline int open_self_procmap(struct procmap_fd *procmap_out)
{
--
2.49.0
From: Lance Yang <lance.yang(a)linux.dev>
The madv_populate and soft-dirty kselftests currently fail on systems where
CONFIG_MEM_SOFT_DIRTY is disabled.
Introduce a new helper softdirty_is_supported() into vm_util.c/h to ensure
tests are properly skipped when the feature is not enabled.
Suggested-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
---
tools/testing/selftests/mm/madv_populate.c | 21 ++--------------
tools/testing/selftests/mm/soft-dirty.c | 5 +++-
tools/testing/selftests/mm/vm_util.c | 28 ++++++++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 1 +
4 files changed, 35 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/mm/madv_populate.c b/tools/testing/selftests/mm/madv_populate.c
index b6fabd5c27ed..43dac7783004 100644
--- a/tools/testing/selftests/mm/madv_populate.c
+++ b/tools/testing/selftests/mm/madv_populate.c
@@ -264,23 +264,6 @@ static void test_softdirty(void)
munmap(addr, SIZE);
}
-static int system_has_softdirty(void)
-{
- /*
- * There is no way to check if the kernel supports soft-dirty, other
- * than by writing to a page and seeing if the bit was set. But the
- * tests are intended to check that the bit gets set when it should, so
- * doing that check would turn a potentially legitimate fail into a
- * skip. Fortunately, we know for sure that arm64 does not support
- * soft-dirty. So for now, let's just use the arch as a corse guide.
- */
-#if defined(__aarch64__)
- return 0;
-#else
- return 1;
-#endif
-}
-
int main(int argc, char **argv)
{
int nr_tests = 16;
@@ -288,7 +271,7 @@ int main(int argc, char **argv)
pagesize = getpagesize();
- if (system_has_softdirty())
+ if (softdirty_is_supported())
nr_tests += 5;
ksft_print_header();
@@ -300,7 +283,7 @@ int main(int argc, char **argv)
test_holes();
test_populate_read();
test_populate_write();
- if (system_has_softdirty())
+ if (softdirty_is_supported())
test_softdirty();
err = ksft_get_fail_cnt();
diff --git a/tools/testing/selftests/mm/soft-dirty.c b/tools/testing/selftests/mm/soft-dirty.c
index 8a3f2b4b2186..98e42d2ac32a 100644
--- a/tools/testing/selftests/mm/soft-dirty.c
+++ b/tools/testing/selftests/mm/soft-dirty.c
@@ -200,8 +200,11 @@ int main(int argc, char **argv)
int pagesize;
ksft_print_header();
- ksft_set_plan(15);
+ if (!softdirty_is_supported())
+ ksft_exit_skip("soft-dirty is not support\n");
+
+ ksft_set_plan(15);
pagemap_fd = open(PAGEMAP_FILE_PATH, O_RDONLY);
if (pagemap_fd < 0)
ksft_exit_fail_msg("Failed to open %s\n", PAGEMAP_FILE_PATH);
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 56e9bd541edd..3173335df775 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -449,6 +449,34 @@ bool check_vmflag_pfnmap(void *addr)
return check_vmflag(addr, "pf");
}
+bool softdirty_is_supported(void)
+{
+ char *addr;
+ int ret = 0;
+ size_t pagesize;
+
+ /* We know for sure that arm64 does not support soft-dirty. */
+#if defined(__aarch64__)
+ return ret;
+#endif
+ pagesize = getpagesize();
+ /*
+ * __mmap_complete() always sets VM_SOFTDIRTY for new VMAs, so we
+ * just mmap a small region and check its VmFlags in /proc/self/smaps
+ * for the "sd" flag.
+ */
+ addr = mmap(0, pagesize, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (!addr)
+ ksft_exit_fail_msg("mmap failed\n");
+
+ if (check_vmflag(addr, "sd"))
+ ret = 1;
+
+ munmap(addr, pagesize);
+ return ret;
+}
+
/*
* Open an fd at /proc/$pid/maps and configure procmap_out ready for
* PROCMAP_QUERY query. Returns 0 on success, or an error code otherwise.
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 07c4acfd84b6..87ad8e0d92c0 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -104,6 +104,7 @@ bool find_vma_procmap(struct procmap_fd *procmap, void *address);
int close_procmap(struct procmap_fd *procmap);
int write_sysfs(const char *file_path, unsigned long val);
int read_sysfs(const char *file_path, unsigned long *val);
+bool softdirty_is_supported(void);
static inline int open_self_procmap(struct procmap_fd *procmap_out)
{
--
2.49.0
Arnd sent the v1 of the series in July, and it was bogus. So with a
little help from claude-sonnet I built up the missing ioctls tests and
tried to figure out a way to apply Arnd's logic without breaking the
existing ioctls.
The end result is in patch 3/3, which makes use of subfunctions to keep
the main ioctl code path clean.
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Changes in v3:
- dropped the co-developed-by tag and put a blurb instead
- change the attribution of patch 3/3 to me as requested by Arnd.
- Link to v2: https://lore.kernel.org/r/20250826-b4-hidraw-ioctls-v2-0-c7726b236719@kerne…
changes in v2:
- add new hidraw ioctls tests
- refactor Arnd's patch to keep the existing error path logic
- link to v1: https://lore.kernel.org/linux-input/20250711072847.2836962-1-arnd@kernel.or…
---
Benjamin Tissoires (3):
selftests/hid: hidraw: add more coverage for hidraw ioctls
selftests/hid: hidraw: forge wrong ioctls and tests them
HID: hidraw: tighten ioctl command parsing
drivers/hid/hidraw.c | 224 ++++++++-------
include/uapi/linux/hidraw.h | 2 +
tools/testing/selftests/hid/hid_common.h | 6 +
tools/testing/selftests/hid/hidraw.c | 473 +++++++++++++++++++++++++++++++
4 files changed, 603 insertions(+), 102 deletions(-)
---
base-commit: 02d6eeedbc36d4b309d5518778071a749ef79c4e
change-id: 20250825-b4-hidraw-ioctls-66f34297032a
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
This commit is a rewrite almost from scratch of vmtest.sh.
By relying on virtme-ng, we get rid of boot2container, reducing the
total bootup time (and network requirements). That means that we are
relying on the programs being installed on the host, but that shouldn't
be an issue. The generation of the kconfig is also now handled by
virtme-ng, so that's one less thing to worry.
I used tools/testing/selftests/vsock/vmtest.sh as a base and modified it
to look mostly like my previous script:
- removed the custom ssh handling
- make use of vng for compiling, which allows to bring remote
compilation (and potentially remote compilation on a remote container)
- change the verbosity logic by having 2 levels:
- first one shows the tests outputs
- second level also shows the VM logs
- instead of only running the compiled kernel when it is built, if we
are in the kernel tree, use the kernel artifacts there (and complain
if they are not built)
- adapted the tests list to match the HID subsystem tests
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
I have switched my workflow to make use of virtme-ng for a few months.
Now it's time to automate the manual commands I've been running in
vmtest.sh.
---
tools/testing/selftests/hid/vmtest.sh | 668 +++++++++++++++++++++-------------
1 file changed, 423 insertions(+), 245 deletions(-)
diff --git a/tools/testing/selftests/hid/vmtest.sh b/tools/testing/selftests/hid/vmtest.sh
index db534e9099a8a4684346eed0067d397ffa6f80cf..ecbd57f775a044b4d076b4800ca0068f9533056c 100755
--- a/tools/testing/selftests/hid/vmtest.sh
+++ b/tools/testing/selftests/hid/vmtest.sh
@@ -1,296 +1,474 @@
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2025 Red Hat
+# Copyright (c) 2025 Meta Platforms, Inc. and affiliates
+#
+# Dependencies:
+# * virtme-ng
+# * busybox-static (used by virtme-ng)
+# * qemu (used by virtme-ng)
+
+readonly SCRIPT_DIR="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"
+readonly KERNEL_CHECKOUT=$(realpath "${SCRIPT_DIR}"/../../../../)
+
+source "${SCRIPT_DIR}"/../kselftest/ktap_helpers.sh
+
+readonly HID_BPF_TEST="${SCRIPT_DIR}"/hid_bpf
+readonly HIDRAW_TEST="${SCRIPT_DIR}"/hidraw
+readonly HID_BPF_PROGS="${KERNEL_CHECKOUT}/drivers/hid/bpf/progs"
+readonly SSH_GUEST_PORT=22
+readonly WAIT_PERIOD=3
+readonly WAIT_PERIOD_MAX=60
+readonly WAIT_TOTAL=$(( WAIT_PERIOD * WAIT_PERIOD_MAX ))
+readonly QEMU_PIDFILE=$(mktemp /tmp/qemu_hid_vmtest_XXXX.pid)
+
+readonly QEMU_OPTS="\
+ --pidfile ${QEMU_PIDFILE} \
+"
+readonly KERNEL_CMDLINE=""
+readonly LOG=$(mktemp /tmp/hid_vmtest_XXXX.log)
+readonly TEST_NAMES=(vm_hid_bpf vm_hidraw vm_pytest)
+readonly TEST_DESCS=(
+ "Run hid_bpf tests in the VM."
+ "Run hidraw tests in the VM."
+ "Run the hid-tools test-suite in the VM."
+)
+
+VERBOSE=0
+SHELL_MODE=0
+BUILD_HOST=""
+BUILD_HOST_PODMAN_CONTAINER_NAME=""
+
+usage() {
+ local name
+ local desc
+ local i
+
+ echo
+ echo "$0 [OPTIONS] [TEST]... [-- tests-args]"
+ echo "If no TEST argument is given, all tests will be run."
+ echo
+ echo "Options"
+ echo " -b: build the kernel from the current source tree and use it for guest VMs"
+ echo " -H: hostname for remote build host (used with -b)"
+ echo " -p: podman container name for remote build host (used with -b)"
+ echo " Example: -H beefyserver -p vng"
+ echo " -q: set the path to or name of qemu binary"
+ echo " -s: start a shell in the VM instead of running tests"
+ echo " -v: more verbose output (can be repeated multiple times)"
+ echo
+ echo "Available tests"
+
+ for ((i = 0; i < ${#TEST_NAMES[@]}; i++)); do
+ name=${TEST_NAMES[${i}]}
+ desc=${TEST_DESCS[${i}]}
+ printf "\t%-35s%-35s\n" "${name}" "${desc}"
+ done
+ echo
-set -u
-set -e
-
-# This script currently only works for x86_64
-ARCH="$(uname -m)"
-case "${ARCH}" in
-x86_64)
- QEMU_BINARY=qemu-system-x86_64
- BZIMAGE="arch/x86/boot/bzImage"
- ;;
-*)
- echo "Unsupported architecture"
exit 1
- ;;
-esac
-SCRIPT_DIR="$(dirname $(realpath $0))"
-OUTPUT_DIR="$SCRIPT_DIR/results"
-KCONFIG_REL_PATHS=("${SCRIPT_DIR}/config" "${SCRIPT_DIR}/config.common" "${SCRIPT_DIR}/config.${ARCH}")
-B2C_URL="https://gitlab.freedesktop.org/gfx-ci/boot2container/-/raw/main/vm2c.py"
-NUM_COMPILE_JOBS="$(nproc)"
-LOG_FILE_BASE="$(date +"hid_selftests.%Y-%m-%d_%H-%M-%S")"
-LOG_FILE="${LOG_FILE_BASE}.log"
-EXIT_STATUS_FILE="${LOG_FILE_BASE}.exit_status"
-CONTAINER_IMAGE="registry.freedesktop.org/bentiss/hid/fedora/39:2023-11-22.1"
-
-TARGETS="${TARGETS:=$(basename ${SCRIPT_DIR})}"
-DEFAULT_COMMAND="pip3 install hid-tools; make -C tools/testing/selftests TARGETS=${TARGETS} run_tests"
-
-usage()
-{
- cat <<EOF
-Usage: $0 [-j N] [-s] [-b] [-d <output_dir>] -- [<command>]
-
-<command> is the command you would normally run when you are in
-the source kernel direcory. e.g:
-
- $0 -- ./tools/testing/selftests/hid/hid_bpf
-
-If no command is specified and a debug shell (-s) is not requested,
-"${DEFAULT_COMMAND}" will be run by default.
-
-If you build your kernel using KBUILD_OUTPUT= or O= options, these
-can be passed as environment variables to the script:
-
- O=<kernel_build_path> $0 -- ./tools/testing/selftests/hid/hid_bpf
-
-or
-
- KBUILD_OUTPUT=<kernel_build_path> $0 -- ./tools/testing/selftests/hid/hid_bpf
-
-Options:
-
- -u) Update the boot2container script to a newer version.
- -d) Update the output directory (default: ${OUTPUT_DIR})
- -b) Run only the build steps for the kernel and the selftests
- -j) Number of jobs for compilation, similar to -j in make
- (default: ${NUM_COMPILE_JOBS})
- -s) Instead of powering off the VM, start an interactive
- shell. If <command> is specified, the shell runs after
- the command finishes executing
-EOF
}
-download()
-{
- local file="$1"
+die() {
+ echo "$*" >&2
+ exit "${KSFT_FAIL}"
+}
- echo "Downloading $file..." >&2
- curl -Lsf "$file" -o "${@:2}"
+vm_ssh() {
+ # vng --ssh-client keeps shouting "Warning: Permanently added 'virtme-ng%22'
+ # (ED25519) to the list of known hosts.",
+ # So replace the command with what's actually called and add the "-q" option
+ stdbuf -oL ssh -q \
+ -F ${HOME}/.cache/virtme-ng/.ssh/virtme-ng-ssh.conf \
+ -l root virtme-ng%${SSH_GUEST_PORT} \
+ "$@"
+ return $?
}
-recompile_kernel()
-{
- local kernel_checkout="$1"
- local make_command="$2"
+cleanup() {
+ if [[ -s "${QEMU_PIDFILE}" ]]; then
+ pkill -SIGTERM -F "${QEMU_PIDFILE}" > /dev/null 2>&1
+ fi
- cd "${kernel_checkout}"
+ # If failure occurred during or before qemu start up, then we need
+ # to clean this up ourselves.
+ if [[ -e "${QEMU_PIDFILE}" ]]; then
+ rm "${QEMU_PIDFILE}"
+ fi
+}
+
+check_args() {
+ local found
- ${make_command} olddefconfig
- ${make_command} headers
- ${make_command}
+ for arg in "$@"; do
+ found=0
+ for name in "${TEST_NAMES[@]}"; do
+ if [[ "${name}" = "${arg}" ]]; then
+ found=1
+ break
+ fi
+ done
+
+ if [[ "${found}" -eq 0 ]]; then
+ echo "${arg} is not an available test" >&2
+ usage
+ fi
+ done
+
+ for arg in "$@"; do
+ if ! command -v > /dev/null "test_${arg}"; then
+ echo "Test ${arg} not found" >&2
+ usage
+ fi
+ done
+}
+
+check_deps() {
+ for dep in vng ${QEMU} busybox pkill ssh pytest; do
+ if [[ ! -x $(command -v "${dep}") ]]; then
+ echo -e "skip: dependency ${dep} not found!\n"
+ exit "${KSFT_SKIP}"
+ fi
+ done
+
+ if [[ ! -x $(command -v "${HID_BPF_TEST}") ]]; then
+ printf "skip: %s not found!" "${HID_BPF_TEST}"
+ printf " Please build the kselftest hid_bpf target.\n"
+ exit "${KSFT_SKIP}"
+ fi
+
+ if [[ ! -x $(command -v "${HIDRAW_TEST}") ]]; then
+ printf "skip: %s not found!" "${HIDRAW_TEST}"
+ printf " Please build the kselftest hidraw target.\n"
+ exit "${KSFT_SKIP}"
+ fi
}
-update_selftests()
-{
- local kernel_checkout="$1"
- local selftests_dir="${kernel_checkout}/tools/testing/selftests/hid"
+check_vng() {
+ local tested_versions
+ local version
+ local ok
- cd "${selftests_dir}"
- ${make_command}
+ tested_versions=("1.36" "1.37")
+ version="$(vng --version)"
+
+ ok=0
+ for tv in "${tested_versions[@]}"; do
+ if [[ "${version}" == *"${tv}"* ]]; then
+ ok=1
+ break
+ fi
+ done
+
+ if [[ ! "${ok}" -eq 1 ]]; then
+ printf "warning: vng version '%s' has not been tested and may " "${version}" >&2
+ printf "not function properly.\n\tThe following versions have been tested: " >&2
+ echo "${tested_versions[@]}" >&2
+ fi
}
-run_vm()
-{
- local run_dir="$1"
- local b2c="$2"
- local kernel_bzimage="$3"
- local command="$4"
- local post_command=""
-
- cd "${run_dir}"
-
- if ! which "${QEMU_BINARY}" &> /dev/null; then
- cat <<EOF
-Could not find ${QEMU_BINARY}
-Please install qemu or set the QEMU_BINARY environment variable.
-EOF
+handle_build() {
+ if [[ ! "${BUILD}" -eq 1 ]]; then
+ return
+ fi
+
+ if [[ ! -d "${KERNEL_CHECKOUT}" ]]; then
+ echo "-b requires vmtest.sh called from the kernel source tree" >&2
exit 1
fi
- # alpine (used in post-container requires the PATH to have /bin
- export PATH=$PATH:/bin
+ pushd "${KERNEL_CHECKOUT}" &>/dev/null
- if [[ "${debug_shell}" != "yes" ]]
- then
- touch ${OUTPUT_DIR}/${LOG_FILE}
- command="mount bpffs -t bpf /sys/fs/bpf/; set -o pipefail ; ${command} 2>&1 | tee ${OUTPUT_DIR}/${LOG_FILE}"
- post_command="cat ${OUTPUT_DIR}/${LOG_FILE}"
- else
- command="mount bpffs -t bpf /sys/fs/bpf/; ${command}"
+ if ! vng --kconfig --config "${SCRIPT_DIR}"/config; then
+ die "failed to generate .config for kernel source tree (${KERNEL_CHECKOUT})"
fi
- set +e
- $b2c --command "${command}" \
- --kernel ${kernel_bzimage} \
- --workdir ${OUTPUT_DIR} \
- --image ${CONTAINER_IMAGE}
+ local vng_args=("-v" "--config" "${SCRIPT_DIR}/config" "--build")
- echo $? > ${OUTPUT_DIR}/${EXIT_STATUS_FILE}
+ if [[ -n "${BUILD_HOST}" ]]; then
+ vng_args+=("--build-host" "${BUILD_HOST}")
+ fi
- set -e
+ if [[ -n "${BUILD_HOST_PODMAN_CONTAINER_NAME}" ]]; then
+ vng_args+=("--build-host-exec-prefix" \
+ "podman exec -ti ${BUILD_HOST_PODMAN_CONTAINER_NAME}")
+ fi
- ${post_command}
-}
+ if ! vng "${vng_args[@]}"; then
+ die "failed to build kernel from source tree (${KERNEL_CHECKOUT})"
+ fi
-is_rel_path()
-{
- local path="$1"
+ if ! make -j$(nproc) -C "${HID_BPF_PROGS}"; then
+ die "failed to build HID bpf objects from source tree (${HID_BPF_PROGS})"
+ fi
- [[ ${path:0:1} != "/" ]]
+ if ! make -j$(nproc) -C "${SCRIPT_DIR}"; then
+ die "failed to build HID selftests from source tree (${SCRIPT_DIR})"
+ fi
+
+ popd &>/dev/null
}
-do_update_kconfig()
-{
- local kernel_checkout="$1"
- local kconfig_file="$2"
+vm_start() {
+ local logfile=/dev/null
+ local verbose_opt=""
+ local kernel_opt=""
+ local qemu
- rm -f "$kconfig_file" 2> /dev/null
+ qemu=$(command -v "${QEMU}")
- for config in "${KCONFIG_REL_PATHS[@]}"; do
- local kconfig_src="${config}"
- cat "$kconfig_src" >> "$kconfig_file"
- done
-}
+ if [[ "${VERBOSE}" -eq 2 ]]; then
+ verbose_opt="--verbose"
+ logfile=/dev/stdout
+ fi
-update_kconfig()
-{
- local kernel_checkout="$1"
- local kconfig_file="$2"
-
- if [[ -f "${kconfig_file}" ]]; then
- local local_modified="$(stat -c %Y "${kconfig_file}")"
-
- for config in "${KCONFIG_REL_PATHS[@]}"; do
- local kconfig_src="${config}"
- local src_modified="$(stat -c %Y "${kconfig_src}")"
- # Only update the config if it has been updated after the
- # previously cached config was created. This avoids
- # unnecessarily compiling the kernel and selftests.
- if [[ "${src_modified}" -gt "${local_modified}" ]]; then
- do_update_kconfig "$kernel_checkout" "$kconfig_file"
- # Once we have found one outdated configuration
- # there is no need to check other ones.
- break
- fi
- done
- else
- do_update_kconfig "$kernel_checkout" "$kconfig_file"
+ # If we are running from within the kernel source tree, use the kernel source tree
+ # as the kernel to boot, otherwise use the currently running kernel.
+ if [[ "$(realpath "$(pwd)")" == "${KERNEL_CHECKOUT}"* ]]; then
+ kernel_opt="${KERNEL_CHECKOUT}"
fi
-}
-main()
-{
- local script_dir="$(cd -P -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"
- local kernel_checkout=$(realpath "${script_dir}"/../../../../)
- # By default the script searches for the kernel in the checkout directory but
- # it also obeys environment variables O= and KBUILD_OUTPUT=
- local kernel_bzimage="${kernel_checkout}/${BZIMAGE}"
- local command="${DEFAULT_COMMAND}"
- local update_b2c="no"
- local debug_shell="no"
- local build_only="no"
-
- while getopts ':hsud:j:b' opt; do
- case ${opt} in
- u)
- update_b2c="yes"
- ;;
- d)
- OUTPUT_DIR="$OPTARG"
- ;;
- j)
- NUM_COMPILE_JOBS="$OPTARG"
- ;;
- s)
- command="/bin/sh"
- debug_shell="yes"
- ;;
- b)
- build_only="yes"
- ;;
- h)
- usage
- exit 0
- ;;
- \? )
- echo "Invalid Option: -$OPTARG"
- usage
- exit 1
- ;;
- : )
- echo "Invalid Option: -$OPTARG requires an argument"
- usage
- exit 1
- ;;
- esac
- done
- shift $((OPTIND -1))
-
- # trap 'catch "$?"' EXIT
- if [[ "${build_only}" == "no" && "${debug_shell}" == "no" ]]; then
- if [[ $# -eq 0 ]]; then
- echo "No command specified, will run ${DEFAULT_COMMAND} in the vm"
- else
- command="$@"
-
- if [[ "${command}" == "/bin/bash" || "${command}" == "bash" ]]
- then
- debug_shell="yes"
- fi
+ vng \
+ --run \
+ ${kernel_opt} \
+ ${verbose_opt} \
+ --qemu-opts="${QEMU_OPTS}" \
+ --qemu="${qemu}" \
+ --user root \
+ --append "${KERNEL_CMDLINE}" \
+ --ssh "${SSH_GUEST_PORT}" \
+ --rw &> ${logfile} &
+
+ local vng_pid=$!
+ local elapsed=0
+
+ while [[ ! -s "${QEMU_PIDFILE}" ]]; do
+ if ! kill -0 "${vng_pid}" 2>/dev/null; then
+ echo "vng process (PID ${vng_pid}) exited early, check logs for details" >&2
+ die "failed to boot VM"
fi
- fi
- local kconfig_file="${OUTPUT_DIR}/latest.config"
- local make_command="make -j ${NUM_COMPILE_JOBS} KCONFIG_CONFIG=${kconfig_file}"
+ if [[ ${elapsed} -ge ${WAIT_TOTAL} ]]; then
+ echo "Timed out after ${WAIT_TOTAL} seconds waiting for VM to boot" >&2
+ die "failed to boot VM"
+ fi
- # Figure out where the kernel is being built.
- # O takes precedence over KBUILD_OUTPUT.
- if [[ "${O:=""}" != "" ]]; then
- if is_rel_path "${O}"; then
- O="$(realpath "${PWD}/${O}")"
+ sleep 1
+ elapsed=$((elapsed + 1))
+ done
+}
+
+vm_wait_for_ssh() {
+ local i
+
+ i=0
+ while true; do
+ if [[ ${i} -gt ${WAIT_PERIOD_MAX} ]]; then
+ die "Timed out waiting for guest ssh"
fi
- kernel_bzimage="${O}/${BZIMAGE}"
- make_command="${make_command} O=${O}"
- elif [[ "${KBUILD_OUTPUT:=""}" != "" ]]; then
- if is_rel_path "${KBUILD_OUTPUT}"; then
- KBUILD_OUTPUT="$(realpath "${PWD}/${KBUILD_OUTPUT}")"
+ if vm_ssh -- true; then
+ break
fi
- kernel_bzimage="${KBUILD_OUTPUT}/${BZIMAGE}"
- make_command="${make_command} KBUILD_OUTPUT=${KBUILD_OUTPUT}"
+ i=$(( i + 1 ))
+ sleep ${WAIT_PERIOD}
+ done
+}
+
+vm_mount_bpffs() {
+ vm_ssh -- mount bpffs -t bpf /sys/fs/bpf
+}
+
+__log_stdin() {
+ stdbuf -oL awk '{ printf "%s:\t%s\n","'"${prefix}"'", $0; fflush() }'
+}
+
+__log_args() {
+ echo "$*" | awk '{ printf "%s:\t%s\n","'"${prefix}"'", $0 }'
+}
+
+log() {
+ local verbose="$1"
+ shift
+
+ local prefix="$1"
+
+ shift
+ local redirect=
+ if [[ ${verbose} -le 0 ]]; then
+ redirect=/dev/null
+ else
+ redirect=/dev/stdout
+ fi
+
+ if [[ "$#" -eq 0 ]]; then
+ __log_stdin | tee -a "${LOG}" > ${redirect}
+ else
+ __log_args "$@" | tee -a "${LOG}" > ${redirect}
fi
+}
- local b2c="${OUTPUT_DIR}/vm2c.py"
+log_setup() {
+ log $((VERBOSE-1)) "setup" "$@"
+}
- echo "Output directory: ${OUTPUT_DIR}"
+log_host() {
+ local testname=$1
- mkdir -p "${OUTPUT_DIR}"
- update_kconfig "${kernel_checkout}" "${kconfig_file}"
+ shift
+ log $((VERBOSE-1)) "test:${testname}:host" "$@"
+}
- recompile_kernel "${kernel_checkout}" "${make_command}"
- update_selftests "${kernel_checkout}" "${make_command}"
+log_guest() {
+ local testname=$1
- if [[ "${build_only}" == "no" ]]; then
- if [[ "${update_b2c}" == "no" && ! -f "${b2c}" ]]; then
- echo "vm2c script not found in ${b2c}"
- update_b2c="yes"
- fi
+ shift
+ log ${VERBOSE} "# test:${testname}" "$@"
+}
- if [[ "${update_b2c}" == "yes" ]]; then
- download $B2C_URL $b2c
- chmod +x $b2c
- fi
+test_vm_hid_bpf() {
+ local testname="${FUNCNAME[0]#test_}"
- run_vm "${kernel_checkout}" $b2c "${kernel_bzimage}" "${command}"
- if [[ "${debug_shell}" != "yes" ]]; then
- echo "Logs saved in ${OUTPUT_DIR}/${LOG_FILE}"
- fi
+ vm_ssh -- "${HID_BPF_TEST}" \
+ 2>&1 | log_guest "${testname}"
+
+ return ${PIPESTATUS[0]}
+}
+
+test_vm_hidraw() {
+ local testname="${FUNCNAME[0]#test_}"
+
+ vm_ssh -- "${HIDRAW_TEST}" \
+ 2>&1 | log_guest "${testname}"
+
+ return ${PIPESTATUS[0]}
+}
+
+test_vm_pytest() {
+ local testname="${FUNCNAME[0]#test_}"
- exit $(cat ${OUTPUT_DIR}/${EXIT_STATUS_FILE})
+ shift
+
+ vm_ssh -- pytest ${SCRIPT_DIR}/tests --color=yes "$@" \
+ 2>&1 | log_guest "${testname}"
+
+ return ${PIPESTATUS[0]}
+}
+
+run_test() {
+ local vm_oops_cnt_before
+ local vm_warn_cnt_before
+ local vm_oops_cnt_after
+ local vm_warn_cnt_after
+ local name
+ local rc
+
+ vm_oops_cnt_before=$(vm_ssh -- dmesg | grep -c -i 'Oops')
+ vm_error_cnt_before=$(vm_ssh -- dmesg --level=err | wc -l)
+
+ name=$(echo "${1}" | awk '{ print $1 }')
+ eval test_"${name}" "$@"
+ rc=$?
+
+ vm_oops_cnt_after=$(vm_ssh -- dmesg | grep -i 'Oops' | wc -l)
+ if [[ ${vm_oops_cnt_after} -gt ${vm_oops_cnt_before} ]]; then
+ echo "FAIL: kernel oops detected on vm" | log_host "${name}"
+ rc=$KSFT_FAIL
+ fi
+
+ vm_error_cnt_after=$(vm_ssh -- dmesg --level=err | wc -l)
+ if [[ ${vm_error_cnt_after} -gt ${vm_error_cnt_before} ]]; then
+ echo "FAIL: kernel error detected on vm" | log_host "${name}"
+ vm_ssh -- dmesg --level=err | log_host "${name}"
+ rc=$KSFT_FAIL
fi
+
+ return "${rc}"
}
-main "$@"
+QEMU="qemu-system-$(uname -m)"
+
+while getopts :hvsbq:H:p: o
+do
+ case $o in
+ v) VERBOSE=$((VERBOSE+1));;
+ s) SHELL_MODE=1;;
+ b) BUILD=1;;
+ q) QEMU=$OPTARG;;
+ H) BUILD_HOST=$OPTARG;;
+ p) BUILD_HOST_PODMAN_CONTAINER_NAME=$OPTARG;;
+ h|*) usage;;
+ esac
+done
+shift $((OPTIND-1))
+
+trap cleanup EXIT
+
+PARAMS=""
+
+if [[ ${#} -eq 0 ]]; then
+ ARGS=("${TEST_NAMES[@]}")
+else
+ ARGS=()
+ COUNT=0
+ for arg in $@; do
+ COUNT=$((COUNT+1))
+ if [[ x"$arg" == x"--" ]]; then
+ break
+ fi
+ ARGS+=($arg)
+ done
+ shift $COUNT
+ PARAMS="$@"
+fi
+
+if [[ "${SHELL_MODE}" -eq 0 ]]; then
+ check_args "${ARGS[@]}"
+ echo "1..${#ARGS[@]}"
+fi
+check_deps
+check_vng
+handle_build
+
+log_setup "Booting up VM"
+vm_start
+vm_wait_for_ssh
+vm_mount_bpffs
+log_setup "VM booted up"
+
+if [[ "${SHELL_MODE}" -eq 1 ]]; then
+ log_setup "Starting interactive shell in VM"
+ echo "Starting shell in VM. Use 'exit' to quit and shutdown the VM."
+ CURRENT_DIR="$(pwd)"
+ vm_ssh -t -- "cd '${CURRENT_DIR}' && exec bash -l"
+ exit "$KSFT_PASS"
+fi
+
+cnt_pass=0
+cnt_fail=0
+cnt_skip=0
+cnt_total=0
+for arg in "${ARGS[@]}"; do
+ run_test "${arg}" "${PARAMS}"
+ rc=$?
+ if [[ ${rc} -eq $KSFT_PASS ]]; then
+ cnt_pass=$(( cnt_pass + 1 ))
+ echo "ok ${cnt_total} ${arg}"
+ elif [[ ${rc} -eq $KSFT_SKIP ]]; then
+ cnt_skip=$(( cnt_skip + 1 ))
+ echo "ok ${cnt_total} ${arg} # SKIP"
+ elif [[ ${rc} -eq $KSFT_FAIL ]]; then
+ cnt_fail=$(( cnt_fail + 1 ))
+ echo "not ok ${cnt_total} ${arg} # exit=$rc"
+ fi
+ cnt_total=$(( cnt_total + 1 ))
+done
+
+echo "SUMMARY: PASS=${cnt_pass} SKIP=${cnt_skip} FAIL=${cnt_fail}"
+echo "Log: ${LOG}"
+
+if [ $((cnt_pass + cnt_skip)) -eq ${cnt_total} ]; then
+ exit "$KSFT_PASS"
+else
+ exit "$KSFT_FAIL"
+fi
---
base-commit: b80a75cf6999fb79971b41eaec7af2bb4b514714
change-id: 20250818-virtme-ng-f73db7e61235
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
From: Lance Yang <lance.yang(a)linux.dev>
The madv_populate and soft-dirty kselftests currently fail on systems where
CONFIG_MEM_SOFT_DIRTY is disabled.
Introduce a new helper softdirty_is_supported() into vm_util.c/h to ensure
tests are properly skipped when the feature is not enabled.
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
---
tools/testing/selftests/mm/madv_populate.c | 21 ++--------------
tools/testing/selftests/mm/soft-dirty.c | 5 +++-
tools/testing/selftests/mm/vm_util.c | 28 ++++++++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 1 +
4 files changed, 35 insertions(+), 20 deletions(-)
diff --git a/tools/testing/selftests/mm/madv_populate.c b/tools/testing/selftests/mm/madv_populate.c
index b6fabd5c27ed..43dac7783004 100644
--- a/tools/testing/selftests/mm/madv_populate.c
+++ b/tools/testing/selftests/mm/madv_populate.c
@@ -264,23 +264,6 @@ static void test_softdirty(void)
munmap(addr, SIZE);
}
-static int system_has_softdirty(void)
-{
- /*
- * There is no way to check if the kernel supports soft-dirty, other
- * than by writing to a page and seeing if the bit was set. But the
- * tests are intended to check that the bit gets set when it should, so
- * doing that check would turn a potentially legitimate fail into a
- * skip. Fortunately, we know for sure that arm64 does not support
- * soft-dirty. So for now, let's just use the arch as a corse guide.
- */
-#if defined(__aarch64__)
- return 0;
-#else
- return 1;
-#endif
-}
-
int main(int argc, char **argv)
{
int nr_tests = 16;
@@ -288,7 +271,7 @@ int main(int argc, char **argv)
pagesize = getpagesize();
- if (system_has_softdirty())
+ if (softdirty_is_supported())
nr_tests += 5;
ksft_print_header();
@@ -300,7 +283,7 @@ int main(int argc, char **argv)
test_holes();
test_populate_read();
test_populate_write();
- if (system_has_softdirty())
+ if (softdirty_is_supported())
test_softdirty();
err = ksft_get_fail_cnt();
diff --git a/tools/testing/selftests/mm/soft-dirty.c b/tools/testing/selftests/mm/soft-dirty.c
index 8a3f2b4b2186..98e42d2ac32a 100644
--- a/tools/testing/selftests/mm/soft-dirty.c
+++ b/tools/testing/selftests/mm/soft-dirty.c
@@ -200,8 +200,11 @@ int main(int argc, char **argv)
int pagesize;
ksft_print_header();
- ksft_set_plan(15);
+ if (!softdirty_is_supported())
+ ksft_exit_skip("soft-dirty is not support\n");
+
+ ksft_set_plan(15);
pagemap_fd = open(PAGEMAP_FILE_PATH, O_RDONLY);
if (pagemap_fd < 0)
ksft_exit_fail_msg("Failed to open %s\n", PAGEMAP_FILE_PATH);
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 56e9bd541edd..3173335df775 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -449,6 +449,34 @@ bool check_vmflag_pfnmap(void *addr)
return check_vmflag(addr, "pf");
}
+bool softdirty_is_supported(void)
+{
+ char *addr;
+ int ret = 0;
+ size_t pagesize;
+
+ /* We know for sure that arm64 does not support soft-dirty. */
+#if defined(__aarch64__)
+ return ret;
+#endif
+ pagesize = getpagesize();
+ /*
+ * __mmap_complete() always sets VM_SOFTDIRTY for new VMAs, so we
+ * just mmap a small region and check its VmFlags in /proc/self/smaps
+ * for the "sd" flag.
+ */
+ addr = mmap(0, pagesize, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ if (!addr)
+ ksft_exit_fail_msg("mmap failed\n");
+
+ if (check_vmflag(addr, "sd"))
+ ret = 1;
+
+ munmap(addr, pagesize);
+ return ret;
+}
+
/*
* Open an fd at /proc/$pid/maps and configure procmap_out ready for
* PROCMAP_QUERY query. Returns 0 on success, or an error code otherwise.
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 07c4acfd84b6..87ad8e0d92c0 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -104,6 +104,7 @@ bool find_vma_procmap(struct procmap_fd *procmap, void *address);
int close_procmap(struct procmap_fd *procmap);
int write_sysfs(const char *file_path, unsigned long val);
int read_sysfs(const char *file_path, unsigned long *val);
+bool softdirty_is_supported(void);
static inline int open_self_procmap(struct procmap_fd *procmap_out)
{
--
2.49.0
Some ublk selftests have strange behavior when fio is not installed.
While most tests behave correctly (run if they don't need fio, or skip
if they need fio), the following tests have different behavior:
- test_null_01, test_null_02, test_generic_01, test_generic_02, and
test_generic_12 try to run fio without checking if it exists first,
and fail on any failure of the fio command (including "fio command
not found"). So these tests fail when they should skip.
- test_stress_05 runs fio without checking if it exists first, but
doesn't fail on fio command failure. This test passes, but that pass
is misleading as the test doesn't do anything useful without fio
installed. So this test passes when it should skip.
Fix these issues by adding _have_program fio checks to the top of all of
these tests.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
Changes in v2:
- Also fix test_generic_01, test_generic_02, test_generic_12, which fail
on systems where bpftrace is installed but fio is not (Mohit Gupta)
- Link to v1: https://lore.kernel.org/r/20250916-ublk_fio-v1-1-8d522539eed7@purestorage.c…
---
tools/testing/selftests/ublk/test_generic_01.sh | 4 ++++
tools/testing/selftests/ublk/test_generic_02.sh | 4 ++++
tools/testing/selftests/ublk/test_generic_12.sh | 4 ++++
tools/testing/selftests/ublk/test_null_01.sh | 4 ++++
tools/testing/selftests/ublk/test_null_02.sh | 4 ++++
tools/testing/selftests/ublk/test_stress_05.sh | 4 ++++
6 files changed, 24 insertions(+)
diff --git a/tools/testing/selftests/ublk/test_generic_01.sh b/tools/testing/selftests/ublk/test_generic_01.sh
index 9227a208ba53128e4a202298316ff77e05607595..21a31cd5491aa79ffe3ad458a0055e832c619325 100755
--- a/tools/testing/selftests/ublk/test_generic_01.sh
+++ b/tools/testing/selftests/ublk/test_generic_01.sh
@@ -10,6 +10,10 @@ if ! _have_program bpftrace; then
exit "$UBLK_SKIP_CODE"
fi
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
_prep_test "null" "sequential io order"
dev_id=$(_add_ublk_dev -t null)
diff --git a/tools/testing/selftests/ublk/test_generic_02.sh b/tools/testing/selftests/ublk/test_generic_02.sh
index 3e80121e3bf5e191aa9ffe1f85e1693be4fdc2d2..12920768b1a080d37fcdff93de7a0439101de09e 100755
--- a/tools/testing/selftests/ublk/test_generic_02.sh
+++ b/tools/testing/selftests/ublk/test_generic_02.sh
@@ -10,6 +10,10 @@ if ! _have_program bpftrace; then
exit "$UBLK_SKIP_CODE"
fi
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
_prep_test "null" "sequential io order for MQ"
dev_id=$(_add_ublk_dev -t null -q 2)
diff --git a/tools/testing/selftests/ublk/test_generic_12.sh b/tools/testing/selftests/ublk/test_generic_12.sh
index 7abbb00d251df9403857b1c6f53aec8bf8eab176..b4046201b4d99ef5355b845ebea2c9a3924276a5 100755
--- a/tools/testing/selftests/ublk/test_generic_12.sh
+++ b/tools/testing/selftests/ublk/test_generic_12.sh
@@ -10,6 +10,10 @@ if ! _have_program bpftrace; then
exit "$UBLK_SKIP_CODE"
fi
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
_prep_test "null" "do imbalanced load, it should be balanced over I/O threads"
NTHREADS=6
diff --git a/tools/testing/selftests/ublk/test_null_01.sh b/tools/testing/selftests/ublk/test_null_01.sh
index a34203f726685787da80b0e32da95e0fcb90d0b1..c2cb8f7a09fe37a9956d067fd56b28dc7ca6bd68 100755
--- a/tools/testing/selftests/ublk/test_null_01.sh
+++ b/tools/testing/selftests/ublk/test_null_01.sh
@@ -6,6 +6,10 @@
TID="null_01"
ERR_CODE=0
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
_prep_test "null" "basic IO test"
dev_id=$(_add_ublk_dev -t null)
diff --git a/tools/testing/selftests/ublk/test_null_02.sh b/tools/testing/selftests/ublk/test_null_02.sh
index 5633ca8766554b22be252c7cb2d13de1bf923b90..8accd35beb55c149f74b23f0fb562e12cbf3e362 100755
--- a/tools/testing/selftests/ublk/test_null_02.sh
+++ b/tools/testing/selftests/ublk/test_null_02.sh
@@ -6,6 +6,10 @@
TID="null_02"
ERR_CODE=0
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
_prep_test "null" "basic IO test with zero copy"
dev_id=$(_add_ublk_dev -t null -z)
diff --git a/tools/testing/selftests/ublk/test_stress_05.sh b/tools/testing/selftests/ublk/test_stress_05.sh
index 566cfd90d192ce8c1f98ca2539792d54a787b3d1..274295061042e5db3f4f0846ae63ea9b787fb2ee 100755
--- a/tools/testing/selftests/ublk/test_stress_05.sh
+++ b/tools/testing/selftests/ublk/test_stress_05.sh
@@ -5,6 +5,10 @@
TID="stress_05"
ERR_CODE=0
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
run_io_and_remove()
{
local size=$1
---
base-commit: da7b97ba0d219a14a83e9cc93f98b53939f12944
change-id: 20250916-ublk_fio-1910998b00b3
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
These kfuncs are removed in commit 2f9838e25790
("selftests/bpf: Cleanup bpf qdisc selftests"), but they are still
referenced by multiple tests. Otherwise, we will get the following errors.
```
progs/bpf_qdisc_fail__incompl_ops.c:13:2: error: call to undeclared function 'bpf_qdisc_skb_drop'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
13 | bpf_qdisc_skb_drop(skb, to_free);
| ^
1 error generated.
progs/bpf_qdisc_fifo.c:38:3: error: call to undeclared function 'bpf_qdisc_skb_drop'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
38 | bpf_qdisc_skb_drop(skb, to_free);
| ^
progs/bpf_qdisc_fq.c:280:11: error: call to undeclared function 'bpf_skb_get_hash'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
280 | hash = bpf_skb_get_hash(skb) & q.orphan_mask;
| ^
progs/bpf_qdisc_fq.c:287:11: error: call to undeclared function 'bpf_skb_get_hash'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
287 | hash = bpf_skb_get_hash(skb) & q.orphan_mask;
| ^
progs/bpf_qdisc_fq.c:375:3: error: call to undeclared function 'bpf_qdisc_skb_drop'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
375 | bpf_qdisc_skb_drop(skb, to_free);
| ^
progs/bpf_qdisc_fifo.c:71:2: error: call to undeclared function 'bpf_qdisc_bstats_update'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
71 | bpf_qdisc_bstats_update(sch, skb);
| ^
progs/bpf_qdisc_fifo.c:106:4: error: call to undeclared function 'bpf_kfree_skb'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
106 | bpf_kfree_skb(skb);
| ^
3 errors generated.
progs/bpf_qdisc_fq.c:614:3: error: call to undeclared function 'bpf_qdisc_bstats_update'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
614 | bpf_qdisc_bstats_update(sch, skb);
| ^
progs/bpf_qdisc_fq.c:619:3: error: call to undeclared function 'bpf_qdisc_watchdog_schedule'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
619 | bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q.timer_slack);
| ^
5 errors generated.
```
Fixes: 2f9838e25790 ("selftests/bpf: Cleanup bpf qdisc selftests")
Signed-off-by: Xing Guo <higuoxing(a)gmail.com>
---
tools/testing/selftests/bpf/progs/bpf_qdisc_common.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
index 3754f581b328..7e7f2fe04f22 100644
--- a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
@@ -14,6 +14,12 @@
struct bpf_sk_buff_ptr;
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_kfree_skb(struct sk_buff *p) __ksym;
+void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
+void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
+void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym;
+
static struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
{
return (struct qdisc_skb_cb *)skb->cb;
--
2.51.0
Some ublk selftests have strange behavior when fio is not installed.
While most tests behave correctly (run if they don't need fio, or skip
if they need fio), the following tests have different behavior:
- test_null_01 and test_null_02 try to run fio without checking if it
exists first, and fail on any failure of the fio command (including
"fio command not found"). So these tests fail when they should skip.
- test_stress_05 runs fio without checking if it exists first, but
doesn't fail on fio command failure. This test passes, but that pass
is misleading as the test doesn't do anything useful without fio
installed. So this test passes when it should skip.
Fix these issues by adding _have_program fio checks to the top of all
three of these tests.
Signed-off-by: Uday Shankar <ushankar(a)purestorage.com>
---
tools/testing/selftests/ublk/test_null_01.sh | 4 ++++
tools/testing/selftests/ublk/test_null_02.sh | 4 ++++
tools/testing/selftests/ublk/test_stress_05.sh | 4 ++++
3 files changed, 12 insertions(+)
diff --git a/tools/testing/selftests/ublk/test_null_01.sh b/tools/testing/selftests/ublk/test_null_01.sh
index a34203f726685787da80b0e32da95e0fcb90d0b1..c2cb8f7a09fe37a9956d067fd56b28dc7ca6bd68 100755
--- a/tools/testing/selftests/ublk/test_null_01.sh
+++ b/tools/testing/selftests/ublk/test_null_01.sh
@@ -6,6 +6,10 @@
TID="null_01"
ERR_CODE=0
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
_prep_test "null" "basic IO test"
dev_id=$(_add_ublk_dev -t null)
diff --git a/tools/testing/selftests/ublk/test_null_02.sh b/tools/testing/selftests/ublk/test_null_02.sh
index 5633ca8766554b22be252c7cb2d13de1bf923b90..8accd35beb55c149f74b23f0fb562e12cbf3e362 100755
--- a/tools/testing/selftests/ublk/test_null_02.sh
+++ b/tools/testing/selftests/ublk/test_null_02.sh
@@ -6,6 +6,10 @@
TID="null_02"
ERR_CODE=0
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
_prep_test "null" "basic IO test with zero copy"
dev_id=$(_add_ublk_dev -t null -z)
diff --git a/tools/testing/selftests/ublk/test_stress_05.sh b/tools/testing/selftests/ublk/test_stress_05.sh
index 566cfd90d192ce8c1f98ca2539792d54a787b3d1..274295061042e5db3f4f0846ae63ea9b787fb2ee 100755
--- a/tools/testing/selftests/ublk/test_stress_05.sh
+++ b/tools/testing/selftests/ublk/test_stress_05.sh
@@ -5,6 +5,10 @@
TID="stress_05"
ERR_CODE=0
+if ! _have_program fio; then
+ exit "$UBLK_SKIP_CODE"
+fi
+
run_io_and_remove()
{
local size=$1
---
base-commit: da7b97ba0d219a14a83e9cc93f98b53939f12944
change-id: 20250916-ublk_fio-1910998b00b3
Best regards,
--
Uday Shankar <ushankar(a)purestorage.com>
Check if watchdog device supports WDIOF_KEEPALIVEPING option before
entering keep_alive() ping test loop. Fix watchdog-test silently looping
if ioctl based ping is not supported by the device. Exit from test in
such case instead of stucking in loop executing failing keep_alive()
Fixes: d89d08ffd2c5 ("selftests: watchdog: Fix ioctl SET* error paths to take oneshot exit path")
Signed-off-by: Akhilesh Patil <akhilesh(a)ee.iitb.ac.in>
---
Testing:
# wdt_test_1 -f /dev/watchdog0 -i
watchdog_info:
identity: m41t93 rtc Watchdog
firmware_version: 0
Support/Status: Set timeout (in seconds)
Support/Status: Watchdog triggers a management or other external alarm not a reboot
# wdt_test_1 -f /dev/watchdog0 -d -t 5 -p 2 -e
Watchdog card disabled.
Watchdog timeout set to 5 seconds.
Watchdog ping rate set to 2 seconds.
Watchdog card enabled.
WDIOC_KEEPALIVE not supported by this device
without this change
# wdt_test_2 -f /dev/watchdog0 -d -t 5 -p 2 -e
Watchdog card disabled.
Watchdog timeout set to 5 seconds.
Watchdog ping rate set to 2 seconds.
Watchdog card enabled.
Watchdog Ticking Away!
^C
(Where test stuck here forver silently)
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index a1f506ba5578..4f09c5db0c7f 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -332,6 +332,12 @@ int main(int argc, char *argv[])
if (oneshot)
goto end;
+ /* Check if WDIOF_KEEPALIVEPING is supported */
+ if (!(info.options & WDIOF_KEEPALIVEPING)) {
+ printf("WDIOC_KEEPALIVE not supported by this device\n");
+ goto end;
+ }
+
printf("Watchdog Ticking Away!\n");
/*
--
2.34.1
FEAT_LSFE is optional from v9.5, it adds new instructions for atomic
memory operations with floating point values. We have no immediate use
for it in kernel, provide a hwcap so userspace can discover it and allow
the ID register field to be exposed to KVM guests.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v3:
- Rebase onto v6.17-rc1.
- Link to v2: https://lore.kernel.org/r/20250703-arm64-lsfe-v2-0-eced80999cb4@kernel.org
Changes in v2:
- Fix result of vi dropping in hwcap test.
- Link to v1: https://lore.kernel.org/r/20250627-arm64-lsfe-v1-0-68351c4bf741@kernel.org
---
Mark Brown (3):
arm64/hwcap: Add hwcap for FEAT_LSFE
KVM: arm64: Expose FEAT_LSFE to guests
kselftest/arm64: Add lsfe to the hwcaps test
Documentation/arch/arm64/elf_hwcaps.rst | 4 ++++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/kernel/cpufeature.c | 2 ++
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kvm/sys_regs.c | 4 +++-
tools/testing/selftests/arm64/abi/hwcap.c | 21 +++++++++++++++++++++
7 files changed, 33 insertions(+), 1 deletion(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250625-arm64-lsfe-0810cf98adc2
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This is series 2b/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v3:
- Add a patch to deal with new code in acpi.
- Drop incorrectly applied Acked-by tags from Danilo.
- Link to v2: https://lore.kernel.org/r/20250719-core-cstr-fanout-1-v2-0-e1cb53f6d233@gma…
Changes in v2:
- Update patch title (was nova-core, now drm/panic).
- Link to v1: https://lore.kernel.org/r/20250709-core-cstr-fanout-1-v1-0-fd793b3e58a2@gma…
---
Tamir Duberstein (11):
drm/panic: use `core::ffi::CStr` method names
rust: auxiliary: use `core::ffi::CStr` method names
rust: configfs: use `core::ffi::CStr` method names
rust: cpufreq: use `core::ffi::CStr` method names
rust: drm: use `core::ffi::CStr` method names
rust: firmware: use `core::ffi::CStr` method names
rust: kunit: use `core::ffi::CStr` method names
rust: miscdevice: use `core::ffi::CStr` method names
rust: net: use `core::ffi::CStr` method names
rust: of: use `core::ffi::CStr` method names
rust: acpi: use `core::ffi::CStr` method names
drivers/gpu/drm/drm_panic_qr.rs | 2 +-
rust/kernel/acpi.rs | 7 ++-----
rust/kernel/auxiliary.rs | 4 ++--
rust/kernel/configfs.rs | 4 ++--
rust/kernel/cpufreq.rs | 2 +-
rust/kernel/drm/device.rs | 4 ++--
rust/kernel/firmware.rs | 2 +-
rust/kernel/kunit.rs | 6 +++---
rust/kernel/miscdevice.rs | 2 +-
rust/kernel/net/phy.rs | 2 +-
rust/kernel/of.rs | 2 +-
samples/rust/rust_configfs.rs | 2 +-
12 files changed, 18 insertions(+), 21 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250709-core-cstr-fanout-1-f20611832272
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This is series 2a/5 of the migration to `core::ffi::CStr`[0].
20250704-core-cstr-prepare-v1-0-a91524037783(a)gmail.com.
This series depends on the prior series[0] and is intended to go through
the rust tree to reduce the number of release cycles required to
complete the work.
Subsystem maintainers: I would appreciate your `Acked-by`s so that this
can be taken through Miguel's tree (where the other series must go).
[0] https://lore.kernel.org/all/20250704-core-cstr-prepare-v1-0-a91524037783@gm…
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v3:
- Add a patch to address new code in device.rs.
- Drop incorrectly applied Acked-by tags from Danilo.
- Link to v2: https://lore.kernel.org/r/20250719-core-cstr-fanout-1-v2-0-1ab5ba189c6e@gma…
Changes in v2:
- Rebase on rust-next.
- Drop pin-init patch, which is no longer needed.
- Link to v1: https://lore.kernel.org/r/20250709-core-cstr-fanout-1-v1-0-64308e7203fc@gma…
---
Tamir Duberstein (9):
gpu: nova-core: use `kernel::{fmt,prelude::fmt!}`
rust: alloc: use `kernel::{fmt,prelude::fmt!}`
rust: block: use `kernel::{fmt,prelude::fmt!}`
rust: device: use `kernel::{fmt,prelude::fmt!}`
rust: file: use `kernel::{fmt,prelude::fmt!}`
rust: kunit: use `kernel::{fmt,prelude::fmt!}`
rust: seq_file: use `kernel::{fmt,prelude::fmt!}`
rust: sync: use `kernel::{fmt,prelude::fmt!}`
rust: device: use `kernel::{fmt,prelude::fmt!}`
drivers/block/rnull.rs | 2 +-
drivers/gpu/nova-core/gpu.rs | 3 +--
drivers/gpu/nova-core/regs/macros.rs | 6 +++---
rust/kernel/alloc/kbox.rs | 2 +-
rust/kernel/alloc/kvec.rs | 2 +-
rust/kernel/alloc/kvec/errors.rs | 2 +-
rust/kernel/block/mq.rs | 2 +-
rust/kernel/block/mq/gen_disk.rs | 2 +-
rust/kernel/block/mq/raw_writer.rs | 3 +--
rust/kernel/device.rs | 6 +++---
rust/kernel/device/property.rs | 23 ++++++++++++-----------
rust/kernel/fs/file.rs | 5 +++--
rust/kernel/kunit.rs | 8 ++++----
rust/kernel/seq_file.rs | 6 +++---
rust/kernel/sync/arc.rs | 2 +-
scripts/rustdoc_test_gen.rs | 2 +-
16 files changed, 38 insertions(+), 38 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250709-core-cstr-fanout-1-f20611832272
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Unlike IPv4, IPv6 routing strictly requires the source address to be valid
on the outgoing interface. If the NS target is set to a remote VLAN interface,
and the source address is also configured on a VLAN over a bond interface,
setting the oif to the bond device will fail to retrieve the correct
destination route.
Fix this by not setting the oif to the bond device when retrieving the NS
target destination. This allows the correct destination device (the VLAN
interface) to be determined, so that bond_verify_device_path can return the
proper VLAN tags for sending NS messages.
Reported-by: David Wilder <wilder(a)us.ibm.com>
Closes: https://lore.kernel.org/netdev/aGOKggdfjv0cApTO@fedora/
Suggested-by: Jay Vosburgh <jv(a)jvosburgh.net>
Tested-by: David Wilder <wilder(a)us.ibm.com>
Acked-by: Jay Vosburgh <jv(a)jvosburgh.net>
Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v3: no update
v2: split the patch into 2 parts, the kernel change and test update (Jay Vosburgh)
---
drivers/net/bonding/bond_main.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 257333c88710..30cf97f4e814 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3355,7 +3355,6 @@ static void bond_ns_send_all(struct bonding *bond, struct slave *slave)
/* Find out through which dev should the packet go */
memset(&fl6, 0, sizeof(struct flowi6));
fl6.daddr = targets[i];
- fl6.flowi6_oif = bond->dev->ifindex;
dst = ip6_route_output(dev_net(bond->dev), NULL, &fl6);
if (dst->error) {
--
2.50.1
Checkpatch.pl expects at least 4 lines of help text.
Extend the help text to make checkpatch.pl happy.
Fixes: 031cdd3bc3f3 ("kunit: Enable PCI on UML without triggering WARN()")
Suggested-by: Shuah Khan <skhan(a)linuxfoundation.org>
Link: https://lore.kernel.org/lkml/3dc95227-2be9-48a0-bdea-3f283d9b2a38@linuxfoun…
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Feel free to fold this into the original commit.
---
lib/kunit/Kconfig | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index 1823539e96da30e165fa8d395ccbd3f6754c836e..7a6af361d2fc6276b9667be8c694b0c80e33c1e8 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -112,5 +112,9 @@ config KUNIT_UML_PCI
select UML_PCI
help
Enables the PCI subsystem on UML for use by KUnit tests.
+ Some KUnit tests require the PCI core which is not enabled by
+ default on UML.
+
+ If unsure, say N.
endif # KUNIT
---
base-commit: f20e264262f1e6a6e5302249e37da355d844b52b
change-id: 20250916-kunit-pci-kconfig-357264bb45f4
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
The `kunit_test` proc macro only checks for the `test` attribute
immediately preceding a `fn`. If the function is disabled via a `cfg`,
the generated code would result in a compile error referencing a
non-existent function [1].
This collects attributes and specifically cherry-picks `cfg` attributes
to be duplicated inside KUnit wrapper functions such that a test function
disabled via `cfg` compiles and is ignored correctly.
Link: https://lore.kernel.org/rust-for-linux/CANiq72==48=69hYiDo1321pCzgn_n1_jg=e… [1]
Closes: https://github.com/Rust-for-Linux/linux/issues/1185
Suggested-by: Miguel Ojeda <ojeda(a)kernel.org>
Signed-off-by: Kaibo Ma <ent3rm4n(a)gmail.com>
---
rust/kernel/kunit.rs | 7 +++++++
rust/macros/kunit.rs | 46 ++++++++++++++++++++++++++++++++------------
2 files changed, 41 insertions(+), 12 deletions(-)
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 41efd8759..32640dfc9 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -357,4 +357,11 @@ fn rust_test_kunit_example_test() {
fn rust_test_kunit_in_kunit_test() {
assert!(in_kunit_test());
}
+
+ #[test]
+ #[cfg(not(all()))]
+ fn rust_test_kunit_always_disabled_test() {
+ // This test should never run because of the `cfg`.
+ assert!(false);
+ }
}
diff --git a/rust/macros/kunit.rs b/rust/macros/kunit.rs
index 81d18149a..850a321e5 100644
--- a/rust/macros/kunit.rs
+++ b/rust/macros/kunit.rs
@@ -5,6 +5,7 @@
//! Copyright (c) 2023 José Expósito <jose.exposito89(a)gmail.com>
use proc_macro::{Delimiter, Group, TokenStream, TokenTree};
+use std::collections::HashMap;
use std::fmt::Write;
pub(crate) fn kunit_tests(attr: TokenStream, ts: TokenStream) -> TokenStream {
@@ -41,20 +42,32 @@ pub(crate) fn kunit_tests(attr: TokenStream, ts: TokenStream) -> TokenStream {
// Get the functions set as tests. Search for `[test]` -> `fn`.
let mut body_it = body.stream().into_iter();
let mut tests = Vec::new();
+ let mut attributes: HashMap<String, TokenStream> = HashMap::new();
while let Some(token) = body_it.next() {
match token {
- TokenTree::Group(ident) if ident.to_string() == "[test]" => match body_it.next() {
- Some(TokenTree::Ident(ident)) if ident.to_string() == "fn" => {
- let test_name = match body_it.next() {
- Some(TokenTree::Ident(ident)) => ident.to_string(),
- _ => continue,
- };
- tests.push(test_name);
+ TokenTree::Punct(ref p) if p.as_char() == '#' => match body_it.next() {
+ Some(TokenTree::Group(g)) if g.delimiter() == Delimiter::Bracket => {
+ if let Some(TokenTree::Ident(name)) = g.stream().into_iter().next() {
+ // Collect attributes because we need to find which are tests. We also
+ // need to copy `cfg` attributes so tests can be conditionally enabled.
+ attributes
+ .entry(name.to_string())
+ .or_default()
+ .extend([token, TokenTree::Group(g)]);
+ }
+ continue;
}
- _ => continue,
+ _ => (),
},
+ TokenTree::Ident(i) if i.to_string() == "fn" && attributes.contains_key("test") => {
+ if let Some(TokenTree::Ident(test_name)) = body_it.next() {
+ tests.push((test_name, attributes.remove("cfg").unwrap_or_default()))
+ }
+ }
+
_ => (),
}
+ attributes.clear();
}
// Add `#[cfg(CONFIG_KUNIT="y")]` before the module declaration.
@@ -100,11 +113,20 @@ pub(crate) fn kunit_tests(attr: TokenStream, ts: TokenStream) -> TokenStream {
let mut test_cases = "".to_owned();
let mut assert_macros = "".to_owned();
let path = crate::helpers::file();
- for test in &tests {
+ let num_tests = tests.len();
+ for (test, cfg_attr) in tests {
let kunit_wrapper_fn_name = format!("kunit_rust_wrapper_{test}");
- // An extra `use` is used here to reduce the length of the message.
+ // Append any `cfg` attributes the user might have written on their tests so we don't
+ // attempt to call them when they are `cfg`'d out. An extra `use` is used here to reduce
+ // the length of the assert message.
let kunit_wrapper = format!(
- "unsafe extern \"C\" fn {kunit_wrapper_fn_name}(_test: *mut ::kernel::bindings::kunit) {{ use ::kernel::kunit::is_test_result_ok; assert!(is_test_result_ok({test}())); }}",
+ r#"unsafe extern "C" fn {kunit_wrapper_fn_name}(_test: *mut ::kernel::bindings::kunit)
+ {{
+ {cfg_attr} {{
+ use ::kernel::kunit::is_test_result_ok;
+ assert!(is_test_result_ok({test}()));
+ }}
+ }}"#,
);
writeln!(kunit_macros, "{kunit_wrapper}").unwrap();
writeln!(
@@ -139,7 +161,7 @@ macro_rules! assert_eq {{
writeln!(
kunit_macros,
"static mut TEST_CASES: [::kernel::bindings::kunit_case; {}] = [\n{test_cases} ::kernel::kunit::kunit_case_null(),\n];",
- tests.len() + 1
+ num_tests + 1
)
.unwrap();
--
2.50.1
The active-backup bonding mode supports XFRM ESP offload. However, when
a bond is added using command like `ip link add bond0 type bond mode 1
miimon 100`, the `ethtool -k` command shows that the XFRM ESP offload is
disabled. This occurs because, in bond_newlink(), we change bond link
first and register bond device later. So the XFRM feature update in
bond_option_mode_set() is not called as the bond device is not yet
registered, leading to the offload feature not being set successfully.
To resolve this issue, we can modify the code order in bond_newlink() to
ensure that the bond device is registered first before changing the bond
link parameters. This change will allow the XFRM ESP offload feature to be
correctly enabled.
Fixes: 007ab5345545 ("bonding: fix feature flag setting at init time")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
drivers/net/bonding/bond_main.c | 2 +-
drivers/net/bonding/bond_netlink.c | 16 +++++++++-------
include/net/bonding.h | 1 +
3 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 257333c88710..2182b34226ca 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4411,7 +4411,7 @@ void bond_work_init_all(struct bonding *bond)
INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
}
-static void bond_work_cancel_all(struct bonding *bond)
+void bond_work_cancel_all(struct bonding *bond)
{
cancel_delayed_work_sync(&bond->mii_work);
cancel_delayed_work_sync(&bond->arp_work);
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index 57fff2421f1b..7a9d73ec8e91 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -579,20 +579,22 @@ static int bond_newlink(struct net_device *bond_dev,
struct rtnl_newlink_params *params,
struct netlink_ext_ack *extack)
{
+ struct bonding *bond = netdev_priv(bond_dev);
struct nlattr **data = params->data;
struct nlattr **tb = params->tb;
int err;
- err = bond_changelink(bond_dev, tb, data, extack);
- if (err < 0)
+ err = register_netdevice(bond_dev);
+ if (err)
return err;
- err = register_netdevice(bond_dev);
- if (!err) {
- struct bonding *bond = netdev_priv(bond_dev);
+ netif_carrier_off(bond_dev);
+ bond_work_init_all(bond);
- netif_carrier_off(bond_dev);
- bond_work_init_all(bond);
+ err = bond_changelink(bond_dev, tb, data, extack);
+ if (err) {
+ bond_work_cancel_all(bond);
+ unregister_netdevice(bond_dev);
}
return err;
diff --git a/include/net/bonding.h b/include/net/bonding.h
index e06f0d63b2c1..bd56ad976cfb 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -711,6 +711,7 @@ struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay);
void bond_work_init_all(struct bonding *bond);
+void bond_work_cancel_all(struct bonding *bond);
#ifdef CONFIG_PROC_FS
void bond_create_proc_entry(struct bonding *bond);
--
2.50.1
RX devmem sometimes fails on NIPA:
https://netdev-3.bots.linux.dev/vmksft-fbnic-qemu-dbg/results/294402/7-devm…
Both RSS and flow steering are properly installed, but the wait_port_listen
fails. Try to remove sleep(1) to see if the cause of the failure is
spending too much time during RX setup. I don't see a good reason to
have sleep in the first place. If there needs to be a delay between
installing the rules and receiving the traffic, let's add it to the
callers (devmem.py) instead.
Signed-off-by: Stanislav Fomichev <sdf(a)fomichev.me>
---
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index c0a22938bed2..3288ed04ce08 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -872,8 +872,6 @@ static int do_server(struct memory_buffer *mem)
goto err_reset_rss;
}
- sleep(1);
-
if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys)) {
pr_err("Failed to bind");
goto err_reset_flow_steering;
--
2.51.0
Here are some small unrelated cleanups collected when working on some
fixes recently.
- Patches 1 & 2: close file descriptors in exit paths in the selftests.
- Patch 3: fix a wrong type (int i/o u32) when parsing netlink message.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Geliang Tang (2):
selftests: mptcp: close server file descriptors
selftests: mptcp: close server IPC descriptors
Matthieu Baerts (NGI0) (1):
mptcp: pm: netlink: fix if-idx type
net/mptcp/pm_netlink.c | 2 +-
tools/testing/selftests/net/mptcp/mptcp_inq.c | 9 +++++++--
tools/testing/selftests/net/mptcp/mptcp_sockopt.c | 9 +++++++--
3 files changed, 15 insertions(+), 5 deletions(-)
---
base-commit: dc2f650f7e6857bf384069c1a56b2937a1ee370d
change-id: 20250912-net-next-mptcp-minor-fixes-6-18-a10e141ae3e7
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
This series should fix the recent instabilities seen by MPTCP and NIPA
CIs where the 'mptcp_connect.sh' tests fail regularly when running the
'disconnect' subtests with "plain" TCP sockets, e.g.
# INFO: disconnect
# 63 ns1 MPTCP -> ns1 (10.0.1.1:20001 ) MPTCP (duration 996ms) [ OK ]
# 64 ns1 MPTCP -> ns1 (10.0.1.1:20002 ) TCP (duration 851ms) [ OK ]
# 65 ns1 TCP -> ns1 (10.0.1.1:20003 ) MPTCP Unexpected revents: POLLERR/POLLNVAL(19)
# (duration 896ms) [FAIL] file received by server does not match (in, out):
# -rw-r--r-- 1 root root 11112852 Aug 19 09:16 /tmp/tmp.hlJe5DoMoq.disconnect
# Trailing bytes are:
# /{ga 6@=#.8:-rw------- 1 root root 10085368 Aug 19 09:16 /tmp/tmp.blClunilxx
# Trailing bytes are:
# /{ga 6@=#.8:66 ns1 MPTCP -> ns1 (dead:beef:1::1:20004) MPTCP (duration 987ms) [ OK ]
# 67 ns1 MPTCP -> ns1 (dead:beef:1::1:20005) TCP (duration 911ms) [ OK ]
# 68 ns1 TCP -> ns1 (dead:beef:1::1:20006) MPTCP (duration 980ms) [ OK ]
# [FAIL] Tests of the full disconnection have failed
These issues started to be visible after some behavioural changes in
TCP, where too quick re-connections after a shutdown() can now be more
easily rejected. Patch 3 modifies the selftests to wait, but this
resolution revealed an issue in MPTCP which is fixed by patch 1 (a fix
for v5.9 kernel).
Patches 2 and 4 improve some errors reported by the selftests, and patch
5 helps with the debugging of such issues.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Note: The last two patches are not strictly fixes, but they are useful
in case similar issues happen again. That's why they have been added
here in this series for -net. If that's an issue, please drop them, and
I can re-send them later on.
---
Matthieu Baerts (NGI0) (5):
mptcp: propagate shutdown to subflows when possible
selftests: mptcp: connect: catch IO errors on listen side
selftests: mptcp: avoid spurious errors on TCP disconnect
selftests: mptcp: print trailing bytes with od
selftests: mptcp: connect: print pcap prefix
net/mptcp/protocol.c | 16 ++++++++++++++++
tools/testing/selftests/net/mptcp/mptcp_connect.c | 11 ++++++-----
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 6 +++++-
tools/testing/selftests/net/mptcp/mptcp_lib.sh | 2 +-
4 files changed, 28 insertions(+), 7 deletions(-)
---
base-commit: 2690cb089502b80b905f2abdafd1bf2d54e1abef
change-id: 20250912-net-mptcp-fix-sft-connect-f095ad7a6e36
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
During the connection establishment, a peer can tell the other one that
it cannot establish new subflows to the initial IP address and port by
setting the 'C' flag [1]. Doing so makes sense when the sender is behind
a strict NAT, operating behind a legacy Layer 4 load balancer, or using
anycast IP address for example.
When this 'C' flag is set, the path-managers must then not try to
establish new subflows to the other peer's initial IP address and port.
The in-kernel PM has access to this info, but the userspace PM didn't,
not letting the userspace daemon able to respect the RFC8684.
Here are a few fixes related to this 'C' flag (aka 'deny-join-id0'):
- Patch 1: add remote_deny_join_id0 info on passive connections. A fix
for v5.14.
- Patch 2: let the userspace PM daemon know about the deny_join_id0
attribute, so when set, it can avoid creating new subflows to the
initial IP address and port. A fix for v5.19.
- Patch 3: a validation for the previous commit.
- Patch 4: record the deny_join_id0 info when TFO is used. A fix for
v6.2.
- Patch 5: not related to deny-join-id0, but it fixes errors messages in
the sockopt selftests, not to create confusions. A fix for v6.5.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Geliang Tang (1):
selftests: mptcp: sockopt: fix error messages
Matthieu Baerts (NGI0) (4):
mptcp: set remote_deny_join_id0 on SYN recv
mptcp: pm: nl: announce deny-join-id0 flag
selftests: mptcp: userspace pm: validate deny-join-id0 flag
mptcp: tfo: record 'deny join id0' info
Documentation/netlink/specs/mptcp_pm.yaml | 4 ++--
include/uapi/linux/mptcp.h | 2 ++
include/uapi/linux/mptcp_pm.h | 4 ++--
net/mptcp/options.c | 6 +++---
net/mptcp/pm_netlink.c | 7 +++++++
net/mptcp/subflow.c | 4 ++++
tools/testing/selftests/net/mptcp/mptcp_sockopt.c | 16 ++++++++++------
tools/testing/selftests/net/mptcp/pm_nl_ctl.c | 7 +++++++
tools/testing/selftests/net/mptcp/userspace_pm.sh | 14 +++++++++++---
9 files changed, 48 insertions(+), 16 deletions(-)
---
base-commit: 2690cb089502b80b905f2abdafd1bf2d54e1abef
change-id: 20250912-net-mptcp-pm-uspace-deny_join_id0-b6111e4e7e69
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
[ based on kvm/next ]
Implement guest_memfd allocation and population via the write syscall.
This is useful in non-CoCo use cases where the host can access guest
memory. Even though the same can also be achieved via userspace mapping
and memcpying from userspace, write provides a more performant option
because it does not need to set page tables and it does not cause a page
fault for every page like memcpy would. Note that memcpy cannot be
accelerated via MADV_POPULATE_WRITE as it is not supported by
guest_memfd and relies on GUP.
Populating 512MiB of guest_memfd on a x86 machine:
- via memcpy: 436 ms
- via write: 202 ms (-54%)
v5:
- Replace the call to the unexported filemap_remove_folio with
zeroing the bytes that could not be copied
- Fix checkpatch findings
v4:
- https://lore.kernel.org/kvm/20250828153049.3922-1-kalyazin@amazon.com
- Switch from implementing the write callback to write_iter
- Remove conditional compilation
v3:
- https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com
- David/Mike D: Only compile support for the write syscall if
CONFIG_KVM_GMEM_SHARED_MEM (now gone) is enabled.
v2:
- https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com
- Switch from an ioctl to the write syscall to implement population
v1:
- https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com
Nikita Kalyazin (2):
KVM: guest_memfd: add generic population via write
KVM: selftests: update guest_memfd write tests
.../testing/selftests/kvm/guest_memfd_test.c | 86 +++++++++++++++++--
virt/kvm/guest_memfd.c | 62 ++++++++++++-
2 files changed, 141 insertions(+), 7 deletions(-)
base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383
--
2.50.1
After commit 5c3bf6cba791 ("bonding: assign random address if device
address is same as bond"), bonding will erroneously randomize the MAC
address of the first interface added to the bond if fail_over_mac =
follow.
Correct this by additionally testing for the bond being empty before
randomizing the MAC.
Fixes: 5c3bf6cba791 ("bonding: assign random address if device address is same as bond")
Reported-by: Qiuling Ren <qren(a)redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
drivers/net/bonding/bond_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 257333c88710..8832bc9f107b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2132,6 +2132,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
memcpy(ss.__data, bond_dev->dev_addr, bond_dev->addr_len);
} else if (bond->params.fail_over_mac == BOND_FOM_FOLLOW &&
BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP &&
+ bond_has_slaves(bond) &&
memcmp(slave_dev->dev_addr, bond_dev->dev_addr, bond_dev->addr_len) == 0) {
/* Set slave to random address to avoid duplicate mac
* address in later fail over.
--
2.50.1
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v18 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28, and it
will be RFC9768.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v18 (11-Sep-2025)
- Reorder tcpi_accecn_fail_mode and tcpi_accecn_opt_seen to avoid adding fields in the middle of tcp_info (Eric Dumazet <edumazet(a)google.com>)
v17 (8-Sep-2025)
- Change tcp_ecn_mode_max from 5 to 2 to disable AccECN enablement before the whole AccECN feature been accpeted
v16 (6-Sep-2025)
- Use TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_ECN in comments of tcp_ecn_send_syn() (Eric Dumazet <edumazet(a)google.com>)
- Add tcpi_accecn_fail_mode and tcpi_accecn_opt_seen to make tcp_info be multiple of 64 bits in patch #12
v15 (14-Aug-205)
- Update pahole results in commit messages
- Accurate ECN will become RFC9768
v14 (22-Jul-2025)
- Add missing const for struct tcp_sock of tcp_accecn_option_beacon_check() of #11 (Simon Horman <horms(a)kernel.org>)
v13 (18-Jul-2025)
- Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>)
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (5):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 32 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 642 ++++++++++++++++++
include/uapi/linux/tcp.h | 9 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 353 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 294 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1406 insertions(+), 184 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
This series primarily adds support for DECLARE_PCI_FIXUP_*() in modules.
There are a few drivers that already use this, and so they are
presumably broken when built as modules.
While at it, I wrote some unit tests that emulate a fake PCI device, and
let the PCI framework match/not-match its vendor/device IDs. This test
can be built into the kernel or built as a module.
I also include some infrastructure changes (patch 3 and 4), so that
ARCH=um (the default for kunit.py), ARCH=arm, and ARCH=arm64 will run
these tests by default. These patches have different maintainers and are
independent, so they can probably be picked up separately. I included
them because otherwise the tests in patch 2 aren't so easy to run.
Brian Norris (4):
PCI: Support FIXUP quirks in modules
PCI: Add KUnit tests for FIXUP quirks
um: Select PCI_DOMAINS_GENERIC
kunit: qemu_configs: Add PCI to arm, arm64
arch/um/Kconfig | 1 +
drivers/pci/Kconfig | 11 ++
drivers/pci/Makefile | 1 +
drivers/pci/fixup-test.c | 197 ++++++++++++++++++++++
drivers/pci/quirks.c | 62 +++++++
include/linux/module.h | 18 ++
kernel/module/main.c | 26 +++
tools/testing/kunit/qemu_configs/arm.py | 1 +
tools/testing/kunit/qemu_configs/arm64.py | 1 +
9 files changed, 318 insertions(+)
create mode 100644 drivers/pci/fixup-test.c
--
2.51.0.384.g4c02a37b29-goog
For systems having CONFIG_NR_CPUS set to > 1024 in kernel config
the selftest fails as arena_spin_lock_irqsave() returns EOPNOTSUPP.
(eg - incase of powerpc default value for CONFIG_NR_CPUS is 8192)
The selftest is skipped incase bpf program returns EOPNOTSUPP,
with a descriptive message logged.
Tested-by: Venkat Rao Bagalkote <venkat88(a)linux.ibm.com>
Signed-off-by: Saket Kumar Bhaskar <skb99(a)linux.ibm.com>
---
Changes since v2:
* Separated arena_spin_lock selftest fix patch from the arena
patchset as it has to go via bpf-next tree.
* For EOPNOTSUPP set test_skip to 3, to differentiate it from
scenarios when run conditions are not met as suggested by Hari.
* Tweaked message displayed on SKIP to remove display of online
cpus.
v2:https://lore.kernel.org/all/20250829165135.1273071-1-skb99@linux.ibm.com/
Changes since v1:
Addressed comments from Alexei:
* Removed skel->rodata->nr_cpus = get_nprocs() and its usage to get
currently online cpus(as it needs to be updated from userspace).
v1:https://lore.kernel.org/all/20250805062747.3479221-1-skb99@linux.ibm.com/
---
.../selftests/bpf/prog_tests/arena_spin_lock.c | 13 +++++++++++++
tools/testing/selftests/bpf/progs/arena_spin_lock.c | 5 ++++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
index 0223fce4db2b..693fd86fbde6 100644
--- a/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
+++ b/tools/testing/selftests/bpf/prog_tests/arena_spin_lock.c
@@ -40,8 +40,13 @@ static void *spin_lock_thread(void *arg)
err = bpf_prog_test_run_opts(prog_fd, &topts);
ASSERT_OK(err, "test_run err");
+
+ if (topts.retval == -EOPNOTSUPP)
+ goto end;
+
ASSERT_EQ((int)topts.retval, 0, "test_run retval");
+end:
pthread_exit(arg);
}
@@ -63,6 +68,7 @@ static void test_arena_spin_lock_size(int size)
skel = arena_spin_lock__open_and_load();
if (!ASSERT_OK_PTR(skel, "arena_spin_lock__open_and_load"))
return;
+
if (skel->data->test_skip == 2) {
test__skip();
goto end;
@@ -86,6 +92,13 @@ static void test_arena_spin_lock_size(int size)
goto end_barrier;
}
+ if (skel->data->test_skip == 3) {
+ printf("%s:SKIP: CONFIG_NR_CPUS exceed the maximum supported by arena spinlock\n",
+ __func__);
+ test__skip();
+ goto end_barrier;
+ }
+
ASSERT_EQ(skel->bss->counter, repeat * nthreads, "check counter value");
end_barrier:
diff --git a/tools/testing/selftests/bpf/progs/arena_spin_lock.c b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
index c4500c37f85e..086b57a426cf 100644
--- a/tools/testing/selftests/bpf/progs/arena_spin_lock.c
+++ b/tools/testing/selftests/bpf/progs/arena_spin_lock.c
@@ -37,8 +37,11 @@ int prog(void *ctx)
#if defined(ENABLE_ATOMICS_TESTS) && defined(__BPF_FEATURE_ADDR_SPACE_CAST)
unsigned long flags;
- if ((ret = arena_spin_lock_irqsave(&lock, flags)))
+ if ((ret = arena_spin_lock_irqsave(&lock, flags))) {
+ if (ret == -EOPNOTSUPP)
+ test_skip = 3;
return ret;
+ }
if (counter != limit)
counter++;
bpf_repeat(cs_count);
--
2.43.5
Various KUnit tests require PCI infrastructure to work. All normal
platforms enable PCI by default, but UML does not. Enabling PCI from
.kunitconfig files is problematic as it would not be portable. So in
commit 6fc3a8636a7b ("kunit: tool: Enable virtio/PCI by default on UML")
PCI was enabled by way of CONFIG_UML_PCI_OVER_VIRTIO=y. However
CONFIG_UML_PCI_OVER_VIRTIO requires additional configuration of
CONFIG_UML_PCI_OVER_VIRTIO_DEVICE_ID or will otherwise trigger a WARN() in
virtio_pcidev_init(). However there is no one correct value for
UML_PCI_OVER_VIRTIO_DEVICE_ID which could be used by default.
This warning is confusing when debugging test failures.
On the other hand, the functionality of CONFIG_UML_PCI_OVER_VIRTIO is not
used at all, given that it is completely non-functional as indicated by
the WARN() in question. Instead it is only used as a way to enable
CONFIG_UML_PCI which itself is not directly configurable.
Instead of going through CONFIG_UML_PCI_OVER_VIRTIO, introduce a custom
configuration option which enables CONFIG_UML_PCI without triggering
warnings or building dead code.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
Reviewed-by: Johannes Berg <johannes(a)sipsolutions.net>
---
Changes in v2:
- Rebase onto v6.17-rc1
- Pick up review from Johannes
- Link to v1: https://lore.kernel.org/r/20250627-kunit-uml-pci-v1-1-a622fa445e58@linutron…
---
lib/kunit/Kconfig | 7 +++++++
tools/testing/kunit/configs/arch_uml.config | 5 ++---
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index c10ede4b1d2201d5f8cddeb71cc5096e21be9b6a..1823539e96da30e165fa8d395ccbd3f6754c836e 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -106,4 +106,11 @@ config KUNIT_DEFAULT_TIMEOUT
If unsure, the default timeout of 300 seconds is suitable for most
cases.
+config KUNIT_UML_PCI
+ bool "KUnit UML PCI Support"
+ depends on UML
+ select UML_PCI
+ help
+ Enables the PCI subsystem on UML for use by KUnit tests.
+
endif # KUNIT
diff --git a/tools/testing/kunit/configs/arch_uml.config b/tools/testing/kunit/configs/arch_uml.config
index 54ad8972681a2cc724e6122b19407188910b9025..28edf816aa70e6f408d9486efff8898df79ee090 100644
--- a/tools/testing/kunit/configs/arch_uml.config
+++ b/tools/testing/kunit/configs/arch_uml.config
@@ -1,8 +1,7 @@
# Config options which are added to UML builds by default
-# Enable virtio/pci, as a lot of tests require it.
-CONFIG_VIRTIO_UML=y
-CONFIG_UML_PCI_OVER_VIRTIO=y
+# Enable pci, as a lot of tests require it.
+CONFIG_KUNIT_UML_PCI=y
# Enable FORTIFY_SOURCE for wider checking.
CONFIG_FORTIFY_SOURCE=y
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250626-kunit-uml-pci-a2b687553746
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
This series is trimmed down version of previous more generic series[1].
In this new series, only -wunreachable-code flag is being added and dead
code is being removed from generated warnings.
[1] https://lore.kernel.org/all/20250822082145.4145617-1-usama.anjum@collabora.…
Muhammad Usama Anjum (2):
selftests/mm: Add -Wunreachable-code and fix warnings
selftests/mm: protection_keys: Fix dead code
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/hmm-tests.c | 5 ++---
tools/testing/selftests/mm/pkey_sighandler_tests.c | 2 +-
tools/testing/selftests/mm/protection_keys.c | 4 +---
tools/testing/selftests/mm/split_huge_page_test.c | 2 +-
5 files changed, 6 insertions(+), 8 deletions(-)
--
2.47.3
[ I think at this point everyone is OK with the ABI, and the x86
implementation has been tested so hopefully we are near to being
able to get this merged? If there are any outstanding issues let
me know and I can look at addressing them. The one possible issue
I am aware of is that the RISC-V shadow stack support was briefly
in -next but got dropped along with the general RISC-V issues during
the last merge window, rebasing for that is still in progress. I
guess ideally this could be applied on a branch and then pulled into
the RISC-V tree? ]
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Yuri Khrustalev has raised questions from the libc side regarding
discoverability of extended clone3() structure sizes[2], this seems like
a general issue with clone3(). There was a suggestion to add a hwcap on
arm64 which isn't ideal but is doable there, though architecture
specific mechanisms would also be needed for x86 (and RISC-V if it's
support gets merged before this does). The idea has, however, had
strong pushback from the architecture maintainers and it is possible to
detect support for this in clone3() by attempting a call with a
misaligned shadow stack pointer specified so no hwcap has been added.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
[2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v20:
- Comment fixes and clarifications in x86 arch_shstk_validate_clone()
from Rick Edgecombe.
- Spelling fix in documentation.
- Link to v19: https://lore.kernel.org/r/20250819-clone3-shadow-stack-v19-0-bc957075479b@k…
Changes in v19:
- Rebase onto v6.17-rc1.
- Link to v18: https://lore.kernel.org/r/20250702-clone3-shadow-stack-v18-0-7965d2b694db@k…
Changes in v18:
- Rebase onto v6.16-rc3.
- Thanks to pointers from Yuri Khrustalev this version has been tested
on x86 so I have removed the RFT tag.
- Clarify clone3_shadow_stack_valid() comment about the Kconfig check.
- Remove redundant GCSB DSYNCs in arm64 code.
- Fix token validation on x86.
- Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@k…
Changes in v17:
- Rebase onto v6.16-rc1.
- Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@k…
Changes in v16:
- Rebase onto v6.15-rc2.
- Roll in fixes from x86 testing from Rick Edgecombe.
- Rework so that the argument is shadow_stack_token.
- Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@k…
Changes in v15:
- Rebase onto v6.15-rc1.
- Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@k…
Changes in v14:
- Rebase onto v6.14-rc1.
- Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@k…
Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@k…
Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@k…
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 44 +++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 55 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 53 ++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 9 +-
kernel/fork.c | 93 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 620 insertions(+), 81 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).
/* pidns mount option for procfs */
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes creates
a procfs mount from an empty pidns so that user namespaced containers
can be nested (without this, the nested containers would fail to
mount procfs). But this requires forking off a helper process because
you cannot just one-shot this using mount(2).
* Container runtimes in general need to fork into a container before
configuring its mounts, which can lead to security issues in the case
of shared-pidns containers (a privileged process in the pidns can
interact with your container runtime process). While
SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
strict need for this due to a minor uAPI wart is kind of unfortunate.
Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.
The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.
Note that you cannot change the pidns of an already-created procfs
instance. The primary reason is that allowing this to be changed would
require RCU-protecting proc_pid_ns(sb) and thus auditing all of
fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
the pid namespace. Since creating procfs instances is very cheap, it
seems unnecessary to overcomplicate this upfront. Trying to reconfigure
procfs this way errors out with -EBUSY.
/* ioctl(PROCFS_GET_PID_NAMESPACE) */
In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.
To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.
Rather than copying the (fairly strict) security model for setns(2),
apply a slightly looser model to better match what userspace can already
do:
* Make the ioctl only valid on the root (meaning that a process without
access to the procfs root -- such as only having an fd to a procfs
file or some open_tree(2)-like subset -- cannot use this API). This
means that the process already has some level of access to the
/proc/$pid directories.
* If the calling process is in an ancestor pidns, then they can already
create pidfd for processes inside the pidns, which is morally
equivalent to a pidns file descriptor according to setns(2). So it
seems reasonable to just allow it in this case. (The justification
for this model was suggested by Christian.)
* If the process has access to /proc/1/ns/pid already (i.e. has
ptrace-read access to the pidns pid1), then this ioctl is equivalent
to just opening a handle to it that way.
Ideally we would check for ptrace-read access against all processes
in the pidns (which is very likely to be true for at least one
process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
by most programs), but this would obviously not scale.
I'm open to suggestions for whether we need to make this stricter (or
possibly allow more cases).
Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com>
---
Changes in v4:
- Remove unneeded EXPORT_SYMBOL_GPL. [Christian Brauner]
- Return -EOPNOTSUPP for new APIs for CONFIG_PID_NS=n rather than
pretending they don't exist entirely. [Christian Brauner]
- PROCFS_IOCTL_MAGIC conflicts with XSDFEC_MAGIC, so we need to allocate
subvalues more carefully (switch to _IO(PROCFS_IOCTL_MAGIC, 32)).
- Add some more selftests for PROCFS_GET_PID_NAMESPACE.
- Reword argument for PROCFS_GET_PID_NAMESPACE security model based on
Christian's suggestion, and remove CAP_SYS_ADMIN edge-case (in most
cases, such a process would also have ptrace-read credentials over the
pidns pid1).
- v3: <https://lore.kernel.org/r/20250724-procfs-pidns-api-v3-0-4c685c910923@cypha…>
Changes in v3:
- Disallow changing pidns for existing procfs instances, as we'd
probably have to RCU-protect everything that touches the pinned pidns
reference.
- Improve tests with slightly nicer ASSERT_ERRNO* macros.
- v2: <https://lore.kernel.org/r/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cypha…>
Changes in v2:
- #ifdef CONFIG_PID_NS
- Improve cover letter wording to make it clear we're talking about two
separate features with different permission models. [Andy Lutomirski]
- Fix build warnings in pidns_is_ancestor() patch. [kernel test robot]
- v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cypha…>
---
Aleksa Sarai (4):
pidns: move is-ancestor logic to helper
procfs: add "pidns" mount option
procfs: add PROCFS_GET_PID_NAMESPACE ioctl
selftests/proc: add tests for new pidns APIs
Documentation/filesystems/proc.rst | 12 ++
fs/proc/root.c | 166 +++++++++++++++-
include/linux/pid_namespace.h | 9 +
include/uapi/linux/fs.h | 4 +
kernel/pid_namespace.c | 22 ++-
tools/testing/selftests/proc/.gitignore | 1 +
tools/testing/selftests/proc/Makefile | 1 +
tools/testing/selftests/proc/proc-pidns.c | 315 ++++++++++++++++++++++++++++++
8 files changed, 514 insertions(+), 16 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250717-procfs-pidns-api-8ed1583431f0
Best regards,
--
Aleksa Sarai <cyphar(a)cyphar.com>
From: Benjamin Berg <benjamin.berg(a)intel.com>
For a while now, we have discussed that it may be better to avoid using
libc inside UML as it may be interfering in unexpected ways with kernel
functionality. A major point of concern is that there is no guarantee
that the libc is not using any address space that may conflict with
kernel addresses.
This patchset is an attempt to start a nolibc port of UML. The goal is
to port UML to use nolibc in smaller chunks to make the switch more
manageable.
There are three parts to this patchset:
* Two patches to use tools/include headers instead of kernel headers
for userspace files.
* A few nolibc fixes and a new NOLIBC_NO_STARTCODE compile flag for it
* Finally nolibc build support for UML and switching two files
The first two parts could be merged independently. The last step to use
nolibc inside UML obviously depends on the first two.
Benjamin
Benjamin Berg (9):
tools compiler.h: fix __used definition
um: use tools/include for user files
tools/nolibc/stdio: remove perror if NOLIBC_IGNORE_ERRNO is set
tools/nolibc/dirent: avoid errno in readdir_r
tools/nolibc: use __fallthrough__ rather than fallthrough
tools/nolibc: add option to disable startup code
um: add infrastructure to build files using nolibc
um: use nolibc for the --showconfig implementation
um: switch ptrace FP register access to nolibc
arch/um/Makefile | 32 ++++++++++++++++---
.../um/include/shared/generated/asm-offsets.h | 1 +
.../include/shared/generated/user_constants.h | 1 +
arch/um/include/shared/init.h | 2 +-
arch/um/include/shared/os.h | 2 ++
arch/um/include/shared/user.h | 5 ---
arch/um/kernel/Makefile | 2 +-
arch/um/kernel/skas/stub.c | 1 +
arch/um/kernel/skas/stub_exe.c | 4 +--
arch/um/os-Linux/skas/process.c | 6 ++--
arch/um/os-Linux/start_up.c | 4 +--
arch/um/scripts/Makefile.rules | 10 ++++--
arch/x86/um/Makefile | 6 ++--
arch/x86/um/os-Linux/Makefile | 5 ++-
arch/x86/um/os-Linux/registers.c | 22 +++++--------
arch/x86/um/user-offsets.c | 1 -
tools/include/linux/compiler.h | 2 +-
tools/include/nolibc/arch-arm.h | 2 ++
tools/include/nolibc/arch-arm64.h | 2 ++
tools/include/nolibc/arch-loongarch.h | 2 ++
tools/include/nolibc/arch-m68k.h | 2 ++
tools/include/nolibc/arch-mips.h | 2 ++
tools/include/nolibc/arch-powerpc.h | 2 ++
tools/include/nolibc/arch-riscv.h | 2 ++
tools/include/nolibc/arch-s390.h | 2 ++
tools/include/nolibc/arch-sh.h | 2 ++
tools/include/nolibc/arch-sparc.h | 2 ++
tools/include/nolibc/arch-x86.h | 4 +++
tools/include/nolibc/compiler.h | 4 +--
tools/include/nolibc/crt.h | 3 ++
tools/include/nolibc/dirent.h | 6 ++--
tools/include/nolibc/stackprotector.h | 2 ++
tools/include/nolibc/stdio.h | 2 ++
tools/include/nolibc/stdlib.h | 2 ++
tools/include/nolibc/sys.h | 3 +-
tools/include/nolibc/sys/auxv.h | 3 ++
36 files changed, 108 insertions(+), 47 deletions(-)
create mode 120000 arch/um/include/shared/generated/asm-offsets.h
create mode 120000 arch/um/include/shared/generated/user_constants.h
--
2.51.0
On Sun, Sep 14, 2025 at 6:24 AM Chris Mason <clm(a)meta.com> wrote:
>
> On Fri, 8 Aug 2025 08:28:49 -0700 Suren Baghdasaryan <surenb(a)google.com> wrote:
>
> > Utilize per-vma locks to stabilize vma after lookup without taking
> > mmap_lock during PROCMAP_QUERY ioctl execution. If vma lock is
> > contended, we fall back to mmap_lock but take it only momentarily
> > to lock the vma and release the mmap_lock. In a very unlikely case
> > of vm_refcnt overflow, this fall back path will fail and ioctl is
> > done under mmap_lock protection.
> >
> > This change is designed to reduce mmap_lock contention and prevent
> > PROCMAP_QUERY ioctl calls from blocking address space updates.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
> > Acked-by: SeongJae Park <sj(a)kernel.org>
> > ---
> > fs/proc/task_mmu.c | 103 +++++++++++++++++++++++++++++++++++++--------
> > 1 file changed, 85 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index c0968d293b61..e64cf40ce9c4 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -132,6 +132,12 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
>
> [ ... ]
>
> > +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_locking_ctx *lock_ctx,
> > + unsigned long addr)
> > +{
> > + struct mm_struct *mm = lock_ctx->mm;
> > + struct vm_area_struct *vma;
> > + struct vma_iterator vmi;
> > +
> > + if (lock_ctx->mmap_locked)
> > + return find_vma(mm, addr);
> > +
> > + /* Unlock previously locked VMA and find the next one under RCU */
> > + unlock_ctx_vma(lock_ctx);
> > + rcu_read_lock();
> > + vma_iter_init(&vmi, mm, addr);
> > + vma = lock_next_vma(mm, &vmi, addr);
> > + rcu_read_unlock();
> > +
> > + if (!vma)
> > + return NULL;
> > +
> > + if (!IS_ERR(vma)) {
> > + lock_ctx->locked_vma = vma;
> > + return vma;
> > + }
> > +
> > + if (PTR_ERR(vma) == -EAGAIN) {
> > + /* Fallback to mmap_lock on vma->vm_refcnt overflow */
> > + mmap_read_lock(mm);
>
> I know it's just a (very rare) fallback, but should we be using
> mmap_read_lock_killable() for consistency? I can see this impacting oom
> kills or other times we really want to be able to get rid of procs.
That's a good idea. From a quick look it seems safe to fail with
-EINTR here, which will propagate all the way to do_procmap_query().
Do you want to post a fixup patch?
Thanks,
Suren.
>
> -chris
Two patches here, first fixes the issue where tunnel core doesn't
actually extract DF bit from the outer IP header, even though both
OVS and TC flower allow matching on it. More details in the commit
message.
The second is a selftest for openvswitch that reproduces the issue,
but also just adds some basic coverage for the tunnel metadata
extraction and related openvswitch uAPI.
Version 2:
* Added missing tun_dst NULL check.
* Added Reviewed-by from Aaron for the selftest.
Version 1:
https://lore.kernel.org/netdev/20250905133105.3940420-1-i.maximets@ovn.org/
Ilya Maximets (2):
net: dst_metadata: fix IP_DF bit not extracted from tunnel headers
selftests: openvswitch: add a simple test for tunnel metadata
include/net/dst_metadata.h | 11 ++-
.../selftests/net/openvswitch/openvswitch.sh | 88 +++++++++++++++++--
2 files changed, 90 insertions(+), 9 deletions(-)
--
2.50.1
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for managing GCS for KVM guests.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v16:
- Rebase onto v6.17-rc3.
- Also expose the feature to nested guests.
- Implement emulation of EXLOCK when returning from nested guests.
- Rename enter_exception_gcs() to compute_exlock().
- Move all ID_AA64PFR1_EL1 handling to the final kernel patch.
- Drop unneeded forwarding of GCS exceptions.
- Commit and cover message updates.
- Link to v15: https://lore.kernel.org/r/20250820-arm64-gcs-v15-0-5e334da18b84@kernel.org
Changes in v15:
- Rebase onto v6.17-rc1.
- Link to v14: https://lore.kernel.org/r/20241005-arm64-gcs-v14-0-59060cd6092b@kernel.org
Changes in v14:
- Rebase onto arm64/for-next/gcs which includes all the non-KVM support.
- Manage the fine grained traps for GCS instructions.
- Manage PSTATE.EXLOCK when delivering exceptions to KVM guests.
- Link to v13: https://lore.kernel.org/r/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
Changes in v13:
- Rebase onto v6.12-rc1.
- Allocate VM_HIGH_ARCH_6 since protection keys used all the existing
bits.
- Implement mm_release() and free transparently allocated GCSs there.
- Use bit 32 of AT_HWCAP for GCS due to AT_HWCAP2 being filled.
- Since we now only set GCSCRE0_EL1 on change ensure that it is
initialised with GCSPR_EL0 accessible to EL0.
- Fix OOM handling on thread copy.
- Link to v12: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec947436a@kernel.org
Changes in v12:
- Clarify and simplify the signal handling code so we work with the
register state.
- When checking for write aborts to shadow stack pages ensure the fault
is a data abort.
- Depend on !UPROBES.
- Comment cleanups.
- Link to v11: https://lore.kernel.org/r/20240822-arm64-gcs-v11-0-41b81947ecb5@kernel.org
Changes in v11:
- Remove the dependency on the addition of clone3() support for shadow
stacks, rebasing onto v6.11-rc3.
- Make ID_AA64PFR1_EL1.GCS writeable in KVM.
- Hide GCS registers when GCS is not enabled for KVM guests.
- Require HCRX_EL2.GCSEn if booting at EL1.
- Require that GCSCR_EL1 and GCSCRE0_EL1 be initialised regardless of
if we boot at EL2 or EL1.
- Remove some stray use of bit 63 in signal cap tokens.
- Warn if we see a GCS with VM_SHARED.
- Remove rdundant check for VM_WRITE in fault handling.
- Cleanups and clarifications in the ABI document.
- Clean up and improve documentation of some sync placement.
- Only set the EL0 GCS mode if it's actually changed.
- Various minor fixes and tweaks.
- Link to v10: https://lore.kernel.org/r/20240801-arm64-gcs-v10-0-699e2bd2190b@kernel.org
Changes in v10:
- Fix issues with THP.
- Tighten up requirements for initialising GCSCR*.
- Only generate GCS signal frames for threads using GCS.
- Only context switch EL1 GCS registers if S1PIE is enabled.
- Move context switch of GCSCRE0_EL1 to EL0 context switch.
- Make GCS registers unconditionally visible to userspace.
- Use FHU infrastructure.
- Don't change writability of ID_AA64PFR1_EL1 for KVM.
- Remove unused arguments from alloc_gcs().
- Typo fixes.
- Link to v9: https://lore.kernel.org/r/20240625-arm64-gcs-v9-0-0f634469b8f0@kernel.org
Changes in v9:
- Rebase onto v6.10-rc3.
- Restructure and clarify memory management fault handling.
- Fix up basic-gcs for the latest clone3() changes.
- Convert to newly merged KVM ID register based feature configuration.
- Fixes for NV traps.
- Link to v8: https://lore.kernel.org/r/20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org
Changes in v8:
- Invalidate signal cap token on stack when consuming.
- Typo and other trivial fixes.
- Don't try to use process_vm_write() on GCS, it intentionally does not
work.
- Fix leak of thread GCSs.
- Rebase onto latest clone3() series.
- Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org
Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (6):
arm64/gcs: Ensure FGTs for EL1 GCS instructions are disabled
KVM: arm64: Manage GCS access and registers for guests
KVM: arm64: Set PSTATE.EXLOCK when entering an exception
KVM: arm64: Validate GCS exception lock when emulating ERET
KVM: arm64: Allow GCS to be enabled for guests
KVM: selftests: arm64: Add GCS registers to get-reg-list
arch/arm64/include/asm/el2_setup.h | 5 +++
arch/arm64/include/asm/kvm_emulate.h | 3 ++
arch/arm64/include/asm/kvm_host.h | 14 +++++++++
arch/arm64/include/asm/vncr_mapping.h | 2 ++
arch/arm64/include/uapi/asm/ptrace.h | 1 +
arch/arm64/kvm/emulate-nested.c | 40 +++++++++++++++++++++++-
arch/arm64/kvm/hyp/exception.c | 37 ++++++++++++++++++++++
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 31 ++++++++++++++++++
arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 10 ++++++
arch/arm64/kvm/nested.c | 7 +++--
arch/arm64/kvm/sys_regs.c | 32 +++++++++++++++++--
tools/testing/selftests/kvm/arm64/get-reg-list.c | 12 +++++++
12 files changed, 188 insertions(+), 6 deletions(-)
---
base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The futex_numa_mpol test requires libnuma, which is not available on
all platforms. When the test is not built, the run.sh script fails
because it unconditionally tries to execute the test binary.
Check for the futex_numa_mpol executable before running it. If the
binary is not present, print a skip message and continue.
This allows the test suite to run successfully on platforms that do
not have libnuma and therefore do not build the futex_numa_mpol
test.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
tools/testing/selftests/futex/functional/run.sh | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/futex/functional/run.sh b/tools/testing/selftests/futex/functional/run.sh
index 81739849f299..f3e43eb806bf 100755
--- a/tools/testing/selftests/futex/functional/run.sh
+++ b/tools/testing/selftests/futex/functional/run.sh
@@ -88,4 +88,8 @@ echo
./futex_priv_hash -g $COLOR
echo
-./futex_numa_mpol $COLOR
+if [ -x ./futex_numa_mpol ]; then
+ ./futex_numa_mpol $COLOR
+else
+ echo "SKIP: futex_numa_mpol (not built)"
+fi
--
2.51.0.355.g5224444f11-goog
This series adds comprehensive testing infrastructure for Netlink
and Generic Netlink
The implementation includes both kernel module and userspace tests to
verify correct Generic Netlink and Netlink behaviors under
various conditions.
Yana Bashlykova (15):
genetlink: add sysfs test module for Generic Netlink
genetlink: add TEST_GENL family for netlink testing
genetlink: add PARALLEL_GENL test family
genetlink: add test case for duplicate genl family registration
genetlink: add test case for family with invalid ops
genetlink: add netlink notifier support
genetlink: add THIRD_GENL family
genetlink: verify unregister fails for non-registered family
genetlink: add LARGE_GENL stress test family
selftests: net: genetlink: add packet capture test infrastructure
selftests: net: genetlink: add /proc/net/netlink test
selftests: net: genetlink: add Generic Netlink controller tests
selftests: net: genetlink: add large family ID resolution test
selftests: net: genetlink: add Netlink and Generic Netlink test suite
selftests: net: genetlink: fix expectation for large family resolution
drivers/net/Kconfig | 2 +
drivers/net/Makefile | 2 +
drivers/net/genetlink/Kconfig | 8 +
drivers/net/genetlink/Makefile | 3 +
.../net-pf-16-proto-16-family-PARALLEL_GENL.c | 1921 ++++++
tools/testing/selftests/net/Makefile | 6 +
tools/testing/selftests/net/genetlink.c | 5152 +++++++++++++++++
7 files changed, 7094 insertions(+)
create mode 100644 drivers/net/genetlink/Kconfig
create mode 100644 drivers/net/genetlink/Makefile
create mode 100644 drivers/net/genetlink/net-pf-16-proto-16-family-PARALLEL_GENL.c
create mode 100644 tools/testing/selftests/net/genetlink.c
--
2.34.1
Soft offlining a HugeTLB page reduces the available HugeTLB page pool.
Since HugeTLB pages are preallocated, reducing the available HugeTLB
page pool can cause allocation failures.
/proc/sys/vm/enable_soft_offline provides a sysctl interface to
disable/enable soft offline:
0 - Soft offline is disabled.
1 - Soft offline is enabled.
The current sysctl interface does not distinguish between HugeTLB pages
and other page types.
Disable soft offline for HugeTLB pages by default (1) and extend the
sysctl interface to preserve existing behavior (2):
0 - Soft offline is disabled.
1 - Soft offline is enabled (excluding HugeTLB pages).
2 - Soft offline is enabled (including HugeTLB pages).
Update documentation for the sysctl interface, reference the sysctl
interface in the sysfs ABI documentation, and update HugeTLB soft
offline selftests.
Reported-by: Shawn Fan <shawn.fan(a)intel.com>
Suggested-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Kyle Meyer <kyle.meyer(a)hpe.com>
---
Tony's original patch disabled soft offline for HugeTLB pages when
a correctable memory error reported via GHES (with "error threshold
exceeded" set) happened to be on a HugeTLB page:
https://lore.kernel.org/all/20250904155720.22149-1-tony.luck@intel.com
This patch disables soft offline for HugeTLB pages by default
(not just from GHES).
---
.../ABI/testing/sysfs-memory-page-offline | 6 ++++
Documentation/admin-guide/sysctl/vm.rst | 18 ++++++++---
mm/memory-failure.c | 21 ++++++++++--
.../selftests/mm/hugetlb-soft-offline.c | 32 +++++++++++++------
4 files changed, 60 insertions(+), 17 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-memory-page-offline b/Documentation/ABI/testing/sysfs-memory-page-offline
index 00f4e35f916f..befb89ae39ec 100644
--- a/Documentation/ABI/testing/sysfs-memory-page-offline
+++ b/Documentation/ABI/testing/sysfs-memory-page-offline
@@ -20,6 +20,12 @@ Description:
number, or a error when the offlining failed. Reading
the file is not allowed.
+ Soft-offline can be disabled/enabled via sysctl:
+ /proc/sys/vm/enable_soft_offline
+
+ For details, see:
+ Documentation/admin-guide/sysctl/vm.rst
+
What: /sys/devices/system/memory/hard_offline_page
Date: Sep 2009
KernelVersion: 2.6.33
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4d71211fdad8..ae56372bd604 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -309,19 +309,29 @@ physical memory) vs performance / capacity implications in transparent and
HugeTLB cases.
For all architectures, enable_soft_offline controls whether to soft offline
-memory pages. When set to 1, kernel attempts to soft offline the pages
-whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to
-the request to soft offline the pages. Its default value is 1.
+memory pages:
+
+- 0: Soft offline is disabled.
+- 1: Soft offline is enabled (excluding HugeTLB pages).
+- 2: Soft offline is enabled (including HugeTLB pages).
+
+The default is 1.
+
+If soft offline is disabled for the requested page type, EOPNOTSUPP is returned.
It is worth mentioning that after setting enable_soft_offline to 0, the
following requests to soft offline pages will not be performed:
+- Request to soft offline from sysfs (soft_offline_page).
+
- Request to soft offline pages from RAS Correctable Errors Collector.
-- On ARM, the request to soft offline pages from GHES driver.
+- On ARM and X86, the request to soft offline pages from GHES driver.
- On PARISC, the request to soft offline pages from Page Deallocation Table.
+Note: Soft offlining a HugeTLB page reduces the HugeTLB page pool.
+
extfrag_threshold
=================
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc30ca4804bf..cb59a99b48c5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -64,11 +64,18 @@
#include "internal.h"
#include "ras/ras_event.h"
+enum soft_offline {
+ SOFT_OFFLINE_DISABLED = 0,
+ SOFT_OFFLINE_ENABLED_SKIP_HUGETLB,
+ SOFT_OFFLINE_ENABLED
+};
+
static int sysctl_memory_failure_early_kill __read_mostly;
static int sysctl_memory_failure_recovery __read_mostly = 1;
-static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_enable_soft_offline __read_mostly =
+ SOFT_OFFLINE_ENABLED_SKIP_HUGETLB;
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
@@ -150,7 +157,7 @@ static const struct ctl_table memory_failure_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
- .extra2 = SYSCTL_ONE,
+ .extra2 = SYSCTL_TWO,
}
};
@@ -2799,12 +2806,20 @@ int soft_offline_page(unsigned long pfn, int flags)
return -EIO;
}
- if (!sysctl_enable_soft_offline) {
+ if (sysctl_enable_soft_offline == SOFT_OFFLINE_DISABLED) {
pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n");
put_ref_page(pfn, flags);
return -EOPNOTSUPP;
}
+ if (sysctl_enable_soft_offline == SOFT_OFFLINE_ENABLED_SKIP_HUGETLB) {
+ if (folio_test_hugetlb(pfn_folio(pfn))) {
+ pr_info_once("disabled for HugeTLB pages by /proc/sys/vm/enable_soft_offline\n");
+ put_ref_page(pfn, flags);
+ return -EOPNOTSUPP;
+ }
+ }
+
mutex_lock(&mf_mutex);
if (PageHWPoison(page)) {
diff --git a/tools/testing/selftests/mm/hugetlb-soft-offline.c b/tools/testing/selftests/mm/hugetlb-soft-offline.c
index f086f0e04756..7e2873cd0a6d 100644
--- a/tools/testing/selftests/mm/hugetlb-soft-offline.c
+++ b/tools/testing/selftests/mm/hugetlb-soft-offline.c
@@ -1,10 +1,15 @@
// SPDX-License-Identifier: GPL-2.0
/*
* Test soft offline behavior for HugeTLB pages:
- * - if enable_soft_offline = 0, hugepages should stay intact and soft
- * offlining failed with EOPNOTSUPP.
- * - if enable_soft_offline = 1, a hugepage should be dissolved and
- * nr_hugepages/free_hugepages should be reduced by 1.
+ *
+ * - if enable_soft_offline = 0 (SOFT_OFFLINE_DISABLED), HugeTLB pages
+ * should stay intact and soft offlining failed with EOPNOTSUPP.
+ *
+ * - if enable_soft_offline = 1 (SOFT_OFFLINE_ENABLED_SKIP_HUGETLB), HugeTLB pages
+ * should stay intact and soft offlining failed with EOPNOTSUPP.
+ *
+ * - if enable_soft_offline = 2 (SOFT_OFFLINE_ENABLED), a HugeTLB page should be
+ * dissolved and nr_hugepages/free_hugepages should be reduced by 1.
*
* Before running, make sure more than 2 hugepages of default_hugepagesz
* are allocated. For example, if /proc/meminfo/Hugepagesize is 2048kB:
@@ -32,6 +37,12 @@
#define EPREFIX " !!! "
+enum soft_offline {
+ SOFT_OFFLINE_DISABLED = 0,
+ SOFT_OFFLINE_ENABLED_SKIP_HUGETLB,
+ SOFT_OFFLINE_ENABLED
+};
+
static int do_soft_offline(int fd, size_t len, int expect_errno)
{
char *filemap = NULL;
@@ -83,7 +94,7 @@ static int set_enable_soft_offline(int value)
char cmd[256] = {0};
FILE *cmdfile = NULL;
- if (value != 0 && value != 1)
+ if (value < SOFT_OFFLINE_DISABLED || value > SOFT_OFFLINE_ENABLED)
return -EINVAL;
sprintf(cmd, "echo %d > /proc/sys/vm/enable_soft_offline", value);
@@ -155,7 +166,7 @@ static int create_hugetlbfs_file(struct statfs *file_stat)
static void test_soft_offline_common(int enable_soft_offline)
{
int fd;
- int expect_errno = enable_soft_offline ? 0 : EOPNOTSUPP;
+ int expect_errno = (enable_soft_offline == SOFT_OFFLINE_ENABLED) ? 0 : EOPNOTSUPP;
struct statfs file_stat;
unsigned long hugepagesize_kb = 0;
unsigned long nr_hugepages_before = 0;
@@ -198,7 +209,7 @@ static void test_soft_offline_common(int enable_soft_offline)
// No need for the hugetlbfs file from now on.
close(fd);
- if (enable_soft_offline) {
+ if (enable_soft_offline == SOFT_OFFLINE_ENABLED) {
if (nr_hugepages_before != nr_hugepages_after + 1) {
ksft_test_result_fail("MADV_SOFT_OFFLINE should reduced 1 hugepage\n");
return;
@@ -219,10 +230,11 @@ static void test_soft_offline_common(int enable_soft_offline)
int main(int argc, char **argv)
{
ksft_print_header();
- ksft_set_plan(2);
+ ksft_set_plan(3);
- test_soft_offline_common(1);
- test_soft_offline_common(0);
+ test_soft_offline_common(SOFT_OFFLINE_ENABLED);
+ test_soft_offline_common(SOFT_OFFLINE_ENABLED_SKIP_HUGETLB);
+ test_soft_offline_common(SOFT_OFFLINE_DISABLED);
ksft_finished();
}
--
2.51.0
This macro gets used in different tests. Add it to kselftest.h
which is central location and tests use this header. Then use this new
macro.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
tools/testing/selftests/kselftest.h | 4 ++++
tools/testing/selftests/mm/protection_keys.c | 2 +-
tools/testing/selftests/net/ovpn/ovpn-cli.c | 3 ++-
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kselftest.h b/tools/testing/selftests/kselftest.h
index 661d31c4b558c..274480e3573ab 100644
--- a/tools/testing/selftests/kselftest.h
+++ b/tools/testing/selftests/kselftest.h
@@ -92,6 +92,10 @@
#endif
#define __printf(a, b) __attribute__((format(printf, a, b)))
+#ifndef __always_unused
+#define __always_unused __attribute__((__unused__))
+#endif
+
#ifndef __maybe_unused
#define __maybe_unused __attribute__((__unused__))
#endif
diff --git a/tools/testing/selftests/mm/protection_keys.c b/tools/testing/selftests/mm/protection_keys.c
index 6281d4c61b50e..2085982dba696 100644
--- a/tools/testing/selftests/mm/protection_keys.c
+++ b/tools/testing/selftests/mm/protection_keys.c
@@ -1302,7 +1302,7 @@ static void test_mprotect_with_pkey_0(int *ptr, u16 pkey)
static void test_ptrace_of_child(int *ptr, u16 pkey)
{
- __attribute__((__unused__)) int peek_result;
+ __always_unused int peek_result;
pid_t child_pid;
void *ignored = 0;
long ret;
diff --git a/tools/testing/selftests/net/ovpn/ovpn-cli.c b/tools/testing/selftests/net/ovpn/ovpn-cli.c
index 9201f2905f2ce..688a5fa6fdacd 100644
--- a/tools/testing/selftests/net/ovpn/ovpn-cli.c
+++ b/tools/testing/selftests/net/ovpn/ovpn-cli.c
@@ -32,9 +32,10 @@
#include <sys/socket.h>
+#include "../../kselftest.h"
+
/* defines to make checkpatch happy */
#define strscpy strncpy
-#define __always_unused __attribute__((__unused__))
/* libnl < 3.5.0 does not set the NLA_F_NESTED on its own, therefore we
* have to explicitly do it to prevent the kernel from failing upon
--
2.47.3
The three patches fix the va_high_addr_switch.sh test failure on x86_64.
Patch 1 fixes the hugepage setup issue that nr_hugepages is reset too
early in run_vmtests.sh and break the later va_high_addr_switch testing.
Patch 2 adds hugepage setup in va_high_addr_switch test, so that it can
still work if vm_runtests.sh changes the hugepage setup someday.
Patch 3 fixes the test failure caused by the hint addr align method change
in hugetlb_get_unmapped_area().
Changes in v3:
- patch 1 adds the Acked-by from David
- patch 3 changes the mmap hint addr to hugepage aligned from page aligned
Changes in v2:
- patch 1 renames nr_hugepgs_origin to orig_nr_hugepgs
- add a patch 2 to setup hugeapges in va_high_addr_switch test
Chunyu Hu (3):
selftests/mm: fix hugepages cleanup too early
selftests/mm: alloc hugepages in va_high_addr_switch test
selftests/mm: fix va_high_addr_switch.sh failure on x86_64
tools/testing/selftests/mm/run_vmtests.sh | 9 ++++-
.../selftests/mm/va_high_addr_switch.c | 4 +-
.../selftests/mm/va_high_addr_switch.sh | 37 +++++++++++++++++++
3 files changed, 46 insertions(+), 4 deletions(-)
--
2.49.0
Hi Linux-kselftest,
Please provide a quote for your products:
Include:
1.Pricing (per unit)
2.Delivery cost & timeline
3.Quote expiry date
Deadline: September
Thanks!
Kamal Prasad
Albinayah Trading
Some high-level virtual drivers need to compute features from their
lower devices, but each currently has its own implementation and may
miss some feature computations. This patch set introduces a common function
to compute features for such devices.
Currently, bonding, team, and bridge have been updated to use the new
helper.
v3:
a) fix hw_enc_features asign order (Sabrina Dubroca)
b) set virtual dev feature defination in netdev_features.h (Jakub Kicinski)
c) remove unneeded err in team_del_slave (Stanislav Fomichev)
d) remove NETIF_F_HW_ESP test as it needs to be test with GSO pkts (Sabrina Dubroca)
v2:
a) remove hard_header_len setting. I will set needed_headroom for bond/team
in a separate patch as bridge has it's own ways. (Ido Schimmel)
b) Add test file to Makefile, set RET=0 to a proper location. (Ido Schimmel)
Hangbin Liu (5):
net: add a common function to compute features from lowers devices
bonding: use common function to compute the features
team: use common function to compute the features
net: bridge: use common function to compute the features
selftests/net: add offload checking test for virtual interface
drivers/net/bonding/bond_main.c | 99 +-------------
drivers/net/team/team_core.c | 78 +----------
include/linux/netdev_features.h | 18 +++
include/linux/netdevice.h | 1 +
net/bridge/br_if.c | 22 +---
net/core/dev.c | 76 +++++++++++
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/vdev_offload.sh | 137 ++++++++++++++++++++
9 files changed, 246 insertions(+), 187 deletions(-)
create mode 100755 tools/testing/selftests/net/vdev_offload.sh
--
2.50.1
This patch series introduces support for the PROBE_MEM32,
bpf_addr_space_cast and PROBE_ATOMIC instructions in the powerpc BPF JIT,
facilitating the implementation of BPF arena and arena atomics.
All selftests related to bpf_arena, bpf_arena_atomic(except
load_acquire/store_release) enablement are passing:
# ./test_progs -t arena_list
#5/1 arena_list/arena_list_1:OK
#5/2 arena_list/arena_list_1000:OK
#5 arena_list:OK
Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
# ./test_progs -t arena_htab
#4/1 arena_htab/arena_htab_llvm:OK
#4/2 arena_htab/arena_htab_asm:OK
#4 arena_htab:OK
Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
# ./test_progs -t verifier_arena
#464/1 verifier_arena/basic_alloc1:OK
#464/2 verifier_arena/basic_alloc2:OK
#464/3 verifier_arena/basic_alloc3:OK
#464/4 verifier_arena/iter_maps1:OK
#464/5 verifier_arena/iter_maps2:OK
#464/6 verifier_arena/iter_maps3:OK
#464 verifier_arena:OK
#465/1 verifier_arena_large/big_alloc1:OK
#465/2 verifier_arena_large/big_alloc2:OK
#465 verifier_arena_large:OK
Summary: 2/8 PASSED, 0 SKIPPED, 0 FAILED
# ./test_progs -t arena_atomics
#3/1 arena_atomics/add:OK
#3/2 arena_atomics/sub:OK
#3/3 arena_atomics/and:OK
#3/4 arena_atomics/or:OK
#3/5 arena_atomics/xor:OK
#3/6 arena_atomics/cmpxchg:OK
#3/7 arena_atomics/xchg:OK
#3/8 arena_atomics/uaf:OK
#3/9 arena_atomics/load_acquire:SKIP
#3/10 arena_atomics/store_release:SKIP
#3 arena_atomics:OK (SKIP: 2/10)
Summary: 1/8 PASSED, 2 SKIPPED, 0 FAILED
Changes since v2:
* Dropped arena_spin_lock selftest fix patch from the patchset as it has
to go via bpf-next while these changes will go via powerpc tree.
v2:https://lore.kernel.org/all/20250829165135.1273071-1-skb99@linux.ibm.com/
Changes since v1:
Addressed comments from Chris:
* Squashed introduction of bpf_jit_emit_probe_mem_store() and its usage in
one patch.
* Defined and used PPC_RAW_RLDICL_DOT to avoid the CMPDI.
* Removed conditional statement for fixup[0] = PPC_RAW_LI(dst_reg, 0);
* Indicated this change is limited to powerpc64 in subject.
Addressed comments from Alexei:
* Removed skel->rodata->nr_cpus = get_nprocs() and its usage to get
currently online cpus(as it needs to be updated from userspace).
Addressed comments from Hari:
* Updated the bpf jit stack layout and associated macros to accommodate
new NVR.
v1:https://lore.kernel.org/all/20250805062747.3479221-1-skb99@linux.ibm.com/
Saket Kumar Bhaskar (4):
powerpc64/bpf: Implement PROBE_MEM32 pseudo instructions
powerpc64/bpf: Implement bpf_addr_space_cast instruction
powerpc64/bpf: Introduce bpf_jit_emit_atomic_ops() to emit atomic
instructions
powerpc64/bpf: Implement PROBE_ATOMIC instructions
arch/powerpc/include/asm/ppc-opcode.h | 1 +
arch/powerpc/net/bpf_jit.h | 6 +-
arch/powerpc/net/bpf_jit_comp.c | 32 +-
arch/powerpc/net/bpf_jit_comp32.c | 2 +-
arch/powerpc/net/bpf_jit_comp64.c | 401 +++++++++++++++++++-------
5 files changed, 330 insertions(+), 112 deletions(-)
--
2.43.5
Some high-level virtual drivers need to compute features from their
lower devices, but each currently has its own implementation and may
miss some feature computations. This patch set introduces a common function
to compute features for such devices.
Currently, bonding, team, and bridge have been updated to use the new
helper.
Hangbin Liu (5):
net: add a common function to compute features from lowers devices
bonding: use common function to compute the features
team: use common function to compute the features
net: bridge: use common function to compute the features
selftests/net: add offload checking test for virtual interface
drivers/net/bonding/bond_main.c | 99 +----------
drivers/net/team/team_core.c | 73 +-------
include/linux/netdevice.h | 19 +++
net/bridge/br_if.c | 22 +--
net/core/dev.c | 79 +++++++++
tools/testing/selftests/net/config | 2 +
tools/testing/selftests/net/vdev_offload.sh | 174 ++++++++++++++++++++
7 files changed, 285 insertions(+), 183 deletions(-)
create mode 100755 tools/testing/selftests/net/vdev_offload.sh
--
2.50.1
From: Jack Thomson <jackabt(a)amazon.com>
Overview:
This patch series adds ARM64 support for the KVM_PRE_FAULT_MEMORY
feature, which was previously only available on x86 [1]. This allows
a reduction in the number of stage-2 faults during execution. This is
beneficial in post-copy migration scenarios, particularly in memory
intensive applications, where high latencies are experienced due to
the stage-2 faults when pre-populating memory via UFFD / memcpy.
Patch Overview:
- The first patch is a preparatory refactor.
- The second patch is adding a page walk flag for pre-faulting.
- The third patch adds support for the KVM_PRE_FAULT_MEMORY ioctl
on arm64.
- The fourth patch fixes an issue with unaligned mmap allocations
in the selftests.
- The fifth patch updates the pre_fault_memory_test to support
arm64.
- The last patch extends the pre_fault_memory_test to cover
different vm memory backings.
[1]: https://lore.kernel.org/kvm/20240710174031.312055-1-pbonzini@redhat.com
Jack Thomson (6):
KVM: arm64: Add __gmem_abort and __user_mem_abort
KVM: arm64: Add KVM_PGTABLE_WALK_PRE_FAULT walk flag
KVM: arm64: Add pre_fault_memory implementation
KVM: selftests: Fix unaligned mmap allocations
KVM: selftests: Enable pre_fault_memory_test for arm64
KVM: selftests: Add option for different backing in pre-fault tests
arch/arm64/include/asm/kvm_pgtable.h | 3 +
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 6 +-
arch/arm64/kvm/mmu.c | 97 +++++++++++++--
tools/testing/selftests/kvm/Makefile.kvm | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 12 +-
.../selftests/kvm/pre_fault_memory_test.c | 110 +++++++++++++-----
8 files changed, 186 insertions(+), 45 deletions(-)
base-commit: 42188667be387867d2bf763d028654cbad046f7b
--
2.43.0
Add an operation, SECCOMP_CLONE_FILTER, that can copy the seccomp
filters from another process to the current process.
Changes from v1 to v2:
* Fixed locking issues. Thanks Al, Alexei, and Kees :)
* Allow filters to be cloned if CAP_SYS_ADMIN or no new privs
is set
* I initially had only CAP_SYS_ADMIN, but I can't think of a
way no new privs is harmful here, so I added it. Thanks, Kees
* Switch to passing in pidfd directly rather than a pointer to a
pidfd
* This more closely aligns with other pidfd syscalls
* Fixed warning in the sample code reported by the test robot
* Various cleanups and improvements in the selftest
Note that I left in the restriction that the target process
has no seccomp filters already loaded. I could see this
limitation being removed in a later patchset, but there are
requests for this feature at present.
Finally, I re-ran the performance numbers and updated the patch
with the latest numbers. The locking changes significantly sped
up the clone operation, and it's now ~1900x faster than the
current method.
Tom Hromatka (1):
seccomp: Add SECCOMP_CLONE_FILTER operation
.../userspace-api/seccomp_filter.rst | 10 ++
include/uapi/linux/seccomp.h | 1 +
kernel/seccomp.c | 48 ++++++
samples/seccomp/.gitignore | 1 +
samples/seccomp/Makefile | 2 +-
samples/seccomp/clone-filter.c | 150 ++++++++++++++++++
tools/include/uapi/linux/seccomp.h | 1 +
tools/testing/selftests/seccomp/seccomp_bpf.c | 114 +++++++++++++
8 files changed, 326 insertions(+), 1 deletion(-)
create mode 100644 samples/seccomp/clone-filter.c
--
2.47.3
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v17 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28, and it
will be RFC9768.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v17 (8-Sep-2025)
- Change tcp_ecn_mode_max from 5 to 2 to disable AccECN enablement before the whole AccECN feature been accpeted
v16 (6-Sep-2025)
- Use TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_ECN in comments of tcp_ecn_send_syn() (Eric Dumazet <edumazet(a)google.com>)
- Add tcpi_accecn_fail_mode and tcpi_accecn_opt_seen to make tcp_info be multiple of 64 bits in patch #12
v15 (14-Aug-205)
- Update pahole results in commit messages
- Accurate ECN will become RFC9768
v14 (22-Jul-2025)
- Add missing const for struct tcp_sock of tcp_accecn_option_beacon_check() of #11 (Simon Horman <horms(a)kernel.org>)
v13 (18-Jul-2025)
- Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>)
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (5):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 32 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 642 ++++++++++++++++++
include/uapi/linux/tcp.h | 9 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 353 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 294 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1406 insertions(+), 184 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
There are currently no kernel tests that verify setting and getting
options of the team driver.
In the future, options may be added that implicitly change other
options, which will make it useful to have tests like these that show
nothing breaks. There will be a follow up patch to this that adds new
"rx_enabled" and "tx_enabled" options, which will implicitly affect the
"enabled" option value and vice versa.
The tests use teamnl to first set options to specific values and then
gets them to compare to the set values.
Signed-off-by: Marc Harvey <marcharvey(a)google.com>
---
Changes in v3:
- Applied minor style changes based on v2 feedback.
- Link to v2: https://lore.kernel.org/netdev/20250904015424.1228665-1-marcharvey@google.c…
Changes in v2:
- Fixed shellcheck failures.
- Fixed test failing in vng by adding a config option to enable the
team driver's active backup mode.
- Link to v1: https://lore.kernel.org/netdev/20250902235504.4190036-1-marcharvey@google.c…
.../selftests/drivers/net/team/Makefile | 6 +-
.../testing/selftests/drivers/net/team/config | 1 +
.../selftests/drivers/net/team/options.sh | 188 ++++++++++++++++++
3 files changed, 193 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/team/options.sh
diff --git a/tools/testing/selftests/drivers/net/team/Makefile b/tools/testing/selftests/drivers/net/team/Makefile
index eaf6938f100e..89d854c7e674 100644
--- a/tools/testing/selftests/drivers/net/team/Makefile
+++ b/tools/testing/selftests/drivers/net/team/Makefile
@@ -1,11 +1,13 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for net selftests
-TEST_PROGS := dev_addr_lists.sh propagation.sh
+TEST_PROGS := dev_addr_lists.sh propagation.sh options.sh
TEST_INCLUDES := \
../bonding/lag_lib.sh \
../../../net/forwarding/lib.sh \
- ../../../net/lib.sh
+ ../../../net/lib.sh \
+ ../../../net/in_netns.sh \
+ ../../../net/lib/sh/defer.sh
include ../../../lib.mk
diff --git a/tools/testing/selftests/drivers/net/team/config b/tools/testing/selftests/drivers/net/team/config
index 636b3525b679..558e1d0cf565 100644
--- a/tools/testing/selftests/drivers/net/team/config
+++ b/tools/testing/selftests/drivers/net/team/config
@@ -3,4 +3,5 @@ CONFIG_IPV6=y
CONFIG_MACVLAN=y
CONFIG_NETDEVSIM=m
CONFIG_NET_TEAM=y
+CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=y
CONFIG_NET_TEAM_MODE_LOADBALANCE=y
diff --git a/tools/testing/selftests/drivers/net/team/options.sh b/tools/testing/selftests/drivers/net/team/options.sh
new file mode 100755
index 000000000000..44888f32b513
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/team/options.sh
@@ -0,0 +1,188 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# These tests verify basic set and get functionality of the team
+# driver options over netlink.
+
+# Run in private netns.
+test_dir="$(dirname "$0")"
+if [[ $# -eq 0 ]]; then
+ "${test_dir}"/../../../net/in_netns.sh "$0" __subprocess
+ exit $?
+fi
+
+ALL_TESTS="
+ team_test_options
+"
+
+source "${test_dir}/../../../net/lib.sh"
+
+TEAM_PORT="team0"
+MEMBER_PORT="dummy0"
+
+setup()
+{
+ ip link add name "${MEMBER_PORT}" type dummy
+ ip link add name "${TEAM_PORT}" type team
+}
+
+get_and_check_value()
+{
+ local option_name="$1"
+ local expected_value="$2"
+ local port_flag="$3"
+
+ local value_from_get
+
+ if ! value_from_get=$(teamnl "${TEAM_PORT}" getoption "${option_name}" \
+ "${port_flag}"); then
+ echo "Could not get option '${option_name}'" >&2
+ return 1
+ fi
+
+ if [[ "${value_from_get}" != "${expected_value}" ]]; then
+ echo "Incorrect value for option '${option_name}'" >&2
+ echo "get (${value_from_get}) != set (${expected_value})" >&2
+ return 1
+ fi
+}
+
+set_and_check_get()
+{
+ local option_name="$1"
+ local option_value="$2"
+ local port_flag="$3"
+
+ local value_from_get
+
+ if ! teamnl "${TEAM_PORT}" setoption "${option_name}" \
+ "${option_value}" "${port_flag}"; then
+ echo "'setoption ${option_name} ${option_value}' failed" >&2
+ return 1
+ fi
+
+ get_and_check_value "${option_name}" "${option_value}" "${port_flag}"
+ return $?
+}
+
+# Get a "port flag" to pass to the `teamnl` command.
+# E.g. $1="dummy0" -> "port=dummy0",
+# $1="" -> ""
+get_port_flag()
+{
+ local port_name="$1"
+
+ if [[ -n "${port_name}" ]]; then
+ echo "--port=${port_name}"
+ fi
+}
+
+attach_port_if_specified()
+{
+ local port_name="$1"
+
+ if [[ -n "${port_name}" ]]; then
+ ip link set dev "${port_name}" master "${TEAM_PORT}"
+ return $?
+ fi
+}
+
+detach_port_if_specified()
+{
+ local port_name="$1"
+
+ if [[ -n "${port_name}" ]]; then
+ ip link set dev "${port_name}" nomaster
+ return $?
+ fi
+}
+
+# Test that an option's get value matches its set value.
+# Globals:
+# RET - Used by testing infra like `check_err`.
+# EXIT_STATUS - Used by `log_test` for whole script exit value.
+# Arguments:
+# option_name - The name of the option.
+# value_1 - The first value to try setting.
+# value_2 - The second value to try setting.
+# port_name - The (optional) name of the attached port.
+team_test_option()
+{
+ local option_name="$1"
+ local value_1="$2"
+ local value_2="$3"
+ local possible_values="$2 $3 $2"
+ local port_name="$4"
+ local port_flag
+
+ RET=0
+
+ echo "Setting '${option_name}' to '${value_1}' and '${value_2}'"
+
+ attach_port_if_specified "${port_name}"
+ check_err $? "Couldn't attach ${port_name} to master"
+ port_flag=$(get_port_flag "${port_name}")
+
+ # Set and get both possible values.
+ for value in ${possible_values}; do
+ set_and_check_get "${option_name}" "${value}" "${port_flag}"
+ check_err $? "Failed to set '${option_name}' to '${value}'"
+ done
+
+ detach_port_if_specified "${port_name}"
+ check_err $? "Couldn't detach ${port_name} from its master"
+
+ log_test "Set + Get '${option_name}' test"
+}
+
+# Test that getting a non-existant option fails.
+# Globals:
+# RET - Used by testing infra like `check_err`.
+# EXIT_STATUS - Used by `log_test` for whole script exit value.
+# Arguments:
+# option_name - The name of the option.
+# port_name - The (optional) name of the attached port.
+team_test_get_option_fails()
+{
+ local option_name="$1"
+ local port_name="$2"
+ local port_flag
+
+ RET=0
+
+ attach_port_if_specified "${port_name}"
+ check_err $? "Couldn't attach ${port_name} to master"
+ port_flag=$(get_port_flag "${port_name}")
+
+ # Just confirm that getting the value fails.
+ teamnl "${TEAM_PORT}" getoption "${option_name}" "${port_flag}"
+ check_fail $? "Shouldn't be able to get option '${option_name}'"
+
+ detach_port_if_specified "${port_name}"
+
+ log_test "Get '${option_name}' fails"
+}
+
+team_test_options()
+{
+ # Wrong option name behavior.
+ team_test_get_option_fails fake_option1
+ team_test_get_option_fails fake_option2 "${MEMBER_PORT}"
+
+ # Correct set and get behavior.
+ team_test_option mode activebackup loadbalance
+ team_test_option notify_peers_count 0 5
+ team_test_option notify_peers_interval 0 5
+ team_test_option mcast_rejoin_count 0 5
+ team_test_option mcast_rejoin_interval 0 5
+ team_test_option enabled true false "${MEMBER_PORT}"
+ team_test_option user_linkup true false "${MEMBER_PORT}"
+ team_test_option user_linkup_enabled true false "${MEMBER_PORT}"
+ team_test_option priority 10 20 "${MEMBER_PORT}"
+ team_test_option queue_id 0 1 "${MEMBER_PORT}"
+}
+
+require_command teamnl
+setup
+tests_run
+exit "${EXIT_STATUS}"
--
2.51.0.384.g4c02a37b29-goog
From: Zhou Yuhang <zhouyuhang(a)kylinos.cn>
On x86_64, the size of struct flock is 32 bytes,
and the layout of this structure may be as follows:
+------------+ offset 0
| l_type | 2 bytes
+------------+ offset 2
| l_whence | 2 bytes
+------------+ offset 4
| padding | 4 bytes
+------------+ offset 8
| l_start | 8 bytes
+------------+ offset 16
| l_len | 8 bytes
+------------+ offset 24
| l_pid | 4 bytes
+------------+ offset 28
| padding | 4 bytes
+------------+ offset 32
Flock fl and fl2 are not initialized after definition.
The padding bytes in the structure may contain random values,
which could cause memcmp() to return a non-zero value,
potentially leading to test failure. The output is as follows:
# [INFO] opened fds 3 4
# [SUCCESS] set OFD read lock on first fd
# [SUCCESS] read and write locks conflicted
# [SUCCESS] F_UNLCK test returns: locked, type 0 pid -1 len 3
# [FAIL] F_UNLCK test returns: locked, type 0 pid -1 len 3
Initialize them to zero to solve this problem.
Signed-off-by: Zhou Yuhang <zhouyuhang(a)kylinos.cn>
---
changes in v2:
- Add a description of the struct flock layout to the commit message.
---
tools/testing/selftests/filelock/ofdlocks.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/filelock/ofdlocks.c b/tools/testing/selftests/filelock/ofdlocks.c
index a55b79810ab2..84e25505bebb 100644
--- a/tools/testing/selftests/filelock/ofdlocks.c
+++ b/tools/testing/selftests/filelock/ofdlocks.c
@@ -36,6 +36,8 @@ int main(void)
{
int rc;
struct flock fl, fl2;
+ memset(&fl, 0, sizeof(fl));
+ memset(&fl2, 0, sizeof(fl2));
int fd = open("/tmp/aa", O_RDWR | O_CREAT | O_EXCL, 0600);
int fd2 = open("/tmp/aa", O_RDONLY);
--
2.33.0
MADV_COLLAPSE is part of linux/mman.h and needs to be included
for this selftest for glibc compatibility. It is also included
in other tests that use MADV_COLLAPSE.
Fixes: d9c7ff4dae62 ("selftests: prctl: introduce tests for disabling THPs completely")
Reported-by: Mark Brown <broonie(a)kernel.org>
Closes: https://lore.kernel.org/all/c8249725-e91d-4c51-b9bb-40305e61e20d@sirena.org…
Signed-off-by: Usama Arif <usamaarif642(a)gmail.com>
---
tools/testing/selftests/mm/prctl_thp_disable.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing/selftests/mm/prctl_thp_disable.c
index feb711dca3a1d..84b4a4b345af5 100644
--- a/tools/testing/selftests/mm/prctl_thp_disable.c
+++ b/tools/testing/selftests/mm/prctl_thp_disable.c
@@ -9,6 +9,7 @@
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
+#include <linux/mman.h>
#include <sys/prctl.h>
#include <sys/wait.h>
--
2.47.3
Fix a memory leak issue on netpoll and create a netconsole test that exposes
the problem, when run with kmemleak enabled.
This is a merge of two patches I've sent individually and are merged on
the same patchset[1][2].
Link: https://lore.kernel.org/all/20250904-netconsole_torture-v2-0-5775ed5dc366@d… [1]
Link: https://lore.kernel.org/all/20250902165426.6d6cd172@kernel.org/ [2]
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v3:
- this patchset is a merge of the fix and the selftest together as
recommended by Jakub.
Changes in v2:
- Reuse the netconsole creation from lib_netcons.sh. Thus, refactoring
the create_dynamic_target() (Jakub)
- Move the "wait" to after all the messages has been sent.
- Link to v1: https://lore.kernel.org/r/20250902-netconsole_torture-v1-1-03c6066598e9@deb…
---
Breno Leitao (3):
netpoll: fix incorrect refcount handling causing incorrect cleanup
selftest: netcons: refactor target creation
selftest: netcons: create a torture test
net/core/netpoll.c | 7 +-
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 30 +++--
.../selftests/drivers/net/netcons_torture.sh | 127 +++++++++++++++++++++
4 files changed, 152 insertions(+), 13 deletions(-)
---
base-commit: d69eb204c255c35abd9e8cb621484e8074c75eaa
change-id: 20250902-netconsole_torture-8fc23f0aca99
Best regards,
--
Breno Leitao <leitao(a)debian.org>
sockmap_redir was introduced to comprehensively test the BPF redirection.
This series strives to increase the tested sockmap/sockhash code coverage;
adds support for skipping the actual redirect part, allowing to simply
SK_DROP or SK_PASS the packet.
BPF_MAP_TYPE_SOCKMAP
BPF_MAP_TYPE_SOCKHASH
x
sk_msg-to-egress
sk_msg-to-ingress
sk_skb-to-egress
sk_skb-to-ingress
x
AF_INET, SOCK_STREAM
AF_INET6, SOCK_STREAM
AF_INET, SOCK_DGRAM
AF_INET6, SOCK_DGRAM
AF_UNIX, SOCK_STREAM
AF_UNIX, SOCK_DGRAM
AF_VSOCK, SOCK_STREAM
AF_VSOCK, SOCK_SEQPACKET
x
SK_REDIRECT
SK_DROP
SK_PASS
Patch 5 ("Support no-redirect SK_DROP/SK_PASS") implements the feature.
Patches 3 ("Rename functions") and 4 ("Let test specify skel's
redirect_type") make preparatory changes.
I also took the opportunity to clean up (Patch 1, "Simplify try_recv()")
and improve a bit (Patch 2, "Fix OOB handling").
$ cd tools/testing/selftests/bpf
$ make
$ sudo ./test_progs -t sockmap_redir
...
Summary: 1/720 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Michal Luczaj (5):
selftests/bpf: sockmap_redir: Simplify try_recv()
selftests/bpf: sockmap_redir: Fix OOB handling
selftests/bpf: sockmap_redir: Rename functions
selftests/bpf: sockmap_redir: Let test specify skel's redirect_type
selftests/bpf: sockmap_redir: Support no-redirect SK_DROP/SK_PASS
.../selftests/bpf/prog_tests/sockmap_redir.c | 143 +++++++++++++++------
1 file changed, 105 insertions(+), 38 deletions(-)
---
base-commit: e8a6a9d3e8cc539d281e77b9df2439f223ec8153
change-id: 20250523-redir-test-pass-drop-2f2a5edca6e1
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
We recently missed detecting an issue during early testing because
the default (!all) tests would not trigger it and even when running
"all" tests it only would happen sometimes because of races.
So let's allow for an easy way to specify "GUP all pages in a single
call", extend the test matrix and extend our default (!all) tests.
By GUP'ing all pages in a single call, with the default size of 128MiB
we'll cover multiple leaf page tables / PMDs on architectures with sane
THP sizes.
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
tools/testing/selftests/mm/gup_test.c | 2 ++
tools/testing/selftests/mm/run_vmtests.sh | 8 +++++---
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/gup_test.c b/tools/testing/selftests/mm/gup_test.c
index bdeaac67ff9aa..8900b840c17a7 100644
--- a/tools/testing/selftests/mm/gup_test.c
+++ b/tools/testing/selftests/mm/gup_test.c
@@ -139,6 +139,8 @@ int main(int argc, char **argv)
break;
case 'n':
nr_pages = atoi(optarg);
+ if (nr_pages < 0)
+ nr_pages = size / psize();
break;
case 't':
thp = 1;
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 9e88cc25b9df2..6240e579b3ba5 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -138,7 +138,7 @@ run_gup_matrix() {
# -n: How many pages to fetch together? 512 is special
# because it's default thp size (or 2M on x86), 123 to
# just test partial gup when hit a huge in whatever form
- for num in "-n 1" "-n 512" "-n 123"; do
+ for num in "-n 1" "-n 512" "-n 123" "-n -1"; do
CATEGORY="gup_test" run_test ./gup_test \
$huge $test_cmd $write $share $num
done
@@ -313,9 +313,11 @@ if $RUN_ALL; then
run_gup_matrix
else
# get_user_pages_fast() benchmark
- CATEGORY="gup_test" run_test ./gup_test -u
+ CATEGORY="gup_test" run_test ./gup_test -u -n 1
+ CATEGORY="gup_test" run_test ./gup_test -u -n -1
# pin_user_pages_fast() benchmark
- CATEGORY="gup_test" run_test ./gup_test -a
+ CATEGORY="gup_test" run_test ./gup_test -a -n 1
+ CATEGORY="gup_test" run_test ./gup_test -a -n -1
fi
# Dump pages 0, 19, and 4096, using pin_user_pages:
CATEGORY="gup_test" run_test ./gup_test -ct -F 0x1 0 19 0x1000
--
2.50.1
This series fixes issues in devlink_rate_tc_bw.py selftest that made
its checks unreliable and its documentation inconsistent with the
actual configuration.
V2:
- Dropped the patch that relaxed the total bandwidth check. Jakub
suggested addressing the instability with interval-based measurement
and by migrating to load.py. That will be handled in a follow-up.
- Link to V1: https://lore.kernel.org/netdev/20250831080641.1828455-1-cjubran@nvidia.com/
Thanks
Carolina Jubran (2):
selftests: drv-net: Fix and clarify TC bandwidth split in
devlink_rate_tc_bw.py
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
.../drivers/net/hw/devlink_rate_tc_bw.py | 100 ++++++++----------
1 file changed, 43 insertions(+), 57 deletions(-)
--
2.38.1
┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ PCI Endpoint │ │ PCI Host │
│ │ │ │ │ │
│ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │
│ │ │ │ │ │
│ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │
│ Controller │ │ update doorbell register address│ │ │
│ │ │ for BAR │ │ │
│ │ │ │ │ 3. Write BAR<n>│
│ │◄──┼───────────────────────────────────┼───┤ │
│ │ │ │ │ │
│ ├──►│ 4.Irq Handle │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────┘ └───────────────────────────────────┘ └────────────────┘
This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/
Original patch only target to vntb driver. But actually it is common
method.
This patches add new API to pci-epf-core, so any EP driver can use it.
Previous v2 discussion here.
https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/
Changes in v21:
- Align to bar size, try to fix Niklas reported problem.
- Rebase to v6.16-rc5
- Link to v20: https://lore.kernel.org/r/20250709-ep-msi-v20-0-43d56f9bd54a@nxp.com
Changes in v20:
- remove set epf of_node's patch and only support one epf now.
- move imx6's patch to first
- detail change see each patches' change log
- Link to v19: https://lore.kernel.org/r/20250609-ep-msi-v19-0-77362eaa48fa@nxp.com
Changes in v19:
- irq part already in v6.16-rc1, only missed pcie/dts part
- rebase to v6.16-rc1
- update commit message for patch IMMUTABLE check.
- Link to v18: https://lore.kernel.org/r/20250414-ep-msi-v18-0-f69b49917464@nxp.com
Changes in v18:
- pci-ep.yaml: sort property order, fix maxvalue to 0x7ffff for msi-map-mask and
iommu-map-mask
- Link to v17: https://lore.kernel.org/r/20250407-ep-msi-v17-0-633ab45a31d0@nxp.com
Changes in v17:
- move document part to pci-ep.yaml
- Link to v16: https://lore.kernel.org/r/20250404-ep-msi-v16-0-d4919d68c0d0@nxp.com
Changes in v16:
- remove arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file
because there are better patches, which under review.
- Add document for pcie-ep msi-map usage
- other change to see each patch's change log
About IMMUTABLE (No change for this part, tglx provide feedback)
> - This IMMUTABLE thing serves no purpose, because you don't randomly
> plug this end-point block on any MSI controller. They come as part
> of an SoC.
"Yes and no. The problem is that the EP implementation is meant to be a
generic library and while GIC-ITS guarantees immutability of the
address/data pair after setup, there are architectures (x86, loongson,
riscv) where the base MSI controller does not and immutability is only
achieved when interrupt remapping is enabled. The latter can be disabled
at boot-time and then the EP implementation becomes a lottery across
affinity changes.
That was my concern about this library implementation and that's why I
asked for a mechanism to ensure that the underlying irqdomain provides a
immutable address/data pair.
So it does not matter for GIC-ITS, but in the larger picture it matters.
Thanks,
tglx
"
So it does not matter for GIC-ITS, but in the larger picture it matters.
- Link to v15: https://lore.kernel.org/r/20250211-ep-msi-v15-0-bcacc1f2b1a9@nxp.com
Changes in v15:
- rebase to v6.14-rc1
- fix build issue find by kernel test robot
- Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com
Changes in v14:
Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As
a result, the approach has been reverted to the v9 method. However, there
are several improvements:
MSI now supports msi-map in addition to msi-parent.
- The struct device: id is used as the endpoint function (EPF) device
identity to map to the stream ID (sideband information).
- The EPC device tree source (DTS) utilizes msi-map to provide such
information.
- The EPF device's of_node is set to the EPC controller’s node. This
approach is commonly used for multi-function device (MFD) platform child
devices, allowing them to inherit properties from the MFD device’s DTS,
such as reset-cells and gpio-cells. This method is well-suited for the
current case, as the EPF is inherently created/binded to the EPC and
should inherit the EPC’s DTS node properties.
Additionally:
Since the basic IMX95 LUT support has already been merged into the
mainline, a DTS and driver increment patch is added to complete the
solution. The patch is rebased onto the latest linux-next tree and
aligned with the new pcitest framework.
- Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com
Changes in v13:
- Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI
- Change request id as func | vfunc << 3
- Remove IRQ_DOMAIN_MSI_IMMUTABLE
Thomas Gleixner:
I hope capture all your points in review comments. If missed, let me know.
- Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com
Changes in v12:
- Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function
irq_domain_msi_is_immuatble().
- split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches
- Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com
Changes in v11:
- Change to use MSI_FLAG_MSG_IMMUTABLE
- Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com
Changes in v10:
Thomas Gleixner:
There are big change in pci-ep-msi.c. I am sure if go on the
corrent path. The key improvement is remove only 1 function devices's
limitation.
I use new patch for imutable check, which relative additional
feature compared to base enablement patch.
- Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info
- Remove only support 1 endpoint function limiation.
- Create one MSI domain for each endpoint function devices.
- Use "msi-map" in pci ep controler node, instead of of msi-parent. first
argument is
(func_no << 8 | vfunc_no)
- Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com
Changes in v9
- Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering
- Remove API pci_epf_align_inbound_addr_lo_hi
- Move doorbell_alloc in to doorbell_enable function.
- Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com
Changes in v8:
- update helper function name to pci_epf_align_inbound_addr()
- Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com
Changes in v7:
- Add helper function pci_epf_align_addr();
- Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com
Changes in v6:
- change doorbell_addr to doorbell_offset
- use round_down()
- add Niklas's test by tag
- rebase to pci/endpoint
- Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com
Changes in v5:
- Move request_irq to epf test function driver for more flexiable user case
- Add fixed size bar handler
- Some minor improvememtn to see each patches's changelog.
- Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com
Changes in v4:
- Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs()
- Use new method to avoid compatible problem.
Add new command DOORBELL_ENABLE and DOORBELL_DISABLE.
pcitest -B send DOORBELL_ENABLE first, EP test function driver try to
remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old
driver don't support new command, so failure return, not side effect.
After test, DOORBELL_DISABLE command send out to recover original map, so
pcitest bar test can pass as normal.
- Other detail change see each patches's change log
- Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com
Change from v2 to v3
- Fixed manivannan's comments
- Move common part to pci-ep-msi.c and pci-ep-msi.h
- rebase to 6.12-rc1
- use RevID to distingiush old version
mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1
echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts
echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid
echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid
echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid
^^^^^^ to enable platform msi support.
ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep
- use new device ID, which identify support doorbell to avoid broken
compatility.
Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices
keep the same behavior as before.
EP side RC with old driver RC with new driver
PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled
Other device ID doorbell disabled* doorbell disabled*
* Behavior remains unchanged.
Change from v1 to v2
- Add missed patch for endpont/pci-epf-test.c
- Move alloc and free to epc driver from epf.
- Provide general help function for EPC driver to alloc platform msi irq.
- Fixed manivannan's comments.
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
---
Frank Li (9):
PCI: imx6: Add helper function imx_pcie_add_lut_by_rid()
PCI: imx6: Add LUT configuration for MSI/IOMMU in Endpoint mode
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment
PCI: endpoint: pci-epf-test: Add doorbell test support
misc: pci_endpoint_test: Add doorbell test case
selftests: pci_endpoint: Add doorbell test case
arm64: dts: imx95: Add msi-map for pci-ep device
Documentation/PCI/endpoint/pci-test-howto.rst | 14 +++
arch/arm64/boot/dts/freescale/imx95.dtsi | 1 +
drivers/misc/pci_endpoint_test.c | 85 ++++++++++++-
drivers/pci/controller/dwc/pci-imx6.c | 25 ++--
drivers/pci/endpoint/Kconfig | 8 ++
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-test.c | 136 +++++++++++++++++++++
drivers/pci/endpoint/pci-ep-msi.c | 98 +++++++++++++++
drivers/pci/endpoint/pci-epf-core.c | 36 ++++++
include/linux/pci-ep-msi.h | 28 +++++
include/linux/pci-epf.h | 18 +++
include/uapi/linux/pcitest.h | 1 +
.../selftests/pci_endpoint/pci_endpoint_test.c | 28 +++++
13 files changed, 470 insertions(+), 9 deletions(-)
---
base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
change-id: 20241010-ep-msi-8b4cab33b1be
Best regards,
--
Frank Li <Frank.Li(a)nxp.com>
Here are various unrelated fixes:
- Patch 1: Fix a wrong attribute type in the MPTCP Netlink specs. A fix
for v6.7.
- Patch 2: Avoid mentioning a deprecated MPTCP sysctl knob in the doc. A
fix for v6.15.
- Patch 3: Handle new warnings from ShellCheck v0.11.0. This prevents
some warnings reported by some CIs. If it is not a good material for
'net', please drop it and I can resend it later, targeting 'net-next'.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (3):
netlink: specs: mptcp: fix if-idx attribute type
doc: mptcp: net.mptcp.pm_type is deprecated
selftests: mptcp: shellcheck: support v0.11.0
Documentation/netlink/specs/mptcp_pm.yaml | 2 +-
Documentation/networking/mptcp.rst | 8 ++++----
tools/testing/selftests/net/mptcp/diag.sh | 2 +-
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 2 +-
tools/testing/selftests/net/mptcp/mptcp_join.sh | 2 +-
tools/testing/selftests/net/mptcp/mptcp_sockopt.sh | 2 +-
tools/testing/selftests/net/mptcp/pm_netlink.sh | 5 +++--
tools/testing/selftests/net/mptcp/simult_flows.sh | 2 +-
tools/testing/selftests/net/mptcp/userspace_pm.sh | 2 +-
9 files changed, 14 insertions(+), 13 deletions(-)
---
base-commit: e2a10daba84968f6b5777d150985fd7d6abc9c84
change-id: 20250908-net-mptcp-misc-fixes-6-17-rc5-7550f5f90b66
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
This is v10 of the TDX selftests.
This series is based on v6.17-rc4 and has a dependency on
"KVM: TDX: Force split irqchip for TDX at irqchip creation time" [1]
Changes from v9 [2]:
- Rebased on top of v6.17-rc4.
- Addressed the comments from v9.
- Removed special handling for split irqchip in the test code in favor
for the kvm fix in [1].
- Removed outdated support for VM memory not backed by guest_memfd.
- Split "KVM: selftests: Hook TDX support to vm and vcpu creation" into
4 separate patches.
[1] https://lore.kernel.org/lkml/20250904062007.622530-1-sagis@google.com/
[2] https://lore.kernel.org/lkml/20250821042915.3712925-1-sagis@google.com/
Ackerley Tng (2):
KVM: selftests: Add helpers to init TDX memory and finalize VM
KVM: selftests: Add ucall support for TDX
Erdem Aktas (2):
KVM: selftests: Add TDX boot code
KVM: selftests: Add support for TDX TDCALL from guest
Isaku Yamahata (2):
KVM: selftests: Update kvm_init_vm_address_properties() for TDX
KVM: selftests: TDX: Use KVM_TDX_CAPABILITIES to validate TDs'
attribute configuration
Sagi Shahar (15):
KVM: selftests: Allocate pgd in virt_map() as necessary
KVM: selftests: Expose functions to get default sregs values
KVM: selftests: Expose function to allocate guest vCPU stack
KVM: selftests: Expose segment definitons to assembly files
KVM: selftests: Add kbuild definitons
KVM: selftests: Define structs to pass parameters to TDX boot code
KVM: selftests: Set up TDX boot code region
KVM: selftests: Set up TDX boot parameters region
KVM: selftests: Add helper to initialize TDX VM
KVM: selftests: Call TDX init when creating a new TDX vm
KVM: selftests: Setup memory regions for TDX on vm creation
KVM: selftests: Call KVM_TDX_INIT_VCPU when creating a new TDX vcpu
KVM: selftests: Set entry point for TDX guest code
KVM: selftests: Add wrapper for TDX MMIO from guest
KVM: selftests: Add TDX lifecycle test
tools/include/linux/kbuild.h | 18 +
tools/testing/selftests/kvm/Makefile.kvm | 32 ++
.../selftests/kvm/include/x86/processor.h | 35 ++
.../selftests/kvm/include/x86/processor_asm.h | 12 +
.../selftests/kvm/include/x86/tdx/td_boot.h | 74 ++++
.../kvm/include/x86/tdx/td_boot_asm.h | 16 +
.../selftests/kvm/include/x86/tdx/tdcall.h | 34 ++
.../selftests/kvm/include/x86/tdx/tdx.h | 14 +
.../selftests/kvm/include/x86/tdx/tdx_util.h | 86 +++++
.../testing/selftests/kvm/include/x86/ucall.h | 4 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 10 +-
.../testing/selftests/kvm/lib/x86/processor.c | 91 +++--
.../selftests/kvm/lib/x86/tdx/td_boot.S | 60 +++
.../kvm/lib/x86/tdx/td_boot_offsets.c | 21 ++
.../selftests/kvm/lib/x86/tdx/tdcall.S | 93 +++++
.../kvm/lib/x86/tdx/tdcall_offsets.c | 16 +
tools/testing/selftests/kvm/lib/x86/tdx/tdx.c | 23 ++
.../selftests/kvm/lib/x86/tdx/tdx_util.c | 354 ++++++++++++++++++
tools/testing/selftests/kvm/lib/x86/ucall.c | 45 ++-
tools/testing/selftests/kvm/x86/tdx_vm_test.c | 31 ++
20 files changed, 1032 insertions(+), 37 deletions(-)
create mode 100644 tools/include/linux/kbuild.h
create mode 100644 tools/testing/selftests/kvm/include/x86/processor_asm.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdcall.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdx.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/td_boot.S
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/td_boot_offsets.c
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdcall.S
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdcall_offsets.c
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdx.c
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c
create mode 100644 tools/testing/selftests/kvm/x86/tdx_vm_test.c
--
2.51.0.338.gd7d06c2dae-goog
During my testing, I found that guest debugging with 'DR6.BD' does not
work in instruction emulation, as the current code only considers the
guest's DR7. Upon reviewing the code, I also observed that the checks
for the userspace guest debugging feature and the guest's own debugging
feature are repeated in different places during instruction
emulation, but the overall logic is the same. If guest debugging
is enabled, it needs to exit to userspace; otherwise, a #DB
exception needs to be injected into the guest. Therefore, as
suggested by Jiangshan Lai, some cleanup has been done for #DB
handling in instruction emulation in this patchset. A new
function named 'kvm_inject_emulated_db()' is introduced to
consolidate all the checking logic. Moreover, I hope we can make
the #DB interception path use the same function as well.
Additionally, when I looked into the single-step #DB handling in
instruction emulation, I noticed that the interrupt shadow is toggled,
but it is not considered in the single-step #DB injection. This
oversight causes VM entry to fail on VMX (due to pending debug
exceptions checking) or breaks the 'MOV SS' suppressed #DB. For the
latter, I have kept the behavior for now in my patchset, as I need some
suggestions.
Hou Wenlong (7):
KVM: x86: Set guest DR6 by kvm_queue_exception_p() in instruction
emulation
KVM: x86: Check guest debug in DR access instruction emulation
KVM: x86: Only check effective code breakpoint in emulation
KVM: x86: Consolidate KVM_GUESTDBG_SINGLESTEP check into the
kvm_inject_emulated_db()
KVM: VMX: Set 'BS' bit in pending debug exceptions during instruction
emulation
KVM: selftests: Verify guest debug DR7.GD checking during instruction
emulation
KVM: selftests: Verify 'BS' bit checking in pending debug exception
during VM entry
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/emulate.c | 14 +--
arch/x86/kvm/kvm_emulate.h | 7 +-
arch/x86/kvm/vmx/main.c | 9 ++
arch/x86/kvm/vmx/vmx.c | 14 ++-
arch/x86/kvm/vmx/x86_ops.h | 1 +
arch/x86/kvm/x86.c | 109 +++++++++++-------
arch/x86/kvm/x86.h | 7 ++
.../selftests/kvm/include/x86/processor.h | 3 +-
tools/testing/selftests/kvm/x86/debug_regs.c | 64 +++++++++-
11 files changed, 167 insertions(+), 63 deletions(-)
base-commit: ecbcc2461839e848970468b44db32282e5059925
--
2.31.1
Unlike IPv4, IPv6 routing strictly requires the source address to be valid
on the outgoing interface. If the NS target is set to a remote VLAN interface,
and the source address is also configured on a VLAN over a bond interface,
setting the oif to the bond device will fail to retrieve the correct
destination route.
Fix this by not setting the oif to the bond device when retrieving the NS
target destination. This allows the correct destination device (the VLAN
interface) to be determined, so that bond_verify_device_path can return the
proper VLAN tags for sending NS messages.
Reported-by: David Wilder <wilder(a)us.ibm.com>
Closes: https://lore.kernel.org/netdev/aGOKggdfjv0cApTO@fedora/
Suggested-by: Jay Vosburgh <jv(a)jvosburgh.net>
Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
v2: split the patch into 2 parts, the kernel change and test update (Jay Vosburgh)
---
drivers/net/bonding/bond_main.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 257333c88710..30cf97f4e814 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3355,7 +3355,6 @@ static void bond_ns_send_all(struct bonding *bond, struct slave *slave)
/* Find out through which dev should the packet go */
memset(&fl6, 0, sizeof(struct flowi6));
fl6.daddr = targets[i];
- fl6.flowi6_oif = bond->dev->ifindex;
dst = ip6_route_output(dev_net(bond->dev), NULL, &fl6);
if (dst->error) {
--
2.50.1
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba(a)kernel.org>:
On Sun, 07 Sep 2025 17:32:41 +0200 you wrote:
> Currently, the MPTCP ADD_ADDR notifications are retransmitted after a
> fixed timeout controlled by the net.mptcp.add_addr_timeout sysctl knob,
> if the corresponding "echo" packets are not received before. This can be
> too slow (or too quick), especially with a too cautious default value
> set to 2 minutes.
>
> - Patch 1: make ADD_ADDR retransmission timeout adaptive, using the
> TCP's retransmission timeout. The corresponding sysctl knob is now
> used as a maximum value.
>
> [...]
Here is the summary with links:
- [net-next,1/3] mptcp: make ADD_ADDR retransmission timeout adaptive
https://git.kernel.org/netdev/net-next/c/30549eebc4d8
- [net-next,2/3] selftests: mptcp: join: tolerate more ADD_ADDR
https://git.kernel.org/netdev/net-next/c/63c31d42cf6f
- [net-next,3/3] selftests: mptcp: join: allow more time to send ADD_ADDR
https://git.kernel.org/netdev/net-next/c/e2cda6343bfe
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Commit 5c3bf6cba791 ("bonding: assign random address if device address is
same as bond") fixed an issue where, after releasing the first slave and
re-adding it to the bond with fail_over_mac=follow, both the active and
backup slaves could end up with duplicate MAC addresses. To avoid this,
the new slave was assigned a random address.
However, if this happens when adding the very first slave, the bond’s
hardware address is set to match the slave’s. Later, during the
fail_over_mac=follow check, the slave’s MAC is randomized because it
naturally matches the bond, which is incorrect.
The issue is normally hidden since the first slave usually becomes the
active one, which restores the bond's MAC address. However, if another
slave is selected as the initial active interface, the issue becomes visible.
Fix this by assigning a random address only when slaves already exist in
the bond.
Fixes: 5c3bf6cba791 ("bonding: assign random address if device address is same as bond")
Reported-by: Qiuling Ren <qren(a)redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
drivers/net/bonding/bond_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 257333c88710..8832bc9f107b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2132,6 +2132,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
memcpy(ss.__data, bond_dev->dev_addr, bond_dev->addr_len);
} else if (bond->params.fail_over_mac == BOND_FOM_FOLLOW &&
BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP &&
+ bond_has_slaves(bond) &&
memcmp(slave_dev->dev_addr, bond_dev->dev_addr, bond_dev->addr_len) == 0) {
/* Set slave to random address to avoid duplicate mac
* address in later fail over.
--
2.50.1
The pmtu test takes nearly an hour when run on a debug kernel
(10min on a normal kernel, so the debug slow down is quite significant).
NIPA tries to ensure all results are delivered by a certain deadline
so this prevents it from retrying the test in case of a flake.
Looks like one of the slowest operations in the test is calling out
to ./openvswitch/ovs-dpctl.py to remove potential leftover OvS interfaces.
Check whether the interfaces exist in the first place in sysfs,
since it can be done directly in bash it is very fast.
This should save us around 20-30% of the test runtime.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
tools/testing/selftests/net/pmtu.sh | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh
index 88e914c4eef9..a3323c21f001 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -1089,10 +1089,11 @@ cleanup() {
cleanup_all_ns
- ip link del veth_A-C 2>/dev/null
- ip link del veth_A-R1 2>/dev/null
- cleanup_del_ovs_internal
- cleanup_del_ovs_vswitchd
+ [ -e "/sys/class/net/veth_A-C" ] && ip link del veth_A-C
+ [ -e "/sys/class/net/veth_A-R1" ] && ip link del veth_A-R1
+ [ -e "/sys/class/net/ovs_br0" ] && cleanup_del_ovs_internal
+ [ -e "/sys/class/net/ovs_br0" ] && cleanup_del_ovs_vswitchd
+
rm -f "$tmpoutfile"
}
--
2.51.0
Introduce SW acceleration for IPIP tunnels in the netfilter flowtable
infrastructure.
---
Changes in v6:
- Rebase on top of nf-next main branch
- Link to v5: https://lore.kernel.org/r/20250721-nf-flowtable-ipip-v5-0-0865af9e58c6@kern…
Changes in v5:
- Rely on __ipv4_addr_hash() to compute the hash used as encap ID
- Remove unnecessary pskb_may_pull() in nf_flow_tuple_encap()
- Add nf_flow_ip4_ecanp_pop utility routine
- Link to v4: https://lore.kernel.org/r/20250718-nf-flowtable-ipip-v4-0-f8bb1c18b986@kern…
Changes in v4:
- Use the hash value of the saddr, daddr and protocol of outer IP header as
encapsulation id.
- Link to v3: https://lore.kernel.org/r/20250703-nf-flowtable-ipip-v3-0-880afd319b9f@kern…
Changes in v3:
- Add outer IP header sanity checks
- target nf-next tree instead of net-next
- Link to v2: https://lore.kernel.org/r/20250627-nf-flowtable-ipip-v2-0-c713003ce75b@kern…
Changes in v2:
- Introduce IPIP flowtable selftest
- Link to v1: https://lore.kernel.org/r/20250623-nf-flowtable-ipip-v1-1-2853596e3941@kern…
---
Lorenzo Bianconi (2):
net: netfilter: Add IPIP flowtable SW acceleration
selftests: netfilter: nft_flowtable.sh: Add IPIP flowtable selftest
include/linux/netdevice.h | 1 +
net/ipv4/ipip.c | 28 +++++++++++
net/netfilter/nf_flow_table_ip.c | 56 +++++++++++++++++++++-
net/netfilter/nft_flow_offload.c | 1 +
.../selftests/net/netfilter/nft_flowtable.sh | 40 ++++++++++++++++
5 files changed, 124 insertions(+), 2 deletions(-)
---
base-commit: bab3ce404553de56242d7b09ad7ea5b70441ea41
change-id: 20250623-nf-flowtable-ipip-1b3d7b08d067
Best regards,
--
Lorenzo Bianconi <lorenzo(a)kernel.org>
Hi all,
This series updates the drv-net XDP program used by the new xdp.py selftest
to use the bpf_dynptr APIs for packet access.
The selftest itself is unchanged.
The original program accessed packet headers directly via
ctx->data/data_end, implicitly assuming headers are always in the linear
region. That assumption is incorrect for multi-buffer XDP and does not
hold across all drivers. For example, mlx5 with striding RQ can leave the
linear area empty, causing the multi-buffer cases to fail.
Switching to bpf_xdp_load/store_bytes would work but always incurs copies.
Instead, this series adopts bpf_dynptr, which provides safe,
verifier-checked access across both linear and fragmented areas while
avoiding copies.
Amery Hung has also proposed a series [1] that addresses the same issues in
the program, but through the use of bpf_xdp_pull_data. My series is not
intended as a replacement for that work, but rather as an exploration of
another viable solution, each of which may be preferable under different
circumstances.
In cases where the program does not return XDP_PASS, I believe dynptr has
an advantage since it avoids an extra copy. Conversely, when the program
returns XDP_PASS, bpf_xdp_pull_data may be preferable, as the copy will
be performed in any case during skb creation.
It may make sense to split the work into two separate programs, allowing us
to test both solutions independently. Alternatively, we can consider a
combined approach, where the more fitting solution is applied for each use
case. I welcome feedback on which direction would be most useful.
[1] https://lore.kernel.org/all/20250905173352.3759457-1-ameryhung@gmail.com/
Thanks!
Nimrod
Nimrod Oren (5):
selftests: drv-net: Test XDP_TX with bpf_dynptr
selftests: drv-net: Test XDP tail adjustment with bpf_dynptr
selftests: drv-net: Test XDP head adjustment with bpf_dynptr
selftests: drv-net: Adjust XDP header data with bpf_dynptr
selftests: drv-net: Check XDP header data with bpf_dynptr
.../selftests/net/lib/xdp_native.bpf.c | 219 ++++++++----------
1 file changed, 96 insertions(+), 123 deletions(-)
--
2.45.0
This series fixes issues in devlink_rate_tc_bw.py selftest that made
its checks unreliable and its documentation inconsistent with the
actual configuration.
Thanks
Carolina Jubran (3):
selftests: drv-net: Fix and clarify TC bandwidth split in
devlink_rate_tc_bw.py
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
selftests: drv-net: Relax total BW check in devlink_rate_tc_bw.py
.../drivers/net/hw/devlink_rate_tc_bw.py | 102 ++++++++----------
1 file changed, 44 insertions(+), 58 deletions(-)
--
2.38.1
The loop in bench_sockmap_prog_destroy() has two issues:
1. Using 'sizeof(ctx.fds)' as the loop bound results in the number of
bytes, not the number of file descriptors, causing the loop to iterate
far more times than intended.
2. The condition 'ctx.fds[0] > 0' incorrectly checks only the first fd for
all iterations, potentially leaving file descriptors unclosed. Change
it to 'ctx.fds[i] > 0' to check each fd properly.
These fixes ensure correct cleanup of all file descriptors when the
benchmark exits.
Signed-off-by: Jiayuan Chen <jiayuan.chen(a)linux.dev>
Reported-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Closes: https://lore.kernel.org/bpf/aLqfWuRR9R_KTe5e@stanley.mountain/
---
tools/testing/selftests/bpf/benchs/bench_sockmap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/benchs/bench_sockmap.c b/tools/testing/selftests/bpf/benchs/bench_sockmap.c
index 8ebf563a67a2..cfc072aa7fff 100644
--- a/tools/testing/selftests/bpf/benchs/bench_sockmap.c
+++ b/tools/testing/selftests/bpf/benchs/bench_sockmap.c
@@ -10,6 +10,7 @@
#include <argp.h>
#include "bench.h"
#include "bench_sockmap_prog.skel.h"
+#include "bpf_util.h"
#define FILE_SIZE (128 * 1024)
#define DATA_REPEAT_SIZE 10
@@ -124,8 +125,8 @@ static void bench_sockmap_prog_destroy(void)
{
int i;
- for (i = 0; i < sizeof(ctx.fds); i++) {
- if (ctx.fds[0] > 0)
+ for (i = 0; i < ARRAY_SIZE(ctx.fds); i++) {
+ if (ctx.fds[i] > 0)
close(ctx.fds[i]);
}
--
2.43.0
Two patches here, first fixes the issue where tunnel core doesn't
actually extract DF bit from the outer IP header, even though both
OVS and TC flower allow matching on it. More details in the commit
message.
The second is a selftest for openvswitch that reproduces the issue,
but also just adds some basic coverage for the tunnel metadata
extraction and related openvswitch uAPI.
Ilya Maximets (2):
net: dst_metadata: fix IP_DF bit not extracted from tunnel headers
selftests: openvswitch: add a simple test for tunnel metadata
include/net/dst_metadata.h | 11 ++-
.../selftests/net/openvswitch/openvswitch.sh | 88 +++++++++++++++++--
2 files changed, 90 insertions(+), 9 deletions(-)
--
2.50.1
This is based on mm-unstable.
I will only CC non-MM folks on the cover letter and the respective patch
to not flood too many inboxes (the lists receive all patches).
--
As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.
To recap, the reason we currently need nth_page() within a folio is because
on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when
it might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages".
If we only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page()
within CMA allocations (again, could span memory sections), and in
one corner case (kfence) when processing memblock allocations (again,
could span memory sections).
So let's handle all that, add sanity checks, and remove nth_page().
Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch #6 -> #13 : disallow folios to have non-contiguous pages
Patch #14 -> #20 : remove nth_page() usage within folios
Patch #22 : disallow CMA allocations of non-contiguous pages
Patch #23 -> #33 : sanity+check + remove nth_page() usage within SG entry
Patch #34 : sanity-check + remove nth_page() usage in
unpin_user_page_range_dirty_lock()
Patch #35 : remove nth_page() in kfence
Patch #36 : adjust stale comment regarding nth_page
Patch #37 : mm: remove nth_page()
A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.
[1] https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-G…
v1 -> v2:
* "fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()"
-> Add comment for loop and remove comment of function regarding
copy_page_to_iter().
* Various smaller patch description tweaks I am not going to list for my
sanity
* "mips: mm: convert __flush_dcache_pages() to
__flush_dcache_folio_pages()"
-> Fix flush_dcache_page()
-> Drop "extern"
* "mm/gup: remove record_subpages()"
-> Added
* "mm/hugetlb: check for unreasonable folio sizes when registering hstate"
-> Refine comment
* "mm/cma: refuse handing out non-contiguous page ranges"
-> Add comment above loop
* "mm/page_alloc: reject unreasonable folio/compound page sizes in
alloc_contig_range_noprof()"
-> Added comment above check
* "mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()"
-> Refined comment
RFC -> v1:
* "wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel
config"
-> Mention that it was never really relevant for the test
* "mm/mm_init: make memmap_init_compound() look more like
prep_compound_page()"
-> Mention the setup of page links
* "mm: limit folio/compound page sizes in problematic kernel configs"
-> Improve comment for PUD handling, mentioning hugetlb and dax
* "mm: simplify folio_page() and folio_page_idx()"
-> Call variable "n"
* "mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()"
-> Keep __init_single_page() and refer to the usage of
memblock_reserved_mark_noinit()
* "fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()"
* "fs: hugetlbfs: remove nth_page() usage within folio in
adjust_range_hwpoison()"
-> Separate nth_page() removal from cleanups
-> Further improve cleanups
* "io_uring/zcrx: remove nth_page() usage within folio"
-> Keep the io_copy_cache for now and limit to nth_page() removal
* "mm/gup: drop nth_page() usage within folio when recording subpages"
-> Cleanup record_subpages as bit
* "mm/cma: refuse handing out non-contiguous page ranges"
-> Replace another instance of "pfn_to_page(pfn)" where we already have
the page
* "scatterlist: disallow non-contigous page ranges in a single SG entry"
-> We have to EXPORT the symbol. I thought about moving it to mm_inline.h,
but I really don't want to include that in include/linux/scatterlist.h
* "ata: libata-eh: drop nth_page() usage within SG entry"
* "mspro_block: drop nth_page() usage within SG entry"
* "memstick: drop nth_page() usage within SG entry"
* "mmc: drop nth_page() usage within SG entry"
-> Keep PAGE_SHIFT
* "scsi: scsi_lib: drop nth_page() usage within SG entry"
* "scsi: sg: drop nth_page() usage within SG entry"
-> Split patches, Keep PAGE_SHIFT
* "crypto: remove nth_page() usage within SG entry"
-> Keep PAGE_SHIFT
* "kfence: drop nth_page() usage"
-> Keep modifying i and use "start_pfn" only instead
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Marek Szyprowski <m.szyprowski(a)samsung.com>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Brendan Jackman <jackmanb(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Dennis Zhou <dennis(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Christoph Lameter <cl(a)gentwo.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: x86(a)kernel.org
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-mips(a)vger.kernel.org
Cc: linux-s390(a)vger.kernel.org
Cc: linux-crypto(a)vger.kernel.org
Cc: linux-ide(a)vger.kernel.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linux-mmc(a)vger.kernel.org
Cc: linux-arm-kernel(a)axis.com
Cc: linux-scsi(a)vger.kernel.org
Cc: kvm(a)vger.kernel.org
Cc: virtualization(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Cc: io-uring(a)vger.kernel.org
Cc: iommu(a)lists.linux.dev
Cc: kasan-dev(a)googlegroups.com
Cc: wireguard(a)lists.zx2c4.com
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-riscv(a)lists.infradead.org
David Hildenbrand (37):
mm: stop making SPARSEMEM_VMEMMAP user-selectable
arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu
kernel config
mm/page_alloc: reject unreasonable folio/compound page sizes in
alloc_contig_range_noprof()
mm/memremap: reject unreasonable folio/compound page sizes in
memremap_pages()
mm/hugetlb: check for unreasonable folio sizes when registering hstate
mm/mm_init: make memmap_init_compound() look more like
prep_compound_page()
mm: sanity-check maximum folio size in folio_set_order()
mm: limit folio/compound page sizes in problematic kernel configs
mm: simplify folio_page() and folio_page_idx()
mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
mm/mm/percpu-km: drop nth_page() usage within single allocation
fs: hugetlbfs: remove nth_page() usage within folio in
adjust_range_hwpoison()
fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()
mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
mm/gup: drop nth_page() usage within folio when recording subpages
mm/gup: remove record_subpages()
io_uring/zcrx: remove nth_page() usage within folio
mips: mm: convert __flush_dcache_pages() to
__flush_dcache_folio_pages()
mm/cma: refuse handing out non-contiguous page ranges
dma-remap: drop nth_page() in dma_common_contiguous_remap()
scatterlist: disallow non-contigous page ranges in a single SG entry
ata: libata-sff: drop nth_page() usage within SG entry
drm/i915/gem: drop nth_page() usage within SG entry
mspro_block: drop nth_page() usage within SG entry
memstick: drop nth_page() usage within SG entry
mmc: drop nth_page() usage within SG entry
scsi: scsi_lib: drop nth_page() usage within SG entry
scsi: sg: drop nth_page() usage within SG entry
vfio/pci: drop nth_page() usage within SG entry
crypto: remove nth_page() usage within SG entry
mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
kfence: drop nth_page() usage
block: update comment of "struct bio_vec" regarding nth_page()
mm: remove nth_page()
arch/arm64/Kconfig | 1 -
arch/mips/include/asm/cacheflush.h | 11 +++--
arch/mips/mm/cache.c | 8 ++--
arch/s390/Kconfig | 1 -
arch/x86/Kconfig | 1 -
crypto/ahash.c | 4 +-
crypto/scompress.c | 8 ++--
drivers/ata/libata-sff.c | 6 +--
drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +-
drivers/memstick/core/mspro_block.c | 3 +-
drivers/memstick/host/jmb38x_ms.c | 3 +-
drivers/memstick/host/tifm_ms.c | 3 +-
drivers/mmc/host/tifm_sd.c | 4 +-
drivers/mmc/host/usdhi6rol0.c | 4 +-
drivers/scsi/scsi_lib.c | 3 +-
drivers/scsi/sg.c | 3 +-
drivers/vfio/pci/pds/lm.c | 3 +-
drivers/vfio/pci/virtio/migrate.c | 3 +-
fs/hugetlbfs/inode.c | 36 +++++---------
include/crypto/scatterwalk.h | 4 +-
include/linux/bvec.h | 7 +--
include/linux/mm.h | 48 +++++++++++++++----
include/linux/page-flags.h | 5 +-
include/linux/scatterlist.h | 3 +-
io_uring/zcrx.c | 4 +-
kernel/dma/remap.c | 2 +-
mm/Kconfig | 3 +-
mm/cma.c | 39 +++++++++------
mm/gup.c | 36 +++++++-------
mm/hugetlb.c | 22 +++++----
mm/internal.h | 1 +
mm/kfence/core.c | 12 +++--
mm/memremap.c | 3 ++
mm/mm_init.c | 15 +++---
mm/page_alloc.c | 10 +++-
mm/pagewalk.c | 2 +-
mm/percpu-km.c | 2 +-
mm/util.c | 36 ++++++++++++++
tools/testing/scatterlist/linux/mm.h | 1 -
.../selftests/wireguard/qemu/kernel.config | 1 -
40 files changed, 217 insertions(+), 146 deletions(-)
base-commit: b73c6f2b5712809f5f386780ac46d1d78c31b2e6
--
2.50.1
This patchset introduces a new per-port bonding option: `ad_actor_port_prio`.
It allows users to configure the actor's port priority, which can then be used
by the bonding driver for aggregator selection based on port priority.
This provides finer control over LACP aggregator choice, especially in setups
with multiple eligible aggregators over 2 switches.
v5:
a) rename 'prio' to 'actor_port_prio' in bond_ad_select_tbl (Jay Vosburgh)
b) update document description
v4:
a) fix actor_port_prio minimal value (Jay Vosburgh)
b) fix ad_agg_selection_test comment order (Paolo Abeni)
c) restruct selftest, reduce duplication (Paolo Abeni)
v3:
a) add comments when init slave port_priority (Jonas Gorski)
b) rename ad_lacp_port_prio to lacp_port_prio (Jay Vosburgh)
v2:
a) set default bond option value for port priority (Nikolay Aleksandrov)
b) fix __agg_ports_priority coding style (Nikolay Aleksandrov)
c) fix shellcheck warns
Hangbin Liu (3):
bonding: add support for per-port LACP actor priority
bonding: support aggregator selection based on port priority
selftests: bonding: add test for LACP actor port priority
Documentation/networking/bonding.rst | 25 +++-
drivers/net/bonding/bond_3ad.c | 31 +++++
drivers/net/bonding/bond_netlink.c | 16 +++
drivers/net/bonding/bond_options.c | 45 +++++++-
include/net/bond_3ad.h | 2 +
include/net/bond_options.h | 1 +
include/uapi/linux/if_link.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 3 +-
.../drivers/net/bonding/bond_lacp_prio.sh | 108 ++++++++++++++++++
tools/testing/selftests/net/forwarding/lib.sh | 24 ----
tools/testing/selftests/net/lib.sh | 24 ++++
11 files changed, 247 insertions(+), 33 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_lacp_prio.sh
--
2.50.1
The three patches fix the va_high_addr_switch.sh test failure on x86_64.
Patch 1 fixes the hugepage setup issue that nr_hugepages is reset too
early in run_vmtests.sh and break the later va_high_addr_switch testing.
Patch 2 adds hugepage setup in va_high_addr_switch test, so that it can
still work if vm_runtests.sh changes the hugepage setup someday.
Patch 3 fixes the test failure caused by the hint addr align method change
in hugetlb_get_unmapped_area().
Changes in v2:
- patch 1 renames nr_hugepgs_origin to orig_nr_hugepgs
- add a patch 2 to setup hugeapges in va_high_addr_switch test
Chunyu Hu (3):
selftests/mm: fix hugepages cleanup too early
selftests/mm: alloc hugepages in va_high_addr_switch test
selftests/mm: fix va_high_addr_switch.sh failure on x86_64
tools/testing/selftests/mm/run_vmtests.sh | 9 ++++-
.../selftests/mm/va_high_addr_switch.c | 4 +-
.../selftests/mm/va_high_addr_switch.sh | 37 +++++++++++++++++++
3 files changed, 46 insertions(+), 4 deletions(-)
--
2.49.0
This patchset ensures that the number of hugepages is correctly set in the
system so that the uffd-stress test does not fail due to the racy nature of
the test. Patch 1 changes the hugepage constraint in the run_vmtests.sh
script, whereas patch 2 changes the constraint in the test itself.
---
Based on 6.17-rc5.
Dev Jain (2):
selftests/mm/uffd-stress: make test operate on less hugetlb memory
selftests/mm/uffd-stress: stricten constraint on free hugepages needed
before the test
tools/testing/selftests/mm/run_vmtests.sh | 10 +++++++---
tools/testing/selftests/mm/uffd-stress.c | 17 +++++++++++------
2 files changed, 18 insertions(+), 9 deletions(-)
--
2.30.2
This patchset ensures that the number of hugepages is correctly set in the
system so that the uffd-stress test does not fail due to the racy nature of
the test. Patch 1 corrects the hugepage constraint in the run_vmtests.sh
script, whereas patch 2 corrects the constraint in the test itself.
Dev Jain (2):
selftests/mm/uffd-stress: Make test operate on less hugetlb memory
selftests/mm/uffd-stress: Stricten constraint on free hugepages before
the test
tools/testing/selftests/mm/run_vmtests.sh | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--
2.30.2
Some high-level virtual drivers need to compute features from their
lower devices, but each currently has its own implementation and may
miss some feature computations. This patch set introduces a common function
to compute features for such devices.
Currently, bonding, team, and bridge have been updated to use the new
helper.
v2:
a) remove hard_header_len setting. I will set needed_headroom for bond/team
in a separate patch as bridge has it's own ways. (Ido Schimmel)
b) Add test file to Makefile, set RET=0 to a proper location. (Ido Schimmel)
Hangbin Liu (5):
net: add a common function to compute features from lowers devices
bonding: use common function to compute the features
team: use common function to compute the features
net: bridge: use common function to compute the features
selftests/net: add offload checking test for virtual interface
drivers/net/bonding/bond_main.c | 99 +----------
drivers/net/team/team_core.c | 73 +-------
include/linux/netdevice.h | 19 +++
net/bridge/br_if.c | 22 +--
net/core/dev.c | 76 +++++++++
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 2 +
tools/testing/selftests/net/vdev_offload.sh | 176 ++++++++++++++++++++
8 files changed, 285 insertions(+), 183 deletions(-)
create mode 100755 tools/testing/selftests/net/vdev_offload.sh
--
2.50.1
From: Fred Griffoul <griffoul(a)casper.infradead.org>
This patch series addresses performance issues in nested VMX when
handling unmanaged guest memory. Unmanaged guest memory refers to memory
not directly mapped by the kernel (no struct page), such as memory
passed with the mem= parameter or guest_memfd for non-Confidential
Computing (CoCo) VMs.
Current Problem:
During nested VMX operations, the system frequently accesses specific
guest pages during L2 VM entry/exit cycles. The current workflow:
1. kvm_vcpu_map() invokes memremap() for unmanaged memory.
2. The system either directly accesses mapped memory via nested VMX or
passes it to the L2 guest through vmcs02.
3. kvm_vcpu_unmap() invokes memunmap()
This repeated map/unmap cycle creates significant performance overhead
due to expensive remapping operations.
Solution approach:
Our solution replaces kvm_host_map with gfn_to_pfn_cache in nested VMX.
It addresses two distinct types of guest pages.
First, we handle the L1 MSR bitmap page, which requires read-only access
for folding L1 and L0 MSR bitmap. We implement this conversion to
gfn_to_pfn_cache in patch 1.
Second, we tackle system pages, including APIC access, virtual APIC, and
posted interrupt descriptor pages. These pages are more complex as
they're accessed by both nested VMX code _and_ passed to the L2 guest in
vmcs02 fields. This requires to restore and complete the
"guest-uses-pfn" support in pfncache through patches 2 and 3, followed
by implementing kvm_host_map replacement with caches in patch 4.
Testing:
Patch 5 introduces a new selftest to verify cache invalidation and
memslot update functionality.
The changes are available in a git repository at:
git://git.infradead.org/users/griffoul/linux.git tags/nvmx-gpc-v1
Suggested-by: dwmw(a)amazon.co.uk
Fred Griffoul (5):
KVM: nVMX: Implement cache for L1 MSR bitmap
KVM: pfncache: Restore guest-uses-pfn support
KVM: x86: Add nested state validation for pfncache support
KVM: nVMX: Implement cache for L1 APIC pages
KVM: selftests: Add nested VMX APIC cache invalidation test
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/nested.c | 213 +++++++++---
arch/x86/kvm/vmx/vmx.h | 10 +-
arch/x86/kvm/x86.c | 14 +-
include/linux/kvm_host.h | 34 +-
include/linux/kvm_types.h | 1 +
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/x86/vmx_apic_update_test.c | 302 ++++++++++++++++++
virt/kvm/kvm_main.c | 3 +-
virt/kvm/kvm_mm.h | 6 +-
virt/kvm/pfncache.c | 43 ++-
11 files changed, 575 insertions(+), 53 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c
--
2.51.0
The TODO about using the number of vCPUs instead of vcpu.id + 1
was already addressed by commit 376bc1b458c9 ("KVM: selftests: Don't
assume vcpu->id is '0' in xAPIC state test"). The comment is now
stale and can be removed.
Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com>
---
tools/testing/selftests/kvm/x86/xapic_state_test.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86/xapic_state_test.c b/tools/testing/selftests/kvm/x86/xapic_state_test.c
index fdebff1165c7..3b4814c55722 100644
--- a/tools/testing/selftests/kvm/x86/xapic_state_test.c
+++ b/tools/testing/selftests/kvm/x86/xapic_state_test.c
@@ -120,8 +120,8 @@ static void test_icr(struct xapic_vcpu *x)
__test_icr(x, icr | i);
/*
- * Send all flavors of IPIs to non-existent vCPUs. TODO: use number of
- * vCPUs, not vcpu.id + 1. Arbitrarily use vector 0xff.
+ * Send all flavors of IPIs to non-existent vCPUs. Arbitrarily use
+ * vector 0xff.
*/
icr = APIC_INT_ASSERT | 0xff;
for (i = 0; i < 0xff; i++) {
--
2.43.0
Replace the hardcoded 0xff in test_icr() with the actual number of vcpus
created for the vm. This address the existing TODO and keeps the test
correct if it is ever run with multiple vcpus.
Signed-off-by: Sukrut Heroorkar <hsukrut3(a)gmail.com>
---
tools/testing/selftests/kvm/x86/xapic_state_test.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/x86/xapic_state_test.c b/tools/testing/selftests/kvm/x86/xapic_state_test.c
index fdebff1165c7..4af36682503e 100644
--- a/tools/testing/selftests/kvm/x86/xapic_state_test.c
+++ b/tools/testing/selftests/kvm/x86/xapic_state_test.c
@@ -56,6 +56,17 @@ static void x2apic_guest_code(void)
} while (1);
}
+static unsigned int vm_nr_vcpus(struct kvm_vm *vm)
+{
+ struct kvm_vcpu *vcpu;
+ unsigned int count = 0;
+
+ list_for_each_entry(vcpu, &vm->vcpus, list)
+ count++;
+
+ return count;
+}
+
static void ____test_icr(struct xapic_vcpu *x, uint64_t val)
{
struct kvm_vcpu *vcpu = x->vcpu;
@@ -124,7 +135,7 @@ static void test_icr(struct xapic_vcpu *x)
* vCPUs, not vcpu.id + 1. Arbitrarily use vector 0xff.
*/
icr = APIC_INT_ASSERT | 0xff;
- for (i = 0; i < 0xff; i++) {
+ for (i = 0; i < vm_nr_vcpus(vcpu->vm); i++) {
if (i == vcpu->id)
continue;
for (j = 0; j < 8; j++)
--
2.43.0
Recent changes to make netlink socket memory accounting must
have broken the implicit assumption of the netlink-dump test
that we can fit exactly 64 dumps into the socket. Handle the
failure mode properly, and increase the dump count to 80
to make sure we still run into the error condition if
the default buffer size increases in the future.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
tools/testing/selftests/net/netlink-dumps.c | 43 ++++++++++++++++-----
1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/net/netlink-dumps.c b/tools/testing/selftests/net/netlink-dumps.c
index 07423f256f96..7618ebe528a4 100644
--- a/tools/testing/selftests/net/netlink-dumps.c
+++ b/tools/testing/selftests/net/netlink-dumps.c
@@ -31,9 +31,18 @@ struct ext_ack {
const char *str;
};
-/* 0: no done, 1: done found, 2: extack found, -1: error */
-static int nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
+enum get_ea_ret {
+ ERROR = -1,
+ NO_CTRL = 0,
+ FOUND_DONE,
+ FOUND_ERR,
+ FOUND_EXTACK,
+};
+
+static enum get_ea_ret
+nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
{
+ enum get_ea_ret ret = NO_CTRL;
const struct nlmsghdr *nlh;
const struct nlattr *attr;
ssize_t rem;
@@ -41,15 +50,19 @@ static int nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
for (rem = n; rem > 0; NLMSG_NEXT(nlh, rem)) {
nlh = (struct nlmsghdr *)&buf[n - rem];
if (!NLMSG_OK(nlh, rem))
- return -1;
+ return ERROR;
- if (nlh->nlmsg_type != NLMSG_DONE)
+ if (nlh->nlmsg_type == NLMSG_ERROR)
+ ret = FOUND_ERR;
+ else if (nlh->nlmsg_type == NLMSG_DONE)
+ ret = FOUND_DONE;
+ else
continue;
ea->err = -*(int *)NLMSG_DATA(nlh);
if (!(nlh->nlmsg_flags & NLM_F_ACK_TLVS))
- return 1;
+ return ret;
ynl_attr_for_each(attr, nlh, sizeof(int)) {
switch (ynl_attr_type(attr)) {
@@ -68,10 +81,10 @@ static int nl_get_extack(char *buf, size_t n, struct ext_ack *ea)
}
}
- return 2;
+ return FOUND_EXTACK;
}
- return 0;
+ return ret;
}
static const struct {
@@ -99,9 +112,9 @@ static const struct {
TEST(dump_extack)
{
int netlink_sock;
+ int i, cnt, ret;
char buf[8192];
int one = 1;
- int i, cnt;
ssize_t n;
netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
@@ -118,7 +131,7 @@ TEST(dump_extack)
ASSERT_EQ(n, 0);
/* Dump so many times we fill up the buffer */
- cnt = 64;
+ cnt = 80;
for (i = 0; i < cnt; i++) {
n = send(netlink_sock, &dump_neigh_bad,
sizeof(dump_neigh_bad), 0);
@@ -140,10 +153,20 @@ TEST(dump_extack)
}
ASSERT_GE(n, (ssize_t)sizeof(struct nlmsghdr));
- EXPECT_EQ(nl_get_extack(buf, n, &ea), 2);
+ ret = nl_get_extack(buf, n, &ea);
+ /* Once we fill the buffer we'll see one ENOBUFS followed
+ * by a number of EBUSYs. Then the last recv() will finally
+ * trigger and complete the dump.
+ */
+ if (ret == FOUND_ERR && (ea.err == ENOBUFS || ea.err == EBUSY))
+ continue;
+ EXPECT_EQ(ret, FOUND_EXTACK);
+ EXPECT_EQ(ea.err, EINVAL);
EXPECT_EQ(ea.attr_offs,
sizeof(struct nlmsghdr) + sizeof(struct ndmsg));
}
+ /* Make sure last message was a full DONE+extack */
+ EXPECT_EQ(ret, FOUND_EXTACK);
}
static const struct {
--
2.51.0
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for managing GCS for KVM guests, it also
includes a fix for S1PIE which has also been sent separately as this
feature is a dependency for GCS. It is based on:
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/gcs
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v15:
- Rebase onto v6.17-rc1.
- Link to v14: https://lore.kernel.org/r/20241005-arm64-gcs-v14-0-59060cd6092b@kernel.org
Changes in v14:
- Rebase onto arm64/for-next/gcs which includes all the non-KVM support.
- Manage the fine grained traps for GCS instructions.
- Manage PSTATE.EXLOCK when delivering exceptions to KVM guests.
- Link to v13: https://lore.kernel.org/r/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
Changes in v13:
- Rebase onto v6.12-rc1.
- Allocate VM_HIGH_ARCH_6 since protection keys used all the existing
bits.
- Implement mm_release() and free transparently allocated GCSs there.
- Use bit 32 of AT_HWCAP for GCS due to AT_HWCAP2 being filled.
- Since we now only set GCSCRE0_EL1 on change ensure that it is
initialised with GCSPR_EL0 accessible to EL0.
- Fix OOM handling on thread copy.
- Link to v12: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec947436a@kernel.org
Changes in v12:
- Clarify and simplify the signal handling code so we work with the
register state.
- When checking for write aborts to shadow stack pages ensure the fault
is a data abort.
- Depend on !UPROBES.
- Comment cleanups.
- Link to v11: https://lore.kernel.org/r/20240822-arm64-gcs-v11-0-41b81947ecb5@kernel.org
Changes in v11:
- Remove the dependency on the addition of clone3() support for shadow
stacks, rebasing onto v6.11-rc3.
- Make ID_AA64PFR1_EL1.GCS writeable in KVM.
- Hide GCS registers when GCS is not enabled for KVM guests.
- Require HCRX_EL2.GCSEn if booting at EL1.
- Require that GCSCR_EL1 and GCSCRE0_EL1 be initialised regardless of
if we boot at EL2 or EL1.
- Remove some stray use of bit 63 in signal cap tokens.
- Warn if we see a GCS with VM_SHARED.
- Remove rdundant check for VM_WRITE in fault handling.
- Cleanups and clarifications in the ABI document.
- Clean up and improve documentation of some sync placement.
- Only set the EL0 GCS mode if it's actually changed.
- Various minor fixes and tweaks.
- Link to v10: https://lore.kernel.org/r/20240801-arm64-gcs-v10-0-699e2bd2190b@kernel.org
Changes in v10:
- Fix issues with THP.
- Tighten up requirements for initialising GCSCR*.
- Only generate GCS signal frames for threads using GCS.
- Only context switch EL1 GCS registers if S1PIE is enabled.
- Move context switch of GCSCRE0_EL1 to EL0 context switch.
- Make GCS registers unconditionally visible to userspace.
- Use FHU infrastructure.
- Don't change writability of ID_AA64PFR1_EL1 for KVM.
- Remove unused arguments from alloc_gcs().
- Typo fixes.
- Link to v9: https://lore.kernel.org/r/20240625-arm64-gcs-v9-0-0f634469b8f0@kernel.org
Changes in v9:
- Rebase onto v6.10-rc3.
- Restructure and clarify memory management fault handling.
- Fix up basic-gcs for the latest clone3() changes.
- Convert to newly merged KVM ID register based feature configuration.
- Fixes for NV traps.
- Link to v8: https://lore.kernel.org/r/20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org
Changes in v8:
- Invalidate signal cap token on stack when consuming.
- Typo and other trivial fixes.
- Don't try to use process_vm_write() on GCS, it intentionally does not
work.
- Fix leak of thread GCSs.
- Rebase onto latest clone3() series.
- Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org
Changes in v7:
- Rebase onto v6.7-rc2 via the clone3() patch series.
- Change the token used to cap the stack during signal handling to be
compatible with GCSPOPM.
- Fix flags for new page types.
- Fold in support for clone3().
- Replace copy_to_user_gcs() with put_user_gcs().
- Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org
Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (6):
arm64/gcs: Ensure FGTs for EL1 GCS instructions are disabled
KVM: arm64: Manage GCS access and registers for guests
KVM: arm64: Forward GCS exceptions to nested guests
KVM: arm64: Set PSTATE.EXLOCK when entering an exception
KVM: arm64: Allow GCS to be enabled for guests
KVM: selftests: arm64: Add GCS registers to get-reg-list
arch/arm64/include/asm/el2_setup.h | 4 +++
arch/arm64/include/asm/kvm_emulate.h | 3 ++
arch/arm64/include/asm/kvm_host.h | 14 +++++++++
arch/arm64/include/asm/vncr_mapping.h | 2 ++
arch/arm64/include/uapi/asm/ptrace.h | 1 +
arch/arm64/kvm/handle_exit.c | 14 +++++++--
arch/arm64/kvm/hyp/exception.c | 37 ++++++++++++++++++++++++
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 31 ++++++++++++++++++++
arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 10 +++++++
arch/arm64/kvm/sys_regs.c | 32 ++++++++++++++++++--
tools/testing/selftests/kvm/arm64/get-reg-list.c | 12 ++++++++
11 files changed, 155 insertions(+), 5 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
+lists
Please keep discussions on-list unless there's something that can't/shouldn't be
posted publicly, e.g. for confidentiality or security reasons.
On Tue, Sep 02, 2025, Faruqui, Aqib wrote:
> I suppose a fix for blindly using PAGE_SIZE in subsequent macros:
>
> #ifdef PAGE_SIZE
> #undef PAGE_SIZE
> #endif
> #define PAGE_SIZE (1ULL << PAGE_SHIFT)
>
> Is no better and is instead blindly suppressing the compiler's redefinition warning.
>
> I'm having trouble finding what causes the conflict, any advice here?
Maybe try a newer compiler? E.g. gcc-14.2 will spit out the exact location of the
previous definition.
In file included from include/x86/svm_util.h:13,
from include/x86/sev.h:15,
from lib/x86/sev.c:5:
include/x86/processor.h:373:9: error: "PAGE_SIZE" redefined [-Werror]
373 | #define PAGE_SIZE (1ULL << PAGE_SHIFT)
| ^~~~~~~~~
include/x86/processor.h:370:9: note: this is the location of the previous definition
370 | #define PAGE_SIZE BIT(12)
| ^~~~~~~~~
Fix to use the return value of the function 'chdir("/")' and check if the
return is either 0 (ok) or 1 (not ok, so the test stops).
The patch fies the solves the following errors:
mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
468 | chdir("/");
mount-notify_test_ns.c:489:17: warning: ignoring return value of
‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-
result]
489 | chdir("/");
To reproduce the issue, use the command:
make kselftest TARGET=filesystems/statmount
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
Reviewed-by: Jan Kara <jack(a)suse.cz>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
.../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 5a3b0ace1a88..a7f899599d52 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
index d91946e69591..dc9eb3087a1a 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
@@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
--
2.43.0
Fix to use the return value of the function 'chdir("/")' and check if the
return is either 0 (ok) or 1 (not ok, so the test stops).
The patch fies the solves the following errors:
mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
468 | chdir("/");
mount-notify_test_ns.c:489:17: warning: ignoring return value of
‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-
result]
489 | chdir("/");
To reproduce the issue, use the command:
make kselftest TARGET=filesystems/statmount
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
.../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 5a3b0ace1a88..a7f899599d52 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
index d91946e69591..dc9eb3087a1a 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
@@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
--
2.43.0
From: Feng Yang <yangfeng(a)kylinos.cn>
The error message printed here only uses the previous err value,
which results in it being printed as 0.
When bpf_map__attach_struct_ops encounters an error,
it uses libbpf_err_ptr(err) to set errno = -err and returns NULL.
Therefore, Using -errno can fix this issue.
Fix before:
run_subtest:FAIL:1019 bpf_map__attach_struct_ops failed for map pro_epilogue: err=0
Fix after:
run_subtest:FAIL:1019 bpf_map__attach_struct_ops failed for map pro_epilogue: err=-9
Signed-off-by: Feng Yang <yangfeng(a)kylinos.cn>
---
Changes in v3:
- Use -errno here directly, thanks: Andrii Nakryiko.
- Link to v2: https://lore.kernel.org/all/20250829014125.198653-1-yangfeng59949@163.com/
---
Changes in v2:
- Use libbpf_get_error, thanks: Alexei Starovoitov.
- Link to v1: https://lore.kernel.org/all/20250828081507.1380218-1-yangfeng59949@163.com/
---
tools/testing/selftests/bpf/test_loader.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/test_loader.c b/tools/testing/selftests/bpf/test_loader.c
index 78423cf89e01..33d59c093a27 100644
--- a/tools/testing/selftests/bpf/test_loader.c
+++ b/tools/testing/selftests/bpf/test_loader.c
@@ -1083,7 +1083,7 @@ void run_subtest(struct test_loader *tester,
link = bpf_map__attach_struct_ops(map);
if (!link) {
PRINT_FAIL("bpf_map__attach_struct_ops failed for map %s: err=%d\n",
- bpf_map__name(map), err);
+ bpf_map__name(map), -errno);
goto tobj_cleanup;
}
links[links_cnt++] = link;
--
2.25.1
This is based on mm-unstable and was cross-compiled heavily.
I should probably have already dropped the RFC label but I want to hear
first if I ignored some corner case (SG entries?) and I need to do
at least a bit more testing.
I will only CC non-MM folks on the cover letter and the respective patch
to not flood too many inboxes (the lists receive all patches).
---
As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.
To recap, the reason we currently need nth_page() within a folio is because
on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when
it might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages".
If we only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page()
within CMA allocations (again, could span memory sections), and in
one corner case (kfence) when processing memblock allocations (again,
could span memory sections).
So let's handle all that, add sanity checks, and remove nth_page().
Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch #6 -> #12 : disallow folios to have non-contiguous pages
Patch #13 -> #20 : remove nth_page() usage within folios
Patch #21 : disallow CMA allocations of non-contiguous pages
Patch #22 -> #31 : sanity+check + remove nth_page() usage within SG entry
Patch #32 : sanity-check + remove nth_page() usage in
unpin_user_page_range_dirty_lock()
Patch #33 : remove nth_page() in kfence
Patch #34 : adjust stale comment regarding nth_page
Patch #35 : mm: remove nth_page()
A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.
[1] https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-G…
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Marek Szyprowski <m.szyprowski(a)samsung.com>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Brendan Jackman <jackmanb(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Dennis Zhou <dennis(a)kernel.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Christoph Lameter <cl(a)gentwo.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: x86(a)kernel.org
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: linux-mips(a)vger.kernel.org
Cc: linux-s390(a)vger.kernel.org
Cc: linux-crypto(a)vger.kernel.org
Cc: linux-ide(a)vger.kernel.org
Cc: intel-gfx(a)lists.freedesktop.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: linux-mmc(a)vger.kernel.org
Cc: linux-arm-kernel(a)axis.com
Cc: linux-scsi(a)vger.kernel.org
Cc: kvm(a)vger.kernel.org
Cc: virtualization(a)lists.linux.dev
Cc: linux-mm(a)kvack.org
Cc: io-uring(a)vger.kernel.org
Cc: iommu(a)lists.linux.dev
Cc: kasan-dev(a)googlegroups.com
Cc: wireguard(a)lists.zx2c4.com
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-riscv(a)lists.infradead.org
David Hildenbrand (35):
mm: stop making SPARSEMEM_VMEMMAP user-selectable
arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu
kernel config
mm/page_alloc: reject unreasonable folio/compound page sizes in
alloc_contig_range_noprof()
mm/memremap: reject unreasonable folio/compound page sizes in
memremap_pages()
mm/hugetlb: check for unreasonable folio sizes when registering hstate
mm/mm_init: make memmap_init_compound() look more like
prep_compound_page()
mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
mm: sanity-check maximum folio size in folio_set_order()
mm: limit folio/compound page sizes in problematic kernel configs
mm: simplify folio_page() and folio_page_idx()
mm/mm/percpu-km: drop nth_page() usage within single allocation
fs: hugetlbfs: remove nth_page() usage within folio in
adjust_range_hwpoison()
mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
mm/gup: drop nth_page() usage within folio when recording subpages
io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
io_uring/zcrx: remove nth_page() usage within folio
mips: mm: convert __flush_dcache_pages() to
__flush_dcache_folio_pages()
mm/cma: refuse handing out non-contiguous page ranges
dma-remap: drop nth_page() in dma_common_contiguous_remap()
scatterlist: disallow non-contigous page ranges in a single SG entry
ata: libata-eh: drop nth_page() usage within SG entry
drm/i915/gem: drop nth_page() usage within SG entry
mspro_block: drop nth_page() usage within SG entry
memstick: drop nth_page() usage within SG entry
mmc: drop nth_page() usage within SG entry
scsi: core: drop nth_page() usage within SG entry
vfio/pci: drop nth_page() usage within SG entry
crypto: remove nth_page() usage within SG entry
mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
kfence: drop nth_page() usage
block: update comment of "struct bio_vec" regarding nth_page()
mm: remove nth_page()
arch/arm64/Kconfig | 1 -
arch/mips/include/asm/cacheflush.h | 11 +++--
arch/mips/mm/cache.c | 8 ++--
arch/s390/Kconfig | 1 -
arch/x86/Kconfig | 1 -
crypto/ahash.c | 4 +-
crypto/scompress.c | 8 ++--
drivers/ata/libata-sff.c | 6 +--
drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +-
drivers/memstick/core/mspro_block.c | 3 +-
drivers/memstick/host/jmb38x_ms.c | 3 +-
drivers/memstick/host/tifm_ms.c | 3 +-
drivers/mmc/host/tifm_sd.c | 4 +-
drivers/mmc/host/usdhi6rol0.c | 4 +-
drivers/scsi/scsi_lib.c | 3 +-
drivers/scsi/sg.c | 3 +-
drivers/vfio/pci/pds/lm.c | 3 +-
drivers/vfio/pci/virtio/migrate.c | 3 +-
fs/hugetlbfs/inode.c | 25 ++++------
include/crypto/scatterwalk.h | 4 +-
include/linux/bvec.h | 7 +--
include/linux/mm.h | 48 +++++++++++++++----
include/linux/page-flags.h | 5 +-
include/linux/scatterlist.h | 4 +-
io_uring/zcrx.c | 34 ++++---------
kernel/dma/remap.c | 2 +-
mm/Kconfig | 3 +-
mm/cma.c | 36 +++++++++-----
mm/gup.c | 13 +++--
mm/hugetlb.c | 23 ++++-----
mm/internal.h | 1 +
mm/kfence/core.c | 17 ++++---
mm/memremap.c | 3 ++
mm/mm_init.c | 13 ++---
mm/page_alloc.c | 5 +-
mm/pagewalk.c | 2 +-
mm/percpu-km.c | 2 +-
mm/util.c | 33 +++++++++++++
tools/testing/scatterlist/linux/mm.h | 1 -
.../selftests/wireguard/qemu/kernel.config | 1 -
40 files changed, 203 insertions(+), 150 deletions(-)
base-commit: c0e3b3f33ba7b767368de4afabaf7c1ddfdc3872
--
2.50.1
From: Vivek Yadav <vivekyadav1207731111(a)gmail.com>
Hi all,
This small series makes cosmetic style cleanups in the arm64 kselftests
to improve readability and suppress checkpatch warnings. These changes
are purely cosmetic and do not affect functionality.
Changes in this series:
* Suppress unnecessary checkpatch warning in a comment
* Add parentheses around sizeof for clarity
* Remove redundant blank line
---
Vivek Yadav (3):
kselftest/arm64: Remove extra blank line
kselftest/arm64: Supress warning and improve readability
kselftest/arm64: Add parentheses around sizeof for clarity
tools/testing/selftests/arm64/abi/hwcap.c | 1 -
tools/testing/selftests/arm64/bti/assembler.h | 1 -
tools/testing/selftests/arm64/fp/fp-ptrace.c | 1 -
tools/testing/selftests/arm64/fp/fp-stress.c | 4 ++--
tools/testing/selftests/arm64/fp/sve-ptrace.c | 2 +-
tools/testing/selftests/arm64/fp/vec-syscfg.c | 1 -
tools/testing/selftests/arm64/fp/zt-ptrace.c | 1 -
tools/testing/selftests/arm64/gcs/gcs-locking.c | 1 -
8 files changed, 3 insertions(+), 9 deletions(-)
--
2.25.1
Two small cleanups.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Thomas Weißschuh (2):
kselftest/arm64/gcs: Correctly check return value when disabling GCS
kselftest/arm64/gcs: Use nolibc's getauxval()
tools/testing/selftests/arm64/gcs/basic-gcs.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20250821-nolibc-gcs-fixes-11cf7585bb74
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Fix to use the return value of the function 'chdir("/")' and check if the
return is either 0 (ok) or 1 (not ok, so the test stops).
The patch fies the solves the following errors:
mount-notify_test.c:468:17: warning: ignoring return value of ‘chdir’
declared with attribute ‘warn_unused_result’ [-Wunused-result]
468 | chdir("/");
mount-notify_test_ns.c:489:17: warning: ignoring return value of
‘chdir’ declared with attribute ‘warn_unused_result’ [-Wunused-
result]
489 | chdir("/");
To reproduce the issue, use the command:
make kselftest TARGET=filesystems/statmount
Signed-off-by: Alessandro Zanni <alessandro.zanni87(a)gmail.com>
---
.../selftests/filesystems/mount-notify/mount-notify_test.c | 2 +-
.../selftests/filesystems/mount-notify/mount-notify_test_ns.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
index 5a3b0ace1a88..a7f899599d52 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
@@ -458,7 +458,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
diff --git a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
index d91946e69591..dc9eb3087a1a 100644
--- a/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
+++ b/tools/testing/selftests/filesystems/mount-notify/mount-notify_test_ns.c
@@ -486,7 +486,7 @@ TEST_F(fanotify, rmdir)
ASSERT_GE(ret, 0);
if (ret == 0) {
- chdir("/");
+ ASSERT_EQ(0, chdir("/"));
unshare(CLONE_NEWNS);
mount("", "/", NULL, MS_REC|MS_PRIVATE, NULL);
umount2("/a", MNT_DETACH);
--
2.43.0
This series adds ONE_REG interface for SBI FWFT extension implemented
by KVM RISC-V. This was missed out in accepted SBI FWFT patches for
KVM RISC-V.
These patches can also be found in the riscv_kvm_fwft_one_reg_v3 branch
at: https://github.com/avpatel/linux.git
Changes since v2:
- Re-based on latest KVM RISC-V queue
- Improved FWFT ONE_REG interface to allow enabling/disabling each
FWFT feature from KVM userspace
Changes since v1:
- Dropped have_state in PATCH4 as suggested by Drew
- Added Drew's Reviewed-by in appropriate patches
Anup Patel (6):
RISC-V: KVM: Set initial value of hedeleg in kvm_arch_vcpu_create()
RISC-V: KVM: Introduce feature specific reset for SBI FWFT
RISC-V: KVM: Introduce optional ONE_REG callbacks for SBI extensions
RISC-V: KVM: Move copy_sbi_ext_reg_indices() to SBI implementation
RISC-V: KVM: Implement ONE_REG interface for SBI FWFT state
KVM: riscv: selftests: Add SBI FWFT to get-reg-list test
arch/riscv/include/asm/kvm_vcpu_sbi.h | 22 +-
arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 15 ++
arch/riscv/kvm/vcpu.c | 3 +-
arch/riscv/kvm/vcpu_onereg.c | 60 +----
arch/riscv/kvm/vcpu_sbi.c | 172 +++++++++++--
arch/riscv/kvm/vcpu_sbi_fwft.c | 227 ++++++++++++++++--
arch/riscv/kvm/vcpu_sbi_sta.c | 63 +++--
.../selftests/kvm/riscv/get-reg-list.c | 32 +++
9 files changed, 467 insertions(+), 128 deletions(-)
--
2.43.0
When many ADD_ADDR need to be sent, it can take some time to send each
of them, and create new subflows. Some CIs seem to occasionally have
issues with these tests, especially with "debug" kernels.
Two subtests will now run for a slightly longer time: the last two where
3 or more ADD_ADDR are sent during the test.
Reviewed-by: Geliang Tang <geliang(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
tools/testing/selftests/net/mptcp/mptcp_join.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index e9e11a9e60fd5374c8a98c3b7159ccbca8053030..b41cebfa1f921ce9ea6a88a908bf6d5e6027b367 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -2268,7 +2268,8 @@ signal_address_tests()
pm_nl_add_endpoint $ns1 10.0.3.1 flags signal
pm_nl_add_endpoint $ns1 10.0.4.1 flags signal
pm_nl_set_limits $ns2 3 3
- run_tests $ns1 $ns2 10.0.1.1
+ speed=slow \
+ run_tests $ns1 $ns2 10.0.1.1
chk_join_nr 3 3 3
chk_add_nr 3 3
fi
@@ -2280,7 +2281,8 @@ signal_address_tests()
pm_nl_add_endpoint $ns1 10.0.3.1 flags signal
pm_nl_add_endpoint $ns1 10.0.14.1 flags signal
pm_nl_set_limits $ns2 3 3
- run_tests $ns1 $ns2 10.0.1.1
+ speed=slow \
+ run_tests $ns1 $ns2 10.0.1.1
join_syn_tx=3 \
chk_join_nr 1 1 1
chk_add_nr 3 3
--
2.51.0
ADD_ADDR can be retransmitted, and with, the parent commit, these
retransmissions can be sent quicker: from 2 minutes to less than one
second.
To avoid false positives where retransmitted ADD_ADDR causes higher
counters than expected, it is required to be more tolerant. Errors are
now only reported when fewer ADD_ADDRs have been sent/received, except
if no ADD_ADDR are expected.
Before the parent commit, the tolerance was present for each tests where
the ADD_ADDR could be retransmitted in a reasonable time (1 sec). Now
that all tests can have retransmitted ADD_ADDR, it is normal to apply
the same tolerance for all tests.
An alternative could be to disable the ADD_ADDR retransmissions by
default, but that's changing the default kernel behaviour. Plus,
ADD_ADDR retransmissions can be required for some tests. To avoid adding
exceptions to many tests, it seems better to increase the tolerance.
Later, we could add a new MIB counter to identify the ADD_ADDR
retransmissions, and remove the tolerance when this counter is
available.
Reviewed-by: Geliang Tang <geliang(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
tools/testing/selftests/net/mptcp/mptcp_join.sh | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/tools/testing/selftests/net/mptcp/mptcp_join.sh b/tools/testing/selftests/net/mptcp/mptcp_join.sh
index 2f046167a0b6cc6fb5531a033d8d95c9ea399cf9..e9e11a9e60fd5374c8a98c3b7159ccbca8053030 100755
--- a/tools/testing/selftests/net/mptcp/mptcp_join.sh
+++ b/tools/testing/selftests/net/mptcp/mptcp_join.sh
@@ -358,6 +358,7 @@ reset_with_add_addr_timeout()
tables="${ip6tables}"
fi
+ # set a maximum, to avoid too long timeout with exponential backoff
ip netns exec $ns1 sysctl -q net.mptcp.add_addr_timeout=1
if ! ip netns exec $ns2 $tables -A OUTPUT -p tcp \
@@ -1669,7 +1670,6 @@ chk_add_nr()
local tx=""
local rx=""
local count
- local timeout
if [[ $ns_invert = "invert" ]]; then
ns_tx=$ns2
@@ -1678,15 +1678,13 @@ chk_add_nr()
rx=" server"
fi
- timeout=$(ip netns exec ${ns_tx} sysctl -n net.mptcp.add_addr_timeout)
-
print_check "add addr rx${rx}"
count=$(mptcp_lib_get_counter ${ns_rx} "MPTcpExtAddAddr")
if [ -z "$count" ]; then
print_skip
- # if the test configured a short timeout tolerate greater then expected
- # add addrs options, due to retransmissions
- elif [ "$count" != "$add_nr" ] && { [ "$timeout" -gt 1 ] || [ "$count" -lt "$add_nr" ]; }; then
+ # Tolerate more ADD_ADDR then expected (if any), due to retransmissions
+ elif [ "$count" != "$add_nr" ] &&
+ { [ "$add_nr" -eq 0 ] || [ "$count" -lt "$add_nr" ]; }; then
fail_test "got $count ADD_ADDR[s] expected $add_nr"
else
print_ok
@@ -1774,18 +1772,15 @@ chk_add_tx_nr()
{
local add_tx_nr=$1
local echo_tx_nr=$2
- local timeout
local count
- timeout=$(ip netns exec $ns1 sysctl -n net.mptcp.add_addr_timeout)
-
print_check "add addr tx"
count=$(mptcp_lib_get_counter ${ns1} "MPTcpExtAddAddrTx")
if [ -z "$count" ]; then
print_skip
- # if the test configured a short timeout tolerate greater then expected
- # add addrs options, due to retransmissions
- elif [ "$count" != "$add_tx_nr" ] && { [ "$timeout" -gt 1 ] || [ "$count" -lt "$add_tx_nr" ]; }; then
+ # Tolerate more ADD_ADDR then expected (if any), due to retransmissions
+ elif [ "$count" != "$add_tx_nr" ] &&
+ { [ "$add_tx_nr" -eq 0 ] || [ "$count" -lt "$add_tx_nr" ]; }; then
fail_test "got $count ADD_ADDR[s] TX, expected $add_tx_nr"
else
print_ok
--
2.51.0
From: Geliang Tang <tanggeliang(a)kylinos.cn>
Currently the ADD_ADDR option is retransmitted with a fixed timeout. This
patch makes the retransmission timeout adaptive by using the maximum RTO
among all the subflows, while still capping it at the configured maximum
value (add_addr_timeout_max). This improves responsiveness when
establishing new subflows.
Specifically:
1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive
timeout.
2. Uses maximum subflow RTO (icsk_rto) when available.
3. Applies exponential backoff based on retransmission count.
4. Maintains fallback to configured max timeout when no RTO data exists.
This slightly changes the behaviour of the MPTCP "add_addr_timeout"
sysctl knob to be used as a maximum instead of a fixed value. But this
is seen as an improvement: the ADD_ADDR might be sent quicker than
before to improve the overall MPTCP connection. Also, the default
value is set to 2 min, which was already way too long, and caused the
ADD_ADDR not to be retransmitted for connections shorter than 2 minutes.
Suggested-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/576
Reviewed-by: Christoph Paasch <cpaasch(a)openai.com>
Signed-off-by: Geliang Tang <tanggeliang(a)kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
v2: no changes.
---
Documentation/networking/mptcp-sysctl.rst | 8 +++++---
net/mptcp/pm.c | 28 ++++++++++++++++++++++++----
2 files changed, 29 insertions(+), 7 deletions(-)
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 1683c139821e3ba6d9eaa3c59330a523d29f1164..1eb6af26b4a7acdedd575a126c576210a78f0d4d 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -8,9 +8,11 @@ MPTCP Sysfs variables
===============================
add_addr_timeout - INTEGER (seconds)
- Set the timeout after which an ADD_ADDR control message will be
- resent to an MPTCP peer that has not acknowledged a previous
- ADD_ADDR message.
+ Set the maximum value of timeout after which an ADD_ADDR control message
+ will be resent to an MPTCP peer that has not acknowledged a previous
+ ADD_ADDR message. A dynamically estimated retransmission timeout based
+ on the estimated connection round-trip-time is used if this value is
+ lower than the maximum one.
Do not retransmit if set to 0.
diff --git a/net/mptcp/pm.c b/net/mptcp/pm.c
index 136a380602cae872b76560649c924330e5f42533..204e1f61212e2be77a8476f024b59be67d04b80a 100644
--- a/net/mptcp/pm.c
+++ b/net/mptcp/pm.c
@@ -268,6 +268,27 @@ int mptcp_pm_mp_prio_send_ack(struct mptcp_sock *msk,
return -EINVAL;
}
+static unsigned int mptcp_adjust_add_addr_timeout(struct mptcp_sock *msk)
+{
+ const struct net *net = sock_net((struct sock *)msk);
+ unsigned int rto = mptcp_get_add_addr_timeout(net);
+ struct mptcp_subflow_context *subflow;
+ unsigned int max = 0;
+
+ mptcp_for_each_subflow(msk, subflow) {
+ struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
+ struct inet_connection_sock *icsk = inet_csk(ssk);
+
+ if (icsk->icsk_rto > max)
+ max = icsk->icsk_rto;
+ }
+
+ if (max && max < rto)
+ rto = max;
+
+ return rto;
+}
+
static void mptcp_pm_add_timer(struct timer_list *timer)
{
struct mptcp_pm_add_entry *entry = timer_container_of(entry, timer,
@@ -292,7 +313,7 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
goto out;
}
- timeout = mptcp_get_add_addr_timeout(sock_net(sk));
+ timeout = mptcp_adjust_add_addr_timeout(msk);
if (!timeout)
goto out;
@@ -307,7 +328,7 @@ static void mptcp_pm_add_timer(struct timer_list *timer)
if (entry->retrans_times < ADD_ADDR_RETRANS_MAX)
sk_reset_timer(sk, timer,
- jiffies + timeout);
+ jiffies + (timeout << entry->retrans_times));
spin_unlock_bh(&msk->pm.lock);
@@ -348,7 +369,6 @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk,
{
struct mptcp_pm_add_entry *add_entry = NULL;
struct sock *sk = (struct sock *)msk;
- struct net *net = sock_net(sk);
unsigned int timeout;
lockdep_assert_held(&msk->pm.lock);
@@ -374,7 +394,7 @@ bool mptcp_pm_alloc_anno_list(struct mptcp_sock *msk,
timer_setup(&add_entry->add_timer, mptcp_pm_add_timer, 0);
reset_timer:
- timeout = mptcp_get_add_addr_timeout(net);
+ timeout = mptcp_adjust_add_addr_timeout(msk);
if (timeout)
sk_reset_timer(sk, &add_entry->add_timer, jiffies + timeout);
--
2.51.0
Changes in v2:
- Optimized the logic in descriptions. (Song Liu)
- Created a new header file to declare kfuncs for future extensions included by other files. (Christian Loehle)
- Fixed some logical issues in the code. (Christian Loehle)
Reference:
[1] https://lore.kernel.org/bpf/20250829101137.9507-1-yikai.lin@vivo.com/
Summary
----------
Hi, everyone,
This patch set introduces an extensible cpuidle governor framework
using BPF struct_ops, enabling dynamic implementation of idle-state selection policies
via BPF programs.
Motivation
----------
As is well-known, CPUs support multiple idle states (e.g., C0, C1, C2, ...),
where deeper states reduce power consumption, but results in longer wakeup latency,
potentially affecting performance.
Existing generic cpuidle governors operate effectively in common scenarios
but exhibit suboptimal behavior in specific Android phone's use cases.
Our testing reveals that during low-utilization scenarios
(e.g., screen-off background tasks like music playback with CPU utilization <10%),
the C0 state occupies ~50% of idle time, causing significant energy inefficiency.
Reducing C0 to ≤20% could yield ≥5% power savings on mobile phones.
To address this, we expect:
1.Dynamic governor switching to power-saved policies for low cpu utilization scenarios (e.g., screen-off mode)
2.Dynamic switching to alternate governors for high-performance scenarios (e.g., gaming)
OverView
----------
The BPF cpuidle ext governor registers at postcore_initcall()
but remains disabled by default due to its low priority "rating" with value "1".
Activation requires adjust higer "rating" than other governors within BPF.
Core Components:
1.**struct cpuidle_gov_ext_ops** – BPF-overridable operations:
- ops.enable()/ops.disable(): enable or disable callback
- ops.select(): cpu Idle-state selection logic
- ops.set_stop_tick(): Scheduler tick management after state selection
- ops.reflect(): feedback info about previous idle state.
- ops.init()/ops.deinit(): Initialization or cleanup.
2.**Critical kfuncs for kernel state access**:
- bpf_cpuidle_ext_gov_update_rating():
Activate ext governor by raising rating must be called from "ops.init()"
- bpf_cpuidle_ext_gov_latency_req(): get idle-state latency constraints
- bpf_tick_nohz_get_sleep_length(): get CPU sleep duration in tickless mode
Future work
----------
1. Scenario detection: Identifying low-utilization states (e.g., screen-off + background music)
2. Policy optimization: Optimizing state-selection algorithms for specific scenarios
Is it related to sched_ext?
---------------------------
The cpuidle framework is as follows.
----------------------------------------------------------
Scheduler Core
----------------------------------------------------------
|
v
----------------------------------------------------------
| FAIR Class | EXT Class | IDLE Class |
----------------------------------------------------------
| | | |
| | | v
| | | ------------------------
| | | enter_cpu_idle()
| | | ------------------------
| | | |
| | | v
| | | ------------------------------
| | | | CPUIDLE Governor |
| | | ------------------------------
| | | | | |
| | | v v v
| | |-----------------------------------
| | | default | | other | | BPF ext |
| | | Governor | | Governor | | Governor | <<===Here is the feature we add.
| | |-----------------------------------
| | | | | |
| | | v v v
| | |-------------------------------------
| | | select idle state
| | |-------------------------------------
Whereas cpuidle is invoked after switching to idle class when no tasks are present in the scheduling RQ.
They are not directly related, so implementing kfuncs or other extensions through sched_ext is not feasible.
Lin Yikai (2):
cpuidle: Implement BPF extensible cpuidle governor class
selftests/bpf: Add selftests for cpuidle_gov_ext
drivers/cpuidle/Kconfig | 12 +
drivers/cpuidle/governors/Makefile | 1 +
drivers/cpuidle/governors/ext.c | 537 ++++++++++++++++++
.../bpf/prog_tests/test_cpuidle_gov_ext.c | 28 +
.../selftests/bpf/progs/cpuidle_common.h | 13 +
.../selftests/bpf/progs/cpuidle_gov_ext.c | 200 +++++++
6 files changed, 791 insertions(+)
create mode 100644 drivers/cpuidle/governors/ext.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpuidle_gov_ext.c
create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_common.h
create mode 100644 tools/testing/selftests/bpf/progs/cpuidle_gov_ext.c
--
2.43.0
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Please find the v16 AccECN protocol patch series, which covers the core
functionality of Accurate ECN, AccECN negotiation, AccECN TCP options,
and AccECN failure handling. The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28, and it
will be RFC9768.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best Regards,
Chia-Yu
---
v16 (6-Sep-2025)
- Use TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_ECN in comments of tcp_ecn_send_syn() (Eric Dumazet <edumazet(a)google.com>)
- Add tcpi_accecn_fail_mode and tcpi_accecn_opt_seen to make tcp_info be multiple of 64 bits in patch #12
v15 (14-Aug-205)
- Update pahole results in commit messages
- Accurate ECN will become RFC9768
v14 (22-Jul-2025)
- Add missing const for struct tcp_sock of tcp_accecn_option_beacon_check() of #11 (Simon Horman <horms(a)kernel.org>)
v13 (18-Jul-2025)
- Implement tcp_accecn_extract_syn_ect() and tcp_accecn_reflector_flags() with static array lookup of patch #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Fix typos in comments of #6 and remove patch #7 of v12 about simulatenous connect (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_ACCECN_E1B_INIT_OFFSET, TCP_ACCECN_E0B_INIT_OFFSET, and TCP_ACCECN_CEB_INIT_OFFSET from patch #7 to #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use static array lookup in tcp_accecn_optfield_to_ecnfield() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return false when WARN_ON_ONCE() is true in tcp_accecn_process_option() of patch #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Make synack_ecn_bytes as static const array and use const u32 pointer in tcp_options_write() of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Use ALIGN() and ALIGN_DOWN() in tcp_options_fit_accecn() to pad TCP AccECN option to dword of #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Return TCP_ACCECN_OPT_FAIL_SEEN if WARN_ON_ONCE() is true in tcp_accecn_option_init() of #12 (Paolo Abeni <pabeni(a)redhat.com>)
v12 (04-Jul-2025)
- Fix compilation issues with some intermediate patches in v11
- Add more comments for AccECN helpers of tcp_ecn.h
v11 (03-Jul-2025)
- Fix compilation issues with some intermediate patches in v10
v10 (02-Jul-2025)
- Add new patch of separated header file include/net/tcp_ecn.h to include ECN and AccECN functions (Eric Dumazet <edumazet(a)google.com>)
- Add comments on the AccECN helper functions in tcp_ecn.h (Eric Dumazet <edumazet(a)google.com>)
- Add documentation of tcp_ecn, tcp_ecn_option, tcp_ecn_beacon in ip-sysctl.rst to the corresponding patch (Eric Dumazet <edumazet(a)google.com>)
- Split wait third ACK functionality into a separated patch from AccECN negotiation patch (Eric Dumazet <edumazet(a)google.com>)
- Add READ_ONCE() over every reads of sysctl for all patches in the series (Eric Dumazet <edumazet(a)google.com>)
- Merge heuristics of AccECN option ceb/cep and ACE field multi-wrap into a single patch
- Add a table of SACK block reduction and required AccECN field in patch #15 commit message (Eric Dumazet <edumazet(a)google.com>)
v9 (21-Jun-2025)
- Use tcp_data_ecn_check() to set TCP_ECN_SEE flag only for RFC3168 ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments about setting TCP_ECN_SEEN flag for RFC3168 and Accruate ECN (Paolo Abeni <pabeni(a)redhat.com>)
- Restruct the code in the for loop of tcp_accecn_process_option() (Paolo Abeni <pabeni(a)redhat.com>)
- Remove ecn_bytes and add use_synack_ecn_bytes flag to identify whether syn_ack_bytes or received_ecn_bytes is used (Paolo Abeni <pabeni(a)redhat.com>)
- Replace leftover_bytes and leftover_size with leftover_highbyte and leftover_lowbyte and add comments in tcp_options_write() (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments and commit message about the 1st retx SYN still attempt AccECN negotiation (Paolo Abeni <pabeni(a)redhat.com>)
v8 (10-Jun-2025)
- Add new helper function tcp_ecn_received_counters_payload() in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Set opts->num_sack_blocks=0 to avoid potential undefined value in #8 (Paolo Abeni <pabeni(a)redhat.com>)
- Reset leftover_size to 2 once leftover_bytes is used in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_opt_demand_min() in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Add new helper function tcp_accecn_saw_opt_fail_recv() in #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Update tcp_options_fit_accecn() to avoid using recursion in #14 (Paolo Abeni <pabeni(a)redhat.com>)
v7 (14-May-2025)
- Modify group sizes of tcp_sock_write_txrx and tcp_sock_write_rx in #3 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Fix the issue in #4 and #5 where the RFC3168 ECN behavior in tcp_ecn_send() is changed (Paolo Abeni <pabeni(a)redhat.com>)
- Modify group size of tcp_sock_write_txrx in #4 and #6 based on pahole results (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message for #9 to explain the increase in tcp_sock_write_rx group size
- Modify group size of tcp_sock_write_tx in #10 based on pahole results
v6 (09-May-2025)
- Add #3 to utilize exisintg holes of tcp_sock_write_txrx group for later patches (#4, #9, #10) with new u8 members (Paolo Abeni <pabeni(a)redhat.com>)
- Add pahole outcomes before and after commit in #4, #5, #6, #9, #10, #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Define new helper function tcp_send_ack_reflect_ect() for sending ACK with reflected ECT in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add comments for function tcp_ecn_rcv_synack() in #5 (Paolo Abeni <pabeni(a)redhat.com>)
- Add enum/define to be used by sysctl_tcp_ecn in #5, sysctl_tcp_ecn_option in #9, and sysctl_tcp_ecn_option_beacon in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move accecn_fail_mode and saw_accecn_opt in #5 and #11 to use exisintg holes of tcp_sock (Paolo Abeni <pabeni(a)redhat.com>)
- Change data type of new members of tcp_request_sock and move them to the end of struct in #5 and #11 (Paolo Abeni <pabeni(a)redhat.com>)
- Move new members of tcp_info to the end of struct in #6 (Paolo Abeni <pabeni(a)redhat.com>)
- Merge previous #7 into #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Mask ecnfield with INET_ECN_MASK to remove WARN_ONCE in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Reduce the indentation levels for reabability in #9 and #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Move delivered_ecn_bytes to the RX group in #9, accecn_opt_tstamp to the TX group in #10, pkts_acked_ewma to the RX group in #15 (Paolo Abeni <pabeni(a)redhat.com>)
- Add changes in Documentation/networking/net_cachelines/tcp_sock.rst for new tcp_sock members in #3, #5, #6, #9, #10, #15
v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms(a)kernel.org>)
v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms(a)kernel.org>)
v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba(a)kernel.org>)
v2 (18-Mar-2025)
- Add one missing patch from the previous AccECN protocol preparation patch series to this patch series.
---
Chia-Yu Chang (5):
tcp: reorganize tcp_sock_write_txrx group for variables later
tcp: ecn functions in separated include file
tcp: accecn: AccECN option send control
tcp: accecn: AccECN option failure handling
tcp: accecn: try to fit AccECN option with SACK
Ilpo Järvinen (9):
tcp: reorganize SYN ECN code
tcp: fast path functions later
tcp: AccECN core
tcp: accecn: AccECN negotiation
tcp: accecn: add AccECN rx byte counters
tcp: accecn: AccECN needs to know delivered bytes
tcp: sack option handling improvements
tcp: accecn: AccECN option
tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics
Documentation/networking/ip-sysctl.rst | 55 +-
.../networking/net_cachelines/tcp_sock.rst | 12 +
include/linux/tcp.h | 32 +-
include/net/netns/ipv4.h | 2 +
include/net/tcp.h | 87 ++-
include/net/tcp_ecn.h | 642 ++++++++++++++++++
include/uapi/linux/tcp.h | 9 +
net/ipv4/syncookies.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 19 +
net/ipv4/tcp.c | 30 +-
net/ipv4/tcp_input.c | 353 ++++++++--
net/ipv4/tcp_ipv4.c | 8 +-
net/ipv4/tcp_minisocks.c | 40 +-
net/ipv4/tcp_output.c | 294 ++++++--
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 1 +
16 files changed, 1406 insertions(+), 184 deletions(-)
create mode 100644 include/net/tcp_ecn.h
--
2.34.1
devmem test fails on NIPA. Most likely we get skb(s) with readable
frags (why?) but the failure manifests as an OOM. The OOM happens
because ncdevmem spams the following message:
recvmsg ret=-1
recvmsg: Bad address
As of today, ncdevmem can't deal with various reasons of EFAULT:
- falling back to regular recvmsg for non-devmem skbs
- increasing ctrl_data size (can't happen with ncdevmem's large buffer)
Exit (cleanly) with error when recvmsg returns EFAULT. This should at
least cause the test to cleanup its state.
Signed-off-by: Stanislav Fomichev <sdf(a)fomichev.me>
---
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 8dc9511d046f..c0a22938bed2 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -945,6 +945,10 @@ static int do_server(struct memory_buffer *mem)
continue;
if (ret < 0) {
perror("recvmsg");
+ if (errno == EFAULT) {
+ pr_err("received EFAULT, won't recover");
+ goto err_close_client;
+ }
continue;
}
if (ret == 0) {
--
2.51.0
Create a netconsole test that puts a lot of pressure on the netconsole
list manipulation. Do it by creating dynamic targets and deleting
targets while messages are being sent. Also put interface down while the
In order to do it, refactor create_dynamic_target(), so it can be used to
create random targets in the torture test.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v2:
- Reuse the netconsole creation from lib_netcons.sh. Thus, refactoring
the create_dynamic_target() (Jakub)
- Move the "wait" to after all the messages has been sent.
- Link to v1: https://lore.kernel.org/r/20250902-netconsole_torture-v1-1-03c6066598e9@deb…
---
Breno Leitao (2):
selftest: netcons: refactor target creation
selftest: netcons: create a torture test
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 30 +++--
.../selftests/drivers/net/netcons_torture.sh | 127 +++++++++++++++++++++
3 files changed, 147 insertions(+), 11 deletions(-)
---
base-commit: 2fd4161d0d2547650d9559d57fc67b4e0a26a9e3
change-id: 20250902-netconsole_torture-8fc23f0aca99
Best regards,
--
Breno Leitao <leitao(a)debian.org>
When the bpf ring buffer is full, new events can not be recorded util
the consumer consumes some events to free space. This may cause critical
events to be discarded, such as in fault diagnostic, where recent events
are more critical than older ones.
So add ovewrite mode for bpf ring buffer. In this mode, the new event
overwrites the oldest event when the buffer is full.
v2:
- remove libbpf changes (Andrii)
- update overwrite benchmark
v1:
https://lore.kernel.org/bpf/20250804022101.2171981-1-xukuohai@huaweicloud.c…
Xu Kuohai (3):
bpf: Add overwrite mode for bpf ring buffer
selftests/bpf: Add test for overwrite ring buffer
selftests/bpf/benchs: Add producer and overwrite bench for ring buffer
include/uapi/linux/bpf.h | 4 +
kernel/bpf/ringbuf.c | 159 +++++++++++++++---
tools/include/uapi/linux/bpf.h | 4 +
tools/testing/selftests/bpf/Makefile | 3 +-
tools/testing/selftests/bpf/bench.c | 2 +
.../selftests/bpf/benchs/bench_ringbufs.c | 95 ++++++++++-
.../bpf/benchs/run_bench_ringbufs.sh | 4 +
.../selftests/bpf/prog_tests/ringbuf.c | 74 ++++++++
.../selftests/bpf/progs/ringbuf_bench.c | 10 ++
.../bpf/progs/test_ringbuf_overwrite.c | 98 +++++++++++
10 files changed, 418 insertions(+), 35 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/test_ringbuf_overwrite.c
--
2.43.0
Print a message so that people reading dmesg know that these NULL
dereferences are not a bug, but instead a deliberate part of
the testing.
Signed-off-by: Dan Carpenter <dan.carpenter(a)linaro.org>
---
lib/kunit/kunit-test.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/kunit/kunit-test.c b/lib/kunit/kunit-test.c
index 8c01eabd4eaf..a8b6e16f4465 100644
--- a/lib/kunit/kunit-test.c
+++ b/lib/kunit/kunit-test.c
@@ -119,6 +119,8 @@ static void kunit_test_null_dereference(void *data)
struct kunit *test = data;
int *null = NULL;
+ pr_info("Triggering deliberate NULL derefence.\n");
+
*null = 0;
KUNIT_FAIL(test, "This line should never be reached\n");
--
2.47.2
Hello Jiayuan Chen,
Commit 7b2fa44de5e7 ("selftest/bpf/benchs: Add benchmark for sockmap
usage") from Apr 7, 2025 (linux-next), leads to the following Smatch
static checker warning:
tools/testing/selftests/bpf/benchs/bench_sockmap.c:129 bench_sockmap_prog_destroy()
error: buffer overflow 'ctx.fds' 5 <= 19
tools/testing/selftests/bpf/benchs/bench_sockmap.c
123 static void bench_sockmap_prog_destroy(void)
124 {
125 int i;
126
127 for (i = 0; i < sizeof(ctx.fds); i++) {
^^^^^^^^^^^^^^^
This should be ARRAY_SIZE(ctx.fds) otherwise it's a buffer overflow.
128 if (ctx.fds[0] > 0)
^^^^^^^^^^
Instead of .fds[0] it should be .fds[i], right?
--> 129 close(ctx.fds[i]);
130 }
131
132 bench_sockmap_prog__destroy(ctx.skel);
133 }
regards,
dan carpenter
There are currently no kernel tests that verify setting and getting
options of the team driver.
In the future, options may be added that implicitly change other
options, which will make it useful to have tests like these that show
nothing breaks. There will be a follow up patch to this that adds new
"rx_enabled" and "tx_enabled" options, which will implicitly affect the
"enabled" option value and vice versa.
The tests use teamnl to first set options to specific values and then
gets them to compare to the set values.
Signed-off-by: Marc Harvey <marcharvey(a)google.com>
---
Changes in v2:
- Fixed shellcheck failures.
- Fixed test failing in vng by adding a config option to enable the
team driver's active backup mode.
- Link to v1: https://lore.kernel.org/netdev/20250902235504.4190036-1-marcharvey@google.c…
.../selftests/drivers/net/team/Makefile | 6 +-
.../testing/selftests/drivers/net/team/config | 1 +
.../selftests/drivers/net/team/options.sh | 192 ++++++++++++++++++
3 files changed, 197 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/team/options.sh
diff --git a/tools/testing/selftests/drivers/net/team/Makefile b/tools/testing/selftests/drivers/net/team/Makefile
index eaf6938f100e..8b00b70ce67f 100644
--- a/tools/testing/selftests/drivers/net/team/Makefile
+++ b/tools/testing/selftests/drivers/net/team/Makefile
@@ -1,11 +1,13 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for net selftests
-TEST_PROGS := dev_addr_lists.sh propagation.sh
+TEST_PROGS := dev_addr_lists.sh propagation.sh options.sh
TEST_INCLUDES := \
../bonding/lag_lib.sh \
../../../net/forwarding/lib.sh \
- ../../../net/lib.sh
+ ../../../net/lib.sh \
+ ../../../net/in_netns.sh \
+ ../../../net/lib/sh/defer.sh \
include ../../../lib.mk
diff --git a/tools/testing/selftests/drivers/net/team/config b/tools/testing/selftests/drivers/net/team/config
index 636b3525b679..558e1d0cf565 100644
--- a/tools/testing/selftests/drivers/net/team/config
+++ b/tools/testing/selftests/drivers/net/team/config
@@ -3,4 +3,5 @@ CONFIG_IPV6=y
CONFIG_MACVLAN=y
CONFIG_NETDEVSIM=m
CONFIG_NET_TEAM=y
+CONFIG_NET_TEAM_MODE_ACTIVEBACKUP=y
CONFIG_NET_TEAM_MODE_LOADBALANCE=y
diff --git a/tools/testing/selftests/drivers/net/team/options.sh b/tools/testing/selftests/drivers/net/team/options.sh
new file mode 100755
index 000000000000..82bf22aa3480
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/team/options.sh
@@ -0,0 +1,192 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# These tests verify basic set and get functionality of the team
+# driver options over netlink.
+
+# Run in private netns.
+test_dir="$(dirname "$0")"
+if [[ $# -eq 0 ]]; then
+ "${test_dir}"/../../../net/in_netns.sh "$0" __subprocess
+ exit $?
+fi
+
+ALL_TESTS="
+ team_test_options
+"
+
+source "${test_dir}/../../../net/lib.sh"
+
+TEAM_PORT="team0"
+MEMBER_PORT="dummy0"
+
+setup()
+{
+ ip link add name "${MEMBER_PORT}" type dummy
+ ip link add name "${TEAM_PORT}" type team
+}
+
+get_and_check_value()
+{
+ local option_name="$1"
+ local expected_value="$2"
+ local port_flag="$3"
+
+ local value_from_get
+
+ if ! value_from_get=$(teamnl "${TEAM_PORT}" getoption "${option_name}" \
+ "${port_flag}"); then
+ echo "Could not get option '${option_name}'" >&2
+ return 1
+ fi
+
+ if [[ "${value_from_get}" != "${expected_value}" ]]; then
+ echo "Incorrect value for option '${option_name}'" >&2
+ echo "get (${value_from_get}) != set (${expected_value})" >&2
+ return 1
+ fi
+}
+
+set_and_check_get()
+{
+ local option_name="$1"
+ local option_value="$2"
+ local port_flag="$3"
+
+ local value_from_get
+
+ if ! teamnl "${TEAM_PORT}" setoption "${option_name}" "${option_value}" \
+ "${port_flag}"; then
+ echo "'setoption ${option_name} ${option_value}' failed" >&2
+ return 1
+ fi
+
+ get_and_check_value "${option_name}" "${option_value}" "${port_flag}"
+ return $?
+}
+
+# Get a "port flag" to pass to the `teamnl` command.
+# E.g. $1="dummy0" -> "port=dummy0",
+# $1="" -> ""
+get_port_flag()
+{
+ local port_name="$1"
+
+ if [[ -n "${port_name}" ]]; then
+ echo "--port=${port_name}"
+ fi
+}
+
+attach_port_if_specified()
+{
+ local port_name="${1}"
+
+ if [[ -n "${port_name}" ]]; then
+ ip link set dev "${port_name}" master "${TEAM_PORT}"
+ return $?
+ fi
+}
+
+detach_port_if_specified()
+{
+ local port_name="${1}"
+
+ if [[ -n "${port_name}" ]]; then
+ ip link set dev "${port_name}" nomaster
+ return $?
+ fi
+}
+
+#######################################
+# Test that an option's get value matches its set value.
+# Globals:
+# RET - Used by testing infra like `check_err`.
+# EXIT_STATUS - Used by `log_test` to whole script exit value.
+# Arguments:
+# option_name - The name of the option.
+# value_1 - The first value to try setting.
+# value_2 - The second value to try setting.
+# port_name - The (optional) name of the attached port.
+#######################################
+team_test_option()
+{
+ local option_name="$1"
+ local value_1="$2"
+ local value_2="$3"
+ local possible_values="$2 $3 $2"
+ local port_name="$4"
+ local port_flag
+
+ RET=0
+
+ echo "Setting '${option_name}' to '${value_1}' and '${value_2}'"
+
+ attach_port_if_specified "${port_name}"
+ check_err $? "Couldn't attach ${port_name} to master"
+ port_flag=$(get_port_flag "${port_name}")
+
+ # Set and get both possible values.
+ for value in ${possible_values}; do
+ set_and_check_get "${option_name}" "${value}" "${port_flag}"
+ check_err $? "Failed to set '${option_name}' to '${value}'"
+ done
+
+ detach_port_if_specified "${port_name}"
+ check_err $? "Couldn't detach ${port_name} from its master"
+
+ log_test "Set + Get '${option_name}' test"
+}
+
+#######################################
+# Test that getting a non-existant option fails.
+# Globals:
+# RET - Used by testing infra like `check_err`.
+# EXIT_STATUS - Used by `log_test` to whole script exit value.
+# Arguments:
+# option_name - The name of the option.
+# port_name - The (optional) name of the attached port.
+#######################################
+team_test_get_option_fails()
+{
+ local option_name="$1"
+ local port_name="$2"
+ local port_flag
+
+ RET=0
+
+ attach_port_if_specified "${port_name}"
+ check_err $? "Couldn't attach ${port_name} to master"
+ port_flag=$(get_port_flag "${port_name}")
+
+ # Just confirm that getting the value fails.
+ teamnl "${TEAM_PORT}" getoption "${option_name}" "${port_flag}"
+ check_fail $? "Shouldn't be able to get option '${option_name}'"
+
+ detach_port_if_specified "${port_name}"
+
+ log_test "Get '${option_name}' fails"
+}
+
+team_test_options()
+{
+ # Wrong option name behavior.
+ team_test_get_option_fails fake_option1
+ team_test_get_option_fails fake_option2 "${MEMBER_PORT}"
+
+ # Correct set and get behavior.
+ team_test_option mode activebackup loadbalance
+ team_test_option notify_peers_count 0 5
+ team_test_option notify_peers_interval 0 5
+ team_test_option mcast_rejoin_count 0 5
+ team_test_option mcast_rejoin_interval 0 5
+ team_test_option enabled true false "${MEMBER_PORT}"
+ team_test_option user_linkup true false "${MEMBER_PORT}"
+ team_test_option user_linkup_enabled true false "${MEMBER_PORT}"
+ team_test_option priority 10 20 "${MEMBER_PORT}"
+ team_test_option queue_id 0 1 "${MEMBER_PORT}"
+}
+
+require_command teamnl
+setup
+tests_run
+exit "${EXIT_STATUS}"
--
2.51.0.338.gd7d06c2dae-goog
Usually the autodefer helpers in lib.sh are expected to be run in context
where success is the expected outcome. However when using them for feature
detection, failure can legitimately occur. But the failed command still
schedules a cleanup, which will likely fail again.
Instead, only schedule deferred cleanup when the positive command succeeds.
This way of organizing the cleanup has the added benefit that now the
return code from these functions reflects whether the command passed.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/lib.sh | 32 +++++++++++++++---------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/tools/testing/selftests/net/lib.sh b/tools/testing/selftests/net/lib.sh
index c7add0dc4c60..80cf1a75136c 100644
--- a/tools/testing/selftests/net/lib.sh
+++ b/tools/testing/selftests/net/lib.sh
@@ -547,8 +547,8 @@ ip_link_add()
{
local name=$1; shift
- ip link add name "$name" "$@"
- defer ip link del dev "$name"
+ ip link add name "$name" "$@" && \
+ defer ip link del dev "$name"
}
ip_link_set_master()
@@ -556,8 +556,8 @@ ip_link_set_master()
local member=$1; shift
local master=$1; shift
- ip link set dev "$member" master "$master"
- defer ip link set dev "$member" nomaster
+ ip link set dev "$member" master "$master" && \
+ defer ip link set dev "$member" nomaster
}
ip_link_set_addr()
@@ -566,8 +566,8 @@ ip_link_set_addr()
local addr=$1; shift
local old_addr=$(mac_get "$name")
- ip link set dev "$name" address "$addr"
- defer ip link set dev "$name" address "$old_addr"
+ ip link set dev "$name" address "$addr" && \
+ defer ip link set dev "$name" address "$old_addr"
}
ip_link_has_flag()
@@ -590,8 +590,8 @@ ip_link_set_up()
local name=$1; shift
if ! ip_link_is_up "$name"; then
- ip link set dev "$name" up
- defer ip link set dev "$name" down
+ ip link set dev "$name" up && \
+ defer ip link set dev "$name" down
fi
}
@@ -600,8 +600,8 @@ ip_link_set_down()
local name=$1; shift
if ip_link_is_up "$name"; then
- ip link set dev "$name" down
- defer ip link set dev "$name" up
+ ip link set dev "$name" down && \
+ defer ip link set dev "$name" up
fi
}
@@ -609,20 +609,20 @@ ip_addr_add()
{
local name=$1; shift
- ip addr add dev "$name" "$@"
- defer ip addr del dev "$name" "$@"
+ ip addr add dev "$name" "$@" && \
+ defer ip addr del dev "$name" "$@"
}
ip_route_add()
{
- ip route add "$@"
- defer ip route del "$@"
+ ip route add "$@" && \
+ defer ip route del "$@"
}
bridge_vlan_add()
{
- bridge vlan add "$@"
- defer bridge vlan del "$@"
+ bridge vlan add "$@" && \
+ defer bridge vlan del "$@"
}
wait_local_port_listen()
--
2.49.0
The fact that all cleanup (ideally) goes through the defer framework makes
debugging of these commands a bit tricky. However, this also gives us a
nice point to place a hook along the lines of PAUSE_ON_FAIL. When the
environment variable DEFER_PAUSE_ON_FAIL is set, and a cleanup command
results in non-zero exit status, show a bit of debuginfo and give the user
an opportunity to interrupt the execution altogether.
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/lib/sh/defer.sh | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/tools/testing/selftests/net/lib/sh/defer.sh b/tools/testing/selftests/net/lib/sh/defer.sh
index 6c642f3d0ced..47ab78c4d465 100644
--- a/tools/testing/selftests/net/lib/sh/defer.sh
+++ b/tools/testing/selftests/net/lib/sh/defer.sh
@@ -1,6 +1,10 @@
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
+# Whether to pause and allow debugging when an executed deferred command has a
+# non-zero exit code.
+: "${DEFER_PAUSE_ON_FAIL:=no}"
+
# map[(scope_id,track,cleanup_id) -> cleanup_command]
# track={d=default | p=priority}
declare -A __DEFER__JOBS
@@ -38,8 +42,20 @@ __defer__run()
local track=$1; shift
local defer_ix=$1; shift
local defer_key=$(__defer__defer_key $track $defer_ix)
+ local ret
eval ${__DEFER__JOBS[$defer_key]}
+ ret=$?
+
+ if [[ "$DEFER_PAUSE_ON_FAIL" == yes && "$ret" -ne 0 ]]; then
+ echo "Deferred command (track $track index $defer_ix):"
+ echo " ${__DEFER__JOBS[$defer_key]}"
+ echo "... ended with an exit status of $ret"
+ echo "Hit enter to continue, 'q' to quit"
+ read a
+ [[ "$a" == q ]] && exit 1
+ fi
+
unset __DEFER__JOBS[$defer_key]
}
--
2.49.0
Currently the way deferred commands are stored and invoked causes any
whitespace to act as an argument separator when the command is executed.
To make it possible to use spaces in deferred commands, store the commands
quoted, and then eval the string prior to execution.
Fixes: a6e263f125cd ("selftests: net: lib: Introduce deferred commands")
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
---
Notes:
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
tools/testing/selftests/net/lib/sh/defer.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/lib/sh/defer.sh b/tools/testing/selftests/net/lib/sh/defer.sh
index 082f5d38321b..6c642f3d0ced 100644
--- a/tools/testing/selftests/net/lib/sh/defer.sh
+++ b/tools/testing/selftests/net/lib/sh/defer.sh
@@ -39,7 +39,7 @@ __defer__run()
local defer_ix=$1; shift
local defer_key=$(__defer__defer_key $track $defer_ix)
- ${__DEFER__JOBS[$defer_key]}
+ eval ${__DEFER__JOBS[$defer_key]}
unset __DEFER__JOBS[$defer_key]
}
@@ -49,7 +49,7 @@ __defer__schedule()
local ndefers=$(__defer__ndefers $track)
local ndefers_key=$(__defer__ndefer_key $track)
local defer_key=$(__defer__defer_key $track $ndefers)
- local defer="$@"
+ local defer="${@@Q}"
__DEFER__JOBS[$defer_key]="$defer"
__DEFER__NJOBS[$ndefers_key]=$((ndefers + 1))
--
2.49.0
This series adds namespace support to vhost-vsock and loopback. It does
not add namespaces to any of the other guest transports (virtio-vsock,
hyperv, or vmci).
The current revision only supports two modes: local or global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).
The mode is set using /proc/sys/net/vsock/ns_mode.
Modes are per-netns and write-once. This allows a system to configure
namespaces independently (some may share CIDs, others are completely
isolated). This also supports future possible mixed use cases, where
there may be namespaces in global mode spinning up VMs while there are
mixed mode namespaces that provide services to the VMs, but are not
allowed to allocate from the global CID pool.
Additionally, added tests for the new semantics:
tools/testing/selftests/vsock/vmtest.sh
1..22
ok 1 vm_server_host_client
ok 2 vm_client_host_server
ok 3 vm_loopback
ok 4 host_vsock_ns_mode_ok
ok 5 host_vsock_ns_mode_write_once_ok
ok 6 global_same_cid_fails
ok 7 local_same_cid_ok
ok 8 global_local_same_cid_ok
ok 9 local_global_same_cid_ok
ok 10 diff_ns_global_host_connect_to_global_vm_ok
ok 11 diff_ns_global_host_connect_to_local_vm_fails
ok 12 diff_ns_global_vm_connect_to_global_host_ok
ok 13 diff_ns_global_vm_connect_to_local_host_fails
ok 14 diff_ns_local_host_connect_to_local_vm_fails
ok 15 diff_ns_local_vm_connect_to_local_host_fails
ok 16 diff_ns_global_to_local_loopback_local_fails
ok 17 diff_ns_local_to_global_loopback_fails
ok 18 diff_ns_local_to_local_loopback_fails
ok 19 diff_ns_global_to_global_loopback_ok
ok 20 same_ns_local_loopback_ok
ok 21 same_ns_local_host_connect_to_local_vm_ok
ok 22 same_ns_local_vm_connect_to_local_host_ok
SUMMARY: PASS=22 SKIP=0 FAIL=0
Log: /tmp/vsock_vmtest_OQC4.log
Thanks again for everyone's help and reviews!
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)gmail.com>
To: Stefano Garzarella <sgarzare(a)redhat.com>
To: Shuah Khan <shuah(a)kernel.org>
To: David S. Miller <davem(a)davemloft.net>
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Paolo Abeni <pabeni(a)redhat.com>
To: Simon Horman <horms(a)kernel.org>
To: Stefan Hajnoczi <stefanha(a)redhat.com>
To: Michael S. Tsirkin <mst(a)redhat.com>
To: Jason Wang <jasowang(a)redhat.com>
To: Xuan Zhuo <xuanzhuo(a)linux.alibaba.com>
To: Eugenio Pérez <eperezma(a)redhat.com>
To: K. Y. Srinivasan <kys(a)microsoft.com>
To: Haiyang Zhang <haiyangz(a)microsoft.com>
To: Wei Liu <wei.liu(a)kernel.org>
To: Dexuan Cui <decui(a)microsoft.com>
To: Bryan Tan <bryan-bt.tan(a)broadcom.com>
To: Vishnu Dasa <vishnu.dasa(a)broadcom.com>
To: Broadcom internal kernel review list <bcm-kernel-feedback-list(a)broadcom.com>
Cc: virtualization(a)lists.linux.dev
Cc: netdev(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: linux-kernel(a)vger.kernel.org
Cc: kvm(a)vger.kernel.org
Cc: linux-hyperv(a)vger.kernel.org
Cc: berrange(a)redhat.com
Changes in v5:
- /proc/net/vsock_ns_mode -> /proc/sys/net/vsock/ns_mode
- vsock_global_net -> vsock_global_dummy_net
- fix netns lookup in vhost_vsock to respect pid namespaces
- add callbacks for vsock_loopback to avoid circular dependency
- vmtest.sh loads vsock_loopback module
- remove vsock_net_mode_can_set()
- change vsock_net_write_mode() to return true/false based on success
- make vsock_net_mode enum instead of u8
- Link to v4: https://lore.kernel.org/r/20250805-vsock-vmtest-v4-0-059ec51ab111@meta.com
Changes in v4:
- removed RFC tag
- implemented loopback support
- renamed new tests to better reflect behavior
- completed suite of tests with permutations of ns modes and vsock_test
as guest/host
- simplified socat bridging with unix socket instead of tcp + veth
- only use vsock_test for success case, socat for failure case (context
in commit message)
- lots of cleanup
Changes in v3:
- add notion of "modes"
- add procfs /proc/net/vsock_ns_mode
- local and global modes only
- no /dev/vhost-vsock-netns
- vmtest.sh already merged, so new patch just adds new tests for NS
- Link to v2:
https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com
Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1:
https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
Changes in v1:
- added 'netns' module param to vsock.ko to enable the
network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
---
Bobby Eshleman (9):
vsock: a per-net vsock NS mode state
vsock: add net to vsock skb cb
vsock: add netns to vsock core
vsock/loopback: add netns support
vsock/virtio: add netns to virtio transport common
vhost/vsock: add netns support
selftests/vsock: improve logging in vmtest.sh
selftests/vsock: invoke vsock_test through helpers
selftests/vsock: add namespace tests
MAINTAINERS | 1 +
drivers/vhost/vsock.c | 30 +-
include/linux/virtio_vsock.h | 12 +
include/net/af_vsock.h | 89 ++-
include/net/net_namespace.h | 4 +
include/net/netns/vsock.h | 25 +
net/vmw_vsock/af_vsock.c | 312 ++++++++-
net/vmw_vsock/hyperv_transport.c | 2 +-
net/vmw_vsock/virtio_transport.c | 5 +-
net/vmw_vsock/virtio_transport_common.c | 14 +-
net/vmw_vsock/vmci_transport.c | 4 +-
net/vmw_vsock/vsock_loopback.c | 76 ++-
tools/testing/selftests/vsock/vmtest.sh | 1092 ++++++++++++++++++++++++++-----
13 files changed, 1475 insertions(+), 191 deletions(-)
---
base-commit: 242041164339594ca019481d54b4f68a7aaff64e
change-id: 20250325-vsock-vmtest-b3a21d2102c2
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
Add the benchmark testcase "kprobe-multi-all", which will hook all the
kernel functions during the testing.
This series is separated out from [1].
Changes since V2:
* add some comment to attach_ksyms_all, which notes that don't run the
testing on a debug kernel
Changes since V1:
* introduce trace_blacklist instead of copy-pasting strcmp in the 2nd
patch
* use fprintf() instead of printf() in 3rd patch
Link: https://lore.kernel.org/bpf/20250817024607.296117-1-dongml2@chinatelecom.cn/ [1]
Menglong Dong (3):
selftests/bpf: move get_ksyms and get_addrs to trace_helpers.c
selftests/bpf: skip recursive functions for kprobe_multi
selftests/bpf: add benchmark testing for kprobe-multi-all
tools/testing/selftests/bpf/bench.c | 4 +
.../selftests/bpf/benchs/bench_trigger.c | 61 +++++
.../selftests/bpf/benchs/run_bench_trigger.sh | 4 +-
.../bpf/prog_tests/kprobe_multi_test.c | 220 +---------------
.../selftests/bpf/progs/trigger_bench.c | 12 +
tools/testing/selftests/bpf/trace_helpers.c | 234 ++++++++++++++++++
tools/testing/selftests/bpf/trace_helpers.h | 3 +
7 files changed, 319 insertions(+), 219 deletions(-)
--
2.51.0
The two patches fix the va_high_addr_switch.sh test failure on x86_64.
Patch 1 fixes the hugepages setup issue that nr_hugepages is reset too
early in run_vmtests.sh and break the later va_high_addr_switch testing.
Patch 2 fixes the test failure caused by the hint addr align method change
in hugetlb_get_unmapped_area().
Chunyu Hu (2):
selftests/mm: fix hugepages cleanup too early
selftests/mm: fix va_high_addr_switch.sh failure on x86_64
tools/testing/selftests/mm/run_vmtests.sh | 9 +++++++--
tools/testing/selftests/mm/va_high_addr_switch.c | 4 ++--
2 files changed, 9 insertions(+), 4 deletions(-)
--
2.49.0
From: Dong Yang <dayss1224(a)gmail.com>
Add supported KVM test cases and fix the compilation dependencies.
---
Changes in v3:
- Reorder patches to fix build dependencies
- Sort common supported test cases alphabetically
- Move ucall_common.h include from common header to specific source files
Changes in v2:
- Delete some repeat KVM test cases on riscv
- Add missing headers to fix the build for new RISC-V KVM selftests
Dong Yang (1):
KVM: riscv: selftests: Add missing headers for new testcases
Quan Zhou (2):
KVM: riscv: selftests: Use the existing RISCV_FENCE macro in
`rseq-riscv.h`
KVM: riscv: selftests: Add common supported test cases
tools/testing/selftests/kvm/Makefile.kvm | 6 ++++++
tools/testing/selftests/kvm/access_tracking_perf_test.c | 1 +
tools/testing/selftests/kvm/include/riscv/processor.h | 1 +
.../selftests/kvm/memslot_modification_stress_test.c | 1 +
tools/testing/selftests/kvm/memslot_perf_test.c | 1 +
tools/testing/selftests/rseq/rseq-riscv.h | 3 +--
6 files changed, 11 insertions(+), 2 deletions(-)
--
2.34.1
Create a netconsole test that puts a lot of pressure on the netconsole
list manipulation. Do it by creating dynamic targets and deleting
targets while messages are being sent. Also put interface down while the
messages are being sent, as creating parallel targets.
The code launches three background jobs on distinct schedules:
* Toggle netcons target every 30 iterations
* create and delete random_target every 50 iterations
* toggle iface every 70 iterations
This creates multiple concurrency sources that interact with netconsole
states. This is good practice to simulate stress, and exercise netpoll
and netconsole locks.
This test already found an issue as reported in [1]
Link: https://lore.kernel.org/all/20250901-netpoll_memleak-v1-1-34a181977dfc@debi… [1]
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/netcons_torture.sh | 133 +++++++++++++++++++++
2 files changed, 134 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/Makefile b/tools/testing/selftests/drivers/net/Makefile
index 984ece05f7f92..2b253b1ff4f38 100644
--- a/tools/testing/selftests/drivers/net/Makefile
+++ b/tools/testing/selftests/drivers/net/Makefile
@@ -17,6 +17,7 @@ TEST_PROGS := \
netcons_fragmented_msg.sh \
netcons_overflow.sh \
netcons_sysdata.sh \
+ netcons_torture.sh \
netpoll_basic.py \
ping.py \
queues.py \
diff --git a/tools/testing/selftests/drivers/net/netcons_torture.sh b/tools/testing/selftests/drivers/net/netcons_torture.sh
new file mode 100755
index 0000000000000..d41884c83cab3
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/netcons_torture.sh
@@ -0,0 +1,133 @@
+#!/usr/bin/env bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Repeatedly send kernel messages, toggles netconsole targets on and off,
+# creates and deletes targets in parallel, and toggles the source interface to
+# simulate stress conditions.
+#
+# This test aims verify the robustness of netconsole under dynamic
+# configurations and concurrent operations.
+#
+# The major goal is to run this test with LOCKDEP, Kmemleak and KASAN to make
+# sure no issues is reported.
+#
+# Author: Breno Leitao <leitao(a)debian.org>
+
+set -euo pipefail
+
+SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
+
+source "${SCRIPTDIR}"/lib/sh/lib_netcons.sh
+
+# Number of times the main loop run
+ITERATIONS=${1:-1000}
+
+# Only test extended format
+FORMAT="extended"
+# And ipv6 only
+IP_VERSION="ipv6"
+
+# Create, enable and delete some targets.
+create_and_delete_random_target() {
+ COUNT=1
+ RND_PREFIX=$(mktemp -u netcons_rnd_XXXX_)
+
+ if [ -d "${NETCONS_CONFIGFS}/${RND_PREFIX}${COUNT}" ] || \
+ [ -d "${NETCONS_CONFIGFS}/${RND_PREFIX}0" ]; then
+ echo "Function didn't finish yet, skipping it." >&2
+ return
+ fi
+
+ # enable COUNT targets
+ for i in $(seq 0 ${COUNT})
+ do
+ RND_TARGET="${RND_PREFIX}"${i}
+ RND_TARGET_PATH="${NETCONS_CONFIGFS}"/"${RND_TARGET}"
+
+ # Basic population so the target can come up
+ mkdir "${RND_TARGET_PATH}"
+ echo "${DSTIP}" > "${RND_TARGET_PATH}"/remote_ip
+ echo "${SRCIP}" > "${RND_TARGET_PATH}"/local_ip
+ echo "${DSTMAC}" > "${RND_TARGET_PATH}"/remote_mac
+ echo "${SRCIF}" > "${RND_TARGET_PATH}"/dev_name
+
+ echo 1 > "${RND_TARGET_PATH}"/enabled
+ done
+
+ echo "netconsole selftest: ${COUNT} additional target was created" > /dev/kmsg
+ # disable them all
+ for i in $(seq 0 ${COUNT})
+ do
+ RND_TARGET="${RND_PREFIX}"${i}
+ RND_TARGET_PATH="${NETCONS_CONFIGFS}"/"${RND_TARGET}"
+ echo 0 > "${RND_TARGET_PATH}"/enabled
+ rmdir "${RND_TARGET_PATH}"
+ done
+}
+
+# Disable and enable the target mid-air, while messages
+# are being transmitted.
+toggle_netcons_target() {
+ for i in $(seq 2)
+ do
+ if [ ! -d "${NETCONS_PATH}" ]
+ then
+ break
+ fi
+ echo 0 > "${NETCONS_PATH}"/enabled 2> /dev/null || true
+ # Try to enable a bit harder, given it might fail to enable
+ # Write to `enabled` might fail depending on the lock, which is
+ # highly contentious here
+ for _ in $(seq 5)
+ do
+ echo 1 > "${NETCONS_PATH}"/enabled 2> /dev/null || true
+ done
+ done
+}
+
+toggle_iface(){
+ ip link set "${SRCIF}" down
+ ip link set "${SRCIF}" up
+}
+
+# Start here
+
+modprobe netdevsim 2> /dev/null || true
+modprobe netconsole 2> /dev/null || true
+
+# Check for basic system dependency and exit if not found
+check_for_dependencies
+# Set current loglevel to KERN_INFO(6), and default to KERN_NOTICE(5)
+echo "6 5" > /proc/sys/kernel/printk
+# Remove the namespace, interfaces and netconsole target on exit
+trap cleanup EXIT
+# Create one namespace and two interfaces
+set_network "${IP_VERSION}"
+# Create a dynamic target for netconsole
+create_dynamic_target "${FORMAT}"
+
+for i in $(seq "$ITERATIONS")
+do
+ for _ in $(seq 10)
+ do
+ echo "${MSG}: ${TARGET} ${i}" > /dev/kmsg
+ wait
+ done
+
+ if (( i % 30 == 0 )); then
+ toggle_netcons_target &
+ fi
+
+ if (( i % 50 == 0 )); then
+ # create some targets, enable them, send msg and disable
+ # all in a parallel thread
+ create_and_delete_random_target &
+ fi
+
+ if (( i % 70 == 0 )); then
+ toggle_iface &
+ fi
+done
+wait
+
+exit "${ksft_pass}"
---
base-commit: 2fd4161d0d2547650d9559d57fc67b4e0a26a9e3
change-id: 20250902-netconsole_torture-8fc23f0aca99
Best regards,
--
Breno Leitao <leitao(a)debian.org>
This is v9 of the TDX selftests.
Thanks everyone for the thorough review on v8 [1]. I tried addressing
all the comments. I'm terribly sorry if I missed something.
The original v8 series [1] was split to make reviewing the test framework
changes easier. This series includes the original patches up to the TDX
lifecycle test which is the first TDX selftest in the series.
This series is based on v6.17-rc2
Changes from v8:
- Rebased on top of v6.17-rc2
- Drop several patches which are no longer needed now that TDX support
is integrated into the common flow.
- Split several patches to make reviewing easier.
- Massive refactor compared to v8 to pull TDX special handling into
__vm_create() and vm_vcpu_add() instead of creating separate functions
for TDX.
- Use kbuild to expose values from c to assembly code.
- Move setup of the reset vectors to c code as suggested by Sean.
- Drop redundant cpuid masking functions which are no longer necessary.
- Initialize TDX protected pages one at a time instead of allocating
large chinks of memory.
- Add UCALL support for TDX to align with the rest of the selftests.
- Minor fixes to kselftest_harness.h and virt_map() that were identified
as part of this work.
[1] https://lore.kernel.org/lkml/20250807201628.1185915-1-sagis@google.com/
Ackerley Tng (2):
KVM: selftests: Add helpers to init TDX memory and finalize VM
KVM: selftests: Add ucall support for TDX
Erdem Aktas (2):
KVM: selftests: Add TDX boot code
KVM: selftests: Add support for TDX TDCALL from guest
Isaku Yamahata (2):
KVM: selftests: Update kvm_init_vm_address_properties() for TDX
KVM: selftests: TDX: Use KVM_TDX_CAPABILITIES to validate TDs'
attribute configuration
Sagi Shahar (13):
KVM: selftests: Include overflow.h instead of redefining
is_signed_type()
KVM: selftests: Allocate pgd in virt_map() as necessary
KVM: selftests: Expose functions to get default sregs values
KVM: selftests: Expose function to allocate guest vCPU stack
KVM: selftests: Expose segment definitons to assembly files
KVM: selftests: Add kbuild definitons
KVM: selftests: Define structs to pass parameters to TDX boot code
KVM: selftests: Set up TDX boot code region
KVM: selftests: Set up TDX boot parameters region
KVM: selftests: Add helper to initialize TDX VM
KVM: selftests: Hook TDX support to vm and vcpu creation
KVM: selftests: Add wrapper for TDX MMIO from guest
KVM: selftests: Add TDX lifecycle test
tools/include/linux/kbuild.h | 18 +
tools/testing/selftests/kselftest_harness.h | 3 +-
tools/testing/selftests/kvm/Makefile.kvm | 32 ++
.../selftests/kvm/include/x86/processor.h | 8 +
.../selftests/kvm/include/x86/processor_asm.h | 12 +
.../selftests/kvm/include/x86/tdx/td_boot.h | 81 ++++
.../kvm/include/x86/tdx/td_boot_asm.h | 16 +
.../selftests/kvm/include/x86/tdx/tdcall.h | 34 ++
.../selftests/kvm/include/x86/tdx/tdx.h | 14 +
.../selftests/kvm/include/x86/tdx/tdx_util.h | 86 ++++
.../testing/selftests/kvm/include/x86/ucall.h | 4 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 25 +-
.../testing/selftests/kvm/lib/x86/processor.c | 122 ++++--
.../selftests/kvm/lib/x86/tdx/td_boot.S | 60 +++
.../kvm/lib/x86/tdx/td_boot_offsets.c | 21 +
.../selftests/kvm/lib/x86/tdx/tdcall.S | 93 +++++
.../kvm/lib/x86/tdx/tdcall_offsets.c | 16 +
tools/testing/selftests/kvm/lib/x86/tdx/tdx.c | 22 +
.../selftests/kvm/lib/x86/tdx/tdx_util.c | 391 ++++++++++++++++++
tools/testing/selftests/kvm/lib/x86/ucall.c | 45 +-
tools/testing/selftests/kvm/x86/tdx_vm_test.c | 31 ++
21 files changed, 1095 insertions(+), 39 deletions(-)
create mode 100644 tools/include/linux/kbuild.h
create mode 100644 tools/testing/selftests/kvm/include/x86/processor_asm.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/td_boot.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/td_boot_asm.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdcall.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdx.h
create mode 100644 tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/td_boot.S
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/td_boot_offsets.c
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdcall.S
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdcall_offsets.c
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdx.c
create mode 100644 tools/testing/selftests/kvm/lib/x86/tdx/tdx_util.c
create mode 100644 tools/testing/selftests/kvm/x86/tdx_vm_test.c
--
2.51.0.rc1.193.gad69d77794-goog
There are currently no kernel tests that verify setting and getting
options of the team driver.
In the future, options may be added that implicitly change other
options, which will make it useful to have tests like these that show
nothing breaks. There will be a follow up patch to this that adds new
"rx_enabled" and "tx_enabled" options, which will implicitly affect the
"enabled" option value and vice versa.
The tests use teamnl to first set options to specific values and then
gets them to compare to the set values.
Signed-off-by: Marc Harvey <marcharvey(a)google.com>
---
.../selftests/drivers/net/team/Makefile | 6 +-
.../selftests/drivers/net/team/options.sh | 194 ++++++++++++++++++
2 files changed, 198 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/team/options.sh
diff --git a/tools/testing/selftests/drivers/net/team/Makefile b/tools/testing/selftests/drivers/net/team/Makefile
index eaf6938f100e..8b00b70ce67f 100644
--- a/tools/testing/selftests/drivers/net/team/Makefile
+++ b/tools/testing/selftests/drivers/net/team/Makefile
@@ -1,11 +1,13 @@
# SPDX-License-Identifier: GPL-2.0
# Makefile for net selftests
-TEST_PROGS := dev_addr_lists.sh propagation.sh
+TEST_PROGS := dev_addr_lists.sh propagation.sh options.sh
TEST_INCLUDES := \
../bonding/lag_lib.sh \
../../../net/forwarding/lib.sh \
- ../../../net/lib.sh
+ ../../../net/lib.sh \
+ ../../../net/in_netns.sh \
+ ../../../net/lib/sh/defer.sh \
include ../../../lib.mk
diff --git a/tools/testing/selftests/drivers/net/team/options.sh b/tools/testing/selftests/drivers/net/team/options.sh
new file mode 100755
index 000000000000..b9c7aa357ad5
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/team/options.sh
@@ -0,0 +1,194 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# These tests verify basic set and get functionality of the team
+# driver options over netlink.
+
+# Run in private netns.
+test_dir="$(dirname "$0")"
+if [[ $# -eq 0 ]]; then
+ "${test_dir}"/../../../net/in_netns.sh "$0" __subprocess
+ exit $?
+fi
+
+ALL_TESTS="
+ team_test_options
+"
+
+source "${test_dir}/../../../net/lib.sh"
+
+TEAM_PORT="team0"
+MEMBER_PORT="dummy0"
+
+setup()
+{
+ ip link add name "${MEMBER_PORT}" type dummy
+ ip link add name "${TEAM_PORT}" type team
+}
+
+get_and_check_value()
+{
+ local option_name="$1"
+ local expected_value="$2"
+ local port_flag="$3"
+
+ local value_from_get
+
+ value_from_get=$(teamnl "${TEAM_PORT}" getoption "${option_name}" \
+ "${port_flag}")
+ if [[ $? != 0 ]]; then
+ echo "Could not get option '${option_name}'" >&2
+ return 1
+ fi
+
+ if [[ "${value_from_get}" != "${expected_value}" ]]; then
+ echo "Incorrect value for option '${option_name}'" >&2
+ echo "get (${value_from_get}) != set (${expected_value})" >&2
+ return 1
+ fi
+}
+
+set_and_check_get()
+{
+ local option_name="$1"
+ local option_value="$2"
+ local port_flag="$3"
+
+ local value_from_get
+
+ teamnl "${TEAM_PORT}" setoption "${option_name}" "${option_value}" \
+ "${port_flag}"
+ if [[ $? != 0 ]]; then
+ echo "'setoption ${option_name} ${option_value}' failed" >&2
+ return 1
+ fi
+
+ get_and_check_value "${option_name}" "${option_value}" "${port_flag}"
+ return $?
+}
+
+# Get a "port flag" to pass to the `teamnl` command.
+# E.g. $?="dummy0" -> "port=dummy0",
+# $?="" -> ""
+get_port_flag()
+{
+ local port_name="$1"
+
+ if [[ -n "${port_name}" ]]; then
+ echo "--port=${port_name}"
+ fi
+}
+
+attach_port_if_specified()
+{
+ local port_name="${1}"
+
+ if [[ -n "${port_name}" ]]; then
+ ip link set dev "${port_name}" master "${TEAM_PORT}"
+ return $?
+ fi
+}
+
+detach_port_if_specified()
+{
+ local port_name="${1}"
+
+ if [[ -n "${port_name}" ]]; then
+ ip link set dev "${port_name}" nomaster
+ return $?
+ fi
+}
+
+#######################################
+# Test that an option's get value matches its set value.
+# Globals:
+# RET - Used by testing infra like `check_err`.
+# EXIT_STATUS - Used by `log_test` to whole script exit value.
+# Arguments:
+# option_name - The name of the option.
+# value_1 - The first value to try setting.
+# value_2 - The second value to try setting.
+# port_name - The (optional) name of the attached port.
+#######################################
+team_test_option()
+{
+ local option_name="$1"
+ local value_1="$2"
+ local value_2="$3"
+ local possible_values="$2 $3 $2"
+ local port_name="$4"
+ local port_flag
+
+ RET=0
+
+ echo "Setting '${option_name}' to '${value_1}' and '${value_2}'"
+
+ attach_port_if_specified "${port_name}"
+ check_err $? "Couldn't attach ${port_name} to master"
+ port_flag=$(get_port_flag "${port_name}")
+
+ # Set and get both possible values.
+ for value in ${possible_values}; do
+ set_and_check_get "${option_name}" "${value}" "${port_flag}"
+ check_err $? "Failed to set '${option_name}' to '${value}'"
+ done
+
+ detach_port_if_specified "${port_name}"
+ check_err $? "Couldn't detach ${port_name} from its master"
+
+ log_test "Set + Get '${option_name}' test"
+}
+
+#######################################
+# Test that getting a non-existant option fails.
+# Globals:
+# RET - Used by testing infra like `check_err`.
+# EXIT_STATUS - Used by `log_test` to whole script exit value.
+# Arguments:
+# option_name - The name of the option.
+# port_name - The (optional) name of the attached port.
+#######################################
+team_test_get_option_fails()
+{
+ local option_name="$1"
+ local port_name="$2"
+ local port_flag
+
+ RET=0
+
+ attach_port_if_specified "${port_name}"
+ check_err $? "Couldn't attach ${port_name} to master"
+ port_flag=$(get_port_flag "${port_name}")
+
+ # Just confirm that getting the value fails.
+ teamnl "${TEAM_PORT}" getoption "${option_name}" "${port_flag}"
+ check_fail $? "Shouldn't be able to get option '${option_name}'"
+
+ detach_port_if_specified "${port_name}"
+
+ log_test "Get '${option_name}' fails"
+}
+
+team_test_options()
+{
+ # Wrong option name behavior.
+ team_test_get_option_fails fake_option1
+ team_test_get_option_fails fake_option2 "${MEMBER_PORT}"
+
+ # Correct set and get behavior.
+ team_test_option mode activebackup loadbalance
+ team_test_option notify_peers_count 0 5
+ team_test_option notify_peers_interval 0 5
+ team_test_option mcast_rejoin_count 0 5
+ team_test_option mcast_rejoin_interval 0 5
+ team_test_option enabled true false "${MEMBER_PORT}"
+ team_test_option user_linkup true false "${MEMBER_PORT}"
+ team_test_option user_linkup_enabled true false "${MEMBER_PORT}"
+ team_test_option priority 10 20 "${MEMBER_PORT}"
+ team_test_option queue_id 0 1 "${MEMBER_PORT}"
+}
+
+require_command teamnl
+setup
+tests_run
+exit "${EXIT_STATUS}"
--
2.51.0.355.g5224444f11-goog
Add the benchmark testcase "kprobe-multi-all", which will hook all the
kernel functions during the testing.
This series is separated out from [1].
Changes since V2:
* add some comment to attach_ksyms_all, which notes that don't run the
testing on a debug kernel
Changes since V1:
* introduce trace_blacklist instead of copy-pasting strcmp in the 2nd
patch
* use fprintf() instead of printf() in 3rd patch
Link: https://lore.kernel.org/bpf/20250817024607.296117-1-dongml2@chinatelecom.cn/ [1]
Menglong Dong (3):
selftests/bpf: move get_ksyms and get_addrs to trace_helpers.c
selftests/bpf: skip recursive functions for kprobe_multi
selftests/bpf: add benchmark testing for kprobe-multi-all
tools/testing/selftests/bpf/bench.c | 4 +
.../selftests/bpf/benchs/bench_trigger.c | 61 +++++
.../selftests/bpf/benchs/run_bench_trigger.sh | 4 +-
.../bpf/prog_tests/kprobe_multi_test.c | 220 +---------------
.../selftests/bpf/progs/trigger_bench.c | 12 +
tools/testing/selftests/bpf/trace_helpers.c | 234 ++++++++++++++++++
tools/testing/selftests/bpf/trace_helpers.h | 3 +
7 files changed, 319 insertions(+), 219 deletions(-)
--
2.51.0
Currently, even if some subtests fails, the end result will still yield
"ok 1 selftests: bpf: test_xsk.sh". Fix it by exiting with 1 if there are
any failures.
Signed-off-by: Ricardo B. Marlière <rbm(a)suse.com>
---
tools/testing/selftests/bpf/test_xsk.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_xsk.sh b/tools/testing/selftests/bpf/test_xsk.sh
index 65aafe0003db054e9dfd156092fed53b07be06a0..62db060298a4a3b4391ee4cfa50557cf4a62d3d5 100755
--- a/tools/testing/selftests/bpf/test_xsk.sh
+++ b/tools/testing/selftests/bpf/test_xsk.sh
@@ -241,4 +241,6 @@ done
if [ $failures -eq 0 ]; then
echo "All tests successful!"
+else
+ exit 1
fi
---
base-commit: 5b6d6fe1ca7b712c74f78426bb23c465fd34b322
change-id: 20250828-selftests-bpf-test_xsk_ret-1eb27dbac071
Best regards,
--
Ricardo B. Marlière <rbm(a)suse.com>
This series contains 4 independent new features:
- Patch 1: use HMAC-SHA256 library instead of open-coded HMAC.
- Patch 2: selftests: check for unexpected fallback counter increments.
- Patches 3-4: record subflows in RPS table, for aRFS support.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Changes in v2:
- Drop previous patches 2 ("mptcp: make ADD_ADDR retransmission timeout
adaptive") + 3 ("selftests: mptcp: remove add_addr_timeout settings"):
They were introducing instabilities in the selftests.
- Rebased. Other patches have not been modified.
- Link to v1: https://lore.kernel.org/r/20250901-net-next-mptcp-misc-feat-6-18-v1-0-80ae8…
---
Christoph Paasch (2):
net: Add rfs_needed() helper
mptcp: record subflows in RPS table
Eric Biggers (1):
mptcp: use HMAC-SHA256 library instead of open-coded HMAC
Gang Yan (1):
selftests: mptcp: add checks for fallback counters
include/net/rps.h | 85 ++++++++++------
net/mptcp/crypto.c | 35 +------
net/mptcp/protocol.c | 21 ++++
tools/testing/selftests/net/mptcp/mptcp_join.sh | 123 ++++++++++++++++++++++++
4 files changed, 202 insertions(+), 62 deletions(-)
---
base-commit: cd8a4cfa6bb43a441901e82f5c222dddc75a18a3
change-id: 20250829-net-next-mptcp-misc-feat-6-18-722fa87a60f1
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
One fix for occasional failures I found while testing and a bunch of
cleanups that should make that test easier to digest.
Tested on x86-64, the test seems to reliably pass.
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Cc: Nico Pache <npache(a)redhat.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Wei Yang <richard.weiyang(a)gmail.com>
--
Mostly a resend, because I accidentally disabled "ccover = true" in my
git config so people were only CCed on the cover letter.
v1 -> v2:
* "selftests/mm: split_huge_page_test: fix occasional is_backed_by_folio()
wrong results"
-> Fixup missing ")" in patch description
David Hildenbrand (2):
selftests/mm: split_huge_page_test: fix occasional
is_backed_by_folio() wrong results
selftests/mm: split_huge_page_test: cleanups for split_pte_mapped_thp
test
.../selftests/mm/split_huge_page_test.c | 138 ++++++++++--------
1 file changed, 81 insertions(+), 57 deletions(-)
base-commit: ef42a39c44ef6da64ae3495d27e28dd6fca62a51
--
2.50.1
The rss_ctx test has gotten pretty flaky after I increased
the queue count in NIPA 2->3. Not 100% clear why. We get
a lot of failures in the rss_ctx.test_hitless_key_update case.
Looking closer it appears that the failures are mostly due
to startup costs. I measured the following timing for ethtool -X:
- python cmd(shell=True) : 150-250msec
- python cmd(shell=False) : 50- 70msec
- timed in bash : 45- 55msec
- YNL Netlink call : 2- 4msec
- .set_rxfh callback : 1- 2msec
The target in the test was set to 200msec. We were mostly measuring
ethtool startup cost it seems. Switch to YNL since it's 100x faster.
Lower the pass criteria to 150msec, no real science behind this number
but we removed some overhead, drivers which previously passed 200msec
should easily pass 150msec now.
Separately we should probably follow up on defaulting to shell=False,
when script doesn't explicitly ask for True, because the overhead
is rather significant.
Switch from _rss_key_rand() to random.randbytes(), YNL takes a binary
array rather than array of ints.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
v2:
- increase the threshold to safer 150msec
- mention change away from _rss_key_rand()
v1: https://lore.kernel.org/20250829220712.327920-1-kuba@kernel.org
---
tools/testing/selftests/drivers/net/hw/rss_ctx.py | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/rss_ctx.py b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
index 9838b8457e5a..a5562a9f729f 100755
--- a/tools/testing/selftests/drivers/net/hw/rss_ctx.py
+++ b/tools/testing/selftests/drivers/net/hw/rss_ctx.py
@@ -335,19 +335,20 @@ from lib.py import ethtool, ip, defer, GenerateTraffic, CmdExitFailure
data = get_rss(cfg)
key_len = len(data['rss-hash-key'])
- key = _rss_key_rand(key_len)
+ ethnl = EthtoolFamily()
+ key = random.randbytes(key_len)
tgen = GenerateTraffic(cfg)
try:
errors0, carrier0 = get_drop_err_sum(cfg)
t0 = datetime.datetime.now()
- ethtool(f"-X {cfg.ifname} hkey " + _rss_key_str(key))
+ ethnl.rss_set({"header": {"dev-index": cfg.ifindex}, "hkey": key})
t1 = datetime.datetime.now()
errors1, carrier1 = get_drop_err_sum(cfg)
finally:
tgen.wait_pkts_and_stop(5000)
- ksft_lt((t1 - t0).total_seconds(), 0.2)
+ ksft_lt((t1 - t0).total_seconds(), 0.15)
ksft_eq(errors1 - errors1, 0)
ksft_eq(carrier1 - carrier0, 0)
--
2.51.0
Clean up tests which expect shell=True without explicitly passing
that param to cmd(). There seems to be only one such case, and
in fact it's better converted to a direct write.
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
tools/testing/selftests/drivers/net/napi_threaded.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/napi_threaded.py b/tools/testing/selftests/drivers/net/napi_threaded.py
index ed66efa481b0..f4be72b2145a 100755
--- a/tools/testing/selftests/drivers/net/napi_threaded.py
+++ b/tools/testing/selftests/drivers/net/napi_threaded.py
@@ -24,7 +24,8 @@ from lib.py import cmd, defer, ethtool
def _set_threaded_state(cfg, threaded) -> None:
- cmd(f"echo {threaded} > /sys/class/net/{cfg.ifname}/threaded")
+ with open(f"/sys/class/net/{cfg.ifname}/threaded", "wb") as fp:
+ fp.write(str(threaded).encode('utf-8'))
def _setup_deferred_cleanup(cfg) -> None:
--
2.51.0
This patch improves the utils.py module by removing unused imports
(errno, random), simplifying the fd_read_timeout() function by
eliminating unnecessary else clause, and cleaning up code style in the
defer class constructor.
Additionally, it renames the parameter in rand_port() from 'type' to
'stype' to avoid shadowing the built-in Python name 'type', improving
code clarity and preventing potential issues.
These changes enhance code readability and maintainability without
affecting functionality.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/net/lib/py/utils.py | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/net/lib/py/utils.py b/tools/testing/selftests/net/lib/py/utils.py
index b188cac49738f..1cdc8e6d6b603 100644
--- a/tools/testing/selftests/net/lib/py/utils.py
+++ b/tools/testing/selftests/net/lib/py/utils.py
@@ -1,9 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
-import errno
import json as _json
import os
-import random
import re
import select
import socket
@@ -21,8 +19,7 @@ def fd_read_timeout(fd, timeout):
rlist, _, _ = select.select([fd], [], [], timeout)
if rlist:
return os.read(fd, 1024)
- else:
- raise TimeoutError("Timeout waiting for fd read")
+ raise TimeoutError("Timeout waiting for fd read")
class cmd:
@@ -138,8 +135,6 @@ global_defer_queue = []
class defer:
def __init__(self, func, *args, **kwargs):
- global global_defer_queue
-
if not callable(func):
raise Exception("defer created with un-callable object, did you call the function instead of passing its name?")
@@ -227,11 +222,11 @@ def bpftrace(expr, json=None, ns=None, host=None, timeout=None):
return cmd_obj
-def rand_port(type=socket.SOCK_STREAM):
+def rand_port(stype=socket.SOCK_STREAM):
"""
Get a random unprivileged port.
"""
- with socket.socket(socket.AF_INET6, type) as s:
+ with socket.socket(socket.AF_INET6, stype) as s:
s.bind(("", 0))
return s.getsockname()[1]
---
base-commit: 864ecc4a6dade82d3f70eab43dad0e277aa6fc78
change-id: 20250901-fix-02eb26114040
Best regards,
--
Breno Leitao <leitao(a)debian.org>