This patch series suggests fixes for several corner cases in the RISC-V
vector ptrace implementation:
- init vector context with proper vlenb, to avoid reading zero vlenb
by an early attached debugger
- follow gdbserver expectations and return ENODATA instead of EINVAL
if vector extension is supported but not yet activated for the
traced process
- validate input vector csr registers in ptrace, to maintain an accurate
view of the tracee's vector context across multiple halt/resume
debug cycles
For detailed description see the appropriate commit messages. A new test
suite v_ptrace is added into the tools/testing/selftests/riscv/vector
to verify some of the vector ptrace functionality and corner cases.
Previous versions:
- v3: https://lore.kernel.org/linux-riscv/20251025210655.43099-1-geomatsi@gmail.c…
- v2: https://lore.kernel.org/linux-riscv/20250821173957.563472-1-geomatsi@gmail.…
- v1: https://lore.kernel.org/linux-riscv/20251007115840.2320557-1-geomatsi@gmail…
Changes in v4:
The form 'vsetvli x0, x0, ...' can only be used if VLMAX remains
unchanged, see spec 6.2. This condition was not met by the initial
values in the selftests w.r.t. the initial zeroed context. QEMU accepted
such values, but actual hardware (c908, BananaPi CanMV Zero board) did
not, setting vill. So fix the selftests after testing on hardware:
- replace 'vsetvli x0, x0, ...' by 'vsetvli rd, x0, ...'
- fixed instruction returns VLMAX, so use it in checks as well
- replace fixed vlenb == 16 in the syscall test
Changes in v3:
Address the review comments by Andy Chiu and rework the approach:
- drop forced vector context save entirely
- perform strict validation of vector csr regs in ptrace
Changes in v2:
- add thread_info flag to allow to force vector context save
- force vector context save after vector ptrace to ensure valid vector
context in the next ptrace operations
- force vector context save on the first context switch after vector
context init to get proper vlenb
---
Ilya Mamay (1):
riscv: ptrace: return ENODATA for inactive vector extension
Sergey Matyukevich (8):
selftests: riscv: test ptrace vector interface
selftests: riscv: verify initial vector state with ptrace
riscv: vector: init vector context with proper vlenb
riscv: csr: define vtype registers elements
riscv: ptrace: validate input vector csr registers
selftests: riscv: verify ptrace rejects invalid vector csr inputs
selftests: riscv: verify ptrace accepts valid vector csr values
selftests: riscv: verify syscalls discard vector context
arch/riscv/include/asm/csr.h | 11 +
arch/riscv/kernel/ptrace.c | 72 +-
arch/riscv/kernel/vector.c | 12 +-
.../testing/selftests/riscv/vector/.gitignore | 1 +
tools/testing/selftests/riscv/vector/Makefile | 5 +-
.../testing/selftests/riscv/vector/v_ptrace.c | 754 ++++++++++++++++++
6 files changed, 847 insertions(+), 8 deletions(-)
create mode 100644 tools/testing/selftests/riscv/vector/v_ptrace.c
base-commit: e811c33b1f137be26a20444b79db8cbc1fca1c89
--
2.51.0
At the moment, the ability to direct-inject vLPIs is only enableable
on an all-or-nothing per-VM basis, causing unnecessary I/O performance
loss in cases where a VM's vCPU count exceeds available vPEs. This RFC
introduces per-vCPU control over vLPI injection to realize potential
I/O performance gain in such situations.
Background
----------
The value of dynamically enabling the direct injection of vLPIs on a
per-vCPU basis is the ability to run guest VMs with simultaneous
hardware-forwarded and software-forwarded message-signaled interrupts.
Currently, hardware-forwarded vLPI direct injection on a KVM guest
requires GICv4 and is enabled on a per-VM, all-or-nothing basis. vLPI
injection enablment happens in two stages:
1) At vGIC initialization, allocate direct injection structures for
each vCPU (doorbell IRQ, vPE table entry, virtual pending table,
vPEID).
2) When a PCI device is configured for passthrough, map its MSIs to
vLPIs using the structures allocated in step 1.
Step 1 is all-or-nothing; if any vCPU cannot be configured with the
vPE structures necessary for direct injection, the vPEs of all vCPUs
are torn down and direct injection is disabled VM-wide.
This universality of direct vLPI injection enablement sparks several
issues, with the most pressing being performance degradation on
overcommitted hosts.
VM-wide vLPI enablement creates resource inefficiency when guest
VMs have more vCPUs than the host has available vPEIDs. The amount of
vPEIDs (and consequently, vPEs) a host can allocate is constrained by
hardware and defined by GICD_TYPER2.VID + 1 (ITS_MAX_VPEID). Since
direct injection requires a vCPU to be assigned a vPEID, at most
ITS_MAX_VPEID vCPUs can be configured for direct injection at a time.
Because vLPI direct injection is all-or-nothing on a VM, if a new guest
VM would exhaust remaining vPEIDs, all vCPUs on that VM would fall back
to hypervisor-forwarded LPIs, causing considerable I/O performance
degradation.
Such performance degradation is exemplified on hosts with CPU
overcommitment. Overcommitting an arbitrarily high number of vCPUs
enables a VM's vCPU count to easily exceed the host's available vPEIDs.
Even with marginally more vCPUs than vPEIDs, the current all-or-nothing
vLPI paradigm disables direct injection entirely. This creates two
problems: first, a single many-vCPU overcommitted VM loses all direct
injection despite having vPEIDs available; second, on multi-tenant
hosts, VMs booted first consume all vPEIDs, leaving later VMs without
direct injection regardless of their I/O intensity. Per-vCPU control
would allow userspace to allocate available vPEIDs across VMs based on
I/O workload rather than boot order or per-VM vCPU count. This per-vCPU
granularity recovers most of the direct injection performance benefit
instead of losing it completely.
To allow this per-vCPU granularity, this RFC introduces three new ioctls
to the KVM API that enables userspace the ability to activate/deactivate
direct vLPI injection capability and resources to vCPUs ad-hoc during VM
runtime.
This RFC proposes userspace control, rather than kernel control, over
vPEID allocation for simplicity of implementation, ease of testability,
and autonomy over resource usage. In the future, the vLPI enable/disable
building blocks from this RFC may be used to implement a full vPE
allocation policy in the kernel.
The solution comes in several parts
-----------------------------------
1) [P 1] General declarations (ioctl definitions/stubs, kconfig option)
2) [P 2] Conditionally disable auto vLPI injection init routines
To prevent vCPUs from exceeding vPEID allocation limits upon VM boot,
disable automatic vPEID allocation in the GICv4 initialization
routine when the per-vCPU kconfig is active. Likewise, disable
automatic hardware forwarding for PCI device-backed MSIs upon device
registration.
3) [P 3-6] Implement per-vCPU vLPI enablement routine, which:
a) Creates per-vCPU doorbell IRQ on new vCPU-scoped, rather than
VM-scoped, interrupt domain hierarchies.
b) Allocates per-vCPU vPE table entries and virtual pending table,
linking them to the vCPU's doorbell IRQ.
c) Iterates through interrupt translation table to set hardware
forwarding for all PCI device–backed interrupts targeting the
specific vCPU.
3) [P 7-8] Implement per-vCPU vLPI disablement routine, which
a) Iterates through interrupt translation table to unset hardware
forwarding for all interrupts targeting the specific vCPU.
b) Frees per-vCPU vPE table entries, virtual pending table, and
doorbell IRQ, then removes vgic_dist's pointer to the vCPU's
freed vPE.
4) [P 9] Couple vSGI enablement with per-vCPU vPE allocation
Since vSGIs cannot be direct-injected without an allocated vPE on
the receiving vCPU, couple vSGI enablement with vLPI enablement
on GICv4.1.
5) [P 10-13] Write selftests for vLPI direct injection
PCI devices cannot be passed through to selftest guests, so
define an ioctl that mocks a hardware source for software-defined
MSI interrupts and sets vLPI "hardware" forwarding for the MSIs. Use
these vLPIs to selftest per-vCPU vLPI enablement/disablement ioctls.
Testing
-------
Testing has been carried out via selftests and QEMU-emulated guests.
Selftests have covered diverse vLPI configurations and race conditions.
These include:
1) Stress testing LPI injection across multiple vCPUs while
concurrently and repeatedly toggling the vCPUs' vLPI
injection capability.
2) Enabling/disabling vLPI direct injection while scheduling or
unscheduling a vCPU.
3) Allocating and freeing a single vPEID to multiple vCPUs, ensuring
reusability.
4) Attempting to allocate a vPEID when all are already allocated,
validating an error is thrown.
5) Calling enable/disable vLPI ioctls when GIC is not initialized.
6) Idempotent ioctl calls.
PCI device passthrough and interrupt injection to QEMU guest
demonstrated:
1) Complete hypervisor circumvention when vLPI injection is enabled on
a vCPU, hypervisor forwarding when vLPI injection is disabled.
2) Interrupts are not lost when received during per-vCPU vLPI state
transitions.
Caveats
-------
1) Pending interrupts are flushed when vLPI injection is disabled for a
vCPU; hardware pending state is not transfered to software. This may
cause pending interrupts to be lost upon vPE disablement.
Unlike vSGIs, vLPIs do not expose their pending state through a
GICD_ISPENDR register. Thus, we would need to read the pending state
of the vLPI from the vPT. To read the pending status of the vLPI from
vPT, we would need to invalidate any vPT cache associated with the
vCPU's vPE. This requires unmapping the vPE and halting the vCPU,
which would be incredibly expensive and unecessary given that MSIs
are usually recoverable by the driver.
2) Direct-injected vSGIs (GICv4.1) require vCPUs to have associated
vPEs. Since disabling vLPI injection on a vCPU frees its
vPE, vSGI direct injection must simultaenously be disabled as well.
At the moment, we use the per-vCPU vSGI toggle mechanism introduced
in commit bacf2c6 to enable/disable vSGI injection alongside vLPI
injection.
Maximilian Dittgen (13):
KVM: Introduce config option for per-vCPU vLPI enablement
KVM: arm64: Disable auto vCPU vPE assignment with per-vCPU vLPI config
KVM: arm64: Refactor out locked section of
kvm_vgic_v4_set_forwarding()
KVM: arm64: Implement vLPI QUERY ioctl for per-vCPU vLPI injection API
KVM: arm64: Implement vLPI ENABLE ioctl for per-vCPU vLPI injection
API
KVM: arm64: Resolve race between vCPU scheduling and vLPI enablement
KVM: arm64: Implement vLPI DISABLE ioctl for per-vCPU vLPI Injection
API
KVM: arm64: Make per-vCPU vLPI control ioctls atomic
KVM: arm64: Couple vSGI enablement with per-vCPU vPE allocation
KVM: selftests: fix MAPC RDbase target formatting in vgic_lpi_stress
KVM: Ioctl to set up userspace-injected MSIs as software-bypassing
vLPIs
KVM: arm64: selftests: Add support for stress testing direct-injected
vLPIs
KVM: arm64: selftests: Add test for per-vCPU vLPI control API
Documentation/virt/kvm/api.rst | 56 +++
arch/arm64/kvm/arm.c | 89 +++++
arch/arm64/kvm/vgic/vgic-its.c | 142 ++++++-
arch/arm64/kvm/vgic/vgic-v3.c | 14 +-
arch/arm64/kvm/vgic/vgic-v4.c | 370 +++++++++++++++++-
arch/arm64/kvm/vgic/vgic.h | 10 +
drivers/irqchip/Kconfig | 13 +
drivers/irqchip/irq-gic-v3-its.c | 58 ++-
drivers/irqchip/irq-gic-v4.c | 75 +++-
include/kvm/arm_vgic.h | 8 +
include/linux/irqchip/arm-gic-v3.h | 5 +
include/linux/irqchip/arm-gic-v4.h | 10 +-
include/linux/kvm_host.h | 11 +
include/uapi/linux/kvm.h | 22 ++
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../selftests/kvm/arm64/per_vcpu_vlpi.c | 274 +++++++++++++
.../selftests/kvm/arm64/vgic_lpi_stress.c | 181 ++++++++-
.../selftests/kvm/lib/arm64/gic_v3_its.c | 9 +-
18 files changed, 1307 insertions(+), 41 deletions(-)
create mode 100644 tools/testing/selftests/kvm/arm64/per_vcpu_vlpi.c
--
2.50.1 (Apple Git-155)
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
This patchset introduces target resume capability to netconsole allowing
it to recover targets when underlying low-level interface comes back
online.
The patchset starts by refactoring netconsole state representation in
order to allow representing deactivated targets (targets that are
disabled due to interfaces going down).
It then modifies netconsole to handle NETDEV_UP events for such targets
and setups netpoll. Targets are matched with incoming interfaces
depending on how they were initially bound in netconsole (by mac or
interface name).
The patchset includes a selftest that validates netconsole target state
transitions and that target is functional after resumed.
Signed-off-by: Andre Carvalho <asantostc(a)gmail.com>
---
Changes in v5:
- patch 3: Set (de)enslaved target as DISABLED instead of DEACTIVATED to prevent
resuming it.
- selftest: Fix test cleanup by moving trap line to outside of loop and remove
unneeded 'local' keyword
- Rename maybe_resume_target to resume_target, add netconsole_ prefix to
process_resumable_targets.
- Hold device reference before calling __netpoll_setup.
- Link to v4: https://lore.kernel.org/r/20251116-netcons-retrigger-v4-0-5290b5f140c2@gmai…
Changes in v4:
- Simplify selftest cleanup, removing trap setup in loop.
- Drop netpoll helper (__setup_netpoll_hold) and manage reference inside
netconsole.
- Move resume_list processing logic to separate function.
- Link to v3: https://lore.kernel.org/r/20251109-netcons-retrigger-v3-0-1654c280bbe6@gmai…
Changes in v3:
- Resume by mac or interface name depending on how target was created.
- Attempt to resume target without holding target list lock, by moving
the target to a temporary list. This is required as netpoll may
attempt to allocate memory.
- Link to v2: https://lore.kernel.org/r/20250921-netcons-retrigger-v2-0-a0e84006237f@gmai…
Changes in v2:
- Attempt to resume target in the same thread, instead of using
workqueue .
- Add wrapper around __netpoll_setup (patch 4).
- Renamed resume_target to maybe_resume_target and moved conditionals to
inside its implementation, keeping code more clear.
- Verify that device addr matches target mac address when target was
setup using mac.
- Update selftest to cover targets bound by mac and interface name.
- Fix typo in selftest comment and sort tests alphabetically in
Makefile.
- Link to v1:
https://lore.kernel.org/r/20250909-netcons-retrigger-v1-0-3aea904926cf@gmai…
---
Andre Carvalho (3):
netconsole: convert 'enabled' flag to enum for clearer state management
netconsole: resume previously deactivated target
selftests: netconsole: validate target resume
Breno Leitao (2):
netconsole: add target_state enum
netconsole: add STATE_DEACTIVATED to track targets disabled by low level
drivers/net/netconsole.c | 155 +++++++++++++++++----
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 35 ++++-
.../selftests/drivers/net/netcons_resume.sh | 97 +++++++++++++
4 files changed, 254 insertions(+), 34 deletions(-)
---
base-commit: a057e8e4ac5b1ddd12be590e2e039fa08d0c8aa4
change-id: 20250816-netcons-retrigger-a4f547bfc867
Best regards,
--
Andre Carvalho <asantostc(a)gmail.com>
This patch series introduces LANDLOCK_SCOPE_MEMFD_EXEC, a new Landlock
scoping mechanism that restricts execution of anonymous memory file
descriptors (memfd) created via memfd_create(2). This addresses security
gaps where processes can bypass W^X policies and execute arbitrary code
through anonymous memory objects.
Fixes: https://github.com/landlock-lsm/linux/issues/37
SECURITY PROBLEM
================
Current Landlock filesystem restrictions do not cover memfd objects,
allowing processes to:
1. Read-to-execute bypass: Create writable memfd, inject code,
then execute via mmap(PROT_EXEC) or direct execve()
2. Anonymous execution: Execute code without touching the filesystem via
execve("/proc/self/fd/N") where N is a memfd descriptor
3. Cross-domain access violations: Pass memfd between processes to
bypass domain restrictions
These scenarios can occur in sandboxed environments where filesystem
access is restricted but memfd creation remains possible.
IMPLEMENTATION
==============
The implementation adds hierarchical execution control through domain
scoping:
Core Components:
- is_memfd_file(): Reliable memfd detection via "memfd:" dentry prefix
- domain_is_scoped(): Cross-domain hierarchy checking (moved to domain.c)
- LSM hooks: mmap_file, file_mprotect, bprm_creds_for_exec
- Creation-time restrictions: hook_file_alloc_security
Security Matrix:
Execution decisions follow domain hierarchy rules preventing both
same-domain bypass attempts and cross-domain access violations while
preserving legitimate hierarchical access patterns.
Domain Hierarchy with LANDLOCK_SCOPE_MEMFD_EXEC:
===============================================
Root (no domain) - No restrictions
|
+-- Domain A [SCOPE_MEMFD_EXEC] Layer 1
| +-- memfd_A (tagged with Domain A as creator)
| |
| +-- Domain A1 (child) [NO SCOPE] Layer 2
| | +-- Inherits Layer 1 restrictions from parent
| | +-- memfd_A1 (can create, inherits restrictions)
| | +-- Domain A1a [SCOPE_MEMFD_EXEC] Layer 3
| | +-- memfd_A1a (tagged with Domain A1a)
| |
| +-- Domain A2 (child) [SCOPE_MEMFD_EXEC] Layer 2
| +-- memfd_A2 (tagged with Domain A2 as creator)
| +-- CANNOT access memfd_A1 (different subtree)
|
+-- Domain B [SCOPE_MEMFD_EXEC] Layer 1
+-- memfd_B (tagged with Domain B as creator)
+-- CANNOT access ANY memfd from Domain A subtree
Execution Decision Matrix:
========================
Executor-> | A | A1 | A1a | A2 | B | Root
Creator | | | | | |
------------|-----|----|-----|----|----|-----
Domain A | X | X | X | X | X | Y
Domain A1 | Y | X | X | X | X | Y
Domain A1a | Y | Y | X | X | X | Y
Domain A2 | Y | X | X | X | X | Y
Domain B | X | X | X | X | X | Y
Root | Y | Y | Y | Y | Y | Y
Legend: Y = Execution allowed, X = Execution denied
Scenarios Covered:
- Direct mmap(PROT_EXEC) on memfd files
- Two-stage mmap(PROT_READ) + mprotect(PROT_EXEC) bypass attempts
- execve("/proc/self/fd/N") anonymous execution
- execveat() and fexecve() file descriptor execution
- Cross-process memfd inheritance and IPC passing
TESTING
=======
All patches have been validated with:
- scripts/checkpatch.pl --strict (clean)
- Selftests covering same-domain restrictions, cross-domain
hierarchy enforcement, and regular file isolation
- KUnit tests for memfd detection edge cases
DISCLAIMER
==========
My understanding of Landlock scoping semantics may be limited, but this
implementation reflects my current understanding based on available
documentation and code analysis. I welcome feedback and corrections
regarding the scoping logic and domain hierarchy enforcement.
Signed-off-by: Abhinav Saxena <xandfury(a)gmail.com>
---
Abhinav Saxena (4):
landlock: add LANDLOCK_SCOPE_MEMFD_EXEC scope
landlock: implement memfd detection
landlock: add memfd exec LSM hooks and scoping
selftests/landlock: add memfd execution tests
include/uapi/linux/landlock.h | 5 +
security/landlock/.kunitconfig | 1 +
security/landlock/audit.c | 4 +
security/landlock/audit.h | 1 +
security/landlock/cred.c | 14 -
security/landlock/domain.c | 67 ++++
security/landlock/domain.h | 4 +
security/landlock/fs.c | 405 ++++++++++++++++++++-
security/landlock/limits.h | 2 +-
security/landlock/task.c | 67 ----
.../selftests/landlock/scoped_memfd_exec_test.c | 325 +++++++++++++++++
11 files changed, 812 insertions(+), 83 deletions(-)
---
base-commit: 5b74b2eff1eeefe43584e5b7b348c8cd3b723d38
change-id: 20250716-memfd-exec-ac0d582018c3
Best regards,
--
Abhinav Saxena <xandfury(a)gmail.com>
Currently the vDSO selftests use the time-related types from libc.
This works on glibc by chance today but will break with other libc
implementations or on distributions which switch to 64-bit times
everywhere.
The kernel's UAPI headers provide the proper types to use with the vDSO
(and raw syscalls) but are not necessarily compatible with libc types.
Introduce a new header which makes the UAPI headers compatible with the
libc.
Also contains some related cleanups.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
---
Changes in v2:
- Use __kernel_old_time_t in vdso_time_t.
- Add vdso_syscalls.h.
- Add a test for the time() function.
- Validate return value of syscall(clock_getres) in vdso_test_abi
- Link to v1: https://lore.kernel.org/r/20251111-vdso-test-types-v1-0-03b31f88c659@linutr…
---
Thomas Weißschuh (14):
Revert "selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers"
selftests: vDSO: Introduce vdso_types.h
selftests: vDSO: Introduce vdso_syscalls.h
selftests: vDSO: vdso_test_gettimeofday: Remove nolibc checks
selftests: vDSO: vdso_test_gettimeofday: Use types from vdso_types.h
selftests: vDSO: vdso_test_abi: Use types from vdso_types.h
selftests: vDSO: vdso_test_abi: Validate return value of syscall(clock_getres)
selftests: vDSO: vdso_test_abi: Use system call wrappers from vdso_syscalls.h
selftests: vDSO: vdso_test_correctness: Drop SYS_getcpu fallbacks
selftests: vDSO: vdso_test_correctness: Make ts_leq() and tv_leq() more generic
selftests: vDSO: vdso_test_correctness: Use types from vdso_types.h
selftests: vDSO: vdso_test_correctness: Use system call wrappers from vdso_syscalls.h
selftests: vDSO: vdso_test_correctness: Use facilities from parse_vdso.c
selftests: vDSO: vdso_test_correctness: Add a test for time()
tools/testing/selftests/vDSO/Makefile | 6 +-
tools/testing/selftests/vDSO/parse_vdso.c | 3 +-
tools/testing/selftests/vDSO/vdso_syscalls.h | 93 ++++++++++
tools/testing/selftests/vDSO/vdso_test_abi.c | 46 +++--
.../testing/selftests/vDSO/vdso_test_correctness.c | 190 +++++++++++----------
.../selftests/vDSO/vdso_test_gettimeofday.c | 9 +-
tools/testing/selftests/vDSO/vdso_types.h | 70 ++++++++
7 files changed, 285 insertions(+), 132 deletions(-)
---
base-commit: 1b2eb8c1324859864f4aa79dc3cfbb2f7ef5c524
change-id: 20251110-vdso-test-types-68ce0c712b79
Best regards,
--
Thomas Weißschuh <thomas.weissschuh(a)linutronix.de>
The `FIXTURE(args)` macro defines an empty `struct _test_data_args`,
leading to `sizeof(struct _test_data_args)` evaluating to 0. This
caused a build error due to a compiler warning on a `memset` call
with a zero size argument.
Adding a dummy member to the struct ensures its size is non-zero,
resolving the build issue.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
tools/testing/selftests/futex/functional/futex_requeue_pi.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/futex/functional/futex_requeue_pi.c b/tools/testing/selftests/futex/functional/futex_requeue_pi.c
index f299d75848cd..000fec468835 100644
--- a/tools/testing/selftests/futex/functional/futex_requeue_pi.c
+++ b/tools/testing/selftests/futex/functional/futex_requeue_pi.c
@@ -52,6 +52,7 @@ struct thread_arg {
FIXTURE(args)
{
+ char dummy;
};
FIXTURE_SETUP(args)
--
2.52.0.rc1.455.g30608eb744-goog
test_memcg_sock() currently requires that memory.stat's "sock " counter
is exactly zero immediately after the TCP server exits. On a busy system
this assumption is too strict:
- Socket memory may be freed with a small delay (e.g. RCU callbacks).
- memcg statistics are updated asynchronously via the rstat flushing
worker, so the "sock " value in memory.stat can stay non-zero for a
short period of time even after all socket memory has been uncharged.
As a result, test_memcg_sock() can intermittently fail even though socket
memory accounting is working correctly.
Make the test more robust by polling memory.stat for the "sock "
counter and allowing it some time to drop to zero instead of checking
it only once. The timeout is set to 3 seconds to cover the periodic
rstat flush interval (FLUSH_TIME = 2*HZ by default) plus some
scheduling slack. If the counter does not become zero within the
timeout, the test still fails as before.
On my test system, running test_memcontrol 50 times produced:
- Before this patch: 6/50 runs passed.
- After this patch: 50/50 runs passed.
Suggested-by: Lance Yang <lance.yang(a)linux.dev>
Signed-off-by: Guopeng Zhang <zhangguopeng(a)kylinos.cn>
---
v2:
- Mention the periodic rstat flush interval (FLUSH_TIME = 2*HZ) in
the comment and clarify the rationale for the 3s timeout.
- Replace the hard-coded retry count and wait interval with macros
to avoid magic numbers and make the 3s timeout calculation explicit.
---
.../selftests/cgroup/test_memcontrol.c | 30 ++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 4e1647568c5b..7bea656658a2 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -24,6 +24,9 @@
static bool has_localevents;
static bool has_recursiveprot;
+#define MEMCG_SOCKSTAT_WAIT_RETRIES 30 /* 3s total */
+#define MEMCG_SOCKSTAT_WAIT_INTERVAL_US (100 * 1000) /* 100 ms */
+
int get_temp_fd(void)
{
return open(".", O_TMPFILE | O_RDWR | O_EXCL);
@@ -1384,6 +1387,8 @@ static int test_memcg_sock(const char *root)
int bind_retries = 5, ret = KSFT_FAIL, pid, err;
unsigned short port;
char *memcg;
+ long sock_post = -1;
+ int i;
memcg = cg_name(root, "memcg_test");
if (!memcg)
@@ -1432,7 +1437,30 @@ static int test_memcg_sock(const char *root)
if (cg_read_long(memcg, "memory.current") < 0)
goto cleanup;
- if (cg_read_key_long(memcg, "memory.stat", "sock "))
+ /*
+ * memory.stat is updated asynchronously via the memcg rstat
+ * flushing worker, which runs periodically (every 2 seconds,
+ * see FLUSH_TIME). On a busy system, the "sock " counter may
+ * stay non-zero for a short period of time after the TCP
+ * connection is closed and all socket memory has been
+ * uncharged.
+ *
+ * Poll memory.stat for up to 3 seconds (~FLUSH_TIME plus some
+ * scheduling slack) and require that the "sock " counter
+ * eventually drops to zero.
+ */
+ for (i = 0; i < MEMCG_SOCKSTAT_WAIT_RETRIES; i++) {
+ sock_post = cg_read_key_long(memcg, "memory.stat", "sock ");
+ if (sock_post < 0)
+ goto cleanup;
+
+ if (!sock_post)
+ break;
+
+ usleep(MEMCG_SOCKSTAT_WAIT_INTERVAL_US);
+ }
+
+ if (sock_post)
goto cleanup;
ret = KSFT_PASS;
--
2.25.1
test_memcg_sock() currently requires that memory.stat's "sock " counter
is exactly zero immediately after the TCP server exits. On a busy system
this assumption is too strict:
- Socket memory may be freed with a small delay (e.g. RCU callbacks).
- memcg statistics are updated asynchronously via the rstat flushing
worker, so the "sock " value in memory.stat can stay non-zero for a
short period of time even after all socket memory has been uncharged.
As a result, test_memcg_sock() can intermittently fail even though socket
memory accounting is working correctly.
Make the test more robust by polling memory.stat for the "sock " counter
and allowing it some time to drop to zero instead of checking it only
once. If the counter does not become zero within the timeout, the test
still fails as before.
On my test system, running test_memcontrol 50 times produced:
- Before this patch: 6/50 runs passed.
- After this patch: 50/50 runs passed.
Signed-off-by: Guopeng Zhang <zhangguopeng(a)kylinos.cn>
---
.../selftests/cgroup/test_memcontrol.c | 24 ++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index 4e1647568c5b..86d9981cddd8 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -1384,6 +1384,8 @@ static int test_memcg_sock(const char *root)
int bind_retries = 5, ret = KSFT_FAIL, pid, err;
unsigned short port;
char *memcg;
+ long sock_post = -1;
+ int i, retries = 30;
memcg = cg_name(root, "memcg_test");
if (!memcg)
@@ -1432,7 +1434,27 @@ static int test_memcg_sock(const char *root)
if (cg_read_long(memcg, "memory.current") < 0)
goto cleanup;
- if (cg_read_key_long(memcg, "memory.stat", "sock "))
+ /*
+ * memory.stat is updated asynchronously via the memcg rstat
+ * flushing worker, so the "sock " counter may stay non-zero
+ * for a short period of time after the TCP connection is
+ * closed and all socket memory has been uncharged.
+ *
+ * Poll memory.stat for up to 3 seconds and require that the
+ * "sock " counter eventually drops to zero.
+ */
+ for (i = 0; i < retries; i++) {
+ sock_post = cg_read_key_long(memcg, "memory.stat", "sock ");
+ if (sock_post < 0)
+ goto cleanup;
+
+ if (!sock_post)
+ break;
+
+ usleep(100 * 1000); /* 100ms */
+ }
+
+ if (sock_post)
goto cleanup;
ret = KSFT_PASS;
--
2.25.1
Here are various unrelated fixes:
- Patch 1: Fix window space computation for fallback connections which
can affect ACK generation. A fix for v5.11.
- Patch 2: Avoid unneeded subflow-level drops due to unsynced received
window. A fix for v5.11.
- Patch 3: Avoid premature close for fallback connections with PREEMPT
kernels. A fix for v5.12.
- Patch 4: Reset instead of fallback in case of data in the MPTCP
out-of-order queue. A fix for v5.7.
- Patches 5-7: Avoid also sending "plain" TCP reset when closing with an
MP_FASTCLOSE. A fix for v6.1.
- Patches 8-9: Longer timeout for background connections in MPTCP Join
selftests. An additional fix for recent patches for v5.13/v6.1.
- Patches 10-11: Fix typo in a check introduce in a recent refactoring.
A fix for v6.15.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Gang Yan (2):
mptcp: fix address removal logic in mptcp_pm_nl_rm_addr
selftests: mptcp: add a check for 'add_addr_accepted'
Matthieu Baerts (NGI0) (3):
selftests: mptcp: join: fastclose: remove flaky marks
selftests: mptcp: join: endpoints: longer timeout
selftests: mptcp: join: userspace: longer timeout
Paolo Abeni (6):
mptcp: fix ack generation for fallback msk
mptcp: avoid unneeded subflow-level drops
mptcp: fix premature close in case of fallback
mptcp: do not fallback when OoO is present
mptcp: decouple mptcp fastclose from tcp close
mptcp: fix duplicate reset on fastclose
net/mptcp/options.c | 54 +++++++++++++++++++++-
net/mptcp/pm_kernel.c | 2 +-
net/mptcp/protocol.c | 59 +++++++++++++++++--------
net/mptcp/protocol.h | 3 +-
tools/testing/selftests/net/mptcp/mptcp_join.sh | 27 ++++++-----
5 files changed, 113 insertions(+), 32 deletions(-)
---
base-commit: 8e0a754b0836d996802713bbebc87bc1cc17925c
change-id: 20251117-net-mptcp-misc-fixes-6-18-rc6-835d94cdc095
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
The current netconsole implementation allocates a static buffer for
extradata (userdata + sysdata) with a fixed size of
MAX_EXTRADATA_ENTRY_LEN * MAX_EXTRADATA_ITEMS bytes for every target,
regardless of whether userspace actually uses this feature. This forces
us to keep MAX_EXTRADATA_ITEMS small (16), which is restrictive for
users who need to attach more metadata to their log messages.
This patch series enables dynamic allocation of the userdata buffer,
allowing it to grow on-demand based on actual usage. The series:
1. Refactors send_fragmented_body() to simplify handling of separated
userdata and sysdata (patch 1/4)
2. Splits userdata and sysdata into separate buffers (patch 2/4)
3. Implements dynamic allocation for the userdata buffer (patch 3/4)
4. Increases MAX_USERDATA_ITEMS from 16 to 256 now that we can do so
without memory waste (patch 4/4)
Benefits:
- No memory waste when userdata is not used
- Targets that use userdata only consume what they need
- Users can attach significantly more metadata without impacting systems
that don't use this feature
Signed-off-by: Gustavo Luiz Duarte <gustavold(a)gmail.com>
---
Changes in v2:
- Added null pointer checks for userdata and sysdata buffers
- Added MAX_SYSDATA_ITEMS to enum sysdata_feature
- Moved code out of ifdef in send_msg_no_fragmentation()
- Renamed variables in send_fragmented_body() to make it easier to
reason about the code
- Link to v1: https://lore.kernel.org/r/20251105-netconsole_dynamic_extradata-v1-0-142890…
---
Gustavo Luiz Duarte (4):
netconsole: Simplify send_fragmented_body()
netconsole: Split userdata and sysdata
netconsole: Dynamic allocation of userdata buffer
netconsole: Increase MAX_USERDATA_ITEMS
drivers/net/netconsole.c | 370 ++++++++++-----------
.../selftests/drivers/net/netcons_overflow.sh | 2 +-
2 files changed, 179 insertions(+), 193 deletions(-)
---
base-commit: 68fa5b092efab37a4f08a47b22bb8ca98f7f6223
change-id: 20251007-netconsole_dynamic_extradata-21bd9d726568
Best regards,
--
Gustavo Duarte <gustavold(a)meta.com>
From: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Since the ftrace fprobe is both fgraph and ftrace based implemented,
the selftest needs to be updated. This does not count the actual
number of lines, but just check the differences.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
---
.../ftrace/test.d/dynevent/add_remove_fprobe.tc | 18 ++++--------------
1 file changed, 4 insertions(+), 14 deletions(-)
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
index 2506f464811b..47067a5e3cb0 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe.tc
@@ -28,25 +28,21 @@ test -d events/fprobes/myevent1
test -d events/fprobes/myevent2
echo 1 > events/fprobes/myevent1/enable
-# Make sure the event is attached and is the only one
+# Make sure the event is attached.
grep -q $PLACE enabled_functions
cnt=`cat enabled_functions | wc -l`
-if [ $cnt -ne $((ocnt + 1)) ]; then
+if [ $cnt -eq $ocnt ]; then
exit_fail
fi
echo 1 > events/fprobes/myevent2/enable
-# It should till be the only attached function
-cnt=`cat enabled_functions | wc -l`
-if [ $cnt -ne $((ocnt + 1)) ]; then
- exit_fail
-fi
+cnt2=`cat enabled_functions | wc -l`
echo 1 > events/fprobes/myevent3/enable
# If the function is different, the attached function should be increased
grep -q $PLACE2 enabled_functions
cnt=`cat enabled_functions | wc -l`
-if [ $cnt -ne $((ocnt + 2)) ]; then
+if [ $cnt -eq $cnt2 ]; then
exit_fail
fi
@@ -56,12 +52,6 @@ echo "-:myevent2" >> dynamic_events
grep -q myevent1 dynamic_events
! grep -q myevent2 dynamic_events
-# should still have 2 left
-cnt=`cat enabled_functions | wc -l`
-if [ $cnt -ne $((ocnt + 2)) ]; then
- exit_fail
-fi
-
echo 0 > events/fprobes/enable
echo > dynamic_events
vgic_lpi_stress sends MAPTI and MAPC commands during guest GIC
setup to map interrupt events to ITT entries and collection IDs
to redistributors, respectively.
Theoretically, we have no guarantee that the ITS will
finish handling these mapping commands before the selftest
calls KVM_SIGNAL_MSI to inject LPIs to the guest. If LPIs
are injected before ITS mapping completes, the ITS cannot
properly pass the interrupt on to the redistributor.
In practice, KVM processes ITS commands synchronously, so
SYNC calls are functionally unnecessary and ignored in
vgic_its_handle_command().
However, selftests should test based on ARM specification and
be blind to KVM-specific implementation optimizations. Thus,
we must update the test to be architecturally compliant and
logically correct.
Fix by adding a SYNC command to the selftests ITS library,
then calling SYNC after ITS mapping to ensure mapping
completes before signal_lpi() writes to GITS_TRANSLATER.
This patch depends on commit a24f7afce048 ("KVM: selftests:
fix MAPC RDbase target formatting in vgic_lpi_stress"), which
is queued in kvmarm/fixes.
Signed-off-by: Maximilian Dittgen <mdittgen(a)amazon.de>
---
Validated by the following debug logging to the GITS_CMD_SYNC handler
in vgic_its_handle_command():
kvm_info("ITS SYNC command: %016llx %016llx %016llx %016llx\n",
its_cmd[0], its_cmd[1], its_cmd[2], its_cmd[3]);
Initialized a selftest guest with 4 vCPUs by:
./vgic_lpi_stress -v 4
Confirmed that an ITS SYNC was successfully called for all 4 vCPUs:
kvm [5094]: ITS SYNC command: 0000000000000005 0000000000000000 0000000000000000 0000000000000000
kvm [5094]: ITS SYNC command: 0000000000000005 0000000000000000 0000000000010000 0000000000000000
kvm [5094]: ITS SYNC command: 0000000000000005 0000000000000000 0000000000020000 0000000000000000
kvm [5094]: ITS SYNC command: 0000000000000005 0000000000000000 0000000000030000 0000000000000000
---
tools/testing/selftests/kvm/arm64/vgic_lpi_stress.c | 4 ++++
.../testing/selftests/kvm/include/arm64/gic_v3_its.h | 1 +
tools/testing/selftests/kvm/lib/arm64/gic_v3_its.c | 11 +++++++++++
3 files changed, 16 insertions(+)
diff --git a/tools/testing/selftests/kvm/arm64/vgic_lpi_stress.c b/tools/testing/selftests/kvm/arm64/vgic_lpi_stress.c
index 687d04463983..e857a605f577 100644
--- a/tools/testing/selftests/kvm/arm64/vgic_lpi_stress.c
+++ b/tools/testing/selftests/kvm/arm64/vgic_lpi_stress.c
@@ -118,6 +118,10 @@ static void guest_setup_gic(void)
guest_setup_its_mappings();
guest_invalidate_all_rdists();
+
+ /* SYNC to ensure ITS setup is complete */
+ for (cpuid = 0; cpuid < test_data.nr_cpus; cpuid++)
+ its_send_sync_cmd(test_data.cmdq_base_va, cpuid);
}
static void guest_code(size_t nr_lpis)
diff --git a/tools/testing/selftests/kvm/include/arm64/gic_v3_its.h b/tools/testing/selftests/kvm/include/arm64/gic_v3_its.h
index 3722ed9c8f96..58feef3eb386 100644
--- a/tools/testing/selftests/kvm/include/arm64/gic_v3_its.h
+++ b/tools/testing/selftests/kvm/include/arm64/gic_v3_its.h
@@ -15,5 +15,6 @@ void its_send_mapc_cmd(void *cmdq_base, u32 vcpu_id, u32 collection_id, bool val
void its_send_mapti_cmd(void *cmdq_base, u32 device_id, u32 event_id,
u32 collection_id, u32 intid);
void its_send_invall_cmd(void *cmdq_base, u32 collection_id);
+void its_send_sync_cmd(void *cmdq_base, u32 vcpu_id);
#endif // __SELFTESTS_GIC_V3_ITS_H__
diff --git a/tools/testing/selftests/kvm/lib/arm64/gic_v3_its.c b/tools/testing/selftests/kvm/lib/arm64/gic_v3_its.c
index 0e2f8ed90f30..d9ee331074ea 100644
--- a/tools/testing/selftests/kvm/lib/arm64/gic_v3_its.c
+++ b/tools/testing/selftests/kvm/lib/arm64/gic_v3_its.c
@@ -253,3 +253,14 @@ void its_send_invall_cmd(void *cmdq_base, u32 collection_id)
its_send_cmd(cmdq_base, &cmd);
}
+
+void its_send_sync_cmd(void *cmdq_base, u32 vcpu_id)
+{
+ struct its_cmd_block cmd = {};
+
+ its_encode_cmd(&cmd, GITS_CMD_SYNC);
+ its_encode_target(&cmd, procnum_to_rdbase(vcpu_id));
+
+ its_send_cmd(cmdq_base, &cmd);
+}
+
--
2.50.1 (Apple Git-155)
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
The printf statement attempts to print the DMA direction string using
the syntax 'dir[directions]', which is an invalid array access. The
variable 'dir' is an integer, and 'directions' is a char pointer array.
This incorrect syntax should be 'directions[dir]', using 'dir' as the
index into the 'directions' array. Fix this by correcting the array
access from 'dir[directions]' to 'directions[dir]'.
Signed-off-by: Zhang Chujun <zhangchujun(a)cmss.chinamobile.com>
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c
index b12f1f9babf8..b925756373ce 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -118,7 +118,7 @@ int main(int argc, char **argv)
}
printf("dma mapping benchmark: threads:%d seconds:%d node:%d dir:%s granule: %d\n",
- threads, seconds, node, dir[directions], granule);
+ threads, seconds, node, directions[dir], granule);
printf("average map latency(us):%.1f standard deviation:%.1f\n",
map.avg_map_100ns/10.0, map.map_stddev/10.0);
printf("average unmap latency(us):%.1f standard deviation:%.1f\n",
--
2.50.1.windows.1
Parsing KTAP is quite an inconvenience, but most of the time the thing
you really want to know is "did anything fail"?
Let's give the user the his information without them needing
to parse anything.
Because of the use of subshells and namespaces, this needs to be
communicated via a file. Just write arbitrary data into the file and
treat non-empty content as a signal that something failed.
In case any user depends on the current behaviour, such as running this
from a script with `set -e` and parsing the result for failures
afterwards, add a flag they can set to get the old behaviour, namely
--no-error-on-fail.
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
Changes in v3:
- Fixed quoting
- Link to v2: https://lore.kernel.org/r/20251014-b4-ksft-error-on-fail-v2-1-b3e2657237b8@…
Changes in v2:
- Fixed bug in report_failure()
- Made error-on-fail the default
- Link to v1: https://lore.kernel.org/r/20251007-b4-ksft-error-on-fail-v1-1-71bf058f5662@…
---
tools/testing/selftests/kselftest/runner.sh | 14 ++++++++++----
tools/testing/selftests/run_kselftest.sh | 14 ++++++++++++++
2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/kselftest/runner.sh b/tools/testing/selftests/kselftest/runner.sh
index 2c3c58e65a419f5ee8d7dc51a37671237a07fa0b..3a62039fa6217f3453423ff011575d0a1eb8c275 100644
--- a/tools/testing/selftests/kselftest/runner.sh
+++ b/tools/testing/selftests/kselftest/runner.sh
@@ -44,6 +44,12 @@ tap_timeout()
fi
}
+report_failure()
+{
+ echo "not ok $*"
+ echo "$*" >> "$kselftest_failures_file"
+}
+
run_one()
{
DIR="$1"
@@ -105,7 +111,7 @@ run_one()
echo "# $TEST_HDR_MSG"
if [ ! -e "$TEST" ]; then
echo "# Warning: file $TEST is missing!"
- echo "not ok $test_num $TEST_HDR_MSG"
+ report_failure "$test_num $TEST_HDR_MSG"
else
if [ -x /usr/bin/stdbuf ]; then
stdbuf="/usr/bin/stdbuf --output=L "
@@ -123,7 +129,7 @@ run_one()
interpreter=$(head -n 1 "$TEST" | cut -c 3-)
cmd="$stdbuf $interpreter ./$BASENAME_TEST"
else
- echo "not ok $test_num $TEST_HDR_MSG"
+ report_failure "$test_num $TEST_HDR_MSG"
return
fi
fi
@@ -137,9 +143,9 @@ run_one()
echo "ok $test_num $TEST_HDR_MSG # SKIP"
elif [ $rc -eq $timeout_rc ]; then \
echo "#"
- echo "not ok $test_num $TEST_HDR_MSG # TIMEOUT $kselftest_timeout seconds"
+ report_failure "$test_num $TEST_HDR_MSG # TIMEOUT $kselftest_timeout seconds"
else
- echo "not ok $test_num $TEST_HDR_MSG # exit=$rc"
+ report_failure "$test_num $TEST_HDR_MSG # exit=$rc"
fi)
cd - >/dev/null
fi
diff --git a/tools/testing/selftests/run_kselftest.sh b/tools/testing/selftests/run_kselftest.sh
index 0443beacf3621ae36cb12ffd57f696ddef3526b5..d4be97498b32e975c63a1167d3060bdeba674c8c 100755
--- a/tools/testing/selftests/run_kselftest.sh
+++ b/tools/testing/selftests/run_kselftest.sh
@@ -33,6 +33,7 @@ Usage: $0 [OPTIONS]
-c | --collection COLLECTION Run all tests from COLLECTION
-l | --list List the available collection:test entries
-d | --dry-run Don't actually run any tests
+ -f | --no-error-on-fail Don't exit with an error just because tests failed
-n | --netns Run each test in namespace
-h | --help Show this usage info
-o | --override-timeout Number of seconds after which we timeout
@@ -44,6 +45,7 @@ COLLECTIONS=""
TESTS=""
dryrun=""
kselftest_override_timeout=""
+ERROR_ON_FAIL=true
while true; do
case "$1" in
-s | --summary)
@@ -65,6 +67,9 @@ while true; do
-d | --dry-run)
dryrun="echo"
shift ;;
+ -f | --no-error-on-fail)
+ ERROR_ON_FAIL=false
+ shift ;;
-n | --netns)
RUN_IN_NETNS=1
shift ;;
@@ -105,9 +110,18 @@ if [ -n "$TESTS" ]; then
available="$(echo "$valid" | sed -e 's/ /\n/g')"
fi
+kselftest_failures_file="$(mktemp --tmpdir kselftest-failures-XXXXXX)"
+export kselftest_failures_file
+
collections=$(echo "$available" | cut -d: -f1 | sort | uniq)
for collection in $collections ; do
[ -w /dev/kmsg ] && echo "kselftest: Running tests in $collection" >> /dev/kmsg
tests=$(echo "$available" | grep "^$collection:" | cut -d: -f2)
($dryrun cd "$collection" && $dryrun run_many $tests)
done
+
+failures="$(cat "$kselftest_failures_file")"
+rm "$kselftest_failures_file"
+if "$ERROR_ON_FAIL" && [ "$failures" ]; then
+ exit 1
+fi
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20251007-b4-ksft-error-on-fail-0c2cb3246041
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
This series adds support for tests that use multiple devices, and adds
one new test, vfio_pci_device_init_perf_test, which measures parallel
device initialization time to demonstrate the improvement from commit
e908f58b6beb ("vfio/pci: Separate SR-IOV VF dev_set").
This series also breaks apart the monolithic vfio_util.h and
vfio_pci_device.c into separate files, to account for all the new code.
This required quite a bit of code motion so the diffstat looks large.
The final layout is more granular and provides a better separation of
the IOMMU code from the device code.
Final layout:
C files:
- tools/testing/selftests/vfio/lib/iommu.c
- tools/testing/selftests/vfio/lib/iova_allocator.c
- tools/testing/selftests/vfio/lib/libvfio.c
- tools/testing/selftests/vfio/lib/vfio_pci_device.c
- tools/testing/selftests/vfio/lib/vfio_pci_driver.c
H files:
- tools/testing/selftests/vfio/lib/include/libvfio.h
- tools/testing/selftests/vfio/lib/include/libvfio/assert.h
- tools/testing/selftests/vfio/lib/include/libvfio/iommu.h
- tools/testing/selftests/vfio/lib/include/libvfio/iova_allocator.h
- tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
- tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
Notably, vfio_util.h is now gone and replaced with libvfio.h.
This series is based on vfio/next plus Alex Mastro's series to add the
IOVA allocator [1]. It should apply cleanly to vfio/next once Alex's
series is merged into 6.18 and then into vfio/next.
This series can be found on GitHub:
https://github.com/dmatlack/linux/tree/vfio/selftests/init_perf_test/v2
[1] https://lore.kernel.org/kvm/20251111-iova-ranges-v3-0-7960244642c5@fb.com/
Cc: Alex Mastro <amastro(a)fb.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: Josh Hilke <jrhilke(a)google.com>
Cc: Raghavendra Rao Ananta <rananta(a)google.com>
Cc: Vipin Sharma <vipinsh(a)google.com>
v2:
- Require tests to call iommu_init() and manage struct iommu objects
rather than implicitly doing it in vfio_pci_device_init().
- Drop all the device wrappers for IOMMU methods and require tests to
interact with the iommu_*() helper functions directly.
- Add a commit to eliminate INVALID_IOVA. This is a simple cleanup I've
been meaning to make.
- Upgrade some driver logging to error (Raghavendra)
- Remove plurality from helper function that fetches BDF from
environment variable (Raghavendra)
- Fix cleanup.sh to only delete the device directory when cleaning up
all devices (Raghavendra)
v1: https://lore.kernel.org/kvm/20251008232531.1152035-1-dmatlack@google.com/
David Matlack (18):
vfio: selftests: Move run.sh into scripts directory
vfio: selftests: Split run.sh into separate scripts
vfio: selftests: Allow passing multiple BDFs on the command line
vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
vfio: selftests: Introduce struct iommu
vfio: selftests: Support multiple devices in the same
container/iommufd
vfio: selftests: Eliminate overly chatty logging
vfio: selftests: Prefix logs with device BDF where relevant
vfio: selftests: Upgrade driver logging to dev_err()
vfio: selftests: Rename struct vfio_dma_region to dma_region
vfio: selftests: Move IOMMU library code into iommu.c
vfio: selftests: Move IOVA allocator into iova_allocator.c
vfio: selftests: Stop passing device for IOMMU operations
vfio: selftests: Rename vfio_util.h to libvfio.h
vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
vfio: selftests: Split libvfio.h into separate header files
vfio: selftests: Eliminate INVALID_IOVA
vfio: selftests: Add vfio_pci_device_init_perf_test
tools/testing/selftests/vfio/Makefile | 9 +-
.../selftests/vfio/lib/drivers/dsa/dsa.c | 36 +-
.../selftests/vfio/lib/drivers/ioat/ioat.c | 18 +-
.../selftests/vfio/lib/include/libvfio.h | 26 +
.../vfio/lib/include/libvfio/assert.h | 54 ++
.../vfio/lib/include/libvfio/iommu.h | 76 +++
.../vfio/lib/include/libvfio/iova_allocator.h | 23 +
.../lib/include/libvfio/vfio_pci_device.h | 125 ++++
.../lib/include/libvfio/vfio_pci_driver.h | 97 +++
.../selftests/vfio/lib/include/vfio_util.h | 331 -----------
tools/testing/selftests/vfio/lib/iommu.c | 465 +++++++++++++++
.../selftests/vfio/lib/iova_allocator.c | 94 +++
tools/testing/selftests/vfio/lib/libvfio.c | 78 +++
tools/testing/selftests/vfio/lib/libvfio.mk | 5 +-
.../selftests/vfio/lib/vfio_pci_device.c | 555 +-----------------
.../selftests/vfio/lib/vfio_pci_driver.c | 16 +-
tools/testing/selftests/vfio/run.sh | 109 ----
.../testing/selftests/vfio/scripts/cleanup.sh | 41 ++
tools/testing/selftests/vfio/scripts/lib.sh | 42 ++
tools/testing/selftests/vfio/scripts/run.sh | 16 +
tools/testing/selftests/vfio/scripts/setup.sh | 48 ++
.../selftests/vfio/vfio_dma_mapping_test.c | 46 +-
.../selftests/vfio/vfio_iommufd_setup_test.c | 2 +-
.../vfio/vfio_pci_device_init_perf_test.c | 167 ++++++
.../selftests/vfio/vfio_pci_device_test.c | 12 +-
.../selftests/vfio/vfio_pci_driver_test.c | 51 +-
26 files changed, 1479 insertions(+), 1063 deletions(-)
create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio.h
create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/assert.h
create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/iommu.h
create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/iova_allocator.h
create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_device.h
create mode 100644 tools/testing/selftests/vfio/lib/include/libvfio/vfio_pci_driver.h
delete mode 100644 tools/testing/selftests/vfio/lib/include/vfio_util.h
create mode 100644 tools/testing/selftests/vfio/lib/iommu.c
create mode 100644 tools/testing/selftests/vfio/lib/iova_allocator.c
create mode 100644 tools/testing/selftests/vfio/lib/libvfio.c
delete mode 100755 tools/testing/selftests/vfio/run.sh
create mode 100755 tools/testing/selftests/vfio/scripts/cleanup.sh
create mode 100755 tools/testing/selftests/vfio/scripts/lib.sh
create mode 100755 tools/testing/selftests/vfio/scripts/run.sh
create mode 100755 tools/testing/selftests/vfio/scripts/setup.sh
create mode 100644 tools/testing/selftests/vfio/vfio_pci_device_init_perf_test.c
base-commit: 0ed3a30fd996cb0cac872432cf25185fda7e5316
prerequisite-patch-id: dcf23dcc1198960bda3102eefaa21df60b2e4c54
prerequisite-patch-id: e32e56d5bf7b6c7dd40d737aa3521560407e00f5
prerequisite-patch-id: 4f79a41bf10a4c025ba5f433551b46035aa15878
prerequisite-patch-id: f903a45f0c32319138cd93a007646ab89132b18c
--
2.52.0.rc1.455.g30608eb744-goog
Main objective of this series is to convert the gro.sh and toeplitz.sh
tests to be "NIPA-compatible" - meaning make use of the Python env,
which lets us run the tests against either netdevsim or a real device.
The tests seem to have been written with a different flow in mind.
Namely they source different bash "setup" scripts depending on arguments
passed to the test. While I have nothing against the use of bash and
the overall architecture - the existing code needs quite a bit of work
(don't assume MAC/IP addresses, support remote endpoint over SSH).
If I'm the one fixing it, I'd rather convert them to our "simplistic"
Python.
This series rewrites the tests in Python while addressing their
shortcomings. The functionality of running the test over loopback
on a real device is retained but with a different method of invocation
(see the last patch).
Once again we are dealing with a script which run over a variety of
protocols (combination of [ipv4, ipv6, ipip] x [tcp, udp]). The first
4 patches add support for test variants to our scripts. We use the
term "variant" in the same sense as the C kselftest_harness.h -
variant is just a set of static input arguments.
Note that neither GRO nor the Toeplitz test fully passes for me on
any HW I have access to. But this is unrelated to the conversion.
This series is not making any real functional changes to the tests,
it is limited to improving the "test harness" scripts.
v2:
[patch 5] fix accidental modification of gitignore
[patch 8] fix typo in "compared"
[patch 9] fix typo I -> It
[patch 10] fix typoe configure -> configured
v1: https://lore.kernel.org/20251117205810.1617533-1-kuba@kernel.org
Jakub Kicinski (12):
selftests: net: py: coding style improvements
selftests: net: py: extract the case generation logic
selftests: net: py: add test variants
selftests: drv-net: xdp: use variants for qstat tests
selftests: net: relocate gro and toeplitz tests to drivers/net
selftests: net: py: support ksft ready without wait
selftests: net: py: read ip link info about remote dev
netdevsim: pass packets thru GRO on Rx
selftests: drv-net: add a Python version of the GRO test
selftests: drv-net: hw: convert the Toeplitz test to Python
netdevsim: add loopback support
selftests: net: remove old setup_* scripts
tools/testing/selftests/drivers/net/Makefile | 2 +
.../testing/selftests/drivers/net/hw/Makefile | 6 +-
tools/testing/selftests/net/Makefile | 7 -
tools/testing/selftests/net/lib/Makefile | 1 +
drivers/net/netdevsim/netdev.c | 26 ++-
.../testing/selftests/{ => drivers}/net/gro.c | 5 +-
.../{net => drivers/net/hw}/toeplitz.c | 7 +-
.../testing/selftests/drivers/net/.gitignore | 1 +
tools/testing/selftests/drivers/net/gro.py | 161 ++++++++++++++
.../selftests/drivers/net/hw/.gitignore | 1 +
.../drivers/net/hw/lib/py/__init__.py | 4 +-
.../selftests/drivers/net/hw/toeplitz.py | 208 ++++++++++++++++++
.../selftests/drivers/net/lib/py/__init__.py | 4 +-
.../selftests/drivers/net/lib/py/env.py | 2 +
tools/testing/selftests/drivers/net/xdp.py | 42 ++--
tools/testing/selftests/net/.gitignore | 2 -
tools/testing/selftests/net/gro.sh | 105 ---------
.../selftests/net/lib/ksft_setup_loopback.sh | 111 ++++++++++
.../testing/selftests/net/lib/py/__init__.py | 5 +-
tools/testing/selftests/net/lib/py/ksft.py | 93 ++++++--
tools/testing/selftests/net/lib/py/nsim.py | 2 +-
tools/testing/selftests/net/lib/py/utils.py | 20 +-
tools/testing/selftests/net/setup_loopback.sh | 120 ----------
tools/testing/selftests/net/setup_veth.sh | 45 ----
tools/testing/selftests/net/toeplitz.sh | 199 -----------------
.../testing/selftests/net/toeplitz_client.sh | 28 ---
26 files changed, 630 insertions(+), 577 deletions(-)
rename tools/testing/selftests/{ => drivers}/net/gro.c (99%)
rename tools/testing/selftests/{net => drivers/net/hw}/toeplitz.c (99%)
create mode 100755 tools/testing/selftests/drivers/net/gro.py
create mode 100755 tools/testing/selftests/drivers/net/hw/toeplitz.py
delete mode 100755 tools/testing/selftests/net/gro.sh
create mode 100755 tools/testing/selftests/net/lib/ksft_setup_loopback.sh
delete mode 100644 tools/testing/selftests/net/setup_loopback.sh
delete mode 100644 tools/testing/selftests/net/setup_veth.sh
delete mode 100755 tools/testing/selftests/net/toeplitz.sh
delete mode 100755 tools/testing/selftests/net/toeplitz_client.sh
--
2.51.1
v23:
fixed some of the "CHECK:" reported on checkpatch --strict.
Accepted Joel's suggestion for kselftest's Makefile.
CONFIG_RISCV_USER_CFI is enabled when zicfiss, zicfilp and fcf-protection
are all present in toolchain
v22: fixing build error due to -march=zicfiss being picked in gcc-13 and above
but not actually doing any codegen or recognizing instruction for zicfiss.
Change in v22 makes dependence on `-fcf-protection=full` compiler flag to
ensure that toolchain has support and then only CONFIG_RISCV_USER_CFI will be
visible in menuconfig.
v21: fixed build errors.
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: Jann Horn <jannh(a)google.com>
To: Conor Dooley <conor+dt(a)kernel.org>
To: Miguel Ojeda <ojeda(a)kernel.org>
To: Alex Gaynor <alex.gaynor(a)gmail.com>
To: Boqun Feng <boqun.feng(a)gmail.com>
To: Gary Guo <gary(a)garyguo.net>
To: Björn Roy Baron <bjorn3_gh(a)protonmail.com>
To: Benno Lossin <benno.lossin(a)proton.me>
To: Andreas Hindborg <a.hindborg(a)kernel.org>
To: Alice Ryhl <aliceryhl(a)google.com>
To: Trevor Gross <tmgross(a)umich.edu>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Cc: rust-for-linux(a)vger.kernel.org
changelog
---------
v23:
- fixed some of the "CHECK:" reported on checkpatch --strict.
- Accepted Joel's suggestion for kselftest's Makefile.
- CONFIG_RISCV_USER_CFI is enabled when zicfiss, zicfilp and fcf-protection
are all present in toolchain
v22:
- CONFIG_RISCV_USER_CFI was by default "n". With dual vdso support it is
default "y" (if toolchain supports it). Fixing build error due to
"-march=zicfiss" being picked in gcc-13 partially. gcc-13 only recognizes the
flag but not actually doing any codegen or recognizing instruction for zicfiss.
Change in v22 makes dependence on `-fcf-protection=full` compiler flag to
ensure that toolchain has support and then only CONFIG_RISCV_USER_CFI will be
visible in menuconfig.
- picked up tags and some cosmetic changes in commit message for dual vdso
patch.
v21:
- Fixing build errors due to changes in arch/riscv/include/asm/vdso.h
Using #ifdef instead of IS_ENABLED in arch/riscv/include/asm/vdso.h
vdso-cfi-offsets.h should be included only when CONFIG_RISCV_USER_CFI
is selected.
v20:
- rebased on v6.18-rc1.
- Added two vDSO support. If `CONFIG_RISCV_USER_CFI` is selected
two vDSOs are compiled (one for hardware prior to RVA23 and one
for RVA23 onwards). Kernel exposes RVA23 vDSO if hardware/cpu
implements zimop else exposes existing vDSO to userspace.
- default selection for `CONFIG_RISCV_USER_CFI` is "Yes".
- replaced "__ASSEMBLY__" with "__ASSEMBLER__"
v19:
- riscv_nousercfi was `int`. changed it to unsigned long.
Thanks to Alex Ghiti for reporting it. It was a bug.
- ELP is cleared on trap entry only when CONFIG_64BIT.
- restore ssp back on return to usermode was being done
before `riscv_v_context_nesting_end` on trap exit path.
If kernel shadow stack were enabled this would result in
kernel operating on user shadow stack and panic (as I found
in my testing of kcfi patch series). So fixed that.
v18:
- rebased on 6.16-rc1
- uprobe handling clears ELP in sstatus image in pt_regs
- vdso was missing shadow stack elf note for object files.
added that. Additional asm file for vdso needed the elf marker
flag. toolchain should complain if `-fcf-protection=full` and
marker is missing for object generated from asm file. Asked
toolchain folks to fix this. Although no reason to gate the merge
on that.
- Split up compile options for march and fcf-protection in vdso
Makefile
- CONFIG_RISCV_USER_CFI option is moved under "Kernel features" menu
Added `arch/riscv/configs/hardening.config` fragment which selects
CONFIG_RISCV_USER_CFI
v17:
- fixed warnings due to empty macros in usercfi.h (reported by alexg)
- fixed prefixes in commit titles reported by alexg
- took below uprobe with fcfi v2 patch from Zong Li and squashed it with
"riscv/traps: Introduce software check exception and uprobe handling"
https://lore.kernel.org/all/20250604093403.10916-1-zong.li@sifive.com/
v16:
- If FWFT is not implemented or returns error for shadow stack activation, then
no_usercfi is set to disable shadow stack. Although this should be picked up
by extension validation and activation. Fixed this bug for zicfilp and zicfiss
both. Thanks to Charlie Jenkins for reporting this.
- If toolchain doesn't support cfi, cfi kselftest shouldn't build. Suggested by
Charlie Jenkins.
- Default for CONFIG_RISCV_USER_CFI is set to no. Charlie/Atish suggested to
keep it off till we have more hardware availibility with RVA23 profile and
zimop/zcmop implemented. Else this will start breaking people's workflow
- Includes the fix if "!RV64 and !SBI" then definitions for FWFT in
asm-offsets.c error.
v15:
- Toolchain has been updated to include `-fcf-protection` flag. This
exists for x86 as well. Updated kernel patches to compile vDSO and
selftest to compile with `fcf-protection=full` flag.
- selecting CONFIG_RISCV_USERCFI selects CONFIG_RISCV_SBI.
- Patch to enable shadow stack for kernel wasn't hidden behind
CONFIG_RISCV_USERCFI and CONFIG_RISCV_SBI. fixed that.
v14:
- rebased on top of palmer/sbi-v3. Thus dropped clement's FWFT patches
Updated RISCV_ISA_EXT_XXXX in hwcap and hwprobe constants.
- Took Radim's suggestions on bitfields.
- Placed cfi_state at the end of thread_info block so that current situation
is not disturbed with respect to member fields of thread_info in single
cacheline.
v13:
- cpu_supports_shadow_stack/cpu_supports_indirect_br_lp_instr uses
riscv_has_extension_unlikely()
- uses nops(count) to create nop slide
- RISCV_ACQUIRE_BARRIER is not needed in `amo_user_shstk`. Removed it
- changed ternaries to simply use implicit casting to convert to bool.
- kernel command line allows to disable zicfilp and zicfiss independently.
updated kernel-parameters.txt.
- ptrace user abi for cfi uses bitmasks instead of bitfields. Added ptrace
kselftest.
- cosmetic and grammatical changes to documentation.
v12:
- It seems like I had accidently squashed arch agnostic indirect branch
tracking prctl and riscv implementation of those prctls. Split them again.
- set_shstk_status/set_indir_lp_status perform CSR writes only when CPU
support is available. As suggested by Zong Li.
- Some minor clean up in kselftests as suggested by Zong Li.
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Changes in v23:
- Link to v22: https://lore.kernel.org/r/20251023-v5_user_cfi_series-v22-0-1935270f7636@ri…
Changes in v22:
- Link to v21: https://lore.kernel.org/r/20251015-v5_user_cfi_series-v21-0-6a07856e90e7@ri…
Changes in v21:
- Link to v20: https://lore.kernel.org/r/20251013-v5_user_cfi_series-v20-0-b9de4be9912e@ri…
Changes in v20:
- Link to v19: https://lore.kernel.org/r/20250731-v5_user_cfi_series-v19-0-09b468d7beab@ri…
Changes in v19:
- Link to v18: https://lore.kernel.org/r/20250711-v5_user_cfi_series-v18-0-a8ee62f9f38e@ri…
Changes in v18:
- Link to v17: https://lore.kernel.org/r/20250604-v5_user_cfi_series-v17-0-4565c2cf869f@ri…
Changes in v17:
- Link to v16: https://lore.kernel.org/r/20250522-v5_user_cfi_series-v16-0-64f61a35eee7@ri…
Changes in v16:
- Link to v15: https://lore.kernel.org/r/20250502-v5_user_cfi_series-v15-0-914966471885@ri…
Changes in v15:
- changelog posted just below cover letter
- Link to v14: https://lore.kernel.org/r/20250429-v5_user_cfi_series-v14-0-5239410d012a@ri…
Changes in v14:
- changelog posted just below cover letter
- Link to v13: https://lore.kernel.org/r/20250424-v5_user_cfi_series-v13-0-971437de586a@ri…
Changes in v13:
- changelog posted just below cover letter
- Link to v12: https://lore.kernel.org/r/20250314-v5_user_cfi_series-v12-0-e51202b53138@ri…
Changes in v12:
- changelog posted just below cover letter
- Link to v11: https://lore.kernel.org/r/20250310-v5_user_cfi_series-v11-0-86b36cbfb910@ri…
Changes in v11:
- changelog posted just below cover letter
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Deepak Gupta (26):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv/mm: manufacture shadow stack pte
riscv/mm: teach pte_mkwrite to manufacture shadow stack PTEs
riscv/mm: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception and uprobe handling
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: kernel command line option to opt out of user cfi
riscv: enable kernel access to shadow stack memory via FWFT sbi call
arch/riscv: dual vdso creation logic and select vdso based on hw
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad and shadow stack note
Documentation/admin-guide/kernel-parameters.txt | 8 +
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 179 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 22 +
arch/riscv/Makefile | 8 +-
arch/riscv/configs/hardening.config | 4 +
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 12 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 26 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 1 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 95 ++++
arch/riscv/include/asm/vdso.h | 13 +-
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 34 ++
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/asm-offsets.c | 10 +
arch/riscv/kernel/cpufeature.c | 27 +
arch/riscv/kernel/entry.S | 38 ++
arch/riscv/kernel/head.S | 27 +
arch/riscv/kernel/process.c | 27 +-
arch/riscv/kernel/ptrace.c | 95 ++++
arch/riscv/kernel/signal.c | 148 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 54 ++
arch/riscv/kernel/usercfi.c | 545 +++++++++++++++++++++
arch/riscv/kernel/vdso.c | 7 +
arch/riscv/kernel/vdso/Makefile | 40 +-
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/gen_vdso_offsets.sh | 4 +-
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/note.S | 3 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/kernel/vdso/vgetrandom-chacha.S | 5 +-
arch/riscv/kernel/vdso_cfi/Makefile | 25 +
arch/riscv/kernel/vdso_cfi/vdso-cfi.S | 11 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 16 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 2 +
include/uapi/linux/prctl.h | 27 +
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 2 +
tools/testing/selftests/riscv/cfi/Makefile | 23 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 82 ++++
tools/testing/selftests/riscv/cfi/cfitests.c | 173 +++++++
tools/testing/selftests/riscv/cfi/shadowstack.c | 385 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 27 +
62 files changed, 2481 insertions(+), 41 deletions(-)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
This patch adds support for the Zalasr ISA extension, which supplies the
real load acquire/store release instructions.
The specification can be found here:
https://github.com/riscv/riscv-zalasr/blob/main/chapter2.adoc
This patch seires has been tested with ltp on Qemu with Brensan's zalasr
support patch[1].
Some false positive spacing error happens during patch checking. Thus I
CCed maintainers of checkpatch.pl as well.
[1] https://lore.kernel.org/all/CAGPSXwJEdtqW=nx71oufZp64nK6tK=0rytVEcz4F-gfvCO…
v4:
- Apply acquire/release semantics to arch_atomic operations. Thanks
to Andrea.
v3:
- Apply acquire/release semantics to arch_xchg/arch_cmpxchg operations
so as to ensure FENCE.TSO ordering between operations which precede the
UNLOCK+LOCK sequence and operations which follow the sequence. Thanks
to Andrea.
- Support hwprobe of Zalasr.
- Allow Zalasr extensions for Guest/VM.
v2:
- Adjust the order of Zalasr and Zalrsc in dt-bindings. Thanks to
Conor.
Xu Lu (10):
riscv: Add ISA extension parsing for Zalasr
dt-bindings: riscv: Add Zalasr ISA extension description
riscv: hwprobe: Export Zalasr extension
riscv: Introduce Zalasr instructions
riscv: Apply Zalasr to smp_load_acquire/smp_store_release
riscv: Apply acquire/release semantics to arch_xchg/arch_cmpxchg
operations
riscv: Apply acquire/release semantics to arch_atomic operations
riscv: Remove arch specific __atomic_acquire/release_fence
RISC-V: KVM: Allow Zalasr extensions for Guest/VM
RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test
Documentation/arch/riscv/hwprobe.rst | 5 +-
.../devicetree/bindings/riscv/extensions.yaml | 5 +
arch/riscv/include/asm/atomic.h | 70 ++++++++-
arch/riscv/include/asm/barrier.h | 91 +++++++++--
arch/riscv/include/asm/cmpxchg.h | 144 +++++++++---------
arch/riscv/include/asm/fence.h | 4 -
arch/riscv/include/asm/hwcap.h | 1 +
arch/riscv/include/asm/insn-def.h | 79 ++++++++++
arch/riscv/include/uapi/asm/hwprobe.h | 1 +
arch/riscv/include/uapi/asm/kvm.h | 1 +
arch/riscv/kernel/cpufeature.c | 1 +
arch/riscv/kernel/sys_hwprobe.c | 1 +
arch/riscv/kvm/vcpu_onereg.c | 2 +
.../selftests/kvm/riscv/get-reg-list.c | 4 +
14 files changed, 314 insertions(+), 95 deletions(-)
--
2.20.1
Currently, guard regions are not visible to users except through
/proc/$pid/pagemap, with no explicit visibility at the VMA level.
This makes the feature less useful, as it isn't entirely apparent which
VMAs may have these entries present, especially when performing actions
which walk through memory regions such as those performed by CRIU.
This series addresses this issue by introducing the VM_MAYBE_GUARD flag
which fulfils this role, updating the smaps logic to display an entry for
these.
The semantics of this flag are that a guard region MAY be present if set
(we cannot be sure, as we can't efficiently track whether an
MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if
not set the VMA definitely does NOT have any guard regions present.
It's problematic to establish this flag without further action, because
that means that VMAs with guard regions in them become non-mergeable with
adjacent VMAs for no especially good reason.
To work around this, this series also introduces the concept of 'sticky'
VMA flags - that is flags which:
a. if set in one VMA and not in another still permit those VMAs to be
merged (if otherwise compatible).
b. When they are merged, the resultant VMA must have the flag set.
The VMA logic is updated to propagate these flags correctly.
Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve
an issue with file-backed guard regions - previously these established an
anon_vma object for file-backed mappings solely to have vma_needs_copy()
correctly propagate guard region mappings to child processes.
We introduce a new flag alias VM_COPY_ON_FORK (which currently only
specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly
for this flag and to copy page tables if it is present, which resolves this
issue.
Additionally, we add the ability for allow-listed VMA flags to be
atomically writable with only mmap/VMA read locks held.
The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure
does not cause any races by being allowed to do so.
This allows us to maintain guard region installation as a read-locked
operation and not endure the overhead of obtaining a write lock here.
Finally we introduce extensive VMA userland tests to assert that the sticky
VMA logic behaves correctly as well as guard region self tests to assert
that smaps visibility is correctly implemented.
v3:
* Propagated tags thanks Vlastimil & Pedro! :)
* Fixed doc nit as per Pedro.
* Added vma_flag_test_atomic() in preparation for fixing
retract_page_tables() (see below). We make this not require any locks, as
we serialise on the page table lock in retract_page_tables().
* Split the atomic flag enablement and actually setting the flag for guard
install into two separate commits so we clearly separate the various VMA
flag implementation details and us enabling this feature.
* Mentioned setting anon_vma for anonymous mappings in commit message as
per Vlastimil.
* Fixed an issue with retract_page_tables() whereby madvise(...,
MADV_COLLAPSE) relies upon file-backed VMAs not being collapsed due to
the UFFD WP VMA flag being set or the VMA having vma->anon_vma set
(i.e. being a MAP_PRIVATE file-backed VMA). This was updated to also
check for VM_MAYBE_GUARD.
* Introduced MADV_COLLAPSE self test to assert that the behaviour is
correct. I first reproduced the issue locally and then adapted the test
to assert that this no longer occurs.
* Mentioned KCSAN permissiveness in commit message as per Pedro.
* Mentioned mmap/VMA read lock excluding mmap/VMA write lock and thus
avoiding meaningful RMW races in commit message as per Vlastimil.
* Mentioned previous unconditional vma->anon_vma installation on guard
region installation as per Vlastimil.
* Avoided having merging compromised by reordering patches such that the
sticky VMA functionality is implemented prior to VM_MAYBE_GUARD being
utilised upon guard region installation, rendering Vlastimil's request to
mention this in a commit message unnecessary.
* Separated out sticky and copy on fork patches as per Pedro.
* Added VM_PFNMAP, VM_MIXEDMAP, VM_UFFD_WP to VM_COPY_ON_FORK to make
things more consistent and clean.
* Added mention of why generally VM_STICKY should be VM_COPY_ON_FORK in
copy on fork patch.
v2:
* Separated out userland VMA tests for sticky behaviour as per Suren.
* Added the concept of atomic writable VMA flags as per Pedro and Vlastimil.
* Made VM_MAYBE_GUARD an atomic writable flag so we don't have to take a VMA
write lock in madvise() as per Pedro and Vlastimil.
https://lore.kernel.org/all/cover.1762422915.git.lorenzo.stoakes@oracle.com/
v1:
https://lore.kernel.org/all/cover.1761756437.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (8):
mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps
mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
mm: implement sticky VMA flags
mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one
mm: set the VM_MAYBE_GUARD flag on guard region install
tools/testing/vma: add VMA sticky userland tests
tools/testing/selftests/mm: add MADV_COLLAPSE test case
tools/testing/selftests/mm: add smaps visibility guard region test
Documentation/filesystems/proc.rst | 5 +-
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 102 ++++++++++++
include/trace/events/mmflags.h | 1 +
mm/khugepaged.c | 72 +++++---
mm/madvise.c | 22 ++-
mm/memory.c | 14 +-
mm/vma.c | 22 +--
tools/testing/selftests/mm/guard-regions.c | 185 +++++++++++++++++++++
tools/testing/selftests/mm/vm_util.c | 5 +
tools/testing/selftests/mm/vm_util.h | 1 +
tools/testing/vma/vma.c | 89 ++++++++--
tools/testing/vma/vma_internal.h | 56 +++++++
13 files changed, 511 insertions(+), 64 deletions(-)
--
2.51.0
The vector regset uses the maximum possible vlenb 8192 to allocate a
2^18 bytes buffer to copy the vector register. But most platforms
don’t support the largest vlenb.
The regset has 2 users, ptrace syscall and coredump. When handling the
PTRACE_GETREGSET requests from ptrace syscall, Linux will prepare a
kernel buffer which size is min(user buffer size, limit). A malicious
user process might overwhelm a memory-constrainted system when the
buffer limit is very large. The coredump uses regset_get_alloc() to
get the context of vector register. But this API allocates buffer
before checking whether the target process uses vector extension, this
wastes time to prepare a large memory buffer.
The buffer limit can be determined after getting platform vlenb in the
early boot stage, this can let the regset buffer match real hardware
limits. Also add .active callbacks to let the coredump skip vector part
when target process doesn't use it.
After this patchset, userspace process needs 2 ptrace syscalls to
retrieve the vector regset with PTRACE_GETREGSET. The first ptrace call
only reads the header to get the vlenb information. Then prepare a
suitable buffer to get the register context. The new vector ptrace
kselftest demonstrates it.
---
v2:
- fix issues in vector ptrace kselftest (Andy)
Yong-Xuan Wang (2):
riscv: ptrace: Optimize the allocation of vector regset
selftests: riscv: Add test for the Vector ptrace interface
arch/riscv/include/asm/vector.h | 1 +
arch/riscv/kernel/ptrace.c | 24 +++-
arch/riscv/kernel/vector.c | 2 +
tools/testing/selftests/riscv/vector/Makefile | 5 +-
.../selftests/riscv/vector/vstate_ptrace.c | 134 ++++++++++++++++++
5 files changed, 162 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/selftests/riscv/vector/vstate_ptrace.c
--
2.43.0
The user_notification_wait_killable_after_reply test fails due to an
unhandled error when a traced syscall is interrupted by a signal.
When a signal arrives after the tracer has received a seccomp
notification but before it has replied, the notification can become
stale. Any subsequent reply (like with SECCOMP_IOCTL_NOTIF_ADDFD)
will fail with -ENOENT.
This patch fixes the test by handling the -ENOENT return value from
SECCOMP_IOCTL_NOTIF_ADDFD, preventing the test from failing
incorrectly. The loop counter is decremented to re-run the iteration
for the restarted syscall.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 574fdd102eb5..c3e598c9c4ee 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -5048,8 +5048,12 @@ TEST(user_notification_wait_killable_after_reply)
addfd.id = req.id;
addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
addfd.srcfd = 0;
- ASSERT_GE(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), 0)
- kill(pid, SIGKILL);
+ ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
+ if (ret < 0 && errno == ENOENT) {
+ i--;
+ continue;
+ }
+ ASSERT_GE(ret, 0);
}
/*
--
2.52.0.rc1.455.g30608eb744-goog
From: Chia-Yu Chang <chia-yu.chang(a)nokia-bell-labs.com>
Hello,
Plesae find the v5 AccECN case handling patch series, which covers
several excpetional case handling of Accurate ECN spec (RFC9768),
adds new identifiers to be used by CC modules, adds ecn_delta into
rate_sample, and keeps the ACE counter for computation, etc.
This patch series is part of the full AccECN patch series, which is available at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
Best regards,
Chia-Yu
---
v6:
- Update comment in #3 to highlight RX path is only used for virtio-net (Paolo Abeni <pabeni(a)redhat.com>)
- Rename TCP_CONG_WANTS_ECT_1 to TCP_CONG_ECT_1_NEGOTIATION to distiguish from TCP_CONG_ECT_1_ESTABLISH (Paolo Abeni <pabeni(a)redhat.com>)
- Move TCP_CONG_ECT_1_ESTABLISH in #6 to latter patch series (Paolo Abeni <pabeni(a)redhat.com>)
- Add new synack_type instead of moving the increment of num_retran in #9 (Paolo Abeni <pabeni(a)redhat.com>)
- Use new synack_type TCP_SYNACK_RETRANS and num_retrans for SYN/ACK retx fallbackk for AccECN in #10 (Paolo Abeni <pabeni(a)redhat.com>)
- Do not cast const struct into non-const in #11, and set AccECN fail mode after tcp_rtx_synack() (Paolo Abeni <pabeni(a)redhat.com>)
v5:
- Move previous #11 in v4 in latter patch after discussion with RFC author.
- Add #3 to update the comments for SKB_GSO_TCP_ECN and SKB_GSO_TCP_ACCECN. (Parav Pandit <parav(a)nvidia.com>)
- Add gro self-test for TCP CWR flag in #4. (Eric Dumazet <edumazet(a)google.com>)
- Add fixes: tag into #7 (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message of #8 and if condition check (Paolo Abeni <pabeni(a)redhat.com>)
- Add empty line between variable declarations and code in #13 (Paolo Abeni <pabeni(a)redhat.com>)
v4:
- Add previous #13 in v2 back after dicussion with the RFC author.
- Add TCP_ACCECN_OPTION_PERSIST to tcp_ecn_option sysctl to ignore AccECN fallback policy on sending AccECN option.
v3:
- Add additional min() check if pkts_acked_ewma is not initialized in #1. (Paolo Abeni <pabeni(a)redhat.com>)
- Change TCP_CONG_WANTS_ECT_1 into individual flag add helper function INET_ECN_xmit_wants_ect_1() in #3. (Paolo Abeni <pabeni(a)redhat.com>)
- Add empty line between variable declarations and code in #4. (Paolo Abeni <pabeni(a)redhat.com>)
- Update commit message to fix old AccECN commits in #5. (Paolo Abeni <pabeni(a)redhat.com>)
- Remove unnecessary brackets in #10. (Paolo Abeni <pabeni(a)redhat.com>)
- Move patch #3 in v2 to a later Prague patch serise and remove patch #13 in v2. (Paolo Abeni <pabeni(a)redhat.com>)
---
Chia-Yu Chang (12):
net: update commnets for SKB_GSO_TCP_ECN and SKB_GSO_TCP_ACCECN
selftests/net: gro: add self-test for TCP CWR flag
tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers
tcp: disable RFC3168 fallback identifier for CC modules
tcp: accecn: handle unexpected AccECN negotiation feedback
tcp: accecn: retransmit downgraded SYN in AccECN negotiation
tcp: add TCP_SYNACK_RETRANS synack_type
tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN
SYN/ACK
tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
tcp: accecn: fallback outgoing half link to non-AccECN
tcp: accecn: detect loss ACK w/ AccECN option and add
TCP_ACCECN_OPTION_PERSIST
tcp: accecn: enable AccECN
Ilpo Järvinen (2):
tcp: try to avoid safer when ACKs are thinned
gro: flushing when CWR is set negatively affects AccECN
Documentation/networking/ip-sysctl.rst | 4 +-
.../networking/net_cachelines/tcp_sock.rst | 1 +
include/linux/skbuff.h | 14 ++-
include/linux/tcp.h | 4 +-
include/net/inet_ecn.h | 20 +++-
include/net/tcp.h | 32 ++++++-
include/net/tcp_ecn.h | 92 ++++++++++++++-----
net/ipv4/inet_connection_sock.c | 4 +
net/ipv4/sysctl_net_ipv4.c | 4 +-
net/ipv4/tcp.c | 2 +
net/ipv4/tcp_cong.c | 5 +-
net/ipv4/tcp_input.c | 37 +++++++-
net/ipv4/tcp_minisocks.c | 46 +++++++---
net/ipv4/tcp_offload.c | 3 +-
net/ipv4/tcp_output.c | 32 ++++---
net/ipv4/tcp_timer.c | 3 +
tools/testing/selftests/net/gro.c | 80 +++++++++++-----
17 files changed, 295 insertions(+), 88 deletions(-)
--
2.34.1
Dzień dobry,
pomagamy przedsiębiorcom wprowadzić model wymiany walut, który minimalizuje wahania kosztów przy rozliczeniach międzynarodowych.
Kiedyv możemy umówić się na 15-minutową rozmowę, aby zaprezentować, jak taki model mógłby działać w Państwa firmie - z gwarancją indywidualnych kursów i pełnym uproszczeniem płatności? Proszę o propozycję dogodnego terminu.
Pozdrawiam
Marek Poradecki
This commit introduces checks for kernel version and seccomp filter flag
support to the seccomp selftests. It also includes conditional header
inclusions using __GLIBC_PREREQ.
Some tests were gated by kernel version, and adjustments were made for
flags introduced after kernel 5.4. This ensures the selftests can run
and pass correctly on kernel versions 5.4 and later, preventing failures
due to features not present in older kernels.
The use of __GLIBC_PREREQ ensures proper compilation and functionality
across different glibc versions in a mainline Linux kernel context.
While it might appear redundant in specific build environments due to
global overrides, it is crucial for upstream correctness and portability.
Signed-off-by: Wake Liu <wakel(a)google.com>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 108 ++++++++++++++++--
1 file changed, 99 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61acbd45ffaa..9b660cff5a4a 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -13,12 +13,14 @@
* we need to use the kernel's siginfo.h file and trick glibc
* into accepting it.
*/
+#if defined(__GLIBC__) && defined(__GLIBC_PREREQ)
#if !__GLIBC_PREREQ(2, 26)
# include <asm/siginfo.h>
# define __have_siginfo_t 1
# define __have_sigval_t 1
# define __have_sigevent_t 1
#endif
+#endif
#include <errno.h>
#include <linux/filter.h>
@@ -300,6 +302,26 @@ int seccomp(unsigned int op, unsigned int flags, void *args)
}
#endif
+int seccomp_flag_supported(int flag)
+{
+ /*
+ * Probes if a seccomp filter flag is supported by the kernel.
+ *
+ * When an unsupported flag is passed to seccomp(SECCOMP_SET_MODE_FILTER, ...),
+ * the kernel returns EINVAL.
+ *
+ * When a supported flag is passed, the kernel proceeds to validate the
+ * filter program pointer. By passing NULL for the filter program,
+ * the kernel attempts to dereference a bad address, resulting in EFAULT.
+ *
+ * Therefore, checking for EFAULT indicates that the flag itself was
+ * recognized and supported by the kernel.
+ */
+ if (seccomp(SECCOMP_SET_MODE_FILTER, flag, NULL) == -1 && errno == EFAULT)
+ return 1;
+ return 0;
+}
+
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
@@ -2436,13 +2458,12 @@ TEST(detect_seccomp_filter_flags)
ASSERT_NE(ENOSYS, errno) {
TH_LOG("Kernel does not support seccomp syscall!");
}
- EXPECT_EQ(-1, ret);
- EXPECT_EQ(EFAULT, errno) {
- TH_LOG("Failed to detect that a known-good filter flag (0x%X) is supported!",
- flag);
- }
- all_flags |= flag;
+ if (seccomp_flag_supported(flag))
+ all_flags |= flag;
+ else
+ TH_LOG("Filter flag (0x%X) is not found to be supported!",
+ flag);
}
/*
@@ -2870,6 +2891,12 @@ TEST_F(TSYNC, two_siblings_with_one_divergence)
TEST_F(TSYNC, two_siblings_with_one_divergence_no_tid_in_err)
{
+ /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */
+ if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) {
+ SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH");
+ return;
+ }
+
long ret, flags;
void *status;
@@ -3475,6 +3502,11 @@ TEST(user_notification_basic)
TEST(user_notification_with_tsync)
{
+ /* Depends on 5189149 (seccomp: allow TSYNC and USER_NOTIF together) */
+ if (!seccomp_flag_supported(SECCOMP_FILTER_FLAG_TSYNC_ESRCH)) {
+ SKIP(return, "Kernel does not support SECCOMP_FILTER_FLAG_TSYNC_ESRCH");
+ return;
+ }
int ret;
unsigned int flags;
@@ -3966,6 +3998,13 @@ TEST(user_notification_filter_empty)
TEST(user_ioctl_notification_filter_empty)
{
+ /* Depends on 95036a7 (seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV
+ * when all users have exited) */
+ if (!ksft_min_kernel_version(6, 11)) {
+ SKIP(return, "Kernel version < 6.11");
+ return;
+ }
+
pid_t pid;
long ret;
int status, p[2];
@@ -4119,6 +4158,12 @@ int get_next_fd(int prev_fd)
TEST(user_notification_addfd)
{
+ /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */
+ if (!ksft_min_kernel_version(5, 14)) {
+ SKIP(return, "Kernel version < 5.14");
+ return;
+ }
+
pid_t pid;
long ret;
int status, listener, memfd, fd, nextfd;
@@ -4281,6 +4326,12 @@ TEST(user_notification_addfd)
TEST(user_notification_addfd_rlimit)
{
+ /* Depends on 7cf97b1 (seccomp: Introduce addfd ioctl to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 9)) {
+ SKIP(return, "Kernel version < 5.9");
+ return;
+ }
+
pid_t pid;
long ret;
int status, listener, memfd;
@@ -4326,9 +4377,12 @@ TEST(user_notification_addfd_rlimit)
EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
EXPECT_EQ(errno, EMFILE);
- addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
- EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
- EXPECT_EQ(errno, EMFILE);
+ /* Depends on 0ae71c7 (seccomp: Support atomic "addfd + send reply") */
+ if (ksft_min_kernel_version(5, 14)) {
+ addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd), -1);
+ EXPECT_EQ(errno, EMFILE);
+ }
addfd.newfd = 100;
addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
@@ -4356,6 +4410,12 @@ TEST(user_notification_addfd_rlimit)
TEST(user_notification_sync)
{
+ /* Depends on 48a1084 (seccomp: add the synchronous mode for seccomp_unotify) */
+ if (!ksft_min_kernel_version(6, 6)) {
+ SKIP(return, "Kernel version < 6.6");
+ return;
+ }
+
struct seccomp_notif req = {};
struct seccomp_notif_resp resp = {};
int status, listener;
@@ -4520,6 +4580,12 @@ static char get_proc_stat(struct __test_metadata *_metadata, pid_t pid)
TEST(user_notification_fifo)
{
+ /* Depends on 4cbf6f6 (seccomp: Use FIFO semantics to order notifications) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct seccomp_notif_resp resp = {};
struct seccomp_notif req = {};
int i, status, listener;
@@ -4623,6 +4689,12 @@ static long get_proc_syscall(struct __test_metadata *_metadata, int pid)
/* Ensure non-fatal signals prior to receive are unmodified */
TEST(user_notification_wait_killable_pre_notification)
{
+ /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct sigaction new_action = {
.sa_handler = signal_handler,
};
@@ -4693,6 +4765,12 @@ TEST(user_notification_wait_killable_pre_notification)
/* Ensure non-fatal signals after receive are blocked */
TEST(user_notification_wait_killable)
{
+ /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct sigaction new_action = {
.sa_handler = signal_handler,
};
@@ -4772,6 +4850,12 @@ TEST(user_notification_wait_killable)
/* Ensure fatal signals after receive are not blocked */
TEST(user_notification_wait_killable_fatal)
{
+ /* Depends on c2aa2df (seccomp: Add wait_killable semantic to seccomp user notifier) */
+ if (!ksft_min_kernel_version(5, 19)) {
+ SKIP(return, "Kernel version < 5.19");
+ return;
+ }
+
struct seccomp_notif req = {};
int listener, status;
pid_t pid;
@@ -4854,6 +4938,12 @@ static void *tsync_vs_dead_thread_leader_sibling(void *_args)
*/
TEST(tsync_vs_dead_thread_leader)
{
+ /* Depends on bfafe5e (seccomp: release task filters when the task exits) */
+ if (!ksft_min_kernel_version(6, 11)) {
+ SKIP(return, "Kernel version < 6.11");
+ return;
+ }
+
int status;
pid_t pid;
long ret;
--
2.50.1.703.g449372360f-goog
syzbot ci has tested the following series
[v4] ipvlan: support mac-nat mode
https://lore.kernel.org/all/20251118100046.2944392-1-skorodumov.dmitry@huaw…
* [PATCH net-next 01/13] ipvlan: Support MACNAT mode
* [PATCH net-next 02/13] ipvlan: macnat: Handle rx mcast-ip and unicast eth
* [PATCH net-next 03/13] ipvlan: Forget all IP when device goes down
* [PATCH net-next 04/13] ipvlan: Support IPv6 in macnat mode.
* [PATCH net-next 05/13] ipvlan: Fix compilation warning about __be32 -> u32
* [PATCH net-next 06/13] ipvlan: Make the addrs_lock be per port
* [PATCH net-next 07/13] ipvlan: Take addr_lock in ipvlan_open()
* [PATCH net-next 08/13] ipvlan: Don't allow children to use IPs of main
* [PATCH net-next 09/13] ipvlan: const-specifier for functions that use iaddr
* [PATCH net-next 10/13] ipvlan: Common code from v6/v4 validator_event
* [PATCH net-next 11/13] ipvlan: common code to handle ipv6/ipv4 address events
* [PATCH net-next 12/13] ipvlan: Ignore PACKET_LOOPBACK in handle_mode_l2()
* [PATCH net-next 13/13] selftests: drv-net: selftest for ipvlan-macnat mode
and found the following issue:
WARNING: suspicious RCU usage in ipvlan_addr_event
Full report is available here:
https://ci.syzbot.org/series/e483b93a-1063-4c8a-b0e2-89530e79768b
***
WARNING: suspicious RCU usage in ipvlan_addr_event
tree: net-next
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base: c99ebb6132595b4b288a413981197eb076547c5a
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/ac5af6f3-6b14-4e35-9d81-ee1522de3952/config
8021q: adding VLAN 0 to HW filter on device batadv0
=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
drivers/net/ipvlan/ipvlan.h:128 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by syz-executor/5984:
#0: ffffffff8f2cc248 (rtnl_mutex){+.+.}-{4:4}, at: inet_rtm_newaddr+0x3b0/0x18b0
#1: ffffffff8f39d9b0 ((inetaddr_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x54/0x90
stack backtrace:
CPU: 1 UID: 0 PID: 5984 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250
lockdep_rcu_suspicious+0x140/0x1d0
ipvlan_addr_event+0x60b/0x950
notifier_call_chain+0x1b6/0x3e0
blocking_notifier_call_chain+0x6a/0x90
__inet_insert_ifa+0xa13/0xbf0
inet_rtm_newaddr+0xf3a/0x18b0
rtnetlink_rcv_msg+0x7cf/0xb70
netlink_rcv_skb+0x208/0x470
netlink_unicast+0x82f/0x9e0
netlink_sendmsg+0x805/0xb30
__sock_sendmsg+0x21c/0x270
__sys_sendto+0x3bd/0x520
__x64_sys_sendto+0xde/0x100
do_syscall_64+0xfa/0xfa0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f711f191503
Code: 64 89 02 48 c7 c0 ff ff ff ff eb b7 66 2e 0f 1f 84 00 00 00 00 00 90 80 3d 61 70 22 00 00 41 89 ca 74 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 55 48 83 ec 30 44 89 4c 24
RSP: 002b:00007ffc44b05f28 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f711ff14620 RCX: 00007f711f191503
RDX: 0000000000000028 RSI: 00007f711ff14670 RDI: 0000000000000003
RBP: 0000000000000001 R08: 00007ffc44b05f44 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000000000000 R14: 00007f711ff14670 R15: 0000000000000000
</TASK>
syz-executor (5984) used greatest stack depth: 19864 bytes left
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot(a)syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller(a)googlegroups.com.
Main objective of this series is to convert the gro.sh and toeplitz.sh
tests to be "NIPA-compatible" - meaning make use of the Python env,
which lets us run the tests against either netdevsim or a real device.
The tests seem to have been written with a different flow in mind.
Namely they source different bash "setup" scripts depending on arguments
passed to the test. While I have nothing against the use of bash and
the overall architecture - the existing code needs quite a bit of work
(don't assume MAC/IP addresses, support remote endpoint over SSH).
If I'm the one fixing it, I'd rather convert them to our "simplistic"
Python.
This series rewrites the tests in Python while addressing their
shortcomings. The functionality of running the test over loopback
on a real device is retained but with a different method of invocation
(see the last patch).
Once again we are dealing with a script which run over a variety of
protocols (combination of [ipv4, ipv6, ipip] x [tcp, udp]). The first
4 patches add support for test variants to our scripts. We use the
term "variant" in the same sense as the C kselftest_harness.h -
variant is just a set of static input arguments.
Note that neither GRO nor the Toeplitz test fully passes for me on
any HW I have access to. But this is unrelated to the conversion.
This series is not making any real functional changes to the tests,
it is limited to improving the "test harness" scripts.
Jakub Kicinski (12):
selftests: net: py: coding style improvements
selftests: net: py: extract the case generation logic
selftests: net: py: add test variants
selftests: drv-net: xdp: use variants for qstat tests
selftests: net: relocate gro and toeplitz tests to drivers/net
selftests: net: py: support ksft ready without wait
selftests: net: py: read ip link info about remote dev
netdevsim: pass packets thru GRO on Rx
selftests: drv-net: add a Python version of the GRO test
selftests: drv-net: hw: convert the Toeplitz test to Python
netdevsim: add loopback support
selftests: net: remove old setup_* scripts
tools/testing/selftests/drivers/net/Makefile | 2 +
.../testing/selftests/drivers/net/hw/Makefile | 6 +-
tools/testing/selftests/net/Makefile | 7 -
tools/testing/selftests/net/lib/Makefile | 1 +
drivers/net/netdevsim/netdev.c | 26 ++-
.../testing/selftests/{ => drivers}/net/gro.c | 5 +-
.../{net => drivers/net/hw}/toeplitz.c | 7 +-
.../testing/selftests/drivers/net/.gitignore | 1 +
tools/testing/selftests/drivers/net/gro.py | 161 ++++++++++++++
.../selftests/drivers/net/hw/.gitignore | 3 +-
.../drivers/net/hw/lib/py/__init__.py | 4 +-
.../selftests/drivers/net/hw/toeplitz.py | 208 ++++++++++++++++++
.../selftests/drivers/net/lib/py/__init__.py | 4 +-
.../selftests/drivers/net/lib/py/env.py | 2 +
tools/testing/selftests/drivers/net/xdp.py | 42 ++--
tools/testing/selftests/net/.gitignore | 2 -
tools/testing/selftests/net/gro.sh | 105 ---------
.../selftests/net/lib/ksft_setup_loopback.sh | 111 ++++++++++
.../testing/selftests/net/lib/py/__init__.py | 5 +-
tools/testing/selftests/net/lib/py/ksft.py | 93 ++++++--
tools/testing/selftests/net/lib/py/nsim.py | 2 +-
tools/testing/selftests/net/lib/py/utils.py | 20 +-
tools/testing/selftests/net/setup_loopback.sh | 120 ----------
tools/testing/selftests/net/setup_veth.sh | 45 ----
tools/testing/selftests/net/toeplitz.sh | 199 -----------------
.../testing/selftests/net/toeplitz_client.sh | 28 ---
26 files changed, 631 insertions(+), 578 deletions(-)
rename tools/testing/selftests/{ => drivers}/net/gro.c (99%)
rename tools/testing/selftests/{net => drivers/net/hw}/toeplitz.c (99%)
create mode 100755 tools/testing/selftests/drivers/net/gro.py
create mode 100755 tools/testing/selftests/drivers/net/hw/toeplitz.py
delete mode 100755 tools/testing/selftests/net/gro.sh
create mode 100755 tools/testing/selftests/net/lib/ksft_setup_loopback.sh
delete mode 100644 tools/testing/selftests/net/setup_loopback.sh
delete mode 100644 tools/testing/selftests/net/setup_veth.sh
delete mode 100755 tools/testing/selftests/net/toeplitz.sh
delete mode 100755 tools/testing/selftests/net/toeplitz_client.sh
--
2.51.1
Since commit 31158ad02ddb ("rqspinlock: Add deadlock detection
and recovery") the updated path on re-entrancy now reports deadlock
via -EDEADLK instead of the previous -EBUSY.
Also, the way reentrancy was exercised (via fentry/lookup_elem_raw)
has been fragile because lookup_elem_raw may be inlined
(find_kernel_btf_id() will return -ESRCH).
To fix this fentry is attached to bpf_obj_free_fields() instead of
lookup_elem_raw() and:
- The htab map is made to use a BTF-described struct val with a
struct bpf_timer so that check_and_free_fields() reliably calls
bpf_obj_free_fields() on element replacement.
- The selftest is updated to do two updates to the same key (insert +
replace) in prog_test.
- The selftest is updated to align with expected errno with the
kernel’s current behavior.
Signed-off-by: Saket Kumar Bhaskar <skb99(a)linux.ibm.com>
---
Changes since v2:
Addressed CI failures:
* Initialize key to 0 before the first update.
* Used pointer value to pass for update and memset rather than
&value.
v2: https://lore.kernel.org/all/20251114152653.356782-1-skb99@linux.ibm.com/
Changes since v1:
Addressed comments from Alexei:
* Fixed the scenario where test may fail when lookup_elem_raw()
is inlined.
v1: https://lore.kernel.org/all/20251106052628.349117-1-skb99@linux.ibm.com/
.../selftests/bpf/prog_tests/htab_update.c | 37 ++++++++++++++-----
.../testing/selftests/bpf/progs/htab_update.c | 19 +++++++---
2 files changed, 41 insertions(+), 15 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_update.c b/tools/testing/selftests/bpf/prog_tests/htab_update.c
index 2bc85f4814f4..d0b405eb2966 100644
--- a/tools/testing/selftests/bpf/prog_tests/htab_update.c
+++ b/tools/testing/selftests/bpf/prog_tests/htab_update.c
@@ -15,17 +15,17 @@ struct htab_update_ctx {
static void test_reenter_update(void)
{
struct htab_update *skel;
- unsigned int key, value;
+ void *value = NULL;
+ unsigned int key, value_size;
int err;
skel = htab_update__open();
if (!ASSERT_OK_PTR(skel, "htab_update__open"))
return;
- /* lookup_elem_raw() may be inlined and find_kernel_btf_id() will return -ESRCH */
- bpf_program__set_autoload(skel->progs.lookup_elem_raw, true);
+ bpf_program__set_autoload(skel->progs.bpf_obj_free_fields, true);
err = htab_update__load(skel);
- if (!ASSERT_TRUE(!err || err == -ESRCH, "htab_update__load") || err)
+ if (!ASSERT_TRUE(!err, "htab_update__load") || err)
goto out;
skel->bss->pid = getpid();
@@ -33,14 +33,33 @@ static void test_reenter_update(void)
if (!ASSERT_OK(err, "htab_update__attach"))
goto out;
- /* Will trigger the reentrancy of bpf_map_update_elem() */
+ value_size = bpf_map__value_size(skel->maps.htab);
+
+ value = calloc(1, value_size);
+ if (!ASSERT_OK_PTR(value, "calloc value"))
+ goto out;
+ /*
+ * First update: plain insert. This should NOT trigger the re-entrancy
+ * path, because there is no old element to free yet.
+ */
key = 0;
- value = 0;
- err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, &value, 0);
- if (!ASSERT_OK(err, "add element"))
+ err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, value, BPF_ANY);
+ if (!ASSERT_OK(err, "first update (insert)"))
+ goto out;
+
+ /*
+ * Second update: replace existing element with same key and trigger
+ * the reentrancy of bpf_map_update_elem().
+ * check_and_free_fields() calls bpf_obj_free_fields() on the old
+ * value, which is where fentry program runs and performs a nested
+ * bpf_map_update_elem(), triggering -EDEADLK.
+ */
+ memset(value, 0, value_size);
+ err = bpf_map_update_elem(bpf_map__fd(skel->maps.htab), &key, value, BPF_ANY);
+ if (!ASSERT_OK(err, "second update (replace)"))
goto out;
- ASSERT_EQ(skel->bss->update_err, -EBUSY, "no reentrancy");
+ ASSERT_EQ(skel->bss->update_err, -EDEADLK, "no reentrancy");
out:
htab_update__destroy(skel);
}
diff --git a/tools/testing/selftests/bpf/progs/htab_update.c b/tools/testing/selftests/bpf/progs/htab_update.c
index 7481bb30b29b..195d3b2fba00 100644
--- a/tools/testing/selftests/bpf/progs/htab_update.c
+++ b/tools/testing/selftests/bpf/progs/htab_update.c
@@ -6,24 +6,31 @@
char _license[] SEC("license") = "GPL";
+/* Map value type: has BTF-managed field (bpf_timer) */
+struct val {
+ struct bpf_timer t;
+ __u64 payload;
+};
+
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1);
- __uint(key_size, sizeof(__u32));
- __uint(value_size, sizeof(__u32));
+ __type(key, __u32);
+ __type(value, struct val);
} htab SEC(".maps");
int pid = 0;
int update_err = 0;
-SEC("?fentry/lookup_elem_raw")
-int lookup_elem_raw(void *ctx)
+SEC("?fentry/bpf_obj_free_fields")
+int bpf_obj_free_fields(void *ctx)
{
- __u32 key = 0, value = 1;
+ __u32 key = 0;
+ struct val value = { .payload = 1 };
if ((bpf_get_current_pid_tgid() >> 32) != pid)
return 0;
- update_err = bpf_map_update_elem(&htab, &key, &value, 0);
+ update_err = bpf_map_update_elem(&htab, &key, &value, BPF_ANY);
return 0;
}
--
2.51.0
Here are a bunch of small improvements to the MPTCP selftests:
- Patch 1: move code to mptcp_lib.sh to prepare the new features.
- Patch 2: simplify mptcp_lib_pr_err_stats helper use.
- Patch 3: remove unused last column from nstat output.
- Patch 4: improve stats dump in mptcp_join.sh.
- Patch 5: get counters from nstat history and simplify mptcp_connect.sh.
- Patch 6: avoid taking the same packet trace twice.
- Patch 7: wait for an event instead of a fix time.
- Patch 8: instead of using 'timeout' and print the stats after, another
internal timeout is used: if it fires, it will print stats, then stop
everything. This avoids confusions around stats in case of timeout.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (8):
selftests: mptcp: lib: introduce 'nstat_{init,get}'
selftests: mptcp: lib: remove stats files args
selftests: mptcp: lib: stats: remove nstat rate columns
selftests: mptcp: join: dump stats from history
selftests: mptcp: lib: get counters from nstat history
selftests: mptcp: connect: avoid double packet traces
selftests: mptcp: wait for port instead of sleep
selftests: mptcp: get stats just before timing out
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 140 ++++++++++-----------
tools/testing/selftests/net/mptcp/mptcp_join.sh | 65 +++++-----
tools/testing/selftests/net/mptcp/mptcp_lib.sh | 58 +++++++--
tools/testing/selftests/net/mptcp/mptcp_sockopt.sh | 43 ++++---
tools/testing/selftests/net/mptcp/simult_flows.sh | 44 ++++---
tools/testing/selftests/net/mptcp/userspace_pm.sh | 3 +-
6 files changed, 203 insertions(+), 150 deletions(-)
---
base-commit: df58ee7d8faf353ebf5d4703c35fcf3e578e9b1b
change-id: 20251114-net-next-mptcp-sft-count-cache-stats-timeout-faa64482db92
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Examples (i.e. doctests) may want to use names such as `foo`, thus the
`clippy::disallowed_names` lint gets in the way.
Thus allow it for all doctests.
In addition, remove it from the existing `expect`s we have in a few
doctests.
This does not mean that we should stop trying to find good names for
our examples, though.
Suggested-by: Alice Ryhl <aliceryhl(a)google.com>
Link: https://lore.kernel.org/rust-for-linux/aRHSLChi5HYXW4-9@google.com/
Signed-off-by: Miguel Ojeda <ojeda(a)kernel.org>
---
rust/kernel/init.rs | 3 +--
rust/kernel/types.rs | 1 -
scripts/rustdoc_test_gen.rs | 2 +-
3 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs
index e476d81c1a27..899b9a962762 100644
--- a/rust/kernel/init.rs
+++ b/rust/kernel/init.rs
@@ -30,7 +30,7 @@
//! ## General Examples
//!
//! ```rust
-//! # #![expect(clippy::disallowed_names, clippy::undocumented_unsafe_blocks)]
+//! # #![expect(clippy::undocumented_unsafe_blocks)]
//! use kernel::types::Opaque;
//! use pin_init::pin_init_from_closure;
//!
@@ -67,7 +67,6 @@
//! ```
//!
//! ```rust
-//! # #![expect(clippy::disallowed_names)]
//! use kernel::{prelude::*, types::Opaque};
//! use core::{ptr::addr_of_mut, marker::PhantomPinned, pin::Pin};
//! # mod bindings {
diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 835824788506..9c5e7dbf1632 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -289,7 +289,6 @@ fn drop(&mut self) {
/// # Examples
///
/// ```
-/// # #![expect(clippy::disallowed_names)]
/// use kernel::types::Opaque;
/// # // Emulate a C struct binding which is from C, maybe uninitialized or not, only the C side
/// # // knows.
diff --git a/scripts/rustdoc_test_gen.rs b/scripts/rustdoc_test_gen.rs
index 0e6a0542d1bd..be0561049660 100644
--- a/scripts/rustdoc_test_gen.rs
+++ b/scripts/rustdoc_test_gen.rs
@@ -208,7 +208,7 @@ macro_rules! assert_eq {{
#[allow(unused)]
static __DOCTEST_ANCHOR: i32 = ::core::line!() as i32 + {body_offset} + 1;
{{
- #![allow(unreachable_pub)]
+ #![allow(unreachable_pub, clippy::disallowed_names)]
{body}
main();
}}
--
2.51.2
Commit 4dfd4bba8578 ("selftests/mm/uffd: refactor non-composite global
vars into struct") moved some of the operations previously implemented
in uffd_setup_environment() earlier in the main test loop.
The calculation of nr_pages, which involves a division by page_size, now
occurs before checking that default_huge_page_size() returns a non-zero
This leads to a division-by-zero error on systems with !CONFIG_HUGETLB.
Fix this by relocating the non-zero page_size check before the nr_pages
calculation, as it was originally implemented.
Cc: stable(a)vger.kernel.org
Fixes: 4dfd4bba8578 ("selftests/mm/uffd: refactor non-composite global vars into struct")
Signed-off-by: Carlos Llamas <cmllamas(a)google.com>
---
tools/testing/selftests/mm/uffd-unit-tests.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index 9e3be2ee7f1b..f917b4c4c943 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -1758,10 +1758,15 @@ int main(int argc, char *argv[])
uffd_test_ops = mem_type->mem_ops;
uffd_test_case_ops = test->test_case_ops;
- if (mem_type->mem_flag & (MEM_HUGETLB_PRIVATE | MEM_HUGETLB))
+ if (mem_type->mem_flag & (MEM_HUGETLB_PRIVATE | MEM_HUGETLB)) {
gopts.page_size = default_huge_page_size();
- else
+ if (gopts.page_size == 0) {
+ uffd_test_skip("huge page size is 0, feature missing?");
+ continue;
+ }
+ } else {
gopts.page_size = psize();
+ }
/* Ensure we have at least 2 pages */
gopts.nr_pages = MAX(UFFD_TEST_MEM_SIZE, gopts.page_size * 2)
@@ -1776,12 +1781,6 @@ int main(int argc, char *argv[])
continue;
uffd_test_start("%s on %s", test->name, mem_type->name);
- if ((mem_type->mem_flag == MEM_HUGETLB ||
- mem_type->mem_flag == MEM_HUGETLB_PRIVATE) &&
- (default_huge_page_size() == 0)) {
- uffd_test_skip("huge page size is 0, feature missing?");
- continue;
- }
if (!uffd_feature_supported(test)) {
uffd_test_skip("feature missing");
continue;
--
2.51.2.1041.gc1ab5b90ca-goog