This series improves the CPU cost of RX token management by adding an
attribute to NETDEV_CMD_BIND_RX that configures sockets using the
binding to avoid the xarray allocator and instead use a per-binding niov
array and a uref field in niov.
Improvement is ~13% cpu util per RX user thread.
Using kperf, the following results were observed:
Before:
Average RX worker idle %: 13.13, flows 4, test runs 11
After:
Average RX worker idle %: 26.32, flows 4, test runs 11
Two other approaches were tested, but with no improvement. Namely, 1)
using a hashmap for tokens and 2) keeping an xarray of atomic counters
but using RCU so that the hotpath could be mostly lockless. Neither of
these approaches proved better than the simple array in terms of CPU.
The attribute NETDEV_A_DMABUF_AUTORELEASE is added to toggle the
optimization. It is an optional attribute and defaults to 0 (i.e.,
optimization on).
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)meta.com>
Changes in v9:
- fixed build with NET_DEVMEM=n
- fixed bug in rx bindings count logic
- Link to v8: https://lore.kernel.org/r/20260107-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v8:
- change static branch logic (only set when enabled, otherwise just
always revert back to disabled)
- fix missing tests
- Link to v7: https://lore.kernel.org/r/20251119-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v7:
- use netlink instead of sockopt (Stan)
- restrict system to only one mode, dmabuf bindings can not co-exist
with different modes (Stan)
- use static branching to enforce single system-wide mode (Stan)
- Link to v6: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v6:
- renamed 'net: devmem: use niov array for token management' to refer to
optionality of new config
- added documentation and tests
- make autorelease flag per-socket sockopt instead of binding
field / sysctl
- many per-patch changes (see Changes sections per-patch)
- Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v5:
- add sysctl to opt-out of performance benefit, back to old token release
- Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token…
Changes in v4:
- rebase to net-next
- Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v3:
- make urefs per-binding instead of per-socket, reducing memory
footprint
- fallback to cleaning up references in dmabuf unbind if socket
leaked tokens
- drop ethtool patch
- Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-u…
Changes in v2:
- net: ethtool: prevent user from breaking devmem single-binding rule
(Mina)
- pre-assign niovs in binding->vec for RX case (Mina)
- remove WARNs on invalid user input (Mina)
- remove extraneous binding ref get (Mina)
- remove WARN for changed binding (Mina)
- always use GFP_ZERO for binding->vec (Mina)
- fix length of alloc for urefs
- use atomic_set(, 0) to initialize sk_user_frags.urefs
- Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-u…
---
Bobby Eshleman (5):
net: devmem: rename tx_vec to vec in dmabuf binding
net: devmem: refactor sock_devmem_dontneed for autorelease split
net: devmem: implement autorelease token management
net: devmem: document NETDEV_A_DMABUF_AUTORELEASE netlink attribute
selftests: drv-net: devmem: add autorelease test
Documentation/netlink/specs/netdev.yaml | 12 +++
Documentation/networking/devmem.rst | 70 +++++++++++++
include/net/netmem.h | 1 +
include/net/sock.h | 7 +-
include/uapi/linux/netdev.h | 1 +
net/core/devmem.c | 116 ++++++++++++++++++----
net/core/devmem.h | 29 +++++-
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 10 +-
net/core/sock.c | 103 ++++++++++++++-----
net/ipv4/tcp.c | 76 +++++++++++---
net/ipv4/tcp_ipv4.c | 11 +-
net/ipv4/tcp_minisocks.c | 3 +-
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/drivers/net/hw/devmem.py | 21 +++-
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 19 ++--
16 files changed, 407 insertions(+), 78 deletions(-)
---
base-commit: 6ad078fa0ababa8de2a2b39f476d2abd179a3cf6
change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
Much work has recently gone into supporting block device integrity data
(sometimes called "metadata") in Linux. Many NVMe devices these days
support metadata transfers and/or automatic protection information
generation and verification. However, ublk devices can't yet advertise
integrity data capabilities. This patch series wires up support for
integrity data in ublk. The ublk feature is referred to as "integrity"
rather than "metadata" to match the block layer's name for it and to
avoid confusion with the existing and unrelated UBLK_IO_F_META.
To advertise support for integrity data, a ublk server fills out the
struct ublk_params's integrity field and sets UBLK_PARAM_TYPE_INTEGRITY.
The struct ublk_param_integrity flags and csum_type fields use the
existing LBMD_PI_* constants from the linux/fs.h UAPI header. The ublk
driver fills out a corresponding struct blk_integrity.
When a request with integrity data is issued to the ublk device, the
ublk driver sets UBLK_IO_F_INTEGRITY in struct ublksrv_io_desc's
op_flags field. This is necessary for a ublk server for which
bi_offload_capable() returns true to distinguish requests with integrity
data from those without.
Integrity data transfers can currently only be performed via the ublk
user copy mechanism. The overhead of zero-copy buffer registration makes
it less appealing for the small transfers typical of integrity data.
Additionally, neither io_uring NVMe passthru nor IORING_RW_ATTR_FLAG_PI
currently allow an io_uring registered buffer for the integrity data.
The ki_pos field of the struct kiocb passed to the user copy
->{read,write}_iter() callback gains a bit UBLKSRV_IO_INTEGRITY_FLAG for
a ublk server to indicate whether to access the request's data or
integrity data.
Not yet supported is an analogue for the IO_INTEGRITY_CHK_*/BIP_CHECK_*
flags to ask the ublk server to verify the guard, reftag, and/or apptag
of a request's protection information. The user copy mechanism currently
forbids a ublk server from reading the data/integrity buffer of a
read-direction request. We could potentially relax this restriction for
integrity data on reads. Alternatively, the ublk driver could verify the
requested fields as part of the user copy operation.
v4:
- Add max_integrity_segments to struct ublk_param_integrity (Ming)
- Move UBLKSRV_IO_INTEGRITY_FLAG to avoid overflow from
QID + UBLKSRV_IO_BUF_OFFSET (Ming)
- Check UBLK_F_INTEGRITY when UBLKSRV_IO_INTEGRITY_FLAG is used (Ming)
- Initialize integrity backing file to disable integrity checks (Ming)
v3:
- Drop support for communicating BIP_CHECK_* for now until the interface
is decided
- Add Reviewed-by tags
v2:
- Communicate BIP_CHECK_* flags and expected reftag seed and app tag to
ublk server
- Add UBLK_F_INTEGRITY feature flag (Ming)
- Don't change the definition of UBLKSRV_IO_BUF_TOTAL_BITS (Ming)
- Drop patches already applied
- Add Reviewed-by tags
Caleb Sander Mateos (16):
blk-integrity: take const pointer in blk_integrity_rq()
ublk: move ublk flag check functions earlier
ublk: set UBLK_IO_F_INTEGRITY in ublksrv_io_desc
ublk: split out ublk_copy_user_bvec() helper
ublk: split out ublk_user_copy() helper
ublk: inline ublk_check_and_get_req() into ublk_user_copy()
ublk: move offset check out of __ublk_check_and_get_req()
ublk: optimize ublk_user_copy() on daemon task
selftests: ublk: display UBLK_F_INTEGRITY support
selftests: ublk: add utility to get block device metadata size
selftests: ublk: add kublk support for integrity params
selftests: ublk: implement integrity user copy in kublk
selftests: ublk: support non-O_DIRECT backing files
selftests: ublk: add integrity data support to loop target
selftests: ublk: add integrity params test
selftests: ublk: add end-to-end integrity test
Stanley Zhang (3):
ublk: support UBLK_PARAM_TYPE_INTEGRITY in device creation
ublk: implement integrity user copy
ublk: support UBLK_F_INTEGRITY
drivers/block/ublk_drv.c | 358 +++++++++++++------
include/linux/blk-integrity.h | 6 +-
include/uapi/linux/ublk_cmd.h | 25 ++
tools/testing/selftests/ublk/Makefile | 7 +-
tools/testing/selftests/ublk/common.c | 4 +-
tools/testing/selftests/ublk/fault_inject.c | 1 +
tools/testing/selftests/ublk/file_backed.c | 90 ++++-
tools/testing/selftests/ublk/kublk.c | 90 ++++-
tools/testing/selftests/ublk/kublk.h | 37 +-
tools/testing/selftests/ublk/metadata_size.c | 36 ++
tools/testing/selftests/ublk/null.c | 1 +
tools/testing/selftests/ublk/stripe.c | 6 +-
tools/testing/selftests/ublk/test_common.sh | 10 +
tools/testing/selftests/ublk/test_loop_08.sh | 111 ++++++
tools/testing/selftests/ublk/test_null_04.sh | 166 +++++++++
15 files changed, 817 insertions(+), 131 deletions(-)
create mode 100644 tools/testing/selftests/ublk/metadata_size.c
create mode 100755 tools/testing/selftests/ublk/test_loop_08.sh
create mode 100755 tools/testing/selftests/ublk/test_null_04.sh
--
2.45.2
Fix a couple of bugs in the RPS cases of the Toeplitz selftest.
Gal Pressman (2):
selftests: drv-net: fix RPS mask handling in toeplitz test
selftests: drv-net: fix RPS mask handling for high CPU numbers
tools/testing/selftests/drivers/net/hw/toeplitz.c | 4 ++--
tools/testing/selftests/drivers/net/hw/toeplitz.py | 9 ++++++---
2 files changed, 8 insertions(+), 5 deletions(-)
--
2.40.1
v4:
- Add tags and fix patch 4 commit log
v3:
- Patch 2: Change the condition for calling reset_partition_data() to
(new_prs <= 0).
- Patch 4: Update commit log and code comment to clarify the change.
- Add a new patch 5 to move the empty cpus/mems check to
cpuset1_validate_change().
v2:
- Patch 1: additional comment
- Patch 2: simplify the conditions for triggering call to
compute_excpus().
- Patch 3: update description of cpuset.cpus.exclusive in cgroup-v2.rst
to reflect the new behavior and change the name of the new
cpus_excl_conflict() parameter to xcpus_changed.
- Patch 4: update description of cpuset.cpus.partition in cgroup-v2.rst
to clarify what exclusive CPUs will be used when a partition is
created.
This patch series is inspired by the cpuset patch sent by Sun Shaojie [1].
The idea is to avoid invalidating sibling partitions when there is a
cpuset.cpus conflict. However this patch series does it in a slightly
different way to make its behavior more consistent with other cpuset
properties.
The first 3 patches are just some cleanup and minor bug fixes on
issues found during the investigation process. The last one is
the major patch that changes the way cpuset.cpus is being handled
during the partition creation process. Instead of invalidating sibling
partitions when there is a conflict, it will strip out the conflicting
exclusive CPUs and assign the remaining non-conflicting exclusive
CPUs to the new partition unless there is no more CPU left which will
fail the partition creation process. It is similar to the idea that
cpuset.cpus.effective may only contain a subset of CPUs specified in
cpuset.cpus. So cpuset.cpus.exclusive.effective may contain only a
subset of cpuset.cpus when a partition is created without setting
cpuset.cpus.exclusive.
Even setting cpuset.cpus.exclusive instead of cpuset.cpus may not
guarantee all the requested CPUs can be granted if parent doesn't have
access to some of those exclusive CPUs. The difference is that conflicts
from siblings is not possible with cpuset.cpus.exclusive as long as it
can be set successfully without failure.
[1] https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/
Waiman Long (5):
cgroup/cpuset: Streamline rm_siblings_excl_cpus()
cgroup/cpuset: Consistently compute effective_xcpus in
update_cpumasks_hier()
cgroup/cpuset: Don't fail cpuset.cpus change in v2
cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus
conflict
cgroup/cpuset: Move the v1 empty cpus/mems check to
cpuset1_validate_change()
Documentation/admin-guide/cgroup-v2.rst | 40 +++--
kernel/cgroup/cpuset-internal.h | 12 ++
kernel/cgroup/cpuset-v1.c | 33 ++++
kernel/cgroup/cpuset.c | 163 ++++++------------
.../selftests/cgroup/test_cpuset_prs.sh | 29 +++-
5 files changed, 150 insertions(+), 127 deletions(-)
--
2.52.0
v5:
- Fix patch 4 doc change as reported by Michael.
v4:
- Add tags and fix patch 4 commit log
v3:
- Patch 2: Change the condition for calling reset_partition_data() to
(new_prs <= 0).
- Patch 4: Update commit log and code comment to clarify the change.
- Add a new patch 5 to move the empty cpus/mems check to
cpuset1_validate_change().
v2:
- Patch 1: additional comment
- Patch 2: simplify the conditions for triggering call to
compute_excpus().
- Patch 3: update description of cpuset.cpus.exclusive in cgroup-v2.rst
to reflect the new behavior and change the name of the new
cpus_excl_conflict() parameter to xcpus_changed.
- Patch 4: update description of cpuset.cpus.partition in cgroup-v2.rst
to clarify what exclusive CPUs will be used when a partition is
created.
This patch series is inspired by the cpuset patch sent by Sun Shaojie [1].
The idea is to avoid invalidating sibling partitions when there is a
cpuset.cpus conflict. However this patch series does it in a slightly
different way to make its behavior more consistent with other cpuset
properties.
The first 3 patches are just some cleanup and minor bug fixes on
issues found during the investigation process. The last one is
the major patch that changes the way cpuset.cpus is being handled
during the partition creation process. Instead of invalidating sibling
partitions when there is a conflict, it will strip out the conflicting
exclusive CPUs and assign the remaining non-conflicting exclusive
CPUs to the new partition unless there is no more CPU left which will
fail the partition creation process. It is similar to the idea that
cpuset.cpus.effective may only contain a subset of CPUs specified in
cpuset.cpus. So cpuset.cpus.exclusive.effective may contain only a
subset of cpuset.cpus when a partition is created without setting
cpuset.cpus.exclusive.
Even setting cpuset.cpus.exclusive instead of cpuset.cpus may not
guarantee all the requested CPUs can be granted if parent doesn't have
access to some of those exclusive CPUs. The difference is that conflicts
from siblings is not possible with cpuset.cpus.exclusive as long as it
can be set successfully without failure.
[1] https://lore.kernel.org/lkml/20251117015708.977585-1-sunshaojie@kylinos.cn/
Waiman Long (5):
cgroup/cpuset: Streamline rm_siblings_excl_cpus()
cgroup/cpuset: Consistently compute effective_xcpus in
update_cpumasks_hier()
cgroup/cpuset: Don't fail cpuset.cpus change in v2
cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus
conflict
cgroup/cpuset: Move the v1 empty cpus/mems check to
cpuset1_validate_change()
Documentation/admin-guide/cgroup-v2.rst | 41 +++--
kernel/cgroup/cpuset-internal.h | 12 ++
kernel/cgroup/cpuset-v1.c | 33 ++++
kernel/cgroup/cpuset.c | 163 ++++++------------
.../selftests/cgroup/test_cpuset_prs.sh | 29 +++-
5 files changed, 151 insertions(+), 127 deletions(-)
--
2.52.0
This patchset introduces target resume capability to netconsole allowing
it to recover targets when underlying low-level interface comes back
online.
The patchset starts by refactoring netconsole state representation in
order to allow representing deactivated targets (targets that are
disabled due to interfaces unregister).
It then modifies netconsole to handle NETDEV_REGISTER events for such
targets, setups netpoll and forces the device UP. Targets are matched with
incoming interfaces depending on how they were bound in netconsole
(by mac or interface name). For these reasons, we also attempt resuming
on NETDEV_CHANGENAME.
The patchset includes a selftest that validates netconsole target state
transitions and that target is functional after resumed.
Signed-off-by: Andre Carvalho <asantostc(a)gmail.com>
---
Changes in v10:
- Define wrappers around dynamic_netconsole_mutex lock/unlock and use
them on process_resume_target to avoid build failures and #ifdefs
inside callsite (suggested by Breno).
- Refactored other dynamic_netconsole_mutex to use the wrappers for
consistency.
- Ensure we cancel pending working during removal of dynamic targets,
which requires also holding dynamic_netconsole_mutex.
- Introduce standalone workqueue to avoid potential leaks during module
cleanup, flushing all pending resume events before removing all
targets.
- Link to v9: https://lore.kernel.org/r/20260104-netcons-retrigger-v9-0-38aa643d2283@gmai…
Changes in v9:
- Hold dynamic_netconsole_mutex on process_resume_target.
- Cleanup dev_name as part of netconsole_process_cleanups_core to ensure
we correctly resume by mac (for targets bound by mac)
- Link to v8: https://lore.kernel.org/r/20251128-netcons-retrigger-v8-0-0bccbf4c6385@gmai…
Changes in v8:
- Handle NETDEV_REGISTER/CHANGENAME instead of NETDEV_UP (and force the device
UP), to increase the chances of succesfully resuming a target. This
requires using a workqueue instead of inline in the event notifier as
we can't UP the device otherwise.
- Link to v7: https://lore.kernel.org/r/20251126-netcons-retrigger-v7-0-1d86dba83b1c@gmai…
Changes in v7:
- selftest: use ${EXIT_STATUS} instead of ${ksft_pass} to avoid
shellcheck warning
- Link to v6: https://lore.kernel.org/r/20251121-netcons-retrigger-v6-0-9c03f5a2bd6f@gmai…
Changes in v6:
- Rebase on top of net-next to resolve conflicts, no functional changes.
- Link to v5: https://lore.kernel.org/r/20251119-netcons-retrigger-v5-0-2c7dda6055d6@gmai…
Changes in v5:
- patch 3: Set (de)enslaved target as DISABLED instead of DEACTIVATED to prevent
resuming it.
- selftest: Fix test cleanup by moving trap line to outside of loop and remove
unneeded 'local' keyword
- Rename maybe_resume_target to resume_target, add netconsole_ prefix to
process_resumable_targets.
- Hold device reference before calling __netpoll_setup.
- Link to v4: https://lore.kernel.org/r/20251116-netcons-retrigger-v4-0-5290b5f140c2@gmai…
Changes in v4:
- Simplify selftest cleanup, removing trap setup in loop.
- Drop netpoll helper (__setup_netpoll_hold) and manage reference inside
netconsole.
- Move resume_list processing logic to separate function.
- Link to v3: https://lore.kernel.org/r/20251109-netcons-retrigger-v3-0-1654c280bbe6@gmai…
Changes in v3:
- Resume by mac or interface name depending on how target was created.
- Attempt to resume target without holding target list lock, by moving
the target to a temporary list. This is required as netpoll may
attempt to allocate memory.
- Link to v2: https://lore.kernel.org/r/20250921-netcons-retrigger-v2-0-a0e84006237f@gmai…
Changes in v2:
- Attempt to resume target in the same thread, instead of using
workqueue .
- Add wrapper around __netpoll_setup (patch 4).
- Renamed resume_target to maybe_resume_target and moved conditionals to
inside its implementation, keeping code more clear.
- Verify that device addr matches target mac address when target was
setup using mac.
- Update selftest to cover targets bound by mac and interface name.
- Fix typo in selftest comment and sort tests alphabetically in
Makefile.
- Link to v1:
https://lore.kernel.org/r/20250909-netcons-retrigger-v1-0-3aea904926cf@gmai…
---
Andre Carvalho (5):
netconsole: convert 'enabled' flag to enum for clearer state management
netconsole: clear dev_name for devices bound by mac
netconsole: introduce helpers for dynamic_netconsole_mutex lock/unlock
netconsole: resume previously deactivated target
selftests: netconsole: validate target resume
Breno Leitao (2):
netconsole: add target_state enum
netconsole: add STATE_DEACTIVATED to track targets disabled by low level
drivers/net/netconsole.c | 305 ++++++++++++++++-----
tools/testing/selftests/drivers/net/Makefile | 1 +
.../selftests/drivers/net/lib/sh/lib_netcons.sh | 35 ++-
.../selftests/drivers/net/netcons_resume.sh | 97 +++++++
4 files changed, 364 insertions(+), 74 deletions(-)
---
base-commit: 60d8484c4cec811f5ceb6550655df74490d1a165
change-id: 20250816-netcons-retrigger-a4f547bfc867
Best regards,
--
Andre Carvalho <asantostc(a)gmail.com>
This patch series builds upon the discussion in
"[PATCH bpf-next v4 0/4] bpf: Improve error reporting for freplace attachment failure" [1].
This patch series introduces support for *common attributes* in the BPF
syscall, providing a unified mechanism for passing shared metadata across
all BPF commands.
The initial set of common attributes includes:
1. 'log_buf': User-provided buffer for storing log output.
2. 'log_size': Size of the provided log buffer.
3. 'log_level': Verbosity level for logging.
4. 'log_true_size': The size of log reported by kernel.
With this extension, the BPF syscall will be able to return meaningful
error messages (e.g., failures of creating map), improving debuggability
and user experience.
Links:
[1] https://lore.kernel.org/bpf/20250224153352.64689-1-leon.hwang@linux.dev/
Changes:
v4 -> v5:
* Rework reporting 'log_true_size' for prog_load, btf_load, and map_create
(per Alexei).
RFC v3 -> v4:
* Drop RFC.
* Address comments from Andrii:
* Add parentheses in 'sys_bpf_ext()'.
* Avoid creating new fd in 'probe_sys_bpf_ext()'.
* Add a new struct to wrap log fields in libbpf.
* Address comments from Alexei:
* Do not skip writing to user space when log_true_size is zero.
* Do not use 'bool' arguments.
* Drop the adding WARN_ON_ONCE()'s.
RFC v2 -> RFC v3:
* Rename probe_sys_bpf_extended to probe_sys_bpf_ext.
* Refactor reporting 'log_true_size' for prog_load.
* Refactor reporting 'btf_log_true_size' for btf_load.
* Add warnings for internal bugs in map_create.
* Check log_true_size in test cases.
* Address comment from Alexei:
* Change kvzalloc/kvfree to kzalloc/kfree.
* Address comments from Andrii:
* Move BPF_COMMON_ATTRS to 'enum bpf_cmd' alongside brief comment.
* Add bpf_check_uarg_tail_zero() for extra checks.
* Rename sys_bpf_extended to sys_bpf_ext.
* Rename sys_bpf_fd_extended to sys_bpf_ext_fd.
* Probe the new feature using NULL and -EFAULT.
* Move probe_sys_bpf_ext to libbpf_internal.h and drop LIBBPF_API.
* Return -EUSERS when log attrs are conflict between bpf_attr and
bpf_common_attr.
* Avoid touching bpf_vlog_init().
* Update the reason messages in map_create.
* Finalize the log using __cleanup().
* Report log size to users.
* Change type of log_buf from '__u64' to 'const char *' and cast type
using ptr_to_u64() in bpf_map_create().
* Do not return -EOPNOTSUPP when kernel doesn't support this feature
in bpf_map_create().
* Add log_level support for map creation for consistency.
* Address comment from Eduard:
* Use common_attrs->log_level instead of BPF_LOG_FIXED.
RFC v1 -> RFC v2:
* Fix build error reported by test bot.
* Address comments from Alexei:
* Drop new uapi for freplace.
* Add common attributes support for prog_load and btf_load.
* Add common attributes support for map_create.
Leon Hwang (9):
bpf: Extend BPF syscall with common attributes support
libbpf: Add support for extended bpf syscall
bpf: Refactor reporting log_true_size for prog_load
bpf: Add syscall common attributes support for prog_load
bpf: Refactor reporting btf_log_true_size for btf_load
bpf: Add syscall common attributes support for btf_load
bpf: Add syscall common attributes support for map_create
libbpf: Add common attr support for map_create
selftests/bpf: Add tests to verify map create failure log
include/linux/bpf.h | 19 +-
include/linux/bpf_verifier.h | 17 ++
include/linux/btf.h | 3 +-
include/linux/syscalls.h | 3 +-
include/uapi/linux/bpf.h | 8 +
kernel/bpf/btf.c | 32 +---
kernel/bpf/log.c | 103 +++++++++++
kernel/bpf/syscall.c | 122 ++++++++++---
kernel/bpf/verifier.c | 19 +-
tools/include/uapi/linux/bpf.h | 8 +
tools/lib/bpf/bpf.c | 49 ++++-
tools/lib/bpf/bpf.h | 17 +-
tools/lib/bpf/features.c | 8 +
tools/lib/bpf/libbpf_internal.h | 3 +
.../selftests/bpf/prog_tests/map_init.c | 168 ++++++++++++++++++
15 files changed, 518 insertions(+), 61 deletions(-)
--
2.52.0
This is a simple ipvtap test to test handling
IP-address add/remove on ipvlan interface.
It creates a veth-interface and then creates several
network-namespace with ipvlan0 interface in it linked to veth.
Then it starts to add/remove addresses on ipvlan0 interfaces
in several threads.
At finish, it checks that there is no duplicated addresses.
Signed-off-by: Dmitry Skorodumov <skorodumov.dmitry(a)huawei.com>
---
v5:
- Execution time of ip-set thread is limited with 1 minute.
- Don't use KSFT_MACHINE_SLOW.
- Fixed position of "CONFIG_TAP=m" in config
v4:
- Removed unneeded modprobe
- Number of threads is 8, if KSFT_MACHINE_SLOW==yes.
It is needed, since on debug-build test may take more than 15 minutes.
- Now veth is created in own namespace
- Added comment about why test adds/removes random ip
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 2 +
tools/testing/selftests/net/ipvtap_test.sh | 168 +++++++++++++++++++++
3 files changed, 171 insertions(+)
create mode 100755 tools/testing/selftests/net/ipvtap_test.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index b66ba04f19d9..45c4ea381bc3 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -48,6 +48,7 @@ TEST_PROGS := \
ipv6_flowlabel.sh \
ipv6_force_forwarding.sh \
ipv6_route_update_soft_lockup.sh \
+ ipvtap_test.sh \
l2_tos_ttl_inherit.sh \
l2tp.sh \
link_netns.py \
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 1e1f253118f5..b84362b9b508 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -48,6 +48,7 @@ CONFIG_IPV6_SEG6_LWTUNNEL=y
CONFIG_IPV6_SIT=y
CONFIG_IPV6_VTI=y
CONFIG_IPVLAN=m
+CONFIG_IPVTAP=m
CONFIG_KALLSYMS=y
CONFIG_L2TP=m
CONFIG_L2TP_ETH=m
@@ -116,6 +117,7 @@ CONFIG_PROC_SYSCTL=y
CONFIG_PSAMPLE=m
CONFIG_RPS=y
CONFIG_SYSFS=y
+CONFIG_TAP=m
CONFIG_TCP_MD5SIG=y
CONFIG_TEST_BLACKHOLE_DEV=m
CONFIG_TEST_BPF=m
diff --git a/tools/testing/selftests/net/ipvtap_test.sh b/tools/testing/selftests/net/ipvtap_test.sh
new file mode 100755
index 000000000000..354ca7ce8584
--- /dev/null
+++ b/tools/testing/selftests/net/ipvtap_test.sh
@@ -0,0 +1,168 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Simple tests for ipvtap
+
+
+#
+# The testing environment looks this way:
+#
+# |------HNS-------| |------PHY-------|
+# | veth<----------------->veth |
+# |------|--|------| |----------------|
+# | |
+# | | |-----TST0-------|
+# | |------------|----ipvlan |
+# | |----------------|
+# |
+# | |-----TST1-------|
+# |---------------|----ipvlan |
+# |----------------|
+#
+
+ALL_TESTS="
+ test_ip_set
+"
+
+source lib.sh
+
+DEBUG=0
+
+VETH_HOST=vethtst.h
+VETH_PHY=vethtst.p
+
+NS_COUNT=32
+IP_ITERATIONS=1024
+IPSET_TIMEOUT="60s"
+
+ns_run() {
+ ns=$1
+ shift
+ if [[ "$ns" == "global" ]]; then
+ "$@" >/dev/null
+ else
+ ip netns exec "$ns" "$@" >/dev/null
+ fi
+}
+
+test_ip_setup_env() {
+ setup_ns NS_PHY
+ setup_ns HST_NS
+
+ # setup simulated other-host (phy) and host itself
+ ns_run "$HST_NS" ip link add $VETH_HOST type veth peer name $VETH_PHY \
+ netns "$NS_PHY" >/dev/null
+ ns_run "$HST_NS" ip link set $VETH_HOST up
+ ns_run "$NS_PHY" ip link set $VETH_PHY up
+
+ for ((i=0; i<NS_COUNT; i++)); do
+ setup_ns ipvlan_ns_$i
+ ns="ipvlan_ns_$i"
+ if [ "$DEBUG" = "1" ]; then
+ echo "created NS ${!ns}"
+ fi
+ if ! ns_run "$HST_NS" ip link add netns ${!ns} ipvlan0 \
+ link $VETH_HOST \
+ type ipvtap mode l2 bridge; then
+ exit_error "FAIL: Failed to configure ipvlan link."
+ fi
+ done
+}
+
+test_ip_cleanup_env() {
+ ns_run "$HST_NS" ip link del $VETH_HOST
+ cleanup_all_ns
+}
+
+exit_error() {
+ echo "$1"
+ exit $ksft_fail
+}
+
+rnd() {
+ echo $(( RANDOM % 32 + 16 ))
+}
+
+test_ip_set_thread() {
+ # Here we are trying to create some IP conflicts between namespaces.
+ # If just add/remove IP, nothing interesting will happen.
+ # But if add random IP and then remove random IP,
+ # eventually conflicts start to apear.
+ ip link set ipvlan0 up
+ for ((i=0; i<IP_ITERATIONS; i++)); do
+ v=$(rnd)
+ ip a a "172.25.0.$v/24" dev ipvlan0 2>/dev/null
+ ip a a "fc00::$v/64" dev ipvlan0 2>/dev/null
+ v=$(rnd)
+ ip a d "172.25.0.$v/24" dev ipvlan0 2>/dev/null
+ ip a d "fc00::$v/64" dev ipvlan0 2>/dev/null
+ done
+}
+
+test_ip_set() {
+ RET=0
+
+ trap test_ip_cleanup_env EXIT
+
+ test_ip_setup_env
+
+ declare -A ns_pids
+ for ((i=0; i<NS_COUNT; i++)); do
+ ns="ipvlan_ns_$i"
+ ns_run ${!ns} timeout "$IPSET_TIMEOUT" \
+ bash -c "$0 test_ip_set_thread"&
+ ns_pids[$i]=$!
+ done
+
+ for ((i=0; i<NS_COUNT; i++)); do
+ wait "${ns_pids[$i]}"
+ done
+
+ declare -A all_ips
+ for ((i=0; i<NS_COUNT; i++)); do
+ ns="ipvlan_ns_$i"
+ ip_output=$(ip netns exec ${!ns} ip a l dev ipvlan0 | grep inet)
+ while IFS= read -r nsip_out; do
+ if [[ -z $nsip_out ]]; then
+ continue;
+ fi
+ nsip=$(awk '{print $2}' <<< "$nsip_out")
+ if [[ -v all_ips[$nsip] ]]; then
+ RET=$ksft_fail
+ log_test "conflict for $nsip"
+ return "$RET"
+ else
+ all_ips[$nsip]=$i
+ fi
+ done <<< "$ip_output"
+ done
+
+ if [ "$DEBUG" = "1" ]; then
+ for key in "${!all_ips[@]}"; do
+ echo "$key: ${all_ips[$key]}"
+ done
+ fi
+
+ trap - EXIT
+ test_ip_cleanup_env
+
+ log_test "test multithreaded ip set"
+}
+
+if [[ "$1" == "-d" ]]; then
+ DEBUG=1
+ shift
+fi
+
+if [[ "$1" == "-t" ]]; then
+ shift
+ TESTS="$*"
+fi
+
+if [[ "$1" == "test_ip_set_thread" ]]; then
+ test_ip_set_thread
+else
+ require_command ip
+
+ tests_run
+fi
--
2.43.0
I've removed the RFC tag from this version of the series, but the items
that I'm looking for feedback on remains the same:
- The userspace ABI, in particular:
- The vector length used for the SVE registers, access to the SVE
registers and access to ZA and (if available) ZT0 depending on
the current state of PSTATE.{SM,ZA}.
- The use of a single finalisation for both SVE and SME.
- The addition of control for enabling fine grained traps in a similar
manner to FGU but without the UNDEF, I'm not clear if this is desired
at all and at present this requires symmetric read and write traps like
FGU. That seemed like it might be desired from an implementation
point of view but we already have one case where we enable an
asymmetric trap (for ARM64_WORKAROUND_AMPERE_AC03_CPU_38) and it
seems generally useful to enable asymmetrically.
This series implements support for SME use in non-protected KVM guests.
Much of this is very similar to SVE, the main additional challenge that
SME presents is that it introduces a new vector length similar to the
SVE vector length and two new controls which change the registers seen
by guests:
- PSTATE.ZA enables the ZA matrix register and, if SME2 is supported,
the ZT0 LUT register.
- PSTATE.SM enables streaming mode, a new floating point mode which
uses the SVE register set with the separately configured SME vector
length. In streaming mode implementation of the FFR register is
optional.
It is also permitted to build systems which support SME without SVE, in
this case when not in streaming mode no SVE registers or instructions
are available. Further, there is no requirement that there be any
overlap in the set of vector lengths supported by SVE and SME in a
system, this is expected to be a common situation in practical systems.
Since there is a new vector length to configure we introduce a new
feature parallel to the existing SVE one with a new pseudo register for
the streaming mode vector length. Due to the overlap with SVE caused by
streaming mode rather than finalising SME as a separate feature we use
the existing SVE finalisation to also finalise SME, a new define
KVM_ARM_VCPU_VEC is provided to help make user code clearer. Finalising
SVE and SME separately would introduce complication with register access
since finalising SVE makes the SVE registers writeable by userspace and
doing multiple finalisations results in an error being reported.
Dealing with a state where the SVE registers are writeable due to one of
SVE or SME being finalised but may have their VL changed by the other
being finalised seems like needless complexity with minimal practical
utility, it seems clearer to just express directly that only one
finalisation can be done in the ABI.
Access to the floating point registers follows the architecture:
- When both SVE and SME are present:
- If PSTATE.SM == 0 the vector length used for the Z and P registers
is the SVE vector length.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
is the SME vector length.
- If only SME is present:
- If PSTATE.SM == 0 the Z and P registers are inaccessible and the
floating point state accessed via the encodings for the V registers.
- If PSTATE.SM == 1 the vector length used for the Z and P registers
- The SME specific ZA and ZT0 registers are only accessible if SVCR.ZA is 1.
The VMM must understand this, in particular when loading state SVCR
should be configured before other state. It should be noted that while
the architecture refers to PSTATE.SM and PSTATE.ZA these PSTATE bits are
not preserved in SPSR_ELx, they are only accessible via SVCR.
There are a large number of subfeatures for SME, most of which only
offer additional instructions but some of which (SME2 and FA64) add
architectural state. These are configured via the ID registers as per
usual.
Protected KVM supported, with the implementation maintaining the
existing restriction that the hypervisor will refuse to run if streaming
mode or ZA is enabled. This both simplfies the code and avoids the need
to allocate storage for host ZA and ZT0 state, there seems to be little
practical use case for supporting this and the memory usage would be
non-trivial.
The new KVM_ARM_VCPU_VEC feature and ZA and ZT0 registers have not been
added to the get-reg-list selftest, the idea of supporting additional
features there without restructuring the program to generate all
possible feature combinations has been rejected. I will post a separate
series which does that restructuring.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v9:
- Rebase onto v6.19-rc1.
- ABI document clarifications.
- Add changes dropping asserts on single bit wide bitfields in set_id_regs.
- Link to v8: https://lore.kernel.org/r/20250902-kvm-arm64-sme-v8-0-2cb2199c656c@kernel.o…
Changes in v8:
- Small fixes in ABI documentation.
- Link to v7: https://lore.kernel.org/r/20250822-kvm-arm64-sme-v7-0-7a65d82b8b10@kernel.o…
Changes in v7:
- Rebase onto v6.17-rc1.
- Handle SMIDR_EL1 as a VM wide ID register and use this in feat_sme_smps().
- Expose affinity fields in SMIDR_EL1.
- Remove SMPRI_EL1 from vcpu_sysreg, the value is always 0 currently.
- Prevent userspace writes to SMPRIMAP_EL2.
- Link to v6: https://lore.kernel.org/r/20250625-kvm-arm64-sme-v6-0-114cff4ffe04@kernel.o…
Changes in v6:
- Rebase onto v6.16-rc3.
- Link to v5: https://lore.kernel.org/r/20250417-kvm-arm64-sme-v5-0-f469a2d5f574@kernel.o…
Changes in v5:
- Rebase onto v6.15-rc2.
- Add pKVM guest support.
- Always restore SVCR.
- Link to v4: https://lore.kernel.org/r/20250214-kvm-arm64-sme-v4-0-d64a681adcc2@kernel.o…
Changes in v4:
- Rebase onto v6.14-rc2 and Mark Rutland's fixes.
- Expose SME to nested guests.
- Additional cleanups and test fixes following on from the rebase.
- Flush register state on VMM PSTATE.{SM,ZA}.
- Link to v3: https://lore.kernel.org/r/20241220-kvm-arm64-sme-v3-0-05b018c1ffeb@kernel.o…
Changes in v3:
- Rebase onto v6.12-rc2.
- Link to v2: https://lore.kernel.org/r/20231222-kvm-arm64-sme-v2-0-da226cb180bb@kernel.o…
Changes in v2:
- Rebase onto v6.7-rc3.
- Configure subfeatures based on host system only.
- Complete nVHE support.
- There was some snafu with sending v1 out, it didn't make it to the
lists but in case it hit people's inboxes I'm sending as v2.
---
Mark Brown (30):
arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06
arm64/fpsimd: Update FA64 and ZT0 enables when loading SME state
arm64/fpsimd: Decide to save ZT0 and streaming mode FFR at bind time
arm64/fpsimd: Check enable bit for FA64 when saving EFI state
arm64/fpsimd: Determine maximum virtualisable SME vector length
KVM: arm64: Pay attention to FFR parameter in SVE save and load
KVM: arm64: Pull ctxt_has_ helpers to start of sysreg-sr.h
KVM: arm64: Move SVE state access macros after feature test macros
KVM: arm64: Rename SVE finalization constants to be more general
KVM: arm64: Document the KVM ABI for SME
KVM: arm64: Define internal features for SME
KVM: arm64: Rename sve_state_reg_region
KVM: arm64: Store vector lengths in an array
KVM: arm64: Implement SME vector length configuration
KVM: arm64: Support SME control registers
KVM: arm64: Support TPIDR2_EL0
KVM: arm64: Support SME identification registers for guests
KVM: arm64: Support SME priority registers
KVM: arm64: Provide assembly for SME register access
KVM: arm64: Support userspace access to streaming mode Z and P registers
KVM: arm64: Flush register state on writes to SVCR.SM and SVCR.ZA
KVM: arm64: Expose SME specific state to userspace
KVM: arm64: Context switch SME state for guests
KVM: arm64: Handle SME exceptions
KVM: arm64: Expose SME to nested guests
KVM: arm64: Provide interface for configuring and enabling SME for guests
KVM: arm64: selftests: Remove spurious check for single bit safe values
KVM: arm64: selftests: Skip impossible invalid value tests
KVM: arm64: selftests: Add SME system registers to get-reg-list
KVM: arm64: selftests: Add SME to set_id_regs test
Documentation/virt/kvm/api.rst | 120 ++++++++---
arch/arm64/include/asm/fpsimd.h | 26 +++
arch/arm64/include/asm/kvm_emulate.h | 6 +
arch/arm64/include/asm/kvm_host.h | 163 ++++++++++++---
arch/arm64/include/asm/kvm_hyp.h | 5 +-
arch/arm64/include/asm/kvm_pkvm.h | 2 +-
arch/arm64/include/asm/vncr_mapping.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 33 +++
arch/arm64/kernel/cpufeature.c | 2 -
arch/arm64/kernel/fpsimd.c | 89 ++++----
arch/arm64/kvm/arm.c | 10 +
arch/arm64/kvm/config.c | 11 +-
arch/arm64/kvm/fpsimd.c | 28 ++-
arch/arm64/kvm/guest.c | 252 ++++++++++++++++++++---
arch/arm64/kvm/handle_exit.c | 14 ++
arch/arm64/kvm/hyp/fpsimd.S | 28 ++-
arch/arm64/kvm/hyp/include/hyp/switch.h | 168 +++++++++++++--
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 110 ++++++----
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 86 ++++++--
arch/arm64/kvm/hyp/nvhe/pkvm.c | 85 ++++++--
arch/arm64/kvm/hyp/nvhe/switch.c | 4 +-
arch/arm64/kvm/hyp/nvhe/sys_regs.c | 6 +
arch/arm64/kvm/hyp/vhe/switch.c | 17 +-
arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 7 +
arch/arm64/kvm/nested.c | 3 +-
arch/arm64/kvm/reset.c | 156 ++++++++++----
arch/arm64/kvm/sys_regs.c | 140 ++++++++++++-
arch/arm64/tools/sysreg | 8 +-
include/uapi/linux/kvm.h | 1 +
tools/testing/selftests/kvm/arm64/get-reg-list.c | 15 +-
tools/testing/selftests/kvm/arm64/set_id_regs.c | 84 ++++++--
31 files changed, 1367 insertions(+), 314 deletions(-)
---
base-commit: 3e7f562e20ee87a25e104ef4fce557d39d62fa85
change-id: 20230301-kvm-arm64-sme-06a1246d3636
Best regards,
--
Mark Brown <broonie(a)kernel.org>