This series improves the CPU cost of RX token management by adding a socket option that configures the socket to avoid the xarray allocator and instead use an niov array and a uref field in niov.
Improvement is ~13% cpu util per RX user thread.
Using kperf, the following results were observed:
Before: Average RX worker idle %: 13.13, flows 4, test runs 11 After: Average RX worker idle %: 26.32, flows 4, test runs 11
Two other approaches were tested, but with no improvement. Namely, 1) using a hashmap for tokens and 2) keeping an xarray of atomic counters but using RCU so that the hotpath could be mostly lockless. Neither of these approaches proved better than the simple array in terms of CPU.
The sockopt SO_DEVMEM_AUTORELEASE is added to toggle the optimization. It defaults to 0 (i.e., optimization on).
Note that prior revs reported only a 5% gain. This lower gain was measured with cpu frequency boosting (unknowingly) disabled. A consistent ~13% is measured for both kperf and nccl workloads with cpu frequency boosting on.
To: David S. Miller davem@davemloft.net To: Eric Dumazet edumazet@google.com To: Jakub Kicinski kuba@kernel.org To: Paolo Abeni pabeni@redhat.com To: Simon Horman horms@kernel.org To: Kuniyuki Iwashima kuniyu@google.com To: Willem de Bruijn willemb@google.com To: Neal Cardwell ncardwell@google.com To: David Ahern dsahern@kernel.org To: Mina Almasry almasrymina@google.com To: Arnd Bergmann arnd@arndb.de To: Jonathan Corbet corbet@lwn.net To: Andrew Lunn andrew+netdev@lunn.ch To: Shuah Khan shuah@kernel.org Cc: Stanislav Fomichev sdf@fomichev.me Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Signed-off-by: Bobby Eshleman bobbyeshleman@meta.com
Changes in v6: - renamed 'net: devmem: use niov array for token management' to refer to optionality of new config - added documentation and tests - make autorelease flag per-socket sockopt instead of binding field / sysctl - many per-patch changes (see Changes sections per-patch) - Link to v5: https://lore.kernel.org/r/20251023-scratch-bobbyeshleman-devmem-tcp-token-up...
Changes in v5: - add sysctl to opt-out of performance benefit, back to old token release - Link to v4: https://lore.kernel.org/all/20250926-scratch-bobbyeshleman-devmem-tcp-token-...
Changes in v4: - rebase to net-next - Link to v3: https://lore.kernel.org/r/20250926-scratch-bobbyeshleman-devmem-tcp-token-up...
Changes in v3: - make urefs per-binding instead of per-socket, reducing memory footprint - fallback to cleaning up references in dmabuf unbind if socket leaked tokens - drop ethtool patch - Link to v2: https://lore.kernel.org/r/20250911-scratch-bobbyeshleman-devmem-tcp-token-up...
Changes in v2: - net: ethtool: prevent user from breaking devmem single-binding rule (Mina) - pre-assign niovs in binding->vec for RX case (Mina) - remove WARNs on invalid user input (Mina) - remove extraneous binding ref get (Mina) - remove WARN for changed binding (Mina) - always use GFP_ZERO for binding->vec (Mina) - fix length of alloc for urefs - use atomic_set(, 0) to initialize sk_user_frags.urefs - Link to v1: https://lore.kernel.org/r/20250902-scratch-bobbyeshleman-devmem-tcp-token-up...
--- Bobby Eshleman (6): net: devmem: rename tx_vec to vec in dmabuf binding net: devmem: refactor sock_devmem_dontneed for autorelease split net: devmem: prepare for autorelease rx token management net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control net: devmem: document SO_DEVMEM_AUTORELEASE socket option net: devmem: add tests for SO_DEVMEM_AUTORELEASE socket option
Documentation/networking/devmem.rst | 70 +++++++++- include/net/netmem.h | 1 + include/net/sock.h | 13 +- include/uapi/asm-generic/socket.h | 2 + net/core/devmem.c | 54 +++++--- net/core/devmem.h | 4 +- net/core/sock.c | 152 ++++++++++++++++++---- net/ipv4/tcp.c | 69 ++++++++-- net/ipv4/tcp_ipv4.c | 11 +- net/ipv4/tcp_minisocks.c | 5 +- tools/include/uapi/asm-generic/socket.h | 2 + tools/testing/selftests/drivers/net/hw/devmem.py | 115 +++++++++++++++- tools/testing/selftests/drivers/net/hw/ncdevmem.c | 20 ++- 13 files changed, 453 insertions(+), 65 deletions(-) --- base-commit: 255d75ef029f33f75fcf5015052b7302486f7ad2 change-id: 20250829-scratch-bobbyeshleman-devmem-tcp-token-upstream-292be174d503
Best regards,
From: Bobby Eshleman bobbyeshleman@meta.com
Rename the 'tx_vec' field in struct net_devmem_dmabuf_binding to 'vec'. This field holds pointers to net_iov structures. The rename prepares for reusing 'vec' for both TX and RX directions.
No functional change intended.
Reviewed-by: Mina Almasry almasrymina@google.com Signed-off-by: Bobby Eshleman bobbyeshleman@meta.com --- net/core/devmem.c | 22 +++++++++++----------- net/core/devmem.h | 2 +- 2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/net/core/devmem.c b/net/core/devmem.c index 1d04754bc756..4dee2666dd07 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -75,7 +75,7 @@ void __net_devmem_dmabuf_binding_free(struct work_struct *wq) dma_buf_detach(binding->dmabuf, binding->attachment); dma_buf_put(binding->dmabuf); xa_destroy(&binding->bound_rxqs); - kvfree(binding->tx_vec); + kvfree(binding->vec); kfree(binding); }
@@ -232,10 +232,10 @@ net_devmem_bind_dmabuf(struct net_device *dev, }
if (direction == DMA_TO_DEVICE) { - binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE, - sizeof(struct net_iov *), - GFP_KERNEL); - if (!binding->tx_vec) { + binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE, + sizeof(struct net_iov *), + GFP_KERNEL); + if (!binding->vec) { err = -ENOMEM; goto err_unmap; } @@ -249,7 +249,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, dev_to_node(&dev->dev)); if (!binding->chunk_pool) { err = -ENOMEM; - goto err_tx_vec; + goto err_vec; }
virtual = 0; @@ -295,7 +295,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); if (direction == DMA_TO_DEVICE) - binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] = niov; + binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov; }
virtual += len; @@ -315,8 +315,8 @@ net_devmem_bind_dmabuf(struct net_device *dev, gen_pool_for_each_chunk(binding->chunk_pool, net_devmem_dmabuf_free_chunk_owner, NULL); gen_pool_destroy(binding->chunk_pool); -err_tx_vec: - kvfree(binding->tx_vec); +err_vec: + kvfree(binding->vec); err_unmap: dma_buf_unmap_attachment_unlocked(binding->attachment, binding->sgt, direction); @@ -363,7 +363,7 @@ struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk, int err = 0;
binding = net_devmem_lookup_dmabuf(dmabuf_id); - if (!binding || !binding->tx_vec) { + if (!binding || !binding->vec) { err = -EINVAL; goto out_err; } @@ -414,7 +414,7 @@ net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, *off = virt_addr % PAGE_SIZE; *size = PAGE_SIZE - *off;
- return binding->tx_vec[virt_addr / PAGE_SIZE]; + return binding->vec[virt_addr / PAGE_SIZE]; }
/*** "Dmabuf devmem memory provider" ***/ diff --git a/net/core/devmem.h b/net/core/devmem.h index 101150d761af..2ada54fb63d7 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -63,7 +63,7 @@ struct net_devmem_dmabuf_binding { * address. This array is convenient to map the virtual addresses to * net_iovs in the TX path. */ - struct net_iov **tx_vec; + struct net_iov **vec;
struct work_struct unbind_w; };
From: Bobby Eshleman bobbyeshleman@meta.com
Refactor sock_devmem_dontneed() in preparation for supporting both autorelease and manual token release modes.
Split the function into two parts: - sock_devmem_dontneed(): handles input validation, token allocation, and copying from userspace - sock_devmem_dontneed_autorelease(): performs the actual token release via xarray lookup and page pool put
This separation allows a future commit to add a parallel sock_devmem_dontneed_manual_release() function that uses a different token tracking mechanism (per-niov reference counting) without duplicating the input validation logic.
The refactoring is purely mechanical with no functional change. Only intended to minimize the noise in subsequent patches.
Reviewed-by: Mina Almasry almasrymina@google.com Signed-off-by: Bobby Eshleman bobbyeshleman@meta.com --- net/core/sock.c | 52 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 32 insertions(+), 20 deletions(-)
diff --git a/net/core/sock.c b/net/core/sock.c index 7a9bbc2afcf0..5562f517d889 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1082,30 +1082,13 @@ static int sock_reserve_memory(struct sock *sk, int bytes) #define MAX_DONTNEED_FRAGS 1024
static noinline_for_stack int -sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) +sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens, + unsigned int num_tokens) { - unsigned int num_tokens, i, j, k, netmem_num = 0; - struct dmabuf_token *tokens; + unsigned int i, j, k, netmem_num = 0; int ret = 0, num_frags = 0; netmem_ref netmems[16];
- if (!sk_is_tcp(sk)) - return -EBADF; - - if (optlen % sizeof(*tokens) || - optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS) - return -EINVAL; - - num_tokens = optlen / sizeof(*tokens); - tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL); - if (!tokens) - return -ENOMEM; - - if (copy_from_sockptr(tokens, optval, optlen)) { - kvfree(tokens); - return -EFAULT; - } - xa_lock_bh(&sk->sk_user_frags); for (i = 0; i < num_tokens; i++) { for (j = 0; j < tokens[i].token_count; j++) { @@ -1135,6 +1118,35 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) for (k = 0; k < netmem_num; k++) WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
+ return ret; +} + +static noinline_for_stack int +sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) +{ + struct dmabuf_token *tokens; + unsigned int num_tokens; + int ret; + + if (!sk_is_tcp(sk)) + return -EBADF; + + if (optlen % sizeof(*tokens) || + optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS) + return -EINVAL; + + num_tokens = optlen / sizeof(*tokens); + tokens = kvmalloc_array(num_tokens, sizeof(*tokens), GFP_KERNEL); + if (!tokens) + return -ENOMEM; + + if (copy_from_sockptr(tokens, optval, optlen)) { + kvfree(tokens); + return -EFAULT; + } + + ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens); + kvfree(tokens); return ret; }
From: Bobby Eshleman bobbyeshleman@meta.com
Add alternative token management implementation (autorelease vs non-autorelease) that replaces xarray-based token lookups with direct array access using page offsets as dmabuf tokens. When enabled, this eliminates xarray overhead and reduces CPU utilization in devmem RX threads by approximately 13%.
This patch changes the meaning of tokens when this option is used. Tokens previously referred to unique fragments of pages. With this option tokens instead represent references to pages, not fragments. Because of this, multiple tokens may refer to the same page and so have identical value (e.g., two small fragments may coexist on the same page). The token and offset pair that the user receives uniquely identifies fragments if needed. This assumes that the user is not attempting to sort / uniq the token list using tokens alone.
This introduces a restriction: devmem RX sockets cannot switch dmabuf bindings when using the autorelease off option. This is necessary because 32-bit tokens lack sufficient bits to encode both large dmabuf page counts and binding/queue IDs. For example, a system with 8 NICs and 32 queues needs 8 bits for binding IDs, leaving only 24 bits for pages (64GB max). This restriction aligns with common usage, as steering flows to different queues/devices is often undesirable for TCP.
This patch adds an atomic uref counter to net_iov for tracking user references via binding->vec. The pp_ref_count is only updated on uref transitions from zero to one or from one to zero, to minimize atomic overhead. If a user fails to refill and closes before returning all tokens, the binding will finish the uref release when unbound.
A flag "autorelease" is added per-socket. This will be used for enabling the old behavior of the kernel releasing references for the sockets upon close(2) (autorelease), instead of requiring that socket users do this themselves. The autorelease flag is always true in this patch, meaning that the old (non-optimized) behavior is kept unconditionally. A future patch supports a user-facing knob to toggle this feature and will change the default to false for the improved performance.
An outstanding_urefs counter is added per-socket so that changes to the autorelease mode can be rejected for active sockets.
The dmabuf unbind path always checks for any leaked urefs.
Signed-off-by: Bobby Eshleman bobbyeshleman@meta.com --- Changes in v6: - remove sk_devmem_info.autorelease, using binding->autorelease instead - move binding->autorelease check to outside of net_devmem_dmabuf_binding_put_urefs() (Mina) - remove overly defensive net_is_devmem_iov() (Mina) - add comment about multiple urefs mapping to a single netmem ref (Mina) - remove overly defense netmem NULL and netmem_is_net_iov checks (Mina) - use niov without casting back and forth with netmem (Mina) - move the autorelease flag from per-binding to per-socket (Mina) - remove the batching logic in sock_devmem_dontneed_manual_release() (Mina) - move autorelease check inside tcp_xa_pool_commit() (Mina) - remove single-binding restriction for autorelease mode (Mina) - unbind always checks for leaked urefs
Changes in v5: - remove unused variables - introduce autorelease flag, preparing for future patch toggle new behavior
Changes in v3: - make urefs per-binding instead of per-socket, reducing memory footprint - fallback to cleaning up references in dmabuf unbind if socket leaked tokens - drop ethtool patch
Changes in v2: - always use GFP_ZERO for binding->vec (Mina) - remove WARN for changed binding (Mina) - remove extraneous binding ref get (Mina) - remove WARNs on invalid user input (Mina) - pre-assign niovs in binding->vec for RX case (Mina) - use atomic_set(, 0) to initialize sk_user_frags.urefs - fix length of alloc for urefs --- include/net/netmem.h | 1 + include/net/sock.h | 13 +++++++-- net/core/devmem.c | 42 ++++++++++++++++++++++------- net/core/devmem.h | 2 +- net/core/sock.c | 55 +++++++++++++++++++++++++++++++++----- net/ipv4/tcp.c | 69 ++++++++++++++++++++++++++++++++++++++---------- net/ipv4/tcp_ipv4.c | 11 +++++--- net/ipv4/tcp_minisocks.c | 5 +++- 8 files changed, 161 insertions(+), 37 deletions(-)
diff --git a/include/net/netmem.h b/include/net/netmem.h index 9e10f4ac50c3..80d2263ba4ed 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -112,6 +112,7 @@ struct net_iov { }; struct net_iov_area *owner; enum net_iov_type type; + atomic_t uref; };
struct net_iov_area { diff --git a/include/net/sock.h b/include/net/sock.h index c7e58b8e8a90..548fabacff7c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -350,7 +350,11 @@ struct sk_filter; * @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS * @sk_scm_unused: unused flags for scm_recv() * @ns_tracker: tracker for netns reference - * @sk_user_frags: xarray of pages the user is holding a reference on. + * @sk_devmem_info: the devmem binding information for the socket + * @frags: xarray of tokens for autorelease mode + * @binding: pointer to the dmabuf binding + * @outstanding_urefs: count of outstanding user references + * @autorelease: if true, tokens released on close; if false, user must release * @sk_owner: reference to the real owner of the socket that calls * sock_lock_init_class_and_name(). */ @@ -579,7 +583,12 @@ struct sock { struct numa_drop_counters *sk_drop_counters; struct rcu_head sk_rcu; netns_tracker ns_tracker; - struct xarray sk_user_frags; + struct { + struct xarray frags; + struct net_devmem_dmabuf_binding *binding; + atomic_t outstanding_urefs; + bool autorelease; + } sk_devmem_info;
#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES) struct module *sk_owner; diff --git a/net/core/devmem.c b/net/core/devmem.c index 4dee2666dd07..904d19e58f4b 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -11,6 +11,7 @@ #include <linux/genalloc.h> #include <linux/mm.h> #include <linux/netdevice.h> +#include <linux/skbuff_ref.h> #include <linux/types.h> #include <net/netdev_queues.h> #include <net/netdev_rx_queue.h> @@ -116,6 +117,24 @@ void net_devmem_free_dmabuf(struct net_iov *niov) gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE); }
+static void +net_devmem_dmabuf_binding_put_urefs(struct net_devmem_dmabuf_binding *binding) +{ + int i; + + for (i = 0; i < binding->dmabuf->size / PAGE_SIZE; i++) { + struct net_iov *niov; + netmem_ref netmem; + + niov = binding->vec[i]; + netmem = net_iov_to_netmem(niov); + + /* Multiple urefs map to only a single netmem ref. */ + if (atomic_xchg(&niov->uref, 0) > 0) + WARN_ON_ONCE(!napi_pp_put_page(netmem)); + } +} + void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) { struct netdev_rx_queue *rxq; @@ -143,6 +162,10 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) __net_mp_close_rxq(binding->dev, rxq_idx, &mp_params); }
+ /* Clean up any lingering urefs from sockets that had autorelease + * disabled. + */ + net_devmem_dmabuf_binding_put_urefs(binding); net_devmem_dmabuf_binding_put(binding); }
@@ -231,14 +254,13 @@ net_devmem_bind_dmabuf(struct net_device *dev, goto err_detach; }
- if (direction == DMA_TO_DEVICE) { - binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE, - sizeof(struct net_iov *), - GFP_KERNEL); - if (!binding->vec) { - err = -ENOMEM; - goto err_unmap; - } + /* Used by TX and also by RX when socket has autorelease disabled */ + binding->vec = kvmalloc_array(dmabuf->size / PAGE_SIZE, + sizeof(struct net_iov *), + GFP_KERNEL | __GFP_ZERO); + if (!binding->vec) { + err = -ENOMEM; + goto err_unmap; }
/* For simplicity we expect to make PAGE_SIZE allocations, but the @@ -292,10 +314,10 @@ net_devmem_bind_dmabuf(struct net_device *dev, niov = &owner->area.niovs[i]; niov->type = NET_IOV_DMABUF; niov->owner = &owner->area; + atomic_set(&niov->uref, 0); page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); - if (direction == DMA_TO_DEVICE) - binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov; + binding->vec[owner->area.base_virtual / PAGE_SIZE + i] = niov; }
virtual += len; diff --git a/net/core/devmem.h b/net/core/devmem.h index 2ada54fb63d7..d4eb28d079bb 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -61,7 +61,7 @@ struct net_devmem_dmabuf_binding {
/* Array of net_iov pointers for this binding, sorted by virtual * address. This array is convenient to map the virtual addresses to - * net_iovs in the TX path. + * net_iovs. */ struct net_iov **vec;
diff --git a/net/core/sock.c b/net/core/sock.c index 5562f517d889..465645c1d74f 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -87,6 +87,7 @@
#include <linux/unaligned.h> #include <linux/capability.h> +#include <linux/dma-buf.h> #include <linux/errno.h> #include <linux/errqueue.h> #include <linux/types.h> @@ -151,6 +152,7 @@ #include <uapi/linux/pidfd.h>
#include "dev.h" +#include "devmem.h"
static DEFINE_MUTEX(proto_list_mutex); static LIST_HEAD(proto_list); @@ -1081,6 +1083,43 @@ static int sock_reserve_memory(struct sock *sk, int bytes) #define MAX_DONTNEED_TOKENS 128 #define MAX_DONTNEED_FRAGS 1024
+static noinline_for_stack int +sock_devmem_dontneed_manual_release(struct sock *sk, struct dmabuf_token *tokens, + unsigned int num_tokens) +{ + struct net_iov *niov; + unsigned int i, j; + netmem_ref netmem; + unsigned int token; + int num_frags = 0; + int ret; + + if (!sk->sk_devmem_info.binding) + return -EINVAL; + + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + token = tokens[i].token_start + j; + if (token >= sk->sk_devmem_info.binding->dmabuf->size / PAGE_SIZE) + break; + + if (++num_frags > MAX_DONTNEED_FRAGS) + return ret; + + niov = sk->sk_devmem_info.binding->vec[token]; + if (atomic_dec_and_test(&niov->uref)) { + netmem = net_iov_to_netmem(niov); + WARN_ON_ONCE(!napi_pp_put_page(netmem)); + } + ret++; + } + } + + atomic_sub(ret, &sk->sk_devmem_info.outstanding_urefs); + + return ret; +} + static noinline_for_stack int sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens, unsigned int num_tokens) @@ -1089,32 +1128,32 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens, int ret = 0, num_frags = 0; netmem_ref netmems[16];
- xa_lock_bh(&sk->sk_user_frags); + xa_lock_bh(&sk->sk_devmem_info.frags); for (i = 0; i < num_tokens; i++) { for (j = 0; j < tokens[i].token_count; j++) { if (++num_frags > MAX_DONTNEED_FRAGS) goto frag_limit_reached;
netmem_ref netmem = (__force netmem_ref)__xa_erase( - &sk->sk_user_frags, tokens[i].token_start + j); + &sk->sk_devmem_info.frags, tokens[i].token_start + j);
if (!netmem || WARN_ON_ONCE(!netmem_is_net_iov(netmem))) continue;
netmems[netmem_num++] = netmem; if (netmem_num == ARRAY_SIZE(netmems)) { - xa_unlock_bh(&sk->sk_user_frags); + xa_unlock_bh(&sk->sk_devmem_info.frags); for (k = 0; k < netmem_num; k++) WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); netmem_num = 0; - xa_lock_bh(&sk->sk_user_frags); + xa_lock_bh(&sk->sk_devmem_info.frags); } ret++; } }
frag_limit_reached: - xa_unlock_bh(&sk->sk_user_frags); + xa_unlock_bh(&sk->sk_devmem_info.frags); for (k = 0; k < netmem_num; k++) WARN_ON_ONCE(!napi_pp_put_page(netmems[k]));
@@ -1145,7 +1184,11 @@ sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) return -EFAULT; }
- ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens); + if (sk->sk_devmem_info.autorelease) + ret = sock_devmem_dontneed_autorelease(sk, tokens, num_tokens); + else + ret = sock_devmem_dontneed_manual_release(sk, tokens, + num_tokens);
kvfree(tokens); return ret; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a9345aa5a2e5..052875c1b547 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -260,6 +260,7 @@ #include <linux/memblock.h> #include <linux/highmem.h> #include <linux/cache.h> +#include <linux/dma-buf.h> #include <linux/err.h> #include <linux/time.h> #include <linux/slab.h> @@ -492,7 +493,10 @@ void tcp_init_sock(struct sock *sk)
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); - xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1); + xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1); + sk->sk_devmem_info.binding = NULL; + atomic_set(&sk->sk_devmem_info.outstanding_urefs, 0); + sk->sk_devmem_info.autorelease = true; } EXPORT_IPV6_MOD(tcp_init_sock);
@@ -2424,11 +2428,11 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
/* Commit part that has been copied to user space. */ for (i = 0; i < p->idx; i++) - __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY, + __xa_cmpxchg(&sk->sk_devmem_info.frags, p->tokens[i], XA_ZERO_ENTRY, (__force void *)p->netmems[i], GFP_KERNEL); /* Rollback what has been pre-allocated and is no longer needed. */ for (; i < p->max; i++) - __xa_erase(&sk->sk_user_frags, p->tokens[i]); + __xa_erase(&sk->sk_devmem_info.frags, p->tokens[i]);
p->max = 0; p->idx = 0; @@ -2436,14 +2440,17 @@ static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p)
static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p) { + if (!sk->sk_devmem_info.autorelease) + return; + if (!p->max) return;
- xa_lock_bh(&sk->sk_user_frags); + xa_lock_bh(&sk->sk_devmem_info.frags);
tcp_xa_pool_commit_locked(sk, p);
- xa_unlock_bh(&sk->sk_user_frags); + xa_unlock_bh(&sk->sk_devmem_info.frags); }
static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p, @@ -2454,18 +2461,18 @@ static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p, if (p->idx < p->max) return 0;
- xa_lock_bh(&sk->sk_user_frags); + xa_lock_bh(&sk->sk_devmem_info.frags);
tcp_xa_pool_commit_locked(sk, p);
for (k = 0; k < max_frags; k++) { - err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k], + err = __xa_alloc(&sk->sk_devmem_info.frags, &p->tokens[k], XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL); if (err) break; }
- xa_unlock_bh(&sk->sk_user_frags); + xa_unlock_bh(&sk->sk_devmem_info.frags);
p->max = k; p->idx = 0; @@ -2479,12 +2486,14 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, unsigned int offset, struct msghdr *msg, int remaining_len) { + struct net_devmem_dmabuf_binding *binding = NULL; struct dmabuf_cmsg dmabuf_cmsg = { 0 }; struct tcp_xa_pool tcp_xa_pool; unsigned int start; int i, copy, n; int sent = 0; int err = 0; + int refs;
tcp_xa_pool.max = 0; tcp_xa_pool.idx = 0; @@ -2536,6 +2545,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; struct net_iov *niov; u64 frag_offset; + u32 token; int end;
/* !skb_frags_readable() should indicate that ALL the @@ -2568,13 +2578,32 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, start; dmabuf_cmsg.frag_offset = frag_offset; dmabuf_cmsg.frag_size = copy; - err = tcp_xa_pool_refill(sk, &tcp_xa_pool, - skb_shinfo(skb)->nr_frags - i); - if (err) + + binding = net_devmem_iov_binding(niov); + + if (!sk->sk_devmem_info.binding) + sk->sk_devmem_info.binding = binding; + + if (sk->sk_devmem_info.binding != binding) { + err = -EFAULT; goto out; + } + + if (sk->sk_devmem_info.autorelease) { + err = tcp_xa_pool_refill(sk, &tcp_xa_pool, + skb_shinfo(skb)->nr_frags - i); + if (err) + goto out; + + dmabuf_cmsg.frag_token = + tcp_xa_pool.tokens[tcp_xa_pool.idx]; + } else { + token = net_iov_virtual_addr(niov) >> PAGE_SHIFT; + dmabuf_cmsg.frag_token = token; + } +
/* Will perform the exchange later */ - dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx]; dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
offset += copy; @@ -2587,8 +2616,15 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, if (err) goto out;
- atomic_long_inc(&niov->pp_ref_count); - tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag); + if (sk->sk_devmem_info.autorelease) { + atomic_long_inc(&niov->pp_ref_count); + tcp_xa_pool.netmems[tcp_xa_pool.idx++] = + skb_frag_netmem(frag); + } else { + if (atomic_inc_return(&niov->uref) == 1) + atomic_long_inc(&niov->pp_ref_count); + refs++; + }
sent += copy;
@@ -2599,6 +2635,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, }
tcp_xa_pool_commit(sk, &tcp_xa_pool); + if (!remaining_len) goto out;
@@ -2617,9 +2654,13 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
out: tcp_xa_pool_commit(sk, &tcp_xa_pool); + if (!sent) sent = err;
+ if (refs > 0) + atomic_add(refs, &sk->sk_devmem_info.outstanding_urefs); + return sent; }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 40a76da5364a..dbb7a71e3cce 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -89,6 +89,9 @@
#include <crypto/md5.h>
+#include <linux/dma-buf.h> +#include "../core/devmem.h" + #include <trace/events/tcp.h>
#ifdef CONFIG_TCP_MD5SIG @@ -2493,7 +2496,7 @@ static void tcp_release_user_frags(struct sock *sk) unsigned long index; void *netmem;
- xa_for_each(&sk->sk_user_frags, index, netmem) + xa_for_each(&sk->sk_devmem_info.frags, index, netmem) WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem)); #endif } @@ -2502,9 +2505,11 @@ void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk);
- tcp_release_user_frags(sk); + if (sk->sk_devmem_info.binding && sk->sk_devmem_info.autorelease) + tcp_release_user_frags(sk);
- xa_destroy(&sk->sk_user_frags); + xa_destroy(&sk->sk_devmem_info.frags); + sk->sk_devmem_info.binding = NULL;
trace_tcp_destroy_sock(sk);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index ded2cf1f6006..a017dea35bb8 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -663,7 +663,10 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
- xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1); + xa_init_flags(&newsk->sk_devmem_info.frags, XA_FLAGS_ALLOC1); + newsk->sk_devmem_info.binding = NULL; + atomic_set(&newsk->sk_devmem_info.outstanding_urefs, 0); + newsk->sk_devmem_info.autorelease = true;
return newsk; }
Hi Bobby,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 255d75ef029f33f75fcf5015052b7302486f7ad2]
url: https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-ren... base: 255d75ef029f33f75fcf5015052b7302486f7ad2 patch link: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-up... patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251105/202511052307.doTV8fDn-lkp@i...) compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251105/202511052307.doTV8fDn-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202511052307.doTV8fDn-lkp@intel.com/
All warnings (new ones prefixed by >>):
net/core/sock.c:1107:12: warning: variable 'ret' is uninitialized when used here [-Wuninitialized]
1107 | return ret; | ^~~ net/core/sock.c:1095:9: note: initialize the variable 'ret' to silence this warning 1095 | int ret; | ^ | = 0 1 warning generated. --
net/ipv4/tcp.c:2626:6: warning: variable 'refs' is uninitialized when used here [-Wuninitialized]
2626 | refs++; | ^~~~ net/ipv4/tcp.c:2496:10: note: initialize the variable 'refs' to silence this warning 2496 | int refs; | ^ | = 0 1 warning generated.
vim +/ret +1107 net/core/sock.c
1085 1086 static noinline_for_stack int 1087 sock_devmem_dontneed_manual_release(struct sock *sk, struct dmabuf_token *tokens, 1088 unsigned int num_tokens) 1089 { 1090 struct net_iov *niov; 1091 unsigned int i, j; 1092 netmem_ref netmem; 1093 unsigned int token; 1094 int num_frags = 0; 1095 int ret; 1096 1097 if (!sk->sk_devmem_info.binding) 1098 return -EINVAL; 1099 1100 for (i = 0; i < num_tokens; i++) { 1101 for (j = 0; j < tokens[i].token_count; j++) { 1102 token = tokens[i].token_start + j; 1103 if (token >= sk->sk_devmem_info.binding->dmabuf->size / PAGE_SIZE) 1104 break; 1105 1106 if (++num_frags > MAX_DONTNEED_FRAGS)
1107 return ret;
1108 1109 niov = sk->sk_devmem_info.binding->vec[token]; 1110 if (atomic_dec_and_test(&niov->uref)) { 1111 netmem = net_iov_to_netmem(niov); 1112 WARN_ON_ONCE(!napi_pp_put_page(netmem)); 1113 } 1114 ret++; 1115 } 1116 } 1117 1118 atomic_sub(ret, &sk->sk_devmem_info.outstanding_urefs); 1119 1120 return ret; 1121 } 1122
Hi Bobby,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 255d75ef029f33f75fcf5015052b7302486f7ad2]
url: https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-ren... base: 255d75ef029f33f75fcf5015052b7302486f7ad2 patch link: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-up... patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management config: arc-nsimosci_hs_defconfig (https://download.01.org/0day-ci/archive/20251106/202511060345.AQs0FTNg-lkp@i...) compiler: arc-linux-gcc (GCC) 15.1.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251106/202511060345.AQs0FTNg-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202511060345.AQs0FTNg-lkp@intel.com/
All warnings (new ones prefixed by >>):
net/ipv4/tcp.c: In function 'tcp_recvmsg_dmabuf':
net/ipv4/tcp.c:2661:12: warning: 'refs' is used uninitialized [-Wuninitialized]
2661 | if (refs > 0) | ^ net/ipv4/tcp.c:2496:13: note: 'refs' was declared here 2496 | int refs; | ^~~~
vim +/refs +2661 net/ipv4/tcp.c
2481 2482 /* On error, returns the -errno. On success, returns number of bytes sent to the 2483 * user. May not consume all of @remaining_len. 2484 */ 2485 static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, 2486 unsigned int offset, struct msghdr *msg, 2487 int remaining_len) 2488 { 2489 struct net_devmem_dmabuf_binding *binding = NULL; 2490 struct dmabuf_cmsg dmabuf_cmsg = { 0 }; 2491 struct tcp_xa_pool tcp_xa_pool; 2492 unsigned int start; 2493 int i, copy, n; 2494 int sent = 0; 2495 int err = 0; 2496 int refs; 2497 2498 tcp_xa_pool.max = 0; 2499 tcp_xa_pool.idx = 0; 2500 do { 2501 start = skb_headlen(skb); 2502 2503 if (skb_frags_readable(skb)) { 2504 err = -ENODEV; 2505 goto out; 2506 } 2507 2508 /* Copy header. */ 2509 copy = start - offset; 2510 if (copy > 0) { 2511 copy = min(copy, remaining_len); 2512 2513 n = copy_to_iter(skb->data + offset, copy, 2514 &msg->msg_iter); 2515 if (n != copy) { 2516 err = -EFAULT; 2517 goto out; 2518 } 2519 2520 offset += copy; 2521 remaining_len -= copy; 2522 2523 /* First a dmabuf_cmsg for # bytes copied to user 2524 * buffer. 2525 */ 2526 memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg)); 2527 dmabuf_cmsg.frag_size = copy; 2528 err = put_cmsg_notrunc(msg, SOL_SOCKET, 2529 SO_DEVMEM_LINEAR, 2530 sizeof(dmabuf_cmsg), 2531 &dmabuf_cmsg); 2532 if (err) 2533 goto out; 2534 2535 sent += copy; 2536 2537 if (remaining_len == 0) 2538 goto out; 2539 } 2540 2541 /* after that, send information of dmabuf pages through a 2542 * sequence of cmsg 2543 */ 2544 for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { 2545 skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; 2546 struct net_iov *niov; 2547 u64 frag_offset; 2548 u32 token; 2549 int end; 2550 2551 /* !skb_frags_readable() should indicate that ALL the 2552 * frags in this skb are dmabuf net_iovs. We're checking 2553 * for that flag above, but also check individual frags 2554 * here. If the tcp stack is not setting 2555 * skb_frags_readable() correctly, we still don't want 2556 * to crash here. 2557 */ 2558 if (!skb_frag_net_iov(frag)) { 2559 net_err_ratelimited("Found non-dmabuf skb with net_iov"); 2560 err = -ENODEV; 2561 goto out; 2562 } 2563 2564 niov = skb_frag_net_iov(frag); 2565 if (!net_is_devmem_iov(niov)) { 2566 err = -ENODEV; 2567 goto out; 2568 } 2569 2570 end = start + skb_frag_size(frag); 2571 copy = end - offset; 2572 2573 if (copy > 0) { 2574 copy = min(copy, remaining_len); 2575 2576 frag_offset = net_iov_virtual_addr(niov) + 2577 skb_frag_off(frag) + offset - 2578 start; 2579 dmabuf_cmsg.frag_offset = frag_offset; 2580 dmabuf_cmsg.frag_size = copy; 2581 2582 binding = net_devmem_iov_binding(niov); 2583 2584 if (!sk->sk_devmem_info.binding) 2585 sk->sk_devmem_info.binding = binding; 2586 2587 if (sk->sk_devmem_info.binding != binding) { 2588 err = -EFAULT; 2589 goto out; 2590 } 2591 2592 if (sk->sk_devmem_info.autorelease) { 2593 err = tcp_xa_pool_refill(sk, &tcp_xa_pool, 2594 skb_shinfo(skb)->nr_frags - i); 2595 if (err) 2596 goto out; 2597 2598 dmabuf_cmsg.frag_token = 2599 tcp_xa_pool.tokens[tcp_xa_pool.idx]; 2600 } else { 2601 token = net_iov_virtual_addr(niov) >> PAGE_SHIFT; 2602 dmabuf_cmsg.frag_token = token; 2603 } 2604 2605 2606 /* Will perform the exchange later */ 2607 dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov); 2608 2609 offset += copy; 2610 remaining_len -= copy; 2611 2612 err = put_cmsg_notrunc(msg, SOL_SOCKET, 2613 SO_DEVMEM_DMABUF, 2614 sizeof(dmabuf_cmsg), 2615 &dmabuf_cmsg); 2616 if (err) 2617 goto out; 2618 2619 if (sk->sk_devmem_info.autorelease) { 2620 atomic_long_inc(&niov->pp_ref_count); 2621 tcp_xa_pool.netmems[tcp_xa_pool.idx++] = 2622 skb_frag_netmem(frag); 2623 } else { 2624 if (atomic_inc_return(&niov->uref) == 1) 2625 atomic_long_inc(&niov->pp_ref_count); 2626 refs++; 2627 } 2628 2629 sent += copy; 2630 2631 if (remaining_len == 0) 2632 goto out; 2633 } 2634 start = end; 2635 } 2636 2637 tcp_xa_pool_commit(sk, &tcp_xa_pool); 2638 2639 if (!remaining_len) 2640 goto out; 2641 2642 /* if remaining_len is not satisfied yet, we need to go to the 2643 * next frag in the frag_list to satisfy remaining_len. 2644 */ 2645 skb = skb_shinfo(skb)->frag_list ?: skb->next; 2646 2647 offset = offset - start; 2648 } while (skb); 2649 2650 if (remaining_len) { 2651 err = -EFAULT; 2652 goto out; 2653 } 2654 2655 out: 2656 tcp_xa_pool_commit(sk, &tcp_xa_pool); 2657 2658 if (!sent) 2659 sent = err; 2660
2661 if (refs > 0)
2662 atomic_add(refs, &sk->sk_devmem_info.outstanding_urefs); 2663 2664 return sent; 2665 } 2666
Hi Bobby,
kernel test robot noticed the following build warnings:
url: https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-ren... base: 255d75ef029f33f75fcf5015052b7302486f7ad2 patch link: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-up... patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management config: openrisc-randconfig-r073-20251105 (https://download.01.org/0day-ci/archive/20251106/202511060119.MAzcsLoN-lkp@i...) compiler: or1k-linux-gcc (GCC) 10.5.0
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Reported-by: Dan Carpenter dan.carpenter@linaro.org | Closes: https://lore.kernel.org/r/202511060119.MAzcsLoN-lkp@intel.com/
New smatch warnings: net/core/sock.c:1107 sock_devmem_dontneed_manual_release() error: uninitialized symbol 'ret'.
vim +/ret +1107 net/core/sock.c
45aa39492cf4dd Bobby Eshleman 2025-11-04 1086 static noinline_for_stack int 45aa39492cf4dd Bobby Eshleman 2025-11-04 1087 sock_devmem_dontneed_manual_release(struct sock *sk, struct dmabuf_token *tokens, 45aa39492cf4dd Bobby Eshleman 2025-11-04 1088 unsigned int num_tokens) 45aa39492cf4dd Bobby Eshleman 2025-11-04 1089 { 45aa39492cf4dd Bobby Eshleman 2025-11-04 1090 struct net_iov *niov; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1091 unsigned int i, j; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1092 netmem_ref netmem; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1093 unsigned int token; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1094 int num_frags = 0; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1095 int ret;
ret needs to be = 0;
45aa39492cf4dd Bobby Eshleman 2025-11-04 1096 45aa39492cf4dd Bobby Eshleman 2025-11-04 1097 if (!sk->sk_devmem_info.binding) 45aa39492cf4dd Bobby Eshleman 2025-11-04 1098 return -EINVAL; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1099 45aa39492cf4dd Bobby Eshleman 2025-11-04 1100 for (i = 0; i < num_tokens; i++) { 45aa39492cf4dd Bobby Eshleman 2025-11-04 1101 for (j = 0; j < tokens[i].token_count; j++) { 45aa39492cf4dd Bobby Eshleman 2025-11-04 1102 token = tokens[i].token_start + j; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1103 if (token >= sk->sk_devmem_info.binding->dmabuf->size / PAGE_SIZE) 45aa39492cf4dd Bobby Eshleman 2025-11-04 1104 break; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1105 45aa39492cf4dd Bobby Eshleman 2025-11-04 1106 if (++num_frags > MAX_DONTNEED_FRAGS) 45aa39492cf4dd Bobby Eshleman 2025-11-04 @1107 return ret;
Uninitialized. It's always a good idea to test code with CONFIG_INIT_STACK_ALL_PATTERN.
45aa39492cf4dd Bobby Eshleman 2025-11-04 1108 45aa39492cf4dd Bobby Eshleman 2025-11-04 1109 niov = sk->sk_devmem_info.binding->vec[token]; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1110 if (atomic_dec_and_test(&niov->uref)) { 45aa39492cf4dd Bobby Eshleman 2025-11-04 1111 netmem = net_iov_to_netmem(niov); 45aa39492cf4dd Bobby Eshleman 2025-11-04 1112 WARN_ON_ONCE(!napi_pp_put_page(netmem)); 45aa39492cf4dd Bobby Eshleman 2025-11-04 1113 } 45aa39492cf4dd Bobby Eshleman 2025-11-04 1114 ret++;
Uninitialized.
45aa39492cf4dd Bobby Eshleman 2025-11-04 1115 } 45aa39492cf4dd Bobby Eshleman 2025-11-04 1116 } 45aa39492cf4dd Bobby Eshleman 2025-11-04 1117 45aa39492cf4dd Bobby Eshleman 2025-11-04 1118 atomic_sub(ret, &sk->sk_devmem_info.outstanding_urefs); 45aa39492cf4dd Bobby Eshleman 2025-11-04 1119 45aa39492cf4dd Bobby Eshleman 2025-11-04 1120 return ret; 45aa39492cf4dd Bobby Eshleman 2025-11-04 1121 }
Hi Bobby,
kernel test robot noticed the following build warnings:
url: https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-ren... base: 255d75ef029f33f75fcf5015052b7302486f7ad2 patch link: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-up... patch subject: [PATCH net-next v6 3/6] net: devmem: prepare for autorelease rx token management config: nios2-randconfig-r071-20251105 (https://download.01.org/0day-ci/archive/20251106/202511060352.2kPPX3xE-lkp@i...) compiler: nios2-linux-gcc (GCC) 8.5.0
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Reported-by: Dan Carpenter dan.carpenter@linaro.org | Closes: https://lore.kernel.org/r/202511060352.2kPPX3xE-lkp@intel.com/
New smatch warnings: net/ipv4/tcp.c:2626 tcp_recvmsg_dmabuf() error: uninitialized symbol 'refs'.
vim +/refs +2626 net/ipv4/tcp.c
8f0b3cc9a4c102 Mina Almasry 2024-09-10 2485 static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2486 unsigned int offset, struct msghdr *msg, 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2487 int remaining_len) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2488 { 45aa39492cf4dd Bobby Eshleman 2025-11-04 2489 struct net_devmem_dmabuf_binding *binding = NULL; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2490 struct dmabuf_cmsg dmabuf_cmsg = { 0 }; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2491 struct tcp_xa_pool tcp_xa_pool; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2492 unsigned int start; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2493 int i, copy, n; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2494 int sent = 0; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2495 int err = 0; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2496 int refs;
refs needs to be initialized to zero. Best practice is to use CONFIG_INIT_STACK_ALL_PATTERN for testing.
8f0b3cc9a4c102 Mina Almasry 2024-09-10 2497 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2498 tcp_xa_pool.max = 0; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2499 tcp_xa_pool.idx = 0; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2500 do { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2501 start = skb_headlen(skb); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2502 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2503 if (skb_frags_readable(skb)) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2504 err = -ENODEV; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2505 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2506 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2507 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2508 /* Copy header. */ 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2509 copy = start - offset; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2510 if (copy > 0) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2511 copy = min(copy, remaining_len); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2512 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2513 n = copy_to_iter(skb->data + offset, copy, 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2514 &msg->msg_iter); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2515 if (n != copy) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2516 err = -EFAULT; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2517 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2518 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2519 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2520 offset += copy; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2521 remaining_len -= copy; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2522 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2523 /* First a dmabuf_cmsg for # bytes copied to user 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2524 * buffer. 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2525 */ 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2526 memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg)); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2527 dmabuf_cmsg.frag_size = copy; 18912c520674ec Stanislav Fomichev 2025-02-24 2528 err = put_cmsg_notrunc(msg, SOL_SOCKET, 18912c520674ec Stanislav Fomichev 2025-02-24 2529 SO_DEVMEM_LINEAR, 18912c520674ec Stanislav Fomichev 2025-02-24 2530 sizeof(dmabuf_cmsg), 18912c520674ec Stanislav Fomichev 2025-02-24 2531 &dmabuf_cmsg); 18912c520674ec Stanislav Fomichev 2025-02-24 2532 if (err) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2533 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2534 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2535 sent += copy; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2536 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2537 if (remaining_len == 0) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2538 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2539 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2540 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2541 /* after that, send information of dmabuf pages through a 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2542 * sequence of cmsg 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2543 */ 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2544 for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2545 skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2546 struct net_iov *niov; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2547 u64 frag_offset; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2548 u32 token; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2549 int end; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2550 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2551 /* !skb_frags_readable() should indicate that ALL the 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2552 * frags in this skb are dmabuf net_iovs. We're checking 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2553 * for that flag above, but also check individual frags 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2554 * here. If the tcp stack is not setting 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2555 * skb_frags_readable() correctly, we still don't want 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2556 * to crash here. 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2557 */ 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2558 if (!skb_frag_net_iov(frag)) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2559 net_err_ratelimited("Found non-dmabuf skb with net_iov"); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2560 err = -ENODEV; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2561 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2562 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2563 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2564 niov = skb_frag_net_iov(frag); 69e39537b66232 Pavel Begunkov 2025-02-04 2565 if (!net_is_devmem_iov(niov)) { 69e39537b66232 Pavel Begunkov 2025-02-04 2566 err = -ENODEV; 69e39537b66232 Pavel Begunkov 2025-02-04 2567 goto out; 69e39537b66232 Pavel Begunkov 2025-02-04 2568 } 69e39537b66232 Pavel Begunkov 2025-02-04 2569 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2570 end = start + skb_frag_size(frag); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2571 copy = end - offset; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2572 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2573 if (copy > 0) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2574 copy = min(copy, remaining_len); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2575 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2576 frag_offset = net_iov_virtual_addr(niov) + 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2577 skb_frag_off(frag) + offset - 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2578 start; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2579 dmabuf_cmsg.frag_offset = frag_offset; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2580 dmabuf_cmsg.frag_size = copy; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2581 45aa39492cf4dd Bobby Eshleman 2025-11-04 2582 binding = net_devmem_iov_binding(niov); 45aa39492cf4dd Bobby Eshleman 2025-11-04 2583 45aa39492cf4dd Bobby Eshleman 2025-11-04 2584 if (!sk->sk_devmem_info.binding) 45aa39492cf4dd Bobby Eshleman 2025-11-04 2585 sk->sk_devmem_info.binding = binding; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2586 45aa39492cf4dd Bobby Eshleman 2025-11-04 2587 if (sk->sk_devmem_info.binding != binding) { 45aa39492cf4dd Bobby Eshleman 2025-11-04 2588 err = -EFAULT; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2589 goto out; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2590 } 45aa39492cf4dd Bobby Eshleman 2025-11-04 2591 45aa39492cf4dd Bobby Eshleman 2025-11-04 2592 if (sk->sk_devmem_info.autorelease) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2593 err = tcp_xa_pool_refill(sk, &tcp_xa_pool, 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2594 skb_shinfo(skb)->nr_frags - i); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2595 if (err) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2596 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2597 45aa39492cf4dd Bobby Eshleman 2025-11-04 2598 dmabuf_cmsg.frag_token = 45aa39492cf4dd Bobby Eshleman 2025-11-04 2599 tcp_xa_pool.tokens[tcp_xa_pool.idx]; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2600 } else { 45aa39492cf4dd Bobby Eshleman 2025-11-04 2601 token = net_iov_virtual_addr(niov) >> PAGE_SHIFT; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2602 dmabuf_cmsg.frag_token = token; 45aa39492cf4dd Bobby Eshleman 2025-11-04 2603 } 45aa39492cf4dd Bobby Eshleman 2025-11-04 2604 45aa39492cf4dd Bobby Eshleman 2025-11-04 2605 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2606 /* Will perform the exchange later */ 297d389e9e5bd4 Pavel Begunkov 2025-02-04 2607 dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2608 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2609 offset += copy; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2610 remaining_len -= copy; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2611 18912c520674ec Stanislav Fomichev 2025-02-24 2612 err = put_cmsg_notrunc(msg, SOL_SOCKET, 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2613 SO_DEVMEM_DMABUF, 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2614 sizeof(dmabuf_cmsg), 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2615 &dmabuf_cmsg); 18912c520674ec Stanislav Fomichev 2025-02-24 2616 if (err) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2617 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2618 45aa39492cf4dd Bobby Eshleman 2025-11-04 2619 if (sk->sk_devmem_info.autorelease) { 45aa39492cf4dd Bobby Eshleman 2025-11-04 2620 atomic_long_inc(&niov->pp_ref_count); 45aa39492cf4dd Bobby Eshleman 2025-11-04 2621 tcp_xa_pool.netmems[tcp_xa_pool.idx++] = 45aa39492cf4dd Bobby Eshleman 2025-11-04 2622 skb_frag_netmem(frag); 45aa39492cf4dd Bobby Eshleman 2025-11-04 2623 } else { 45aa39492cf4dd Bobby Eshleman 2025-11-04 2624 if (atomic_inc_return(&niov->uref) == 1) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2625 atomic_long_inc(&niov->pp_ref_count); 45aa39492cf4dd Bobby Eshleman 2025-11-04 @2626 refs++; ^^^^^^
45aa39492cf4dd Bobby Eshleman 2025-11-04 2627 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2628 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2629 sent += copy; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2630 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2631 if (remaining_len == 0) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2632 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2633 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2634 start = end; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2635 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2636 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2637 tcp_xa_pool_commit(sk, &tcp_xa_pool); 45aa39492cf4dd Bobby Eshleman 2025-11-04 2638 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2639 if (!remaining_len) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2640 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2641 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2642 /* if remaining_len is not satisfied yet, we need to go to the 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2643 * next frag in the frag_list to satisfy remaining_len. 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2644 */ 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2645 skb = skb_shinfo(skb)->frag_list ?: skb->next; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2646 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2647 offset = offset - start; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2648 } while (skb); 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2649 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2650 if (remaining_len) { 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2651 err = -EFAULT; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2652 goto out; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2653 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2654 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2655 out: 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2656 tcp_xa_pool_commit(sk, &tcp_xa_pool); 45aa39492cf4dd Bobby Eshleman 2025-11-04 2657 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2658 if (!sent) 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2659 sent = err; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2660 45aa39492cf4dd Bobby Eshleman 2025-11-04 2661 if (refs > 0) 45aa39492cf4dd Bobby Eshleman 2025-11-04 2662 atomic_add(refs, &sk->sk_devmem_info.outstanding_urefs); 45aa39492cf4dd Bobby Eshleman 2025-11-04 2663 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2664 return sent; 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2665 } 8f0b3cc9a4c102 Mina Almasry 2024-09-10 2666
From: Bobby Eshleman bobbyeshleman@meta.com
Add SO_DEVMEM_AUTORELEASE socket option to allow applications to control token release behavior on a per-socket basis.
The socket option accepts boolean values (0 or 1): - 1 (true): outstanding tokens are automatically released when the socket closes - 0 (false): outstanding tokens are released when the dmabuf is unbound
The option can only be changed when the socket has no outstanding tokens, enforced by checking: 1. The frags xarray is empty (no tokens in autorelease mode) 2. The outstanding_urefs counter is zero (no tokens in manual mode)
This restriction prevents inconsistent token tracking state between acquisition and release calls. If either condition fails, setsockopt returns -EBUSY.
The default state is autorelease off.
Signed-off-by: Bobby Eshleman bobbyeshleman@meta.com --- include/uapi/asm-generic/socket.h | 2 ++ net/core/sock.c | 51 +++++++++++++++++++++++++++++++++ net/ipv4/tcp.c | 2 +- tools/include/uapi/asm-generic/socket.h | 2 ++ 4 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 53b5a8c002b1..59302318bb34 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -150,6 +150,8 @@ #define SO_INQ 84 #define SCM_INQ SO_INQ
+#define SO_DEVMEM_AUTORELEASE 85 + #if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/net/core/sock.c b/net/core/sock.c index 465645c1d74f..27af476f3cd3 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1160,6 +1160,46 @@ sock_devmem_dontneed_autorelease(struct sock *sk, struct dmabuf_token *tokens, return ret; }
+static noinline_for_stack int +sock_devmem_set_autorelease(struct sock *sk, sockptr_t optval, unsigned int optlen) +{ + int val; + + if (!sk_is_tcp(sk)) + return -EBADF; + + if (optlen < sizeof(int)) + return -EINVAL; + + if (copy_from_sockptr(&val, optval, sizeof(val))) + return -EFAULT; + + /* Validate that val is 0 or 1 */ + if (val != 0 && val != 1) + return -EINVAL; + + sockopt_lock_sock(sk); + + /* Can only change autorelease if: + * 1. No tokens in the frags xarray (autorelease mode) + * 2. No outstanding urefs (manual release mode) + */ + if (!xa_empty(&sk->sk_devmem_info.frags)) { + sockopt_release_sock(sk); + return -EBUSY; + } + + if (atomic_read(&sk->sk_devmem_info.outstanding_urefs) > 0) { + sockopt_release_sock(sk); + return -EBUSY; + } + + sk->sk_devmem_info.autorelease = !!val; + + sockopt_release_sock(sk); + return 0; +} + static noinline_for_stack int sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) { @@ -1351,6 +1391,9 @@ int sk_setsockopt(struct sock *sk, int level, int optname, #ifdef CONFIG_PAGE_POOL case SO_DEVMEM_DONTNEED: return sock_devmem_dontneed(sk, optval, optlen); + + case SO_DEVMEM_AUTORELEASE: + return sock_devmem_set_autorelease(sk, optval, optlen); #endif case SO_SNDTIMEO_OLD: case SO_SNDTIMEO_NEW: @@ -2208,6 +2251,14 @@ int sk_getsockopt(struct sock *sk, int level, int optname, v.val = READ_ONCE(sk->sk_txrehash); break;
+#ifdef CONFIG_PAGE_POOL + case SO_DEVMEM_AUTORELEASE: + if (!sk_is_tcp(sk)) + return -EBADF; + v.val = sk->sk_devmem_info.autorelease; + break; +#endif + default: /* We implement the SO_SNDLOWAT etc to not be settable * (1003.1g 7). diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 052875c1b547..8226ba892b36 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -496,7 +496,7 @@ void tcp_init_sock(struct sock *sk) xa_init_flags(&sk->sk_devmem_info.frags, XA_FLAGS_ALLOC1); sk->sk_devmem_info.binding = NULL; atomic_set(&sk->sk_devmem_info.outstanding_urefs, 0); - sk->sk_devmem_info.autorelease = true; + sk->sk_devmem_info.autorelease = false; } EXPORT_IPV6_MOD(tcp_init_sock);
diff --git a/tools/include/uapi/asm-generic/socket.h b/tools/include/uapi/asm-generic/socket.h index f333a0ac4ee4..9710a3d7cc4d 100644 --- a/tools/include/uapi/asm-generic/socket.h +++ b/tools/include/uapi/asm-generic/socket.h @@ -147,6 +147,8 @@
#define SO_PASSRIGHTS 83
+#define SO_DEVMEM_AUTORELEASE 85 + #if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
Hi Bobby,
kernel test robot noticed the following build errors:
[auto build test ERROR on 255d75ef029f33f75fcf5015052b7302486f7ad2]
url: https://github.com/intel-lab-lkp/linux/commits/Bobby-Eshleman/net-devmem-ren... base: 255d75ef029f33f75fcf5015052b7302486f7ad2 patch link: https://lore.kernel.org/r/20251104-scratch-bobbyeshleman-devmem-tcp-token-up... patch subject: [PATCH net-next v6 4/6] net: devmem: add SO_DEVMEM_AUTORELEASE for autorelease control config: sparc64-randconfig-002-20251105 (https://download.01.org/0day-ci/archive/20251106/202511060041.0DVnS1CV-lkp@i...) compiler: sparc64-linux-gcc (GCC) 10.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251106/202511060041.0DVnS1CV-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202511060041.0DVnS1CV-lkp@intel.com/
All errors (new ones prefixed by >>):
net/core/sock.c: In function 'sk_setsockopt':
net/core/sock.c:1395:7: error: 'SO_DEVMEM_AUTORELEASE' undeclared (first use in this function)
1395 | case SO_DEVMEM_AUTORELEASE: | ^~~~~~~~~~~~~~~~~~~~~ net/core/sock.c:1395:7: note: each undeclared identifier is reported only once for each function it appears in net/core/sock.c: In function 'sk_getsockopt': net/core/sock.c:2255:7: error: 'SO_DEVMEM_AUTORELEASE' undeclared (first use in this function) 2255 | case SO_DEVMEM_AUTORELEASE: | ^~~~~~~~~~~~~~~~~~~~~
vim +/SO_DEVMEM_AUTORELEASE +1395 net/core/sock.c
1282 1283 /* 1284 * This is meant for all protocols to use and covers goings on 1285 * at the socket level. Everything here is generic. 1286 */ 1287 1288 int sk_setsockopt(struct sock *sk, int level, int optname, 1289 sockptr_t optval, unsigned int optlen) 1290 { 1291 struct so_timestamping timestamping; 1292 struct socket *sock = sk->sk_socket; 1293 struct sock_txtime sk_txtime; 1294 int val; 1295 int valbool; 1296 struct linger ling; 1297 int ret = 0; 1298 1299 /* 1300 * Options without arguments 1301 */ 1302 1303 if (optname == SO_BINDTODEVICE) 1304 return sock_setbindtodevice(sk, optval, optlen); 1305 1306 if (optlen < sizeof(int)) 1307 return -EINVAL; 1308 1309 if (copy_from_sockptr(&val, optval, sizeof(val))) 1310 return -EFAULT; 1311 1312 valbool = val ? 1 : 0; 1313 1314 /* handle options which do not require locking the socket. */ 1315 switch (optname) { 1316 case SO_PRIORITY: 1317 if (sk_set_prio_allowed(sk, val)) { 1318 sock_set_priority(sk, val); 1319 return 0; 1320 } 1321 return -EPERM; 1322 case SO_TYPE: 1323 case SO_PROTOCOL: 1324 case SO_DOMAIN: 1325 case SO_ERROR: 1326 return -ENOPROTOOPT; 1327 #ifdef CONFIG_NET_RX_BUSY_POLL 1328 case SO_BUSY_POLL: 1329 if (val < 0) 1330 return -EINVAL; 1331 WRITE_ONCE(sk->sk_ll_usec, val); 1332 return 0; 1333 case SO_PREFER_BUSY_POLL: 1334 if (valbool && !sockopt_capable(CAP_NET_ADMIN)) 1335 return -EPERM; 1336 WRITE_ONCE(sk->sk_prefer_busy_poll, valbool); 1337 return 0; 1338 case SO_BUSY_POLL_BUDGET: 1339 if (val > READ_ONCE(sk->sk_busy_poll_budget) && 1340 !sockopt_capable(CAP_NET_ADMIN)) 1341 return -EPERM; 1342 if (val < 0 || val > U16_MAX) 1343 return -EINVAL; 1344 WRITE_ONCE(sk->sk_busy_poll_budget, val); 1345 return 0; 1346 #endif 1347 case SO_MAX_PACING_RATE: 1348 { 1349 unsigned long ulval = (val == ~0U) ? ~0UL : (unsigned int)val; 1350 unsigned long pacing_rate; 1351 1352 if (sizeof(ulval) != sizeof(val) && 1353 optlen >= sizeof(ulval) && 1354 copy_from_sockptr(&ulval, optval, sizeof(ulval))) { 1355 return -EFAULT; 1356 } 1357 if (ulval != ~0UL) 1358 cmpxchg(&sk->sk_pacing_status, 1359 SK_PACING_NONE, 1360 SK_PACING_NEEDED); 1361 /* Pairs with READ_ONCE() from sk_getsockopt() */ 1362 WRITE_ONCE(sk->sk_max_pacing_rate, ulval); 1363 pacing_rate = READ_ONCE(sk->sk_pacing_rate); 1364 if (ulval < pacing_rate) 1365 WRITE_ONCE(sk->sk_pacing_rate, ulval); 1366 return 0; 1367 } 1368 case SO_TXREHASH: 1369 if (!sk_is_tcp(sk)) 1370 return -EOPNOTSUPP; 1371 if (val < -1 || val > 1) 1372 return -EINVAL; 1373 if ((u8)val == SOCK_TXREHASH_DEFAULT) 1374 val = READ_ONCE(sock_net(sk)->core.sysctl_txrehash); 1375 /* Paired with READ_ONCE() in tcp_rtx_synack() 1376 * and sk_getsockopt(). 1377 */ 1378 WRITE_ONCE(sk->sk_txrehash, (u8)val); 1379 return 0; 1380 case SO_PEEK_OFF: 1381 { 1382 int (*set_peek_off)(struct sock *sk, int val); 1383 1384 set_peek_off = READ_ONCE(sock->ops)->set_peek_off; 1385 if (set_peek_off) 1386 ret = set_peek_off(sk, val); 1387 else 1388 ret = -EOPNOTSUPP; 1389 return ret; 1390 } 1391 #ifdef CONFIG_PAGE_POOL 1392 case SO_DEVMEM_DONTNEED: 1393 return sock_devmem_dontneed(sk, optval, optlen); 1394
1395 case SO_DEVMEM_AUTORELEASE:
1396 return sock_devmem_set_autorelease(sk, optval, optlen); 1397 #endif 1398 case SO_SNDTIMEO_OLD: 1399 case SO_SNDTIMEO_NEW: 1400 return sock_set_timeout(&sk->sk_sndtimeo, optval, 1401 optlen, optname == SO_SNDTIMEO_OLD); 1402 case SO_RCVTIMEO_OLD: 1403 case SO_RCVTIMEO_NEW: 1404 return sock_set_timeout(&sk->sk_rcvtimeo, optval, 1405 optlen, optname == SO_RCVTIMEO_OLD); 1406 } 1407 1408 sockopt_lock_sock(sk); 1409 1410 switch (optname) { 1411 case SO_DEBUG: 1412 if (val && !sockopt_capable(CAP_NET_ADMIN)) 1413 ret = -EACCES; 1414 else 1415 sock_valbool_flag(sk, SOCK_DBG, valbool); 1416 break; 1417 case SO_REUSEADDR: 1418 sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE); 1419 break; 1420 case SO_REUSEPORT: 1421 if (valbool && !sk_is_inet(sk)) 1422 ret = -EOPNOTSUPP; 1423 else 1424 sk->sk_reuseport = valbool; 1425 break; 1426 case SO_DONTROUTE: 1427 sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool); 1428 sk_dst_reset(sk); 1429 break; 1430 case SO_BROADCAST: 1431 sock_valbool_flag(sk, SOCK_BROADCAST, valbool); 1432 break; 1433 case SO_SNDBUF: 1434 /* Don't error on this BSD doesn't and if you think 1435 * about it this is right. Otherwise apps have to 1436 * play 'guess the biggest size' games. RCVBUF/SNDBUF 1437 * are treated in BSD as hints 1438 */ 1439 val = min_t(u32, val, READ_ONCE(sysctl_wmem_max)); 1440 set_sndbuf: 1441 /* Ensure val * 2 fits into an int, to prevent max_t() 1442 * from treating it as a negative value. 1443 */ 1444 val = min_t(int, val, INT_MAX / 2); 1445 sk->sk_userlocks |= SOCK_SNDBUF_LOCK; 1446 WRITE_ONCE(sk->sk_sndbuf, 1447 max_t(int, val * 2, SOCK_MIN_SNDBUF)); 1448 /* Wake up sending tasks if we upped the value. */ 1449 sk->sk_write_space(sk); 1450 break; 1451 1452 case SO_SNDBUFFORCE: 1453 if (!sockopt_capable(CAP_NET_ADMIN)) { 1454 ret = -EPERM; 1455 break; 1456 } 1457 1458 /* No negative values (to prevent underflow, as val will be 1459 * multiplied by 2). 1460 */ 1461 if (val < 0) 1462 val = 0; 1463 goto set_sndbuf; 1464 1465 case SO_RCVBUF: 1466 /* Don't error on this BSD doesn't and if you think 1467 * about it this is right. Otherwise apps have to 1468 * play 'guess the biggest size' games. RCVBUF/SNDBUF 1469 * are treated in BSD as hints 1470 */ 1471 __sock_set_rcvbuf(sk, min_t(u32, val, READ_ONCE(sysctl_rmem_max))); 1472 break; 1473 1474 case SO_RCVBUFFORCE: 1475 if (!sockopt_capable(CAP_NET_ADMIN)) { 1476 ret = -EPERM; 1477 break; 1478 } 1479 1480 /* No negative values (to prevent underflow, as val will be 1481 * multiplied by 2). 1482 */ 1483 __sock_set_rcvbuf(sk, max(val, 0)); 1484 break; 1485 1486 case SO_KEEPALIVE: 1487 if (sk->sk_prot->keepalive) 1488 sk->sk_prot->keepalive(sk, valbool); 1489 sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool); 1490 break; 1491 1492 case SO_OOBINLINE: 1493 sock_valbool_flag(sk, SOCK_URGINLINE, valbool); 1494 break; 1495 1496 case SO_NO_CHECK: 1497 sk->sk_no_check_tx = valbool; 1498 break; 1499 1500 case SO_LINGER: 1501 if (optlen < sizeof(ling)) { 1502 ret = -EINVAL; /* 1003.1g */ 1503 break; 1504 } 1505 if (copy_from_sockptr(&ling, optval, sizeof(ling))) { 1506 ret = -EFAULT; 1507 break; 1508 } 1509 if (!ling.l_onoff) { 1510 sock_reset_flag(sk, SOCK_LINGER); 1511 } else { 1512 unsigned long t_sec = ling.l_linger; 1513 1514 if (t_sec >= MAX_SCHEDULE_TIMEOUT / HZ) 1515 WRITE_ONCE(sk->sk_lingertime, MAX_SCHEDULE_TIMEOUT); 1516 else 1517 WRITE_ONCE(sk->sk_lingertime, t_sec * HZ); 1518 sock_set_flag(sk, SOCK_LINGER); 1519 } 1520 break; 1521 1522 case SO_BSDCOMPAT: 1523 break; 1524 1525 case SO_TIMESTAMP_OLD: 1526 case SO_TIMESTAMP_NEW: 1527 case SO_TIMESTAMPNS_OLD: 1528 case SO_TIMESTAMPNS_NEW: 1529 sock_set_timestamp(sk, optname, valbool); 1530 break; 1531 1532 case SO_TIMESTAMPING_NEW: 1533 case SO_TIMESTAMPING_OLD: 1534 if (optlen == sizeof(timestamping)) { 1535 if (copy_from_sockptr(×tamping, optval, 1536 sizeof(timestamping))) { 1537 ret = -EFAULT; 1538 break; 1539 } 1540 } else { 1541 memset(×tamping, 0, sizeof(timestamping)); 1542 timestamping.flags = val; 1543 } 1544 ret = sock_set_timestamping(sk, optname, timestamping); 1545 break; 1546 1547 case SO_RCVLOWAT: 1548 { 1549 int (*set_rcvlowat)(struct sock *sk, int val) = NULL; 1550 1551 if (val < 0) 1552 val = INT_MAX; 1553 if (sock) 1554 set_rcvlowat = READ_ONCE(sock->ops)->set_rcvlowat; 1555 if (set_rcvlowat) 1556 ret = set_rcvlowat(sk, val); 1557 else 1558 WRITE_ONCE(sk->sk_rcvlowat, val ? : 1); 1559 break; 1560 } 1561 case SO_ATTACH_FILTER: { 1562 struct sock_fprog fprog; 1563 1564 ret = copy_bpf_fprog_from_user(&fprog, optval, optlen); 1565 if (!ret) 1566 ret = sk_attach_filter(&fprog, sk); 1567 break; 1568 } 1569 case SO_ATTACH_BPF: 1570 ret = -EINVAL; 1571 if (optlen == sizeof(u32)) { 1572 u32 ufd; 1573 1574 ret = -EFAULT; 1575 if (copy_from_sockptr(&ufd, optval, sizeof(ufd))) 1576 break; 1577 1578 ret = sk_attach_bpf(ufd, sk); 1579 } 1580 break; 1581 1582 case SO_ATTACH_REUSEPORT_CBPF: { 1583 struct sock_fprog fprog; 1584 1585 ret = copy_bpf_fprog_from_user(&fprog, optval, optlen); 1586 if (!ret) 1587 ret = sk_reuseport_attach_filter(&fprog, sk); 1588 break; 1589 } 1590 case SO_ATTACH_REUSEPORT_EBPF: 1591 ret = -EINVAL; 1592 if (optlen == sizeof(u32)) { 1593 u32 ufd; 1594 1595 ret = -EFAULT; 1596 if (copy_from_sockptr(&ufd, optval, sizeof(ufd))) 1597 break; 1598 1599 ret = sk_reuseport_attach_bpf(ufd, sk); 1600 } 1601 break; 1602 1603 case SO_DETACH_REUSEPORT_BPF: 1604 ret = reuseport_detach_prog(sk); 1605 break; 1606 1607 case SO_DETACH_FILTER: 1608 ret = sk_detach_filter(sk); 1609 break; 1610 1611 case SO_LOCK_FILTER: 1612 if (sock_flag(sk, SOCK_FILTER_LOCKED) && !valbool) 1613 ret = -EPERM; 1614 else 1615 sock_valbool_flag(sk, SOCK_FILTER_LOCKED, valbool); 1616 break; 1617 1618 case SO_MARK: 1619 if (!sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) && 1620 !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) { 1621 ret = -EPERM; 1622 break; 1623 } 1624 1625 __sock_set_mark(sk, val); 1626 break; 1627 case SO_RCVMARK: 1628 sock_valbool_flag(sk, SOCK_RCVMARK, valbool); 1629 break; 1630 1631 case SO_RCVPRIORITY: 1632 sock_valbool_flag(sk, SOCK_RCVPRIORITY, valbool); 1633 break; 1634 1635 case SO_RXQ_OVFL: 1636 sock_valbool_flag(sk, SOCK_RXQ_OVFL, valbool); 1637 break; 1638 1639 case SO_WIFI_STATUS: 1640 sock_valbool_flag(sk, SOCK_WIFI_STATUS, valbool); 1641 break; 1642 1643 case SO_NOFCS: 1644 sock_valbool_flag(sk, SOCK_NOFCS, valbool); 1645 break; 1646 1647 case SO_SELECT_ERR_QUEUE: 1648 sock_valbool_flag(sk, SOCK_SELECT_ERR_QUEUE, valbool); 1649 break; 1650 1651 case SO_PASSCRED: 1652 if (sk_may_scm_recv(sk)) 1653 sk->sk_scm_credentials = valbool; 1654 else 1655 ret = -EOPNOTSUPP; 1656 break; 1657 1658 case SO_PASSSEC: 1659 if (IS_ENABLED(CONFIG_SECURITY_NETWORK) && sk_may_scm_recv(sk)) 1660 sk->sk_scm_security = valbool; 1661 else 1662 ret = -EOPNOTSUPP; 1663 break; 1664 1665 case SO_PASSPIDFD: 1666 if (sk_is_unix(sk)) 1667 sk->sk_scm_pidfd = valbool; 1668 else 1669 ret = -EOPNOTSUPP; 1670 break; 1671 1672 case SO_PASSRIGHTS: 1673 if (sk_is_unix(sk)) 1674 sk->sk_scm_rights = valbool; 1675 else 1676 ret = -EOPNOTSUPP; 1677 break; 1678 1679 case SO_INCOMING_CPU: 1680 reuseport_update_incoming_cpu(sk, val); 1681 break; 1682 1683 case SO_CNX_ADVICE: 1684 if (val == 1) 1685 dst_negative_advice(sk); 1686 break; 1687 1688 case SO_ZEROCOPY: 1689 if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) { 1690 if (!(sk_is_tcp(sk) || 1691 (sk->sk_type == SOCK_DGRAM && 1692 sk->sk_protocol == IPPROTO_UDP))) 1693 ret = -EOPNOTSUPP; 1694 } else if (sk->sk_family != PF_RDS) { 1695 ret = -EOPNOTSUPP; 1696 } 1697 if (!ret) { 1698 if (val < 0 || val > 1) 1699 ret = -EINVAL; 1700 else 1701 sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool); 1702 } 1703 break; 1704 1705 case SO_TXTIME: 1706 if (optlen != sizeof(struct sock_txtime)) { 1707 ret = -EINVAL; 1708 break; 1709 } else if (copy_from_sockptr(&sk_txtime, optval, 1710 sizeof(struct sock_txtime))) { 1711 ret = -EFAULT; 1712 break; 1713 } else if (sk_txtime.flags & ~SOF_TXTIME_FLAGS_MASK) { 1714 ret = -EINVAL; 1715 break; 1716 } 1717 /* CLOCK_MONOTONIC is only used by sch_fq, and this packet 1718 * scheduler has enough safe guards. 1719 */ 1720 if (sk_txtime.clockid != CLOCK_MONOTONIC && 1721 !sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) { 1722 ret = -EPERM; 1723 break; 1724 } 1725 1726 ret = sockopt_validate_clockid(sk_txtime.clockid); 1727 if (ret) 1728 break; 1729 1730 sock_valbool_flag(sk, SOCK_TXTIME, true); 1731 sk->sk_clockid = sk_txtime.clockid; 1732 sk->sk_txtime_deadline_mode = 1733 !!(sk_txtime.flags & SOF_TXTIME_DEADLINE_MODE); 1734 sk->sk_txtime_report_errors = 1735 !!(sk_txtime.flags & SOF_TXTIME_REPORT_ERRORS); 1736 break; 1737 1738 case SO_BINDTOIFINDEX: 1739 ret = sock_bindtoindex_locked(sk, val); 1740 break; 1741 1742 case SO_BUF_LOCK: 1743 if (val & ~SOCK_BUF_LOCK_MASK) { 1744 ret = -EINVAL; 1745 break; 1746 } 1747 sk->sk_userlocks = val | (sk->sk_userlocks & 1748 ~SOCK_BUF_LOCK_MASK); 1749 break; 1750 1751 case SO_RESERVE_MEM: 1752 { 1753 int delta; 1754 1755 if (val < 0) { 1756 ret = -EINVAL; 1757 break; 1758 } 1759 1760 delta = val - sk->sk_reserved_mem; 1761 if (delta < 0) 1762 sock_release_reserved_memory(sk, -delta); 1763 else 1764 ret = sock_reserve_memory(sk, delta); 1765 break; 1766 } 1767 1768 default: 1769 ret = -ENOPROTOOPT; 1770 break; 1771 } 1772 sockopt_release_sock(sk); 1773 return ret; 1774 } 1775
From: Bobby Eshleman bobbyeshleman@meta.com
Update devmem.rst documentation to describe the new SO_DEVMEM_AUTORELEASE socket option and its usage.
Document the following: - The two token release modes (automatic vs manual) - How to use SO_DEVMEM_AUTORELEASE to control the behavior - Performance benefits of disabling autorelease (~10% CPU reduction) - Restrictions and caveats of manual token release - Usage examples for both getsockopt and setsockopt
Signed-off-by: Bobby Eshleman bobbyeshleman@meta.com --- Documentation/networking/devmem.rst | 70 +++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 2 deletions(-)
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst index a6cd7236bfbd..1bfce686dce6 100644 --- a/Documentation/networking/devmem.rst +++ b/Documentation/networking/devmem.rst @@ -215,8 +215,8 @@ Freeing frags -------------
Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user -processes the frag. The user must return the frag to the kernel via -SO_DEVMEM_DONTNEED:: +processes the frag. Users should return tokens to the kernel via +SO_DEVMEM_DONTNEED when they are done processing the data::
ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token, sizeof(token)); @@ -235,6 +235,72 @@ can be less than the tokens provided by the user in case of: (a) an internal kernel leak bug. (b) the user passed more than 1024 frags.
+ +Autorelease Control +~~~~~~~~~~~~~~~~~~~ + +The SO_DEVMEM_AUTORELEASE socket option controls what happens to outstanding +tokens (tokens not released via SO_DEVMEM_DONTNEED) when the socket closes:: + + int autorelease = 0; /* 0 = manual release, 1 = automatic release */ + ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_AUTORELEASE, + &autorelease, sizeof(autorelease)); + + /* Query current setting */ + int current_val; + socklen_t len = sizeof(current_val); + ret = getsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_AUTORELEASE, + ¤t_val, &len); + +When autorelease is disabled (default): + +- Outstanding tokens are NOT released when the socket closes +- Outstanding tokens are only released when the dmabuf is unbound +- Provides better performance by eliminating xarray overhead (~10% CPU reduction) +- Kernel tracks tokens via atomic reference counters in net_iov structures + +When autorelease is enabled: + +- Outstanding tokens are automatically released when the socket closes +- Backwards compatible behavior +- Kernel tracks tokens in an xarray per socket + +Important: In both modes, applications should call SO_DEVMEM_DONTNEED to +return tokens as soon as they are done processing. The autorelease setting only +affects what happens to tokens that are still outstanding when close() is called. + +The autorelease setting can only be changed when the socket has no outstanding +tokens. If tokens are present, setsockopt returns -EBUSY. + + +Performance Considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Disabling autorelease provides approximately ~10% CPU utilization improvement in +RX workloads by: + +- Eliminating xarray allocations and lookups for token tracking +- Using atomic reference counters instead +- Reducing lock contention on the xarray spinlock + +However, applications must ensure all tokens are released via +SO_DEVMEM_DONTNEED before closing the socket, otherwise the backing pages will +remain pinned until the dmabuf is unbound. + + +Caveats +~~~~~~~ + +- With autorelease disabled, sockets cannot switch between different dmabuf + bindings. This restriction exists because tokens in this mode do not encode + the binding information necessary to perform the token release. + +- Applications using manual release mode (autorelease=0) must ensure all tokens + are returned via SO_DEVMEM_DONTNEED before socket close to avoid resource + leaks during the lifetime of the dmabuf binding. Tokens not released before + close() will only be freed when the dmabuf is unbound. + + TX Interface ============
On 11/04, Bobby Eshleman wrote:
From: Bobby Eshleman bobbyeshleman@meta.com
[..]
+Autorelease Control +~~~~~~~~~~~~~~~~~~~
Have you considered an option to have this flag on the dmabuf binding itself? This will let us keep everything in ynl and not add a new socket option. I think also semantically, this is a property of the binding and not the socket? (not sure what's gonna happen if we have autorelease=on and autorelease=off sockets receiving to the same dmabuf)
On Wed, Nov 5, 2025 at 9:34 AM Stanislav Fomichev stfomichev@gmail.com wrote:
On 11/04, Bobby Eshleman wrote:
From: Bobby Eshleman bobbyeshleman@meta.com
[..]
+Autorelease Control +~~~~~~~~~~~~~~~~~~~
Have you considered an option to have this flag on the dmabuf binding itself? This will let us keep everything in ynl and not add a new socket option. I think also semantically, this is a property of the binding and not the socket? (not sure what's gonna happen if we have autorelease=on and autorelease=off sockets receiving to the same dmabuf)
I think this thread (and maybe other comments on that patch) is the context that missed your inbox:
https://lore.kernel.org/netdev/aQIoxVO3oICd8U8Q@devvm11784.nha0.facebook.com...
Let us know if you disagree.
On 11/05, Mina Almasry wrote:
On Wed, Nov 5, 2025 at 9:34 AM Stanislav Fomichev stfomichev@gmail.com wrote:
On 11/04, Bobby Eshleman wrote:
From: Bobby Eshleman bobbyeshleman@meta.com
[..]
+Autorelease Control +~~~~~~~~~~~~~~~~~~~
Have you considered an option to have this flag on the dmabuf binding itself? This will let us keep everything in ynl and not add a new socket option. I think also semantically, this is a property of the binding and not the socket? (not sure what's gonna happen if we have autorelease=on and autorelease=off sockets receiving to the same dmabuf)
I think this thread (and maybe other comments on that patch) is the context that missed your inbox:
https://lore.kernel.org/netdev/aQIoxVO3oICd8U8Q@devvm11784.nha0.facebook.com...
Let us know if you disagree.
Thanks, I did miss that whole v5 because I was OOO, let me take a look!
On 11/05, Stanislav Fomichev wrote:
On 11/05, Mina Almasry wrote:
On Wed, Nov 5, 2025 at 9:34 AM Stanislav Fomichev stfomichev@gmail.com wrote:
On 11/04, Bobby Eshleman wrote:
From: Bobby Eshleman bobbyeshleman@meta.com
[..]
+Autorelease Control +~~~~~~~~~~~~~~~~~~~
Have you considered an option to have this flag on the dmabuf binding itself? This will let us keep everything in ynl and not add a new socket option. I think also semantically, this is a property of the binding and not the socket? (not sure what's gonna happen if we have autorelease=on and autorelease=off sockets receiving to the same dmabuf)
I think this thread (and maybe other comments on that patch) is the context that missed your inbox:
https://lore.kernel.org/netdev/aQIoxVO3oICd8U8Q@devvm11784.nha0.facebook.com...
Let us know if you disagree.
Thanks, I did miss that whole v5 because I was OOO, let me take a look!
Thank you for the context!
I think that the current approach is ok, we can go with that, but I wonder whether we can simplify things a bit? What if we prohibit the co-existence of autorelease=on and autorelease=off sockets on the system? The first binding basically locks the kernel path into one way or the other (presumably by using static-branch) and prohibits new bindings that use a different mode. It will let us still keep the mode on the binding and will help us not think about the co-existance (we can also still keep things like one-dmabuf-per-socket restrictions in the new mode, etc).
I think for you, Mina, this should still work? You have a knob to go back to the old mode if needed. At the same time, we keep the UAPI surface smaller and keep the path more simple. Ideally, we can also deprecate the old mode at some point (if you manage to successfully migrate of course). WDYT?
On Wed, Nov 05, 2025 at 03:17:06PM -0800, Stanislav Fomichev wrote:
On 11/05, Stanislav Fomichev wrote:
Thank you for the context!
I think that the current approach is ok, we can go with that, but I wonder whether we can simplify things a bit? What if we prohibit the co-existence of autorelease=on and autorelease=off sockets on the system? The first binding basically locks the kernel path into one way or the other (presumably by using static-branch) and prohibits new bindings that use a different mode. It will let us still keep the mode on the binding and will help us not think about the co-existance (we can also still keep things like one-dmabuf-per-socket restrictions in the new mode, etc).
That approach is okay by me.
Best, Bobby
I think for you, Mina, this should still work? You have a knob to go back to the old mode if needed. At the same time, we keep the UAPI surface smaller and keep the path more simple. Ideally, we can also deprecate the old mode at some point (if you manage to successfully migrate of course). WDYT?
On Wed, Nov 05, 2025 at 09:34:03AM -0800, Stanislav Fomichev wrote:
On 11/04, Bobby Eshleman wrote:
From: Bobby Eshleman bobbyeshleman@meta.com
[..]
+Autorelease Control +~~~~~~~~~~~~~~~~~~~
Have you considered an option to have this flag on the dmabuf binding itself? This will let us keep everything in ynl and not add a new socket option. I think also semantically, this is a property of the binding and not the socket? (not sure what's gonna happen if we have autorelease=on and autorelease=off sockets receiving to the same dmabuf)
This was our initial instinct too and was the implementation in the prior version, but we opted for a socket-based property because it simplifies backwards compatibility with multi-binding steering rules. In this case, where bindings may have different autorelease settings, the recv path would need to error out once any binding with different autorelease value was detected, because the dont_need path doesn't have any context to know if any specific token is part of the socket's xarray (autorelease=on) or part of the binding->vec (autorelease=off).
At the socket level we can just prevent the mode switch by counting outstanding references... to do this at the binding level, I think we have to revert back to the ethtool approach we experimented with earlier (trying to resolve steering rules to queues, and then check their binding->autorelease values and make sure they are consistent).
This should work out off the box for mixed-modes, given then outstanding ref rule.
Probably should add a test for specifically that...
Best, Bobby
From: Bobby Eshleman bobbyeshleman@meta.com
Add -A flag to ncdevmem to set autorelease mode.
Add tests for the SO_DEVMEM_AUTORELEASE socket option:
New tests include: - check_sockopt_autorelease_default: Verifies default value is 0 - check_sockopt_autorelease_set_0: Tests setting to 0 and reading back - check_sockopt_autorelease_set_1: Tests toggling from 0 to 1 - check_sockopt_autorelease_invalid: Tests invalid value (2) returns EINVAL - check_autorelease_disabled: Tests ncdevmem in manual token release mode - check_autorelease_enabled: Tests ncdevmem in autorelease mode
All check_sockopt tests gracefully skip with KsftSkipEx if SO_DEVMEM_AUTORELEASE is not supported by the kernel.
Signed-off-by: Bobby Eshleman bobbyeshleman@meta.com --- tools/testing/selftests/drivers/net/hw/devmem.py | 115 +++++++++++++++++++++- tools/testing/selftests/drivers/net/hw/ncdevmem.c | 20 +++- 2 files changed, 133 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/hw/devmem.py b/tools/testing/selftests/drivers/net/hw/devmem.py index 45c2d49d55b6..29ec179d651f 100755 --- a/tools/testing/selftests/drivers/net/hw/devmem.py +++ b/tools/testing/selftests/drivers/net/hw/devmem.py @@ -1,6 +1,9 @@ #!/usr/bin/env python3 # SPDX-License-Identifier: GPL-2.0
+import socket +import errno + from os import path from lib.py import ksft_run, ksft_exit from lib.py import ksft_eq, KsftSkipEx @@ -63,12 +66,122 @@ def check_tx_chunks(cfg) -> None: ksft_eq(socat.stdout.strip(), "hello\nworld")
+@ksft_disruptive +def check_autorelease_disabled(cfg) -> None: + """Test RX with autorelease disabled (requires manual token release in ncdevmem)""" + require_devmem(cfg) + + port = rand_port() + socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}" + listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7 -A 0" + + with bkg(listen_cmd, exit_wait=True) as ncdevmem: + wait_port_listen(port) + cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \ + head -c 1K | {socat}", host=cfg.remote, shell=True) + + ksft_eq(ncdevmem.ret, 0) + + +@ksft_disruptive +def check_autorelease_enabled(cfg) -> None: + """Test RX with autorelease enabled (requires token autorelease in ncdevmem)""" + require_devmem(cfg) + + port = rand_port() + socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}" + listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7 -A 1" + + with bkg(listen_cmd, exit_wait=True) as ncdevmem: + wait_port_listen(port) + cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \ + head -c 1K | {socat}", host=cfg.remote, shell=True) + + ksft_eq(ncdevmem.ret, 0) + + +def check_sockopt_autorelease_default(cfg) -> None: + """Test that SO_DEVMEM_AUTORELEASE default is 0""" + SO_DEVMEM_AUTORELEASE = 85 + + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + try: + val = sock.getsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE) + ksft_eq(val, 0, "Default autorelease should be 0") + except OSError as e: + if e.errno == errno.ENOPROTOOPT: + raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported") + raise + finally: + sock.close() + + +def check_sockopt_autorelease_set_0(cfg) -> None: + """Test setting SO_DEVMEM_AUTORELEASE to 0""" + SO_DEVMEM_AUTORELEASE = 85 + + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + try: + sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 0) + val = sock.getsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE) + ksft_eq(val, 0, "Autorelease should be 0 after setting") + except OSError as e: + if e.errno == errno.ENOPROTOOPT: + raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported") + raise + finally: + sock.close() + + +def check_sockopt_autorelease_set_1(cfg) -> None: + """Test setting SO_DEVMEM_AUTORELEASE to 1""" + SO_DEVMEM_AUTORELEASE = 85 + + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + try: + # First set to 0 + sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 0) + # Then set back to 1 + sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 1) + val = sock.getsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE) + ksft_eq(val, 1, "Autorelease should be 1 after setting") + except OSError as e: + if e.errno == errno.ENOPROTOOPT: + raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported") + raise + finally: + sock.close() + + +def check_sockopt_autorelease_invalid(cfg) -> None: + """Test that SO_DEVMEM_AUTORELEASE rejects invalid values""" + SO_DEVMEM_AUTORELEASE = 85 + + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + try: + try: + sock.setsockopt(socket.SOL_SOCKET, SO_DEVMEM_AUTORELEASE, 2) + raise Exception("setsockopt should have failed with EINVAL") + except OSError as e: + if e.errno == errno.ENOPROTOOPT: + raise KsftSkipEx("SO_DEVMEM_AUTORELEASE not supported") + ksft_eq(e.errno, errno.EINVAL, "Should fail with EINVAL for invalid value") + finally: + sock.close() + + def main() -> None: with NetDrvEpEnv(__file__) as cfg: cfg.bin_local = path.abspath(path.dirname(__file__) + "/ncdevmem") cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
- ksft_run([check_rx, check_tx, check_tx_chunks], + ksft_run([check_rx, check_tx, check_tx_chunks, + check_autorelease_enabled, + check_autorelease_disabled, + check_sockopt_autorelease_default, + check_sockopt_autorelease_set_0, + check_sockopt_autorelease_set_1, + check_sockopt_autorelease_invalid], args=(cfg, )) ksft_exit()
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c index 3288ed04ce08..34d608d07bec 100644 --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c @@ -83,6 +83,10 @@ #define MSG_SOCK_DEVMEM 0x2000000 #endif
+#ifndef SO_DEVMEM_AUTORELEASE +#define SO_DEVMEM_AUTORELEASE 85 +#endif + #define MAX_IOV 1024
static size_t max_chunk; @@ -97,6 +101,7 @@ static unsigned int ifindex; static unsigned int dmabuf_id; static uint32_t tx_dmabuf_id; static int waittime_ms = 500; +static int autorelease = -1;
/* System state loaded by current_config_load() */ #define MAX_FLOWS 8 @@ -890,6 +895,16 @@ static int do_server(struct memory_buffer *mem) if (enable_reuseaddr(socket_fd)) goto err_close_socket;
+ if (autorelease >= 0) { + ret = setsockopt(socket_fd, SOL_SOCKET, SO_DEVMEM_AUTORELEASE, + &autorelease, sizeof(autorelease)); + if (ret) { + pr_err("SO_DEVMEM_AUTORELEASE failed"); + goto err_close_socket; + } + fprintf(stderr, "Set SO_DEVMEM_AUTORELEASE to %d\n", autorelease); + } + fprintf(stderr, "binding to address %s:%d\n", server_ip, ntohs(server_sin.sin6_port));
@@ -1397,7 +1412,7 @@ int main(int argc, char *argv[]) int is_server = 0, opt; int ret, err = 1;
- while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:")) != -1) { + while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:z:A:")) != -1) { switch (opt) { case 'l': is_server = 1; @@ -1426,6 +1441,9 @@ int main(int argc, char *argv[]) case 'z': max_chunk = atoi(optarg); break; + case 'A': + autorelease = atoi(optarg); + break; case '?': fprintf(stderr, "unknown option: %c\n", optopt); break;
linux-kselftest-mirror@lists.linaro.org