On Fri, May 2, 2025 at 4:47 AM Paolo Abeni pabeni@redhat.com wrote:
Hi,
On 4/29/25 5:26 AM, Mina Almasry wrote:
Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path.
Provide API for sendmsg to be able to send dmabufs bound to this device:
- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying.
We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems.
We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages.
The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd.
Based on work by Stanislav Fomichev sdf@fomichev.me. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov.
Cc: Stanislav Fomichev sdf@fomichev.me Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Stanislav Fomichev sdf@fomichev.me
I'm sorry for the late feedback. A bunch of things I did not notice before...
@@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
if (msg && msg->msg_ubuf && msg->sg_from_iter) ret = msg->sg_from_iter(skb, from, length);
else if (unlikely(binding))
I'm unsure if the unlikely() here (and in similar tests below) it's worthy: depending on the actual workload this condition could be very likely.
Right, for now I'm guessing the MSG_ZEROCOPY use case in the else is more common workload, and putting the devmem use case in the unlikely path so I don't regress other use cases. We could revisit this in the future. In my tests, the devmem workload doesn't seem affected by this unlikely.
[...]
@@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) int flags, err, copied = 0; int mss_now = 0, size_goal, copied_syn = 0; int process_backlog = 0;
bool sockc_valid = true; int zc = 0; long timeo; flags = msg->msg_flags;
sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags),
.dmabuf_id = 0 };
the '.dmabuf_id = 0' part is not needed, and possibly the code is clearer without it.
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
if (unlikely(err))
/* Don't return error until MSG_FASTOPEN has been
* processed; that may succeed even if the cmsg is
* invalid.
*/
sockc_valid = false;
}
if ((flags & MSG_ZEROCOPY) && size) { if (msg->msg_ubuf) { uarg = msg->msg_ubuf;
@@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) zc = MSG_ZEROCOPY; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { skb = tcp_write_queue_tail(sk);
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb),
sockc_valid && !!sockc.dmabuf_id);
If sock_cmsg_send() failed and the user did not provide a dmabuf_id, memory accounting will be incorrect.
Forgive me but I don't see it. sockc_valid will be false, so msg_zerocopy_realloc will do the normal MSG_ZEROCOPY accounting. Then below whech check sockc_valid in place of where we did the sock_cmsg_send before, and goto err. I assume the goto err should undo the memory accounting in the new code as in the old code. Can you elaborate on the bug you see?
if (!uarg) { err = -ENOBUFS; goto out_err;
@@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) zc = MSG_ZEROCOPY; else uarg_to_msgzc(uarg)->zerocopy = 0;
if (sockc_valid && sockc.dmabuf_id) {
binding = net_devmem_get_binding(sk, sockc.dmabuf_id);
if (IS_ERR(binding)) {
err = PTR_ERR(binding);
binding = NULL;
goto out_err;
}
} } } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) { if (sk->sk_route_caps & NETIF_F_SG) zc = MSG_SPLICE_PAGES; }
if (sockc_valid && sockc.dmabuf_id &&
(!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) {
err = -EINVAL;
goto out_err;
}
if (unlikely(flags & MSG_FASTOPEN || inet_test_bit(DEFER_CONNECT, sk)) && !tp->repair) {
@@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) /* 'common' sending to sendq */ }
sockc = (struct sockcm_cookie) { .tsflags = READ_ONCE(sk->sk_tsflags)};
if (msg->msg_controllen) {
err = sock_cmsg_send(sk, msg, &sockc);
if (unlikely(err)) {
err = -EINVAL;
goto out_err;
}
}
if (!sockc_valid)
goto out_err;
Here 'err' could have been zeroed by tcp_sendmsg_fastopen(), and out_err could emit a wrong return value.
Good point indeed.
Possibly it's better to keep the 'dmabuf_id' initialization out of sock_cmsg_send() in a separate helper could simplify the handling here?
This should be possible as well.