Re: [PATCH net-next v13 4/9] net: devmem: Implement TX path

2 May 2025


      On Fri, May 2, 2025 at 4:47 AM Paolo Abeni pabeni@redhat.com wrote:
...
Hi,
On 4/29/25 5:26 AM, Mina Almasry wrote:
...
Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.
Provide API for sendmsg to be able to send dmabufs bound to this device:

Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.

Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.
We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.
We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.
The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
this by making __net_devmem_dmabuf_binding_free schedule_work'd.
Based on work by Stanislav Fomichev sdf@fomichev.me. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.
Cc: Stanislav Fomichev sdf@fomichev.me
Signed-off-by: Kaiyuan Zhang kaiyuanz@google.com
Signed-off-by: Mina Almasry almasrymina@google.com
Acked-by: Stanislav Fomichev sdf@fomichev.me
I'm sorry for the late feedback. A bunch of things I did not notice
before...
...
@@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
  if (msg && msg->msg_ubuf && msg->sg_from_iter)
          ret = msg->sg_from_iter(skb, from, length);


else if (unlikely(binding))



I'm unsure if the unlikely() here (and in similar tests below) it's
worthy: depending on the actual workload this condition could be very
likely.
Right, for now I'm guessing the MSG_ZEROCOPY use case in the else is
more common workload, and putting the devmem use case in the unlikely
path so I don't regress other use cases. We could revisit this in the
future. In my tests, the devmem workload doesn't seem affected by this
unlikely.
...
[...]
...
@@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
      int flags, err, copied = 0;
      int mss_now = 0, size_goal, copied_syn = 0;
      int process_backlog = 0;

bool sockc_valid = true;
int zc = 0;
long timeo;

flags = msg->msg_flags;



sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags),


                                .dmabuf_id = 0 };



the '.dmabuf_id = 0' part is not needed, and possibly the code is
clearer without it.
...

if (msg->msg_controllen) {


        err = sock_cmsg_send(sk, msg, &sockc);


        if (unlikely(err))


                /* Don't return error until MSG_FASTOPEN has been


                 * processed; that may succeed even if the cmsg is


                 * invalid.


                 */


                sockc_valid = false;


}


if ((flags & MSG_ZEROCOPY) && size) {
        if (msg->msg_ubuf) {
                uarg = msg->msg_ubuf;



@@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
                              zc = MSG_ZEROCOPY;
              } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
                      skb = tcp_write_queue_tail(sk);

                uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));




                uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb),


                                            sockc_valid && !!sockc.dmabuf_id);



If sock_cmsg_send() failed and the user did not provide a dmabuf_id,
memory accounting will be incorrect.
Forgive me but I don't see it. sockc_valid will be false, so
msg_zerocopy_realloc will do the normal MSG_ZEROCOPY accounting. Then
below whech check sockc_valid in place of where we did the
sock_cmsg_send before, and goto err. I assume the goto err should undo
the memory accounting in the new code as in the old code. Can you
elaborate on the bug you see?
...
...
                  if (!uarg) {
                          err = -ENOBUFS;
                          goto out_err;

@@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
                              zc = MSG_ZEROCOPY;
                      else
                              uarg_to_msgzc(uarg)->zerocopy = 0;


                if (sockc_valid && sockc.dmabuf_id) {


                        binding = net_devmem_get_binding(sk, sockc.dmabuf_id);


                        if (IS_ERR(binding)) {


                                err = PTR_ERR(binding);


                                binding = NULL;


                                goto out_err;


                        }


                }
        }
} else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) {
        if (sk->sk_route_caps & NETIF_F_SG)
                zc = MSG_SPLICE_PAGES;
}



if (sockc_valid && sockc.dmabuf_id &&


    (!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) {


        err = -EINVAL;


        goto out_err;


}


if (unlikely(flags & MSG_FASTOPEN ||
             inet_test_bit(DEFER_CONNECT, sk)) &&
    !tp->repair) {



@@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
              /* 'common' sending to sendq */
      }

sockc = (struct sockcm_cookie) { .tsflags = READ_ONCE(sk->sk_tsflags)};


if (msg->msg_controllen) {


        err = sock_cmsg_send(sk, msg, &sockc);


        if (unlikely(err)) {


                err = -EINVAL;


                goto out_err;


        }


}




if (!sockc_valid)


        goto out_err;



Here 'err' could have been zeroed by tcp_sendmsg_fastopen(), and out_err
could emit a wrong return value.
Good point indeed.
...
Possibly it's better to keep the 'dmabuf_id' initialization out of
sock_cmsg_send() in a separate helper could simplify the handling here?
This should be possible as well.
-- 
Thanks,
Mina


    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH net-next v13 4/9] net: devmem: Implement TX path