Hi Willem, sorry for the late reply, some holiday vacations and other priorities pulled me away from this a bit. I'm getting ready to post RFC V2, so answering some questions ahead of when I do that.
On Thu, Dec 26, 2024 at 1:52 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote: ...
+static int zerocopy_fill_skb_from_devmem(struct sk_buff *skb,
struct msghdr *msg,
struct iov_iter *from, int length)
+{
int i = skb_shinfo(skb)->nr_frags;
int orig_length = length;
netmem_ref netmem;
size_t size;
while (length && iov_iter_count(from)) {
if (i == MAX_SKB_FRAGS)
return -EMSGSIZE;
size = min_t(size_t, iter_iov_len(from), length);
if (!size)
return -EFAULT;
On error, should caller skb_zerocopy_iter_stream rewind from, rather than (or as well as) msg->msg_iter?
Ah, so this was confusing, because there were 2 iterators to keep track off, (a) binding->tx_iter, which is `from` and (b) msg->msg_iter.
Stan suggested removing binding->tx_iter entirely, so that we're back to using only 1 iterator, which is msg->msg_iter. That does simplify the code greatly, and I think addresses this comment as well, because there will be no need to make sure from is advanced/reverted with msg->msg_iter.
netmem = net_iov_to_netmem(iter_iov(from)->iov_base);
get_netmem(netmem);
skb_add_rx_frag_netmem(skb, i, netmem, from->iov_offset, size,
PAGE_SIZE);
iov_iter_advance(from, size);
length -= size;
i++;
}
iov_iter_advance(&msg->msg_iter, orig_length);
What does this do if sendmsg is called with NULL as buffer?
So even if iov_base == NULL, the iterator is created anyhow. The iterator will be from addresses 0 -> iov_len.
In the next iteration, I've applied Stan's suggestion to use iov_base as the offset into the dma-buf to send from. I think it ends up being a much cleaner UAPI, but let me know what you think.
...
struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) @@ -109,6 +112,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) unsigned long xa_idx; unsigned int rxq_idx;
xa_erase(&net_devmem_dmabuf_bindings, binding->id);
/* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the
* erase.
*/
synchronize_net();
What precisely does this protect?
synchronize_net() ensures no packet is in flight inside an rcu readside section. But a packet can still be in flight, such as posted to the device or queued in a qdisc.
The TX data path does a net_devmem_lookup_dmabuf() to lookup the dmabuf_id provided by the user.
But that dmabuf_id may be unbind'd via net_devmem_unbind_dmabuf () by the user at any time, so some synchronization is needed to make sure we don't do a send from a dmabuf that is being freed in another thread.
The synchronization in this patch is such that the lookup path obtains a reference under rcu lock, and the unbind control path makes sure to wait a full RCU grace period before dropping reference via net_devmem_dmabuf_binding_put(). net_devmem_dmabuf_binding_put() will trigger freeing the binding if the refcount hits zero.
...
+struct net_devmem_dmabuf_binding * +net_devmem_get_sockc_binding(struct sock *sk, struct sockcm_cookie *sockc) +{
struct net_devmem_dmabuf_binding *binding;
int err = 0;
binding = net_devmem_lookup_dmabuf(sockc->dmabuf_id);
This lookup is from global xarray net_devmem_dmabuf_bindings.
Is there a check that the socket is sending out through the device to which this dmabuf was bound with netlink? Should there be? (e.g., SO_BINDTODEVICE).
Yes, I think it may be an issue if the user triggers a send from a different netdevice, because indeed when we bind a dmabuf we bind it to a specific netdevice.
One option is as you say to require TX sockets to be bound and to check that we're bound to the correct netdev. I also wonder if I can make this work without SO_BINDTODEVICE, by querying the netdev the sock is currently trying to send out on and doing a check in the tcp_sendmsg. I'm not sure if this is possible but I'll give it a look.
-- Thanks, Mina