On Mon, Mar 3, 2025 at 4:20 PM Jakub Kicinski kuba@kernel.org wrote:
On Fri, 28 Feb 2025 17:29:13 -0800 Mina Almasry wrote:
On Fri, Feb 28, 2025 at 4:38 PM Jakub Kicinski kuba@kernel.org wrote:
On Thu, 27 Feb 2025 04:12:02 +0000 Mina Almasry wrote:
static inline void __skb_frag_ref(skb_frag_t *frag) {
get_page(skb_frag_page(frag));
get_netmem(skb_frag_netmem(frag));
}
Silently handling types of memory the caller may not be expecting always worries me.
Sorry, I'm not following. What caller is not expecting netmem? Here we're making sure __skb_frag_ref() handles netmem correctly, i.e. we were not expecting netmem here before, and after this patch we'll handle it correctly.
Why do we need this?
The MSG_ZEROCOPY TX path takes a page reference on the passed memory in zerocopy_fill_skb_from_iter() that kfree_skb() later drops when the skb is sent. We need an equivalent for netmem, which only supports pp refs today. This is my attempt at implementing a page_ref equivalent to net_iov and generic netmem.
I think __skb_frag_[un]ref is used elsewhere in the TX path too, tcp_mtu_probe for example calls skb_frag_ref eventually.
Any such caller must be inspected to make sure it generates / anticipates skbs with appropriate pp_recycle and readable settings. It's possible that adding a set of _netmem APIs would be too much churn, but if it's not - it'd make it clear which parts of the kernel we have inspected.
I see, you're concerned about interactions between skb->pp_recycle and skb->unreadable. My strategy to handle this is to take what works for pages, and extend it as carefully as possible to unreadable net_iovs/skbs. Abstractly speaking, I think skb_frag_ref/unref should be able to obtain/drop a ref on a frag regardless of what the underlying memory type is.
Currently __skb_frag_ref() obtains a page ref regardless of skb->pp_recycle setting, and skb_page_unref() drops a page_ref if !skb->pp_recycle and a pp ref if skb->pp_recycle. I extend the logic so that it applies as-is to net_iovs and unreadable skbs.
After this patch, __skb_frag_ref() obtains a 'page ref equivalent' on the net_iov regardless of the pp_recycle setting. My conjecture is that since that works for pages, it should also work for net_iovs.
Also after this patch, skb_page_unref will drop a pp ref on the net_iov if skb->pp_recycle is set, and drop a 'page ref equivalent' on the net_iov if !skb->pp_recycle. The conjecture again is that since that works for pages, it should extend to net_iovs.
With regards to the callers, my thinking is that since we preserved the semantics of __skb_frag_ref and skb_page_unref (and so skb_frag_ref/skb_frag_unref by extension), then all existing callers of skb_frag_[un]ref should work as is. Here is some code analysis of individual callers:
1. skb_release_data __skb_frag_unref skb_page_unref
This code path expects to drop a pp_ref if skb->pp_recycle, and drop a page ref if !skb->pp_recycle. We make sure to do this regardless if the skb is actually filled with net_iovs or pages, and the conjecture is that since the logic to acquire/drop a pp=page ref works for pages, it should work for the net_iovs as well.
2. __skb_zcopy_downgrade_managed skb_frag_ref
Same thing here, since this code expects to get a page ref on the frag, we obtain the net_iov equivalent of a page_ref and that should work as well for net_iovs as pages. I could look at more callers, but the general thinking is the same. Am I off the rails here?
Of course, we could leave skb_frag_[un]ref as-is and create an skb_frag_[un]ref_netmem which support net_iovs, but my ambition here was to make skb_frag_[un]ref itself support netmem rather than fork the frag refcounting - if it seems like a good idea?
In general, I'm surprised by the lack of bug reports for devmem.
I guess we did a good job making sure we don't regress the page paths.
:)
The lack of support in any driver that qemu will run is an issue. I wonder if also the fact that devmem needs some setup is also an issue. We need headersplit enabled, udmabuf created, netlink API bound, and then a connection referring to created and we don't support loopback. I think maybe it all may make it difficult for syzbot to repro. I've had it on my todo list to investigate this more.
Can you think of any way we could expose this more to syzbot? First thing that comes to mind is a simple hack in netdevsim, to make it insert a netmem handle (allocated locally, not a real memory provider), every N packets (controllable via debugfs). Would that work?
Yes, great idea. I don't see why it wouldn't work.
We don't expect mixing of net_iovs and pages in the same skb, but netdevsim could create one net_iov skb every N skbs.
I guess I'm not totally sure something is discoverable to syzbot. Is a netdevsim hack toggleable via a debugfs sufficient for syzbot? I'll investigate and ask.
Yeah, my unreliable memory is that syzbot has a mixed record discovering problems with debugfs. If you could ask Dmitry for advice that'd be ideal.
Yes, I took a look here and discussed with Willem. Long story short is that syzbot support is possible but with a handful of changes. We'll look into that.
Long story long, for syzbot support I don't think netdevsim itself will be useful. Its our understanding so far that syzbot doesn't do anything special with netdevsim. We'll need to add queue API/page_pool/unreadable netmem support to one of the drivers qemu (syzbot) uses, and that should get syzbot fuzzing the control plane.
To get syzbot to fuzz the data plane, I think we need to set up a special syzbot instance which configures udmabuf/rss/flow steering/netlink binding and start injecting packets through the data path. Syzbot would not discover a working config on its own. I'm told it's rare to set up specialized syzbot instances but we could sell that this coverage is important enough.
Hacking netdevsim like you suggested would be useful as well, but outside of syzbot, AFAICT so far. We can run existing netdevsim data path tests with netdevsim in 'unreadable netmem mode' and see if it can reproduce issues. Although I'm not sure how to integrate that with continuous testing yet.
-- Thanks, Mina