On 11/7/23 23:03, Mina Almasry wrote:
On Tue, Nov 7, 2023 at 2:55 PM David Ahern dsahern@kernel.org wrote:
On 11/7/23 3:10 PM, Mina Almasry wrote:
On Mon, Nov 6, 2023 at 3:44 PM David Ahern dsahern@kernel.org wrote:
On 11/5/23 7:44 PM, Mina Almasry wrote:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index eeeda849115c..1c351c138a5b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding { };
#ifdef CONFIG_DMA_SHARED_BUFFER +struct page_pool_iov * +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding); +void netdev_free_devmem(struct page_pool_iov *ppiov);
netdev_{alloc,free}_dmabuf?
Can do.
I say that because a dmabuf can be host memory, at least I am not aware of a restriction that a dmabuf is device memory.
In my limited experience dma-buf is generally device memory, and that's really its use case. CONFIG_UDMABUF is a driver that mocks dma-buf with a memfd which I think is used for testing. But I can do the rename, it's more clear anyway, I think.
config UDMABUF bool "userspace dmabuf misc driver" default n depends on DMA_SHARED_BUFFER depends on MEMFD_CREATE || COMPILE_TEST help A driver to let userspace turn memfd regions into dma-bufs. Qemu can use this to create host dmabufs for guest framebuffers.
Qemu is just a userspace process; it is no way a special one.
Treating host memory as a dmabuf should radically simplify the io_uring extension of this set.
I agree actually, and I was about to make that comment to David Wei's series once I have the time.
David, your io_uring RX zerocopy proposal actually works with devmem TCP, if you're inclined to do that instead, what you'd do roughly is (I think):
That would be a Frankenstein's monster api with no good reason for it. You bind memory via netlink because you don't have a proper context to work with otherwise, io_uring serves as the context with a separate and precise abstraction around queues. Same with dmabufs, it totally makes sense for device memory, but wrapping host memory into a file just to immediately unwrap it back with no particular benefits from doing so doesn't seem like a good uapi. Currently, the difference will be hidden by io_uring.
And we'd still need to have a hook in pp's get page to grab buffers from the buffer ring instead of refilling via SO_DEVMEM_DONTNEED and a callback for when skbs are dropped. It's just instead of a new pp ops it'll be a branch in the devmem path. io_uring might want to use the added iov format in the future for device memory or even before that, io_uring doesn't really care whether it's pages or not.
It's also my big concern from how many optimisations it'll fence us off. With the current io_uring RFC I can get rid of all buffer atomic refcounting and replace it with a single percpu counting per skb. Hopefully, that will be doable after we place it on top of pp providers.
- Allocate a memfd,
- Use CONFIG_UDMABUF to create a dma-buf out of that memfd.
- Bind the dma-buf to the NIC using the netlink API in this RFC.
- Your io_uring extensions and io_uring uapi should work as-is almost
on top of this series, I think.
If you do this the incoming packets should land into your memfd, which may or may not work for you. In the future if you feel inclined to use device memory, this approach that I'm describing here would be more extensible to device memory, because you'd already be using dma-bufs for your user memory; you'd just replace one kind of dma-buf (UDMABUF) with another.
That the io_uring set needs to dive into page_pools is just wrong - complicating the design and code and pushing io_uring into a realm it does not need to be involved in.
I disagree. How does it complicate it? io_uring will be just a yet another provider implementing the callbacks of the API created for such use cases and not changing common pp/net bits. The rest of code is in io_uring implementing interaction with userspace and other usability features, but there will be anyway some amount of code if we want to have a convenient and performant api via io_uring.
Most (all?) of this patch set can work with any memory; only device memory is unreadable.