On Mon, Apr 01, 2024 at 12:22:24PM -0700, Mina Almasry wrote:
On Thu, Mar 28, 2024 at 12:31 AM Christoph Hellwig hch@infradead.org wrote:
On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
Are you envisioning that dmabuf support would be added to the block layer
Yes.
(which I understand is part of the VFS and not driver specific),
The block layer isn't really the VFS, it's just another core stack like the network stack.
or as part of the specific storage driver (like nvme for example)? If we can add dmabuf support to the block layer itself that sounds awesome. We may then be able to do devmem TCP on all/most storage devices without having to modify each individual driver.
I suspect we'll still need to touch the drivers to understand it, but hopefully all the main infrastructure can live in the block layer.
In your estimation, is adding dmabuf support to the block layer something technically feasible & acceptable upstream? I notice you suggested it so I'm guessing yes to both, but I thought I'd confirm.
I think so, and I know there has been quite some interest to at least pre-register userspace memory so that the iommu overhead can be pre-loaded. It also is a much better interface for Peer to Peer transfers than what we currently have.
Thanks for copying me on this. This sounds really great.
Also P2PDMA requires PCI root complex to support this kind of direct transfer, and IIUC dmabuf does not have such hardware dependency.
I think this is positively thrilling news for me. I was worried that adding devmemTCP support to storage devices would involve using a non-dmabuf standard of buffer sharing like pci_p2pdma_ (drivers/pci/p2pdma.c) and that would require messy changes to pci_p2pdma_ that would get nacked. Also it would require adding pci_p2pdma_ support to devmem TCP, which is a can of worms. If adding dma-buf support to storage devices is feasible and desirable, that's a much better approach IMO. (a) it will maybe work with devmem TCP without any changes needed on the netdev side of things and (b) dma-buf support may be generically useful and a good contribution even outside of devmem TCP.
I think the major difference is its interface, which exposes an mmap memory region instead of fd: https://lwn.net/Articles/906092/.
I don't have a concrete user for devmem TCP for storage devices but the use case is very similar to GPU and I imagine the benefits in perf can be significant in some setups.
We have storage use cases at ByteDance, we use NVME SSD to cache videos transferred through network, so moving data directly from SSD to NIC would help a lot.
Thanks!