On Thu, Aug 18, 2022 at 01:07:16PM +0200, Christian König wrote:
Am 17.08.22 um 18:11 schrieb Jason Gunthorpe:
dma-buf has become a way to safely acquire a handle to non-struct page memory that can still have lifetime controlled by the exporter. Notably RDMA can now import dma-buf FDs and build them into MRs which allows for PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory from PCI device BARs.
This series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. I imagine this dmabuf approach to be usable by iommufd as well for generic and safe P2P mappings.
In general looks good to me, but we really need to get away from using sg_tables for this here.
The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more.
Would that be somehow avoidable? Or could you at least explain the use case a bit better.
I didn't see a way, maybe you know of one
VFIO needs to maintain a list of dmabuf FDs that have been created by the user attached to each vfio_device:
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz) { down_write(&vdev->memory_lock); list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); up_write(&vdev->memory_lock);
And dmabuf FD's are removed from the list when the user closes the FD:
static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) { down_write(&priv->vdev->memory_lock); list_del_init(&priv->dmabufs_elm); up_write(&priv->vdev->memory_lock);
Which then poses the problem: How do you iterate over only dma_buf's that are still alive to execute move?
This seems necessary as parts of the dma_buf have already been destroyed by the time the user's release function is called.
Which I solved like this:
down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!dma_buf_try_get(priv->dmabuf)) continue;
So the scenarios resolve as: - Concurrent release is not in progress: dma_buf_try_get() succeeds and prevents concurrent release from starting - Release has started but not reached its memory_lock: dma_buf_try_get() fails - Release has started but passed its memory_lock: dmabuf is not on the list so dma_buf_try_get() is not called.
Jason