On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote:
On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
- DMABUF abuses dma_map_resource() for P2P and thus doesn't work in certain special cases.
Not just certain special cases, but one of the main use cases. Basically P2P can happen in two ways:
a) through a PCIe switch, or b) through connected root ports
Yes, we tested both, both work.
The open code version here only supports a), only supports it when there is no offset between the 'phyiscal' address of the BAR seen PCIe endpoint and the Linux way. x86 usually (always?) doesn't have an offset there, but other architectures often do.
The PCI offset is some embedded thing - I've never seen it in a server platform.
Frankly, it is just bad SOC design and there is good reason why non-zero needs to be avoided. As soon as you create aliases between the address spaces you invite trouble. IIRC a SOC I used once put the memory at 0 -> 4G then put the only PCI aperture at 4g -> 4g+N. However this design requires 64 bit PCI support, which at the time, the platform didn't have. So they used PCI offset to hackily alias the aperture over the DDR. I don't remember if they threw out a bit of DDR to resolve the alias, or if they just didn't support PCI switches.
In any case, it is a complete mess. You either drastically limit your BAR size, don't support PCI switches or loose a lot of DDR.
I also seem to remember that iommu and PCI offset don't play nice together - so for the VFIO use case where the iommu is present I'm pretty sure we can very safely assume 0 offset. That seems confirmed by the fact that VFIO has never handled PCI offset in its own P2P path and P2P works fine in VMs across a wide range of platforms.
That said, I agree we should really have APIs that support this properly, and dma_map_resource is certainly technically wrong.
So, would you be OK with this series if I try to make a dma_map_p2p() that resolves the offset issue?
Last but not least I don't really see how the code would even work when an IOMMU is used, as dma_map_resource will return an IOVA that is only understood by the IOMMU itself, and not the other endpoint.
I don't understand this.
__iommu_dma_map() will put the given phys into the iommu_domain associated with 'dev' and return the IOVA it picked.
Here 'dev' is the importing device, it is the device that will issue the DMA:
+ dma_addr = dma_map_resource( + attachment->dev, + pci_resource_start(priv->vdev->pdev, priv->index) + + priv->offset, + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
eg attachment->dev is the PCI device of the RDMA device, not the VFIO device.
'phys' is the CPU physical of the PCI BAR page, which with 0 PCI offset is the right thing to program into the IO page table.
How was this code even tested?
It was tested on a few platforms, like I said above, the cases where it doesn't work are special, largely embedded, and not anything we have in our labs - AFAIK.
Jason