On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote:
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places.
drm folks made up their own weird rules, if they internally stick to it they have to listen to it given that they ignore review comments, but it violates the scatterlist API and has not business anywhere else in the kernel. And yes, there probably is a reason or two why the drm code is unusually error prone.
That may be, but it is creating problems if DRM gets to do X crazy thing and nobody else can..
So, we have two issues here
1) DMABUF abuses the scatter list, but this is very constrainted we have this weird special "DMABUF scatterlist" that is only touched by DMABUF importers. The imports signal that they understand the format with a flag. This is ugly and would be nice to clean to a dma mapped address list of some sort.
I spent alot of time a few years ago removing driver touches of the SGL and preparing the RDMA stack to do this kind of change, at least.
2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in certain special cases.
Rather than jump to ZONE_DEVICE and map_sgl I would like to continue to support non-struct page mapping. So, I would suggest adding a dma_map_p2p() that can cover off the special cases, include the two struct devices as arguments with a physical PFN/size. Do the same logic we do under the ZONE_DEVICE stuff.
Drivers can choose if they want to pay the memory cost of ZONE_DEVICE and get faster dma_map or get slower dma_map and save memory.
I still think we can address them incrementally - but the dma_map_p2p() might be small enough to sort out right away, if you are OK with it.
Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
How do I get a struct page * for a 4k BAR in vfio?
I guess we have different definitions of small then :)
But unless my understanding of the code is out out of data, memremap_pages just requires the (virtual) start address to be 2MB aligned, not the size. Adding Dan for comments.
Don't we need the virtual start address to equal the physical pfn for everything to work properly? eg pfn_to_page?
And we can't over-allocate because another driver might want to also use ZONE_DEVICE pages for its BAR that is now creating a collision.
So, at least as is, the memmap stuff seems unable to support the case we have with VFIO.
That being said, what is the point of mapping say a 4k BAR for p2p? You're not going to save a measurable amount of CPU overhead if that is the only place you transfer to.
For the purpose this series is chasing, it is for doorbell rings. The actual data transfer may still bounce through CPU memory (if a CMB is not available), but the latency reduction of directly signaling the peer device that the transfer is ready is the key objective.
Bouncing an interrupt through the CPU to cause it to do a writel() is very tiem consuming, especially on slow ARM devices, while we have adequate memory bandwidth for data transfer.
When I look at iommufd, it is for generality and compat. We don't have knowledge of what the guest will do, so regardless of BAR size we have to create P2P iommu mappings for every kind of PCI BAR. It is what vfio is currently doing.
Jason