The content of this message was lost. It was probably cross-posted to multiple lists and previously handled on another list.
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places.
drm folks made up their own weird rules, if they internally stick to it they have to listen to it given that they ignore review comments, but it violates the scatterlist API and has not business anywhere else in the kernel. And yes, there probably is a reason or two why the drm code is unusually error prone.
Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
How do I get a struct page * for a 4k BAR in vfio?
I guess we have different definitions of small then :)
But unless my understanding of the code is out out of data, memremap_pages just requires the (virtual) start address to be 2MB aligned, not the size. Adding Dan for comments.
That being said, what is the point of mapping say a 4k BAR for p2p? You're not going to save a measurable amount of CPU overhead if that is the only place you transfer to.
On Wed, Sep 7, 2022 at 5:30 PM Christoph Hellwig hch@infradead.org wrote:
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places.
drm folks made up their own weird rules, if they internally stick to it they have to listen to it given that they ignore review comments, but it violates the scatterlist API and has not business anywhere else in the kernel. And yes, there probably is a reason or two why the drm code is unusually error prone.
Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
How do I get a struct page * for a 4k BAR in vfio?
I guess we have different definitions of small then :)
But unless my understanding of the code is out out of data, memremap_pages just requires the (virtual) start address to be 2MB aligned, not the size. Adding Dan for comments.
That being said, what is the point of mapping say a 4k BAR for p2p? You're not going to save a measurable amount of CPU overhead if that is the only place you transfer to.
I don't know what Jason had in mind, but I can see a use for that for writing to doorbells of a device. Today, usually what happens is that peer A reads/writes to peer B's memory through the large bar and then signals the host the operation was completed. Then the host s/w writes to the doorbell of the peer B to let him know he can continue with the execution as the data is now ready (or can be recycled). I can imagine peer A writing directly to the doorbell of peer B, and usually for that we would like to expose a very small area, probably a single 4K page.
Oded
On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote:
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places.
drm folks made up their own weird rules, if they internally stick to it they have to listen to it given that they ignore review comments, but it violates the scatterlist API and has not business anywhere else in the kernel. And yes, there probably is a reason or two why the drm code is unusually error prone.
That may be, but it is creating problems if DRM gets to do X crazy thing and nobody else can..
So, we have two issues here
1) DMABUF abuses the scatter list, but this is very constrainted we have this weird special "DMABUF scatterlist" that is only touched by DMABUF importers. The imports signal that they understand the format with a flag. This is ugly and would be nice to clean to a dma mapped address list of some sort.
I spent alot of time a few years ago removing driver touches of the SGL and preparing the RDMA stack to do this kind of change, at least.
2) DMABUF abuses dma_map_resource() for P2P and thus doesn't work in certain special cases.
Rather than jump to ZONE_DEVICE and map_sgl I would like to continue to support non-struct page mapping. So, I would suggest adding a dma_map_p2p() that can cover off the special cases, include the two struct devices as arguments with a physical PFN/size. Do the same logic we do under the ZONE_DEVICE stuff.
Drivers can choose if they want to pay the memory cost of ZONE_DEVICE and get faster dma_map or get slower dma_map and save memory.
I still think we can address them incrementally - but the dma_map_p2p() might be small enough to sort out right away, if you are OK with it.
Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
How do I get a struct page * for a 4k BAR in vfio?
I guess we have different definitions of small then :)
But unless my understanding of the code is out out of data, memremap_pages just requires the (virtual) start address to be 2MB aligned, not the size. Adding Dan for comments.
Don't we need the virtual start address to equal the physical pfn for everything to work properly? eg pfn_to_page?
And we can't over-allocate because another driver might want to also use ZONE_DEVICE pages for its BAR that is now creating a collision.
So, at least as is, the memmap stuff seems unable to support the case we have with VFIO.
That being said, what is the point of mapping say a 4k BAR for p2p? You're not going to save a measurable amount of CPU overhead if that is the only place you transfer to.
For the purpose this series is chasing, it is for doorbell rings. The actual data transfer may still bounce through CPU memory (if a CMB is not available), but the latency reduction of directly signaling the peer device that the transfer is ready is the key objective.
Bouncing an interrupt through the CPU to cause it to do a writel() is very tiem consuming, especially on slow ARM devices, while we have adequate memory bandwidth for data transfer.
When I look at iommufd, it is for generality and compat. We don't have knowledge of what the guest will do, so regardless of BAR size we have to create P2P iommu mappings for every kind of PCI BAR. It is what vfio is currently doing.
Jason
On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
- DMABUF abuses dma_map_resource() for P2P and thus doesn't work in certain special cases.
Not just certain special cases, but one of the main use cases. Basically P2P can happen in two ways:
a) through a PCIe switch, or b) through connected root ports
The open code version here only supports a), only supports it when there is no offset between the 'phyiscal' address of the BAR seen PCIe endpoint and the Linux way. x86 usually (always?) doesn't have an offset there, but other architectures often do.
Last but not least I don't really see how the code would even work when an IOMMU is used, as dma_map_resource will return an IOVA that is only understood by the IOMMU itself, and not the other endpoint.
How was this code even tested?
On Wed, Sep 07, 2022 at 08:32:23AM -0700, Christoph Hellwig wrote:
On Wed, Sep 07, 2022 at 12:23:28PM -0300, Jason Gunthorpe wrote:
- DMABUF abuses dma_map_resource() for P2P and thus doesn't work in certain special cases.
Not just certain special cases, but one of the main use cases. Basically P2P can happen in two ways:
a) through a PCIe switch, or b) through connected root ports
Yes, we tested both, both work.
The open code version here only supports a), only supports it when there is no offset between the 'phyiscal' address of the BAR seen PCIe endpoint and the Linux way. x86 usually (always?) doesn't have an offset there, but other architectures often do.
The PCI offset is some embedded thing - I've never seen it in a server platform.
Frankly, it is just bad SOC design and there is good reason why non-zero needs to be avoided. As soon as you create aliases between the address spaces you invite trouble. IIRC a SOC I used once put the memory at 0 -> 4G then put the only PCI aperture at 4g -> 4g+N. However this design requires 64 bit PCI support, which at the time, the platform didn't have. So they used PCI offset to hackily alias the aperture over the DDR. I don't remember if they threw out a bit of DDR to resolve the alias, or if they just didn't support PCI switches.
In any case, it is a complete mess. You either drastically limit your BAR size, don't support PCI switches or loose a lot of DDR.
I also seem to remember that iommu and PCI offset don't play nice together - so for the VFIO use case where the iommu is present I'm pretty sure we can very safely assume 0 offset. That seems confirmed by the fact that VFIO has never handled PCI offset in its own P2P path and P2P works fine in VMs across a wide range of platforms.
That said, I agree we should really have APIs that support this properly, and dma_map_resource is certainly technically wrong.
So, would you be OK with this series if I try to make a dma_map_p2p() that resolves the offset issue?
Last but not least I don't really see how the code would even work when an IOMMU is used, as dma_map_resource will return an IOVA that is only understood by the IOMMU itself, and not the other endpoint.
I don't understand this.
__iommu_dma_map() will put the given phys into the iommu_domain associated with 'dev' and return the IOVA it picked.
Here 'dev' is the importing device, it is the device that will issue the DMA:
+ dma_addr = dma_map_resource( + attachment->dev, + pci_resource_start(priv->vdev->pdev, priv->index) + + priv->offset, + priv->dmabuf->size, dir, DMA_ATTR_SKIP_CPU_SYNC);
eg attachment->dev is the PCI device of the RDMA device, not the VFIO device.
'phys' is the CPU physical of the PCI BAR page, which with 0 PCI offset is the right thing to program into the IO page table.
How was this code even tested?
It was tested on a few platforms, like I said above, the cases where it doesn't work are special, largely embedded, and not anything we have in our labs - AFAIK.
Jason
On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote:
The PCI offset is some embedded thing - I've never seen it in a server platform.
That's not actually true, e.g. some power system definitively had it, althiugh I don't know if the current ones do.
But that's not that point. The offset is a configuration fully supported by Linux, and someone that just works by using the proper APIs. Doing some handwaiving about embedded only or bad design doesn't matter. There is a reason why we have these proper APIs and no one has any business bypassing them.
I also seem to remember that iommu and PCI offset don't play nice together - so for the VFIO use case where the iommu is present I'm pretty sure we can very safely assume 0 offset. That seems confirmed by the fact that VFIO has never handled PCI offset in its own P2P path and P2P works fine in VMs across a wide range of platforms.
I think the offset is one of the reasons why IOVA windows can be reserved (and maybe also why ppc is so weird).
So, would you be OK with this series if I try to make a dma_map_p2p() that resolves the offset issue?
Well, if it also solves the other issue of invalid scatterlists leaking outside of drm we can think about it.
Last but not least I don't really see how the code would even work when an IOMMU is used, as dma_map_resource will return an IOVA that is only understood by the IOMMU itself, and not the other endpoint.
I don't understand this.
__iommu_dma_map() will put the given phys into the iommu_domain associated with 'dev' and return the IOVA it picked.
Yes, __iommu_dma_map creates an IOVA for the mapped remote BAR. That is the right thing if the I/O goes through the host bridge, but it is the wrong thing if the I/O goes through the switch - in that case the IOVA generated is not something that the endpoint that owns the BAR can even understand.
Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how this is handled.
On Fri, Sep 09, 2022 at 06:24:35AM -0700, Christoph Hellwig wrote:
On Wed, Sep 07, 2022 at 01:12:52PM -0300, Jason Gunthorpe wrote:
The PCI offset is some embedded thing - I've never seen it in a server platform.
That's not actually true, e.g. some power system definitively had it, althiugh I don't know if the current ones do.
I thought those were all power embedded systems.
There is a reason why we have these proper APIs and no one has any business bypassing them.
Yes, we should try to support these things, but you said this patch didn't work and wasn't tested - that is not true at all.
And it isn't like we have APIs just sitting here to solve this specific problem. So lets make something.
So, would you be OK with this series if I try to make a dma_map_p2p() that resolves the offset issue?
Well, if it also solves the other issue of invalid scatterlists leaking outside of drm we can think about it.
The scatterlist stuff has already leaked outside of DRM anyhow.
Again, I think it is very problematic to let DRM get away with things and then insist all the poor non-DRM people be responsible to clean up their mess.
I'm skeptical I can fix AMD GPU, but I can try to create a DMABUF op that returns something that is not a scatterlist and teach RDMA to use it. So at least the VFIO/RDMA part can avoid the scatter list abuse. I expected to need non-scatterlist for iommufd anyhow.
Coupled with a series to add some dma_map_resource_pci() that handles the PCI_P2PDMA_MAP_BUS_ADDR and the PCI offset, would it be an agreeable direction?
Take a look at iommu_dma_map_sg and pci_p2pdma_map_segment to see how this is handled.
So there is a bug in all these DMABUF implementations, they do ignore the PCI_P2PDMA_MAP_BUS_ADDR "distance type".
This isn't a real-world problem for VFIO because VFIO is largely incompatible with the non-ACS configuration that would trigger PCI_P2PDMA_MAP_BUS_ADDR, and explains why we never saw any problem. All our systems have ACS turned on so we can use VFIO.
I'm unclear how Habana or AMD have avoided a problem here..
This is much more serious than the pci offset in my mind.
Thanks, Jason
On 2022-09-07 16:23, Jason Gunthorpe wrote:
On Wed, Sep 07, 2022 at 07:29:58AM -0700, Christoph Hellwig wrote:
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places.
drm folks made up their own weird rules, if they internally stick to it they have to listen to it given that they ignore review comments, but it violates the scatterlist API and has not business anywhere else in the kernel. And yes, there probably is a reason or two why the drm code is unusually error prone.
That may be, but it is creating problems if DRM gets to do X crazy thing and nobody else can..
So, we have two issues here
DMABUF abuses the scatter list, but this is very constrainted we have this weird special "DMABUF scatterlist" that is only touched by DMABUF importers. The imports signal that they understand the format with a flag. This is ugly and would be nice to clean to a dma mapped address list of some sort.
I spent alot of time a few years ago removing driver touches of the SGL and preparing the RDMA stack to do this kind of change, at least.
DMABUF abuses dma_map_resource() for P2P and thus doesn't work in certain special cases.
FWIW, dma_map_resource() *is* for P2P in general. The classic case of one device poking at another's registers that was the original motivation is a standalone DMA engine reading/writing a peripheral device's FIFO, so the very similar inter-device doorbell signal is absolutely in scope too; VRAM might be a slightly greyer area, but if it's still not page-backed kernel memory then I reckon that's fair game.
The only trouble is that it's not geared for *PCI* P2P when that may or may not happen entirely upstream of IOMMU translation.
Robin.
On Wed, Sep 07, 2022 at 05:31:14PM +0100, Robin Murphy wrote:
The only trouble is that it's not geared for *PCI* P2P when that may or may not happen entirely upstream of IOMMU translation.
This is why PCI users have to call the pci_distance stuff before using dma_map_resource(), it ensures the PCI fabric is setup in a way that is consistent with the iommu. eg if we have IOMMU turned on then the fabric must have ACS/etc to ensure that all TLPs are translated.
PCI P2P is very complicated and fragile, sadly.
Thanks, Jason
Christoph Hellwig wrote:
On Wed, Sep 07, 2022 at 09:33:11AM -0300, Jason Gunthorpe wrote:
Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places.
drm folks made up their own weird rules, if they internally stick to it they have to listen to it given that they ignore review comments, but it violates the scatterlist API and has not business anywhere else in the kernel. And yes, there probably is a reason or two why the drm code is unusually error prone.
Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
How do I get a struct page * for a 4k BAR in vfio?
I guess we have different definitions of small then :)
But unless my understanding of the code is out out of data, memremap_pages just requires the (virtual) start address to be 2MB aligned, not the size. Adding Dan for comments.
The minimum granularity for sparse_add_section() that memremap_pages uses internally is 2MB, so start and end need to be 2MB aligned. Details here:
ba72b4c8cf60 mm/sparsemem: support sub-section hotplug
On 2022-09-07 13:33, Jason Gunthorpe wrote:
On Wed, Sep 07, 2022 at 05:05:57AM -0700, Christoph Hellwig wrote:
On Tue, Sep 06, 2022 at 08:48:28AM -0300, Jason Gunthorpe wrote:
Right, this whole thing is the "standard" that dmabuf has adopted instead of the struct pages. Once the AMD GPU driver started doing this some time ago other drivers followed.
But it is simple wrong. The scatterlist requires struct page backing. In theory a physical address would be enough, but when Dan Williams sent patches for that Linus shot them down.
Yes, you said that, and I said that when the AMD driver first merged it - but it went in anyhow and now people are using it in a bunch of places.
I'm happy that Christian wants to start trying to fix it, and will help him, but it doesn't really impact this. Whatever fix is cooked up will apply equally to vfio and habana.
We've just added support for P2P segments in scatterlists, can that not be used here?
Robin.
That being said the scatterlist is the wrong interface here (and probably for most of it's uses). We really want a lot-level struct with just the dma_address and length for the DMA side, and leave it separate from that what is used to generate it (in most cases that would be a bio_vec).
Oh definitely
Now we have struct pages, almost, but I'm not sure if their limits are compatible with VFIO? This has to work for small bars as well.
Why would small BARs be problematic for the pages? The pages are more a problem for gigantic BARs do the memory overhead.
How do I get a struct page * for a 4k BAR in vfio?
The docs say:
..hotplug api on memory block boundaries. The implementation relies on this lack of user-api constraint to allow sub-section sized memory ranges to be specified to :c:func:`arch_add_memory`, the top-half of memory hotplug. Sub-section support allows for 2MB as the cross-arch common alignment granularity for :c:func:`devm_memremap_pages`.
Jason
linaro-mm-sig@lists.linaro.org