On Tue, Jun 22, 2021 at 11:42:27AM +0300, Oded Gabbay wrote:
On Tue, Jun 22, 2021 at 9:37 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 22.06.21 um 01:29 schrieb Jason Gunthorpe:
On Mon, Jun 21, 2021 at 10:24:16PM +0300, Oded Gabbay wrote:
Another thing I want to emphasize is that we are doing p2p only through the export/import of the FD. We do *not* allow the user to mmap the dma-buf as we do not support direct IO. So there is no access to these pages through the userspace.
Arguably mmaping the memory is a better choice, and is the direction that Logan's series goes in. Here the use of DMABUF was specifically designed to allow hitless revokation of the memory, which this isn't even using.
The major problem with this approach is that DMA-buf is also used for memory which isn't CPU accessible.
That isn't an issue here because the memory is only intended to be used with P2P transfers so it must be CPU accessible.
That was one of the reasons we didn't even considered using the mapping memory approach for GPUs.
Well, now we have DEVICE_PRIVATE memory that can meet this need too.. Just nobody has wired it up to hmm_range_fault()
So you are taking the hit of very limited hardware support and reduced performance just to squeeze into DMABUF..
Thanks Jason for the clarification, but I honestly prefer to use DMA-BUF at the moment. It gives us just what we need (even more than what we need as you pointed out), it is *already* integrated and tested in the RDMA subsystem, and I'm feeling comfortable using it as I'm somewhat familiar with it from my AMD days.
You still have the issue that this patch is doing all of this P2P stuff wrong - following the already NAK'd AMD approach.
I'll go and read Logan's patch-set to see if that will work for us in the future. Please remember, as Daniel said, we don't have struct page backing our device memory, so if that is a requirement to connect to Logan's work, then I don't think we will want to do it at this point.
It is trivial to get the struct page for a PCI BAR.
Jason
Am 22.06.21 um 14:01 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 11:42:27AM +0300, Oded Gabbay wrote:
On Tue, Jun 22, 2021 at 9:37 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 22.06.21 um 01:29 schrieb Jason Gunthorpe:
On Mon, Jun 21, 2021 at 10:24:16PM +0300, Oded Gabbay wrote:
Another thing I want to emphasize is that we are doing p2p only through the export/import of the FD. We do *not* allow the user to mmap the dma-buf as we do not support direct IO. So there is no access to these pages through the userspace.
Arguably mmaping the memory is a better choice, and is the direction that Logan's series goes in. Here the use of DMABUF was specifically designed to allow hitless revokation of the memory, which this isn't even using.
The major problem with this approach is that DMA-buf is also used for memory which isn't CPU accessible.
That isn't an issue here because the memory is only intended to be used with P2P transfers so it must be CPU accessible.
No, especially P2P is often done on memory resources which are not even remotely CPU accessible.
That's one of the major reasons why we use P2P in the first place. See the whole XGMI implementation for example.
Thanks Jason for the clarification, but I honestly prefer to use DMA-BUF at the moment. It gives us just what we need (even more than what we need as you pointed out), it is *already* integrated and tested in the RDMA subsystem, and I'm feeling comfortable using it as I'm somewhat familiar with it from my AMD days.
That was one of the reasons we didn't even considered using the mapping memory approach for GPUs.
Well, now we have DEVICE_PRIVATE memory that can meet this need too.. Just nobody has wired it up to hmm_range_fault()
So you are taking the hit of very limited hardware support and reduced performance just to squeeze into DMABUF..
You still have the issue that this patch is doing all of this P2P stuff wrong - following the already NAK'd AMD approach.
Well that stuff was NAKed because we still use sg_tables, not because we don't want to allocate struct pages.
The plan is to push this forward since DEVICE_PRIVATE clearly can't handle all of our use cases and is not really a good fit to be honest.
IOMMU is now working as well, so as far as I can see we are all good here.
I'll go and read Logan's patch-set to see if that will work for us in the future. Please remember, as Daniel said, we don't have struct page backing our device memory, so if that is a requirement to connect to Logan's work, then I don't think we will want to do it at this point.
It is trivial to get the struct page for a PCI BAR.
Yeah, but it doesn't make much sense. Why should we create a struct page for something that isn't even memory in a lot of cases?
Regards, Christian.
On Tue, Jun 22, 2021 at 02:23:03PM +0200, Christian König wrote:
Am 22.06.21 um 14:01 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 11:42:27AM +0300, Oded Gabbay wrote:
On Tue, Jun 22, 2021 at 9:37 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 22.06.21 um 01:29 schrieb Jason Gunthorpe:
On Mon, Jun 21, 2021 at 10:24:16PM +0300, Oded Gabbay wrote:
Another thing I want to emphasize is that we are doing p2p only through the export/import of the FD. We do *not* allow the user to mmap the dma-buf as we do not support direct IO. So there is no access to these pages through the userspace.
Arguably mmaping the memory is a better choice, and is the direction that Logan's series goes in. Here the use of DMABUF was specifically designed to allow hitless revokation of the memory, which this isn't even using.
The major problem with this approach is that DMA-buf is also used for memory which isn't CPU accessible.
That isn't an issue here because the memory is only intended to be used with P2P transfers so it must be CPU accessible.
No, especially P2P is often done on memory resources which are not even remotely CPU accessible.
That is a special AMD thing, P2P here is PCI P2P and all PCI memory is CPU accessible.
So you are taking the hit of very limited hardware support and reduced performance just to squeeze into DMABUF..
You still have the issue that this patch is doing all of this P2P stuff wrong - following the already NAK'd AMD approach.
Well that stuff was NAKed because we still use sg_tables, not because we don't want to allocate struct pages.
sg lists in general.
The plan is to push this forward since DEVICE_PRIVATE clearly can't handle all of our use cases and is not really a good fit to be honest.
IOMMU is now working as well, so as far as I can see we are all good here.
How? Is that more AMD special stuff?
This patch series never calls to the iommu driver, AFAICT.
I'll go and read Logan's patch-set to see if that will work for us in the future. Please remember, as Daniel said, we don't have struct page backing our device memory, so if that is a requirement to connect to Logan's work, then I don't think we will want to do it at this point.
It is trivial to get the struct page for a PCI BAR.
Yeah, but it doesn't make much sense. Why should we create a struct page for something that isn't even memory in a lot of cases?
Because the iommu and other places need this handle to setup their stuff. Nobody has yet been brave enough to try to change those flows to be able to use a physical CPU address.
This is why we have a special struct page type just for PCI BAR memory.
Jason
Am 22.06.21 um 17:23 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 02:23:03PM +0200, Christian König wrote:
Am 22.06.21 um 14:01 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 11:42:27AM +0300, Oded Gabbay wrote:
On Tue, Jun 22, 2021 at 9:37 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 22.06.21 um 01:29 schrieb Jason Gunthorpe:
On Mon, Jun 21, 2021 at 10:24:16PM +0300, Oded Gabbay wrote:
> Another thing I want to emphasize is that we are doing p2p only > through the export/import of the FD. We do *not* allow the user to > mmap the dma-buf as we do not support direct IO. So there is no access > to these pages through the userspace. Arguably mmaping the memory is a better choice, and is the direction that Logan's series goes in. Here the use of DMABUF was specifically designed to allow hitless revokation of the memory, which this isn't even using.
The major problem with this approach is that DMA-buf is also used for memory which isn't CPU accessible.
That isn't an issue here because the memory is only intended to be used with P2P transfers so it must be CPU accessible.
No, especially P2P is often done on memory resources which are not even remotely CPU accessible.
That is a special AMD thing, P2P here is PCI P2P and all PCI memory is CPU accessible.
No absolutely not. NVidia GPUs work exactly the same way.
And you have tons of similar cases in embedded and SoC systems where intermediate memory between devices isn't directly addressable with the CPU.
So you are taking the hit of very limited hardware support and reduced performance just to squeeze into DMABUF..
You still have the issue that this patch is doing all of this P2P stuff wrong - following the already NAK'd AMD approach.
Well that stuff was NAKed because we still use sg_tables, not because we don't want to allocate struct pages.
sg lists in general.
The plan is to push this forward since DEVICE_PRIVATE clearly can't handle all of our use cases and is not really a good fit to be honest.
IOMMU is now working as well, so as far as I can see we are all good here.
How? Is that more AMD special stuff?
No, just using the dma_map_resource() interface.
We have that working on tons of IOMMU enabled systems.
This patch series never calls to the iommu driver, AFAICT.
I'll go and read Logan's patch-set to see if that will work for us in the future. Please remember, as Daniel said, we don't have struct page backing our device memory, so if that is a requirement to connect to Logan's work, then I don't think we will want to do it at this point.
It is trivial to get the struct page for a PCI BAR.
Yeah, but it doesn't make much sense. Why should we create a struct page for something that isn't even memory in a lot of cases?
Because the iommu and other places need this handle to setup their stuff. Nobody has yet been brave enough to try to change those flows to be able to use a physical CPU address.
Well that is certainly not true. I'm just not sure if that works with all IOMMU drivers thought.
Would need to ping Felix when the support for this was merged.
Regards, Christian.
This is why we have a special struct page type just for PCI BAR memory.
Jason
On Tue, Jun 22, 2021 at 05:29:01PM +0200, Christian König wrote:
Am 22.06.21 um 17:23 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 02:23:03PM +0200, Christian König wrote:
Am 22.06.21 um 14:01 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 11:42:27AM +0300, Oded Gabbay wrote:
On Tue, Jun 22, 2021 at 9:37 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 22.06.21 um 01:29 schrieb Jason Gunthorpe: > On Mon, Jun 21, 2021 at 10:24:16PM +0300, Oded Gabbay wrote: > > > Another thing I want to emphasize is that we are doing p2p only > > through the export/import of the FD. We do *not* allow the user to > > mmap the dma-buf as we do not support direct IO. So there is no access > > to these pages through the userspace. > Arguably mmaping the memory is a better choice, and is the direction > that Logan's series goes in. Here the use of DMABUF was specifically > designed to allow hitless revokation of the memory, which this isn't > even using. The major problem with this approach is that DMA-buf is also used for memory which isn't CPU accessible.
That isn't an issue here because the memory is only intended to be used with P2P transfers so it must be CPU accessible.
No, especially P2P is often done on memory resources which are not even remotely CPU accessible.
That is a special AMD thing, P2P here is PCI P2P and all PCI memory is CPU accessible.
No absolutely not. NVidia GPUs work exactly the same way.
And you have tons of similar cases in embedded and SoC systems where intermediate memory between devices isn't directly addressable with the CPU.
None of that is PCI P2P.
It is all some specialty direct transfer.
You can't reasonably call dma_map_resource() on non CPU mapped memory for instance, what address would you pass?
Do not confuse "I am doing transfers between two HW blocks" with PCI Peer to Peer DMA transfers - the latter is a very narrow subcase.
No, just using the dma_map_resource() interface.
Ik, but yes that does "work". Logan's series is better.
I'll go and read Logan's patch-set to see if that will work for us in the future. Please remember, as Daniel said, we don't have struct page backing our device memory, so if that is a requirement to connect to Logan's work, then I don't think we will want to do it at this point.
It is trivial to get the struct page for a PCI BAR.
Yeah, but it doesn't make much sense. Why should we create a struct page for something that isn't even memory in a lot of cases?
Because the iommu and other places need this handle to setup their stuff. Nobody has yet been brave enough to try to change those flows to be able to use a physical CPU address.
Well that is certainly not true. I'm just not sure if that works with all IOMMU drivers thought.
Huh? All the iommu interfaces except for the dma_map_resource() are struct page based. dma_map_resource() is slow ad limited in what it can do.
Jason
Am 22.06.21 um 17:40 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 05:29:01PM +0200, Christian König wrote:
[SNIP] No absolutely not. NVidia GPUs work exactly the same way.
And you have tons of similar cases in embedded and SoC systems where intermediate memory between devices isn't directly addressable with the CPU.
None of that is PCI P2P.
It is all some specialty direct transfer.
You can't reasonably call dma_map_resource() on non CPU mapped memory for instance, what address would you pass?
Do not confuse "I am doing transfers between two HW blocks" with PCI Peer to Peer DMA transfers - the latter is a very narrow subcase.
No, just using the dma_map_resource() interface.
Ik, but yes that does "work". Logan's series is better.
No it isn't. It makes devices depend on allocating struct pages for their BARs which is not necessary nor desired.
How do you prevent direct I/O on those pages for example?
Allocating a struct pages has their use case, for example for exposing VRAM as memory for HMM. But that is something very specific and should not limit PCIe P2P DMA in general.
[SNIP] Well that is certainly not true. I'm just not sure if that works with all IOMMU drivers thought.
Huh? All the iommu interfaces except for the dma_map_resource() are struct page based. dma_map_resource() is slow ad limited in what it can do.
Yeah, but that is exactly the functionality we need. And as far as I can see that is also what Oded wants here.
Mapping stuff into userspace and then doing direct DMA to it is only a very limited use case and we need to be more flexible here.
Christian.
Jason
On Tue, Jun 22, 2021 at 05:48:10PM +0200, Christian König wrote:
Am 22.06.21 um 17:40 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 05:29:01PM +0200, Christian König wrote:
[SNIP] No absolutely not. NVidia GPUs work exactly the same way.
And you have tons of similar cases in embedded and SoC systems where intermediate memory between devices isn't directly addressable with the CPU.
None of that is PCI P2P.
It is all some specialty direct transfer.
You can't reasonably call dma_map_resource() on non CPU mapped memory for instance, what address would you pass?
Do not confuse "I am doing transfers between two HW blocks" with PCI Peer to Peer DMA transfers - the latter is a very narrow subcase.
No, just using the dma_map_resource() interface.
Ik, but yes that does "work". Logan's series is better.
No it isn't. It makes devices depend on allocating struct pages for their BARs which is not necessary nor desired.
Which dramatically reduces the cost of establishing DMA mappings, a loop of dma_map_resource() is very expensive.
How do you prevent direct I/O on those pages for example?
GUP fails.
Allocating a struct pages has their use case, for example for exposing VRAM as memory for HMM. But that is something very specific and should not limit PCIe P2P DMA in general.
Sure, but that is an ideal we are far from obtaining, and nobody wants to work on it prefering to do hacky hacky like this.
If you believe in this then remove the scatter list from dmabuf, add a new set of dma_map* APIs to work on physical addresses and all the other stuff needed.
Otherwise, we have what we have and drivers don't get to opt out. This is why the stuff in AMDGPU was NAK'd.
Jason
Am 22.06.21 um 18:05 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 05:48:10PM +0200, Christian König wrote:
Am 22.06.21 um 17:40 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 05:29:01PM +0200, Christian König wrote:
[SNIP] No absolutely not. NVidia GPUs work exactly the same way.
And you have tons of similar cases in embedded and SoC systems where intermediate memory between devices isn't directly addressable with the CPU.
None of that is PCI P2P.
It is all some specialty direct transfer.
You can't reasonably call dma_map_resource() on non CPU mapped memory for instance, what address would you pass?
Do not confuse "I am doing transfers between two HW blocks" with PCI Peer to Peer DMA transfers - the latter is a very narrow subcase.
No, just using the dma_map_resource() interface.
Ik, but yes that does "work". Logan's series is better.
No it isn't. It makes devices depend on allocating struct pages for their BARs which is not necessary nor desired.
Which dramatically reduces the cost of establishing DMA mappings, a loop of dma_map_resource() is very expensive.
Yeah, but that is perfectly ok. Our BAR allocations are either in chunks of at least 2MiB or only a single 4KiB page.
Oded might run into more performance problems, but those DMA-buf mappings are usually set up only once.
How do you prevent direct I/O on those pages for example?
GUP fails.
At least that is calming.
Allocating a struct pages has their use case, for example for exposing VRAM as memory for HMM. But that is something very specific and should not limit PCIe P2P DMA in general.
Sure, but that is an ideal we are far from obtaining, and nobody wants to work on it prefering to do hacky hacky like this.
If you believe in this then remove the scatter list from dmabuf, add a new set of dma_map* APIs to work on physical addresses and all the other stuff needed.
Yeah, that's what I totally agree on. And I actually hoped that the new P2P work for PCIe would go into that direction, but that didn't materialized.
But allocating struct pages for PCIe BARs which are essentially registers and not memory is much more hacky than the dma_resource_map() approach.
To re-iterate why I think that having struct pages for those BARs is a bad idea: Our doorbells on AMD GPUs are write and read pointers for ring buffers.
When you write to the BAR you essentially tell the firmware that you have either filled the ring buffer or read a bunch of it. This in turn then triggers an interrupt in the hardware/firmware which was eventually asleep.
By using PCIe P2P we want to avoid the round trip to the CPU when one device has filled the ring buffer and another device must be woken up to process it.
Think of it as MSI-X in reverse and allocating struct pages for those BARs just to work around the shortcomings of the DMA API makes no sense at all to me.
We also do have the VRAM BAR, and for HMM we do allocate struct pages for the address range exposed there. But this is a different use case.
Regards, Christian.
Otherwise, we have what we have and drivers don't get to opt out. This is why the stuff in AMDGPU was NAK'd.
Jason
On Wed, Jun 23, 2021 at 10:57:35AM +0200, Christian König wrote:
No it isn't. It makes devices depend on allocating struct pages for their BARs which is not necessary nor desired.
Which dramatically reduces the cost of establishing DMA mappings, a loop of dma_map_resource() is very expensive.
Yeah, but that is perfectly ok. Our BAR allocations are either in chunks of at least 2MiB or only a single 4KiB page.
And very small apparently
Allocating a struct pages has their use case, for example for exposing VRAM as memory for HMM. But that is something very specific and should not limit PCIe P2P DMA in general.
Sure, but that is an ideal we are far from obtaining, and nobody wants to work on it prefering to do hacky hacky like this.
If you believe in this then remove the scatter list from dmabuf, add a new set of dma_map* APIs to work on physical addresses and all the other stuff needed.
Yeah, that's what I totally agree on. And I actually hoped that the new P2P work for PCIe would go into that direction, but that didn't materialized.
It is a lot of work and the only gain is to save a bit of memory for struct pages. Not a very big pay off.
But allocating struct pages for PCIe BARs which are essentially registers and not memory is much more hacky than the dma_resource_map() approach.
It doesn't really matter. The pages are in a special zone and are only being used as handles for the BAR memory.
By using PCIe P2P we want to avoid the round trip to the CPU when one device has filled the ring buffer and another device must be woken up to process it.
Sure, we all have these scenarios, what is inside the memory doesn't realy matter. The mechanism is generic and the struct pages don't care much if they point at something memory-like or at something register-like.
They are already in big trouble because you can't portably use CPU instructions to access them anyhow.
Jason
Am 2021-06-22 um 11:29 a.m. schrieb Christian König:
Am 22.06.21 um 17:23 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 02:23:03PM +0200, Christian König wrote:
Am 22.06.21 um 14:01 schrieb Jason Gunthorpe:
On Tue, Jun 22, 2021 at 11:42:27AM +0300, Oded Gabbay wrote:
On Tue, Jun 22, 2021 at 9:37 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 22.06.21 um 01:29 schrieb Jason Gunthorpe: > On Mon, Jun 21, 2021 at 10:24:16PM +0300, Oded Gabbay wrote: > >> Another thing I want to emphasize is that we are doing p2p only >> through the export/import of the FD. We do *not* allow the user to >> mmap the dma-buf as we do not support direct IO. So there is no >> access >> to these pages through the userspace. > Arguably mmaping the memory is a better choice, and is the > direction > that Logan's series goes in. Here the use of DMABUF was > specifically > designed to allow hitless revokation of the memory, which this > isn't > even using. The major problem with this approach is that DMA-buf is also used for memory which isn't CPU accessible.
That isn't an issue here because the memory is only intended to be used with P2P transfers so it must be CPU accessible.
No, especially P2P is often done on memory resources which are not even remotely CPU accessible.
That is a special AMD thing, P2P here is PCI P2P and all PCI memory is CPU accessible.
No absolutely not. NVidia GPUs work exactly the same way.
And you have tons of similar cases in embedded and SoC systems where intermediate memory between devices isn't directly addressable with the CPU.
> So you are taking the hit of very limited hardware support and > reduced > performance just to squeeze into DMABUF..
You still have the issue that this patch is doing all of this P2P stuff wrong - following the already NAK'd AMD approach.
Well that stuff was NAKed because we still use sg_tables, not because we don't want to allocate struct pages.
sg lists in general.
The plan is to push this forward since DEVICE_PRIVATE clearly can't handle all of our use cases and is not really a good fit to be honest.
IOMMU is now working as well, so as far as I can see we are all good here.
How? Is that more AMD special stuff?
No, just using the dma_map_resource() interface.
We have that working on tons of IOMMU enabled systems.
This patch series never calls to the iommu driver, AFAICT.
I'll go and read Logan's patch-set to see if that will work for us in the future. Please remember, as Daniel said, we don't have struct page backing our device memory, so if that is a requirement to connect to Logan's work, then I don't think we will want to do it at this point.
It is trivial to get the struct page for a PCI BAR.
Yeah, but it doesn't make much sense. Why should we create a struct page for something that isn't even memory in a lot of cases?
Because the iommu and other places need this handle to setup their stuff. Nobody has yet been brave enough to try to change those flows to be able to use a physical CPU address.
Well that is certainly not true. I'm just not sure if that works with all IOMMU drivers thought.
Would need to ping Felix when the support for this was merged.
We have been working on IOMMU support for all our multi-GPU memory mappings in KFD. The PCIe P2P side of this is currently only merged on our internal branch. Before we can actually use this, we need CONFIG_DMABUF_MOVE_NOTIFY enabled (which is still documented as experimental and disabled by default). Otherwise we'll end up pinning all our VRAM.
I think we'll try to put together an upstream patch series of all our PCIe P2P support in a few weeks or so. This will include IOMMU mappings, checking that PCIe P2P is actually possible between two devices, and KFD topology updates to correctly report those capabilities to user mode.
It will not use struct pages for exported VRAM buffers.
Regards, Felix
Regards, Christian.
This is why we have a special struct page type just for PCI BAR memory.
Jason
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
linaro-mm-sig@lists.linaro.org