Am 12.12.22 um 04:00 schrieb Tomasz Figa:
[SNIP]
What we could do is to force all exporters to use begin/end_cpu_access() even on it's own buffers and look at all the importers when the access is completed. But that would increase the complexity of the handling in the exporter.
I feel like they should be doing so anyway, because it often depends on the SoC integration whether the DMA can do cache snooping or not.
Yeah, but wouldn't help without the exporter taking care of the importers needs of cache flushing. And that's exactly what we try hard to avoid because it creates code in every exporter which is never tested except for the case that you combine this exporter with an non coherent importer.
That's a testing nightmare because then you everywhere has code which only in very few combinations of exporter and importer is actually used.
Although arguably, there is a corner case today where if one uses dma_alloc_coherent() to get a buffer with a coherent CPU mapping for device X that is declared as cache-coherent, one also expects not to need to call begin/end_cpu_access(), but those would be needed if the buffer was to be imported by device Y that is not cache-coherent...
Sounds like after all it's a mess. I guess your patches make it one step closer to something sensible, import would fail in such cases. Although arguably we should be able to still export from driver Y and import to driver X just fine if Y allocated the buffer as coherent - otherwise we would break existing users for whom things worked fine.
Allocating the buffer as coherent won't help in this case because we actually do want CPU caching for performant access. It's just that some device needs a cache flush before it sees all the changes.
As far as I can see without adding additional complexity to the exporter this can only be archived in two ways:
1. Switch the role of the exporter and importer. This way the device with the need for the cache flush is the exporter and in control of the operations on its buffers.
2. We use DMA-buf as neutral mediator. Since DMA-buf keeps track of who has mapped the buffers it inserts the appropriate dma_sync_*_for_device() calls.
[SNIP] The best we can do is to reject combinations which won't work in the kernel and then userspace could react accordingly.
The question is whether userspace is able to deal with it, given the limited amount of information it gets from the kernel. Sure, there is always the ultimate fallback of memcpy(), but in many cases that would be totally unusable due to severe performance impact. If we were to remove the existing extent of implicit handling from the kernel, I think we need to first give the userspace the information necessary to explicitly handle the fallback to the same extent.
Good point.
We also need to think about backwards compatibility. Simply removing the implicit fallback cases would probably break a lot of userspace, so an opt-in behavior is likely needed initially...
Yes, I'm completely aware of that as well.
We can't hard break userspace even if the previous approach didn't worked 100% correctly.
That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.
How do DMA-buf heaps change anything here? We already have CPU cache maintenance callbacks in the DMA-buf API - the begin/end_cpu_access() for CPU accesses and dmabuf_map/unmap_attachment() for device accesses (which arguably shouldn't actually do CPU cache maintenance, unless that's implied by how either of the involved DMA engines work).
DMA-buf heaps are the neutral man in the middle.
The implementation keeps track of all the attached importers and should make sure that the allocated memory fits the need of everyone. Additional to that calls to the cache DMA-api cache management functions are inserted whenever CPU access begins/ends.
I think in current design, it only knows all the importers after the buffer is already allocated, so it doesn't necessarily have a way to handle the allocation constraints. Something would have to be done to get all the importers attached before the allocation actually takes place.
That's already in place. See the attach and map callbacks.
I'm just not sure if heaps fully implements it like this.
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.
Why does the exporter leave the buffer with dirty CPU caches?
Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.
Sorry for the question not being very clear. I meant: How do the CPU caches get dirty in that case?
The exporter wrote to it. As far as I understand the exporter just copies things from A to B with memcpy to construct the buffer content.
Okay, so it's just due to CPU access and basically what we touched a few paragraphs above.
Yes, I've never seen a device which actually dirties the CPU cache. But I would never rule out that such a device exists.
Regards, Christian.
[SNIP]
Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.
The bounce buffers are something we should really have in a separate component.
Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.
And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.
Remember that actually making the memory accessible to a device often needs to be handled already as a part of the allocation (e.g. dma_mask in the non-IOMMU case). So I tend to think that the current division of responsibilities is mostly fine - the dma_alloc family should be seen as a part of MM already, especially with all the recent improvements from Christoph, like dma_alloc_pages().
Yes, that's indeed a very interesting development which as far as I can see goes into the right direction.
That said, it may indeed make sense to move things like IOMMU mapping management out of the dma_alloc() and just reduce those functions to simply returning a set of pages that satisfy the allocation constraints. One would need to call dma_map() after the allocation, but that's actually a fair expectation. Sometimes it might be preferred to pre-allocate the memory, but only map it into the device address space when it's really necessary.
What I'm still missing is the functionality to allocate pages for multiple devices and proper error codes when dma_map() can't make the page accessible to a device.
Agreed. Although again, I think the more challenging part would be to get the complete list of devices involved before the allocation happens.
Best regards, Tomasz
Regards, Christian.
>>> It's a use-case that is working fine today with many devices (e.g. network >>> adapters) in the ARM world, exactly because the architecture specific >>> implementation of the DMA API inserts the cache maintenance operations >>> on buffer ownership transfer. >> Yeah, I'm perfectly aware of that. The problem is that exactly that >> design totally breaks GPUs on Xen DOM0 for example. >> >> And Xen is just one example, I can certainly say from experience that >> this design was a really really bad idea because it favors just one use >> case while making other use cases practically impossible if not really >> hard to implement. > Sorry, I haven't worked with Xen. Could you elaborate what's the > problem that this introduces for it? That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.
Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.
There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.
Okay, let me check. Thanks,
Best regards, Tomasz