On Fri, Dec 9, 2022 at 11:28 AM Christian König christian.koenig@amd.com wrote:
Am 09.12.22 um 09:26 schrieb Tomasz Figa:
[SNIP] Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.
Yeah, completely agree. The crux is really that we need to improve those user space allocators by providing more information from the kernel.
- We don't properly communicate allocation requirements to userspace.
E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.
DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.
Yes, same what Daniel said as well. We need to provide some more hints which allocator to use from the kernel.
This information doesn't necessarily have to come from the kernel. For KMS it makes sense for the kernel to provide it but Vulkan drivers for example could have the information entirely in the UMD. The important point is that user space can somehow query which heap can be used with which constraints for a specific operation. At that point it's entirely a user space problem.
> So if a device driver uses cached system memory on an architecture which > devices which can't access it the right approach is clearly to reject > the access. I'd like to accent the fact that "requires cache maintenance" != "can't access".
Well that depends. As said above the exporter exports the buffer as it was allocated.
If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.
Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?
Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.
If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?
The problem is that the roles are reversed. The callbacks let the exporter knows that it needs to flush the caches when the importer is done accessing the buffer with the CPU.
But here the exporter is the one accessing the buffer with the CPU and the importer then accesses stale data because it doesn't snoop the caches.
What we could do is to force all exporters to use begin/end_cpu_access() even on it's own buffers and look at all the importers when the access is completed. But that would increase the complexity of the handling in the exporter.
In other words we would then have code in the exporters which is only written for handling the constrains of the importers. This has a wide variety of consequences, especially that this functionality of the exporter can't be tested without a proper importer.
I was also thinking about reversing the role of exporter and importer in the kernel, but came to the conclusion that doing this under the hood without userspace knowing about it is probably not going to work either.
The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)
Never heard of anything like that either, but who knows.
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.
I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...
Exactly that, yes.
Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?
Both X11 and Wayland have protocols which can be used to display a certain DMA-buf handle, their feedback packages contain information how ideal your configuration is, e.g. if the DMA-buf handle could be used or if an extra copy was needed etc...
Similar exists between VAAPI and V4L2 as far as I know, but as you noted as well here it's indeed more about negotiating pixel format, stride, padding, alignment etc...
The best we can do is to reject combinations which won't work in the kernel and then userspace could react accordingly.
That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.
How do DMA-buf heaps change anything here? We already have CPU cache maintenance callbacks in the DMA-buf API - the begin/end_cpu_access() for CPU accesses and dmabuf_map/unmap_attachment() for device accesses (which arguably shouldn't actually do CPU cache maintenance, unless that's implied by how either of the involved DMA engines work).
DMA-buf heaps are the neutral man in the middle.
The implementation keeps track of all the attached importers and should make sure that the allocated memory fits the need of everyone. Additional to that calls to the cache DMA-api cache management functions are inserted whenever CPU access begins/ends.
That's the best we can do for system memory sharing, only device specific memory can't be allocated like this.
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.
V4L2 also has the ability to allocate buffers and map them with caches enabled.
Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.
It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.
Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.
Agreed, but that's only possible if we have a central entity that understands what devices the requested buffer would be used with. My understanding is that we're nowhere close to that with mainstream Linux today.
// As a side note, videobuf2-dma-contig can actually allocate discontiguous memory, if the device is behind an IOMMU. Actually with the recent DMA API improvements, we could probably coalesce vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.
That would probably make live a little bit simpler, yes.
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.
Why does the exporter leave the buffer with dirty CPU caches?
Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.
Sorry for the question not being very clear. I meant: How do the CPU caches get dirty in that case?
The exporter wrote to it. As far as I understand the exporter just copies things from A to B with memcpy to construct the buffer content.
[SNIP]
Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.
The bounce buffers are something we should really have in a separate component.
Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.
And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.
Remember that actually making the memory accessible to a device often needs to be handled already as a part of the allocation (e.g. dma_mask in the non-IOMMU case). So I tend to think that the current division of responsibilities is mostly fine - the dma_alloc family should be seen as a part of MM already, especially with all the recent improvements from Christoph, like dma_alloc_pages().
Yes, that's indeed a very interesting development which as far as I can see goes into the right direction.
That said, it may indeed make sense to move things like IOMMU mapping management out of the dma_alloc() and just reduce those functions to simply returning a set of pages that satisfy the allocation constraints. One would need to call dma_map() after the allocation, but that's actually a fair expectation. Sometimes it might be preferred to pre-allocate the memory, but only map it into the device address space when it's really necessary.
What I'm still missing is the functionality to allocate pages for multiple devices and proper error codes when dma_map() can't make the page accessible to a device.
Regards, Christian.
>> It's a use-case that is working fine today with many devices (e.g. network >> adapters) in the ARM world, exactly because the architecture specific >> implementation of the DMA API inserts the cache maintenance operations >> on buffer ownership transfer. > Yeah, I'm perfectly aware of that. The problem is that exactly that > design totally breaks GPUs on Xen DOM0 for example. > > And Xen is just one example, I can certainly say from experience that > this design was a really really bad idea because it favors just one use > case while making other use cases practically impossible if not really > hard to implement. Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?
That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.
Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.
There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.
Okay, let me check. Thanks,
Best regards, Tomasz