DMA-buf and uncached system memory

List overview All Threads
Download

newer

older

[PATCH v2] drm/gem: Fix GEM handle...

[PATCH 0/1] [RFC] drm/fourcc: Add...

Christian König

15 Feb 2021 15 Feb '21

8:58 a.m.

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian.

Show replies by date

Daniel Vetter

16 Feb 16 Feb

9:25 a.m.

On Mon, Feb 15, 2021 at 09:58:08AM +0100, Christian König wrote:

...

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

After all the system memory access pattern is a PCIe extension and as such something generic.

Yes it's a problem, and it's a complete mess. So the defacto rules are:

1. importer has to do coherent transactions to snoop cpu caches.

This way both modes work:

- if the buffer is cached, we're fine

- if the buffer is not cached, but the exporter has flushed all the caches, you're mostly just wasting time on inefficient bus cycles. Also this doesn't work if your CPU doesn't just drop clean cachelines. Like I thing some ARM are prone to do, not idea about AMD, Intel is guaranteed to drop them which is why the uncached scanout for integrated Intel + amd discrete "works".

2. exporters picks the mode freely, and can even change it at runtime (i915 does this, since we don't have an "allocate for scanout" flag wired through consistently). This doesn't work on arm, there the rule is "all devices in the same system must use the same mode".

3. This should be solved at the dma-buf layer, but the dma-api refuses to tell you this information (at least for dma_alloc_coherent). And I'm not going to deal with the bikeshed that would bring into my inbox. Or at least there's always been screaming that drivers shouldn't peek behind the abstraction.

So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

What we imo shouldn't do is driver private interfaces here, that's just going to make the problem worse long term. Or at least driver-private interfaces that spawn across drivers behind dma-buf, because imo this is really a problem that dma-buf should solve.

If you do want to solve this at the dma-buf level I can try and point you at the respective i915 and amdgpu code that makes the magic work - I've had to fix it a few times in the past. I'm not sure whether we'd need to pass the dynamic nature through though, i.e. whether we want to be able to scan out imported dma-buf and hence request they be used in uncached mode. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Nicolas Dufresne

22 Jun 22 Jun

7:39 p.m.

Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :

...

So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you also get nice dirt on the screen. If you have a UVC webcam that produces a pixel format compatible with your display, you can reproduce the issue quite easily with:

gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink

p.s. some frame-rate are less likely to exhibit the issue, make sure you create movement to see it.

The only solution I could think of (not implemented) was to detect in the attach() call what the importers can do (with dev->coherent_dma_mask if I recall), and otherwise flush the cache immediately and start flushing the cache from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't have an idea yet). I bet this idea is inapplicable to were you have fences, we don't have that in v4l2.

This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really a good plan for that, so one needs to make one up. I'm not aware oh an importer could know how the memory was allocated by the exporter, and worst, how an importer could figure-out that the export is going to produce buffer with hot CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a fixed size image).

Nicolas

Daniel Stone

11:34 p.m.

Hi Nicolas,

On Wed, 22 Jun 2022 at 20:39, Nicolas Dufresne nicolas@ndufresne.ca wrote:

...

Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :

...
So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you also get nice dirt on the screen. If you have a UVC webcam that produces a pixel format compatible with your display, you can reproduce the issue quite easily with:

gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink

p.s. some frame-rate are less likely to exhibit the issue, make sure you create movement to see it.

Right, this is because the UVC data in a vmalloc() area is not necessarily flushed from the CPU cache, and the importer expects it will be.

...

The only solution I could think of (not implemented) was to detect in the attach() call what the importers can do (with dev->coherent_dma_mask if I recall), and otherwise flush the cache immediately and start flushing the cache from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't have an idea yet). I bet this idea is inapplicable to were you have fences, we don't have that in v4l2.

This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really a good plan for that, so one needs to make one up. I'm not aware oh an importer could know how the memory was allocated by the exporter, and worst, how an importer could figure-out that the export is going to produce buffer with hot CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a fixed size image).

This is exactly what Christian was saying above.

Cheers, Daniel

Christian König

23 Jun 23 Jun

6:59 a.m.

Am 23.06.22 um 01:34 schrieb Daniel Stone:

...

Hi Nicolas,

On Wed, 22 Jun 2022 at 20:39, Nicolas Dufresne nicolas@ndufresne.ca wrote:

...
Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :

...
So I think if AMD also guarantees to drop clean cachelines just do the same thing we do right now for intel integrated + discrete amd, but in reserve. It's fragile, but it does work.

Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you also get nice dirt on the screen. If you have a UVC webcam that produces a pixel format compatible with your display, you can reproduce the issue quite easily with:

gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink

p.s. some frame-rate are less likely to exhibit the issue, make sure you create movement to see it.

Right, this is because the UVC data in a vmalloc() area is not necessarily flushed from the CPU cache, and the importer expects it will be.

Yeah, but that is something perfectly valid for an exporter to do. So the bug is not in UVC.

...

...
The only solution I could think of (not implemented) was to detect in the attach() call what the importers can do (with dev->coherent_dma_mask if I recall), and otherwise flush the cache immediately and start flushing the cache from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't have an idea yet). I bet this idea is inapplicable to were you have fences, we don't have that in v4l2.

This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really a good plan for that, so one needs to make one up. I'm not aware oh an importer could know how the memory was allocated by the exporter, and worst, how an importer could figure-out that the export is going to produce buffer with hot CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a fixed size image).

This is exactly what Christian was saying above.

Well more or less.

The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

The importer needs to be able to deal with that. Either by flushing the CPU cache manually (awkward), rejecting the DMA-buf for this use case (display scanout) or working around that inside it's driver (extra copy, different hw settings, etc...).

Regards, Christian.

...

Cheers, Daniel

Pekka Paalanen

7:13 a.m.

On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...

The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

Thanks, pq

Christian König

7:26 a.m.

Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...

On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...

Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

Regards, Christian.

...

Thanks, pq

Lucas Stach

8:04 a.m.

Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...

Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Regards, Lucas

Christian König

8:14 a.m.

Am 23.06.22 um 10:04 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

...

In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

8:58 a.m.

Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:

...

Am 23.06.22 um 10:04 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

Yes, that one is not really an issue as you know that at buffer creation time and can make sure to map those buffers uncached on non coherent arches. If there are no explicit domain transfer points non coherent must bite the bullet and bypass the CPU caches, running performance into the ground.

...

...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

Regards, Lucas

Christian König

9:09 a.m.

Am 23.06.22 um 10:58 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:

...
Am 23.06.22 um 10:04 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

...
The exporter isn't doing anything wrong here. DMA-buf are supposed to be CPU cached and can also be cache hot.

Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

Yes, that one is not really an issue as you know that at buffer creation time and can make sure to map those buffers uncached on non coherent arches. If there are no explicit domain transfer points non coherent must bite the bullet and bypass the CPU caches, running performance into the ground.

Yes, exactly that was what this mail thread was about. But this case is currently not supported by DMA-buf.

In other words, cache coherency is currently mandatory for everybody involved.

...

...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

9:33 a.m.

Am Donnerstag, dem 23.06.2022 um 11:09 +0200 schrieb Christian König:

...

Am 23.06.22 um 10:58 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:

...
Am 23.06.22 um 10:04 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:

...
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:

...
On Thu, 23 Jun 2022 08:59:41 +0200 Christian König christian.koenig@amd.com wrote:

> The exporter isn't doing anything wrong here. DMA-buf are supposed to be > CPU cached and can also be cache hot. Hi,

what is that statement based on?

On the design documentation of DMA-buf and the actual driver implementations.

Coherency and snooping of the CPU cache is mandatory for devices and root complexes in the PCI specification. Incoherent access is just an extension.

We inherited that by basing DMA-buf on the Linux kernel DMA-API which in turn is largely based on the PCI specification.

...
Were the (mandatory for CPU access) cpu_access_begin/end functions & ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache is fully flushed out?

No, those functions are to inform the exporter that the importer has started and finished accessing the buffer using the CPU.

There is no signaling in the other direction. In other words the exporter doesn't inform the importer about CPU accesses because it is the owner of the buffer.

It's the responsibility of the importer to make sure that it can actually access the data in the buffer. If it can't guarantee that the importer shouldn't import the buffer in the first place.

This is not really correct. DMA-buf inherited the the map/unmap part from the DMA API, which on cache coherent architecture is mostly a no- op or ties into the IOMMU implementation to set up the pagetables for the translation. On non cache coherent architectures this is the point where any any necessary cache maintenance happens. DRM breaks this model by caching the DMA-buf mapping for performance reasons.

That's not only because of performance reasons, but also because of correctness.

At least the Vulkan API and a bunch of OpenGL extensions make it mandatory for the buffer to be cache coherent. The kernel is simply not informed about domain transfers.

For example you can just do a CPU copy to a ring buffer and the expectation is that an already running shader sees that.

Yes, that one is not really an issue as you know that at buffer creation time and can make sure to map those buffers uncached on non coherent arches. If there are no explicit domain transfer points non coherent must bite the bullet and bypass the CPU caches, running performance into the ground.

Yes, exactly that was what this mail thread was about. But this case is currently not supported by DMA-buf.

In other words, cache coherency is currently mandatory for everybody involved.

...
...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

How does the importer know that the memory behind the DMA-buf is in CPU cached memory?

If you now tell me that an importer always needs to assume this and reject the import if it can't do snooping, then any DMA-buf usage on most ARM SoCs is currently invalid usage. On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...

See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Regards, Lucas

Christian König

9:46 a.m.

Am 23.06.22 um 11:33 schrieb Lucas Stach:

...

[SNIP]

...
...
...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

How does the importer know that the memory behind the DMA-buf is in CPU cached memory?

If you now tell me that an importer always needs to assume this and reject the import if it can't do snooping, then any DMA-buf usage on most ARM SoCs is currently invalid usage.

Yes, exactly that. I've pointed out a couple of times now that a lot of ARM SoCs don't implement that the way we need it.

We already had tons of bug reports because somebody attached a random PCI root complex to an ARM SoC and expected it to work with for example an AMD GPU.

Non-cache coherent applications are currently not really supported by the DMA-buf framework in any way.

...

On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

10:13 a.m.

Am Donnerstag, dem 23.06.2022 um 11:46 +0200 schrieb Christian König:

...

Am 23.06.22 um 11:33 schrieb Lucas Stach:

...
[SNIP]

...
...
...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

No, the expectation is that the importer can deal with whatever the exporter provides.

If the importer can't access the DMA-buf coherently it's his job to handle that gracefully.

How does the importer know that the memory behind the DMA-buf is in CPU cached memory?

If you now tell me that an importer always needs to assume this and reject the import if it can't do snooping, then any DMA-buf usage on most ARM SoCs is currently invalid usage.

Yes, exactly that. I've pointed out a couple of times now that a lot of ARM SoCs don't implement that the way we need it.

We already had tons of bug reports because somebody attached a random PCI root complex to an ARM SoC and expected it to work with for example an AMD GPU.

Non-cache coherent applications are currently not really supported by the DMA-buf framework in any way.

I'm not talking about bolting on a PCIe root complex, with its implicit inherited "PCI is cache coherent" expectations to a ARM SoC, but just the standard VPU/GPU/display engines are not snooping on most ARM SoCs.

...

...
On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

Sure, all the non-coherent arches have some way to do the cache maintenance in some explicit way. Non coherent and no cache maintenance instruction would be a recipe for desaster. ;)

...

What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

That's a very x86 centric world view you have there. 99% of DMA-buf uses on those cheap ARM SoCs is non-snooping. We can not do any fallbacks here, as the whole graphics world on those SoCs with their different IP cores mixed together depends on DMA-buf sharing working efficiently even when the SoC is mostly non coherent.

In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Regards, Lucas

Christian König

11:10 a.m.

Am 23.06.22 um 12:13 schrieb Lucas Stach:

...

[SNIP]

...
...
On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

Sure, all the non-coherent arches have some way to do the cache maintenance in some explicit way. Non coherent and no cache maintenance instruction would be a recipe for desaster. ;)

...
What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

That's a very x86 centric world view you have there. 99% of DMA-buf uses on those cheap ARM SoCs is non-snooping. We can not do any fallbacks here, as the whole graphics world on those SoCs with their different IP cores mixed together depends on DMA-buf sharing working efficiently even when the SoC is mostly non coherent.

In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

...

Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Regards, Christian.

...

Regards, Lucas

Daniel Stone

11:27 a.m.

Hi Christian,

On Thu, 23 Jun 2022 at 12:11, Christian König christian.koenig@amd.com wrote:

...

...
In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

For some strange reason, 'let's do buffer sharing but make sure it doesn't work on Arm' wasn't exactly the intention of the groups who came together under the Linaro umbrella to create dmabuf.

If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Cheers, Daniel

Christian König

11:32 a.m.

Am 23.06.22 um 13:27 schrieb Daniel Stone:

...

Hi Christian,

On Thu, 23 Jun 2022 at 12:11, Christian König christian.koenig@amd.com wrote:

...
...
In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

For some strange reason, 'let's do buffer sharing but make sure it doesn't work on Arm' wasn't exactly the intention of the groups who came together under the Linaro umbrella to create dmabuf.

If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Well, to be honest I think that would indeed be necessary.

What we have created are essentially two different worlds, one for PCI devices and one for the rest.

This was indeed not the intention, but it's a fact that basically all DMA-buf based PCI drivers assume coherent access.

Regards, Christian.

...

Cheers, Daniel

Daniel Vetter

24 Jun 24 Jun

10:02 p.m.

On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:

...

Am 23.06.22 um 13:27 schrieb Daniel Stone:

...
Hi Christian,

On Thu, 23 Jun 2022 at 12:11, Christian König christian.koenig@amd.com wrote:

...
...
In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

For some strange reason, 'let's do buffer sharing but make sure it doesn't work on Arm' wasn't exactly the intention of the groups who came together under the Linaro umbrella to create dmabuf.

If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Well, to be honest I think that would indeed be necessary.

What we have created are essentially two different worlds, one for PCI devices and one for the rest.

This was indeed not the intention, but it's a fact that basically all DMA-buf based PCI drivers assume coherent access.

dma-buf does not require universal snooping.

It does defacto require that all device access is coherent with all other device access, and consistent with the exporters notion of how cpu coherency is achieved. Not that coherent does not mean snooping, as long as all devices do unsnooped access and the exporter either does wc/uc or flushes caches that's perfectly fine, and how all the arm soc dma-buf sharing works.

We did originally have the wording in there that you have to map/unamp around every device access, but that got dropped because no one was doing that anyway.

Now where this totally breaks down is how we make this work, because the idea was that dma_buf_attach validates this all. Where this means all the hilarious reasons buffer sharing might not work: - wrong coherency mode (cpu cached or not) - not contiguous (we do check that, but only once we get the sg from dma_buf_attachment_map, which strictly speaking is a bit too late but most drivers do attach&map as one step so not that bad in practice) - whether the dma api will throw in bounce buffers or not - random shit like "oh this is in the wrong memory bank", which I think never landed in upstream

p2p connectivity is about the only one that gets this right, yay. And the only reason we can even get it right is because all the information is exposed to drivers fully.

The issue is that the device dma api refuses to share this information because it would "leak". Which sucks, because we have defacto build every single cross-device use-case of dma-buf on the assumption we can check this (up to gl/vk specs), but oh well.

So in practice this gets sorted out by endless piles of hacks to make individual use-cases work.

Oh and: This is definitely not limited to arm socs. x86 socs with intel at least have exactly all the same issues, and they get solved by adding various shitty hacks to the involved drivers (like i915+amdgpu). Luckily the intel camera driver isn't in upstream yet, since that would break a bunch of the hacks since suddently there will be now 2 cpu cache incoherent devices in an x86 system.

Ideally someone fixes this, but I'm not hopeful.

I recommend pouring more drinks.

What is definitely not correct is claiming that dma-buf wasn't meant for this. We discussed cache coherency issues endless in budapest 12 or so years ago, I was there. It's just that the reality of the current implementation is falling short, and every time someone tries to fix it we get shouted down by dma api maintainers for looking behind their current.

tldr; You have to magically know to not use cpu cached allocators on these machines.

Aside: This is also why vgem alloates wc memory on x86. It's the least common denominator that works. arm unfortunately doesn't allow you to allocate wc memory, so there stuff is simply somewhat broken. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

4 Jul 4 Jul

1:48 p.m.

Hi Daniel,

Am 25.06.22 um 00:02 schrieb Daniel Vetter:

...

On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:

...
Am 23.06.22 um 13:27 schrieb Daniel Stone:

...
[SNIP] If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Well, to be honest I think that would indeed be necessary.

What we have created are essentially two different worlds, one for PCI devices and one for the rest.

This was indeed not the intention, but it's a fact that basically all DMA-buf based PCI drivers assume coherent access.

dma-buf does not require universal snooping.

It does defacto require that all device access is coherent with all other device access, and consistent with the exporters notion of how cpu coherency is achieved. Not that coherent does not mean snooping, as long as all devices do unsnooped access and the exporter either does wc/uc or flushes caches that's perfectly fine, and how all the arm soc dma-buf sharing works.

We should probably start documenting that better.

...

We did originally have the wording in there that you have to map/unamp around every device access, but that got dropped because no one was doing that anyway.

Now where this totally breaks down is how we make this work, because the idea was that dma_buf_attach validates this all. Where this means all the hilarious reasons buffer sharing might not work:

wrong coherency mode (cpu cached or not)

not contiguous (we do check that, but only once we get the sg from dma_buf_attachment_map, which strictly speaking is a bit too late but most drivers do attach&map as one step so not that bad in practice)

whether the dma api will throw in bounce buffers or not

random shit like "oh this is in the wrong memory bank", which I think never landed in upstream

p2p connectivity is about the only one that gets this right, yay. And the only reason we can even get it right is because all the information is exposed to drivers fully.

Yeah, that's why I designed P2P that way :)

I also don't think it's that bad, at least for radeon, nouveau and amdgpu all the migration restrictions are actually handled correctly.

In other words when a DMA-buf is about to be used by another device we use TTM to move the buffer around so that it can actually be accessed by that device.

What I haven't foreseen in here is that we need to deal with different caching behaviors between exporter and importer.

...

The issue is that the device dma api refuses to share this information because it would "leak". Which sucks, because we have defacto build every single cross-device use-case of dma-buf on the assumption we can check this (up to gl/vk specs), but oh well.

So in practice this gets sorted out by endless piles of hacks to make individual use-cases work.

Oh and: This is definitely not limited to arm socs. x86 socs with intel at least have exactly all the same issues, and they get solved by adding various shitty hacks to the involved drivers (like i915+amdgpu). Luckily the intel camera driver isn't in upstream yet, since that would break a bunch of the hacks since suddently there will be now 2 cpu cache incoherent devices in an x86 system.

Ideally someone fixes this, but I'm not hopeful.

I recommend pouring more drinks.

What is definitely not correct is claiming that dma-buf wasn't meant for this. We discussed cache coherency issues endless in budapest 12 or so years ago, I was there. It's just that the reality of the current implementation is falling short, and every time someone tries to fix it we get shouted down by dma api maintainers for looking behind their current.

Well that explains this, I've joined the party a year later and haven't witnessed all of this.

...

tldr; You have to magically know to not use cpu cached allocators on these machines.

Or reject the attachment. As far as I can see that is still the cleanest option.

Regards, Christian.

...

Aside: This is also why vgem alloates wc memory on x86. It's the least common denominator that works. arm unfortunately doesn't allow you to allocate wc memory, so there stuff is simply somewhat broken. -Daniel

Daniel Vetter

9 Aug 9 Aug

2:46 p.m.

On Mon, Jul 04, 2022 at 03:48:03PM +0200, Christian König wrote:

...

Hi Daniel,

Am 25.06.22 um 00:02 schrieb Daniel Vetter:

...
On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:

...
Am 23.06.22 um 13:27 schrieb Daniel Stone:

...
[SNIP] If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Well, to be honest I think that would indeed be necessary.

What we have created are essentially two different worlds, one for PCI devices and one for the rest.

This was indeed not the intention, but it's a fact that basically all DMA-buf based PCI drivers assume coherent access.

dma-buf does not require universal snooping.

It does defacto require that all device access is coherent with all other device access, and consistent with the exporters notion of how cpu coherency is achieved. Not that coherent does not mean snooping, as long as all devices do unsnooped access and the exporter either does wc/uc or flushes caches that's perfectly fine, and how all the arm soc dma-buf sharing works.

We should probably start documenting that better.

Agreed :-)

Are you volunteering to type up something that reflects the current sorry state of affairs? I'm not sure I'm the best since I guess I've been too badly involved in this ...

...

...
We did originally have the wording in there that you have to map/unamp around every device access, but that got dropped because no one was doing that anyway.

Now where this totally breaks down is how we make this work, because the idea was that dma_buf_attach validates this all. Where this means all the hilarious reasons buffer sharing might not work:

wrong coherency mode (cpu cached or not)

not contiguous (we do check that, but only once we get the sg from dma_buf_attachment_map, which strictly speaking is a bit too late but most drivers do attach&map as one step so not that bad in practice)

whether the dma api will throw in bounce buffers or not

random shit like "oh this is in the wrong memory bank", which I think never landed in upstream

p2p connectivity is about the only one that gets this right, yay. And the only reason we can even get it right is because all the information is exposed to drivers fully.

Yeah, that's why I designed P2P that way :)

I also don't think it's that bad, at least for radeon, nouveau and amdgpu all the migration restrictions are actually handled correctly.

In other words when a DMA-buf is about to be used by another device we use TTM to move the buffer around so that it can actually be accessed by that device.

What I haven't foreseen in here is that we need to deal with different caching behaviors between exporter and importer.

Yeah we should have done caching explicitly and full opt-in like with p2p. The trouble is that this would have been a multi-year fight with dma api folks, who insist it must be all transparent. So the politically clever thing was to just ignore the problem and land dma-buf, but it comes back to bite us now :-/

...

...
The issue is that the device dma api refuses to share this information because it would "leak". Which sucks, because we have defacto build every single cross-device use-case of dma-buf on the assumption we can check this (up to gl/vk specs), but oh well.

So in practice this gets sorted out by endless piles of hacks to make individual use-cases work.

Oh and: This is definitely not limited to arm socs. x86 socs with intel at least have exactly all the same issues, and they get solved by adding various shitty hacks to the involved drivers (like i915+amdgpu). Luckily the intel camera driver isn't in upstream yet, since that would break a bunch of the hacks since suddently there will be now 2 cpu cache incoherent devices in an x86 system.

Ideally someone fixes this, but I'm not hopeful.

I recommend pouring more drinks.

What is definitely not correct is claiming that dma-buf wasn't meant for this. We discussed cache coherency issues endless in budapest 12 or so years ago, I was there. It's just that the reality of the current implementation is falling short, and every time someone tries to fix it we get shouted down by dma api maintainers for looking behind their current.

Well that explains this, I've joined the party a year later and haven't witnessed all of this.

Yay, cleared up another confusion!

...

...
tldr; You have to magically know to not use cpu cached allocators on these machines.

Or reject the attachment. As far as I can see that is still the cleanest option.

Yeah rejecting is always an ok thing if it just doesn't work. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10 Aug 10 Aug

5:55 a.m.

Am 09.08.22 um 16:46 schrieb Daniel Vetter:

...

On Mon, Jul 04, 2022 at 03:48:03PM +0200, Christian König wrote:

...
Hi Daniel,

Am 25.06.22 um 00:02 schrieb Daniel Vetter:

...
On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:

...
Am 23.06.22 um 13:27 schrieb Daniel Stone:

...
[SNIP] If it's really your belief that dmabuf requires universal snooping, I recommend you send the patch to update the documentation, as well as to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.

Well, to be honest I think that would indeed be necessary.

What we have created are essentially two different worlds, one for PCI devices and one for the rest.

This was indeed not the intention, but it's a fact that basically all DMA-buf based PCI drivers assume coherent access.

dma-buf does not require universal snooping.

It does defacto require that all device access is coherent with all other device access, and consistent with the exporters notion of how cpu coherency is achieved. Not that coherent does not mean snooping, as long as all devices do unsnooped access and the exporter either does wc/uc or flushes caches that's perfectly fine, and how all the arm soc dma-buf sharing works.

We should probably start documenting that better.

Agreed :-)

Are you volunteering to type up something that reflects the current sorry state of affairs? I'm not sure I'm the best since I guess I've been too badly involved in this ...

Yeah, already working on this. But you know, normal human being with two hands and one head.

With all the workload I'm pretty sure people would have cloned me by now if tech would be just a bit more advanced.

Christian.

...

...
...
We did originally have the wording in there that you have to map/unamp around every device access, but that got dropped because no one was doing that anyway.

Now where this totally breaks down is how we make this work, because the idea was that dma_buf_attach validates this all. Where this means all the hilarious reasons buffer sharing might not work:

wrong coherency mode (cpu cached or not)

not contiguous (we do check that, but only once we get the sg from dma_buf_attachment_map, which strictly speaking is a bit too late but most drivers do attach&map as one step so not that bad in practice)

whether the dma api will throw in bounce buffers or not

random shit like "oh this is in the wrong memory bank", which I think never landed in upstream

p2p connectivity is about the only one that gets this right, yay. And the only reason we can even get it right is because all the information is exposed to drivers fully.

Yeah, that's why I designed P2P that way :)

I also don't think it's that bad, at least for radeon, nouveau and amdgpu all the migration restrictions are actually handled correctly.

In other words when a DMA-buf is about to be used by another device we use TTM to move the buffer around so that it can actually be accessed by that device.

What I haven't foreseen in here is that we need to deal with different caching behaviors between exporter and importer.

Yeah we should have done caching explicitly and full opt-in like with p2p. The trouble is that this would have been a multi-year fight with dma api folks, who insist it must be all transparent. So the politically clever thing was to just ignore the problem and land dma-buf, but it comes back to bite us now :-/

...
...
The issue is that the device dma api refuses to share this information because it would "leak". Which sucks, because we have defacto build every single cross-device use-case of dma-buf on the assumption we can check this (up to gl/vk specs), but oh well.

So in practice this gets sorted out by endless piles of hacks to make individual use-cases work.

Oh and: This is definitely not limited to arm socs. x86 socs with intel at least have exactly all the same issues, and they get solved by adding various shitty hacks to the involved drivers (like i915+amdgpu). Luckily the intel camera driver isn't in upstream yet, since that would break a bunch of the hacks since suddently there will be now 2 cpu cache incoherent devices in an x86 system.

Ideally someone fixes this, but I'm not hopeful.

I recommend pouring more drinks.

What is definitely not correct is claiming that dma-buf wasn't meant for this. We discussed cache coherency issues endless in budapest 12 or so years ago, I was there. It's just that the reality of the current implementation is falling short, and every time someone tries to fix it we get shouted down by dma api maintainers for looking behind their current.

Well that explains this, I've joined the party a year later and haven't witnessed all of this.

Yay, cleared up another confusion!

...
...
tldr; You have to magically know to not use cpu cached allocators on these machines.

Or reject the attachment. As far as I can see that is still the cleanest option.

Yeah rejecting is always an ok thing if it just doesn't work. -Daniel

Lucas Stach

23 Jun 23 Jun

11:29 a.m.

Am Donnerstag, dem 23.06.2022 um 13:10 +0200 schrieb Christian König:

...

Am 23.06.22 um 12:13 schrieb Lucas Stach:

...
[SNIP]

...
...
On most of the multimedia targeted ARM SoCs being unable to snoop the cache is the norm, not an exception.

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Well for the AMD and Intel use cases we at least have the opportunity to signal cache flushing, but I'm not sure if that counts for everybody.

Sure, all the non-coherent arches have some way to do the cache maintenance in some explicit way. Non coherent and no cache maintenance instruction would be a recipe for desaster. ;)

...
What we would rather do for those use cases is an indicator on the DMA-buf if the underlying backing store is CPU cached or not. The importer can then cleanly reject the use cases where it can't support CPU cache snooping.

This then results in the normal fallback paths which we have anyway for those use cases because DMA-buf sharing is not always possible.

That's a very x86 centric world view you have there. 99% of DMA-buf uses on those cheap ARM SoCs is non-snooping. We can not do any fallbacks here, as the whole graphics world on those SoCs with their different IP cores mixed together depends on DMA-buf sharing working efficiently even when the SoC is mostly non coherent.

In fact DMA-buf sharing works fine on most of those SoCs because everyone just assumes that all the accelerators don't snoop, so the memory shared via DMA-buf is mostly CPU uncached. It only falls apart for uses like the UVC cameras, where the shared buffer ends up being CPU cached.

Well then the existing DMA-buf framework is not what you want to use for this.

Sorry, but this is just ignoring reality. You try to flag 8+ years of DMA-buf usage on non-coherent arches as "you shouldn't do this". At this point there are probably a lot more users (drivers) of DMA-buf in the kernel for devices, which are used on non-coherent arches, than there are on coherent arches.

...

...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

...

You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Regards, Lucas

Christian König

11:54 a.m.

Am 23.06.22 um 13:29 schrieb Lucas Stach:

...

[SNIP]

...
Well then the existing DMA-buf framework is not what you want to use for this.

Sorry, but this is just ignoring reality. You try to flag 8+ years of DMA-buf usage on non-coherent arches as "you shouldn't do this". At this point there are probably a lot more users (drivers) of DMA-buf in the kernel for devices, which are used on non-coherent arches, than there are on coherent arches.

Well, it's my reality that people come up with bug reports about that and we have been pushing back on this with the explanation "Hey this is not supported" as long as I can think about it.

I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

I'm as much surprised as you are about this lack of agreement about such fundamental stuff.

...

...
...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

When we now have to introduce domain transitions to get non coherent access working we are essentially splitting all the drivers into coherent and non-coherent ones.

That doesn't sounds like it would improve interop.

...

Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

...

...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

12:14 p.m.

Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...

Am 23.06.22 um 13:29 schrieb Lucas Stach:

...
[SNIP]

...
Well then the existing DMA-buf framework is not what you want to use for this.

Sorry, but this is just ignoring reality. You try to flag 8+ years of DMA-buf usage on non-coherent arches as "you shouldn't do this". At this point there are probably a lot more users (drivers) of DMA-buf in the kernel for devices, which are used on non-coherent arches, than there are on coherent arches.

Well, it's my reality that people come up with bug reports about that and we have been pushing back on this with the explanation "Hey this is not supported" as long as I can think about it.

I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

...

I'm as much surprised as you are about this lack of agreement about such fundamental stuff.

...
...
...
Non-coherent without explicit domain transfer points is just not going to work. So why can't we solve the issue for DMA-buf in the same way as the DMA API already solved it years ago: by adding the equivalent of the dma_sync calls that do cache maintenance when necessary? On x86 (or any system where things are mostly coherent) you could still no-op them for the common case and only trigger cache cleaning if the importer explicitly says that is going to do a non-snooping access.

Because DMA-buf is a framework for buffer sharing between cache coherent devices which don't signal transitions.

We intentionally didn't implemented any of the dma_sync_* functions because that would break the intended use case.

Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

Coherent can mean 2 different things: 1. CPU cached with snooping from the IO device 2. CPU uncached

The Vulkan and GL "coherent" uses are really coherent without explicit domain transitions, so on non coherent arches that require the transitions the only way to implement this is by making the memory CPU uncached. Which from a performance PoV will probably not be what app developers expect, but will still expose the correct behavior.

...

When we now have to introduce domain transitions to get non coherent access working we are essentially splitting all the drivers into coherent and non-coherent ones.

That doesn't sounds like it would improve interop.

...
Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit documentation that "userspace can not rely on coherent access".

...

...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Regards, Lucas

Christian König

12:52 p.m.

Am 23.06.22 um 14:14 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

...

...
[SNIP]

...
Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

Coherent can mean 2 different things:

CPU cached with snooping from the IO device

CPU uncached

The Vulkan and GL "coherent" uses are really coherent without explicit domain transitions, so on non coherent arches that require the transitions the only way to implement this is by making the memory CPU uncached. Which from a performance PoV will probably not be what app developers expect, but will still expose the correct behavior.

Quite a boomer for performance, but yes that should work.

...

...
...
Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit documentation that "userspace can not rely on coherent access".

Yeah, double checked that as well. This is for the coherency case on the exporter side.

...

...
...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

Sounds like I'm not making clear what I want to say here: For the exporter using cache coherent memory is optional, for the importer it isn't.

For the exporter it is perfectly valid to use kmalloc, get_free_page etc... on his buffers as long as it uses the DMA API to give the importer access to it.

The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

Otherwise we would need to change all DMA-buf exporters to use a special function for allocation non-coherent memory and that is certainly not going to fly.

...

I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Well the key point is we intentionally rejected that design previously because it created all kind of trouble as well.

For this limited use case of doing a domain transition right before scanout it might make sense, but that's just one use case.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

3:26 p.m.

Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:

...

Am 23.06.22 um 14:14 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

I fail to construct a case where you want the Vulkan/GL "no domain transition" coherent semantic without the allocator knowing about this. If you need this and the system is non-snooping, surely the allocator will choose uncached memory.

I agree that you absolutely need to fail the usage when someone imports a CPU cached buffer and then tries to use it as GL coherent on a non- snooping system. That simply will not work.

...

...
...
[SNIP]

...
Non coherent access, including your non-snoop scanout, and no domain transition signal just doesn't go together when you want to solve things in a generic way.

Yeah, that's the stuff I totally agree on.

See we absolutely do have the requirement of implementing coherent access without domain transitions for Vulkan and OpenGL+extensions.

Coherent can mean 2 different things:

CPU cached with snooping from the IO device

CPU uncached

The Vulkan and GL "coherent" uses are really coherent without explicit domain transitions, so on non coherent arches that require the transitions the only way to implement this is by making the memory CPU uncached. Which from a performance PoV will probably not be what app developers expect, but will still expose the correct behavior.

Quite a boomer for performance, but yes that should work.

...
...
...
Remember that in a fully (not only IO) coherent system the CPU isn't the only agent that may cache the content you are trying to access here. The dirty cacheline could reasonably still be sitting in a GPU or VPU cache, so you need some way to clean those cachelines, which isn't a magic "importer knows how to call CPU cache clean instructions".

IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I need to double check as well, that's way to long ago) this was kicked out because of the requirements above.

The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit documentation that "userspace can not rely on coherent access".

Yeah, double checked that as well. This is for the coherency case on the exporter side.

...
...
...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

Sounds like I'm not making clear what I want to say here: For the exporter using cache coherent memory is optional, for the importer it isn't.

For the exporter it is perfectly valid to use kmalloc, get_free_page etc... on his buffers as long as it uses the DMA API to give the importer access to it.

And here is where our line of thought diverges: the DMA API allows snooping and non-snooping devices to work together just fine, as it has explicit domain transitions, which are no-ops if both devices are snooping, but will do the necessary cache maintenance when one of them is non-snooping but the memory is CPU cached.

I don't see why DMA-buf should be any different here. Yes, you can not support the "no domain transition" sharing when the memory is CPU cached and one of the devices in non-snooping, but you can support 99% of real use-cases like the non-snooped scanout or the UVC video import.

...

The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Why? The importer maps the dma-buf via dma_buf_map_attachment, which in most cases triggers a map via the DMA API on the exporter side. This map via the DMA API will already do the right thing in terms of cache management, it's just that we explicitly disable it via DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be cached, which violates the DMA API explicit domain transition anyway.

...

Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

It's not just display drivers, video codec accelerators and most GPUs in this space are also non-snooping. In the ARM SoC world everyone just assumes you are non-snooping, which is why things work for most cases and only a handful like the UVC video import is broken.

...

Otherwise we would need to change all DMA-buf exporters to use a special function for allocation non-coherent memory and that is certainly not going to fly.

...
I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Well the key point is we intentionally rejected that design previously because it created all kind of trouble as well.

I would really like to know what issues popped up there. Moving the dma-buf attachment to work more like a buffer used with the DMA API seems like a good thing to me.

...

For this limited use case of doing a domain transition right before scanout it might make sense, but that's just one use case.

The only case I see that we still couldn't support with a change in that direction is the GL coherent access to a imported buffer that has been allocated from CPU cached memory on a system with non-snooping agents. Which to me sounds like a pretty niche use-case, but I would be happy to be proven wrong.

Regards, Lucas

Christian König

24 Jun 24 Jun

6:54 a.m.

Am 23.06.22 um 17:26 schrieb Lucas Stach:

...

Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:

...
Am 23.06.22 um 14:14 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

I fail to construct a case where you want the Vulkan/GL "no domain transition" coherent semantic without the allocator knowing about this. If you need this and the system is non-snooping, surely the allocator will choose uncached memory.

No it won't. The allocator in the exporter is independent of the importer.

That is an important and intentional design decision, cause otherwise you wouldn't have exporters/importers in the first place and rather a centralized allocation pool like what dma-heap implements.

See the purpose of DMA-buf is to expose the buffers in the way the exporter wants to expose them. So when the exporting driver wants to allocate normal cached system memory then that is perfectly fine and completely fits into this design.

Otherwise we would need to adjust all exporters to the importers, which is potentially not even possible.

...

I agree that you absolutely need to fail the usage when someone imports a CPU cached buffer and then tries to use it as GL coherent on a non- snooping system. That simply will not work.

Exactly that, yes. That's what the attach callback is good for.

See we already have tons of cases where buffers can't be shared because they wasn't initially allocated in a way the importer can deal with them. But that's perfectly ok and intentional.

For example just take a configuration where a dedicated GPU clones the display with an integrated GPU. The dedicated GPU needs the image in local memory for scanout which is usually not accessible to the integrated GPU.

So either attaching the DMA-buf or creating the KMS framebuffer config will fail and we are running into the fallback path which involves an extra copy. And that is perfectly fine and intentional since this configuration is not supported by the hardware.

...

...
...
...
[SNIP]

...
...
You can of course use DMA-buf in an incoherent environment, but then you can't expect that this works all the time.

This is documented behavior and so far we have bluntly rejected any of the complains that it doesn't work on most ARM SoCs and I don't really see a way to do this differently.

Can you point me to that part of the documentation? A quick grep for "coherent" didn't immediately turn something up within the DMA-buf dirs.

Search for "cache coherency management". It's quite a while ago, but I do remember helping to review that stuff.

That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are saying the exact opposite of the DMA-buf is always coherent.

Sounds like I'm not making clear what I want to say here: For the exporter using cache coherent memory is optional, for the importer it isn't.

For the exporter it is perfectly valid to use kmalloc, get_free_page etc... on his buffers as long as it uses the DMA API to give the importer access to it.

And here is where our line of thought diverges: the DMA API allows snooping and non-snooping devices to work together just fine, as it has explicit domain transitions, which are no-ops if both devices are snooping, but will do the necessary cache maintenance when one of them is non-snooping but the memory is CPU cached.

I don't see why DMA-buf should be any different here. Yes, you can not support the "no domain transition" sharing when the memory is CPU cached and one of the devices in non-snooping, but you can support 99% of real use-cases like the non-snooped scanout or the UVC video import.

Well I didn't say we couldn't do it that way. What I'm saying that it was intentionally decided against it.

We could re-iterate that decision, but this would mean that all existing exporters would now need to provide additional functionality.

...

...
The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Why? The importer maps the dma-buf via dma_buf_map_attachment, which in most cases triggers a map via the DMA API on the exporter side. This map via the DMA API will already do the right thing in terms of cache management, it's just that we explicitly disable it via DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be cached, which violates the DMA API explicit domain transition anyway.

Why doesn't the importer simply calls dma_sync_sg_for_device() as necessary? See the importer does already know when it needs to access the buffer and as far as I can see has all the necessary variable to do the sync.

The exporter on the other hand doesn't know that. So we would need to transport this information.

Another fundamental problem is that the DMA API isn't designed for device to device transitions. In other words you have CPU->device and device->CPU transition, but not device->device. As far as I can see the DMA API should already have the necessary information if things like cache flushes are necessary or not.

...

...
Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

It's not just display drivers, video codec accelerators and most GPUs in this space are also non-snooping. In the ARM SoC world everyone just assumes you are non-snooping, which is why things work for most cases and only a handful like the UVC video import is broken.

That is really interesting to know, but I still think that DMA-buf was absolutely not designed for this use case.

From the point of view the primary reason for this was laptops with both dedicated and integrated GPUs, webcams etc...

That you have a huge number of ARM specific devices which can interop with themselves, but not with devices outside of their domain is not something foreseen here.

Regards, Christian.

...

...
Otherwise we would need to change all DMA-buf exporters to use a special function for allocation non-coherent memory and that is certainly not going to fly.

...
I also don't see why you think that both world views are so totally different. We could just require explicit domain transitions for non- snoop access, which would probably solve your scanout issue and would not be a problem for most ARM systems, where we could no-op this if the buffer is already in uncached memory and at the same time keep the "x86 assumes cached + snooped access by default" semantics.

Well the key point is we intentionally rejected that design previously because it created all kind of trouble as well.

I would really like to know what issues popped up there. Moving the dma-buf attachment to work more like a buffer used with the DMA API seems like a good thing to me.

...
For this limited use case of doing a domain transition right before scanout it might make sense, but that's just one use case.

The only case I see that we still couldn't support with a change in that direction is the GL coherent access to a imported buffer that has been allocated from CPU cached memory on a system with non-snooping agents. Which to me sounds like a pretty niche use-case, but I would be happy to be proven wrong.

Regards, Lucas

Lucas Stach

8:10 a.m.

Am Freitag, dem 24.06.2022 um 08:54 +0200 schrieb Christian König:

...

Am 23.06.22 um 17:26 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:

...
Am 23.06.22 um 14:14 schrieb Lucas Stach:

...
Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:

...
Am 23.06.22 um 13:29 schrieb Lucas Stach: [SNIP] I mean I even had somebody from ARM which told me that this is not going to work with our GPUs on a specific SoC. That there are ARM internal use cases which just seem to work because all the devices are non-coherent is completely new to me.

Yes, trying to hook up a peripheral that assumes cache snooping in some design details to a non coherent SoC may end up exploding in various ways. On the other hand you can work around most of those assumptions by marking the memory as uncached to the CPU, which may tank performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework for this".

See we do support using uncached/not snooped memory in DMA-buf, but only for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the exporter provides.

I fail to construct a case where you want the Vulkan/GL "no domain transition" coherent semantic without the allocator knowing about this. If you need this and the system is non-snooping, surely the allocator will choose uncached memory.

No it won't. The allocator in the exporter is independent of the importer.

That is an important and intentional design decision, cause otherwise you wouldn't have exporters/importers in the first place and rather a centralized allocation pool like what dma-heap implements.

See the purpose of DMA-buf is to expose the buffers in the way the exporter wants to expose them. So when the exporting driver wants to allocate normal cached system memory then that is perfectly fine and completely fits into this design.

I'm specifically talking about the case where a snooping exporter would allocate the GL coherent buffer and a non-snooping importer would need to access that buffer with the same "no domain transition needed" semantic. That is the thing which we can not make work at all and need to fail the attach. If both the exporter and importer are non-snooping you would probably get uncached memory, as long as the exporter knows how the buffer will be used. Is there a real use-case where the exporter doesn't know that the buffer will be used as GL/Vulkan coherent and we can't do fallback on the importer side?

...

Otherwise we would need to adjust all exporters to the importers, which is potentially not even possible.

...
I agree that you absolutely need to fail the usage when someone imports a CPU cached buffer and then tries to use it as GL coherent on a non- snooping system. That simply will not work.

Exactly that, yes. That's what the attach callback is good for.

See we already have tons of cases where buffers can't be shared because they wasn't initially allocated in a way the importer can deal with them. But that's perfectly ok and intentional.

For example just take a configuration where a dedicated GPU clones the display with an integrated GPU. The dedicated GPU needs the image in local memory for scanout which is usually not accessible to the integrated GPU.

So either attaching the DMA-buf or creating the KMS framebuffer config will fail and we are running into the fallback path which involves an extra copy. And that is perfectly fine and intentional since this configuration is not supported by the hardware.

...
...
...
...
[SNIP]

And here is where our line of thought diverges: the DMA API allows snooping and non-snooping devices to work together just fine, as it has explicit domain transitions, which are no-ops if both devices are snooping, but will do the necessary cache maintenance when one of them is non-snooping but the memory is CPU cached.

I don't see why DMA-buf should be any different here. Yes, you can not support the "no domain transition" sharing when the memory is CPU cached and one of the devices in non-snooping, but you can support 99% of real use-cases like the non-snooped scanout or the UVC video import.

Well I didn't say we couldn't do it that way. What I'm saying that it was intentionally decided against it.

We could re-iterate that decision, but this would mean that all existing exporters would now need to provide additional functionality.

The way I see it we would only need this for exporters that potentially export CPU cached memory, but need to interop with non-snooping importers. I guess that can be done on a case-by-case basis and wouldn't be a big flag day operation.

...

...
...
The importer on the other hand needs to be able to deal with that. When this is not the case then the importer somehow needs to work around that.

Why? The importer maps the dma-buf via dma_buf_map_attachment, which in most cases triggers a map via the DMA API on the exporter side. This map via the DMA API will already do the right thing in terms of cache management, it's just that we explicitly disable it via DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be cached, which violates the DMA API explicit domain transition anyway.

Why doesn't the importer simply calls dma_sync_sg_for_device() as necessary? See the importer does already know when it needs to access the buffer and as far as I can see has all the necessary variable to do the sync.

First, it wouldn't be symmetric with the dma_buf_map_attachment, where the actual dma_map_sg also happens on the exporter side.

Second, that is again a very x86 with PCI centric view. The importer flushing CPU caches by calling dma_sync_sg_for_device will only suffice in a world where devices are IO coherent, i.e. they snoop the CPU cache but don't participate fully in the system coherency due to never keeping dirty cache lines for buffers in system memory.

On fully coherent systems like ARM with AMBA CHI or x86 with CXL.cache all devices with access to the buffer can keep dirty cache lines in their device private caches, so any access from a non-snooping agent will require a cache clean on all those devices, which would basically require the the dma_buf_sync to be a broadcast operation to the exporter and all attached fully coherent importers.

...

The exporter on the other hand doesn't know that. So we would need to transport this information.

Another fundamental problem is that the DMA API isn't designed for device to device transitions. In other words you have CPU->device and device->CPU transition, but not device->device. As far as I can see the DMA API should already have the necessary information if things like cache flushes are necessary or not.

Don't you contradict the second part here with the first? The DMA API doesn't have the necessary information about needed cache cleaning on the exporters or other attached importers side, when you only call the dma_sync on the importer, which is exactly why I'm arguing for putting it in the dma_buf ops so we can do the necessary operations on other attached clients to make a device->device transition working reliably.

...

...
...
Either by flushing the CPU caches or by rejecting using the imported buffer for this specific use case (like AMD and Intel drivers should be doing).

If the Intel or ARM display drivers need non-cached memory and don't reject buffer where they don't know this then that's certainly a bug in those drivers.

It's not just display drivers, video codec accelerators and most GPUs in this space are also non-snooping. In the ARM SoC world everyone just assumes you are non-snooping, which is why things work for most cases and only a handful like the UVC video import is broken.

That is really interesting to know, but I still think that DMA-buf was absolutely not designed for this use case.

From the point of view the primary reason for this was laptops with both dedicated and integrated GPUs, webcams etc...

That you have a huge number of ARM specific devices which can interop with themselves, but not with devices outside of their domain is not something foreseen here.

Our recollection of history might differ here, but as Daniel remarked kind of snarkily, most of the initial contributors to dma-buf were from Linaro and TI, both of which were focused on getting device interop working on ARM devices, which at the time were overwhelmingly non- snooping. So I kind of doubt that dma-buf wasn't designed for this use- case.

Regards, Lucas

Nicolas Dufresne

27 Jun 27 Jun

1:54 p.m.

Le jeudi 23 juin 2022 à 11:33 +0200, Lucas Stach a écrit :

...

...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Telling the exporter for every scan is unneeded overhead. If that information is made available "properly", then tracking it in attach/detach is sufficient and lightweight.

Nicolas

Lucas Stach

2:06 p.m.

Am Montag, dem 27.06.2022 um 09:54 -0400 schrieb Nicolas Dufresne:

...

Le jeudi 23 juin 2022 à 11:33 +0200, Lucas Stach a écrit :

...
...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Telling the exporter for every scan is unneeded overhead. If that information is made available "properly", then tracking it in attach/detach is sufficient and lightweight.

That isn't sufficient. The AMD GPU is a single device, but internally has different engines that have different capabilities with regard to snooping the caches. So you will likely end up with needing the cache clean if the V4L2 buffer is going directly to scanout, which doesn't snoop, but if the usage changes to sampling you don't need any cache flushes.

Also I don't see a big overhead when comparing a kernel internal call that tells the exporter that the importer is going to access the buffer without snooping and thus needs the cache clean once every frame and the need to always clean the cache before DQBUF when a potentially non- snooping importer is attached.

Regards, Lucas

Nicolas Dufresne

2:30 p.m.

Le lundi 27 juin 2022 à 16:06 +0200, Lucas Stach a écrit :

...

Am Montag, dem 27.06.2022 um 09:54 -0400 schrieb Nicolas Dufresne:

...
Le jeudi 23 juin 2022 à 11:33 +0200, Lucas Stach a écrit :

...
...
See for example on AMD/Intel hardware most of the engines can perfectly deal with cache coherent memory accesses. Only the display engines can't.

So on import time we can't even say if the access can be coherent and snoop the CPU cache or not because we don't know how the imported DMA-buf will be used later on.

So for those mixed use cases, wouldn't it help to have something similar to the dma_sync in the DMA-buf API, so your scanout usage can tell the exporter that it's going to do non-snoop access and any dirty cache lines must be cleaned? Signaling this to the exporter would allow to skip the cache maintenance if the buffer is in CPU uncached memory, which again is a default case for the ARM SoC world.

Telling the exporter for every scan is unneeded overhead. If that information is made available "properly", then tracking it in attach/detach is sufficient and lightweight.

That isn't sufficient. The AMD GPU is a single device, but internally has different engines that have different capabilities with regard to snooping the caches. So you will likely end up with needing the cache clean if the V4L2 buffer is going directly to scanout, which doesn't snoop, but if the usage changes to sampling you don't need any cache flushes.

Also I don't see a big overhead when comparing a kernel internal call that tells the exporter that the importer is going to access the buffer without snooping and thus needs the cache clean once every frame and the need to always clean the cache before DQBUF when a potentially non- snooping importer is attached.

Ack, thanks for the information.

...

Regards, Lucas

Nicolas Dufresne

1:51 p.m.

Hi,

Le jeudi 23 juin 2022 à 10:58 +0200, Lucas Stach a écrit :

...

...
...
In the DMA API keeping things mapped is also a valid use-case, but then you need to do explicit domain transfers via the dma_sync_* family, which DMA-buf has not inherited. Again those sync are no-ops on cache coherent architectures, but do any necessary cache maintenance on non coherent arches.

Correct, yes. Coherency is mandatory for DMA-buf, you can't use dma_sync_* on it when you are the importer.

The exporter could of course make use of that because he is the owner of the buffer.

In the example given here with UVC video, you don't know that the buffer will be exported and needs to be coherent without synchronization points, due to the mapping cache at the DRM side. So V4L2 naturally allocates the buffers from CPU cached memory. If the expectation is that those buffers are device coherent without relying on the map/unmap_attachment calls, then V4L2 needs to always synchronize caches on DQBUF when the buffer is allocated from CPU cached memory and a single DMA-buf attachment exists. And while writing this I realize that this is probably exactly what V4L2 should do...

I'm not sure we are making any progress here. Doing so will just regress performance of coherent devices used to render UVC video feeds. In fact, they are all coherent except the display controller (on Intel). What my colleague was suggesting me to try (with the expectation that some adaptation will be needed, perhaps new signalling flags), is to read the dma_coherency_mask values on the devices that calls attach() and adapt v4l2 exporter accordingly.

Its likely wrong as-is, not intended to be used for that, but the value is that it tries to fix the problem, unlike what I'm reading here.

Nicolas

Thomas Zimmermann

23 Jun 23 Jun

8:13 a.m.

Hi Christian

Am 15.02.21 um 09:58 schrieb Christian König:

...

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

I had a patchset here that extends iosys-map (former dma-buf-map) with caching information. I'll post a copy.

Sorry for being late to reply.

Best regards Thomas

...

After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev

Christian König

8:26 a.m.

Am 23.06.22 um 10:13 schrieb Thomas Zimmermann:

...

Hi Christian

Am 15.02.21 um 09:58 schrieb Christian König:

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

I had a patchset here that extends iosys-map (former dma-buf-map) with caching information. I'll post a copy.

Oh, nice. But why on iosys-map? We need that per DMA-buf.

Thanks, Christian.

...

Sorry for being late to reply.

Best regards Thomas

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Thomas Zimmermann

8:42 a.m.

Am 23.06.22 um 10:26 schrieb Christian König:

...

Am 23.06.22 um 10:13 schrieb Thomas Zimmermann:

...
Hi Christian

Am 15.02.21 um 09:58 schrieb Christian König:

...
Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

I had a patchset here that extends iosys-map (former dma-buf-map) with caching information. I'll post a copy.

Oh, nice. But why on iosys-map? We need that per DMA-buf.

It's returned via the dma-buf's vmap call within the iosys-map structure. If the dma-buf moves, the following vmap calls always return updated caching information. Maybe it's not quite what you need for Freesync?

I'll use this for format-conversion helpers, which do some optimizations for uncached memory.

Anyway, I have to look for that patch...

Best regards Thomas

...

Thanks, Christian.

...
Sorry for being late to reply.

Best regards Thomas

...
After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev

Rob Clark

9 Aug 9 Aug

3:01 p.m.

On Mon, Feb 15, 2021 at 12:58 AM Christian König christian.koenig@amd.com wrote:

...

Hi guys,

we are currently working an Freesync and direct scan out from system memory on AMD APUs in A+A laptops.

On problem we stumbled over is that our display hardware needs to scan out from uncached system memory and we currently don't have a way to communicate that through DMA-buf.

For our specific use case at hand we are going to implement something driver specific, but the question is should we have something more generic for this?

I'm a bit late to this party (and sorry, I didn't read the entire thread), but it occurs to me that dmabuf mmap_info[1] would also get you what you need, ie. display importing dma-buf could check whether the exporter is mapping cached or not, and reject the import if needed?

[1] https://patchwork.freedesktop.org/patch/496069/?series=106847&rev=2

...

After all the system memory access pattern is a PCIe extension and as such something generic.

Regards, Christian. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

1169

days inactive

1710

days old

linaro-mm-sig@lists.linaro.org

35 comments

participants

tags (0)

participants (9)

Christian König
Christian König
Daniel Stone
Daniel Vetter
Lucas Stach
Nicolas Dufresne
Pekka Paalanen
Rob Clark
Thomas Zimmermann