Try to address the DMA-buf coherency problem

List overview All Threads
Download

newer

older

[PATCH] drm/msm: Add missing check...

[PATCH] dma-buf: fix...

Christian König

20 Oct 2022 20 Oct '22

12:13 p.m.

Hi guys,

after finding that we essentially have two separate worlds for coherent sharing of buffer through DMA-buf I thought I will tackle that problem a bit and at least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object telling the framework that dev_is_dma_coherent() needs to return true for an importing device to be able to attach. Since we should always have a fallback path this should give userspace the chance to still keep the use case working, either by doing a CPU copy instead or reversing the roles of exporter and importer.

For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear the flag for their buffers if special handling like the USWC flag in amdgpu or the uncached allocations for radeon/nouveau are in use.

Additional to that importers can also check the flag if they have some non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as I can see should at least keep things working.

Please review and/or comment, Christian.

Show replies by date

Christian König

20 Oct 20 Oct

12:13 p.m.

New subject: [PATCH 1/3] dma-buf: add dma_coherent flag

Start to fix the coherency issues for non PCI devices by adding a dma_coherent flag to the DMA-buf.

The flag should be set by the exporter if only devices which can do coherent DMA-access with respect to the CPU cache are allowed to access the buffer.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/dma-buf.c | 5 +++++ include/linux/dma-buf.h | 16 ++++++++++++++++ 2 files changed, 21 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index c40d72d318fd..7509807bf485 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -15,6 +15,7 @@ #include <linux/slab.h> #include <linux/dma-buf.h> #include <linux/dma-fence.h> +#include <linux/dma-map-ops.h> #include <linux/anon_inodes.h> #include <linux/export.h> #include <linux/debugfs.h> @@ -635,6 +636,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)

dmabuf->priv = exp_info->priv; dmabuf->ops = exp_info->ops; + dmabuf->dma_coherent = exp_info->coherent; dmabuf->size = exp_info->size; dmabuf->exp_name = exp_info->exp_name; dmabuf->owner = exp_info->owner; @@ -894,6 +896,9 @@ dma_buf_dynamic_attach(struct dma_buf *dmabuf, struct device *dev, if (WARN_ON(importer_ops && !importer_ops->move_notify)) return ERR_PTR(-EINVAL);

+ if (dmabuf->dma_coherent && !dev_is_dma_coherent(dev)) + return ERR_PTR(-EINVAL); + attach = kzalloc(sizeof(*attach), GFP_KERNEL); if (!attach) return ERR_PTR(-ENOMEM); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 6fa8d4e29719..f2083b94b116 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -326,6 +326,20 @@ struct dma_buf { /** @ops: dma_buf_ops associated with this buffer object. */ const struct dma_buf_ops *ops;

+ /** + * @dma_coherent: + * + * True if the buffer is backed by DMA coherent memory with respect to + * the CPU cache even if the architecture can support incoherent + * devices. + * + * Usually mirrors the result of dev_is_dma_coherent() of the exporter, + * but can be cleared by the exporter to allow incoherent devices + * access to the buffer if the exporter takes care of coherency for CPU + * accesses. + */ + bool dma_coherent; + /** * @vmapping_counter: * @@ -524,6 +538,7 @@ struct dma_buf_attachment { * @ops: Attach allocator-defined dma buf ops to the new buffer * @size: Size of the buffer - invariant over the lifetime of the buffer * @flags: mode flags for the file + * @coherent: If DMA accesses must be coherent to the CPU cache. * @resv: reservation-object, NULL to allocate default one * @priv: Attach private data of allocator to this buffer * @@ -536,6 +551,7 @@ struct dma_buf_export_info { const struct dma_buf_ops *ops; size_t size; int flags; + bool coherent; struct dma_resv *resv; void *priv; };

-- 2.25.1

Christian König

12:13 p.m.

New subject: [PATCH 2/3] drm/prime: set the dma_coherent flag for export

When a device driver is snooping the CPU cache during access we assume that all importers need to be able to snoop the CPU cache as well.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/drm_prime.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 20e109a802ae..d5c70b6fe8a4 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -28,6 +28,7 @@

#include <linux/export.h> #include <linux/dma-buf.h> +#include <linux/dma-map-ops.h> #include <linux/rbtree.h> #include <linux/module.h>

@@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, .size = obj->size, .flags = flags, .priv = obj, + .coherent = dev_is_dma_coherent(dev->dev), .resv = obj->resv, };

-- 2.25.1

Rob Clark

2:43 p.m.

New subject: [PATCH 2/3] drm/prime: set the dma_coherent flag for export

On Thu, Oct 20, 2022 at 5:13 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

When a device driver is snooping the CPU cache during access we assume that all importers need to be able to snoop the CPU cache as well.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/drm_prime.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 20e109a802ae..d5c70b6fe8a4 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -28,6 +28,7 @@

#include <linux/export.h> #include <linux/dma-buf.h> +#include <linux/dma-map-ops.h> #include <linux/rbtree.h> #include <linux/module.h>

@@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, .size = obj->size, .flags = flags, .priv = obj,
          .coherent = dev_is_dma_coherent(dev->dev),

To set the coherent flag correctly, I think I'd need a way to override on a per buffer basis, since coherency is a property of the gpu pgtables (which in the msm case is an immutable property of the gem object). We also have some awkwardness that drm->dev isn't actually the GPU, thanks to the kernels device model seeing a collection of other small devices shoehorned into a single drm device to fit userspace's view of the world. So relying on drm->dev isn't really going to give sensible results.

I guess msm could just bury our heads in the sand and continue to do things the way we have been (buffers that are mapped cached-coherent are only self-shared) but would be nice to catch if userspace tried to import one into (for ex) v4l2..

BR, -R

...

            .resv = obj->resv,
    };

-- 2.25.1

Christian König

2:56 p.m.

New subject: [PATCH 2/3] drm/prime: set the dma_coherent flag for export

Am 20.10.22 um 16:43 schrieb Rob Clark:

...

On Thu, Oct 20, 2022 at 5:13 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
When a device driver is snooping the CPU cache during access we assume that all importers need to be able to snoop the CPU cache as well.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/drm_prime.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 20e109a802ae..d5c70b6fe8a4 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -28,6 +28,7 @@

#include <linux/export.h> #include <linux/dma-buf.h> +#include <linux/dma-map-ops.h> #include <linux/rbtree.h> #include <linux/module.h>

@@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, .size = obj->size, .flags = flags, .priv = obj,
          .coherent = dev_is_dma_coherent(dev->dev),
To set the coherent flag correctly, I think I'd need a way to override on a per buffer basis, since coherency is a property of the gpu pgtables (which in the msm case is an immutable property of the gem object). We also have some awkwardness that drm->dev isn't actually the GPU, thanks to the kernels device model seeing a collection of other small devices shoehorned into a single drm device to fit userspace's view of the world. So relying on drm->dev isn't really going to give sensible results.

Yeah, I've the same problem for amdgpu where some buffers are snooped while others aren't.

But this should be unproblematic since the flag can always be cleared by the driver later on (it just can't be set).

Additional to that I've just noted that armada, i915, omap and tegra use their own DMA-buf export function. MSM could do the same as well if the device itself is marked as not coherent while some buffers are mapped cache coherent.

Regards, Christian.

...

I guess msm could just bury our heads in the sand and continue to do things the way we have been (buffers that are mapped cached-coherent are only self-shared) but would be nice to catch if userspace tried to import one into (for ex) v4l2..

BR, -R

...
             .resv = obj->resv,
     };
-- 2.25.1

Rob Clark

28 Oct 28 Oct

1:11 a.m.

New subject: [PATCH 2/3] drm/prime: set the dma_coherent flag for export

On Thu, Oct 20, 2022 at 7:57 AM Christian König christian.koenig@amd.com wrote:

...

Am 20.10.22 um 16:43 schrieb Rob Clark:

...
On Thu, Oct 20, 2022 at 5:13 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
When a device driver is snooping the CPU cache during access we assume that all importers need to be able to snoop the CPU cache as well.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/drm_prime.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 20e109a802ae..d5c70b6fe8a4 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -28,6 +28,7 @@

#include <linux/export.h> #include <linux/dma-buf.h> +#include <linux/dma-map-ops.h> #include <linux/rbtree.h> #include <linux/module.h>

@@ -889,6 +890,7 @@ struct dma_buf *drm_gem_prime_export(struct drm_gem_object *obj, .size = obj->size, .flags = flags, .priv = obj,
          .coherent = dev_is_dma_coherent(dev->dev),
To set the coherent flag correctly, I think I'd need a way to override on a per buffer basis, since coherency is a property of the gpu pgtables (which in the msm case is an immutable property of the gem object). We also have some awkwardness that drm->dev isn't actually the GPU, thanks to the kernels device model seeing a collection of other small devices shoehorned into a single drm device to fit userspace's view of the world. So relying on drm->dev isn't really going to give sensible results.
Yeah, I've the same problem for amdgpu where some buffers are snooped while others aren't.

But this should be unproblematic since the flag can always be cleared by the driver later on (it just can't be set).

Additional to that I've just noted that armada, i915, omap and tegra use their own DMA-buf export function. MSM could do the same as well if the device itself is marked as not coherent while some buffers are mapped cache coherent.

yeah, I guess that would work.. it would be a bit unfortunate to need to use our own export function, but I guess it is a small price to pay and I like the overall idea, so a-b for the series

For the VMM case, it would be nice to expose this to userspace, but I've sent a patch to do this in an msm specific way, and I guess at least solving that problem for one driver and better than the current state of "iff driver == "i915" { it's mmap'd cached } else { it's writecombine }" in crosvm

Admittedly the VMM case is a rather special case compared to your average userspace dealing with dmabuf's, but it would be nice to get out of the current situation where it is having to make assumptions which are quite possibly wrong, so giving the VMM some information even if it is "the cachability isn't static, you should bail now if your arch can't cope" would be an improvement. (For background, this case is also a bit specific for android/gralloc.. for driver allocated buffers in a VM, with the native usermode driver (UMD) in guest, you still have some control within the UMD)

BR, -R

...

Regards, Christian.

...
I guess msm could just bury our heads in the sand and continue to do things the way we have been (buffers that are mapped cached-coherent are only self-shared) but would be nice to catch if userspace tried to import one into (for ex) v4l2..

BR, -R

...
             .resv = obj->resv,
     };
-- 2.25.1

Christian König

20 Oct 20 Oct

12:13 p.m.

New subject: [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf

When a device is snooping the CPU cache we assume that all importers must snoop the CPU cache as well.

Execpt for vmalloc allocations since those implement mmap() imports must always snoop the cache or we will run into coherency problems.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/media/common/videobuf2/videobuf2-dma-contig.c | 2 ++ drivers/media/common/videobuf2/videobuf2-dma-sg.c | 2 ++ drivers/media/common/videobuf2/videobuf2-vmalloc.c | 1 + 3 files changed, 5 insertions(+)

diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c index 555bd40fa472..57433310c55c 100644 --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c @@ -11,6 +11,7 @@ */

#include <linux/dma-buf.h> +#include <linux/dma-map-ops.h> #include <linux/module.h> #include <linux/refcount.h> #include <linux/scatterlist.h> @@ -507,6 +508,7 @@ static struct dma_buf *vb2_dc_get_dmabuf(struct vb2_buffer *vb, exp_info.size = buf->size; exp_info.flags = flags; exp_info.priv = buf; + exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);

if (!buf->sgt_base) buf->sgt_base = vb2_dc_get_base_sgt(buf); diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c index 36981a5b5c53..dbdd753e4a39 100644 --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c @@ -10,6 +10,7 @@ * the Free Software Foundation. */

+#include <linux/dma-map-ops.h> #include <linux/module.h> #include <linux/mm.h> #include <linux/refcount.h> @@ -522,6 +523,7 @@ static struct dma_buf *vb2_dma_sg_get_dmabuf(struct vb2_buffer *vb, exp_info.size = buf->size; exp_info.flags = flags; exp_info.priv = buf; + exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);

if (WARN_ON(!buf->dma_sgt)) return NULL; diff --git a/drivers/media/common/videobuf2/videobuf2-vmalloc.c b/drivers/media/common/videobuf2/videobuf2-vmalloc.c index 41db707e43a4..0b6874733e86 100644 --- a/drivers/media/common/videobuf2/videobuf2-vmalloc.c +++ b/drivers/media/common/videobuf2/videobuf2-vmalloc.c @@ -341,6 +341,7 @@ static struct dma_buf *vb2_vmalloc_get_dmabuf(struct vb2_buffer *vb, exp_info.size = buf->size; exp_info.flags = flags; exp_info.priv = buf; + exp_info.coherent = true;

if (WARN_ON(!buf->vaddr)) return NULL;

-- 2.25.1

Hans Verkuil

4 Nov 4 Nov

10:42 a.m.

New subject: [PATCH 3/3] media: videobuf2: set dma_coherent flag for DMA-buf

Hi Christian,

On 20/10/2022 14:13, Christian König wrote:

...

When a device is snooping the CPU cache we assume that all importers must snoop the CPU cache as well.

Execpt for vmalloc allocations since those implement mmap() imports must always snoop the cache or we will run into coherency problems.

Looks reasonable.

For this series:

Acked-by: Hans Verkuil hverkuil-cisco@xs4all.nl

Regards,

Hans

...

Signed-off-by: Christian König christian.koenig@amd.com

drivers/media/common/videobuf2/videobuf2-dma-contig.c | 2 ++ drivers/media/common/videobuf2/videobuf2-dma-sg.c | 2 ++ drivers/media/common/videobuf2/videobuf2-vmalloc.c | 1 + 3 files changed, 5 insertions(+)

diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c index 555bd40fa472..57433310c55c 100644 --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c @@ -11,6 +11,7 @@ */ #include <linux/dma-buf.h> +#include <linux/dma-map-ops.h> #include <linux/module.h> #include <linux/refcount.h> #include <linux/scatterlist.h> @@ -507,6 +508,7 @@ static struct dma_buf *vb2_dc_get_dmabuf(struct vb2_buffer *vb, exp_info.size = buf->size; exp_info.flags = flags; exp_info.priv = buf;

exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);

if (!buf->sgt_base) buf->sgt_base = vb2_dc_get_base_sgt(buf); diff --git a/drivers/media/common/videobuf2/videobuf2-dma-sg.c b/drivers/media/common/videobuf2/videobuf2-dma-sg.c index 36981a5b5c53..dbdd753e4a39 100644 --- a/drivers/media/common/videobuf2/videobuf2-dma-sg.c +++ b/drivers/media/common/videobuf2/videobuf2-dma-sg.c @@ -10,6 +10,7 @@

the Free Software Foundation.

*/ +#include <linux/dma-map-ops.h> #include <linux/module.h> #include <linux/mm.h> #include <linux/refcount.h> @@ -522,6 +523,7 @@ static struct dma_buf *vb2_dma_sg_get_dmabuf(struct vb2_buffer *vb, exp_info.size = buf->size; exp_info.flags = flags; exp_info.priv = buf;

exp_info.coherent = dev_is_dma_coherent(vb->vb2_queue->dev);

if (WARN_ON(!buf->dma_sgt)) return NULL; diff --git a/drivers/media/common/videobuf2/videobuf2-vmalloc.c b/drivers/media/common/videobuf2/videobuf2-vmalloc.c index 41db707e43a4..0b6874733e86 100644 --- a/drivers/media/common/videobuf2/videobuf2-vmalloc.c +++ b/drivers/media/common/videobuf2/videobuf2-vmalloc.c @@ -341,6 +341,7 @@ static struct dma_buf *vb2_vmalloc_get_dmabuf(struct vb2_buffer *vb, exp_info.size = buf->size; exp_info.flags = flags; exp_info.priv = buf;

exp_info.coherent = true;

if (WARN_ON(!buf->vaddr)) return NULL;

Christian König

27 Oct 27 Oct

12:14 p.m.

Am 20.10.22 um 14:13 schrieb Christian König:

...

Hi guys,

after finding that we essentially have two separate worlds for coherent sharing of buffer through DMA-buf I thought I will tackle that problem a bit and at least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object telling the framework that dev_is_dma_coherent() needs to return true for an importing device to be able to attach. Since we should always have a fallback path this should give userspace the chance to still keep the use case working, either by doing a CPU copy instead or reversing the roles of exporter and importer.

For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear the flag for their buffers if special handling like the USWC flag in amdgpu or the uncached allocations for radeon/nouveau are in use.

Additional to that importers can also check the flag if they have some non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as I can see should at least keep things working.

Gentle ping on this. Lucas, Daniel and Nicolas you have been rather active in the last discussion. Do you mind taking a look?

Thanks, Christian.

...

Please review and/or comment, Christian.

Lucas Stach

28 Oct 28 Oct

8:09 a.m.

Hi Christian,

Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:

...

Hi guys,

after finding that we essentially have two separate worlds for coherent sharing of buffer through DMA-buf I thought I will tackle that problem a bit and at least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object telling the framework that dev_is_dma_coherent() needs to return true for an importing device to be able to attach. Since we should always have a fallback path this should give userspace the chance to still keep the use case working, either by doing a CPU copy instead or reversing the roles of exporter and importer.

The fallback would likely be a CPU copy with the appropriate cache flushes done by the device driver, which would be a massiv performance issue.

...

For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear the flag for their buffers if special handling like the USWC flag in amdgpu or the uncached allocations for radeon/nouveau are in use.

I don't think the V4L2 part works for most ARM systems. The default there is for devices to be noncoherent unless explicitly marked otherwise. I don't think any of the "devices" writing the video buffers in cached memory with the CPU do this. While we could probably mark them as coherent, I don't think this is moving in the right direction.

...

Additional to that importers can also check the flag if they have some non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as I can see should at least keep things working.

I would like to see this solved properly. Where I think properly means we make things work on systems with mixed coherent/noncoherent masters in the same way the DMA API does on such systems: by inserting the proper cache maintenance operations.

I also think that we should keep in mind that the world is moving into a direction where DMA masters may not only snoop the CPU caches (what is the definition of cache coherent on x86), but actually take part in the system coherence and are able to have writeback caches for shared data on their own. I can only speculate, as I haven't seen the amdgpu side yet, but I think this proposal is moving in the other direction by assuming a central system cache, where the importer has some magic way to clean this central cache.

Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU dma-buf sharing work on ARM systems and seem to hold some strong opinions on how this should work, I guess I need to make some time available to type it up, so we can discuss over coder rather than abstract ideas. If I come up with something that works for my use-cases would you be up for taking a shot at a amdgpu implementation?

Regards, Lucas

Christian König

8:40 a.m.

Hi Lucas,

Am 28.10.22 um 10:09 schrieb Lucas Stach:

...

Hi Christian,

Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:

...
Hi guys,

after finding that we essentially have two separate worlds for coherent sharing of buffer through DMA-buf I thought I will tackle that problem a bit and at least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object telling the framework that dev_is_dma_coherent() needs to return true for an importing device to be able to attach. Since we should always have a fallback path this should give userspace the chance to still keep the use case working, either by doing a CPU copy instead or reversing the roles of exporter and importer.

The fallback would likely be a CPU copy with the appropriate cache flushes done by the device driver, which would be a massiv performance issue.

But essentially the right thing to do. The only alternative I can see is to reverse the role of exporter and importer.

...

...
For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear the flag for their buffers if special handling like the USWC flag in amdgpu or the uncached allocations for radeon/nouveau are in use.

I don't think the V4L2 part works for most ARM systems. The default there is for devices to be noncoherent unless explicitly marked otherwise. I don't think any of the "devices" writing the video buffers in cached memory with the CPU do this. While we could probably mark them as coherent, I don't think this is moving in the right direction.

Well why not? Those devices are coherent in the sense of the DMA API that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.

We could come up with a better name for coherency, e.g. snooping for example. But that is just an documentation detail.

...

...
Additional to that importers can also check the flag if they have some non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as I can see should at least keep things working.

I would like to see this solved properly. Where I think properly means we make things work on systems with mixed coherent/noncoherent masters in the same way the DMA API does on such systems: by inserting the proper cache maintenance operations.

And this the exact wrong approach as far as I can see. As Daniel noted as well we absolutely need some kind of coherency between exporter and importer.

This can either be provided by the PCI specification which makes it mandatory for device to snoop the caches or by platform devices agreeing that they only work on uncached memory.

Explicit cache flush operations are simple not part of the design of DMA-buf because they are not part of the design of the higher level APIs like Vulkan and OpenGL.

Adding this to DMA-buf for the rather special use case would completely break that and make live much more complicated for everybody.

...

I also think that we should keep in mind that the world is moving into a direction where DMA masters may not only snoop the CPU caches (what is the definition of cache coherent on x86), but actually take part in the system coherence and are able to have writeback caches for shared data on their own. I can only speculate, as I haven't seen the amdgpu side yet, but I think this proposal is moving in the other direction by assuming a central system cache, where the importer has some magic way to clean this central cache.

What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link

And yes we support that in a bunch of configurations and also have worked with that and amdgpu with DMA-buf based shared.

This should not be a problem with this approach.

...

Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU dma-buf sharing work on ARM systems and seem to hold some strong opinions on how this should work, I guess I need to make some time available to type it up, so we can discuss over coder rather than abstract ideas. If I come up with something that works for my use-cases would you be up for taking a shot at a amdgpu implementation?

Well, not really. As I said I see what you have in mind here as completely wrong approach we will certainly not support in any GPU driver.

What we can do is to request the use case which won't work and this is exactly what the patch here does.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

11:42 a.m.

Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:

...

Hi Lucas,

Am 28.10.22 um 10:09 schrieb Lucas Stach:

...
Hi Christian,

Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:

...
Hi guys,

after finding that we essentially have two separate worlds for coherent sharing of buffer through DMA-buf I thought I will tackle that problem a bit and at least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object telling the framework that dev_is_dma_coherent() needs to return true for an importing device to be able to attach. Since we should always have a fallback path this should give userspace the chance to still keep the use case working, either by doing a CPU copy instead or reversing the roles of exporter and importer.

The fallback would likely be a CPU copy with the appropriate cache flushes done by the device driver, which would be a massiv performance issue.

But essentially the right thing to do. The only alternative I can see is to reverse the role of exporter and importer.

I don't think that would work generally either, as buffer exporter and importer isn't always a 1:1 thing. As soon as any attached importer has a different coherency behavior than the others, things fall apart.

...

...
...
For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear the flag for their buffers if special handling like the USWC flag in amdgpu or the uncached allocations for radeon/nouveau are in use.

I don't think the V4L2 part works for most ARM systems. The default there is for devices to be noncoherent unless explicitly marked otherwise. I don't think any of the "devices" writing the video buffers in cached memory with the CPU do this. While we could probably mark them as coherent, I don't think this is moving in the right direction.

Well why not? Those devices are coherent in the sense of the DMA API that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.

We could come up with a better name for coherency, e.g. snooping for example. But that is just an documentation detail.

I agree that those devices copying data into a CPU cacheable buffer should be marked as coherent, just not sure right now if other things like DMA mappings are done on that device, which would require the cache maintenance.

...

...
...
Additional to that importers can also check the flag if they have some non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as I can see should at least keep things working.

I would like to see this solved properly. Where I think properly means we make things work on systems with mixed coherent/noncoherent masters in the same way the DMA API does on such systems: by inserting the proper cache maintenance operations.

And this the exact wrong approach as far as I can see. As Daniel noted as well we absolutely need some kind of coherency between exporter and importer.

I think it's important that we are very specific about the thing we are talking about here: I guess when you say coherency you mean hardware enforced coherency on cacheable memory, which is the default on x86/PCI.

The other way to enforce coherency is to either insert cache maintenance operations, or make sure that the buffer is not cacheable by any device taking part in the sharing, including the CPU.

...

This can either be provided by the PCI specification which makes it mandatory for device to snoop the caches or by platform devices agreeing that they only work on uncached memory.

What you disregard here is the fact that there are many systems out there with mixed topologies, where some masters are part of the coherency domain and some are not.

We have two options here: either mandate that coherency for dma-bufs need to be established by hardware, which is the option that you strongly prefer, which means forcing all buffers to be uncacheable in a system with masters that are not coherent, or allowing some form of bracketed DMA access with cache maintenance ops.

...

Explicit cache flush operations are simple not part of the design of DMA-buf because they are not part of the design of the higher level APIs like Vulkan and OpenGL.

I'm aware that some graphics APIs have been living in a universe blissfully unaware of systems without hardware enforced coherency. But that isn't the only use for dma-bufs.

I totally agree that some higher level API primitives aren't possible without coherency at the hardware level and for those uses we should require either HW enforced coherency or uncachable memory. But I don't think we should make things slow deliberately on systems that allow to optimize things with the help of bracketed access.

If I understood things right your amdgpu use-case even falls into this category: normally you would want to use cacheable memory for everything, but just make sure to clean the caches before using the buffer with the non-coherent display engine.

...

Adding this to DMA-buf for the rather special use case would completely break that and make live much more complicated for everybody.

...
I also think that we should keep in mind that the world is moving into a direction where DMA masters may not only snoop the CPU caches (what is the definition of cache coherent on x86), but actually take part in the system coherence and are able to have writeback caches for shared data on their own. I can only speculate, as I haven't seen the amdgpu side yet, but I think this proposal is moving in the other direction by assuming a central system cache, where the importer has some magic way to clean this central cache.

What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link

Or ARM AMBA CHI.

...

And yes we support that in a bunch of configurations and also have worked with that and amdgpu with DMA-buf based shared.

This should not be a problem with this approach.

It works as long as all masters sharing the buffer are accessing the buffer through the HW cache coherence facilities provided by CXL. As soon as a master wants to bypass it (like your nosnoop scanout) you need some way to force all other masters sharing access to the buffer to clean their caches.

...

...
Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU dma-buf sharing work on ARM systems and seem to hold some strong opinions on how this should work, I guess I need to make some time available to type it up, so we can discuss over coder rather than abstract ideas. If I come up with something that works for my use-cases would you be up for taking a shot at a amdgpu implementation?

Well, not really. As I said I see what you have in mind here as completely wrong approach we will certainly not support in any GPU driver.

What we can do is to request the use case which won't work and this is exactly what the patch here does.

Did you mean to write "prevent the use case which won't work" here?

Regards, Lucas

Christian König

2:26 p.m.

Am 28.10.22 um 13:42 schrieb Lucas Stach:

...

Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:

...
But essentially the right thing to do. The only alternative I can see is to reverse the role of exporter and importer.

I don't think that would work generally either, as buffer exporter and importer isn't always a 1:1 thing. As soon as any attached importer has a different coherency behavior than the others, things fall apart.

I've just mentioned it because somebody noted that when you reverse the roles of exporter and importer with the V4L driver and i915 then the use case suddenly starts working.

...

...
...
...
For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear the flag for their buffers if special handling like the USWC flag in amdgpu or the uncached allocations for radeon/nouveau are in use.

I don't think the V4L2 part works for most ARM systems. The default there is for devices to be noncoherent unless explicitly marked otherwise. I don't think any of the "devices" writing the video buffers in cached memory with the CPU do this. While we could probably mark them as coherent, I don't think this is moving in the right direction.

Well why not? Those devices are coherent in the sense of the DMA API that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.

We could come up with a better name for coherency, e.g. snooping for example. But that is just an documentation detail.

I agree that those devices copying data into a CPU cacheable buffer should be marked as coherent, just not sure right now if other things like DMA mappings are done on that device, which would require the cache maintenance.

Yeah, good point.

...

...
And this the exact wrong approach as far as I can see. As Daniel noted as well we absolutely need some kind of coherency between exporter and importer.

I think it's important that we are very specific about the thing we are talking about here: I guess when you say coherency you mean hardware enforced coherency on cacheable memory, which is the default on x86/PCI.

Well, no. What I mean with coherency is that the devices don't need insert special operation to access each others data.

This can be archived by multiple approaches, e.g. by the PCI coherency requirements, device internal connections (XGMI, NVLink, CXL etc...) as well as using uncached system memory.

The key point is what we certainly don't want is special operations which say: Ok, now device A can access the data, now device B..... because this breaks tons of use cases.

...

The other way to enforce coherency is to either insert cache maintenance operations, or make sure that the buffer is not cacheable by any device taking part in the sharing, including the CPU.

Yes and no. When we want the devices to interact with each other without the CPU then we need some negotiated coherency between the two.

...

...
This can either be provided by the PCI specification which makes it mandatory for device to snoop the caches or by platform devices agreeing that they only work on uncached memory.

What you disregard here is the fact that there are many systems out there with mixed topologies, where some masters are part of the coherency domain and some are not.

We have two options here: either mandate that coherency for dma-bufs need to be established by hardware, which is the option that you strongly prefer, which means forcing all buffers to be uncacheable in a system with masters that are not coherent, or allowing some form of bracketed DMA access with cache maintenance ops.

Well I don't prefer that option, it's just the only one I can see. One of the main goals of DMA-buf is to allow device to share data without the need for CPU interactions.

In other words we negotiate the high level properties and then the device can talk to each other without explicitly noting who is accessing what.

And this concept is completely incompatible with maintenance ops. We made that mistake with SWIOTLB for the DMA API and I don't really want to repeat that stunt.

...

...
Explicit cache flush operations are simple not part of the design of DMA-buf because they are not part of the design of the higher level APIs like Vulkan and OpenGL.

I'm aware that some graphics APIs have been living in a universe blissfully unaware of systems without hardware enforced coherency. But that isn't the only use for dma-bufs.

Yeah, but the other use cases are extremely limited. As far as I can see

...

I totally agree that some higher level API primitives aren't possible without coherency at the hardware level and for those uses we should require either HW enforced coherency or uncachable memory. But I don't think we should make things slow deliberately on systems that allow to optimize things with the help of bracketed access.

If I understood things right your amdgpu use-case even falls into this category: normally you would want to use cacheable memory for everything, but just make sure to clean the caches before using the buffer with the non-coherent display engine.

No, that won't work like this. The caching attributes must be coherent for the display engine to work correctly.

...

...
Adding this to DMA-buf for the rather special use case would completely break that and make live much more complicated for everybody.

...
I also think that we should keep in mind that the world is moving into a direction where DMA masters may not only snoop the CPU caches (what is the definition of cache coherent on x86), but actually take part in the system coherence and are able to have writeback caches for shared data on their own. I can only speculate, as I haven't seen the amdgpu side yet, but I think this proposal is moving in the other direction by assuming a central system cache, where the importer has some magic way to clean this central cache.

What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link

Or ARM AMBA CHI.

...
And yes we support that in a bunch of configurations and also have worked with that and amdgpu with DMA-buf based shared.

This should not be a problem with this approach.

It works as long as all masters sharing the buffer are accessing the buffer through the HW cache coherence facilities provided by CXL. As soon as a master wants to bypass it (like your nosnoop scanout) you need some way to force all other masters sharing access to the buffer to clean their caches.

That won't work like this. The problem is that this is an APU and so the display is part of the CPU. When either the MTRR or PAT says that the physical address is cacheable the engine might just hang on access.

...

...
...
Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU dma-buf sharing work on ARM systems and seem to hold some strong opinions on how this should work, I guess I need to make some time available to type it up, so we can discuss over coder rather than abstract ideas. If I come up with something that works for my use-cases would you be up for taking a shot at a amdgpu implementation?

Well, not really. As I said I see what you have in mind here as completely wrong approach we will certainly not support in any GPU driver.

What we can do is to request the use case which won't work and this is exactly what the patch here does.

Did you mean to write "prevent the use case which won't work" here?

Oh, yes. To fast typing.

Regards, Christian.

...

Regards, Lucas

Nicolas Dufresne

3:46 p.m.

Hi,

just dropping some real live use case, sorry I'm not really proposing solutions, I believe you are much more knowledgeable in this regard.

Le vendredi 28 octobre 2022 à 16:26 +0200, Christian König a écrit :

...

Am 28.10.22 um 13:42 schrieb Lucas Stach:

...
Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:

...
But essentially the right thing to do. The only alternative I can see is to reverse the role of exporter and importer.

I don't think that would work generally either, as buffer exporter and importer isn't always a 1:1 thing. As soon as any attached importer has a different coherency behavior than the others, things fall apart.

I've just mentioned it because somebody noted that when you reverse the roles of exporter and importer with the V4L driver and i915 then the use case suddenly starts working.

Though, its not generically possible to reverse these roles. If you want to do so, you endup having to do like Android (gralloc) and ChromeOS (minigbm), because you will have to allocate DRM buffers that knows about importer specific requirements. See link [1] for what it looks like for RK3399, with Motion Vector size calculation copied from the kernel driver into a userspace lib (arguably that was available from V4L2 sizeimage, but this is technically difficult to communicate within the software layers). If you could let the decoder export (with proper cache management) the non-generic code would not be needed.

Another case where reversing the role is difficult is for case where you need to multiplex the streams (let's use a camera to illustrate) and share that with multiple processes. In these uses case, the DRM importers are volatile, which one do you abuse to do allocation from ? In multimedia server like PipeWire, you are not really aware if the camera will be used by DRM or not, and if something "special" is needed in term of role inversion. It is relatively easy to deal with matching modifiers, but using downstream (display/gpu) as an exporter is always difficult (and require some level of abuse and guessing).

[1] https://android.googlesource.com/platform/external/minigbm/+/refs/heads/mast...

...

...
...
...
...
For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear the flag for their buffers if special handling like the USWC flag in amdgpu or the uncached allocations for radeon/nouveau are in use.

I don't think the V4L2 part works for most ARM systems. The default there is for devices to be noncoherent unless explicitly marked otherwise. I don't think any of the "devices" writing the video buffers in cached memory with the CPU do this. While we could probably mark them as coherent, I don't think this is moving in the right direction.

Well why not? Those devices are coherent in the sense of the DMA API that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.

We could come up with a better name for coherency, e.g. snooping for example. But that is just an documentation detail.

I agree that those devices copying data into a CPU cacheable buffer should be marked as coherent, just not sure right now if other things like DMA mappings are done on that device, which would require the cache maintenance.

Yeah, good point.

...
...
And this the exact wrong approach as far as I can see. As Daniel noted as well we absolutely need some kind of coherency between exporter and importer.

I think it's important that we are very specific about the thing we are talking about here: I guess when you say coherency you mean hardware enforced coherency on cacheable memory, which is the default on x86/PCI.

Well, no. What I mean with coherency is that the devices don't need insert special operation to access each others data.

This can be archived by multiple approaches, e.g. by the PCI coherency requirements, device internal connections (XGMI, NVLink, CXL etc...) as well as using uncached system memory.

The key point is what we certainly don't want is special operations which say: Ok, now device A can access the data, now device B..... because this breaks tons of use cases.

I'm coming back again with the multiplexing case. We keep having mixed uses case with multiple receiver. In some case, data may endup on CPU while being encoded in HW. Current approach of disabling cache does work, but CPU algorithm truly suffer in performance. Doing a full memcpy to a cached buffer helps, but remains slower then if the cache had been snooped by the importer (encoder here) driver.

...

...
The other way to enforce coherency is to either insert cache maintenance operations, or make sure that the buffer is not cacheable by any device taking part in the sharing, including the CPU.

Yes and no. When we want the devices to interact with each other without the CPU then we need some negotiated coherency between the two.

...
...
This can either be provided by the PCI specification which makes it mandatory for device to snoop the caches or by platform devices agreeing that they only work on uncached memory.

What you disregard here is the fact that there are many systems out there with mixed topologies, where some masters are part of the coherency domain and some are not.

We have two options here: either mandate that coherency for dma-bufs need to be established by hardware, which is the option that you strongly prefer, which means forcing all buffers to be uncacheable in a system with masters that are not coherent, or allowing some form of bracketed DMA access with cache maintenance ops.

Well I don't prefer that option, it's just the only one I can see. One of the main goals of DMA-buf is to allow device to share data without the need for CPU interactions.

In other words we negotiate the high level properties and then the device can talk to each other without explicitly noting who is accessing what.

And this concept is completely incompatible with maintenance ops. We made that mistake with SWIOTLB for the DMA API and I don't really want to repeat that stunt.

...
...
Explicit cache flush operations are simple not part of the design of DMA-buf because they are not part of the design of the higher level APIs like Vulkan and OpenGL.

I'm aware that some graphics APIs have been living in a universe blissfully unaware of systems without hardware enforced coherency. But that isn't the only use for dma-bufs.

Yeah, but the other use cases are extremely limited. As far as I can see

...
I totally agree that some higher level API primitives aren't possible without coherency at the hardware level and for those uses we should require either HW enforced coherency or uncachable memory. But I don't think we should make things slow deliberately on systems that allow to optimize things with the help of bracketed access.

If I understood things right your amdgpu use-case even falls into this category: normally you would want to use cacheable memory for everything, but just make sure to clean the caches before using the buffer with the non-coherent display engine.

No, that won't work like this. The caching attributes must be coherent for the display engine to work correctly.

...
...
Adding this to DMA-buf for the rather special use case would completely break that and make live much more complicated for everybody.

...
I also think that we should keep in mind that the world is moving into a direction where DMA masters may not only snoop the CPU caches (what is the definition of cache coherent on x86), but actually take part in the system coherence and are able to have writeback caches for shared data on their own. I can only speculate, as I haven't seen the amdgpu side yet, but I think this proposal is moving in the other direction by assuming a central system cache, where the importer has some magic way to clean this central cache.

What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link

Or ARM AMBA CHI.

...
And yes we support that in a bunch of configurations and also have worked with that and amdgpu with DMA-buf based shared.

This should not be a problem with this approach.

It works as long as all masters sharing the buffer are accessing the buffer through the HW cache coherence facilities provided by CXL. As soon as a master wants to bypass it (like your nosnoop scanout) you need some way to force all other masters sharing access to the buffer to clean their caches.

That won't work like this. The problem is that this is an APU and so the display is part of the CPU. When either the MTRR or PAT says that the physical address is cacheable the engine might just hang on access.

...
...
...
Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU dma-buf sharing work on ARM systems and seem to hold some strong opinions on how this should work, I guess I need to make some time available to type it up, so we can discuss over coder rather than abstract ideas. If I come up with something that works for my use-cases would you be up for taking a shot at a amdgpu implementation?

Well, not really. As I said I see what you have in mind here as completely wrong approach we will certainly not support in any GPU driver.

What we can do is to request the use case which won't work and this is exactly what the patch here does.

Did you mean to write "prevent the use case which won't work" here?

Oh, yes. To fast typing.

Regards, Christian.

...
Regards, Lucas

Christian König

5:50 p.m.

Hi Nicolas,

Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:

...

Hi,

just dropping some real live use case, sorry I'm not really proposing solutions, I believe you are much more knowledgeable in this regard.

Well, I think each of us has a lot of specialized knowledge. For example I don't know to much about gralloc/minigbm. So this input is very valuable.

...

Le vendredi 28 octobre 2022 à 16:26 +0200, Christian König a écrit :

...
Am 28.10.22 um 13:42 schrieb Lucas Stach:

...
Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:

...
But essentially the right thing to do. The only alternative I can see is to reverse the role of exporter and importer.

I don't think that would work generally either, as buffer exporter and importer isn't always a 1:1 thing. As soon as any attached importer has a different coherency behavior than the others, things fall apart.

I've just mentioned it because somebody noted that when you reverse the roles of exporter and importer with the V4L driver and i915 then the use case suddenly starts working.

Though, its not generically possible to reverse these roles. If you want to do so, you endup having to do like Android (gralloc) and ChromeOS (minigbm), because you will have to allocate DRM buffers that knows about importer specific requirements. See link [1] for what it looks like for RK3399, with Motion Vector size calculation copied from the kernel driver into a userspace lib (arguably that was available from V4L2 sizeimage, but this is technically difficult to communicate within the software layers). If you could let the decoder export (with proper cache management) the non-generic code would not be needed.

Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is tricky, but getting the parameters so that the stuff is actually directly displayable by GPUs is even trickier.

Essentially you need to look at both sides and interference to get to a common ground, e.g. alignment, pitch, width/height, padding, etc.....

Deciding from which side to allocate from is just one step in this process. For example most dGPUs can't display directly from system memory altogether, but it is possible to allocate the DMA-buf through the GPU driver and then write into device memory with P2P PCI transfers.

So as far as I can see switching importer and exporter roles and even having performant extra fallbacks should be a standard feature of userspace.

...

Another case where reversing the role is difficult is for case where you need to multiplex the streams (let's use a camera to illustrate) and share that with multiple processes. In these uses case, the DRM importers are volatile, which one do you abuse to do allocation from ? In multimedia server like PipeWire, you are not really aware if the camera will be used by DRM or not, and if something "special" is needed in term of role inversion. It is relatively easy to deal with matching modifiers, but using downstream (display/gpu) as an exporter is always difficult (and require some level of abuse and guessing).

Oh, very good point! Yeah we do have use cases for this where an input buffer is both displayed as well as encoded.

...

...
Well, no. What I mean with coherency is that the devices don't need insert special operation to access each others data.

This can be archived by multiple approaches, e.g. by the PCI coherency requirements, device internal connections (XGMI, NVLink, CXL etc...) as well as using uncached system memory.

The key point is what we certainly don't want is special operations which say: Ok, now device A can access the data, now device B..... because this breaks tons of use cases.

I'm coming back again with the multiplexing case. We keep having mixed uses case with multiple receiver. In some case, data may endup on CPU while being encoded in HW. Current approach of disabling cache does work, but CPU algorithm truly suffer in performance. Doing a full memcpy to a cached buffer helps, but remains slower then if the cache had been snooped by the importer (encoder here) driver.

Yeah, that was another reason why we ended up rather having an extra copy than working with uncached buffers for display as well.

Regards, Christian.

Daniel Stone

6:47 p.m.

Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:

...
Though, its not generically possible to reverse these roles. If you want to do so, you endup having to do like Android (gralloc) and ChromeOS (minigbm), because you will have to allocate DRM buffers that knows about importer specific requirements. See link [1] for what it looks like for RK3399, with Motion Vector size calculation copied from the kernel driver into a userspace lib (arguably that was available from V4L2 sizeimage, but this is technically difficult to communicate within the software layers). If you could let the decoder export (with proper cache management) the non-generic code would not be needed.

Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is tricky, but getting the parameters so that the stuff is actually directly displayable by GPUs is even trickier.

Essentially you need to look at both sides and interference to get to a common ground, e.g. alignment, pitch, width/height, padding, etc.....

Deciding from which side to allocate from is just one step in this process. For example most dGPUs can't display directly from system memory altogether, but it is possible to allocate the DMA-buf through the GPU driver and then write into device memory with P2P PCI transfers.

So as far as I can see switching importer and exporter roles and even having performant extra fallbacks should be a standard feature of userspace.

...
Another case where reversing the role is difficult is for case where you need to multiplex the streams (let's use a camera to illustrate) and share that with multiple processes. In these uses case, the DRM importers are volatile, which one do you abuse to do allocation from ? In multimedia server like PipeWire, you are not really aware if the camera will be used by DRM or not, and if something "special" is needed in term of role inversion. It is relatively easy to deal with matching modifiers, but using downstream (display/gpu) as an exporter is always difficult (and require some level of abuse and guessing).

Oh, very good point! Yeah we do have use cases for this where an input buffer is both displayed as well as encoded.

This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2 and decode through that into locally-allocated buffers. All they know is that there's a Wayland server at the other end of a socket somewhere which will want to import the FD. The server does give you some hints along the way: it will tell you that importing into a particular GPU target device is necessary as the ultimate fallback, and importing into a particular KMS device is preferable as the optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it allocates a buffer chain, schedules a decode into that buffer, and passes it along to the server to import. The server fails to import the buffer into the GPU, and tells the client this. The client then ... well, it doesn't know that it needs to allocate within the GPU instead, but it knows that doing so might be one thing which would make the request succeed.

But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

I fully understand your point about APIs like Vulkan not sensibly allowing bracketing, and that's fine. On the other hand, a lot of extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on Arm just cannot fulfill complete coherency. On a lot of these platforms, despite what you might think about the CPU/GPU capabilities, the bottleneck is _always_ memory bandwidth, so mandating extra copies is an absolute non-starter, and would instantly cripple billions of devices. Lucas has been pretty gentle, but to be more clear, this is not an option and won't be for at least the next decade.

So we obviously need a third way at this point, because 'all devices must always be coherent' vs. 'cache must be an unknown' can't work. How about this as a suggestion: we have some unused flags in the PRIME ioctls. Can we add a flag for 'import must be coherent'?

That flag wouldn't be set for the existing ecosystem Lucas/Nicolas/myself are talking about, where we have explicit handover points and users are fully able to perform cache maintenance. For newer APIs where it's not possible to properly express that bracketing, they would always set that flag (unless we add an API carve-out where the client promises to do whatever is required to maintain that).

Would that be viable?

Cheers, Daniel

Christian König

1 Nov 1 Nov

5:40 p.m.

Am 28.10.22 um 20:47 schrieb Daniel Stone:

...

Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:

...
Though, its not generically possible to reverse these roles. If you want to do so, you endup having to do like Android (gralloc) and ChromeOS (minigbm), because you will have to allocate DRM buffers that knows about importer specific requirements. See link [1] for what it looks like for RK3399, with Motion Vector size calculation copied from the kernel driver into a userspace lib (arguably that was available from V4L2 sizeimage, but this is technically difficult to communicate within the software layers). If you could let the decoder export (with proper cache management) the non-generic code would not be needed.

Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is tricky, but getting the parameters so that the stuff is actually directly displayable by GPUs is even trickier.

Essentially you need to look at both sides and interference to get to a common ground, e.g. alignment, pitch, width/height, padding, etc.....

Deciding from which side to allocate from is just one step in this process. For example most dGPUs can't display directly from system memory altogether, but it is possible to allocate the DMA-buf through the GPU driver and then write into device memory with P2P PCI transfers.

So as far as I can see switching importer and exporter roles and even having performant extra fallbacks should be a standard feature of userspace.

...
Another case where reversing the role is difficult is for case where you need to multiplex the streams (let's use a camera to illustrate) and share that with multiple processes. In these uses case, the DRM importers are volatile, which one do you abuse to do allocation from ? In multimedia server like PipeWire, you are not really aware if the camera will be used by DRM or not, and if something "special" is needed in term of role inversion. It is relatively easy to deal with matching modifiers, but using downstream (display/gpu) as an exporter is always difficult (and require some level of abuse and guessing).

Oh, very good point! Yeah we do have use cases for this where an input buffer is both displayed as well as encoded.

This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2 and decode through that into locally-allocated buffers. All they know is that there's a Wayland server at the other end of a socket somewhere which will want to import the FD. The server does give you some hints along the way: it will tell you that importing into a particular GPU target device is necessary as the ultimate fallback, and importing into a particular KMS device is preferable as the optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it allocates a buffer chain, schedules a decode into that buffer, and passes it along to the server to import. The server fails to import the buffer into the GPU, and tells the client this. The client then ... well, it doesn't know that it needs to allocate within the GPU instead, but it knows that doing so might be one thing which would make the request succeed.

But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

The more common case is that you need to allocate from the GPU and then import that into the V4L2 device. The background is that all dGPUs I know of need the data inside local memory (VRAM) to be able to scan out from it.

...

I fully understand your point about APIs like Vulkan not sensibly allowing bracketing, and that's fine. On the other hand, a lot of extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on Arm just cannot fulfill complete coherency. On a lot of these platforms, despite what you might think about the CPU/GPU capabilities, the bottleneck is _always_ memory bandwidth, so mandating extra copies is an absolute non-starter, and would instantly cripple billions of devices. Lucas has been pretty gentle, but to be more clear, this is not an option and won't be for at least the next decade.

Well x86 pretty much has the same restrictions.

For example the scanout buffer is usually always in local memory because you often scan out at up to 120Hz while your recording is only 30fps and most of the time lower resolution.

Pumping all that data 120 time a second over the PCIe bus would just not be doable in a lot of use cases.

...

So we obviously need a third way at this point, because 'all devices must always be coherent' vs. 'cache must be an unknown' can't work. How about this as a suggestion: we have some unused flags in the PRIME ioctls. Can we add a flag for 'import must be coherent'?

That's pretty much exactly what my patch set does. It just keeps userspace out of the way and says that creating the initial connection between the devices fails if they can't talk directly with each other.

Maybe we should move that into userspace so that the involved components know of hand that a certain approach won't work?

...

That flag wouldn't be set for the existing ecosystem Lucas/Nicolas/myself are talking about, where we have explicit handover points and users are fully able to perform cache maintenance. For newer APIs where it's not possible to properly express that bracketing, they would always set that flag (unless we add an API carve-out where the client promises to do whatever is required to maintain that).

Would that be viable?

No, as I said. Explicit handover points are just an absolutely no-go. We just have way to many use cases which don't work with that idea.

As I said we made the same mistake with the DMA-Api and even more 20 years later are still running into problems because of that.

Just try to run any dGPU under a XEN hypervisor with memory fragmentation for a very good example why this is such a bad idea.

Regards, Christian.

...

Cheers, Daniel

Nicolas Dufresne

9:09 p.m.

Le mardi 01 novembre 2022 à 18:40 +0100, Christian König a écrit :

...

Am 28.10.22 um 20:47 schrieb Daniel Stone:

...
Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:

...
Though, its not generically possible to reverse these roles. If you want to do so, you endup having to do like Android (gralloc) and ChromeOS (minigbm), because you will have to allocate DRM buffers that knows about importer specific requirements. See link [1] for what it looks like for RK3399, with Motion Vector size calculation copied from the kernel driver into a userspace lib (arguably that was available from V4L2 sizeimage, but this is technically difficult to communicate within the software layers). If you could let the decoder export (with proper cache management) the non-generic code would not be needed.

Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is tricky, but getting the parameters so that the stuff is actually directly displayable by GPUs is even trickier.

Essentially you need to look at both sides and interference to get to a common ground, e.g. alignment, pitch, width/height, padding, etc.....

Deciding from which side to allocate from is just one step in this process. For example most dGPUs can't display directly from system memory altogether, but it is possible to allocate the DMA-buf through the GPU driver and then write into device memory with P2P PCI transfers.

So as far as I can see switching importer and exporter roles and even having performant extra fallbacks should be a standard feature of userspace.

...
Another case where reversing the role is difficult is for case where you need to multiplex the streams (let's use a camera to illustrate) and share that with multiple processes. In these uses case, the DRM importers are volatile, which one do you abuse to do allocation from ? In multimedia server like PipeWire, you are not really aware if the camera will be used by DRM or not, and if something "special" is needed in term of role inversion. It is relatively easy to deal with matching modifiers, but using downstream (display/gpu) as an exporter is always difficult (and require some level of abuse and guessing).

Oh, very good point! Yeah we do have use cases for this where an input buffer is both displayed as well as encoded.

This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2 and decode through that into locally-allocated buffers. All they know is that there's a Wayland server at the other end of a socket somewhere which will want to import the FD. The server does give you some hints along the way: it will tell you that importing into a particular GPU target device is necessary as the ultimate fallback, and importing into a particular KMS device is preferable as the optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it allocates a buffer chain, schedules a decode into that buffer, and passes it along to the server to import. The server fails to import the buffer into the GPU, and tells the client this. The client then ... well, it doesn't know that it needs to allocate within the GPU instead, but it knows that doing so might be one thing which would make the request succeed.

But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around. I have colleague who integrated PCIe CODEC (Blaize Xplorer X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways to use it. Of course, in current state of DMABuf, you have to be an exporter to do anything fancy, but it did not have to be like this, its a design choice. I'm not sure in the end what was the final method used, the driver isn't yet upstream, so maybe that is not even final. What I know is that there is various condition you may use the CODEC for which the optimal location will vary. As an example, using the post processor or not, see my next comment for more details.

...

The more common case is that you need to allocate from the GPU and then import that into the V4L2 device. The background is that all dGPUs I know of need the data inside local memory (VRAM) to be able to scan out from it.

The reality is that what is common to you, might not be to others. In my work, most ARM SoC have display that just handle direct scannout from cameras and codecs. The only case the commonly fails is whenever we try to display UVC created dmabuf, which have dirty CPU write cache and this is the type of thing we'd like to see solved. I think this series was addressing it in principle, but failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run Khodi with direct scanout of video stream on a wide variety of SoC and they use the CODEC as exporter all the time. They simply don't have cases were the opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG does not really offer you any API to reverse the allocation.

...

...
I fully understand your point about APIs like Vulkan not sensibly allowing bracketing, and that's fine. On the other hand, a lot of extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on Arm just cannot fulfill complete coherency. On a lot of these platforms, despite what you might think about the CPU/GPU capabilities, the bottleneck is _always_ memory bandwidth, so mandating extra copies is an absolute non-starter, and would instantly cripple billions of devices. Lucas has been pretty gentle, but to be more clear, this is not an option and won't be for at least the next decade.

Well x86 pretty much has the same restrictions.

For example the scanout buffer is usually always in local memory because you often scan out at up to 120Hz while your recording is only 30fps and most of the time lower resolution.

Pumping all that data 120 time a second over the PCIe bus would just not be doable in a lot of use cases.

This is good point for this case. Though, the effect of using remote RAM in CODEC can be very dramatic. In some case, the buffer you are going to display are what we call the reference frames. That means that while you are displaying these, the CODEC needs to read from these in order to construct the following frames. Most of the time, reads are massively slower with remote RAM, and over- uploading, like you describe here is going to be the most optimal path.

Note that in some other cases, the buffers are called secondary buffers, which is the outcome of a post processor embedded into the CODEC. In that case, remote RAM may be fine, it really depends on the write speed (though usually really good compared to reads). So yes, in a case of high refresh rate, a CODEC with post processor may do a better job (if you have a single display path for that buffer).

p.s. Note that the reason we support reference frame display even if secondary buffer is possible is often because we are limited on memory bandwidth. For let's say 4K60, a secondary render will require an extra 18gb writes (on the best platform, might require equivalent reads on other platforms, like Mediatek, Samsung Exynos and some other).

...

...
So we obviously need a third way at this point, because 'all devices must always be coherent' vs. 'cache must be an unknown' can't work. How about this as a suggestion: we have some unused flags in the PRIME ioctls. Can we add a flag for 'import must be coherent'?

That's pretty much exactly what my patch set does. It just keeps userspace out of the way and says that creating the initial connection between the devices fails if they can't talk directly with each other.

Maybe we should move that into userspace so that the involved components know of hand that a certain approach won't work?

Anything that an be predict without trial and error is great idea for sure. Though, we have to be realist, there is no guarantied way to be sure other then trying. So I would not be too worried. Imho, I lacked the time to try it out, but the current implementation should in theory make software like GStreamer fallback to memcpy for simple cases like UVC Cameras. Very simple gstreamer pipeline most folks can run (and that usually has produced artifacts for years) is:

gst-launch-1.0 v4l2src ! queue ! kmssink

p.s. you need to turn off any drm master, may not work on multi-display setup, kmssink is very limited, but it does implement memcpy fallback if dmabuf import fails (just that it does not fail).

...

...
That flag wouldn't be set for the existing ecosystem Lucas/Nicolas/myself are talking about, where we have explicit handover points and users are fully able to perform cache maintenance. For newer APIs where it's not possible to properly express that bracketing, they would always set that flag (unless we add an API carve-out where the client promises to do whatever is required to maintain that).

Would that be viable?

No, as I said. Explicit handover points are just an absolutely no-go. We just have way to many use cases which don't work with that idea.

As I said we made the same mistake with the DMA-Api and even more 20 years later are still running into problems because of that.

Just try to run any dGPU under a XEN hypervisor with memory fragmentation for a very good example why this is such a bad idea.

Regards, Christian.

...
Cheers, Daniel

Christian König

2 Nov 2 Nov

11:18 a.m.

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...

[SNIP]

...
...
But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around.

And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

...

I have colleague who integrated PCIe CODEC (Blaize Xplorer X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways to use it. Of course, in current state of DMABuf, you have to be an exporter to do anything fancy, but it did not have to be like this, its a design choice. I'm not sure in the end what was the final method used, the driver isn't yet upstream, so maybe that is not even final. What I know is that there is various condition you may use the CODEC for which the optimal location will vary. As an example, using the post processor or not, see my next comment for more details.

Yeah, and stuff like this was already discussed multiple times. Local memory of devices can only be made available by the exporter, not the importer.

So in the case of separated camera and encoder you run into exactly the same limitation that some device needs the allocation to happen on the camera while others need it on the encoder.

...

...
The more common case is that you need to allocate from the GPU and then import that into the V4L2 device. The background is that all dGPUs I know of need the data inside local memory (VRAM) to be able to scan out from it.

The reality is that what is common to you, might not be to others. In my work, most ARM SoC have display that just handle direct scannout from cameras and codecs.

...

The only case the commonly fails is whenever we try to display UVC created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct display for dGPUs are broken because media players think they can do the simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but rather something completely special to ARM.

...

which have dirty CPU write cache and this is the type of thing we'd like to see solved. I think this series was addressing it in principle, but failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run Khodi with direct scanout of video stream on a wide variety of SoC and they use the CODEC as exporter all the time. They simply don't have cases were the opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG does not really offer you any API to reverse the allocation.

Ok, let me try to explain it once more. It sounds like I wasn't able to get my point through.

That we haven't heard anybody screaming that x86 doesn't work is just because we handle the case that a buffer isn't directly displayable in X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

This is just a software solution which works because of coincident and not because of engineering.

Regards, Christian.

Lucas Stach

11:39 a.m.

Hi Christian,

going to reply in more detail when I have some more time, so just some quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:

...

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

...
...
But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around.

And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

With bracketed access you could even make this case work, as the dGPU would be able to slurp a copy of the dma-buf into LMEM for scanout.

...

...
I have colleague who integrated PCIe CODEC (Blaize Xplorer X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways to use it. Of course, in current state of DMABuf, you have to be an exporter to do anything fancy, but it did not have to be like this, its a design choice. I'm not sure in the end what was the final method used, the driver isn't yet upstream, so maybe that is not even final. What I know is that there is various condition you may use the CODEC for which the optimal location will vary. As an example, using the post processor or not, see my next comment for more details.

Yeah, and stuff like this was already discussed multiple times. Local memory of devices can only be made available by the exporter, not the importer.

So in the case of separated camera and encoder you run into exactly the same limitation that some device needs the allocation to happen on the camera while others need it on the encoder.

...
...
The more common case is that you need to allocate from the GPU and then import that into the V4L2 device. The background is that all dGPUs I know of need the data inside local memory (VRAM) to be able to scan out from it.

The reality is that what is common to you, might not be to others. In my work, most ARM SoC have display that just handle direct scannout from cameras and codecs.

...
The only case the commonly fails is whenever we try to display UVC created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct display for dGPUs are broken because media players think they can do the simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but rather something completely special to ARM.

It the normal case for a lot of ARM SoCs. That world is certainly not any less big than the x86 dGPU world. A huge number of devices are ARM based set-top boxes and other video players. Just because it is a special case for you doesn't mean it's a global special case.

...

...
which have dirty CPU write cache and this is the type of thing we'd like to see solved. I think this series was addressing it in principle, but failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run Khodi with direct scanout of video stream on a wide variety of SoC and they use the CODEC as exporter all the time. They simply don't have cases were the opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG does not really offer you any API to reverse the allocation.

Ok, let me try to explain it once more. It sounds like I wasn't able to get my point through.

That we haven't heard anybody screaming that x86 doesn't work is just because we handle the case that a buffer isn't directly displayable in X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

And on a lot of ARM SoCs it's exactly the right thing to do. Many codecs need contiguous memory there, so importing a scatter-gather buffer from the GPU via dma-buf will simply not work.

...

This is just a software solution which works because of coincident and not because of engineering.

By mandating a software fallback for the cases where you would need bracketed access to the dma-buf, you simply shift the problem into userspace. Userspace then creates the bracket by falling back to some other import option that mostly do a copy and then the appropriate cache maintenance.

While I understand your sentiment about the DMA-API design being inconvenient when things are just coherent by system design, the DMA- API design wasn't done this way due to bad engineering, but due to the fact that performant DMA access on some systems just require this kind of bracketing.

Regards, Lucas

Christian König

12:21 p.m.

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:

...

Hi Christian,

going to reply in more detail when I have some more time, so just some quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:

...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

With bracketed access you could even make this case work, as the dGPU would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should pump the data into LMEM in the first place.

...

...
...
The only case the commonly fails is whenever we try to display UVC created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct display for dGPUs are broken because media players think they can do the simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but rather something completely special to ARM.

It the normal case for a lot of ARM SoCs.

Yeah, but it's not the normal case for everybody.

We had numerous projects where customers wanted to pump video data directly from a decoder into an GPU frame or from a GPU frame into an encoder.

The fact that media frameworks doesn't support that out of the box is simply a bug.

...

That world is certainly not any less big than the x86 dGPU world. A huge number of devices are ARM based set-top boxes and other video players. Just because it is a special case for you doesn't mean it's a global special case.

Ok, let's stop with that. This isn't helpful in the technical discussion.

...

...
That we haven't heard anybody screaming that x86 doesn't work is just because we handle the case that a buffer isn't directly displayable in X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

And on a lot of ARM SoCs it's exactly the right thing to do.

Yeah and that's fine, it just doesn't seem to work in all cases.

For both x86 as well as the case here that the CPU cache might be dirty the exporter needs to be the device with the requirements.

For x86 dGPUs that's the backing store is some local memory. For the non-coherent ARM devices it's that the CPU cache is not dirty.

For a device driver which solely works with cached system memory inserting cache flush operations is something it would never do for itself. It would just be doing this for the importer and exactly that would be bad design because we then have handling for the display driver outside of the driver.

...

...
This is just a software solution which works because of coincident and not because of engineering.

By mandating a software fallback for the cases where you would need bracketed access to the dma-buf, you simply shift the problem into userspace. Userspace then creates the bracket by falling back to some other import option that mostly do a copy and then the appropriate cache maintenance.

While I understand your sentiment about the DMA-API design being inconvenient when things are just coherent by system design, the DMA- API design wasn't done this way due to bad engineering, but due to the fact that performant DMA access on some systems just require this kind of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of giving you a proper error code when something won't work in a specific way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API transparently insert bounce buffers instead of giving you a proper error code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it into the upper layer where it can be handled much more gracefully.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

5:10 p.m.

Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:

...

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:

...
Hi Christian,

going to reply in more detail when I have some more time, so just some quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:

...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

With bracketed access you could even make this case work, as the dGPU would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should pump the data into LMEM in the first place.

That's very use-case specific. If the codec used those frames as reference frames for further coding operations, allocating them in LMEM of a remote device doesn't sound like a clever idea. But then I think this is a whole different discussion on its own.

...

...
...
...
The only case the commonly fails is whenever we try to display UVC created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct display for dGPUs are broken because media players think they can do the simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but rather something completely special to ARM.

It the normal case for a lot of ARM SoCs.

Yeah, but it's not the normal case for everybody.

We had numerous projects where customers wanted to pump video data directly from a decoder into an GPU frame or from a GPU frame into an encoder.

The fact that media frameworks doesn't support that out of the box is simply a bug.

...
That world is certainly not any less big than the x86 dGPU world. A huge number of devices are ARM based set-top boxes and other video players. Just because it is a special case for you doesn't mean it's a global special case.

Ok, let's stop with that. This isn't helpful in the technical discussion.

Agreed. I think we should try to understand the fact that there are very different world-views and use-cases for dma-buf right now and that breaking one or the other is simply no option. What may be an obscure special case on x86 may be very common on ARM and vice versa.

...

...
...
That we haven't heard anybody screaming that x86 doesn't work is just because we handle the case that a buffer isn't directly displayable in X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

And on a lot of ARM SoCs it's exactly the right thing to do.

Yeah and that's fine, it just doesn't seem to work in all cases.

For both x86 as well as the case here that the CPU cache might be dirty the exporter needs to be the device with the requirements.

For x86 dGPUs that's the backing store is some local memory. For the non-coherent ARM devices it's that the CPU cache is not dirty.

For a device driver which solely works with cached system memory inserting cache flush operations is something it would never do for itself.

It's exactly what a device driver working with cacheable memory buffers on a architecture with non-coherent DMA masters would do. In fact it's the main reason why the bracketing with explicit ownership transfer exists in DMA API: to allow the DMA API implementation to do the necessary cache maintenance operations on architectures that need them.

...

It would just be doing this for the importer and exactly that would be bad design because we then have handling for the display driver outside of the driver.

The driver would have to do those cache maintenance operations if it directly worked with a non-coherent device. Doing it for the importer is just doing it for another device, not the one directly managed by the exporter.

I really don't see the difference to the other dma-buf ops: in dma_buf_map_attachment the exporter maps the dma-buf on behalf and into the address space of the importer. Why would cache maintenance be any different?

...

...
...
This is just a software solution which works because of coincident and not because of engineering.

By mandating a software fallback for the cases where you would need bracketed access to the dma-buf, you simply shift the problem into userspace. Userspace then creates the bracket by falling back to some other import option that mostly do a copy and then the appropriate cache maintenance.

While I understand your sentiment about the DMA-API design being inconvenient when things are just coherent by system design, the DMA- API design wasn't done this way due to bad engineering, but due to the fact that performant DMA access on some systems just require this kind of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of giving you a proper error code when something won't work in a specific way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API transparently insert bounce buffers instead of giving you a proper error code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it into the upper layer where it can be handled much more gracefully.

How would you expect the DMA API to behave on a system where the device driver is operating on cacheable memory, but the device is non- coherent? Telling the driver that this just doesn't work? It's a use- case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Regards, Lucas

Christian König

7:13 p.m.

Am 02.11.22 um 18:10 schrieb Lucas Stach:

...

Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König: [SNIP]

...
It would just be doing this for the importer and exactly that would be bad design because we then have handling for the display driver outside of the driver.

The driver would have to do those cache maintenance operations if it directly worked with a non-coherent device. Doing it for the importer is just doing it for another device, not the one directly managed by the exporter.

I really don't see the difference to the other dma-buf ops: in dma_buf_map_attachment the exporter maps the dma-buf on behalf and into the address space of the importer. Why would cache maintenance be any different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use cases and is at least by me and a couple of others seen as generally design failure of the Linux DMA-API.

DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

Additional to that a very basic concept of DMA-buf is that the exporter provides the buffer as it is and just double checks if the importer can access it. For example we have XGMI links which makes memory accessible to other devices on the same bus, but not to PCIe device and not even to the CPU. Otherwise you wouldn't be able to implement things like secure decoding where the data isn't even accessible outside the device to device link.

So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

What we can do is to reverse the role of the exporter and importer and let the device which needs uncached memory take control. This way this device can insert operations as needed, e.g. flush read caches or invalidate write caches.

This is what we have already done in DMA-buf and what already works perfectly fine with use cases which are even more complicated than a simple write cache invalidation.

...

...
...
...
This is just a software solution which works because of coincident and not because of engineering.

By mandating a software fallback for the cases where you would need bracketed access to the dma-buf, you simply shift the problem into userspace. Userspace then creates the bracket by falling back to some other import option that mostly do a copy and then the appropriate cache maintenance.

While I understand your sentiment about the DMA-API design being inconvenient when things are just coherent by system design, the DMA- API design wasn't done this way due to bad engineering, but due to the fact that performant DMA access on some systems just require this kind of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of giving you a proper error code when something won't work in a specific way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API transparently insert bounce buffers instead of giving you a proper error code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it into the upper layer where it can be handled much more gracefully.

How would you expect the DMA API to behave on a system where the device driver is operating on cacheable memory, but the device is non- coherent? Telling the driver that this just doesn't work?

Yes, exactly that.

It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

...

It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Regards, Christian.

...

Regards, Lucas

Lucas Stach

7:48 p.m.

Am Mittwoch, dem 02.11.2022 um 20:13 +0100 schrieb Christian König:

...

Am 02.11.22 um 18:10 schrieb Lucas Stach:

...
Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König: [SNIP]

...
It would just be doing this for the importer and exactly that would be bad design because we then have handling for the display driver outside of the driver.

The driver would have to do those cache maintenance operations if it directly worked with a non-coherent device. Doing it for the importer is just doing it for another device, not the one directly managed by the exporter.

I really don't see the difference to the other dma-buf ops: in dma_buf_map_attachment the exporter maps the dma-buf on behalf and into the address space of the importer. Why would cache maintenance be any different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use cases and is at least by me and a couple of others seen as generally design failure of the Linux DMA-API.

DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

I'm not sure where you see the CPU in that scenario. A explicit ownership transfer could also trigger things like a cache invalidate on the receiving end of a P2P transfer. A CPU cache invalidate/flush is just one possible use-case of the ownership transfer.

And sure, some uses will just not work when a explicit ownership transfer is required, as the usage model and/or API doesn't allow for it. But lots of uses will work with explicit ownership transfer, actually there is no other way to properly deal with systems where the sharing masters may be non-coherent, which is norm rather than the exception in lots of embedded uses, regardless of the inconvenience this might cause for some other areas.

...

Additional to that a very basic concept of DMA-buf is that the exporter provides the buffer as it is and just double checks if the importer can access it. For example we have XGMI links which makes memory accessible to other devices on the same bus, but not to PCIe device and not even to the CPU. Otherwise you wouldn't be able to implement things like secure decoding where the data isn't even accessible outside the device to device link.

So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

What we can do is to reverse the role of the exporter and importer and let the device which needs uncached memory take control. This way this device can insert operations as needed, e.g. flush read caches or invalidate write caches.

And that will only solve one aspect of the allocation restriction maze, namely the cacheable attribute, but it might as well break things like contiguous requirements.

...

This is what we have already done in DMA-buf and what already works perfectly fine with use cases which are even more complicated than a simple write cache invalidation.

...
...
...
...
This is just a software solution which works because of coincident and not because of engineering.

By mandating a software fallback for the cases where you would need bracketed access to the dma-buf, you simply shift the problem into userspace. Userspace then creates the bracket by falling back to some other import option that mostly do a copy and then the appropriate cache maintenance.

While I understand your sentiment about the DMA-API design being inconvenient when things are just coherent by system design, the DMA- API design wasn't done this way due to bad engineering, but due to the fact that performant DMA access on some systems just require this kind of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of giving you a proper error code when something won't work in a specific way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API transparently insert bounce buffers instead of giving you a proper error code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it into the upper layer where it can be handled much more gracefully.

How would you expect the DMA API to behave on a system where the device driver is operating on cacheable memory, but the device is non- coherent? Telling the driver that this just doesn't work?

Yes, exactly that.

It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

How would this situation be handled more gracefully than a cache flush/invalidate. Certainly the solution wouldn't be to make the Linux network stack operate on non-cacheable memory? To be clear: that's a rhetorical question, as unaligned access, which is quite common in many network uses, to non-cacheable memory is also not allowed on most system designs, so this isn't even an option.

...

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

...
It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

So maybe we should agree that we both have uses in mind which are practically impossible to implement when restricting dma-buf to one world view? You are thinking about uses that need full coherence without any bracketing and I'm thinking about uses that won't ever work with usable performance without bracketing.

So maybe we should circle back to daniels suggestion to have a flag to keep both worlds apart? Reject the import as you do in this series when the "must not use bracketing" flag is set and allow explicit ownership transfers otherwise seems like a workable solution to me.

Regards, Lucas

Tomasz Figa

17 Nov 17 Nov

9:35 a.m.

Hi Christian and everyone,

On Thu, Nov 3, 2022 at 4:14 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Am 02.11.22 um 18:10 schrieb Lucas Stach:

...
Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König: [SNIP]

...
It would just be doing this for the importer and exactly that would be bad design because we then have handling for the display driver outside of the driver.

The driver would have to do those cache maintenance operations if it directly worked with a non-coherent device. Doing it for the importer is just doing it for another device, not the one directly managed by the exporter.

I really don't see the difference to the other dma-buf ops: in dma_buf_map_attachment the exporter maps the dma-buf on behalf and into the address space of the importer. Why would cache maintenance be any different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use cases and is at least by me and a couple of others seen as generally design failure of the Linux DMA-API.

First of all, thanks for starting the discussion and sorry for being late to the party. May I ask you to keep me on CC for any changes that touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay being copied on the entire series, no need to pick the specific patches. Thanks in advance.

I agree that we have some design issues in the current DMA-buf framework, but I'd try to approach it a bit differently. Instead of focusing on the issues in the current design, could we write down our requirements and try to come up with how a correct design would look like? (A lot of that has been already mentioned in this thread, but I find it quite difficult to follow and it might not be a complete view either.)

That said, let me address a few aspects already mentioned, to make sure that everyone is on the same page.

...

DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer. Based on that, your drivers would install the DMA completion (presumably IRQ) handlers or not. It's necessary since it's not uncommon that devices A and B could be in the same coherency domain, while C could be in a different one, but you may still want them to exchange data through DMA-bufs. Even if it means the need for some extra round trips it would likely be more efficient than a full memory copy (might not be true 100% of the time).

...

Additional to that a very basic concept of DMA-buf is that the exporter provides the buffer as it is and just double checks if the importer can access it. For example we have XGMI links which makes memory accessible to other devices on the same bus, but not to PCIe device and not even to the CPU. Otherwise you wouldn't be able to implement things like secure decoding where the data isn't even accessible outside the device to device link.

Fully agreed.

...

So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

...

What we can do is to reverse the role of the exporter and importer and let the device which needs uncached memory take control. This way this device can insert operations as needed, e.g. flush read caches or invalidate write caches.

(Putting aside the cases when the access is really impossible at all.) Correct me if I'm wrong, but isn't that because we don't have a proper hook for the importer to tell the DMA-buf framework to prepare the buffer for its access?

...

This is what we have already done in DMA-buf and what already works perfectly fine with use cases which are even more complicated than a simple write cache invalidation.

...
...
...
...
This is just a software solution which works because of coincident and not because of engineering.

By mandating a software fallback for the cases where you would need bracketed access to the dma-buf, you simply shift the problem into userspace. Userspace then creates the bracket by falling back to some other import option that mostly do a copy and then the appropriate cache maintenance.

While I understand your sentiment about the DMA-API design being inconvenient when things are just coherent by system design, the DMA- API design wasn't done this way due to bad engineering, but due to the fact that performant DMA access on some systems just require this kind of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of giving you a proper error code when something won't work in a specific way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API transparently insert bounce buffers instead of giving you a proper error code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it into the upper layer where it can be handled much more gracefully.

How would you expect the DMA API to behave on a system where the device driver is operating on cacheable memory, but the device is non- coherent? Telling the driver that this just doesn't work?

Yes, exactly that.

It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

What are higher and lower levels here?

As per the existing design of the DMA mapping framework, the framework handles the system DMA architecture details and DMA master drivers take care of invoking the right DMA mapping operations around the DMA accesses. This makes sense to me, as DMA master drivers have no idea about the specific SoCs or buses they're plugged into, while the DMA mapping framework has no idea when the DMA accesses are taking place.

...

In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation" like bounce buffers, but cache maintenance is a hardware feature existing there by default and it's often much cheaper to operate on cached memory and synchronize the caches rather than have everything in uncached (or write-combined) memory.

...

...
It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

Best regards, Tomasz

Christian König

12:10 p.m.

Hi Tomasz,

Am 17.11.22 um 10:35 schrieb Tomasz Figa:

...

Hi Christian and everyone,

On Thu, Nov 3, 2022 at 4:14 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 02.11.22 um 18:10 schrieb Lucas Stach:

...
Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König: [SNIP]

...
It would just be doing this for the importer and exactly that would be bad design because we then have handling for the display driver outside of the driver.

The driver would have to do those cache maintenance operations if it directly worked with a non-coherent device. Doing it for the importer is just doing it for another device, not the one directly managed by the exporter.

I really don't see the difference to the other dma-buf ops: in dma_buf_map_attachment the exporter maps the dma-buf on behalf and into the address space of the importer. Why would cache maintenance be any different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use cases and is at least by me and a couple of others seen as generally design failure of the Linux DMA-API.

First of all, thanks for starting the discussion and sorry for being late to the party. May I ask you to keep me on CC for any changes that touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay being copied on the entire series, no need to pick the specific patches. Thanks in advance.

Sorry for that, I've only added people involved in the previous discussion. Going to keep you in the loop.

...

I agree that we have some design issues in the current DMA-buf framework, but I'd try to approach it a bit differently. Instead of focusing on the issues in the current design, could we write down our requirements and try to come up with how a correct design would look like? (A lot of that has been already mentioned in this thread, but I find it quite difficult to follow and it might not be a complete view either.)

Well, exactly that's what we disagree on.

As far as I can see the current design of DMA-buf is perfectly capable of handling all the requirements.

A brief summary of the requirements with some implementation notes:

1. Device drivers export their memory as it is. E.g. no special handling for importers on the exporter side. If an importer can't deal with a specific format, layout, caching etc... of the data the correct approach is to reject the attachment. Those parameters are controlled by userspace and negotiating them is explicitly not part of the framework.

2. Accesses of the CPU to a buffer are bracket int begin_cpu_access() and end_cpu_access() calls. Here we can insert the CPU cache invalidation/flushes as necessary.

3. Accesses of the device HW to a buffer are represented with dma_fence objects. It's explicitly allowed to have multiple devices access the buffer at the same time.

4. Access to the DMA-buf by the HW of an importer is setup by the exporter. In other words the exporter provides a bunch of DMA addresses the importer should access. The importer should not try to come up with those on its own.

...

That said, let me address a few aspects already mentioned, to make sure that everyone is on the same page.

...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

...

...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

...

...
What we can do is to reverse the role of the exporter and importer and let the device which needs uncached memory take control. This way this device can insert operations as needed, e.g. flush read caches or invalidate write caches.

(Putting aside the cases when the access is really impossible at all.) Correct me if I'm wrong, but isn't that because we don't have a proper hook for the importer to tell the DMA-buf framework to prepare the buffer for its access?

No, we do have the explicit begin_cpu_access() and end_cpu_access() calls which are even exporter to userspace through IOCTLs for this.

The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Either reverting the roles of the importer or exporter or switching over to the common DMA-heaps implementation for the buffer would solve that.

...

...
It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

...

As per the existing design of the DMA mapping framework, the framework handles the system DMA architecture details and DMA master drivers take care of invoking the right DMA mapping operations around the DMA accesses. This makes sense to me, as DMA master drivers have no idea about the specific SoCs or buses they're plugged into, while the DMA mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following: 1. get_user_pages() grabs a reference to the pages we want to DMA to/from. 2. DMA mapping framework is called by the driver to create an sg-table. 3. The DMA mapping framework sees that this won't work and inserts bounce buffers. 4. The driver does the DMA to the bounce buffers instead. 5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens here. E.g. drivers can't prevent using bounce buffers when they need coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a higher level, for example by get_user_pages() or more general in the MM.

...

...
In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation" like bounce buffers, but cache maintenance is a hardware feature existing there by default and it's often much cheaper to operate on cached memory and synchronize the caches rather than have everything in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only take care of translating addresses.

Or the higher level (get_user_pages() is just one example of that) which says ok device X wants to access memory Y I need to make sure that caches are flushed/invalidated.

...

...
...
It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Thanks for the input, Christian.

...

Best regards, Tomasz

Nicolas Dufresne

3:38 p.m.

Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :

...

...
...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here. I have quite some experience with ownership transfer mechanism, as this is how GStreamer framework works since 2000. Concurrent access is a really common use cases and it is quite well defined in that context. The bracketing system (in this case called map() unmap(), with flag stating the usage intention like reads and write) is combined the the refcount. The basic rules are simple:

- An object with a refcount higher then 2 is shared, hence read-only - An object with refcount of one, mapped for writes becomes exclusive - Non exclusive writes can be done, but that has to be explicit (intentional), we didn't go as far as Rust in that domain - Wrappers around these object can use mechanism like "copy-on-write" and can also maintain the state of shadow buffers (e.g. GL upload slow cases) even with concurrent access.

Just hope it clarify, Rust language works, yet its all based on explicit ownership transfers. Its not limiting, but it requires a different way of thinking how data is to be accessed.

Nicolas

Rob Clark

18 Nov 18 Nov

7:32 p.m.

On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:

...

Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :

...
...
...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here. I have quite some experience with ownership transfer mechanism, as this is how GStreamer framework works since 2000. Concurrent access is a really common use cases and it is quite well defined in that context. The bracketing system (in this case called map() unmap(), with flag stating the usage intention like reads and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case where ownership transfer happens without CPU involvement, such as via GPU waiting on a fence

BR, -R

Nicolas Dufresne

19 Nov 19 Nov

8:35 p.m.

Le vendredi 18 novembre 2022 à 11:32 -0800, Rob Clark a écrit :

...

On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:

...
Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :

...
...
...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here. I have quite some experience with ownership transfer mechanism, as this is how GStreamer framework works since 2000. Concurrent access is a really common use cases and it is quite well defined in that context. The bracketing system (in this case called map() unmap(), with flag stating the usage intention like reads and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case where ownership transfer happens without CPU involvement, such as via GPU waiting on a fence

HW fences and proper ownership isn't incompatible at all. Even if you have no software involved during the usage, software still need to share the dmabuf (at least once), and sharing modify the ownership, and can be made explicit.

p.s. I will agree if someone raises that this is totally off topic

Nicolas

...

BR, -R

Daniel Vetter

22 Nov 22 Nov

2:36 p.m.

On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:

...

On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:

...
Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :

...
...
...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here. I have quite some experience with ownership transfer mechanism, as this is how GStreamer framework works since 2000. Concurrent access is a really common use cases and it is quite well defined in that context. The bracketing system (in this case called map() unmap(), with flag stating the usage intention like reads and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case where ownership transfer happens without CPU involvement, such as via GPU waiting on a fence

Yeah for pure device2device handover the rule pretty much has to be that any coherency management that needs to be done must be done from the device side (flushing device side caches and stuff like that) only. But under the assumption that _all_ cpu side management has been done already before the first device access started.

And then the map/unmap respectively begin/end_cpu_access can be used what it was meant for with cpu side invalidation/flushing and stuff like that, while having pretty clear handover/ownership rules and hopefully not doing no unecessary flushes. And all that while allowing device acces to be pipelined. Worst case the exporter has to insert some pipelined cache flushes as a dma_fence pipelined work of its own between the device access when moving from one device to the other. That last part sucks a bit right now, because we don't have any dma_buf_attachment method which does this syncing without recreating the mapping, but in reality this is solved by caching mappings in the exporter (well dma-buf layer) nowadays.

True concurrent access like vk/compute expects is still a model that dma-buf needs to support on top, but that's a special case and pretty much needs hw that supports such concurrent access without explicit handover and fencing.

Aside from some historical accidents and still a few warts, I do think dma-buf does support both of these models. Of course in the case of gpu/drm drivers, userspace must know whats possible and act accordingly, otherwise you just get to keep all the pieces. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

5:33 p.m.

Am 22.11.22 um 15:36 schrieb Daniel Vetter:

...

On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:

...
On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:

...
Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :

...
...
...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here. I have quite some experience with ownership transfer mechanism, as this is how GStreamer framework works since 2000. Concurrent access is a really common use cases and it is quite well defined in that context. The bracketing system (in this case called map() unmap(), with flag stating the usage intention like reads and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case where ownership transfer happens without CPU involvement, such as via GPU waiting on a fence

Yeah for pure device2device handover the rule pretty much has to be that any coherency management that needs to be done must be done from the device side (flushing device side caches and stuff like that) only. But under the assumption that _all_ cpu side management has been done already before the first device access started.

And then the map/unmap respectively begin/end_cpu_access can be used what it was meant for with cpu side invalidation/flushing and stuff like that, while having pretty clear handover/ownership rules and hopefully not doing no unecessary flushes. And all that while allowing device acces to be pipelined. Worst case the exporter has to insert some pipelined cache flushes as a dma_fence pipelined work of its own between the device access when moving from one device to the other. That last part sucks a bit right now, because we don't have any dma_buf_attachment method which does this syncing without recreating the mapping, but in reality this is solved by caching mappings in the exporter (well dma-buf layer) nowadays.

True concurrent access like vk/compute expects is still a model that dma-buf needs to support on top, but that's a special case and pretty much needs hw that supports such concurrent access without explicit handover and fencing.

Aside from some historical accidents and still a few warts, I do think dma-buf does support both of these models.

We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Regards, Christian.

...

Of course in the case of gpu/drm drivers, userspace must know whats possible and act accordingly, otherwise you just get to keep all the pieces. -Daniel

Daniel Vetter

6:26 p.m.

On Tue, 22 Nov 2022 at 18:34, Christian König christian.koenig@amd.com wrote:

...

Am 22.11.22 um 15:36 schrieb Daniel Vetter:

...
On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:

...
On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne nicolas@ndufresne.ca wrote:

...
Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :

...
...
> DMA-Buf let's the exporter setup the DMA addresses the importer uses to > be able to directly decided where a certain operation should go. E.g. we > have cases where for example a P2P write doesn't even go to memory, but > rather a doorbell BAR to trigger another operation. Throwing in CPU > round trips for explicit ownership transfer completely breaks that > concept. It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here. I have quite some experience with ownership transfer mechanism, as this is how GStreamer framework works since 2000. Concurrent access is a really common use cases and it is quite well defined in that context. The bracketing system (in this case called map() unmap(), with flag stating the usage intention like reads and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case where ownership transfer happens without CPU involvement, such as via GPU waiting on a fence

Yeah for pure device2device handover the rule pretty much has to be that any coherency management that needs to be done must be done from the device side (flushing device side caches and stuff like that) only. But under the assumption that _all_ cpu side management has been done already before the first device access started.

And then the map/unmap respectively begin/end_cpu_access can be used what it was meant for with cpu side invalidation/flushing and stuff like that, while having pretty clear handover/ownership rules and hopefully not doing no unecessary flushes. And all that while allowing device acces to be pipelined. Worst case the exporter has to insert some pipelined cache flushes as a dma_fence pipelined work of its own between the device access when moving from one device to the other. That last part sucks a bit right now, because we don't have any dma_buf_attachment method which does this syncing without recreating the mapping, but in reality this is solved by caching mappings in the exporter (well dma-buf layer) nowadays.

True concurrent access like vk/compute expects is still a model that dma-buf needs to support on top, but that's a special case and pretty much needs hw that supports such concurrent access without explicit handover and fencing.

Aside from some historical accidents and still a few warts, I do think dma-buf does support both of these models.

We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Yeah, but engineering practicalities were pretty clear that no one would rewrite the entire Xorg stack and all the drivers just to make that happen for prime.

...

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Given the historical baggage of existing use-case, I think the only way out is that we look at concrete examples from real world users that break, and figure out how to fix them. Without breaking any of the existing mess.

One idea might be that we have a per-platform dma_buf_legacy_coherency_mode(), which tells you what mode (cpu cache snooping or uncached memory) you need to use to make sure that all devices agree. On x86 the rule might be that it's cpu cache snooping by default, but if you have an integrated gpu then everyone needs to use uncached. That sucks, but at least we could keep the existing mess going and clean it up. Everyone else would be uncached, except maybe arm64 servers with pcie connectors. Essentially least common denominator to make this work. Note that uncached actually means device access doesn't snoop, the cpu side you can handle with either uc/wc mappings or explicit flushing.

Then once we have that we could implement the coherency negotiation protocol on top as an explicit opt-in, so that you can still use coherent buffers across two pcie gpus even if you also have an integrated gpu.

Doing only the new protocol without some means to keep the existing pile of carefully hacked up assumptions would break things, and we can't do that. Also I have no idea whether that global legacy device coherency mode would work. Also we might more than just snooped/unsnoop, since depending upon architecture you might want to only snoop one transaction (reads vs writes) instead of both of them: If writes snoop then cpu reads would never need to invalidate caches beforehand, but writes would still need to flush (and would give you faster reads on the device side since those can still bypass snooping). Some igpu platforms work like that, but I'm not sure whether there's any other device that would care enough about these for this to matter. Yes it's a hw mis-design (well I don't like it personally), they fixed it :-)

Cheers, Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Pekka Paalanen

23 Nov 23 Nov

8:33 a.m.

On Tue, 22 Nov 2022 18:33:59 +0100 Christian König christian.koenig@amd.com wrote:

...

We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been forgotten: https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.c...

It aimed mostly at userspace, but sounds to me like the coherency stuff could use a section of its own there?

Thanks, pq

Daniel Vetter

4:33 p.m.

On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:

...

On Tue, 22 Nov 2022 18:33:59 +0100 Christian König christian.koenig@amd.com wrote:

...
We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been forgotten: https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.c...

It aimed mostly at userspace, but sounds to me like the coherency stuff could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel? -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Nicolas Dufresne

25 Nov 25 Nov

4:40 p.m.

Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :

...

On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:

...
On Tue, 22 Nov 2022 18:33:59 +0100 Christian König christian.koenig@amd.com wrote:

...
We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been forgotten: https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.c...

It aimed mostly at userspace, but sounds to me like the coherency stuff could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be completely wrong user-space behaviour in this thread. But it seems to pre-date the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA vs system memory issue. But it also underline a very important aspect, that userland is not aware which one to use. What this document suggest though seems more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM mailing list.

Nicolas

Daniel Vetter

30 Nov 30 Nov

10:30 a.m.

On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:

...

Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :

...
On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:

...
On Tue, 22 Nov 2022 18:33:59 +0100 Christian König christian.koenig@amd.com wrote:

...
We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been forgotten: https://lore.kernel.org/dri-devel/20210905122742.86029-1-daniels@collabora.c...

It aimed mostly at userspace, but sounds to me like the coherency stuff could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be completely wrong user-space behaviour in this thread. But it seems to pre-date the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA vs system memory issue. But it also underline a very important aspect, that userland is not aware which one to use. What this document suggest though seems more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM mailing list.

The doc is more about document the current status quo/best practices, which is very much not using dma-heaps.

The issue there is that currently userspace has no idea which dma-heap to use for shared buffers, plus not all allocators are exposed through heaps to begin with. We had this noted as a todo item (add some device->heap sysfs links was the idea), until that's done all you can do is hardcode the right heaps for the right usage in userspace, which is what android does. Plus android doesn't have dgpu, so doesn't need the missing ttm heap.

But yeah the long-term aspiration also hasn't changed, because the dma-heap todo list is also very, very old by now :-/ -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

2 Dec 2 Dec

3:27 p.m.

Am 30.11.22 um 11:30 schrieb Daniel Vetter:

...

On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:

...
Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :

...
On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:

...
On Tue, 22 Nov 2022 18:33:59 +0100 Christian König christian.koenig@amd.com wrote:

...
We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been forgotten: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

It aimed mostly at userspace, but sounds to me like the coherency stuff could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be completely wrong user-space behaviour in this thread. But it seems to pre-date the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA vs system memory issue. But it also underline a very important aspect, that userland is not aware which one to use. What this document suggest though seems more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM mailing list.

The doc is more about document the current status quo/best practices, which is very much not using dma-heaps.

The issue there is that currently userspace has no idea which dma-heap to use for shared buffers, plus not all allocators are exposed through heaps to begin with. We had this noted as a todo item (add some device->heap sysfs links was the idea), until that's done all you can do is hardcode the right heaps for the right usage in userspace, which is what android does. Plus android doesn't have dgpu, so doesn't need the missing ttm heap.

Hui? Why do you think we should have a TTM heap in the first place?

As far as I can see we need three different ways of allocation: 1. Normal system memory. 2. CMA based. 3. Device specific.

When any of the participating devices needs CMA then user space must use the CMA heap, when any of the participating devices needs device specific memory then user space must use device specific memory (from that device).

It still can be that two devices can't talk with each other because both needs the buffer to be allocated from device specific memory for a particular use case, but at least all the cases for both none CPU cache coherent ARM devices as well as device specific memory for scanout should be handled with this.

Regards, Christian.

...

But yeah the long-term aspiration also hasn't changed, because the dma-heap todo list is also very, very old by now :-/ -Daniel

Daniel Vetter

5 Jan 5 Jan

11:46 a.m.

On Fri, Dec 02, 2022 at 04:27:08PM +0100, Christian König wrote:

...

Am 30.11.22 um 11:30 schrieb Daniel Vetter:

...
On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:

...
Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :

...
On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:

...
On Tue, 22 Nov 2022 18:33:59 +0100 Christian König christian.koenig@amd.com wrote:

...
We should have come up with dma-heaps earlier and make it clear that exporting a DMA-buf from a device gives you something device specific which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. Question left is what documentation is missing to make it clear how things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been forgotten: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

It aimed mostly at userspace, but sounds to me like the coherency stuff could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be completely wrong user-space behaviour in this thread. But it seems to pre-date the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA vs system memory issue. But it also underline a very important aspect, that userland is not aware which one to use. What this document suggest though seems more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM mailing list.

The doc is more about document the current status quo/best practices, which is very much not using dma-heaps.

The issue there is that currently userspace has no idea which dma-heap to use for shared buffers, plus not all allocators are exposed through heaps to begin with. We had this noted as a todo item (add some device->heap sysfs links was the idea), until that's done all you can do is hardcode the right heaps for the right usage in userspace, which is what android does. Plus android doesn't have dgpu, so doesn't need the missing ttm heap.

Hui? Why do you think we should have a TTM heap in the first place?

[sorry for the really late reply]

Extremely late reply, but p2p buffer sharing when you want it in vram is only something ttm can provide you. And if the goal is to have all buffer sharing needs served by dma-heaps, then eventually that'll be needed too.

But probably totally fine to just not have it for another few years (decades?).

...

As far as I can see we need three different ways of allocation:

Normal system memory.

CMA based.

Device specific.

When any of the participating devices needs CMA then user space must use the CMA heap, when any of the participating devices needs device specific memory then user space must use device specific memory (from that device).

It still can be that two devices can't talk with each other because both needs the buffer to be allocated from device specific memory for a particular use case, but at least all the cases for both none CPU cache coherent ARM devices as well as device specific memory for scanout should be handled with this.

Yeah on ARM this should be enough. -Daniel

...

Regards, Christian.

...
But yeah the long-term aspiration also hasn't changed, because the dma-heap todo list is also very, very old by now :-/ -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tomasz Figa

5 Dec 5 Dec

6:41 a.m.

Hi Christian,

On Thu, Nov 17, 2022 at 9:11 PM Christian König christian.koenig@amd.com wrote:

...

Hi Tomasz,

Am 17.11.22 um 10:35 schrieb Tomasz Figa:

...
Hi Christian and everyone,

On Thu, Nov 3, 2022 at 4:14 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 02.11.22 um 18:10 schrieb Lucas Stach:

...
Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König: [SNIP]

...
It would just be doing this for the importer and exactly that would be bad design because we then have handling for the display driver outside of the driver.

The driver would have to do those cache maintenance operations if it directly worked with a non-coherent device. Doing it for the importer is just doing it for another device, not the one directly managed by the exporter.

I really don't see the difference to the other dma-buf ops: in dma_buf_map_attachment the exporter maps the dma-buf on behalf and into the address space of the importer. Why would cache maintenance be any different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use cases and is at least by me and a couple of others seen as generally design failure of the Linux DMA-API.

First of all, thanks for starting the discussion and sorry for being late to the party. May I ask you to keep me on CC for any changes that touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay being copied on the entire series, no need to pick the specific patches. Thanks in advance.

Sorry for that, I've only added people involved in the previous discussion. Going to keep you in the loop.

No worries. Thanks.

Sorry, for being late with the reply, had a bit of vacation and then some catching up last week.

...

...
I agree that we have some design issues in the current DMA-buf framework, but I'd try to approach it a bit differently. Instead of focusing on the issues in the current design, could we write down our requirements and try to come up with how a correct design would look like? (A lot of that has been already mentioned in this thread, but I find it quite difficult to follow and it might not be a complete view either.)

Well, exactly that's what we disagree on.

As far as I can see the current design of DMA-buf is perfectly capable of handling all the requirements.

A brief summary of the requirements with some implementation notes:

Device drivers export their memory as it is. E.g. no special handling

for importers on the exporter side. If an importer can't deal with a specific format, layout, caching etc... of the data the correct approach is to reject the attachment. Those parameters are controlled by userspace and negotiating them is explicitly not part of the framework.

Ack. I believe it matches the current implementation of the DMA-buf framework, although as you mentioned, the swiotlb feature of the DMA mapping framework kind of violates this.

...

Accesses of the CPU to a buffer are bracket int begin_cpu_access()

and end_cpu_access() calls. Here we can insert the CPU cache invalidation/flushes as necessary.

Ack. I think a part of the problem today is that there exist userspace and kernel code instances that don't insert them and assume that some magic keeps the cache clean...

...

Accesses of the device HW to a buffer are represented with dma_fence

objects. It's explicitly allowed to have multiple devices access the buffer at the same time.

Ack. Again there exists kernel code that doesn't honor or provide DMA fences (e.g. V4L2).

...

Access to the DMA-buf by the HW of an importer is setup by the exporter. In other words the exporter provides a bunch of DMA addresses the

importer should access. The importer should not try to come up with those on its own.

...
That said, let me address a few aspects already mentioned, to make sure that everyone is on the same page.

...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to be able to directly decided where a certain operation should go. E.g. we have cases where for example a P2P write doesn't even go to memory, but rather a doorbell BAR to trigger another operation. Throwing in CPU round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which accepts two (or an array?) of devices and tells the caller whether the devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

...

...
...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

...

We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

...

...
...
What we can do is to reverse the role of the exporter and importer and let the device which needs uncached memory take control. This way this device can insert operations as needed, e.g. flush read caches or invalidate write caches.

(Putting aside the cases when the access is really impossible at all.) Correct me if I'm wrong, but isn't that because we don't have a proper hook for the importer to tell the DMA-buf framework to prepare the buffer for its access?

No, we do have the explicit begin_cpu_access() and end_cpu_access() calls which are even exporter to userspace through IOCTLs for this.

Ack.

...

The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

...

Either reverting the roles of the importer or exporter or switching over to the common DMA-heaps implementation for the buffer would solve that.

...
...
It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

Hmm, I generally consider the DMA mapping framework a part of MM, especially its allocation capabilities. It heavily relies on low level MM primitives internally and exposes another set of low level primitives that is more useful for device drivers. At least it should be seen this way, but I agree that it currently includes things that shouldn't necessarily be there, like the transparent bouncing.

...

...
As per the existing design of the DMA mapping framework, the framework handles the system DMA architecture details and DMA master drivers take care of invoking the right DMA mapping operations around the DMA accesses. This makes sense to me, as DMA master drivers have no idea about the specific SoCs or buses they're plugged into, while the DMA mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:

get_user_pages() grabs a reference to the pages we want to DMA to/from.

DMA mapping framework is called by the driver to create an sg-table.

The DMA mapping framework sees that this won't work and inserts

bounce buffers. 4. The driver does the DMA to the bounce buffers instead. 5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens here. E.g. drivers can't prevent using bounce buffers when they need coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a higher level, for example by get_user_pages() or more general in the MM.

I tend to agree with you on this, but I see a lot of challenges that would need to be solved if we want to make it otherwise. The whole reason for transparent bouncing came from the fact that many existing subsystems didn't really care about the needs of the underlying hardware, e.g. block or network subsystems would just pass random pages to the drivers. I think the idea of DMA mapping handling this internally was to avoid implementing the bouncing here and there in each block or network driver that needs it. (Arguably, an optional helper could be provided instead for use in those subsystems...)

...

...
...
In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation" like bounce buffers, but cache maintenance is a hardware feature existing there by default and it's often much cheaper to operate on cached memory and synchronize the caches rather than have everything in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping framework. Generic memory allocation primitives (like alloc_pages(), kmalloc()) shouldn't be used for buffers that are expected to be passed to DMA engines - there exist DMA-aware replacements, like dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(), etc.

...

Or the higher level (get_user_pages() is just one example of that) which says ok device X wants to access memory Y I need to make sure that caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just doomed at the very start, because it does nothing to accommodate hardware needs at allocation time and then just expects some magic to happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for hardware that allows it, i.e. without much memory access constraints. What would be still missing if we just use the existing gup() + dma_map() + dma_sync() sequence?

...

...
...
...
It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

Best regards, Tomasz

Christian König

8:28 a.m.

Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...

[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

1. A lot of userspace applications/frameworks assume that it can allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

3. We seem to lack some essential parts of those restrictions in the documentation.

...

...
...
...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

...

...
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.

It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.

...

...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

...

...
Either reverting the roles of the importer or exporter or switching over to the common DMA-heaps implementation for the buffer would solve that.

...
...
It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

Hmm, I generally consider the DMA mapping framework a part of MM, especially its allocation capabilities. It heavily relies on low level MM primitives internally and exposes another set of low level primitives that is more useful for device drivers. At least it should be seen this way, but I agree that it currently includes things that shouldn't necessarily be there, like the transparent bouncing.

Exactly that, yes! Good to hear that more people think that way.

Christoph made some comments which sounded like he agreed to some of the points as well, but nobody ever said it so clearly.

...

...
...
As per the existing design of the DMA mapping framework, the framework handles the system DMA architecture details and DMA master drivers take care of invoking the right DMA mapping operations around the DMA accesses. This makes sense to me, as DMA master drivers have no idea about the specific SoCs or buses they're plugged into, while the DMA mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:

get_user_pages() grabs a reference to the pages we want to DMA to/from.

DMA mapping framework is called by the driver to create an sg-table.

The DMA mapping framework sees that this won't work and inserts

bounce buffers. 4. The driver does the DMA to the bounce buffers instead. 5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens here. E.g. drivers can't prevent using bounce buffers when they need coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a higher level, for example by get_user_pages() or more general in the MM.

I tend to agree with you on this, but I see a lot of challenges that would need to be solved if we want to make it otherwise. The whole reason for transparent bouncing came from the fact that many existing subsystems didn't really care about the needs of the underlying hardware, e.g. block or network subsystems would just pass random pages to the drivers. I think the idea of DMA mapping handling this internally was to avoid implementing the bouncing here and there in each block or network driver that needs it. (Arguably, an optional helper could be provided instead for use in those subsystems...)

Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

...

...
...
...
In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation" like bounce buffers, but cache maintenance is a hardware feature existing there by default and it's often much cheaper to operate on cached memory and synchronize the caches rather than have everything in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping framework. Generic memory allocation primitives (like alloc_pages(), kmalloc()) shouldn't be used for buffers that are expected to be passed to DMA engines - there exist DMA-aware replacements, like dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(), etc.

...
Or the higher level (get_user_pages() is just one example of that) which says ok device X wants to access memory Y I need to make sure that caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just doomed at the very start, because it does nothing to accommodate hardware needs at allocation time and then just expects some magic to happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for hardware that allows it, i.e. without much memory access constraints. What would be still missing if we just use the existing gup() + dma_map() + dma_sync() sequence?

Error or rather fallback handling and *not* transparently inserting bounce buffers.

The whole implicit bounce buffers concept falls apart as soon as you don't have a stream access any more.

...

...
...
...
...
It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Regards, Christian.

...

Best regards, Tomasz

Nicolas Dufresne

6 Dec 6 Dec

6:26 p.m.

Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :

...

Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...
[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

A lot of userspace applications/frameworks assume that it can

allocate the buffer anywhere and it just works.

I know you have said that about 10 times, perhaps I'm about to believe it, but why do you think userspace assumes this ? Did you actually read code that does this (that isn't meant to run on controlled environment). And can you provide some example of broken generic userspace ? The DMABuf flow is meant to be trial and error. At least in GStreamer, yes, mostly only device allocation (when genericly usable) is implemented, but the code that has been contribute will try and fallback back like documented. Still fails sometimes, but that's exactly the kind of kernel bugs your patchset is trying to address. I don't blame anyone here, since why would folks on GStreamer/FFMPEG or any other "generic media framework" spend so much time implement "per linux device code", when non- embedded (constraint) linux is just handful of users (compare to Windows, Android, iOS users).

To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't always enable (or allow normal users to access) heaps (though you point 2. gets in the way) ? Unlike virtual memory, I don't think there is very good accounting and reclaiming mechanism for that memory, hence opening these means any userspace could possibly impair the system functioning. If you can't e.g. limit their usage within containers, this is pretty difficult for generic linux to carry. This is a wider problem of course, which likely affect a lot of GPU usage too, but perhaps it should be in the lower priority part of the todo.

...

This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

nit: content protection is not CODEC specific, its a platform feature, its also not really a thing upstream yet from what I'm aware of. This needs unified design and documentation imho, but also enough standardisation so that a generic application can use it. Right now, content protection people have been complaining that V4L2 (and most generic userspace) don't work with their design, rather then trying to figure-out a design that works with existing API.

...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

We seem to lack some essential parts of those restrictions in the

documentation.

Agreed (can't always disagree).

regards, Nicolas

...

...
...
...
...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

...
...
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.

It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.

...
...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

...
...
Either reverting the roles of the importer or exporter or switching over to the common DMA-heaps implementation for the buffer would solve that.

...
...
It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

Hmm, I generally consider the DMA mapping framework a part of MM, especially its allocation capabilities. It heavily relies on low level MM primitives internally and exposes another set of low level primitives that is more useful for device drivers. At least it should be seen this way, but I agree that it currently includes things that shouldn't necessarily be there, like the transparent bouncing.

Exactly that, yes! Good to hear that more people think that way.

Christoph made some comments which sounded like he agreed to some of the points as well, but nobody ever said it so clearly.

...
...
...
As per the existing design of the DMA mapping framework, the framework handles the system DMA architecture details and DMA master drivers take care of invoking the right DMA mapping operations around the DMA accesses. This makes sense to me, as DMA master drivers have no idea about the specific SoCs or buses they're plugged into, while the DMA mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:

get_user_pages() grabs a reference to the pages we want to DMA to/from.

DMA mapping framework is called by the driver to create an sg-table.

The DMA mapping framework sees that this won't work and inserts

bounce buffers. 4. The driver does the DMA to the bounce buffers instead. 5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens here. E.g. drivers can't prevent using bounce buffers when they need coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a higher level, for example by get_user_pages() or more general in the MM.

I tend to agree with you on this, but I see a lot of challenges that would need to be solved if we want to make it otherwise. The whole reason for transparent bouncing came from the fact that many existing subsystems didn't really care about the needs of the underlying hardware, e.g. block or network subsystems would just pass random pages to the drivers. I think the idea of DMA mapping handling this internally was to avoid implementing the bouncing here and there in each block or network driver that needs it. (Arguably, an optional helper could be provided instead for use in those subsystems...)

Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

...
...
...
...
In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation" like bounce buffers, but cache maintenance is a hardware feature existing there by default and it's often much cheaper to operate on cached memory and synchronize the caches rather than have everything in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping framework. Generic memory allocation primitives (like alloc_pages(), kmalloc()) shouldn't be used for buffers that are expected to be passed to DMA engines - there exist DMA-aware replacements, like dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(), etc.

...
Or the higher level (get_user_pages() is just one example of that) which says ok device X wants to access memory Y I need to make sure that caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just doomed at the very start, because it does nothing to accommodate hardware needs at allocation time and then just expects some magic to happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for hardware that allows it, i.e. without much memory access constraints. What would be still missing if we just use the existing gup() + dma_map() + dma_sync() sequence?

Error or rather fallback handling and *not* transparently inserting bounce buffers.

The whole implicit bounce buffers concept falls apart as soon as you don't have a stream access any more.

...
...
...
...
...
It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Regards, Christian.

...
Best regards, Tomasz

Christian König

6:37 p.m.

Am 06.12.22 um 19:26 schrieb Nicolas Dufresne:

...

Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :

...
Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...
[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

A lot of userspace applications/frameworks assume that it can

allocate the buffer anywhere and it just works.

I know you have said that about 10 times, perhaps I'm about to believe it, but why do you think userspace assumes this ? Did you actually read code that does this (that isn't meant to run on controlled environment).

Yes, absolutely.

Kodi for example used to do this and it was actually me who pointed out that this whole approach is flawed in the first place.

And we had tons of customer projects which ran into trouble with that. It became better in the last few years, but only after pushing back on this many many times.

Regards, Christian.

...

And can you provide some example of broken generic userspace ? The DMABuf flow is meant to be trial and error. At least in GStreamer, yes, mostly only device allocation (when genericly usable) is implemented, but the code that has been contribute will try and fallback back like documented. Still fails sometimes, but that's exactly the kind of kernel bugs your patchset is trying to address. I don't blame anyone here, since why would folks on GStreamer/FFMPEG or any other "generic media framework" spend so much time implement "per linux device code", when non- embedded (constraint) linux is just handful of users (compare to Windows, Android, iOS users).

To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't always enable (or allow normal users to access) heaps (though you point 2. gets in the way) ? Unlike virtual memory, I don't think there is very good accounting and reclaiming mechanism for that memory, hence opening these means any userspace could possibly impair the system functioning. If you can't e.g. limit their usage within containers, this is pretty difficult for generic linux to carry. This is a wider problem of course, which likely affect a lot of GPU usage too, but perhaps it should be in the lower priority part of the todo.

...
This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

nit: content protection is not CODEC specific, its a platform feature, its also not really a thing upstream yet from what I'm aware of. This needs unified design and documentation imho, but also enough standardisation so that a generic application can use it. Right now, content protection people have been complaining that V4L2 (and most generic userspace) don't work with their design, rather then trying to figure-out a design that works with existing API.

...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

We seem to lack some essential parts of those restrictions in the

documentation.

Agreed (can't always disagree).

regards, Nicolas

...
...
...
...
...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

...
...
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.

It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.

...
...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

...
...
Either reverting the roles of the importer or exporter or switching over to the common DMA-heaps implementation for the buffer would solve that.

...
...
It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

Hmm, I generally consider the DMA mapping framework a part of MM, especially its allocation capabilities. It heavily relies on low level MM primitives internally and exposes another set of low level primitives that is more useful for device drivers. At least it should be seen this way, but I agree that it currently includes things that shouldn't necessarily be there, like the transparent bouncing.

Exactly that, yes! Good to hear that more people think that way.

Christoph made some comments which sounded like he agreed to some of the points as well, but nobody ever said it so clearly.

...
...
...
As per the existing design of the DMA mapping framework, the framework handles the system DMA architecture details and DMA master drivers take care of invoking the right DMA mapping operations around the DMA accesses. This makes sense to me, as DMA master drivers have no idea about the specific SoCs or buses they're plugged into, while the DMA mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:

get_user_pages() grabs a reference to the pages we want to DMA to/from.

DMA mapping framework is called by the driver to create an sg-table.

The DMA mapping framework sees that this won't work and inserts

bounce buffers. 4. The driver does the DMA to the bounce buffers instead. 5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens here. E.g. drivers can't prevent using bounce buffers when they need coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a higher level, for example by get_user_pages() or more general in the MM.

I tend to agree with you on this, but I see a lot of challenges that would need to be solved if we want to make it otherwise. The whole reason for transparent bouncing came from the fact that many existing subsystems didn't really care about the needs of the underlying hardware, e.g. block or network subsystems would just pass random pages to the drivers. I think the idea of DMA mapping handling this internally was to avoid implementing the bouncing here and there in each block or network driver that needs it. (Arguably, an optional helper could be provided instead for use in those subsystems...)

Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

...
...
...
...
In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation" like bounce buffers, but cache maintenance is a hardware feature existing there by default and it's often much cheaper to operate on cached memory and synchronize the caches rather than have everything in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping framework. Generic memory allocation primitives (like alloc_pages(), kmalloc()) shouldn't be used for buffers that are expected to be passed to DMA engines - there exist DMA-aware replacements, like dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(), etc.

...
Or the higher level (get_user_pages() is just one example of that) which says ok device X wants to access memory Y I need to make sure that caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just doomed at the very start, because it does nothing to accommodate hardware needs at allocation time and then just expects some magic to happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for hardware that allows it, i.e. without much memory access constraints. What would be still missing if we just use the existing gup() + dma_map() + dma_sync() sequence?

Error or rather fallback handling and *not* transparently inserting bounce buffers.

The whole implicit bounce buffers concept falls apart as soon as you don't have a stream access any more.

...
...
...
...
> It's a use-case that is working fine today with many devices (e.g. network > adapters) in the ARM world, exactly because the architecture specific > implementation of the DMA API inserts the cache maintenance operations > on buffer ownership transfer. Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Regards, Christian.

...
Best regards, Tomasz

Tomasz Figa

9 Dec 9 Dec

8:26 a.m.

On Mon, Dec 5, 2022 at 5:29 PM Christian König christian.koenig@amd.com wrote:

...

Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...
[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

A lot of userspace applications/frameworks assume that it can

allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

Ack.

Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.

...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.

...

We seem to lack some essential parts of those restrictions in the

documentation.

Ack.

...

...
...
...
...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

...

We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

...

That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

How do DMA-buf heaps change anything here? We already have CPU cache maintenance callbacks in the DMA-buf API - the begin/end_cpu_access() for CPU accesses and dmabuf_map/unmap_attachment() for device accesses (which arguably shouldn't actually do CPU cache maintenance, unless that's implied by how either of the involved DMA engines work).

...

...
...
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.

It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.

Agreed, but that's only possible if we have a central entity that understands what devices the requested buffer would be used with. My understanding is that we're nowhere close to that with mainstream Linux today.

// As a side note, videobuf2-dma-contig can actually allocate discontiguous memory, if the device is behind an IOMMU. Actually with the recent DMA API improvements, we could probably coalesce vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

...

...
...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

Sorry for the question not being very clear. I meant: How do the CPU caches get dirty in that case?

...

...
...
Either reverting the roles of the importer or exporter or switching over to the common DMA-heaps implementation for the buffer would solve that.

...
...
It's the job of the higher level to prepare the buffer a device work with, not the one of the lower level.

What are higher and lower levels here?

The MM as higher level and the DMA mapping framework as lower level.

Hmm, I generally consider the DMA mapping framework a part of MM, especially its allocation capabilities. It heavily relies on low level MM primitives internally and exposes another set of low level primitives that is more useful for device drivers. At least it should be seen this way, but I agree that it currently includes things that shouldn't necessarily be there, like the transparent bouncing.

Exactly that, yes! Good to hear that more people think that way.

Christoph made some comments which sounded like he agreed to some of the points as well, but nobody ever said it so clearly.

An interesting anecdote would be a case where we had the dma_mask set incorrectly for one of the devices, which ended up triggering the implicit bouncing and hurting the performance a lot, despite all of the functional testing passing just fine...

...

...
...
...
As per the existing design of the DMA mapping framework, the framework handles the system DMA architecture details and DMA master drivers take care of invoking the right DMA mapping operations around the DMA accesses. This makes sense to me, as DMA master drivers have no idea about the specific SoCs or buses they're plugged into, while the DMA mapping framework has no idea when the DMA accesses are taking place.

Yeah and exactly that's a bit problematic. In an ideal world device drivers wouldn't see memory they can't access in the first place.

For example what we currently do is the following:

get_user_pages() grabs a reference to the pages we want to DMA to/from.

DMA mapping framework is called by the driver to create an sg-table.

The DMA mapping framework sees that this won't work and inserts

bounce buffers. 4. The driver does the DMA to the bounce buffers instead. 5. The DMA mapping framework copies the data to the real location.

This is highly problematic since it removes the control of what happens here. E.g. drivers can't prevent using bounce buffers when they need coherent access for DMA-buf for example.

What we should have instead is that bounce buffers are applied at a higher level, for example by get_user_pages() or more general in the MM.

I tend to agree with you on this, but I see a lot of challenges that would need to be solved if we want to make it otherwise. The whole reason for transparent bouncing came from the fact that many existing subsystems didn't really care about the needs of the underlying hardware, e.g. block or network subsystems would just pass random pages to the drivers. I think the idea of DMA mapping handling this internally was to avoid implementing the bouncing here and there in each block or network driver that needs it. (Arguably, an optional helper could be provided instead for use in those subsystems...)

Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

Remember that actually making the memory accessible to a device often needs to be handled already as a part of the allocation (e.g. dma_mask in the non-IOMMU case). So I tend to think that the current division of responsibilities is mostly fine - the dma_alloc family should be seen as a part of MM already, especially with all the recent improvements from Christoph, like dma_alloc_pages().

That said, it may indeed make sense to move things like IOMMU mapping management out of the dma_alloc() and just reduce those functions to simply returning a set of pages that satisfy the allocation constraints. One would need to call dma_map() after the allocation, but that's actually a fair expectation. Sometimes it might be preferred to pre-allocate the memory, but only map it into the device address space when it's really necessary.

...

...
...
...
...
In other words in a proper design the higher level would prepare the memory in a way the device driver can work with it, not the other way around.

When a device driver gets memory it can't work with the correct response is to throw an error and bubble that up into a layer where it can be handled gracefully.

For example instead of using bounce buffers under the hood the DMA layer the MM should make sure that when you call read() with O_DIRECT that the pages in question are accessible by the device.

I tend to agree with you if it's about a costly software "emulation" like bounce buffers, but cache maintenance is a hardware feature existing there by default and it's often much cheaper to operate on cached memory and synchronize the caches rather than have everything in uncached (or write-combined) memory.

Well that we should have proper cache maintenance is really not the question. The question is where to do that?

E.g. in the DMA-mapping framework which as far as I can see should only take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping framework. Generic memory allocation primitives (like alloc_pages(), kmalloc()) shouldn't be used for buffers that are expected to be passed to DMA engines - there exist DMA-aware replacements, like dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(), etc.

...
Or the higher level (get_user_pages() is just one example of that) which says ok device X wants to access memory Y I need to make sure that caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just doomed at the very start, because it does nothing to accommodate hardware needs at allocation time and then just expects some magic to happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for hardware that allows it, i.e. without much memory access constraints. What would be still missing if we just use the existing gup() + dma_map() + dma_sync() sequence?

Error or rather fallback handling and *not* transparently inserting bounce buffers.

The whole implicit bounce buffers concept falls apart as soon as you don't have a stream access any more.

I see. So I think we agree on that. :)

...

...
...
...
...
...
It's a use-case that is working fine today with many devices (e.g. network adapters) in the ARM world, exactly because the architecture specific implementation of the DMA API inserts the cache maintenance operations on buffer ownership transfer.

Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards, Tomasz

Pekka Paalanen

9:32 a.m.

On Fri, 9 Dec 2022 17:26:06 +0900 Tomasz Figa tfiga@chromium.org wrote:

...

On Mon, Dec 5, 2022 at 5:29 PM Christian König christian.koenig@amd.com wrote:

...
Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...
[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

A lot of userspace applications/frameworks assume that it can

allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

Ack.

Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.

Hi,

as an anecdote, when I was improving Mutter's cross-DRM-device handling (for DisplayLink uses) a few years ago, I implemented several different approaches of where to allocate, to try until going for the slowest but guaranteed to work case of copying every update into and out of sysram.

It seems there are two different approaches in general for allocation and sharing:

1. Try different things until it works or you run out of options

pro: - no need for a single software component to know everything about every device in the system

con: - who bothers with fallbacks, if the first try works on my system for my use case I test with? I.e. cost of code in users. - trial-and-error can be very laborious (allocate, share with all devices, populate, test) - the search space might be huge

2. Have a central component that knows what to do

pro: - It might work on the first attempt, so no fallbacks in users. - It might be optimal.

con: - You need a software component that knows everything about every single combination of hardware in existence, multiplied by use cases.

Neither seems good, which brings us back to https://github.com/cubanismo/allocator .

...

...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.

...

We seem to lack some essential parts of those restrictions in the

documentation.

Ack.

...
...
...
...
...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

...
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

I have a vague belief that many, many years ago, in the early days of dmabuf development, there was the idea of the sequence: - create a dmabuf handle - share the handle with all devices that would need access - *then* do the allocation with kernel-internal negotiation to fill all devices' needs, if at all possible

Obviously that didn't happen. I think today's dmabuf Wayland protocol would support this though.

Anyway, Wayland can tell the app which DRM devices a buffer needs to work with as a GPU texture and potentially on same/another DRM device as a KMS framebuffer, so theoretically the app could know.

Thanks, pq

Alex Deucher

5:07 p.m.

On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen ppaalanen@gmail.com wrote:

...

On Fri, 9 Dec 2022 17:26:06 +0900 Tomasz Figa tfiga@chromium.org wrote:

...
On Mon, Dec 5, 2022 at 5:29 PM Christian König christian.koenig@amd.com wrote:

...
Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...
[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

A lot of userspace applications/frameworks assume that it can

allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

Ack.

Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.

Hi,

as an anecdote, when I was improving Mutter's cross-DRM-device handling (for DisplayLink uses) a few years ago, I implemented several different approaches of where to allocate, to try until going for the slowest but guaranteed to work case of copying every update into and out of sysram.

It seems there are two different approaches in general for allocation and sharing:

Try different things until it works or you run out of options

pro:

no need for a single software component to know everything about every device in the system

con:

who bothers with fallbacks, if the first try works on my system for my use case I test with? I.e. cost of code in users.

trial-and-error can be very laborious (allocate, share with all devices, populate, test)

the search space might be huge

Even that is fraught with difficulty. We had a ton of bug reports over the years claiming amdgpu was broken when users tried to use displaylink devices in combination with AMD dGPUs because the performance was so slow. The problem was that rather than using dma-buf, the compositor was just mmaping the the dGPU BAR, which happens to be uncached write combined and across the PCI bus, and copying the data to the displaylink device. Read access to a PCI BAR with the CPU is on the order of 10s of MB per second.

Alex

...

Have a central component that knows what to do

pro:

It might work on the first attempt, so no fallbacks in users.

It might be optimal.

con:

You need a software component that knows everything about every single combination of hardware in existence, multiplied by use cases.

Neither seems good, which brings us back to https://github.com/cubanismo/allocator .

...
...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.

...

We seem to lack some essential parts of those restrictions in the

documentation.

Ack.

...
...
...
...
> So if a device driver uses cached system memory on an architecture which > devices which can't access it the right approach is clearly to reject > the access. I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

...
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

I have a vague belief that many, many years ago, in the early days of dmabuf development, there was the idea of the sequence:

create a dmabuf handle

share the handle with all devices that would need access

*then* do the allocation with kernel-internal negotiation to fill all devices' needs, if at all possible

Obviously that didn't happen. I think today's dmabuf Wayland protocol would support this though.

Anyway, Wayland can tell the app which DRM devices a buffer needs to work with as a GPU texture and potentially on same/another DRM device as a KMS framebuffer, so theoretically the app could know.

Thanks, pq

Daniel Vetter

5 Jan 5 Jan

11:50 a.m.

On Fri, Dec 09, 2022 at 12:07:49PM -0500, Alex Deucher wrote:

...

On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen ppaalanen@gmail.com wrote:

...
On Fri, 9 Dec 2022 17:26:06 +0900 Tomasz Figa tfiga@chromium.org wrote:

...
On Mon, Dec 5, 2022 at 5:29 PM Christian König christian.koenig@amd.com wrote:

...
Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...
[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

A lot of userspace applications/frameworks assume that it can

allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

Ack.

Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.

Hi,

as an anecdote, when I was improving Mutter's cross-DRM-device handling (for DisplayLink uses) a few years ago, I implemented several different approaches of where to allocate, to try until going for the slowest but guaranteed to work case of copying every update into and out of sysram.

It seems there are two different approaches in general for allocation and sharing:

Try different things until it works or you run out of options

pro:

no need for a single software component to know everything about every device in the system

con:

who bothers with fallbacks, if the first try works on my system for my use case I test with? I.e. cost of code in users.

trial-and-error can be very laborious (allocate, share with all devices, populate, test)

the search space might be huge

Even that is fraught with difficulty. We had a ton of bug reports over the years claiming amdgpu was broken when users tried to use displaylink devices in combination with AMD dGPUs because the performance was so slow. The problem was that rather than using dma-buf, the compositor was just mmaping the the dGPU BAR, which happens to be uncached write combined and across the PCI bus, and copying the data to the displaylink device. Read access to a PCI BAR with the CPU is on the order of 10s of MB per second.

Yeah you shouldn't readback dma-buf directly when you get them from a gl/vk driver, but instead use the tools these apis provide for cpu access.

But that just puts us even more firmly into the "who even bothers" territory because people just hack up something that happens to work on their igpu (where at least on x86 you nowadays tend to get fully cpu cached buffers). -Daniel

...

Alex

...

Have a central component that knows what to do

pro:

It might work on the first attempt, so no fallbacks in users.

It might be optimal.

con:

You need a software component that knows everything about every single combination of hardware in existence, multiplied by use cases.

Neither seems good, which brings us back to https://github.com/cubanismo/allocator .

...
...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.

...

We seem to lack some essential parts of those restrictions in the

documentation.

Ack.

...
...
...
>> So if a device driver uses cached system memory on an architecture which >> devices which can't access it the right approach is clearly to reject >> the access. > I'd like to accent the fact that "requires cache maintenance" != "can't access". Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

...
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

I have a vague belief that many, many years ago, in the early days of dmabuf development, there was the idea of the sequence:

create a dmabuf handle

share the handle with all devices that would need access

*then* do the allocation with kernel-internal negotiation to fill all devices' needs, if at all possible

Obviously that didn't happen. I think today's dmabuf Wayland protocol would support this though.

Anyway, Wayland can tell the app which DRM devices a buffer needs to work with as a GPU texture and potentially on same/another DRM device as a KMS framebuffer, so theoretically the app could know.

Thanks, pq

Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Tomasz Figa

12 Dec 12 Dec

3:13 a.m.

On Fri, Dec 9, 2022 at 6:32 PM Pekka Paalanen ppaalanen@gmail.com wrote:

...

On Fri, 9 Dec 2022 17:26:06 +0900 Tomasz Figa tfiga@chromium.org wrote:

...
On Mon, Dec 5, 2022 at 5:29 PM Christian König christian.koenig@amd.com wrote:

...
Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

...
[SNIP]

...
In other words explicit ownership transfer is not something we would want as requirement in the framework, cause otherwise we break tons of use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly possible to implement this using the dma_fence objects mentioned above. E.g. drivers can already look at who is accessing a buffer currently and can even grab explicit ownership of it by adding their own dma_fence objects.

The only exception is CPU based access, e.g. when something is written with the CPU a cache flush might be necessary and when something is read with the CPU a cache invalidation might be necessary.

Okay, that's much clearer now, thanks for clarifying this. So we should be covered for the cache maintenance needs originating from CPU accesses already, +/- the broken cases which don't call the begin/end CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines, we should be covered either by the userspace explicitly flushing the hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing begin/end CPU access calls.)

Well there are multiple problems here:

A lot of userspace applications/frameworks assume that it can

allocate the buffer anywhere and it just works.

This isn't true at all, we have tons of cases where device can only access their special memory for certain use cases. Just look at scanout for displaying on dGPU, neither AMD nor NVidia supports system memory here. Similar cases exists for audio/video codecs where intermediate memory is only accessible by certain devices because of content protection.

Ack.

Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.

Hi,

as an anecdote, when I was improving Mutter's cross-DRM-device handling (for DisplayLink uses) a few years ago, I implemented several different approaches of where to allocate, to try until going for the slowest but guaranteed to work case of copying every update into and out of sysram.

It seems there are two different approaches in general for allocation and sharing:

Try different things until it works or you run out of options

pro:

no need for a single software component to know everything about every device in the system

con:

who bothers with fallbacks, if the first try works on my system for my use case I test with? I.e. cost of code in users.

trial-and-error can be very laborious (allocate, share with all devices, populate, test)

the search space might be huge

Have a central component that knows what to do

pro:

It might work on the first attempt, so no fallbacks in users.

It might be optimal.

con:

You need a software component that knows everything about every single combination of hardware in existence, multiplied by use cases.

Neither seems good, which brings us back to https://github.com/cubanismo/allocator .

I need to refresh my memory on how far we went with that and what the stoppers were, but it really sounds like we need it to make things really work on a mainstream Linux system.

When I was still participating in the discussions, I remember the idea was to have an API exposed by various components, like EGL, Vulkan, etc. to describe the constraints. Maybe to simplify the problem, instead of having this complex negotiation between different APIs, we could have a plugin-based library and plugins would be installed together with the various API implementations (e.g. even proprietary Nvidia drivers could provide an open source allocation plugin to tell the central allocator library about its requirements)?

I also recall we were stopped by lack of a generic DMA-buf allocation functionality exposed to the userspace, but that now exists as DMA-buf heaps.

...

...
...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.

...

We seem to lack some essential parts of those restrictions in the

documentation.

Ack.

...
...
...
...
> So if a device driver uses cached system memory on an architecture which > devices which can't access it the right approach is clearly to reject > the access. I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

...
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

I have a vague belief that many, many years ago, in the early days of dmabuf development, there was the idea of the sequence:

create a dmabuf handle

share the handle with all devices that would need access

*then* do the allocation with kernel-internal negotiation to fill all devices' needs, if at all possible

Yeah, I had such a recollection as well. I think the difficulty is to get the userspace to tell the kernel about all the devices before the buffer needs to be allocated.

...

Obviously that didn't happen. I think today's dmabuf Wayland protocol would support this though.

Anyway, Wayland can tell the app which DRM devices a buffer needs to work with as a GPU texture and potentially on same/another DRM device as a KMS framebuffer, so theoretically the app could know.

Good to know, thanks.

Best regards, Tomasz

Christian König

9 Dec 9 Dec

10:27 a.m.

Am 09.12.22 um 09:26 schrieb Tomasz Figa:

...

[SNIP] Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.

Yeah, completely agree. The crux is really that we need to improve those user space allocators by providing more information from the kernel.

...

...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.

Yes, same what Daniel said as well. We need to provide some more hints which allocator to use from the kernel.

...

...
...
...
...
...
So if a device driver uses cached system memory on an architecture which devices which can't access it the right approach is clearly to reject the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The problem is that the roles are reversed. The callbacks let the exporter knows that it needs to flush the caches when the importer is done accessing the buffer with the CPU.

But here the exporter is the one accessing the buffer with the CPU and the importer then accesses stale data because it doesn't snoop the caches.

What we could do is to force all exporters to use begin/end_cpu_access() even on it's own buffers and look at all the importers when the access is completed. But that would increase the complexity of the handling in the exporter.

In other words we would then have code in the exporters which is only written for handling the constrains of the importers. This has a wide variety of consequences, especially that this functionality of the exporter can't be tested without a proper importer.

I was also thinking about reversing the role of exporter and importer in the kernel, but came to the conclusion that doing this under the hood without userspace knowing about it is probably not going to work either.

...

The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

Never heard of anything like that either, but who knows.

...

...
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Exactly that, yes.

...

Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

Both X11 and Wayland have protocols which can be used to display a certain DMA-buf handle, their feedback packages contain information how ideal your configuration is, e.g. if the DMA-buf handle could be used or if an extra copy was needed etc...

Similar exists between VAAPI and V4L2 as far as I know, but as you noted as well here it's indeed more about negotiating pixel format, stride, padding, alignment etc...

The best we can do is to reject combinations which won't work in the kernel and then userspace could react accordingly.

...

...
That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

How do DMA-buf heaps change anything here? We already have CPU cache maintenance callbacks in the DMA-buf API - the begin/end_cpu_access() for CPU accesses and dmabuf_map/unmap_attachment() for device accesses (which arguably shouldn't actually do CPU cache maintenance, unless that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should make sure that the allocated memory fits the need of everyone. Additional to that calls to the cache DMA-api cache management functions are inserted whenever CPU access begins/ends.

That's the best we can do for system memory sharing, only device specific memory can't be allocated like this.

...

...
...
...
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.

It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.

Agreed, but that's only possible if we have a central entity that understands what devices the requested buffer would be used with. My understanding is that we're nowhere close to that with mainstream Linux today.

// As a side note, videobuf2-dma-contig can actually allocate discontiguous memory, if the device is behind an IOMMU. Actually with the recent DMA API improvements, we could probably coalesce vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

That would probably make live a little bit simpler, yes.

...

...
...
...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

Sorry for the question not being very clear. I meant: How do the CPU caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just copies things from A to B with memcpy to construct the buffer content.

...

[SNIP]

...
Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

Remember that actually making the memory accessible to a device often needs to be handled already as a part of the allocation (e.g. dma_mask in the non-IOMMU case). So I tend to think that the current division of responsibilities is mostly fine - the dma_alloc family should be seen as a part of MM already, especially with all the recent improvements from Christoph, like dma_alloc_pages().

Yes, that's indeed a very interesting development which as far as I can see goes into the right direction.

...

That said, it may indeed make sense to move things like IOMMU mapping management out of the dma_alloc() and just reduce those functions to simply returning a set of pages that satisfy the allocation constraints. One would need to call dma_map() after the allocation, but that's actually a fair expectation. Sometimes it might be preferred to pre-allocate the memory, but only map it into the device address space when it's really necessary.

What I'm still missing is the functionality to allocate pages for multiple devices and proper error codes when dma_map() can't make the page accessible to a device.

Regards, Christian.

...

...
...
...
...
...
> It's a use-case that is working fine today with many devices (e.g. network > adapters) in the ARM world, exactly because the architecture specific > implementation of the DMA API inserts the cache maintenance operations > on buffer ownership transfer. Yeah, I'm perfectly aware of that. The problem is that exactly that design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that this design was a really really bad idea because it favors just one use case while making other use cases practically impossible if not really hard to implement.

Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards, Tomasz

Tomasz Figa

12 Dec 12 Dec

3 a.m.

On Fri, Dec 9, 2022 at 7:27 PM Christian König christian.koenig@amd.com wrote:

...

Am 09.12.22 um 09:26 schrieb Tomasz Figa:

[snip]

...

Yes, same what Daniel said as well. We need to provide some more hints which allocator to use from the kernel.

...
...
...
...
...
> So if a device driver uses cached system memory on an architecture which > devices which can't access it the right approach is clearly to reject > the access. I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The problem is that the roles are reversed. The callbacks let the exporter knows that it needs to flush the caches when the importer is done accessing the buffer with the CPU.

But here the exporter is the one accessing the buffer with the CPU and the importer then accesses stale data because it doesn't snoop the caches.

What we could do is to force all exporters to use begin/end_cpu_access() even on it's own buffers and look at all the importers when the access is completed. But that would increase the complexity of the handling in the exporter.

I feel like they should be doing so anyway, because it often depends on the SoC integration whether the DMA can do cache snooping or not.

Although arguably, there is a corner case today where if one uses dma_alloc_coherent() to get a buffer with a coherent CPU mapping for device X that is declared as cache-coherent, one also expects not to need to call begin/end_cpu_access(), but those would be needed if the buffer was to be imported by device Y that is not cache-coherent...

Sounds like after all it's a mess. I guess your patches make it one step closer to something sensible, import would fail in such cases. Although arguably we should be able to still export from driver Y and import to driver X just fine if Y allocated the buffer as coherent - otherwise we would break existing users for whom things worked fine.

...

In other words we would then have code in the exporters which is only written for handling the constrains of the importers. This has a wide variety of consequences, especially that this functionality of the exporter can't be tested without a proper importer.

I was also thinking about reversing the role of exporter and importer in the kernel, but came to the conclusion that doing this under the hood without userspace knowing about it is probably not going to work either.

...
The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

Never heard of anything like that either, but who knows.

...
...
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Exactly that, yes.

...
Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

Both X11 and Wayland have protocols which can be used to display a certain DMA-buf handle, their feedback packages contain information how ideal your configuration is, e.g. if the DMA-buf handle could be used or if an extra copy was needed etc...

Similar exists between VAAPI and V4L2 as far as I know, but as you noted as well here it's indeed more about negotiating pixel format, stride, padding, alignment etc...

The best we can do is to reject combinations which won't work in the kernel and then userspace could react accordingly.

The question is whether userspace is able to deal with it, given the limited amount of information it gets from the kernel. Sure, there is always the ultimate fallback of memcpy(), but in many cases that would be totally unusable due to severe performance impact. If we were to remove the existing extent of implicit handling from the kernel, I think we need to first give the userspace the information necessary to explicitly handle the fallback to the same extent.

We also need to think about backwards compatibility. Simply removing the implicit fallback cases would probably break a lot of userspace, so an opt-in behavior is likely needed initially...

...

...
...
That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

How do DMA-buf heaps change anything here? We already have CPU cache maintenance callbacks in the DMA-buf API - the begin/end_cpu_access() for CPU accesses and dmabuf_map/unmap_attachment() for device accesses (which arguably shouldn't actually do CPU cache maintenance, unless that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should make sure that the allocated memory fits the need of everyone. Additional to that calls to the cache DMA-api cache management functions are inserted whenever CPU access begins/ends.

I think in current design, it only knows all the importers after the buffer is already allocated, so it doesn't necessarily have a way to handle the allocation constraints. Something would have to be done to get all the importers attached before the allocation actually takes place.

...

That's the best we can do for system memory sharing, only device specific memory can't be allocated like this.

...
...
...
...
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.

It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.

Agreed, but that's only possible if we have a central entity that understands what devices the requested buffer would be used with. My understanding is that we're nowhere close to that with mainstream Linux today.

// As a side note, videobuf2-dma-contig can actually allocate discontiguous memory, if the device is behind an IOMMU. Actually with the recent DMA API improvements, we could probably coalesce vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

That would probably make live a little bit simpler, yes.

...
...
...
...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

Sorry for the question not being very clear. I meant: How do the CPU caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just copies things from A to B with memcpy to construct the buffer content.

Okay, so it's just due to CPU access and basically what we touched a few paragraphs above.

...

...
[SNIP]

...
Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

Remember that actually making the memory accessible to a device often needs to be handled already as a part of the allocation (e.g. dma_mask in the non-IOMMU case). So I tend to think that the current division of responsibilities is mostly fine - the dma_alloc family should be seen as a part of MM already, especially with all the recent improvements from Christoph, like dma_alloc_pages().

Yes, that's indeed a very interesting development which as far as I can see goes into the right direction.

...
That said, it may indeed make sense to move things like IOMMU mapping management out of the dma_alloc() and just reduce those functions to simply returning a set of pages that satisfy the allocation constraints. One would need to call dma_map() after the allocation, but that's actually a fair expectation. Sometimes it might be preferred to pre-allocate the memory, but only map it into the device address space when it's really necessary.

What I'm still missing is the functionality to allocate pages for multiple devices and proper error codes when dma_map() can't make the page accessible to a device.

Agreed. Although again, I think the more challenging part would be to get the complete list of devices involved before the allocation happens.

Best regards, Tomasz

...

Regards, Christian.

...
...
...
...
...
>> It's a use-case that is working fine today with many devices (e.g. network >> adapters) in the ARM world, exactly because the architecture specific >> implementation of the DMA API inserts the cache maintenance operations >> on buffer ownership transfer. > Yeah, I'm perfectly aware of that. The problem is that exactly that > design totally breaks GPUs on Xen DOM0 for example. > > And Xen is just one example, I can certainly say from experience that > this design was a really really bad idea because it favors just one use > case while making other use cases practically impossible if not really > hard to implement. Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards, Tomasz

Christian König

11:15 a.m.

Am 12.12.22 um 04:00 schrieb Tomasz Figa:

...

[SNIP]

...
What we could do is to force all exporters to use begin/end_cpu_access() even on it's own buffers and look at all the importers when the access is completed. But that would increase the complexity of the handling in the exporter.

I feel like they should be doing so anyway, because it often depends on the SoC integration whether the DMA can do cache snooping or not.

Yeah, but wouldn't help without the exporter taking care of the importers needs of cache flushing. And that's exactly what we try hard to avoid because it creates code in every exporter which is never tested except for the case that you combine this exporter with an non coherent importer.

That's a testing nightmare because then you everywhere has code which only in very few combinations of exporter and importer is actually used.

...

Although arguably, there is a corner case today where if one uses dma_alloc_coherent() to get a buffer with a coherent CPU mapping for device X that is declared as cache-coherent, one also expects not to need to call begin/end_cpu_access(), but those would be needed if the buffer was to be imported by device Y that is not cache-coherent...

Sounds like after all it's a mess. I guess your patches make it one step closer to something sensible, import would fail in such cases. Although arguably we should be able to still export from driver Y and import to driver X just fine if Y allocated the buffer as coherent - otherwise we would break existing users for whom things worked fine.

Allocating the buffer as coherent won't help in this case because we actually do want CPU caching for performant access. It's just that some device needs a cache flush before it sees all the changes.

As far as I can see without adding additional complexity to the exporter this can only be archived in two ways:

1. Switch the role of the exporter and importer. This way the device with the need for the cache flush is the exporter and in control of the operations on its buffers.

2. We use DMA-buf as neutral mediator. Since DMA-buf keeps track of who has mapped the buffers it inserts the appropriate dma_sync_*_for_device() calls.

...

...
[SNIP] The best we can do is to reject combinations which won't work in the kernel and then userspace could react accordingly.

The question is whether userspace is able to deal with it, given the limited amount of information it gets from the kernel. Sure, there is always the ultimate fallback of memcpy(), but in many cases that would be totally unusable due to severe performance impact. If we were to remove the existing extent of implicit handling from the kernel, I think we need to first give the userspace the information necessary to explicitly handle the fallback to the same extent.

Good point.

...

We also need to think about backwards compatibility. Simply removing the implicit fallback cases would probably break a lot of userspace, so an opt-in behavior is likely needed initially...

Yes, I'm completely aware of that as well.

We can't hard break userspace even if the previous approach didn't worked 100% correctly.

...

...
...
...
That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

How do DMA-buf heaps change anything here? We already have CPU cache maintenance callbacks in the DMA-buf API - the begin/end_cpu_access() for CPU accesses and dmabuf_map/unmap_attachment() for device accesses (which arguably shouldn't actually do CPU cache maintenance, unless that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should make sure that the allocated memory fits the need of everyone. Additional to that calls to the cache DMA-api cache management functions are inserted whenever CPU access begins/ends.

I think in current design, it only knows all the importers after the buffer is already allocated, so it doesn't necessarily have a way to handle the allocation constraints. Something would have to be done to get all the importers attached before the allocation actually takes place.

That's already in place. See the attach and map callbacks.

I'm just not sure if heaps fully implements it like this.

...

...
...
...
...
...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

Sorry for the question not being very clear. I meant: How do the CPU caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just copies things from A to B with memcpy to construct the buffer content.

Okay, so it's just due to CPU access and basically what we touched a few paragraphs above.

Yes, I've never seen a device which actually dirties the CPU cache. But I would never rule out that such a device exists.

Regards, Christian.

...

...
...
[SNIP]

...
Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

Remember that actually making the memory accessible to a device often needs to be handled already as a part of the allocation (e.g. dma_mask in the non-IOMMU case). So I tend to think that the current division of responsibilities is mostly fine - the dma_alloc family should be seen as a part of MM already, especially with all the recent improvements from Christoph, like dma_alloc_pages().

Yes, that's indeed a very interesting development which as far as I can see goes into the right direction.

...
That said, it may indeed make sense to move things like IOMMU mapping management out of the dma_alloc() and just reduce those functions to simply returning a set of pages that satisfy the allocation constraints. One would need to call dma_map() after the allocation, but that's actually a fair expectation. Sometimes it might be preferred to pre-allocate the memory, but only map it into the device address space when it's really necessary.

What I'm still missing is the functionality to allocate pages for multiple devices and proper error codes when dma_map() can't make the page accessible to a device.

Agreed. Although again, I think the more challenging part would be to get the complete list of devices involved before the allocation happens.

Best regards, Tomasz

...
Regards, Christian.

...
...
...
...
>>> It's a use-case that is working fine today with many devices (e.g. network >>> adapters) in the ARM world, exactly because the architecture specific >>> implementation of the DMA API inserts the cache maintenance operations >>> on buffer ownership transfer. >> Yeah, I'm perfectly aware of that. The problem is that exactly that >> design totally breaks GPUs on Xen DOM0 for example. >> >> And Xen is just one example, I can certainly say from experience that >> this design was a really really bad idea because it favors just one use >> case while making other use cases practically impossible if not really >> hard to implement. > Sorry, I haven't worked with Xen. Could you elaborate what's the > problem that this introduces for it? That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards, Tomasz

Sebastian Wick

5 Jan 5 Jan

8:39 p.m.

On Fri, Dec 9, 2022 at 11:28 AM Christian König christian.koenig@amd.com wrote:

...

Am 09.12.22 um 09:26 schrieb Tomasz Figa:

...
[SNIP] Although I think the most common case on mainstream Linux today is properly allocating for device X (e.g. V4L2 video decoder or DRM-based GPU) and hoping that other devices would accept the buffers just fine, which isn't a given on most platforms (although often it's just about pixel format, width/height/stride alignment, tiling, etc. rather than the memory itself). That's why ChromiumOS has minigbm and Android has gralloc that act as the central point of knowledge on buffer allocation.

Yeah, completely agree. The crux is really that we need to improve those user space allocators by providing more information from the kernel.

...
...

We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace, because now it needs to pick the right heap. With allocation built into the specific UAPI (like V4L2), it's at least possible to allocate for one specific device without having any knowledge about allocation constraints in the userspace.

Yes, same what Daniel said as well. We need to provide some more hints which allocator to use from the kernel.

This information doesn't necessarily have to come from the kernel. For KMS it makes sense for the kernel to provide it but Vulkan drivers for example could have the information entirely in the UMD. The important point is that user space can somehow query which heap can be used with which constraints for a specific operation. At that point it's entirely a user space problem.

...

...
...
...
...
...
> So if a device driver uses cached system memory on an architecture which > devices which can't access it the right approach is clearly to reject > the access. I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it was allocated.

If that means the the exporter provides a piece of memory which requires CPU cache snooping to access correctly then the best thing we can do is to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice? Do I assume correctly that it's about sharing a buffer between one DMA engine that is cache-coherent and another that is non-coherent, where the first one ends up having its accesses always go through some kind of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one device driver wrote to it's internal buffer with the CPU (so some cache lines where dirty) and then a device which couldn't deal with that tried to access it.

If so, shouldn't that driver surround its CPU accesses with begin/end_cpu_access() in the first place?

The problem is that the roles are reversed. The callbacks let the exporter knows that it needs to flush the caches when the importer is done accessing the buffer with the CPU.

But here the exporter is the one accessing the buffer with the CPU and the importer then accesses stale data because it doesn't snoop the caches.

What we could do is to force all exporters to use begin/end_cpu_access() even on it's own buffers and look at all the importers when the access is completed. But that would increase the complexity of the handling in the exporter.

In other words we would then have code in the exporters which is only written for handling the constrains of the importers. This has a wide variety of consequences, especially that this functionality of the exporter can't be tested without a proper importer.

I was also thinking about reversing the role of exporter and importer in the kernel, but came to the conclusion that doing this under the hood without userspace knowing about it is probably not going to work either.

...
The case that I was suggesting was of a hardware block that actually sits behind the CPU cache and thus dirties it on writes, not the driver doing that. (I haven't personally encountered such a system, though.)

Never heard of anything like that either, but who knows.

...
...
We could say that all device drivers must always look at the coherency of the devices which want to access their buffers. But that would horrible complicate things for maintaining the drivers because then drivers would need to take into account requirements from other drivers while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA mapping API, but currently it's only handling requirements of one device. And we don't have any information from the userspace what other devices the buffer would be used with...

Exactly that, yes.

...
Actually, do we even have such information in the userspace today? Let's say I do a video call in a web browser on a typical Linux system. I have a V4L2 camera, VAAPI video encoder and X11 display. The V4L2 camera fills in buffers with video frames and both encoder and display consume them. Do we have a central place which would know that a buffer needs to be allocated that works with the producer and all consumers?

Both X11 and Wayland have protocols which can be used to display a certain DMA-buf handle, their feedback packages contain information how ideal your configuration is, e.g. if the DMA-buf handle could be used or if an extra copy was needed etc...

Similar exists between VAAPI and V4L2 as far as I know, but as you noted as well here it's indeed more about negotiating pixel format, stride, padding, alignment etc...

The best we can do is to reject combinations which won't work in the kernel and then userspace could react accordingly.

...
...
That's essentially the reason why we have DMA-buf heaps. Those heaps expose system memory buffers with certain properties (size, CMA, DMA bit restrictions etc...) and also implement the callback functions for CPU cache maintenance.

How do DMA-buf heaps change anything here? We already have CPU cache maintenance callbacks in the DMA-buf API - the begin/end_cpu_access() for CPU accesses and dmabuf_map/unmap_attachment() for device accesses (which arguably shouldn't actually do CPU cache maintenance, unless that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should make sure that the allocated memory fits the need of everyone. Additional to that calls to the cache DMA-api cache management functions are inserted whenever CPU access begins/ends.

That's the best we can do for system memory sharing, only device specific memory can't be allocated like this.

...
...
...
...
We do have the system and CMA dma-buf heap for cases where a device driver which wants to access the buffer with caches enabled. So this is not a limitation in functionality, it's just a matter of correctly using it.

V4L2 also has the ability to allocate buffers and map them with caches enabled.

Yeah, when that's a requirement for the V4L2 device it also makes perfect sense.

It's just that we shouldn't force any specific allocation behavior on a device driver because of the requirements of a different device.

Giving an example a V4L2 device shouldn't be forced to use videobuf2-dma-contig because some other device needs CMA. Instead we should use the common DMA-buf heaps which implement this as neutral supplier of system memory buffers.

Agreed, but that's only possible if we have a central entity that understands what devices the requested buffer would be used with. My understanding is that we're nowhere close to that with mainstream Linux today.

// As a side note, videobuf2-dma-contig can actually allocate discontiguous memory, if the device is behind an IOMMU. Actually with the recent DMA API improvements, we could probably coalesce vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

That would probably make live a little bit simpler, yes.

...
...
...
...
The problem is that in this particular case the exporter provides the buffer as is, e.g. with dirty CPU caches. And the importer can't deal with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in this case that means writing with the CPU to it.

Sorry for the question not being very clear. I meant: How do the CPU caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just copies things from A to B with memcpy to construct the buffer content.

...
[SNIP]

...
Yes, totally agree. The problem is really that we moved bunch of MM and DMA functions in one API.

The bounce buffers are something we should really have in a separate component.

Then the functionality of allocating system memory for a specific device or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a device (IOMMU programming etc..) should then be part of an DMA API.

Remember that actually making the memory accessible to a device often needs to be handled already as a part of the allocation (e.g. dma_mask in the non-IOMMU case). So I tend to think that the current division of responsibilities is mostly fine - the dma_alloc family should be seen as a part of MM already, especially with all the recent improvements from Christoph, like dma_alloc_pages().

Yes, that's indeed a very interesting development which as far as I can see goes into the right direction.

...
That said, it may indeed make sense to move things like IOMMU mapping management out of the dma_alloc() and just reduce those functions to simply returning a set of pages that satisfy the allocation constraints. One would need to call dma_map() after the allocation, but that's actually a fair expectation. Sometimes it might be preferred to pre-allocate the memory, but only map it into the device address space when it's really necessary.

What I'm still missing is the functionality to allocate pages for multiple devices and proper error codes when dma_map() can't make the page accessible to a device.

Regards, Christian.

...
...
...
...
...
>> It's a use-case that is working fine today with many devices (e.g. network >> adapters) in the ARM world, exactly because the architecture specific >> implementation of the DMA API inserts the cache maintenance operations >> on buffer ownership transfer. > Yeah, I'm perfectly aware of that. The problem is that exactly that > design totally breaks GPUs on Xen DOM0 for example. > > And Xen is just one example, I can certainly say from experience that > this design was a really really bad idea because it favors just one use > case while making other use cases practically impossible if not really > hard to implement. Sorry, I haven't worked with Xen. Could you elaborate what's the problem that this introduces for it?

That's a bit longer topic. The AMD XEN devs are already working on this as far as I know. I can ping internally how far they got with sending the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which direction we should be going, but I guess we can wait until they send some patches.

There was just recently a longer thread on the amd-gfx mailing list about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards, Tomasz

Rob Clark

4 Nov 4 Nov

2:58 p.m.

On Wed, Nov 2, 2022 at 5:21 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:

...
Hi Christian,

going to reply in more detail when I have some more time, so just some quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:

...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

With bracketed access you could even make this case work, as the dGPU would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should pump the data into LMEM in the first place.

For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own dmabuf, which more or less bypasses dma-buf to get back the backing GEM obj?

BR, -R

Christian König

3 p.m.

Am 04.11.22 um 15:58 schrieb Rob Clark:

...

On Wed, Nov 2, 2022 at 5:21 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:

...
Hi Christian,

going to reply in more detail when I have some more time, so just some quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:

...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

With bracketed access you could even make this case work, as the dGPU would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should pump the data into LMEM in the first place.

For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own dmabuf, which more or less bypasses dma-buf to get back the backing GEM obj?

When the codec and scanout is on the same device, then yes.

But we already had cases where the codec was on the dGPU and scanout happened on the integrated engine.

Regards, Christian.

...

BR, -R

Pekka Paalanen

2 Nov 2 Nov

12:19 p.m.

On Wed, 2 Nov 2022 12:18:01 +0100 Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

...
...
But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around.

And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

What should we Wayland people be pushing back on exactly? And where is our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the Wayland compositor cannot use the given dmabuf at all, giving the client an opportunity to try something else. The Wayland protocol also tells clients which DRM rendering device at minimum the dmabuf needs to be compatible with. It even tells which KMS device the dmabuf could be put on direct scanout if the dmabuf was suitable for that and direct scanout is otherwise possible.

What the client (application) does with all that information is up to the client. That code is not part of Wayland.

I'm sure we would be happy to add protocol for anything that https://github.com/cubanismo/allocator needs to become the universal optimal buffer allocator library.

...

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

The only fallback Wayland compositors tend to do is use OpenGL/Vulkan rendering for composition if direct scanout on a KMS plane does not work. There are many reasons why direct scanout may not be possible in addition to hardware and drivers not agreeing to do it with the given set of buffers.

A general purpose (read: desktop) Wayland compositor simply cannot live without being able to fall back from KMS composition to software/GPU composition.

But yes, there are use cases where that fallback is as good as failing completely. Those are not desktops but more like set-top-boxes and TVs.

Thanks, pq

Christian König

12:27 p.m.

Am 02.11.22 um 13:19 schrieb Pekka Paalanen:

...

On Wed, 2 Nov 2022 12:18:01 +0100 Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

...
...
But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around.

And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

What should we Wayland people be pushing back on exactly? And where is our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the Wayland compositor cannot use the given dmabuf at all, giving the client an opportunity to try something else.

That's exactly what I meant with pushing back :)

I wasn't aware that this handling is already implemented.

...

The Wayland protocol also tells clients which DRM rendering device at minimum the dmabuf needs to be compatible with. It even tells which KMS device the dmabuf could be put on direct scanout if the dmabuf was suitable for that and direct scanout is otherwise possible.

Yeah, perfect. Exactly that's what's needed here.

...

What the client (application) does with all that information is up to the client. That code is not part of Wayland.

I'm sure we would be happy to add protocol for anything that https://github.com/cubanismo/allocator needs to become the universal optimal buffer allocator library.

From what you wrote it's already perfectly covered.

...

...
This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

The only fallback Wayland compositors tend to do is use OpenGL/Vulkan rendering for composition if direct scanout on a KMS plane does not work. There are many reasons why direct scanout may not be possible in addition to hardware and drivers not agreeing to do it with the given set of buffers.

A general purpose (read: desktop) Wayland compositor simply cannot live without being able to fall back from KMS composition to software/GPU composition.

But yes, there are use cases where that fallback is as good as failing completely. Those are not desktops but more like set-top-boxes and TVs.

Completely agree to this approach.

The only problem is that media players tend to not implement a way to allow direct scanout because of those fallback paths.

But as you said that's their decision.

Thanks, Christian.

...

Thanks, pq

Pekka Paalanen

12:50 p.m.

On Wed, 2 Nov 2022 13:27:18 +0100 Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Am 02.11.22 um 13:19 schrieb Pekka Paalanen:

...
On Wed, 2 Nov 2022 12:18:01 +0100 Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

...
...
But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around.

And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

What should we Wayland people be pushing back on exactly? And where is our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the Wayland compositor cannot use the given dmabuf at all, giving the client an opportunity to try something else.

That's exactly what I meant with pushing back :)

I wasn't aware that this handling is already implemented.

Well... it's "optional". The Wayland dmabuf protocol has two cases for a client/app:

a) you do the right thing and wait for the compositor to ack that the dmabuf works (the reply should come pretty much immediately, not needing the compositor to actually composite), or

b) you just send the dmabuf and continue as if it always worked. If it doesn't, you might get a protocol error later and be disconnected.

Guess which one Mesa uses?

I've been told Mesa has no way to handle a failure there, so it doesn't. I would not be surprised if others just copy that behaviour.

I recall personally being against adding option b) to begin with, but there it is, added for Mesa IIRC.

...

...
The Wayland protocol also tells clients which DRM rendering device at minimum the dmabuf needs to be compatible with. It even tells which KMS device the dmabuf could be put on direct scanout if the dmabuf was suitable for that and direct scanout is otherwise possible.

Yeah, perfect. Exactly that's what's needed here.

...
What the client (application) does with all that information is up to the client. That code is not part of Wayland.

I'm sure we would be happy to add protocol for anything that https://github.com/cubanismo/allocator needs to become the universal optimal buffer allocator library.

From what you wrote it's already perfectly covered.

Cool. Now if clients would just use it...

Thanks, pq

...

...
...
This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

The only fallback Wayland compositors tend to do is use OpenGL/Vulkan rendering for composition if direct scanout on a KMS plane does not work. There are many reasons why direct scanout may not be possible in addition to hardware and drivers not agreeing to do it with the given set of buffers.

A general purpose (read: desktop) Wayland compositor simply cannot live without being able to fall back from KMS composition to software/GPU composition.

But yes, there are use cases where that fallback is as good as failing completely. Those are not desktops but more like set-top-boxes and TVs.

Completely agree to this approach.

The only problem is that media players tend to not implement a way to allow direct scanout because of those fallback paths.

But as you said that's their decision.

Thanks, Christian.

...
Thanks, pq

Christian König

12:56 p.m.

Am 02.11.22 um 13:50 schrieb Pekka Paalanen:

...

On Wed, 2 Nov 2022 13:27:18 +0100 Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 02.11.22 um 13:19 schrieb Pekka Paalanen:

...
On Wed, 2 Nov 2022 12:18:01 +0100 Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

...
> But the client is just a video player. It doesn't understand how to > allocate BOs for Panfrost or AMD or etnaviv. So without a universal > allocator (again ...), 'just allocate on the GPU' isn't a useful > response to the client. Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around.

And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

What should we Wayland people be pushing back on exactly? And where is our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the Wayland compositor cannot use the given dmabuf at all, giving the client an opportunity to try something else.

That's exactly what I meant with pushing back :)

I wasn't aware that this handling is already implemented.

Well... it's "optional". The Wayland dmabuf protocol has two cases for a client/app:

a) you do the right thing and wait for the compositor to ack that the dmabuf works (the reply should come pretty much immediately, not needing the compositor to actually composite), or

b) you just send the dmabuf and continue as if it always worked. If it doesn't, you might get a protocol error later and be disconnected.

Guess which one Mesa uses?

I've been told Mesa has no way to handle a failure there, so it doesn't. I would not be surprised if others just copy that behaviour.

I recall personally being against adding option b) to begin with, but there it is, added for Mesa IIRC.

Well I'm not sure if those projects actually used X or Wayland, but we did had cases where this was really used.

IIRC in the case of Prime offloading we now allocate a buffer from the displaying device in Mesa and do the copy from the rendering GPU to the displaying GPU on the client side.

The background is that this gave us much better control which engine and parameters where used for the copy resulting in a nice performance improvement.

Regards, Christian.

Nicolas Dufresne

3 Nov 3 Nov

10:16 p.m.

Le mercredi 02 novembre 2022 à 12:18 +0100, Christian König a écrit :

...

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

...
[SNIP]

...
...
But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.

Well exactly that's the point I'm raising: The client *must* understand that!

See we need to be able to handle all restrictions here, coherency of the data is just one of them.

For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.

It feels like this is a generic statement and there is no reason it could not be the other way around.

And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.

That paragraph is a bit sloppy. By "you guys" you mean what exactly ? Normal users will let V4L2 device allocate and write into their own memory (the device fill it, not "you guys"). This is done like this simply because this is guarantied to work with the V4L2 device. Most V4L2 device produces known by userpsace pixel formats and layout, for which userspace know for sure it can implement a GPU shader or software fallback for. I'm still to see one of these format that cannot be efficiently imported into a modern GPU and converted using shaders. I'm not entirely sure what/which GPU a dGPU is compared to a GPU btw.

In many cases, camera kind of V4L2 devices will have 1 producer for many consumers. Consider your photo application, the streams will likely be capture and displayed while being encoded by one of more CODEC, while being streamed to a Machine Learning model for analyses. The software complexity to communicate back the list of receiver devices and implementing all their non-standard way to allocate memory, so all the combination of trial and error is just ridiculously high. Remember that each GPU have their own allocation methods and corner cases, this is simply not manageable by "you guys", which I pretty much assume is everyone writing software for Generic Linux these days (non-Android/ChromeOS).

...

To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.

This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.

This is a bit inaccurate. The compositor I've worked with (Gnome and Weston) will only memcpy SHM. For DMABuf, they will fail importation if its not usable either by the display or the GPU. Specially on the GPU side though (which is the ultimate compositor fallback), there exists efficient HW copy mechanism that may be used, and this is fine, since unlike your scannout example, it won't be uploading over and over, but will do later re-display from a remote copy (or transformed copy). Or if you prefer, its cached at the cost of higher memory usage.

I think it would be preferable to speak about device to device sharing, since V4L2 vs GPU is not really representative of the program. I think V4L2 vs GPU and "you guys" simply contribute to the never ending, and needless friction around that difficulty that exists with current support for memory sharing in Linux.

...

...
I have colleague who integrated PCIe CODEC (Blaize Xplorer X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways to use it. Of course, in current state of DMABuf, you have to be an exporter to do anything fancy, but it did not have to be like this, its a design choice. I'm not sure in the end what was the final method used, the driver isn't yet upstream, so maybe that is not even final. What I know is that there is various condition you may use the CODEC for which the optimal location will vary. As an example, using the post processor or not, see my next comment for more details.

Yeah, and stuff like this was already discussed multiple times. Local memory of devices can only be made available by the exporter, not the importer.

So in the case of separated camera and encoder you run into exactly the same limitation that some device needs the allocation to happen on the camera while others need it on the encoder.

...
...
The more common case is that you need to allocate from the GPU and then import that into the V4L2 device. The background is that all dGPUs I know of need the data inside local memory (VRAM) to be able to scan out from it.

The reality is that what is common to you, might not be to others. In my work, most ARM SoC have display that just handle direct scannout from cameras and codecs.

...
The only case the commonly fails is whenever we try to display UVC created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct display for dGPUs are broken because media players think they can do the simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but rather something completely special to ARM.

sigh .. The UVC failures was first discovered on my Intel PC, and later reproduced on ARM. Userspace expected driver(s) (V4L2 exports, DRM imports) should have rejected the DMABuf import (I kind of know, I wrote this part). From a userspace point of you, unlike what you stipulate here, there was no fault. You said already that importer / exporter role is to be tried, the order you try should not matter. So yes, today's userspace may lack the ability to flip the roles, but at least it tries, and if the driver does not fail, you can't blame userspace for at least trying to achieve decent performance.

I'd like to remind that this is clearly all kernel bugs, and we cannot state that kernel drivers "are broken because media player". Just the fact that this thread starts from a kernel changes kind of prove it. Would be nice also for you to understand that I'm not against the method used in this patchset, but I'm not against a bracketing mechanism either, as I think the former can improve, where the first one only give more "correct" results.

...

...
which have dirty CPU write cache and this is the type of thing we'd like to see solved. I think this series was addressing it in principle, but failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run Khodi with direct scanout of video stream on a wide variety of SoC and they use the CODEC as exporter all the time. They simply don't have cases were the opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG does not really offer you any API to reverse the allocation.

Ok, let me try to explain it once more. It sounds like I wasn't able to get my point through.

That we haven't heard anybody screaming that x86 doesn't work is just because we handle the case that a buffer isn't directly displayable in X/Wayland anyway, but this is absolutely not the optimal solution.

Basically, you are complaining that compositor will use GPU shaders to adapt the buffers for the display. Most display don't do or have limited YUV support, flipping the roles or bracketing won't help that. Using a GPU shader to adapt it, like compositor and userspace do seems all right. And yes, sometimes the memory will get imported into the GPU very efficiently, something in the mid- range, and other times some GPU stack (which is userspace) will memcpy. But remember that the GPU stack is programmed to work with a specific GPU, not the higher level userland.

...

The argument that you want to keep the allocation on the codec side is completely false as far as I can see.

I haven't made this argument, and don't intend to do so. There is nothing in this thread that should be interpreted as I want, or not want. I want the same thing as everyone on this list, which is both performance and correct results.

...

We already had numerous projects where we reported this practice as bugs to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

Links ? Remember that I do read every single bugs and emails around GStreamer project. I do maintain older and newer V4L2 support in there. I also did contribute a lot to the mechanism GStreamer have in-place to reverse the allocation. In fact, its implemented, the problem being that on generic Linux, the receiver element, like the GL element and the display sink don't have any API they can rely on to allocate memory. Thus, they don't implement what we call the allocation offer in GStreamer term. Very often though, on other modern OS, or APIs like VA, the memory offer is replaced by a context. So the allocation is done from a "context" which is neither an importer or an exporter. This is mostly found on MacOS and Windows.

Was there APIs suggested to actually make it manageable by userland to allocate from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API ready, no.

Can we at least implement some DRM memory allocation, yes, but remember that, until very recently, the DRM drivers used by the display path was not exposed through Wayland. This issue has only been resolved recently, it will take some time before this propagate through compositors code. And you need compositor implementation to do GL and multimedia stack implementation. Please, keep in mind before raising bad practice by GStreamer and FFMPEG developers that getting all the bit and pieces in place require back and forth, there has been huge gaps that these devs were not able to overcome yet. Also, remember that these stack don't have any contract to support Linux. They support it to the best of their knowledge and capabilities, along with Windows, MacOS, IOS, Android and more. And to my experience, memory sharing have different challenges in all of these OS.

...

This is just a software solution which works because of coincident and not because of engineering.

Another argument I can't really agree with, there is a lot of effort put into fallback (mostly GPU fallback) in various software stack. These fallback are engineered to guaranty you can display your frames. That case I've raised should have ended well with a GPU/CPU fallback, but a kernel bug broke the ability to fallback. If the kernel had rejected the import (your series ?), or offered a bracketing mechanism (for the UVC case, both method would have worked), the end results would have just worked.

I would not disagree if someone states that DMAbuf in UVC driver is an abuse. The driver simply memcpy chunk of variable size data streamed by the usb camera into a normal memory buffer. So why is that exported as a dmabuf ? I don't have an strong opinion on that, but if you think this is wrong, then it proves my point that this is a kernel bug. The challenge here is to come up with how we will fix this, and sharing a good understanding of what today's userspace do, and why they do so is key to make proper designs. As I started, writing code for DMABuf subsystem is out of reach for me, I can only share what existing software do, and why it does it like this.

Nicolas

...

Regards, Christian.

Christian König

4 Nov 4 Nov

9:03 a.m.

Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:

...

[SNIP]

...
We already had numerous projects where we reported this practice as bugs to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

Links ? Remember that I do read every single bugs and emails around GStreamer project. I do maintain older and newer V4L2 support in there. I also did contribute a lot to the mechanism GStreamer have in-place to reverse the allocation. In fact, its implemented, the problem being that on generic Linux, the receiver element, like the GL element and the display sink don't have any API they can rely on to allocate memory. Thus, they don't implement what we call the allocation offer in GStreamer term. Very often though, on other modern OS, or APIs like VA, the memory offer is replaced by a context. So the allocation is done from a "context" which is neither an importer or an exporter. This is mostly found on MacOS and Windows.

Was there APIs suggested to actually make it manageable by userland to allocate from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API ready, no.

Well, that stuff is absolutely ready: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_... What do you think I'm talking about all the time?

DMA-buf has a lengthy section about CPU access to buffers and clearly documents how all of that is supposed to work: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L11... This includes braketing of CPU access with dma_buf_begin_cpu_access() and dma_buf_end_cpu_access(), as well as transaction management between devices and the CPU and even implicit synchronization.

This specification is then implemented by the different drivers including V4L2: https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf...

As well as the different DRM drivers: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915... https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/am...

This design was then used by us with various media players on different customer projects, including QNAP https://www.qnap.com/en/product/ts-877 as well as the newest Tesla https://www.amd.com/en/products/embedded-automotive-solutions

I won't go into the details here, but we are using exactly the approach I've outlined to let userspace control the DMA between the different device in question. I'm one of the main designers of that and our multimedia and mesa team has up-streamed quite a number of changes for this project.

I'm not that well into different ARM based solutions because we are just recently getting results that this starts to work with AMD GPUs, but I'm pretty sure that the design should be able to handle that as well.

So we have clearly prove that this design works, even with special requirements which are way more complex than what we are discussing here. We had cases where we used GStreamer to feed DMA-buf handles into multiple devices with different format requirements and that seems to work fine.

-----

But enough of this rant. As I wrote Lucas as well this doesn't help us any further in the technical discussion.

The only technical argument I have is that if some userspace applications fail to use the provided UAPI while others use it correctly then this is clearly not a good reason to change the UAPI, but rather an argument to change the applications.

If the application should be kept simple and device independent then allocating the buffer from the device independent DMA heaps would be enough as well. Cause that provider implements the necessary handling for dma_buf_begin_cpu_access() and dma_buf_end_cpu_access().

I'm a bit surprised that we are arguing about stuff like this because we spend a lot of effort trying to document this. Daniel gave me the job to fix this documentation, but after reading through it multiple times now I can't seem to find where the design and the desired behavior is unclear.

What is clearly a bug in the kernel is that we don't reject things which won't work correctly and this is what this patch here addresses. What we could talk about is backward compatibility for this patch, cause it might look like it breaks things which previously used to work at least partially.

Regards, Christian.

Nicolas Dufresne

1:38 p.m.

Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :

...

Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:

...
[SNIP]

Was there APIs suggested to actually make it manageable by userland to allocate from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API ready, no.

Well, that stuff is absolutely ready: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_... What do you think I'm talking about all the time?

I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by vendors in production. There is an upstream proposal for GStreamer, but review comments were never addressed, in short, its stalled, and it waiting for a volunteer. It might also be based on very old implementation of DMABuf Heap, needs to be verified in depth for sure as the time have passed.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391

...

DMA-buf has a lengthy section about CPU access to buffers and clearly documents how all of that is supposed to work: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L11... This includes braketing of CPU access with dma_buf_begin_cpu_access() and dma_buf_end_cpu_access(), as well as transaction management between devices and the CPU and even implicit synchronization.

This specification is then implemented by the different drivers including V4L2: https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf...

As well as the different DRM drivers: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915... https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/am...

I know, I've implement the userspace bracketing for this in GStreamer [1] before DMAbuf Heap was merged and was one of the reporter for the missing bracketing in VB2. Was tested against i915 driver. Note, this is just a fallback, the performance is terrible, memory exported by (at least my old i915 HW) is not cacheable on CPU. Though, between corrupted image and bad performance or just bad performance, we decided that it was better to have the second. When the DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback works. Work is in progress to better handle these two cases generically. For now, sometimes the app need to get involved, but this is only happens on embedded/controlled kind of use cases. What matters is that application that needs this can do it.

[1] https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/g...

...

This design was then used by us with various media players on different customer projects, including QNAP https://www.qnap.com/en/product/ts-877 as well as the newest Tesla https://www.amd.com/en/products/embedded-automotive-solutions

I won't go into the details here, but we are using exactly the approach I've outlined to let userspace control the DMA between the different device in question. I'm one of the main designers of that and our multimedia and mesa team has up-streamed quite a number of changes for this project.

I'm not that well into different ARM based solutions because we are just recently getting results that this starts to work with AMD GPUs, but I'm pretty sure that the design should be able to handle that as well.

So we have clearly prove that this design works, even with special requirements which are way more complex than what we are discussing here. We had cases where we used GStreamer to feed DMA-buf handles into multiple devices with different format requirements and that seems to work fine.

Sounds like you have a love/hate relationship with GStreamer. Glad the framework is working for you too. The framework have had bidirectional memory allocation for over a decade, it also has context sharing for stacks like D3D11,12/GL/Vulkan/CUDA etc. I strictly didn't understand what you were complaining about. As a vendor, you can solve all this in your BSP. Though, translating BSP patches into a generic upstream-able features is not as simple. The solution that works for vendor is usually the most cost effective one. I'm sure, Tesla or AMD Automotive are no exceptions.

...

What is clearly a bug in the kernel is that we don't reject things which won't work correctly and this is what this patch here addresses. What we could talk about is backward compatibility for this patch, cause it might look like it breaks things which previously used to work at least partially.

I did read your code review (I don't class this discussion as a code review). You where asked to address several issues, clearly a v2 is needed.

1. Rob Clark stated the coherency is not homogeneous in many device drivers. So your patch will yield many false positives. He asked if you could think of a "per attachement" solution, as splitting drivers didn't seem like the best approach (and it would have to happen at the same time anyway).

2. Lucas raised a concern, unfortunately no proof yet, that this may cause regressions in existing userspace application. You stated that the application must be wrong, yet this would break Linus rule #1. I'm not participating in that discussion, I tried to reproduced, but appart from writing an app that will break, I haven't found yet. But it feels like the way forward is to ensure each exporter driver can override this in case it has a good reason to do so.

As a third option, a safer approach would be to avoid reusing a flawed mechanism to detect device coherency (or rephrased, accept that device coherency isn't homogeneous). Communicate this back another way. this means patching all the drivers, but at least each driver maintainer will be able to judge with their HW knowledge if this is going to regress or not. When I first looked at it, I wanted to experiment with enabling cache in vb2 contiguous allocator. I was thinking that perhaps I could have bracketing, and state changes triggered by the attach call, all based on the device coherency, but now that I saw Rob Clark comment, I feel like this is flawed.

happy v2, Nicolas

Christian König

2:51 p.m.

Am 04.11.22 um 14:38 schrieb Nicolas Dufresne:

...

Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :

...
Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:

...
[SNIP]

Was there APIs suggested to actually make it manageable by userland to allocate from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API ready, no.

Well, that stuff is absolutely ready: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_... What do you think I'm talking about all the time?

I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by vendors in production. There is an upstream proposal for GStreamer, but review comments were never addressed, in short, its stalled, and it waiting for a volunteer. It might also be based on very old implementation of DMABuf Heap, needs to be verified in depth for sure as the time have passed.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391

Well this is actually quite related to coherency.

As general purpose framework it's mandatory for DMA Heaps to support the coherency callbacks. And as far as I can see they are indeed correctly implemented.

So if you allocate from a DMA Heap it should be guaranteed that sharing with all devices work correctly, no matter if they write with the CPU or don't snoop the CPU cache. That's essentially what those heaps are good for.

I just don't know how you should select between the contiguous and the system heap since I'm not deep enough in the userspace side of this.

We should have come up with heaps earlier and don't implement so many different system memory allocators in the drivers in the first place.

...

...
DMA-buf has a lengthy section about CPU access to buffers and clearly documents how all of that is supposed to work: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L11... This includes braketing of CPU access with dma_buf_begin_cpu_access() and dma_buf_end_cpu_access(), as well as transaction management between devices and the CPU and even implicit synchronization.

This specification is then implemented by the different drivers including V4L2: https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf...

As well as the different DRM drivers: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915... https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/am...

I know, I've implement the userspace bracketing for this in GStreamer [1] before DMAbuf Heap was merged and was one of the reporter for the missing bracketing in VB2. Was tested against i915 driver. Note, this is just a fallback, the performance is terrible, memory exported by (at least my old i915 HW) is not cacheable on CPU. Though, between corrupted image and bad performance or just bad performance, we decided that it was better to have the second. When the DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback works. Work is in progress to better handle these two cases generically. For now, sometimes the app need to get involved, but this is only happens on embedded/controlled kind of use cases. What matters is that application that needs this can do it.

[1] https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/g...

Yeah, I've heard complains about that before with people asking if we couldn't map buffers exporter by the GPU as cacheable and then flush it on demand.

For i915 that's more or less possible because it just uses stolen system memory, but pretty much any other dGPU uses local memory and mapping a PCIe BAR cacheable is really a no-go.

...

...
This design was then used by us with various media players on different customer projects, including QNAP https://www.qnap.com/en/product/ts-877 as well as the newest Tesla https://www.amd.com/en/products/embedded-automotive-solutions

I won't go into the details here, but we are using exactly the approach I've outlined to let userspace control the DMA between the different device in question. I'm one of the main designers of that and our multimedia and mesa team has up-streamed quite a number of changes for this project.

I'm not that well into different ARM based solutions because we are just recently getting results that this starts to work with AMD GPUs, but I'm pretty sure that the design should be able to handle that as well.

So we have clearly prove that this design works, even with special requirements which are way more complex than what we are discussing here. We had cases where we used GStreamer to feed DMA-buf handles into multiple devices with different format requirements and that seems to work fine.

Sounds like you have a love/hate relationship with GStreamer. Glad the framework is working for you too. The framework have had bidirectional memory allocation for over a decade, it also has context sharing for stacks like D3D11,12/GL/Vulkan/CUDA etc. I strictly didn't understand what you were complaining about. As a vendor, you can solve all this in your BSP. Though, translating BSP patches into a generic upstream-able features is not as simple. The solution that works for vendor is usually the most cost effective one. I'm sure, Tesla or AMD Automotive are no exceptions.

The framework is indeed great. It's just that customers often come up with their own sources/sinks which then obviously don't go upstream.

The LGPL was sometimes helpful to get people involved, but most often just leaves to much opportunity to develop a throw away solution.

...

...

What is clearly a bug in the kernel is that we don't reject things which won't work correctly and this is what this patch here addresses. What we could talk about is backward compatibility for this patch, cause it might look like it breaks things which previously used to work at least partially.

I did read your code review (I don't class this discussion as a code review). You where asked to address several issues, clearly a v2 is needed.

Rob Clark stated the coherency is not homogeneous in many device drivers. So

your patch will yield many false positives. He asked if you could think of a "per attachement" solution, as splitting drivers didn't seem like the best approach (and it would have to happen at the same time anyway).

That should already be addressed. The flag is intentionally per buffer and only gets check as pre condition the framework itself.

So for example importing a buffer which requires CPU cache coherency is still possible, but then creating the framebuffer fails. This is pretty much the handling we already have, just not driver specific any more.

...

Lucas raised a concern, unfortunately no proof yet, that this may cause

regressions in existing userspace application. You stated that the application must be wrong, yet this would break Linus rule #1. I'm not participating in that discussion, I tried to reproduced, but appart from writing an app that will break, I haven't found yet. But it feels like the way forward is to ensure each exporter driver can override this in case it has a good reason to do so.

Yes, that's what I meant that we should discuss this. Essentially we have the option between an application using corrupted data or breaking in the first place. I tend to say that the second is less harmful, but that's not a hard opinion.

...

As a third option, a safer approach would be to avoid reusing a flawed mechanism to detect device coherency (or rephrased, accept that device coherency isn't homogeneous). Communicate this back another way. this means patching all the drivers, but at least each driver maintainer will be able to judge with their HW knowledge if this is going to regress or not. When I first looked at it, I wanted to experiment with enabling cache in vb2 contiguous allocator. I was thinking that perhaps I could have bracketing, and state changes triggered by the attach call, all based on the device coherency, but now that I saw Rob Clark comment, I feel like this is flawed.

Yeah, that won't work. Daniel V also noted multiple times that we could do this in the attach callback, but attaching should really be something done only once.

If you want device which write to the buffers with the CPU and devices which don't snoop to work together on system memory then DMA Heaps are really the way to go.

Just going to extend the patch set to the cases we have in amdgpu and i915 as well.

Thanks, Christian.

...

happy v2, Nicolas

913

days inactive

990

days old

linaro-mm-sig@lists.linaro.org

60 comments

participants

tags (0)

participants (12)

Alex Deucher
Christian König
Christian König
Daniel Stone
Daniel Vetter
Hans Verkuil
Lucas Stach
Nicolas Dufresne
Pekka Paalanen
Rob Clark
Sebastian Wick
Tomasz Figa