Re: [Linaro-mm-sig] [RFC 1/2] dma-buf: Introduce dma buffer sharing mechanism

8 Nov 2011

      On Thu, Nov 3, 2011 at 3:04 AM, Marek Szyprowski
m.szyprowski@samsung.com wrote:
...
Hello,
I'm sorry for a late reply, but after Kernel Summit/ELC I have some comments.
On Friday, October 14, 2011 5:35 PM Daniel Vetter wrote:
...
On Fri, Oct 14, 2011 at 12:00:58PM +0200, Tomasz Stanislawski wrote:
...
...
+/**

struct dma_buf_ops - operations possible on struct dma_buf

@create: creates a struct dma_buf of a fixed size. Actual allocation

does not happen here.

The 'create' ops is not present in dma_buf_ops.
...

@attach: allows different devices to 'attach' themselves to the given

buffer. It might return -EBUSY to signal that backing storage

is already allocated and incompatible with the requirements

of requesting device. [optional]

@detach: detach a given device from this buffer. [optional]

@get_scatterlist: returns list of scatter pages allocated, increases

usecount of the buffer. Requires atleast one attach to be

called before. Returned sg list should already be mapped

into _device_ address space.

You must add a comment that this call 'may sleep'.
I like the get_scatterlist idea. It allows the exported to create a
valid scatterlist for a client in a elegant way.
I do not like this whole attachment idea. The problem is that
currently there is no support in DMA framework for allocation for
multiple devices. As long as no such a support exists, there is no
generic way to handle attribute negotiations and buffer allocations
that involve multiple devices. So the exporter drivers would have to
implement more or less hacky solutions to handle memory requirements
and choosing the device that allocated memory.
Currently, AFAIK there is even no generic way for a driver to
acquire its own DMA memory requirements.
Therefore all logic hidden beneath 'attachment' is pointless. I
think that support for attach/detach (and related stuff) should be
postponed until support for multi-device allocation is added to DMA
framework.
Imo we clearly need this to make the multi-device-driver with insane dma
requirements work on arm. And rewriting the buffer handling in
participating subsystem twice isn't really a great plan. I envision that
on platforms where we need this madness, the driver must call back to the
dma subsytem to create a dma_buf. The dma subsytem should be already aware
of all the requirements and hence should be able to handle them..
...
I don't say the attachment list idea is wrong but adding attachment
stuff creates an illusion that problem of multi-device allocations
is somehow magically solved. We should not force the developers of
exporter drivers to solve the problem that is not solvable yet.
Well, this is why we need to create a decent support infrastructure for
platforms (= arm madness) that needs this, so that device drivers and
subsystem don't need to invent that wheel on their own. Which as you point
out, they actually can't.
The real question is whether it is possible to create any generic support
infrastructure. I really doubt. IMHO this is something that will be hacked for
each 'product release' and will never read the mainline...
...
...
The other problem are the APIs. For example, the V4L2 subsystem
assumes that memory is allocated after successful VIDIOC_REQBUFS
with V4L2_MEMORY_MMAP memory type. Therefore attach would be
automatically followed by get_scatterlist, blocking possibility of
any buffer migrations in future.
Well, pardon to break the news, but v4l needs to rework the buffer
handling. If you want to share buffers with a gpu driver, you _have_ to
life with the fact that gpus do fully dynamic buffer management, meaning:

buffers get allocated and destroyed on the fly, meaning static reqbuf

just went out the window (we obviously cache buffer objects and reuse
  them for performance, as long as the processing pipeline doesn't really
  change).

buffers get moved around in memory, meaning you either need full-blown

sync-objects with a callback to drivers to tear-down mappings on-demand,
  or every driver needs to guarnatee to call put_scatterlist in a
  reasonable short time. The latter is probably the more natural thing for
  v4l devices.
I'm really not convinced if it is possible to go for the completely dynamic
buffer management, especially if we are implementing a proof-of-concept
solution. Please notice the following facts:

all v4l2 drivers do the 'static' buffer management - memory is being

allocated on REQBUF() call and then mapped permanently into both userspace
and dma (io) address space.
Is this strictly true if we are introducing a new 'enum v4l2_memory'
for dmabuf's?  Shouldn't that give us some flexibility, especially if
the v4l2 device is only the importer, not the allocator, of the
memory.
and a couple more notes:
1) for V4L2_MEMORY_DMABUF we don't want the importing v4l2 device to
create a userspace memory mapping
2) there is nothing stopping a dmabuf exporter from being more strict
about static allocation of buffers..  we just want the API to be
flexible enough so that exporting devices which are more dynamic can
continue to do so.
...

dma-mapping api is very limited in the area of the dynamic buffer management,

this API has been designed definitely for static buffer allocation and mapping.
It looks that fully dynamic buffer management requires a complete change of
v4l2 api principles (V4L3?) and a completely new DMA API interface. That's
probably the reason by none of the GPU driver relies on the DMA-mapping API
and implements custom solution for managing the mappings.
This reminds me one more issue I've noticed in the current dma buf proof-of-
concept. You assumed that the exporter will be responsible for mapping the
buffer into io address space of all the client devices. What if the device
needs additional custom hooks/hacks during the mappings? This will be a serious
problem for the current GPU drivers for example. IMHO the API will be much
clearer if each client driver will map the scatter list gathered from the
dma buf by itself. Only the client driver has the complete knowledge how
to do this correctly for this particular device. This way it will also work
with devices that don't do the real DMA (like for example USB devices that
copy all data from usb packets to the target buffer with the cpu).
The exporter doesn't map.. it returns a scatterlist to the importer.
But the exporter does allocate and pin backing pages.  And it is
preferable if the exporter has the opportunity to wait until as much
is known about the various importing devices to know if it must
allocate contiguous pages, or pages in a certain range.
That said, on a platform where everything had iommu's or somehow
didn't have any particular memory requirements, or where the exporter
had the strictest requirements (or at least knew of the strictest
requirements), then the exporter is free to allocate/pin the backing
pages earlier, such as even before the buffer is exported.
BR,
-R
...
Best regards
Marek Szyprowski
Samsung Poland R&D Center

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Linaro-mm-sig] [RFC 1/2] dma-buf: Introduce dma buffer sharing mechanism

Best regards