Re: [Linaro-mm-sig] Minutes from V4L2 update call

19 Apr 2012

      Hi Daniel,
On Tuesday 17 April 2012 19:21:37 Daniel Vetter wrote:
...
On Tue, Apr 17, 2012 at 04:23:08PM +0200, Tomasz Stanislawski wrote:
...
On 04/17/2012 03:03 PM, Daniel Vetter wrote:
...
On Tue, Apr 17, 2012 at 01:40:47PM +0200, Tomasz Stanislawski wrote:
...
Hello everyone,
I would like to ask about the agreement on a behavior of
a DMABUF exporter for dma_buf_map_attachment.
According to the DMABUF spec the exporter should return a scatterlist
mapped into importers DMA space. However there were issues about the
concept.
I made a short survey for mapping strategy for DMABUF patches for some
drivers:

V4L2 - support for dmabuf importing hopefully consistent with dmabuf
spec. The patch "v4l: vb2-dma-contig: change map/unmap behaviour for
importers" implement DMA mapping performed on the importer side.
However the patch can be dropped at no cost.

Exynos DRM - the latest version implements mapping on the exporter
side

Omap/DRM - 'mapping' is done on exporter side by setting a physical
address> as DMA address in the scatterlist. The dma_map_sg should be
used for this purpose

nouveau/i915 by Dave Airlie - mapping for client is done on importer
side. http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-dmabuf2

Does it mean that it is agreed that the exporter is responsible for
mapping into the client space?
Hi Daniel,
...
No, and imo you're examples are void:
4. Dave Airlie's latest rfc patches implement dma api mapping on the
importers side. For i915 see e.g. commit
http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-dmabuf2&id=f5c6
bc7c5483ceaa79ed23c6b0af9b967ebc009f which directly grabs the sg list
from the dma_buf exporter and shoves it into the into the intel gtt
functions that require and already dma mapped sg list.
Sorry, my mistake. I was thinking about 'exporter' when I wrote
'importer'. I agree that i915 is consistent with DMABUF spec. Sorry for
the confusion.
Doesn't matter, drm/i915 exporter code in that branch does the dma_map_sg
for the attached device:
http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-dmabuf2&id=3212...
afcce81c993e274a4d1ff28ddf86b3b0
I think Tomasz agrees with you on this one. The latest DMABUF exporter support 
in DRM/i915 implements mapping to the importer's device in the exporter 
driver.
...
...
...

Omap: Afaik the address returned is from the TILER, which can be

accessed from any hw device. So it's already mapped into device address
space, just not through an iommu managed by dma-api (but conceptually on
the same level). This use case is very much one fo the reasons for
exporting the sg list already mapped.
No. If the client driver has IOMMU then it cannot access the TILER
by basing only on physical address. The client has to map the physical
address to the device's address space. It can be done by transforming
paddr into a scatterlist and mapping it using dma_map_sg. Otherwise
the physical address is useless from the perspective of the client
device.
Therefore returning a phys address as a dma_address is not a valid
way of mapping a memory.
Of course if none of HW pieces uses IOMMU everything works fine.
However, imo an introduction of IOMMU for some client device should
__not__ force any change to the exporter code.
Well, there are 3 ways to make this work with the current dma_buf spec

There are no other iommus (like you've noticed).
There are other iommus, but the tiler area is mapped 1:1 into them
(current approach used in omap according to Rob Clarke).
Things from tiler are maped on-demand. Currently we don't have any dma
interface to remap random pfns, so using these iommus managed by the dma
api on top of the omap tiler won't work. What would work is adding a
platform specific bit into the device's dma options to tell the omap
tiler code to either map the dma_buf into the tiler or use the platform
dma api to map it into device address space.

This use case of a special, driver-manged iommu which is thighly
integrated with e.g. the gpu, but can also be accessed by other related
devices is the core scenario I have in mind for mandating device address
space mappings. Yes, this needs some special code in the exporter to make
it work, but e.g. with omap it's the only way to shared buffer objects in
a tiled layout through (all mapped into device address space through the
tiler). For fast video processing pipelineds, using tiled objects is
crucial for performance, and on many devices the solution for a common
tiling layout is a common tiling unit like omaps tiler.
So I don't see the problem here.
I don't see any issue with the first two use cases (no IOMMU or 1:1 mappings), 
as I expect they will be all we need for tiling units. Tiling units expose a 
physically contiguous aperture, IOMMU mappings are thus pretty useless. I just 
cross my fingers and hope we won't get tiling units that expose non-contiguous 
apertures.
However, I'm not really sure what the implementation will look like. The OMAP 
tiler can be considered in a way as a special-purpose IOMMU. Even if we use 
1:1 IOMMU mappings for the importer, we will need to setup those, resulting in 
two cascaded IOMMUs. I don't think the DMA API should handle that, so I expect 
the exporter to come up with an SG list of physical addresses corresponding to 
the tiler-mapped aperture, and map it to the importer's device using the DMA 
API. For tiling units that are part of the exporter, or at least tightly 
coupled with them, I don't expect issues there. For independent tiling units 
that can be used by several exporter devices, we will need a tiler API to map 
ranges on demand. We should be careful here, as creating a second IOMMU API 
wouldn't be such a great idea.
The problem of selecting how to tile the data is also left unanswered. Should 
the user see multiple instances of the same buffer, each with a different 
layout, and select the one the importer will need, or should that be handled 
automagically in the kernel ?
...
...
...

Exynos uses the iommu api as a nice interface to handle it's graphics

aperture (akin to omap tiler or i915 gtt). Last time I've checked the
discussion about how this should be integrated (and at what level) with
the dma api is still ongoing. So it's imo way to early to conclude that
exynos doesn't return device dma addresses because this iommu might not
end up being managed by the core dma api.

That's your code ;-)

Last but not least your dma_pages_from_sg (or whatever it's called)
trick
The function was called dma_get_pages. As it was mentioned before this
function was a workaround for deficiencies in DMA api.
I agree that it is not possible to transform a buffer into a list of
pages in general case i.e. there might be no struct pages for a given
range of physical address at which RAM is available.
However if it is possible to transform a DMA buffer into a list of pages
then the DMA subsystem is the best place to do it. That is why
dma_get_pages was introduced to DMA api in my RFC. The alternative
might be dma_get_sg to obtain the scatterlist but it is only a cosmetic
difference from dma_get_pages.
Still, a hack that works sometimes but not always is guaranteed to blow up
sooner or later. I'd like to get this right and not rework it too often,
hence my stubborn opinion on it.
I'm pretty sure none of us want to get it wrong ;-)
I agree that dma_get_pages() is the wrong level of abstraction, as pages can't 
correctly describe all possible buffers. dma_get_sg() is slightly better, do 
you have another proposition ?
...
...
You can criticize as you like but please try to help us to find a better
way to export a DMA buffer allocated by DMA api using function like
dma_alloc_{coherent/noncoherent/writecombine}.
To be frank, the current dma api is not suitable to be used with dma_buf
if one of your importers cannot deal with struct pages allocated with
page_alloc. Which means all the devices that need CMA allocated through
the dma api.
I don't think the problem is on the importer's side. Whether the memory comes 
from page_alloc, from a tiling unit or from a reserved memory block results 
from how the exporter has allocated memory in the first place. From the 
importer's point of view all we need is an SG list mapped to the importer's 
device. Whether the memory is backed by struct page or not doesn't mapper once 
the mapping has been created.
...
Now for the fully dynamic & insane case of a drm device being the
exporter, I think we can live with this. Atm we only have seriously useful
drm drivers on x86 and there we don't need CMA.
But I agree that this isn't a workable solution on arm and embedded
platforms, and I think we need to pimp up the dma api to be useful for the
basic case of sharing a statically allocated piece of memory.
My plan here is that we should have a default dma_buf exporter
implementation that uses these new dma api interfaces so that for the
usual case you don't have to write implementations for all these nutty
dma_buf interface pieces. That would only be required for special cases
that need the full generality of the interface (i.e. a drm exporter) or
for memory allocated in a special manner (e.g. v4l userpointer buffer or
memory allocated from a carveout not backed by struct pages).
To make that happen we need a new way to allocate coherent dma memory, so
I guess we'll have a
void *
dma_alloc_attrs_for_devices(struct device **devices, int num_devices,
   		    dma_addr_t **dma_handles, gfp_t gfp
   		    struct dma_attrs *attrs)
which would allocate coherent dma memory suitable for all the num_devices
listed in devices, handing back the dma_handles in the array pointed to by
dma_handles.
This would require a one-size-fits-them-all generic memory allocator that 
knows about all possible devices in a system. We've already discussed this in 
the past, and I'd *really* like to avoid that as it would quickly become 
insanely complex to handle. Such an API would also require knowing all the 
devices that a buffer will be used with beforehand, which isn't provided by 
the current DMABUF API.
...
For the no-iommu case (i.e. dma_generic_alloc_coherent) it would be enough
to simply & toghether all the dma masks. On platforms that require more
magic (e.g. CMA) this would be more complicated.
To make dynamic pipeline reconfiguration for the example dma_buf exporter
possible, the example exporter could alloc a new buffer when a new devices
gets attached and copy over the data from the old buffer to the new one,
then releasing the old one. If that doesn't work, it can simply return
-EBUSY or -ENOMEM.
If we later on also want to support the streaming dma api better for
common dma_buf exporters, we can look into that. But I think that would be
a second step after getting this dma api extension in.
I hope we've found the real reason for our disagreement on the dma_buf
spec, if I'm totally off, please keep on shouting at me.
I don't think there's such a fundamental disagreement. The spec requires 
exporters to map buffers to the importer's device, and that's what is 
implemented by the latest work-in-progress patches. The V4L2 patches that move 
mapping to the importer's side are getting dropped.
Creating the importer's device mapping in the exporter driver was a bit 
counter-intuitive to me to start with, but I don't have a strong preference 
now. Things like tiler mappings might be easier to handle in the exporter's 
device, but at the end of the the core issue is mapping a piece of memory 
allocated for one device to another device.
I agree that passing physical addresses (regardless of whether those physical 
addresses are struct page pointers, PFNs or anything else) around is probably 
not a good solution. We already have a DMA API to abstract physical memory and 
provide devices with a DMA address, we should use and extend it. However, DMA 
addresses are specific to a device, while dma-buf needs to share buffers 
between separate devices (otherwise it would be pretty pointless), they can't 
be used to describe a cross-device buffer.
When allocating a buffer using the DMA API, memory is "allocated" behind the 
scene and mapped to the device address space ("allocated" in this case means 
anything from plain physical memory allocation to reservation of a special-
purpose memory range, like in the OMAP TILER example). All the allocator 
device driver gets to see is the DMA address and/or the DMA scatter list. 
Then, when we want to share the memory with a second device, we need a way to 
map the memory to the second device's address space. Whether that operation is 
performed in the importer or exporter driver isn't the core issue, in both 
cases the mapping will be delegated to the DMA API.
We currently have no way in the DMA API to map memory allocated and mapped to 
one device to a second device. From an API point of view, we need a call to 
perform that mapping. That call will need a pointer to the device to map the 
memory to, and a reference to the memory. One possible solution is to identify 
the memory by the combination of allocator device pointer and dma address. 
This would be pretty simple for the caller, but it might be complex to 
implement in the DMA API. I haven't looked at how this could be done yet.
-- 
Regards,

Laurent Pinchart

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Linaro-mm-sig] Minutes from V4L2 update call