Re: [Linaro-mm-sig] Minutes from V4L2 update call

21 Apr 2012


      Hi Daniel,
On Thursday 19 April 2012 15:59:42 Daniel Vetter wrote:
...
On Thu, Apr 19, 2012 at 03:20:30PM +0200, Laurent Pinchart wrote:
...
On Tuesday 17 April 2012 19:21:37 Daniel Vetter wrote:
...
On Tue, Apr 17, 2012 at 04:23:08PM +0200, Tomasz Stanislawski wrote:
[snip]
...
...
...
...
The function was called dma_get_pages. As it was mentioned before this
function was a workaround for deficiencies in DMA api. I agree that it
is not possible to transform a buffer into a list of pages in general
case i.e. there might be no struct pages for a given range of physical
address at which RAM is available.
However if it is possible to transform a DMA buffer into a list of
pages then the DMA subsystem is the best place to do it. That is why
dma_get_pages was introduced to DMA api in my RFC. The alternative
might be dma_get_sg to obtain the scatterlist but it is only a
cosmetic difference from dma_get_pages.
Still, a hack that works sometimes but not always is guaranteed to blow
up sooner or later. I'd like to get this right and not rework it too
often, hence my stubborn opinion on it.
I'm pretty sure none of us want to get it wrong ;-)
I agree that dma_get_pages() is the wrong level of abstraction, as pages
can't correctly describe all possible buffers. dma_get_sg() is slightly
better, do you have another proposition ?
Hm, I haven't seen a dma_get_sg proposal floating around anywhere, can you
give a pointer?
I don't think one has been posted. My point was that a DMA API function that 
returned a scatter list instead of a table of struct page would be better.
...
...
...
...
You can criticize as you like but please try to help us to find a
better way to export a DMA buffer allocated by DMA api using function
like dma_alloc_{coherent/noncoherent/writecombine}.
To be frank, the current dma api is not suitable to be used with dma_buf
if one of your importers cannot deal with struct pages allocated with
page_alloc. Which means all the devices that need CMA allocated through
the dma api.
I don't think the problem is on the importer's side. Whether the memory
comes from page_alloc, from a tiling unit or from a reserved memory block
results from how the exporter has allocated memory in the first place.
From the importer's point of view all we need is an SG list mapped to the
importer's device. Whether the memory is backed by struct page or not
doesn't mapper once the mapping has been created.
Bad phrasing on my side, the problem is not with the importer. But with
the current dma interfaces the exporter simply cannot allocate the memory
required by the importer if the importer needs contigious coherent dma
memory.
...
...
Now for the fully dynamic & insane case of a drm device being the
exporter, I think we can live with this. Atm we only have seriously
useful drm drivers on x86 and there we don't need CMA.
But I agree that this isn't a workable solution on arm and embedded
platforms, and I think we need to pimp up the dma api to be useful for
the basic case of sharing a statically allocated piece of memory.
My plan here is that we should have a default dma_buf exporter
implementation that uses these new dma api interfaces so that for the
usual case you don't have to write implementations for all these nutty
dma_buf interface pieces. That would only be required for special cases
that need the full generality of the interface (i.e. a drm exporter) or
for memory allocated in a special manner (e.g. v4l userpointer buffer or
memory allocated from a carveout not backed by struct pages).
To make that happen we need a new way to allocate coherent dma memory,
so I guess we'll have a
void *
dma_alloc_attrs_for_devices(struct device **devices, int num_devices,
	    dma_addr_t **dma_handles, gfp_t gfp
	    struct dma_attrs *attrs)


which would allocate coherent dma memory suitable for all the
num_devices listed in devices, handing back the dma_handles in the array
pointed to by dma_handles.
This would require a one-size-fits-them-all generic memory allocator that
knows about all possible devices in a system. We've already discussed this
in the past, and I'd *really* like to avoid that as it would quickly
become insanely complex to handle. Such an API would also require knowing
all the devices that a buffer will be used with beforehand, which isn't
provided by the current DMABUF API.
The idea is that we reallocate this memory when a new devices shows up. If
that is not possible, you'll simply get an error. I think this would also
be more in line with v4l behaviour which demands that memory is allocated
up-front (or must fail).
That's an interesting idea, although I'm not sure if we should really do so. 
Sure, it completely abstracts memory allocation from userspace and provides 
something that just works (that is in a couple of years time, when we'll be 
done with the implementation :-)). However, the performance impact of 
reallocating the memory is likely prohibitive, so we will run into issues if 
applications don't act smartly. Providing an API that just works but then 
fails in some cases because of performance issues (especially when the issues 
are potentially random-looking from a userspace perspective) is asking for 
trouble. I'd rather have a solid API that requires a bit of thinking on 
userspace's side, to make sure userspace developer think about what they do.
...
And I don't think this memory allocator will be too complicated to be
implement. One thing to note is that it only tackles coherent dma memory,
i.e. static allocation, fixed kernel mapping. That leaves out tons of
special cases needed in drm land with all their implications on the
dma_buf interface.
The other thing is that the dma api can already allocate this memory and
we only need to figure out a way to aggregate all these constraints and
allocate it from the right memory pool. For simple stuff like the dma_mask
we can just & them together, for crazier stuff like contigous or a special
CMA pool we could use flags and check which allocators support all the
flags we need.
One of those crazy constraints come from Samsung, They need to allocate the 
luma and chroma planes (for NV formats) in different memory banks for some of 
their IP blocks. Flags could be used to direct allocation to a specific CMA 
pool, but we would then need to know about the buffer content format to know 
what flag to set.
...
I've already looked a bit through the dma generic stuff and it should be
too hard to integrate it there. Integration with CMA pools will be uglier,
but should be doable.
Once the memory is allocated, the only thing left is wrapping the
allocation in a sg list and calling the map_sg function of the relevant
dma_map_ops for each device.
Totally crazy boards/SoCs/platforms are simply left with creating their
own hacks, but I think for somewhat sane platforms this should work.
...
...
For the no-iommu case (i.e. dma_generic_alloc_coherent) it would be
enough to simply & toghether all the dma masks. On platforms that
require more magic (e.g. CMA) this would be more complicated.
To make dynamic pipeline reconfiguration for the example dma_buf
exporter possible, the example exporter could alloc a new buffer when a
new devices gets attached and copy over the data from the old buffer to
the new one, then releasing the old one. If that doesn't work, it can
simply return -EBUSY or -ENOMEM.
If we later on also want to support the streaming dma api better for
common dma_buf exporters, we can look into that. But I think that would
be a second step after getting this dma api extension in.
I hope we've found the real reason for our disagreement on the dma_buf
spec, if I'm totally off, please keep on shouting at me.
I don't think there's such a fundamental disagreement. The spec requires
exporters to map buffers to the importer's device, and that's what is
implemented by the latest work-in-progress patches. The V4L2 patches that
move mapping to the importer's side are getting dropped.
Creating the importer's device mapping in the exporter driver was a bit
counter-intuitive to me to start with, but I don't have a strong
preference now. Things like tiler mappings might be easier to handle in
the exporter's device, but at the end of the the core issue is mapping a
piece of memory allocated for one device to another device.
I agree that passing physical addresses (regardless of whether those
physical addresses are struct page pointers, PFNs or anything else)
around is probably not a good solution. We already have a DMA API to
abstract physical memory and provide devices with a DMA address, we
should use and extend it. However, DMA addresses are specific to a
device, while dma-buf needs to share buffers between separate devices
(otherwise it would be pretty pointless), they can't be used to describe
a cross-device buffer.
When allocating a buffer using the DMA API, memory is "allocated" behind
the scene and mapped to the device address space ("allocated" in this
case means anything from plain physical memory allocation to reservation
of a special- purpose memory range, like in the OMAP TILER example). All
the allocator device driver gets to see is the DMA address and/or the DMA
scatter list. Then, when we want to share the memory with a second
device, we need a way to map the memory to the second device's address
space. Whether that operation is performed in the importer or exporter
driver isn't the core issue, in both cases the mapping will be delegated
to the DMA API.
We currently have no way in the DMA API to map memory allocated and mapped
to one device to a second device. From an API point of view, we need a
call to perform that mapping. That call will need a pointer to the device
to map the memory to, and a reference to the memory. One possible
solution is to identify the memory by the combination of allocator device
pointer and dma address. This would be pretty simple for the caller, but
it might be complex to implement in the DMA API. I haven't looked at how
this could be done yet.
I think the allocate&map for multiple devices issue is solveable, see
above. It's gonna be a bit of work to piece things together, though ;-)
Do you think it would be that complex to split allocation and mapping ? Surely 
enough we could also provide an allocate & map function, but if we can 
separate the two operations we could also add an importer at runtime if 
needed.
-- 
Regards,

Laurent Pinchart

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Linaro-mm-sig] Minutes from V4L2 update call