[Linaro-mm-sig] Re: Try to address the DMA-buf coherency problem

2 Nov 2022


      Hi Christian,
going to reply in more detail when I have some more time, so just some
quick thoughts for now.
Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
...
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
...
[SNIP]
...
...
But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.
Well exactly that's the point I'm raising: The client *must* understand
that!
See we need to be able to handle all restrictions here, coherency of the
data is just one of them.
For example the much more important question is the location of the data
and for this allocating from the V4L2 device is in most cases just not
going to fly.
It feels like this is a generic statement and there is no reason it could not be
the other way around.
And exactly that's my point. You always need to look at both ways to 
share the buffer and can't assume that one will always work.
As far as I can see it you guys just allocate a buffer from a V4L2 
device, fill it with data and send it to Wayland for displaying.
To be honest I'm really surprised that the Wayland guys hasn't pushed 
back on this practice already.
This only works because the Wayland as well as X display pipeline is 
smart enough to insert an extra copy when it find that an imported 
buffer can't be used as a framebuffer directly.
With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.
...
...
I have colleague who integrated PCIe CODEC (Blaize Xplorer
X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
to use it. Of course, in current state of DMABuf, you have to be an exporter to
do anything fancy, but it did not have to be like this, its a design choice. I'm
not sure in the end what was the final method used, the driver isn't yet
upstream, so maybe that is not even final. What I know is that there is various
condition you may use the CODEC for which the optimal location will vary. As an
example, using the post processor or not, see my next comment for more details.
Yeah, and stuff like this was already discussed multiple times. Local 
memory of devices can only be made available by the exporter, not the 
importer.
So in the case of separated camera and encoder you run into exactly the 
same limitation that some device needs the allocation to happen on the 
camera while others need it on the encoder.
...
...
The more common case is that you need to allocate from the GPU and then
import that into the V4L2 device. The background is that all dGPUs I
know of need the data inside local memory (VRAM) to be able to scan out
from it.
The reality is that what is common to you, might not be to others. In my work,
most ARM SoC have display that just handle direct scannout from cameras and
codecs.
...
The only case the commonly fails is whenever we try to display UVC
created dmabuf,
Well, exactly that's not correct! The whole x86 use cases of direct 
display for dGPUs are broken because media players think they can do the 
simple thing and offload all the problematic cases to the display server.
This is absolutely *not* the common use case you describe here, but 
rather something completely special to ARM.
It the normal case for a lot of ARM SoCs. That world is certainly not
any less big than the x86 dGPU world. A huge number of devices are ARM
based set-top boxes and other video players. Just because it is a
special case for you doesn't mean it's a global special case.
...
...
which have dirty CPU write cache and this is the type of thing
we'd like to see solved. I think this series was addressing it in principle, but
failing the import and the raised point is that this wasn't the optimal way.
There is a community project called LibreELEC, if you aren't aware, they run
Khodi with direct scanout of video stream on a wide variety of SoC and they use
the CODEC as exporter all the time. They simply don't have cases were the
opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
does not really offer you any API to reverse the allocation.
Ok, let me try to explain it once more. It sounds like I wasn't able to 
get my point through.
That we haven't heard anybody screaming that x86 doesn't work is just 
because we handle the case that a buffer isn't directly displayable in 
X/Wayland anyway, but this is absolutely not the optimal solution.
The argument that you want to keep the allocation on the codec side is 
completely false as far as I can see.
We already had numerous projects where we reported this practice as bugs 
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
And on a lot of ARM SoCs it's exactly the right thing to do. Many
codecs need contiguous memory there, so importing a scatter-gather
buffer from the GPU via dma-buf will simply not work.
...
This is just a software solution which works because of coincident and 
not because of engineering.
By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.
While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.
Regards,
Lucas

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: Try to address the DMA-buf coherency problem