[Linaro-mm-sig] Re: Try to address the DMA-buf coherency problem

17 Nov 2022


      Hi Christian and everyone,
On Thu, Nov 3, 2022 at 4:14 AM Christian König
ckoenig.leichtzumerken@gmail.com wrote:
...
Am 02.11.22 um 18:10 schrieb Lucas Stach:
...
Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
[SNIP]
...
It would just be doing this for the importer and exactly that
would be bad design because we then have handling for the display driver
outside of the driver.
The driver would have to do those cache maintenance operations if it
directly worked with a non-coherent device. Doing it for the importer
is just doing it for another device, not the one directly managed by
the exporter.
I really don't see the difference to the other dma-buf ops: in
dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
the address space of the importer. Why would cache maintenance be any
different?
The issue here is the explicit ownership transfer.
We intentionally decided against that because it breaks tons of use
cases and is at least by me and a couple of others seen as generally
design failure of the Linux DMA-API.
First of all, thanks for starting the discussion and sorry for being
late to the party. May I ask you to keep me on CC for any changes that
touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
being copied on the entire series, no need to pick the specific
patches. Thanks in advance.
I agree that we have some design issues in the current DMA-buf
framework, but I'd try to approach it a bit differently. Instead of
focusing on the issues in the current design, could we write down our
requirements and try to come up with how a correct design would look
like? (A lot of that has been already mentioned in this thread, but I
find it quite difficult to follow and it might not be a complete view
either.)
That said, let me address a few aspects already mentioned, to make
sure that everyone is on the same page.
...
DMA-Buf let's the exporter setup the DMA addresses the importer uses to
be able to directly decided where a certain operation should go. E.g. we
have cases where for example a P2P write doesn't even go to memory, but
rather a doorbell BAR to trigger another operation. Throwing in CPU
round trips for explicit ownership transfer completely breaks that concept.
It sounds like we should have a dma_dev_is_coherent_with_dev() which
accepts two (or an array?) of devices and tells the caller whether the
devices need explicit ownership transfer. Based on that, your drivers
would install the DMA completion (presumably IRQ) handlers or not.
It's necessary since it's not uncommon that devices A and B could be
in the same coherency domain, while C could be in a different one, but
you may still want them to exchange data through DMA-bufs. Even if it
means the need for some extra round trips it would likely be more
efficient than a full memory copy (might not be true 100% of the
time).
...
Additional to that a very basic concept of DMA-buf is that the exporter
provides the buffer as it is and just double checks if the importer can
access it. For example we have XGMI links which makes memory accessible
to other devices on the same bus, but not to PCIe device and not even to
the CPU. Otherwise you wouldn't be able to implement things like secure
decoding where the data isn't even accessible outside the device to
device link.
Fully agreed.
...
So if a device driver uses cached system memory on an architecture which
devices which can't access it the right approach is clearly to reject
the access.
I'd like to accent the fact that "requires cache maintenance" != "can't access".
...
What we can do is to reverse the role of the exporter and importer and
let the device which needs uncached memory take control. This way this
device can insert operations as needed, e.g. flush read caches or
invalidate write caches.
(Putting aside the cases when the access is really impossible at all.)
Correct me if I'm wrong, but isn't that because we don't have a proper
hook for the importer to tell the DMA-buf framework to prepare the
buffer for its access?
...
This is what we have already done in DMA-buf and what already works
perfectly fine with use cases which are even more complicated than a
simple write cache invalidation.
...
...
...
...
This is just a software solution which works because of coincident and
not because of engineering.
By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.
While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.
Well, this is exactly what I'm criticizing on the DMA-API. Instead of
giving you a proper error code when something won't work in a specific
way it just tries to hide the requirements inside the DMA layer.
For example when your device can only access 32bits the DMA-API
transparently insert bounce buffers instead of giving you a proper error
code that the memory in question can't be accessed.
This just tries to hide the underlying problem instead of pushing it
into the upper layer where it can be handled much more gracefully.
How would you expect the DMA API to behave on a system where the device
driver is operating on cacheable memory, but the device is non-
coherent? Telling the driver that this just doesn't work?
Yes, exactly that.
It's the job of the higher level to prepare the buffer a device work
with, not the one of the lower level.
What are higher and lower levels here?
As per the existing design of the DMA mapping framework, the framework
handles the system DMA architecture details and DMA master drivers
take care of invoking the right DMA mapping operations around the DMA
accesses. This makes sense to me, as DMA master drivers have no idea
about the specific SoCs or buses they're plugged into, while the DMA
mapping framework has no idea when the DMA accesses are taking place.
...
In other words in a proper design the higher level would prepare the
memory in a way the device driver can work with it, not the other way
around.
When a device driver gets memory it can't work with the correct response
is to throw an error and bubble that up into a layer where it can be
handled gracefully.
For example instead of using bounce buffers under the hood the DMA layer
the MM should make sure that when you call read() with O_DIRECT that the
pages in question are accessible by the device.
I tend to agree with you if it's about a costly software "emulation"
like bounce buffers, but cache maintenance is a hardware feature
existing there by default and it's often much cheaper to operate on
cached memory and synchronize the caches rather than have everything
in uncached (or write-combined) memory.
...
...
It's a use-case that is working fine today with many devices (e.g. network
adapters) in the ARM world, exactly because the architecture specific
implementation of the DMA API inserts the cache maintenance operations
on buffer ownership transfer.
Yeah, I'm perfectly aware of that. The problem is that exactly that
design totally breaks GPUs on Xen DOM0 for example.
And Xen is just one example, I can certainly say from experience that
this design was a really really bad idea because it favors just one use
case while making other use cases practically impossible if not really
hard to implement.
Sorry, I haven't worked with Xen. Could you elaborate what's the
problem that this introduces for it?
Best regards,
Tomasz

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: Try to address the DMA-buf coherency problem