[PATCH v5 0/6] usb: gadget: functionfs: DMABUF import interface

List overview All Threads
Download

newer

older

[PATCH] dma-buf: try to catch...

Re: [PATCH] mm/cma: Drop...

Paul Cercueil

19 Jan 2024 19 Jan '24

2:13 p.m.

Hi,

This is the v5 of my patchset that adds a new DMABUF import interface to FunctionFS.

Daniel / Sima suggested that I should cache the dma_buf_attachment while the DMABUF is attached to the interface, instead of mapping/unmapping the DMABUF for every transfer (also because unmapping is not possible in the dma_fence's critical section). This meant having to add new dma_buf_begin_access() / dma_buf_end_access() functions that the driver can call to ensure cache coherency. These two functions are provided by the new patch [1/6], and an implementation for udmabuf was added in [2/6] - see the changelog below.

This patchset was successfully tested with CONFIG_LOCKDEP, no errors were reported in dmesg while using the interface.

This interface is being used at Analog Devices, to transfer data from high-speed transceivers to USB in a zero-copy fashion, using also the DMABUF import interface to the IIO subsystem which is being upstreamed in parallel [1]. The two are used by the Libiio software [2].

On a ZCU102 board with a FMComms3 daughter board, using the combination of these two new interfaces yields a drastic improvement of the throughput, from about 127 MiB/s using IIO's buffer read/write interface + read/write to the FunctionFS endpoints, to about 274 MiB/s when passing around DMABUFs, for a lower CPU usage (0.85 load avg. before, vs. 0.65 after).

Right now, *technically* there are no users of this interface, as Analog Devices wants to wait until both interfaces are accepted upstream to merge the DMABUF code in Libiio into the main branch, and Jonathan wants to wait and see if this patchset is accepted to greenlight the DMABUF interface in IIO as well. I think this isn't really a problem; once everybody is happy with its part of the cake, we can merge them all at once.

This is obviously for 5.9, and based on next-20240119.

Changelog:

- [1/6]: New patch - [2/6]: New patch - [5/6]: - Cache the dma_buf_attachment while the DMABUF is attached. - Use dma_buf_begin/end_access() to ensure that the DMABUF data will be coherent to the hardware. - Remove comment about cache-management and dma_buf_unmap_attachment(), since we now use dma_buf_begin/end_access(). - Select DMA_SHARED_BUFFER in Kconfig entry - Add Christian's ACK

Cheers, -Paul

[1] https://lore.kernel.org/linux-iio/219abc43b4fdd4a13b307ed2efaa0e6869e68e3f.c... [2] https://github.com/analogdevicesinc/libiio/tree/pcercuei/dev-new-dmabuf-api

Paul Cercueil (6): dma-buf: Add dma_buf_{begin,end}_access() dma-buf: udmabuf: Implement .{begin,end}_access usb: gadget: Support already-mapped DMA SGs usb: gadget: functionfs: Factorize wait-for-endpoint code usb: gadget: functionfs: Add DMABUF import interface Documentation: usb: Document FunctionFS DMABUF API

-- 2.43.0

Show replies by date

Paul Cercueil

19 Jan 19 Jan

2:13 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

Signed-off-by: Paul Cercueil paul@crapouillou.net

--- v5: New patch --- drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 37 ++++++++++++++++++++++ 2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach, * - dma_buf_mmap() * - dma_buf_begin_cpu_access() * - dma_buf_end_cpu_access() + * - dma_buf_begin_access() + * - dma_buf_end_access() * - dma_buf_map_attachment_unlocked() * - dma_buf_unmap_attachment_unlocked() * - dma_buf_vmap_unlocked() @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);

+/** + * @dma_buf_begin_access - Call before any hardware access from/to the DMABUF + * @attach: [in] attachment used for hardware access + * @sg_table: [in] scatterlist used for the DMA transfer + * @direction: [in] direction of DMA transfer + */ +int dma_buf_begin_access(struct dma_buf_attachment *attach, + struct sg_table *sgt, enum dma_data_direction dir) +{ + struct dma_buf *dmabuf; + bool cookie; + int ret; + + if (WARN_ON(!attach)) + return -EINVAL; + + dmabuf = attach->dmabuf; + + if (!dmabuf->ops->begin_access) + return 0; + + cookie = dma_fence_begin_signalling(); + ret = dmabuf->ops->begin_access(attach, sgt, dir); + dma_fence_end_signalling(cookie); + + if (WARN_ON_ONCE(ret)) + return ret; + + return 0; +} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF); + +/** + * @dma_buf_end_access - Call after any hardware access from/to the DMABUF + * @attach: [in] attachment used for hardware access + * @sg_table: [in] scatterlist used for the DMA transfer + * @direction: [in] direction of DMA transfer + */ +int dma_buf_end_access(struct dma_buf_attachment *attach, + struct sg_table *sgt, enum dma_data_direction dir) +{ + struct dma_buf *dmabuf; + bool cookie; + int ret; + + if (WARN_ON(!attach)) + return -EINVAL; + + dmabuf = attach->dmabuf; + + if (!dmabuf->ops->end_access) + return 0; + + cookie = dma_fence_begin_signalling(); + ret = dmabuf->ops->end_access(attach, sgt, dir); + dma_fence_end_signalling(cookie); + + if (WARN_ON_ONCE(ret)) + return ret; + + return 0; +} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF); + #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);

+ /** + * @begin_access: + * + * This is called from dma_buf_begin_access() when a device driver + * wants to access the data of the DMABUF. The exporter can use this + * to flush/sync the caches if needed. + * + * This callback is optional. + * + * Returns: + * + * 0 on success or a negative error code on failure. + */ + int (*begin_access)(struct dma_buf_attachment *, struct sg_table *, + enum dma_data_direction); + + /** + * @end_access: + * + * This is called from dma_buf_end_access() when a device driver is + * done accessing the data of the DMABUF. The exporter can use this + * to flush/sync the caches if needed. + * + * This callback is optional. + * + * Returns: + * + * 0 on success or a negative error code on failure. + */ + int (*end_access)(struct dma_buf_attachment *, struct sg_table *, + enum dma_data_direction); + /** * @mmap: * @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach);

+int dma_buf_begin_access(struct dma_buf_attachment *attach, + struct sg_table *sgt, enum dma_data_direction dir); +int dma_buf_end_access(struct dma_buf_attachment *attach, + struct sg_table *sgt, enum dma_data_direction dir); + struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info);

int dma_buf_fd(struct dma_buf *dmabuf, int flags);

-- 2.43.0

kernel test robot

20 Jan 20 Jan

8:20 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Hi Paul,

kernel test robot noticed the following build warnings:

[auto build test WARNING on usb/usb-testing] [also build test WARNING on usb/usb-next usb/usb-linus drm-misc/drm-misc-next lwn/docs-next linus/master v6.7 next-20240119] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Paul-Cercueil/dma-buf-Add-dma... base: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git usb-testing patch link: https://lore.kernel.org/r/20240119141402.44262-2-paul%40crapouillou.net patch subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access() config: arm-randconfig-001-20240120 (https://download.01.org/0day-ci/archive/20240121/202401210406.YYgVcAC1-lkp@i...) compiler: clang version 18.0.0git (https://github.com/llvm/llvm-project d92ce344bf641e6bb025b41b3f1a77dd25e2b3e9) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240121/202401210406.YYgVcAC1-lkp@i...)

If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/oe-kbuild-all/202401210406.YYgVcAC1-lkp@intel.com/

All warnings (new ones prefixed by >>):

...

...
drivers/dma-buf/dma-buf.c:1608: warning: Cannot understand * @dma_buf_begin_access - Call before any hardware access from/to the DMABUF

on line 1608 - I thought it was a doc line

...

...
drivers/dma-buf/dma-buf.c:1640: warning: Cannot understand * @dma_buf_end_access - Call after any hardware access from/to the DMABUF

on line 1640 - I thought it was a doc line

vim +1608 drivers/dma-buf/dma-buf.c

1606 1607 /**

...

1608 * @dma_buf_begin_access - Call before any hardware access from/to the DMABUF

1609 * @attach: [in] attachment used for hardware access 1610 * @sg_table: [in] scatterlist used for the DMA transfer 1611 * @direction: [in] direction of DMA transfer 1612 */ 1613 int dma_buf_begin_access(struct dma_buf_attachment *attach, 1614 struct sg_table *sgt, enum dma_data_direction dir) 1615 { 1616 struct dma_buf *dmabuf; 1617 bool cookie; 1618 int ret; 1619 1620 if (WARN_ON(!attach)) 1621 return -EINVAL; 1622 1623 dmabuf = attach->dmabuf; 1624 1625 if (!dmabuf->ops->begin_access) 1626 return 0; 1627 1628 cookie = dma_fence_begin_signalling(); 1629 ret = dmabuf->ops->begin_access(attach, sgt, dir); 1630 dma_fence_end_signalling(cookie); 1631 1632 if (WARN_ON_ONCE(ret)) 1633 return ret; 1634 1635 return 0; 1636 } 1637 EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF); 1638 1639 /**

...

1640 * @dma_buf_end_access - Call after any hardware access from/to the DMABUF

1641 * @attach: [in] attachment used for hardware access 1642 * @sg_table: [in] scatterlist used for the DMA transfer 1643 * @direction: [in] direction of DMA transfer 1644 */ 1645 int dma_buf_end_access(struct dma_buf_attachment *attach, 1646 struct sg_table *sgt, enum dma_data_direction dir) 1647 { 1648 struct dma_buf *dmabuf; 1649 bool cookie; 1650 int ret; 1651 1652 if (WARN_ON(!attach)) 1653 return -EINVAL; 1654 1655 dmabuf = attach->dmabuf; 1656 1657 if (!dmabuf->ops->end_access) 1658 return 0; 1659 1660 cookie = dma_fence_begin_signalling(); 1661 ret = dmabuf->ops->end_access(attach, sgt, dir); 1662 dma_fence_end_signalling(cookie); 1663 1664 if (WARN_ON_ONCE(ret)) 1665 return ret; 1666 1667 return 0; 1668 } 1669 EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF); 1670

-- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

Christian König

22 Jan 22 Jan

10:35 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 19.01.24 um 15:13 schrieb Paul Cercueil:

...

These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the DMA-buf design point of view.

Regards, Christian.

...

Signed-off-by: Paul Cercueil paul@crapouillou.net

v5: New patch

drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 37 ++++++++++++++++++++++ 2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
- dma_buf_mmap()
- dma_buf_begin_cpu_access()
- dma_buf_end_cpu_access()
- dma_buf_begin_access()
- dma_buf_end_access()
- dma_buf_map_attachment_unlocked()
- dma_buf_unmap_attachment_unlocked()
- dma_buf_vmap_unlocked()
@@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); +/**

@dma_buf_begin_access - Call before any hardware access from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum dma_data_direction dir)
+{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->begin_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->begin_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

+/**

@dma_buf_end_access - Call after any hardware access from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum dma_data_direction dir)
+{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->end_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->end_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);

#ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) {

diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
/**
* @begin_access:
*
* This is called from dma_buf_begin_access() when a device driver
* wants to access the data of the DMABUF. The exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*begin_access)(struct dma_buf_attachment *, struct sg_table *,
	    enum dma_data_direction);
/**
* @end_access:
*
* This is called from dma_buf_end_access() when a device driver is
* done accessing the data of the DMABUF. The exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*end_access)(struct dma_buf_attachment *, struct sg_table *,
	  enum dma_data_direction);
/**

@mmap:
@@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum dma_data_direction dir);
+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum dma_data_direction dir);
struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info);
int dma_buf_fd(struct dma_buf *dmabuf, int flags);

Paul Cercueil

11:01 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :

...

Am 19.01.24 um 15:13 schrieb Paul Cercueil:

...
These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested this.

...

Regards, Christian.

Cheers, -Paul

...

...
Signed-off-by: Paul Cercueil paul@crapouillou.net

v5: New patch

drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h   | 37 ++++++++++++++++++++++ 2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,    *     - dma_buf_mmap()    *     - dma_buf_begin_cpu_access()    *     - dma_buf_end_cpu_access()

*     - dma_buf_begin_access()

*     - dma_buf_end_access()

*     - dma_buf_map_attachment_unlocked()    *     - dma_buf_unmap_attachment_unlocked()    *     - dma_buf_vmap_unlocked() @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); +/**

@dma_buf_begin_access - Call before any hardware access from/to

the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in]    direction of DMA transfer

*/

+int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->begin_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->begin_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

+/**

@dma_buf_end_access - Call after any hardware access from/to

the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->end_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->end_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);

#ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
/**
* @begin_access:
*
* This is called from dma_buf_begin_access() when a
device driver
* wants to access the data of the DMABUF. The exporter
can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*begin_access)(struct dma_buf_attachment *, struct
sg_table *,
	    enum dma_data_direction);
/**
* @end_access:
*
* This is called from dma_buf_end_access() when a device
driver is
* done accessing the data of the DMABUF. The exporter can
use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*end_access)(struct dma_buf_attachment *, struct
sg_table *,
	  enum dma_data_direction);
/** * @mmap: * @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir); +int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir);

struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); int dma_buf_fd(struct dma_buf *dmabuf, int flags);

Christian König

1:41 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 22.01.24 um 12:01 schrieb Paul Cercueil:

...

Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :

...
Am 19.01.24 um 15:13 schrieb Paul Cercueil:

...
These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested this.

Sorry, I haven't fully catched up to the discussion then.

In general DMA-buf is build around the idea that the data can be accessed coherently by the involved devices.

Having a begin/end of access for devices was brought up multiple times but so far rejected for good reasons.

That an exporter has to call extra functions to access his own buffers is a complete no-go for the design since this forces exporters into doing extra steps for allowing importers to access their data.

That in turn is pretty much un-testable unless you have every possible importer around while testing the exporter.

Regards, Christian.

...

...
Regards, Christian.

Cheers, -Paul

...
...
Signed-off-by: Paul Cercueil paul@crapouillou.net

v5: New patch

drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h   | 37 ++++++++++++++++++++++ 2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,    *     - dma_buf_mmap()    *     - dma_buf_begin_cpu_access()    *     - dma_buf_end_cpu_access()

*     - dma_buf_begin_access()

*     - dma_buf_end_access()

*     - dma_buf_map_attachment_unlocked()    *     - dma_buf_unmap_attachment_unlocked()    *     - dma_buf_vmap_unlocked() @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); +/**

@dma_buf_begin_access - Call before any hardware access from/to

the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in]    direction of DMA transfer

*/

+int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->begin_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->begin_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

+/**

@dma_buf_end_access - Call after any hardware access from/to

the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->end_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->end_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);

#ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
/**
* @begin_access:
*
* This is called from dma_buf_begin_access() when a
device driver
* wants to access the data of the DMABUF. The exporter
can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*begin_access)(struct dma_buf_attachment *, struct
sg_table *,
	    enum dma_data_direction);
/**
* @end_access:
*
* This is called from dma_buf_end_access() when a device
driver is
* done accessing the data of the DMABUF. The exporter can
use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*end_access)(struct dma_buf_attachment *, struct
sg_table *,
	  enum dma_data_direction);
/** * @mmap: * @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir); +int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir);

struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); int dma_buf_fd(struct dma_buf *dmabuf, int flags);

Paul Cercueil

23 Jan 23 Jan

10:10 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Hi Christian,

Le lundi 22 janvier 2024 à 14:41 +0100, Christian König a écrit :

...

Am 22.01.24 um 12:01 schrieb Paul Cercueil:

...
Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :

...
Am 19.01.24 um 15:13 schrieb Paul Cercueil:

...
These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested this.

Sorry, I haven't fully catched up to the discussion then.

In general DMA-buf is build around the idea that the data can be accessed coherently by the involved devices.

Having a begin/end of access for devices was brought up multiple times but so far rejected for good reasons.

I would argue that if it was brought up multiple times, then there are also good reasons to support such a mechanism.

...

That an exporter has to call extra functions to access his own buffers is a complete no-go for the design since this forces exporters into doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps? And only importers (who cache the mapping and actually care about non- coherency) would have to call these.

At the very least, is there a way to check that "the data can be accessed coherently by the involved devices"? So that my importer can EPERM if there is no coherency vs. a device that's already attached.

Cheers, -Paul

...

That in turn is pretty much un-testable unless you have every possible importer around while testing the exporter.

Regards, Christian.

...
...
Regards, Christian.

Cheers, -Paul

...
...
Signed-off-by: Paul Cercueil paul@crapouillou.net

v5: New patch

drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++    include/linux/dma-buf.h   | 37 ++++++++++++++++++++++    2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma- buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,     *     - dma_buf_mmap()     *     - dma_buf_begin_cpu_access()     *     - dma_buf_end_cpu_access()

*     - dma_buf_begin_access()

*     - dma_buf_end_access()

*     - dma_buf_map_attachment_unlocked()     *     - dma_buf_unmap_attachment_unlocked()     *     - dma_buf_vmap_unlocked() @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map)    }    EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);    +/**

@dma_buf_begin_access - Call before any hardware access

from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in]    direction of DMA transfer

*/

+int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->begin_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->begin_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

+/**

@dma_buf_end_access - Call after any hardware access

from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->end_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->end_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);

#ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
/**
* @begin_access:
*
* This is called from dma_buf_begin_access() when a
device driver
* wants to access the data of the DMABUF. The
exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*begin_access)(struct dma_buf_attachment *,
struct sg_table *,
	    enum dma_data_direction);
/**
* @end_access:
*
* This is called from dma_buf_end_access() when a
device driver is
* done accessing the data of the DMABUF. The exporter
can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*end_access)(struct dma_buf_attachment *, struct
sg_table *,
	  enum dma_data_direction);
/** * @mmap: * @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir); +int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir);

struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); int dma_buf_fd(struct dma_buf *dmabuf, int flags);

Christian König

11:52 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 23.01.24 um 11:10 schrieb Paul Cercueil:

...

Hi Christian,

Le lundi 22 janvier 2024 à 14:41 +0100, Christian König a écrit :

...
Am 22.01.24 um 12:01 schrieb Paul Cercueil:

...
Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :

...
Am 19.01.24 um 15:13 schrieb Paul Cercueil:

...
These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested this.

Sorry, I haven't fully catched up to the discussion then.

In general DMA-buf is build around the idea that the data can be accessed coherently by the involved devices.

Having a begin/end of access for devices was brought up multiple times but so far rejected for good reasons.

I would argue that if it was brought up multiple times, then there are also good reasons to support such a mechanism.

...
That an exporter has to call extra functions to access his own buffers is a complete no-go for the design since this forces exporters into doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps? And only importers (who cache the mapping and actually care about non- coherency) would have to call these.

No, the problem is still that you would have to change all importers to mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for coherency. Even if you don't cache the mapping you don't get coherency.

In other words exporters are not require to call sync_to_cpu or sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency matters?

...

At the very least, is there a way to check that "the data can be accessed coherently by the involved devices"? So that my importer can EPERM if there is no coherency vs. a device that's already attached.

Yeah, there is functionality for this in the DMA subsystem. I've once created prototype patches for enforcing the same coherency approach between importer and exporter, but we never got around to upstream them.

...

Cheers, -Paul

...
That in turn is pretty much un-testable unless you have every possible importer around while testing the exporter.

Regards, Christian.

...
...
Regards, Christian.

Cheers, -Paul

...
...
Signed-off-by: Paul Cercueil paul@crapouillou.net

v5: New patch

drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++    include/linux/dma-buf.h   | 37 ++++++++++++++++++++++    2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma- buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,     *     - dma_buf_mmap()     *     - dma_buf_begin_cpu_access()     *     - dma_buf_end_cpu_access()

*     - dma_buf_begin_access()

*     - dma_buf_end_access()

*     - dma_buf_map_attachment_unlocked()     *     - dma_buf_unmap_attachment_unlocked()     *     - dma_buf_vmap_unlocked() @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map)    }    EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); +/**

@dma_buf_begin_access - Call before any hardware access

from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in]    direction of DMA transfer

*/

+int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->begin_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->begin_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

+/**

@dma_buf_end_access - Call after any hardware access

from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->end_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->end_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);

#ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
/**
* @begin_access:
*
* This is called from dma_buf_begin_access() when a
device driver
* wants to access the data of the DMABUF. The
exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*begin_access)(struct dma_buf_attachment *,
struct sg_table *,
	    enum dma_data_direction);
/**
* @end_access:
*
* This is called from dma_buf_end_access() when a
device driver is
* done accessing the data of the DMABUF. The exporter
can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*end_access)(struct dma_buf_attachment *, struct
sg_table *,
	  enum dma_data_direction);
/** * @mmap: * @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir); +int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir);

struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); int dma_buf_fd(struct dma_buf *dmabuf, int flags);

Paul Cercueil

1:02 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Le mardi 23 janvier 2024 à 12:52 +0100, Christian König a écrit :

...

Am 23.01.24 um 11:10 schrieb Paul Cercueil:

...
Hi Christian,

Le lundi 22 janvier 2024 à 14:41 +0100, Christian König a écrit :

...
Am 22.01.24 um 12:01 schrieb Paul Cercueil:

...
Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :

...
Am 19.01.24 um 15:13 schrieb Paul Cercueil:

...
These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested this.

Sorry, I haven't fully catched up to the discussion then.

In general DMA-buf is build around the idea that the data can be accessed coherently by the involved devices.

Having a begin/end of access for devices was brought up multiple times but so far rejected for good reasons.

I would argue that if it was brought up multiple times, then there are also good reasons to support such a mechanism.

...
That an exporter has to call extra functions to access his own buffers is a complete no-go for the design since this forces exporters into doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps? And only importers (who cache the mapping and actually care about non- coherency) would have to call these.

No, the problem is still that you would have to change all importers to mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for coherency. Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache coherency when the SGs are mapped/unmapped.

The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Daniel / Sima suggested then that I cache the mapping and add new functions to ensure cache coherency, which is what these patches are about.

...

In other words exporters are not require to call sync_to_cpu or sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to USB/functionfs with the interface introduced by this patchset. I attach them to IIO with a similar interface (being upstreamed in parallel), and transfer data from USB to IIO and vice-versa in a zero-copy fashion.

This works perfectly fine as long as the USB and IIO hardware are coherent between themselves, which is the case on most of our boards. However I do have a board (with a Xilinx Ultrascale SoC) where it is not the case, and cache flushes/sync are needed. So I was trying to rework these new interfaces to work on that system too.

If this really is a no-no, then I am fine with the assumption that devices sharing a DMABUF must be coherent between themselves; but that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

Cheers, -Paul

...

...
At the very least, is there a way to check that "the data can be accessed coherently by the involved devices"? So that my importer can EPERM if there is no coherency vs. a device that's already attached.

Yeah, there is functionality for this in the DMA subsystem. I've once created prototype patches for enforcing the same coherency approach between importer and exporter, but we never got around to upstream them.

...
Cheers, -Paul

...
That in turn is pretty much un-testable unless you have every possible importer around while testing the exporter.

Regards, Christian.

...
...
Regards, Christian.

Cheers, -Paul

...
...
Signed-off-by: Paul Cercueil paul@crapouillou.net

v5: New patch

drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++     include/linux/dma-buf.h   | 37 ++++++++++++++++++++++     2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma- buf/dma- buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,      *     - dma_buf_mmap()      *     - dma_buf_begin_cpu_access()      *     - dma_buf_end_cpu_access()

*     - dma_buf_begin_access()

*     - dma_buf_end_access()

*     - dma_buf_map_attachment_unlocked()      *     - dma_buf_unmap_attachment_unlocked()      *     - dma_buf_vmap_unlocked() @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map)     }     EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);     +/**

@dma_buf_begin_access - Call before any hardware access

from/to the DMABUF

@attach: [in] attachment used for hardware

access

@sg_table: [in] scatterlist used for the DMA

transfer

@direction: [in]    direction of DMA transfer

*/

+int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->begin_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->begin_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

+/**

@dma_buf_end_access - Call after any hardware access

from/to the DMABUF

@attach: [in] attachment used for hardware

access

@sg_table: [in] scatterlist used for the DMA

transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir) +{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->end_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->end_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);

#ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma- buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
/**
* @begin_access:
*
* This is called from dma_buf_begin_access() when
a device driver
* wants to access the data of the DMABUF. The
exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on
failure.
*/
int (*begin_access)(struct dma_buf_attachment *,
struct sg_table *,
	    enum dma_data_direction);
/**
* @end_access:
*
* This is called from dma_buf_end_access() when a
device driver is
* done accessing the data of the DMABUF. The
exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on
failure.
*/
int (*end_access)(struct dma_buf_attachment *,
struct sg_table *,
	  enum dma_data_direction);
/** * @mmap: * @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum
dma_data_direction dir); +int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum
dma_data_direction dir);

struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); int dma_buf_fd(struct dma_buf *dmabuf, int flags);

Christian König

1:28 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 23.01.24 um 14:02 schrieb Paul Cercueil:

...

[SNIP]

...
...
...
That an exporter has to call extra functions to access his own buffers is a complete no-go for the design since this forces exporters into doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps? And only importers (who cache the mapping and actually care about non- coherency) would have to call these.

No, the problem is still that you would have to change all importers to mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for coherency. Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache coherency when the SGs are mapped/unmapped.

Well I just double checked the source in 6.7.1 and I can't see udmabuf doing anything for cache coherency in map/unmap.

All it does is calling dma_map_sgtable() and dma_unmap_sgtable() to create and destroy the SG table and those are not supposed to sync anything to the CPU cache.

In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here, it's just that this is missing from udmabuf.

...

The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently.

For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

For exporters you can implement the begin/end CPU access functions which allows you to implement something even if your exporting device can't snoop the CPU cache.

...

Daniel / Sima suggested then that I cache the mapping and add new functions to ensure cache coherency, which is what these patches are about.

Yeah, I've now catched up on the latest mail. Sorry I haven't seen that before.

...

...
In other words exporters are not require to call sync_to_cpu or sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to USB/functionfs with the interface introduced by this patchset. I attach them to IIO with a similar interface (being upstreamed in parallel), and transfer data from USB to IIO and vice-versa in a zero-copy fashion.

This works perfectly fine as long as the USB and IIO hardware are coherent between themselves, which is the case on most of our boards. However I do have a board (with a Xilinx Ultrascale SoC) where it is not the case, and cache flushes/sync are needed. So I was trying to rework these new interfaces to work on that system too.

Yeah, that sounds strongly like one of the use cases we have rejected so far.

...

If this really is a no-no, then I am fine with the assumption that devices sharing a DMABUF must be coherent between themselves; but that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers.

The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

Regards, Christian.

...

Cheers, -Paul

...
...
At the very least, is there a way to check that "the data can be accessed coherently by the involved devices"? So that my importer can EPERM if there is no coherency vs. a device that's already attached.

Yeah, there is functionality for this in the DMA subsystem. I've once created prototype patches for enforcing the same coherency approach between importer and exporter, but we never got around to upstream them.

...
Cheers, -Paul

...
That in turn is pretty much un-testable unless you have every possible importer around while testing the exporter.

Regards, Christian.

...
...
Regards, Christian.

Cheers, -Paul

...
> Signed-off-by: Paul Cercueilpaul@crapouillou.net > > --- > v5: New patch > --- > drivers/dma-buf/dma-buf.c | 66 > +++++++++++++++++++++++++++++++++++++++ > include/linux/dma-buf.h | 37 ++++++++++++++++++++++ > 2 files changed, 103 insertions(+) > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma- > buf/dma- > buf.c > index 8fe5aa67b167..a8bab6c18fcd 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -830,6 +830,8 @@ static struct sg_table * > __map_dma_buf(struct > dma_buf_attachment *attach, > * - dma_buf_mmap() > * - dma_buf_begin_cpu_access() > * - dma_buf_end_cpu_access() > + * - dma_buf_begin_access() > + * - dma_buf_end_access() > * - dma_buf_map_attachment_unlocked() > * - dma_buf_unmap_attachment_unlocked() > * - dma_buf_vmap_unlocked() > @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct > dma_buf > *dmabuf, struct iosys_map *map) > } > EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); > > +/** > + * @dma_buf_begin_access - Call before any hardware access > from/to > the DMABUF > + * @attach: [in] attachment used for hardware > access > + * @sg_table: [in] scatterlist used for the DMA > transfer > + * @direction: [in] direction of DMA transfer > + */ > +int dma_buf_begin_access(struct dma_buf_attachment > *attach, > + struct sg_table *sgt, enum > dma_data_direction dir) > +{ > + struct dma_buf *dmabuf; > + bool cookie; > + int ret; > + > + if (WARN_ON(!attach)) > + return -EINVAL; > + > + dmabuf = attach->dmabuf; > + > + if (!dmabuf->ops->begin_access) > + return 0; > + > + cookie = dma_fence_begin_signalling(); > + ret = dmabuf->ops->begin_access(attach, sgt, dir); > + dma_fence_end_signalling(cookie); > + > + if (WARN_ON_ONCE(ret)) > + return ret; > + > + return 0; > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF); > + > +/** > + * @dma_buf_end_access - Call after any hardware access > from/to > the DMABUF > + * @attach: [in] attachment used for hardware > access > + * @sg_table: [in] scatterlist used for the DMA > transfer > + * @direction: [in] direction of DMA transfer > + */ > +int dma_buf_end_access(struct dma_buf_attachment *attach, > + struct sg_table *sgt, enum > dma_data_direction dir) > +{ > + struct dma_buf *dmabuf; > + bool cookie; > + int ret; > + > + if (WARN_ON(!attach)) > + return -EINVAL; > + > + dmabuf = attach->dmabuf; > + > + if (!dmabuf->ops->end_access) > + return 0; > + > + cookie = dma_fence_begin_signalling(); > + ret = dmabuf->ops->end_access(attach, sgt, dir); > + dma_fence_end_signalling(cookie); > + > + if (WARN_ON_ONCE(ret)) > + return ret; > + > + return 0; > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF); > + > #ifdef CONFIG_DEBUG_FS > static int dma_buf_debug_show(struct seq_file *s, void > *unused) > { > diff --git a/include/linux/dma-buf.h b/include/linux/dma- > buf.h > index 8ff4add71f88..8ba612c7cc16 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -246,6 +246,38 @@ struct dma_buf_ops { > */ > int (*end_cpu_access)(struct dma_buf *, enum > dma_data_direction); > > + /** > + * @begin_access: > + * > + * This is called from dma_buf_begin_access() when > a > device driver > + * wants to access the data of the DMABUF. The > exporter > can use this > + * to flush/sync the caches if needed. > + * > + * This callback is optional. > + * > + * Returns: > + * > + * 0 on success or a negative error code on > failure. > + */ > + int (*begin_access)(struct dma_buf_attachment *, > struct > sg_table *, > + enum dma_data_direction); > + > + /** > + * @end_access: > + * > + * This is called from dma_buf_end_access() when a > device > driver is > + * done accessing the data of the DMABUF. The > exporter > can > use this > + * to flush/sync the caches if needed. > + * > + * This callback is optional. > + * > + * Returns: > + * > + * 0 on success or a negative error code on > failure. > + */ > + int (*end_access)(struct dma_buf_attachment *, > struct > sg_table *, > + enum dma_data_direction); > + > /** > * @mmap: > * > @@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf > *dmabuf, > int dma_buf_pin(struct dma_buf_attachment *attach); > void dma_buf_unpin(struct dma_buf_attachment *attach); > > +int dma_buf_begin_access(struct dma_buf_attachment > *attach, > + struct sg_table *sgt, enum > dma_data_direction dir); > +int dma_buf_end_access(struct dma_buf_attachment *attach, > + struct sg_table *sgt, enum > dma_data_direction dir); > + > struct dma_buf *dma_buf_export(const struct > dma_buf_export_info > *exp_info); > > int dma_buf_fd(struct dma_buf *dmabuf, int flags);

Paul Cercueil

24 Jan 24 Jan

10:58 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Hi Christian,

Le mardi 23 janvier 2024 à 14:28 +0100, Christian König a écrit :

...

Am 23.01.24 um 14:02 schrieb Paul Cercueil:

...
[SNIP]

...
...
...
That an exporter has to call extra functions to access his own buffers is a complete no-go for the design since this forces exporters into doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps? And only importers (who cache the mapping and actually care about non- coherency) would have to call these.

No, the problem is still that you would have to change all importers to mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for coherency. Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache coherency when the SGs are mapped/unmapped.

Well I just double checked the source in 6.7.1 and I can't see udmabuf doing anything for cache coherency in map/unmap. All it does is calling dma_map_sgtable() and dma_unmap_sgtable() to create and destroy the SG table and those are not supposed to sync anything to the CPU cache. In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here, it's just that this is missing from udmabuf.

Ok.

...

...
The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently. For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache and the other cannot. Therefore if I want to support it properly, I do need cache flush/sync. I don't actually try to access the data using the CPU (and when I do, I call the sync start/end ioctls).

...

For exporters you can implement the begin/end CPU access functions which allows you to implement something even if your exporting device can't snoop the CPU cache.

That only works if the importers call the begin_cpu_access() / end_cpu_access(), which they don't.

...

...
Daniel / Sima suggested then that I cache the mapping and add new functions to ensure cache coherency, which is what these patches are about.

Yeah, I've now catched up on the latest mail. Sorry I haven't seen that before.

...
...
In other words exporters are not require to call sync_to_cpu or sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to USB/functionfs with the interface introduced by this patchset. I attach them to IIO with a similar interface (being upstreamed in parallel), and transfer data from USB to IIO and vice-versa in a zero-copy fashion.

This works perfectly fine as long as the USB and IIO hardware are coherent between themselves, which is the case on most of our boards. However I do have a board (with a Xilinx Ultrascale SoC) where it is not the case, and cache flushes/sync are needed. So I was trying to rework these new interfaces to work on that system too.

Yeah, that sounds strongly like one of the use cases we have rejected so far.

...
If this really is a no-no, then I am fine with the assumption that devices sharing a DMABUF must be coherent between themselves; but that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers. The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call begin_cpu_access() / end_cpu_access(), which they don't, because they don't actually access the data using the CPU.

Unless you mean that the exporter can call sync_for_cpu/sync_for_device before/after every single DMA transfer so that the data appears coherent to the importers, without them having to call begin_cpu_access() / end_cpu_access().

In which case - this would still demultiply the complexity; my USB- functionfs interface here (and IIO interface in the separate patchset) are not device-specific, so I'd rather keep them importers.

...

If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The interface does work perfectly fine on platforms that don't have coherency problems. The coherency issue in itself really is a tangential issue.

So I will send a v6 where I don't try to force the cache coherency - and instead assume that the attached devices are coherent between themselves.

But it would be even better to have a way to detect non-coherency and return an error on attach.

Cheers, -Paul

Andrew Davis

3:38 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On 1/24/24 4:58 AM, Paul Cercueil wrote:

...

Hi Christian,

Le mardi 23 janvier 2024 à 14:28 +0100, Christian König a écrit :

...
Am 23.01.24 um 14:02 schrieb Paul Cercueil:

...
[SNIP]

...
...
...
That an exporter has to call extra functions to access his own buffers is a complete no-go for the design since this forces exporters into doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps? And only importers (who cache the mapping and actually care about non- coherency) would have to call these.

No, the problem is still that you would have to change all importers to mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for coherency. Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache coherency when the SGs are mapped/unmapped.

Well I just double checked the source in 6.7.1 and I can't see udmabuf doing anything for cache coherency in map/unmap. All it does is calling dma_map_sgtable() and dma_unmap_sgtable() to create and destroy the SG table and those are not supposed to sync anything to the CPU cache. In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here, it's just that this is missing from udmabuf.

Ok.

...
...
The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently. For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache and the other cannot. Therefore if I want to support it properly, I do need cache flush/sync. I don't actually try to access the data using the CPU (and when I do, I call the sync start/end ioctls).

If you don't access the data using the CPU, then how did the data end up in the CPU caches? If you have a device that can write-allocate into your CPU cache, but some other device in the system cannot snoop that data back out then that is just broken and those devices cannot reasonably share buffers..

Now we do have systems where some hardware can snoop CPU(or L3) caches and others cannot, but they will not *allocate* into those caches (unless they also have the ability to sync them without CPU in the loop).

Your problem may be if you are still using udmabuf driver as your DMA-BUF exporter, which as said before is broken (and I just sent some patches with a few fixes just for you :)). For udmabuf, data starts in the CPU domain (in caches) and is only ever synced for the CPU, not for attached devices. So in this case the writing device might update those cache lines but a non-snooping reader would never see those updates.

I'm not saying there isn't a need for these new {begin,end}_access() functions. I can think of a few interesting usecases, but as you say below that would be good to work out in a different series.

Andrew

...

...
For exporters you can implement the begin/end CPU access functions which allows you to implement something even if your exporting device can't snoop the CPU cache.

That only works if the importers call the begin_cpu_access() / end_cpu_access(), which they don't.

...
...
Daniel / Sima suggested then that I cache the mapping and add new functions to ensure cache coherency, which is what these patches are about.

Yeah, I've now catched up on the latest mail. Sorry I haven't seen that before.

...
...
In other words exporters are not require to call sync_to_cpu or sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to USB/functionfs with the interface introduced by this patchset. I attach them to IIO with a similar interface (being upstreamed in parallel), and transfer data from USB to IIO and vice-versa in a zero-copy fashion.

This works perfectly fine as long as the USB and IIO hardware are coherent between themselves, which is the case on most of our boards. However I do have a board (with a Xilinx Ultrascale SoC) where it is not the case, and cache flushes/sync are needed. So I was trying to rework these new interfaces to work on that system too.

Yeah, that sounds strongly like one of the use cases we have rejected so far.

...
If this really is a no-no, then I am fine with the assumption that devices sharing a DMABUF must be coherent between themselves; but that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers. The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call begin_cpu_access() / end_cpu_access(), which they don't, because they don't actually access the data using the CPU.

Unless you mean that the exporter can call sync_for_cpu/sync_for_device before/after every single DMA transfer so that the data appears coherent to the importers, without them having to call begin_cpu_access() / end_cpu_access().

In which case - this would still demultiply the complexity; my USB- functionfs interface here (and IIO interface in the separate patchset) are not device-specific, so I'd rather keep them importers.

...
If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The interface does work perfectly fine on platforms that don't have coherency problems. The coherency issue in itself really is a tangential issue.

So I will send a v6 where I don't try to force the cache coherency - and instead assume that the attached devices are coherent between themselves.

But it would be even better to have a way to detect non-coherency and return an error on attach.

Cheers, -Paul

Paul Cercueil

3:52 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Hi Andrew,

Le mercredi 24 janvier 2024 à 09:38 -0600, Andrew Davis a écrit :

...

On 1/24/24 4:58 AM, Paul Cercueil wrote:

...
Hi Christian,

Le mardi 23 janvier 2024 à 14:28 +0100, Christian König a écrit :

...
Am 23.01.24 um 14:02 schrieb Paul Cercueil:

...
[SNIP]

...
...
> > That an exporter has to call extra functions to access > his > own > buffers > is a complete no-go for the design since this forces > exporters > into > doing extra steps for allowing importers to access their > data. > Then what about we add these dma_buf_{begin,end}_access(), with only implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps? And only importers (who cache the mapping and actually care about non- coherency) would have to call these.

No, the problem is still that you would have to change all importers to mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for coherency. Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache coherency when the SGs are mapped/unmapped.

Well I just double checked the source in 6.7.1 and I can't see udmabuf doing anything for cache coherency in map/unmap. All it does is calling dma_map_sgtable() and dma_unmap_sgtable() to create and destroy the SG table and those are not supposed to sync anything to the CPU cache. In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here, it's just that this is missing from udmabuf.

Ok.

...
...
The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently. For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache and the other cannot. Therefore if I want to support it properly, I do need cache flush/sync. I don't actually try to access the data using the CPU (and when I do, I call the sync start/end ioctls).

If you don't access the data using the CPU, then how did the data end up in the CPU caches? If you have a device that can write- allocate into your CPU cache, but some other device in the system cannot snoop that data back out then that is just broken and those devices cannot reasonably share buffers..

I think that's what happens, yes.

...

Now we do have systems where some hardware can snoop CPU(or L3) caches and others cannot, but they will not *allocate* into those caches (unless they also have the ability to sync them without CPU in the loop).

Your problem may be if you are still using udmabuf driver as your DMA-BUF exporter, which as said before is broken (and I just sent some patches with a few fixes just for you :)). For udmabuf, data starts in the CPU domain (in caches) and is only ever synced for the CPU, not for attached devices. So in this case the writing device might update those cache lines but a non-snooping reader would never see those updates.

I tried today with the system dma-heap, and the behaviour was the same. Adding an implementation of .dma_buf_begin/end_access() to it made it work there too.

...

I'm not saying there isn't a need for these new {begin,end}_access() functions. I can think of a few interesting usecases, but as you say below that would be good to work out in a different series.

Yep, but it's a can of worms I'd rather not open if I can avoid it :)

...

Andrew

Cheers, -Paul

...

...
...
For exporters you can implement the begin/end CPU access functions which allows you to implement something even if your exporting device can't snoop the CPU cache.

That only works if the importers call the begin_cpu_access() / end_cpu_access(), which they don't.

...
...
Daniel / Sima suggested then that I cache the mapping and add new functions to ensure cache coherency, which is what these patches are about.

Yeah, I've now catched up on the latest mail. Sorry I haven't seen that before.

...
...
In other words exporters are not require to call sync_to_cpu or sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to USB/functionfs with the interface introduced by this patchset. I attach them to IIO with a similar interface (being upstreamed in parallel), and transfer data from USB to IIO and vice-versa in a zero-copy fashion.

This works perfectly fine as long as the USB and IIO hardware are coherent between themselves, which is the case on most of our boards. However I do have a board (with a Xilinx Ultrascale SoC) where it is not the case, and cache flushes/sync are needed. So I was trying to rework these new interfaces to work on that system too.

Yeah, that sounds strongly like one of the use cases we have rejected so far.

...
If this really is a no-no, then I am fine with the assumption that devices sharing a DMABUF must be coherent between themselves; but that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers. The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call begin_cpu_access() / end_cpu_access(), which they don't, because they don't actually access the data using the CPU.

Unless you mean that the exporter can call sync_for_cpu/sync_for_device before/after every single DMA transfer so that the data appears coherent to the importers, without them having to call begin_cpu_access() / end_cpu_access().

In which case - this would still demultiply the complexity; my USB- functionfs interface here (and IIO interface in the separate patchset) are not device-specific, so I'd rather keep them importers.

...
If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The interface does work perfectly fine on platforms that don't have coherency problems. The coherency issue in itself really is a tangential issue.

So I will send a v6 where I don't try to force the cache coherency

and instead assume that the attached devices are coherent between themselves.

But it would be even better to have a way to detect non-coherency and return an error on attach.

Cheers, -Paul

Christian König

25 Jan 25 Jan

3 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 24.01.24 um 11:58 schrieb Paul Cercueil:

...

[SNIP]

...
...
The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently. For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache and the other cannot. Therefore if I want to support it properly, I do need cache flush/sync. I don't actually try to access the data using the CPU (and when I do, I call the sync start/end ioctls).

Usually that isn't a problem as long as you don't access the data with the CPU.

[SNIP]

...

...
...
(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers. The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call begin_cpu_access() / end_cpu_access(), which they don't, because they don't actually access the data using the CPU.

Wow, that is a completely new use case then.

Neither DMA-buf nor the DMA subsystem in Linux actually supports this as far as I can see.

...

Unless you mean that the exporter can call sync_for_cpu/sync_for_device before/after every single DMA transfer so that the data appears coherent to the importers, without them having to call begin_cpu_access() / end_cpu_access().

Yeah, I mean the importers don't have to call begin_cpu_access() / end_cpu_access() if they don't do CPU access :)

What you can still do as exporter is to call sync_for_device() and sync_for_cpu() before and after each operation on your non-coherent device. Paired with the fence signaling that should still work fine then.

But taking a step back, this use case is not something even the low level DMA subsystem supports. That sync_for_cpu() does the right thing is coincident and not proper engineering.

What you need is a sync_device_to_device() which does the appropriate actions depending on which devices are involved.

...

In which case - this would still demultiply the complexity; my USB- functionfs interface here (and IIO interface in the separate patchset) are not device-specific, so I'd rather keep them importers.

...
If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The interface does work perfectly fine on platforms that don't have coherency problems. The coherency issue in itself really is a tangential issue.

Yeah, completely agree.

...

So I will send a v6 where I don't try to force the cache coherency - and instead assume that the attached devices are coherent between themselves.

But it would be even better to have a way to detect non-coherency and return an error on attach.

Take a look into the DMA subsystem. I'm pretty sure we already have something like this in there.

If nothing else helps you could take a look if the coherent memory access mask is non zero or something like that.

Regards, Christian.

...

Cheers, -Paul

Daniel Vetter

6:01 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On Thu, Jan 25, 2024 at 04:00:16PM +0100, Christian König wrote:

...

Am 24.01.24 um 11:58 schrieb Paul Cercueil:

...
[SNIP]

...
...
The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently. For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache and the other cannot. Therefore if I want to support it properly, I do need cache flush/sync. I don't actually try to access the data using the CPU (and when I do, I call the sync start/end ioctls).

Usually that isn't a problem as long as you don't access the data with the CPU.

[SNIP]

...
...
...
(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers. The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call begin_cpu_access() / end_cpu_access(), which they don't, because they don't actually access the data using the CPU.

Wow, that is a completely new use case then.

Neither DMA-buf nor the DMA subsystem in Linux actually supports this as far as I can see.

...
Unless you mean that the exporter can call sync_for_cpu/sync_for_device before/after every single DMA transfer so that the data appears coherent to the importers, without them having to call begin_cpu_access() / end_cpu_access().

Yeah, I mean the importers don't have to call begin_cpu_access() / end_cpu_access() if they don't do CPU access :)

What you can still do as exporter is to call sync_for_device() and sync_for_cpu() before and after each operation on your non-coherent device. Paired with the fence signaling that should still work fine then.

But taking a step back, this use case is not something even the low level DMA subsystem supports. That sync_for_cpu() does the right thing is coincident and not proper engineering.

What you need is a sync_device_to_device() which does the appropriate actions depending on which devices are involved.

...
In which case - this would still demultiply the complexity; my USB- functionfs interface here (and IIO interface in the separate patchset) are not device-specific, so I'd rather keep them importers.

...
If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The interface does work perfectly fine on platforms that don't have coherency problems. The coherency issue in itself really is a tangential issue.

Yeah, completely agree.

...
So I will send a v6 where I don't try to force the cache coherency - and instead assume that the attached devices are coherent between themselves.

But it would be even better to have a way to detect non-coherency and return an error on attach.

Take a look into the DMA subsystem. I'm pretty sure we already have something like this in there.

If nothing else helps you could take a look if the coherent memory access mask is non zero or something like that.

Jumping in way late and apolgies to everyone since yes I indeed suggested this entire mess to Paul in some private thread.

And worse, I think we need it, it's just that we got away without it thus far.

So way back at the og dma-buf kick-off dma coherency was discussed, and a few things where noted: - the dma api only supports device<->cpu coherency - getting the full coherency model off the ground right away is probably too hard, so we made the decision that where it matters, relevant flushing needs to be done in dma_buf_map/unmap.

If you look at the earliest patches for dma-buf we had pretty clear language that all dma-operations should be bracketed with map/unmap. Of course that didn't work out for drm at all, and we had to first get dma_resv_lock and dma_fence landed and then your dynamic exporter/importer support in just to get the buffer migration functionality working, which was only one of the things discussed that braketing everything with map/unmap was supposed to take care of.

The other was coherency management. But looking through archives I think this was already agreed to be postponed for later in the original kick-off meeting and never further discussed on the mailing list.

This worked for a fairly long time, because thus far dma-buf was used on fairly reaasonable architectures where all participating devices are coherent enough.

We did have to add the cpu access flushing fairly quickly because there's a lot of SoC chips (including intel) where that was necessary, but even that was added later on, as an opt-in and without fixing every. See fc13020e086b ("dma-buf: add support for kernel cpu access").

The ioctl to allow userspace to do flushing was added even later on, and there the entire yolo opt-in situation is even worse. c11e391da2a8 ("dma-buf: Add ioctls to allow userspace to flush") was only in 2016, 5 years after dma-buf landed.

It looks like it's finally time to add the device side flushing functions we've talked about first over 12 years ago :-)

The reason this pops up now is that unlike other dma-buf users on maybe somewhat more funky architectures, Paul's patches want to use dma_fence for synchronization of the dma operations. Which means you cannot call the full dma_buf_map/unmap dance because that takes dma_resv_lock, and absolute no-go in a dma_fence critical path.

And yes in those 12 years the dma-api hasn't gained the device2device sync support we'd need, but neither has it gained the multiple devices <-> cpu sync support we'd strictly need for dma-buf. So yes this is all a terrible hodge-podge of hacks, but if we'd require theoretically perfect code we'd still have zero dma-buf support in upstream.

This also includes how we landed these extensions, none of them in the past have landed with a "update all existing exporters/importers" rule. We talked about that every time, and rejected it every time for imo pretty good reasons - the perf impact tends to be way too harsh if you impose over-flushing on everyone, including the reasonable platforms. And we currently can't do less than overflushing with the current dma-api interfaces because we dont have the specific flush functions we'd need. So really this isn't doing a worse abuse of the dma-api than what we have. It's definitely a bit wasteful since the functions we use do in theory flush too much. But in practice on the these funky architectures they flush enough.

There's also the very hard issue of actually trying to optimize flushes, because a dma operation might only access part of a buffer, and you might interleave read/write access by different devices in very innovative ways. So I'm firmly on the "make it work first, then fast" side of things.

So dma-buf will continue to be a thing that's tested for specific combos, and then we'll patch them. It's a decade-plus tradition at this point.

Which is all a very long winded way of saying that yes, I think we need this, and we needed this 12 years ago already if we'd have aimed for perfect.

I have a bunch of detail comments on the patch itself, but I guess we first need to find consensus on whether it's a good idea in the first place.

Cheers, Sima

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

26 Jan 26 Jan

4:42 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 25.01.24 um 19:01 schrieb Daniel Vetter:

...

On Thu, Jan 25, 2024 at 04:00:16PM +0100, Christian König wrote:

...
Am 24.01.24 um 11:58 schrieb Paul Cercueil:

...
[SNIP]

...
...
The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently. For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache and the other cannot. Therefore if I want to support it properly, I do need cache flush/sync. I don't actually try to access the data using the CPU (and when I do, I call the sync start/end ioctls).

Usually that isn't a problem as long as you don't access the data with the CPU.

[SNIP]

...
...
...
(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers. The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call begin_cpu_access() / end_cpu_access(), which they don't, because they don't actually access the data using the CPU.

Wow, that is a completely new use case then.

Neither DMA-buf nor the DMA subsystem in Linux actually supports this as far as I can see.

...
Unless you mean that the exporter can call sync_for_cpu/sync_for_device before/after every single DMA transfer so that the data appears coherent to the importers, without them having to call begin_cpu_access() / end_cpu_access().

Yeah, I mean the importers don't have to call begin_cpu_access() / end_cpu_access() if they don't do CPU access :)

What you can still do as exporter is to call sync_for_device() and sync_for_cpu() before and after each operation on your non-coherent device. Paired with the fence signaling that should still work fine then.

But taking a step back, this use case is not something even the low level DMA subsystem supports. That sync_for_cpu() does the right thing is coincident and not proper engineering.

What you need is a sync_device_to_device() which does the appropriate actions depending on which devices are involved.

...
In which case - this would still demultiply the complexity; my USB- functionfs interface here (and IIO interface in the separate patchset) are not device-specific, so I'd rather keep them importers.

...
If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The interface does work perfectly fine on platforms that don't have coherency problems. The coherency issue in itself really is a tangential issue.

Yeah, completely agree.

...
So I will send a v6 where I don't try to force the cache coherency - and instead assume that the attached devices are coherent between themselves.

But it would be even better to have a way to detect non-coherency and return an error on attach.

Take a look into the DMA subsystem. I'm pretty sure we already have something like this in there.

If nothing else helps you could take a look if the coherent memory access mask is non zero or something like that.

Jumping in way late and apolgies to everyone since yes I indeed suggested this entire mess to Paul in some private thread.

And worse, I think we need it, it's just that we got away without it thus far.

So way back at the og dma-buf kick-off dma coherency was discussed, and a few things where noted:

the dma api only supports device<->cpu coherency

getting the full coherency model off the ground right away is probably too hard, so we made the decision that where it matters, relevant flushing needs to be done in dma_buf_map/unmap.

If you look at the earliest patches for dma-buf we had pretty clear language that all dma-operations should be bracketed with map/unmap. Of course that didn't work out for drm at all, and we had to first get dma_resv_lock and dma_fence landed and then your dynamic exporter/importer support in just to get the buffer migration functionality working, which was only one of the things discussed that braketing everything with map/unmap was supposed to take care of.

The other was coherency management. But looking through archives I think this was already agreed to be postponed for later in the original kick-off meeting and never further discussed on the mailing list.

This worked for a fairly long time, because thus far dma-buf was used on fairly reaasonable architectures where all participating devices are coherent enough.

We did have to add the cpu access flushing fairly quickly because there's a lot of SoC chips (including intel) where that was necessary, but even that was added later on, as an opt-in and without fixing every. See fc13020e086b ("dma-buf: add support for kernel cpu access").

The ioctl to allow userspace to do flushing was added even later on, and there the entire yolo opt-in situation is even worse. c11e391da2a8 ("dma-buf: Add ioctls to allow userspace to flush") was only in 2016, 5 years after dma-buf landed.

It looks like it's finally time to add the device side flushing functions we've talked about first over 12 years ago :-)

The reason this pops up now is that unlike other dma-buf users on maybe somewhat more funky architectures, Paul's patches want to use dma_fence for synchronization of the dma operations. Which means you cannot call the full dma_buf_map/unmap dance because that takes dma_resv_lock, and absolute no-go in a dma_fence critical path.

And yes in those 12 years the dma-api hasn't gained the device2device sync support we'd need, but neither has it gained the multiple devices <-> cpu sync support we'd strictly need for dma-buf. So yes this is all a terrible hodge-podge of hacks, but if we'd require theoretically perfect code we'd still have zero dma-buf support in upstream.

This also includes how we landed these extensions, none of them in the past have landed with a "update all existing exporters/importers" rule. We talked about that every time, and rejected it every time for imo pretty good reasons - the perf impact tends to be way too harsh if you impose over-flushing on everyone, including the reasonable platforms. And we currently can't do less than overflushing with the current dma-api interfaces because we dont have the specific flush functions we'd need. So really this isn't doing a worse abuse of the dma-api than what we have. It's definitely a bit wasteful since the functions we use do in theory flush too much. But in practice on the these funky architectures they flush enough.

There's also the very hard issue of actually trying to optimize flushes, because a dma operation might only access part of a buffer, and you might interleave read/write access by different devices in very innovative ways. So I'm firmly on the "make it work first, then fast" side of things.

So dma-buf will continue to be a thing that's tested for specific combos, and then we'll patch them. It's a decade-plus tradition at this point.

Which is all a very long winded way of saying that yes, I think we need this, and we needed this 12 years ago already if we'd have aimed for perfect.

I have a bunch of detail comments on the patch itself, but I guess we first need to find consensus on whether it's a good idea in the first place.

Well I think we should have some solution, but I'm not sure if it should be part of DMA-buf.

Essentially a DMA-buf exports the buffers as he uses it and the importer (or the DMA-buf subsystem) is then the one who says ok I can use this or I can't use this or I need to call extra functions to use this or whatever.

It's not the job of the exporter to provide the coherency for the importer, cause otherwise we would have a lot of code in the exporter which can only be tested when you have the right importer around. And I strongly think that this is a no-go for having a reliable solution.

That's why I think the approach of having DMA-buf callbacks is most likely the wrong thing to do.

What should happen instead is that the DMA subsystem provides functionality which to devices which don't support coherency through it's connection to say I want to access this data, please make sure to flush the appropriate catches. But that's just a very very rough design idea.

This will become more with CXL at the horizon I think.

Regards, Christian.

...

Cheers, Sima

Daniel Vetter

30 Jan 30 Jan

9:01 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:

...

Am 25.01.24 um 19:01 schrieb Daniel Vetter:

...
On Thu, Jan 25, 2024 at 04:00:16PM +0100, Christian König wrote:

...
Am 24.01.24 um 11:58 schrieb Paul Cercueil:

...
[SNIP]

...
...
The problem was then that dma_buf_unmap_attachment cannot be called before the dma_fence is signaled, and calling it after is already too late (because the fence would be signaled before the data is sync'd).

Well what sync are you talking about? CPU sync? In DMA-buf that is handled differently. For importers it's mandatory that they can be coherent with the exporter. That usually means they can snoop the CPU cache if the exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache and the other cannot. Therefore if I want to support it properly, I do need cache flush/sync. I don't actually try to access the data using the CPU (and when I do, I call the sync start/end ioctls).

Usually that isn't a problem as long as you don't access the data with the CPU.

[SNIP]

...
...
...
(and I *think* there is a way to force coherency in the Ultrascale's interconnect - we're investigating it)

What you can do is that instead of using udmabuf or dma-heaps is that the device which can't provide coherency act as exporters of the buffers. The exporter is allowed to call sync_for_cpu/sync_for_device on it's own buffers and also gets begin/end CPU access notfications. So you can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call begin_cpu_access() / end_cpu_access(), which they don't, because they don't actually access the data using the CPU.

Wow, that is a completely new use case then.

Neither DMA-buf nor the DMA subsystem in Linux actually supports this as far as I can see.

...
Unless you mean that the exporter can call sync_for_cpu/sync_for_device before/after every single DMA transfer so that the data appears coherent to the importers, without them having to call begin_cpu_access() / end_cpu_access().

Yeah, I mean the importers don't have to call begin_cpu_access() / end_cpu_access() if they don't do CPU access :)

What you can still do as exporter is to call sync_for_device() and sync_for_cpu() before and after each operation on your non-coherent device. Paired with the fence signaling that should still work fine then.

But taking a step back, this use case is not something even the low level DMA subsystem supports. That sync_for_cpu() does the right thing is coincident and not proper engineering.

What you need is a sync_device_to_device() which does the appropriate actions depending on which devices are involved.

...
In which case - this would still demultiply the complexity; my USB- functionfs interface here (and IIO interface in the separate patchset) are not device-specific, so I'd rather keep them importers.

...
If you really don't have coherency between devices then that would be a really new use case and we would need much more agreement on how to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The interface does work perfectly fine on platforms that don't have coherency problems. The coherency issue in itself really is a tangential issue.

Yeah, completely agree.

...
So I will send a v6 where I don't try to force the cache coherency - and instead assume that the attached devices are coherent between themselves.

But it would be even better to have a way to detect non-coherency and return an error on attach.

Take a look into the DMA subsystem. I'm pretty sure we already have something like this in there.

If nothing else helps you could take a look if the coherent memory access mask is non zero or something like that.

Jumping in way late and apolgies to everyone since yes I indeed suggested this entire mess to Paul in some private thread.

And worse, I think we need it, it's just that we got away without it thus far.

So way back at the og dma-buf kick-off dma coherency was discussed, and a few things where noted:

the dma api only supports device<->cpu coherency

getting the full coherency model off the ground right away is probably too hard, so we made the decision that where it matters, relevant flushing needs to be done in dma_buf_map/unmap.

If you look at the earliest patches for dma-buf we had pretty clear language that all dma-operations should be bracketed with map/unmap. Of course that didn't work out for drm at all, and we had to first get dma_resv_lock and dma_fence landed and then your dynamic exporter/importer support in just to get the buffer migration functionality working, which was only one of the things discussed that braketing everything with map/unmap was supposed to take care of.

The other was coherency management. But looking through archives I think this was already agreed to be postponed for later in the original kick-off meeting and never further discussed on the mailing list.

This worked for a fairly long time, because thus far dma-buf was used on fairly reaasonable architectures where all participating devices are coherent enough.

We did have to add the cpu access flushing fairly quickly because there's a lot of SoC chips (including intel) where that was necessary, but even that was added later on, as an opt-in and without fixing every. See fc13020e086b ("dma-buf: add support for kernel cpu access").

The ioctl to allow userspace to do flushing was added even later on, and there the entire yolo opt-in situation is even worse. c11e391da2a8 ("dma-buf: Add ioctls to allow userspace to flush") was only in 2016, 5 years after dma-buf landed.

It looks like it's finally time to add the device side flushing functions we've talked about first over 12 years ago :-)

The reason this pops up now is that unlike other dma-buf users on maybe somewhat more funky architectures, Paul's patches want to use dma_fence for synchronization of the dma operations. Which means you cannot call the full dma_buf_map/unmap dance because that takes dma_resv_lock, and absolute no-go in a dma_fence critical path.

And yes in those 12 years the dma-api hasn't gained the device2device sync support we'd need, but neither has it gained the multiple devices <-> cpu sync support we'd strictly need for dma-buf. So yes this is all a terrible hodge-podge of hacks, but if we'd require theoretically perfect code we'd still have zero dma-buf support in upstream.

This also includes how we landed these extensions, none of them in the past have landed with a "update all existing exporters/importers" rule. We talked about that every time, and rejected it every time for imo pretty good reasons - the perf impact tends to be way too harsh if you impose over-flushing on everyone, including the reasonable platforms. And we currently can't do less than overflushing with the current dma-api interfaces because we dont have the specific flush functions we'd need. So really this isn't doing a worse abuse of the dma-api than what we have. It's definitely a bit wasteful since the functions we use do in theory flush too much. But in practice on the these funky architectures they flush enough.

There's also the very hard issue of actually trying to optimize flushes, because a dma operation might only access part of a buffer, and you might interleave read/write access by different devices in very innovative ways. So I'm firmly on the "make it work first, then fast" side of things.

So dma-buf will continue to be a thing that's tested for specific combos, and then we'll patch them. It's a decade-plus tradition at this point.

Which is all a very long winded way of saying that yes, I think we need this, and we needed this 12 years ago already if we'd have aimed for perfect.

I have a bunch of detail comments on the patch itself, but I guess we first need to find consensus on whether it's a good idea in the first place.

Well I think we should have some solution, but I'm not sure if it should be part of DMA-buf.

Essentially a DMA-buf exports the buffers as he uses it and the importer (or the DMA-buf subsystem) is then the one who says ok I can use this or I can't use this or I need to call extra functions to use this or whatever.

It's not the job of the exporter to provide the coherency for the importer, cause otherwise we would have a lot of code in the exporter which can only be tested when you have the right importer around. And I strongly think that this is a no-go for having a reliable solution.

The trouble is, that if you have other memory than stuff allocated by the dma-api or mapped using the dma-api, then by necessity the exporter has to deal with this.

Which is the exact same reason we also force the exporters to deal with the cpu cache flushing - you're argument that it's not great to replicate this everywhere holds there equally.

The other thing is that right now the exporter is the only one who actually knows what kind of dma coherency rules apply for a certain piece of memory. E.g. on i915-gem even if it's dma_map_sg mapped the underlying i915-gem buffer might be non-coherent, and i915-gem makes it all work by doing the appropriate amount of clflush.

Similar funky things happen in other cases.

So unless we add an interface which allows importers to figure out how much flushing is needed, currently the exporter is the only one who knows (because it can inspect the struct device at dma_buf_attach time).

We could flip this around, but it would be a rather serious depature from the dma-buf design approach thus far.

...

That's why I think the approach of having DMA-buf callbacks is most likely the wrong thing to do.

What should happen instead is that the DMA subsystem provides functionality which to devices which don't support coherency through it's connection to say I want to access this data, please make sure to flush the appropriate catches. But that's just a very very rough design idea.

This will become more with CXL at the horizon I think.

Yeah CXL will make this all even more fun, but we are firmly there already with devices deciding per-buffer (or sometimes even per-access with intel's MOCS stuff) what coherency mode to use for a buffer.

Also arm soc generally have both coherent and non-coherent device interconnects, and I think some devices can switch with runtime flags too which mode they use for a specific transition.

CXL just extends this to pcie devices.

So the mess is here, how do we deal with it?

My take is that the opt-in callback addition is far from great, but it's in line with how we extended dma-buf the past decade plus too. So unless someone's volunteering to pour some serious time into re-engineering this all (including testing all the different device/driver<->device/driver interactions) I think there's only really one other option: To not support these cases at all. And I don't really like that, because it means people will hack together something even worse in their drivers.

By adding it to dma-buf it'll stare us in our faces at least :-/

Cheers, Sima

...

Regards, Christian.

...
Cheers, Sima

Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

9:23 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 30.01.24 um 10:01 schrieb Daniel Vetter:

...

On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:

...
[SNIP] Well I think we should have some solution, but I'm not sure if it should be part of DMA-buf.

Essentially a DMA-buf exports the buffers as he uses it and the importer (or the DMA-buf subsystem) is then the one who says ok I can use this or I can't use this or I need to call extra functions to use this or whatever.

It's not the job of the exporter to provide the coherency for the importer, cause otherwise we would have a lot of code in the exporter which can only be tested when you have the right importer around. And I strongly think that this is a no-go for having a reliable solution.

The trouble is, that if you have other memory than stuff allocated by the dma-api or mapped using the dma-api, then by necessity the exporter has to deal with this.

Yes, I was thinking about that as well.

...

Which is the exact same reason we also force the exporters to deal with the cpu cache flushing - you're argument that it's not great to replicate this everywhere holds there equally.

And I'm not really happy with that either.

...

The other thing is that right now the exporter is the only one who actually knows what kind of dma coherency rules apply for a certain piece of memory. E.g. on i915-gem even if it's dma_map_sg mapped the underlying i915-gem buffer might be non-coherent, and i915-gem makes it all work by doing the appropriate amount of clflush.

Yeah, exactly that's the reason why I think that this stuff doesn't belong into exporters/drivers.

Looking at what kind of hacks and workarounds we have in both amdgpu as well as i915 it's pretty clear that we need to improve this design somehow.

...

Similar funky things happen in other cases.

So unless we add an interface which allows importers to figure out how much flushing is needed, currently the exporter is the only one who knows (because it can inspect the struct device at dma_buf_attach time).

We could flip this around, but it would be a rather serious depature from the dma-buf design approach thus far.

Well clients already give the DMA-direction to exporters when creating the mapping and get an appropriate sg_table in return.

All we need to do is getting the information what flushing is needed into the object returned here so that the DMA API can work with it.

Christoph Hellwig pretty much nailed it when he said that the problem with the sg_table is that it mixes input and output parameters of the DMA-API.

I would extend that and say that we need a mapping object the DMA-API can work with so that it can know what needs to be done when devices request that data is made coherent between them or the CPU.

...

...
That's why I think the approach of having DMA-buf callbacks is most likely the wrong thing to do.

What should happen instead is that the DMA subsystem provides functionality which to devices which don't support coherency through it's connection to say I want to access this data, please make sure to flush the appropriate catches. But that's just a very very rough design idea.

This will become more with CXL at the horizon I think.

Yeah CXL will make this all even more fun, but we are firmly there already with devices deciding per-buffer (or sometimes even per-access with intel's MOCS stuff) what coherency mode to use for a buffer.

Also arm soc generally have both coherent and non-coherent device interconnects, and I think some devices can switch with runtime flags too which mode they use for a specific transition.

CXL just extends this to pcie devices.

So the mess is here, how do we deal with it?

I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

...

My take is that the opt-in callback addition is far from great, but it's in line with how we extended dma-buf the past decade plus too. So unless someone's volunteering to pour some serious time into re-engineering this all (including testing all the different device/driver<->device/driver interactions) I think there's only really one other option: To not support these cases at all. And I don't really like that, because it means people will hack together something even worse in their drivers.

By adding it to dma-buf it'll stare us in our faces at least :-/

Yeah, it's the way of the least resistance. But with CXL at the horizon and more and more drivers using it I think it's predictable that this will sooner or later blow up.

Cheers, Christian.

...

Cheers, Sima

...
Regards, Christian.

...
Cheers, Sima

Linaro-mm-sig mailing list --linaro-mm-sig@lists.linaro.org To unsubscribe send an email tolinaro-mm-sig-leave@lists.linaro.org

Paul Cercueil

9:48 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Hi Christian,

(Your email software is configured for HTML btw)

Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

...

Am 30.01.24 um 10:01 schrieb Daniel Vetter:

...
On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:

...
[SNIP] Well I think we should have some solution, but I'm not sure if it should be part of DMA-buf.

Essentially a DMA-buf exports the buffers as he uses it and the importer (or the DMA-buf subsystem) is then the one who says ok I can use this or I can't use this or I need to call extra functions to use this or whatever.

It's not the job of the exporter to provide the coherency for the importer, cause otherwise we would have a lot of code in the exporter which can only be tested when you have the right importer around. And I strongly think that this is a no-go for having a reliable solution.

The trouble is, that if you have other memory than stuff allocated by the dma-api or mapped using the dma-api, then by necessity the exporter has to deal with this.

Yes, I was thinking about that as well.

...
Which is the exact same reason we also force the exporters to deal with the cpu cache flushing - you're argument that it's not great to replicate this everywhere holds there equally.

And I'm not really happy with that either.

...
The other thing is that right now the exporter is the only one who actually knows what kind of dma coherency rules apply for a certain piece of memory. E.g. on i915-gem even if it's dma_map_sg mapped the underlying i915-gem buffer might be non-coherent, and i915-gem makes it all work by doing the appropriate amount of clflush.

Yeah, exactly that's the reason why I think that this stuff doesn't belong into exporters/drivers. Looking at what kind of hacks and workarounds we have in both amdgpu as well as i915 it's pretty clear that we need to improve this design somehow.

...
Similar funky things happen in other cases.

So unless we add an interface which allows importers to figure out how much flushing is needed, currently the exporter is the only one who knows (because it can inspect the struct device at dma_buf_attach time).

We could flip this around, but it would be a rather serious depature from the dma-buf design approach thus far.

Well clients already give the DMA-direction to exporters when creating the mapping and get an appropriate sg_table in return. All we need to do is getting the information what flushing is needed into the object returned here so that the DMA API can work with it. Christoph Hellwig pretty much nailed it when he said that the problem with the sg_table is that it mixes input and output parameters of the DMA-API. I would extend that and say that we need a mapping object the DMA- API can work with so that it can know what needs to be done when devices request that data is made coherent between them or the CPU.

...
...
That's why I think the approach of having DMA-buf callbacks is most likely the wrong thing to do.

What should happen instead is that the DMA subsystem provides functionality which to devices which don't support coherency through it's connection to say I want to access this data, please make sure to flush the appropriate catches. But that's just a very very rough design idea.

This will become more with CXL at the horizon I think.

Yeah CXL will make this all even more fun, but we are firmly there already with devices deciding per-buffer (or sometimes even per-access with intel's MOCS stuff) what coherency mode to use for a buffer.

Also arm soc generally have both coherent and non-coherent device interconnects, and I think some devices can switch with runtime flags too which mode they use for a specific transition.

CXL just extends this to pcie devices.

So the mess is here, how do we deal with it?

I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset, which is just a dead simple

struct dma_vec { dma_addr_t addr; size_t len; };

(The rationale for introducing it in the IIO DMABUF patchset was that the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct dma_vec + flags.

Cheers, -Paul

...

...
My take is that the opt-in callback addition is far from great, but it's in line with how we extended dma-buf the past decade plus too. So unless someone's volunteering to pour some serious time into re- engineering this all (including testing all the different device/driver<-

...
device/driver

interactions) I think there's only really one other option: To not support these cases at all. And I don't really like that, because it means people will hack together something even worse in their drivers.

By adding it to dma-buf it'll stare us in our faces at least :-/

Yeah, it's the way of the least resistance. But with CXL at the horizon and more and more drivers using it I think it's predictable that this will sooner or later blow up. Cheers, Christian.

...
Cheers, Sima

...
Regards, Christian.

...
Cheers, Sima

_______________________________________________ Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig- leave@lists.linaro.org

Daniel Vetter

10:40 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:

...

Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

...
I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset, which is just a dead simple

struct dma_vec { dma_addr_t addr; size_t len; };

(The rationale for introducing it in the IIO DMABUF patchset was that the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into input data (struct page + len) and output data (which is the dma_addr_t + len you have above).

The part I don't expect to ever happen, because it hasn't the past 20 or so years, is that the dma-api will give us information about what is needed to keep the buffers coherency between various devices and the cpu. -Sima

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

1:09 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 30.01.24 um 11:40 schrieb Daniel Vetter:

...

On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:

...
Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

...
I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset, which is just a dead simple

struct dma_vec { dma_addr_t addr; size_t len; };

(The rationale for introducing it in the IIO DMABUF patchset was that the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into input data (struct page + len) and output data (which is the dma_addr_t + len you have above).

I would extend that a bit and say we have an array with dma_addr+power_of_two_order and a header structure with lower bit offset and some DMA transaction flags.

But this is something which can be worked as an optimization later on. For a start this proposal here looks good to me as well.

...

The part I don't expect to ever happen, because it hasn't the past 20 or so years, is that the dma-api will give us information about what is needed to keep the buffers coherency between various devices and the cpu.

Well maybe that's what we are doing wrong.

Instead of asking the dma-api about the necessary information we should give the API the opportunity to work for us.

In other words we don't need the information about buffer coherency what we need is that the API works for as and fulfills the requirements we have.

So the question is really what should we propose to change on the DMA-api side to get this working as expected?

Regards, Christian.

...

-Sima

Daniel Vetter

31 Jan 31 Jan

9:07 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On Tue, Jan 30, 2024 at 02:09:45PM +0100, Christian König wrote:

...

Am 30.01.24 um 11:40 schrieb Daniel Vetter:

...
On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:

...
Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

...
I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset, which is just a dead simple

struct dma_vec { dma_addr_t addr; size_t len; };

(The rationale for introducing it in the IIO DMABUF patchset was that the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into input data (struct page + len) and output data (which is the dma_addr_t + len you have above).

I would extend that a bit and say we have an array with dma_addr+power_of_two_order and a header structure with lower bit offset and some DMA transaction flags.

But this is something which can be worked as an optimization later on. For a start this proposal here looks good to me as well.

...
The part I don't expect to ever happen, because it hasn't the past 20 or so years, is that the dma-api will give us information about what is needed to keep the buffers coherency between various devices and the cpu.

Well maybe that's what we are doing wrong.

Instead of asking the dma-api about the necessary information we should give the API the opportunity to work for us.

In other words we don't need the information about buffer coherency what we need is that the API works for as and fulfills the requirements we have.

So the question is really what should we propose to change on the DMA-api side to get this working as expected?

So one thing I've been pondering, kinda picking up your point about CXL, is that we do make the coherency protocol more explicit by adding a coherency mode to dma_buf that the exporter sets. Some ideas for values this could have:

- ATTOMIC_COHERENT: Fully cache coherent, including device/cpu atomis. This would be for CXL. Non-CXL devices could still participate with the old model using explicit devices flushes, but must at comply with CPU_COHERENT.

There's also the power9-only nvlink that would fit here, but I guess going forward CXL (and cache-coherent integrated gpu) would really be the only users of this flag.

Peer2peer would have the same rules, otherwise doesn't really make sense. Also we might want to forbib non-CXL imports for these buffers maybe even? Not sure on that.

- CPU_COHERENT: device transactions do snoop cpu devices caches, but devices might do their own caching which isn't snooped by the cpu and needs explicit device-side invalidate/flushing. This means pcie importers are not allowed to use pcie no-snoop transactions, intel igpu wouldn't be allowed to use MOCS that do the same, similar for arm integrated devices.

Importers can skip all explicit cache management apis like dma_buf_begin/end_cpu_access, or the newly proposed dma_buf_begin/end_device_access here.

We'd need to figure out what exactly this means for peer2peer transactions, I have no idea whether the no-snoop flag even does anything for those.

We might also want to split up CPU_COHERENT into CPU_COHERENT_WB and CPU_WOHERENT_WC, so that importers know whether cpu reads are going to be crawling or not.

- MEMORY_COHERENT: devices transactions do not snoop any caches, but promise that all transactions are fully flushed to system memory. Any devices transactions which do fill cpu caches must call the proposed dma_buf_begin/end_device_access functions proposed here. Any cpu access must be braketed by calls to dma_buf_begin/end_cpu_access.

If your device does fill cpu caches, then essentially you'd not be able to import such buffers. Not sure whether we need to put the responsibility of checking that onto importers or exporters. Ideally core dma-buf.c code would check this.

Also maybe the cpu WC mapping mode would actually need to be a sub-mode for MEMORY_COHERENT, because all cpu wc achieves is to avoid the need to call dma_buf_begin/end_cpu_access, you would still need your devices to be memory coherent. And if they're not, then you cannot use that dma-buf.

Or maybe alternatively we need to guarantee that exporters which set MEMORY_COHERENT implement dma_buf_begin/end_device_access to make things work for these cpu-coherent but not memory-coherent devices. This becomes very tricky with device/arch/bus specific details I think.

- DMA_API_COHERENT: The memory is allocated or mapped by the dma-api, and the exact coherency mode is not know. Importers _must_ braket all cpu and device access with the respective dma_buf functions. This is essentially the "we have no idea" default.

Note that exporters might export memory allocated with dma_map_alloc with MEMORY_COHERENT or CPU_COHERENT if they know how the memory exactly works. E.g. for most arm soc gpu/display drivers we can assume that the dma-api gives us MEMORY_COHERENT or CPU_COHERENT_WC, and just use that. Essentially this would make the current implicit assumptions explicit.

udmabuf would need to set this, definitely if Paul's patches to add the explicit device flushes land.

- DEFAULT_COHERENT: This would be the backwards compat legacy yolo behvaior. I'm not sure whether we should alias that with DMA_API_COHERENT or leave it as a special value to mark exporters which haven't been updated for the much more explicit coherency handling yet.

The specification for this coherency mode would be a flat out "who knows, just don't break existing use-cases with actual users". Essentially the only reason we'd have this would be to make sure we can avoid regressions of these existing use-cases, by keeping whatever horrible heuristics we have in current exporters.

It would also allow us to convert exporters and importers on a case by case basis.

Note that all these coherency modes are defined in terms of bus-sepecific device access and in terms of dma_buf apis the importer must call or can skip. This way we'd avoid having to change the dma-api in a first step, and if this all works out properly we could then use the resulting dma-api as a baseline to propose dma-api extensions.

I think starting right out with designing dma-api extension is a few bridges too far. Both from a "how do we convince upstream" pov, but maybe even more from a "how do we figure out what we even need" pov.

...

Regards, Christian.

...
-Sima

Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

6 Feb 6 Feb

1:28 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

Am 31.01.24 um 10:07 schrieb Daniel Vetter:

...

On Tue, Jan 30, 2024 at 02:09:45PM +0100, Christian König wrote:

...
Am 30.01.24 um 11:40 schrieb Daniel Vetter:

...
On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:

...
Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

...
I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset, which is just a dead simple

struct dma_vec { dma_addr_t addr; size_t len; };

(The rationale for introducing it in the IIO DMABUF patchset was that the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into input data (struct page + len) and output data (which is the dma_addr_t + len you have above).

I would extend that a bit and say we have an array with dma_addr+power_of_two_order and a header structure with lower bit offset and some DMA transaction flags.

But this is something which can be worked as an optimization later on. For a start this proposal here looks good to me as well.

...
The part I don't expect to ever happen, because it hasn't the past 20 or so years, is that the dma-api will give us information about what is needed to keep the buffers coherency between various devices and the cpu.

Well maybe that's what we are doing wrong.

Instead of asking the dma-api about the necessary information we should give the API the opportunity to work for us.

In other words we don't need the information about buffer coherency what we need is that the API works for as and fulfills the requirements we have.

So the question is really what should we propose to change on the DMA-api side to get this working as expected?

So one thing I've been pondering, kinda picking up your point about CXL, is that we do make the coherency protocol more explicit by adding a coherency mode to dma_buf that the exporter sets. Some ideas for values this could have:

ATTOMIC_COHERENT: Fully cache coherent, including device/cpu atomis. This would be for CXL. Non-CXL devices could still participate with the old model using explicit devices flushes, but must at comply with CPU_COHERENT.

There's also the power9-only nvlink that would fit here, but I guess going forward CXL (and cache-coherent integrated gpu) would really be the only users of this flag.

Peer2peer would have the same rules, otherwise doesn't really make sense. Also we might want to forbib non-CXL imports for these buffers maybe even? Not sure on that.

CPU_COHERENT: device transactions do snoop cpu devices caches, but devices might do their own caching which isn't snooped by the cpu and needs explicit device-side invalidate/flushing. This means pcie importers are not allowed to use pcie no-snoop transactions, intel igpu wouldn't be allowed to use MOCS that do the same, similar for arm integrated devices.

Importers can skip all explicit cache management apis like dma_buf_begin/end_cpu_access, or the newly proposed dma_buf_begin/end_device_access here.

We'd need to figure out what exactly this means for peer2peer transactions, I have no idea whether the no-snoop flag even does anything for those.

We might also want to split up CPU_COHERENT into CPU_COHERENT_WB and CPU_WOHERENT_WC, so that importers know whether cpu reads are going to be crawling or not.

MEMORY_COHERENT: devices transactions do not snoop any caches, but promise that all transactions are fully flushed to system memory. Any devices transactions which do fill cpu caches must call the proposed dma_buf_begin/end_device_access functions proposed here. Any cpu access must be braketed by calls to dma_buf_begin/end_cpu_access.

If your device does fill cpu caches, then essentially you'd not be able to import such buffers. Not sure whether we need to put the responsibility of checking that onto importers or exporters. Ideally core dma-buf.c code would check this.

Also maybe the cpu WC mapping mode would actually need to be a sub-mode for MEMORY_COHERENT, because all cpu wc achieves is to avoid the need to call dma_buf_begin/end_cpu_access, you would still need your devices to be memory coherent. And if they're not, then you cannot use that dma-buf.

Or maybe alternatively we need to guarantee that exporters which set MEMORY_COHERENT implement dma_buf_begin/end_device_access to make things work for these cpu-coherent but not memory-coherent devices. This becomes very tricky with device/arch/bus specific details I think.

DMA_API_COHERENT: The memory is allocated or mapped by the dma-api, and the exact coherency mode is not know. Importers _must_ braket all cpu and device access with the respective dma_buf functions. This is essentially the "we have no idea" default.

Note that exporters might export memory allocated with dma_map_alloc with MEMORY_COHERENT or CPU_COHERENT if they know how the memory exactly works. E.g. for most arm soc gpu/display drivers we can assume that the dma-api gives us MEMORY_COHERENT or CPU_COHERENT_WC, and just use that. Essentially this would make the current implicit assumptions explicit.

udmabuf would need to set this, definitely if Paul's patches to add the explicit device flushes land.

DEFAULT_COHERENT: This would be the backwards compat legacy yolo behvaior. I'm not sure whether we should alias that with DMA_API_COHERENT or leave it as a special value to mark exporters which haven't been updated for the much more explicit coherency handling yet.

The specification for this coherency mode would be a flat out "who knows, just don't break existing use-cases with actual users". Essentially the only reason we'd have this would be to make sure we can avoid regressions of these existing use-cases, by keeping whatever horrible heuristics we have in current exporters.

It would also allow us to convert exporters and importers on a case by case basis.

Note that all these coherency modes are defined in terms of bus-sepecific device access and in terms of dma_buf apis the importer must call or can skip. This way we'd avoid having to change the dma-api in a first step, and if this all works out properly we could then use the resulting dma-api as a baseline to propose dma-api extensions.

When I read this for the first time my initial impression was that the idea mostly looked good, but while thinking about it more and more I came to the conclusion that this would go into the wrong direction.

Maybe I'm repeating myself, but I think we first of all have to talk a bit about some aspects of coherency:

1. Intra device coherency. This means that intra devices caches are invalidated before beginning an operation and flushed before signaling that an operation finished.

2. Inter device and device to CPU coherency. This means that caches which sit in between devices and between devices and the CPU need to be invalidated and flushed appropriately when buffers are accessed by different parties.

Number 1 is device specific, part of the DMA-buf framework and handled by dma_fences. As far as I can see that part is actually quite well designed and I don't see any obvious need for change.

Number 2 is platform specific and I completely agree with the DMA-api folks that this doesn't belong into DMA-buf in the first place. That's why I think the begin_cpu_access()/end_cpu_access() callbacks are actually a bit misplaced. We still can use those in the exporter, but to make better buffer placement decisions, but should not invalidate any caches when they are called.

The flushing and invalidation for platform caches should really be in the DMA-buf framework and not the exporter.

So in my thinking the enumeration you outlined above should really go into struct device and explaining to everybody what the coherency properties of DMA operations of this device is.

...

I think starting right out with designing dma-api extension is a few bridges too far. Both from a "how do we convince upstream" pov, but maybe even more from a "how do we figure out what we even need" pov.

Well I totally agree on the "how do we figure out what we even need", but I disagree a bit on that we don't know what DMA-api extension we need.

We don't have the full picture yet, but as I already outlined from the DMA-api pov we have two major things on our TODO list:

1. Somehow remove the struct pages from the DMA-buf *importer* API.

My best suggestion at the moment for this is to split sg_tables into two data structures, one for the struct pages and one for the DMA addresses.

Mangling the addresses to ensure that no importer messes with the struct pages was a good step, but it also creates problems when dma_sync_sg_for_cpu() dma_sync_for_device() are supposed to be called.

2. Add some dma_sync_sg_between_devices(A, B....).

And on this I think we are on the same page that we are going to need this, but we are just not clear on who is going to use it.

Regards, Christian.

...

...
Regards, Christian.

...
-Sima

Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

Daniel Vetter

1:57 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On Tue, Feb 06, 2024 at 02:28:35PM +0100, Christian König wrote:

...

Am 31.01.24 um 10:07 schrieb Daniel Vetter:

...
On Tue, Jan 30, 2024 at 02:09:45PM +0100, Christian König wrote:

...
Am 30.01.24 um 11:40 schrieb Daniel Vetter:

...
On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:

...
Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

...
I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset, which is just a dead simple

struct dma_vec { dma_addr_t addr; size_t len; };

(The rationale for introducing it in the IIO DMABUF patchset was that the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into input data (struct page + len) and output data (which is the dma_addr_t + len you have above).

I would extend that a bit and say we have an array with dma_addr+power_of_two_order and a header structure with lower bit offset and some DMA transaction flags.

But this is something which can be worked as an optimization later on. For a start this proposal here looks good to me as well.

...
The part I don't expect to ever happen, because it hasn't the past 20 or so years, is that the dma-api will give us information about what is needed to keep the buffers coherency between various devices and the cpu.

Well maybe that's what we are doing wrong.

Instead of asking the dma-api about the necessary information we should give the API the opportunity to work for us.

In other words we don't need the information about buffer coherency what we need is that the API works for as and fulfills the requirements we have.

So the question is really what should we propose to change on the DMA-api side to get this working as expected?

So one thing I've been pondering, kinda picking up your point about CXL, is that we do make the coherency protocol more explicit by adding a coherency mode to dma_buf that the exporter sets. Some ideas for values this could have:

ATTOMIC_COHERENT: Fully cache coherent, including device/cpu atomis. This would be for CXL. Non-CXL devices could still participate with the old model using explicit devices flushes, but must at comply with CPU_COHERENT.

There's also the power9-only nvlink that would fit here, but I guess going forward CXL (and cache-coherent integrated gpu) would really be the only users of this flag.

Peer2peer would have the same rules, otherwise doesn't really make sense. Also we might want to forbib non-CXL imports for these buffers maybe even? Not sure on that.

CPU_COHERENT: device transactions do snoop cpu devices caches, but devices might do their own caching which isn't snooped by the cpu and needs explicit device-side invalidate/flushing. This means pcie importers are not allowed to use pcie no-snoop transactions, intel igpu wouldn't be allowed to use MOCS that do the same, similar for arm integrated devices.

Importers can skip all explicit cache management apis like dma_buf_begin/end_cpu_access, or the newly proposed dma_buf_begin/end_device_access here.

We'd need to figure out what exactly this means for peer2peer transactions, I have no idea whether the no-snoop flag even does anything for those.

We might also want to split up CPU_COHERENT into CPU_COHERENT_WB and CPU_WOHERENT_WC, so that importers know whether cpu reads are going to be crawling or not.

MEMORY_COHERENT: devices transactions do not snoop any caches, but promise that all transactions are fully flushed to system memory. Any devices transactions which do fill cpu caches must call the proposed dma_buf_begin/end_device_access functions proposed here. Any cpu access must be braketed by calls to dma_buf_begin/end_cpu_access.

If your device does fill cpu caches, then essentially you'd not be able to import such buffers. Not sure whether we need to put the responsibility of checking that onto importers or exporters. Ideally core dma-buf.c code would check this.

Also maybe the cpu WC mapping mode would actually need to be a sub-mode for MEMORY_COHERENT, because all cpu wc achieves is to avoid the need to call dma_buf_begin/end_cpu_access, you would still need your devices to be memory coherent. And if they're not, then you cannot use that dma-buf.

Or maybe alternatively we need to guarantee that exporters which set MEMORY_COHERENT implement dma_buf_begin/end_device_access to make things work for these cpu-coherent but not memory-coherent devices. This becomes very tricky with device/arch/bus specific details I think.

DMA_API_COHERENT: The memory is allocated or mapped by the dma-api, and the exact coherency mode is not know. Importers _must_ braket all cpu and device access with the respective dma_buf functions. This is essentially the "we have no idea" default.

Note that exporters might export memory allocated with dma_map_alloc with MEMORY_COHERENT or CPU_COHERENT if they know how the memory exactly works. E.g. for most arm soc gpu/display drivers we can assume that the dma-api gives us MEMORY_COHERENT or CPU_COHERENT_WC, and just use that. Essentially this would make the current implicit assumptions explicit.

udmabuf would need to set this, definitely if Paul's patches to add the explicit device flushes land.

DEFAULT_COHERENT: This would be the backwards compat legacy yolo behvaior. I'm not sure whether we should alias that with DMA_API_COHERENT or leave it as a special value to mark exporters which haven't been updated for the much more explicit coherency handling yet.

The specification for this coherency mode would be a flat out "who knows, just don't break existing use-cases with actual users". Essentially the only reason we'd have this would be to make sure we can avoid regressions of these existing use-cases, by keeping whatever horrible heuristics we have in current exporters.

It would also allow us to convert exporters and importers on a case by case basis.

Note that all these coherency modes are defined in terms of bus-sepecific device access and in terms of dma_buf apis the importer must call or can skip. This way we'd avoid having to change the dma-api in a first step, and if this all works out properly we could then use the resulting dma-api as a baseline to propose dma-api extensions.

When I read this for the first time my initial impression was that the idea mostly looked good, but while thinking about it more and more I came to the conclusion that this would go into the wrong direction.

Maybe I'm repeating myself, but I think we first of all have to talk a bit about some aspects of coherency:

Intra device coherency. This means that intra devices caches are

invalidated before beginning an operation and flushed before signaling that an operation finished.

Inter device and device to CPU coherency. This means that caches which

sit in between devices and between devices and the CPU need to be invalidated and flushed appropriately when buffers are accessed by different parties.

Number 1 is device specific, part of the DMA-buf framework and handled by dma_fences. As far as I can see that part is actually quite well designed and I don't see any obvious need for change.

Number 2 is platform specific and I completely agree with the DMA-api folks that this doesn't belong into DMA-buf in the first place. That's why I think the begin_cpu_access()/end_cpu_access() callbacks are actually a bit misplaced. We still can use those in the exporter, but to make better buffer placement decisions, but should not invalidate any caches when they are called.

I don't think there's cases where we can avoid the cache management in begin/end_cpu_access, because there are SoC out there with the following constraints:

- Device access is not coherent with cpu caches, no snooping going on at all. Device2device dma is fully coherent though (since there's no caches to take care of at all).

- Mapping as write-combined on the cpu is not possible. Not every platform is reasonable and has something like pat with cache control in each pte. Or they have, but in practice it's not useable.

Which means your options are only a) no cpu access b) bracket cpu access with cache management. So I'm not sure what exactly you have in mind here, since people really don't like a) that's why we added that cpu access braketing stuff?

Also what exactly would you do in begin/end_cpu_access instead of cache management? Note that you kinda need to call dma_buf_vmap (for kernel access) or dma_buf_mmap (for userspace) before you can call these, and any placement changes should be done in those functions and not in begin/end_cpu_access. Especially for dma_buf_vmap the buffer must be pinned, so you have no other choice. And for userspace you'd need fault handlers, you cannot rely on the begin/end ioct calls, because that would defacto make those into a pin/unpin ioctl, which defeats the point of having more dynamic memory management for these buffers.

...

The flushing and invalidation for platform caches should really be in the DMA-buf framework and not the exporter.

So in my thinking the enumeration you outlined above should really go into struct device and explaining to everybody what the coherency properties of DMA operations of this device is.

So that's the part that I think dma-api folks really don't like. They don't want higher levels to know about cache management at all, so I don't see how we can make this happen.

The other issue is that we have a ton of exporter which flaunt the dma-api rules for their platform/device, e.g. x86 is officially fully cache coherent. Except integrated gpu/camera isp are not, and for rendering you can select the coherency mode on a per-transaction level in the cs packets.

So putting this into a struct device flags is not going to work I think for these two cases: for dma-api allocated/managed memory I don't think'll get it, and for stuff like i915-gem it's too strict, we need at least a per-buffer flag for this.

...

...
I think starting right out with designing dma-api extension is a few bridges too far. Both from a "how do we convince upstream" pov, but maybe even more from a "how do we figure out what we even need" pov.

Well I totally agree on the "how do we figure out what we even need", but I disagree a bit on that we don't know what DMA-api extension we need.

We don't have the full picture yet, but as I already outlined from the DMA-api pov we have two major things on our TODO list:

Somehow remove the struct pages from the DMA-buf *importer* API.

My best suggestion at the moment for this is to split sg_tables into two data structures, one for the struct pages and one for the DMA addresses.

Mangling the addresses to ensure that no importer messes with the struct pages was a good step, but it also creates problems when dma_sync_sg_for_cpu() dma_sync_for_device() are supposed to be called.

Hm yeah we need to temporarily unmangle those around those calls. Since it's a debug only option this should't be a big deal.

I agree that eventually we should aim towards splitting this properly, but I think as long as the dma-api itself isn't there yet, it doesn't make too much sense to charge ahead in dma-buf code.

...

Add some dma_sync_sg_between_devices(A, B....).

And on this I think we are on the same page that we are going to need this, but we are just not clear on who is going to use it.

Yeah I think this we might be able to eventually get added to dma-api. But I think that's only on the table once - we have this in dma-buf code (can be in dma-buf.c or in exporters, I'm not extremely opionated about this). - we have real-world use-cases where fusing superflous caches management operations with the existing dma_sync_sg_for_device/cpu actually matters - someone's willing to roll out the infrastructure work - altough a default implementation that just calss dma_sync_sg_for_device/cpu in the right order is probably good enough as fallback for most platforms.

I think the other issue is that all this is multi-year projects with a lot of effort, and I think we need something that will work a lot sooner for Paul's use-case here.

Cheers, Sima

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

30 Jan 30 Jan

10:38 a.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On Tue, Jan 30, 2024 at 10:23:03AM +0100, Christian König wrote:

...

Am 30.01.24 um 10:01 schrieb Daniel Vetter:

...
On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:

...
[SNIP] Well I think we should have some solution, but I'm not sure if it should be part of DMA-buf.

Essentially a DMA-buf exports the buffers as he uses it and the importer (or the DMA-buf subsystem) is then the one who says ok I can use this or I can't use this or I need to call extra functions to use this or whatever.

It's not the job of the exporter to provide the coherency for the importer, cause otherwise we would have a lot of code in the exporter which can only be tested when you have the right importer around. And I strongly think that this is a no-go for having a reliable solution.

The trouble is, that if you have other memory than stuff allocated by the dma-api or mapped using the dma-api, then by necessity the exporter has to deal with this.

Yes, I was thinking about that as well.

...
Which is the exact same reason we also force the exporters to deal with the cpu cache flushing - you're argument that it's not great to replicate this everywhere holds there equally.

And I'm not really happy with that either.

...
The other thing is that right now the exporter is the only one who actually knows what kind of dma coherency rules apply for a certain piece of memory. E.g. on i915-gem even if it's dma_map_sg mapped the underlying i915-gem buffer might be non-coherent, and i915-gem makes it all work by doing the appropriate amount of clflush.

Yeah, exactly that's the reason why I think that this stuff doesn't belong into exporters/drivers.

Looking at what kind of hacks and workarounds we have in both amdgpu as well as i915 it's pretty clear that we need to improve this design somehow.

Yeah it's been a well-known issue, and we've very slowly improved things.

...

...
Similar funky things happen in other cases.

So unless we add an interface which allows importers to figure out how much flushing is needed, currently the exporter is the only one who knows (because it can inspect the struct device at dma_buf_attach time).

We could flip this around, but it would be a rather serious depature from the dma-buf design approach thus far.

Well clients already give the DMA-direction to exporters when creating the mapping and get an appropriate sg_table in return.

All we need to do is getting the information what flushing is needed into the object returned here so that the DMA API can work with it.

So the problem is that we can provide this information from exporters that do device specific stuff. But we cannot get this information from exporters which just use the dma-api, whether it's dma_alloc or dma_map_sg, because the core design principle of the dma-api is to hide the coherency rules for device dma.

The idea is that you have the same ip on different socs, where on one the soc needs cache flushing and on the other you dont (because different architecture, or just the ip being connected to different interconnects), you can use the exact same driver since the dma-api hides all this.

And at least every time it was discussed in the past, dma-api maintainers insisted that we don't break this abstraction rule. Which means for most exporters, we simply do not have this information available. This is also why after epic long discussions it was decided that cache coherency was the exporter's problem, so that from an importer pov there's no difference between an sg list optained through dma_buf_map and an sg list obtained from dma_map_sg or memory allocated with dma_alloc - in none of these cases does the driver have to do its own cache management.

...

Christoph Hellwig pretty much nailed it when he said that the problem with the sg_table is that it mixes input and output parameters of the DMA-API.

Hm my take away from these discussions was that sg as a data structure is not a clean design, but I haven't ever seen Christoph (or anyone else from the dma-api side) say that they're ok with leaking cache coherency management to clients.

We couldn't even get the core arch primitives exported to drivers so that dma-buf exporters could do the right cache management for their driver specific allocators that entirely bypass the dma-api. I think what you're suggesting would go way beyond that.

...

I would extend that and say that we need a mapping object the DMA-API can work with so that it can know what needs to be done when devices request that data is made coherent between them or the CPU.

Personally I do think it makes sense as a design and iirc we discussed it plenty in the early dma-buf discussions. I just don't think it's a realistic design approach to upstream.

I think best we can hope for is a new set of device2device sync functions in the dma_sg_sync_for* family of functions, so that on platforms where syncing for cpu access requires cache flushes, but going from one device to the next doesn't we could avoid some unecessary flushing. Currently there's no way to do that and we have to pessimistically flush for cpu coherency with the dma-api. Or suffer from device2device coherency issues on funky platforms.

...

...
...
That's why I think the approach of having DMA-buf callbacks is most likely the wrong thing to do.

What should happen instead is that the DMA subsystem provides functionality which to devices which don't support coherency through it's connection to say I want to access this data, please make sure to flush the appropriate catches. But that's just a very very rough design idea.

This will become more with CXL at the horizon I think.

Yeah CXL will make this all even more fun, but we are firmly there already with devices deciding per-buffer (or sometimes even per-access with intel's MOCS stuff) what coherency mode to use for a buffer.

Also arm soc generally have both coherent and non-coherent device interconnects, and I think some devices can switch with runtime flags too which mode they use for a specific transition.

CXL just extends this to pcie devices.

So the mess is here, how do we deal with it?

I would say we start with the DMA-API by getting away from sg_tables to something cleaner and state oriented.

Imo that's a tangential distraction. Definitely would be great to untangle that data structure, but I don't think that gets us any closer to getting the coherency information out of the dma-api abstraction that we'd like to have.

That part has been an extremely firm "no" every time we asked.

...

...
My take is that the opt-in callback addition is far from great, but it's in line with how we extended dma-buf the past decade plus too. So unless someone's volunteering to pour some serious time into re-engineering this all (including testing all the different device/driver<->device/driver interactions) I think there's only really one other option: To not support these cases at all. And I don't really like that, because it means people will hack together something even worse in their drivers.

By adding it to dma-buf it'll stare us in our faces at least :-/

Yeah, it's the way of the least resistance. But with CXL at the horizon and more and more drivers using it I think it's predictable that this will sooner or later blow up.

I know, it's kinda been blowing up already.

My prediction is that the best we can get out of the dma-api is a new device2device sync, while all the coherency details are still 100% hidden behind the dma-api. And even that is probably going to take years.

Cheers, Sima

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

25 Jan 25 Jan

6:10 p.m.

New subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

On Fri, Jan 19, 2024 at 03:13:57PM +0100, Paul Cercueil wrote:

...

These functions should be used by device drivers when they start and stop accessing the data of DMABUF. It allows DMABUF importers to cache the dma_buf_attachment while ensuring that the data they want to access is available for their device when the DMA transfers take place.

Signed-off-by: Paul Cercueil paul@crapouillou.net

Putting my detailed review comments here just so I don't have to remember them any longer. We need to reach consensus on the big picture direction first.

...

v5: New patch

drivers/dma-buf/dma-buf.c | 66 +++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 37 ++++++++++++++++++++++ 2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..a8bab6c18fcd 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct dma_buf_attachment *attach,
- dma_buf_mmap()
- dma_buf_begin_cpu_access()
- dma_buf_end_cpu_access()
- dma_buf_begin_access()
- dma_buf_end_access()
- dma_buf_map_attachment_unlocked()
- dma_buf_unmap_attachment_unlocked()
- dma_buf_vmap_unlocked()
@@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF); +/**

@dma_buf_begin_access - Call before any hardware access from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

I think for the kerneldoc would be good to point at the other function here, explain why this might be needed and that for most reasonable devices it's probably not, and link between the function pairs.

Also we need to document that dma_buf_map does an implied dma_buf_begin_access (because dma_sg_map does an implied dma_sg_sync_for_device) and vice versa for dma_buf_end_access. Which also means that dma_buf_map/unmap should link to these functions in their kerneldoc too.

Finally I think we should document here that it's ok to call these from dma_fence signalling critical section and link to the relevant discussion in the dma_fence docs for that.

...

*/

+int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum dma_data_direction dir)
+{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->begin_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->begin_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

So explicit device side coherency management is not going to be very compatible with dynamic buffer managament where the exporter can move the buffer around. The reason for that is that for a dynamic exporter we cache the sg mapping, which means any device-side coherency management which dma_buf_map/unmap would do will not happen (since it's cached), potentially breaking things for importers that rely on the assumption that dma_buf_map/unmap already implies dma_buf_begin/end_device_access.

I think for now it's sufficient to put a WARN_ON(dma_buf_is_dymamic() && ops->begin|end_access) or similar into dma_buf_export and bail out with an error to catch that.

Aside from the nits I do think this is roughly what we brievely discussed well over a decade ago in the original dma-buf kickoff meeting at a linaro connect in Budapest :-)

Cheers, Sima

...

+/**

@dma_buf_end_access - Call after any hardware access from/to the DMABUF

@attach: [in] attachment used for hardware access

@sg_table: [in] scatterlist used for the DMA transfer

@direction: [in] direction of DMA transfer

*/

+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum dma_data_direction dir)
+{
struct dma_buf *dmabuf;

bool cookie;

int ret;

if (WARN_ON(!attach))
return -EINVAL;
dmabuf = attach->dmabuf;

if (!dmabuf->ops->end_access)
return 0;
cookie = dma_fence_begin_signalling();

ret = dmabuf->ops->end_access(attach, sgt, dir);

dma_fence_end_signalling(cookie);

if (WARN_ON_ONCE(ret))
return ret;
return 0;
+} +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);

#ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8ff4add71f88..8ba612c7cc16 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -246,6 +246,38 @@ struct dma_buf_ops { */ int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
/**
* @begin_access:
*
* This is called from dma_buf_begin_access() when a device driver
* wants to access the data of the DMABUF. The exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*begin_access)(struct dma_buf_attachment *, struct sg_table *,
	    enum dma_data_direction);
/**
* @end_access:
*
* This is called from dma_buf_end_access() when a device driver is
* done accessing the data of the DMABUF. The exporter can use this
* to flush/sync the caches if needed.
*
* This callback is optional.
*
* Returns:
*
* 0 on success or a negative error code on failure.
*/
int (*end_access)(struct dma_buf_attachment *, struct sg_table *,
	  enum dma_data_direction);
/**

@mmap:
@@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf, int dma_buf_pin(struct dma_buf_attachment *attach); void dma_buf_unpin(struct dma_buf_attachment *attach); +int dma_buf_begin_access(struct dma_buf_attachment *attach,
	 struct sg_table *sgt, enum dma_data_direction dir);
+int dma_buf_end_access(struct dma_buf_attachment *attach,
       struct sg_table *sgt, enum dma_data_direction dir);
struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info); int dma_buf_fd(struct dma_buf *dmabuf, int flags); -- 2.43.0

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Paul Cercueil

19 Jan 19 Jan

2:13 p.m.

New subject: [PATCH v5 2/6] dma-buf: udmabuf: Implement .{begin,end}_access

Implement .begin_access() and .end_access() callbacks.

For now these functions will simply sync/flush the CPU cache when needed.

Signed-off-by: Paul Cercueil paul@crapouillou.net

--- v5: New patch --- drivers/dma-buf/udmabuf.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index c40645999648..a87d89b58816 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -179,6 +179,31 @@ static int end_cpu_udmabuf(struct dma_buf *buf, return 0; }

+static int begin_udmabuf(struct dma_buf_attachment *attach, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct dma_buf *buf = attach->dmabuf; + struct udmabuf *ubuf = buf->priv; + struct device *dev = ubuf->device->this_device; + + dma_sync_sg_for_device(dev, sgt->sgl, sg_nents(sgt->sgl), dir); + return 0; +} + +static int end_udmabuf(struct dma_buf_attachment *attach, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct dma_buf *buf = attach->dmabuf; + struct udmabuf *ubuf = buf->priv; + struct device *dev = ubuf->device->this_device; + + if (dir != DMA_TO_DEVICE) + dma_sync_sg_for_cpu(dev, sgt->sgl, sg_nents(sgt->sgl), dir); + return 0; +} + static const struct dma_buf_ops udmabuf_ops = { .cache_sgt_mapping = true, .map_dma_buf = map_udmabuf, @@ -189,6 +214,8 @@ static const struct dma_buf_ops udmabuf_ops = { .vunmap = vunmap_udmabuf, .begin_cpu_access = begin_cpu_udmabuf, .end_cpu_access = end_cpu_udmabuf, + .begin_access = begin_udmabuf, + .end_access = end_udmabuf, };

#define SEALS_WANTED (F_SEAL_SHRINK)

-- 2.43.0

Daniel Vetter

7 Feb 7 Feb

5:10 p.m.

New subject: [PATCH v5 2/6] dma-buf: udmabuf: Implement .{begin,end}_access

On Fri, Jan 19, 2024 at 03:13:58PM +0100, Paul Cercueil wrote:

...

Implement .begin_access() and .end_access() callbacks.

For now these functions will simply sync/flush the CPU cache when needed.

Signed-off-by: Paul Cercueil paul@crapouillou.net

v5: New patch

drivers/dma-buf/udmabuf.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c index c40645999648..a87d89b58816 100644 --- a/drivers/dma-buf/udmabuf.c +++ b/drivers/dma-buf/udmabuf.c @@ -179,6 +179,31 @@ static int end_cpu_udmabuf(struct dma_buf *buf, return 0; } +static int begin_udmabuf(struct dma_buf_attachment *attach,
	 struct sg_table *sgt,
	 enum dma_data_direction dir)
+{

struct dma_buf *buf = attach->dmabuf;

struct udmabuf *ubuf = buf->priv;

struct device *dev = ubuf->device->this_device;

dma_sync_sg_for_device(dev, sgt->sgl, sg_nents(sgt->sgl), dir);

So one thing I've just wondered is whether we've made sure that this is only doing cache coherency maintenance, and not swiotlb bounce buffer copying. The latter would really not be suitable for dma-buf anymore I think.

Not sure how to best check for that since it's all in the depths of the dma-api code, but I guess the best way to really make sure is to disable CONFIG_SWIOTLB. Otherwise I guess the way to absolutely make sure is to trace swiotlb_sync_single_for_device/cpu.

It would be kinda neat if dma-buf.c code could make sure you never ever get an swiotlb entry from a dma_buf_map_attachment call, but I don't think we can enforce that. There's sg_dma_is_swiotlb, but that won't catch all implementations, only the generic dma-iommu.c one.

Cheers, Sima

...

return 0;

+}

+static int end_udmabuf(struct dma_buf_attachment *attach,
       struct sg_table *sgt,
       enum dma_data_direction dir)
+{
struct dma_buf *buf = attach->dmabuf;

struct udmabuf *ubuf = buf->priv;

struct device *dev = ubuf->device->this_device;

if (dir != DMA_TO_DEVICE)
dma_sync_sg_for_cpu(dev, sgt->sgl, sg_nents(sgt->sgl), dir);
return 0;
+}

static const struct dma_buf_ops udmabuf_ops = { .cache_sgt_mapping = true, .map_dma_buf = map_udmabuf, @@ -189,6 +214,8 @@ static const struct dma_buf_ops udmabuf_ops = { .vunmap = vunmap_udmabuf, .begin_cpu_access = begin_cpu_udmabuf, .end_cpu_access = end_cpu_udmabuf,

.begin_access = begin_udmabuf,

.end_access = end_udmabuf,

};

#define SEALS_WANTED (F_SEAL_SHRINK)

2.43.0

Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Paul Cercueil

19 Jan 19 Jan

2:13 p.m.

New subject: [PATCH v5 3/6] usb: gadget: Support already-mapped DMA SGs

Add a new 'sg_was_mapped' field to the struct usb_request. This field can be used to indicate that the scatterlist associated to the USB transfer has already been mapped into the DMA space, and it does not have to be done internally.

Signed-off-by: Paul Cercueil paul@crapouillou.net --- drivers/usb/gadget/udc/core.c | 7 ++++++- include/linux/usb/gadget.h | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/gadget/udc/core.c b/drivers/usb/gadget/udc/core.c index d59f94464b87..9d4150124fdb 100644 --- a/drivers/usb/gadget/udc/core.c +++ b/drivers/usb/gadget/udc/core.c @@ -903,6 +903,11 @@ int usb_gadget_map_request_by_dev(struct device *dev, if (req->length == 0) return 0;

+ if (req->sg_was_mapped) { + req->num_mapped_sgs = req->num_sgs; + return 0; + } + if (req->num_sgs) { int mapped;

@@ -948,7 +953,7 @@ EXPORT_SYMBOL_GPL(usb_gadget_map_request); void usb_gadget_unmap_request_by_dev(struct device *dev, struct usb_request *req, int is_in) { - if (req->length == 0) + if (req->length == 0 || req->sg_was_mapped) return;

if (req->num_mapped_sgs) { diff --git a/include/linux/usb/gadget.h b/include/linux/usb/gadget.h index a771ccc038ac..c529e4e06997 100644 --- a/include/linux/usb/gadget.h +++ b/include/linux/usb/gadget.h @@ -52,6 +52,7 @@ struct usb_ep; * @short_not_ok: When reading data, makes short packets be * treated as errors (queue stops advancing till cleanup). * @dma_mapped: Indicates if request has been mapped to DMA (internal) + * @sg_was_mapped: Set if the scatterlist has been mapped before the request * @complete: Function called when request completes, so this request and * its buffer may be re-used. The function will always be called with * interrupts disabled, and it must not sleep. @@ -111,6 +112,7 @@ struct usb_request { unsigned zero:1; unsigned short_not_ok:1; unsigned dma_mapped:1; + unsigned sg_was_mapped:1;

void (*complete)(struct usb_ep *ep, struct usb_request *req);

-- 2.43.0

Paul Cercueil

2:14 p.m.

New subject: [PATCH v5 4/6] usb: gadget: functionfs: Factorize wait-for-endpoint code

This exact same code was duplicated in two different places.

Signed-off-by: Paul Cercueil paul@crapouillou.net --- drivers/usb/gadget/function/f_fs.c | 48 +++++++++++++++++------------- 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index 6bff6cb93789..ed2a6d5fcef7 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -934,31 +934,44 @@ static ssize_t __ffs_epfile_read_data(struct ffs_epfile *epfile, return ret; }

-static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) +static struct ffs_ep *ffs_epfile_wait_ep(struct file *file) { struct ffs_epfile *epfile = file->private_data; - struct usb_request *req; struct ffs_ep *ep; - char *data = NULL; - ssize_t ret, data_len = -EINVAL; - int halt; - - /* Are we still active? */ - if (WARN_ON(epfile->ffs->state != FFS_ACTIVE)) - return -ENODEV; + int ret;

/* Wait for endpoint to be enabled */ ep = epfile->ep; if (!ep) { if (file->f_flags & O_NONBLOCK) - return -EAGAIN; + return ERR_PTR(-EAGAIN);

ret = wait_event_interruptible( epfile->ffs->wait, (ep = epfile->ep)); if (ret) - return -EINTR; + return ERR_PTR(-EINTR); }

+ return ep; +} + +static ssize_t ffs_epfile_io(struct file *file, struct ffs_io_data *io_data) +{ + struct ffs_epfile *epfile = file->private_data; + struct usb_request *req; + struct ffs_ep *ep; + char *data = NULL; + ssize_t ret, data_len = -EINVAL; + int halt; + + /* Are we still active? */ + if (WARN_ON(epfile->ffs->state != FFS_ACTIVE)) + return -ENODEV; + + ep = ffs_epfile_wait_ep(file); + if (IS_ERR(ep)) + return PTR_ERR(ep); + /* Do we halt? */ halt = (!io_data->read == !epfile->in); if (halt && epfile->isoc) @@ -1280,16 +1293,9 @@ static long ffs_epfile_ioctl(struct file *file, unsigned code, return -ENODEV;

/* Wait for endpoint to be enabled */ - ep = epfile->ep; - if (!ep) { - if (file->f_flags & O_NONBLOCK) - return -EAGAIN; - - ret = wait_event_interruptible( - epfile->ffs->wait, (ep = epfile->ep)); - if (ret) - return -EINTR; - } + ep = ffs_epfile_wait_ep(file); + if (IS_ERR(ep)) + return PTR_ERR(ep);

spin_lock_irq(&epfile->ffs->eps_lock);

-- 2.43.0

Paul Cercueil

2:14 p.m.

New subject: [PATCH v5 5/6] usb: gadget: functionfs: Add DMABUF import interface

This patch introduces three new ioctls. They all should be called on a data endpoint (ie. not ep0). They are:

- FUNCTIONFS_DMABUF_ATTACH, which takes the file descriptor of a DMABUF object to attach to the endpoint.

- FUNCTIONFS_DMABUF_DETACH, which takes the file descriptor of the DMABUF to detach from the endpoint. Note that closing the endpoint's file descriptor will automatically detach all attached DMABUFs.

- FUNCTIONFS_DMABUF_TRANSFER, which requests a data transfer from / to the given DMABUF. Its argument is a structure that packs the DMABUF's file descriptor, the size in bytes to transfer (which should generally be set to the size of the DMABUF), and a 'flags' field which is unused for now. Before this ioctl can be used, the related DMABUF must be attached with FUNCTIONFS_DMABUF_ATTACH.

These three ioctls enable the FunctionFS code to transfer data between the USB stack and a DMABUF object, which can be provided by a driver from a completely different subsystem, in a zero-copy fashion.

Signed-off-by: Paul Cercueil paul@crapouillou.net Acked-by: Daniel Vetter daniel.vetter@ffwll.ch Acked-by: Christian König christian.koenig@amd.com

--- v2: - Make ffs_dma_resv_lock() static - Add MODULE_IMPORT_NS(DMA_BUF); - The attach/detach functions are now performed without locking the eps_lock spinlock. The transfer function starts with the spinlock unlocked, then locks it before allocating and queueing the USB transfer.

v3: - Inline to_ffs_dma_fence() which was called only once. - Simplify ffs_dma_resv_lock() - Add comment explaining why we unref twice in ffs_dmabuf_detach() - Document uapi struct usb_ffs_dmabuf_transfer_req and IOCTLs

v4: - Protect the DMABUF list with a mutex - Use incremental sequence number for the dma_fences - Unref attachments and DMABUFs in workers - Remove dead code in ffs_dma_resv_lock() - Fix non-block actually blocking - Use dma_fence_begin/end_signalling() - Add comment about cache-management and dma_buf_unmap_attachment() - Make sure dma_buf_map_attachment() is called with the dma-resv locked

v5: - Cache the dma_buf_attachment while the DMABUF is attached. - Use dma_buf_begin/end_access() to ensure that the DMABUF data will be coherent to the hardware. - Remove comment about cache-management and dma_buf_unmap_attachment(), since we now use dma_buf_begin/end_access(). - Added Christian's ACK - Select DMA_SHARED_BUFFER in Kconfig entry --- drivers/usb/gadget/Kconfig | 1 + drivers/usb/gadget/function/f_fs.c | 456 ++++++++++++++++++++++++++++ include/uapi/linux/usb/functionfs.h | 41 +++ 3 files changed, 498 insertions(+)

diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig index b3592bcb0f96..566ff0b1282a 100644 --- a/drivers/usb/gadget/Kconfig +++ b/drivers/usb/gadget/Kconfig @@ -190,6 +190,7 @@ config USB_F_MASS_STORAGE tristate

config USB_F_FS + select DMA_SHARED_BUFFER tristate

config USB_F_UAC1 diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index ed2a6d5fcef7..82cc449a4d7e 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -15,6 +15,9 @@ /* #define VERBOSE_DEBUG */

#include <linux/blkdev.h> +#include <linux/dma-buf.h> +#include <linux/dma-fence.h> +#include <linux/dma-resv.h> #include <linux/pagemap.h> #include <linux/export.h> #include <linux/fs_parser.h> @@ -43,6 +46,8 @@

#define FUNCTIONFS_MAGIC 0xa647361 /* Chosen by a honest dice roll ;) */

+MODULE_IMPORT_NS(DMA_BUF); + /* Reference counter handling */ static void ffs_data_get(struct ffs_data *ffs); static void ffs_data_put(struct ffs_data *ffs); @@ -124,6 +129,23 @@ struct ffs_ep { u8 num; };

+struct ffs_dmabuf_priv { + struct list_head entry; + struct kref ref; + struct ffs_data *ffs; + struct dma_buf_attachment *attach; + struct sg_table *sgt; + enum dma_data_direction dir; + spinlock_t lock; + u64 context; +}; + +struct ffs_dma_fence { + struct dma_fence base; + struct ffs_dmabuf_priv *priv; + struct work_struct work; +}; + struct ffs_epfile { /* Protects ep->ep and ep->req. */ struct mutex mutex; @@ -197,6 +219,11 @@ struct ffs_epfile { unsigned char isoc; /* P: ffs->eps_lock */

unsigned char _pad; + + /* Protects dmabufs */ + struct mutex dmabufs_mutex; + struct list_head dmabufs; /* P: dmabufs_mutex */ + atomic_t seqno; };

struct ffs_buffer { @@ -1271,10 +1298,51 @@ static ssize_t ffs_epfile_read_iter(struct kiocb *kiocb, struct iov_iter *to) return res; }

+static void ffs_dmabuf_release(struct kref *ref) +{ + struct ffs_dmabuf_priv *priv = container_of(ref, struct ffs_dmabuf_priv, ref); + struct dma_buf_attachment *attach = priv->attach; + struct dma_buf *dmabuf = attach->dmabuf; + + pr_debug("FFS DMABUF release\n"); + dma_resv_lock(dmabuf->resv, NULL); + dma_buf_unmap_attachment(attach, priv->sgt, priv->dir); + dma_resv_unlock(dmabuf->resv); + + dma_buf_detach(attach->dmabuf, attach); + dma_buf_put(dmabuf); + kfree(priv); +} + +static void ffs_dmabuf_get(struct dma_buf_attachment *attach) +{ + struct ffs_dmabuf_priv *priv = attach->importer_priv; + + kref_get(&priv->ref); +} + +static void ffs_dmabuf_put(struct dma_buf_attachment *attach) +{ + struct ffs_dmabuf_priv *priv = attach->importer_priv; + + kref_put(&priv->ref, ffs_dmabuf_release); +} + static int ffs_epfile_release(struct inode *inode, struct file *file) { struct ffs_epfile *epfile = inode->i_private; + struct ffs_dmabuf_priv *priv, *tmp; + + mutex_lock(&epfile->dmabufs_mutex); + + /* Close all attached DMABUFs */ + list_for_each_entry_safe(priv, tmp, &epfile->dmabufs, entry) { + list_del(&priv->entry); + ffs_dmabuf_put(priv->attach); + } + + mutex_unlock(&epfile->dmabufs_mutex);

__ffs_epfile_read_buffer_free(epfile); ffs_data_closed(epfile->ffs); @@ -1282,6 +1350,354 @@ ffs_epfile_release(struct inode *inode, struct file *file) return 0; }

+static void ffs_dmabuf_unmap_work(struct work_struct *work) +{ + struct ffs_dma_fence *dma_fence = + container_of(work, struct ffs_dma_fence, work); + struct ffs_dmabuf_priv *priv = dma_fence->priv; + struct dma_buf_attachment *attach = priv->attach; + struct dma_fence *fence = &dma_fence->base; + + ffs_dmabuf_put(attach); + dma_fence_put(fence); +} + +static void ffs_dmabuf_signal_done(struct ffs_dma_fence *dma_fence, int ret) +{ + struct ffs_dmabuf_priv *priv = dma_fence->priv; + struct dma_fence *fence = &dma_fence->base; + bool cookie = dma_fence_begin_signalling(); + + dma_fence_get(fence); + dma_buf_end_access(priv->attach, priv->sgt, priv->dir); + + fence->error = ret; + dma_fence_signal(fence); + dma_fence_end_signalling(cookie); + + /* + * The fence will be unref'd in ffs_dmabuf_unmap_work. + * It can't be done here, as the unref functions might try to lock + * the resv object, which would deadlock. + */ + INIT_WORK(&dma_fence->work, ffs_dmabuf_unmap_work); + queue_work(priv->ffs->io_completion_wq, &dma_fence->work); +} + +static void ffs_epfile_dmabuf_io_complete(struct usb_ep *ep, + struct usb_request *req) +{ + pr_debug("FFS: DMABUF transfer complete, status=%d\n", req->status); + ffs_dmabuf_signal_done(req->context, req->status); + usb_ep_free_request(ep, req); +} + +static const char *ffs_dmabuf_get_driver_name(struct dma_fence *fence) +{ + return "functionfs"; +} + +static const char *ffs_dmabuf_get_timeline_name(struct dma_fence *fence) +{ + return ""; +} + +static void ffs_dmabuf_fence_release(struct dma_fence *fence) +{ + struct ffs_dma_fence *dma_fence = + container_of(fence, struct ffs_dma_fence, base); + + kfree(dma_fence); +} + +static const struct dma_fence_ops ffs_dmabuf_fence_ops = { + .get_driver_name = ffs_dmabuf_get_driver_name, + .get_timeline_name = ffs_dmabuf_get_timeline_name, + .release = ffs_dmabuf_fence_release, +}; + +static int ffs_dma_resv_lock(struct dma_buf *dmabuf, bool nonblock) +{ + if (!nonblock) + return dma_resv_lock_interruptible(dmabuf->resv, NULL); + + if (!dma_resv_trylock(dmabuf->resv)) + return -EBUSY; + + return 0; +} + +static struct dma_buf_attachment * +ffs_dmabuf_find_attachment(struct ffs_epfile *epfile, struct dma_buf *dmabuf) +{ + struct device *dev = epfile->ffs->gadget->dev.parent; + struct dma_buf_attachment *attach = NULL; + struct ffs_dmabuf_priv *priv; + + mutex_lock(&epfile->dmabufs_mutex); + + list_for_each_entry(priv, &epfile->dmabufs, entry) { + if (priv->attach->dev == dev + && priv->attach->dmabuf == dmabuf) { + attach = priv->attach; + break; + } + } + + if (attach) + ffs_dmabuf_get(attach); + + mutex_unlock(&epfile->dmabufs_mutex); + + return attach ?: ERR_PTR(-EPERM); +} + +static int ffs_dmabuf_attach(struct file *file, int fd) +{ + bool nonblock = file->f_flags & O_NONBLOCK; + struct ffs_epfile *epfile = file->private_data; + struct usb_gadget *gadget = epfile->ffs->gadget; + struct dma_buf_attachment *attach; + struct ffs_dmabuf_priv *priv; + enum dma_data_direction dir; + struct sg_table *sg_table; + struct dma_buf *dmabuf; + int err; + + if (!gadget || !gadget->sg_supported) + return -EPERM; + + dmabuf = dma_buf_get(fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + attach = dma_buf_attach(dmabuf, gadget->dev.parent); + if (IS_ERR(attach)) { + err = PTR_ERR(attach); + goto err_dmabuf_put; + } + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) { + err = -ENOMEM; + goto err_dmabuf_detach; + } + + dir = epfile->in ? DMA_FROM_DEVICE : DMA_TO_DEVICE; + + err = ffs_dma_resv_lock(dmabuf, nonblock); + if (err) + goto err_free_priv; + + sg_table = dma_buf_map_attachment(attach, dir); + dma_resv_unlock(dmabuf->resv); + + if (IS_ERR(sg_table)) { + err = PTR_ERR(sg_table); + goto err_free_priv; + } + + attach->importer_priv = priv; + + priv->sgt = sg_table; + priv->dir = dir; + priv->ffs = epfile->ffs; + priv->attach = attach; + spin_lock_init(&priv->lock); + kref_init(&priv->ref); + priv->context = dma_fence_context_alloc(1); + + mutex_lock(&epfile->dmabufs_mutex); + list_add(&priv->entry, &epfile->dmabufs); + mutex_unlock(&epfile->dmabufs_mutex); + + return 0; + +err_free_priv: + kfree(priv); +err_dmabuf_detach: + dma_buf_detach(dmabuf, attach); +err_dmabuf_put: + dma_buf_put(dmabuf); + + return err; +} + +static int ffs_dmabuf_detach(struct file *file, int fd) +{ + struct ffs_epfile *epfile = file->private_data; + struct device *dev = epfile->ffs->gadget->dev.parent; + struct ffs_dmabuf_priv *priv; + struct dma_buf *dmabuf; + int ret = -EPERM; + + dmabuf = dma_buf_get(fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + mutex_lock(&epfile->dmabufs_mutex); + + list_for_each_entry(priv, &epfile->dmabufs, entry) { + if (priv->attach->dev == dev + && priv->attach->dmabuf == dmabuf) { + list_del(&priv->entry); + + /* Unref the reference from ffs_dmabuf_attach() */ + ffs_dmabuf_put(priv->attach); + ret = 0; + break; + } + } + + mutex_unlock(&epfile->dmabufs_mutex); + dma_buf_put(dmabuf); + + return ret; +} + +static int ffs_dmabuf_transfer(struct file *file, + const struct usb_ffs_dmabuf_transfer_req *req) +{ + bool nonblock = file->f_flags & O_NONBLOCK; + struct ffs_epfile *epfile = file->private_data; + struct dma_buf_attachment *attach; + struct ffs_dmabuf_priv *priv; + struct ffs_dma_fence *fence; + struct usb_request *usb_req; + struct dma_buf *dmabuf; + struct ffs_ep *ep; + bool cookie; + u32 seqno; + int ret; + + if (req->flags & ~USB_FFS_DMABUF_TRANSFER_MASK) + return -EINVAL; + + dmabuf = dma_buf_get(req->fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + if (req->length > dmabuf->size || req->length == 0) { + ret = -EINVAL; + goto err_dmabuf_put; + } + + attach = ffs_dmabuf_find_attachment(epfile, dmabuf); + if (IS_ERR(attach)) { + ret = PTR_ERR(attach); + goto err_dmabuf_put; + } + + priv = attach->importer_priv; + + ep = ffs_epfile_wait_ep(file); + if (IS_ERR(ep)) { + ret = PTR_ERR(ep); + goto err_attachment_put; + } + + ret = dma_buf_begin_access(attach, priv->sgt, priv->dir); + if (ret) + goto err_attachment_put; + + ret = ffs_dma_resv_lock(dmabuf, nonblock); + if (ret) + goto err_end_access; + + /* Make sure we don't have writers */ + if (!dma_resv_test_signaled(dmabuf->resv, DMA_RESV_USAGE_WRITE)) { + pr_debug("FFS WRITE fence is not signaled\n"); + ret = -EBUSY; + goto err_resv_unlock; + } + + /* If we're writing to the DMABUF, make sure we don't have readers */ + if (epfile->in && + !dma_resv_test_signaled(dmabuf->resv, DMA_RESV_USAGE_READ)) { + pr_debug("FFS READ fence is not signaled\n"); + ret = -EBUSY; + goto err_resv_unlock; + } + + ret = dma_resv_reserve_fences(dmabuf->resv, 1); + if (ret) + goto err_resv_unlock; + + fence = kmalloc(sizeof(*fence), GFP_KERNEL); + if (!fence) { + ret = -ENOMEM; + goto err_resv_unlock; + } + + fence->priv = priv; + + spin_lock_irq(&epfile->ffs->eps_lock); + + /* In the meantime, endpoint got disabled or changed. */ + if (epfile->ep != ep) { + ret = -ESHUTDOWN; + goto err_fence_put; + } + + usb_req = usb_ep_alloc_request(ep->ep, GFP_ATOMIC); + if (!usb_req) { + ret = -ENOMEM; + goto err_fence_put; + } + + /* + * usb_ep_queue() guarantees that all transfers are processed in the + * order they are enqueued, so we can use a simple incrementing + * sequence number for the dma_fence. + */ + seqno = atomic_add_return(1, &epfile->seqno); + + dma_fence_init(&fence->base, &ffs_dmabuf_fence_ops, + &priv->lock, priv->context, seqno); + + dma_resv_add_fence(dmabuf->resv, &fence->base, + dma_resv_usage_rw(epfile->in)); + dma_resv_unlock(dmabuf->resv); + + /* Now that the dma_fence is in place, queue the transfer. */ + + usb_req->length = req->length; + usb_req->buf = NULL; + usb_req->sg = priv->sgt->sgl; + usb_req->num_sgs = sg_nents_for_len(priv->sgt->sgl, req->length); + usb_req->sg_was_mapped = true; + usb_req->context = fence; + usb_req->complete = ffs_epfile_dmabuf_io_complete; + + cookie = dma_fence_begin_signalling(); + ret = usb_ep_queue(ep->ep, usb_req, GFP_ATOMIC); + dma_fence_end_signalling(cookie); + if (ret) { + pr_warn("FFS: Failed to queue DMABUF: %d\n", ret); + ffs_dmabuf_signal_done(fence, ret); + usb_ep_free_request(ep->ep, usb_req); + } + + spin_unlock_irq(&epfile->ffs->eps_lock); + dma_buf_put(dmabuf); + + return ret; + +err_fence_put: + spin_unlock_irq(&epfile->ffs->eps_lock); + dma_fence_put(&fence->base); +err_resv_unlock: + dma_resv_unlock(dmabuf->resv); +err_end_access: + dma_buf_end_access(attach, priv->sgt, priv->dir); +err_attachment_put: + ffs_dmabuf_put(attach); +err_dmabuf_put: + dma_buf_put(dmabuf); + + return ret; +} + static long ffs_epfile_ioctl(struct file *file, unsigned code, unsigned long value) { @@ -1292,6 +1708,44 @@ static long ffs_epfile_ioctl(struct file *file, unsigned code, if (WARN_ON(epfile->ffs->state != FFS_ACTIVE)) return -ENODEV;

+ switch (code) { + case FUNCTIONFS_DMABUF_ATTACH: + { + int fd; + + if (copy_from_user(&fd, (void __user *)value, sizeof(fd))) { + ret = -EFAULT; + break; + } + + return ffs_dmabuf_attach(file, fd); + } + case FUNCTIONFS_DMABUF_DETACH: + { + int fd; + + if (copy_from_user(&fd, (void __user *)value, sizeof(fd))) { + ret = -EFAULT; + break; + } + + return ffs_dmabuf_detach(file, fd); + } + case FUNCTIONFS_DMABUF_TRANSFER: + { + struct usb_ffs_dmabuf_transfer_req req; + + if (copy_from_user(&req, (void __user *)value, sizeof(req))) { + ret = -EFAULT; + break; + } + + return ffs_dmabuf_transfer(file, &req); + } + default: + break; + } + /* Wait for endpoint to be enabled */ ep = ffs_epfile_wait_ep(file); if (IS_ERR(ep)) @@ -1869,6 +2323,8 @@ static int ffs_epfiles_create(struct ffs_data *ffs) for (i = 1; i <= count; ++i, ++epfile) { epfile->ffs = ffs; mutex_init(&epfile->mutex); + mutex_init(&epfile->dmabufs_mutex); + INIT_LIST_HEAD(&epfile->dmabufs); if (ffs->user_flags & FUNCTIONFS_VIRTUAL_ADDR) sprintf(epfile->name, "ep%02x", ffs->eps_addrmap[i]); else diff --git a/include/uapi/linux/usb/functionfs.h b/include/uapi/linux/usb/functionfs.h index 078098e73fd3..9f88de9c3d66 100644 --- a/include/uapi/linux/usb/functionfs.h +++ b/include/uapi/linux/usb/functionfs.h @@ -86,6 +86,22 @@ struct usb_ext_prop_desc { __le16 wPropertyNameLength; } __attribute__((packed));

+/* Flags for usb_ffs_dmabuf_transfer_req->flags (none for now) */ +#define USB_FFS_DMABUF_TRANSFER_MASK 0x0 + +/** + * struct usb_ffs_dmabuf_transfer_req - Transfer request for a DMABUF object + * @fd: file descriptor of the DMABUF object + * @flags: one or more USB_FFS_DMABUF_TRANSFER_* flags + * @length: number of bytes used in this DMABUF for the data transfer. + * Should generally be set to the DMABUF's size. + */ +struct usb_ffs_dmabuf_transfer_req { + int fd; + __u32 flags; + __u64 length; +} __attribute__((packed)); + #ifndef __KERNEL__

/* @@ -290,6 +306,31 @@ struct usb_functionfs_event { #define FUNCTIONFS_ENDPOINT_DESC _IOR('g', 130, \ struct usb_endpoint_descriptor)

+/* + * Attach the DMABUF object, identified by its file descriptor, to the + * data endpoint. Returns zero on success, and a negative errno value + * on error. + */ +#define FUNCTIONFS_DMABUF_ATTACH _IOW('g', 131, int) +

+/* + * Detach the given DMABUF object, identified by its file descriptor, + * from the data endpoint. Returns zero on success, and a negative + * errno value on error. Note that closing the endpoint's file + * descriptor will automatically detach all attached DMABUFs. + */ +#define FUNCTIONFS_DMABUF_DETACH _IOW('g', 132, int) + +/* + * Enqueue the previously attached DMABUF to the transfer queue. + * The argument is a structure that packs the DMABUF's file descriptor, + * the size in bytes to transfer (which should generally correspond to + * the size of the DMABUF), and a 'flags' field which is unused + * for now. Returns zero on success, and a negative errno value on + * error. + */ +#define FUNCTIONFS_DMABUF_TRANSFER _IOW('g', 133, \ + struct usb_ffs_dmabuf_transfer_req)

#endif /* _UAPI__LINUX_FUNCTIONFS_H__ */

-- 2.43.0

Paul Cercueil

2:14 p.m.

New subject: [PATCH v5 6/6] Documentation: usb: Document FunctionFS DMABUF API

Add documentation for the three ioctls used to attach or detach externally-created DMABUFs, and to request transfers from/to previously attached DMABUFs.

Signed-off-by: Paul Cercueil paul@crapouillou.net

--- v3: New patch --- Documentation/usb/functionfs.rst | 36 ++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+)

diff --git a/Documentation/usb/functionfs.rst b/Documentation/usb/functionfs.rst index a3054bea38f3..d05a775bc45b 100644 --- a/Documentation/usb/functionfs.rst +++ b/Documentation/usb/functionfs.rst @@ -2,6 +2,9 @@ How FunctionFS works ====================

+Overview +======== + From kernel point of view it is just a composite function with some unique behaviour. It may be added to an USB configuration only after the user space driver has registered by writing descriptors and @@ -66,3 +69,36 @@ have been written to their ep0's.

Conversely, the gadget is unregistered after the first USB function closes its endpoints. + +DMABUF interface +================ + +FunctionFS additionally supports a DMABUF based interface, where the +userspace can attach DMABUF objects (externally created) to an endpoint, +and subsequently use them for data transfers. + +A userspace application can then use this interface to share DMABUF +objects between several interfaces, allowing it to transfer data in a +zero-copy fashion, for instance between IIO and the USB stack. + +As part of this interface, three new IOCTLs have been added. These three +IOCTLs have to be performed on a data endpoint (ie. not ep0). They are: + + ``FUNCTIONFS_DMABUF_ATTACH(int)`` + Attach the DMABUF object, identified by its file descriptor, to the + data endpoint. Returns zero on success, and a negative errno value + on error. + + ``FUNCTIONFS_DMABUF_DETACH(int)`` + Detach the given DMABUF object, identified by its file descriptor, + from the data endpoint. Returns zero on success, and a negative + errno value on error. Note that closing the endpoint's file + descriptor will automatically detach all attached DMABUFs. + + ``FUNCTIONFS_DMABUF_TRANSFER(struct usb_ffs_dmabuf_transfer_req *)`` + Enqueue the previously attached DMABUF to the transfer queue. + The argument is a structure that packs the DMABUF's file descriptor, + the size in bytes to transfer (which should generally correspond to + the size of the DMABUF), and a 'flags' field which is unused + for now. Returns zero on success, and a negative errno value on + error.

-- 2.43.0

701

days inactive

720

days old

linaro-mm-sig@lists.linaro.org

31 comments

participants

tags (0)

participants (6)

Andrew Davis
Christian König
Christian König
Daniel Vetter
kernel test robot
Paul Cercueil