Condsider the following call sequence:
/* Upper layer */
dma_fence_begin_signalling();
lock(tainted_shared_lock);
/* Driver callback */
dma_fence_begin_signalling();
...
The driver might here use a utility that is annotated as intended for the
dma-fence signalling critical path. Now if the upper layer isn't correctly
annotated yet for whatever reason, resulting in
/* Upper layer */
lock(tainted_shared_lock);
/* Driver callback */
dma_fence_begin_signalling();
We will receive a false lockdep locking order violation notification from
dma_fence_begin_signalling(). However entering a dma-fence signalling
critical section itself doesn't block and could not cause a deadlock.
So use a successful read_trylock() annotation instead for
dma_fence_begin_signalling(). That will make sure that the locking order
is correctly registered in the first case, and doesn't register any
locking order in the second case.
The alternative is of course to make sure that the "Upper layer" is always
correctly annotated. But experience shows that's not easily achievable
in all cases.
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/dma-buf/dma-fence.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index f177c56269bb..17f632768ef9 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -308,8 +308,8 @@ bool dma_fence_begin_signalling(void)
if (in_atomic())
return true;
- /* ... and non-recursive readlock */
- lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+ /* ... and non-recursive successful read_trylock */
+ lock_acquire(&dma_fence_lockdep_map, 0, 1, 1, 1, NULL, _RET_IP_);
return false;
}
@@ -340,7 +340,7 @@ void __dma_fence_might_wait(void)
lock_map_acquire(&dma_fence_lockdep_map);
lock_map_release(&dma_fence_lockdep_map);
if (tmp)
- lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+ lock_acquire(&dma_fence_lockdep_map, 0, 1, 1, 1, NULL, _THIS_IP_);
}
#endif
--
2.39.2
The purpose of this patchset is for MediaTek secure video playback, and
also to enable other potential uses of this in the future. The 'restricted
dma-heap' will be used to allocate dma_buf objects that reference memory
in the secure world that is inaccessible/unmappable by the non-secure
(i.e. kernel/userspace) world. That memory will be used by the secure/
trusted world to store secure information (i.e. decrypted media content).
The dma_bufs allocated from the kernel will be passed to V4L2 for video
decoding (as input and output). They will also be used by the drm
system for rendering of the content.
This patchset adds two MediaTek restricted heaps and they will be used in
v4l2[1] and drm[2].
1) restricted_mtk_cm: secure chunk memory for MediaTek SVP (Secure Video
Path). The buffer is reserved for the secure world after bootup and it
is used for vcodec's ES/working buffer;
2) restricted_mtk_cma: secure CMA memory for MediaTek SVP. This buffer is
dynamically reserved for the secure world and will be got when we start
playing secure videos. Once the security video playing is complete, the
CMA will be released. This heap is used for the vcodec's frame buffer.
[1] https://lore.kernel.org/linux-mediatek/20240412090851.24999-1-yunfei.dong@m…
[2] https://lore.kernel.org/linux-mediatek/20240403102701.369-1-shawn.sung@medi…
Change note:
v5: 1) Reconstruct TEE commands to allow the kernel to obtain the PA of the
TEE buffer to initialize a valid sg table.
2) Previously, PA was hidden from the kernel. Then the kernel checks if
this is restricted buffer by "if (sg_page(sg) == NULL)".
In this version, we will add a new explicit interface
(sg_dma_is_restricted) for users to determine whether this is a
restricted buffer.
3) some words improve, like using "rheap".
Rebase on v6.9-rc7.
v4: https://lore.kernel.org/linux-mediatek/20240112092014.23999-1-yong.wu@media…
1) Rename the heap name from "secure" to "restricted". suggested from
Simon/Pekka. There are still several "secure" string in MTK file
since we use ARM platform in which we call this "secure world"/
"secure command".
v3: https://lore.kernel.org/linux-mediatek/20231212024607.3681-1-yong.wu@mediat…
1) Separate the secure heap to a common file(secure_heap.c) and mtk
special file (secure_heap_mtk.c), and put all the tee related code
into our special file.
2) About dt-binding, Add "mediatek," prefix since this is Mediatek TEE
firmware definition.
3) Remove the normal CMA heap which is a draft for qcom.
Rebase on v6.7-rc1.
v2: https://lore.kernel.org/linux-mediatek/20231111111559.8218-1-yong.wu@mediat…
1) Move John's patches into the vcodec patchset since they use the new
dma heap interface directly.
https://lore.kernel.org/linux-mediatek/20231106120423.23364-1-yunfei.dong@m…
2) Reword the dt-binding description.
3) Rename the heap name from mtk_svp to secure_mtk_cm.
This means the current vcodec/DRM upstream code doesn't match this.
4) Add a normal CMA heap. currently it should be a draft version.
5) Regarding the UUID, I still use hard code, but put it in a private
data which allow the others could set their own UUID. What's more, UUID
is necessary for the session with TEE. If we don't have it, we can't
communicate with the TEE, including the get_uuid interface, which tries
to make uuid more generic, not working. If there is other way to make
UUID more general, please free to tell me.
v1: https://lore.kernel.org/linux-mediatek/20230911023038.30649-1-yong.wu@media…
Base on v6.6-rc1.
Yong Wu (9):
dt-bindings: reserved-memory: Add mediatek,dynamic-restricted-region
scatterlist: Add a flag for the restricted memory
lib/scatterlist: Add sg_dup_table
dma-buf: heaps: Initialize a restricted heap
dma-buf: heaps: restricted_heap: Add private heap ops
dma-buf: heaps: restricted_heap: Add dma_ops
dma-buf: heaps: restricted_heap: Add MediaTek restricted heap and
heap_init
dma-buf: heaps: restricted_heap_mtk: Add TEE memory service call
dma_buf: heaps: restricted_heap_mtk: Add a new CMA heap
.../mediatek,dynamic-restricted-region.yaml | 43 ++
drivers/dma-buf/heaps/Kconfig | 16 +
drivers/dma-buf/heaps/Makefile | 4 +-
drivers/dma-buf/heaps/restricted_heap.c | 219 +++++++++
drivers/dma-buf/heaps/restricted_heap.h | 45 ++
drivers/dma-buf/heaps/restricted_heap_mtk.c | 423 ++++++++++++++++++
drivers/dma-buf/heaps/system_heap.c | 27 +-
include/linux/scatterlist.h | 36 ++
lib/scatterlist.c | 26 ++
9 files changed, 812 insertions(+), 27 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/mediatek,dynamic-restricted-region.yaml
create mode 100644 drivers/dma-buf/heaps/restricted_heap.c
create mode 100644 drivers/dma-buf/heaps/restricted_heap.h
create mode 100644 drivers/dma-buf/heaps/restricted_heap_mtk.c
--
2.18.0
On Wed, Jul 31, 2024 at 11:34:49AM +0800, Huan Yang wrote:
> The current udmabuf_folio contains a list_head and the corresponding
> folio pointer, with a size of 24 bytes. udmabuf_folio uses kmalloc to
> allocate memory.
>
> However, kmalloc is a public pool, starting from 64 bytes. This means
> that each udmabuf_folio allocation will waste 40 bytes.
>
> Considering that each udmabuf creates a folio corresponding to a
> udmabuf_folio, the wasted memory can be significant in the case of
> memory fragmentation.
>
> Furthermore, if udmabuf is frequently used, the allocation and
> deallocation of udmabuf_folio will also be frequent.
>
> Therefore, this patch adds a kmem_cache dedicated to the allocation and
> deallocation of udmabuf_folio.This is expected to improve the
> performance of allocation and deallocation within the expected range,
> while also avoiding memory waste.
>
> Signed-off-by: Huan Yang <link(a)vivo.com>
> ---
> drivers/dma-buf/udmabuf.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 047c3cd2ceff..db4de8c745ce 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -24,6 +24,8 @@ static int size_limit_mb = 64;
> module_param(size_limit_mb, int, 0644);
> MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is 64.");
>
> +static struct kmem_cache *udmabuf_folio_cachep;
> +
> struct udmabuf {
> pgoff_t pagecount;
> struct folio **folios;
> @@ -169,7 +171,7 @@ static void unpin_all_folios(struct list_head *unpin_list)
> unpin_folio(ubuf_folio->folio);
>
> list_del(&ubuf_folio->list);
> - kfree(ubuf_folio);
> + kmem_cache_free(udmabuf_folio_cachep, ubuf_folio);
> }
> }
>
> @@ -178,7 +180,7 @@ static int add_to_unpin_list(struct list_head *unpin_list,
> {
> struct udmabuf_folio *ubuf_folio;
>
> - ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL);
> + ubuf_folio = kmem_cache_alloc(udmabuf_folio_cachep, GFP_KERNEL);
> if (!ubuf_folio)
> return -ENOMEM;
>
> @@ -492,10 +494,20 @@ static int __init udmabuf_dev_init(void)
> if (ret < 0) {
> pr_err("Could not setup DMA mask for udmabuf device\n");
> misc_deregister(&udmabuf_misc);
misc_deregister() is now called twice in this error path, I think you've
forgotten to delete this line too?
Otherwise lgtm.
-Sima
> - return ret;
> + goto err;
> + }
> +
> + udmabuf_folio_cachep = KMEM_CACHE(udmabuf_folio, 0);
> + if (unlikely(!udmabuf_folio_cachep)) {
> + ret = -ENOMEM;
> + goto err;
> }
>
> return 0;
> +
> +err:
> + misc_deregister(&udmabuf_misc);
> + return ret;
> }
>
> static void __exit udmabuf_dev_exit(void)
>
> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
> --
> 2.45.2
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
>
> 在 2024/7/30 17:05, Huan Yang 写道:
> >
> > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > [????????? daniel.vetter(a)ffwll.ch ?????????
> > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > >
> > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > UDMA-BUF step:
> > > > 1. memfd_create
> > > > 2. open file(buffer/direct)
> > > > 3. udmabuf create
> > > > 4. mmap memfd
> > > > 5. read file into memfd vaddr
> > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > to start _all_ the io before you start creating the dma-buf, ideally
> > > with
> > > everything running in parallel. But just starting the direct I/O with
> > > async and then creating the umdabuf should be a lot faster and avoid
> > That's greate, Let me rephrase that, and please correct me if I'm wrong.
> >
> > UDMA-BUF step:
> > 1. memfd_create
> > 2. mmap memfd
> > 3. open file(buffer/direct)
> > 4. start thread to async read
> > 3. udmabuf create
> >
> > With this, can improve
>
> I just test with it. Step is:
>
> UDMA-BUF step:
> 1. memfd_create
> 2. mmap memfd
> 3. open file(buffer/direct)
> 4. start thread to async read
> 5. udmabuf create
>
> 6 . join wait
>
> 3G file read all step cost 1,527,103,431ns, it's greate.
Ok that's almost the throughput of your patch set, which I think is close
enough. The remaining difference is probably just the mmap overhead, not
sure whether/how we can do direct i/o to an fd directly ... in principle
it's possible for any file that uses the standard pagecache.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Le 31/07/2024 à 09:37, Huan Yang a écrit :
> The current udmabuf_folio contains a list_head and the corresponding
> folio pointer, with a size of 24 bytes. udmabuf_folio uses kmalloc to
> allocate memory.
>
> However, kmalloc is a public pool, starting from 8,16,32 bytes.
> Additionally, if the size is not aligned with the kmalloc size, it will
> be rounded up to the corresponding size.
> This means that each udmabuf_folio allocation will get 32 bytes, and
> waste 8 bytes.
>
> Considering that each udmabuf creates a folio corresponding to a
> udmabuf_folio, the wasted memory can be significant in the case of
> memory fragmentation.
>
> Furthermore, if udmabuf is frequently used, the allocation and
> deallocation of udmabuf_folio will also be frequent.
>
> Therefore, this patch adds a kmem_cache dedicated to the allocation and
> deallocation of udmabuf_folio.This is expected to improve the
> performance of allocation and deallocation within the expected range,
> while also avoiding memory waste.
>
> Signed-off-by: Huan Yang <link(a)vivo.com>
> ---
> v3 -> v2: fix error description.
> v2 -> v1: fix double unregister, remove unlikely.
> drivers/dma-buf/udmabuf.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 047c3cd2ceff..c112c58ef09a 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -24,6 +24,8 @@ static int size_limit_mb = 64;
> module_param(size_limit_mb, int, 0644);
> MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is 64.");
>
> +static struct kmem_cache *udmabuf_folio_cachep;
> +
> struct udmabuf {
> pgoff_t pagecount;
> struct folio **folios;
> @@ -169,7 +171,7 @@ static void unpin_all_folios(struct list_head *unpin_list)
> unpin_folio(ubuf_folio->folio);
>
> list_del(&ubuf_folio->list);
> - kfree(ubuf_folio);
> + kmem_cache_free(udmabuf_folio_cachep, ubuf_folio);
> }
> }
>
> @@ -178,7 +180,7 @@ static int add_to_unpin_list(struct list_head *unpin_list,
> {
> struct udmabuf_folio *ubuf_folio;
>
> - ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL);
> + ubuf_folio = kmem_cache_alloc(udmabuf_folio_cachep, GFP_KERNEL);
> if (!ubuf_folio)
> return -ENOMEM;
>
> @@ -491,11 +493,20 @@ static int __init udmabuf_dev_init(void)
> DMA_BIT_MASK(64));
> if (ret < 0) {
> pr_err("Could not setup DMA mask for udmabuf device\n");
> - misc_deregister(&udmabuf_misc);
> - return ret;
> + goto err;
> + }
> +
> + udmabuf_folio_cachep = KMEM_CACHE(udmabuf_folio, 0);
> + if (!udmabuf_folio_cachep) {
> + ret = -ENOMEM;
> + goto err;
> }
>
> return 0;
> +
> +err:
> + misc_deregister(&udmabuf_misc);
> + return ret;
> }
>
> static void __exit udmabuf_dev_exit(void)
Hi,
should a kmem_cache_destroy() be also added in udmabuf_dev_exit()?
CJ
>
> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
Hi Huan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 931a3b3bccc96e7708c82b30b2b5fa82dfd04890]
url: https://github.com/intel-lab-lkp/linux/commits/Huan-Yang/dma-buf-heaps-Intr…
base: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
patch link: https://lore.kernel.org/r/20240730075755.10941-4-link%40vivo.com
patch subject: [PATCH v2 3/5] dma-buf: heaps: support alloc async read file
config: xtensa-allyesconfig (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@…)
compiler: xtensa-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407312202.LhLTLEhX-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'priv' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_devt' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'list' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_cdev' not described in 'dma_heap'
>> drivers/dma-buf/dma-heap.c:158: warning: Function parameter or struct member 'lock' not described in 'dma_heap_file_control'
drivers/dma-buf/dma-heap.c:482: warning: expecting prototype for Trigger sync file read, read into dma(). Prototype was for dma_heap_read_file_sync() instead
vim +158 drivers/dma-buf/dma-heap.c
133
134 /**
135 * struct dma_heap_file_control - global control of dma_heap file read.
136 * @works: @dma_heap_file_work's list head.
137 *
138 * @threadwq: wait queue for @work_thread, if commit work, @work_thread
139 * wakeup and read this work's file contains.
140 *
141 * @workwq: used for main thread wait for file read end, if allocation
142 * end before file read. @dma_heap_file_task ref effect this.
143 *
144 * @work_thread: file read kthread. the dma_heap_file_task work's consumer.
145 *
146 * @heap_fwork_cachep: @dma_heap_file_work's cachep, it's alloc/free frequently.
147 *
148 * @nr_work: global number of how many work committed.
149 */
150 struct dma_heap_file_control {
151 struct list_head works;
152 spinlock_t lock; // only lock for @works.
153 wait_queue_head_t threadwq;
154 wait_queue_head_t workwq;
155 struct task_struct *work_thread;
156 struct kmem_cache *heap_fwork_cachep;
157 atomic_t nr_work;
> 158 };
159
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Huan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 931a3b3bccc96e7708c82b30b2b5fa82dfd04890]
url: https://github.com/intel-lab-lkp/linux/commits/Huan-Yang/dma-buf-heaps-Intr…
base: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
patch link: https://lore.kernel.org/r/20240730075755.10941-2-link%40vivo.com
patch subject: [PATCH v2 1/5] dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
config: xtensa-allyesconfig (https://download.01.org/0day-ci/archive/20240731/202407311822.ZneNMq5I-lkp@…)
compiler: xtensa-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240731/202407311822.ZneNMq5I-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407311822.ZneNMq5I-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'priv' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'heap_devt' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'list' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:44: warning: Function parameter or struct member 'heap_cdev' not described in 'dma_heap'
>> drivers/dma-buf/dma-heap.c:104: warning: expecting prototype for Trigger sync file read, read into dma(). Prototype was for dma_heap_read_file_sync() instead
vim +104 drivers/dma-buf/dma-heap.c
86
87 /**
88 * Trigger sync file read, read into dma-buf.
89 *
90 * @dmabuf: which we done alloced and export.
91 * @heap_file: file info wrapper to read from.
92 *
93 * Whether to use buffer I/O or direct I/O depends on the mode when the
94 * file is opened.
95 * Remember, if use direct I/O, file must be page aligned.
96 * Since the buffer used for file reading is provided by dma-buf, when
97 * using direct I/O, the file content will be directly filled into
98 * dma-buf without the need for additional CPU copying.
99 *
100 * 0 on success, negative if anything wrong.
101 */
102 static int dma_heap_read_file_sync(struct dma_buf *dmabuf,
103 struct dma_heap_file *heap_file)
> 104 {
105 struct iosys_map map;
106 ssize_t bytes;
107 int ret;
108
109 ret = dma_buf_vmap(dmabuf, &map);
110 if (ret)
111 return ret;
112
113 /**
114 * The kernel_read_file function can handle file reading effectively,
115 * and if the return value does not match the file size,
116 * then it indicates an error.
117 */
118 bytes = kernel_read_file(heap_file->file, 0, &map.vaddr, dmabuf->size,
119 &heap_file->fsize, READING_POLICY);
120 if (bytes != heap_file->fsize)
121 ret = -EIO;
122
123 dma_buf_vunmap(dmabuf, &map);
124
125 return ret;
126 }
127
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Tue, Jul 30, 2024 at 1:14 AM Huan Yang <link(a)vivo.com> wrote:
>
>
> 在 2024/7/30 16:03, Christian König 写道:
> > Am 30.07.24 um 09:57 schrieb Huan Yang:
> >> Background
> >> ====
> >> Some user may need load file into dma-buf, current way is:
> >> 1. allocate a dma-buf, get dma-buf fd
> >> 2. mmap dma-buf fd into user vaddr
> >> 3. read(file_fd, vaddr, fsz)
> >> Due to dma-buf user map can't support direct I/O[1], the file read
> >> must be buffer I/O.
> >>
> >> This means that during the process of reading the file into dma-buf,
> >> page cache needs to be generated, and the corresponding content needs to
> >> be first copied to the page cache before being copied to the dma-buf.
> >>
> >> This way worked well when reading relatively small files before, as
> >> the page cache can cache the file content, thus improving performance.
> >>
> >> However, there are new challenges currently, especially as AI models are
> >> becoming larger and need to be shared between DMA devices and the CPU
> >> via dma-buf.
> >>
> >> For example, our 7B model file size is around 3.4GB. Using the
> >> previous would mean generating a total of 3.4GB of page cache
> >> (even if it will be reclaimed), and also requiring the copying of 3.4GB
> >> of content between page cache and dma-buf.
> >>
> >> Due to the limited resources of system memory, files in the gigabyte
> >> range
> >> cannot persist in memory indefinitely, so this portion of page cache may
> >> not provide much assistance for subsequent reads. Additionally, the
> >> existence of page cache will consume additional system resources due to
> >> the extra copying required by the CPU.
> >>
> >> Therefore, I think it is necessary for dma-buf to support direct I/O.
> >>
> >> However, direct I/O file reads cannot be performed using the buffer
> >> mmaped by the user space for the dma-buf.[1]
> >>
> >> Here are some discussions on implementing direct I/O using dma-buf:
> >>
> >> mmap[1]
> >> ---
> >> dma-buf never support user map vaddr use of direct I/O.
> >>
> >> udmabuf[2]
> >> ---
> >> Currently, udmabuf can use the memfd method to read files into
> >> dma-buf in direct I/O mode.
> >>
> >> However, if the size is large, the current udmabuf needs to adjust the
> >> corresponding size_limit(default 64MB).
> >> But using udmabuf for files at the 3GB level is not a very good
> >> approach.
> >> It needs to make some adjustments internally to handle this.[3] Or else,
> >> fail create.
> >>
> >> But, it is indeed a viable way to enable dma-buf to support direct I/O.
> >> However, it is necessary to initiate the file read after the memory
> >> allocation
> >> is completed, and handle race conditions carefully.
> >>
> >> sendfile/splice[4]
> >> ---
> >> Another way to enable dma-buf to support direct I/O is by implementing
> >> splice_write/write_iter in the dma-buf file operations (fops) to adapt
> >> to the sendfile method.
> >> However, the current sendfile/splice calls are based on pipe. When using
> >> direct I/O to read a file, the content needs to be copied to the buffer
> >> allocated by the pipe (default 64KB), and then the dma-buf fops'
> >> splice_write needs to be called to write the content into the dma-buf.
> >> This approach requires serially reading the content of file pipe size
> >> into the pipe buffer and then waiting for the dma-buf to be written
> >> before reading the next one.(The I/O performance is relatively weak
> >> under direct I/O.)
> >> Moreover, due to the existence of the pipe buffer, even when using
> >> direct I/O and not needing to generate additional page cache,
> >> there still needs to be a CPU copy.
> >>
> >> copy_file_range[5]
> >> ---
> >> Consider of copy_file_range, It only supports copying files within the
> >> same file system. Similarly, it is not very practical.
> >>
> >>
> >> So, currently, there is no particularly suitable solution on VFS to
> >> allow dma-buf to support direct I/O for large file reads.
> >>
> >> This patchset provides an idea to complete file reads when requesting a
> >> dma-buf.
> >>
> >> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> >> ===
> >> This patch provides a method to immediately read the file content after
> >> the dma-buf is allocated, and only returns the dma-buf file descriptor
> >> after the file is fully read.
> >>
> >> Since the dma-buf file descriptor is not returned, no other thread can
> >> access it except for the current thread, so we don't need to worry about
> >> race conditions.
> >
> > That is a completely false assumption.
> Can you provide a detailed explanation as to why this assumption is
> incorrect? thanks.
> >
> >>
> >> Map the dma-buf to the vmalloc area and initiate file reads in kernel
> >> space, supporting both buffer I/O and direct I/O.
> >>
> >> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
> >> When a user needs to allocate a dma-buf and read a file, they should
> >> pass this heap flag. As the size of the file being read is fixed,
> >> there is no
> >> need to pass the 'len' parameter. Instead, The file_fd needs to be
> >> passed to
> >> indicate to the kernel the file that needs to be read.
> >>
> >> The file open flag determines the mode of file reading.
> >> But, please note that if direct I/O(O_DIRECT) is needed to read the
> >> file,
> >> the file size must be page aligned. (with patch 2-5, no need)
> >>
> >> Therefore, for the user, len and file_fd are mutually exclusive,
> >> and they are combined using a union.
> >>
> >> Once the user obtains the dma-buf fd, the dma-buf directly contains the
> >> file content.
> >
> > And I'm repeating myself, but this is a complete NAK from my side to
> > this approach.
> >
> > We pointed out multiple ways of how to implement this cleanly and not
> > by hacking functionality into the kernel which absolutely doesn't
> > belong there.
> In this patchset, I have provided performance comparisons of each of
> these methods. Can you please provide more opinions?
> >
> > Regards,
> > Christian.
> >
> >>
> >> Patch 1 implement it.
> >>
> >> Patch 2-5 provides an approach for performance improvement.
> >>
> >> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
> >> synchronously read files using direct I/O.
> >>
> >> This approach helps to save CPU copying and avoid a certain degree of
> >> memory thrashing (page cache generation and reclamation)
> >>
> >> When dealing with large file sizes, the benefits of this approach become
> >> particularly significant.
> >>
> >> However, there are currently some methods that can improve performance,
> >> not just save system resources:
> >>
> >> Due to the large file size, for example, a AI 7B model of around
> >> 3.4GB, the
> >> time taken to allocate DMA-BUF memory will be relatively long. Waiting
> >> for the allocation to complete before reading the file will add to the
> >> overall time consumption. Therefore, the total time for DMA-BUF
> >> allocation and file read can be calculated using the formula
> >> T(total) = T(alloc) + T(I/O)
> >>
> >> However, if we change our approach, we don't necessarily need to wait
> >> for the DMA-BUF allocation to complete before initiating I/O. In fact,
> >> during the allocation process, we already hold a portion of the page,
> >> which means that waiting for subsequent page allocations to complete
> >> before carrying out file reads is actually unfair to the pages that have
> >> already been allocated.
> >>
> >> The allocation of pages is sequential, and the reading of the file is
> >> also sequential, with the content and size corresponding to the file.
> >> This means that the memory location for each page, which holds the
> >> content of a specific position in the file, can be determined at the
> >> time of allocation.
> >>
> >> However, to fully leverage I/O performance, it is best to wait and
> >> gather a certain number of pages before initiating batch processing.
> >>
> >> The default gather size is 128MB. So, ever gathered can see as a file
> >> read
> >> work, it maps the gather page to the vmalloc area to obtain a continuous
> >> virtual address, which is used as a buffer to store the contents of the
> >> corresponding file. So, if using direct I/O to read a file, the file
> >> content will be written directly to the corresponding dma-buf buffer
> >> memory
> >> without any additional copying.(compare to pipe buffer.)
> >>
> >> Consider other ways to read into dma-buf. If we assume reading after
> >> mmap
> >> dma-buf, we need to map the pages of the dma-buf to the user virtual
> >> address space. Also, udmabuf memfd need do this operations too.
> >> Even if we support sendfile, the file copy also need buffer, you must
> >> setup it.
> >> So, mapping pages to the vmalloc area does not incur any additional
> >> performance overhead compared to other methods.[6]
> >>
> >> Certainly, the administrator can also modify the gather size through
> >> patch5.
> >>
> >> The formula for the time taken for system_heap buffer allocation and
> >> file reading through async_read is as follows:
> >>
> >> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
> >>
> >> Compared to the synchronous read:
> >> T(total) = T(alloc) + T(I/O)
> >>
> >> If the allocation time or I/O time is long, the time difference will be
> >> covered by the maximum value between the allocation and I/O. The other
> >> party will be concealed.
> >>
> >> Therefore, the larger the size of the file that needs to be read, the
> >> greater the corresponding benefits will be.
> >>
> >> How to use
> >> ===
> >> Consider the current pathway for loading model files into DMA-BUF:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(can't use O_DIRECT)
> >> 3. use file len to allocate dma-buf, get dma-buf fd
> >> 4. mmap dma-buf fd, get vaddr
> >> 5. read(file_fd, vaddr, file_size) into dma-buf pages
> >> 6. share, attach, whatever you want
> >>
> >> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(buffer/direct)
> >> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
> >> set file_fd
> >> instead of len. get dma-buf fd(contains file content)
> >> 4. share, attach, whatever you want
> >>
> >> So, test it is easy.
> >>
> >> How to test
> >> ===
> >> The performance comparison will be conducted for the following
> >> scenarios:
> >> 1. normal
> >> 2. udmabuf with [3] patch
> >> 3. sendfile
> >> 4. only patch 1
> >> 5. patch1 - patch4.
> >>
> >> normal:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(can't use O_DIRECT)
> >> 3. use file len to allocate dma-buf, get dma-buf fd
> >> 4. mmap dma-buf fd, get vaddr
> >> 5. read(file_fd, vaddr, file_size) into dma-buf pages
> >> 6. share, attach, whatever you want
> >>
> >> UDMA-BUF step:
> >> 1. memfd_create
> >> 2. open file(buffer/direct)
> >> 3. udmabuf create
> >> 4. mmap memfd
> >> 5. read file into memfd vaddr
> >>
> >> Sendfile step(need suit splice_write/write_iter, just use to compare):
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(buffer/direct)
> >> 3. use file len to allocate dma-buf, get dma-buf fd
> >> 4. sendfile file_fd to dma-buf fd
> >> 6. share, attach, whatever you want
> >>
> >> patch1/patch1-4:
> >> 1. open dma-heap, get heap fd
> >> 2. open file, get file_fd(buffer/direct)
> >> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
> >> set file_fd
> >> instead of len. get dma-buf fd(contains file content)
> >> 4. share, attach, whatever you want
> >>
> >> You can create a file to test it. Compare the performance gap between
> >> the two.
> >> It is best to compare the differences in file size from KB to MB to GB.
> >>
> >> The following test data will compare the performance differences
> >> between 512KB,
> >> 8MB, 1GB, and 3GB under various scenarios.
> >>
> >> Performance Test
> >> ===
> >> 12G RAM phone
> >> UFS4.0(the maximum speed is 4GB/s. ),
> >> f2fs
> >> kernel 6.1 with patch[7] (or else, can't support kvec direct I/O
> >> read.)
> >> no memory pressure.
> >> drop_cache is used for each test.
> >>
> >> The average of 5 test results:
> >> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
> >> 3GB(ns) |
> >> | ------------------- | ---------- | ---------- | ------------- |
> >> ------------- |
> >> | normal | 2,790,861 | 14,535,784 | 1,520,790,492 |
> >> 3,332,438,754 |
> >> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
> >> 2,108,419,923 |
> >> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938 |
> >> 3,062,052,984 |
> >> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
> >> 2,187,570,861 |
> >> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
> >> 9,777,661,077 |
> >> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877 |
> >> 5,648,897,554 |
> >> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
> >> 2,158,305,738 |
> >> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
> >> 1,400,006,107 |
>
> With this test, sendfile can't give a good help base on pipe buffer.
>
> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP do
> not open this feature)
Hi Huan,
We should be able to turn on udmabuf for the Android kernels. We don't
have CONFIG_UDMABUF because nobody has wanted it so far. It's
encouraging to see your latest results!
-T.J.
>
> Anyway, I am sending this patchset in the hope of further discussion.
>
> Thanks.
>
> >>
> >> So, based on the test results:
> >>
> >> When the file is large, the patchset has the highest performance.
> >> Compared to normal, patchset is a 50% improvement;
> >> Compared to normal, patch1 only showed a degradation of 41%.
> >> patch1 typical performance breakdown is as follows:
> >> 1. alloc cost 188,802,693 ns
> >> 2. vmap cost 42,491,385 ns
> >> 3. file read cost 4,180,876,702 ns
> >> Therefore, directly performing a single direct I/O read on a large file
> >> may not be the most optimal way for performance.
> >>
> >> The performance of direct I/O implemented by the sendfile method is
> >> the worst.
> >>
> >> When file size is small, The difference in performance is not
> >> significant. This is consistent with expectations.
> >>
> >>
> >>
> >> Suggested use cases
> >> ===
> >> 1. When there is a need to read large files and system resources
> >> are scarce,
> >> especially when the size of memory is limited.(GB level) In this
> >> scenario, using direct I/O for file reading can even bring
> >> performance
> >> improvements.(may need patch2-3)
> >> 2. For embedded devices with limited RAM, using direct I/O can
> >> save system
> >> resources and avoid unnecessary data copying. Therefore, even
> >> if the
> >> performance is lower when read small file, it can still be used
> >> effectively.
> >> 3. If there is sufficient memory, pinning the page cache of the
> >> model files
> >> in memory and placing file in the EROFS file system for
> >> read-only access
> >> maybe better.(EROFS do not support direct I/O)
> >>
> >>
> >> Changlog
> >> ===
> >> v1 [8]
> >> v1->v2:
> >> Uses the heap flag method for alloc and read instead of adding a new
> >> DMA-buf ioctl command. [9]
> >> Split the patchset to facilitate review and test.
> >> patch 1 implement alloc and read, offer heap flag into it.
> >> patch 2-4 offer async read
> >> patch 5 can change gather limit.
> >>
> >> Reference
> >> ===
> >> [1]
> >> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
> >> [2]
> >> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
> >> [3]
> >> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
> >> [4]
> >> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
> >> [5]
> >> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
> >> [6]
> >> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
> >> [7]
> >> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230…
> >> [8]
> >> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
> >> [9]
> >> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
> >>
> >> Huan Yang (5):
> >> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
> >> dma-buf: heaps: Introduce async alloc read ops
> >> dma-buf: heaps: support alloc async read file
> >> dma-buf: heaps: system_heap alloc support async read
> >> dma-buf: heaps: configurable async read gather limit
> >>
> >> drivers/dma-buf/dma-heap.c | 552 +++++++++++++++++++++++++++-
> >> drivers/dma-buf/heaps/system_heap.c | 70 +++-
> >> include/linux/dma-heap.h | 53 ++-
> >> include/uapi/linux/dma-heap.h | 11 +-
> >> 4 files changed, 673 insertions(+), 13 deletions(-)
> >>
> >>
> >> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
> >
Am 30.07.24 um 13:36 schrieb Huan Yang:
>>>> Either drop the whole approach or change udmabuf to do what you
>>>> want to do.
>>> OK, if so, do I need to send a patch to make dma-buf support sendfile?
>>
>> Well the udmabuf approach doesn't need to use sendfile, so no.
>
> Get it, I'll not send again.
>
> About udmabuf, I test find it can't support larget find read due to
> page array alloc.
>
> I already upload this patch, but do not recive answer.
>
> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>
> Is there anything wrong with my understanding of it?
No, that patch was totally fine. Not getting a response is usually
something good.
In other words when maintainer see something which won't work at all
they immediately react, but when nobody complains it usually means you
are on the right track.
As long as nobody has any good arguments against it I'm happy to take
that one upstream through drm-misc-next immediately since it's clearly a
stand a lone improvement on it's own.
Regards,
Christian.
>
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Apart from that I don't see a doable way which can be accepted into
>>>> the kernel.
>>> Thanks for your suggestion.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Patch 1 implement it.
>>>>>>>
>>>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>>>
>>>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>>>> synchronously read files using direct I/O.
>>>>>>>
>>>>>>> This approach helps to save CPU copying and avoid a certain
>>>>>>> degree of
>>>>>>> memory thrashing (page cache generation and reclamation)
>>>>>>>
>>>>>>> When dealing with large file sizes, the benefits of this
>>>>>>> approach become
>>>>>>> particularly significant.
>>>>>>>
>>>>>>> However, there are currently some methods that can improve
>>>>>>> performance,
>>>>>>> not just save system resources:
>>>>>>>
>>>>>>> Due to the large file size, for example, a AI 7B model of around
>>>>>>> 3.4GB, the
>>>>>>> time taken to allocate DMA-BUF memory will be relatively long.
>>>>>>> Waiting
>>>>>>> for the allocation to complete before reading the file will add
>>>>>>> to the
>>>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>>>> allocation and file read can be calculated using the formula
>>>>>>> T(total) = T(alloc) + T(I/O)
>>>>>>>
>>>>>>> However, if we change our approach, we don't necessarily need to
>>>>>>> wait
>>>>>>> for the DMA-BUF allocation to complete before initiating I/O. In
>>>>>>> fact,
>>>>>>> during the allocation process, we already hold a portion of the
>>>>>>> page,
>>>>>>> which means that waiting for subsequent page allocations to
>>>>>>> complete
>>>>>>> before carrying out file reads is actually unfair to the pages
>>>>>>> that have
>>>>>>> already been allocated.
>>>>>>>
>>>>>>> The allocation of pages is sequential, and the reading of the
>>>>>>> file is
>>>>>>> also sequential, with the content and size corresponding to the
>>>>>>> file.
>>>>>>> This means that the memory location for each page, which holds the
>>>>>>> content of a specific position in the file, can be determined at
>>>>>>> the
>>>>>>> time of allocation.
>>>>>>>
>>>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>>>> gather a certain number of pages before initiating batch
>>>>>>> processing.
>>>>>>>
>>>>>>> The default gather size is 128MB. So, ever gathered can see as a
>>>>>>> file read
>>>>>>> work, it maps the gather page to the vmalloc area to obtain a
>>>>>>> continuous
>>>>>>> virtual address, which is used as a buffer to store the contents
>>>>>>> of the
>>>>>>> corresponding file. So, if using direct I/O to read a file, the
>>>>>>> file
>>>>>>> content will be written directly to the corresponding dma-buf
>>>>>>> buffer memory
>>>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>>>
>>>>>>> Consider other ways to read into dma-buf. If we assume reading
>>>>>>> after mmap
>>>>>>> dma-buf, we need to map the pages of the dma-buf to the user
>>>>>>> virtual
>>>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>>>> Even if we support sendfile, the file copy also need buffer, you
>>>>>>> must
>>>>>>> setup it.
>>>>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>>>>> performance overhead compared to other methods.[6]
>>>>>>>
>>>>>>> Certainly, the administrator can also modify the gather size
>>>>>>> through patch5.
>>>>>>>
>>>>>>> The formula for the time taken for system_heap buffer allocation
>>>>>>> and
>>>>>>> file reading through async_read is as follows:
>>>>>>>
>>>>>>> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>>>
>>>>>>> Compared to the synchronous read:
>>>>>>> T(total) = T(alloc) + T(I/O)
>>>>>>>
>>>>>>> If the allocation time or I/O time is long, the time difference
>>>>>>> will be
>>>>>>> covered by the maximum value between the allocation and I/O. The
>>>>>>> other
>>>>>>> party will be concealed.
>>>>>>>
>>>>>>> Therefore, the larger the size of the file that needs to be
>>>>>>> read, the
>>>>>>> greater the corresponding benefits will be.
>>>>>>>
>>>>>>> How to use
>>>>>>> ===
>>>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>>> 1. open dma-heap, get heap fd
>>>>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>> 4. mmap dma-buf fd, get vaddr
>>>>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>> 6. share, attach, whatever you want
>>>>>>>
>>>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>>> 1. open dma-heap, get heap fd
>>>>>>> 2. open file, get file_fd(buffer/direct)
>>>>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
>>>>>>> flag, set file_fd
>>>>>>> instead of len. get dma-buf fd(contains file content)
>>>>>>> 4. share, attach, whatever you want
>>>>>>>
>>>>>>> So, test it is easy.
>>>>>>>
>>>>>>> How to test
>>>>>>> ===
>>>>>>> The performance comparison will be conducted for the following
>>>>>>> scenarios:
>>>>>>> 1. normal
>>>>>>> 2. udmabuf with [3] patch
>>>>>>> 3. sendfile
>>>>>>> 4. only patch 1
>>>>>>> 5. patch1 - patch4.
>>>>>>>
>>>>>>> normal:
>>>>>>> 1. open dma-heap, get heap fd
>>>>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>> 4. mmap dma-buf fd, get vaddr
>>>>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>> 6. share, attach, whatever you want
>>>>>>>
>>>>>>> UDMA-BUF step:
>>>>>>> 1. memfd_create
>>>>>>> 2. open file(buffer/direct)
>>>>>>> 3. udmabuf create
>>>>>>> 4. mmap memfd
>>>>>>> 5. read file into memfd vaddr
>>>>>>>
>>>>>>> Sendfile step(need suit splice_write/write_iter, just use to
>>>>>>> compare):
>>>>>>> 1. open dma-heap, get heap fd
>>>>>>> 2. open file, get file_fd(buffer/direct)
>>>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>> 4. sendfile file_fd to dma-buf fd
>>>>>>> 6. share, attach, whatever you want
>>>>>>>
>>>>>>> patch1/patch1-4:
>>>>>>> 1. open dma-heap, get heap fd
>>>>>>> 2. open file, get file_fd(buffer/direct)
>>>>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
>>>>>>> flag, set file_fd
>>>>>>> instead of len. get dma-buf fd(contains file content)
>>>>>>> 4. share, attach, whatever you want
>>>>>>>
>>>>>>> You can create a file to test it. Compare the performance gap
>>>>>>> between the two.
>>>>>>> It is best to compare the differences in file size from KB to MB
>>>>>>> to GB.
>>>>>>>
>>>>>>> The following test data will compare the performance differences
>>>>>>> between 512KB,
>>>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>>>
>>>>>>> Performance Test
>>>>>>> ===
>>>>>>> 12G RAM phone
>>>>>>> UFS4.0(the maximum speed is 4GB/s. ),
>>>>>>> f2fs
>>>>>>> kernel 6.1 with patch[7] (or else, can't support kvec direct
>>>>>>> I/O read.)
>>>>>>> no memory pressure.
>>>>>>> drop_cache is used for each test.
>>>>>>>
>>>>>>> The average of 5 test results:
>>>>>>> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
>>>>>>> 3GB(ns) |
>>>>>>> | ------------------- | ---------- | ---------- | -------------
>>>>>>> | ------------- |
>>>>>>> | normal | 2,790,861 | 14,535,784 | 1,520,790,492
>>>>>>> | 3,332,438,754 |
>>>>>>> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
>>>>>>> 2,108,419,923 |
>>>>>>> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938
>>>>>>> | 3,062,052,984 |
>>>>>>> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
>>>>>>> 2,187,570,861 |
>>>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184
>>>>>>> | 9,777,661,077 |
>>>>>>> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877
>>>>>>> | 5,648,897,554 |
>>>>>>> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
>>>>>>> 2,158,305,738 |
>>>>>>> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
>>>>>>> 1,400,006,107 |
>>>>>
>>>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>>>
>>>>> udmabuf is good, but I think our oem driver can't suit it. (And,
>>>>> AOSP do not open this feature)
>>>>>
>>>>>
>>>>> Anyway, I am sending this patchset in the hope of further discussion.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>>
>>>>>>> So, based on the test results:
>>>>>>>
>>>>>>> When the file is large, the patchset has the highest performance.
>>>>>>> Compared to normal, patchset is a 50% improvement;
>>>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>>>> patch1 typical performance breakdown is as follows:
>>>>>>> 1. alloc cost 188,802,693 ns
>>>>>>> 2. vmap cost 42,491,385 ns
>>>>>>> 3. file read cost 4,180,876,702 ns
>>>>>>> Therefore, directly performing a single direct I/O read on a
>>>>>>> large file
>>>>>>> may not be the most optimal way for performance.
>>>>>>>
>>>>>>> The performance of direct I/O implemented by the sendfile method
>>>>>>> is the worst.
>>>>>>>
>>>>>>> When file size is small, The difference in performance is not
>>>>>>> significant. This is consistent with expectations.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Suggested use cases
>>>>>>> ===
>>>>>>> 1. When there is a need to read large files and system
>>>>>>> resources are scarce,
>>>>>>> especially when the size of memory is limited.(GB level)
>>>>>>> In this
>>>>>>> scenario, using direct I/O for file reading can even bring
>>>>>>> performance
>>>>>>> improvements.(may need patch2-3)
>>>>>>> 2. For embedded devices with limited RAM, using direct I/O
>>>>>>> can save system
>>>>>>> resources and avoid unnecessary data copying. Therefore,
>>>>>>> even if the
>>>>>>> performance is lower when read small file, it can still be
>>>>>>> used
>>>>>>> effectively.
>>>>>>> 3. If there is sufficient memory, pinning the page cache of
>>>>>>> the model files
>>>>>>> in memory and placing file in the EROFS file system for
>>>>>>> read-only access
>>>>>>> maybe better.(EROFS do not support direct I/O)
>>>>>>>
>>>>>>>
>>>>>>> Changlog
>>>>>>> ===
>>>>>>> v1 [8]
>>>>>>> v1->v2:
>>>>>>> Uses the heap flag method for alloc and read instead of
>>>>>>> adding a new
>>>>>>> DMA-buf ioctl command. [9]
>>>>>>> Split the patchset to facilitate review and test.
>>>>>>> patch 1 implement alloc and read, offer heap flag into it.
>>>>>>> patch 2-4 offer async read
>>>>>>> patch 5 can change gather limit.
>>>>>>>
>>>>>>> Reference
>>>>>>> ===
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>>>> [2]
>>>>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>>>> [3]
>>>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>>>> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>>>> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>>>> [6]
>>>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>>>> [7]
>>>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230…
>>>>>>> [8]
>>>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>>>> [9]
>>>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>>>
>>>>>>> Huan Yang (5):
>>>>>>> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>>> dma-buf: heaps: Introduce async alloc read ops
>>>>>>> dma-buf: heaps: support alloc async read file
>>>>>>> dma-buf: heaps: system_heap alloc support async read
>>>>>>> dma-buf: heaps: configurable async read gather limit
>>>>>>>
>>>>>>> drivers/dma-buf/dma-heap.c | 552
>>>>>>> +++++++++++++++++++++++++++-
>>>>>>> drivers/dma-buf/heaps/system_heap.c | 70 +++-
>>>>>>> include/linux/dma-heap.h | 53 ++-
>>>>>>> include/uapi/linux/dma-heap.h | 11 +-
>>>>>>> 4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>>>
>>>>>>>
>>>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>>>
>>>>
>>
Am 30.07.24 um 10:46 schrieb Huan Yang:
>
> 在 2024/7/30 16:37, Christian König 写道:
>> Am 30.07.24 um 10:14 schrieb Huan Yang:
>>> 在 2024/7/30 16:03, Christian König 写道:
>>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>>> Background
>>>>> ====
>>>>> Some user may need load file into dma-buf, current way is:
>>>>> 1. allocate a dma-buf, get dma-buf fd
>>>>> 2. mmap dma-buf fd into user vaddr
>>>>> 3. read(file_fd, vaddr, fsz)
>>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>>> must be buffer I/O.
>>>>>
>>>>> This means that during the process of reading the file into dma-buf,
>>>>> page cache needs to be generated, and the corresponding content
>>>>> needs to
>>>>> be first copied to the page cache before being copied to the dma-buf.
>>>>>
>>>>> This way worked well when reading relatively small files before, as
>>>>> the page cache can cache the file content, thus improving
>>>>> performance.
>>>>>
>>>>> However, there are new challenges currently, especially as AI
>>>>> models are
>>>>> becoming larger and need to be shared between DMA devices and the CPU
>>>>> via dma-buf.
>>>>>
>>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>>> previous would mean generating a total of 3.4GB of page cache
>>>>> (even if it will be reclaimed), and also requiring the copying of
>>>>> 3.4GB
>>>>> of content between page cache and dma-buf.
>>>>>
>>>>> Due to the limited resources of system memory, files in the
>>>>> gigabyte range
>>>>> cannot persist in memory indefinitely, so this portion of page
>>>>> cache may
>>>>> not provide much assistance for subsequent reads. Additionally, the
>>>>> existence of page cache will consume additional system resources
>>>>> due to
>>>>> the extra copying required by the CPU.
>>>>>
>>>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>>>
>>>>> However, direct I/O file reads cannot be performed using the buffer
>>>>> mmaped by the user space for the dma-buf.[1]
>>>>>
>>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>>
>>>>> mmap[1]
>>>>> ---
>>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>>
>>>>> udmabuf[2]
>>>>> ---
>>>>> Currently, udmabuf can use the memfd method to read files into
>>>>> dma-buf in direct I/O mode.
>>>>>
>>>>> However, if the size is large, the current udmabuf needs to adjust
>>>>> the
>>>>> corresponding size_limit(default 64MB).
>>>>> But using udmabuf for files at the 3GB level is not a very good
>>>>> approach.
>>>>> It needs to make some adjustments internally to handle this.[3] Or
>>>>> else,
>>>>> fail create.
>>>>>
>>>>> But, it is indeed a viable way to enable dma-buf to support direct
>>>>> I/O.
>>>>> However, it is necessary to initiate the file read after the
>>>>> memory allocation
>>>>> is completed, and handle race conditions carefully.
>>>>>
>>>>> sendfile/splice[4]
>>>>> ---
>>>>> Another way to enable dma-buf to support direct I/O is by
>>>>> implementing
>>>>> splice_write/write_iter in the dma-buf file operations (fops) to
>>>>> adapt
>>>>> to the sendfile method.
>>>>> However, the current sendfile/splice calls are based on pipe. When
>>>>> using
>>>>> direct I/O to read a file, the content needs to be copied to the
>>>>> buffer
>>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>>> splice_write needs to be called to write the content into the
>>>>> dma-buf.
>>>>> This approach requires serially reading the content of file pipe size
>>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>>> before reading the next one.(The I/O performance is relatively weak
>>>>> under direct I/O.)
>>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>>> direct I/O and not needing to generate additional page cache,
>>>>> there still needs to be a CPU copy.
>>>>>
>>>>> copy_file_range[5]
>>>>> ---
>>>>> Consider of copy_file_range, It only supports copying files within
>>>>> the
>>>>> same file system. Similarly, it is not very practical.
>>>>>
>>>>>
>>>>> So, currently, there is no particularly suitable solution on VFS to
>>>>> allow dma-buf to support direct I/O for large file reads.
>>>>>
>>>>> This patchset provides an idea to complete file reads when
>>>>> requesting a
>>>>> dma-buf.
>>>>>
>>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>> ===
>>>>> This patch provides a method to immediately read the file content
>>>>> after
>>>>> the dma-buf is allocated, and only returns the dma-buf file
>>>>> descriptor
>>>>> after the file is fully read.
>>>>>
>>>>> Since the dma-buf file descriptor is not returned, no other thread
>>>>> can
>>>>> access it except for the current thread, so we don't need to worry
>>>>> about
>>>>> race conditions.
>>>>
>>>> That is a completely false assumption.
>>> Can you provide a detailed explanation as to why this assumption is
>>> incorrect? thanks.
>>
>> File descriptors can be guessed and is available to userspace as soon
>> as dma_buf_fd() is called.
>>
>> What could potentially work is to call system_heap_allocate() without
>> calling dma_buf_fd(), but I'm not sure if you can then make I/O to
>> the underlying pages.
>
> Actually, the dma-buf file descriptor is obtained only after a
> successful file read in the code, so there is no issue.
>
> If you are interested, you can take a look at the
> dma_heap_buffer_alloc_and_read function in patch1.
>
>>
>>>>
>>>>>
>>>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>>>> space, supporting both buffer I/O and direct I/O.
>>>>>
>>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>>> pass this heap flag. As the size of the file being read is fixed,
>>>>> there is no
>>>>> need to pass the 'len' parameter. Instead, The file_fd needs to be
>>>>> passed to
>>>>> indicate to the kernel the file that needs to be read.
>>>>>
>>>>> The file open flag determines the mode of file reading.
>>>>> But, please note that if direct I/O(O_DIRECT) is needed to read
>>>>> the file,
>>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>>
>>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>>> and they are combined using a union.
>>>>>
>>>>> Once the user obtains the dma-buf fd, the dma-buf directly
>>>>> contains the
>>>>> file content.
>>>>
>>>> And I'm repeating myself, but this is a complete NAK from my side
>>>> to this approach.
>>>>
>>>> We pointed out multiple ways of how to implement this cleanly and
>>>> not by hacking functionality into the kernel which absolutely
>>>> doesn't belong there.
>>> In this patchset, I have provided performance comparisons of each of
>>> these methods. Can you please provide more opinions?
>>
>> Either drop the whole approach or change udmabuf to do what you want
>> to do.
> OK, if so, do I need to send a patch to make dma-buf support sendfile?
Well the udmabuf approach doesn't need to use sendfile, so no.
Regards,
Christian.
>
>>
>> Apart from that I don't see a doable way which can be accepted into
>> the kernel.
> Thanks for your suggestion.
>>
>> Regards,
>> Christian.
>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Patch 1 implement it.
>>>>>
>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>
>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>> synchronously read files using direct I/O.
>>>>>
>>>>> This approach helps to save CPU copying and avoid a certain degree of
>>>>> memory thrashing (page cache generation and reclamation)
>>>>>
>>>>> When dealing with large file sizes, the benefits of this approach
>>>>> become
>>>>> particularly significant.
>>>>>
>>>>> However, there are currently some methods that can improve
>>>>> performance,
>>>>> not just save system resources:
>>>>>
>>>>> Due to the large file size, for example, a AI 7B model of around
>>>>> 3.4GB, the
>>>>> time taken to allocate DMA-BUF memory will be relatively long.
>>>>> Waiting
>>>>> for the allocation to complete before reading the file will add to
>>>>> the
>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>> allocation and file read can be calculated using the formula
>>>>> T(total) = T(alloc) + T(I/O)
>>>>>
>>>>> However, if we change our approach, we don't necessarily need to wait
>>>>> for the DMA-BUF allocation to complete before initiating I/O. In
>>>>> fact,
>>>>> during the allocation process, we already hold a portion of the page,
>>>>> which means that waiting for subsequent page allocations to complete
>>>>> before carrying out file reads is actually unfair to the pages
>>>>> that have
>>>>> already been allocated.
>>>>>
>>>>> The allocation of pages is sequential, and the reading of the file is
>>>>> also sequential, with the content and size corresponding to the file.
>>>>> This means that the memory location for each page, which holds the
>>>>> content of a specific position in the file, can be determined at the
>>>>> time of allocation.
>>>>>
>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>> gather a certain number of pages before initiating batch processing.
>>>>>
>>>>> The default gather size is 128MB. So, ever gathered can see as a
>>>>> file read
>>>>> work, it maps the gather page to the vmalloc area to obtain a
>>>>> continuous
>>>>> virtual address, which is used as a buffer to store the contents
>>>>> of the
>>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>>> content will be written directly to the corresponding dma-buf
>>>>> buffer memory
>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>
>>>>> Consider other ways to read into dma-buf. If we assume reading
>>>>> after mmap
>>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>> Even if we support sendfile, the file copy also need buffer, you must
>>>>> setup it.
>>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>>> performance overhead compared to other methods.[6]
>>>>>
>>>>> Certainly, the administrator can also modify the gather size
>>>>> through patch5.
>>>>>
>>>>> The formula for the time taken for system_heap buffer allocation and
>>>>> file reading through async_read is as follows:
>>>>>
>>>>> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>
>>>>> Compared to the synchronous read:
>>>>> T(total) = T(alloc) + T(I/O)
>>>>>
>>>>> If the allocation time or I/O time is long, the time difference
>>>>> will be
>>>>> covered by the maximum value between the allocation and I/O. The
>>>>> other
>>>>> party will be concealed.
>>>>>
>>>>> Therefore, the larger the size of the file that needs to be read, the
>>>>> greater the corresponding benefits will be.
>>>>>
>>>>> How to use
>>>>> ===
>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>> 1. open dma-heap, get heap fd
>>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>> 4. mmap dma-buf fd, get vaddr
>>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>> 6. share, attach, whatever you want
>>>>>
>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>> 1. open dma-heap, get heap fd
>>>>> 2. open file, get file_fd(buffer/direct)
>>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
>>>>> flag, set file_fd
>>>>> instead of len. get dma-buf fd(contains file content)
>>>>> 4. share, attach, whatever you want
>>>>>
>>>>> So, test it is easy.
>>>>>
>>>>> How to test
>>>>> ===
>>>>> The performance comparison will be conducted for the following
>>>>> scenarios:
>>>>> 1. normal
>>>>> 2. udmabuf with [3] patch
>>>>> 3. sendfile
>>>>> 4. only patch 1
>>>>> 5. patch1 - patch4.
>>>>>
>>>>> normal:
>>>>> 1. open dma-heap, get heap fd
>>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>> 4. mmap dma-buf fd, get vaddr
>>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>> 6. share, attach, whatever you want
>>>>>
>>>>> UDMA-BUF step:
>>>>> 1. memfd_create
>>>>> 2. open file(buffer/direct)
>>>>> 3. udmabuf create
>>>>> 4. mmap memfd
>>>>> 5. read file into memfd vaddr
>>>>>
>>>>> Sendfile step(need suit splice_write/write_iter, just use to
>>>>> compare):
>>>>> 1. open dma-heap, get heap fd
>>>>> 2. open file, get file_fd(buffer/direct)
>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>> 4. sendfile file_fd to dma-buf fd
>>>>> 6. share, attach, whatever you want
>>>>>
>>>>> patch1/patch1-4:
>>>>> 1. open dma-heap, get heap fd
>>>>> 2. open file, get file_fd(buffer/direct)
>>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
>>>>> flag, set file_fd
>>>>> instead of len. get dma-buf fd(contains file content)
>>>>> 4. share, attach, whatever you want
>>>>>
>>>>> You can create a file to test it. Compare the performance gap
>>>>> between the two.
>>>>> It is best to compare the differences in file size from KB to MB
>>>>> to GB.
>>>>>
>>>>> The following test data will compare the performance differences
>>>>> between 512KB,
>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>
>>>>> Performance Test
>>>>> ===
>>>>> 12G RAM phone
>>>>> UFS4.0(the maximum speed is 4GB/s. ),
>>>>> f2fs
>>>>> kernel 6.1 with patch[7] (or else, can't support kvec direct
>>>>> I/O read.)
>>>>> no memory pressure.
>>>>> drop_cache is used for each test.
>>>>>
>>>>> The average of 5 test results:
>>>>> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
>>>>> 3GB(ns) |
>>>>> | ------------------- | ---------- | ---------- | ------------- |
>>>>> ------------- |
>>>>> | normal | 2,790,861 | 14,535,784 | 1,520,790,492 |
>>>>> 3,332,438,754 |
>>>>> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
>>>>> 2,108,419,923 |
>>>>> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938 |
>>>>> 3,062,052,984 |
>>>>> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
>>>>> 2,187,570,861 |
>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
>>>>> 9,777,661,077 |
>>>>> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877 |
>>>>> 5,648,897,554 |
>>>>> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
>>>>> 2,158,305,738 |
>>>>> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
>>>>> 1,400,006,107 |
>>>
>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>
>>> udmabuf is good, but I think our oem driver can't suit it. (And,
>>> AOSP do not open this feature)
>>>
>>>
>>> Anyway, I am sending this patchset in the hope of further discussion.
>>>
>>> Thanks.
>>>
>>>>>
>>>>> So, based on the test results:
>>>>>
>>>>> When the file is large, the patchset has the highest performance.
>>>>> Compared to normal, patchset is a 50% improvement;
>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>> patch1 typical performance breakdown is as follows:
>>>>> 1. alloc cost 188,802,693 ns
>>>>> 2. vmap cost 42,491,385 ns
>>>>> 3. file read cost 4,180,876,702 ns
>>>>> Therefore, directly performing a single direct I/O read on a large
>>>>> file
>>>>> may not be the most optimal way for performance.
>>>>>
>>>>> The performance of direct I/O implemented by the sendfile method
>>>>> is the worst.
>>>>>
>>>>> When file size is small, The difference in performance is not
>>>>> significant. This is consistent with expectations.
>>>>>
>>>>>
>>>>>
>>>>> Suggested use cases
>>>>> ===
>>>>> 1. When there is a need to read large files and system
>>>>> resources are scarce,
>>>>> especially when the size of memory is limited.(GB level) In
>>>>> this
>>>>> scenario, using direct I/O for file reading can even bring
>>>>> performance
>>>>> improvements.(may need patch2-3)
>>>>> 2. For embedded devices with limited RAM, using direct I/O can
>>>>> save system
>>>>> resources and avoid unnecessary data copying. Therefore,
>>>>> even if the
>>>>> performance is lower when read small file, it can still be used
>>>>> effectively.
>>>>> 3. If there is sufficient memory, pinning the page cache of the
>>>>> model files
>>>>> in memory and placing file in the EROFS file system for
>>>>> read-only access
>>>>> maybe better.(EROFS do not support direct I/O)
>>>>>
>>>>>
>>>>> Changlog
>>>>> ===
>>>>> v1 [8]
>>>>> v1->v2:
>>>>> Uses the heap flag method for alloc and read instead of adding
>>>>> a new
>>>>> DMA-buf ioctl command. [9]
>>>>> Split the patchset to facilitate review and test.
>>>>> patch 1 implement alloc and read, offer heap flag into it.
>>>>> patch 2-4 offer async read
>>>>> patch 5 can change gather limit.
>>>>>
>>>>> Reference
>>>>> ===
>>>>> [1]
>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>> [2] https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>> [3]
>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>> [4] https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>> [5] https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>> [6]
>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>> [7]
>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230…
>>>>> [8]
>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>> [9]
>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>
>>>>> Huan Yang (5):
>>>>> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>> dma-buf: heaps: Introduce async alloc read ops
>>>>> dma-buf: heaps: support alloc async read file
>>>>> dma-buf: heaps: system_heap alloc support async read
>>>>> dma-buf: heaps: configurable async read gather limit
>>>>>
>>>>> drivers/dma-buf/dma-heap.c | 552
>>>>> +++++++++++++++++++++++++++-
>>>>> drivers/dma-buf/heaps/system_heap.c | 70 +++-
>>>>> include/linux/dma-heap.h | 53 ++-
>>>>> include/uapi/linux/dma-heap.h | 11 +-
>>>>> 4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>
>>>>>
>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>
>>