On Thu, Aug 01, 2024 at 10:53:45AM +0800, Huan Yang wrote:
>
> 在 2024/8/1 4:46, Daniel Vetter 写道:
> > On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
> > > 在 2024/7/30 17:05, Huan Yang 写道:
> > > > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > > > [????????? daniel.vetter(a)ffwll.ch ?????????
> > > > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > > > >
> > > > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > > > UDMA-BUF step:
> > > > > > 1. memfd_create
> > > > > > 2. open file(buffer/direct)
> > > > > > 3. udmabuf create
> > > > > > 4. mmap memfd
> > > > > > 5. read file into memfd vaddr
> > > > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > > > to start _all_ the io before you start creating the dma-buf, ideally
> > > > > with
> > > > > everything running in parallel. But just starting the direct I/O with
> > > > > async and then creating the umdabuf should be a lot faster and avoid
> > > > That's greate, Let me rephrase that, and please correct me if I'm wrong.
> > > >
> > > > UDMA-BUF step:
> > > > 1. memfd_create
> > > > 2. mmap memfd
> > > > 3. open file(buffer/direct)
> > > > 4. start thread to async read
> > > > 3. udmabuf create
> > > >
> > > > With this, can improve
> > > I just test with it. Step is:
> > >
> > > UDMA-BUF step:
> > > 1. memfd_create
> > > 2. mmap memfd
> > > 3. open file(buffer/direct)
> > > 4. start thread to async read
> > > 5. udmabuf create
> > >
> > > 6 . join wait
> > >
> > > 3G file read all step cost 1,527,103,431ns, it's greate.
> > Ok that's almost the throughput of your patch set, which I think is close
> > enough. The remaining difference is probably just the mmap overhead, not
> > sure whether/how we can do direct i/o to an fd directly ... in principle
> > it's possible for any file that uses the standard pagecache.
>
> Yes, for mmap, IMO, now that we get all folios and pin it. That's mean all
> pfn it's got when udmabuf created.
>
> So, I think mmap with page fault is helpless for save memory but increase
> the mmap access cost.(maybe can save a little page table's memory)
>
> I want to offer a patchset to remove it and more suitable for folios
> operate(And remove unpin list). And contains some fix patch.
>
> I'll send it when I test it's good.
>
>
> About fd operation for direct I/O, maybe use sendfile or copy_file_range?
>
> sendfile base pipe buffer, it's low performance when I test is.
>
> copy_file_range can't work due to it's not the same file system.
>
> So, I can't find other way to do it. Can someone give some suggestions?
Yeah direct I/O to pagecache without an mmap might be too niche to be
supported. Maybe io_uring has something, but I guess as unlikely as
anything else.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On 2024-08-01 20:32, Kasireddy, Vivek wrote:
> Hi Huan,
>
>> This patchset attempts to fix some errors in udmabuf and remove the
>> upin_list structure.
>>
>> Some of this fix just gather the patches which I upload before.
>>
>> Patch1
>> ===
>> Try to remove page fault mmap and direct map it.
>> Due to current udmabuf has already obtained and pinned the folio
>> upon completion of the creation.This means that the physical memory has
>> already been acquired, rather than being accessed dynamically. The
>> current page fault method only saves some page table memory.
>>
>> As a result, the page fault mechanism has lost its purpose as a demanding
>> page. Due to the fact that page fault requires trapping into kernel mode
>> and filling in when accessing the corresponding virtual address in mmap,
>> this means that user mode access to virtual addresses needs to trap into
>> kernel mode.
>>
>> Therefore, when creating a large size udmabuf, this represents a
>> considerable overhead.
> Just want to mention that for the main use-case the udmabuf driver is designed for,
> (sharing Qemu Guest FB with Host for GPU DMA), udmabufs are not created very
> frequently. And, I think providing CPU access via mmap is just a backup, mainly
> intended for debugging purposes.
FYI, Mesa now uses udmabuf for supporting dma-bufs with software rendering.
--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer
On Wed, Jul 31, 2024 at 11:34:49AM +0800, Huan Yang wrote:
> The current udmabuf_folio contains a list_head and the corresponding
> folio pointer, with a size of 24 bytes. udmabuf_folio uses kmalloc to
> allocate memory.
>
> However, kmalloc is a public pool, starting from 64 bytes. This means
> that each udmabuf_folio allocation will waste 40 bytes.
>
> Considering that each udmabuf creates a folio corresponding to a
> udmabuf_folio, the wasted memory can be significant in the case of
> memory fragmentation.
>
> Furthermore, if udmabuf is frequently used, the allocation and
> deallocation of udmabuf_folio will also be frequent.
>
> Therefore, this patch adds a kmem_cache dedicated to the allocation and
> deallocation of udmabuf_folio.This is expected to improve the
> performance of allocation and deallocation within the expected range,
> while also avoiding memory waste.
>
> Signed-off-by: Huan Yang <link(a)vivo.com>
> ---
> drivers/dma-buf/udmabuf.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 047c3cd2ceff..db4de8c745ce 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -24,6 +24,8 @@ static int size_limit_mb = 64;
> module_param(size_limit_mb, int, 0644);
> MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is 64.");
>
> +static struct kmem_cache *udmabuf_folio_cachep;
> +
> struct udmabuf {
> pgoff_t pagecount;
> struct folio **folios;
> @@ -169,7 +171,7 @@ static void unpin_all_folios(struct list_head *unpin_list)
> unpin_folio(ubuf_folio->folio);
>
> list_del(&ubuf_folio->list);
> - kfree(ubuf_folio);
> + kmem_cache_free(udmabuf_folio_cachep, ubuf_folio);
> }
> }
>
> @@ -178,7 +180,7 @@ static int add_to_unpin_list(struct list_head *unpin_list,
> {
> struct udmabuf_folio *ubuf_folio;
>
> - ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL);
> + ubuf_folio = kmem_cache_alloc(udmabuf_folio_cachep, GFP_KERNEL);
> if (!ubuf_folio)
> return -ENOMEM;
>
> @@ -492,10 +494,20 @@ static int __init udmabuf_dev_init(void)
> if (ret < 0) {
> pr_err("Could not setup DMA mask for udmabuf device\n");
> misc_deregister(&udmabuf_misc);
misc_deregister() is now called twice in this error path, I think you've
forgotten to delete this line too?
Otherwise lgtm.
-Sima
> - return ret;
> + goto err;
> + }
> +
> + udmabuf_folio_cachep = KMEM_CACHE(udmabuf_folio, 0);
> + if (unlikely(!udmabuf_folio_cachep)) {
> + ret = -ENOMEM;
> + goto err;
> }
>
> return 0;
> +
> +err:
> + misc_deregister(&udmabuf_misc);
> + return ret;
> }
>
> static void __exit udmabuf_dev_exit(void)
>
> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
> --
> 2.45.2
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, Jul 30, 2024 at 08:04:04PM +0800, Huan Yang wrote:
>
> 在 2024/7/30 17:05, Huan Yang 写道:
> >
> > 在 2024/7/30 16:56, Daniel Vetter 写道:
> > > [????????? daniel.vetter(a)ffwll.ch ?????????
> > > https://aka.ms/LearnAboutSenderIdentification?????????????]
> > >
> > > On Tue, Jul 30, 2024 at 03:57:44PM +0800, Huan Yang wrote:
> > > > UDMA-BUF step:
> > > > 1. memfd_create
> > > > 2. open file(buffer/direct)
> > > > 3. udmabuf create
> > > > 4. mmap memfd
> > > > 5. read file into memfd vaddr
> > > Yeah this is really slow and the worst way to do it. You absolutely want
> > > to start _all_ the io before you start creating the dma-buf, ideally
> > > with
> > > everything running in parallel. But just starting the direct I/O with
> > > async and then creating the umdabuf should be a lot faster and avoid
> > That's greate, Let me rephrase that, and please correct me if I'm wrong.
> >
> > UDMA-BUF step:
> > 1. memfd_create
> > 2. mmap memfd
> > 3. open file(buffer/direct)
> > 4. start thread to async read
> > 3. udmabuf create
> >
> > With this, can improve
>
> I just test with it. Step is:
>
> UDMA-BUF step:
> 1. memfd_create
> 2. mmap memfd
> 3. open file(buffer/direct)
> 4. start thread to async read
> 5. udmabuf create
>
> 6 . join wait
>
> 3G file read all step cost 1,527,103,431ns, it's greate.
Ok that's almost the throughput of your patch set, which I think is close
enough. The remaining difference is probably just the mmap overhead, not
sure whether/how we can do direct i/o to an fd directly ... in principle
it's possible for any file that uses the standard pagecache.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Le 31/07/2024 à 09:37, Huan Yang a écrit :
> The current udmabuf_folio contains a list_head and the corresponding
> folio pointer, with a size of 24 bytes. udmabuf_folio uses kmalloc to
> allocate memory.
>
> However, kmalloc is a public pool, starting from 8,16,32 bytes.
> Additionally, if the size is not aligned with the kmalloc size, it will
> be rounded up to the corresponding size.
> This means that each udmabuf_folio allocation will get 32 bytes, and
> waste 8 bytes.
>
> Considering that each udmabuf creates a folio corresponding to a
> udmabuf_folio, the wasted memory can be significant in the case of
> memory fragmentation.
>
> Furthermore, if udmabuf is frequently used, the allocation and
> deallocation of udmabuf_folio will also be frequent.
>
> Therefore, this patch adds a kmem_cache dedicated to the allocation and
> deallocation of udmabuf_folio.This is expected to improve the
> performance of allocation and deallocation within the expected range,
> while also avoiding memory waste.
>
> Signed-off-by: Huan Yang <link(a)vivo.com>
> ---
> v3 -> v2: fix error description.
> v2 -> v1: fix double unregister, remove unlikely.
> drivers/dma-buf/udmabuf.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index 047c3cd2ceff..c112c58ef09a 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -24,6 +24,8 @@ static int size_limit_mb = 64;
> module_param(size_limit_mb, int, 0644);
> MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is 64.");
>
> +static struct kmem_cache *udmabuf_folio_cachep;
> +
> struct udmabuf {
> pgoff_t pagecount;
> struct folio **folios;
> @@ -169,7 +171,7 @@ static void unpin_all_folios(struct list_head *unpin_list)
> unpin_folio(ubuf_folio->folio);
>
> list_del(&ubuf_folio->list);
> - kfree(ubuf_folio);
> + kmem_cache_free(udmabuf_folio_cachep, ubuf_folio);
> }
> }
>
> @@ -178,7 +180,7 @@ static int add_to_unpin_list(struct list_head *unpin_list,
> {
> struct udmabuf_folio *ubuf_folio;
>
> - ubuf_folio = kzalloc(sizeof(*ubuf_folio), GFP_KERNEL);
> + ubuf_folio = kmem_cache_alloc(udmabuf_folio_cachep, GFP_KERNEL);
> if (!ubuf_folio)
> return -ENOMEM;
>
> @@ -491,11 +493,20 @@ static int __init udmabuf_dev_init(void)
> DMA_BIT_MASK(64));
> if (ret < 0) {
> pr_err("Could not setup DMA mask for udmabuf device\n");
> - misc_deregister(&udmabuf_misc);
> - return ret;
> + goto err;
> + }
> +
> + udmabuf_folio_cachep = KMEM_CACHE(udmabuf_folio, 0);
> + if (!udmabuf_folio_cachep) {
> + ret = -ENOMEM;
> + goto err;
> }
>
> return 0;
> +
> +err:
> + misc_deregister(&udmabuf_misc);
> + return ret;
> }
>
> static void __exit udmabuf_dev_exit(void)
Hi,
should a kmem_cache_destroy() be also added in udmabuf_dev_exit()?
CJ
>
> base-commit: cd19ac2f903276b820f5d0d89de0c896c27036ed
Hi Huan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 931a3b3bccc96e7708c82b30b2b5fa82dfd04890]
url: https://github.com/intel-lab-lkp/linux/commits/Huan-Yang/dma-buf-heaps-Intr…
base: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
patch link: https://lore.kernel.org/r/20240730075755.10941-4-link%40vivo.com
patch subject: [PATCH v2 3/5] dma-buf: heaps: support alloc async read file
config: xtensa-allyesconfig (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@…)
compiler: xtensa-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240731/202407312202.LhLTLEhX-lkp@…)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp(a)intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407312202.LhLTLEhX-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'priv' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_devt' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'list' not described in 'dma_heap'
drivers/dma-buf/dma-heap.c:45: warning: Function parameter or struct member 'heap_cdev' not described in 'dma_heap'
>> drivers/dma-buf/dma-heap.c:158: warning: Function parameter or struct member 'lock' not described in 'dma_heap_file_control'
drivers/dma-buf/dma-heap.c:482: warning: expecting prototype for Trigger sync file read, read into dma(). Prototype was for dma_heap_read_file_sync() instead
vim +158 drivers/dma-buf/dma-heap.c
133
134 /**
135 * struct dma_heap_file_control - global control of dma_heap file read.
136 * @works: @dma_heap_file_work's list head.
137 *
138 * @threadwq: wait queue for @work_thread, if commit work, @work_thread
139 * wakeup and read this work's file contains.
140 *
141 * @workwq: used for main thread wait for file read end, if allocation
142 * end before file read. @dma_heap_file_task ref effect this.
143 *
144 * @work_thread: file read kthread. the dma_heap_file_task work's consumer.
145 *
146 * @heap_fwork_cachep: @dma_heap_file_work's cachep, it's alloc/free frequently.
147 *
148 * @nr_work: global number of how many work committed.
149 */
150 struct dma_heap_file_control {
151 struct list_head works;
152 spinlock_t lock; // only lock for @works.
153 wait_queue_head_t threadwq;
154 wait_queue_head_t workwq;
155 struct task_struct *work_thread;
156 struct kmem_cache *heap_fwork_cachep;
157 atomic_t nr_work;
> 158 };
159
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki