On 6/9/25 11:32, wangtao wrote:
-----Original Message----- From: Christoph Hellwig hch@infradead.org Sent: Monday, June 9, 2025 12:35 PM To: Christian König christian.koenig@amd.com Cc: wangtao tao.wangtao@honor.com; Christoph Hellwig hch@infradead.org; sumit.semwal@linaro.org; kraxel@redhat.com; vivek.kasireddy@intel.com; viro@zeniv.linux.org.uk; brauner@kernel.org; hughd@google.com; akpm@linux-foundation.org; amir73il@gmail.com; benjamin.gaignard@collabora.com; Brian.Starkey@arm.com; jstultz@google.com; tjmercier@google.com; jack@suse.cz; baolin.wang@linux.alibaba.com; linux-media@vger.kernel.org; dri- devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux- kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- mm@kvack.org; wangbintian(BintianWang) bintian.wang@honor.com; yipengxiang yipengxiang@honor.com; liulu 00013167 liulu.liu@honor.com; hanfeng 00012985 feng.han@honor.com Subject: Re: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range
On Fri, Jun 06, 2025 at 01:20:48PM +0200, Christian König wrote:
dmabuf acts as a driver and shouldn't be handled by VFS, so I made dmabuf implement copy_file_range callbacks to support direct I/O zero-copy. I'm open to both approaches. What's the preference of VFS experts?
That would probably be illegal. Using the sg_table in the DMA-buf implementation turned out to be a mistake.
Two thing here that should not be directly conflated. Using the sg_table was a huge mistake, and we should try to move dmabuf to switch that to a pure
I'm a bit confused: don't dmabuf importers need to traverse sg_table to access folios or dma_addr/len? Do you mean restricting sg_table access (e.g., only via iov_iter) or proposing alternative approaches?
No, accessing pages folios inside the sg_table of a DMA-buf is strictly forbidden.
We have removed most use cases of that over the years and push back on generating new ones.
dma_addr_t/len array now that the new DMA API supporting that has been merged. Is there any chance the dma-buf maintainers could start to kick this off? I'm of course happy to assist.
Work on that is already underway for some time.
Most GPU drivers already do sg_table -> DMA array conversion, I need to push on the remaining to clean up.
But there are also tons of other users of dma_buf_map_attachment() which needs to be converted.
But that notwithstanding, dma-buf is THE buffer sharing mechanism in the kernel, and we should promote it instead of reinventing it badly. And there is a use case for having a fully DMA mapped buffer in the block layer and I/O path, especially on systems with an IOMMU. So having an iov_iter backed by a dma-buf would be extremely helpful. That's mostly lib/iov_iter.c code, not VFS, though.
Are you suggesting adding an ITER_DMABUF type to iov_iter, or implementing dmabuf-to-iov_bvec conversion within iov_iter?
That would be rather nice to have, yeah.
The question Christoph raised was rather why is your CPU so slow that walking the page tables has a significant overhead compared to the actual I/O?
Yes, that's really puzzling and should be addressed first.
With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead is relatively low (observed in 3GHz tests).
Even on a low end CPU walking the page tables and grabbing references shouldn't be that much of an overhead.
There must be some reason why you see so much CPU overhead. E.g. compound pages are broken up or similar which should not happen in the first place.
Regards, Christian.
| 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |---------------------------|--------|--------|--------|--------|----- | 1) memfd direct R/W| 1 | 118 | 312 | 3448 | 100% | 2) u+memfd direct R/W| 196 | 123 | 295 | 3651 | 105% | 3) u+memfd direct sendfile| 175 | 102 | 976 | 1100 | 31% | 4) u+memfd direct splice| 173 | 103 | 443 | 2428 | 70% | 5) udmabuf buffer R/W| 183 | 100 | 453 | 2375 | 68% | 6) dmabuf buffer R/W| 34 | 4 | 427 | 2519 | 73% | 7) udmabuf direct c_f_r| 200 | 102 | 278 | 3874 | 112% | 8) dmabuf direct c_f_r| 36 | 5 | 269 | 4002 | 116%
With lower CPU performance (e.g., 1GHz), GUP overhead becomes more significant (as seen in 1GHz tests). | 32x32MB Read 1024MB |Creat-ms|Close-ms| I/O-ms|I/O-MB/s| I/O% |---------------------------|--------|--------|--------|--------|----- | 1) memfd direct R/W| 2 | 393 | 969 | 1109 | 100% | 2) u+memfd direct R/W| 592 | 424 | 570 | 1884 | 169% | 3) u+memfd direct sendfile| 587 | 356 | 2229 | 481 | 43% | 4) u+memfd direct splice| 568 | 352 | 795 | 1350 | 121% | 5) udmabuf buffer R/W| 597 | 343 | 1238 | 867 | 78% | 6) dmabuf buffer R/W| 69 | 13 | 1128 | 952 | 85% | 7) udmabuf direct c_f_r| 595 | 345 | 372 | 2889 | 260% | 8) dmabuf direct c_f_r| 80 | 13 | 274 | 3929 | 354%
Regards, Wangtao.
On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote:
dma_addr_t/len array now that the new DMA API supporting that has been merged. Is there any chance the dma-buf maintainers could start to kick this off? I'm of course happy to assist.
Work on that is already underway for some time.
Most GPU drivers already do sg_table -> DMA array conversion, I need to push on the remaining to clean up.
Do you have a pointer?
Yes, that's really puzzling and should be addressed first.
With high CPU performance (e.g., 3GHz), GUP (get_user_pages) overhead is relatively low (observed in 3GHz tests).
Even on a low end CPU walking the page tables and grabbing references shouldn't be that much of an overhead.
Yes.
There must be some reason why you see so much CPU overhead. E.g. compound pages are broken up or similar which should not happen in the first place.
pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter last length) array strut pages unfortunately. The block direct I/O code has grown code to reassemble folios from them fairly recently which did speed up some workloads.
Is this test using the block device or iomap direct I/O code? What kernel version is it run on?
linaro-mm-sig@lists.linaro.org