Re: [PATCH 2/2] dmabuf/heaps: implement DMA_BUF_IOCTL_RW_FILE for system_heap - Linaro-mm-sig

21 May 2025

      On Mon, May 19, 2025 at 9:06 PM wangtao tao.wangtao@honor.com wrote:
...
...
-----Original Message-----
From: wangtao
Sent: Monday, May 19, 2025 8:04 PM
To: 'T.J. Mercier' tjmercier@google.com; Christian König
christian.koenig@amd.com
Cc: sumit.semwal@linaro.org; benjamin.gaignard@collabora.com;
Brian.Starkey@arm.com; jstultz@google.com; linux-media@vger.kernel.org;
dri-devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
kernel@vger.kernel.org; wangbintian(BintianWang)
bintian.wang@honor.com; yipengxiang yipengxiang@honor.com; liulu
00013167 liulu.liu@honor.com; hanfeng 00012985 feng.han@honor.com
Subject: RE: [PATCH 2/2] dmabuf/heaps: implement
DMA_BUF_IOCTL_RW_FILE for system_heap
...
-----Original Message-----
From: T.J. Mercier tjmercier@google.com
Sent: Saturday, May 17, 2025 2:37 AM
To: Christian König christian.koenig@amd.com
Cc: wangtao tao.wangtao@honor.com; sumit.semwal@linaro.org;
benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
jstultz@google.com; linux-media@vger.kernel.org; dri-
devel@lists.freedesktop.org; linaro-mm-sig@lists.linaro.org; linux-
kernel@vger.kernel.org; wangbintian(BintianWang)
bintian.wang@honor.com; yipengxiang yipengxiang@honor.com; liulu
00013167 liulu.liu@honor.com; hanfeng 00012985
feng.han@honor.com
...
Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
DMA_BUF_IOCTL_RW_FILE
...
for system_heap
On Fri, May 16, 2025 at 1:36 AM Christian König
christian.koenig@amd.com
wrote:
...
On 5/16/25 09:40, wangtao wrote:
...
...
-----Original Message-----
From: Christian König christian.koenig@amd.com
Sent: Thursday, May 15, 2025 10:26 PM
To: wangtao tao.wangtao@honor.com; sumit.semwal@linaro.org;
benjamin.gaignard@collabora.com; Brian.Starkey@arm.com;
jstultz@google.com; tjmercier@google.com
Cc: linux-media@vger.kernel.org; dri-devel@lists.freedesktop.org;
linaro- mm-sig@lists.linaro.org; linux-kernel@vger.kernel.org;
wangbintian(BintianWang) bintian.wang@honor.com; yipengxiang
yipengxiang@honor.com; liulu 00013167 liulu.liu@honor.com;
hanfeng
00012985 feng.han@honor.com
Subject: Re: [PATCH 2/2] dmabuf/heaps: implement
DMA_BUF_IOCTL_RW_FILE for system_heap
On 5/15/25 16:03, wangtao wrote:
> [wangtao] My Test Configuration (CPU 1GHz, 5-test average):
> Allocation: 32x32MB buffer creation
> - dmabuf 53ms vs. udmabuf 694ms (10X slower)
> - Note: shmem shows excessive allocation time
Yeah, that is something already noted by others as well. But that
is orthogonal.
>
> Read 1024MB File:
> - dmabuf direct 326ms vs. udmabuf direct 461ms (40% slower)
> - Note: pin_user_pages_fast consumes majority CPU cycles
>
> Key function call timing: See details below.
Those aren't valid, you are comparing different functionalities here.
Please try using udmabuf with sendfile() as confirmed to be
working by
T.J.
...
...
[wangtao] Using buffer IO with dmabuf file read/write requires one
memory copy.
...
...
Direct IO removes this copy to enable zero-copy. The sendfile
system call reduces memory copies from two (read/write) to one.
However, with udmabuf, sendfile still keeps at least one copy, failing
zero-copy.
...
...
Then please work on fixing this.
Regards,
Christian.
...
If udmabuf sendfile uses buffer IO (file page cache), read latency
matches dmabuf buffer read, but allocation time is much longer.
With Direct IO, the default 16-page pipe size makes it slower than
buffer
IO.
...
...
Test data shows:
udmabuf direct read is much faster than udmabuf sendfile.
dmabuf direct read outperforms udmabuf direct read by a large margin.
Issue: After udmabuf is mapped via map_dma_buf, apps using memfd
or udmabuf for Direct IO might cause errors, but there are no
safeguards to prevent this.
Allocate 32x32MB buffer and read 1024 MB file Test:
Metric                 | alloc (ms) | read (ms) | total (ms)
-----------------------|------------|-----------|-----------
udmabuf buffer read    | 539        | 2017      | 2555
udmabuf direct read    | 522        | 658       | 1179
I can't reproduce the part where udmabuf direct reads are faster than
buffered reads. That's the opposite of what I'd expect. Something
seems wrong with those buffered reads.
...
...
udmabuf buffer sendfile| 505        | 1040      | 1546
udmabuf direct sendfile| 510        | 2269      | 2780
I can reproduce the 3.5x slower udambuf direct sendfile compared to
udmabuf direct read. It's a pretty disappointing result, so it seems
like something could be improved there.
1G from ext4 on 6.12.17 | read/sendfile (ms)
------------------------|-------------------
udmabuf buffer read     | 351
udmabuf direct read     | 540
udmabuf buffer sendfile | 255
udmabuf direct sendfile | 1990
[wangtao] By the way, did you clear the file cache during testing?
Looking at your data again, read and sendfile buffers are faster than Direct
I/O, which suggests the file cache wasn’t cleared. If you didn’t clear the file
cache, the test results are unfair and unreliable for reference. On embedded
devices, it’s nearly impossible to maintain stable caching for multi-GB files. If
such files could be cached, we might as well cache dmabufs directly to save
time on creating dmabufs and reading file data.
You can call posix_fadvise(file_fd, 0, len, POSIX_FADV_DONTNEED) after
opening the file or before closing it to clear the file cache, ensuring actual file
I/O operations are tested.
[wangtao] Please confirm if cache clearing was performed during testing.
I reduced the test scope from 3GB to 1GB. While results without
cache clearing show general alignment, udmabuf buffer read remains
slower than direct read. Comparative data:
Your test reading 1GB(ext4 on 6.12.17:
Method                | read/sendfile (ms) | read vs. (%)

udmabuf buffer read   | 351                | 138%
udmabuf direct read   | 540                | 212%
udmabuf buffer sendfile | 255              | 100%
udmabuf direct sendfile | 1990             | 780%
My 3.5GHz tests (f2fs):
Without cache clearing:
Method                | alloc | read  | vs. (%)

udmabuf buffer read   | 140   | 386   | 310%
udmabuf direct read   | 151   | 326   | 262%
udmabuf buffer sendfile | 136 | 124   | 100%
udmabuf direct sendfile | 132 | 892   | 717%
dmabuf buffer read    | 23    | 154   | 124%
patch direct read     | 29    | 271   | 218%
With cache clearing:
Method                | alloc | read  | vs. (%)

udmabuf buffer read   | 135   | 546   | 180%
udmabuf direct read   | 159   | 300   | 99%
udmabuf buffer sendfile | 134 | 303   | 100%
udmabuf direct sendfile | 141 | 912   | 301%
dmabuf buffer read    | 22    | 362   | 119%
patch direct read     | 29    | 265   | 87%
Results without cache clearing aren't representative for embedded
mobile devices. Notably, on low-power CPUs @1GHz, sendfile latency
without cache clearing exceeds dmabuf direct I/O read time.
Without cache clearing:
Method                | alloc | read  | vs. (%)

udmabuf buffer read   | 546   | 1745  | 442%
udmabuf direct read   | 511   | 704   | 178%
udmabuf buffer sendfile | 496 | 395   | 100%
udmabuf direct sendfile | 498 | 2332  | 591%
dmabuf buffer read    | 43    | 453   | 115%
my patch direct read  | 49    | 310   | 79%
With cache clearing:
Method                | alloc | read  | vs. (%)

udmabuf buffer read   | 552   | 2067  | 198%
udmabuf direct read   | 540   | 627   | 60%
udmabuf buffer sendfile | 497 | 1045  | 100%
udmabuf direct sendfile | 527 | 2330  | 223%
dmabuf buffer read    | 40    | 1111  | 106%
my patch direct read  | 44    | 310   | 30%
Reducing CPU overhead/power consumption is critical for mobile devices.
We need simpler and more efficient dmabuf direct I/O support.
As Christian evaluated sendfile performance based on your data, could
you confirm whether the cache was cleared? If not, please share the
post-cache-clearing test data. Thank you for your support.
Yes sorry, I was out yesterday riding motorcycles. I did not clear the
cache for the buffered reads, I didn't realize you had. The IO plus
the copy certainly explains the difference.
Your point about the unlikelihood of any of that data being in the
cache also makes sense.
I'm not sure it changes anything about the ioctl approach though.
Another way to do this would be to move the (optional) support for
direct IO into the exporter via dma_buf_fops and dma_buf_ops. Then
normal read() syscalls would just work for buffers that support them.
I know that's more complicated, but at least it doesn't require
inventing new uapi to do it.
1G from ext4 on 6.12.20 | read/sendfile (ms) w/ 3 > drop_caches
------------------------|-------------------
udmabuf buffer read     | 1210
udmabuf direct read     | 671
udmabuf buffer sendfile | 1096
udmabuf direct sendfile | 2340
...
...
...
...
...
dmabuf buffer read     | 51         | 1068      | 1118
dmabuf direct read     | 52         | 297       | 349
udmabuf sendfile test steps:

Open data file(1024MB), get back_fd 2. Create memfd(32MB) #

Loop steps 2-6 3. Allocate udmabuf with memfd 4. Call
sendfile(memfd,
back_fd) 5. Close memfd after sendfile 6. Close udmabuf 7. Close
back_fd
...
Regards,
Christian.