On Thu, Jul 10, 2025 at 10:49:19AM +0200, Pavel Machek wrote:
Hi!
memcpy() from normal memory is about 2msec/1MB. Unfortunately, for DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do 760p video recording. Plus, copying full-resolution photo buffer takes more than 200msec!
There's possibility to do some processing on GPU, and its implemented here:
https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
but that hits the same problem in the end -- data is in DMA-BUF, uncached, and takes way too long to copy out.
And that's ... wrong. DMA ended seconds ago, complete cache flush would be way cheaper than copying single frame out, and I still have to deal with uncached frames.
So I have two questions:
- Is my analysis correct that, no matter how I get frame from v4l and
process it on GPU, I'll have to copy it from uncached memory in the end?
If you need to touch the buffers using the CPU then you are either stuck with uncached memory or you need to implement bracketed access to do the necessary cache maintenance. Be aware that completely flushing the cache is not really an option, as that would impact other workloads, so you have to flush the cache by walking the virtual address space of the buffer, which may take a significant amount of CPU time.
What kind of "significant amount of CPU time" are we talking here? Millisecond?
It really depends on the platform, the type of cache, and the size of the buffer. I remember that back in the N900 days a selective cash clean of a large buffer for full resolution images took several dozens of milliseconds, possibly close to 100ms. We had to clean the whole D-cache to make it fast enough, but you can't always do that as Lucas mentioned.
Bracketed access is fine with me.
Flushing a cache should be an option. I'm root, there's no other significant workload, and copying out the buffer takes 200msec+. There are lot of cache flushes that can be done in quarter a second!
However, if you are only going to use the buffer with the GPU I see no reason to touch it from the CPU side. Why would you even need to copy the content? After all dma-bufs are meant to enable zero-copy between DMA capable accelerators. You can simply import the V4L2 buffer into a GL texture using EGL_EXT_image_dma_buf_import. Using this path you don't need to bother with the cache at all, as the GPU will directly read the video buffers from RAM.
Yes, so GPU will read video buffer from RAM, then debayer it, and then what? Then I need to store a data into raw file, or use CPU to turn it into JPEG file, or maybe run video encoder on it. That are all tasks that are done on CPU...
linaro-mm-sig@lists.linaro.org