On Tue, May 07, 2024 at 04:15:05PM +0100, Bryan O'Donoghue wrote:
On 07/05/2024 16:09, Dmitry Baryshkov wrote:
Ah, I see. Then why do you require the DMA-ble buffer at all? If you are providing data to VPU or DRM, then you should be able to get the buffer from the data-consuming device.
Because we don't necessarily know what the consuming device is, if any.
Well ... that's an entirely different issue. And it's unsolved.
Currently the approach is to allocate where the constraints are usually most severe (like display, if you need that, or the camera module for input) and then just pray the stack works out without too much copying. All userspace (whether the generic glue or the userspace driver depends a bit upon the exact api) does need to have a copy fallback for these sharing cases, ideally with the copying accelerated by hw.
If you try to solve this by just preemptive allocating everything as cma buffers, then you'll make the situation substantially worse (because now you're wasting tons of cma memory where you might not even need it). And without really solving the problem, since for some gpus that memory might be unusable (because you cannot scan that out on any discrete gpu, and sometimes not even on an integrated one). -Sima
Could be VPU, could be Zoom/Hangouts via pipewire, could for argument sake be GPU or DSP.
Also if we introduce a dependency on another device to allocate the output buffers - say always taking the output buffer from the GPU, then we've added another dependency which is more difficult to guarantee across different arches.
bod
On Tue, May 07, 2024 at 07:36:59PM +0200, Daniel Vetter wrote:
On Tue, May 07, 2024 at 04:15:05PM +0100, Bryan O'Donoghue wrote:
On 07/05/2024 16:09, Dmitry Baryshkov wrote:
Ah, I see. Then why do you require the DMA-ble buffer at all? If you are providing data to VPU or DRM, then you should be able to get the buffer from the data-consuming device.
Because we don't necessarily know what the consuming device is, if any.
Well ... that's an entirely different issue. And it's unsolved.
Currently the approach is to allocate where the constraints are usually most severe (like display, if you need that, or the camera module for input) and then just pray the stack works out without too much copying. All userspace (whether the generic glue or the userspace driver depends a bit upon the exact api) does need to have a copy fallback for these sharing cases, ideally with the copying accelerated by hw.
If you try to solve this by just preemptive allocating everything as cma buffers, then you'll make the situation substantially worse (because now you're wasting tons of cma memory where you might not even need it). And without really solving the problem, since for some gpus that memory might be unusable (because you cannot scan that out on any discrete gpu, and sometimes not even on an integrated one).
I think we have a general agreement that the proposed solution is a stop-gap measure for an unsolved issue.
Note that libcamera is already designed that way. The API is designed to import buffers, using dma-buf file handles. If an application has a way to allocate dma-buf instances through another means (from the display or from a video encoder for instance), it should do so, and use those buffers with libcamera.
For applications that don't have an easy way to get hold of dma-buf instances, we have a buffer allocator helper as a side component. That allocator uses the underlying camera capture device, and allocates buffers from the V4L2 video device. It's only on platforms where we have no hardware camera processing (or, rather, platforms where the hardware vendors doesn't give us access to the camera hardware, such as recent Intel SoCs, or Qualcomm SoCs used in ARM laptops) that we need to allocate memory elsewhere.
In the long run, I want a centralized memory allocator accessible by userspace applications (something similar in purpose to gralloc on Android), and I want to get rid of buffer allocation in libcamera (and even in V4L2, in the even longer term). That's the long run.
Shorter term, we have a problem to solve, and the best option we have found so far is to rely on dma-buf heaps as a backend for the frame buffer allocatro helper in libcamera for the use case described above. This won't work in 100% of the cases, clearly. It's a stop-gap measure until we can do better.
Could be VPU, could be Zoom/Hangouts via pipewire, could for argument sake be GPU or DSP.
Also if we introduce a dependency on another device to allocate the output buffers - say always taking the output buffer from the GPU, then we've added another dependency which is more difficult to guarantee across different arches.
linaro-mm-sig@lists.linaro.org