Le mercredi 02 novembre 2022 à 12:18 +0100, Christian König a écrit :
Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
[SNIP]
But the client is just a video player. It doesn't understand how to allocate BOs for Panfrost or AMD or etnaviv. So without a universal allocator (again ...), 'just allocate on the GPU' isn't a useful response to the client.
Well exactly that's the point I'm raising: The client *must* understand that!
See we need to be able to handle all restrictions here, coherency of the data is just one of them.
For example the much more important question is the location of the data and for this allocating from the V4L2 device is in most cases just not going to fly.
It feels like this is a generic statement and there is no reason it could not be the other way around.
And exactly that's my point. You always need to look at both ways to share the buffer and can't assume that one will always work.
As far as I can see it you guys just allocate a buffer from a V4L2 device, fill it with data and send it to Wayland for displaying.
That paragraph is a bit sloppy. By "you guys" you mean what exactly ? Normal users will let V4L2 device allocate and write into their own memory (the device fill it, not "you guys"). This is done like this simply because this is guarantied to work with the V4L2 device. Most V4L2 device produces known by userpsace pixel formats and layout, for which userspace know for sure it can implement a GPU shader or software fallback for. I'm still to see one of these format that cannot be efficiently imported into a modern GPU and converted using shaders. I'm not entirely sure what/which GPU a dGPU is compared to a GPU btw.
In many cases, camera kind of V4L2 devices will have 1 producer for many consumers. Consider your photo application, the streams will likely be capture and displayed while being encoded by one of more CODEC, while being streamed to a Machine Learning model for analyses. The software complexity to communicate back the list of receiver devices and implementing all their non-standard way to allocate memory, so all the combination of trial and error is just ridiculously high. Remember that each GPU have their own allocation methods and corner cases, this is simply not manageable by "you guys", which I pretty much assume is everyone writing software for Generic Linux these days (non-Android/ChromeOS).
To be honest I'm really surprised that the Wayland guys hasn't pushed back on this practice already.
This only works because the Wayland as well as X display pipeline is smart enough to insert an extra copy when it find that an imported buffer can't be used as a framebuffer directly.
This is a bit inaccurate. The compositor I've worked with (Gnome and Weston) will only memcpy SHM. For DMABuf, they will fail importation if its not usable either by the display or the GPU. Specially on the GPU side though (which is the ultimate compositor fallback), there exists efficient HW copy mechanism that may be used, and this is fine, since unlike your scannout example, it won't be uploading over and over, but will do later re-display from a remote copy (or transformed copy). Or if you prefer, its cached at the cost of higher memory usage.
I think it would be preferable to speak about device to device sharing, since V4L2 vs GPU is not really representative of the program. I think V4L2 vs GPU and "you guys" simply contribute to the never ending, and needless friction around that difficulty that exists with current support for memory sharing in Linux.
I have colleague who integrated PCIe CODEC (Blaize Xplorer X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways to use it. Of course, in current state of DMABuf, you have to be an exporter to do anything fancy, but it did not have to be like this, its a design choice. I'm not sure in the end what was the final method used, the driver isn't yet upstream, so maybe that is not even final. What I know is that there is various condition you may use the CODEC for which the optimal location will vary. As an example, using the post processor or not, see my next comment for more details.
Yeah, and stuff like this was already discussed multiple times. Local memory of devices can only be made available by the exporter, not the importer.
So in the case of separated camera and encoder you run into exactly the same limitation that some device needs the allocation to happen on the camera while others need it on the encoder.
The more common case is that you need to allocate from the GPU and then import that into the V4L2 device. The background is that all dGPUs I know of need the data inside local memory (VRAM) to be able to scan out from it.
The reality is that what is common to you, might not be to others. In my work, most ARM SoC have display that just handle direct scannout from cameras and codecs.
The only case the commonly fails is whenever we try to display UVC created dmabuf,
Well, exactly that's not correct! The whole x86 use cases of direct display for dGPUs are broken because media players think they can do the simple thing and offload all the problematic cases to the display server.
This is absolutely *not* the common use case you describe here, but rather something completely special to ARM.
sigh .. The UVC failures was first discovered on my Intel PC, and later reproduced on ARM. Userspace expected driver(s) (V4L2 exports, DRM imports) should have rejected the DMABuf import (I kind of know, I wrote this part). From a userspace point of you, unlike what you stipulate here, there was no fault. You said already that importer / exporter role is to be tried, the order you try should not matter. So yes, today's userspace may lack the ability to flip the roles, but at least it tries, and if the driver does not fail, you can't blame userspace for at least trying to achieve decent performance.
I'd like to remind that this is clearly all kernel bugs, and we cannot state that kernel drivers "are broken because media player". Just the fact that this thread starts from a kernel changes kind of prove it. Would be nice also for you to understand that I'm not against the method used in this patchset, but I'm not against a bracketing mechanism either, as I think the former can improve, where the first one only give more "correct" results.
which have dirty CPU write cache and this is the type of thing we'd like to see solved. I think this series was addressing it in principle, but failing the import and the raised point is that this wasn't the optimal way.
There is a community project called LibreELEC, if you aren't aware, they run Khodi with direct scanout of video stream on a wide variety of SoC and they use the CODEC as exporter all the time. They simply don't have cases were the opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG does not really offer you any API to reverse the allocation.
Ok, let me try to explain it once more. It sounds like I wasn't able to get my point through.
That we haven't heard anybody screaming that x86 doesn't work is just because we handle the case that a buffer isn't directly displayable in X/Wayland anyway, but this is absolutely not the optimal solution.
Basically, you are complaining that compositor will use GPU shaders to adapt the buffers for the display. Most display don't do or have limited YUV support, flipping the roles or bracketing won't help that. Using a GPU shader to adapt it, like compositor and userspace do seems all right. And yes, sometimes the memory will get imported into the GPU very efficiently, something in the mid- range, and other times some GPU stack (which is userspace) will memcpy. But remember that the GPU stack is programmed to work with a specific GPU, not the higher level userland.
The argument that you want to keep the allocation on the codec side is completely false as far as I can see.
I haven't made this argument, and don't intend to do so. There is nothing in this thread that should be interpreted as I want, or not want. I want the same thing as everyone on this list, which is both performance and correct results.
We already had numerous projects where we reported this practice as bugs to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
Links ? Remember that I do read every single bugs and emails around GStreamer project. I do maintain older and newer V4L2 support in there. I also did contribute a lot to the mechanism GStreamer have in-place to reverse the allocation. In fact, its implemented, the problem being that on generic Linux, the receiver element, like the GL element and the display sink don't have any API they can rely on to allocate memory. Thus, they don't implement what we call the allocation offer in GStreamer term. Very often though, on other modern OS, or APIs like VA, the memory offer is replaced by a context. So the allocation is done from a "context" which is neither an importer or an exporter. This is mostly found on MacOS and Windows.
Was there APIs suggested to actually make it manageable by userland to allocate from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API ready, no.
Can we at least implement some DRM memory allocation, yes, but remember that, until very recently, the DRM drivers used by the display path was not exposed through Wayland. This issue has only been resolved recently, it will take some time before this propagate through compositors code. And you need compositor implementation to do GL and multimedia stack implementation. Please, keep in mind before raising bad practice by GStreamer and FFMPEG developers that getting all the bit and pieces in place require back and forth, there has been huge gaps that these devs were not able to overcome yet. Also, remember that these stack don't have any contract to support Linux. They support it to the best of their knowledge and capabilities, along with Windows, MacOS, IOS, Android and more. And to my experience, memory sharing have different challenges in all of these OS.
This is just a software solution which works because of coincident and not because of engineering.
Another argument I can't really agree with, there is a lot of effort put into fallback (mostly GPU fallback) in various software stack. These fallback are engineered to guaranty you can display your frames. That case I've raised should have ended well with a GPU/CPU fallback, but a kernel bug broke the ability to fallback. If the kernel had rejected the import (your series ?), or offered a bracketing mechanism (for the UVC case, both method would have worked), the end results would have just worked.
I would not disagree if someone states that DMAbuf in UVC driver is an abuse. The driver simply memcpy chunk of variable size data streamed by the usb camera into a normal memory buffer. So why is that exported as a dmabuf ? I don't have an strong opinion on that, but if you think this is wrong, then it proves my point that this is a kernel bug. The challenge here is to come up with how we will fix this, and sharing a good understanding of what today's userspace do, and why they do so is key to make proper designs. As I started, writing code for DMABuf subsystem is out of reach for me, I can only share what existing software do, and why it does it like this.
Nicolas
Regards, Christian.