On Fri, 14 Feb 2025 at 21:19, Boris Brezillon boris.brezillon@collabora.com wrote:
On Fri, 14 Feb 2025 18:37:14 +0530 Sumit Garg sumit.garg@linaro.org wrote:
On Fri, 14 Feb 2025 at 15:37, Jens Wiklander jens.wiklander@linaro.org wrote:
Hi,
On Thu, Feb 13, 2025 at 6:39 PM Daniel Stone daniel@fooishbar.org wrote:
Hi,
On Thu, 13 Feb 2025 at 15:57, Jens Wiklander jens.wiklander@linaro.org wrote:
On Thu, Feb 13, 2025 at 3:05 PM Daniel Stone daniel@fooishbar.org wrote:
But just because TEE is one good backend implementation, doesn't mean it should be the userspace ABI. Why should userspace care that TEE has mediated the allocation instead of it being a predefined range within DT?
The TEE may very well use a predefined range that part is abstracted with the interface.
Of course. But you can also (and this has been shipped on real devices) handle this without any per-allocation TEE needs by simply allocating from a memory range which is predefined within DT.
From the userspace point of view, why should there be one ABI to allocate memory from a predefined range which is delivered by DT to the kernel, and one ABI to allocate memory from a predefined range which is mediated by TEE?
We need some way to specify the protection profile (or use case as I've called it in the ABI) required for the buffer. Whether it's defined in DT seems irrelevant.
What advantage does userspace get from having to have a different codepath to get a different handle to memory? What about x86?
I think this proposal is looking at it from the wrong direction. Instead of working upwards from the implementation to userspace, start with userspace and work downwards. The interesting property to focus on is allocating memory, not that EL1 is involved behind the scenes.
From what I've gathered from earlier discussions, it wasn't much of a problem for userspace to handle this. If the kernel were to provide it via a different ABI, how would it be easier to implement in the kernel? I think we need an example to understand your suggestion.
It is a problem for userspace, because we need to expose acceptable parameters for allocation through the entire stack. If you look at the dmabuf documentation in the kernel for how buffers should be allocated and exchanged, you can see the negotiation flow for modifiers. This permeates through KMS, EGL, Vulkan, Wayland, GStreamer, and more.
What dma-buf properties are you referring to? dma_heap_ioctl_allocate() accepts a few flags for the resulting file descriptor and no flags for the heap itself.
Standardising on heaps allows us to add those in a similar way.
How would you solve this with heaps? Would you use one heap for each protection profile (use case), add heap_flags, or do a bit of both?
I would say one heap per-profile.
And then it would have a per vendor multiplication factor as each vendor enforces memory restriction in a platform specific manner which won't scale.
Christian gave an historical background here [1] as to why that hasn't worked in the past with DMA heaps given the scalability issues.
[1] https://lore.kernel.org/dri-devel/e967e382-6cca-4dee-8333-39892d532f71@gmail...
Hm, I fail to see where Christian dismiss the dma-heaps solution in this email. He even says:
If the memory is not physically attached to any device, but rather just
memory attached to the CPU or a system wide memory controller then expose the memory as DMA-heap with specific requirements (e.g. certain sized pages, contiguous, restricted, encrypted, ...).
I am not saying Christian dismissed DMA heaps but rather how scalability is an issue. What we are proposing here is a generic interface via TEE to the firmware/Trusted OS which can perform all the platform specific memory restrictions. This solution will scale across vendors.
If we have to add different allocation mechanisms, then the complexity increases, permeating not only into all the different userspace APIs, but also into the drivers which need to support every different allocation mechanism even if they have no opinion on it - e.g. Mali doesn't care in any way whether the allocation comes from a heap or TEE or ACPI or whatever, it cares only that the memory is protected.
Does that help?
I think you're missing the stage where an unprotected buffer is received and decrypted into a protected buffer. If you use the TEE for decryption or to configure the involved devices for the use case, it makes sense to let the TEE allocate the buffers, too. A TEE doesn't have to be an OS in the secure world, it can be an abstraction to support the use case depending on the design. So the restricted buffer is already allocated before we reach Mali in your example.
Allocating restricted buffers from the TEE subsystem saves us from maintaining proxy dma-buf heaps.
Honestly, when I look at dma-heap implementations, they seem to be trivial shells around existing (more complex) allocators, and the boiler plate [1] to expose a dma-heap is relatively small. The dma-buf implementation, you already have, so we're talking about a hundred lines of code to maintain, which shouldn't be significantly more than what you have for the new ioctl() to be honest.
It will rather be redundant vendor specific code under DMA heaps calling into firmware/Trusted OS to enforce memory restrictions as you can look into Mediatek example [1]. With TEE subsystem managing that it won't be the case as we will provide a common abstraction for the communication with underlying firmware/Trusted OS.
[1] https://lore.kernel.org/linux-arm-kernel/20240515112308.10171-1-yong.wu@medi...
And I'll insist on what Daniel said, it's a small price to pay to have a standard interface to expose to userspace. If dma-heaps are not used for this kind things, I honestly wonder what they will be used for...
Let's try not to forcefully find a use-case for DMA heaps when there is a better alternative available. I am still failing to see why you don't consider following as a standardised user-space interface:
"When user-space has to work with restricted memory, ask TEE device to allocate it"
-Sumit
Hi Sumit,
On Mon, 17 Feb 2025 at 06:13, Sumit Garg sumit.garg@linaro.org wrote:
On Fri, 14 Feb 2025 at 21:19, Boris Brezillon boris.brezillon@collabora.com wrote:
I would say one heap per-profile.
And then it would have a per vendor multiplication factor as each vendor enforces memory restriction in a platform specific manner which won't scale.
Yes, they do enforce it in a platform-specific manner, but so does TEE. There is no one golden set of semantics which is globally applicable between all hardware and all products in a useful manner.
So, if we define protected,secure-video + protected,secure-video-record + protected,trusted-ui heap names, we have exactly the same number of axes. The only change is from uint32_t to string.
Christian gave an historical background here [1] as to why that hasn't worked in the past with DMA heaps given the scalability issues.
[1] https://lore.kernel.org/dri-devel/e967e382-6cca-4dee-8333-39892d532f71@gmail...
Hm, I fail to see where Christian dismiss the dma-heaps solution in this email. He even says:
If the memory is not physically attached to any device, but rather just
memory attached to the CPU or a system wide memory controller then expose the memory as DMA-heap with specific requirements (e.g. certain sized pages, contiguous, restricted, encrypted, ...).
I am not saying Christian dismissed DMA heaps but rather how scalability is an issue. What we are proposing here is a generic interface via TEE to the firmware/Trusted OS which can perform all the platform specific memory restrictions. This solution will scale across vendors.
I read something completely different into Christian's mail.
What Christian is saying is that injecting generic constraint solving into the kernel doesn't scale. It's not OK to build out generic infrastructure in the kernel which queries a bunch of leaf drivers and attempts to somehow come up with something which satisfies userspace-provided constraints.
But this isn't the same thing as saying 'dma-heaps is wrong'! Again, there is no additional complexity in the kernel between a dma-heap which bridges over to TEE, and a TEE userspace interface which also bridges over to TEE. Both of them are completely fine according to what he's said.
Honestly, when I look at dma-heap implementations, they seem to be trivial shells around existing (more complex) allocators, and the boiler plate [1] to expose a dma-heap is relatively small. The dma-buf implementation, you already have, so we're talking about a hundred lines of code to maintain, which shouldn't be significantly more than what you have for the new ioctl() to be honest.
It will rather be redundant vendor specific code under DMA heaps calling into firmware/Trusted OS to enforce memory restrictions as you can look into Mediatek example [1]. With TEE subsystem managing that it won't be the case as we will provide a common abstraction for the communication with underlying firmware/Trusted OS.
Yes, it's common for everyone who uses TEE to implement SVP. It's not common for the people who do _not_ use TEE to implement SVP. Which means that userspace has to type out both, and what we're asking in this thread is: why?
Why should userspace have to support dma-heap allocation for platforms supporting SVP via a static DT-defined carveout as well as supporting TEE API allocation for platforms supporting SVP via a dynamic carveout? What benefit does it bring to have this surfaced as a completely separate uAPI?
And I'll insist on what Daniel said, it's a small price to pay to have a standard interface to expose to userspace. If dma-heaps are not used for this kind things, I honestly wonder what they will be used for...
Let's try not to forcefully find a use-case for DMA heaps when there is a better alternative available.
What makes it better? If you could explain very clearly the benefit userspace will gain from asking TEE to allocate $n bytes for TEE_IOC_UC_SECURE_VIDEO_PLAY, compared to asking dma-heap to allocate $n bytes for protected,secure-video, I think that would really help. Right now, I don't understand how it would be better in any way whatsoever for userspace. And I think your decision to implement it as a separate API is based on a misunderstanding of Christian's position.
I am still failing to see why you don't consider following as a standardised user-space interface:
"When user-space has to work with restricted memory, ask TEE device to allocate it"
As far as I can tell, having userspace work with the TEE interface brings zero benefit (again, please correct me if I'm wrong and explain how it's better). The direct cost - call it a disbenefit - it brings is that we have to spend a pile of time typing out support for TEE allocation in every media/GPU/display driver/application, and when we do any kind of negotiation, we have to have one protocol definition for TEE and one for non-TEE.
dma-heaps was created to solve the problem of having too many 'allocate $n bytes from $specialplace' uAPIs. The proliferation was painful and making it difficult for userspace to do what it needed to do. Userspace doesn't _yet_ make full use of it, but the solution is to make userspace make full use of it, not to go create entirely separate allocation paths for unclear reasons.
Besides, I'm writing this from a platform that implements SVP not via TEE. I've worked on platforms which implement SVP without any TEE, where the TEE implementation would be at best a no-op stub, and at worst flat-out impossible.
So that's 'why not TEE as the single uAPI for SVP'. So, again, let's please turn this around: _why_ TEE? Who benefits from exposing this as completely separate to the more generic uAPI that we specifically designed to handle things like this?
Cheers, Daniel
On Tue, 18 Feb 2025 at 21:52, Daniel Stone daniel@fooishbar.org wrote:
Hi Sumit,
On Mon, 17 Feb 2025 at 06:13, Sumit Garg sumit.garg@linaro.org wrote:
On Fri, 14 Feb 2025 at 21:19, Boris Brezillon boris.brezillon@collabora.com wrote:
I would say one heap per-profile.
And then it would have a per vendor multiplication factor as each vendor enforces memory restriction in a platform specific manner which won't scale.
Yes, they do enforce it in a platform-specific manner, but so does TEE. There is no one golden set of semantics which is globally applicable between all hardware and all products in a useful manner.
So, if we define protected,secure-video + protected,secure-video-record + protected,trusted-ui heap names, we have exactly the same number of axes. The only change is from uint32_t to string.
Christian gave an historical background here [1] as to why that hasn't worked in the past with DMA heaps given the scalability issues.
[1] https://lore.kernel.org/dri-devel/e967e382-6cca-4dee-8333-39892d532f71@gmail...
Hm, I fail to see where Christian dismiss the dma-heaps solution in this email. He even says:
If the memory is not physically attached to any device, but rather just
memory attached to the CPU or a system wide memory controller then expose the memory as DMA-heap with specific requirements (e.g. certain sized pages, contiguous, restricted, encrypted, ...).
I am not saying Christian dismissed DMA heaps but rather how scalability is an issue. What we are proposing here is a generic interface via TEE to the firmware/Trusted OS which can perform all the platform specific memory restrictions. This solution will scale across vendors.
I read something completely different into Christian's mail.
What Christian is saying is that injecting generic constraint solving into the kernel doesn't scale. It's not OK to build out generic infrastructure in the kernel which queries a bunch of leaf drivers and attempts to somehow come up with something which satisfies userspace-provided constraints.
But this isn't the same thing as saying 'dma-heaps is wrong'! Again, there is no additional complexity in the kernel between a dma-heap which bridges over to TEE, and a TEE userspace interface which also bridges over to TEE. Both of them are completely fine according to what he's said.
Honestly, when I look at dma-heap implementations, they seem to be trivial shells around existing (more complex) allocators, and the boiler plate [1] to expose a dma-heap is relatively small. The dma-buf implementation, you already have, so we're talking about a hundred lines of code to maintain, which shouldn't be significantly more than what you have for the new ioctl() to be honest.
It will rather be redundant vendor specific code under DMA heaps calling into firmware/Trusted OS to enforce memory restrictions as you can look into Mediatek example [1]. With TEE subsystem managing that it won't be the case as we will provide a common abstraction for the communication with underlying firmware/Trusted OS.
Yes, it's common for everyone who uses TEE to implement SVP. It's not common for the people who do _not_ use TEE to implement SVP. Which means that userspace has to type out both, and what we're asking in this thread is: why?
Why should userspace have to support dma-heap allocation for platforms supporting SVP via a static DT-defined carveout as well as supporting TEE API allocation for platforms supporting SVP via a dynamic carveout? What benefit does it bring to have this surfaced as a completely separate uAPI?
And I'll insist on what Daniel said, it's a small price to pay to have a standard interface to expose to userspace. If dma-heaps are not used for this kind things, I honestly wonder what they will be used for...
Let's try not to forcefully find a use-case for DMA heaps when there is a better alternative available.
What makes it better? If you could explain very clearly the benefit userspace will gain from asking TEE to allocate $n bytes for TEE_IOC_UC_SECURE_VIDEO_PLAY, compared to asking dma-heap to allocate $n bytes for protected,secure-video, I think that would really help. Right now, I don't understand how it would be better in any way whatsoever for userspace. And I think your decision to implement it as a separate API is based on a misunderstanding of Christian's position.
I am still failing to see why you don't consider following as a standardised user-space interface:
"When user-space has to work with restricted memory, ask TEE device to allocate it"
As far as I can tell, having userspace work with the TEE interface brings zero benefit (again, please correct me if I'm wrong and explain how it's better). The direct cost - call it a disbenefit - it brings is that we have to spend a pile of time typing out support for TEE allocation in every media/GPU/display driver/application, and when we do any kind of negotiation, we have to have one protocol definition for TEE and one for non-TEE.
dma-heaps was created to solve the problem of having too many 'allocate $n bytes from $specialplace' uAPIs. The proliferation was painful and making it difficult for userspace to do what it needed to do. Userspace doesn't _yet_ make full use of it, but the solution is to make userspace make full use of it, not to go create entirely separate allocation paths for unclear reasons.
Besides, I'm writing this from a platform that implements SVP not via TEE. I've worked on platforms which implement SVP without any TEE, where the TEE implementation would be at best a no-op stub, and at worst flat-out impossible.
Can you elaborate the non-TEE use-case for Secure Video Path (SVP) a bit more? As to how the protected/encrypted media content pipeline works? Which architecture support does your use-case require? Is there any higher privileged level firmware interaction required to perform media content decryption into restricted memory? Do you plan to upstream corresponding support in near future?
Let me try to elaborate on the Secure Video Path (SVP) flow requiring a TEE implementation (in general terms a higher privileged firmware managing the pipeline as the kernel/user-space has no access permissions to the plain text media content):
- Firstly a content decryption key is securely provisioned into the TEE implementation. - Interaction with TEE to set up access permissions of different peripherals in the media pipeline so that they can access restricted memory. - Interaction with TEE to allocate restricted memory buffers. - Interaction with TEE to decrypt downloaded encrypted media content from normal memory buffers to restricted memory buffers. - Then the further media pipeline is able to process the plain media content in restricted buffers and display it.
So that's 'why not TEE as the single uAPI for SVP'.
Let's try to see if your SVP use-case really converges with TEE based SVP such that we really need a single uAPI.
So, again, let's please turn this around: _why_ TEE? Who benefits from exposing this as completely separate to the more generic uAPI that we specifically designed to handle things like this?
The bridging between DMA heaps and TEE would still require user-space to perform an IOCTL into TEE to register the DMA-bufs as you can see here [1]. Then it will rather be two handles for user-space to manage. Similarly during restricted memory allocation/free we need another glue layer under DMA heaps to TEE subsystem.
The reason is simply which has been iterated over many times in the past threads that:
"If user-space has to interact with a TEE device for SVP use-case then why it's not better to ask TEE to allocate restricted DMA-bufs too"
[1] https://lkml.indiana.edu/hypermail/linux/kernel/2408.3/08296.html
-Sumit
Hi Sumit,
On Fri, 21 Feb 2025 at 11:24, Sumit Garg sumit.garg@linaro.org wrote:
On Tue, 18 Feb 2025 at 21:52, Daniel Stone daniel@fooishbar.org wrote:
dma-heaps was created to solve the problem of having too many 'allocate $n bytes from $specialplace' uAPIs. The proliferation was painful and making it difficult for userspace to do what it needed to do. Userspace doesn't _yet_ make full use of it, but the solution is to make userspace make full use of it, not to go create entirely separate allocation paths for unclear reasons.
Besides, I'm writing this from a platform that implements SVP not via TEE. I've worked on platforms which implement SVP without any TEE, where the TEE implementation would be at best a no-op stub, and at worst flat-out impossible.
Can you elaborate the non-TEE use-case for Secure Video Path (SVP) a bit more? As to how the protected/encrypted media content pipeline works? Which architecture support does your use-case require? Is there any higher privileged level firmware interaction required to perform media content decryption into restricted memory? Do you plan to upstream corresponding support in near future?
You can see the MTK SVP patches on list which use the MTK SMC to mediate it.
There are TI Jacinto platforms which implement a 'secure' area configured statically by (IIRC) BL2, with static permissions defined for each AXI endpoint, e.g. CPU write + codec RW + dispc read. I've heard of another SoC vendor doing the same, but I don't think I can share those details. There is no TEE interaction.
I'm writing this message from an AMD laptop which implements restricted content paths outside of TEE. I don't have the full picture of how SVP is implemented on AMD systems, but I do know that I don't have any TEE devices exposed.
Let me try to elaborate on the Secure Video Path (SVP) flow requiring a TEE implementation (in general terms a higher privileged firmware managing the pipeline as the kernel/user-space has no access permissions to the plain text media content):
- [...]
Yeah, I totally understand the TEE usecase. I think that TEE is a good design to implement this. I think that TEE should be used for SVP where it makes sense.
Please understand that I am _not_ arguing that no-one should use TEE for SVP!
So, again, let's please turn this around: _why_ TEE? Who benefits from exposing this as completely separate to the more generic uAPI that we specifically designed to handle things like this?
The bridging between DMA heaps and TEE would still require user-space to perform an IOCTL into TEE to register the DMA-bufs as you can see here [1]. Then it will rather be two handles for user-space to manage.
Yes, the decoder would need to do this. That's common though: if you want to share a buffer between V4L2 and DRM, you have three handles: the V4L2 buffer handle, the DRM GEM handle, and the dmabuf you use to bridge the two.
Similarly during restricted memory allocation/free we need another glue layer under DMA heaps to TEE subsystem.
Yep.
The reason is simply which has been iterated over many times in the past threads that:
"If user-space has to interact with a TEE device for SVP use-case
then why it's not better to ask TEE to allocate restricted DMA-bufs too"
The first word in your proposition is load-bearing.
Build out the usecase a little more here. You have a DRMed video stream coming in, which you need to decode (involving TEE for this usecase). You get a dmabuf handle to the decoded frame. You need to pass the dmabuf across to the Wayland compositor. The compositor needs to pass it to EGL/Vulkan to import and do composition, which in turn passes it to the GPU DRM driver. The output of the composition is in turn shared between the GPU DRM driver and the separate KMS DRM driver, with the involvement of GBM.
For the platforms I'm interested in, the GPU DRM driver needs to switch into protected mode, which has no involvement at all with TEE - it's architecturally impossible to have TEE involved without moving most of the GPU driver into TEE and destroying performance. The display hardware also needs to engage protected mode, which again has no involvement with TEE and again would need to have half the driver moved into TEE for no benefit in order to do so. The Wayland compositor also has no interest in TEE: it tells the GPU DRM driver about the protected status of its buffers, and that's it.
What these components _are_ opinionated about, is the way buffers are allocated and managed. We built out dmabuf modifiers for this usecase, and we have a good negotiation protocol around that. We also really care about buffer placement in some usecases - e.g. some display/codec hardware requires buffers to be sourced from contiguous memory, other hardware needs to know that when it shares buffers with another device, it needs to place the buffers outside of inaccessible/slow local RAM. So we built out dma-heaps, so every part of the component in the stack can communicate their buffer-placement needs in the same way as we do modifiers, and negotiate an acceptable allocation.
That's my starting point for this discussion. We have a mechanism to deal with the fact that buffers need to be shared between different IP blocks which have their own constraints on buffer placement, avoiding the current problem of having every subsystem reinvent their own allocation uAPI which was burying us in impedance mismatch and confusion. That mechanism is dma-heaps. It seems like your starting point from this discussion is that you've implemented a TEE-centric design for SVP, and so all of userspace should bypass our existing cross-subsystem special-purpose allocation mechanism, and write specifically to one implementation. I believe that is a massive step backwards and an immediate introduction of technical debt.
Again, having an implementation of SVP via TEE makes a huge amount of sense. Having _most_ SVP implementations via TEE still makes a lot of sense. Having _all_ SVP implementations eventually be via TEE would still make sense. But even if we were at that point - which we aren't - it still doesn't justify telling userspace 'use the generic dma-heap uAPI for every device-specific allocation constraint, apart from SVP which has a completely different way to allocate some bytes'.
Cheers, Daniel
linaro-mm-sig@lists.linaro.org