tldr; DMA buffers aren't normal memory, expecting that you can use them like that (like calling get_user_pages works, or that they're accounting like any other normal memory) cannot be guaranteed.
Since some userspace only runs on integrated devices, where all buffers are actually all resident system memory, there's a huge temptation to assume that a struct page is always present and useable like for any more pagecache backed mmap. This has the potential to result in a uapi nightmare.
To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which blocks get_user_pages and all the other struct page based infrastructure for everyone. In spirit this is the uapi counterpart to the kernel-internal CONFIG_DMABUF_DEBUG.
Motivated by a recent patch which wanted to swich the system dma-buf heap to vm_insert_page instead of vm_insert_pfn.
v2:
Jason brought up that we also want to guarantee that all ptes have the pte_special flag set, to catch fast get_user_pages (on architectures that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
From auditing the various functions to insert pfn pte entires (vm_insert_pfn_prot, remap_pfn_range and all it's callers like dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so this should be the correct flag to check for.
v3: Change to WARN_ON_ONCE (Thomas Zimmermann)
References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbW... Acked-by: Christian König christian.koenig@amd.com Acked-by: Thomas Zimmermann tzimmermann@suse.de Cc: Thomas Zimmermann tzimmermann@suse.de Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Suren Baghdasaryan surenb@google.com Cc: Matthew Wilcox willy@infradead.org Cc: John Stultz john.stultz@linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org -- Ok I entirely forgot about this patch but stumbled over it and checked what's up with it no. I think it's ready now for merging: - shmem helper patches to fix up vgem landed - ttm has been fixed since a while - I don't think we've had any other open issues
Time to lock down this uapi contract for real? -Daniel --- drivers/dma-buf/dma-buf.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index b6c36914e7c6..88718665c3c3 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -150,6 +150,8 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) ret = dmabuf->ops->mmap(dmabuf, vma); dma_resv_unlock(dmabuf->resv);
+ WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP)); + return ret; }
@@ -1495,6 +1497,8 @@ int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma, ret = dmabuf->ops->mmap(dmabuf, vma); dma_resv_unlock(dmabuf->resv);
+ WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP)); + return ret; } EXPORT_SYMBOL_NS_GPL(dma_buf_mmap, DMA_BUF);
On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
tldr; DMA buffers aren't normal memory, expecting that you can use them like that (like calling get_user_pages works, or that they're accounting like any other normal memory) cannot be guaranteed.
Since some userspace only runs on integrated devices, where all buffers are actually all resident system memory, there's a huge temptation to assume that a struct page is always present and useable like for any more pagecache backed mmap. This has the potential to result in a uapi nightmare.
To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which blocks get_user_pages and all the other struct page based infrastructure for everyone. In spirit this is the uapi counterpart to the kernel-internal CONFIG_DMABUF_DEBUG.
Motivated by a recent patch which wanted to swich the system dma-buf heap to vm_insert_page instead of vm_insert_pfn.
v2:
Jason brought up that we also want to guarantee that all ptes have the pte_special flag set, to catch fast get_user_pages (on architectures that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
From auditing the various functions to insert pfn pte entires (vm_insert_pfn_prot, remap_pfn_range and all it's callers like dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so this should be the correct flag to check for.
I didn't look at how this actually gets used, but it is a bit of a pain to insert a lifetime controlled object like a struct page as a special PTE/VM_PFNMAP
How is the lifetime model implemented here? How do you know when userspace has finally unmapped the page?
Jason
On Tue, 22 Nov 2022 at 19:04, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
tldr; DMA buffers aren't normal memory, expecting that you can use them like that (like calling get_user_pages works, or that they're accounting like any other normal memory) cannot be guaranteed.
Since some userspace only runs on integrated devices, where all buffers are actually all resident system memory, there's a huge temptation to assume that a struct page is always present and useable like for any more pagecache backed mmap. This has the potential to result in a uapi nightmare.
To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which blocks get_user_pages and all the other struct page based infrastructure for everyone. In spirit this is the uapi counterpart to the kernel-internal CONFIG_DMABUF_DEBUG.
Motivated by a recent patch which wanted to swich the system dma-buf heap to vm_insert_page instead of vm_insert_pfn.
v2:
Jason brought up that we also want to guarantee that all ptes have the pte_special flag set, to catch fast get_user_pages (on architectures that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
From auditing the various functions to insert pfn pte entires (vm_insert_pfn_prot, remap_pfn_range and all it's callers like dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so this should be the correct flag to check for.
I didn't look at how this actually gets used, but it is a bit of a pain to insert a lifetime controlled object like a struct page as a special PTE/VM_PFNMAP
How is the lifetime model implemented here? How do you know when userspace has finally unmapped the page?
The vma has a filp which is the refcounted dma_buf. With dma_buf you never get an individual page it's always the entire object. And it's up to the allocator how exactly it wants to use or not use the page's refcount. So if gup goes in and elevates the refcount, you can break stuff, which is why I'm doing this. -Daniel
On Tue, Nov 22, 2022 at 07:08:25PM +0100, Daniel Vetter wrote:
On Tue, 22 Nov 2022 at 19:04, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
tldr; DMA buffers aren't normal memory, expecting that you can use them like that (like calling get_user_pages works, or that they're accounting like any other normal memory) cannot be guaranteed.
Since some userspace only runs on integrated devices, where all buffers are actually all resident system memory, there's a huge temptation to assume that a struct page is always present and useable like for any more pagecache backed mmap. This has the potential to result in a uapi nightmare.
To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which blocks get_user_pages and all the other struct page based infrastructure for everyone. In spirit this is the uapi counterpart to the kernel-internal CONFIG_DMABUF_DEBUG.
Motivated by a recent patch which wanted to swich the system dma-buf heap to vm_insert_page instead of vm_insert_pfn.
v2:
Jason brought up that we also want to guarantee that all ptes have the pte_special flag set, to catch fast get_user_pages (on architectures that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
From auditing the various functions to insert pfn pte entires (vm_insert_pfn_prot, remap_pfn_range and all it's callers like dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so this should be the correct flag to check for.
I didn't look at how this actually gets used, but it is a bit of a pain to insert a lifetime controlled object like a struct page as a special PTE/VM_PFNMAP
How is the lifetime model implemented here? How do you know when userspace has finally unmapped the page?
The vma has a filp which is the refcounted dma_buf. With dma_buf you never get an individual page it's always the entire object. And it's up to the allocator how exactly it wants to use or not use the page's refcount. So if gup goes in and elevates the refcount, you can break stuff, which is why I'm doing this.
But how does move work?
Jason
On Tue, 22 Nov 2022 at 19:50, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 07:08:25PM +0100, Daniel Vetter wrote:
On Tue, 22 Nov 2022 at 19:04, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 06:08:00PM +0100, Daniel Vetter wrote:
tldr; DMA buffers aren't normal memory, expecting that you can use them like that (like calling get_user_pages works, or that they're accounting like any other normal memory) cannot be guaranteed.
Since some userspace only runs on integrated devices, where all buffers are actually all resident system memory, there's a huge temptation to assume that a struct page is always present and useable like for any more pagecache backed mmap. This has the potential to result in a uapi nightmare.
To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which blocks get_user_pages and all the other struct page based infrastructure for everyone. In spirit this is the uapi counterpart to the kernel-internal CONFIG_DMABUF_DEBUG.
Motivated by a recent patch which wanted to swich the system dma-buf heap to vm_insert_page instead of vm_insert_pfn.
v2:
Jason brought up that we also want to guarantee that all ptes have the pte_special flag set, to catch fast get_user_pages (on architectures that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
From auditing the various functions to insert pfn pte entires (vm_insert_pfn_prot, remap_pfn_range and all it's callers like dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so this should be the correct flag to check for.
I didn't look at how this actually gets used, but it is a bit of a pain to insert a lifetime controlled object like a struct page as a special PTE/VM_PFNMAP
How is the lifetime model implemented here? How do you know when userspace has finally unmapped the page?
The vma has a filp which is the refcounted dma_buf. With dma_buf you never get an individual page it's always the entire object. And it's up to the allocator how exactly it wants to use or not use the page's refcount. So if gup goes in and elevates the refcount, you can break stuff, which is why I'm doing this.
But how does move work?
You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works. Refaulting and any coherency issues when a refault races against a dma-buf migration is up to the driver/exporter to handle correctly. None rely on struct page like mm/ moving stuff around for compaction/ksm/numa-balancing/whateverr. -Daniel
On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works.
Okay, this is one of the ways that this can be made to work correctly, as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this was the DAX mistake)
Jason
On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works.
Okay, this is one of the ways that this can be made to work correctly, as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this was the DAX mistake)
Hence this patch, to enforce that no dma-buf exporter gets this wrong. Which some did, and then blamed bug reporters for the resulting splats :-) One of the things we've reverted was the ttm huge pte support, since that doesn't have the pmd_special flag (yet) and so would let gup_fast through. -Daniel
Am 22.11.22 um 20:50 schrieb Daniel Vetter:
On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works.
Okay, this is one of the ways that this can be made to work correctly, as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this was the DAX mistake)
Hence this patch, to enforce that no dma-buf exporter gets this wrong. Which some did, and then blamed bug reporters for the resulting splats :-) One of the things we've reverted was the ttm huge pte support, since that doesn't have the pmd_special flag (yet) and so would let gup_fast through.
The problem is not only gup, a lot of people seem to assume that when you are able to grab a reference to a page that the ptes pointing to that page can't change any more. And that's obviously incorrect.
I witnessed tons of discussions about that already. Some customers even modified our code assuming that and then wondered why the heck they ran into data corruption.
It's gotten so bad that I've even proposed intentionally mangling the page reference count on TTM allocated pages: https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1...
I think it would be better that instead of having special flags in the ptes and vmas that you can't follow them to a page structure we would add something to the page indicating that you can't grab a reference to it. But this might break some use cases as well.
Regards, Christian.
-Daniel
On Wed, 23 Nov 2022 at 10:06, Christian König christian.koenig@amd.com wrote:
Am 22.11.22 um 20:50 schrieb Daniel Vetter:
On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works.
Okay, this is one of the ways that this can be made to work correctly, as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this was the DAX mistake)
Hence this patch, to enforce that no dma-buf exporter gets this wrong. Which some did, and then blamed bug reporters for the resulting splats :-) One of the things we've reverted was the ttm huge pte support, since that doesn't have the pmd_special flag (yet) and so would let gup_fast through.
The problem is not only gup, a lot of people seem to assume that when you are able to grab a reference to a page that the ptes pointing to that page can't change any more. And that's obviously incorrect.
I witnessed tons of discussions about that already. Some customers even modified our code assuming that and then wondered why the heck they ran into data corruption.
It's gotten so bad that I've even proposed intentionally mangling the page reference count on TTM allocated pages: https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1...
Yeah maybe something like this could be applied after we land this patch here. Well maybe should have the same check in gem mmap code to make sure no driver
I think it would be better that instead of having special flags in the ptes and vmas that you can't follow them to a page structure we would add something to the page indicating that you can't grab a reference to it. But this might break some use cases as well.
Afaik the problem with that is that there's no free page bits left for these debug checks. Plus the pte+vma flags are the flags to make this clear already. -Daniel
Am 23.11.22 um 10:30 schrieb Daniel Vetter:
On Wed, 23 Nov 2022 at 10:06, Christian König christian.koenig@amd.com wrote:
Am 22.11.22 um 20:50 schrieb Daniel Vetter:
On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works.
Okay, this is one of the ways that this can be made to work correctly, as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this was the DAX mistake)
Hence this patch, to enforce that no dma-buf exporter gets this wrong. Which some did, and then blamed bug reporters for the resulting splats :-) One of the things we've reverted was the ttm huge pte support, since that doesn't have the pmd_special flag (yet) and so would let gup_fast through.
The problem is not only gup, a lot of people seem to assume that when you are able to grab a reference to a page that the ptes pointing to that page can't change any more. And that's obviously incorrect.
I witnessed tons of discussions about that already. Some customers even modified our code assuming that and then wondered why the heck they ran into data corruption.
It's gotten so bad that I've even proposed intentionally mangling the page reference count on TTM allocated pages: https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1...
Yeah maybe something like this could be applied after we land this patch here.
I think both should land ASAP. We don't have any other way than to clearly point out incorrect approaches if we want to prevent the resulting data corruption.
Well maybe should have the same check in gem mmap code to make sure no driver
Reads like the sentence is somehow cut of?
I think it would be better that instead of having special flags in the ptes and vmas that you can't follow them to a page structure we would add something to the page indicating that you can't grab a reference to it. But this might break some use cases as well.
Afaik the problem with that is that there's no free page bits left for these debug checks. Plus the pte+vma flags are the flags to make this clear already.
Maybe a GFP flag to set the page reference count to zero or something like this?
Christian.
-Daniel
On Wed, 23 Nov 2022 at 10:39, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 23.11.22 um 10:30 schrieb Daniel Vetter:
On Wed, 23 Nov 2022 at 10:06, Christian König christian.koenig@amd.com wrote:
Am 22.11.22 um 20:50 schrieb Daniel Vetter:
On Tue, 22 Nov 2022 at 20:34, Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Nov 22, 2022 at 08:29:05PM +0100, Daniel Vetter wrote:
You nuke all the ptes. Drivers that move have slightly more than a bare struct file, they also have a struct address_space so that invalidate_mapping_range() works.
Okay, this is one of the ways that this can be made to work correctly, as long as you never allow GUP/GUP_fast to succeed on the PTEs. (this was the DAX mistake)
Hence this patch, to enforce that no dma-buf exporter gets this wrong. Which some did, and then blamed bug reporters for the resulting splats :-) One of the things we've reverted was the ttm huge pte support, since that doesn't have the pmd_special flag (yet) and so would let gup_fast through.
The problem is not only gup, a lot of people seem to assume that when you are able to grab a reference to a page that the ptes pointing to that page can't change any more. And that's obviously incorrect.
I witnessed tons of discussions about that already. Some customers even modified our code assuming that and then wondered why the heck they ran into data corruption.
It's gotten so bad that I've even proposed intentionally mangling the page reference count on TTM allocated pages: https://patchwork.kernel.org/project/dri-devel/patch/20220927143529.135689-1...
Yeah maybe something like this could be applied after we land this patch here.
I think both should land ASAP. We don't have any other way than to clearly point out incorrect approaches if we want to prevent the resulting data corruption.
Well maybe should have the same check in gem mmap code to make sure no driver
Reads like the sentence is somehow cut of?
Yeah, just wanted to say that we need to make sure in as many places as possible that VM_PFNMAP is set for bo mmaps.
I think it would be better that instead of having special flags in the ptes and vmas that you can't follow them to a page structure we would add something to the page indicating that you can't grab a reference to it. But this might break some use cases as well.
Afaik the problem with that is that there's no free page bits left for these debug checks. Plus the pte+vma flags are the flags to make this clear already.
Maybe a GFP flag to set the page reference count to zero or something like this?
Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace. -Daniel
On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
Maybe a GFP flag to set the page reference count to zero or something like this?
Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace.
Mucking with the refcount like this to protect against crazy out of tree drives seems horrible..
The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do though, though you must combine this with the special PTE flag..
Jason
Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
Maybe a GFP flag to set the page reference count to zero or something like this?
Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace.
Mucking with the refcount like this to protect against crazy out of tree drives seems horrible..
Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it.
The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do though, though you must combine this with the special PTE flag..
That's not sufficient. The pages are released much later than things actually go wrong. In most cases this WARN_ON here won't hit.
Christian.
Jason
On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
Maybe a GFP flag to set the page reference count to zero or something like this?
Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace.
Mucking with the refcount like this to protect against crazy out of tree drives seems horrible..
Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it.
kvm was taking refs on special PTEs? That seems really unlikely?
The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do though, though you must combine this with the special PTE flag..
That's not sufficient. The pages are released much later than things actually go wrong. In most cases this WARN_ON here won't hit.
How so? As long as the page is mapped into the PTE there is no issue with corruption. If dmabuf checks the refcount after it does the unmap mapping range it should catch any bogus pin that might be confused about address coherency.
Jason
Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
Maybe a GFP flag to set the page reference count to zero or something like this?
Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace.
Mucking with the refcount like this to protect against crazy out of tree drives seems horrible..
Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it.
kvm was taking refs on special PTEs? That seems really unlikely?
Well then look at this code here:
commit add6a0cd1c5ba51b201e1361b05a5df817083618 Author: Paolo Bonzini pbonzini@redhat.com Date: Tue Jun 7 17:51:18 2016 +0200
KVM: MMU: try to fix up page faults before giving up
The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA.
This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn and fixup_user_fault together help supporting it. The patch also supports VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to reference counting.
Cc: Xiao Guangrong guangrong.xiao@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Radim Krčmář rkrcmar@redhat.com Tested-by: Neo Jia cjia@nvidia.com Reported-by: Kirti Wankhede kwankhede@nvidia.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com
And see also the discussion here: https://patchwork.freedesktop.org/patch/414123/
as well as here: https://patchwork.freedesktop.org/patch/499190/
I can't count how often I have pointed out that this is absolutely illegal and KVM can't touch pages in VMAs with VM_PFNMAP.
The WARN_ON(pag_count(p) != 1) seems like a reasonable thing to do though, though you must combine this with the special PTE flag..
That's not sufficient. The pages are released much later than things actually go wrong. In most cases this WARN_ON here won't hit.
How so? As long as the page is mapped into the PTE there is no issue with corruption. If dmabuf checks the refcount after it does the unmap mapping range it should catch any bogus pin that might be confused about address coherency.
Yeah, that would work. The problem is this WARN_ON() comes much later.
The device drivers usually keep the page around for a while even after it is unmapped. IIRC the cleanup worker only runs every 10ms or so.
Christian.
Jason
On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote:
Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
Maybe a GFP flag to set the page reference count to zero or something like this?
Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace.
Mucking with the refcount like this to protect against crazy out of tree drives seems horrible..
Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it.
kvm was taking refs on special PTEs? That seems really unlikely?
Well then look at this code here:
commit add6a0cd1c5ba51b201e1361b05a5df817083618 Author: Paolo Bonzini pbonzini@redhat.com Date: Tue Jun 7 17:51:18 2016 +0200
KVM: MMU: try to fix up page faults before giving up
The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA.
This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn and fixup_user_fault together help supporting it. The patch also supports VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to reference counting.
Cc: Xiao Guangrong guangrong.xiao@linux.intel.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Radim Krčmář rkrcmar@redhat.com Tested-by: Neo Jia cjia@nvidia.com Reported-by: Kirti Wankhede kwankhede@nvidia.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com
This patch is known to be broken in so many ways. It also has a major security hole that it ignores the PTE flags making the page RO. Ignoring the special bit is somehow not surprising :(
This probably doesn't work, but is the general idea of what KVM needs to do:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1376a47fedeedb..4161241fc3228c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, return r; }
+ /* + * Special PTEs are never convertible into a struct page, even if the + * driver that owns them might have put a PFN with a struct page into + * the PFNMAP. If the arch doesn't support special then we cannot + * safely process these pages. + */ +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL + if (pte_special(*ptep)) + return -EINVAL; +#else + return -EINVAL; +#endif + if (write_fault && !pte_write(*ptep)) { pfn = KVM_PFN_ERR_RO_FAULT; goto out;
Jason
On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote:
Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
> Maybe a GFP flag to set the page reference count to zero or something > like this? Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace.
Mucking with the refcount like this to protect against crazy out of tree drives seems horrible..
Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it.
kvm was taking refs on special PTEs? That seems really unlikely?
Well then look at this code here:
commit add6a0cd1c5ba51b201e1361b05a5df817083618 Author: Paolo Bonzini pbonzini@redhat.com Date: Tue Jun 7 17:51:18 2016 +0200
KVM: MMU: try to fix up page faults before giving up The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault
handler then can use remap_pfn_range to place some non-reserved pages in the VMA.
This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn and fixup_user_fault together help supporting it. The patch also
supports VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to reference counting.
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Neo Jia <cjia@nvidia.com> Reported-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is known to be broken in so many ways. It also has a major security hole that it ignores the PTE flags making the page RO. Ignoring the special bit is somehow not surprising :(
This probably doesn't work, but is the general idea of what KVM needs to do:
Oh dear, when I dug around in there I entirely missed that kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs to grow a proper mmu notifier.
Another thing I'm wondering right now, the follow_pte(); fixup_user_fault(); follow_pte(); approach does not make any guarantees of actually being right. If you're sufficiently unlucky you might race against an immediate pte invalidate between the fixup and the 2nd follow_pte(). But you can also not loop, because that would fail to catch permanent faults.
I think the iommu fault drivers have a similar pattern.
What am I missing here? Or is that also just broken. gup works around this with the slow path that takes the mmap sem and walking the vma tree, follow_pte/fixup_user_fautl users dont. Maybe mmu notifier based restarting would help with this too, if done properly. -Daniel
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1376a47fedeedb..4161241fc3228c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, return r; }
/*
* Special PTEs are never convertible into a struct page, even if the
* driver that owns them might have put a PFN with a struct page into
* the PFNMAP. If the arch doesn't support special then we cannot
* safely process these pages.
*/
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
if (pte_special(*ptep))
return -EINVAL;
+#else
return -EINVAL;
+#endif
if (write_fault && !pte_write(*ptep)) { pfn = KVM_PFN_ERR_RO_FAULT; goto out;
Jason
On Wed, Nov 23, 2022 at 03:28:27PM +0100, Daniel Vetter wrote:
This patch is known to be broken in so many ways. It also has a major security hole that it ignores the PTE flags making the page RO. Ignoring the special bit is somehow not surprising :(
This probably doesn't work, but is the general idea of what KVM needs to do:
Oh dear, when I dug around in there I entirely missed that kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs to grow a proper mmu notifier.
Another thing I'm wondering right now, the follow_pte(); fixup_user_fault(); follow_pte(); approach does not make any guarantees of actually being right. If you're sufficiently unlucky you might race against an immediate pte invalidate between the fixup and the 2nd follow_pte(). But you can also not loop, because that would fail to catch permanent faults.
Yes, it is pretty broken.
kvm already has support for mmu notifiers and uses it for other stuff. I can't remember what exactly this code path was for, IIRC Paolo talked about having a big rework/fix for it when we last talked about the missing write protect. I also vauagely recall he had some explanation why this might be safe.
I think the iommu fault drivers have a similar pattern.
Where? It shouldn't
The common code for SVA just calls handle_mm_fault() and restarts the PRI. Since the page table is physically shared there is no issue with a stale copy.
What am I missing here? Or is that also just broken. gup works around this with the slow path that takes the mmap sem and walking the vma tree, follow_pte/fixup_user_fautl users dont.
follow_pte() is just fundamentally broken, things must not use it.
Maybe mmu notifier based restarting would help with this too, if done properly.
That is called hmm_range_fault()
Jason
On Wed, 23 Nov 2022 at 16:04, Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Nov 23, 2022 at 03:28:27PM +0100, Daniel Vetter wrote:
This patch is known to be broken in so many ways. It also has a major security hole that it ignores the PTE flags making the page RO. Ignoring the special bit is somehow not surprising :(
This probably doesn't work, but is the general idea of what KVM needs to do:
Oh dear, when I dug around in there I entirely missed that kvm_try_get_pfn exists, and it's very broken indeed. kvm really needs to grow a proper mmu notifier.
Another thing I'm wondering right now, the follow_pte(); fixup_user_fault(); follow_pte(); approach does not make any guarantees of actually being right. If you're sufficiently unlucky you might race against an immediate pte invalidate between the fixup and the 2nd follow_pte(). But you can also not loop, because that would fail to catch permanent faults.
Yes, it is pretty broken.
kvm already has support for mmu notifiers and uses it for other stuff. I can't remember what exactly this code path was for, IIRC Paolo talked about having a big rework/fix for it when we last talked about the missing write protect. I also vauagely recall he had some explanation why this might be safe.
I think the iommu fault drivers have a similar pattern.
Where? It shouldn't
The common code for SVA just calls handle_mm_fault() and restarts the PRI. Since the page table is physically shared there is no issue with a stale copy.
What am I missing here? Or is that also just broken. gup works around this with the slow path that takes the mmap sem and walking the vma tree, follow_pte/fixup_user_fautl users dont.
follow_pte() is just fundamentally broken, things must not use it.
Maybe mmu notifier based restarting would help with this too, if done properly.
That is called hmm_range_fault()
Ah right I mixed that up on a quick grep, thanks for pointing me in the right direction. Worries appeased. -Daniel
On Wed, 23 Nov 2022 at 14:28, Jason Gunthorpe jgg@ziepe.ca wrote:
On Wed, Nov 23, 2022 at 02:12:25PM +0100, Christian König wrote:
Am 23.11.22 um 13:53 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 01:49:41PM +0100, Christian König wrote:
Am 23.11.22 um 13:46 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 11:06:55AM +0100, Daniel Vetter wrote:
> Maybe a GFP flag to set the page reference count to zero or something > like this? Hm yeah that might work. I'm not sure what it will all break though? And we'd need to make sure that underflowing the page refcount dies in a backtrace.
Mucking with the refcount like this to protect against crazy out of tree drives seems horrible..
Well not only out of tree drivers. The intree KVM got that horrible wrong as well, those where the latest guys complaining about it.
kvm was taking refs on special PTEs? That seems really unlikely?
Well then look at this code here:
commit add6a0cd1c5ba51b201e1361b05a5df817083618 Author: Paolo Bonzini pbonzini@redhat.com Date: Tue Jun 7 17:51:18 2016 +0200
KVM: MMU: try to fix up page faults before giving up The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault
handler then can use remap_pfn_range to place some non-reserved pages in the VMA.
This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn and fixup_user_fault together help supporting it. The patch also
supports VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to reference counting.
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Neo Jia <cjia@nvidia.com> Reported-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is known to be broken in so many ways. It also has a major security hole that it ignores the PTE flags making the page RO. Ignoring the special bit is somehow not surprising :(
This probably doesn't work, but is the general idea of what KVM needs to do:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1376a47fedeedb..4161241fc3228c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, return r; }
/*
* Special PTEs are never convertible into a struct page, even if the
* driver that owns them might have put a PFN with a struct page into
* the PFNMAP. If the arch doesn't support special then we cannot
* safely process these pages.
*/
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
if (pte_special(*ptep))
return -EINVAL;
On second thought this wont work, because it completely defeats the point of why this code here exists. remap_pfn_range() (which is what the various dma_mmap functions and the ioremap functions are built on top of too) sets VM_PFNMAP too, so this check would even catch the static mappings.
Plus these static mappings aren't all that static either, e.g. pci access also can revoke bar mappings nowadays.
I think nothing except full mmu_notifier will actually fix this. -Daniel
+#else
return -EINVAL;
+#endif
if (write_fault && !pte_write(*ptep)) { pfn = KVM_PFN_ERR_RO_FAULT; goto out;
Jason
On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1376a47fedeedb..4161241fc3228c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, return r; }
/*
* Special PTEs are never convertible into a struct page, even if the
* driver that owns them might have put a PFN with a struct page into
* the PFNMAP. If the arch doesn't support special then we cannot
* safely process these pages.
*/
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
if (pte_special(*ptep))
return -EINVAL;
On second thought this wont work, because it completely defeats the point of why this code here exists. remap_pfn_range() (which is what the various dma_mmap functions and the ioremap functions are built on top of too) sets VM_PFNMAP too, so this check would even catch the static mappings.
The problem with the way this code is designed is how it allows returning the pfn without taking any reference based on things like !pfn_valid or page_reserved. This allows it to then conditionally put back the reference based on the same reasoning. It is impossible to thread pte special into that since it is a PTE flag, not a property of the PFN.
I don't entirely understand why it needs the page reference at all, even if it is available - so I can't guess why it is OK to ignore the page reference in other cases, or why it is OK to be racy..
Eg hmm_range_fault() does not obtain page references and implements a very similar algorithm to kvm.
Plus these static mappings aren't all that static either, e.g. pci access also can revoke bar mappings nowadays.
And there are already mmu notifiers to handle that, AFAIK.
Jason
Am 23.11.22 um 16:08 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1376a47fedeedb..4161241fc3228c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, return r; }
/*
* Special PTEs are never convertible into a struct page, even if the
* driver that owns them might have put a PFN with a struct page into
* the PFNMAP. If the arch doesn't support special then we cannot
* safely process these pages.
*/
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
if (pte_special(*ptep))
return -EINVAL;
On second thought this wont work, because it completely defeats the point of why this code here exists. remap_pfn_range() (which is what the various dma_mmap functions and the ioremap functions are built on top of too) sets VM_PFNMAP too, so this check would even catch the static mappings.
The problem with the way this code is designed is how it allows returning the pfn without taking any reference based on things like !pfn_valid or page_reserved. This allows it to then conditionally put back the reference based on the same reasoning. It is impossible to thread pte special into that since it is a PTE flag, not a property of the PFN.
I don't entirely understand why it needs the page reference at all,
That's exactly what I've pointed out in the previous discussion about that code as well.
As far as I can see it this is just another case where people assumed that grabbing a page reference somehow magically prevents the pte from changing.
I have not the slightest idea how people got this impression, but I have heard it so many time from so many different sources that there must be some common cause to this. Is the maybe some book or tutorial how to sophisticate break the kernel or something like this?
Anyway as far as I can see only correct approach would be to use an MMU notifier or more high level hmm_range_fault()+seq number.
Regards, Christian.
even if it is available - so I can't guess why it is OK to ignore the page reference in other cases, or why it is OK to be racy..
Eg hmm_range_fault() does not obtain page references and implements a very similar algorithm to kvm.
Plus these static mappings aren't all that static either, e.g. pci access also can revoke bar mappings nowadays.
And there are already mmu notifiers to handle that, AFAIK.
Jason
On Wed, 23 Nov 2022 at 16:15, Christian König christian.koenig@amd.com wrote:
Am 23.11.22 um 16:08 schrieb Jason Gunthorpe:
On Wed, Nov 23, 2022 at 03:34:54PM +0100, Daniel Vetter wrote:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1376a47fedeedb..4161241fc3228c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2598,6 +2598,19 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, return r; }
/*
* Special PTEs are never convertible into a struct page, even if the
* driver that owns them might have put a PFN with a struct page into
* the PFNMAP. If the arch doesn't support special then we cannot
* safely process these pages.
*/
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
if (pte_special(*ptep))
return -EINVAL;
On second thought this wont work, because it completely defeats the point of why this code here exists. remap_pfn_range() (which is what the various dma_mmap functions and the ioremap functions are built on top of too) sets VM_PFNMAP too, so this check would even catch the static mappings.
The problem with the way this code is designed is how it allows returning the pfn without taking any reference based on things like !pfn_valid or page_reserved. This allows it to then conditionally put back the reference based on the same reasoning. It is impossible to thread pte special into that since it is a PTE flag, not a property of the PFN.
I don't entirely understand why it needs the page reference at all,
That's exactly what I've pointed out in the previous discussion about that code as well.
As far as I can see it this is just another case where people assumed that grabbing a page reference somehow magically prevents the pte from changing.
I have not the slightest idea how people got this impression, but I have heard it so many time from so many different sources that there must be some common cause to this. Is the maybe some book or tutorial how to sophisticate break the kernel or something like this?
It's what get_user_pages does, so it does "work". Except this path here is the fallback for when get_user_pages does not work (because of the pte_special/VM_SPECIAL case). So essentially it's just a rather broken get_user_pages that handrolls a bunch of things with bugs&races.
I have no idea why people don't realize they're just reinventing gup without using gup, but that's essentially what's going on.
Anyway as far as I can see only correct approach would be to use an MMU notifier or more high level hmm_range_fault()+seq number.
Yeah, plus if you go through ptes you really have to obey all the flags or things will break. Especially the RO pte flag. -Daniel
Regards, Christian.
even if it is available - so I can't guess why it is OK to ignore the page reference in other cases, or why it is OK to be racy..
Eg hmm_range_fault() does not obtain page references and implements a very similar algorithm to kvm.
Plus these static mappings aren't all that static either, e.g. pci access also can revoke bar mappings nowadays.
And there are already mmu notifiers to handle that, AFAIK.
Jason
On Wed, Nov 23, 2022 at 04:15:05PM +0100, Christian König wrote:
I have not the slightest idea how people got this impression, but I have heard it so many time from so many different sources that there must be some common cause to this. Is the maybe some book or tutorial how to sophisticate break the kernel or something like this?
Anyway as far as I can see only correct approach would be to use an MMU notifier or more high level hmm_range_fault()+seq number.
It already uses a mmu notifier, this is why I have no idea what the page ref is doing..
Jason
Hi
Am 22.11.22 um 18:08 schrieb Daniel Vetter:
tldr; DMA buffers aren't normal memory, expecting that you can use them like that (like calling get_user_pages works, or that they're accounting like any other normal memory) cannot be guaranteed.
Since some userspace only runs on integrated devices, where all buffers are actually all resident system memory, there's a huge temptation to assume that a struct page is always present and useable like for any more pagecache backed mmap. This has the potential to result in a uapi nightmare.
To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which blocks get_user_pages and all the other struct page based infrastructure for everyone. In spirit this is the uapi counterpart to the kernel-internal CONFIG_DMABUF_DEBUG.
Motivated by a recent patch which wanted to swich the system dma-buf heap to vm_insert_page instead of vm_insert_pfn.
v2:
Jason brought up that we also want to guarantee that all ptes have the pte_special flag set, to catch fast get_user_pages (on architectures that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
From auditing the various functions to insert pfn pte entires (vm_insert_pfn_prot, remap_pfn_range and all it's callers like dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so this should be the correct flag to check for.
v3: Change to WARN_ON_ONCE (Thomas Zimmermann)
References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbW... Acked-by: Christian König christian.koenig@amd.com Acked-by: Thomas Zimmermann tzimmermann@suse.de Cc: Thomas Zimmermann tzimmermann@suse.de Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Suren Baghdasaryan surenb@google.com Cc: Matthew Wilcox willy@infradead.org Cc: John Stultz john.stultz@linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org -- Ok I entirely forgot about this patch but stumbled over it and checked what's up with it no. I think it's ready now for merging:
- shmem helper patches to fix up vgem landed
- ttm has been fixed since a while
- I don't think we've had any other open issues
Time to lock down this uapi contract for real?
-Daniel
drivers/dma-buf/dma-buf.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index b6c36914e7c6..88718665c3c3 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -150,6 +150,8 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) ret = dmabuf->ops->mmap(dmabuf, vma); dma_resv_unlock(dmabuf->resv);
- WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
Well , I already a-b'ed this, but does it work with DMa helpers. I'm asking because of [1].
Best regards Thomas
[1] https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/drm_gem_dma...
- return ret; }
@@ -1495,6 +1497,8 @@ int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma, ret = dmabuf->ops->mmap(dmabuf, vma); dma_resv_unlock(dmabuf->resv);
- WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
- return ret; } EXPORT_SYMBOL_NS_GPL(dma_buf_mmap, DMA_BUF);
On Wed, 23 Nov 2022 at 09:07, Thomas Zimmermann tzimmermann@suse.de wrote:
Hi
Am 22.11.22 um 18:08 schrieb Daniel Vetter:
tldr; DMA buffers aren't normal memory, expecting that you can use them like that (like calling get_user_pages works, or that they're accounting like any other normal memory) cannot be guaranteed.
Since some userspace only runs on integrated devices, where all buffers are actually all resident system memory, there's a huge temptation to assume that a struct page is always present and useable like for any more pagecache backed mmap. This has the potential to result in a uapi nightmare.
To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which blocks get_user_pages and all the other struct page based infrastructure for everyone. In spirit this is the uapi counterpart to the kernel-internal CONFIG_DMABUF_DEBUG.
Motivated by a recent patch which wanted to swich the system dma-buf heap to vm_insert_page instead of vm_insert_pfn.
v2:
Jason brought up that we also want to guarantee that all ptes have the pte_special flag set, to catch fast get_user_pages (on architectures that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
From auditing the various functions to insert pfn pte entires (vm_insert_pfn_prot, remap_pfn_range and all it's callers like dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so this should be the correct flag to check for.
v3: Change to WARN_ON_ONCE (Thomas Zimmermann)
References: https://lore.kernel.org/lkml/CAKMK7uHi+mG0z0HUmNt13QCCvutuRVjpcR0NjRL12k-WbW... Acked-by: Christian König christian.koenig@amd.com Acked-by: Thomas Zimmermann tzimmermann@suse.de Cc: Thomas Zimmermann tzimmermann@suse.de Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Suren Baghdasaryan surenb@google.com Cc: Matthew Wilcox willy@infradead.org Cc: John Stultz john.stultz@linaro.org Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org -- Ok I entirely forgot about this patch but stumbled over it and checked what's up with it no. I think it's ready now for merging:
- shmem helper patches to fix up vgem landed
- ttm has been fixed since a while
- I don't think we've had any other open issues
Time to lock down this uapi contract for real?
-Daniel
drivers/dma-buf/dma-buf.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index b6c36914e7c6..88718665c3c3 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -150,6 +150,8 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) ret = dmabuf->ops->mmap(dmabuf, vma); dma_resv_unlock(dmabuf->resv);
WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
Well , I already a-b'ed this, but does it work with DMa helpers. I'm asking because of [1].
Best regards Thomas
[1] https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/drm_gem_dma...
This one is entertaining, but also doesn't matter, because the remap_pfn_range that the various dma_mmap functions boil down to sets the VM_PFNMAP and a pile of other flags. See
https://elixir.bootlin.com/linux/v6.1-rc6/source/mm/memory.c#L2518
I have no idea why Laurent cleared this flag here just so it gets reset again a bit later when he added that code. Laurent? -Daniel
}return ret;
@@ -1495,6 +1497,8 @@ int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma, ret = dmabuf->ops->mmap(dmabuf, vma); dma_resv_unlock(dmabuf->resv);
WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
} EXPORT_SYMBOL_NS_GPL(dma_buf_mmap, DMA_BUF);return ret;
-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev
linaro-mm-sig@lists.linaro.org