On Mon, Jun 15, 2026 at 03:27:01PM +0100, Matt Evans wrote:
Hi Matt,
On 11/06/2026 21:30, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:16PM +0100, Matt Evans wrote:
Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to find a PFN.
This supports multi-range DMABUFs, which typically would be used to represent scattered spans but might even represent overlapping or aliasing spans of PFNs.
Because this is intended to be used in vfio_pci_core.c, we also need to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
Signed-off-by: Matt Evans matt@ozlabs.org
[...]
- const unsigned long pagesize = PAGE_SIZE << order;
- unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust) <<
PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;- unsigned long rounded_page_addr = ALIGN_DOWN(address, pagesize);
- unsigned long rounded_page_end = rounded_page_addr + pagesize;
- unsigned long page_buf_offset;
- unsigned long page_buf_offset_end;
- unsigned long range_buf_offset = 0;
- unsigned int i;
- if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) {
if (order > 0)return -EAGAIN;/* A fault address outside of the VMA is absurd. */WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",address, vma->vm_start, vma->vm_end);This could flood dmesg if triggered repeatedly by userspace :( Since a fault outside the VMA is an invalid access that already results in a SIGBUS, we could probably avoid the WARN here? Perhaps pr_warn_ratelimited() should suffice?
I'm OK moving to a pr_warn_ratelimited(). Note though that this case is "genuinely impossible" currently and the check exists in case something changes elsewhere. (Re your flood comment, am I missing a way for userspace to trigger this? The scenario is a faulthandler for a VMA getting a VA outside the bounds of that VMA; such a fault address wouldn't match that VMA.)
I should've been explicit, I guess I worded it wrong, my bad. I didn't mean that a user-space could hit this on it's own. However, I meant to say if there's a bug in core mm during some future work, dmesg being flooded by stackdumps gets messy (especially during dev) as the underlying reason might get missed in the flood. Hence, I prefer moving to pr_warn_ratelimited.
return -EFAULT;- }
- /*
* page_buff_offset[_end] is the span of DMABUF offsets* corresponding to the faulting page:*/- if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
vma_off, &page_buf_offset) ||check_add_overflow(page_buf_offset, pagesize,&page_buf_offset_end)))return -EFAULT;- for (i = 0; i < priv->nr_ranges; i++) {
size_t range_len = priv->phys_vec[i].len;phys_addr_t range_start = priv->phys_vec[i].paddr;/** If the current range starts after the page's span,* this and any future range won't match. Bail early.*/if (page_buf_offset_end <= range_buf_offset)break;if (page_buf_offset >= range_buf_offset &&page_buf_offset_end <= range_buf_offset + range_len) {/** The faulting page is wholly contained* within the span represented by the range.* Validate PFN alignment for the order:*/unsigned long pfn = (range_start + page_buf_offset -range_buf_offset) / PAGE_SIZE;Minor nit: I'm aware that decent compilers convert pow(2) divides to >> However, we seem to be using `>> PAGE_SHIFT` across vfio-pci. E.g.:
return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
Let's consider using the same pattern?
(Do you know of a compiler that both builds the kernel and does NOT perform this transformation? I am confident that resulting object code will be OK here.)
I guess most of the modern compiler do, I was just referring to the style across the file. I don't have any strong opinion.
In an earlier revision I was using shifts but they were fairly messy compared to this expression, which arises from a request by Jason.
Yea, looking back at Jason's comment [1], I think he was mainly pointing out that the common factor (PAGE_SIZE) could be grouped together. But again, no strong feeling about this, just picked up a pattern across the file. If it breaks on some compiler we can fix it later..
if (IS_ALIGNED(pfn, 1 << order)) {*out_pfn = pfn;return 0;}/* Retry with smaller order */return -EAGAIN;}range_buf_offset += range_len;- }
- /*
* A hugepage straddling a range boundary will fail to match a* range, but the address will (eventually) match when retried* with a smaller page.*/- if (order > 0)
return -EAGAIN;- /*
* If we get here, the address fell outside of the span* represented by the (concatenated) ranges. Setup of aNit: double space before "Setup" and "But" below.
I liked Alex's response :-) This is common practice for monospaced text since increasing inter-sentence spacing helps readability in paragraph blocks (see Documentation/ for many examples ...).
Ack. :)
* mapping must ensure that the VMA is <= the total size of* the ranges, so this should never happen. But, if it does,* force SIGBUS for the access and warn.*/- WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n",
address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,priv->nr_ranges, priv->size);- return -EFAULT;
The fall-through logic at the end feels a bit redundant.
If we've exhausted the phys_vec list without finding a match, returning -EAGAIN for order > 0 seems like the correct fallback behavior.
This path can happen (for order > 0) e.g. mis-alignment of VA versus the PFN, i.e. is likely...
However, the subsequent WARN_ONCE for the order == 0 seems unnecessary? An out-of-bounds access is an error that should simply return -EFAULT (converting to SIGBUS) without polluting the kernel log with stackdumps?
...but the only way this can happen, for order == 0, is if the VMA extends beyond the underlying resource. For example, if the VMA is larger than the DMABUF size (the total length of phys ranges set up inside the DMABUF). Both VFIO BAR mmap() and a DMABUF mmap() disallow mapping off the end of the underlying resource. That is, this also "cannot happen" but if logic changes elsewhere then we will really want to know about hitting this case -- the check is not redundant.
I didn't mean to imply that the path itself is impossible or won't happen.. I just meant that the logic / structure felt a bit redundant at the end of the function..
Instead of having the separate `if (order > 0)` block falling through to the base case, I suggest it could be cleaner as:
ret = order ? -EAGAIN : -EFAULT;
if (ret == -EFAULT) pr_warn_ratelimited(...);
return ret;
But again, that's a preference. I'd leave that to your judgement.
Still, it doesn't need a regdump/backtrace (at least while this is only called from one spot), so a pr_warn_* is better.
Ack.
Thanks, Praan
linaro-mm-sig@lists.linaro.org