On Tue, 4 Nov 2025 14:23:24 +0530 Meghana Malladi wrote:
> > I tried honoring Jakub's comment to avoid freeing the rx memory wherever
> > necessary.
> >
> > "In case of icssg driver, freeing the rx memory is necessary as the
> > rx descriptor memory is owned by the cppi dma controller and can be
> > mapped to a single memory model (pages/xdp buffers) at a given time.
> > In order to remap it, the memory needs to be freed and reallocated."
>
> Just to make sure we are on the same page, does the above explanation
> make sense to you or do you want me to make any changes in this series
> for v5 ?
No. Based on your reply below you seem to understand what is being
asked, so you're expected to do it.
> >> I think you should:
> >> - stop the H/W from processing incoming packets,
> >> - spool all the pending packets
> >> - attach/detach the xsk_pool
> >> - refill the ring
> >> - re-enable the H/W
> >
> > Current implementation follows the same sequence:
> > 1. Does a channel teardown -> stop incoming traffic
> > 2. free the rx descriptors from free queue and completion queue -> spool
> > all pending packets/descriptors
> > 3. attach/detach the xsk pool
> > 4. allocate rx descriptors and fill the freeq after mapping them to the
> > correct memory buffers -> refill the ring
> > 5. restart the NAPI - re-enable the H/W to recv the traffic
> >
> > I am still working on skipping 2 and 4 steps but this will be a long
> > shot. Need to make sure all corner cases are getting covered. If this
> > approach looks doable without causing any regressions I might post it as
> > a followup patch later in the future.
On Tue, Nov 04, 2025 at 11:19:43AM -0800, Nicolin Chen wrote:
> On Sun, Nov 02, 2025 at 10:00:48AM +0200, Leon Romanovsky wrote:
> > Changelog:
> > v6:
> > * Fixed wrong error check from pcim_p2pdma_init().
> > * Documented pcim_p2pdma_provider() function.
> > * Improved commit messages.
> > * Added VFIO DMA-BUF selftest.
> > * Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf.
> > * Fixed error unwind when dma_buf_fd() fails.
> > * Document latest changes to p2pmem.
> > * Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type.
> > * Moved DMA mapping logic to DMA-BUF.
> > * Removed types patch to avoid dependencies between subsystems.
> > * Moved vfio_pci_dma_buf_move() in err_undo block.
> > * Added nvgrace patch.
>
> I have verified this v6 using Jason's iommufd dmabuf branch:
> https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf/
>
> by drafting a QEMU patch on top of Shameer's vSMMU v5 series:
> https://github.com/nicolinc/qemu/commits/wip/iommufd_dmabuf/
>
> with that, I see GPU BAR memory be correctly fetched in the QEMU:
> vfio_region_dmabuf Device 0009:01:00.0, region "0009:01:00.0 BAR 0", offset: 0x0, size: 0x1000000
> vfio_region_dmabuf Device 0009:01:00.0, region "0009:01:00.0 BAR 2", offset: 0x0, size: 0x44f00000
> vfio_region_dmabuf Device 0009:01:00.0, region "0009:01:00.0 BAR 4", offset: 0x0, size: 0x17a0000000
Great thanks! This means we finally have a solution to that follow_pfn
lifetime problem in type 1! What a long journey :)
For those following along this same flow will be used with KVM to
allow it to map VFIO as well. Confidential Compute will require this
because some arches can't put confidential MMIO (or RAM) into a VMA.
Jason
Changelog:
v6:
* Fixed wrong error check from pcim_p2pdma_init().
* Documented pcim_p2pdma_provider() function.
* Improved commit messages.
* Added VFIO DMA-BUF selftest.
* Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf.
* Fixed error unwind when dma_buf_fd() fails.
* Document latest changes to p2pmem.
* Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type.
* Moved DMA mapping logic to DMA-BUF.
* Removed types patch to avoid dependencies between subsystems.
* Moved vfio_pci_dma_buf_move() in err_undo block.
* Added nvgrace patch.
v5: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org
* Rebased on top of v6.18-rc1.
* Added more validation logic to make sure that DMA-BUF length doesn't
overflow in various scenarios.
* Hide kernel config from the users.
* Fixed type conversion issue. DMA ranges are exposed with u64 length,
but DMA-BUF uses "unsigned int" as a length for SG entries.
* Added check to prevent from VFIO drivers which reports BAR size
different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
* Split pcim_p2pdma_provider() to two functions, one that initializes
array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
* Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
* Cache provider in vfio_pci_dma_buf struct instead of BAR index.
* Removed misleading comment from pcim_p2pdma_provider().
* Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
* Added extra patch which adds new CONFIG, so next patches can reuse
* it.
* Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
into the other patch.
* Fixed revoke calls to be aligned with true->false semantics.
* Extended p2pdma_providers to be per-BAR and not global to whole
* device.
* Fixed possible race between dmabuf states and revoke.
* Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
* Changed commit messages.
* Reused DMA_ATTR_MMIO attribute.
* Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
---
Jason Gunthorpe (2):
PCI/P2PDMA: Document DMABUF model
vfio/nvgrace: Support get_dmabuf_phys
Leon Romanovsky (7):
PCI/P2PDMA: Separate the mmap() support from the core logic
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
dma-buf: provide phys_vec to scatter-gather mapping routine
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature functions
Documentation/driver-api/pci/p2pdma.rst | 95 +++++++---
block/blk-mq-dma.c | 2 +-
drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 182 +++++++++++++-----
drivers/vfio/pci/Kconfig | 3 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/nvgrace-gpu/main.c | 56 ++++++
drivers/vfio/pci/vfio_pci.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 22 ++-
drivers/vfio/pci/vfio_pci_core.c | 56 ++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 315 ++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 +++
drivers/vfio/vfio_main.c | 2 +
include/linux/dma-buf.h | 18 ++
include/linux/pci-p2pdma.h | 120 +++++++-----
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 42 +++++
include/uapi/linux/vfio.h | 27 +++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
21 files changed, 1077 insertions(+), 139 deletions(-)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20251016-dmabuf-vfio-6cef732adf5a
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
On Sun, Nov 2, 2025, at 19:11, Alex Williamson wrote:
> On Sun, 2 Nov 2025 17:12:53 +0200
> Leon Romanovsky <leon(a)kernel.org> wrote:
>> On Sun, Nov 02, 2025 at 08:01:37AM -0700, Alex Williamson wrote:
>> > We don't need the separate loop or flag, and adding it breaks the
>> > existing reverse list walk. Thanks,
>>
>> Do you want me to send v7? I have a feeling that v6 is good to be merged.
>
> Let's hold off, if this ends up being the only fixup I can roll it in.
> Thanks,
Thanks
>
> Alex
>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 24204893e221..51a3bcc26f8b 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -2403,7 +2403,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> struct iommufd_ctx *iommufd_ctx)
>> {
>> struct vfio_pci_core_device *vdev;
>> - bool restore_revoke = false;
>> struct pci_dev *pdev;
>> int ret;
>>
>> @@ -2473,7 +2472,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> }
>>
>> vfio_pci_dma_buf_move(vdev, true);
>> - restore_revoke = true;
>> vfio_pci_zap_bars(vdev);
>> }
>>
>> @@ -2501,15 +2499,12 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> struct vfio_pci_core_device, vdev.dev_set_list);
>>
>> err_undo:
>> - if (restore_revoke) {
>> - list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
>> - if (__vfio_pci_memory_enabled(vdev))
>> - vfio_pci_dma_buf_move(vdev, false);
>> - }
>> -
>> list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
>> - vdev.dev_set_list)
>> + vdev.dev_set_list) {
>> + if (__vfio_pci_memory_enabled(vdev))
>> + vfio_pci_dma_buf_move(vdev, false);
>> up_write(&vdev->memory_lock);
>> + }
>>
>> list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
>> pm_runtime_put(&vdev->pdev->dev);
>>
>>
>> >
>> > Alex
>> >
>>
On 10/31/25 06:15, Kasireddy, Vivek wrote:
> Hi Jason,
>
>> Subject: Re: [RFC v2 0/8] dma-buf: Add support for mapping dmabufs via
>> interconnects
>>
>> On Thu, Oct 30, 2025 at 06:17:11AM +0000, Kasireddy, Vivek wrote:
>>> It mostly looks OK to me but there are a few things that I want to discuss,
>>> after briefly looking at the patches in your branch:
>>> - I am wondering what is the benefit of the SGT compatibility stuff especially
>>> when Christian suggested that he'd like to see SGT usage gone from
>>> dma-buf
>>
>> I think to get rid of SGT we do need to put it in a little well
>> defined box and then create alternatives and remove things using
>> SGT. This is a long journey, and I think this is the first step.
>>
>> If SGT is some special case it will be harder to excise.
>>
>> So the next steps would be to make all the exporters directly declare
>> a SGT and then remove the SGT related ops from dma_ops itself and
>> remove the compat sgt in the attach logic. This is not hard, it is all
>> simple mechanical work.
> IMO, this SGT compatibility stuff should ideally be a separate follow-on
> effort (and patch series) that would also probably include updates to
> various drivers to add the SGT mapping type.
Nope, just the other way around. In other words the SGT compatibility is a pre-requisite.
We should first demonstrate with existing drivers that the new interface works and does what it promised to do and then extend it with new functionality.
Regards,
Christian.
>
>>
>> This way the only compat requirement is to automatically give an
>> import match list for a SGT only importer which is very little code in
>> the core.
>>
>> The point is we make the SGT stuff nonspecial and fully aligned with
>> the mapping type in small steps. This way neither importer nor
>> exporter should have any special code to deal with interworking.
>>
>> To remove SGT we'd want to teach the core code how to create some kind
>> of conversion mapping type, eg exporter uses SGT importer uses NEW so
>> the magic conversion mapping type does the adapatation.
>>
>> In this way we can convert importers and exporters to use NEW in any
>> order and they still interwork with each other.
>>
>>> eventually. Also, if matching fails, IMO, indicating that to the
>>> importer (allow_ic) and having both exporter/importer fallback to
>>> the current legacy mechanism would be simpler than the SGT
>>> compatibility stuff.
>>
>> I don't want to have three paths in importers.
>>
>> If the importer supports SGT it should declare it in a match and the
>> core code should always return a SGT match for the importer to use
>>
>> The importer should not have to code 'oh it is sgt but it somehow a
>> little different' via an allow_ic type idea.
>>
>>> - Also, I thought PCIe P2P (along with SGT) use-cases are already well
>> handled
>>> by the existing map_dma_buf() and other interfaces. So, it might be
>> confusing
>>> if the newer interfaces also provide a mechanism to handle P2P although a
>>> bit differently. I might be missing something here but shouldn't the existing
>>> allow_peer2peer and other related stuff be left alone?
>>
>> P2P is part of SGT, it gets pulled into the SGT stuff as steps toward
>> isolating SGT properly. Again as we move things to use native SGT
>> exporters we would remove the exporter related allow_peer2peer items
>> when they become unused.
>>
>>> - You are also adding custom attach/detach ops for each mapping_type. I
>> think
>>> it makes sense to reuse existing attach/detach ops if possible and initiate
>> the
>>> matching process from there, at-least initially.
>>
>> I started there, but as soon as I went to adding PAL I realized the
>> attach/detach logic was completely different for each of the mapping
>> types. So this is looking alot simpler.
>>
>> If the driver wants to share the same attach/detach ops for some of
>> its mapping types then it can just set the same function pointer to
>> all of them and pick up the mapping type from the attach->map_type.
>>
>>> - Looks like your design doesn't call for a dma_buf_map_interconnect() or
>> other
>>> similar helpers provided by dma-buf core that the importers can use. Is that
>>> because the return type would not be known to the core?
>>
>> I don't want to have a single shared 'map' operation, that is the
>> whole point of this design. Each mapping type has its own ops, own
>> types, own function signatures that the client calls directly.
>>
>> No more type confusion or trying to abuse phys_addr_t, dma_addr_t, or
>> scatterlist for in appropriate things. If your driver wants something
>> special, like IOV, then give it proper clear types so it is
>> understandable.
>>
>>> - And, just to confirm, with your design if I want to add a new interconnect/
>>> mapping_type (not just IOV but in general), all that is needed is to provide
>> custom
>>> attach/detach, match ops and one or more ops to map/unmap the address
>> list
>>> right? Does this mean that the role of dma-buf core would be limited to just
>>> match and the exporters are expected to do most of the heavy lifting and
>>> checking for stuff like dynamic importers, resv lock held, etc?
>>
>> I expect the core code would continue to provide wrappers and helpers
>> to call the ops that can do any required common stuff.
>>
>> However, keep in mind, when the importer moves to use mapping type it
>> also must be upgraded to use the dynamic importer flow as this API
>> doesn't support non-dynamic importers using mapping type.
>>
>> I will add some of these remarks to the commit messages..
> Sounds good. I'll start testing/working on IOV interconnect patches based on
> your design.
>
> Thanks,
> Vivek
>>
>> Thanks!
>> Jason