Changelog:
v7:
* Dropped restore_revoke flag and added vfio_pci_dma_buf_move
to reverse loop.
* Fixed spelling errors in documentation patch.
* Rebased on top of v6.18-rc3.
* Added include to stddef.h to vfio.h, to keep uapi header file independent.
v6: https://patch.msgid.link/20251102-dmabuf-vfio-v6-0-d773cff0db9f@nvidia.com
* Fixed wrong error check from pcim_p2pdma_init().
* Documented pcim_p2pdma_provider() function.
* Improved commit messages.
* Added VFIO DMA-BUF selftest, not sent yet.
* Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf.
* Fixed error unwind when dma_buf_fd() fails.
* Document latest changes to p2pmem.
* Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type.
* Moved DMA mapping logic to DMA-BUF.
* Removed types patch to avoid dependencies between subsystems.
* Moved vfio_pci_dma_buf_move() in err_undo block.
* Added nvgrace patch.
v5: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org
* Rebased on top of v6.18-rc1.
* Added more validation logic to make sure that DMA-BUF length doesn't
overflow in various scenarios.
* Hide kernel config from the users.
* Fixed type conversion issue. DMA ranges are exposed with u64 length,
but DMA-BUF uses "unsigned int" as a length for SG entries.
* Added check to prevent from VFIO drivers which reports BAR size
different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
* Split pcim_p2pdma_provider() to two functions, one that initializes
array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
* Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
* Cache provider in vfio_pci_dma_buf struct instead of BAR index.
* Removed misleading comment from pcim_p2pdma_provider().
* Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
* Added extra patch which adds new CONFIG, so next patches can reuse
* it.
* Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
into the other patch.
* Fixed revoke calls to be aligned with true->false semantics.
* Extended p2pdma_providers to be per-BAR and not global to whole
* device.
* Fixed possible race between dmabuf states and revoke.
* Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
* Changed commit messages.
* Reused DMA_ATTR_MMIO attribute.
* Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
---
Jason Gunthorpe (2):
PCI/P2PDMA: Document DMABUF model
vfio/nvgrace: Support get_dmabuf_phys
Leon Romanovsky (7):
PCI/P2PDMA: Separate the mmap() support from the core logic
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
dma-buf: provide phys_vec to scatter-gather mapping routine
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature functions
Documentation/driver-api/pci/p2pdma.rst | 95 +++++++---
block/blk-mq-dma.c | 2 +-
drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 182 +++++++++++++-----
drivers/vfio/pci/Kconfig | 3 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/nvgrace-gpu/main.c | 56 ++++++
drivers/vfio/pci/vfio_pci.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 22 ++-
drivers/vfio/pci/vfio_pci_core.c | 53 ++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 315 ++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 +++
drivers/vfio/vfio_main.c | 2 +
include/linux/dma-buf.h | 18 ++
include/linux/pci-p2pdma.h | 120 +++++++-----
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 42 +++++
include/uapi/linux/vfio.h | 28 +++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
21 files changed, 1074 insertions(+), 140 deletions(-)
---
base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
change-id: 20251016-dmabuf-vfio-6cef732adf5a
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/r/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf
v2:
- Rebase on Leon's v7
- Fix mislocking in an iopt_fill_domain() error path
v1: https://patch.msgid.link/r/0-v1-64bed2430cdb+31b-iommufd_dmabuf_jgg@nvidia.…
Jason Gunthorpe (9):
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
iommufd: Add DMABUF to iopt_pages
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Allow a DMABUF to be revoked
iommufd: Allow MMIO pages in a batch
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd/selftest: Add some tests for the dmabuf flow
drivers/iommu/iommufd/io_pagetable.c | 78 +++-
drivers/iommu/iommufd/io_pagetable.h | 53 ++-
drivers/iommu/iommufd/ioas.c | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 14 +-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 10 +
drivers/iommu/iommufd/pages.c | 407 ++++++++++++++++--
drivers/iommu/iommufd/selftest.c | 142 ++++++
drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++
include/linux/vfio_pci_core.h | 4 +
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 44 ++
12 files changed, 781 insertions(+), 66 deletions(-)
base-commit: bb04e92c86b44b3e36532099b68de1e889acfee7
--
2.43.0
On 11/5/25 11:39, Pierre-Eric Pelloux-Prayer wrote:
> Le 04/11/2025 à 17:40, Christian König a écrit :
>> On 11/4/25 09:35, Pierre-Eric Pelloux-Prayer wrote:
>>> The benefits of using multiple entities is that multiple fill jobs
>>> can run in parallel. Otherwise, even if the entity has access
>>> to multiple engines, a burst of N independent jobs will all
>>> run on the same engine because an entity guarantees the ordering
>>> of execution matches the ordering of the submission.
>>>
>>> Callers can opt-out of this behavior by passing the entity they
>>> want to use (see amdgpu_move_blit).
>>
>> That still sounds like a really bad idea to me.
>>
>> First of all we can't reserve so many fence slots in the release handler, previously we basically just relied on the fact that the BO will most likely be mostly idle.
>>
>> I think we should just use a single SDMA engine for each clear and distribute clearing different BOs over multiple engines.
>
> So N clear entities, each one having access to a single engine. And all jobs to clear a single BO go to the same entity?
>
> Is that what you mean?
More or less.
N clear entities, each one has access to all engines. When a BO needs to be cleared it picks the next best entity and submits the jobs.
This way clear entities still load balance with moves and page table updates but we can keep the clearing logic simple.
Christian.
>
> Pierre-Eric
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer(a)amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 84 ++++++++++++++++++-------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 1 +
>>> 2 files changed, 64 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index c357a6d9763a..839ea8c7f6be 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -2224,6 +2224,7 @@ u32 amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
>>> adev->mman.clear_entities = kcalloc(num_clear_entities,
>>> sizeof(struct amdgpu_ttm_entity),
>>> GFP_KERNEL);
>>> + atomic_set(&adev->mman.next_clear_entity, 0);
>>> if (!adev->mman.clear_entities)
>>> goto error_free_entity;
>>> @@ -2498,10 +2499,12 @@ int amdgpu_fill_buffer(struct amdgpu_ttm_entity *entity,
>>> {
>>> struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>>> struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
>>> + struct dma_fence *fences[TTM_FENCES_MAX_SLOT_COUNT] = {};
>>> struct dma_fence *fence = NULL;
>>> struct dma_resv *resv = NULL;
>>> struct amdgpu_res_cursor dst;
>>> - int r;
>>> + uint64_t cur_size, to;
>>> + int r, e, n_fences;
>>> /* The fences will be either added to the resv object or the last fence
>>> * will be returned to the caller. In the latter case, all fill jobs will
>>> @@ -2515,53 +2518,92 @@ int amdgpu_fill_buffer(struct amdgpu_ttm_entity *entity,
>>> }
>>> if (!entity) {
>>> - entity = &adev->mman.clear_entities[0];
>>> resv = &bo->tbo.base._resv;
>>> - r = dma_resv_reserve_fences(resv, 1);
>>> +
>>> + /* Determine how much fences we're going to add to the
>>> + * resv object.
>>> + */
>>> + n_fences = 0;
>>> + amdgpu_res_first(bo->tbo.resource, 0, amdgpu_bo_size(bo), &dst);
>>> + while (dst.remaining) {
>>> + cur_size = min(dst.size, 256ULL << 20);
>>> +
>>> + n_fences += 1;
>>> + amdgpu_res_next(&dst, cur_size);
>>> + }
>>> + if (n_fences == 0)
>>> + return 0;
>>> +
>>> + /* One slot per entity at most. */
>>> + n_fences = MIN(n_fences, adev->mman.num_clear_entities);
>>> +
>>> + r = dma_resv_reserve_fences(resv, n_fences);
>>> if (r)
>>> return r;
>>> + } else {
>>> + mutex_lock(&entity->gart_window_lock);
>>> }
>>> amdgpu_res_first(bo->tbo.resource, 0, amdgpu_bo_size(bo), &dst);
>>> - mutex_lock(&entity->gart_window_lock);
>>> while (dst.remaining) {
>>> - struct dma_fence *next;
>>> - uint64_t cur_size, to;
>>> -
>>> /* Never fill more than 256MiB at once to avoid timeouts */
>>> cur_size = min(dst.size, 256ULL << 20);
>>> + if (resv) {
>>> + /* Pick a new entity for each partial clear so they can
>>> + * execute in parallel.
>>> + */
>>> + e = atomic_inc_return(&adev->mman.next_clear_entity) %
>>> + adev->mman.num_clear_entities;
>>> + entity = &adev->mman.clear_entities[e];
>>> + mutex_lock(&entity->gart_window_lock);
>>> + }
>>> +
>>> r = amdgpu_ttm_map_buffer(&entity->base,
>>> &bo->tbo, bo->tbo.resource, &dst,
>>> entity->gart_window_id1, ring, false,
>>> &cur_size, &to,
>>> dependency,
>>> resv);
>>> - if (r)
>>> + if (r) {
>>> + mutex_unlock(&entity->gart_window_lock);
>>> goto error;
>>> + }
>>> r = amdgpu_ttm_fill_mem(ring, &entity->base,
>>> src_data, to, cur_size, resv,
>>> - &next, true, k_job_id);
>>> - if (r)
>>> + &fence, true, k_job_id);
>>> + if (r) {
>>> + mutex_unlock(&entity->gart_window_lock);
>>> goto error;
>>> -
>>> - if (resv) {
>>> - dma_resv_add_fence(resv, next, DMA_RESV_USAGE_KERNEL);
>>> - dma_fence_put(next);
>>> - } else {
>>> - dma_fence_put(fence);
>>> - fence = next;
>>> }
>>> amdgpu_res_next(&dst, cur_size);
>>> +
>>> + if (resv) {
>>> + /* Delay the addition of the fences to resv, otherwise the next partial
>>> + * clears will depend on this one.
>>> + */
>>> + fences[e] = fence;
>>> + mutex_unlock(&entity->gart_window_lock);
>>> + } else {
>>> + dma_fence_put(*f);
>>> + *f = fence;
>>> + }
>>> }
>>> error:
>>> - mutex_unlock(&entity->gart_window_lock);
>>> - if (f)
>>> - *f = dma_fence_get(fence);
>>> - dma_fence_put(fence);
>>> + if (resv) {
>>> + for (e = 0; e < adev->mman.num_clear_entities; e++) {
>>> + if (fences[e]) {
>>> + dma_resv_add_fence(resv, fences[e], DMA_RESV_USAGE_KERNEL);
>>> + dma_fence_put(fences[e]);
>>> + }
>>> + }
>>> + } else {
>>> + mutex_unlock(&entity->gart_window_lock);
>>> + }
>>> +
>>> return r;
>>> }
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> index 38df2b5b4bc7..3fc31c7c6bfe 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
>>> @@ -73,6 +73,7 @@ struct amdgpu_mman {
>>> struct amdgpu_ttm_entity default_entity; /* has no gart windows */
>>> struct amdgpu_ttm_entity *clear_entities;
>>> + atomic_t next_clear_entity;
>>> u32 num_clear_entities;
>>> struct amdgpu_ttm_entity move_entities[TTM_FENCES_MAX_SLOT_COUNT];
>>> u32 num_move_entities;
On 11/5/25 11:34, Pierre-Eric Pelloux-Prayer wrote:
>>
>>> + } else {
>>> + all_signaled = false;
>>> + if (no_wait_gpu) {
>>> + spin_unlock(&man->pipelined_eviction.lock);
>>> + return -EBUSY;
>>> + }
>>> + fences_to_add[n++] = dma_fence_get(fence);
>>> + }
>>> + }
>>> + spin_unlock(&man->pipelined_eviction.lock);
>>> +
>>> + if (all_signaled)
>>> return 0;
>>> - if (no_wait_gpu) {
>>> - ret = dma_fence_is_signaled(fence) ? 0 : -EBUSY;
>>> - dma_fence_put(fence);
>>> - return ret;
>>> + for (i = 0; i < n; i++) {
>>> + dma_resv_add_fence(bo->base.resv, fences_to_add[i], DMA_RESV_USAGE_KERNEL);
>>> + dma_fence_put(fences_to_add[i]);
>>> }
>>> - dma_resv_add_fence(bo->base.resv, fence, DMA_RESV_USAGE_KERNEL);
>>> -
>>> - ret = dma_resv_reserve_fences(bo->base.resv, 1);
>>> - dma_fence_put(fence);
>>> - return ret;
>>> + return dma_resv_reserve_fences(bo->base.resv, TTM_FENCES_MAX_SLOT_COUNT);
>>
>> Please separate out a patch where the call to dma_resv_reserve_fences() is removed here.
>
> Can you remind me why it's not needed?
Some years ago the dma_resv object had a fixed field for an exclusive fence.
When we removed that we sprinkled calls like "dma_resv_reserve_fences(bo->base.resv, 1)" all over the place where we previously used this exclusive fence slot to prevent things from going boom!
It could be that some old drivers like radeon or qxl still rely on that somewhere, but that would then clearly be a driver bug.
What we could do is to either leave it alone or remove it, but changing it to reserving TTM_FENCES_MAX_SLOT_COUNT is clearly not correct.
>>
>>> }
>>> /**
>>> @@ -718,7 +732,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
>>> int i, ret;
>>> ticket = dma_resv_locking_ctx(bo->base.resv);
>>> - ret = dma_resv_reserve_fences(bo->base.resv, 1);
>>> + ret = dma_resv_reserve_fences(bo->base.resv, TTM_FENCES_MAX_SLOT_COUNT);
>>> if (unlikely(ret))
>>> return ret;
>>> @@ -757,7 +771,7 @@ static int ttm_bo_alloc_resource(struct ttm_buffer_object *bo,
>>> return ret;
>>> }
>>> - ret = ttm_bo_add_move_fence(bo, man, ctx->no_wait_gpu);
>>> + ret = ttm_bo_add_pipelined_eviction_fences(bo, man, ctx->no_wait_gpu);
>>> if (unlikely(ret)) {
>>> ttm_resource_free(bo, res);
>>> if (ret == -EBUSY)
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> index acbbca9d5c92..ada8af965acf 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> @@ -258,7 +258,7 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo,
>>> ret = dma_resv_trylock(&fbo->base.base._resv);
>>> WARN_ON(!ret);
>>> - ret = dma_resv_reserve_fences(&fbo->base.base._resv, 1);
>>> + ret = dma_resv_reserve_fences(&fbo->base.base._resv, TTM_FENCES_MAX_SLOT_COUNT);
>>> if (ret) {
>>> dma_resv_unlock(&fbo->base.base._resv);
>>> kfree(fbo);
>>> @@ -646,6 +646,8 @@ static void ttm_bo_move_pipeline_evict(struct ttm_buffer_object *bo,
>>> {
>>> struct ttm_device *bdev = bo->bdev;
>>> struct ttm_resource_manager *from;
>>> + struct dma_fence *tmp;
>>> + int i, free_slot = -1;
>>> from = ttm_manager_type(bdev, bo->resource->mem_type);
>>> @@ -653,13 +655,35 @@ static void ttm_bo_move_pipeline_evict(struct ttm_buffer_object *bo,
>>> * BO doesn't have a TTM we need to bind/unbind. Just remember
>>> * this eviction and free up the allocation
>>> */
>>> - spin_lock(&from->move_lock);
>>> - if (!from->move || dma_fence_is_later(fence, from->move)) {
>>> - dma_fence_put(from->move);
>>> - from->move = dma_fence_get(fence);
>>> + spin_lock(&from->pipelined_eviction.lock);
>>> + for (i = 0; i < from->pipelined_eviction.n_fences; i++) {
>>> + tmp = from->pipelined_eviction.fences[i];
>>> + if (!tmp) {
>>> + if (free_slot < 0)
>>> + free_slot = i;
>>> + continue;
>>
>> Just break here.
>
> The logic here is to reuse context slots. Even if slot 0 is empty, I need to use slot 1 if slot 1's context is the same as fence->context.
Good point, but slot 0 should never be empty. See we fill the slots starting from 0 and then incrementing you either have NULL or a slot which doesn't match.
> This way, we're guaranteed to find a slot for all contexts used by the driver.
>
>>
>>> + }
>>> + if (fence->context != tmp->context)
>>> + continue;
>>> + if (dma_fence_is_later(fence, tmp)) {
>>> + dma_fence_put(tmp);
>>> + free_slot = i;
>>> + break;
>>> + }
>>> + goto unlock;
>>> + }
>>> + if (free_slot >= 0) {
>>
>> Drop free_slot and check i here.
>>
>>> + from->pipelined_eviction.fences[free_slot] = dma_fence_get(fence);
>>> + } else {
>>> + WARN(1, "not enough fence slots for all fence contexts");
>>> + spin_unlock(&from->pipelined_eviction.lock);
>>> + dma_fence_wait(fence, false);
>>> + goto end;
>>> }
>>> - spin_unlock(&from->move_lock);
>>> +unlock:
>>> + spin_unlock(&from->pipelined_eviction.lock);
>>> +end:
>>> ttm_resource_free(bo, &bo->resource);
>>> }
>>> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c b/drivers/gpu/drm/ttm/ttm_resource.c
>>> index e2c82ad07eb4..ae0d4621cc55 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_resource.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_resource.c
>>> @@ -523,14 +523,19 @@ void ttm_resource_manager_init(struct ttm_resource_manager *man,
>>> {
>>> unsigned i;
>>> - spin_lock_init(&man->move_lock);
>>> man->bdev = bdev;
>>> man->size = size;
>>> man->usage = 0;
>>> for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i)
>>> INIT_LIST_HEAD(&man->lru[i]);
>>> - man->move = NULL;
>>> + spin_lock_init(&man->pipelined_eviction.lock);
>>> + for (i = 0; i < TTM_FENCES_MAX_SLOT_COUNT; i++)
>>> + man->pipelined_eviction.fences[i] = NULL;
>>> + /* Can be overridden by drivers that wants to use more than 1 entity
>>> + * for moves and evictions (limited to TTM_FENCES_MAX_SLOT_COUNT).
>>> + */
>>> + man->pipelined_eviction.n_fences = 1;
>>> }
>>> EXPORT_SYMBOL(ttm_resource_manager_init);
>>> @@ -551,7 +556,7 @@ int ttm_resource_manager_evict_all(struct ttm_device *bdev,
>>> .no_wait_gpu = false,
>>> };
>>> struct dma_fence *fence;
>>> - int ret;
>>> + int ret, i;
>>> do {
>>> ret = ttm_bo_evict_first(bdev, man, &ctx);
>>> @@ -561,18 +566,32 @@ int ttm_resource_manager_evict_all(struct ttm_device *bdev,
>>> if (ret && ret != -ENOENT)
>>> return ret;
>>> - spin_lock(&man->move_lock);
>>> - fence = dma_fence_get(man->move);
>>> - spin_unlock(&man->move_lock);
>>> + ret = 0;
>>> - if (fence) {
>>> - ret = dma_fence_wait(fence, false);
>>> - dma_fence_put(fence);
>>> - if (ret)
>>> - return ret;
>>> - }
>>> + do {
>>> + fence = NULL;
>>> - return 0;
>>> + spin_lock(&man->pipelined_eviction.lock);
>>> + for (i = 0; i < man->pipelined_eviction.n_fences; i++) {
>>> + fence = man->pipelined_eviction.fences[i];
>>
>>> + man->pipelined_eviction.fences[i] = NULL;
>>
>> Drop that. We should never set man->pipelined_eviction.fences to NULL.
>
> Why?
To simplify the logic while filling the slots.
dma_fences are made to be keept around for long.
>>
>>> +
>>> /**
>>> * enum ttm_lru_item_type - enumerate ttm_lru_item subclasses
>>> */
>>> @@ -180,8 +189,10 @@ struct ttm_resource_manager_func {
>>> * @size: Size of the managed region.
>>> * @bdev: ttm device this manager belongs to
>>> * @func: structure pointer implementing the range manager. See above
>>> - * @move_lock: lock for move fence
>>> - * @move: The fence of the last pipelined move operation.
>>> + * @pipelined_eviction.lock: lock for eviction fences
>>> + * @pipelined_eviction.n_fences: The number of fences allowed in the array. If
>>> + * 0, pipelined evictions aren't used.
>>> + * @pipelined_eviction.fences: The fences of the last pipelined move operation.
>>> * @lru: The lru list for this memory type.
>>> *
>>> * This structure is used to identify and manage memory types for a device.
>>> @@ -195,12 +206,15 @@ struct ttm_resource_manager {
>>> struct ttm_device *bdev;
>>> uint64_t size;
>>> const struct ttm_resource_manager_func *func;
>>> - spinlock_t move_lock;
>>> - /*
>>> - * Protected by @move_lock.
>>> + /* This is very similar to a dma_resv object, but locking rules make
>>> + * it difficult to use a it in this context.
>>> */
>>> - struct dma_fence *move;
>>> + struct {
>>> + spinlock_t lock;
>>> + int n_fences;
>>> + struct dma_fence *fences[TTM_FENCES_MAX_SLOT_COUNT];
>>> + } pipelined_eviction;
>>
>> Drop the separate structure, just make move an array instead.
>
> IMO pipelined_eviction.fences and pipelined_eviction.lock is clearer when reading the code than moves and move_lock but if you prefer I'll switch back to the old names.
The name "pipelined_eviction" is just a bit to long. Maybe just eviction_fences and eviction_fences_lock?
Regards,
Christian.
>>
>> And also drop n_fences. Just always take a look at all fences.
>
> OK.
>
> Thanks,
> Pierre-Eric
>
>>
>> Regards,
>> Christian.
>>
>>> /*
>>> * Protected by the bdev->lru_lock.
>>> @@ -421,8 +435,12 @@ static inline bool ttm_resource_manager_used(struct ttm_resource_manager *man)
>>> static inline void
>>> ttm_resource_manager_cleanup(struct ttm_resource_manager *man)
>>> {
>>> - dma_fence_put(man->move);
>>> - man->move = NULL;
>>> + int i;
>>> +
>>> + for (i = 0; i < TTM_FENCES_MAX_SLOT_COUNT; i++) {
>>> + dma_fence_put(man->pipelined_eviction.fences[i]);
>>> + man->pipelined_eviction.fences[i] = NULL;
>>> + }
>>> }
>>> void ttm_lru_bulk_move_init(struct ttm_lru_bulk_move *bulk);