On Sun, Oct 26, 2025 at 09:44:14PM -0700, Vivek Kasireddy wrote:
> +/**
> + * dma_buf_match_interconnects - determine if there is a specific interconnect
> + * that is supported by both exporter and importer.
> + * @attach: [in] attachment to populate ic_match field
> + * @exp: [in] array of interconnects supported by exporter
> + * @exp_ics: [in] number of interconnects supported by exporter
> + * @imp: [in] array of interconnects supported by importer
> + * @imp_ics: [in] number of interconnects supported by importer
> + *
> + * This helper function iterates through the list interconnects supported by
> + * both exporter and importer to find a match. A successful match means that
> + * a common interconnect type is supported by both parties and the exporter's
> + * match_interconnect() callback also confirms that the importer is compatible
> + * with the exporter for that interconnect type.
Document which of the exporter/importer is supposed to call this
> + *
> + * If a match is found, the attach->ic_match field is populated with a copy
> + * of the exporter's match data.
> + * Return: true if a match is found, false otherwise.
> + */
> +bool dma_buf_match_interconnects(struct dma_buf_attachment *attach,
> + const struct dma_buf_interconnect_match *exp,
> + unsigned int exp_ics,
> + const struct dma_buf_interconnect_match *imp,
> + unsigned int imp_ics)
> +{
> + const struct dma_buf_interconnect_ops *ic_ops;
> + struct dma_buf_interconnect_match *ic_match;
> + struct dma_buf *dmabuf = attach->dmabuf;
> + unsigned int i, j;
> +
> + if (!exp || !imp)
> + return false;
> +
> + if (!attach->allow_ic)
> + return false;
Seems redundant with this check for ic_ops == NULL:
> + ic_ops = dmabuf->ops->interconnect_ops;
> + if (!ic_ops || !ic_ops->match_interconnect)
> + return false;
This seems like too much of a maze to me..
I think you should structure it like this. First declare an interconnect:
struct dma_buf_interconnect iov_interconnect {
.name = "IOV interconnect",
.match =..
}
Then the exporters "subclass"
struct dma_buf_interconnect_ops vfio_iov_interconnect {
.interconnect = &iov_interconnect,
.map = vfio_map,
}
I guess no container_of technique..
Then in VFIO's attach trigger the new code:
const struct dma_buf_interconnect_match vfio_exp_ics[] = {
{&vfio_iov_interconnect},
};
dma_buf_match_interconnects(attach, &vfio_exp_ics))
Which will callback to the importer:
static const struct dma_buf_attach_ops xe_dma_buf_attach_ops = {
.get_importer_interconnects
}
dma_buf_match_interconnects() would call
aops->get_importer_interconnects
and matchs first on .interconnect, then call the interconnect->match
function with exp/inpt match structs if not NULL.
> +struct dma_buf_interconnect_match {
> + const struct dma_buf_interconnect *type;
> + struct device *dev;
> + unsigned int bar;
> +};
This should be more general, dev and bar are unique to the iov
importer. Maybe just simple:
struct dma_buf_interconnect_match {
struct dma_buf_interconnect *ic; // no need for type
const struct dma_buf_interconnct_ops *exporter_ic_ops;
u64 match_data[2]; // dev and bar are IOV specific, generalize
};
Then some helper
const struct dma_buf_interconnect_match supports_ics[] = {
IOV_INTERCONNECT(&vfio_iov_interconnect, dev, bar),
}
And it would be nice if interconnect aware drivers could more easially
interwork with non-interconnect importers.
So I'd add a exporter type of 'p2p dma mapped scatterlist' that just
matches the legacy importer.
Jason
On Sun, Oct 26, 2025 at 09:44:13PM -0700, Vivek Kasireddy wrote:
> For the map operation, the dma-buf core will create an xarray but
> the exporter needs to populate it with the interconnect specific
> addresses. And, similarly for unmap, the exporter is expected to
> cleanup the individual entries of the xarray.
I don't think we should limit this to xarrays, nor do I think it is a
great datastructure for what is usually needed here..
I just posted the patches showing what iommufd needs, and it wants
something like
struct mapping {
struct p2p_provider *provider;
size_t nelms;
struct phys_vec *phys;
};
Which is not something that make sense as an xarray.
I think the interconnect should have its own functions for map/unmap,
ie instead of trying to have them as a commmon
dma_buf_interconnect_ops do something like
struct dma_buf_interconnect_ops {
const char *name;
bool (*supports_interconnects)(struct dma_buf_attachment *attach,
const struct dma_buf_interconnect_match *,
unsigned int num_ics);
};
struct dma_buf_iov_interconnect_ops {
struct dma_buf_interconnect_ops ic_ops;
struct xx *(*map)(struct dma_buf_attachment *attach,
unsigned int *bar_number,
size_t *nelms);
// No unmap for iov
};
static inline struct xx *dma_buf_iov_map(struct dma_buf_attachment *attach,
unsigned int *bar_number,
size_t *nelms)
{
return container_of(attach->ic_ops, struct dma_buf_iov_interconnect_ops, ic_ops)->map(
attach, bar_number, nelms));
}
> +/**
> + * dma_buf_attachment_is_dynamic - check if the importer can handle move_notify.
> + * @attach: the attachment to check
> + *
> + * Returns true if a DMA-buf importer has indicated that it can handle dmabuf
> + * location changes through the move_notify callback.
> + */
> +static inline bool
> +dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
> +{
> + return !!attach->importer_ops;
> +}
Why is this in this patch?
I also think this patch should be second in the series, it makes more
sense to figure out how to attach with an interconnect then show how
to map/unmap with that interconnect
Like I'm not sure why this introduces allow_ic?
Jason
On Sun, Oct 26, 2025 at 03:55:04PM +0800, Shuai Xue wrote:
>
>
> 在 2025/10/22 20:50, Jason Gunthorpe 写道:
> > On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro(a)nvidia.com>
> > >
> > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > enabling safe sharing of non-struct page memory with controlled
> > > lifetime management. This allows RDMA and other subsystems to import
> > > dma-buf FDs and build them into memory regions for PCI P2P operations.
> > >
> > > The implementation provides a revocable attachment mechanism using
> > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > don't change physical addresses, but access is revoked when the VFIO
> > > device is closed or a PCI reset is issued. This ensures kernel
> > > self-defense against potentially hostile userspace.
> >
> > Let's enhance this:
> >
> > Currently VFIO can take MMIO regions from the device's BAR and map
> > them into a PFNMAP VMA with special PTEs. This mapping type ensures
> > the memory cannot be used with things like pin_user_pages(), hmm, and
> > so on. In practice only the user process CPU and KVM can safely make
> > use of these VMA. When VFIO shuts down these VMAs are cleaned by
> > unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
> > unbind.
> >
> > However, VFIO type 1 has an insecure behavior where it uses
> > follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
> > into the IOMMU. This has a long history of enabling P2P DMA inside
> > VMs, but has serious lifetime problems by allowing a UAF of the MMIO
> > after the VFIO driver has been unbound.
>
> Hi, Jason,
>
> Can you elaborate on this more?
>
> From my understanding of the VFIO type 1 implementation:
>
> - When a device is opened through VFIO type 1, it increments the
> device->refcount
> - During unbind, the driver waits for this refcount to drop to zero via
> wait_for_completion(&device->comp)
> - This should prevent the unbind() from completing while the device is
> still in use
>
> Given this refcount mechanism, I do not figure out how the UAF can
> occur.
A second vfio device can be opened and then use follow_pfnmap_*() to
read the first vfio device's PTEs. There is no relationship betweent
the first and second VFIO devices, so once the first is unbound it
sails through the device->comp while the second device retains the PFN
in its type1 iommu_domain.
Jason
On 10/20/25 13:18, Matthew Brost wrote:
> On Mon, Oct 20, 2025 at 10:16:23AM +0200, Philipp Stanner wrote:
>> On Fri, 2025-10-17 at 14:28 -0700, Matthew Brost wrote:
>>> On Fri, Oct 17, 2025 at 11:31:47AM +0200, Philipp Stanner wrote:
>>>> It seems that DMA_FENCE_FLAG_SEQNO64_BIT has no real effects anymore,
>>>> since seqno is a u64 everywhere.
>>>>
>>>> Remove the unneeded flag.
>>>>
>>>> Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
>>>> ---
>>>> Seems to me that this flag doesn't really do anything anymore?
>>>>
>>>> I *suspect* that it could be that some drivers pass a u32 to
>>>> dma_fence_init()? I guess they could be ported, couldn't they.
>>>>
>>>
>>> Xe uses 32-bit hardware fence sequence numbers—see [1] and [2]. We could
>>> switch to 64-bit hardware fence sequence numbers, but that would require
>>> changes on the driver side. If you sent this to our CI, I’m fairly
>>> certain we’d see a bunch of failures. I suspect this would also break
>>> several other drivers.
>>
>> What exactly breaks? Help me out here; if you pass a u32 for a u64,
>
> Seqno wraps.
>
>> doesn't the C standard guarantee that the higher, unused 32 bits will
>> be 0?
>
> return (int)(lower_32_bits(f1) - lower_32_bits(f2)) > 0;
>
> Look at the above logic.
>
> f1 = 0x0;
> f2 = 0xffffffff; /* -1 */
>
> The above statement will correctly return true.
>
> Compared to the below statement which returns false.
>
> return f1 > f2;
>
> We test seqno wraps in Xe by setting our initial seqno to -127, again if
> you send this patch to our CI any test which sends more than 127 job on
> queue will likely fail.
Yeah, exactly that's why this flag is needed for quite a lot of things.
Question is what is missing in the documentation to make that clear?
Regards,
Christian.
>
> Matt
>
>>
>> Because the only thing the flag still does is do this lower_32 check in
>> fence_is_later.
>>
>> P.
>>
>>>
>>> As I mentioned, all Xe-supported platforms could be updated since their
>>> rings support 64-bit store instructions. However, I suspect that very
>>> old i915 platforms don’t support such instructions in the ring. I agree
>>> this is a legacy issue, and we should probably use 64-bit sequence
>>> numbers in Xe. But again, platforms and drivers that are decades old
>>> might break as a result.
>>>
>>> Matt
>>>
>>> [1] https://elixir.bootlin.com/linux/v6.17.1/source/drivers/gpu/drm/xe/xe_hw_fe…
>>> [2] https://elixir.bootlin.com/linux/v6.17.1/source/drivers/gpu/drm/xe/xe_hw_fe…
>>>
>>>> P.
>>>> ---
>>>> drivers/dma-buf/dma-fence.c | 3 +--
>>>> include/linux/dma-fence.h | 10 +---------
>>>> 2 files changed, 2 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
>>>> index 3f78c56b58dc..24794c027813 100644
>>>> --- a/drivers/dma-buf/dma-fence.c
>>>> +++ b/drivers/dma-buf/dma-fence.c
>>>> @@ -1078,8 +1078,7 @@ void
>>>> dma_fence_init64(struct dma_fence *fence, const struct dma_fence_ops *ops,
>>>> spinlock_t *lock, u64 context, u64 seqno)
>>>> {
>>>> - __dma_fence_init(fence, ops, lock, context, seqno,
>>>> - BIT(DMA_FENCE_FLAG_SEQNO64_BIT));
>>>> + __dma_fence_init(fence, ops, lock, context, seqno, 0);
>>>> }
>>>> EXPORT_SYMBOL(dma_fence_init64);
>>>>
>>>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>>>> index 64639e104110..4eca2db28625 100644
>>>> --- a/include/linux/dma-fence.h
>>>> +++ b/include/linux/dma-fence.h
>>>> @@ -98,7 +98,6 @@ struct dma_fence {
>>>> };
>>>>
>>>> enum dma_fence_flag_bits {
>>>> - DMA_FENCE_FLAG_SEQNO64_BIT,
>>>> DMA_FENCE_FLAG_SIGNALED_BIT,
>>>> DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>> DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT,
>>>> @@ -470,14 +469,7 @@ dma_fence_is_signaled(struct dma_fence *fence)
>>>> */
>>>> static inline bool __dma_fence_is_later(struct dma_fence *fence, u64 f1, u64 f2)
>>>> {
>>>> - /* This is for backward compatibility with drivers which can only handle
>>>> - * 32bit sequence numbers. Use a 64bit compare when the driver says to
>>>> - * do so.
>>>> - */
>>>> - if (test_bit(DMA_FENCE_FLAG_SEQNO64_BIT, &fence->flags))
>>>> - return f1 > f2;
>>>> -
>>>> - return (int)(lower_32_bits(f1) - lower_32_bits(f2)) > 0;
>>>> + return f1 > f2;
>>>> }
>>>>
>>>> /**
>>>> --
>>>> 2.49.0
>>>>
>>
From: Matthew Auld <matthew.auld(a)intel.com>
[ Upstream commit edb1745fc618ba8ef63a45ce3ae60de1bdf29231 ]
Since the dma-resv is shared we don't need to reserve and add a fence
slot fence twice, plus no need to loop through the dependencies.
Signed-off-by: Matthew Auld <matthew.auld(a)intel.com>
Cc: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Cc: Matthew Brost <matthew.brost(a)intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt(a)intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Link: https://lore.kernel.org/r/20250829164715.720735-2-matthew.auld@intel.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
LLM Generated explanations, may be completely bogus:
YES
Explanation
- What it fixes
- Removes redundant dma-resv operations when a backup BO shares the
same reservation object as the original BO, preventing the same
fence from being reserved/added twice to the same `dma_resv`.
- Avoids scanning the same dependency set twice when source and
destination BOs share the same `dma_resv`.
- Why the change is correct
- The backup object is created to share the parent’s reservation
object, so a single reserve/add is sufficient:
- The backup BO is initialized with the parent’s resv:
`drivers/gpu/drm/xe/xe_bo.c:1309` (`xe_bo_init_locked(...,
bo->ttm.base.resv, ...)`), ensuring `bo->ttm.base.resv ==
backup->ttm.base.resv`.
- The patch adds an explicit invariant check to document and enforce
this: `drivers/gpu/drm/xe/xe_bo.c:1225` (`xe_assert(xe,
bo->ttm.base.resv == backup->ttm.base.resv)`).
- With shared `dma_resv`, adding the same fence twice is at best
redundant (wasting fence slots and memory) and at worst error-prone.
Reserving fence slots only once and adding the fence once is the
correct behavior.
- Specific code changes and effects
- Evict path (GPU migration copy case):
- Before: reserves and adds fence on both `bo->ttm.base.resv` and
`backup->ttm.base.resv`.
- After: reserves and adds exactly once, guarded by the shared-resv
assertion.
- See single reserve and add: `drivers/gpu/drm/xe/xe_bo.c:1226`
(reserve) and `drivers/gpu/drm/xe/xe_bo.c:1237` (add fence). This
is the core fix; the removed second reserve/add on the backup is
the redundant part eliminated.
- Restore path (migration copy back):
- Same simplification: reserve once, add once on the shared
`dma_resv`.
- See single reserve and add: `drivers/gpu/drm/xe/xe_bo.c:1375`
(reserve) and `drivers/gpu/drm/xe/xe_bo.c:1387` (add fence).
- Dependency handling in migrate:
- Before: added deps for both src and dst based only on `src_bo !=
dst_bo`.
- After: only add dst deps if the resv objects differ, avoiding
double-walking the same `dma_resv`.
- See updated condition: `drivers/gpu/drm/xe/xe_migrate.c:932`
(`src_bo->ttm.base.resv != dst_bo->ttm.base.resv`).
- User-visible impact without the patch
- Duplicate `dma_resv_add_fence()` calls on the same reservation
object can:
- Consume extra shared-fence slots and memory.
- Inflate dependency lists, causing unnecessary scheduler waits and
overhead.
- Increase failure likelihood of `dma_resv_reserve_fences()` under
memory pressure.
- These paths are exercised during suspend/resume flows of pinned VRAM
BOs (evict/restore), so reliability and performance in power
transitions can be affected.
- Scope and risk
- Small, focused changes localized to the Intel Xe driver
migration/evict/restore paths:
- Files: `drivers/gpu/drm/xe/xe_bo.c`,
`drivers/gpu/drm/xe/xe_migrate.c`.
- No API changes or architectural refactors; logic strictly reduces
redundant operations.
- The `xe_assert` acts as a safety net to catch unexpected non-shared
`resv` usage; normal runtime behavior is unchanged when the
invariant holds.
- The CPU copy fallback paths are untouched.
- Stable backport considerations
- This is a clear correctness and robustness fix, not a feature.
- Low regression risk if the stable branch also creates the backup BO
with the parent’s `dma_resv` (as shown by the use of
`xe_bo_init_locked(..., bo->ttm.base.resv, ...)` in
`drivers/gpu/drm/xe/xe_bo.c:1309`).
- If a stable branch diverges and the backup BO does not share the
resv, this patch would need adjustment (i.e., keep dual reserve/add
in that case). The added `xe_assert` helps surface such mismatches
during testing.
Conclusion: This commit fixes a real bug (duplicate fence reserve/add
and duplicate dependency scanning on a shared `dma_resv`) with a
minimal, well-scoped change. It aligns with stable rules (important
bugfix, low risk, contained), so it should be backported.
drivers/gpu/drm/xe/xe_bo.c | 13 +------------
drivers/gpu/drm/xe/xe_migrate.c | 2 +-
2 files changed, 2 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index d07e23eb1a54d..5a61441d68af5 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -1242,14 +1242,11 @@ int xe_bo_evict_pinned(struct xe_bo *bo)
else
migrate = mem_type_to_migrate(xe, bo->ttm.resource->mem_type);
+ xe_assert(xe, bo->ttm.base.resv == backup->ttm.base.resv);
ret = dma_resv_reserve_fences(bo->ttm.base.resv, 1);
if (ret)
goto out_backup;
- ret = dma_resv_reserve_fences(backup->ttm.base.resv, 1);
- if (ret)
- goto out_backup;
-
fence = xe_migrate_copy(migrate, bo, backup, bo->ttm.resource,
backup->ttm.resource, false);
if (IS_ERR(fence)) {
@@ -1259,8 +1256,6 @@ int xe_bo_evict_pinned(struct xe_bo *bo)
dma_resv_add_fence(bo->ttm.base.resv, fence,
DMA_RESV_USAGE_KERNEL);
- dma_resv_add_fence(backup->ttm.base.resv, fence,
- DMA_RESV_USAGE_KERNEL);
dma_fence_put(fence);
} else {
ret = xe_bo_vmap(backup);
@@ -1338,10 +1333,6 @@ int xe_bo_restore_pinned(struct xe_bo *bo)
if (ret)
goto out_unlock_bo;
- ret = dma_resv_reserve_fences(backup->ttm.base.resv, 1);
- if (ret)
- goto out_unlock_bo;
-
fence = xe_migrate_copy(migrate, backup, bo,
backup->ttm.resource, bo->ttm.resource,
false);
@@ -1352,8 +1343,6 @@ int xe_bo_restore_pinned(struct xe_bo *bo)
dma_resv_add_fence(bo->ttm.base.resv, fence,
DMA_RESV_USAGE_KERNEL);
- dma_resv_add_fence(backup->ttm.base.resv, fence,
- DMA_RESV_USAGE_KERNEL);
dma_fence_put(fence);
} else {
ret = xe_bo_vmap(backup);
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 2a627ed64b8f8..ba9b8590eccb2 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -901,7 +901,7 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
if (!fence) {
err = xe_sched_job_add_deps(job, src_bo->ttm.base.resv,
DMA_RESV_USAGE_BOOKKEEP);
- if (!err && src_bo != dst_bo)
+ if (!err && src_bo->ttm.base.resv != dst_bo->ttm.base.resv)
err = xe_sched_job_add_deps(job, dst_bo->ttm.base.resv,
DMA_RESV_USAGE_BOOKKEEP);
if (err)
--
2.51.0
On Tue, 21 Oct 2025 17:20:22 +1300, Barry Song wrote:
> From: Barry Song <v-songbaohua(a)oppo.com>
>
> We can allocate high-order pages, but mapping them one by
> one is inefficient. This patch changes the code to map
> as large a chunk as possible. The code looks somewhat
>
> [ ... ]
Reviewed-by: Maxime Ripard <mripard(a)kernel.org>
Thanks!
Maxime
For retrieving a pointer to the struct dma_resv for a given GEM object. We
also introduce it in a new trait, BaseObjectPrivate, which we automatically
implement for all gem objects and don't expose to users outside of the
crate.
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
---
rust/kernel/drm/gem/mod.rs | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs
index 32bff2e8463f4..67813cfb0db42 100644
--- a/rust/kernel/drm/gem/mod.rs
+++ b/rust/kernel/drm/gem/mod.rs
@@ -200,6 +200,18 @@ fn create_mmap_offset(&self) -> Result<u64> {
impl<T: IntoGEMObject> BaseObject for T {}
+/// Crate-private base operations shared by all GEM object classes.
+#[expect(unused)]
+pub(crate) trait BaseObjectPrivate: IntoGEMObject {
+ /// Return a pointer to this object's dma_resv.
+ fn raw_dma_resv(&self) -> *mut bindings::dma_resv {
+ // SAFETY: `as_gem_obj()` always returns a valid pointer to the base DRM gem object
+ unsafe { (*self.as_raw()).resv }
+ }
+}
+
+impl<T: IntoGEMObject> BaseObjectPrivate for T {}
+
/// A base GEM object.
///
/// Invariants
--
2.51.0
On Tue, Oct 21, 2025 at 4:43 PM Matthew Brost <matthew.brost(a)intel.com> wrote:
>
> On Sat, Oct 18, 2025 at 12:42:30AM -0700, Matthew Brost wrote:
> > On Fri, Oct 17, 2025 at 11:43:51PM -0700, Matthew Brost wrote:
> > > On Fri, Oct 17, 2025 at 10:37:46AM -0500, Rob Herring wrote:
> > > > On Thu, Oct 16, 2025 at 11:25:34PM -0700, Matthew Brost wrote:
> > > > > On Thu, Oct 16, 2025 at 04:06:05PM -0500, Rob Herring (Arm) wrote:
> > > > > > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
> > > > > > relatively simple interface with single command stream to describe
> > > > > > buffers, operation settings, and network operations. It supports up to 8
> > > > > > memory regions (though no h/w bounds on a region). The Ethos NPUs
> > > > > > are designed to use an SRAM for scratch memory. Region 2 is reserved
> > > > > > for SRAM (like the downstream driver stack and compiler). Userspace
> > > > > > doesn't need access to the SRAM.
> > > >
> > > > Thanks for the review.
> > > >
> > > > [...]
> > > >
> > > > > > +static struct dma_fence *ethosu_job_run(struct drm_sched_job *sched_job)
> > > > > > +{
> > > > > > + struct ethosu_job *job = to_ethosu_job(sched_job);
> > > > > > + struct ethosu_device *dev = job->dev;
> > > > > > + struct dma_fence *fence = NULL;
> > > > > > + int ret;
> > > > > > +
> > > > > > + if (unlikely(job->base.s_fence->finished.error))
> > > > > > + return NULL;
> > > > > > +
> > > > > > + fence = ethosu_fence_create(dev);
> > > > >
> > > > > Another reclaim issue: ethosu_fence_create allocates memory using
> > > > > GFP_KERNEL. Since we're already in the DMA fence signaling path
> > > > > (reclaim), this can lead to a deadlock.
> > > > >
> > > > > Without too much thought, you likely want to move this allocation to
> > > > > ethosu_job_do_push, but before taking dev->sched_lock or calling
> > > > > drm_sched_job_arm.
> > > > >
> > > > > We really should fix the DRM scheduler work queue to be tainted with
> > > > > reclaim. If I recall correctly, we'd need to update the work queue
> > > > > layer. Let me look into that—I've seen this type of bug several times,
> > > > > and lockdep should be able to catch it.
> > > >
> > > > Likely the rocket driver suffers from the same issues...
> > > >
> > >
> > > I am not surprised by this statement.
> > >
> > > > >
> > > > > > + if (IS_ERR(fence))
> > > > > > + return fence;
> > > > > > +
> > > > > > + if (job->done_fence)
> > > > > > + dma_fence_put(job->done_fence);
> > > > > > + job->done_fence = dma_fence_get(fence);
> > > > > > +
> > > > > > + ret = pm_runtime_get_sync(dev->base.dev);
> > > > >
> > > > > I haven't looked at your PM design, but this generally looks quite
> > > > > dangerous with respect to reclaim. For example, if your PM resume paths
> > > > > allocate memory or take locks that allocate memory underneath, you're
> > > > > likely to run into issues.
> > > > >
> > > > > A better approach would be to attach a PM reference to your job upon
> > > > > creation and release it upon job destruction. That would be safer and
> > > > > save you headaches in the long run.
> > > >
> > > > Our PM is nothing more than clock enable/disable and register init.
> > > >
> > > > If the runtime PM API doesn't work and needs special driver wrappers,
> > > > then I'm inclined to just not use it and manage clocks directly (as
> > > > that's all it is doing).
> > > >
> > >
> > > Yes, then you’re probably fine. More complex drivers can do all sorts of
> > > things during a PM wake, which is why PM wakes should generally be the
> > > outermost layer. I still suggest, to future-proof your code, that you
> > > move the PM reference to an outer layer.
> > >
> >
> > Also, taking a PM reference in a function call — as opposed to tying it
> > to a object's lifetime — is risky. It can quickly lead to imbalances in
> > PM references if things go sideways or function calls become unbalanced.
> > Depending on how your driver uses the DRM scheduler, this seems like a
> > real possibility.
> >
> > Matt
> >
> > > > >
> > > > > This is what we do in Xe [1] [2].
> > > > >
> > > > > Also, in general, this driver has been reviewed (RB’d), but it's not
> > > > > great that I spotted numerous issues within just five minutes. I suggest
> > > > > taking a step back and thoroughly evaluating everything this driver is
> > > > > doing.
> > > >
> > > > Well, if it is hard to get simple drivers right, then it's a problem
> > > > with the subsystem APIs IMO.
> > > >
> > >
> > > Yes, agreed. We should have assertions and lockdep annotations in place
> > > to catch driver-side misuses. This is the second driver I’ve randomly
> > > looked at over the past year that has broken DMA fencing and reclaim
> > > rules. I’ll take an action item to fix this in the DRM scheduler, but
> > > I’m afraid I’ll likely break multiple drivers in the process as misuess
> > > / lockdep will complain.
>
> I've posted a series [1] for the DRM scheduler which will complain about the
> things I've pointed out here.
Thanks. I ran v6 with them and no lockdep splats.
Rob