On 5/19/26 17:31, Xaver Hugl wrote:
> Am Di., 19. Mai 2026 um 15:29 Uhr schrieb Christian König
> <christian.koenig(a)amd.com>:
>>> 1. This series makes the ability to manipulate syncobjs available
>>> independently of attached hardware.
>>> 2. It makes it available under a consistent path /dev/syncobj.
>>
>> Exactly that is a big no-go. This has to be under /dev/dri.
> FWIW udmabuf is also under /dev directly, but I don't think any
> compositor developer would complain about a different path.
> What are the rules for that? Could this simply be put in /dev/dri/syncobj?
The syncobj are actually the DRM specific way of doing things. The general kernel wide way is to use sync files (see drivers/dma-buf/sync_file.c).
But there has already been tons of problems with those sync files. E.g. they doesn't support your use case at all since they don't have wait before submit behavior.
So there are already ways to do this, but the Linux kernel so far told everybody that this is forbidden. The DRM syncobj wait before signal functionality is much better, but then basically the second try to do this.
> The part where we get this independent of attached hardware is quite
> important for us though, since we can't just ignore explicit sync once
> the device we previously imported the syncobj into is disconnected.
Can you elaborate more on this?
> Buffers can be from any device or allocated in system memory and
> access should be synchronized properly in all cases.
>
> How exactly it's made available isn't all that critical.
>
>>> 3. It removes the need to translate between syncobjs fds and handles.
>>
>> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
> Could you expand on why it's needed? For compositors, the handle is
> just an intermediary thing when translating between file descriptors.
Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
> FTR for me at least, this part would be merely nice to have, since it
> slightly reduces the amount of ioctls a compositor needs to call, but
> it's not important.
>
>>>> What about using VGEM for this?
>>>
>>> If the vgem render node were made available unconditionally under,
>>
>> Software rendering is a complete corner case, I don't think that this will be enabled by default.
> That simply makes vgem unsuitable for solving the problems we face in
> compositors.
Thinking more about it vgem also has the same issues as sync file mentioned above. So that is really also not doable.
Maybe Simona or David have another idea.
Regards,
Christian.
>
> - Xaver
virtio_gpu_cursor_plane_update() and virtio_gpu_resource_flush() lock
the framebuffer BO's dma_resv via virtio_gpu_array_lock_resv() and
ignore its return value. The function can fail with -EINTR from
dma_resv_lock_interruptible() (signal during lock wait) or with
-ENOMEM from dma_resv_reserve_fences() (fence slot allocation),
leaving the resv lock not held. The queue path then walks the object
array and calls dma_resv_add_fence(), which requires the lock held;
with lockdep enabled this trips dma_resv_assert_held():
WARNING: drivers/dma-buf/dma-resv.c:296 at dma_resv_add_fence+0x71e/0x840
Call Trace:
virtio_gpu_array_add_fence
virtio_gpu_queue_ctrl_sgs
virtio_gpu_queue_fenced_ctrl_buffer
virtio_gpu_cursor_plane_update
drm_atomic_helper_commit_planes
drm_atomic_helper_commit_tail
commit_tail
drm_atomic_helper_commit
drm_atomic_commit
drm_atomic_helper_update_plane
__setplane_atomic
drm_mode_cursor_universal
drm_mode_cursor_common
drm_mode_cursor_ioctl
drm_ioctl
__x64_sys_ioctl
Beyond the WARN, mutating the dma_resv fence list without the lock
races with concurrent readers/writers and can corrupt the list.
Both call sites run inside the .atomic_update plane callback, which
DRM atomic helpers do not allow to fail (by the time it runs, the
commit has been signed off to userspace and there is no clean
rollback path). Moving the lock acquisition to .prepare_fb was
rejected because the broader lock scope deadlocks against other BO
locking paths in the same atomic commit.
Introduce virtio_gpu_lock_one_resv_uninterruptible() that uses
dma_resv_lock() instead of dma_resv_lock_interruptible(). This
eliminates the -EINTR failure mode -- the realistic syzbot trigger
-- without extending the lock hold across the commit. The helper
locks a single BO and rejects nents > 1 with -EINVAL; both fix
sites lock exactly one BO.
Use it from virtio_gpu_cursor_plane_update() and
virtio_gpu_resource_flush(); check the return value to handle the
remaining -ENOMEM case from dma_resv_reserve_fences() by freeing
the objs and skipping the plane update for that frame. The
framebuffer BOs touched here are not shared with other contexts
and lock contention is expected to be brief, so the loss of
signal-interruptibility is acceptable.
Other callers of virtio_gpu_array_lock_resv() (the ioctl paths)
continue to use the interruptible variant.
The bug was reported by syzbot, triggered via fault injection
(fail_nth) on the DRM_IOCTL_MODE_CURSOR path, which forces the
-ENOMEM branch in dma_resv_reserve_fences().
Reported-by: syzbot+72bd3dd3a5d5f39a0271(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=72bd3dd3a5d5f39a0271
Fixes: 5cfd31c5b3a3 ("drm/virtio: fix virtio_gpu_cursor_plane_update().")
Cc: stable(a)vger.kernel.org
Signed-off-by: Deepanshu Kartikey <kartikey406(a)gmail.com>
---
v4: Rename the helper to virtio_gpu_lock_one_resv_uninterruptible()
and reject objs->nents > 1 with -EINVAL. The v3 helper's
multi-object branch used drm_gem_lock_reservations(), which is
interruptible, contradicting the "uninterruptible" name; both
fix sites lock a single BO so the multi-object path is dropped.
(Dmitry Osipenko)
v3: Drop the prepare_fb/cleanup_fb approach from v2 (it deadlocked
against virtio_gpu_resource_flush(), which also locks the BO in
the same atomic commit). Instead add an uninterruptible variant
of the resv lock helper and use it in both
virtio_gpu_cursor_plane_update() and virtio_gpu_resource_flush().
(Dmitry Osipenko)
v2: Move resv lock acquisition from .atomic_update (which must not
fail) to .prepare_fb (which may), per maintainer review of v1.
The v1 approach of silently skipping the cursor update on lock
failure violated the atomic-commit contract with userspace.
---
drivers/gpu/drm/virtio/virtgpu_drv.h | 1 +
drivers/gpu/drm/virtio/virtgpu_gem.c | 17 +++++++++++++++++
drivers/gpu/drm/virtio/virtgpu_plane.c | 10 ++++++++--
3 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h b/drivers/gpu/drm/virtio/virtgpu_drv.h
index f17660a71a3e..2f3531950aa4 100644
--- a/drivers/gpu/drm/virtio/virtgpu_drv.h
+++ b/drivers/gpu/drm/virtio/virtgpu_drv.h
@@ -317,6 +317,7 @@ virtio_gpu_array_from_handles(struct drm_file *drm_file, u32 *handles, u32 nents
void virtio_gpu_array_add_obj(struct virtio_gpu_object_array *objs,
struct drm_gem_object *obj);
int virtio_gpu_array_lock_resv(struct virtio_gpu_object_array *objs);
+int virtio_gpu_lock_one_resv_uninterruptible(struct virtio_gpu_object_array *objs);
void virtio_gpu_array_unlock_resv(struct virtio_gpu_object_array *objs);
void virtio_gpu_array_add_fence(struct virtio_gpu_object_array *objs,
struct dma_fence *fence);
diff --git a/drivers/gpu/drm/virtio/virtgpu_gem.c b/drivers/gpu/drm/virtio/virtgpu_gem.c
index f22dc5c21cd4..435d37d36034 100644
--- a/drivers/gpu/drm/virtio/virtgpu_gem.c
+++ b/drivers/gpu/drm/virtio/virtgpu_gem.c
@@ -238,6 +238,23 @@ int virtio_gpu_array_lock_resv(struct virtio_gpu_object_array *objs)
return ret;
}
+int virtio_gpu_lock_one_resv_uninterruptible(struct virtio_gpu_object_array *objs)
+{
+ int ret;
+
+ if (objs->nents != 1)
+ return -EINVAL;
+
+ dma_resv_lock(objs->objs[0]->resv, NULL);
+
+ ret = dma_resv_reserve_fences(objs->objs[0]->resv, 1);
+ if (ret) {
+ virtio_gpu_array_unlock_resv(objs);
+ return ret;
+ }
+ return 0;
+}
+
void virtio_gpu_array_unlock_resv(struct virtio_gpu_object_array *objs)
{
if (objs->nents == 1) {
diff --git a/drivers/gpu/drm/virtio/virtgpu_plane.c b/drivers/gpu/drm/virtio/virtgpu_plane.c
index a126d1b25f46..652352424744 100644
--- a/drivers/gpu/drm/virtio/virtgpu_plane.c
+++ b/drivers/gpu/drm/virtio/virtgpu_plane.c
@@ -215,7 +215,10 @@ static void virtio_gpu_resource_flush(struct drm_plane *plane,
if (!objs)
return;
virtio_gpu_array_add_obj(objs, vgfb->base.obj[0]);
- virtio_gpu_array_lock_resv(objs);
+ if (virtio_gpu_lock_one_resv_uninterruptible(objs)) {
+ virtio_gpu_array_put_free(objs);
+ return;
+ }
virtio_gpu_cmd_resource_flush(vgdev, bo->hw_res_handle, x, y,
width, height, objs,
vgplane_st->fence);
@@ -459,7 +462,10 @@ static void virtio_gpu_cursor_plane_update(struct drm_plane *plane,
if (!objs)
return;
virtio_gpu_array_add_obj(objs, vgfb->base.obj[0]);
- virtio_gpu_array_lock_resv(objs);
+ if (virtio_gpu_lock_one_resv_uninterruptible(objs)) {
+ virtio_gpu_array_put_free(objs);
+ return;
+ }
virtio_gpu_cmd_transfer_to_host_2d
(vgdev, 0,
plane->state->crtc_w,
--
2.43.0
On 5/19/26 19:08, Xaver Hugl wrote:
>>> The part where we get this independent of attached hardware is quite
>>> important for us though, since we can't just ignore explicit sync once
>>> the device we previously imported the syncobj into is disconnected.
>>
>> Can you elaborate more on this?
>
> In Wayland, the client is allowed to attach dmabuf and syncobj
> independently, they don't have to be from the same device (and the
> compositor wouldn't be able to verify the opposite anyways). The
> compositor will usually import both into the same drm device, but
> especially with compositors that render on multiple devices, that's
> not necessarily the case either.
>
> If for example we had a system with one internal GPU and one external
> GPU, the client renders on the internal GPU and the compositor uses
> the external one. Now when the user yanks the USB C cable, afaiu
Well I would say the other way around is a pretty common use case.
In other words the compositors uses the internal GPU for composing and displaying the picture. And the client uses the external GPU for fast rendering.
> - the buffers from the client stay valid
Buffers from the hot plugged GPU don't stay valid. Accessing CPU mappings either result in a SIGBUS or are redirected to a dummy page.
DMA operations to hot plugged buffers from other GPUs (or rather more general other devices) are waited on before the underlying resource is removed (e.g. system memory or PCIe address space or whatever is backing that).
But no new DMA operations are usually permitted to start.
> - the syncobj stays valid on the client side
> - the syncobj becomes invalid on the compositor side
Nope that's not correct. The syncobj itself stays valid even if you completely hot plug the device.
It can just be that the fences inside the syncobj are terminated with an error.
> "invalid" there means either
> - the acquire point of the client is marked as signaled, before
> rendering on the client side is completed
> - the acquire point of the client is never signaled. Since the
> compositor waits for the acquire point, the Wayland surface is stuck
> forever
Both of those would be a *massive* violation of documented kernel rules for hot-plugging which could lead to random data corruption and/or deadlocks.
If you see any HW driver showing behavior like that please open up a bug report and ping the relevant maintainers immediately.
When a hotplug happens all operations of the device should return an -ENODEV error, even when exposed to other devices/application through syncobj or syncfile.
One problem is that only syncfile allows for querying such error codes at the moment, we have patches pending to add that to syncobj as well but we lack a compositor with support for that as userspace client.
> Afaik the latter is currently the case. The former wouldn't be much
> better though, not when it's preventable.
>
> This is admittedly an edge case, but GPU hotunplug is something we try
> to support as well as possible in Plasma, and all the edge cases cause
> a lot of problems in combination and are a lot of headaches to handle
> (or really work around) in the compositor.
Well exactly that design is used in the Tesla 3 infotainment system for example.
So GPU hotplug is actually a pretty common use case.
> Another edge case is when the client asks the compositor to import the
> syncobj, which can fail when a hotunplug is in process, and ends up
> disconnecting the client for no fault of either client or compositor.
Well the question here is if the device the compositor is using or the client is using is gone?
If the client device is hot removed the compositor should be perfectly capable to import the syncobj.
If the compositor device is gone then you don't have a device to display anything any more, so generating the next frame doesn't seem to make sense either.
What could be is that you want the compositor to be kept alive even when the display device is gone to switch over to vkms or whatever so that a VNC session or other remote desktop still works.
>>>>> 3. It removes the need to translate between syncobjs fds and handles.
>>>>
>>>> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
>>> Could you expand on why it's needed? For compositors, the handle is
>>> just an intermediary thing when translating between file descriptors.
>>
>> Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
> That would be nice.
Take a look at drm_syncobj_file_fops and how drm_syncobj_add_eventfd() is used. Adding that functionality shouldn't be more than a typing exercise.
Do I see it right that this would already solve most problems in the compositor side?
Regards,
Christian.
>
> - Xaver
When dumping IB contents from a hung job, amdgpu_devcoredump_format()
acquires the VM root PD's reservation lock via amdgpu_vm_lock_by_pasid()
and then, for each IB referenced by the job, calls amdgpu_bo_reserve()
on the BO that backs the IB. Both reservations are taken on
reservation_ww_class_mutex objects but neither uses a ww_acquire_ctx,
which trips lockdep:
WARNING: possible recursive locking detected
--------------------------------------------
kworker/u128:0 is trying to acquire lock:
ffff88838b16e1f0 (reservation_ww_class_mutex){+.+.}-{4:4},
at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]
but task is already holding lock:
ffff8882f82681f0 (reservation_ww_class_mutex){+.+.}-{4:4},
at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]
Possible unsafe locking scenario:
CPU0
----
lock(reservation_ww_class_mutex);
lock(reservation_ww_class_mutex);
*** DEADLOCK ***
May be due to missing lock nesting notation
Workqueue: events_unbound amdgpu_devcoredump_deferred_work [amdgpu]
Call Trace:
__ww_mutex_lock.constprop.0
ww_mutex_lock
amdgpu_bo_reserve
amdgpu_devcoredump_format+0x1594 [amdgpu]
amdgpu_devcoredump_deferred_work+0xea [amdgpu]
process_one_work
worker_thread
kthread
The two reservations are on different BOs in the captured trace, so the
splat is a lockdep-correctness warning, not an observed deadlock. It
becomes a real self-deadlock whenever the IB BO shares its dma_resv
with the root PD (the always-valid case, see
amdgpu_vm_is_bo_always_valid()): amdgpu_bo_reserve(abo) re-acquires the
same ww_mutex without a ticket and blocks forever.
With amdgpu.gpu_recovery=0 the timeout handler refires every ~2 s and
each invocation produces this splat, drowning the kernel ring buffer.
Fix it by collecting the per-IB BO references under the root PD's
reservation, then releasing the root before reserving each IB BO
individually. The walk over the VM mapping tree must remain under the
root lock (mappings can be torn down without it), but the actual
content copies do not need to nest inside it. Each per-IB reservation
is now an independent top-level acquire, eliminating the nested
ww_mutex.
The collect/release logic is factored out into two small helpers
(amdgpu_devcoredump_collect_ib_refs / amdgpu_devcoredump_release_ib_refs)
to keep the main function's indentation reasonable.
This also fixes a BO refcount leak in the original code: when
amdgpu_bo_reserve() failed, control jumped to free_ib_content without
running amdgpu_bo_unref(). In the new structure the per-IB BO refs
are released unconditionally in the cleanup helper.
Reproducer (~150 LoC libdrm_amdgpu): submit a single GFX IB containing
PACKET3_INDIRECT_BUFFER chained at GPU VA 0 and wait for the fence.
The TDR fires within ~10 s and the deferred coredump worker produces
the splat above on every invocation.
Fixes: 7b15fc2d1f1a ("drm/amdgpu: dump job ibs in the devcoredump")
Cc: stable(a)vger.kernel.org # 7.1
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c | 147 +++++++++++++-----
1 file changed, 110 insertions(+), 37 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index d386bc775d03..f6bb968de756 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -207,6 +207,72 @@ static void amdgpu_devcoredump_fw_info(struct amdgpu_device *adev,
}
}
+struct amdgpu_devcoredump_ib_ref {
+ struct amdgpu_bo *bo;
+ u64 offset;
+};
+
+/*
+ * Walk the VM's mapping tree under the root PD's reservation to obtain the BO
+ * that backs each IB and pin it with a refcount. The root PD reservation is
+ * dropped before this function returns; the caller can then reserve each IB
+ * BO individually without nesting ww_mutex acquires on
+ * reservation_ww_class_mutex.
+ *
+ * Returns an array of num_ibs entries (each ib_refs[i].bo may be NULL if its
+ * mapping was not found), or NULL on allocation failure / VM lookup failure.
+ * The caller must release the BO refs and free the array.
+ */
+static struct amdgpu_devcoredump_ib_ref *
+amdgpu_devcoredump_collect_ib_refs(struct amdgpu_device *adev,
+ struct amdgpu_coredump_info *coredump)
+{
+ struct amdgpu_devcoredump_ib_ref *ib_refs;
+ struct amdgpu_bo_va_mapping *mapping;
+ struct amdgpu_bo *root;
+ struct amdgpu_vm *vm;
+ u64 va_start;
+
+ ib_refs = kcalloc(coredump->num_ibs, sizeof(*ib_refs), GFP_KERNEL);
+ if (!ib_refs)
+ return NULL;
+
+ vm = amdgpu_vm_lock_by_pasid(adev, &root, coredump->pasid);
+ if (!vm) {
+ kfree(ib_refs);
+ return NULL;
+ }
+
+ for (int i = 0; i < coredump->num_ibs; i++) {
+ va_start = coredump->ibs[i].gpu_addr & AMDGPU_GMC_HOLE_MASK;
+ mapping = amdgpu_vm_bo_lookup_mapping(vm, va_start / AMDGPU_GPU_PAGE_SIZE);
+ if (!mapping)
+ continue;
+
+ ib_refs[i].bo = amdgpu_bo_ref(mapping->bo_va->base.bo);
+ ib_refs[i].offset = va_start -
+ mapping->start * AMDGPU_GPU_PAGE_SIZE;
+ }
+
+ amdgpu_bo_unreserve(root);
+ amdgpu_bo_unref(&root);
+
+ return ib_refs;
+}
+
+static void
+amdgpu_devcoredump_release_ib_refs(struct amdgpu_devcoredump_ib_ref *ib_refs,
+ int num_ibs)
+{
+ if (!ib_refs)
+ return;
+
+ for (int i = 0; i < num_ibs; i++)
+ if (ib_refs[i].bo)
+ amdgpu_bo_unref(&ib_refs[i].bo);
+ kfree(ib_refs);
+}
+
static ssize_t
amdgpu_devcoredump_format(char *buffer, size_t count, struct amdgpu_coredump_info *coredump)
{
@@ -214,13 +280,11 @@ amdgpu_devcoredump_format(char *buffer, size_t count, struct amdgpu_coredump_inf
struct drm_printer p;
struct drm_print_iterator iter;
struct amdgpu_vm_fault_info *fault_info;
- struct amdgpu_bo_va_mapping *mapping;
struct amdgpu_ip_block *ip_block;
struct amdgpu_res_cursor cursor;
- struct amdgpu_bo *abo, *root;
- uint64_t va_start, offset;
+ struct amdgpu_bo *abo;
+ uint64_t offset;
struct amdgpu_ring *ring;
- struct amdgpu_vm *vm;
u32 *ib_content;
uint8_t *kptr;
int ver, i, j, r;
@@ -343,43 +407,52 @@ amdgpu_devcoredump_format(char *buffer, size_t count, struct amdgpu_coredump_inf
drm_printf(&p, "VRAM is lost due to GPU reset!\n");
if (coredump->num_ibs) {
- /* Don't try to lookup the VM or map the BOs when calculating the
- * size required to store the devcoredump.
+ struct amdgpu_devcoredump_ib_ref *ib_refs = NULL;
+
+ /*
+ * Snapshot per-IB BO references under the root PD's reservation,
+ * then release the root before reserving each IB BO individually
+ * to copy its contents.
+ *
+ * Reserving an IB BO while the root PD is still reserved would
+ * be a nested ww_mutex acquire on reservation_ww_class_mutex
+ * without a ww_acquire_ctx, which trips lockdep's recursive-
+ * locking check and self-deadlocks for IB BOs that share their
+ * dma_resv with the root PD (always-valid BOs).
+ *
+ * Skip lookup/reservation entirely on the sizing pass: it does
+ * not write IB content, and the size estimate doesn't depend on
+ * whether the BOs are reachable.
*/
- if (sizing_pass)
- vm = NULL;
- else
- vm = amdgpu_vm_lock_by_pasid(adev, &root, coredump->pasid);
+ if (!sizing_pass)
+ ib_refs = amdgpu_devcoredump_collect_ib_refs(adev, coredump);
- for (int i = 0; i < coredump->num_ibs && (sizing_pass || vm); i++) {
+ for (int i = 0; i < coredump->num_ibs; i++) {
ib_content = kvmalloc_array(coredump->ibs[i].ib_size_dw, 4,
GFP_KERNEL);
if (!ib_content)
continue;
- /* vm=NULL can only happen when 'sizing_pass' is true. Skip to the
- * drm_printf() calls (ib_content doesn't need to be initialized
- * as its content won't be written anywhere).
- */
- if (!vm)
+ if (sizing_pass)
goto output_ib_content;
- va_start = coredump->ibs[i].gpu_addr & AMDGPU_GMC_HOLE_MASK;
- mapping = amdgpu_vm_bo_lookup_mapping(vm, va_start / AMDGPU_GPU_PAGE_SIZE);
- if (!mapping)
- goto free_ib_content;
+ if (!ib_refs || !ib_refs[i].bo)
+ goto output_ib_content;
+
+ abo = ib_refs[i].bo;
+ offset = ib_refs[i].offset;
- offset = va_start - (mapping->start * AMDGPU_GPU_PAGE_SIZE);
- abo = amdgpu_bo_ref(mapping->bo_va->base.bo);
r = amdgpu_bo_reserve(abo, false);
if (r)
- goto free_ib_content;
+ goto output_ib_content;
if (abo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS) {
off = 0;
- if (abo->tbo.resource->mem_type != TTM_PL_VRAM)
- goto unreserve_abo;
+ if (abo->tbo.resource->mem_type != TTM_PL_VRAM) {
+ amdgpu_bo_unreserve(abo);
+ goto output_ib_content;
+ }
amdgpu_res_first(abo->tbo.resource, offset,
coredump->ibs[i].ib_size_dw * 4,
@@ -395,8 +468,10 @@ amdgpu_devcoredump_format(char *buffer, size_t count, struct amdgpu_coredump_inf
r = ttm_bo_kmap(&abo->tbo, 0,
PFN_UP(abo->tbo.base.size),
&abo->kmap);
- if (r)
- goto unreserve_abo;
+ if (r) {
+ amdgpu_bo_unreserve(abo);
+ goto output_ib_content;
+ }
kptr = amdgpu_bo_kptr(abo);
kptr += offset;
@@ -406,21 +481,19 @@ amdgpu_devcoredump_format(char *buffer, size_t count, struct amdgpu_coredump_inf
amdgpu_bo_kunmap(abo);
}
+ amdgpu_bo_unreserve(abo);
+
output_ib_content:
drm_printf(&p, "\nIB #%d 0x%llx %d dw\n",
i, coredump->ibs[i].gpu_addr, coredump->ibs[i].ib_size_dw);
- for (int j = 0; j < coredump->ibs[i].ib_size_dw; j++)
- drm_printf(&p, "0x%08x\n", ib_content[j]);
-unreserve_abo:
- if (vm)
- amdgpu_bo_unreserve(abo);
-free_ib_content:
+ if (!sizing_pass && ib_refs && ib_refs[i].bo) {
+ for (int j = 0; j < coredump->ibs[i].ib_size_dw; j++)
+ drm_printf(&p, "0x%08x\n", ib_content[j]);
+ }
kvfree(ib_content);
}
- if (vm) {
- amdgpu_bo_unreserve(root);
- amdgpu_bo_unref(&root);
- }
+
+ amdgpu_devcoredump_release_ib_refs(ib_refs, coredump->num_ibs);
}
return count - iter.remain;
--
2.54.0
This RFC builds on T.J. Mercier's earlier series [1] which added
a memory.stat counter for exported dma-bufs and a binder-backed
mechanism to transfer charges between cgroups.
The first commit is taken almost verbatim from TJ's series:
it introduces MEMCG_DMABUF as a dedicated per-cgroup stat, so that
the total exported dma-buf footprint is visible both system-wide
(via the root cgroup) and per-application (via per-process cgroups).
This avoids the overhead of DMABUF_SYSFS_STATS and integrates
naturally into the existing cgroup memory hierarchy.
The rest of the series departs from TJ's approach. While the first
commit introduces the memcg stat infrastructure for dmabufs, the
export-time charging it introduces in dma_buf_export() is then
superseded: we charge at dma_heap_ioctl_allocate() time, using a
new charge_pid_fd field in struct dma_heap_allocation_data. The
allocator opens a pidfd for its client (e.g., from binder's
sender_pid), passes it to the ioctl, and the kernel charges the
buffer directly to the client's cgroup at allocation time, so no
transfer step is needed.
This decouples the accounting path from binder entirely:
any allocator that knows its client's PID can use the pid_fd
mechanism regardless of the IPC transport in use.
The cross-cgroup charging capability requires access control.
Patches #3 and #4 add a generic LSM hook (security_dma_heap_alloc)
and an SELinux implementation based on a new dma_heap object class
with a charge_to permission, so policy authors can express which
domains are allowed to charge memory to another domain's cgroup.
Last patch adds some tests to verify the new charge_pid_fd field.
We are sending it as an RFC to spark broader discussion. It may or
may not be the right path forward, and we welcome feedback on the
trade-offs.
Collision note: Eric Chanudet's series [2] adds __GFP_ACCOUNT to
system_heap page allocations as an opt-in module parameter. That
approach charges pages to the allocator's own kmem, which overlaps with
MEMCG_DMABUF. This series explicitly removes __GFP_ACCOUNT from system
heap allocations and routes all accounting through the MEMCG_DMABUF
path to avoid double-counting.
[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.co…
[2] https://lore.kernel.org/r/20260113-dmabuf-heap-system-memcg-v2-0-e85722cc2f…
Signed-off-by: Albert Esteve <aesteve(a)redhat.com>
---
Albert Esteve (4):
dma-heap: charge dma-buf memory via explicit memcg
security: dma-heap: Add dma_heap_alloc LSM hook
selinux: Restrict cross-cgroup dma-heap charging
selftests/dmabuf-heaps: Add dma-buf memcg accounting tests
T.J. Mercier (1):
memcg: Track exported dma-buffers
Documentation/admin-guide/cgroup-v2.rst | 5 +
drivers/dma-buf/dma-buf.c | 7 +
drivers/dma-buf/dma-heap.c | 54 +++++-
drivers/dma-buf/heaps/system_heap.c | 2 -
include/linux/dma-buf.h | 4 +
include/linux/lsm_hook_defs.h | 1 +
include/linux/memcontrol.h | 37 ++++
include/linux/security.h | 7 +
include/uapi/linux/dma-heap.h | 6 +
mm/memcontrol.c | 19 ++
security/security.c | 16 ++
security/selinux/hooks.c | 7 +
security/selinux/include/classmap.h | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +-
tools/testing/selftests/cgroup/test_memcontrol.c | 143 +++++++++++++-
tools/testing/selftests/dmabuf-heaps/config | 1 +
tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++-
tools/testing/selftests/dmabuf-heaps/vmtest.sh | 205 +++++++++++++++++++++
18 files changed, 633 insertions(+), 10 deletions(-)
---
base-commit: 74fe02ce122a6103f207d29fafc8b3a53de6abaf
change-id: 20260508-v2_20230123_tjmercier_google_com-f44fcfb16530
Best regards,
--
Albert Esteve <aesteve(a)redhat.com>
On Thu, 7 May 2026 11:02:26 +0200
Marcin Åšlusarz <marcin.slusarz(a)arm.com> wrote:
> On Tue, May 05, 2026 at 06:15:23PM +0200, Boris Brezillon wrote:
> > > @@ -277,9 +286,21 @@ int panthor_device_init(struct panthor_device *ptdev)
> > > return ret;
> > > }
> > >
> > > + /* If a protected heap name is specified but not found, defer the probe until created */
> > > + if (protected_heap_name && strlen(protected_heap_name)) {
> >
> > Do we really need this strlen() > 0? Won't dma_heap_find() fail is the
> > name is "" already?
>
> If dma_heap_find() will fail, then the whole probe with fail too.
> This check prevents that.
Yeah, that's also a questionable design choice. I mean, we can
currently probe and boot the FW even though we never setup the
protected FW sections, so why should we defer the probe here? Can't we
just retry the next time a group with the protected bit is created and
fail if we can find a protected heap?
> I'm not sure why it's needed at all, but if
> it is really needed, then s/strlen(protected_heap_name)/protected_heap_name[0]/
> would simplify this.
It's not so much about how you do the test, and more about the case
you're trying to protect against. I guess here you assume that
panthor.protected_heap_name="" means "I don't have a protected heap for
you". If it's deemed acceptable, this should most certainly be
described somewhere.
>
> > > + ptdev->protm.heap = dma_heap_find(protected_heap_name);
> > > + if (!ptdev->protm.heap) {
> > > + drm_warn(&ptdev->base,
> > > + "Protected heap \'%s\' not (yet) available - deferring probe",
> > > + protected_heap_name);
> > > + ret = -EPROBE_DEFER;
> > > + goto err_rpm_put;
> >
> > If you move the heap retrieval before the rpm enablement, you can get
> > rid of this goto err_rpm_put.
> >
> > > + }
> > > + }
> > > +
> > > ret = panthor_hw_init(ptdev);
> > > if (ret)
> > > - goto err_rpm_put;
> > > + goto err_dma_heap_put;
> > >
> > > ret = panthor_pwr_init(ptdev);
> > > if (ret)
Hi all,
This series is based on previous RFCs/discussions:
Tech topic: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/
RFCv1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
RFCv2: https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
The background/rationale is covered in more detail in the RFC cover
letters. The TL;DR is:
The goal is to enable userspace driver designs that use VFIO to export
DMABUFs representing subsets of PCI device BARs, and "vend" those
buffers from a primary process to other subordinate processes by fd.
These processes then mmap() the buffers and their access to the device
is isolated to the exported ranges. This is an improvement on sharing
the VFIO device fd to subordinate processes, which would allow
unfettered access .
This is achieved by enabling mmap() of vfio-pci DMABUFs. Second, a
new ioctl()-based revocation mechanism is added to allow the primary
process to forcibly revoke access to previously-shared BAR spans, even
if the subordinate processes haven't cleanly exited.
(The related topic of safe delegation of iommufd control to the
subordinate processes is not addressed here, and is follow-up work.)
As well as isolation and revocation, another advantage to accessing a
BAR through a VMA backed by a DMABUF is that it's straightforward to
create the buffer with access attributes, such as write-combining.
Notes on patches
================
Feedback from the RFCs requested that, instead of creating
DMABUF-specific vm_ops and .fault paths, to go the whole way and
migrate the existing VFIO PCI BAR mmap() to be backed by a DMABUF too,
resulting in a common vm_ops and fault handler for mmap()s of both the
VFIO device and explicitly-exported DMABUFs. This has been done for
vfio-pci, but not sub-drivers (nvgrace-gpu's special-case mappings are
unchanged).
vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
A bug fix to a related are, whose context is a depdency for later
patches.
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
The first is for a DMABUF VMA fault handler to determine
arbitrary-sized PFNs from ranges in DMABUF. Secondly, refactor
DMABUF export for use by the existing export feature and a new
helper that creates a DMABUF corresponding to a VFIO BAR mmap()
request.
vfio/pci: Convert BAR mmap() to use a DMABUF
The vfio-pci core mmap() creates a DMABUF with the helper, and the
vm_ops fault handler uses the other helper to resolve the fault.
Because this depends on DMABUF structs/code, CONFIG_VFIO_PCI_CORE
needs to depend on CONFIG_DMA_SHARED_BUFFER. The
CONFIG_VFIO_PCI_DMABUF still conditionally enables the export
support code.
NOTE: The user mmap()s a device fd, but the resulting VMA's vm_file
becomes that of the DMABUF which takes ownership of the device and
puts it on release. This maintains the existing behaviour of a VMA
keeping the VFIO device open.
BAR zapping then happens via the existing vfio_pci_dma_buf_move()
path, which now needs to unmap PTEs in the DMABUF's address_space.
vfio/pci: Provide a user-facing name for BAR mappings
There was a request for decent debug naming in /proc/<pid>/maps
etc. comparable to the existing VFIO names: since the VMAs are
DMABUFs, they have a "dmabuf:" prefix and can't be 100% identical
to before. This is a user-visible change, but this patch at least
now gives us extra info on the BDF & BAR being mapped.
vfio/pci: Clean up BAR zap and revocation
In general (see NOTE!) the vfio_pci_zap_bars() is now obsolete,
since it unmaps PTEs in the VFIO device address_space which is now
unused. This consolidates all calls (e.g. around reset) with the
neighbouring vfio_pci_dma_buf_move()s into new functions, to
revoke-zap/unrevoke.
NOTE: the nvgrace-gpu driver continues to use its own private
vm_ops, fault handler, etc. for its special memregions, and these
DO still add PTEs to the VFIO device address_space. So, a
temporary flag, vdev->bar_needs_zap, maintains the old behaviour
for this use. At least this patch's consolidation makes it easy
to remove the remaining zap when this need goes away.
A FIXME is added: if nvgrace-gpu is converted to DMABUFs, remove
the flag and final zap.
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour
unchanged with respect to the DMABUFs. An application can close
all device fds, and this will revoke/clean up all DMABUFs; no
mappings or other access can be performed now. When enabling
mmap() of the DMABUFs, this means access through the VMA is also
revoked. This complicates the fault handler because whilst the
DMABUF exists, it has no guarantee that the corresponding VFIO
device is still alive. Adds synchronisation ensuring the vdev is
available before vdev->memory_lock is touched.
(I decided against the alternative of preventing cleanup by holding
the VFIO device open if any DMABUFs exist, because it's both a
change of behaviour and less clean overall.)
I've added a chonky comment in place, happy to clarify more if you
have ideas.
vfio/pci: Permanently revoke a DMABUF on request
By weight, this is mostly a rename of revoked to an enum, status.
There are now 3 states for a buffer, usable and revoked
temporary/permanent. A new VFIO device ioctl is added,
VFIO_DEVICE_PCI_DMABUF_REVOKE, which passes a DMABUF (exported from
that device) and permanently revokes it. Thus a userspace driver
can guarantee any downstream consumers of a shared fd are prevented
from accessing a BAR range, and that range can be reused.
The code doing revocation in vfio_pci_dma_buf_move() is moved,
unchanged, to a common function for use by _move() and the new
ioctl path.
Q: I can't think of a good reason to temporarily revoke/unrevoke
buffers from userspace, so didn't add a 'flags' field to the ioctl
struct. Easy to add if people think it's worthwhile for future
use.
vfio/pci: Add mmap() attributes to DMABUF feature
Reserves bits [31:28] in vfio_device_feature_dma_buf to allow a
(CPU) mapping attribute to be specified for an exported set of
ranges. The default is the current UC, and a new flag can specify
CPU access as WC.
Q: I've taken 4 bits; the intention is for this field to be a
scalar not a bitmap (i.e. mutually-exclusive access properties).
Perhaps 4 is a bit too many?
Testing
=======
(The [RFC ONLY] userspace test program, for QEMU edu-plus, has been
dropped, but can be found in the GitHub branch below.)
This code has been tested in mapping DMABUFs of single/multiple
ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
to work correctly. I've lightly tested WC mappings also (by observing
resulting PTEs as having the correct attributes...). No regressions
observed on the VFIO selftests, or on our internal vfio-pci
applications.
End
===
This is based on -next (next-20260414 but will merge earlier), as it
depends on Leon's series "vfio: Wait for dma-buf invalidation to
complete":
https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566a…
These commits are on GitHub, along with "[RFC ONLY] selftests: vfio: Add
standalone vfio_dmabuf_mmap_test":
https://github.com/metamev/linux/compare/next-20260414...metamev:linux:dev/…
Thanks for reading,
Matt
================================================================================
Change log:
v1:
- Cleanup of the common DMABUF-aware VMA vm_ops fault handler and
export code.
- Fixed a lot of races, particularly faults racing with DMABUF
cleanup (if the VFIO device fds close, for example).
- Added nicer human-readable names for VFIO mmap() VMAs
RFCv2: Respin based on the feedback/suggestions:
https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
- Transform the existing VFIO BAR mmap path to also use DMABUFs
behind the scenes, and then simply share that code for
explicitly-mapped DMABUFs. Jason wanted to go that direction to
enable iommufd VFIO type 1 emulation to pick up a DMABUF for an IO
mapping.
- Revoke buffers using a VFIO device fd ioctl
RFCv1:
https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
Matt Evans (9):
vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
vfio/pci: Convert BAR mmap() to use a DMABUF
vfio/pci: Provide a user-facing name for BAR mappings
vfio/pci: Clean up BAR zap and revocation
vfio/pci: Support mmap() of a VFIO DMABUF
vfio/pci: Permanently revoke a DMABUF on request
vfio/pci: Add mmap() attributes to DMABUF feature
drivers/vfio/pci/Kconfig | 3 +-
drivers/vfio/pci/Makefile | 3 +-
drivers/vfio/pci/nvgrace-gpu/main.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 30 +-
drivers/vfio/pci/vfio_pci_core.c | 224 ++++++++++---
drivers/vfio/pci/vfio_pci_dmabuf.c | 500 +++++++++++++++++++++++-----
drivers/vfio/pci/vfio_pci_priv.h | 49 ++-
include/linux/vfio_pci_core.h | 1 +
include/uapi/linux/vfio.h | 42 ++-
9 files changed, 690 insertions(+), 167 deletions(-)
--
2.47.3
This series adds a new device /dev/syncobj that can be used to create
and manipulate DRM syncobjs. Previously, these operations required the
use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
and DRIVER_SYNCOBJ_TIMELINE features.
There are several issues with the existing API:
- Syncobjs are the only explicit sync mechanism available on wayland.
Most compositors do not use GPU waits. Instead, they use the
DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
DRM devices means that compositors cannot consistently offer this
feature even though no device-specific logic is involved.
- llvmpipe currently cannot offer syncobj interop because it does not
have access to a DRM device. This means that applications using
llvmpipe cannot present images before they have finished rendering,
despite llvmpipe using threaded rendering.
- Clients that do not use the Vulkan WSI need to manually probe /dev/dri
for devices that support the syncobj ioctls in order to use the
wayland syncobj protocol.
- Similarly, clients that want to use screen capture have no equivalent
to the WSI and are therefore forced into that path.
- Having to keep a DRM device open has potentially negative interactions
with GPU hotplug.
- Having to translate between syncobj FDs and handles is troublesome in
the compositor usecase since syncobjs come and go frequently and need
to be cleaned up when clients disconnect.
/dev/syncobj solves these issues by providing all syncobj ioctls under a
consistent path that is not tied to any DRM device. It also operates
directly on file descriptors instead of syncobj handles.
The series starts with a number of small refactorings in drm_syncobj.c
to make its functionality available outside of the file and without the
need for drm_file/handle pairs.
The last commit adds the /dev/syncobj module. I've added it as a misc
device but maybe this should instead live somewhere under gpu/drm.
An application using the new interface can be found at [1].
[1]: https://github.com/mahkoh/jay/pull/947
---
Julian Orth (12):
drm/syncobj: add drm_syncobj_from_fd
drm/syncobj: add drm_syncobj_fence_lookup
drm/syncobj: make drm_syncobj_array_wait_timeout public
drm/syncobj: add drm_syncobj_register_eventfd
drm/syncobj: have transfer functions accept drm_syncobj directly
drm/syncobj: add drm_syncobj_transfer
drm/syncobj: add drm_syncobj_timeline_signal
drm/syncobj: add drm_syncobj_query
drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
drm/syncobj: add drm_syncobj_import_sync_file
drm/syncobj: add drm_syncobj_export_sync_file
misc/syncobj: add new device
Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
drivers/gpu/drm/drm_syncobj.c | 374 ++++++++++++++-----
drivers/misc/Kconfig | 10 +
drivers/misc/Makefile | 1 +
drivers/misc/syncobj.c | 404 +++++++++++++++++++++
include/drm/drm_syncobj.h | 21 ++
include/uapi/linux/syncobj.h | 75 ++++
7 files changed, 795 insertions(+), 91 deletions(-)
---
base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
change-id: 20260516-jorth-syncobj-d4d374c8c61b
Best regards,
--
Julian Orth <ju.orth(a)gmail.com>