On 5/20/26 14:33, Xaver Hugl wrote:
> Am Mi., 20. Mai 2026 um 10:08 Uhr schrieb Christian König
> <christian.koenig(a)amd.com>:
>> Well I would say the other way around is a pretty common use case.
>>
>> In other words the compositors uses the internal GPU for composing and displaying the picture. And the client uses the external GPU for fast rendering.
> Sure, but that's not what I'm talking about.
Yeah sorry for that, I wasn't sure if I misunderstood your use case because it's usually the other way around.
>>> - the buffers from the client stay valid
>>
>> Buffers from the hot plugged GPU don't stay valid. Accessing CPU mappings either result in a SIGBUS or are redirected to a dummy page.
> Again, not what I wrote about. The buffers are on the integrated GPU.
General rule of thumb is that as long as the exporter stays around the buffers stay around as well.
>>> - the syncobj stays valid on the client side
>>> - the syncobj becomes invalid on the compositor side
>>
>> Nope that's not correct. The syncobj itself stays valid even if you completely hot plug the device.
>>
>> It can just be that the fences inside the syncobj are terminated with an error.
> What about eventfd created for a point on the syncobj?
The eventfd unfortunately doesn't has error handling as far as I know, so when a fence signals with an error condition then the eventfd you only sees that it is signaled.
> Another (future) problem with hotplugs will be if the sync file hasn't
> materialized for the timeline point when the device is hotunplugged,
> since there can't be an error on the fence if there isn't one. Or
> could userspace somehow set an 'artificial' fence with an error in
> that case?
In general the answer is yes, userspace needs to take care of inserting fences when wait before signal is used and the work can not be submitted to the HW for some reason.
Currently we only have an IOCTL to insert the signaled dummy fence at some timeline sequence, but it should be trivial as well to insert a signaled fence with an error code.
But the compositor needs to be able to handle that case anyway, because it can be that a malicious or just buggy client just never inserts the fence.
So that a device is hot plugged is not different to just a client not inserting the fence in the first place.
>>> "invalid" there means either
>>> - the acquire point of the client is marked as signaled, before
>>> rendering on the client side is completed
>>> - the acquire point of the client is never signaled. Since the
>>> compositor waits for the acquire point, the Wayland surface is stuck
>>> forever
>>
>> Both of those would be a *massive* violation of documented kernel rules for hot-plugging which could lead to random data corruption and/or deadlocks.
>>
>> If you see any HW driver showing behavior like that please open up a bug report and ping the relevant maintainers immediately.
> If there are no error codes with syncobj yet, then to userspace, the
> latter behavior is exactly what we get, isn't it?
No, from userspace side you just see a signaled fence. It's just that you need to export the timeline point of the syncobj to a syncfile and then you can call the QUERY IOCTL on the syncfile to see the error code.
>> When a hotplug happens all operations of the device should return an -ENODEV error, even when exposed to other devices/application through syncobj or syncfile.
> Okay, that at least gives us a way to fail imports somewhat
> gracefully. Normally, failing to import a syncobj is a fatal error in
> the Wayland protocol.
So the task at hand would be to avoid importing the syncobj into a driver. That should be relatively trivial.
The only real problem I see is if you want to create a syncobj without having any device whatsoever.
>> One problem is that only syncfile allows for querying such error codes at the moment, we have patches pending to add that to syncobj as well but we lack a compositor with support for that as userspace client.
> As long as the error case can be detected with an eventfd,
Yeah that's the problem. The eventfd only tells you if the operation is completed (or at least has materialized).
To query the error you would need to ask the underlying syncobj or syncfile directly.
> implementing that in KWin shouldn't be a challenge.
>
>> Well the question here is if the device the compositor is using or the client is using is gone?
>>
>> If the client device is hot removed the compositor should be perfectly capable to import the syncobj.
>>
>> If the compositor device is gone then you don't have a device to display anything any more, so generating the next frame doesn't seem to make sense either.
>>
>> What could be is that you want the compositor to be kept alive even when the display device is gone to switch over to vkms or whatever so that a VNC session or other remote desktop still works.
> There are two GPUs in the example I gave. The compositor can use both
> for rendering (in cosmic-comp's case) or switch between them (what I'm
> trying to do with KWin), or use one device for rendering, and another
> for importing the syncobj.
Ah! I think I got the problem now. You basically want to avoid importing the syncobj because when the wrong device goes away you are busted.
The reason we didn't considered having the IOCTLs on the FD is because if you don't import them and instead keep them around you can run out file descriptors quite quickly.
When you have an use case where you receive an FD from the client and do a one shot conversion to an eventfd that will probably work, but for keeping them in the long run you need some kind of container for the syncobjs, don't you?
>>>>>>> 3. It removes the need to translate between syncobjs fds and handles.
>>>>>>
>>>>>> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
>>>>> Could you expand on why it's needed? For compositors, the handle is
>>>>> just an intermediary thing when translating between file descriptors.
>>>>
>>>> Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
>>> That would be nice.
>>
>> Take a look at drm_syncobj_file_fops and how drm_syncobj_add_eventfd() is used. Adding that functionality shouldn't be more than a typing exercise.
> Yeah, this patchset already adds that functionality (on the new device).
>
>> Do I see it right that this would already solve most problems in the compositor side?
> Skipping the syncobj handle step would only reduce the amounts of
> ioctls the compositor does, but afaict it wouldn't solve any
> compositor problems. At least not as long as it's still tied to a drm
> device.
Yeah, you need something like a syncobj container or dummy DRM device.
> For device hotplugs, the only new thing we need for correctly handling
> syncobj is a way to receive errors on the eventfd.
I need to look into the eventfd code, could be that this is somehow possible but it's clearly not something I used before.
> A device-independent way to create and use syncobj would still be
> useful to us though, both to simplify the compositor and to improve
> the software rendering use cases.
Yeah not sure how to cleanly do that. We could have a dummy /dev/dri/rendersync or something like that, but that would be quite a hack.
At least I understand the requirement now.
Thanks,
Christian.
>
> - Xaver
On 5/19/26 17:31, Xaver Hugl wrote:
> Am Di., 19. Mai 2026 um 15:29 Uhr schrieb Christian König
> <christian.koenig(a)amd.com>:
>>> 1. This series makes the ability to manipulate syncobjs available
>>> independently of attached hardware.
>>> 2. It makes it available under a consistent path /dev/syncobj.
>>
>> Exactly that is a big no-go. This has to be under /dev/dri.
> FWIW udmabuf is also under /dev directly, but I don't think any
> compositor developer would complain about a different path.
> What are the rules for that? Could this simply be put in /dev/dri/syncobj?
The syncobj are actually the DRM specific way of doing things. The general kernel wide way is to use sync files (see drivers/dma-buf/sync_file.c).
But there has already been tons of problems with those sync files. E.g. they doesn't support your use case at all since they don't have wait before submit behavior.
So there are already ways to do this, but the Linux kernel so far told everybody that this is forbidden. The DRM syncobj wait before signal functionality is much better, but then basically the second try to do this.
> The part where we get this independent of attached hardware is quite
> important for us though, since we can't just ignore explicit sync once
> the device we previously imported the syncobj into is disconnected.
Can you elaborate more on this?
> Buffers can be from any device or allocated in system memory and
> access should be synchronized properly in all cases.
>
> How exactly it's made available isn't all that critical.
>
>>> 3. It removes the need to translate between syncobjs fds and handles.
>>
>> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
> Could you expand on why it's needed? For compositors, the handle is
> just an intermediary thing when translating between file descriptors.
Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
> FTR for me at least, this part would be merely nice to have, since it
> slightly reduces the amount of ioctls a compositor needs to call, but
> it's not important.
>
>>>> What about using VGEM for this?
>>>
>>> If the vgem render node were made available unconditionally under,
>>
>> Software rendering is a complete corner case, I don't think that this will be enabled by default.
> That simply makes vgem unsuitable for solving the problems we face in
> compositors.
Thinking more about it vgem also has the same issues as sync file mentioned above. So that is really also not doable.
Maybe Simona or David have another idea.
Regards,
Christian.
>
> - Xaver
The patch set allows to register a dmabuf to an io_uring instance for
a specified file and use it with io_uring read / write requests. The
infrastructure is not tied to io_uring and there could be more users
in the future. A similar idea was attempted some years ago by Keith [1],
from where I borrowed a good number of changes, and later was brough up
by Tushar and Vishal from Intel.
It's an opt-in feature for files, and they need to implement a new
file operation to use it. Only NVMe block devices are supported in this
series. The user API is built on top of io_uring's "registered buffers",
where a dmabuf is registered in a special way, but after it can be used
as any other "registered buffer" with IORING_OP_{READ,WRITE}_FIXED
requests. It's created via a new file operation and the resulted map is
then passed through the I/O stack in a new iterator type. There is some
additional infrastructure to bind it all, which also counts requests
using a dmabuf map and managing lifetimes, which is used to implement
map invalidation.
It was tested for GPU <-> NVMe transfers. Also, as it maintains a
long-term dma mapping, it helps with the IOMMU cost. The numbers
below are for udmabuf reads previously run by Anuj for different
IOMMU modes:
- STRICT: before = 570 KIOPS, after = 5.01 MIOPS
- LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
- PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
There are some liburing tests that can serve as an example:
git: https://github.com/isilence/liburing.git rw-dmabuf-tests-v3
url: https://github.com/isilence/liburing/tree/rw-dmabuf-tests-v3
[1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
v3: - Rework io_uring registration
- Move token/map infrastructure code out of blk-mq
- Simplify callbacks: remove a separate blk-mq table, which was
mostly just forwarding calls (to nvme).
- Don't skip dma sync depending on request direction
- Fix a couple of hangs
- Rename s/dma/dmabuf/
- Other small changes
v2: - Don't pass raw dma addresses, wrap it into a driver specific object
- Split into two objects: token and map
- Implement move_notify
Pavel Begunkov (10):
file: add callback for creating long-term dmabuf maps
iov_iter: add iterator type for dmabuf maps
block: move bvec init into __bio_clone
block: introduce dma map backed bio type
lib: add dmabuf token infrastructure
block: forward create_dmabuf_token to drivers
nvme-pci: implement dma_token backed requests
io_uring/rsrc: introduce buf registration structure
io_uring/rsrc: extend buffer update
io_uring/rsrc: add dmabuf backed registered buffers
block/bio.c | 28 +++-
block/blk-merge.c | 14 ++
block/blk.h | 3 +-
block/fops.c | 16 ++
drivers/nvme/host/pci.c | 282 ++++++++++++++++++++++++++++++++
include/linux/bio.h | 19 ++-
include/linux/blk-mq.h | 9 +
include/linux/blk_types.h | 8 +-
include/linux/fs.h | 2 +
include/linux/io_dmabuf_token.h | 92 +++++++++++
include/linux/io_uring_types.h | 5 +
include/linux/uio.h | 11 ++
include/uapi/linux/io_uring.h | 31 +++-
io_uring/io_uring.c | 3 +-
io_uring/rsrc.c | 266 +++++++++++++++++++++++++-----
io_uring/rsrc.h | 30 +++-
io_uring/rw.c | 4 +-
lib/Kconfig | 4 +
lib/Makefile | 2 +
lib/io_dmabuf_token.c | 272 ++++++++++++++++++++++++++++++
lib/iov_iter.c | 29 +++-
21 files changed, 1071 insertions(+), 59 deletions(-)
create mode 100644 include/linux/io_dmabuf_token.h
create mode 100644 lib/io_dmabuf_token.c
--
2.53.0
On 5/19/26 19:08, Xaver Hugl wrote:
>>> The part where we get this independent of attached hardware is quite
>>> important for us though, since we can't just ignore explicit sync once
>>> the device we previously imported the syncobj into is disconnected.
>>
>> Can you elaborate more on this?
>
> In Wayland, the client is allowed to attach dmabuf and syncobj
> independently, they don't have to be from the same device (and the
> compositor wouldn't be able to verify the opposite anyways). The
> compositor will usually import both into the same drm device, but
> especially with compositors that render on multiple devices, that's
> not necessarily the case either.
>
> If for example we had a system with one internal GPU and one external
> GPU, the client renders on the internal GPU and the compositor uses
> the external one. Now when the user yanks the USB C cable, afaiu
Well I would say the other way around is a pretty common use case.
In other words the compositors uses the internal GPU for composing and displaying the picture. And the client uses the external GPU for fast rendering.
> - the buffers from the client stay valid
Buffers from the hot plugged GPU don't stay valid. Accessing CPU mappings either result in a SIGBUS or are redirected to a dummy page.
DMA operations to hot plugged buffers from other GPUs (or rather more general other devices) are waited on before the underlying resource is removed (e.g. system memory or PCIe address space or whatever is backing that).
But no new DMA operations are usually permitted to start.
> - the syncobj stays valid on the client side
> - the syncobj becomes invalid on the compositor side
Nope that's not correct. The syncobj itself stays valid even if you completely hot plug the device.
It can just be that the fences inside the syncobj are terminated with an error.
> "invalid" there means either
> - the acquire point of the client is marked as signaled, before
> rendering on the client side is completed
> - the acquire point of the client is never signaled. Since the
> compositor waits for the acquire point, the Wayland surface is stuck
> forever
Both of those would be a *massive* violation of documented kernel rules for hot-plugging which could lead to random data corruption and/or deadlocks.
If you see any HW driver showing behavior like that please open up a bug report and ping the relevant maintainers immediately.
When a hotplug happens all operations of the device should return an -ENODEV error, even when exposed to other devices/application through syncobj or syncfile.
One problem is that only syncfile allows for querying such error codes at the moment, we have patches pending to add that to syncobj as well but we lack a compositor with support for that as userspace client.
> Afaik the latter is currently the case. The former wouldn't be much
> better though, not when it's preventable.
>
> This is admittedly an edge case, but GPU hotunplug is something we try
> to support as well as possible in Plasma, and all the edge cases cause
> a lot of problems in combination and are a lot of headaches to handle
> (or really work around) in the compositor.
Well exactly that design is used in the Tesla 3 infotainment system for example.
So GPU hotplug is actually a pretty common use case.
> Another edge case is when the client asks the compositor to import the
> syncobj, which can fail when a hotunplug is in process, and ends up
> disconnecting the client for no fault of either client or compositor.
Well the question here is if the device the compositor is using or the client is using is gone?
If the client device is hot removed the compositor should be perfectly capable to import the syncobj.
If the compositor device is gone then you don't have a device to display anything any more, so generating the next frame doesn't seem to make sense either.
What could be is that you want the compositor to be kept alive even when the display device is gone to switch over to vkms or whatever so that a VNC session or other remote desktop still works.
>>>>> 3. It removes the need to translate between syncobjs fds and handles.
>>>>
>>>> That's a pretty big no-go as well. The differentiation between FDs and handles is completely intentional.
>>> Could you expand on why it's needed? For compositors, the handle is
>>> just an intermediary thing when translating between file descriptors.
>>
>> Well what we could do is to add an IOCTL to directly attach an syncobj file descriptor to an eventfd.
> That would be nice.
Take a look at drm_syncobj_file_fops and how drm_syncobj_add_eventfd() is used. Adding that functionality shouldn't be more than a typing exercise.
Do I see it right that this would already solve most problems in the compositor side?
Regards,
Christian.
>
> - Xaver
This RFC builds on T.J. Mercier's earlier series [1] which added
a memory.stat counter for exported dma-bufs and a binder-backed
mechanism to transfer charges between cgroups.
The first commit is taken almost verbatim from TJ's series:
it introduces MEMCG_DMABUF as a dedicated per-cgroup stat, so that
the total exported dma-buf footprint is visible both system-wide
(via the root cgroup) and per-application (via per-process cgroups).
This avoids the overhead of DMABUF_SYSFS_STATS and integrates
naturally into the existing cgroup memory hierarchy.
The rest of the series departs from TJ's approach. While the first
commit introduces the memcg stat infrastructure for dmabufs, the
export-time charging it introduces in dma_buf_export() is then
superseded: we charge at dma_heap_ioctl_allocate() time, using a
new charge_pid_fd field in struct dma_heap_allocation_data. The
allocator opens a pidfd for its client (e.g., from binder's
sender_pid), passes it to the ioctl, and the kernel charges the
buffer directly to the client's cgroup at allocation time, so no
transfer step is needed.
This decouples the accounting path from binder entirely:
any allocator that knows its client's PID can use the pid_fd
mechanism regardless of the IPC transport in use.
The cross-cgroup charging capability requires access control.
Patches #3 and #4 add a generic LSM hook (security_dma_heap_alloc)
and an SELinux implementation based on a new dma_heap object class
with a charge_to permission, so policy authors can express which
domains are allowed to charge memory to another domain's cgroup.
Last patch adds some tests to verify the new charge_pid_fd field.
We are sending it as an RFC to spark broader discussion. It may or
may not be the right path forward, and we welcome feedback on the
trade-offs.
Collision note: Eric Chanudet's series [2] adds __GFP_ACCOUNT to
system_heap page allocations as an opt-in module parameter. That
approach charges pages to the allocator's own kmem, which overlaps with
MEMCG_DMABUF. This series explicitly removes __GFP_ACCOUNT from system
heap allocations and routes all accounting through the MEMCG_DMABUF
path to avoid double-counting.
[1] https://lore.kernel.org/cgroups/20230109213809.418135-1-tjmercier@google.co…
[2] https://lore.kernel.org/r/20260113-dmabuf-heap-system-memcg-v2-0-e85722cc2f…
Signed-off-by: Albert Esteve <aesteve(a)redhat.com>
---
Albert Esteve (4):
dma-heap: charge dma-buf memory via explicit memcg
security: dma-heap: Add dma_heap_alloc LSM hook
selinux: Restrict cross-cgroup dma-heap charging
selftests/dmabuf-heaps: Add dma-buf memcg accounting tests
T.J. Mercier (1):
memcg: Track exported dma-buffers
Documentation/admin-guide/cgroup-v2.rst | 5 +
drivers/dma-buf/dma-buf.c | 7 +
drivers/dma-buf/dma-heap.c | 54 +++++-
drivers/dma-buf/heaps/system_heap.c | 2 -
include/linux/dma-buf.h | 4 +
include/linux/lsm_hook_defs.h | 1 +
include/linux/memcontrol.h | 37 ++++
include/linux/security.h | 7 +
include/uapi/linux/dma-heap.h | 6 +
mm/memcontrol.c | 19 ++
security/security.c | 16 ++
security/selinux/hooks.c | 7 +
security/selinux/include/classmap.h | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +-
tools/testing/selftests/cgroup/test_memcontrol.c | 143 +++++++++++++-
tools/testing/selftests/dmabuf-heaps/config | 1 +
tools/testing/selftests/dmabuf-heaps/dmabuf-heap.c | 126 ++++++++++++-
tools/testing/selftests/dmabuf-heaps/vmtest.sh | 205 +++++++++++++++++++++
18 files changed, 633 insertions(+), 10 deletions(-)
---
base-commit: 74fe02ce122a6103f207d29fafc8b3a53de6abaf
change-id: 20260508-v2_20230123_tjmercier_google_com-f44fcfb16530
Best regards,
--
Albert Esteve <aesteve(a)redhat.com>
On Thu, 7 May 2026 11:02:26 +0200
Marcin Åšlusarz <marcin.slusarz(a)arm.com> wrote:
> On Tue, May 05, 2026 at 06:15:23PM +0200, Boris Brezillon wrote:
> > > @@ -277,9 +286,21 @@ int panthor_device_init(struct panthor_device *ptdev)
> > > return ret;
> > > }
> > >
> > > + /* If a protected heap name is specified but not found, defer the probe until created */
> > > + if (protected_heap_name && strlen(protected_heap_name)) {
> >
> > Do we really need this strlen() > 0? Won't dma_heap_find() fail is the
> > name is "" already?
>
> If dma_heap_find() will fail, then the whole probe with fail too.
> This check prevents that.
Yeah, that's also a questionable design choice. I mean, we can
currently probe and boot the FW even though we never setup the
protected FW sections, so why should we defer the probe here? Can't we
just retry the next time a group with the protected bit is created and
fail if we can find a protected heap?
> I'm not sure why it's needed at all, but if
> it is really needed, then s/strlen(protected_heap_name)/protected_heap_name[0]/
> would simplify this.
It's not so much about how you do the test, and more about the case
you're trying to protect against. I guess here you assume that
panthor.protected_heap_name="" means "I don't have a protected heap for
you". If it's deemed acceptable, this should most certainly be
described somewhere.
>
> > > + ptdev->protm.heap = dma_heap_find(protected_heap_name);
> > > + if (!ptdev->protm.heap) {
> > > + drm_warn(&ptdev->base,
> > > + "Protected heap \'%s\' not (yet) available - deferring probe",
> > > + protected_heap_name);
> > > + ret = -EPROBE_DEFER;
> > > + goto err_rpm_put;
> >
> > If you move the heap retrieval before the rpm enablement, you can get
> > rid of this goto err_rpm_put.
> >
> > > + }
> > > + }
> > > +
> > > ret = panthor_hw_init(ptdev);
> > > if (ret)
> > > - goto err_rpm_put;
> > > + goto err_dma_heap_put;
> > >
> > > ret = panthor_pwr_init(ptdev);
> > > if (ret)
Hi all,
This series is based on previous RFCs/discussions:
Tech topic: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/
RFCv1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
RFCv2: https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
The background/rationale is covered in more detail in the RFC cover
letters. The TL;DR is:
The goal is to enable userspace driver designs that use VFIO to export
DMABUFs representing subsets of PCI device BARs, and "vend" those
buffers from a primary process to other subordinate processes by fd.
These processes then mmap() the buffers and their access to the device
is isolated to the exported ranges. This is an improvement on sharing
the VFIO device fd to subordinate processes, which would allow
unfettered access .
This is achieved by enabling mmap() of vfio-pci DMABUFs. Second, a
new ioctl()-based revocation mechanism is added to allow the primary
process to forcibly revoke access to previously-shared BAR spans, even
if the subordinate processes haven't cleanly exited.
(The related topic of safe delegation of iommufd control to the
subordinate processes is not addressed here, and is follow-up work.)
As well as isolation and revocation, another advantage to accessing a
BAR through a VMA backed by a DMABUF is that it's straightforward to
create the buffer with access attributes, such as write-combining.
Notes on patches
================
Feedback from the RFCs requested that, instead of creating
DMABUF-specific vm_ops and .fault paths, to go the whole way and
migrate the existing VFIO PCI BAR mmap() to be backed by a DMABUF too,
resulting in a common vm_ops and fault handler for mmap()s of both the
VFIO device and explicitly-exported DMABUFs. This has been done for
vfio-pci, but not sub-drivers (nvgrace-gpu's special-case mappings are
unchanged).
vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
A bug fix to a related are, whose context is a depdency for later
patches.
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
The first is for a DMABUF VMA fault handler to determine
arbitrary-sized PFNs from ranges in DMABUF. Secondly, refactor
DMABUF export for use by the existing export feature and a new
helper that creates a DMABUF corresponding to a VFIO BAR mmap()
request.
vfio/pci: Convert BAR mmap() to use a DMABUF
The vfio-pci core mmap() creates a DMABUF with the helper, and the
vm_ops fault handler uses the other helper to resolve the fault.
Because this depends on DMABUF structs/code, CONFIG_VFIO_PCI_CORE
needs to depend on CONFIG_DMA_SHARED_BUFFER. The
CONFIG_VFIO_PCI_DMABUF still conditionally enables the export
support code.
NOTE: The user mmap()s a device fd, but the resulting VMA's vm_file
becomes that of the DMABUF which takes ownership of the device and
puts it on release. This maintains the existing behaviour of a VMA
keeping the VFIO device open.
BAR zapping then happens via the existing vfio_pci_dma_buf_move()
path, which now needs to unmap PTEs in the DMABUF's address_space.
vfio/pci: Provide a user-facing name for BAR mappings
There was a request for decent debug naming in /proc/<pid>/maps
etc. comparable to the existing VFIO names: since the VMAs are
DMABUFs, they have a "dmabuf:" prefix and can't be 100% identical
to before. This is a user-visible change, but this patch at least
now gives us extra info on the BDF & BAR being mapped.
vfio/pci: Clean up BAR zap and revocation
In general (see NOTE!) the vfio_pci_zap_bars() is now obsolete,
since it unmaps PTEs in the VFIO device address_space which is now
unused. This consolidates all calls (e.g. around reset) with the
neighbouring vfio_pci_dma_buf_move()s into new functions, to
revoke-zap/unrevoke.
NOTE: the nvgrace-gpu driver continues to use its own private
vm_ops, fault handler, etc. for its special memregions, and these
DO still add PTEs to the VFIO device address_space. So, a
temporary flag, vdev->bar_needs_zap, maintains the old behaviour
for this use. At least this patch's consolidation makes it easy
to remove the remaining zap when this need goes away.
A FIXME is added: if nvgrace-gpu is converted to DMABUFs, remove
the flag and final zap.
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour
unchanged with respect to the DMABUFs. An application can close
all device fds, and this will revoke/clean up all DMABUFs; no
mappings or other access can be performed now. When enabling
mmap() of the DMABUFs, this means access through the VMA is also
revoked. This complicates the fault handler because whilst the
DMABUF exists, it has no guarantee that the corresponding VFIO
device is still alive. Adds synchronisation ensuring the vdev is
available before vdev->memory_lock is touched.
(I decided against the alternative of preventing cleanup by holding
the VFIO device open if any DMABUFs exist, because it's both a
change of behaviour and less clean overall.)
I've added a chonky comment in place, happy to clarify more if you
have ideas.
vfio/pci: Permanently revoke a DMABUF on request
By weight, this is mostly a rename of revoked to an enum, status.
There are now 3 states for a buffer, usable and revoked
temporary/permanent. A new VFIO device ioctl is added,
VFIO_DEVICE_PCI_DMABUF_REVOKE, which passes a DMABUF (exported from
that device) and permanently revokes it. Thus a userspace driver
can guarantee any downstream consumers of a shared fd are prevented
from accessing a BAR range, and that range can be reused.
The code doing revocation in vfio_pci_dma_buf_move() is moved,
unchanged, to a common function for use by _move() and the new
ioctl path.
Q: I can't think of a good reason to temporarily revoke/unrevoke
buffers from userspace, so didn't add a 'flags' field to the ioctl
struct. Easy to add if people think it's worthwhile for future
use.
vfio/pci: Add mmap() attributes to DMABUF feature
Reserves bits [31:28] in vfio_device_feature_dma_buf to allow a
(CPU) mapping attribute to be specified for an exported set of
ranges. The default is the current UC, and a new flag can specify
CPU access as WC.
Q: I've taken 4 bits; the intention is for this field to be a
scalar not a bitmap (i.e. mutually-exclusive access properties).
Perhaps 4 is a bit too many?
Testing
=======
(The [RFC ONLY] userspace test program, for QEMU edu-plus, has been
dropped, but can be found in the GitHub branch below.)
This code has been tested in mapping DMABUFs of single/multiple
ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
to work correctly. I've lightly tested WC mappings also (by observing
resulting PTEs as having the correct attributes...). No regressions
observed on the VFIO selftests, or on our internal vfio-pci
applications.
End
===
This is based on -next (next-20260414 but will merge earlier), as it
depends on Leon's series "vfio: Wait for dma-buf invalidation to
complete":
https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566a…
These commits are on GitHub, along with "[RFC ONLY] selftests: vfio: Add
standalone vfio_dmabuf_mmap_test":
https://github.com/metamev/linux/compare/next-20260414...metamev:linux:dev/…
Thanks for reading,
Matt
================================================================================
Change log:
v1:
- Cleanup of the common DMABUF-aware VMA vm_ops fault handler and
export code.
- Fixed a lot of races, particularly faults racing with DMABUF
cleanup (if the VFIO device fds close, for example).
- Added nicer human-readable names for VFIO mmap() VMAs
RFCv2: Respin based on the feedback/suggestions:
https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
- Transform the existing VFIO BAR mmap path to also use DMABUFs
behind the scenes, and then simply share that code for
explicitly-mapped DMABUFs. Jason wanted to go that direction to
enable iommufd VFIO type 1 emulation to pick up a DMABUF for an IO
mapping.
- Revoke buffers using a VFIO device fd ioctl
RFCv1:
https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
Matt Evans (9):
vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
vfio/pci: Convert BAR mmap() to use a DMABUF
vfio/pci: Provide a user-facing name for BAR mappings
vfio/pci: Clean up BAR zap and revocation
vfio/pci: Support mmap() of a VFIO DMABUF
vfio/pci: Permanently revoke a DMABUF on request
vfio/pci: Add mmap() attributes to DMABUF feature
drivers/vfio/pci/Kconfig | 3 +-
drivers/vfio/pci/Makefile | 3 +-
drivers/vfio/pci/nvgrace-gpu/main.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 30 +-
drivers/vfio/pci/vfio_pci_core.c | 224 ++++++++++---
drivers/vfio/pci/vfio_pci_dmabuf.c | 500 +++++++++++++++++++++++-----
drivers/vfio/pci/vfio_pci_priv.h | 49 ++-
include/linux/vfio_pci_core.h | 1 +
include/uapi/linux/vfio.h | 42 ++-
9 files changed, 690 insertions(+), 167 deletions(-)
--
2.47.3
This series adds a new device /dev/syncobj that can be used to create
and manipulate DRM syncobjs. Previously, these operations required the
use of a DRM device and the device needed to support the DRIVER_SYNCOBJ
and DRIVER_SYNCOBJ_TIMELINE features.
There are several issues with the existing API:
- Syncobjs are the only explicit sync mechanism available on wayland.
Most compositors do not use GPU waits. Instead, they use the
DRM_IOCTL_SYNCOBJ_EVENTFD ioctl to perform a CPU wait. Being tied to
DRM devices means that compositors cannot consistently offer this
feature even though no device-specific logic is involved.
- llvmpipe currently cannot offer syncobj interop because it does not
have access to a DRM device. This means that applications using
llvmpipe cannot present images before they have finished rendering,
despite llvmpipe using threaded rendering.
- Clients that do not use the Vulkan WSI need to manually probe /dev/dri
for devices that support the syncobj ioctls in order to use the
wayland syncobj protocol.
- Similarly, clients that want to use screen capture have no equivalent
to the WSI and are therefore forced into that path.
- Having to keep a DRM device open has potentially negative interactions
with GPU hotplug.
- Having to translate between syncobj FDs and handles is troublesome in
the compositor usecase since syncobjs come and go frequently and need
to be cleaned up when clients disconnect.
/dev/syncobj solves these issues by providing all syncobj ioctls under a
consistent path that is not tied to any DRM device. It also operates
directly on file descriptors instead of syncobj handles.
The series starts with a number of small refactorings in drm_syncobj.c
to make its functionality available outside of the file and without the
need for drm_file/handle pairs.
The last commit adds the /dev/syncobj module. I've added it as a misc
device but maybe this should instead live somewhere under gpu/drm.
An application using the new interface can be found at [1].
[1]: https://github.com/mahkoh/jay/pull/947
---
Julian Orth (12):
drm/syncobj: add drm_syncobj_from_fd
drm/syncobj: add drm_syncobj_fence_lookup
drm/syncobj: make drm_syncobj_array_wait_timeout public
drm/syncobj: add drm_syncobj_register_eventfd
drm/syncobj: have transfer functions accept drm_syncobj directly
drm/syncobj: add drm_syncobj_transfer
drm/syncobj: add drm_syncobj_timeline_signal
drm/syncobj: add drm_syncobj_query
drm/syncobj: fix resource leak in drm_syncobj_import_sync_file_fence
drm/syncobj: add drm_syncobj_import_sync_file
drm/syncobj: add drm_syncobj_export_sync_file
misc/syncobj: add new device
Documentation/userspace-api/ioctl/ioctl-number.rst | 1 +
drivers/gpu/drm/drm_syncobj.c | 374 ++++++++++++++-----
drivers/misc/Kconfig | 10 +
drivers/misc/Makefile | 1 +
drivers/misc/syncobj.c | 404 +++++++++++++++++++++
include/drm/drm_syncobj.h | 21 ++
include/uapi/linux/syncobj.h | 75 ++++
7 files changed, 795 insertions(+), 91 deletions(-)
---
base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
change-id: 20260516-jorth-syncobj-d4d374c8c61b
Best regards,
--
Julian Orth <ju.orth(a)gmail.com>