Recently there was a fairly long thread about recoreable hardware page
faults, how they can deadlock, and what to do about that.
While the discussion is still fresh I figured good time to try and
document the conclusions a bit.
References: https://lore.kernel.org/dri-devel/20210107030127.20393-1-Felix.Kuehling@amd…
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Thomas Hellström <thomas.hellstrom(a)intel.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Jerome Glisse <jglisse(a)redhat.com>
Cc: Felix Kuehling <felix.kuehling(a)amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: linux-media(a)vger.kernel.org
Cc: linaro-mm-sig(a)lists.linaro.org
--
I'll be away next week, but figured I'll type this up quickly for some
comments and to check whether I got this all roughly right.
Critique very much wanted on this, so that we can make sure hw which
can't preempt (with pagefaults pending) like gfx10 has a clear path to
support page faults in upstream. So anything I missed, got wrong or
like that would be good.
-Daniel
---
Documentation/driver-api/dma-buf.rst | 66 ++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index a2133d69872c..e924c1e4f7a3 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -257,3 +257,69 @@ fences in the kernel. This means:
userspace is allowed to use userspace fencing or long running compute
workloads. This also means no implicit fencing for shared buffers in these
cases.
+
+Recoverable Hardware Page Faults Implications
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Modern hardware supports recoverable page faults, which has a lot of
+implications for DMA fences.
+
+First, a pending page fault obviously holds up the work that's running on the
+accelerator and a memory allocation is usually required to resolve the fault.
+But memory allocations are not allowed to gate completion of DMA fences, which
+means any workload using recoverable page faults cannot use DMA fences for
+synchronization. Synchronization fences controlled by userspace must be used
+instead.
+
+On GPUs this poses a problem, because current desktop compositor protocols on
+Linus rely on DMA fences, which means without an entirely new userspace stack
+built on top of userspace fences, they cannot benefit from recoverable page
+faults. The exception is when page faults are only used as migration hints and
+never to on-demand fill a memory request. For now this means recoverable page
+faults on GPUs are limited to pure compute workloads.
+
+Furthermore GPUs usually have shared resources between the 3D rendering and
+compute side, like compute units or command submission engines. If both a 3D
+job with a DMA fence and a compute workload using recoverable page faults are
+pending they could deadlock:
+
+- The 3D workload might need to wait for the compute job to finish and release
+ hardware resources first.
+
+- The compute workload might be stuck in a page fault, because the memory
+ allocation is waiting for the DMA fence of the 3D workload to complete.
+
+There are a few ways to prevent this problem:
+
+- Compute workloads can always be preempted, even when a page fault is pending
+ and not yet repaired. Not all hardware supports this.
+
+- DMA fence workloads and workloads which need page fault handling have
+ independent hardware resources to guarantee forward progress. This could be
+ achieved through e.g. through dedicated engines and minimal compute unit
+ reservations for DMA fence workloads.
+
+- The reservation approach could be further refined by only reserving the
+ hardware resources for DMA fence workloads when they are in-flight. This must
+ cover the time from when the DMA fence is visible to other threads up to
+ moment when fence is completed through dma_fence_signal().
+
+- As a last resort, if the hardware provides no useful reservation mechanics,
+ all workloads must be flushed from the GPU when switching between jobs
+ requiring DMA fences or jobs requiring page fault handling: This means all DMA
+ fences must complete before a compute job with page fault handling can be
+ inserted into the scheduler queue. And vice versa, before a DMA fence can be
+ made visible anywhere in the system, all compute workloads must be preempted
+ to guarantee all pending GPU page faults are flushed.
+
+Note that workloads that run on independent hardware like copy engines or other
+GPUs do not have any impact. This allows us to keep using DMA fences internally
+in the kernel even for resolving hardware page faults, e.g. by using copy
+engines to clear or copy memory needed to resolve the page fault.
+
+In some ways this page fault problem is a special case of the `Infinite DMA
+Fences` discussions: Infinite fences from compute workloads are allowed to
+depend on DMA fences, but not the other way around. And not even the page fault
+problem is new, because some other CPU thread in userspace might
+hit a page fault which holds up a userspace fence - supporting page faults on
+GPUs doesn't anything fundamentally new.
--
2.30.0
On Wed, Jan 27, 2021 at 01:08:05PM +0100, Thomas Zimmermann wrote:
> Hi
>
> Am 11.01.21 um 17:50 schrieb Daniel Vetter:
> > On Fri, Jan 08, 2021 at 10:43:31AM +0100, Thomas Zimmermann wrote:
> > > Implementations of the vmap/vunmap GEM callbacks may perform pinning
> > > of the BO and may acquire the associated reservation object's lock.
> > > Callers that only require a mapping of the contained memory can thus
> > > interfere with other tasks that require exact pinning, such as scanout.
> > > This is less of an issue with private SHMEM buffers, but may happen
> > > with imported ones.
> > >
> > > Therefore provide the new interfaces drm_gem_shmem_vmap_local() and
> > > drm_gem_shmem_vunmap_local(), which only perform the vmap/vunmap
> > > operations. Callers have to hold the reservation lock while the mapping
> > > persists.
> > >
> > > This patch also connects GEM SHMEM helpers to GEM object functions with
> > > equivalent functionality.
> > >
> > > v4:
> > > * call dma_buf_{vmap,vunmap}_local() where necessary (Daniel)
> > > * move driver changes into separate patches (Daniel)
> > >
> > > Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
> > > ---
> > > drivers/gpu/drm/drm_gem_shmem_helper.c | 90 +++++++++++++++++++++++---
> > > include/drm/drm_gem_shmem_helper.h | 2 +
> > > 2 files changed, 84 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > > index 9825c378dfa6..298832b2b43b 100644
> > > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > > @@ -32,6 +32,8 @@ static const struct drm_gem_object_funcs drm_gem_shmem_funcs = {
> > > .get_sg_table = drm_gem_shmem_get_sg_table,
> > > .vmap = drm_gem_shmem_vmap,
> > > .vunmap = drm_gem_shmem_vunmap,
> > > + .vmap_local = drm_gem_shmem_vmap_local,
> > > + .vunmap_local = drm_gem_shmem_vunmap_local,
> > > .mmap = drm_gem_shmem_mmap,
> > > };
> > > @@ -261,7 +263,8 @@ void drm_gem_shmem_unpin(struct drm_gem_object *obj)
> > > }
> > > EXPORT_SYMBOL(drm_gem_shmem_unpin);
> > > -static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct dma_buf_map *map)
> > > +static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct dma_buf_map *map,
> > > + bool local)
> >
> > This is a bit spaghetti and also has the problem that we're not changing
> > shmem->vmap_use_count under different locks, depending upon which path
> > we're taking.
> >
> > I think the cleanest would be if we pull the if (import_attach) case out
> > of the _locked() version completely, for all cases, and also outside of
> > the shmem->vmap_lock. This means no caching of vmaps in the shmem layer
> > anymore for imported buffers, but this is no longer a problem: We cache
> > them in the exporters instead (I think at least, if not maybe need to fix
> > that where it's expensive).
>
> There's no vmap refcounting in amdgpu AFAICT. So importing pages from there
> into an SHMEM object has the potential of breaking. IIRC same fro radeon and
> nouveau.
As long as the pinning is refcounted I think it should be fine, it's just
that if you have multiple vmaps (e.g. 2 udl devices plugged in) we'll set
up 2 vmaps. Which is a point pointless, but not really harmful. At least
on 64bit where there's enough virtual address space.
> So I'm somewhat reluctant to making this change. I guess I'll look elsewhere
> first to fix some of the locking issues (e.g., my recent ast cursor
> patches).
If this would break for amdgpu/radeon/nouveau then we already have a bug,
since 2 udl devices can provoke this issue already as-is. So I don't think
this should be a blocker.
-Daniel
>
> Best regards
> Thomas
>
> >
> > Other option would be to unly pull it out for the _vmap_local case, but
> > that's a bit ugly because no longer symmetrical in the various paths.
> >
> > > {
> > > struct drm_gem_object *obj = &shmem->base;
> > > int ret = 0;
> > > @@ -272,7 +275,10 @@ static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct
> > > }
> > > if (obj->import_attach) {
> > > - ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> > > + if (local)
> > > + ret = dma_buf_vmap_local(obj->import_attach->dmabuf, map);
> > > + else
> > > + ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> > > if (!ret) {
> > > if (WARN_ON(map->is_iomem)) {
> > > ret = -EIO;
> > > @@ -313,7 +319,7 @@ static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct
> > > return ret;
> > > }
> > > -/*
> > > +/**
> > > * drm_gem_shmem_vmap - Create a virtual mapping for a shmem GEM object
> > > * @shmem: shmem GEM object
> > > * @map: Returns the kernel virtual address of the SHMEM GEM object's backing
> > > @@ -339,15 +345,53 @@ int drm_gem_shmem_vmap(struct drm_gem_object *obj, struct dma_buf_map *map)
> > > ret = mutex_lock_interruptible(&shmem->vmap_lock);
> > > if (ret)
> > > return ret;
> > > - ret = drm_gem_shmem_vmap_locked(shmem, map);
> > > + ret = drm_gem_shmem_vmap_locked(shmem, map, false);
> > > mutex_unlock(&shmem->vmap_lock);
> > > return ret;
> > > }
> > > EXPORT_SYMBOL(drm_gem_shmem_vmap);
> > > +/**
> > > + * drm_gem_shmem_vmap_local - Create a virtual mapping for a shmem GEM object
> > > + * @shmem: shmem GEM object
> > > + * @map: Returns the kernel virtual address of the SHMEM GEM object's backing
> > > + * store.
> > > + *
> > > + * This function makes sure that a contiguous kernel virtual address mapping
> > > + * exists for the buffer backing the shmem GEM object.
> > > + *
> > > + * The function is called with the BO's reservation object locked. Callers must
> > > + * hold the lock until after unmapping the buffer.
> > > + *
> > > + * This function can be used to implement &drm_gem_object_funcs.vmap_local. But
> > > + * it can also be called by drivers directly, in which case it will hide the
> > > + * differences between dma-buf imported and natively allocated objects.
> >
> > So for the other callbacks I tried to make sure we have different entry
> > points for this, since it's not really the same thing and because of the
> > locking mess we have with dma_resv_lock vs various pre-existing local
> > locking scheme, it's easy to get a mess.
> >
> > I think the super clean version here would be to also export just the
> > internal stuff for the ->v(un)map_local hooks, but that's maybe a bit too
> > much boilerplate for no real gain.
> > -Daniel
> >
> > > + *
> > > + * Acquired mappings should be cleaned up by calling drm_gem_shmem_vunmap_local().
> > > + *
> > > + * Returns:
> > > + * 0 on success or a negative error code on failure.
> > > + */
> > > +int drm_gem_shmem_vmap_local(struct drm_gem_object *obj, struct dma_buf_map *map)
> > > +{
> > > + struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
> > > + int ret;
> > > +
> > > + dma_resv_assert_held(obj->resv);
> > > +
> > > + ret = mutex_lock_interruptible(&shmem->vmap_lock);
> > > + if (ret)
> > > + return ret;
> > > + ret = drm_gem_shmem_vmap_locked(shmem, map, true);
> > > + mutex_unlock(&shmem->vmap_lock);
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL(drm_gem_shmem_vmap_local);
> > > +
> > > static void drm_gem_shmem_vunmap_locked(struct drm_gem_shmem_object *shmem,
> > > - struct dma_buf_map *map)
> > > + struct dma_buf_map *map, bool local)
> > > {
> > > struct drm_gem_object *obj = &shmem->base;
> > > @@ -358,7 +402,10 @@ static void drm_gem_shmem_vunmap_locked(struct drm_gem_shmem_object *shmem,
> > > return;
> > > if (obj->import_attach)
> > > - dma_buf_vunmap(obj->import_attach->dmabuf, map);
> > > + if (local)
> > > + dma_buf_vunmap_local(obj->import_attach->dmabuf, map);
> > > + else
> > > + dma_buf_vunmap(obj->import_attach->dmabuf, map);
> > > else
> > > vunmap(shmem->vaddr);
> > > @@ -366,7 +413,7 @@ static void drm_gem_shmem_vunmap_locked(struct drm_gem_shmem_object *shmem,
> > > drm_gem_shmem_put_pages(shmem);
> > > }
> > > -/*
> > > +/**
> > > * drm_gem_shmem_vunmap - Unmap a virtual mapping fo a shmem GEM object
> > > * @shmem: shmem GEM object
> > > * @map: Kernel virtual address where the SHMEM GEM object was mapped
> > > @@ -384,11 +431,38 @@ void drm_gem_shmem_vunmap(struct drm_gem_object *obj, struct dma_buf_map *map)
> > > struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
> > > mutex_lock(&shmem->vmap_lock);
> > > - drm_gem_shmem_vunmap_locked(shmem, map);
> > > + drm_gem_shmem_vunmap_locked(shmem, map, false);
> > > mutex_unlock(&shmem->vmap_lock);
> > > }
> > > EXPORT_SYMBOL(drm_gem_shmem_vunmap);
> > > +/**
> > > + * drm_gem_shmem_vunmap_local - Unmap a virtual mapping fo a shmem GEM object
> > > + * @shmem: shmem GEM object
> > > + * @map: Kernel virtual address where the SHMEM GEM object was mapped
> > > + *
> > > + * This function cleans up a kernel virtual address mapping acquired by
> > > + * drm_gem_shmem_vmap_local(). The mapping is only removed when the use count
> > > + * drops to zero.
> > > + *
> > > + * The function is called with the BO's reservation object locked.
> > > + *
> > > + * This function can be used to implement &drm_gem_object_funcs.vmap_local.
> > > + * But it can also be called by drivers directly, in which case it will hide
> > > + * the differences between dma-buf imported and natively allocated objects.
> > > + */
> > > +void drm_gem_shmem_vunmap_local(struct drm_gem_object *obj, struct dma_buf_map *map)
> > > +{
> > > + struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj);
> > > +
> > > + dma_resv_assert_held(obj->resv);
> > > +
> > > + mutex_lock(&shmem->vmap_lock);
> > > + drm_gem_shmem_vunmap_locked(shmem, map, true);
> > > + mutex_unlock(&shmem->vmap_lock);
> > > +}
> > > +EXPORT_SYMBOL(drm_gem_shmem_vunmap_local);
> > > +
> > > struct drm_gem_shmem_object *
> > > drm_gem_shmem_create_with_handle(struct drm_file *file_priv,
> > > struct drm_device *dev, size_t size,
> > > diff --git a/include/drm/drm_gem_shmem_helper.h b/include/drm/drm_gem_shmem_helper.h
> > > index 434328d8a0d9..3f59bdf749aa 100644
> > > --- a/include/drm/drm_gem_shmem_helper.h
> > > +++ b/include/drm/drm_gem_shmem_helper.h
> > > @@ -114,7 +114,9 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem);
> > > int drm_gem_shmem_pin(struct drm_gem_object *obj);
> > > void drm_gem_shmem_unpin(struct drm_gem_object *obj);
> > > int drm_gem_shmem_vmap(struct drm_gem_object *obj, struct dma_buf_map *map);
> > > +int drm_gem_shmem_vmap_local(struct drm_gem_object *obj, struct dma_buf_map *map);
> > > void drm_gem_shmem_vunmap(struct drm_gem_object *obj, struct dma_buf_map *map);
> > > +void drm_gem_shmem_vunmap_local(struct drm_gem_object *obj, struct dma_buf_map *map);
> > > int drm_gem_shmem_madvise(struct drm_gem_object *obj, int madv);
> > > --
> > > 2.29.2
> > >
> >
>
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5, 90409 Nürnberg, Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Felix Imendörffer
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Hi Simon,
On Thu, 28 Jan 2021 at 20:01, Simon Ser <contact(a)emersion.fr> wrote:
>
> On Thursday, January 28th, 2021 at 1:03 PM, Sumit Semwal <sumit.semwal(a)linaro.org> wrote:
>
> > Since he didn't comment over Hridya's last clarification about the
> > tracepoints to track total GPU memory allocations being orthogonal to
> > this series, I assumed he agreed with it.
>
> IIRC he's away this week. (I don't remember when he comes back.)
>
> > Daniel, do you still have objections around adding this patch in?
>
> (Adding him explicitly in CC)
Thanks for doing this!
Best,
Sumit.