Linaro-mm-sig February 2021

linaro-mm-sig@lists.linaro.org

19 participants
33 discussions

[PATCH] RFC: dma-fence: Document recoverable page fault implications

by Daniel Vetter

Recently there was a fairly long thread about recoreable hardware page faults, how they can deadlock, and what to do about that. While the discussion is still fresh I figured good time to try and document the conclusions a bit. References: https://lore.kernel.org/dri-devel/20210107030127.20393-1-Felix.Kuehling@amd… Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com> Cc: Thomas Hellström <thomas.hellstrom(a)intel.com> Cc: "Christian König" <christian.koenig(a)amd.com> Cc: Jerome Glisse <jglisse(a)redhat.com> Cc: Felix Kuehling <felix.kuehling(a)amd.com> Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com> Cc: Sumit Semwal <sumit.semwal(a)linaro.org> Cc: linux-media(a)vger.kernel.org Cc: linaro-mm-sig(a)lists.linaro.org -- I'll be away next week, but figured I'll type this up quickly for some comments and to check whether I got this all roughly right. Critique very much wanted on this, so that we can make sure hw which can't preempt (with pagefaults pending) like gfx10 has a clear path to support page faults in upstream. So anything I missed, got wrong or like that would be good. -Daniel --- Documentation/driver-api/dma-buf.rst | 66 ++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index a2133d69872c..e924c1e4f7a3 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -257,3 +257,69 @@ fences in the kernel. This means: userspace is allowed to use userspace fencing or long running compute workloads. This also means no implicit fencing for shared buffers in these cases. + +Recoverable Hardware Page Faults Implications +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Modern hardware supports recoverable page faults, which has a lot of +implications for DMA fences. + +First, a pending page fault obviously holds up the work that's running on the +accelerator and a memory allocation is usually required to resolve the fault. +But memory allocations are not allowed to gate completion of DMA fences, which +means any workload using recoverable page faults cannot use DMA fences for +synchronization. Synchronization fences controlled by userspace must be used +instead. + +On GPUs this poses a problem, because current desktop compositor protocols on +Linus rely on DMA fences, which means without an entirely new userspace stack +built on top of userspace fences, they cannot benefit from recoverable page +faults. The exception is when page faults are only used as migration hints and +never to on-demand fill a memory request. For now this means recoverable page +faults on GPUs are limited to pure compute workloads. + +Furthermore GPUs usually have shared resources between the 3D rendering and +compute side, like compute units or command submission engines. If both a 3D +job with a DMA fence and a compute workload using recoverable page faults are +pending they could deadlock: + +- The 3D workload might need to wait for the compute job to finish and release + hardware resources first. + +- The compute workload might be stuck in a page fault, because the memory + allocation is waiting for the DMA fence of the 3D workload to complete. + +There are a few ways to prevent this problem: + +- Compute workloads can always be preempted, even when a page fault is pending + and not yet repaired. Not all hardware supports this. + +- DMA fence workloads and workloads which need page fault handling have + independent hardware resources to guarantee forward progress. This could be + achieved through e.g. through dedicated engines and minimal compute unit + reservations for DMA fence workloads. + +- The reservation approach could be further refined by only reserving the + hardware resources for DMA fence workloads when they are in-flight. This must + cover the time from when the DMA fence is visible to other threads up to + moment when fence is completed through dma_fence_signal(). + +- As a last resort, if the hardware provides no useful reservation mechanics, + all workloads must be flushed from the GPU when switching between jobs + requiring DMA fences or jobs requiring page fault handling: This means all DMA + fences must complete before a compute job with page fault handling can be + inserted into the scheduler queue. And vice versa, before a DMA fence can be + made visible anywhere in the system, all compute workloads must be preempted + to guarantee all pending GPU page faults are flushed. + +Note that workloads that run on independent hardware like copy engines or other +GPUs do not have any impact. This allows us to keep using DMA fences internally +in the kernel even for resolving hardware page faults, e.g. by using copy +engines to clear or copy memory needed to resolve the page fault. + +In some ways this page fault problem is a special case of the `Infinite DMA +Fences` discussions: Infinite fences from compute workloads are allowed to +depend on DMA fences, but not the other way around. And not even the page fault +problem is new, because some other CPU thread in userspace might +hit a page fault which holds up a userspace fence - supporting page faults on +GPUs doesn't anything fundamentally new. -- 2.30.0

4 years, 11 months

Re: [Linaro-mm-sig] [PATCH v4 04/13] drm/shmem-helper: Provide a vmap function for short-term mappings

by Daniel Vetter

On Wed, Jan 27, 2021 at 01:08:05PM +0100, Thomas Zimmermann wrote: > Hi > > Am 11.01.21 um 17:50 schrieb Daniel Vetter: > > On Fri, Jan 08, 2021 at 10:43:31AM +0100, Thomas Zimmermann wrote: > > > Implementations of the vmap/vunmap GEM callbacks may perform pinning > > > of the BO and may acquire the associated reservation object's lock. > > > Callers that only require a mapping of the contained memory can thus > > > interfere with other tasks that require exact pinning, such as scanout. > > > This is less of an issue with private SHMEM buffers, but may happen > > > with imported ones. > > > > > > Therefore provide the new interfaces drm_gem_shmem_vmap_local() and > > > drm_gem_shmem_vunmap_local(), which only perform the vmap/vunmap > > > operations. Callers have to hold the reservation lock while the mapping > > > persists. > > > > > > This patch also connects GEM SHMEM helpers to GEM object functions with > > > equivalent functionality. > > > > > > v4: > > > * call dma_buf_{vmap,vunmap}_local() where necessary (Daniel) > > > * move driver changes into separate patches (Daniel) > > > > > > Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de> > > > --- > > > drivers/gpu/drm/drm_gem_shmem_helper.c | 90 +++++++++++++++++++++++--- > > > include/drm/drm_gem_shmem_helper.h | 2 + > > > 2 files changed, 84 insertions(+), 8 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c > > > index 9825c378dfa6..298832b2b43b 100644 > > > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c > > > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c > > > @@ -32,6 +32,8 @@ static const struct drm_gem_object_funcs drm_gem_shmem_funcs = { > > > .get_sg_table = drm_gem_shmem_get_sg_table, > > > .vmap = drm_gem_shmem_vmap, > > > .vunmap = drm_gem_shmem_vunmap, > > > + .vmap_local = drm_gem_shmem_vmap_local, > > > + .vunmap_local = drm_gem_shmem_vunmap_local, > > > .mmap = drm_gem_shmem_mmap, > > > }; > > > @@ -261,7 +263,8 @@ void drm_gem_shmem_unpin(struct drm_gem_object *obj) > > > } > > > EXPORT_SYMBOL(drm_gem_shmem_unpin); > > > -static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct dma_buf_map *map) > > > +static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct dma_buf_map *map, > > > + bool local) > > > > This is a bit spaghetti and also has the problem that we're not changing > > shmem->vmap_use_count under different locks, depending upon which path > > we're taking. > > > > I think the cleanest would be if we pull the if (import_attach) case out > > of the _locked() version completely, for all cases, and also outside of > > the shmem->vmap_lock. This means no caching of vmaps in the shmem layer > > anymore for imported buffers, but this is no longer a problem: We cache > > them in the exporters instead (I think at least, if not maybe need to fix > > that where it's expensive). > > There's no vmap refcounting in amdgpu AFAICT. So importing pages from there > into an SHMEM object has the potential of breaking. IIRC same fro radeon and > nouveau. As long as the pinning is refcounted I think it should be fine, it's just that if you have multiple vmaps (e.g. 2 udl devices plugged in) we'll set up 2 vmaps. Which is a point pointless, but not really harmful. At least on 64bit where there's enough virtual address space. > So I'm somewhat reluctant to making this change. I guess I'll look elsewhere > first to fix some of the locking issues (e.g., my recent ast cursor > patches). If this would break for amdgpu/radeon/nouveau then we already have a bug, since 2 udl devices can provoke this issue already as-is. So I don't think this should be a blocker. -Daniel > > Best regards > Thomas > > > > > Other option would be to unly pull it out for the _vmap_local case, but > > that's a bit ugly because no longer symmetrical in the various paths. > > > > > { > > > struct drm_gem_object *obj = &shmem->base; > > > int ret = 0; > > > @@ -272,7 +275,10 @@ static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct > > > } > > > if (obj->import_attach) { > > > - ret = dma_buf_vmap(obj->import_attach->dmabuf, map); > > > + if (local) > > > + ret = dma_buf_vmap_local(obj->import_attach->dmabuf, map); > > > + else > > > + ret = dma_buf_vmap(obj->import_attach->dmabuf, map); > > > if (!ret) { > > > if (WARN_ON(map->is_iomem)) { > > > ret = -EIO; > > > @@ -313,7 +319,7 @@ static int drm_gem_shmem_vmap_locked(struct drm_gem_shmem_object *shmem, struct > > > return ret; > > > } > > > -/* > > > +/** > > > * drm_gem_shmem_vmap - Create a virtual mapping for a shmem GEM object > > > * @shmem: shmem GEM object > > > * @map: Returns the kernel virtual address of the SHMEM GEM object's backing > > > @@ -339,15 +345,53 @@ int drm_gem_shmem_vmap(struct drm_gem_object *obj, struct dma_buf_map *map) > > > ret = mutex_lock_interruptible(&shmem->vmap_lock); > > > if (ret) > > > return ret; > > > - ret = drm_gem_shmem_vmap_locked(shmem, map); > > > + ret = drm_gem_shmem_vmap_locked(shmem, map, false); > > > mutex_unlock(&shmem->vmap_lock); > > > return ret; > > > } > > > EXPORT_SYMBOL(drm_gem_shmem_vmap); > > > +/** > > > + * drm_gem_shmem_vmap_local - Create a virtual mapping for a shmem GEM object > > > + * @shmem: shmem GEM object > > > + * @map: Returns the kernel virtual address of the SHMEM GEM object's backing > > > + * store. > > > + * > > > + * This function makes sure that a contiguous kernel virtual address mapping > > > + * exists for the buffer backing the shmem GEM object. > > > + * > > > + * The function is called with the BO's reservation object locked. Callers must > > > + * hold the lock until after unmapping the buffer. > > > + * > > > + * This function can be used to implement &drm_gem_object_funcs.vmap_local. But > > > + * it can also be called by drivers directly, in which case it will hide the > > > + * differences between dma-buf imported and natively allocated objects. > > > > So for the other callbacks I tried to make sure we have different entry > > points for this, since it's not really the same thing and because of the > > locking mess we have with dma_resv_lock vs various pre-existing local > > locking scheme, it's easy to get a mess. > > > > I think the super clean version here would be to also export just the > > internal stuff for the ->v(un)map_local hooks, but that's maybe a bit too > > much boilerplate for no real gain. > > -Daniel > > > > > + * > > > + * Acquired mappings should be cleaned up by calling drm_gem_shmem_vunmap_local(). > > > + * > > > + * Returns: > > > + * 0 on success or a negative error code on failure. > > > + */ > > > +int drm_gem_shmem_vmap_local(struct drm_gem_object *obj, struct dma_buf_map *map) > > > +{ > > > + struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj); > > > + int ret; > > > + > > > + dma_resv_assert_held(obj->resv); > > > + > > > + ret = mutex_lock_interruptible(&shmem->vmap_lock); > > > + if (ret) > > > + return ret; > > > + ret = drm_gem_shmem_vmap_locked(shmem, map, true); > > > + mutex_unlock(&shmem->vmap_lock); > > > + > > > + return ret; > > > +} > > > +EXPORT_SYMBOL(drm_gem_shmem_vmap_local); > > > + > > > static void drm_gem_shmem_vunmap_locked(struct drm_gem_shmem_object *shmem, > > > - struct dma_buf_map *map) > > > + struct dma_buf_map *map, bool local) > > > { > > > struct drm_gem_object *obj = &shmem->base; > > > @@ -358,7 +402,10 @@ static void drm_gem_shmem_vunmap_locked(struct drm_gem_shmem_object *shmem, > > > return; > > > if (obj->import_attach) > > > - dma_buf_vunmap(obj->import_attach->dmabuf, map); > > > + if (local) > > > + dma_buf_vunmap_local(obj->import_attach->dmabuf, map); > > > + else > > > + dma_buf_vunmap(obj->import_attach->dmabuf, map); > > > else > > > vunmap(shmem->vaddr); > > > @@ -366,7 +413,7 @@ static void drm_gem_shmem_vunmap_locked(struct drm_gem_shmem_object *shmem, > > > drm_gem_shmem_put_pages(shmem); > > > } > > > -/* > > > +/** > > > * drm_gem_shmem_vunmap - Unmap a virtual mapping fo a shmem GEM object > > > * @shmem: shmem GEM object > > > * @map: Kernel virtual address where the SHMEM GEM object was mapped > > > @@ -384,11 +431,38 @@ void drm_gem_shmem_vunmap(struct drm_gem_object *obj, struct dma_buf_map *map) > > > struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj); > > > mutex_lock(&shmem->vmap_lock); > > > - drm_gem_shmem_vunmap_locked(shmem, map); > > > + drm_gem_shmem_vunmap_locked(shmem, map, false); > > > mutex_unlock(&shmem->vmap_lock); > > > } > > > EXPORT_SYMBOL(drm_gem_shmem_vunmap); > > > +/** > > > + * drm_gem_shmem_vunmap_local - Unmap a virtual mapping fo a shmem GEM object > > > + * @shmem: shmem GEM object > > > + * @map: Kernel virtual address where the SHMEM GEM object was mapped > > > + * > > > + * This function cleans up a kernel virtual address mapping acquired by > > > + * drm_gem_shmem_vmap_local(). The mapping is only removed when the use count > > > + * drops to zero. > > > + * > > > + * The function is called with the BO's reservation object locked. > > > + * > > > + * This function can be used to implement &drm_gem_object_funcs.vmap_local. > > > + * But it can also be called by drivers directly, in which case it will hide > > > + * the differences between dma-buf imported and natively allocated objects. > > > + */ > > > +void drm_gem_shmem_vunmap_local(struct drm_gem_object *obj, struct dma_buf_map *map) > > > +{ > > > + struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj); > > > + > > > + dma_resv_assert_held(obj->resv); > > > + > > > + mutex_lock(&shmem->vmap_lock); > > > + drm_gem_shmem_vunmap_locked(shmem, map, true); > > > + mutex_unlock(&shmem->vmap_lock); > > > +} > > > +EXPORT_SYMBOL(drm_gem_shmem_vunmap_local); > > > + > > > struct drm_gem_shmem_object * > > > drm_gem_shmem_create_with_handle(struct drm_file *file_priv, > > > struct drm_device *dev, size_t size, > > > diff --git a/include/drm/drm_gem_shmem_helper.h b/include/drm/drm_gem_shmem_helper.h > > > index 434328d8a0d9..3f59bdf749aa 100644 > > > --- a/include/drm/drm_gem_shmem_helper.h > > > +++ b/include/drm/drm_gem_shmem_helper.h > > > @@ -114,7 +114,9 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem); > > > int drm_gem_shmem_pin(struct drm_gem_object *obj); > > > void drm_gem_shmem_unpin(struct drm_gem_object *obj); > > > int drm_gem_shmem_vmap(struct drm_gem_object *obj, struct dma_buf_map *map); > > > +int drm_gem_shmem_vmap_local(struct drm_gem_object *obj, struct dma_buf_map *map); > > > void drm_gem_shmem_vunmap(struct drm_gem_object *obj, struct dma_buf_map *map); > > > +void drm_gem_shmem_vunmap_local(struct drm_gem_object *obj, struct dma_buf_map *map); > > > int drm_gem_shmem_madvise(struct drm_gem_object *obj, int madv); > > > -- > > > 2.29.2 > > > > > > > -- > Thomas Zimmermann > Graphics Driver Developer > SUSE Software Solutions Germany GmbH > Maxfeldstr. 5, 90409 Nürnberg, Germany > (HRB 36809, AG Nürnberg) > Geschäftsführer: Felix Imendörffer > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

4 years, 11 months

Re: [Linaro-mm-sig] [PATCH v3] dmabuf: Add the capability to expose DMA-BUF stats in sysfs

by Sumit Semwal

Hi Simon, On Thu, 28 Jan 2021 at 20:01, Simon Ser <contact(a)emersion.fr> wrote: > > On Thursday, January 28th, 2021 at 1:03 PM, Sumit Semwal <sumit.semwal(a)linaro.org> wrote: > > > Since he didn't comment over Hridya's last clarification about the > > tracepoints to track total GPU memory allocations being orthogonal to > > this series, I assumed he agreed with it. > > IIRC he's away this week. (I don't remember when he comes back.) > > > Daniel, do you still have objections around adding this patch in? > > (Adding him explicitly in CC) Thanks for doing this! Best, Sumit.

4 years, 11 months

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig February 2021