New subject: [RFC 01/17] dma-fence: add might_sleep annotation to _wait()

12 May 2020


      Hi all,
I've dragged my feet for years on this, hoping that cross-release lockdep
would do this for us, but well that never really happened unfortunately.
So here we are.
Cc'ed quite a pile of people since this is about the cross-driver contract
around dma_fences. Which is heavily used for dma_buf, and I'm hearing more
noises that rdma folks are looking into this, hence also on cc.
There's a bunch of different parts to this RFC:
- The annotations itself, in the 2nd patch after the prep patch to add
  might_sleep annotations. Commit message has all the motivation for what
  kind of deadlocks I want to catch, best you just read it.
Since lockdep doesn't understand cross-release natively we need to
  cobble something together using rwlocks and a few more tricks, but from
  the test rollout in a few places in drm/vkms, amdgpu & i915 I think what
  I have now seems to actually work. Downside is that we have to
  explicitly annotate all code involved in eventual dma_fence signalling.
- Second important part is locking down the current dma-fence cross-driver
  contract, using lockdep priming like we already do for dma_resv_lock.
  I've just started with my own take on what we probably need to make the
  current code work (-ish), but both amdgpu and i915 have issues with
  that. So this needs some careful discussions, and also some thought on
  how we land it all eventually to not break lockdep completely for
  everyone.
The important patch for that is "dma-fence: prime lockdep annotations"
  plus of course the various annotations patches and driver hacks to
  highlight some of the issues caught.
Note that depending upon what exactly we end up deciding we might need
  to improve the annotations for fs_reclaim_acquire/release - for
  dma_fence_wait in mmu notifiers we can only allow GFP_NOWAIT (afaiui),
  and currently fs_reclaim_acquire/release only has a lockdep class for
  __GFP_FS only, we'd need to add another one for __GFP_DIRECT_RECLAIM in
  general maybe.
- Finally there's clearly some gaps in the current dma_fence driver
  interfaces: Amdgpu's hang recovery is essentially impossible to fix
  as-is - it needs to reset the display state and you can't get at modeset
  locks from tdr without deadlock potential. i915 has an internal trick
  (but it stops working once we involve real cross-driver fences) for this
  issues, but then for i915 modeset reset is only needed on very ancient
  gen2/3. Modern hw is a lot more reasonable.
I'm kinda hoping that the annotations and priming for basic command
  submission and atomic modeset paths could be merged soonish, while we
  the tdr side clearly needs a pile more work to get going. But since we
  have to explicitly annotate all code paths anyway we can hide bugs in
  e.g. tdr code by simply not yet annotating those functions.
I'm trying to lay out at least one idea for solving the tdr issue in the
  patch titled "drm/scheduler: use dma-fence annotations in tdr work".
Finally, once we have some agreement on where we're going with all this,
we also need some documentation. Currently that's missing because I don't
want to re-edit the text all the time while we still figure out the
details of the exact cross-driver semantics.
My goal here is that with this we can lock down the cross-driver contract
for the last bit of the dma_buf/resv/fence story and make sure this stops
being such a wobbly thing where everyone just does whatever they feel
like.
Ideas, thoughts, reviews, testing (with specific annotations for that
driver) on other drivers very much welcome.
Cheers, Daniel
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson chris@chris-wilson.co.uk
Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com
Cc: Christian König christian.koenig@amd.com
Daniel Vetter (17):
  dma-fence: add might_sleep annotation to _wait()
  dma-fence: basic lockdep annotations
  dma-fence: prime lockdep annotations
  drm/vkms: Annotate vblank timer
  drm/vblank: Annotate with dma-fence signalling section
  drm/atomic-helper: Add dma-fence annotations
  drm/amdgpu: add dma-fence annotations to atomic commit path
  drm/scheduler: use dma-fence annotations in main thread
  drm/amdgpu: use dma-fence annotations in cs_submit()
  drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  drm/amdgpu: DC also loves to allocate stuff where it shouldn't
  drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
  drm/scheduler: use dma-fence annotations in tdr work
  drm/amdgpu: use dma-fence annotations for gpu reset code
  Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
  drm/amdgpu: gpu recovery does full modesets
  drm/i915: Annotate dma_fence_work
drivers/dma-buf/dma-fence.c                   | 56 +++++++++++++++++++
 drivers/dma-buf/dma-resv.c                    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  5 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 22 ++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/atom.c             |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++++-
 drivers/gpu/drm/amd/display/dc/core/dc.c      |  4 +-
 drivers/gpu/drm/drm_atomic_helper.c           | 16 ++++++
 drivers/gpu/drm/drm_vblank.c                  |  8 ++-
 drivers/gpu/drm/i915/i915_sw_fence_work.c     |  3 +
 drivers/gpu/drm/scheduler/sched_main.c        | 11 ++++
 drivers/gpu/drm/vkms/vkms_crtc.c              |  8 ++-
 include/linux/dma-fence.h                     | 13 +++++
 16 files changed, 160 insertions(+), 13 deletions(-)
-- 
2.26.2

[RFC 00/17] dma-fence lockdep annotations

-Daniel

-Daniel

-Daniel

-Daniel

-Daniel