Quoting Daniel Stone (2020-06-11 10:01:46)
Hi,
On Thu, 11 Jun 2020 at 09:44, Dave Airlie airlied@gmail.com wrote:
On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:
Introducing a global lockmap that cannot capture the rules correctly,
Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.
As someone who has sunk a bunch of work into explicit-fencing awareness in my compositor so I can never be blocked, I'd be disappointed if the infrastructure was ultimately pointless because the documented fencing rules were _o_/ or thereabouts. Lockdep definitely isn't my area of expertise so I can't comment on the patch per se, but having something to ensure we don't hit deadlocks sure seems a lot better than nothing.
This is doing dependency analysis on execution contexts which is a far cry from doing the fence dependency analysis, and so has to actively ignore the cycles that must exist on the dma side, and also the cycles that prevent entering execution contexts on the CPU. It has to actively ignore scheduler execution contexts, for lockdep cries, and so we do not get analysis of the locking contexts along that path. This would be solvable along the lines of extending lockdep ala lockdep_dma_enter().
Had i915's execution flow been marked up, it should have found the dubious wait for external fences inside the dead GPU recovery, and probably found a few more things to complain about with the reset locking. [Note we already do the same annotations for wait-vs-reset, but not reset-vs-execution.]
Determination of which waits are legal and which are not is entirely ad hoc, for there is no status change tracking in the dependency analysis [that is once an execution context is linked to a published fence, again integral to lockdep.] Consider if the completion chain in atomic is swapped out for the morally equivalent fences along intertwined timelines, and so it does a bunch of dma_fence_wait() instead. Why are those waits legal despite them being after we have committed to fulfilling the out fence? [Why are the waits on and for the GPU legal, since they equally block execution flow?]
Forcing a generic primitive to always be part of the same global map is horrible. You forgo being able to use the primitive for unrelated tasks, lose the ability to name particular contexts to gain more informative dependency cycle reports from having the explicit linkage. You can add wait_map tracking without loss of generality [in less than 10 lines], and you can still enforce that all fences used for a common purpose follow the same rules [the simplest way being to default to the singular wait_map]. But it's the explicitly named execution contexts that are the biggest boon to reading the code and reading the lockdep warns.
This is a bunch of ad hoc tracking for a very narrow purpose applied globally, with loss of information. -Chris