Design is similar to the lockdep annotations for workers, but with some twists:
- We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
- We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
- To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
- The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
- To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A);
dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access
-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c - :doc: fence polling + :doc: implicit fence polling
Kernel Functions and Structures Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around &dma_fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have &mmu_notifier, respectively &mmu_interval_notifier, + * and &shrinker. For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. &dma_fence interacts with almost all of the core memory + * handling through page fault handlers via &dma_resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it also interacts through all + * allocation sites through &mmu_notifier and &shrinker. + * + * Furthermore lockdep does not handle cross-release dependencies, which means + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught + * at runtime with some quick testing. The simplest example is one thread + * waiting on a &dma_fence while holding a lock:: + * + * lock(A); + * dma_fence_wait(B); + * unlock(A); + * + * while the other thread is stuck trying to acquire the same lock, which + * prevents it from signalling the fence the previous thread is stuck waiting + * on:: + * + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * + * By manually annotating all code relevant to signalling a &dma_fence we can + * teach lockdep about these dependencies, which also helps with the validation + * headache since now lockdep can check all the rules for us:: + * + * cookie = dma_fence_begin_signalling(); + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * dma_fence_end_signalling(cookie); + * + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to + * annotate critical sections the following rules need to be observed: + * + * * All code necessary to complete a &dma_fence must be annotated, from the + * point where a fence is accessible to other threads, to the point where + * dma_fence_signal() is called. Un-annotated code can contain deadlock issues, + * and due to the very strict rules and many corner cases it is infeasible to + * catch these just with review or normal stress testing. + * + * * &struct dma_resv deserves a special note, since the readers are only + * protected by rcu. This means the signalling critical section starts as soon + * as the new fences are installed, even before dma_resv_unlock() is called. + * + * * The only exception are fast paths and opportunistic signalling code, which + * calls dma_fence_signal() purely as an optimization, but is not required to + * guarantee completion of a &dma_fence. The usual example is a wait IOCTL + * which calls dma_fence_signal(), while the mandatory completion path goes + * through a hardware interrupt and possible job completion worker. + * + * * To aid composability of code, the annotations can be freely nested, as long + * as the overall locking hierarchy is consistent. The annotations also work + * both in interrupt and process context. Due to implementation details this + * requires that callers pass an opaque cookie from + * dma_fence_begin_signalling() to dma_fence_end_signalling(). + * + * * Validation against the cross driver contract is implemented by priming + * lockdep with the relevant hierarchy at boot-up. This means even just + * testing with a single device is enough to validate a driver, at least as + * far as deadlocks with dma_fence_wait() against dma_fence_signal() are + * concerned. + */ +#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = { + .name = "dma_fence_map" +}; + +/** + * dma_fence_begin_signalling - begin a critical DMA fence signalling section + * + * Drivers should use this to annotate the beginning of any code section + * required to eventually complete &dma_fence by calling dma_fence_signal(). + * + * The end of these critical sections are annotated with + * dma_fence_end_signalling(). + * + * Returns: + * + * Opaque cookie needed by the implementation, which needs to be passed to + * dma_fence_end_signalling(). + */ +bool dma_fence_begin_signalling(void) +{ + /* explicitly nesting ... */ + if (lock_is_held_type(&dma_fence_lockdep_map, 1)) + return true; + + /* rely on might_sleep check for soft/hardirq locks */ + if (in_atomic()) + return true; + + /* ... and non-recursive readlock */ + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_); + + return false; +} +EXPORT_SYMBOL(dma_fence_begin_signalling); + +/** + * dma_fence_end_signalling - end a critical DMA fence signalling section + * + * Closes a critical section annotation opened by dma_fence_begin_signalling(). + */ +void dma_fence_end_signalling(bool cookie) +{ + if (cookie) + return; + + lock_release(&dma_fence_lockdep_map, _RET_IP_); +} +EXPORT_SYMBOL(dma_fence_end_signalling); + +void __dma_fence_might_wait(void) +{ + bool tmp; + + tmp = lock_is_held_type(&dma_fence_lockdep_map, 1); + if (tmp) + lock_release(&dma_fence_lockdep_map, _THIS_IP_); + lock_map_acquire(&dma_fence_lockdep_map); + lock_map_release(&dma_fence_lockdep_map); + if (tmp) + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_); +} +#endif + + /** * dma_fence_signal_locked - signal completion of a fence * @fence: the fence to signal @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret; + bool tmp;
if (!fence) return -EINVAL;
+ tmp = dma_fence_begin_signalling(); + spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);
+ dma_fence_end_signalling(tmp); + return ret; } EXPORT_SYMBOL(dma_fence_signal); @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
might_sleep();
+ __dma_fence_might_wait(); + trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout); diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }
+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{ + return true; +} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif + int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,
Design is similar to the lockdep annotations for workers, but with some twists:
- We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
- We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
- To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
- The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
- To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A);
dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
Originally I hope that the cross-release lockdep extensions would alleviate the need for explicit annotations:
https://lwn.net/Articles/709849/
But there's a few reasons why that's not an option:
- It's not happening in upstream, since it got reverted due to too many false positives:
commit e966eaeeb623f09975ef362c2866fae6f86844f9 Author: Ingo Molnar mingo@kernel.org Date: Tue Dec 12 12:31:16 2017 +0100
locking/lockdep: Remove the cross-release locking checks
This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y), while it found a number of old bugs initially, was also causing too many false positives that caused people to disable lockdep - which is arguably a worse overall outcome.
- cross-release uses the complete() call to annotate the end of critical sections, for dma_fence that would be dma_fence_signal(). But we do not want all dma_fence_signal() calls to be treated as critical, since many are opportunistic cleanup of gpu requests. If these get stuck there's still the main completion interrupt and workers who can unblock everyone. Automatically annotating all dma_fence_signal() calls would hence cause false positives.
- cross-release had some educated guesses for when a critical section starts, like fresh syscall or fresh work callback. This would again cause false positives without explicit annotations, since for dma_fence the critical sections only starts when we publish a fence.
- Furthermore there can be cases where a thread never does a dma_fence_signal, but is still critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it.
In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
v5: Amend commit message to explain in detail why cross-release isn't the solution.
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access
-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c - :doc: fence polling + :doc: implicit fence polling
Kernel Functions and Structures Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around &dma_fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have &mmu_notifier, respectively &mmu_interval_notifier, + * and &shrinker. For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. &dma_fence interacts with almost all of the core memory + * handling through page fault handlers via &dma_resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it also interacts through all + * allocation sites through &mmu_notifier and &shrinker. + * + * Furthermore lockdep does not handle cross-release dependencies, which means + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught + * at runtime with some quick testing. The simplest example is one thread + * waiting on a &dma_fence while holding a lock:: + * + * lock(A); + * dma_fence_wait(B); + * unlock(A); + * + * while the other thread is stuck trying to acquire the same lock, which + * prevents it from signalling the fence the previous thread is stuck waiting + * on:: + * + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * + * By manually annotating all code relevant to signalling a &dma_fence we can + * teach lockdep about these dependencies, which also helps with the validation + * headache since now lockdep can check all the rules for us:: + * + * cookie = dma_fence_begin_signalling(); + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * dma_fence_end_signalling(cookie); + * + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to + * annotate critical sections the following rules need to be observed: + * + * * All code necessary to complete a &dma_fence must be annotated, from the + * point where a fence is accessible to other threads, to the point where + * dma_fence_signal() is called. Un-annotated code can contain deadlock issues, + * and due to the very strict rules and many corner cases it is infeasible to + * catch these just with review or normal stress testing. + * + * * &struct dma_resv deserves a special note, since the readers are only + * protected by rcu. This means the signalling critical section starts as soon + * as the new fences are installed, even before dma_resv_unlock() is called. + * + * * The only exception are fast paths and opportunistic signalling code, which + * calls dma_fence_signal() purely as an optimization, but is not required to + * guarantee completion of a &dma_fence. The usual example is a wait IOCTL + * which calls dma_fence_signal(), while the mandatory completion path goes + * through a hardware interrupt and possible job completion worker. + * + * * To aid composability of code, the annotations can be freely nested, as long + * as the overall locking hierarchy is consistent. The annotations also work + * both in interrupt and process context. Due to implementation details this + * requires that callers pass an opaque cookie from + * dma_fence_begin_signalling() to dma_fence_end_signalling(). + * + * * Validation against the cross driver contract is implemented by priming + * lockdep with the relevant hierarchy at boot-up. This means even just + * testing with a single device is enough to validate a driver, at least as + * far as deadlocks with dma_fence_wait() against dma_fence_signal() are + * concerned. + */ +#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = { + .name = "dma_fence_map" +}; + +/** + * dma_fence_begin_signalling - begin a critical DMA fence signalling section + * + * Drivers should use this to annotate the beginning of any code section + * required to eventually complete &dma_fence by calling dma_fence_signal(). + * + * The end of these critical sections are annotated with + * dma_fence_end_signalling(). + * + * Returns: + * + * Opaque cookie needed by the implementation, which needs to be passed to + * dma_fence_end_signalling(). + */ +bool dma_fence_begin_signalling(void) +{ + /* explicitly nesting ... */ + if (lock_is_held_type(&dma_fence_lockdep_map, 1)) + return true; + + /* rely on might_sleep check for soft/hardirq locks */ + if (in_atomic()) + return true; + + /* ... and non-recursive readlock */ + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_); + + return false; +} +EXPORT_SYMBOL(dma_fence_begin_signalling); + +/** + * dma_fence_end_signalling - end a critical DMA fence signalling section + * + * Closes a critical section annotation opened by dma_fence_begin_signalling(). + */ +void dma_fence_end_signalling(bool cookie) +{ + if (cookie) + return; + + lock_release(&dma_fence_lockdep_map, _RET_IP_); +} +EXPORT_SYMBOL(dma_fence_end_signalling); + +void __dma_fence_might_wait(void) +{ + bool tmp; + + tmp = lock_is_held_type(&dma_fence_lockdep_map, 1); + if (tmp) + lock_release(&dma_fence_lockdep_map, _THIS_IP_); + lock_map_acquire(&dma_fence_lockdep_map); + lock_map_release(&dma_fence_lockdep_map); + if (tmp) + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_); +} +#endif + + /** * dma_fence_signal_locked - signal completion of a fence * @fence: the fence to signal @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret; + bool tmp;
if (!fence) return -EINVAL;
+ tmp = dma_fence_begin_signalling(); + spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);
+ dma_fence_end_signalling(tmp); + return ret; } EXPORT_SYMBOL(dma_fence_signal); @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
might_sleep();
+ __dma_fence_might_wait(); + trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout); diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }
+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{ + return true; +} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif + int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,
Op 05-06-2020 om 15:29 schreef Daniel Vetter:
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A);
dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
Originally I hope that the cross-release lockdep extensions would alleviate the need for explicit annotations:
https://lwn.net/Articles/709849/
But there's a few reasons why that's not an option:
It's not happening in upstream, since it got reverted due to too many false positives:
commit e966eaeeb623f09975ef362c2866fae6f86844f9 Author: Ingo Molnar mingo@kernel.org Date: Tue Dec 12 12:31:16 2017 +0100
locking/lockdep: Remove the cross-release locking checks This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y), while it found a number of old bugs initially, was also causing too many false positives that caused people to disable lockdep - which is arguably a worse overall outcome.
cross-release uses the complete() call to annotate the end of critical sections, for dma_fence that would be dma_fence_signal(). But we do not want all dma_fence_signal() calls to be treated as critical, since many are opportunistic cleanup of gpu requests. If these get stuck there's still the main completion interrupt and workers who can unblock everyone. Automatically annotating all dma_fence_signal() calls would hence cause false positives.
cross-release had some educated guesses for when a critical section starts, like fresh syscall or fresh work callback. This would again cause false positives without explicit annotations, since for dma_fence the critical sections only starts when we publish a fence.
Furthermore there can be cases where a thread never does a dma_fence_signal, but is still critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it.
In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
v5: Amend commit message to explain in detail why cross-release isn't the solution.
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access -Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. kernel-doc:: drivers/dma-buf/dma-buf.c
- :doc: fence polling
- :doc: implicit fence polling
Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview +DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc); +/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around &dma_fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have &mmu_notifier, respectively &mmu_interval_notifier, + * and &shrinker. For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. &dma_fence interacts with almost all of the core memory + * handling through page fault handlers via &dma_resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it also interacts through all + * allocation sites through &mmu_notifier and &shrinker. + * + * Furthermore lockdep does not handle cross-release dependencies, which means + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught + * at runtime with some quick testing. The simplest example is one thread + * waiting on a &dma_fence while holding a lock:: + * + * lock(A); + * dma_fence_wait(B); + * unlock(A); + * + * while the other thread is stuck trying to acquire the same lock, which + * prevents it from signalling the fence the previous thread is stuck waiting + * on:: + * + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * + * By manually annotating all code relevant to signalling a &dma_fence we can + * teach lockdep about these dependencies, which also helps with the validation + * headache since now lockdep can check all the rules for us:: + * + * cookie = dma_fence_begin_signalling(); + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * dma_fence_end_signalling(cookie); + * + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to + * annotate critical sections the following rules need to be observed: + * + * * All code necessary to complete a &dma_fence must be annotated, from the + * point where a fence is accessible to other threads, to the point where + * dma_fence_signal() is called. Un-annotated code can contain deadlock issues, + * and due to the very strict rules and many corner cases it is infeasible to + * catch these just with review or normal stress testing. + * + * * &struct dma_resv deserves a special note, since the readers are only + * protected by rcu. This means the signalling critical section starts as soon + * as the new fences are installed, even before dma_resv_unlock() is called. + * + * * The only exception are fast paths and opportunistic signalling code, which + * calls dma_fence_signal() purely as an optimization, but is not required to + * guarantee completion of a &dma_fence. The usual example is a wait IOCTL + * which calls dma_fence_signal(), while the mandatory completion path goes + * through a hardware interrupt and possible job completion worker. + * + * * To aid composability of code, the annotations can be freely nested, as long + * as the overall locking hierarchy is consistent. The annotations also work + * both in interrupt and process context. Due to implementation details this + * requires that callers pass an opaque cookie from + * dma_fence_begin_signalling() to dma_fence_end_signalling(). + * + * * Validation against the cross driver contract is implemented by priming + * lockdep with the relevant hierarchy at boot-up. This means even just + * testing with a single device is enough to validate a driver, at least as + * far as deadlocks with dma_fence_wait() against dma_fence_signal() are + * concerned. + */ +#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = { + .name = "dma_fence_map" +}; + +/** + * dma_fence_begin_signalling - begin a critical DMA fence signalling section + * + * Drivers should use this to annotate the beginning of any code section + * required to eventually complete &dma_fence by calling dma_fence_signal(). + * + * The end of these critical sections are annotated with + * dma_fence_end_signalling(). + * + * Returns: + * + * Opaque cookie needed by the implementation, which needs to be passed to + * dma_fence_end_signalling(). + */ +bool dma_fence_begin_signalling(void) +{ + /* explicitly nesting ... */ + if (lock_is_held_type(&dma_fence_lockdep_map, 1)) + return true; + + /* rely on might_sleep check for soft/hardirq locks */ + if (in_atomic()) + return true; + + /* ... and non-recursive readlock */ + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_); + + return false; +} +EXPORT_SYMBOL(dma_fence_begin_signalling); + +/** + * dma_fence_end_signalling - end a critical DMA fence signalling section + * + * Closes a critical section annotation opened by dma_fence_begin_signalling(). + */ +void dma_fence_end_signalling(bool cookie) +{ + if (cookie) + return; + + lock_release(&dma_fence_lockdep_map, _RET_IP_); +} +EXPORT_SYMBOL(dma_fence_end_signalling); + +void __dma_fence_might_wait(void) +{ + bool tmp; + + tmp = lock_is_held_type(&dma_fence_lockdep_map, 1); + if (tmp) + lock_release(&dma_fence_lockdep_map, _THIS_IP_); + lock_map_acquire(&dma_fence_lockdep_map); + lock_map_release(&dma_fence_lockdep_map); + if (tmp) + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_); +} +#endif + + /** * dma_fence_signal_locked - signal completion of a fence * @fence: the fence to signal @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret; + bool tmp; if (!fence) return -EINVAL; + tmp = dma_fence_begin_signalling(); + spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags); + dma_fence_end_signalling(tmp); + return ret; } EXPORT_SYMBOL(dma_fence_signal); @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout) might_sleep(); + __dma_fence_might_wait(); + trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout); diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); } +#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{ + return true; +} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif + int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,
As original author of dma-fence, I enjoy seeing more lockdep annotations. Fence was always meant to be cross-driver, so strict driver annotations that can be verified by lockdep are a good thing. Because drivers have to interact with other drivers that use dma-fence, the rules must be the same for everyone, and the above code makes sense.
Reviewed-by: Maarten Lankhorst maarten.lankhorst@linux.intel.com
On 04/06/2020 09:12, Daniel Vetter wrote:
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A);
dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access -Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. kernel-doc:: drivers/dma-buf/dma-buf.c
- :doc: fence polling
- :doc: implicit fence polling
Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview +DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc); +/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around &dma_fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have &mmu_notifier, respectively &mmu_interval_notifier, + * and &shrinker. For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. &dma_fence interacts with almost all of the core memory + * handling through page fault handlers via &dma_resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it also interacts through all + * allocation sites through &mmu_notifier and &shrinker. + * + * Furthermore lockdep does not handle cross-release dependencies, which means + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught + * at runtime with some quick testing. The simplest example is one thread + * waiting on a &dma_fence while holding a lock:: + * + * lock(A); + * dma_fence_wait(B); + * unlock(A); + * + * while the other thread is stuck trying to acquire the same lock, which + * prevents it from signalling the fence the previous thread is stuck waiting + * on:: + * + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * + * By manually annotating all code relevant to signalling a &dma_fence we can + * teach lockdep about these dependencies, which also helps with the validation + * headache since now lockdep can check all the rules for us:: + * + * cookie = dma_fence_begin_signalling(); + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * dma_fence_end_signalling(cookie); + * + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to + * annotate critical sections the following rules need to be observed: + * + * * All code necessary to complete a &dma_fence must be annotated, from the + * point where a fence is accessible to other threads, to the point where + * dma_fence_signal() is called. Un-annotated code can contain deadlock issues, + * and due to the very strict rules and many corner cases it is infeasible to + * catch these just with review or normal stress testing. + * + * * &struct dma_resv deserves a special note, since the readers are only + * protected by rcu. This means the signalling critical section starts as soon + * as the new fences are installed, even before dma_resv_unlock() is called. + * + * * The only exception are fast paths and opportunistic signalling code, which + * calls dma_fence_signal() purely as an optimization, but is not required to + * guarantee completion of a &dma_fence. The usual example is a wait IOCTL + * which calls dma_fence_signal(), while the mandatory completion path goes + * through a hardware interrupt and possible job completion worker. + * + * * To aid composability of code, the annotations can be freely nested, as long + * as the overall locking hierarchy is consistent. The annotations also work + * both in interrupt and process context. Due to implementation details this + * requires that callers pass an opaque cookie from + * dma_fence_begin_signalling() to dma_fence_end_signalling(). + * + * * Validation against the cross driver contract is implemented by priming + * lockdep with the relevant hierarchy at boot-up. This means even just + * testing with a single device is enough to validate a driver, at least as + * far as deadlocks with dma_fence_wait() against dma_fence_signal() are + * concerned. + */ +#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = { + .name = "dma_fence_map" +};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?
Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
+/**
- dma_fence_begin_signalling - begin a critical DMA fence signalling section
- Drivers should use this to annotate the beginning of any code section
- required to eventually complete &dma_fence by calling dma_fence_signal().
- The end of these critical sections are annotated with
- dma_fence_end_signalling().
- Returns:
- Opaque cookie needed by the implementation, which needs to be passed to
- dma_fence_end_signalling().
- */
+bool dma_fence_begin_signalling(void) +{
- /* explicitly nesting ... */
- if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
- /* rely on might_sleep check for soft/hardirq locks */
- if (in_atomic())
return true;
- /* ... and non-recursive readlock */
- lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
The recursive usage wouldn't work then right? Would write annotation on the wait path work?
Regards,
Tvrtko
- return false;
+} +EXPORT_SYMBOL(dma_fence_begin_signalling);
+/**
- dma_fence_end_signalling - end a critical DMA fence signalling section
- Closes a critical section annotation opened by dma_fence_begin_signalling().
- */
+void dma_fence_end_signalling(bool cookie) +{
- if (cookie)
return;
- lock_release(&dma_fence_lockdep_map, _RET_IP_);
+} +EXPORT_SYMBOL(dma_fence_end_signalling);
+void __dma_fence_might_wait(void) +{
- bool tmp;
- tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
- if (tmp)
lock_release(&dma_fence_lockdep_map, _THIS_IP_);
- lock_map_acquire(&dma_fence_lockdep_map);
- lock_map_release(&dma_fence_lockdep_map);
- if (tmp)
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+} +#endif
- /**
- dma_fence_signal_locked - signal completion of a fence
- @fence: the fence to signal
@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret;
- bool tmp;
if (!fence) return -EINVAL;
- tmp = dma_fence_begin_signalling();
- spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);
- dma_fence_end_signalling(tmp);
- return ret; } EXPORT_SYMBOL(dma_fence_signal);
@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout) might_sleep();
- __dma_fence_might_wait();
- trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); } +#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{
- return true;
+} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif
- int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 04/06/2020 09:12, Daniel Vetter wrote:
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A); dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access
-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c
- :doc: fence polling
:doc: implicit fence polling
Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
- :doc: fence signalling annotation
- DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/**
- DOC: fence signalling annotation
- Proving correctness of all the kernel code around &dma_fence through code
- review and testing is tricky for a few reasons:
- It is a cross-driver contract, and therefore all drivers must follow the
- same rules for lock nesting order, calling contexts for various functions
- and anything else significant for in-kernel interfaces. But it is also
- impossible to test all drivers in a single machine, hence brute-force N vs.
- N testing of all combinations is impossible. Even just limiting to the
- possible combinations is infeasible.
- There is an enormous amount of driver code involved. For render drivers
- there's the tail of command submission, after fences are published,
- scheduler code, interrupt and workers to process job completion,
- and timeout, gpu reset and gpu hang recovery code. Plus for integration
- with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
- and &shrinker. For modesetting drivers there's the commit tail functions
- between when fences for an atomic modeset are published, and when the
- corresponding vblank completes, including any interrupt processing and
- related workers. Auditing all that code, across all drivers, is not
- feasible.
- Due to how many other subsystems are involved and the locking hierarchies
- this pulls in there is extremely thin wiggle-room for driver-specific
- differences. &dma_fence interacts with almost all of the core memory
- handling through page fault handlers via &dma_resv, dma_resv_lock() and
- dma_resv_unlock(). On the other side it also interacts through all
- allocation sites through &mmu_notifier and &shrinker.
- Furthermore lockdep does not handle cross-release dependencies, which means
- any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
- at runtime with some quick testing. The simplest example is one thread
- waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
- while the other thread is stuck trying to acquire the same lock, which
- prevents it from signalling the fence the previous thread is stuck waiting
- on::
lock(A);
unlock(A);
dma_fence_signal(B);
- By manually annotating all code relevant to signalling a &dma_fence we can
- teach lockdep about these dependencies, which also helps with the validation
- headache since now lockdep can check all the rules for us::
- cookie = dma_fence_begin_signalling();
- lock(A);
- unlock(A);
- dma_fence_signal(B);
- dma_fence_end_signalling(cookie);
- For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
- annotate critical sections the following rules need to be observed:
- All code necessary to complete a &dma_fence must be annotated, from the
- point where a fence is accessible to other threads, to the point where
- dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
- and due to the very strict rules and many corner cases it is infeasible to
- catch these just with review or normal stress testing.
- &struct dma_resv deserves a special note, since the readers are only
- protected by rcu. This means the signalling critical section starts as soon
- as the new fences are installed, even before dma_resv_unlock() is called.
- The only exception are fast paths and opportunistic signalling code, which
- calls dma_fence_signal() purely as an optimization, but is not required to
- guarantee completion of a &dma_fence. The usual example is a wait IOCTL
- which calls dma_fence_signal(), while the mandatory completion path goes
- through a hardware interrupt and possible job completion worker.
- To aid composability of code, the annotations can be freely nested, as long
- as the overall locking hierarchy is consistent. The annotations also work
- both in interrupt and process context. Due to implementation details this
- requires that callers pass an opaque cookie from
- dma_fence_begin_signalling() to dma_fence_end_signalling().
- Validation against the cross driver contract is implemented by priming
- lockdep with the relevant hierarchy at boot-up. This means even just
- testing with a single device is enough to validate a driver, at least as
- far as deadlocks with dma_fence_wait() against dma_fence_signal() are
- concerned.
- */
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?
Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.
I've typed up the full length rant already here:
https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
+/**
- dma_fence_begin_signalling - begin a critical DMA fence signalling section
- Drivers should use this to annotate the beginning of any code section
- required to eventually complete &dma_fence by calling dma_fence_signal().
- The end of these critical sections are annotated with
- dma_fence_end_signalling().
- Returns:
- Opaque cookie needed by the implementation, which needs to be passed to
- dma_fence_end_signalling().
- */
+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.
The recursive usage wouldn't work then right? Would write annotation on the wait path work?
Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.
And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
Cheers, Daniel
Regards,
Tvrtko
return false;
+} +EXPORT_SYMBOL(dma_fence_begin_signalling);
+/**
- dma_fence_end_signalling - end a critical DMA fence signalling section
- Closes a critical section annotation opened by dma_fence_begin_signalling().
- */
+void dma_fence_end_signalling(bool cookie) +{
if (cookie)
return;
lock_release(&dma_fence_lockdep_map, _RET_IP_);
+} +EXPORT_SYMBOL(dma_fence_end_signalling);
+void __dma_fence_might_wait(void) +{
bool tmp;
tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
if (tmp)
lock_release(&dma_fence_lockdep_map, _THIS_IP_);
lock_map_acquire(&dma_fence_lockdep_map);
lock_map_release(&dma_fence_lockdep_map);
if (tmp)
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+} +#endif
- /**
- dma_fence_signal_locked - signal completion of a fence
- @fence: the fence to signal
@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret;
bool tmp; if (!fence) return -EINVAL;
tmp = dma_fence_begin_signalling();
spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);
dma_fence_end_signalling(tmp);
return ret;
} EXPORT_SYMBOL(dma_fence_signal);
@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
might_sleep();
__dma_fence_might_wait();
trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }
+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{
return true;
+} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif
- int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,
On 10/06/2020 16:17, Daniel Vetter wrote:
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 04/06/2020 09:12, Daniel Vetter wrote:
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A); dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access
-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c
- :doc: fence polling
:doc: implicit fence polling
Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
- :doc: fence signalling annotation
- DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/**
- DOC: fence signalling annotation
- Proving correctness of all the kernel code around &dma_fence through code
- review and testing is tricky for a few reasons:
- It is a cross-driver contract, and therefore all drivers must follow the
- same rules for lock nesting order, calling contexts for various functions
- and anything else significant for in-kernel interfaces. But it is also
- impossible to test all drivers in a single machine, hence brute-force N vs.
- N testing of all combinations is impossible. Even just limiting to the
- possible combinations is infeasible.
- There is an enormous amount of driver code involved. For render drivers
- there's the tail of command submission, after fences are published,
- scheduler code, interrupt and workers to process job completion,
- and timeout, gpu reset and gpu hang recovery code. Plus for integration
- with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
- and &shrinker. For modesetting drivers there's the commit tail functions
- between when fences for an atomic modeset are published, and when the
- corresponding vblank completes, including any interrupt processing and
- related workers. Auditing all that code, across all drivers, is not
- feasible.
- Due to how many other subsystems are involved and the locking hierarchies
- this pulls in there is extremely thin wiggle-room for driver-specific
- differences. &dma_fence interacts with almost all of the core memory
- handling through page fault handlers via &dma_resv, dma_resv_lock() and
- dma_resv_unlock(). On the other side it also interacts through all
- allocation sites through &mmu_notifier and &shrinker.
- Furthermore lockdep does not handle cross-release dependencies, which means
- any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
- at runtime with some quick testing. The simplest example is one thread
- waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
- while the other thread is stuck trying to acquire the same lock, which
- prevents it from signalling the fence the previous thread is stuck waiting
- on::
lock(A);
unlock(A);
dma_fence_signal(B);
- By manually annotating all code relevant to signalling a &dma_fence we can
- teach lockdep about these dependencies, which also helps with the validation
- headache since now lockdep can check all the rules for us::
- cookie = dma_fence_begin_signalling();
- lock(A);
- unlock(A);
- dma_fence_signal(B);
- dma_fence_end_signalling(cookie);
- For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
- annotate critical sections the following rules need to be observed:
- All code necessary to complete a &dma_fence must be annotated, from the
- point where a fence is accessible to other threads, to the point where
- dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
- and due to the very strict rules and many corner cases it is infeasible to
- catch these just with review or normal stress testing.
- &struct dma_resv deserves a special note, since the readers are only
- protected by rcu. This means the signalling critical section starts as soon
- as the new fences are installed, even before dma_resv_unlock() is called.
- The only exception are fast paths and opportunistic signalling code, which
- calls dma_fence_signal() purely as an optimization, but is not required to
- guarantee completion of a &dma_fence. The usual example is a wait IOCTL
- which calls dma_fence_signal(), while the mandatory completion path goes
- through a hardware interrupt and possible job completion worker.
- To aid composability of code, the annotations can be freely nested, as long
- as the overall locking hierarchy is consistent. The annotations also work
- both in interrupt and process context. Due to implementation details this
- requires that callers pass an opaque cookie from
- dma_fence_begin_signalling() to dma_fence_end_signalling().
- Validation against the cross driver contract is implemented by priming
- lockdep with the relevant hierarchy at boot-up. This means even just
- testing with a single device is enough to validate a driver, at least as
- far as deadlocks with dma_fence_wait() against dma_fence_signal() are
- concerned.
- */
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?
Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.
I've typed up the full length rant already here:
https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
But "perfect storm" of:
+ global fence lockmap + mmu notifiers + fs reclaim + default annotations in dma_fence_signal / dma_fence_wait
Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.
Example from the CI run:
[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);
What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.
+/**
- dma_fence_begin_signalling - begin a critical DMA fence signalling section
- Drivers should use this to annotate the beginning of any code section
- required to eventually complete &dma_fence by calling dma_fence_signal().
- The end of these critical sections are annotated with
- dma_fence_end_signalling().
- Returns:
- Opaque cookie needed by the implementation, which needs to be passed to
- dma_fence_end_signalling().
- */
+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.
The recursive usage wouldn't work then right? Would write annotation on the wait path work?
Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.
And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
How do I see in lockdep splats if it was a read or write user? Your patch appears to have:
dma_fence_signal: + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
__dma_fence_might_wait: + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.
Regards,
Tvrtko
On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 10/06/2020 16:17, Daniel Vetter wrote:
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 04/06/2020 09:12, Daniel Vetter wrote:
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A); dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access
-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c
- :doc: fence polling
:doc: implicit fence polling
Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
- :doc: fence signalling annotation
- DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/**
- DOC: fence signalling annotation
- Proving correctness of all the kernel code around &dma_fence through code
- review and testing is tricky for a few reasons:
- It is a cross-driver contract, and therefore all drivers must follow the
- same rules for lock nesting order, calling contexts for various functions
- and anything else significant for in-kernel interfaces. But it is also
- impossible to test all drivers in a single machine, hence brute-force N vs.
- N testing of all combinations is impossible. Even just limiting to the
- possible combinations is infeasible.
- There is an enormous amount of driver code involved. For render drivers
- there's the tail of command submission, after fences are published,
- scheduler code, interrupt and workers to process job completion,
- and timeout, gpu reset and gpu hang recovery code. Plus for integration
- with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
- and &shrinker. For modesetting drivers there's the commit tail functions
- between when fences for an atomic modeset are published, and when the
- corresponding vblank completes, including any interrupt processing and
- related workers. Auditing all that code, across all drivers, is not
- feasible.
- Due to how many other subsystems are involved and the locking hierarchies
- this pulls in there is extremely thin wiggle-room for driver-specific
- differences. &dma_fence interacts with almost all of the core memory
- handling through page fault handlers via &dma_resv, dma_resv_lock() and
- dma_resv_unlock(). On the other side it also interacts through all
- allocation sites through &mmu_notifier and &shrinker.
- Furthermore lockdep does not handle cross-release dependencies, which means
- any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
- at runtime with some quick testing. The simplest example is one thread
- waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
- while the other thread is stuck trying to acquire the same lock, which
- prevents it from signalling the fence the previous thread is stuck waiting
- on::
lock(A);
unlock(A);
dma_fence_signal(B);
- By manually annotating all code relevant to signalling a &dma_fence we can
- teach lockdep about these dependencies, which also helps with the validation
- headache since now lockdep can check all the rules for us::
- cookie = dma_fence_begin_signalling();
- lock(A);
- unlock(A);
- dma_fence_signal(B);
- dma_fence_end_signalling(cookie);
- For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
- annotate critical sections the following rules need to be observed:
- All code necessary to complete a &dma_fence must be annotated, from the
- point where a fence is accessible to other threads, to the point where
- dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
- and due to the very strict rules and many corner cases it is infeasible to
- catch these just with review or normal stress testing.
- &struct dma_resv deserves a special note, since the readers are only
- protected by rcu. This means the signalling critical section starts as soon
- as the new fences are installed, even before dma_resv_unlock() is called.
- The only exception are fast paths and opportunistic signalling code, which
- calls dma_fence_signal() purely as an optimization, but is not required to
- guarantee completion of a &dma_fence. The usual example is a wait IOCTL
- which calls dma_fence_signal(), while the mandatory completion path goes
- through a hardware interrupt and possible job completion worker.
- To aid composability of code, the annotations can be freely nested, as long
- as the overall locking hierarchy is consistent. The annotations also work
- both in interrupt and process context. Due to implementation details this
- requires that callers pass an opaque cookie from
- dma_fence_begin_signalling() to dma_fence_end_signalling().
- Validation against the cross driver contract is implemented by priming
- lockdep with the relevant hierarchy at boot-up. This means even just
- testing with a single device is enough to validate a driver, at least as
- far as deadlocks with dma_fence_wait() against dma_fence_signal() are
- concerned.
- */
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?
Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.
I've typed up the full length rant already here:
https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
But "perfect storm" of:
- global fence lockmap
- mmu notifiers
- fs reclaim
- default annotations in dma_fence_signal / dma_fence_wait
Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.
Example from the CI run:
[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);
What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.
Run CI without the i915 annotation patch, nothing breaks.
So we can gradually fix up existing code that doesn't quite get it right and move on.
+/**
- dma_fence_begin_signalling - begin a critical DMA fence signalling section
- Drivers should use this to annotate the beginning of any code section
- required to eventually complete &dma_fence by calling dma_fence_signal().
- The end of these critical sections are annotated with
- dma_fence_end_signalling().
- Returns:
- Opaque cookie needed by the implementation, which needs to be passed to
- dma_fence_end_signalling().
- */
+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.
The recursive usage wouldn't work then right? Would write annotation on the wait path work?
Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.
And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
How do I see in lockdep splats if it was a read or write user? Your patch appears to have:
dma_fence_signal:
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
__dma_fence_might_wait:
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.
I think you got tricked by the implementation, this isn't quite what's going on. There's two things which make the annotations special:
- we want a recursive read lock on the signalling critical section. The problem is that lockdep doesn't implement full validation for recursive read locks, only non-recursive read/write locks fully validated. There's some checks for recursive read locks, but exactly the checks we need to catch common dma_fence_wait deadlocks aren't done. That's why we need to implement manual lock recursion on the reader side
- now on the write side we additionally need to implement an read2write upgrade, and a write2read downgrade. Lockdep doesn't implement that, so again we have to hand-roll this.
Let's go through the code line-by-line:
bool tmp;
tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
We check whether someone is holding the non-recursive read lock already.
if (tmp) lock_release(&dma_fence_lockdep_map, _THIS_IP_);
If that's the case, we drop that read lock.
lock_map_acquire(&dma_fence_lockdep_map);
Then we do the actual might_wait annotation, the above takes the full write lock ...
lock_map_release(&dma_fence_lockdep_map);
... and now we release the write lock again.
if (tmp) lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Finally we need to re-acquire the read lock, if we've held that when entering this function. This annotation naturally has to exactly match what begin_signalling would do, otherwise the hand-rolled nesting would fall apart.
I hope that explains what's going on here, and assures you that might_wait() is indeed a write lock annotation, but with a big pile of complications. -Daniel
On 11/06/2020 12:29, Daniel Vetter wrote:
On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 10/06/2020 16:17, Daniel Vetter wrote:
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 04/06/2020 09:12, Daniel Vetter wrote:
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A); dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access
-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c
- :doc: fence polling
:doc: implicit fence polling
Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
- :doc: fence signalling annotation
- DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/**
- DOC: fence signalling annotation
- Proving correctness of all the kernel code around &dma_fence through code
- review and testing is tricky for a few reasons:
- It is a cross-driver contract, and therefore all drivers must follow the
- same rules for lock nesting order, calling contexts for various functions
- and anything else significant for in-kernel interfaces. But it is also
- impossible to test all drivers in a single machine, hence brute-force N vs.
- N testing of all combinations is impossible. Even just limiting to the
- possible combinations is infeasible.
- There is an enormous amount of driver code involved. For render drivers
- there's the tail of command submission, after fences are published,
- scheduler code, interrupt and workers to process job completion,
- and timeout, gpu reset and gpu hang recovery code. Plus for integration
- with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
- and &shrinker. For modesetting drivers there's the commit tail functions
- between when fences for an atomic modeset are published, and when the
- corresponding vblank completes, including any interrupt processing and
- related workers. Auditing all that code, across all drivers, is not
- feasible.
- Due to how many other subsystems are involved and the locking hierarchies
- this pulls in there is extremely thin wiggle-room for driver-specific
- differences. &dma_fence interacts with almost all of the core memory
- handling through page fault handlers via &dma_resv, dma_resv_lock() and
- dma_resv_unlock(). On the other side it also interacts through all
- allocation sites through &mmu_notifier and &shrinker.
- Furthermore lockdep does not handle cross-release dependencies, which means
- any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
- at runtime with some quick testing. The simplest example is one thread
- waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
- while the other thread is stuck trying to acquire the same lock, which
- prevents it from signalling the fence the previous thread is stuck waiting
- on::
lock(A);
unlock(A);
dma_fence_signal(B);
- By manually annotating all code relevant to signalling a &dma_fence we can
- teach lockdep about these dependencies, which also helps with the validation
- headache since now lockdep can check all the rules for us::
- cookie = dma_fence_begin_signalling();
- lock(A);
- unlock(A);
- dma_fence_signal(B);
- dma_fence_end_signalling(cookie);
- For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
- annotate critical sections the following rules need to be observed:
- All code necessary to complete a &dma_fence must be annotated, from the
- point where a fence is accessible to other threads, to the point where
- dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
- and due to the very strict rules and many corner cases it is infeasible to
- catch these just with review or normal stress testing.
- &struct dma_resv deserves a special note, since the readers are only
- protected by rcu. This means the signalling critical section starts as soon
- as the new fences are installed, even before dma_resv_unlock() is called.
- The only exception are fast paths and opportunistic signalling code, which
- calls dma_fence_signal() purely as an optimization, but is not required to
- guarantee completion of a &dma_fence. The usual example is a wait IOCTL
- which calls dma_fence_signal(), while the mandatory completion path goes
- through a hardware interrupt and possible job completion worker.
- To aid composability of code, the annotations can be freely nested, as long
- as the overall locking hierarchy is consistent. The annotations also work
- both in interrupt and process context. Due to implementation details this
- requires that callers pass an opaque cookie from
- dma_fence_begin_signalling() to dma_fence_end_signalling().
- Validation against the cross driver contract is implemented by priming
- lockdep with the relevant hierarchy at boot-up. This means even just
- testing with a single device is enough to validate a driver, at least as
- far as deadlocks with dma_fence_wait() against dma_fence_signal() are
- concerned.
- */
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?
Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.
I've typed up the full length rant already here:
https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
But "perfect storm" of:
- global fence lockmap
- mmu notifiers
- fs reclaim
- default annotations in dma_fence_signal / dma_fence_wait
Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.
Example from the CI run:
[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);
What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.
Run CI without the i915 annotation patch, nothing breaks.
I think some parts of i915 would still break with my idea to only apply annotations on exported fences. What do you dislike about that idea? I thought the point is to enforce rules for _exported_ fences.
How you have annotated dma_fence_work you can't say, maybe it is exported maybe it isn't. I think it is btw, so splats would still be there, but I am not sure it is conceptually correct.
At least my understanding is GFP_KERNEL allocations are only disallowed by the virtue of the global dma-fence contract. If you want to enforce they are never used for anything but exporting, then that would be a bit harsh, no?
Another example from the CI run:
[26.585357] CPU0 CPU1 [26.585359] ---- ---- [26.585360] lock(dma_fence_map); [26.585362] lock(mmu_notifier_invalidate_range_start); [26.585365] lock(dma_fence_map); [26.585367] lock(i915_gem_object_internal/1); [26.585369] *** DEADLOCK ***
Lets say someone submitted an execbuf using userptr as a batch and then unmapped it immediately. That would explain CPU1 getting into the mmu notifier and waiting on this batch to unbind the object.
Meanwhile CPU0 is the async command parser for this request trying to lock the shadow batch buffer. Because it uses the dma_fence_work this is between the begin/end signalling markers.
It can be the same dma-fence I think, since we install the async parser fence on the real batch dma-resv, but dma_fence_map is not a real lock, so what is actually preventing progress in this case?
CPU1 is waiting on a fence, but CPU0 can obtain the lock(i915_gem_object_internal/1), proceed to parse the batch, and exit the signalling section. At which point CPU1 is still blocked, waiting until the execbuf finishes and then mmu notifier can finish and invalidate the pages.
Maybe I am missing something but I don't see how this one is real.
So we can gradually fix up existing code that doesn't quite get it right and move on.
+/**
- dma_fence_begin_signalling - begin a critical DMA fence signalling section
- Drivers should use this to annotate the beginning of any code section
- required to eventually complete &dma_fence by calling dma_fence_signal().
- The end of these critical sections are annotated with
- dma_fence_end_signalling().
- Returns:
- Opaque cookie needed by the implementation, which needs to be passed to
- dma_fence_end_signalling().
- */
+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.
The recursive usage wouldn't work then right? Would write annotation on the wait path work?
Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.
And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
How do I see in lockdep splats if it was a read or write user? Your patch appears to have:
dma_fence_signal:
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
__dma_fence_might_wait:
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.
I think you got tricked by the implementation, this isn't quite what's going on. There's two things which make the annotations special:
- we want a recursive read lock on the signalling critical section.
The problem is that lockdep doesn't implement full validation for recursive read locks, only non-recursive read/write locks fully validated. There's some checks for recursive read locks, but exactly the checks we need to catch common dma_fence_wait deadlocks aren't done. That's why we need to implement manual lock recursion on the reader side
- now on the write side we additionally need to implement an
read2write upgrade, and a write2read downgrade. Lockdep doesn't implement that, so again we have to hand-roll this.
Let's go through the code line-by-line:
bool tmp; tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
We check whether someone is holding the non-recursive read lock already.
if (tmp) lock_release(&dma_fence_lockdep_map, _THIS_IP_);
If that's the case, we drop that read lock.
lock_map_acquire(&dma_fence_lockdep_map);
Then we do the actual might_wait annotation, the above takes the full write lock ...
lock_map_release(&dma_fence_lockdep_map);
... and now we release the write lock again.
if (tmp) lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Finally we need to re-acquire the read lock, if we've held that when entering this function. This annotation naturally has to exactly match what begin_signalling would do, otherwise the hand-rolled nesting would fall apart.
I hope that explains what's going on here, and assures you that might_wait() is indeed a write lock annotation, but with a big pile of complications.
I am certainly confused by the difference between lock_map_acquire/release and lock_acquire/release. What is the difference between the two?
Regards,
Tvrtko
On Thu, Jun 11, 2020 at 4:29 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 11/06/2020 12:29, Daniel Vetter wrote:
On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 10/06/2020 16:17, Daniel Vetter wrote:
On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
On 04/06/2020 09:12, Daniel Vetter wrote:
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A); dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 12 +- drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 182 insertions(+), 3 deletions(-)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 63dec76d1d8d..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects .. kernel-doc:: drivers/dma-buf/dma-buf.c :doc: cpu access
-Fence Poll Support -~~~~~~~~~~~~~~~~~~ +Implicit Fence Poll Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. kernel-doc:: drivers/dma-buf/dma-buf.c
- :doc: fence polling
:doc: implicit fence polling
Kernel Functions and Structures Reference
@@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
- :doc: fence signalling annotation
- DMA Fences Functions Reference
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/**
- DOC: fence signalling annotation
- Proving correctness of all the kernel code around &dma_fence through code
- review and testing is tricky for a few reasons:
- It is a cross-driver contract, and therefore all drivers must follow the
- same rules for lock nesting order, calling contexts for various functions
- and anything else significant for in-kernel interfaces. But it is also
- impossible to test all drivers in a single machine, hence brute-force N vs.
- N testing of all combinations is impossible. Even just limiting to the
- possible combinations is infeasible.
- There is an enormous amount of driver code involved. For render drivers
- there's the tail of command submission, after fences are published,
- scheduler code, interrupt and workers to process job completion,
- and timeout, gpu reset and gpu hang recovery code. Plus for integration
- with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
- and &shrinker. For modesetting drivers there's the commit tail functions
- between when fences for an atomic modeset are published, and when the
- corresponding vblank completes, including any interrupt processing and
- related workers. Auditing all that code, across all drivers, is not
- feasible.
- Due to how many other subsystems are involved and the locking hierarchies
- this pulls in there is extremely thin wiggle-room for driver-specific
- differences. &dma_fence interacts with almost all of the core memory
- handling through page fault handlers via &dma_resv, dma_resv_lock() and
- dma_resv_unlock(). On the other side it also interacts through all
- allocation sites through &mmu_notifier and &shrinker.
- Furthermore lockdep does not handle cross-release dependencies, which means
- any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
- at runtime with some quick testing. The simplest example is one thread
- waiting on a &dma_fence while holding a lock::
lock(A);
dma_fence_wait(B);
unlock(A);
- while the other thread is stuck trying to acquire the same lock, which
- prevents it from signalling the fence the previous thread is stuck waiting
- on::
lock(A);
unlock(A);
dma_fence_signal(B);
- By manually annotating all code relevant to signalling a &dma_fence we can
- teach lockdep about these dependencies, which also helps with the validation
- headache since now lockdep can check all the rules for us::
- cookie = dma_fence_begin_signalling();
- lock(A);
- unlock(A);
- dma_fence_signal(B);
- dma_fence_end_signalling(cookie);
- For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
- annotate critical sections the following rules need to be observed:
- All code necessary to complete a &dma_fence must be annotated, from the
- point where a fence is accessible to other threads, to the point where
- dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
- and due to the very strict rules and many corner cases it is infeasible to
- catch these just with review or normal stress testing.
- &struct dma_resv deserves a special note, since the readers are only
- protected by rcu. This means the signalling critical section starts as soon
- as the new fences are installed, even before dma_resv_unlock() is called.
- The only exception are fast paths and opportunistic signalling code, which
- calls dma_fence_signal() purely as an optimization, but is not required to
- guarantee completion of a &dma_fence. The usual example is a wait IOCTL
- which calls dma_fence_signal(), while the mandatory completion path goes
- through a hardware interrupt and possible job completion worker.
- To aid composability of code, the annotations can be freely nested, as long
- as the overall locking hierarchy is consistent. The annotations also work
- both in interrupt and process context. Due to implementation details this
- requires that callers pass an opaque cookie from
- dma_fence_begin_signalling() to dma_fence_end_signalling().
- Validation against the cross driver contract is implemented by priming
- lockdep with the relevant hierarchy at boot-up. This means even just
- testing with a single device is enough to validate a driver, at least as
- far as deadlocks with dma_fence_wait() against dma_fence_signal() are
- concerned.
- */
+#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
+};
Maybe a stupid question because this is definitely complicated, but.. If you have a single/static/global lockdep map, doesn't this mean _all_ locks, from _all_ drivers happening to use dma-fences will get recorded in it. Will this work and not cause false positives?
Sounds like it could create a common link between two completely unconnected usages. Because below you do add annotations to generic dma_fence_signal and dma_fence_wait.
This is fully intentional. dma-fence is a cross-driver interface, if every driver invents its own rules about how this should work we have an unmaintainable and unreviewable mess.
I've typed up the full length rant already here:
https://lore.kernel.org/dri-devel/CAKMK7uGnFhbpuurRsnZ4dvRV9gQ_3-rmSJaoqSFY=...
But "perfect storm" of:
- global fence lockmap
- mmu notifiers
- fs reclaim
- default annotations in dma_fence_signal / dma_fence_wait
Equals to anything ever using dma_fence will be in impossible chains with random other drivers, even if neither driver has code to export/share that fence.
Example from the CI run:
[25.918788] Chain exists of: fs_reclaim --> mmu_notifier_invalidate_range_start --> dma_fence_map [25.918794] Possible unsafe locking scenario: [25.918797] CPU0 CPU1 [25.918799] ---- ---- [25.918801] lock(dma_fence_map); [25.918803] lock(mmu_notifier_invalidate_range_start); [25.918807] lock(dma_fence_map); [25.918809] lock(fs_reclaim);
What about a dma_fence_export helper which would "arm" the annotations? It would be called as soon as the fence is exported. Maybe when added to dma_resv, or exported via sync_file, etc. Before that point begin/end_signaling and so would be no-ops.
Run CI without the i915 annotation patch, nothing breaks.
I think some parts of i915 would still break with my idea to only apply annotations on exported fences. What do you dislike about that idea? I thought the point is to enforce rules for _exported_ fences.
dma_fence is a shared concept, this is upstream, drivers are expected to a) use shared concepts and b) use them in a consistent way. If drivers do whatever they feel like then they're no maintainable in the upstream sense of "maintainable even if the vendor walks away". This was the reason why amd had to spend 2 refactoring from DAL (which used all the helpers they shared with their firmware/windows driver) to DC (which uses all the upstream kms helpers and datastructures directly).
How you have annotated dma_fence_work you can't say, maybe it is exported maybe it isn't. I think it is btw, so splats would still be there, but I am not sure it is conceptually correct.
At least my understanding is GFP_KERNEL allocations are only disallowed by the virtue of the global dma-fence contract. If you want to enforce they are never used for anything but exporting, then that would be a bit harsh, no?
Another example from the CI run:
[26.585357] CPU0 CPU1 [26.585359] ---- ---- [26.585360] lock(dma_fence_map); [26.585362] lock(mmu_notifier_invalidate_range_start); [26.585365] lock(dma_fence_map); [26.585367] lock(i915_gem_object_internal/1); [26.585369] *** DEADLOCK ***
So ime the above deadlock summaries tend to be wrong as soon as you have more than 2 locks involved. Which we have here - they only ever show at most 2 threads, with each thread only taking 2 locks in total, which isn't going to deadlock if you have more than 2 locks involved. Which is the case above.
Personally I just ignore the above deadlock scenario and just always look at all the locks and backtraces lockdep gives me, and then reconstruct the dependency graph by hand myself, including deadlock scenario.
Lets say someone submitted an execbuf using userptr as a batch and then unmapped it immediately. That would explain CPU1 getting into the mmu notifier and waiting on this batch to unbind the object.
Meanwhile CPU0 is the async command parser for this request trying to lock the shadow batch buffer. Because it uses the dma_fence_work this is between the begin/end signalling markers.
It can be the same dma-fence I think, since we install the async parser fence on the real batch dma-resv, but dma_fence_map is not a real lock, so what is actually preventing progress in this case?
CPU1 is waiting on a fence, but CPU0 can obtain the lock(i915_gem_object_internal/1), proceed to parse the batch, and exit the signalling section. At which point CPU1 is still blocked, waiting until the execbuf finishes and then mmu notifier can finish and invalidate the pages.
Maybe I am missing something but I don't see how this one is real.
The above doesn't deadlock, and it also shouldn't result in a lockdep splat. The trouble is when the signalling thread also grabs i915_gem_object_internal/1 somewhere. Which if you go through full CI results you see there's more involved (and at least one of the splats is all just lockdep priming and might_lock, so could be an annotation bug on top), and there is indeed a path where we lock the driver private lock in more places, and the wrong way round. That's the thing lockdep is complaining about, it's just not making that clear in the summary because the summary is only ever correct for 2 locks. Not if more is involved.
So we can gradually fix up existing code that doesn't quite get it right and move on.
+/**
- dma_fence_begin_signalling - begin a critical DMA fence signalling section
- Drivers should use this to annotate the beginning of any code section
- required to eventually complete &dma_fence by calling dma_fence_signal().
- The end of these critical sections are annotated with
- dma_fence_end_signalling().
- Returns:
- Opaque cookie needed by the implementation, which needs to be passed to
- dma_fence_end_signalling().
- */
+bool dma_fence_begin_signalling(void) +{
/* explicitly nesting ... */
if (lock_is_held_type(&dma_fence_lockdep_map, 1))
return true;
/* rely on might_sleep check for soft/hardirq locks */
if (in_atomic())
return true;
/* ... and non-recursive readlock */
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
Would it work if signalling path would mark itself as a write lock? I am thinking it would be nice to see in lockdep splats what are signals and what are waits.
Yeah it'd be nice to have a read vs write name for the lock. But we already have this problem for e.g. flush_work(), from which I've stolen this idea. So it's not really new. Essentially look at the backtraces lockdep gives you, and reconstruct the deadlock. I'm hoping that people will notice the special functions on the backtrace, e.g. dma_fence_begin_signalling will be listed as offending function/lock holder, and then read the kerneldoc.
The recursive usage wouldn't work then right? Would write annotation on the wait path work?
Wait path is write annotations already, but yeah annotating the signalling side as write would cause endless amounts of alse positives. Also it makes composability of these e.g. what I've done in amdgpu with annotations in tdr work in drm/scheduler, annotations in the amdgpu gpu reset code and then also annotations in atomic code, which all nest within each other in some call chains, but not others. Dropping the recursion would break that and make it really awkward to annotate such cases correctly.
And the recursion only works if it's read locks, otherwise lockdep complains if you have inconsistent annotations on the signalling side (which again would make it more or less impossible to annotate the above case fully).
How do I see in lockdep splats if it was a read or write user? Your patch appears to have:
dma_fence_signal:
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
__dma_fence_might_wait:
lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Which both seem like read lock. I don't fully understand the lockdep API so I might be wrong, not sure. But neither I see a difference in splats telling me which path is which.
I think you got tricked by the implementation, this isn't quite what's going on. There's two things which make the annotations special:
- we want a recursive read lock on the signalling critical section.
The problem is that lockdep doesn't implement full validation for recursive read locks, only non-recursive read/write locks fully validated. There's some checks for recursive read locks, but exactly the checks we need to catch common dma_fence_wait deadlocks aren't done. That's why we need to implement manual lock recursion on the reader side
- now on the write side we additionally need to implement an
read2write upgrade, and a write2read downgrade. Lockdep doesn't implement that, so again we have to hand-roll this.
Let's go through the code line-by-line:
bool tmp; tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
We check whether someone is holding the non-recursive read lock already.
if (tmp) lock_release(&dma_fence_lockdep_map, _THIS_IP_);
If that's the case, we drop that read lock.
lock_map_acquire(&dma_fence_lockdep_map);
Then we do the actual might_wait annotation, the above takes the full write lock ...
lock_map_release(&dma_fence_lockdep_map);
... and now we release the write lock again.
if (tmp) lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
Finally we need to re-acquire the read lock, if we've held that when entering this function. This annotation naturally has to exactly match what begin_signalling would do, otherwise the hand-rolled nesting would fall apart.
I hope that explains what's going on here, and assures you that might_wait() is indeed a write lock annotation, but with a big pile of complications.
I am certainly confused by the difference between lock_map_acquire/release and lock_acquire/release. What is the difference between the two?
lock_acquire/release is a wrapper around lock_map_acquire/release. This is all lockdep internal, it's a completely undocumented maze, so unfortunately only option is to really careful follow all the definitions from various locking primitives. And then compare with lockdep self-test (which use the locking primitives, not the lockdep internals) to see which flag controls which kind of behaviour.
That's at least what I do, and it's horrible. But yeah lockdep doesn't have documentation for this.
If you think it's better to open code the lock_map/acquire, I guess I can do that. But it's a mess, so I need to carefully retest everything and make sure I've set the right flags and bits - for added fun they also change ordering in some of the wrappers! -Daniel
Quoting Daniel Vetter (2020-06-04 09:12:09)
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A); dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Introducing a global lockmap that cannot capture the rules correctly, Nacked-by: Chris Wilson chris@chris-wilson.co.uk -Chris
On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:
Quoting Daniel Vetter (2020-06-04 09:12:09)
Design is similar to the lockdep annotations for workers, but with some twists:
We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A); dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Introducing a global lockmap that cannot capture the rules correctly,
Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.
Dave.
Hi,
On Thu, 11 Jun 2020 at 09:44, Dave Airlie airlied@gmail.com wrote:
On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:
Introducing a global lockmap that cannot capture the rules correctly,
Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.
As someone who has sunk a bunch of work into explicit-fencing awareness in my compositor so I can never be blocked, I'd be disappointed if the infrastructure was ultimately pointless because the documented fencing rules were _o_/ or thereabouts. Lockdep definitely isn't my area of expertise so I can't comment on the patch per se, but having something to ensure we don't hit deadlocks sure seems a lot better than nothing.
Cheers, Daniel
Quoting Daniel Stone (2020-06-11 10:01:46)
Hi,
On Thu, 11 Jun 2020 at 09:44, Dave Airlie airlied@gmail.com wrote:
On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:
Introducing a global lockmap that cannot capture the rules correctly,
Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.
As someone who has sunk a bunch of work into explicit-fencing awareness in my compositor so I can never be blocked, I'd be disappointed if the infrastructure was ultimately pointless because the documented fencing rules were _o_/ or thereabouts. Lockdep definitely isn't my area of expertise so I can't comment on the patch per se, but having something to ensure we don't hit deadlocks sure seems a lot better than nothing.
This is doing dependency analysis on execution contexts which is a far cry from doing the fence dependency analysis, and so has to actively ignore the cycles that must exist on the dma side, and also the cycles that prevent entering execution contexts on the CPU. It has to actively ignore scheduler execution contexts, for lockdep cries, and so we do not get analysis of the locking contexts along that path. This would be solvable along the lines of extending lockdep ala lockdep_dma_enter().
Had i915's execution flow been marked up, it should have found the dubious wait for external fences inside the dead GPU recovery, and probably found a few more things to complain about with the reset locking. [Note we already do the same annotations for wait-vs-reset, but not reset-vs-execution.]
Determination of which waits are legal and which are not is entirely ad hoc, for there is no status change tracking in the dependency analysis [that is once an execution context is linked to a published fence, again integral to lockdep.] Consider if the completion chain in atomic is swapped out for the morally equivalent fences along intertwined timelines, and so it does a bunch of dma_fence_wait() instead. Why are those waits legal despite them being after we have committed to fulfilling the out fence? [Why are the waits on and for the GPU legal, since they equally block execution flow?]
Forcing a generic primitive to always be part of the same global map is horrible. You forgo being able to use the primitive for unrelated tasks, lose the ability to name particular contexts to gain more informative dependency cycle reports from having the explicit linkage. You can add wait_map tracking without loss of generality [in less than 10 lines], and you can still enforce that all fences used for a common purpose follow the same rules [the simplest way being to default to the singular wait_map]. But it's the explicitly named execution contexts that are the biggest boon to reading the code and reading the lockdep warns.
This is a bunch of ad hoc tracking for a very narrow purpose applied globally, with loss of information. -Chris
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:
Quoting Daniel Stone (2020-06-11 10:01:46)
Hi,
On Thu, 11 Jun 2020 at 09:44, Dave Airlie airlied@gmail.com wrote:
On Thu, 11 Jun 2020 at 18:01, Chris Wilson chris@chris-wilson.co.uk wrote:
Introducing a global lockmap that cannot capture the rules correctly,
Can you document the rules all drivers should be following then, because from here it looks to get refactored every version of i915, and it would be nice if we could all aim for the same set of things roughly. We've already had enough problems with amdgpu vs i915 vs everyone else with fences, if this stops that in the future then I'd rather we have that than just some unwritten rules per driver and untestable.
As someone who has sunk a bunch of work into explicit-fencing awareness in my compositor so I can never be blocked, I'd be disappointed if the infrastructure was ultimately pointless because the documented fencing rules were _o_/ or thereabouts. Lockdep definitely isn't my area of expertise so I can't comment on the patch per se, but having something to ensure we don't hit deadlocks sure seems a lot better than nothing.
This is doing dependency analysis on execution contexts which is a far cry from doing the fence dependency analysis, and so has to actively ignore the cycles that must exist on the dma side, and also the cycles that prevent entering execution contexts on the CPU. It has to actively ignore scheduler execution contexts, for lockdep cries, and so we do not get analysis of the locking contexts along that path. This would be solvable along the lines of extending lockdep ala lockdep_dma_enter().
drm/scheduler is annotated, found some rather improbably to hit issues in practice. But from the quick chat I've had with König and others I think he agrees that it's real at least in the theoretical sense. Probably should consider playing lottery if you hit it in practice though :-)
Had i915's execution flow been marked up, it should have found the dubious wait for external fences inside the dead GPU recovery, and probably found a few more things to complain about with the reset locking. [Note we already do the same annotations for wait-vs-reset, but not reset-vs-execution.]
I know it splats, that's why the tdr annotation patch comes with a spec proposal for lifting the wait busting we do in i915 to the dma_fence level. I included that because amdgpu has the same problem on modern hw. Apparently their planned fix (because they've hit this bug in testing) was to push some shared lock down into their atomic_comit_tail function and use that in gpu reset, so don't seem that interested in extending dma_fence.
For i915 it's just gen2/3 display, and cross-driver dma-buf/fence usage for those is nil and won't change. Pragmatic solution imo would be to just not annotate gpu reset on these platforms, and relying on our wait busting plus igt tests to make sure it keeps working as-is. The point of the explicit annotations for the signalling side is very much that it can be rolled out gradually, and entirely left out for old legacy paths that aren't worth fixing.
Determination of which waits are legal and which are not is entirely ad hoc, for there is no status change tracking in the dependency analysis [that is once an execution context is linked to a published fence, again integral to lockdep.] Consider if the completion chain in atomic is swapped out for the morally equivalent fences along intertwined timelines, and so it does a bunch of dma_fence_wait() instead. Why are those waits legal despite them being after we have committed to fulfilling the out fence? [Why are the waits on and for the GPU legal, since they equally block execution flow?]
No need to consider, it's already real and resulted in some pretty splats until I got the recursion handling right.
Forcing a generic primitive to always be part of the same global map is horrible. You forgo being able to use the primitive for unrelated tasks, lose the ability to name particular contexts to gain more informative dependency cycle reports from having the explicit linkage. You can add wait_map tracking without loss of generality [in less than 10 lines], and you can still enforce that all fences used for a common purpose follow the same rules [the simplest way being to default to the singular wait_map]. But it's the explicitly named execution contexts that are the biggest boon to reading the code and reading the lockdep warns.
So one thing that's maybe not clear here: This doesn't track the DAG of dependencies. Doesn't even try, I'm still faithfully assuming drivers get that part right. Which is a gap and maybe we should fix this, but not the goal here.
All this does is validate fences against anything else that might be going on in the system. E.g. your recursion example for atomic is handled by just assuming that any dma_fence_wait within a signalling section is legit and correct. We can add this later on, but not with lockdep, since lockdep works with classes. And proofing that dma_fences are acyclic requires you track them all as individuals. Entirely different things.
That still leaves the below:
Forcing a generic primitive to always be part of the same global map is horrible.
And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.
This is a bunch of ad hoc tracking for a very narrow purpose applied globally, with loss of information.
It doesn't solve every problem indeed. I'm happy to review patches to check acyclic-ness of dma-fence at the global level from you, I haven't figured out yet how to make that happen. I know i915-gem has that, but this is about the cross-driver contract here. -Daniel
Quoting Daniel Vetter (2020-06-19 09:51:59)
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:
Forcing a generic primitive to always be part of the same global map is horrible.
And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.
Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.
This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope. -Chris
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
Quoting Daniel Vetter (2020-06-19 09:51:59)
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:
Forcing a generic primitive to always be part of the same global map is horrible.
And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.
Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.
This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.
Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.
Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.
The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.
So dma_fence as general replacement for completion chains I think just wont happen.
What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.
And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.
But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.
So there's imo two cases:
- Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.
- Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.
Cheers, Daniel
Quoting Daniel Vetter (2020-06-19 10:43:09)
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
Quoting Daniel Vetter (2020-06-19 09:51:59)
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:
Forcing a generic primitive to always be part of the same global map is horrible.
And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.
Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.
This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.
Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.
Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.
The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.
So dma_fence as general replacement for completion chains I think just wont happen.
That is sad. I cannot comprehend going back to pure completions after a taste of fence scheduling. And we are not even close to fully utilising them, as not all the async cpu [allocation!] tasks are fully tracked by fences yet and are still stuck in a FIFO workqueue.
What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.
And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.
But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.
So there's imo two cases:
Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.
Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.
I think we are faced with this very real problem.
The papering we have today over userptr is so very thin, and if you squint you can already see it is coupled into the completion signal. Just it happens to be on the other side of the fence.
The next batch of priority inversions involve integrating the async cpu tasks into the scheduler, and have full dependency tracking over every internal fence. I do not see any way to avoid coupling the completion signal from the GPU to the earliest resource allocation, as it's an unbroken chain of work, at least from the user's perspective. [Next up for annotations is that we need to always assume that userspace has an implicit lock on GPU resources; having to break that lock with a GPU reset should be a breach of our data integrity, and best avoided, for compute does not care one iota about system integrity and insist userspace knows best.] Such allocations have to be allowed to fail and for that failure to propagate cancelling the queued work, such that I'm considering what rules we need for gfp_t. That might allow enough leverage to break any fs_reclaim loops, but userptr is likely forever doomed [aside from its fs_reclaim loop is as preventable as the normal shrinker paths], but we still need to suggest to pin_user_pages that failure is better than oom and that is not clear atm. Plus the usual failure can happen at any time after updating the user facing bookkeeping, but that is just extra layers in the execution monitor ready to step in and replacing failing work with the error propagation. Or where the system grinds to a halt, requiring the monitor to patch in a new page / resource. -Chris
On Fri, Jun 19, 2020 at 3:12 PM Chris Wilson chris@chris-wilson.co.uk wrote:
Quoting Daniel Vetter (2020-06-19 10:43:09)
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
Quoting Daniel Vetter (2020-06-19 09:51:59)
On Fri, Jun 19, 2020 at 10:25 AM Chris Wilson chris@chris-wilson.co.uk wrote:
Forcing a generic primitive to always be part of the same global map is horrible.
And no concrete example or reason for why that's not possible. Because frankly it's not horrible, this is what upstream is all about: Shared concepts, shared contracts, shared code.
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.
Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.
This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.
Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.
Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.
The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.
So dma_fence as general replacement for completion chains I think just wont happen.
That is sad. I cannot comprehend going back to pure completions after a taste of fence scheduling. And we are not even close to fully utilising them, as not all the async cpu [allocation!] tasks are fully tracked by fences yet and are still stuck in a FIFO workqueue.
What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.
And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.
But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.
So there's imo two cases:
Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.
Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.
I think we are faced with this very real problem.
The papering we have today over userptr is so very thin, and if you squint you can already see it is coupled into the completion signal. Just it happens to be on the other side of the fence.
The next batch of priority inversions involve integrating the async cpu tasks into the scheduler, and have full dependency tracking over every internal fence. I do not see any way to avoid coupling the completion signal from the GPU to the earliest resource allocation, as it's an unbroken chain of work, at least from the user's perspective. [Next up for annotations is that we need to always assume that userspace has an implicit lock on GPU resources; having to break that lock with a GPU reset should be a breach of our data integrity, and best avoided, for compute does not care one iota about system integrity and insist userspace knows best.] Such allocations have to be allowed to fail and for that failure to propagate cancelling the queued work, such that I'm considering what rules we need for gfp_t. That might allow enough leverage to break any fs_reclaim loops, but userptr is likely forever doomed [aside from its fs_reclaim loop is as preventable as the normal shrinker paths], but we still need to suggest to pin_user_pages that failure is better than oom and that is not clear atm. Plus the usual failure can happen at any time after updating the user facing bookkeeping, but that is just extra layers in the execution monitor ready to step in and replacing failing work with the error propagation. Or where the system grinds to a halt, requiring the monitor to patch in a new page / resource.
Zooming out a bunch, since this is a lot about the details of making this happen, and I want to make sure I'm understanding your aim correctly. I think we have 2 big things here interacting:
On one side the "everything async" push, for some value of everything. Once everything is async we let either the linux scheduler (for dma_fence_work) or the gpu scheduler (for i915_request) figure out how to order everything, with all the dependencies. For memory allocations there's likely quite a bit of retrying (on the allocation side) and skipping (on the shrinker/mmu notifier side) involved to make this all pan out. Maybe something like a GFP_NOGPU flag.
On the other side we have opinionated userspace with both very long-running batches (they might as well be infinite, best we can do is check that they still preempt within a reasonable amount of time, lack of hw support for preemption in all cases notwithstanding). And batches which synchronize across engines and whatever entirely under userspace controls, with stuff like gpu semaphore waits entirely in the cmd stream, without any kernel or gpu scheduler involvement. Well maybe a slightly smarter gpu scheduler which converts the semaphore wait from a pure busy loop into a "repoll on each scheduler timeslice". But not actual dependency tracking awareness in the kernel (or guc/hw fwiw) of what userspace is really trying to do.
Later is a big motivator for the former, since with arbitrary long batches and arbitrary fences any wait for a batch to complete can take forever, hence anything that might end up doing that needs to be done async and without locks. That way we don't have to shoot anything if a batch takes too long.
Finally if anything goes wrong (on the kernel side at least) we just propagete fence error state through the entire ladder of in-flight things (only if it goes wrong terminally ofc).
Roughly correct or did I miss a big (or small but really important) thing?
Thanks, Daniel
Hi, Jumping in after a couple of weeks where I've paged most everything out of my brain ...
On Fri, 19 Jun 2020 at 10:43, Daniel Vetter daniel@ffwll.ch wrote:
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.
Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.
This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.
Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.
Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.
The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.
So dma_fence as general replacement for completion chains I think just wont happen.
What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.
And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.
But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.
So there's imo two cases:
Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.
Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.
I don't claim to understand the implementation of i915's scheduler and GEM handling, and it seems like there's some public context missing here. But to me, the above is a good statement of what I (and a lot of other userspace) have been relying on - that dma-fence is a very tightly scoped thing which is very predictable but in extremis.
It would be great to have something like this enshrined in dma-fence documentation, visible to both kernel and external users. The properties we've so far been assuming for the graphics pipeline - covering production & execution of vertex/fragment workloads on the GPU, framebuffer display, and to the extent this is necessary involving compute - are something like this:
A single dma-fence with no dependencies represents (the tail of) a unit of work, which has been all but committed to the hardware. Once committed to the hardware, this work will complete (successfully or in error) in bounded time. The unit of work referred to by a dma-fence may carry dependencies on other dma-fences, which must of course be subject to the same restrictions as above. No action from any userspace component is required to ensure that the completion occurs.
The cases I know of which legitimately blow holes in this are: - the work is scheduled but GPU execution resource contention prevents it from completion, e.g. something on a higher-priority context repeatedly gets scheduled in front of it - this is OK because by definition it's what should happen - the work is scheduled but CPU execution resource contention prevents it from completion, e.g. the DRM scheduler does not get to trigger the hardware to execute the work - this is OK because at this point we have a big system-wide problem - the work is scheduled but non-execution resource contention prevents it from making progress, e.g. VRAM contention and/or a paging storm - this is OK because again we have a larger problem here and we can't reasonably expect the driver to solve this - the work is executed but execution does not complete due to the nature of the work, e.g. a chain of work contains a hostile compute shader which does not complete in any reasonable time - this is OK because we require TDR; even without a smart compositor detecting based on fence waits that the work is unsuitable and should not hold up other work, the driver will probably ban the context and lock it out anyway
The first three are general system resource-overload cases, no different from the CPU-side equivalent where it's up to the admin to impose ulimits to prevent forkbombs or runaway memory usage, or up to the user to run fewer Electron apps. The last one is more difficult, because we can't solve the halting problem to know ahead of time that the user has submitted an infinite workload, so we have to live with that as a real hazard and mitigate it where we can (by returning -EIO and killing the app from inside Mesa).
If repurposing dma-fence for non-graphics uses (like general-purpose compute or driver-internal tracking for things other than GPU workloads) makes it more difficult to guarantee the above properties, then I don't want to do it. Maybe the answer is that dma-fence gets split into its core infrastructure which can be used for completion chains, with actual dma-fence being layered above generic completion APIs: other-completion-API can consume fences, but fences _cannot_ consume non-fence things.
This does force a split between graphics (GL/Vulkan/display) workloads and compute (CL/oneAPI/HSA/CUDA), which I get is really difficult to resolve in the driver. But the two are hard split anyway: graphics requires upfront and explicit buffer management, in return dangling the carrot that you can pipeline your workloads and expect completion in reasonable time. General-purpose compute lets you go far more YOLO on resource access, including full userptr SVM, but the flipside is that your execution time might be measured in weeks; as a result you don't get to do execution pipelining because even if you could, it's not a big enough win relative to your execution time to be worth the extra driver and system complexity. I don't think there's a reasonable lowest common denominator between the two that we can try to reuse a generic model for both, because you make too many compromises to try to fit conflicting interests.
In the pre-syncobj days, we did look at what we called 'empty fences' or 'future fences' with the ChromeOS team: a synchronisation object which wasn't backed by a promise of completion as dma-fence is, but instead by the meta-promise (from userspace) of a promise of completion. Ultimately it never became a real thing for the same reason that swsync isn't either; it needed so much special-case handling and so many disclaimers and opt-ins everywhere that by the end, we weren't sure why we were trying to shoehorn it into dma-fence apart from dma-fence already existing - but by removing all its guarantees, we also removed all its usefulness as a primitive.
Cheers, Daniel
On Thu, Jul 09, 2020 at 08:29:21AM +0100, Daniel Stone wrote:
Hi, Jumping in after a couple of weeks where I've paged most everything out of my brain ...
On Fri, 19 Jun 2020 at 10:43, Daniel Vetter daniel@ffwll.ch wrote:
On Fri, Jun 19, 2020 at 10:13:35AM +0100, Chris Wilson wrote:
The proposed patches might very well encode the wrong contract, that's all up for discussion. But fundamentally questioning that we need one is missing what upstream is all about.
Then I have not clearly communicated, as my opinion is not that validation is worthless, but that the implementation is enshrining a global property on a low level primitive that prevents it from being used elsewhere. And I want to replace completion [chains] with fences, and bio with fences, and closures with fences, and what other equivalencies there are in the kernel. The fence is as central a locking construct as struct completion and deserves to be a foundational primitive provided by kernel/ used throughout all drivers for discrete problem domains.
This is narrowing dma_fence whereby adding struct lockdep_map *dma_fence::wait_map and annotating linkage, allows you to continue to specify that all dma_fence used for a particular purpose must follow common rules, without restricting the primitive for uses outside of this scope.
Somewhere else in this thread I had discussions with Jason Gunthorpe about this topic. It might maybe change somewhat depending upon exact rules, but his take is very much "I don't want dma_fence in rdma". Or pretty close to that at least.
Similar discussions with habanalabs, they're using dma_fence internally without any of the uapi. Discussion there has also now concluded that it's best if they remove them, and simply switch over to a wait_queue or completion like every other driver does.
The next round of the patches already have a paragraph to at least somewhat limit how non-gpu drivers use dma_fence. And I guess actual consensus might be pointing even more strongly at dma_fence being solely something for gpus and closely related subsystem (maybe media) for syncing dma-buf access.
So dma_fence as general replacement for completion chains I think just wont happen.
What might make sense is if e.g. the lockdep annotations could be reused, at least in design, for wait_queue or completion or anything else really. I do think that has a fair chance compared to the automagic cross-release annotations approach, which relied way too heavily on guessing where barriers are. My experience from just a bit of playing around with these patches here and discussing them with other driver maintainers is that accurately deciding where critical sections start and end is a job for humans only. And if you get it wrong, you will have a false positive.
And you're indeed correct that if we'd do annotations for completions and wait queues, then that would need to have a class per semantically equivalent user, like we have lockdep classes for mutexes, not just one overall.
But dma_fence otoh is something very specific, which comes with very specific rules attached - it's not a generic wait_queue at all. Originally it did start out as one even, but it is a very specialized wait_queue.
So there's imo two cases:
Your completion is entirely orthogonal of dma_fences, and can never ever block a dma_fence. Don't use dma_fence for this, and no problem. It's just another wait_queue somewhere.
Your completion can eventually, maybe through lots of convolutions and depdencies, block a dma_fence. In that case full dma_fence rules apply, and the only thing you can do with a custom annotation is make the rules even stricter. E.g. if a sub-timeline in the scheduler isn't allowed to take certain scheduler locks. But the userspace visible/published fence do take them, maybe as part of command submission or retirement. Entirely hypotethical, no idea any driver actually needs this.
I don't claim to understand the implementation of i915's scheduler and GEM handling, and it seems like there's some public context missing here. But to me, the above is a good statement of what I (and a lot of other userspace) have been relying on - that dma-fence is a very tightly scoped thing which is very predictable but in extremis.
It would be great to have something like this enshrined in dma-fence documentation, visible to both kernel and external users. The properties we've so far been assuming for the graphics pipeline - covering production & execution of vertex/fragment workloads on the GPU, framebuffer display, and to the extent this is necessary involving compute - are something like this:
A single dma-fence with no dependencies represents (the tail of) a unit of work, which has been all but committed to the hardware. Once committed to the hardware, this work will complete (successfully or in error) in bounded time. The unit of work referred to by a dma-fence may carry dependencies on other dma-fences, which must of course be subject to the same restrictions as above. No action from any userspace component is required to ensure that the completion occurs.
The cases I know of which legitimately blow holes in this are:
- the work is scheduled but GPU execution resource contention
prevents it from completion, e.g. something on a higher-priority context repeatedly gets scheduled in front of it - this is OK because by definition it's what should happen
- the work is scheduled but CPU execution resource contention
prevents it from completion, e.g. the DRM scheduler does not get to trigger the hardware to execute the work - this is OK because at this point we have a big system-wide problem
- the work is scheduled but non-execution resource contention
prevents it from making progress, e.g. VRAM contention and/or a paging storm - this is OK because again we have a larger problem here and we can't reasonably expect the driver to solve this
- the work is executed but execution does not complete due to the
nature of the work, e.g. a chain of work contains a hostile compute shader which does not complete in any reasonable time - this is OK because we require TDR; even without a smart compositor detecting based on fence waits that the work is unsuitable and should not hold up other work, the driver will probably ban the context and lock it out anyway
The first three are general system resource-overload cases, no different from the CPU-side equivalent where it's up to the admin to impose ulimits to prevent forkbombs or runaway memory usage, or up to the user to run fewer Electron apps. The last one is more difficult, because we can't solve the halting problem to know ahead of time that the user has submitted an infinite workload, so we have to live with that as a real hazard and mitigate it where we can (by returning -EIO and killing the app from inside Mesa).
If repurposing dma-fence for non-graphics uses (like general-purpose compute or driver-internal tracking for things other than GPU workloads) makes it more difficult to guarantee the above properties, then I don't want to do it. Maybe the answer is that dma-fence gets split into its core infrastructure which can be used for completion chains, with actual dma-fence being layered above generic completion APIs: other-completion-API can consume fences, but fences _cannot_ consume non-fence things.
This does force a split between graphics (GL/Vulkan/display) workloads and compute (CL/oneAPI/HSA/CUDA), which I get is really difficult to resolve in the driver. But the two are hard split anyway: graphics requires upfront and explicit buffer management, in return dangling the carrot that you can pipeline your workloads and expect completion in reasonable time. General-purpose compute lets you go far more YOLO on resource access, including full userptr SVM, but the flipside is that your execution time might be measured in weeks; as a result you don't get to do execution pipelining because even if you could, it's not a big enough win relative to your execution time to be worth the extra driver and system complexity. I don't think there's a reasonable lowest common denominator between the two that we can try to reuse a generic model for both, because you make too many compromises to try to fit conflicting interests.
In the pre-syncobj days, we did look at what we called 'empty fences' or 'future fences' with the ChromeOS team: a synchronisation object which wasn't backed by a promise of completion as dma-fence is, but instead by the meta-promise (from userspace) of a promise of completion. Ultimately it never became a real thing for the same reason that swsync isn't either; it needed so much special-case handling and so many disclaimers and opt-ins everywhere that by the end, we weren't sure why we were trying to shoehorn it into dma-fence apart from dma-fence already existing - but by removing all its guarantees, we also removed all its usefulness as a primitive.
New series has a patch which tries to at least somewhat summarize this entire problem, and why it just doesn't work. Doesn't contain yet the full proposed solution, but maybe that's best for a follow-up patch. Anyway probably best if we poke holes at that text there.
Between the preepmt ctx fence in amdgpu and userspace fences or gpu futex or whatever you want to call it, I do think we can make the compute side happy. The sad puppy face comes a bit from vulkan, since vulkan would really like the same execution model, but because it needs to integrate with the overall dma-fence based compositor stack, it can't.
I think even that is solveable, if we have vulkan-based compositors and a completely new set of protocols and uapi from client all the way down to display. That makes it about as bad as a flag day as atomic+modifiers.
Also the only reason why the kms driver can then suddenly import a userspace fence, while nothing else in the kernel can allow such dependencies is fairly simple: Framebuffers are pinned, which breaks the dependency loops in the memory manager, and so avoids all the troubles in a slightly different form.
And of course we'd need a timeout in case userspace just screwed up somehow. -Daniel
Cheers, Daniel
Design is similar to the lockdep annotations for workers, but with some twists:
- We use a read-lock for the execution/worker/completion side, so that this explicit annotation can be more liberally sprinkled around. With read locks lockdep isn't going to complain if the read-side isn't nested the same way under all circumstances, so ABBA deadlocks are ok. Which they are, since this is an annotation only.
- We're using non-recursive lockdep read lock mode, since in recursive read lock mode lockdep does not catch read side hazards. And we _very_ much want read side hazards to be caught. For full details of this limitation see
commit e91498589746065e3ae95d9a00b068e525eec34f Author: Peter Zijlstra peterz@infradead.org Date: Wed Aug 23 13:13:11 2017 +0200
locking/lockdep/selftests: Add mixed read-write ABBA tests
- To allow nesting of the read-side explicit annotations we explicitly keep track of the nesting. lock_is_held() allows us to do that.
- The wait-side annotation is a write lock, and entirely done within dma_fence_wait() for everyone by default.
- To be able to freely annotate helper functions I want to make it ok to call dma_fence_begin/end_signalling from soft/hardirq context. First attempt was using the hardirq locking context for the write side in lockdep, but this forces all normal spinlocks nested within dma_fence_begin/end_signalling to be spinlocks. That bollocks.
The approach now is to simple check in_atomic(), and for these cases entirely rely on the might_sleep() check in dma_fence_wait(). That will catch any wrong nesting against spinlocks from soft/hardirq contexts.
The idea here is that every code path that's critical for eventually signalling a dma_fence should be annotated with dma_fence_begin/end_signalling. The annotation ideally starts right after a dma_fence is published (added to a dma_resv, exposed as a sync_file fd, attached to a drm_syncobj fd, or anything else that makes the dma_fence visible to other kernel threads), up to and including the dma_fence_wait(). Examples are irq handlers, the scheduler rt threads, the tail of execbuf (after the corresponding fences are visible), any workers that end up signalling dma_fences and really anything else. Not annotated should be code paths that only complete fences opportunistically as the gpu progresses, like e.g. shrinker/eviction code.
The main class of deadlocks this is supposed to catch are:
Thread A:
mutex_lock(A); mutex_unlock(A);
dma_fence_signal();
Thread B:
mutex_lock(A); dma_fence_wait(); mutex_unlock(A);
Thread B is blocked on A signalling the fence, but A never gets around to that because it cannot acquire the lock A.
Note that dma_fence_wait() is allowed to be nested within dma_fence_begin/end_signalling sections. To allow this to happen the read lock needs to be upgraded to a write lock, which means that any other lock is acquired between the dma_fence_begin_signalling() call and the call to dma_fence_wait(), and still held, this will result in an immediate lockdep complaint. The only other option would be to not annotate such calls, defeating the point. Therefore these annotations cannot be sprinkled over the code entirely mindless to avoid false positives.
Originally I hope that the cross-release lockdep extensions would alleviate the need for explicit annotations:
https://lwn.net/Articles/709849/
But there's a few reasons why that's not an option:
- It's not happening in upstream, since it got reverted due to too many false positives:
commit e966eaeeb623f09975ef362c2866fae6f86844f9 Author: Ingo Molnar mingo@kernel.org Date: Tue Dec 12 12:31:16 2017 +0100
locking/lockdep: Remove the cross-release locking checks
This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y), while it found a number of old bugs initially, was also causing too many false positives that caused people to disable lockdep - which is arguably a worse overall outcome.
- cross-release uses the complete() call to annotate the end of critical sections, for dma_fence that would be dma_fence_signal(). But we do not want all dma_fence_signal() calls to be treated as critical, since many are opportunistic cleanup of gpu requests. If these get stuck there's still the main completion interrupt and workers who can unblock everyone. Automatically annotating all dma_fence_signal() calls would hence cause false positives.
- cross-release had some educated guesses for when a critical section starts, like fresh syscall or fresh work callback. This would again cause false positives without explicit annotations, since for dma_fence the critical sections only starts when we publish a fence.
- Furthermore there can be cases where a thread never does a dma_fence_signal, but is still critical for reaching completion of fences. One example would be a scheduler kthread which picks up jobs and pushes them into hardware, where the interrupt handler or another completion thread calls dma_fence_signal(). But if the scheduler thread hangs, then all the fences hang, hence we need to manually annotate it. cross-release aimed to solve this by chaining cross-release dependencies, but the dependency from scheduler thread to the completion interrupt handler goes through hw where cross-release code can't observe it.
In short, without manual annotations and careful review of the start and end of critical sections, cross-relese dependency tracking doesn't work. We need explicit annotations.
v2: handle soft/hardirq ctx better against write side and dont forget EXPORT_SYMBOL, drivers can't use this otherwise.
v3: Kerneldoc.
v4: Some spelling fixes from Mika
v5: Amend commit message to explain in detail why cross-release isn't the solution.
v6: Pull out misplaced .rst hunk.
Reviewed-by: Thomas Hellström thomas.hellstrom@intel.com Reviewed-by: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com --- Documentation/driver-api/dma-buf.rst | 6 + drivers/dma-buf/dma-fence.c | 161 +++++++++++++++++++++++++++ include/linux/dma-fence.h | 12 ++ 3 files changed, 179 insertions(+)
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 7fb7b661febd..05d856131140 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -133,6 +133,12 @@ DMA Fences .. kernel-doc:: drivers/dma-buf/dma-fence.c :doc: DMA fences overview
+DMA Fence Signalling Annotations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/dma-buf/dma-fence.c + :doc: fence signalling annotation + DMA Fences Functions Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index 656e9ac2d028..0005bc002529 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num) } EXPORT_SYMBOL(dma_fence_context_alloc);
+/** + * DOC: fence signalling annotation + * + * Proving correctness of all the kernel code around &dma_fence through code + * review and testing is tricky for a few reasons: + * + * * It is a cross-driver contract, and therefore all drivers must follow the + * same rules for lock nesting order, calling contexts for various functions + * and anything else significant for in-kernel interfaces. But it is also + * impossible to test all drivers in a single machine, hence brute-force N vs. + * N testing of all combinations is impossible. Even just limiting to the + * possible combinations is infeasible. + * + * * There is an enormous amount of driver code involved. For render drivers + * there's the tail of command submission, after fences are published, + * scheduler code, interrupt and workers to process job completion, + * and timeout, gpu reset and gpu hang recovery code. Plus for integration + * with core mm with have &mmu_notifier, respectively &mmu_interval_notifier, + * and &shrinker. For modesetting drivers there's the commit tail functions + * between when fences for an atomic modeset are published, and when the + * corresponding vblank completes, including any interrupt processing and + * related workers. Auditing all that code, across all drivers, is not + * feasible. + * + * * Due to how many other subsystems are involved and the locking hierarchies + * this pulls in there is extremely thin wiggle-room for driver-specific + * differences. &dma_fence interacts with almost all of the core memory + * handling through page fault handlers via &dma_resv, dma_resv_lock() and + * dma_resv_unlock(). On the other side it also interacts through all + * allocation sites through &mmu_notifier and &shrinker. + * + * Furthermore lockdep does not handle cross-release dependencies, which means + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught + * at runtime with some quick testing. The simplest example is one thread + * waiting on a &dma_fence while holding a lock:: + * + * lock(A); + * dma_fence_wait(B); + * unlock(A); + * + * while the other thread is stuck trying to acquire the same lock, which + * prevents it from signalling the fence the previous thread is stuck waiting + * on:: + * + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * + * By manually annotating all code relevant to signalling a &dma_fence we can + * teach lockdep about these dependencies, which also helps with the validation + * headache since now lockdep can check all the rules for us:: + * + * cookie = dma_fence_begin_signalling(); + * lock(A); + * unlock(A); + * dma_fence_signal(B); + * dma_fence_end_signalling(cookie); + * + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to + * annotate critical sections the following rules need to be observed: + * + * * All code necessary to complete a &dma_fence must be annotated, from the + * point where a fence is accessible to other threads, to the point where + * dma_fence_signal() is called. Un-annotated code can contain deadlock issues, + * and due to the very strict rules and many corner cases it is infeasible to + * catch these just with review or normal stress testing. + * + * * &struct dma_resv deserves a special note, since the readers are only + * protected by rcu. This means the signalling critical section starts as soon + * as the new fences are installed, even before dma_resv_unlock() is called. + * + * * The only exception are fast paths and opportunistic signalling code, which + * calls dma_fence_signal() purely as an optimization, but is not required to + * guarantee completion of a &dma_fence. The usual example is a wait IOCTL + * which calls dma_fence_signal(), while the mandatory completion path goes + * through a hardware interrupt and possible job completion worker. + * + * * To aid composability of code, the annotations can be freely nested, as long + * as the overall locking hierarchy is consistent. The annotations also work + * both in interrupt and process context. Due to implementation details this + * requires that callers pass an opaque cookie from + * dma_fence_begin_signalling() to dma_fence_end_signalling(). + * + * * Validation against the cross driver contract is implemented by priming + * lockdep with the relevant hierarchy at boot-up. This means even just + * testing with a single device is enough to validate a driver, at least as + * far as deadlocks with dma_fence_wait() against dma_fence_signal() are + * concerned. + */ +#ifdef CONFIG_LOCKDEP +struct lockdep_map dma_fence_lockdep_map = { + .name = "dma_fence_map" +}; + +/** + * dma_fence_begin_signalling - begin a critical DMA fence signalling section + * + * Drivers should use this to annotate the beginning of any code section + * required to eventually complete &dma_fence by calling dma_fence_signal(). + * + * The end of these critical sections are annotated with + * dma_fence_end_signalling(). + * + * Returns: + * + * Opaque cookie needed by the implementation, which needs to be passed to + * dma_fence_end_signalling(). + */ +bool dma_fence_begin_signalling(void) +{ + /* explicitly nesting ... */ + if (lock_is_held_type(&dma_fence_lockdep_map, 1)) + return true; + + /* rely on might_sleep check for soft/hardirq locks */ + if (in_atomic()) + return true; + + /* ... and non-recursive readlock */ + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_); + + return false; +} +EXPORT_SYMBOL(dma_fence_begin_signalling); + +/** + * dma_fence_end_signalling - end a critical DMA fence signalling section + * + * Closes a critical section annotation opened by dma_fence_begin_signalling(). + */ +void dma_fence_end_signalling(bool cookie) +{ + if (cookie) + return; + + lock_release(&dma_fence_lockdep_map, _RET_IP_); +} +EXPORT_SYMBOL(dma_fence_end_signalling); + +void __dma_fence_might_wait(void) +{ + bool tmp; + + tmp = lock_is_held_type(&dma_fence_lockdep_map, 1); + if (tmp) + lock_release(&dma_fence_lockdep_map, _THIS_IP_); + lock_map_acquire(&dma_fence_lockdep_map); + lock_map_release(&dma_fence_lockdep_map); + if (tmp) + lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_); +} +#endif + + /** * dma_fence_signal_locked - signal completion of a fence * @fence: the fence to signal @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence) { unsigned long flags; int ret; + bool tmp;
if (!fence) return -EINVAL;
+ tmp = dma_fence_begin_signalling(); + spin_lock_irqsave(fence->lock, flags); ret = dma_fence_signal_locked(fence); spin_unlock_irqrestore(fence->lock, flags);
+ dma_fence_end_signalling(tmp); + return ret; } EXPORT_SYMBOL(dma_fence_signal); @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
might_sleep();
+ __dma_fence_might_wait(); + trace_dma_fence_wait_start(fence); if (fence->ops->wait) ret = fence->ops->wait(fence, intr, timeout); diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 3347c54f3a87..3f288f7db2ef 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) } while (1); }
+#ifdef CONFIG_LOCKDEP +bool dma_fence_begin_signalling(void); +void dma_fence_end_signalling(bool cookie); +#else +static inline bool dma_fence_begin_signalling(void) +{ + return true; +} +static inline void dma_fence_end_signalling(bool cookie) {} +static inline void __dma_fence_might_wait(void) {} +#endif + int dma_fence_signal(struct dma_fence *fence); int dma_fence_signal_locked(struct dma_fence *fence); signed long dma_fence_default_wait(struct dma_fence *fence,
linaro-mm-sig@lists.linaro.org