On Thu, Mar 15, 2018 at 10:56 AM, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 15.03.2018 um 10:20 schrieb Daniel Vetter:
On Tue, Mar 13, 2018 at 06:20:07PM +0100, Christian König wrote: [SNIP] Take a look at the DOT graphs for atomic I've done a while ago. I think we could make a formidable competition for who's doing the worst diagrams :-)
Thanks, going to give that a try.
[SNIP] amdgpu: Expects that you never hold any of the heavywheight locks while waiting for a fence (since gpu resets will need them).
i915: Happily blocks on fences while holding all kinds of locks, expects gpu reset to be able to recover even in this case.
In this case I can comfort you, the looks amdgpu needs to grab during GPU reset are the reservation lock of the VM page tables. I have strong doubt that i915 will ever hold those.
Ah good, means that very likely there's at least no huge fundamental design issue that we run into.
Could be that we run into problems because Thread A hold lock 1 tries to take lock 2, then i915 holds 2 and our reset path needs 1.
Yeah that might happen, but lockdep will catch those, and generally those cases can be fixed with slight reordering or re-annotating of the code to avoid upsetting lockdep. As long as we don't have a full-on functional dependency (which is what I've feared).
[SNIP]
Yes, except for fallback paths and bootup self tests we simply never wait for fences while holding locks.
That's not what I meant with "are you sure". Did you enable the cross-release stuff (after patching the bunch of leftover core kernel issues still present), annotate dma_fence with the cross-release stuff, run a bunch of multi-driver (amdgpu vs i915) dma-buf sharing tests and weep?
Ok, what exactly do you mean with cross-release checking?
Current lockdep doesn't spot deadlocks like the below:
thread A: holds mutex, waiting for completion.
thread B: acquires mutex before it will ever signal the completion A is waiting for
->deadlock
cross-release lockdep support can catch these through new fancy annotations. Similar waiter/signaller annotations exists for waiting on workers and anything else, and it would be a perfect fit for waiter/signaller code around dma_fence.
lwn has you covered a usual: https://lwn.net/Articles/709849/
Cheers, Daniel
I didn't do the full thing yet, but just within i915 we've found tons of small little deadlocks we never really considered thanks to cross release, and that wasn't even including the dma_fence annotation. Luckily nothing that needed a full-on driver redesign.
I guess I need to ping core kernel maintainers about cross-release again. I'd much prefer if we could validate ->invalidate_mapping and the locking/fence dependency issues using that, instead of me having to read and understand all the drivers.
[SNIP]
I fear that with the ->invalidate_mapping callback (which inverts the control flow between importer and exporter) and tying dma_fences into all this it will be a _lot_ worse. And I'm definitely too stupid to understand all the dependency chains without the aid of lockdep and a full test suite (we have a bunch of amdgpu/i915 dma-buf tests in igt btw).
Yes, that is also something I worry about.
Regards, Christian.