On Thu, Jun 11, 2020 at 09:30:12AM +0200, Thomas Hellström (Intel) wrote:
On 6/4/20 10:12 AM, Daniel Vetter wrote:
Two in one go:
it is allowed to call dma_fence_wait() while holding a dma_resv_lock(). This is fundamental to how eviction works with ttm, so required.
it is allowed to call dma_fence_wait() from memory reclaim contexts, specifically from shrinker callbacks (which i915 does), and from mmu notifier callbacks (which amdgpu does, and which i915 sometimes also does, and probably always should, but that's kinda a debate). Also for stuff like HMM we really need to be able to do this, or things get real dicey.
Consequence is that any critical path necessary to get to a dma_fence_signal for a fence must never a) call dma_resv_lock nor b) allocate memory with GFP_KERNEL. Also by implication of dma_resv_lock(), no userspace faulting allowed. That's some supremely obnoxious limitations, which is why we need to sprinkle the right annotations to all relevant paths.
The one big locking context we're leaving out here is mmu notifiers, added in
commit 23b68395c7c78a764e8963fc15a7cfd318bf187f Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon Aug 26 22:14:21 2019 +0200
mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
that one covers a lot of other callsites, and it's also allowed to wait on dma-fences from mmu notifiers. But there's no ready-made functions exposed to prime this, so I've left it out for now.
v2: Also track against mmu notifier context.
v3: kerneldoc to spec the cross-driver contract. Note that currently i915 throws in a hard-coded 10s timeout on foreign fences (not sure why that was done, but it's there), which is why that rule is worded with SHOULD instead of MUST.
Also some of the mmu_notifier/shrinker rules might surprise SoC drivers, I haven't fully audited them all. Which is infeasible anyway, we'll need to run them with lockdep and dma-fence annotations and see what goes boom.
v4: A spelling fix from Mika
Cc: Mika Kuoppala mika.kuoppala@intel.com Cc: Thomas Hellstrom thomas.hellstrom@intel.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-rdma@vger.kernel.org Cc: amd-gfx@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: Christian König christian.koenig@amd.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com
Documentation/driver-api/dma-buf.rst | 6 ++++ drivers/dma-buf/dma-fence.c | 41 ++++++++++++++++++++++++++++ drivers/dma-buf/dma-resv.c | 4 +++ include/linux/dma-fence.h | 1 + 4 files changed, 52 insertions(+)
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I do wonder whether the mmu notifier constraint should only be set when mmu notifiers are enabled, since on a bunch of arm-soc gpu drivers that stuff just doesn't matter. But I expect that sooner or later these arm gpus will show up in bigger arm cores, where you might want to have kvm and maybe device virtualization and stuff, and then you need mmu notifiers.
Plus having a very clear and consistent cross-driver api contract is imo better than leaving this up to drivers and then having incompatible assumptions.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there). I think it'll take us a while to really bottom out on this specific question here. -Daniel
Reviewed-by: Thomas Hellström thomas.hellstrom@intel.com
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Right, nor will RDMA ODP.
Jason
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.
Regards, Felix
Right, nor will RDMA ODP.
Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.
Well amdgpu already has dma_fence waits in the hmm callbacks, so nothing new. But since you start using these in amdkfd ... perfect opportunity to annotate the amdkfd paths for fence signalling critical sections? Especially the preempt-ctx fence should be an interesting case to annotate and see whether lockdep finds anything. Not sure what else there is. -Daniel
Regards, Felix
Right, nor will RDMA ODP.
Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Thu, Jun 11, 2020 at 07:35:35PM -0400, Felix Kuehling wrote:
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.
Note that HMM mandate, and i stressed that several time in the past, that all GPU page table update are asynchronous and do not have to wait on _anything_.
I understand that you use DMA engine for GPU page table update but if you want to do so with HMM then you need a GPU page table update only DMA context where all GPU page table update goes through and where user space can not queue up job.
It can be for HMM only but if you want to mix HMM with non HMM then everything need to be on that queue and other command queue will have to depends on it.
Cheers, Jérôme
On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.
Can you pls cc these patches to dri-devel when they show up? Depending upon how your hw works there's and endless amount of bad things that can happen.
Also I think (again depending upon how the hw exactly works) this stuff would be a perfect example for the dma_fence annotations.
The worst case is if your hw cannot preempt while a hw page fault is pending. That means none of the dma_fence will ever signal (the amdkfd preempt ctx fences wont, and the classic fences from amdgpu might be also stall). At least when you're unlucky and the fence you're waiting on somehow (anywhere in its dependency chain really) need the engine that's currently blocked waiting for the hw page fault.
That in turn means anything you do in your hw page fault handler is in the critical section for dma fence signalling, which has far reaching implications. -Daniel
Regards, Felix
Right, nor will RDMA ODP.
Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 2020-06-23 um 3:39 a.m. schrieb Daniel Vetter:
On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.
Can you pls cc these patches to dri-devel when they show up? Depending upon how your hw works there's and endless amount of bad things that can happen.
Yes, I'll do that.
Also I think (again depending upon how the hw exactly works) this stuff would be a perfect example for the dma_fence annotations.
We have already applied your patch series to our development branch. I haven't looked into what annotations we'd have to add to our new code yet.
The worst case is if your hw cannot preempt while a hw page fault is pending. That means none of the dma_fence will ever signal (the amdkfd preempt ctx fences wont, and the classic fences from amdgpu might be also stall). At least when you're unlucky and the fence you're waiting on somehow (anywhere in its dependency chain really) need the engine that's currently blocked waiting for the hw page fault.
Our HW can preempt while handling a page fault, at least on the GPU generation we're working on now. On other GPUs we haven't included in our initial effort, we will not be able to preempt while a page fault is in progress. This is problematic, but that's for reasons related to our GPU hardware scheduler and unrelated to fences.
That in turn means anything you do in your hw page fault handler is in the critical section for dma fence signalling, which has far reaching implications.
I'm not sure I agree, at least for KFD. The only place where KFD uses fences that depend on preemptions is eviction fences. And we can get rid of those if we can preempt GPU access to specific BOs by invalidating GPU PTEs. That way we don't need to preempt the GPU queues while a page fault is in progress. Instead we would create more page faults.
That assumes that we can invalidate GPU PTEs without depending on fences. We've discussed possible deadlocks due to memory allocations needed on that code paths for IBs or page tables. We've already eliminated page table allocations and reservation locks on the PTE invalidation code path. And we're using a separate scheduler entity so we can't get stuck behind other IBs that depend on fences. IIRC, Christian also implemented a separate memory pool for IBs for this code path.
Regards, Felix
-Daniel
Regards, Felix
Right, nor will RDMA ODP.
Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Jun 23, 2020 at 02:44:24PM -0400, Felix Kuehling wrote:
Am 2020-06-23 um 3:39 a.m. schrieb Daniel Vetter:
On Fri, Jun 12, 2020 at 1:35 AM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Soon nouveau will get company. We're working on a recoverable page fault implementation for HMM in amdgpu where we'll need to update page tables using the GPUs SDMA engine and wait for corresponding fences in MMU notifiers.
Can you pls cc these patches to dri-devel when they show up? Depending upon how your hw works there's and endless amount of bad things that can happen.
Yes, I'll do that.
Also I think (again depending upon how the hw exactly works) this stuff would be a perfect example for the dma_fence annotations.
We have already applied your patch series to our development branch. I haven't looked into what annotations we'd have to add to our new code yet.
The worst case is if your hw cannot preempt while a hw page fault is pending. That means none of the dma_fence will ever signal (the amdkfd preempt ctx fences wont, and the classic fences from amdgpu might be also stall). At least when you're unlucky and the fence you're waiting on somehow (anywhere in its dependency chain really) need the engine that's currently blocked waiting for the hw page fault.
Our HW can preempt while handling a page fault, at least on the GPU generation we're working on now. On other GPUs we haven't included in our initial effort, we will not be able to preempt while a page fault is in progress. This is problematic, but that's for reasons related to our GPU hardware scheduler and unrelated to fences.
Well the trouble is if the page fault holds up a preempt, then there's no way for a dma_fence to complete while your hw page fault handler is stuck doing whatever. That means the entire hw page fault becomes a fence signalling critical section, with the consequence that there's almost nothing you can actually do. System memory becomes GFP_ATOMIC only, and for vram you need to make sure that you never evict anything that might be in active use.
So not enabling these platforms sounds like a very good plan to me :-)
That in turn means anything you do in your hw page fault handler is in the critical section for dma fence signalling, which has far reaching implications.
I'm not sure I agree, at least for KFD. The only place where KFD uses fences that depend on preemptions is eviction fences. And we can get rid of those if we can preempt GPU access to specific BOs by invalidating GPU PTEs. That way we don't need to preempt the GPU queues while a page fault is in progress. Instead we would create more page faults.
The big problem isn't pure kfd workloads, all the trouble comes in when you mix kfd and amdgpu workloads. kfd alone is easy, just make sure there's no fences to begin with, and there will be no problems.
That assumes that we can invalidate GPU PTEs without depending on fences. We've discussed possible deadlocks due to memory allocations needed on that code paths for IBs or page tables. We've already eliminated page table allocations and reservation locks on the PTE invalidation code path. And we're using a separate scheduler entity so we can't get stuck behind other IBs that depend on fences. IIRC, Christian also implemented a separate memory pool for IBs for this code path.
Yeah it's the memory allocations that kill you. Both system memory, but also vram. Since evicting vram might mean you end up stuck behind a dma_fence of a legacy context hogging that memory, and probably also means doing a few dma_resv_lock. All of these thing deadlock if you can't preempt the context with something else. -Daniel
Regards, Felix
-Daniel
Regards, Felix
Right, nor will RDMA ODP.
Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Jason,
Somehow this got stuck somewhere in the mail queues, only popped up just now ...
On Thu, Jun 11, 2020 at 11:15:15AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The problem with gpus is that these completions leak across the board like mad. Both internally within memory managers (made a lot worse with p2p direct access to vram), and through uapi.
Many gpus still have a very hard time preempting, so doing an overall switch in drivers/gpu to a memory management model where that is required is not a very realistic option. And minimally you need either preempt (still takes a while, but a lot faster generally than waiting for work to complete) or hw faults (just a bunch of tlb flushes plus virtual indexed caches, so just the caveat of that for a gpu, which has lots and big tlbs and caches). So preventing the completion leaks within the kernel is I think unrealistic, except if we just say "well sorry, run on windows, mkay" for many gpu workloads. Or more realistic "well sorry, run on the nvidia blob with nvidia hw".
The userspace side we can somewhat isolate, at least for pure compute workloads. But the thing is drivers/gpu is a continum from tiny socs (where dma_fence is a very nice model) to huge compute stuff (where it's maybe not the nicest, but hey hw sucks so still neeeded). Doing full on break in uapi somewhere in there is at least a bit awkward, e.g. some of the media codec code on intel runs all the way from the smallest intel soc to the big transcode servers.
So the current status quo is "total mess, every driver defines their own rules". All I'm trying to do is some common rules here, do make this mess slightly more manageable and overall reviewable and testable.
I have no illusions that this is fundamentally pretty horrible, and the leftover wiggle room for writing memory manager is barely more than a hairline. Just not seeing how other options are better.
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
It's not everyone, or at least not everywhere, it's some fairly limited cases. Also, even if we drop the mmu_notifier on the floor, then we're stuck with shrinkers and GFP_NOFS. Still need a mempool of some sorts to guarantee you get out of a bind, so not much better.
At least that's my current understanding of where we are across all drivers.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Right, nor will RDMA ODP.
Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here. -Daniel
On Tue, Jun 16, 2020 at 02:07:19PM +0200, Daniel Vetter wrote:
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Right, nor will RDMA ODP.
Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here.
rdma does not use dma_fence at all, and though it is hard to tell, I didn't notice a dma_fence in the nouveau invalidation call path.
At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.
Ie it is strange that the new totally-not-a-gpu drivers use dma_fence, they surely don't have the same constraints as the existing GPU world, and it would be annoying to see dma_fence notifiers spring up in them
Jason
On Wed, Jun 17, 2020 at 9:27 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Tue, Jun 16, 2020 at 02:07:19PM +0200, Daniel Vetter wrote:
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Right, nor will RDMA ODP.
Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here.
rdma does not use dma_fence at all, and though it is hard to tell, I didn't notice a dma_fence in the nouveau invalidation call path.
Nouveau for compute has hw page faults. It doesn't have hw page faults for non-compute fixed function blocks afaik, so there's a hybrid model going on. But nouveau also doesn't support userspace memory (instead of driver-allocated buffer objects) for these fixed function blocks, so no need to have a dma_fence_wait in there.
At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.
Yeah I'm working on documentation, and also the notifiers here hopefully make it clear it's massive pain. I think we could even make a hard rule that dma_fence in mmu notifier outside of drivers/gpu is a bug/misfeature.
Might be a good idea to add a MAINTAINERS entry with a K: regex pattern, so that you can catch such modifiers. We do already have such a pattern for dma-fence, to catch abuse. So if you want I could type up a documentation patch for this, get your and others acks and the dri-devel folks would enforce that the dma_fence_wait madness doesn't leak beyond drivers/gpu
Ie it is strange that the new totally-not-a-gpu drivers use dma_fence, they surely don't have the same constraints as the existing GPU world, and it would be annoying to see dma_fence notifiers spring up in them
If you mean drivers/misc/habanalabs, that's going to get taken care of:
commit ed65bfd9fd86dec3772570b0320ca85b9fb69f2e Author: Daniel Vetter daniel.vetter@ffwll.ch Date: Mon May 11 11:11:42 2020 +0200
habanalabs: don't set default fence_ops->wait
It's the default.
Also so much for "we're not going to tell the graphics people how to review their code", dma_fence is a pretty core piece of gpu driver infrastructure. And it's very much uapi relevant, including piles of corresponding userspace protocols and libraries for how to pass these around.
Would be great if habanalabs would not use this (from a quick look it's not needed at all), since open source the userspace and playing by the usual rules isn't on the table. If that's not possible (because it's actually using the uapi part of dma_fence to interact with gpu drivers) then we have exactly what everyone promised we'd want to avoid.
Signed-off-by: Daniel Vetter daniel.vetter@intel.com Reviewed-by: Oded Gabbay oded.gabbay@gmail.com Signed-off-by: Oded Gabbay oded.gabbay@gmail.com
Oded has agreed to remove the dma-fence usage, since they really don't need it (and all the baggage that comes with it), plain old completion is enough for their use. This use is also why I added the regex to MAINTAINERS, so that in the future we can catch people who try to use dma_fence because it looks cute and useful, and are completely oblivious to all the pain and headaches involved.
Cheers, Daniel
On Wed, Jun 17, 2020 at 09:57:54AM +0200, Daniel Vetter wrote:
At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.
Yeah I'm working on documentation, and also the notifiers here hopefully make it clear it's massive pain. I think we could even make a hard rule that dma_fence in mmu notifier outside of drivers/gpu is a bug/misfeature.
Yep!
Might be a good idea to add a MAINTAINERS entry with a K: regex pattern, so that you can catch such modifiers. We do already have such a pattern for dma-fence, to catch abuse. So if you want I could type up a documentation patch for this, get your and others acks and the dri-devel folks would enforce that the dma_fence_wait madness doesn't leak beyond drivers/gpu
It seems like the best thing
Oded has agreed to remove the dma-fence usage, since they really don't need it (and all the baggage that comes with it), plain old completion is enough for their use. This use is also why I added the regex to MAINTAINERS, so that in the future we can catch people who try to use dma_fence because it looks cute and useful, and are completely oblivious to all the pain and headaches involved.
This is good!
Thanks, Jason
On Wed, Jun 17, 2020 at 12:29:40PM -0300, Jason Gunthorpe wrote:
On Wed, Jun 17, 2020 at 09:57:54AM +0200, Daniel Vetter wrote:
At the very least I think there should be some big warning that dma_fence in notifiers should be avoided.
Yeah I'm working on documentation, and also the notifiers here hopefully make it clear it's massive pain. I think we could even make a hard rule that dma_fence in mmu notifier outside of drivers/gpu is a bug/misfeature.
Yep!
Might be a good idea to add a MAINTAINERS entry with a K: regex pattern, so that you can catch such modifiers. We do already have such a pattern for dma-fence, to catch abuse. So if you want I could type up a documentation patch for this, get your and others acks and the dri-devel folks would enforce that the dma_fence_wait madness doesn't leak beyond drivers/gpu
It seems like the best thing
Just thought about where to best put this, and I think including it as another paragraph in the next round of this series makes the most sense. You'll get cc'ed for acking when that happens - might take a while since there's a lot of details here all over to sort out. -Daniel
Oded has agreed to remove the dma-fence usage, since they really don't need it (and all the baggage that comes with it), plain old completion is enough for their use. This use is also why I added the regex to MAINTAINERS, so that in the future we can catch people who try to use dma_fence because it looks cute and useful, and are completely oblivious to all the pain and headaches involved.
This is good!
Thanks, Jason
On Tue, Jun 16, 2020 at 2:07 PM Daniel Vetter daniel@ffwll.ch wrote:
Hi Jason,
Somehow this got stuck somewhere in the mail queues, only popped up just now ...
On Thu, Jun 11, 2020 at 11:15:15AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
I still have my doubts about allowing fence waiting from within shrinkers. IMO ideally they should use a trywait approach, in order to allow memory allocation during command submission for drivers that publish fences before command submission. (Since early reservation object release requires that).
Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up with a mempool to make sure it can handle it's allocations.
But since drivers are already waiting from within shrinkers and I take your word for HMM requiring this,
Yeah the big trouble is HMM and mmu notifiers. That's the really awkward one, the shrinker one is a lot less established.
I really question if HW that needs something like DMA fence should even be using mmu notifiers - the best use is HW that can fence the DMA directly without having to get involved with some command stream processing.
Or at the very least it should not be a generic DMA fence but a narrowed completion tied only into the same GPU driver's command completion processing which should be able to progress without blocking.
The problem with gpus is that these completions leak across the board like mad. Both internally within memory managers (made a lot worse with p2p direct access to vram), and through uapi.
Many gpus still have a very hard time preempting, so doing an overall switch in drivers/gpu to a memory management model where that is required is not a very realistic option. And minimally you need either preempt (still takes a while, but a lot faster generally than waiting for work to complete) or hw faults (just a bunch of tlb flushes plus virtual indexed caches, so just the caveat of that for a gpu, which has lots and big tlbs and caches). So preventing the completion leaks within the kernel is I think unrealistic, except if we just say "well sorry, run on windows, mkay" for many gpu workloads. Or more realistic "well sorry, run on the nvidia blob with nvidia hw".
The userspace side we can somewhat isolate, at least for pure compute workloads. But the thing is drivers/gpu is a continum from tiny socs (where dma_fence is a very nice model) to huge compute stuff (where it's maybe not the nicest, but hey hw sucks so still neeeded). Doing full on break in uapi somewhere in there is at least a bit awkward, e.g. some of the media codec code on intel runs all the way from the smallest intel soc to the big transcode servers.
So the current status quo is "total mess, every driver defines their own rules". All I'm trying to do is some common rules here, do make this mess slightly more manageable and overall reviewable and testable.
I have no illusions that this is fundamentally pretty horrible, and the leftover wiggle room for writing memory manager is barely more than a hairline. Just not seeing how other options are better.
So bad news is that gpu's are horrible, but I think if you don't have to review gpu drivers it's substantially better. If you do have hw with full device page fault support, then there's no need to ever install a dma_fence. Punching out device ptes and flushing caches is all that's needed. That is also the plan we have, for the workloads and devices where that's possible.
Now my understanding for rdma is that if you don't have hw page fault support, then the only other object is to more or less permanently pin the memory. So again, dma_fence are completely useless, since it's entirely up to userspace when a given piece of registered memory isn't needed anymore, and the entire problem boils down to how much do we allow random userspace to just pin (system or device) memory. Or at least I don't really see any other solution.
On the other end we have simpler devices like video input/output. Those always need pinned memory, but through hw design it's limited in how much you can pin (generally max resolution times a limited set of buffers to cycle through). Just including that memory pinning allowance as part of device access makes sense.
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.
tldr; of all this: gpus kinda suck sometimes, but that's also not news :-/
Cheers, Daniel
The intent of notifiers was never to endlessly block while vast amounts of SW does work.
Going around and switching everything in a GPU to GFP_ATOMIC seems like bad idea.
It's not everyone, or at least not everywhere, it's some fairly limited cases. Also, even if we drop the mmu_notifier on the floor, then we're stuck with shrinkers and GFP_NOFS. Still need a mempool of some sorts to guarantee you get out of a bind, so not much better.
At least that's my current understanding of where we are across all drivers.
I've pinged a bunch of armsoc gpu driver people and ask them how much this hurts, so that we have a clear answer. On x86 I don't think we have much of a choice on this, with userptr in amd and i915 and hmm work in nouveau (but nouveau I think doesn't use dma_fence in there).
Right, nor will RDMA ODP.
Hm, what's the context here? I thought RDMA side you really don't want dma_fence in mmu_notifiers, so not clear to me what you're agreeing on here.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:
Now my understanding for rdma is that if you don't have hw page fault support,
The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.
But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.
Surely the GPU driver can block and release the notifier directly from its own command processing channel?
Why does this fence and all it entails need to leak out across drivers?
Jason
On Wed, Jun 17, 2020 at 12:28:35PM -0300, Jason Gunthorpe wrote:
On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:
Now my understanding for rdma is that if you don't have hw page fault support,
The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.
But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.
Surely the GPU driver can block and release the notifier directly from its own command processing channel?
Why does this fence and all it entails need to leak out across drivers?
So 10 years ago we had this world of every gpu driver is its own bucket, nothing leaks out to the world. But the world had a different idea how gpus where supposed to work, with stuff like:
- laptops with a power-efficient but slow gpu integrated on the cpu die, and a 2nd, much faster but also more wasteful gpu seperately
- also multi-gpu rendering (but on linux we never really got around to enabling that, at least not for 3d rendering)
- soc just bundle IP blocks together, and very often they feel like they have to do their own display block (it's fairly easy and allows you to keep your hw engineers justified on payroll with some more patents they create), but anything more fancy they buy in. So from a driver architecture pov even a single chip soc looks like a bundle of gpus
And you want to pipeline all this because performance, so waiting in userspace for one block to finish before you hand it ever to the other isn't a good idea.
Hence dma_fence as a cross driver leak was created by pulling the gpu completion tracking from the drm/ttm library for managing vram.
Now with glorious hindsight we could have come up with a different approach, where synchronization is managed by userspace, kernel just provides some primitives (kinda like futexes, but for gpu). And the kernel manages residency and gpu pte wrangling entirely seperately. But:
- 10 years ago drivers/gpu was a handful of people at best
- we just finished the massive rewrite to get to a kernel memory manager and kernel modesetting (over 5 years after windows/macos), so appetite for massive rewrites was minimal.
Here we are, now with 50 more drivers built on top and an entire userspace ecosystem that relies on all this (because yes we made dma_fence also the building block for all the cross-process uapi, why wouldn't we).
I hope that explains a bit the history of how and why we ended up here.
Maybe I should do a plumbers talk about "How not to memory manage - cautious tales from drivers/gpu" I think there's a lot of areas where the conversation usually goes "wtf" ... long explanation of history and technical reasons leading to a "oh dear". With a lot of other accelerators and things landing might be good to have a list of things that look tempting (because hey 2% faster) but arent worth the pain. -Daniel
On Thu, Jun 18, 2020 at 05:00:51PM +0200, Daniel Vetter wrote:
On Wed, Jun 17, 2020 at 12:28:35PM -0300, Jason Gunthorpe wrote:
On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:
Now my understanding for rdma is that if you don't have hw page fault support,
The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.
But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.
Surely the GPU driver can block and release the notifier directly from its own command processing channel?
Why does this fence and all it entails need to leak out across drivers?
So 10 years ago we had this world of every gpu driver is its own bucket, nothing leaks out to the world. But the world had a different idea how gpus where supposed to work, with stuff like:
Sure, I understand DMA fence, but why does a *notifier* need it?
The job of the notifier is to guarentee that the device it is connected to is not doing DMA before it returns.
That just means you need to prove that device is done with the buffer.
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)
Jason
On Fri, Jun 19, 2020 at 8:58 AM Jason Gunthorpe jgg@ziepe.ca wrote:
On Thu, Jun 18, 2020 at 05:00:51PM +0200, Daniel Vetter wrote:
On Wed, Jun 17, 2020 at 12:28:35PM -0300, Jason Gunthorpe wrote:
On Wed, Jun 17, 2020 at 08:48:50AM +0200, Daniel Vetter wrote:
Now my understanding for rdma is that if you don't have hw page fault support,
The RDMA ODP feature is restartable HW page faulting just like nouveau has. The classical MR feature doesn't have this. Only mlx5 HW supports ODP today.
It's only gpus (I think) which are in this awkward in-between spot where dynamic memory management really is much wanted, but the hw kinda sucks. Aside, about 10+ years ago we had a similar problem with gpu hw, but for security: Many gpu didn't have any kinds of page tables to isolate different clients from each another. drivers/gpu fixed this by parsing&validating what userspace submitted to make sure it's only every accessing its own buffers. Most gpus have become reasonable nowadays and do have proper per-process pagetables (gpu process, not the pasid stuff), but even today there's still some of the old model left in some of the smallest SoC.
But I still don't understand why a dma fence is needed inside the GPU driver itself in the notifier.
Surely the GPU driver can block and release the notifier directly from its own command processing channel?
Why does this fence and all it entails need to leak out across drivers?
So 10 years ago we had this world of every gpu driver is its own bucket, nothing leaks out to the world. But the world had a different idea how gpus where supposed to work, with stuff like:
Sure, I understand DMA fence, but why does a *notifier* need it?
The job of the notifier is to guarentee that the device it is connected to is not doing DMA before it returns.
That just means you need to prove that device is done with the buffer.
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)
Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g. - device A renders some stuff. Userspace gets dma_fence A out of that (well sync_file or one of the other uapi interfaces, but you get the idea) - userspace (across process or just different driver) issues more rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this - because unfortunate reasons, the same rendering on device B also needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages - unhappiness ensues, because now the mmu notifier from device B can get hung up on the dma operation device A is doing
If you want to avoid this either a) have less shitty hw (not an option, gpus are gpus, it is slowly getting better though) or b) force userspace to stall before handing over to next device (about as uncool) or c) just pin all the memory always, who cares (also rather unpopular, gpus tend to use all the memory they can get).
I guess the thing with gpus is that dma operations aren't like read/writes for pretty much everything else, but essentially compute contexts (usually implemented as ringbuffers where you stream stuff into) with cross everything dependencies. This even holds within a single gpu, since pretty much all modern gpus have multiple different engines special on different things. And yup that's directly exposed to userspace, for vulkan and other low-level gpu apis even directly to applications. So dma operation for gpu isn't just "done when the read/write finishes", but pulls in an entire chain of dependencies and ordering that needs to happen before it can even start.
-Daniel
On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)
Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.
- device A renders some stuff. Userspace gets dma_fence A out of that
(well sync_file or one of the other uapi interfaces, but you get the idea)
- userspace (across process or just different driver) issues more
rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this
- because unfortunate reasons, the same rendering on device B also
needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages
I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!
What if the first device is a page faulting device and doesn't call dma_fence??
It you are going to treat things this way then the mmu notifier really needs to be part of the some core DMA buf, and not randomly sprinkled in drivers
But really this is what page pinning is supposed to be used for, the MM behavior when it blocks on a pinned page is less invasive than if it stalls inside a mmu notifier.
You can mix it, use mmu notififers to keep track if the buffer is still live, but when you want to trigger DMA then pin the pages and keep them pinned until DMA is done. The pin protects things (well, fork is still a problem)
Do not need to wait on dma_fence in notifiers.
Jason
On Fri, Jun 19, 2020 at 1:39 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)
Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.
- device A renders some stuff. Userspace gets dma_fence A out of that
(well sync_file or one of the other uapi interfaces, but you get the idea)
- userspace (across process or just different driver) issues more
rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this
- because unfortunate reasons, the same rendering on device B also
needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages
I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!
The first device might not even have a notifier. This is the 2nd device, waiting on a dma_fence of its own, but which happens to be queued up as a dma operation behind something else.
What if the first device is a page faulting device and doesn't call dma_fence??
Not sure what you mean with this ... even if it does page-faulting for some other reasons, it'll emit a dma_fence which the 2nd device can consume as a dependency.
It you are going to treat things this way then the mmu notifier really needs to be part of the some core DMA buf, and not randomly sprinkled in drivers
So maybe again unclear, we don't allow such userptr dma-buf to even be shared. They're just for slurping in stuff in the local device (general from file io or something the cpu has done or similar). There have been attempts to use it as the general backing storage, but that didn't go down too well because way too many complications.
Generally most memory the gpu operates on isn't stuff that's mmu_notifier'ed. And also, the device with userptr support only waits for its own dma_fence (because well you can't share this stuff, we disallow that).
The problem is that there's piles of other dependencies for a dma job. GPU doesn't just consume a single buffer each time, it consumes entire lists of buffers and mixes them all up in funny ways. Some of these buffers are userptr, entirely local to the device. Other buffers are just normal device driver allocations (and managed with some shrinker to keep them in check). And then there's the actually shared dma-buf with other devices. The trouble is that they're all bundled up together.
Now we probably should have some helper code for userptr so that all drivers do this roughly the same, but that's just not there yet. But it can't be a dma-buf exporter behind the dma-buf interfaces, because even just pinned get_user_pages would be too different semantics compared to normal shared dma-buf objects, that's all very tightly tied into the specific driver.
But really this is what page pinning is supposed to be used for, the MM behavior when it blocks on a pinned page is less invasive than if it stalls inside a mmu notifier.
You can mix it, use mmu notififers to keep track if the buffer is still live, but when you want to trigger DMA then pin the pages and keep them pinned until DMA is done. The pin protects things (well, fork is still a problem)
Hm I thought amdgpu had that (or drm/radeon as the previous incarnation of that stack), and was unhappy about the issues. Would need Christian König to chime in.
Do not need to wait on dma_fence in notifiers.
Maybe :-) The goal of this series is more to document current rules and make them more consistent. Fixing them if we don't like them might be a follow-up task, but that would likely be a pile more work. First we need to know what the exact shape of the problem even is. -Daniel
On Fri, Jun 19, 2020 at 05:06:04PM +0200, Daniel Vetter wrote:
On Fri, Jun 19, 2020 at 1:39 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)
Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.
- device A renders some stuff. Userspace gets dma_fence A out of that
(well sync_file or one of the other uapi interfaces, but you get the idea)
- userspace (across process or just different driver) issues more
rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this
- because unfortunate reasons, the same rendering on device B also
needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages
I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!
The first device might not even have a notifier. This is the 2nd device, waiting on a dma_fence of its own, but which happens to be queued up as a dma operation behind something else.
What if the first device is a page faulting device and doesn't call dma_fence??
Not sure what you mean with this ... even if it does page-faulting for some other reasons, it'll emit a dma_fence which the 2nd device can consume as a dependency.
At some point the pages under the buffer have to be either pinned or protected by mmu notifier. So each and every single device doing DMA to these pages must either pin, or use mmu notifier.
Driver A should never 'borrow' a notifier from B
If each driver controls its own lifetime of the buffers, why can't the driver locally wait for its device to finish?
Can't the GPUs cancel work that is waiting on a DMA fence? Ie if Driver A detects that work completed and wants to trigger a DMA fence, but it now knows the buffer is invalidated, can't it tell driver B to give up?
The problem is that there's piles of other dependencies for a dma job. GPU doesn't just consume a single buffer each time, it consumes entire lists of buffers and mixes them all up in funny ways. Some of these buffers are userptr, entirely local to the device. Other buffers are just normal device driver allocations (and managed with some shrinker to keep them in check). And then there's the actually shared dma-buf with other devices. The trouble is that they're all bundled up together.
But why does this matter? Does the GPU itself consume some work and then stall internally waiting for an external DMA fence?
Otherwise I would expect this dependency chain should be breakable by aborting work waiting on fences upon invalidation (without stalling)
Do not need to wait on dma_fence in notifiers.
Maybe :-) The goal of this series is more to document current rules and make them more consistent. Fixing them if we don't like them might be a follow-up task, but that would likely be a pile more work. First we need to know what the exact shape of the problem even is.
Fair enough
Jason
On Fri, Jun 19, 2020 at 5:15 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Jun 19, 2020 at 05:06:04PM +0200, Daniel Vetter wrote:
On Fri, Jun 19, 2020 at 1:39 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Fri, Jun 19, 2020 at 09:22:09AM +0200, Daniel Vetter wrote:
As I've understood GPU that means you need to show that the commands associated with the buffer have completed. This is all local stuff within the driver, right? Why use fence (other than it already exists)
Because that's the end-of-dma thing. And it's cross-driver for the above reasons, e.g.
- device A renders some stuff. Userspace gets dma_fence A out of that
(well sync_file or one of the other uapi interfaces, but you get the idea)
- userspace (across process or just different driver) issues more
rendering for device B, which depends upon the rendering done on device A. So dma_fence A is an dependency and will block this dma operation. Userspace (and the kernel) gets dma_fence B out of this
- because unfortunate reasons, the same rendering on device B also
needs a userptr buffer, which means that dma_fence B is also the one that the mmu_range_notifier needs to wait on before it can tell core mm that it can go ahead and release those pages
I was afraid you'd say this - this is complete madness for other DMA devices to borrow the notifier hook of the first device!
The first device might not even have a notifier. This is the 2nd device, waiting on a dma_fence of its own, but which happens to be queued up as a dma operation behind something else.
What if the first device is a page faulting device and doesn't call dma_fence??
Not sure what you mean with this ... even if it does page-faulting for some other reasons, it'll emit a dma_fence which the 2nd device can consume as a dependency.
At some point the pages under the buffer have to be either pinned or protected by mmu notifier. So each and every single device doing DMA to these pages must either pin, or use mmu notifier.
Driver A should never 'borrow' a notifier from B
It doesn't. I guess this would be great topic for lpc with a seriously big white-board, but I guess we don't have that this year again, so let me try again. Simplified example ofc, but should be the gist.
Ingredients: Device A and Device B A dma-buf, shared between device A and device B, let's call that shared_buf A userptr buffer, which userspace created on device B to hopefully somewhat track a virtual memory range, let's call that userptr_buf. A pile of other buffers, but we pretend they don't exist (because they kinda don't matter.
Sequence of events as userspace issues them to the kernel. 1. dma operation on device A, which fills some interesting stuff into shared_buf. Userspace gets back a handle to dma_fence fence_A. No mmu notifier anywhere to be seen in the driver for device A.
2. userspace passes fence_A around to some other place
3. other places takes the handle for shared_buf and fence_A and userptr_buf and starts a dma operation on device B. It's one dma operation, maybe device B is taking the data from shared_buf and compresses it into userptr_buf, so that userspace can then send it over the network or to disk or whatever. device B has a mmu_notifier. Userspace gets back fence_B, which represents this dma operation. The kernel also stuffs this fence_B into the mmu_range_notifier for userptr_buf.
-> at this point device A might still be crunching the numbers
4. device A is finally done doing whatever it was supposed to do, and fence_A completes
5. device B wakes up (this might or might not involve the kernel, usually it does) since fence_A has completed, and now starts doing its own crunching.
6. once device B is also done, it signals fence_B
In all this device A has never borrowed the mmu notifier or even accessd the memory in userptr_buf or had access to that buffer handle.
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
If each driver controls its own lifetime of the buffers, why can't the driver locally wait for its device to finish?
Can't the GPUs cancel work that is waiting on a DMA fence? Ie if Driver A detects that work completed and wants to trigger a DMA fence, but it now knows the buffer is invalidated, can't it tell driver B to give up?
We can (usually, the shitty hw where we can't has generally disappeared) with gpu reset. Users make really sad faces when that happens though, and generally they're only ok with that if it's indeed a nasty gpu program that resulted in the crash (there's some webgl shaders that run too long for quick&easy testing of how good the gpu reset is, don't do that if you care about the data in your desktop session ...).
The trouble is that userspace assembles the work that's queued up on the gpu. After submission everyone has forgotten enough that just canceling stuff and re-issuing everything isn't on the table.
Some hw is better, with real hw page faults and stuff, but those also don't need dma_fence to track their memory. But generally just not possible.
The problem is that there's piles of other dependencies for a dma job. GPU doesn't just consume a single buffer each time, it consumes entire lists of buffers and mixes them all up in funny ways. Some of these buffers are userptr, entirely local to the device. Other buffers are just normal device driver allocations (and managed with some shrinker to keep them in check). And then there's the actually shared dma-buf with other devices. The trouble is that they're all bundled up together.
But why does this matter? Does the GPU itself consume some work and then stall internally waiting for an external DMA fence?
Yup, see above, that's what's going on. Userspace queues up distributed work across engines & drivers, and then just waits for the entire thing to cascade and finish.
Otherwise I would expect this dependency chain should be breakable by aborting work waiting on fences upon invalidation (without stalling)
Yup, it would. Now on some hw you have a gpu work scheduler that sits in some kthread, and you could probably unschedule the work if there's some external dependency and you get an mmu notifier callback. Then put it on some queue, re-acquire the user pages and then reschedule it.
It's still as horrible, since you still have the wait for the completion in there, the only benefit is that other device drivers without userptr support don't have to live with that specific constraint. dma_fence rules are still very strict and easy to deadlock, so we'd still want some lockdep checks, but now you'd have to somehow annotate whether you're a driver with userptr or a driver without userptr and make sure everyone gets it right.
Also a scheduler which can unschedule and reschedule is mighty more complex than one which cannot, plus it needs to do that from mmu notifier callback (not the nicest calling context we have in the kernel by far). And if you have a single driver which doesn't unschedule, you're still screwed from an overall subsystem pov.
So lots of code, lots of work, and not that much motivation to roll it out consistently across the board since there's no incremental payoff. Plus the thing is, the drivers without userptr are generally the really simple ones. Much easier to just fix those than to change the big complex render beasts which want userptr :-)
E.g. the atomic modeset framework we've rolled out in the past few years and that almost all display drivers now use pulls any (sleeping) locks and memory allocations out of the critical async work section by design. Some drivers still managed to butcher it (the annotations caught some locking bugs already, not just memory allocations in the wrong spot), but generally easy to fix those.
Do not need to wait on dma_fence in notifiers.
Maybe :-) The goal of this series is more to document current rules and make them more consistent. Fixing them if we don't like them might be a follow-up task, but that would likely be a pile more work. First we need to know what the exact shape of the problem even is.
Fair enough
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But
- that needs minimally reliable preempt support for gpu work, and hw engineers seem to have a hard time with that (or just don't want to do it). hw page faults would be even better, and even more wishlist than reality if you expect it to work everywhere.
- it'd be a complete break of the established userspace abi, including all the cross driver stuff. Which means it's not just some in-kernel refactoring, we need to rev the entire ecosystem. And that takes a very long time, and needs serious pressure to get people moving.
E.g. the atomic modeset rework is still not yet rolled out to major linux desktop environments, and it's over 5 years old, and it's starting to seriously hurt because lots of performance features require atomic modeset in userspace to be able to use them. I think rev'ing the entire memory management support will take as long. Plus I don't think we can ditch the old ways - even if all the hw currently using this would be dead (and we can delete the drivers) there's still the much smaller gpus in SoC that also need to go through the entire evolution. -Daniel
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But
I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.
Jason
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).
So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But
I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.
Yes for user that expect HMM they need to be asynchronous. But it is hard to revert user ptr has it was done a long time ago.
Cheers, Jérôme
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
(isn't this broken for O_DIRECT as well anyhow?)
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.
It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..
The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.
Make sense
I'm not sure I saw this in the AMD hmm stuff - it would be good if someone would look at that. Every time I do it looks like the locking is wrong.
Jason
Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
(isn't this broken for O_DIRECT as well anyhow?)
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?
How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.
With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.
With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.
In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.
Regards, Felix
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.
It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..
The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.
Make sense
I'm not sure I saw this in the AMD hmm stuff - it would be good if someone would look at that. Every time I do it looks like the locking is wrong.
Jason _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
(isn't this broken for O_DIRECT as well anyhow?)
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?
How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.
With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.
With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.
In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.
The model here in fork has been wrong for a long time, and I do wonder how O_DIRECT manages to not be broken too.. I guess the time windows there are too small to get unlucky.
If you have a write pin on a page then it should not be COW'd into the fork'd process but copied with the originating page remaining with the original mm.
I wonder if there is some easy way to achive that - if that is the main reason to use notifiers then it would be a better solution.
Jason
Am 2020-06-19 um 3:55 p.m. schrieb Jason Gunthorpe:
On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
(isn't this broken for O_DIRECT as well anyhow?)
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?
How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.
With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.
With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.
In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.
The model here in fork has been wrong for a long time, and I do wonder how O_DIRECT manages to not be broken too.. I guess the time windows there are too small to get unlucky.
If you have a write pin on a page then it should not be COW'd into the fork'd process but copied with the originating page remaining with the original mm.
I wonder if there is some easy way to achive that - if that is the main reason to use notifiers then it would be a better solution.
Other than the application changing its own virtual address mappings (mprotect, munmap, etc.), triggering MMU notifiers, we also get MMU notifiers from THP worker threads, and NUMA balancing.
When we start doing migration to DEVICE_PRIVATE memory with HMM, we also get MMU notifiers during those driver-initiated migrations.
Regards, Felix
Jason
On Fri, Jun 19, 2020 at 04:55:38PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
(isn't this broken for O_DIRECT as well anyhow?)
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?
How much the mmu_notifier blocks fork progress depends, on quickly we can preempt GPU jobs accessing affected memory. If we don't have fine-grained preemption capability (graphics), the best we can do is wait for the GPU jobs to complete. We can also delay submission of new GPU jobs to the same memory until the MMU notifier is done. Future jobs would use the new page addresses.
With fine-grained preemption (ROCm compute), we can preempt GPU work on the affected adders space to minimize the delay seen by fork.
With recoverable device page faults, we can invalidate GPU page table entries, so device access to the affected pages stops immediately.
In all cases, the end result is, that the device page table gets updated with the address of the copied pages before the GPU accesses the COW memory again.Without the MMU notifier, we'd end up with the GPU corrupting memory of the other process.
The model here in fork has been wrong for a long time, and I do wonder how O_DIRECT manages to not be broken too.. I guess the time windows there are too small to get unlucky.
This was discuss extensively in the GUP works John have been doing. Yes O_DIRECT can potentialy break but only if you are writting to COW pages and you initiated the O_DIRECT right before the fork and GUP happen before fork was able to write protect the pages.
If you O_DIRECT but use memory as input ie you are writting the memory to the file not reading from the file. Then fork is harmless as you are just reading memory. You can still face the COW uncertainty (the process against which you did the O_DIRECT get "new" pages but your O_DIRECT goes on with the "old" pages) but doing O_DIRECT and fork concurently is asking for trouble.
If you have a write pin on a page then it should not be COW'd into the fork'd process but copied with the originating page remaining with the original mm.
I wonder if there is some easy way to achive that - if that is the main reason to use notifiers then it would be a better solution.
Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.
Now if read only GUP on P0 happens in C (or B everything is symetrical in respect to root A) then P0 might not be the page that is in C after the GUP ie if something in C write to the virtual address corresponding to P0 then a new page might get allocated and the virtual address will no longer point to P0 for C.
Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some what "fix" that but GUP fast is still succeptible to this.
Note that above commit only address the GUP after/while forking. GUP before fork() need mmu notifier (or forcing page copy instead of COW).
Cheers, Jérôme
On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.
For a long time now RDMA has broken COW pages when creating user DMA regions.
The problem has been that fork re-COW's regions that had their COW broken.
So, if you break the COW upon mapping and prevent fork (and others) from copying DMA pinned then you'd cover the cases.
Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some what "fix" that but GUP fast is still succeptible to this.
Ah, so everyone breaks the COW now, not just RDMA..
What do you mean 'GUP fast is still succeptible to this' ?
Jason
On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.
For a long time now RDMA has broken COW pages when creating user DMA regions.
The problem has been that fork re-COW's regions that had their COW broken.
So, if you break the COW upon mapping and prevent fork (and others) from copying DMA pinned then you'd cover the cases.
I am not sure we want to prevent COW for pinned GUP pages, this would change current semantic and potentialy break/slow down existing apps.
Anyway i think we focus too much on fork/COW, it is just an unfixable broken corner cases, mmu notifier allows you to avoid it. Forcing real copy on fork would likely be seen as regression by most people.
Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some what "fix" that but GUP fast is still succeptible to this.
Ah, so everyone breaks the COW now, not just RDMA..
What do you mean 'GUP fast is still succeptible to this' ?
Not all GUP fast path are updated (intentionaly) __get_user_pages_fast() for instance still keeps COW intact. People using GUP should really knows what they are doing.
Cheers, Jérôme
On Mon, Jun 22, 2020 at 04:15:40PM -0400, Jerome Glisse wrote:
On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
Not doable as page refcount can change for things unrelated to GUP, with John changes we can identify GUP and we could potentialy copy GUPed page instead of COW but this can potentialy slow down fork() and i am not sure how acceptable this would be. Also this does not solve GUP against page that are already in fork tree ie page P0 is in process A which forks, we now have page P0 in process A and B. Now we have process A which forks again and we have page P0 in A, B, and C. Here B and C are two branches with root in A. B and/or C can keep forking and grow the fork tree.
For a long time now RDMA has broken COW pages when creating user DMA regions.
The problem has been that fork re-COW's regions that had their COW broken.
So, if you break the COW upon mapping and prevent fork (and others) from copying DMA pinned then you'd cover the cases.
I am not sure we want to prevent COW for pinned GUP pages, this would change current semantic and potentialy break/slow down existing apps.
Isn't that basically exactly what 17839856fd588 does? It looks like it uses the same approach RDMA does by sticking FOLL_WRITE even though it is a read action.
After that change the reamining bug is that fork can re-establish a COW./
Anyway i think we focus too much on fork/COW, it is just an unfixable broken corner cases, mmu notifier allows you to avoid it. Forcing real copy on fork would likely be seen as regression by most people.
If you don't copy the there are data corruption bugs though. Real apps probably don't hit a problem here as they are not forking while GUP's are active (RDMA excluded, which does do this)
I think that implementing page pinning by blocking mmu notifiers for the duration of the pin is a particularly good idea either, that actually seems a lot worse than just having the pin in the first place.
Particularly if it is only being done to avoid corner case bugs that already afflict other GUP cases :(
What do you mean 'GUP fast is still succeptible to this' ?
Not all GUP fast path are updated (intentionaly) __get_user_pages_fast()
Sure, that is is the 'raw' accessor
Jason
On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
Just no way to deal with it easily, i thought about forcing the anon_vma (page->mapping for anonymous page) to the anon_vma that belongs to the vma against which the GUP was done but it would break things if page is already in other branch of a fork tree. Also this forbid fast GUP.
Quite frankly the fork was not the main motivating factor. GPU can pin potentialy GBytes of memory thus we wanted to be able to release it but since Michal changes to reclaim code this is no longer effective.
User buffer should never end up in those weird corner case, iirc the first usage was for xorg exa texture upload, then generalize to texture upload in mesa and latter on to more upload cases (vertices, ...). At least this is what i remember today. So in those cases we do not expect fork, splice, mremap, mprotect, ...
Maybe we can audit how user ptr buffer are use today and see if we can define a usage pattern that would allow to cut corner in kernel. For instance we could use mmu notifier just to block CPU pte update while we do GUP and thus never wait on dma fence.
Then GPU driver just keep the GUP pin around until they are done with the page. They can also use the mmu notifier to keep a flag so that the driver know if it needs to redo a GUP ie:
The notifier path: GPU_mmu_notifier_start_callback(range) gpu_lock_cpu_pagetable(range) for_each_bo_in(bo, range) { bo->need_gup = true; } gpu_unlock_cpu_pagetable(range)
GPU_validate_buffer_pages(bo) if (!bo->need_gup) return; put_pages(bo->pages); range = bo_vaddr_range(bo) gpu_lock_cpu_pagetable(range) GUP(bo->pages, range) gpu_unlock_cpu_pagetable(range)
Depending on how user_ptr are use today this could work.
(isn't this broken for O_DIRECT as well anyhow?)
Yes it can in theory, if you have an application that does O_DIRECT and fork concurrently (ie O_DIRECT in one thread and fork in another). Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast was able to lookup a page with write permission before fork had the chance to update it to read only for COW.
But doing O_DIRECT (or anything that use GUP fast) in one thread and fork in another is inherently broken ie there is no way to fix it.
See 17839856fd588f4ab6b789f482ed3ffd7c403e1f
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?
It enforce ordering between fork and GUP, if fork is first it blocks GUP and if forks is last then fork waits on GUP and then user buffer get invalidated.
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.
It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..
Well my POV is that if you abide by rules HMM defined then you do not need to pin pages. The rule is asynchronous device page table update.
Pinning pages is problematic it blocks many core mm features and it is just bad all around. Also it is inherently broken in front of fork/mremap/splice/...
Cheers, Jérôme
On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
Just no way to deal with it easily, i thought about forcing the anon_vma (page->mapping for anonymous page) to the anon_vma that belongs to the vma against which the GUP was done but it would break things if page is already in other branch of a fork tree. Also this forbid fast GUP.
Quite frankly the fork was not the main motivating factor. GPU can pin potentialy GBytes of memory thus we wanted to be able to release it but since Michal changes to reclaim code this is no longer effective.
What where how? My patch to annote reclaim paths with mmu notifier possibility just landed in -mm, so if direct reclaim can't reclaim mmu notifier'ed stuff anymore we need to know.
Also this would resolve the entire pain we're discussing in this thread about dma_fence_wait deadlocking against anything that's not GFP_ATOMIC ... -Daniel
User buffer should never end up in those weird corner case, iirc the first usage was for xorg exa texture upload, then generalize to texture upload in mesa and latter on to more upload cases (vertices, ...). At least this is what i remember today. So in those cases we do not expect fork, splice, mremap, mprotect, ...
Maybe we can audit how user ptr buffer are use today and see if we can define a usage pattern that would allow to cut corner in kernel. For instance we could use mmu notifier just to block CPU pte update while we do GUP and thus never wait on dma fence.
Then GPU driver just keep the GUP pin around until they are done with the page. They can also use the mmu notifier to keep a flag so that the driver know if it needs to redo a GUP ie:
The notifier path: GPU_mmu_notifier_start_callback(range) gpu_lock_cpu_pagetable(range) for_each_bo_in(bo, range) { bo->need_gup = true; } gpu_unlock_cpu_pagetable(range)
GPU_validate_buffer_pages(bo) if (!bo->need_gup) return; put_pages(bo->pages); range = bo_vaddr_range(bo) gpu_lock_cpu_pagetable(range) GUP(bo->pages, range) gpu_unlock_cpu_pagetable(range)
Depending on how user_ptr are use today this could work.
(isn't this broken for O_DIRECT as well anyhow?)
Yes it can in theory, if you have an application that does O_DIRECT and fork concurrently (ie O_DIRECT in one thread and fork in another). Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast was able to lookup a page with write permission before fork had the chance to update it to read only for COW.
But doing O_DIRECT (or anything that use GUP fast) in one thread and fork in another is inherently broken ie there is no way to fix it.
See 17839856fd588f4ab6b789f482ed3ffd7c403e1f
How does mmu_notifiers help the fork case anyhow? Block fork from progressing?
It enforce ordering between fork and GUP, if fork is first it blocks GUP and if forks is last then fork waits on GUP and then user buffer get invalidated.
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait.
It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..
Well my POV is that if you abide by rules HMM defined then you do not need to pin pages. The rule is asynchronous device page table update.
Pinning pages is problematic it blocks many core mm features and it is just bad all around. Also it is inherently broken in front of fork/mremap/splice/...
Cheers, Jérôme
On Fri, Jun 19, 2020 at 10:43:20PM +0200, Daniel Vetter wrote:
On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
I wonder if we should try to fix the fork case more directly - RDMA has this same problem and added MADV_DONTFORK a long time ago as a hacky way to deal with it.
Some crazy page pin that resolved COW in a way that always kept the physical memory with the mm that initiated the pin?
Just no way to deal with it easily, i thought about forcing the anon_vma (page->mapping for anonymous page) to the anon_vma that belongs to the vma against which the GUP was done but it would break things if page is already in other branch of a fork tree. Also this forbid fast GUP.
Quite frankly the fork was not the main motivating factor. GPU can pin potentialy GBytes of memory thus we wanted to be able to release it but since Michal changes to reclaim code this is no longer effective.
What where how? My patch to annote reclaim paths with mmu notifier possibility just landed in -mm, so if direct reclaim can't reclaim mmu notifier'ed stuff anymore we need to know.
Also this would resolve the entire pain we're discussing in this thread about dma_fence_wait deadlocking against anything that's not GFP_ATOMIC ...
Sorry my bad, reclaim still works, only oom skip. It was couple years ago and i thought that some of the things discuss while back did make it upstream.
It is probably a good time to also point out that what i wanted to do is have all the mmu notifier callback provide some kind of fence (not dma fence) so that we can split the notification into step: A- schedule notification on all devices/system get fences this step should minimize lock dependency and should not have to wait for anything also best if you can avoid memory allocation for instance by pre-allocating what you need for notification. B- mm can do things like unmap but can not map new page so write special swap pte to cpu page table C- wait on each fences from A ... resume old code ie replace pte or finish unmap ...
The idea here is that at step C the core mm can decide to back off if any fence returned from A have to wait. This means that every device is invalidating for nothing but if we get there then it might still be a good thing as next time around maybe the kernel would be successfull without a wait.
This would allow things like reclaim to make forward progress and skip over or limit wait time to given timeout.
Also I thought to extend this even to multi-cpu tlb flush so that device and CPUs follow same pattern and we can make // progress on each.
Getting to such scheme is a lot of work. My plan was to first get the fence as part of the notifier user API and hide it from mm inside notifier common code. Then update each core mm path to new model and see if there is any benefit from it. Reclaim would be first candidate.
Cheers, Jérôme
On Fri, Jun 19, 2020 at 04:10:11PM -0400, Jerome Glisse wrote:
Maybe we can audit how user ptr buffer are use today and see if we can define a usage pattern that would allow to cut corner in kernel. For instance we could use mmu notifier just to block CPU pte update while we do GUP and thus never wait on dma fence.
The DMA fence is the main problem, if you can think of a way to avoid it then it would be great!
Then GPU driver just keep the GUP pin around until they are done with the page. They can also use the mmu notifier to keep a flag so that the driver know if it needs to redo a GUP ie:
The notifier path: GPU_mmu_notifier_start_callback(range) gpu_lock_cpu_pagetable(range) for_each_bo_in(bo, range) { bo->need_gup = true; } gpu_unlock_cpu_pagetable(range)
So some kind of invalidation tracking? But this doesn't solve COW and Fork problem?
It is kind of unrelated to HMM, it just shouldn't be using mmu notifiers to replace page pinning..
Well my POV is that if you abide by rules HMM defined then you do not need to pin pages. The rule is asynchronous device page table update.
I think one of the hmm rules is to not block notifiers for a long time, which these scheme seem to violate already.
Pinning for a long time is less bad than blocing notifiers for a long time, IMHO
Jason
On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).
So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.
All devices which support recoverable page faults also have a dedicated paging engine for the kernel driver which the driver already makes use of. We can also update the GPU page tables with the CPU.
Alex
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But
I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.
Yes for user that expect HMM they need to be asynchronous. But it is hard to revert user ptr has it was done a long time ago.
Cheers, Jérôme
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).
So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.
All devices which support recoverable page faults also have a dedicated paging engine for the kernel driver which the driver already makes use of. We can also update the GPU page tables with the CPU.
We have a potential problem with CPU updating page tables while the GPU is retrying on page table entries because 64 bit CPU transactions don't arrive in device memory atomically.
We are using SDMA for page table updates. This currently goes through a the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode only. But since it's based on the DRM GPU scheduler, we do use dma-fence to wait for completion.
Regards, Felix
Alex
Full disclosure: We are aware that we've designed ourselves into an impressive corner here, and there's lots of talks going on about untangling the dma synchronization from the memory management completely. But
I think the documenting is really important: only GPU should be using this stuff and driving notifiers this way. Complete NO for any totally-not-a-GPU things in drivers/accel for sure.
Yes for user that expect HMM they need to be asynchronous. But it is hard to revert user ptr has it was done a long time ago.
Cheers, Jérôme
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
The madness is only that device B's mmu notifier might need to wait for fence_B so that the dma operation finishes. Which in turn has to wait for device A to finish first.
So, it sound, fundamentally you've got this graph of operations across an unknown set of drivers and the kernel cannot insert itself in dma_fence hand offs to re-validate any of the buffers involved? Buffers which by definition cannot be touched by the hardware yet.
That really is a pretty horrible place to end up..
Pinning really is right answer for this kind of work flow. I think converting pinning to notifers should not be done unless notifier invalidation is relatively bounded.
I know people like notifiers because they give a bit nicer performance in some happy cases, but this cripples all the bad cases..
If pinning doesn't work for some reason maybe we should address that?
Note that the dma fence is only true for user ptr buffer which predate any HMM work and thus were using mmu notifier already. You need the mmu notifier there because of fork and other corner cases.
For nouveau the notifier do not need to wait for anything it can update the GPU page table right away. Modulo needing to write to GPU memory using dma engine if the GPU page table is in GPU memory that is not accessible from the CPU but that's never the case for nouveau so far (but i expect it will be at one point).
So i see this as 2 different cases, the user ptr case, which does pin pages by the way, where things are synchronous. Versus the HMM cases where everything is asynchronous.
I probably need to warn AMD folks again that using HMM means that you must be able to update the GPU page table asynchronously without fence wait. The issue for AMD is that they already update their GPU page table using DMA engine. I believe this is still doable if they use a kernel only DMA engine context, where only kernel can queue up jobs so that you do not need to wait for unrelated things and you can prioritize GPU page table update which should translate in fast GPU page table update without DMA fence.
All devices which support recoverable page faults also have a dedicated paging engine for the kernel driver which the driver already makes use of. We can also update the GPU page tables with the CPU.
We have a potential problem with CPU updating page tables while the GPU is retrying on page table entries because 64 bit CPU transactions don't arrive in device memory atomically.
We are using SDMA for page table updates. This currently goes through a the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode only. But since it's based on the DRM GPU scheduler, we do use dma-fence to wait for completion.
Yeah my worry is mostly that some cross dma fence leak into it but it should never happen realy, maybe there is a way to catch if it does and print a warning.
So yes you can use dma fence, as long as they do not have cross-dep. Another expectation is that they complete quickly and usualy page table update do.
Cheers, Jérôme
On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
We have a potential problem with CPU updating page tables while the GPU is retrying on page table entries because 64 bit CPU transactions don't arrive in device memory atomically.
Except for 32 bit platforms atomicity is guarenteed if you use uncached writeq() to aligned addresses..
The linux driver model breaks of the writeX() stuff is not atomic.
Jason
linaro-mm-sig@lists.linaro.org