On Tue, May 25, 2021 at 5:05 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Hi Daniel,
Am 25.05.21 um 15:05 schrieb Daniel Vetter:
Hi Christian,
On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
Am 21.05.21 um 20:31 schrieb Daniel Vetter: This works by adding the fence of the last eviction DMA operation to BOs when their backing store is newly allocated. That's what the ttm_bo_add_move_fence() function you stumbled over is good for: https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo...
Now the problem is it is possible that the application is terminated before it can complete it's command submission. But since resource management only waits for the shared fences when there are some there is a chance that we free up memory while it is still in use.
Hm where is this code? Would help in my audit that I wanted to do this week? If you look at most other places like drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't treat the shared fences special and always also include the exclusive one.
See amdgpu_gem_object_close():
... fence = dma_resv_get_excl(bo->tbo.base.resv); if (fence) { amdgpu_bo_fence(bo, fence, true); fence = NULL; } ...
We explicitly added that because resource management of some other driver was going totally bananas without that.
But I'm not sure which one that was. Maybe dig a bit in the git and mailing history of that.
Hm I looked and it's
commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb Author: Christian König christian.koenig@amd.com Date: Thu Mar 12 12:03:34 2020 +0100
drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
That sounded more like amdgpu itself needing this, not another driver?
But looking at amdgpu_vm_bo_update_mapping() it seems to pick the right fencing mode for gpu pte clearing, so I'm really not sure what the bug was that you worked around here?The implementation boils down to amdgpu_sync_resv() which syncs for the exclusive fence, always. And there's nothing else that I could find in public history at least, no references to bug reports or anything. I think you need to dig internally, because as-is I'm not seeing the problem here.
Or am I missing something here? -Daniel