On Wed, Mar 21, 2018 at 12:54:20PM +0100, Christian König wrote:
Am 21.03.2018 um 09:28 schrieb Daniel Vetter:
On Tue, Mar 20, 2018 at 06:47:57PM +0100, Christian König wrote:
Am 20.03.2018 um 15:08 schrieb Daniel Vetter:
[SNIP] For the in-driver reservation path (CS) having a slow-path that grabs a temporary reference, drops the vram lock and then locks the reservation normally (using the acquire context used already for the entire CS) is a bit tricky, but totally feasible. Ttm doesn't do that though.
That is exactly what we do in amdgpu as well, it's just not very efficient nor reliable to retry getting the right pages for a submission over and over again.
Out of curiosity, where's that code? I did read the ttm eviction code way back, and that one definitely didn't do that. Would be interesting to update my understanding.
That is in amdgpu_cs.c. amdgpu_cs_parser_bos() does a horrible dance with grabbing, releasing and regrabbing locks in a loop.
Then in amdgpu_cs_submit() we grab an lock preventing page table updates and check if all pages are still the one we want to have:
amdgpu_mn_lock(p->mn); if (p->bo_list) { for (i = p->bo_list->first_userptr; i < p->bo_list->num_entries; ++i) { struct amdgpu_bo *bo = p->bo_list->array[i].robj;
if (amdgpu_ttm_tt_userptr_needs_pages(bo->tbo.ttm)) { amdgpu_mn_unlock(p->mn); return -ERESTARTSYS; } } }
If anything changed on the page tables we restart the whole IOCTL using -ERESTARTSYS and try again.
I'm not talking about userptr here, but general bo eviction. Sorry for the confusion.
The reason I'm dragging all the general bo management into this discussions is because we do seem to have fairly fundamental difference in how that's done, with resulting consequences for the locking hierarchy.
And if this invalidate_mapping stuff should work, together with userptr and everything else, I think we're required to agree on how this is all supposed to nest, and how exactly we should back off for the other side that needs to break the locking circle.
That aside, I don't entirely understand why you need to restart so much. I figured that get_user_pages is ordered correctly against mmu invalidations, but I get the impression you think that's not the case. How does that happen? -Daniel