On 1/14/20 5:00 AM, Jason Gunthorpe wrote:
On Mon, Jan 13, 2020 at 02:47:02PM -0800, Ralph Campbell wrote:
void nouveau_svmm_fini(struct nouveau_svmm **psvmm) { struct nouveau_svmm *svmm = *psvmm;
- struct mmu_interval_notifier *mni;
- if (svmm) { mutex_lock(&svmm->mutex);
while (true) {
mni = mmu_interval_notifier_find(svmm->mm,
&nouveau_svm_mni_ops, 0UL, ~0UL);
if (!mni)
break;
mmu_interval_notifier_put(mni);
Oh, now I really don't like the name 'put'. It looks like mni is refcounted here, and it isn't. put should be called 'remove_deferred'
OK.
And then you also need a way to barrier this scheme on driver unload.
Good point. I can add something like void mmu_interval_notifier_synchronize(struct mm_struct *mm) that waits for deferred operations to complete similar to mmu_interval_read_begin().
svmm->vmm = NULL; mutex_unlock(&svmm->mutex);}
mmu_notifier_put(&svmm->notifier);
While here it was actually a refcount.
+static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni,
const struct mmu_notifier_range *range)
+{
- struct svmm_interval *smi =
container_of(mni, struct svmm_interval, notifier);
- struct nouveau_svmm *svmm = smi->svmm;
- unsigned long start = mmu_interval_notifier_start(mni);
- unsigned long last = mmu_interval_notifier_last(mni);
This whole algorithm only works if it is protected by the read side of the interval tree lock. Deserves at least a comment if not an assertion too.
This is called from the invalidate() callback and while holding the driver page table lock so the struct mmu_interval_notifier and the interval tree can't change. I will add comments for v7.
static int nouveau_range_fault(struct nouveau_svmm *svmm, struct nouveau_drm *drm, void *data, u32 size,
u64 *pfns, struct svm_notifier *notifier)
{ unsigned long timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); /* Have HMM fault pages within the fault window to the GPU. */ struct hmm_range range = {u64 *pfns, u64 start, u64 end)
.notifier = ¬ifier->notifier,
.start = notifier->notifier.interval_tree.start,
.end = notifier->notifier.interval_tree.last + 1,
.start = start,
.pfns = pfns, .flags = nouveau_svm_pfn_flags, .values = nouveau_svm_pfn_values,.end = end,
.default_flags = 0,
.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT, };.pfn_flags_mask = ~0UL,
- struct mm_struct *mm = notifier->notifier.mm;
- struct mm_struct *mm = svmm->mm; long ret;
while (true) { if (time_after(jiffies, timeout)) return -EBUSY;
range.notifier_seq = mmu_interval_read_begin(range.notifier);
range.default_flags = 0;
down_read(&mm->mmap_sem);range.pfn_flags_mask = -1UL;
mmap sem doesn't have to be held for the interval search, and again we have lifetime issues with the membership here.
I agree mmap_sem isn't needed for the interval search, it is needed if the search doesn't find a registered interval and one needs to be created to cover the underlying VMA. If an arbitrary size interval was created instead, then mmap_sem wouldn't be needed. I don't understand the lifetime/membership issue. The driver is the only thing that allocates, inserts, or removes struct mmu_interval_notifier and thus completely controls the lifetime.
ret = nouveau_svmm_interval_find(svmm, &range);
if (ret) {
up_read(&mm->mmap_sem);
return ret;
}
ret = hmm_range_fault(&range, 0); up_read(&mm->mmap_sem); if (ret <= 0) {range.notifier_seq = mmu_interval_read_begin(range.notifier);
I'm still not sure this is a better approach than what ODP does. It looks very expensive on the fault path..
Jason
ODP doesn't have this problem because users have to call ib_reg_mr() before any I/O can happen to the process address space. That is when mmu_interval_notifier_insert() / mmu_interval_notifier_remove() can be called and the driver doesn't have to worry about the interval changing sizes or being removed while I/O is happening. For GPU like devices, I'm trying to allow hardware access to any user level address without pre-registering it. That means inserting mmu interval notifiers for the ranges the GPU page faults on and updating the intervals as munmap() calls remove parts of the address space. I don't want to register an interval per page so the logical range is the underlying VMA.
It isn't that expensive, there is an extra driver lock/unlock as part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC) for new intervals. Also, the deferred interval updates for munmap(). Compared to the cost of updating PTEs in the device and GPU fault handling, this is minimal overhead.