This series adds new functions to the mmu interval notifier API to allow device drivers with MMUs to dynamically mirror a process' page tables based on device faults and invalidation callbacks. The Nouveau driver is updated to use the extended API and a set of stand alone self tests is added to help validate and maintain correctness.
The patches are based on linux-5.5.0-rc6 and are for Jason's rdma/hmm tree since I believe he is planning some interval notifier changes.
Changes v5 -> v6: Rebase to linux-5.5.0-rc6 Refactored mmu interval notifier patches Converted nouveau to use the new mmu interval notifier API
Changes v4 -> v5: Added mmu interval notifier insert/remove/update callable from the invalidate() callback Updated HMM tests to use the new core interval notifier API
Changes v1 -> v4: https://lore.kernel.org/linux-mm/20191104222141.5173-1-rcampbell@nvidia.com
Ralph Campbell (6): mm/mmu_notifier: add mmu_interval_notifier_insert_safe() mm/mmu_notifier: add mmu_interval_notifier_put() mm/notifier: add mmu_interval_notifier_update() mm/mmu_notifier: add mmu_interval_notifier_find() nouveau: use new mmu interval notifiers mm/hmm/test: add self tests for HMM
MAINTAINERS | 3 + drivers/gpu/drm/nouveau/nouveau_svm.c | 313 ++++-- include/linux/mmu_notifier.h | 29 + lib/Kconfig.debug | 11 + lib/Makefile | 1 + lib/test_hmm.c | 1368 ++++++++++++++++++++++++ mm/mmu_notifier.c | 223 +++- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 3 + tools/testing/selftests/vm/config | 2 + tools/testing/selftests/vm/hmm-tests.c | 1354 +++++++++++++++++++++++ tools/testing/selftests/vm/run_vmtests | 16 + tools/testing/selftests/vm/test_hmm.sh | 97 ++ 13 files changed, 3289 insertions(+), 132 deletions(-) create mode 100644 lib/test_hmm.c create mode 100644 tools/testing/selftests/vm/hmm-tests.c create mode 100755 tools/testing/selftests/vm/test_hmm.sh
mmu_interval_notifier_insert() can't be called safely from inside the invalidate() callback because it can acquire the mmap_sem lock which might already be held. Insertion might be needed when the invalidate() callback creates a "hole" in the interval being tracked (i.e., the event type MMU_NOTIFY_UNMAP) and the interval needs to be split in order to continue receiving callbacks for the remaining left and right intervals.
Add a new function mmu_interval_notifier_insert_safe() which can be called from the invalidate() callback.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com --- include/linux/mmu_notifier.h | 4 ++++ mm/mmu_notifier.c | 45 ++++++++++++++++++++++++++++++++---- 2 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 9e6caa8ecd19..027c9c8f3a69 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -299,6 +299,10 @@ int mmu_interval_notifier_insert_locked( struct mmu_interval_notifier *mni, struct mm_struct *mm, unsigned long start, unsigned long length, const struct mmu_interval_notifier_ops *ops); +int mmu_interval_notifier_insert_safe( + struct mmu_interval_notifier *mni, struct mm_struct *mm, + unsigned long start, unsigned long length, + const struct mmu_interval_notifier_ops *ops); void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni);
/** diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index f76ea05b1cb0..a5ff19cd1bc5 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -913,16 +913,17 @@ static int __mmu_interval_notifier_insert( /** * mmu_interval_notifier_insert - Insert an interval notifier * @mni: Interval notifier to register + * @mm: mm_struct to attach to * @start: Starting virtual address to monitor * @length: Length of the range to monitor - * @mm : mm_struct to attach to + * @ops: Interval notifier callback operations * * This function subscribes the interval notifier for notifications from the - * mm. Upon return the ops related to mmu_interval_notifier will be called + * mm. Upon return, the ops related to mmu_interval_notifier will be called * whenever an event that intersects with the given range occurs. * - * Upon return the range_notifier may not be present in the interval tree yet. - * The caller must use the normal interval notifier read flow via + * Upon return, the mmu_interval_notifier may not be present in the interval + * tree yet. The caller must use the normal interval notifier read flow via * mmu_interval_read_begin() to establish SPTEs for this range. */ int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni, @@ -969,6 +970,42 @@ int mmu_interval_notifier_insert_locked( } EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
+/** + * mmu_interval_notifier_insert_safe - Insert an interval notifier + * @mni: Interval notifier to register + * @mm: mm_struct to attach to + * @start: Starting virtual address to monitor + * @length: Length of the range to monitor + * @ops: Interval notifier callback operations + * + * Return: -EINVAL if @mm hasn't been initialized for interval notifiers + * by calling mmu_notifier_register(NULL, mm) or + * __mmu_notifier_register(NULL, mm). + * + * This function subscribes the interval notifier for notifications from the + * mm. Upon return, the ops related to mmu_interval_notifier will be called + * whenever an event that intersects with the given range occurs. + * + * This function is safe to call from the ops->invalidate() function. + * Upon return, the mmu_interval_notifier may not be present in the interval + * tree yet. The caller must use the normal interval notifier read flow via + * mmu_interval_read_begin() to establish SPTEs for this range. + */ +int mmu_interval_notifier_insert_safe( + struct mmu_interval_notifier *mni, struct mm_struct *mm, + unsigned long start, unsigned long length, + const struct mmu_interval_notifier_ops *ops) +{ + struct mmu_notifier_mm *mmn_mm; + + mmn_mm = mm->mmu_notifier_mm; + if (!mmn_mm || !mmn_mm->has_itree) + return -EINVAL; + return __mmu_interval_notifier_insert(mni, mm, mmn_mm, start, length, + ops); +} +EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_safe); + /** * mmu_interval_notifier_remove - Remove a interval notifier * @mni: Interval notifier to unregister
mmu_interval_notifier_remove() can't be called safely from inside the invalidate() callback because it sleeps waiting for invalidate callbacks to finish. Removals might be needed when the invalidate() callback is for munmap() (i.e., the event type MMU_NOTIFY_UNMAP), and the interval being tracked is no longer needed.
Add a new function mmu_interval_notifier_put() which is safe to call from the invalidate() callback. The ops->release() function will be called when all callbacks are finished and no CPUs are accessing the mmu_interval_notifier.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com --- include/linux/mmu_notifier.h | 6 +++ mm/mmu_notifier.c | 86 ++++++++++++++++++++++++++++-------- 2 files changed, 74 insertions(+), 18 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 027c9c8f3a69..6dcaa632eef7 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -233,11 +233,16 @@ struct mmu_notifier { * @invalidate: Upon return the caller must stop using any SPTEs within this * range. This function can sleep. Return false only if sleeping * was required but mmu_notifier_range_blockable(range) is false. + * @release: This function should be defined when using + * mmu_interval_notifier_put(). It will be called when the + * mmu_interval_notifier is removed from the interval tree. + * No other callbacks will be generated after this returns. */ struct mmu_interval_notifier_ops { bool (*invalidate)(struct mmu_interval_notifier *mni, const struct mmu_notifier_range *range, unsigned long cur_seq); + void (*release)(struct mmu_interval_notifier *mni); };
struct mmu_interval_notifier { @@ -304,6 +309,7 @@ int mmu_interval_notifier_insert_safe( unsigned long start, unsigned long length, const struct mmu_interval_notifier_ops *ops); void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni); +void mmu_interval_notifier_put(struct mmu_interval_notifier *mni);
/** * mmu_interval_set_seq - Save the invalidation sequence diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index a5ff19cd1bc5..40c837ae8d90 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -129,6 +129,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) { struct mmu_interval_notifier *mni; struct hlist_node *next; + struct hlist_head removed_list;
spin_lock(&mmn_mm->lock); if (--mmn_mm->active_invalidate_ranges || @@ -144,20 +145,35 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) * The inv_end incorporates a deferred mechanism like rtnl_unlock(). * Adds and removes are queued until the final inv_end happens then * they are progressed. This arrangement for tree updates is used to - * avoid using a blocking lock during invalidate_range_start. + * avoid using a blocking lock while walking the interval tree. */ + INIT_HLIST_HEAD(&removed_list); hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list, deferred_item) { + hlist_del(&mni->deferred_item); if (RB_EMPTY_NODE(&mni->interval_tree.rb)) interval_tree_insert(&mni->interval_tree, &mmn_mm->itree); - else + else { interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); - hlist_del(&mni->deferred_item); + if (mni->ops->release) + hlist_add_head(&mni->deferred_item, + &removed_list); + } } spin_unlock(&mmn_mm->lock);
+ hlist_for_each_entry_safe(mni, next, &removed_list, deferred_item) { + struct mm_struct *mm = mni->mm; + + hlist_del(&mni->deferred_item); + mni->ops->release(mni); + + /* pairs with mmgrab() in __mmu_interval_notifier_insert() */ + mmdrop(mm); + } + wake_up_all(&mmn_mm->wq); }
@@ -1006,24 +1022,13 @@ int mmu_interval_notifier_insert_safe( } EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_safe);
-/** - * mmu_interval_notifier_remove - Remove a interval notifier - * @mni: Interval notifier to unregister - * - * This function must be paired with mmu_interval_notifier_insert(). It cannot - * be called from any ops callback. - * - * Once this returns ops callbacks are no longer running on other CPUs and - * will not be called in future. - */ -void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) +static unsigned long __mmu_interval_notifier_put( + struct mmu_interval_notifier *mni) { struct mm_struct *mm = mni->mm; struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; unsigned long seq = 0;
- might_sleep(); - spin_lock(&mmn_mm->lock); if (mn_itree_is_invalidating(mmn_mm)) { /* @@ -1043,6 +1048,28 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) } spin_unlock(&mmn_mm->lock);
+ return seq; +} + +/** + * mmu_interval_notifier_remove - Remove an interval notifier + * @mni: Interval notifier to unregister + * + * This function must be paired with one of the mmu_interval_notifier_insert() + * functions. It cannot be called from any ops callback. + * Once this returns, ops callbacks are no longer running on other CPUs and + * will not be called in future. + */ +void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) +{ + struct mm_struct *mm = mni->mm; + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; + unsigned long seq; + + might_sleep(); + + seq = __mmu_interval_notifier_put(mni); + /* * The possible sleep on progress in the invalidation requires the * caller not hold any locks held by invalidation callbacks. @@ -1053,11 +1080,34 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni) wait_event(mmn_mm->wq, READ_ONCE(mmn_mm->invalidate_seq) != seq);
- /* pairs with mmgrab in mmu_interval_notifier_insert() */ - mmdrop(mm); + /* pairs with mmgrab() in __mmu_interval_notifier_insert() */ + if (!mni->ops->release) + mmdrop(mm); } EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove);
+/** + * mmu_interval_notifier_put - Unregister an interval notifier + * @mni: Interval notifier to unregister + * + * This function must be paired with one of the mmu_interval_notifier_insert() + * functions. It is safe to call from the invalidate() callback. + * Once this returns, ops callbacks may still be running on other CPUs and + * the release() callback will be called when they finish. + */ +void mmu_interval_notifier_put(struct mmu_interval_notifier *mni) +{ + struct mm_struct *mm = mni->mm; + + if (!__mmu_interval_notifier_put(mni)) { + mni->ops->release(mni); + + /* pairs with mmgrab() in __mmu_interval_notifier_insert() */ + mmdrop(mm); + } +} +EXPORT_SYMBOL_GPL(mmu_interval_notifier_put); + /** * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed *
When using mmu interval notifiers, the underlying VMA range that the interval is tracking might be expanded or shrunk. Add a helper function to allow the interval to be updated safely from the ops->invalidate() callback in response to VMA range changes. This is effectively removing and inserting the mmu interval notifier but more efficient.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com --- include/linux/mmu_notifier.h | 4 +++ mm/mmu_notifier.c | 69 ++++++++++++++++++++++++++++++++++-- 2 files changed, 70 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 6dcaa632eef7..0ce59b4f22c2 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -251,6 +251,8 @@ struct mmu_interval_notifier { struct mm_struct *mm; struct hlist_node deferred_item; unsigned long invalidate_seq; + unsigned long updated_start; + unsigned long updated_last; };
#ifdef CONFIG_MMU_NOTIFIER @@ -310,6 +312,8 @@ int mmu_interval_notifier_insert_safe( const struct mmu_interval_notifier_ops *ops); void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni); void mmu_interval_notifier_put(struct mmu_interval_notifier *mni); +void mmu_interval_notifier_update(struct mmu_interval_notifier *mni, + unsigned long start, unsigned long last);
/** * mmu_interval_set_seq - Save the invalidation sequence diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 40c837ae8d90..47ad9cc89aab 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -157,7 +157,14 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm) else { interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); - if (mni->ops->release) + if (mni->updated_last) { + mni->interval_tree.start = mni->updated_start; + mni->interval_tree.last = mni->updated_last; + mni->updated_start = 0; + mni->updated_last = 0; + interval_tree_insert(&mni->interval_tree, + &mmn_mm->itree); + } else if (mni->ops->release) hlist_add_head(&mni->deferred_item, &removed_list); } @@ -872,6 +879,8 @@ static int __mmu_interval_notifier_insert( mni->ops = ops; RB_CLEAR_NODE(&mni->interval_tree.rb); mni->interval_tree.start = start; + mni->updated_start = 0; + mni->updated_last = 0; /* * Note that the representation of the intervals in the interval tree * considers the ending point as contained in the interval. @@ -1038,8 +1047,12 @@ static unsigned long __mmu_interval_notifier_put( if (RB_EMPTY_NODE(&mni->interval_tree.rb)) { hlist_del(&mni->deferred_item); } else { - hlist_add_head(&mni->deferred_item, - &mmn_mm->deferred_list); + if (mni->updated_last) { + mni->updated_start = 0; + mni->updated_last = 0; + } else + hlist_add_head(&mni->deferred_item, + &mmn_mm->deferred_list); seq = mmn_mm->invalidate_seq; } } else { @@ -1108,6 +1121,56 @@ void mmu_interval_notifier_put(struct mmu_interval_notifier *mni) } EXPORT_SYMBOL_GPL(mmu_interval_notifier_put);
+/** + * mmu_interval_notifier_update - Update interval notifier range + * @mni: Interval notifier to update + * @start: New starting virtual address to monitor + * @last: New last virtual address (inclusive) to monitor + * + * This function updates the range being monitored and is safe to call from + * the invalidate() callback function. + * Upon return, the mmu_interval_notifier range may not be updated in the + * interval tree yet. The caller must use the normal interval notifier read + * flow via mmu_interval_read_begin() to establish SPTEs for this range. + */ +void mmu_interval_notifier_update(struct mmu_interval_notifier *mni, + unsigned long start, unsigned long last) +{ + struct mm_struct *mm = mni->mm; + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; + unsigned long seq = 0; + + if (WARN_ON(start >= last)) + return; + + spin_lock(&mmn_mm->lock); + if (mn_itree_is_invalidating(mmn_mm)) { + /* + * Update is being called after insert put this on the + * deferred list, but before the deferred list was processed. + */ + if (RB_EMPTY_NODE(&mni->interval_tree.rb)) { + mni->interval_tree.start = start; + mni->interval_tree.last = last; + } else { + if (!mni->updated_last) + hlist_add_head(&mni->deferred_item, + &mmn_mm->deferred_list); + mni->updated_start = start; + mni->updated_last = last; + } + seq = mmn_mm->invalidate_seq; + } else { + WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb)); + interval_tree_remove(&mni->interval_tree, &mmn_mm->itree); + mni->interval_tree.start = start; + mni->interval_tree.last = last; + interval_tree_insert(&mni->interval_tree, &mmn_mm->itree); + } + spin_unlock(&mmn_mm->lock); +} +EXPORT_SYMBOL_GPL(mmu_interval_notifier_update); + /** * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed *
Device drivers may or may not have a convenient range based data structure to look up and find intervals that are registered with the mmu interval notifiers. Rather than forcing drivers to duplicate the interval tree, provide an API to look up intervals that are registered and accessor functions to return the start and last address of the interval.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com --- include/linux/mmu_notifier.h | 15 +++++++++++++++ mm/mmu_notifier.c | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 48 insertions(+)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 0ce59b4f22c2..cdbbad13b278 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -314,6 +314,21 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni); void mmu_interval_notifier_put(struct mmu_interval_notifier *mni); void mmu_interval_notifier_update(struct mmu_interval_notifier *mni, unsigned long start, unsigned long last); +struct mmu_interval_notifier *mmu_interval_notifier_find(struct mm_struct *mm, + const struct mmu_interval_notifier_ops *ops, + unsigned long start, unsigned long last); + +static inline unsigned long mmu_interval_notifier_start( + struct mmu_interval_notifier *mni) +{ + return mni->interval_tree.start; +} + +static inline unsigned long mmu_interval_notifier_last( + struct mmu_interval_notifier *mni) +{ + return mni->interval_tree.last; +}
/** * mmu_interval_set_seq - Save the invalidation sequence diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 47ad9cc89aab..4efecc0f13cb 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -1171,6 +1171,39 @@ void mmu_interval_notifier_update(struct mmu_interval_notifier *mni, } EXPORT_SYMBOL_GPL(mmu_interval_notifier_update);
+struct mmu_interval_notifier *mmu_interval_notifier_find(struct mm_struct *mm, + const struct mmu_interval_notifier_ops *ops, + unsigned long start, unsigned long last) +{ + struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm; + struct interval_tree_node *node; + struct mmu_interval_notifier *mni; + struct mmu_interval_notifier *res = NULL; + + spin_lock(&mmn_mm->lock); + node = interval_tree_iter_first(&mmn_mm->itree, start, last); + if (node) { + mni = container_of(node, struct mmu_interval_notifier, + interval_tree); + while (true) { + if (mni->ops == ops) { + res = mni; + break; + } + node = interval_tree_iter_next(&mni->interval_tree, + start, last); + if (!node) + break; + mni = container_of(node, struct mmu_interval_notifier, + interval_tree); + } + } + spin_unlock(&mmn_mm->lock); + + return res; +} +EXPORT_SYMBOL_GPL(mmu_interval_notifier_find); + /** * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed *
Update nouveau to only use the mmu interval notifiers.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com --- drivers/gpu/drm/nouveau/nouveau_svm.c | 313 +++++++++++++++++--------- 1 file changed, 201 insertions(+), 112 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index df9bf1fd1bc0..0343e48d41d7 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -88,7 +88,7 @@ nouveau_ivmm_find(struct nouveau_svm *svm, u64 inst) }
struct nouveau_svmm { - struct mmu_notifier notifier; + struct mm_struct *mm; struct nouveau_vmm *vmm; struct { unsigned long start; @@ -98,6 +98,13 @@ struct nouveau_svmm { struct mutex mutex; };
+struct svmm_interval { + struct mmu_interval_notifier notifier; + struct nouveau_svmm *svmm; +}; + +static const struct mmu_interval_notifier_ops nouveau_svm_mni_ops; + #define SVMM_DBG(s,f,a...) \ NV_DEBUG((s)->vmm->cli->drm, "svm-%p: "f"\n", (s), ##a) #define SVMM_ERR(s,f,a...) \ @@ -236,6 +243,8 @@ nouveau_svmm_join(struct nouveau_svmm *svmm, u64 inst) static void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit) { + SVMM_DBG(svmm, "invalidate %016llx-%016llx", start, limit); + if (limit > start) { bool super = svmm->vmm->vmm.object.client->super; svmm->vmm->vmm.object.client->super = true; @@ -248,58 +257,25 @@ nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit) } }
-static int -nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn, - const struct mmu_notifier_range *update) -{ - struct nouveau_svmm *svmm = - container_of(mn, struct nouveau_svmm, notifier); - unsigned long start = update->start; - unsigned long limit = update->end; - - if (!mmu_notifier_range_blockable(update)) - return -EAGAIN; - - SVMM_DBG(svmm, "invalidate %016lx-%016lx", start, limit); - - mutex_lock(&svmm->mutex); - if (unlikely(!svmm->vmm)) - goto out; - - if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) { - if (start < svmm->unmanaged.start) { - nouveau_svmm_invalidate(svmm, start, - svmm->unmanaged.limit); - } - start = svmm->unmanaged.limit; - } - - nouveau_svmm_invalidate(svmm, start, limit); - -out: - mutex_unlock(&svmm->mutex); - return 0; -} - -static void nouveau_svmm_free_notifier(struct mmu_notifier *mn) -{ - kfree(container_of(mn, struct nouveau_svmm, notifier)); -} - -static const struct mmu_notifier_ops nouveau_mn_ops = { - .invalidate_range_start = nouveau_svmm_invalidate_range_start, - .free_notifier = nouveau_svmm_free_notifier, -}; - void nouveau_svmm_fini(struct nouveau_svmm **psvmm) { struct nouveau_svmm *svmm = *psvmm; + struct mmu_interval_notifier *mni; + if (svmm) { mutex_lock(&svmm->mutex); + while (true) { + mni = mmu_interval_notifier_find(svmm->mm, + &nouveau_svm_mni_ops, 0UL, ~0UL); + if (!mni) + break; + mmu_interval_notifier_put(mni); + } svmm->vmm = NULL; mutex_unlock(&svmm->mutex); - mmu_notifier_put(&svmm->notifier); + mmdrop(svmm->mm); + kfree(svmm); *psvmm = NULL; } } @@ -343,11 +319,12 @@ nouveau_svmm_init(struct drm_device *dev, void *data, goto out_free;
down_write(¤t->mm->mmap_sem); - svmm->notifier.ops = &nouveau_mn_ops; - ret = __mmu_notifier_register(&svmm->notifier, current->mm); + ret = __mmu_notifier_register(NULL, current->mm); if (ret) goto out_mm_unlock; - /* Note, ownership of svmm transfers to mmu_notifier */ + + mmgrab(current->mm); + svmm->mm = current->mm;
cli->svm.svmm = svmm; cli->svm.cli = cli; @@ -482,65 +459,212 @@ nouveau_svm_fault_cache(struct nouveau_svm *svm, fault->inst, fault->addr, fault->access); }
-struct svm_notifier { - struct mmu_interval_notifier notifier; - struct nouveau_svmm *svmm; -}; +static struct svmm_interval *nouveau_svmm_new_interval( + struct nouveau_svmm *svmm, + unsigned long start, + unsigned long last) +{ + struct svmm_interval *smi; + int ret; + + smi = kmalloc(sizeof(*smi), GFP_ATOMIC); + if (!smi) + return NULL; + + smi->svmm = svmm; + + ret = mmu_interval_notifier_insert_safe(&smi->notifier, svmm->mm, + start, last - start + 1, &nouveau_svm_mni_ops); + if (ret) { + kfree(smi); + return NULL; + } + + return smi; +} + +static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range) +{ + struct svmm_interval *smi = + container_of(mni, struct svmm_interval, notifier); + struct nouveau_svmm *svmm = smi->svmm; + unsigned long start = mmu_interval_notifier_start(mni); + unsigned long last = mmu_interval_notifier_last(mni); + + if (start >= range->start) { + /* Remove the whole interval or keep the right-hand part. */ + if (last <= range->end) + mmu_interval_notifier_put(mni); + else + mmu_interval_notifier_update(mni, range->end, last); + return; + } + + /* Keep the left-hand part of the interval. */ + mmu_interval_notifier_update(mni, start, range->start - 1); + + /* If a hole is created, create an interval for the right-hand part. */ + if (last >= range->end) { + smi = nouveau_svmm_new_interval(svmm, range->end, last); + /* + * If we can't allocate an interval, we won't get invalidation + * callbacks so clear the mapping and rely on faults to reload + * the mappings if needed. + */ + if (!smi) + nouveau_svmm_invalidate(svmm, range->end, last + 1); + } +}
-static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni, - const struct mmu_notifier_range *range, - unsigned long cur_seq) +static bool nouveau_svmm_interval_invalidate(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range, + unsigned long cur_seq) { - struct svm_notifier *sn = - container_of(mni, struct svm_notifier, notifier); + struct svmm_interval *smi = + container_of(mni, struct svmm_interval, notifier); + struct nouveau_svmm *svmm = smi->svmm;
/* - * serializes the update to mni->invalidate_seq done by caller and + * Serializes the update to mni->invalidate_seq done by the caller and * prevents invalidation of the PTE from progressing while HW is being - * programmed. This is very hacky and only works because the normal - * notifier that does invalidation is always called after the range - * notifier. + * programmed. */ if (mmu_notifier_range_blockable(range)) - mutex_lock(&sn->svmm->mutex); - else if (!mutex_trylock(&sn->svmm->mutex)) + mutex_lock(&svmm->mutex); + else if (!mutex_trylock(&svmm->mutex)) return false; + mmu_interval_set_seq(mni, cur_seq); - mutex_unlock(&sn->svmm->mutex); + nouveau_svmm_invalidate(svmm, range->start, range->end); + + /* Stop tracking the range if it is an unmap. */ + if (range->event == MMU_NOTIFY_UNMAP) + nouveau_svmm_do_unmap(mni, range); + + mutex_unlock(&svmm->mutex); return true; }
+static void nouveau_svmm_interval_release(struct mmu_interval_notifier *mni) +{ + struct svmm_interval *smi = + container_of(mni, struct svmm_interval, notifier); + + kfree(smi); +} + static const struct mmu_interval_notifier_ops nouveau_svm_mni_ops = { - .invalidate = nouveau_svm_range_invalidate, + .invalidate = nouveau_svmm_interval_invalidate, + .release = nouveau_svmm_interval_release, };
+/* + * Find or create a mmu_interval_notifier for the given range. + * Although mmu_interval_notifier_insert_safe() can handle overlapping + * intervals, we only create non-overlapping intervals, shrinking the hmm_range + * if it spans more than one svmm_interval. + */ +static int nouveau_svmm_interval_find(struct nouveau_svmm *svmm, + struct hmm_range *range) +{ + struct mmu_interval_notifier *mni; + struct svmm_interval *smi; + struct vm_area_struct *vma; + unsigned long start = range->start; + unsigned long last = range->end - 1; + int ret; + + mutex_lock(&svmm->mutex); + mni = mmu_interval_notifier_find(svmm->mm, &nouveau_svm_mni_ops, start, + last); + if (mni) { + if (start >= mmu_interval_notifier_start(mni)) { + smi = container_of(mni, struct svmm_interval, notifier); + if (last > mmu_interval_notifier_last(mni)) + range->end = + mmu_interval_notifier_last(mni) + 1; + goto found; + } + WARN_ON(last <= mmu_interval_notifier_start(mni)); + range->end = mmu_interval_notifier_start(mni); + last = range->end - 1; + } + /* + * Might as well create an interval covering the underlying VMA to + * avoid having to create a bunch of small intervals. + */ + vma = find_vma(svmm->mm, range->start); + if (!vma || start < vma->vm_start) { + ret = -ENOENT; + goto err; + } + if (range->end > vma->vm_end) { + range->end = vma->vm_end; + last = range->end - 1; + } else if (!mni) { + /* Anything registered on the right part of the vma? */ + mni = mmu_interval_notifier_find(svmm->mm, &nouveau_svm_mni_ops, + range->end, vma->vm_end - 1); + if (mni) + last = mmu_interval_notifier_start(mni) - 1; + else + last = vma->vm_end - 1; + } + /* Anything registered on the left part of the vma? */ + mni = mmu_interval_notifier_find(svmm->mm, &nouveau_svm_mni_ops, + vma->vm_start, start - 1); + if (mni) + start = mmu_interval_notifier_last(mni) + 1; + else + start = vma->vm_start; + smi = nouveau_svmm_new_interval(svmm, start, last); + if (!smi) { + ret = -ENOMEM; + goto err; + } + +found: + range->notifier = &smi->notifier; + mutex_unlock(&svmm->mutex); + return 0; + +err: + mutex_unlock(&svmm->mutex); + return ret; +} + static int nouveau_range_fault(struct nouveau_svmm *svmm, struct nouveau_drm *drm, void *data, u32 size, - u64 *pfns, struct svm_notifier *notifier) + u64 *pfns, u64 start, u64 end) { unsigned long timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); /* Have HMM fault pages within the fault window to the GPU. */ struct hmm_range range = { - .notifier = ¬ifier->notifier, - .start = notifier->notifier.interval_tree.start, - .end = notifier->notifier.interval_tree.last + 1, + .start = start, + .end = end, .pfns = pfns, .flags = nouveau_svm_pfn_flags, .values = nouveau_svm_pfn_values, + .default_flags = 0, + .pfn_flags_mask = ~0UL, .pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT, }; - struct mm_struct *mm = notifier->notifier.mm; + struct mm_struct *mm = svmm->mm; long ret;
while (true) { if (time_after(jiffies, timeout)) return -EBUSY;
- range.notifier_seq = mmu_interval_read_begin(range.notifier); - range.default_flags = 0; - range.pfn_flags_mask = -1UL; down_read(&mm->mmap_sem); + ret = nouveau_svmm_interval_find(svmm, &range); + if (ret) { + up_read(&mm->mmap_sem); + return ret; + } + range.notifier_seq = mmu_interval_read_begin(range.notifier); ret = hmm_range_fault(&range, 0); up_read(&mm->mmap_sem); if (ret <= 0) { @@ -585,7 +709,6 @@ nouveau_svm_fault(struct nvif_notify *notify) } i; u64 phys[16]; } args; - struct vm_area_struct *vma; u64 inst, start, limit; int fi, fn, pi, fill; int replay = 0, ret; @@ -640,7 +763,6 @@ nouveau_svm_fault(struct nvif_notify *notify) args.i.p.version = 0;
for (fi = 0; fn = fi + 1, fi < buffer->fault_nr; fi = fn) { - struct svm_notifier notifier; struct mm_struct *mm;
/* Cancel any faults from non-SVM channels. */ @@ -662,36 +784,12 @@ nouveau_svm_fault(struct nvif_notify *notify) start = max_t(u64, start, svmm->unmanaged.limit); SVMM_DBG(svmm, "wndw %016llx-%016llx", start, limit);
- mm = svmm->notifier.mm; + mm = svmm->mm; if (!mmget_not_zero(mm)) { nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]); continue; }
- /* Intersect fault window with the CPU VMA, cancelling - * the fault if the address is invalid. - */ - down_read(&mm->mmap_sem); - vma = find_vma_intersection(mm, start, limit); - if (!vma) { - SVMM_ERR(svmm, "wndw %016llx-%016llx", start, limit); - up_read(&mm->mmap_sem); - mmput(mm); - nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]); - continue; - } - start = max_t(u64, start, vma->vm_start); - limit = min_t(u64, limit, vma->vm_end); - up_read(&mm->mmap_sem); - SVMM_DBG(svmm, "wndw %016llx-%016llx", start, limit); - - if (buffer->fault[fi]->addr != start) { - SVMM_ERR(svmm, "addr %016llx", buffer->fault[fi]->addr); - mmput(mm); - nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]); - continue; - } - /* Prepare the GPU-side update of all pages within the * fault window, determining required pages and access * permissions based on pending faults. @@ -743,18 +841,9 @@ nouveau_svm_fault(struct nvif_notify *notify) args.i.p.addr, args.i.p.addr + args.i.p.size, fn - fi);
- notifier.svmm = svmm; - ret = mmu_interval_notifier_insert(¬ifier.notifier, - svmm->notifier.mm, - args.i.p.addr, args.i.p.size, - &nouveau_svm_mni_ops); - if (!ret) { - ret = nouveau_range_fault( - svmm, svm->drm, &args, - sizeof(args.i) + pi * sizeof(args.phys[0]), - args.phys, ¬ifier); - mmu_interval_notifier_remove(¬ifier.notifier); - } + ret = nouveau_range_fault(svmm, svm->drm, &args, + sizeof(args.i) + pi * sizeof(args.phys[0]), args.phys, + start, start + args.i.p.size); mmput(mm);
/* Cancel any faults in the window whose pages didn't manage
Add some basic stand alone self tests for HMM.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Jérôme Glisse jglisse@redhat.com --- MAINTAINERS | 3 + lib/Kconfig.debug | 11 + lib/Makefile | 1 + lib/test_hmm.c | 1368 ++++++++++++++++++++++++ tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 3 + tools/testing/selftests/vm/config | 2 + tools/testing/selftests/vm/hmm-tests.c | 1354 +++++++++++++++++++++++ tools/testing/selftests/vm/run_vmtests | 16 + tools/testing/selftests/vm/test_hmm.sh | 97 ++ 10 files changed, 2856 insertions(+) create mode 100644 lib/test_hmm.c create mode 100644 tools/testing/selftests/vm/hmm-tests.c create mode 100755 tools/testing/selftests/vm/test_hmm.sh
diff --git a/MAINTAINERS b/MAINTAINERS index 4017e6b760be..08508a20e2cf 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7514,7 +7514,10 @@ L: linux-mm@kvack.org S: Maintained F: mm/hmm* F: include/linux/hmm* +F: include/uapi/linux/test_hmm* F: Documentation/vm/hmm.rst +F: lib/test_hmm* +F: tools/testing/selftests/vm/*hmm*
HOST AP DRIVER M: Jouni Malinen j@w1.fi diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5ffe144c9794..d38d1e281dca 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2162,6 +2162,17 @@ config TEST_MEMINIT
If unsure, say N.
+config TEST_HMM + tristate "Test HMM (Heterogeneous Memory Management)" + depends on HMM_MIRROR + depends on DEVICE_PRIVATE + help + This is a pseudo device driver solely for testing HMM. + Say M here if you want to build the HMM test module. + Doing so will allow you to run tools/testing/selftest/vm/hmm-tests. + + If unsure, say N. + endif # RUNTIME_TESTING_MENU
config MEMTEST diff --git a/lib/Makefile b/lib/Makefile index 93217d44237f..348ce83cb21f 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -88,6 +88,7 @@ obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o obj-$(CONFIG_TEST_STACKINIT) += test_stackinit.o obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o +obj-$(CONFIG_TEST_HMM) += test_hmm.o
obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/
diff --git a/lib/test_hmm.c b/lib/test_hmm.c new file mode 100644 index 000000000000..96ffa988969c --- /dev/null +++ b/lib/test_hmm.c @@ -0,0 +1,1368 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This is a module to test the HMM (Heterogeneous Memory Management) + * mirror and zone device private memory migration APIs of the kernel. + * Userspace programs can register with the driver to mirror their own address + * space and can use the device to read/write any valid virtual address. + */ +#include <linux/init.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/cdev.h> +#include <linux/device.h> +#include <linux/mutex.h> +#include <linux/rwsem.h> +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/highmem.h> +#include <linux/delay.h> +#include <linux/pagemap.h> +#include <linux/hmm.h> +#include <linux/vmalloc.h> +#include <linux/swap.h> +#include <linux/swapops.h> +#include <linux/sched/mm.h> +#include <linux/platform_device.h> + +#include <uapi/linux/test_hmm.h> + +#define DMIRROR_NDEVICES 2 +#define DMIRROR_RANGE_FAULT_TIMEOUT 1000 +#define DEVMEM_CHUNK_SIZE (256 * 1024 * 1024U) +#define DEVMEM_CHUNKS_RESERVE 16 + +static const struct dev_pagemap_ops dmirror_devmem_ops; +static const struct mmu_interval_notifier_ops dmirror_min_ops; +static dev_t dmirror_dev; +static struct page *dmirror_zero_page; + +struct dmirror_device; + +struct dmirror_bounce { + void *ptr; + unsigned long size; + unsigned long addr; + unsigned long cpages; +}; + +#define DPT_SHIFT PAGE_SHIFT +#define DPT_VALID (1UL << 0) +#define DPT_WRITE (1UL << 1) +#define DPT_DPAGE (1UL << 2) + +#define DPT_XA_TAG_WRITE 3UL + +static const uint64_t dmirror_hmm_flags[HMM_PFN_FLAG_MAX] = { + [HMM_PFN_VALID] = DPT_VALID, + [HMM_PFN_WRITE] = DPT_WRITE, + [HMM_PFN_DEVICE_PRIVATE] = DPT_DPAGE, +}; + +static const uint64_t dmirror_hmm_values[HMM_PFN_VALUE_MAX] = { + [HMM_PFN_NONE] = 0, + [HMM_PFN_ERROR] = 0x10, + [HMM_PFN_SPECIAL] = 0x10, +}; + +/* + * Data structure to track address ranges and register for mmu interval + * notifier updates. + */ +struct dmirror_interval { + struct mmu_interval_notifier notifier; + struct dmirror *dmirror; +}; + +/* + * Data attached to the open device file. + * Note that it might be shared after a fork(). + */ +struct dmirror { + struct mm_struct *mm; + struct dmirror_device *mdevice; + struct xarray pt; + struct mutex mutex; +}; + +/* + * ZONE_DEVICE pages for migration and simulating device memory. + */ +struct dmirror_chunk { + struct dev_pagemap pagemap; + struct dmirror_device *mdevice; +}; + +/* + * Per device data. + */ +struct dmirror_device { + struct cdev cdevice; + struct hmm_devmem *devmem; + + unsigned int devmem_capacity; + unsigned int devmem_count; + struct dmirror_chunk **devmem_chunks; + struct mutex devmem_lock; /* protects the above */ + + unsigned long calloc; + unsigned long cfree; + struct page *free_pages; + spinlock_t lock; /* protects the above */ +}; + +static struct dmirror_device dmirror_devices[DMIRROR_NDEVICES]; + +static int dmirror_bounce_init(struct dmirror_bounce *bounce, + unsigned long addr, + unsigned long size) +{ + bounce->addr = addr; + bounce->size = size; + bounce->cpages = 0; + bounce->ptr = vmalloc(size); + if (!bounce->ptr) + return -ENOMEM; + return 0; +} + +static void dmirror_bounce_fini(struct dmirror_bounce *bounce) +{ + vfree(bounce->ptr); +} + +static int dmirror_fops_open(struct inode *inode, struct file *filp) +{ + struct cdev *cdev = inode->i_cdev; + struct dmirror *dmirror; + int ret; + + /* Mirror this process address space */ + dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL); + if (dmirror == NULL) + return -ENOMEM; + + dmirror->mdevice = container_of(cdev, struct dmirror_device, cdevice); + mutex_init(&dmirror->mutex); + xa_init(&dmirror->pt); + + /* + * Pre-register for mmu interval notifiers so + * mmu_interval_notifier_insert_safe() can be called without holding + * mmap_sem for write. + */ + ret = mmu_notifier_register(NULL, current->mm); + if (ret) { + kfree(dmirror); + return ret; + } + + /* Pairs with the mmdrop() in dmirror_fops_release(). */ + mmgrab(current->mm); + dmirror->mm = current->mm; + + /* Only the first open registers the address space. */ + filp->private_data = dmirror; + return ret; +} + +static int dmirror_fops_release(struct inode *inode, struct file *filp) +{ + struct dmirror *dmirror = filp->private_data; + struct mmu_interval_notifier *mni; + + mutex_lock(&dmirror->mutex); + while (true) { + mni = mmu_interval_notifier_find(dmirror->mm, &dmirror_min_ops, + 0UL, ~0UL); + if (!mni) + break; + mmu_interval_notifier_put(mni); + } + mutex_unlock(&dmirror->mutex); + mmdrop(dmirror->mm); + mmu_notifier_synchronize(); + xa_destroy(&dmirror->pt); + kfree(dmirror); + return 0; +} + +static inline struct dmirror_device *dmirror_page_to_device(struct page *page) + +{ + struct dmirror_chunk *devmem; + + devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap); + return devmem->mdevice; +} + +static bool dmirror_device_is_mine(struct dmirror_device *mdevice, + struct page *page) +{ + if (!is_zone_device_page(page)) + return false; + return page->pgmap->ops == &dmirror_devmem_ops && + dmirror_page_to_device(page) == mdevice; +} + +static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range) +{ + uint64_t *pfns = range->pfns; + unsigned long pfn; + + for (pfn = (range->start >> PAGE_SHIFT); + pfn < (range->end >> PAGE_SHIFT); + pfn++, pfns++) { + struct page *page; + void *entry; + + /* + * HMM_PFN_ERROR is returned if it is accessing invalid memory + * either because of memory error (hardware detected memory + * corruption) or more likely because of truncate on mmap + * file. + */ + if (*pfns == range->values[HMM_PFN_ERROR]) + return -EFAULT; + if (!(*pfns & range->flags[HMM_PFN_VALID])) + return -EFAULT; + page = hmm_device_entry_to_page(range, *pfns); + /* We asked for pages to be populated but check anyway. */ + if (!page) + return -EFAULT; + if (is_zone_device_page(page)) { + /* + * TODO: need a way to ask HMM to fault foreign zone + * device private pages. + */ + if (!dmirror_device_is_mine(dmirror->mdevice, page)) + continue; + } + entry = page; + if (*pfns & range->flags[HMM_PFN_WRITE]) + entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE); + else if (range->default_flags & range->flags[HMM_PFN_WRITE]) + return -EFAULT; + entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC); + if (xa_is_err(entry)) + return xa_err(entry); + } + + return 0; +} + +static void dmirror_do_update(struct dmirror *dmirror, unsigned long start, + unsigned long end) +{ + unsigned long pfn; + + /* + * The XArray doesn't hold references to pages since it relies on + * the mmu notifier to clear pointers when they become stale. + * Therefore, it is OK to just clear the entry. + */ + for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) + xa_erase(&dmirror->pt, pfn); +} + +static struct dmirror_interval *dmirror_new_interval(struct dmirror *dmirror, + unsigned long start, + unsigned long last) +{ + struct dmirror_interval *dmi; + int ret; + + dmi = kmalloc(sizeof(*dmi), GFP_ATOMIC); + if (!dmi) + return NULL; + + dmi->dmirror = dmirror; + + ret = mmu_interval_notifier_insert_safe(&dmi->notifier, dmirror->mm, + start, last - start + 1, &dmirror_min_ops); + if (ret) { + kfree(dmi); + return NULL; + } + + return dmi; +} + +static void dmirror_do_unmap(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range) +{ + struct dmirror_interval *dmi = + container_of(mni, struct dmirror_interval, notifier); + struct dmirror *dmirror = dmi->dmirror; + unsigned long start = mmu_interval_notifier_start(mni); + unsigned long last = mmu_interval_notifier_last(mni); + + if (start >= range->start) { + /* Remove the whole interval or keep the right-hand part. */ + if (last <= range->end) + mmu_interval_notifier_put(mni); + else + mmu_interval_notifier_update(mni, range->end, last); + return; + } + + /* Keep the left-hand part of the interval. */ + mmu_interval_notifier_update(mni, start, range->start - 1); + + /* If a hole is created, create an interval for the right-hand part. */ + if (last >= range->end) { + dmi = dmirror_new_interval(dmirror, range->end, last); + /* + * If we can't allocate an interval, we won't get invalidation + * callbacks so clear the mapping and rely on faults to reload + * the mappings if needed. + */ + if (!dmi) + dmirror_do_update(dmirror, range->end, last + 1); + } +} + +static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range, + unsigned long cur_seq) +{ + struct dmirror_interval *dmi = + container_of(mni, struct dmirror_interval, notifier); + struct dmirror *dmirror = dmi->dmirror; + unsigned long start = mmu_interval_notifier_start(mni); + unsigned long last = mmu_interval_notifier_last(mni); + + if (mmu_notifier_range_blockable(range)) + mutex_lock(&dmirror->mutex); + else if (!mutex_trylock(&dmirror->mutex)) + return false; + + mmu_interval_set_seq(mni, cur_seq); + dmirror_do_update(dmirror, max(start, range->start), + min(last + 1, range->end)); + + /* Stop tracking the range if it is an unmap. */ + if (range->event == MMU_NOTIFY_UNMAP) + dmirror_do_unmap(mni, range); + + mutex_unlock(&dmirror->mutex); + return true; +} + +static void dmirror_interval_release(struct mmu_interval_notifier *mni) +{ + struct dmirror_interval *dmi = + container_of(mni, struct dmirror_interval, notifier); + + kfree(dmi); +} + +static const struct mmu_interval_notifier_ops dmirror_min_ops = { + .invalidate = dmirror_interval_invalidate, + .release = dmirror_interval_release, +}; + +/* + * Find or create a mmu_interval_notifier for the given range. + * Although mmu_interval_notifier_insert_safe() can handle overlapping + * intervals, we only create non-overlapping intervals, shrinking the hmm_range + * if it spans more than one dmirror_interval. + */ +static int dmirror_interval_find(struct dmirror *dmirror, + struct hmm_range *range) +{ + struct mmu_interval_notifier *mni; + struct dmirror_interval *dmi; + struct vm_area_struct *vma; + unsigned long start = range->start; + unsigned long last = range->end - 1; + int ret; + + mutex_lock(&dmirror->mutex); + mni = mmu_interval_notifier_find(dmirror->mm, &dmirror_min_ops, start, + last); + if (mni) { + if (start >= mmu_interval_notifier_start(mni)) { + dmi = container_of(mni, struct dmirror_interval, + notifier); + if (last > mmu_interval_notifier_last(mni)) + range->end = + mmu_interval_notifier_last(mni) + 1; + goto found; + } + WARN_ON(last <= mmu_interval_notifier_start(mni)); + range->end = mmu_interval_notifier_start(mni); + last = range->end - 1; + } + /* + * Might as well create an interval covering the underlying VMA to + * avoid having to create a bunch of small intervals. + */ + vma = find_vma(dmirror->mm, start); + if (!vma || start < vma->vm_start) { + ret = -ENOENT; + goto err; + } + if (range->end > vma->vm_end) { + range->end = vma->vm_end; + last = range->end - 1; + } else if (!mni) { + /* Anything registered on the right part of the vma? */ + mni = mmu_interval_notifier_find(dmirror->mm, &dmirror_min_ops, + range->end, vma->vm_end - 1); + if (mni) + last = mmu_interval_notifier_start(mni) - 1; + else + last = vma->vm_end - 1; + } + /* Anything registered on the left part of the vma? */ + mni = mmu_interval_notifier_find(dmirror->mm, &dmirror_min_ops, + vma->vm_start, start - 1); + if (mni) + start = mmu_interval_notifier_last(mni) + 1; + else + start = vma->vm_start; + dmi = dmirror_new_interval(dmirror, start, last); + if (!dmi) { + ret = -ENOMEM; + goto err; + } + +found: + range->notifier = &dmi->notifier; + mutex_unlock(&dmirror->mutex); + return 0; + +err: + mutex_unlock(&dmirror->mutex); + return ret; +} + +static int dmirror_range_fault(struct dmirror *dmirror, + struct hmm_range *range) +{ + struct mm_struct *mm = dmirror->mm; + unsigned long timeout = + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); + int ret; + + while (true) { + long count; + + if (time_after(jiffies, timeout)) { + ret = -EBUSY; + goto out; + } + + down_read(&mm->mmap_sem); + ret = dmirror_interval_find(dmirror, range); + if (ret) { + up_read(&mm->mmap_sem); + goto out; + } + range->notifier_seq = mmu_interval_read_begin(range->notifier); + count = hmm_range_fault(range, 0); + up_read(&mm->mmap_sem); + if (count <= 0) { + if (count == 0 || count == -EBUSY) + continue; + ret = count; + goto out; + } + + mutex_lock(&dmirror->mutex); + if (mmu_interval_read_retry(range->notifier, + range->notifier_seq)) { + mutex_unlock(&dmirror->mutex); + continue; + } + break; + } + + ret = dmirror_do_fault(dmirror, range); + + mutex_unlock(&dmirror->mutex); +out: + return ret; +} + +static int dmirror_fault(struct dmirror *dmirror, unsigned long start, + unsigned long end, bool write) +{ + struct mm_struct *mm = dmirror->mm; + unsigned long addr; + unsigned long next; + uint64_t pfns[64]; + struct hmm_range range = { + .pfns = pfns, + .flags = dmirror_hmm_flags, + .values = dmirror_hmm_values, + .pfn_shift = DPT_SHIFT, + .pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] | + dmirror_hmm_flags[HMM_PFN_WRITE]), + .default_flags = dmirror_hmm_flags[HMM_PFN_VALID] | + (write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0), + }; + int ret = 0; + + /* Since the mm is for the mirrored process, get a reference first. */ + if (!mmget_not_zero(mm)) + return 0; + + for (addr = start; addr < end; addr = next) { + next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end); + range.start = addr; + range.end = next; + + ret = dmirror_range_fault(dmirror, &range); + if (ret) + break; + } + + mmput(mm); + return ret; +} + +static int dmirror_do_read(struct dmirror *dmirror, unsigned long start, + unsigned long end, struct dmirror_bounce *bounce) +{ + unsigned long pfn; + void *ptr; + + ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK); + + for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) { + void *entry; + struct page *page; + void *tmp; + + entry = xa_load(&dmirror->pt, pfn); + page = xa_untag_pointer(entry); + if (!page) + return -ENOENT; + + tmp = kmap(page); + memcpy(ptr, tmp, PAGE_SIZE); + kunmap(page); + + ptr += PAGE_SIZE; + bounce->cpages++; + } + + return 0; +} + +static int dmirror_read(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) +{ + struct dmirror_bounce bounce; + unsigned long start, end; + unsigned long size = cmd->npages << PAGE_SHIFT; + int ret; + + start = cmd->addr; + end = start + size; + if (end < start) + return -EINVAL; + + ret = dmirror_bounce_init(&bounce, start, size); + if (ret) + return ret; + +again: + mutex_lock(&dmirror->mutex); + ret = dmirror_do_read(dmirror, start, end, &bounce); + mutex_unlock(&dmirror->mutex); + if (ret == 0) + ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr, + bounce.size); + else if (ret == -ENOENT) { + start = cmd->addr + (bounce.cpages << PAGE_SHIFT); + ret = dmirror_fault(dmirror, start, end, false); + if (ret == 0) { + cmd->faults++; + goto again; + } + } + + cmd->cpages = bounce.cpages; + dmirror_bounce_fini(&bounce); + return ret; +} + +static int dmirror_do_write(struct dmirror *dmirror, unsigned long start, + unsigned long end, struct dmirror_bounce *bounce) +{ + unsigned long pfn; + void *ptr; + + ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK); + + for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) { + void *entry; + struct page *page; + void *tmp; + + entry = xa_load(&dmirror->pt, pfn); + page = xa_untag_pointer(entry); + if (!page || xa_pointer_tag(entry) != DPT_XA_TAG_WRITE) + return -ENOENT; + + tmp = kmap(page); + memcpy(tmp, ptr, PAGE_SIZE); + kunmap(page); + + ptr += PAGE_SIZE; + bounce->cpages++; + } + + return 0; +} + +static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) +{ + struct dmirror_bounce bounce; + unsigned long start, end; + unsigned long size = cmd->npages << PAGE_SHIFT; + int ret; + + start = cmd->addr; + end = start + size; + if (end < start) + return -EINVAL; + + ret = dmirror_bounce_init(&bounce, start, size); + if (ret) + return ret; + ret = copy_from_user(bounce.ptr, (void __user *)cmd->ptr, + bounce.size); + if (ret) + return ret; + +again: + mutex_lock(&dmirror->mutex); + ret = dmirror_do_write(dmirror, start, end, &bounce); + mutex_unlock(&dmirror->mutex); + if (ret == -ENOENT) { + start = cmd->addr + (bounce.cpages << PAGE_SHIFT); + ret = dmirror_fault(dmirror, start, end, true); + if (ret == 0) { + cmd->faults++; + goto again; + } + } + + cmd->cpages = bounce.cpages; + dmirror_bounce_fini(&bounce); + return ret; +} + +static bool dmirror_allocate_chunk(struct dmirror_device *mdevice, + struct page **ppage) +{ + struct dmirror_chunk *devmem; + struct resource *res; + unsigned long pfn; + unsigned long pfn_first; + unsigned long pfn_last; + void *ptr; + + mutex_lock(&mdevice->devmem_lock); + + if (mdevice->devmem_count == mdevice->devmem_capacity) { + struct dmirror_chunk **new_chunks; + unsigned int new_capacity; + + new_capacity = mdevice->devmem_capacity + + DEVMEM_CHUNKS_RESERVE; + new_chunks = krealloc(mdevice->devmem_chunks, + sizeof(new_chunks[0]) * new_capacity, + GFP_KERNEL); + if (!new_chunks) + goto err; + mdevice->devmem_capacity = new_capacity; + mdevice->devmem_chunks = new_chunks; + } + + res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE, + "hmm_dmirror"); + if (IS_ERR(res)) + goto err; + + devmem = kzalloc(sizeof(*devmem), GFP_KERNEL); + if (!devmem) + goto err; + + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + devmem->pagemap.res = *res; + devmem->pagemap.ops = &dmirror_devmem_ops; + + ptr = memremap_pages(&devmem->pagemap, numa_node_id()); + if (IS_ERR(ptr)) + goto err_free; + + devmem->mdevice = mdevice; + pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT; + pfn_last = pfn_first + + (resource_size(&devmem->pagemap.res) >> PAGE_SHIFT); + mdevice->devmem_chunks[mdevice->devmem_count++] = devmem; + + mutex_unlock(&mdevice->devmem_lock); + + pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n", + DEVMEM_CHUNK_SIZE / (1024 * 1024), + mdevice->devmem_count, + mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)), + pfn_first, pfn_last); + + spin_lock(&mdevice->lock); + for (pfn = pfn_first; pfn < pfn_last; pfn++) { + struct page *page = pfn_to_page(pfn); + + page->zone_device_data = mdevice->free_pages; + mdevice->free_pages = page; + } + if (ppage) { + *ppage = mdevice->free_pages; + mdevice->free_pages = (*ppage)->zone_device_data; + mdevice->calloc++; + } + spin_unlock(&mdevice->lock); + + return true; + +err_free: + kfree(devmem); +err: + mutex_unlock(&mdevice->devmem_lock); + return false; +} + +static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) +{ + struct page *dpage = NULL; + struct page *rpage; + + /* + * This is a fake device so we alloc real system memory to store + * our device memory. + */ + rpage = alloc_page(GFP_HIGHUSER); + if (!rpage) + return NULL; + + spin_lock(&mdevice->lock); + + if (mdevice->free_pages) { + dpage = mdevice->free_pages; + mdevice->free_pages = dpage->zone_device_data; + mdevice->calloc++; + spin_unlock(&mdevice->lock); + } else { + spin_unlock(&mdevice->lock); + if (!dmirror_allocate_chunk(mdevice, &dpage)) + goto error; + } + + dpage->zone_device_data = rpage; + get_page(dpage); + lock_page(dpage); + return dpage; + +error: + __free_page(rpage); + return NULL; +} + +static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args, + struct dmirror *dmirror) +{ + struct dmirror_device *mdevice = dmirror->mdevice; + const unsigned long *src = args->src; + unsigned long *dst = args->dst; + unsigned long addr; + + for (addr = args->start; addr < args->end; addr += PAGE_SIZE, + src++, dst++) { + struct page *spage; + struct page *dpage; + struct page *rpage; + + if (!(*src & MIGRATE_PFN_MIGRATE)) + continue; + + /* + * Note that spage might be NULL which is OK since it is an + * unallocated pte_none() or read-only zero page. + */ + spage = migrate_pfn_to_page(*src); + + /* + * Don't migrate device private pages from our own driver or + * others. For our own we would do a device private memory copy + * not a migration and for others, we would need to fault the + * other device's page into system memory first. + */ + if (spage && is_zone_device_page(spage)) + continue; + + dpage = dmirror_devmem_alloc_page(mdevice); + if (!dpage) + continue; + + rpage = dpage->zone_device_data; + if (spage) + copy_highpage(rpage, spage); + else + clear_highpage(rpage); + + /* + * Normally, a device would use the page->zone_device_data to + * point to the mirror but here we use it to hold the page for + * the simulated device memory and that page holds the pointer + * to the mirror. + */ + rpage->zone_device_data = dmirror; + + *dst = migrate_pfn(page_to_pfn(dpage)) | + MIGRATE_PFN_LOCKED; + if ((*src & MIGRATE_PFN_WRITE) || + (!spage && args->vma->vm_flags & VM_WRITE)) + *dst |= MIGRATE_PFN_WRITE; + } +} + +static int dmirror_migrate_finalize_and_map(struct migrate_vma *args, + struct dmirror *dmirror) +{ + unsigned long start = args->start; + unsigned long end = args->end; + const unsigned long *src = args->src; + const unsigned long *dst = args->dst; + unsigned long pfn; + + /* Map the migrated pages into the device's page tables. */ + mutex_lock(&dmirror->mutex); + + for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, + src++, dst++) { + struct page *dpage; + void *entry; + + if (!(*src & MIGRATE_PFN_MIGRATE)) + continue; + + dpage = migrate_pfn_to_page(*dst); + if (!dpage) + continue; + + /* + * Store the page that holds the data so the page table + * doesn't have to deal with ZONE_DEVICE private pages. + */ + entry = dpage->zone_device_data; + if (*dst & MIGRATE_PFN_WRITE) + entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE); + entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC); + if (xa_is_err(entry)) + return xa_err(entry); + } + + mutex_unlock(&dmirror->mutex); + return 0; +} + +static int dmirror_migrate(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + unsigned long start, end, addr; + unsigned long size = cmd->npages << PAGE_SHIFT; + struct mm_struct *mm = dmirror->mm; + struct vm_area_struct *vma; + unsigned long src_pfns[64]; + unsigned long dst_pfns[64]; + struct dmirror_bounce bounce; + struct migrate_vma args; + unsigned long next; + int ret; + + start = cmd->addr; + end = start + size; + if (end < start) + return -EINVAL; + + /* Since the mm is for the mirrored process, get a reference first. */ + if (!mmget_not_zero(mm)) + return -EINVAL; + + down_read(&mm->mmap_sem); + for (addr = start; addr < end; addr = next) { + vma = find_vma(mm, addr); + if (!vma || addr < vma->vm_start) { + ret = -EINVAL; + goto out; + } + next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT)); + if (next > vma->vm_end) + next = vma->vm_end; + + args.vma = vma; + args.src = src_pfns; + args.dst = dst_pfns; + args.start = addr; + args.end = next; + ret = migrate_vma_setup(&args); + if (ret) + goto out; + + dmirror_migrate_alloc_and_copy(&args, dmirror); + migrate_vma_pages(&args); + dmirror_migrate_finalize_and_map(&args, dmirror); + migrate_vma_finalize(&args); + } + up_read(&mm->mmap_sem); + mmput(mm); + + /* Return the migrated data for verification. */ + ret = dmirror_bounce_init(&bounce, start, size); + if (ret) + return ret; + mutex_lock(&dmirror->mutex); + ret = dmirror_do_read(dmirror, start, end, &bounce); + mutex_unlock(&dmirror->mutex); + if (ret == 0) + ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr, + bounce.size); + cmd->cpages = bounce.cpages; + dmirror_bounce_fini(&bounce); + return ret; + +out: + up_read(&mm->mmap_sem); + mmput(mm); + return ret; +} + +static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range, + unsigned char *perm, uint64_t entry) +{ + struct page *page; + + if (entry == range->values[HMM_PFN_ERROR]) { + *perm = HMM_DMIRROR_PROT_ERROR; + return; + } + page = hmm_device_entry_to_page(range, entry); + if (!page) { + *perm = HMM_DMIRROR_PROT_NONE; + return; + } + if (entry & range->flags[HMM_PFN_DEVICE_PRIVATE]) { + /* Is the page migrated to this device or some other? */ + if (dmirror->mdevice == dmirror_page_to_device(page)) + *perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL; + else + *perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE; + } else if (is_zero_pfn(page_to_pfn(page))) + *perm = HMM_DMIRROR_PROT_ZERO; + else + *perm = HMM_DMIRROR_PROT_NONE; + if (entry & range->flags[HMM_PFN_WRITE]) + *perm |= HMM_DMIRROR_PROT_WRITE; + else + *perm |= HMM_DMIRROR_PROT_READ; +} + +static bool dmirror_snapshot_invalidate(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range, + unsigned long cur_seq) +{ + struct dmirror_interval *dmi = + container_of(mni, struct dmirror_interval, notifier); + struct dmirror *dmirror = dmi->dmirror; + + if (mmu_notifier_range_blockable(range)) + mutex_lock(&dmirror->mutex); + else if (!mutex_trylock(&dmirror->mutex)) + return false; + + /* + * Snapshots only need to set the sequence number since the + * invalidations are handled by the dmirror_interval ranges. + */ + mmu_interval_set_seq(mni, cur_seq); + + mutex_unlock(&dmirror->mutex); + return true; +} + +static const struct mmu_interval_notifier_ops dmirror_mrn_ops = { + .invalidate = dmirror_snapshot_invalidate, +}; + +static int dmirror_range_snapshot(struct dmirror *dmirror, + struct hmm_range *range, + unsigned char *perm) +{ + struct mm_struct *mm = dmirror->mm; + struct dmirror_interval notifier; + unsigned long timeout = + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); + unsigned long i; + unsigned long n; + int ret = 0; + + notifier.dmirror = dmirror; + range->notifier = ¬ifier.notifier; + + ret = mmu_interval_notifier_insert_safe(range->notifier, mm, + range->start, range->end - range->start, + &dmirror_mrn_ops); + if (ret) + return ret; + + while (true) { + long count; + + if (time_after(jiffies, timeout)) { + ret = -EBUSY; + goto out; + } + + range->notifier_seq = mmu_interval_read_begin(range->notifier); + + down_read(&mm->mmap_sem); + count = hmm_range_fault(range, HMM_FAULT_SNAPSHOT); + up_read(&mm->mmap_sem); + if (count <= 0) { + if (count == 0 || count == -EBUSY) + continue; + ret = count; + goto out; + } + + mutex_lock(&dmirror->mutex); + if (mmu_interval_read_retry(range->notifier, + range->notifier_seq)) { + mutex_unlock(&dmirror->mutex); + continue; + } + break; + } + + n = (range->end - range->start) >> PAGE_SHIFT; + for (i = 0; i < n; i++) + dmirror_mkentry(dmirror, range, perm + i, range->pfns[i]); + + mutex_unlock(&dmirror->mutex); +out: + mmu_interval_notifier_remove(range->notifier); + return ret; +} + +static int dmirror_snapshot(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + struct mm_struct *mm = dmirror->mm; + unsigned long start, end; + unsigned long size = cmd->npages << PAGE_SHIFT; + unsigned long addr; + unsigned long next; + uint64_t pfns[64]; + unsigned char perm[64]; + char __user *uptr; + struct hmm_range range = { + .pfns = pfns, + .flags = dmirror_hmm_flags, + .values = dmirror_hmm_values, + .pfn_shift = DPT_SHIFT, + .pfn_flags_mask = ~0ULL, + }; + int ret = 0; + + start = cmd->addr; + end = start + size; + if (end < start) + return -EINVAL; + + /* Since the mm is for the mirrored process, get a reference first. */ + if (!mmget_not_zero(mm)) + return -EINVAL; + + /* + * Register a temporary notifier to detect invalidations even if it + * overlaps with other mmu_interval_notifiers. + */ + uptr = (void __user *)cmd->ptr; + for (addr = start; addr < end; addr = next) { + unsigned long n; + + next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end); + range.start = addr; + range.end = next; + + ret = dmirror_range_snapshot(dmirror, &range, perm); + if (ret) + break; + + n = (range.end - range.start) >> PAGE_SHIFT; + ret = copy_to_user(uptr, perm, n); + if (ret) + break; + + cmd->cpages += n; + uptr += n; + } + mmput(mm); + + return ret; +} + +static long dmirror_fops_unlocked_ioctl(struct file *filp, + unsigned int command, + unsigned long arg) +{ + void __user *uarg = (void __user *)arg; + struct hmm_dmirror_cmd cmd; + struct dmirror *dmirror; + int ret; + + dmirror = filp->private_data; + if (!dmirror) + return -EINVAL; + + ret = copy_from_user(&cmd, uarg, sizeof(cmd)); + if (ret) + return ret; + + if (cmd.addr & ~PAGE_MASK) + return -EINVAL; + if (cmd.addr >= (cmd.addr + (cmd.npages << PAGE_SHIFT))) + return -EINVAL; + + cmd.cpages = 0; + cmd.faults = 0; + + switch (command) { + case HMM_DMIRROR_READ: + ret = dmirror_read(dmirror, &cmd); + break; + + case HMM_DMIRROR_WRITE: + ret = dmirror_write(dmirror, &cmd); + break; + + case HMM_DMIRROR_MIGRATE: + ret = dmirror_migrate(dmirror, &cmd); + break; + + case HMM_DMIRROR_SNAPSHOT: + ret = dmirror_snapshot(dmirror, &cmd); + break; + + default: + return -EINVAL; + } + if (ret) + return ret; + + return copy_to_user(uarg, &cmd, sizeof(cmd)); +} + +static const struct file_operations dmirror_fops = { + .open = dmirror_fops_open, + .release = dmirror_fops_release, + .unlocked_ioctl = dmirror_fops_unlocked_ioctl, + .llseek = default_llseek, + .owner = THIS_MODULE, +}; + +static void dmirror_devmem_free(struct page *page) +{ + struct page *rpage = page->zone_device_data; + struct dmirror_device *mdevice; + + if (rpage) + __free_page(rpage); + + mdevice = dmirror_page_to_device(page); + + spin_lock(&mdevice->lock); + mdevice->cfree++; + page->zone_device_data = mdevice->free_pages; + mdevice->free_pages = page; + spin_unlock(&mdevice->lock); +} + +static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args, + struct dmirror_device *mdevice) +{ + struct vm_area_struct *vma = args->vma; + const unsigned long *src = args->src; + unsigned long *dst = args->dst; + unsigned long start = args->start; + unsigned long end = args->end; + unsigned long addr; + + for (addr = start; addr < end; addr += PAGE_SIZE, + src++, dst++) { + struct page *dpage, *spage; + + spage = migrate_pfn_to_page(*src); + if (!spage || !(*src & MIGRATE_PFN_MIGRATE)) + continue; + if (!dmirror_device_is_mine(mdevice, spage)) + continue; + spage = spage->zone_device_data; + + dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr); + if (!dpage) + continue; + + lock_page(dpage); + copy_highpage(dpage, spage); + *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + if (*src & MIGRATE_PFN_WRITE) + *dst |= MIGRATE_PFN_WRITE; + } + return 0; +} + +static void dmirror_devmem_fault_finalize_and_map(struct migrate_vma *args, + struct dmirror *dmirror) +{ + /* Invalidate the device's page table mapping. */ + mutex_lock(&dmirror->mutex); + dmirror_do_update(dmirror, args->start, args->end); + mutex_unlock(&dmirror->mutex); +} + +static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf) +{ + struct migrate_vma args; + unsigned long src_pfns; + unsigned long dst_pfns; + struct page *rpage; + struct dmirror *dmirror; + vm_fault_t ret; + + /* FIXME demonstrate how we can adjust migrate range */ + args.vma = vmf->vma; + args.start = vmf->address; + args.end = args.start + PAGE_SIZE; + args.src = &src_pfns; + args.dst = &dst_pfns; + + if (migrate_vma_setup(&args)) + return VM_FAULT_SIGBUS; + + /* + * Normally, a device would use the page->zone_device_data to point to + * the mirror but here we use it to hold the page for the simulated + * device memory and that page holds the pointer to the mirror. + */ + rpage = vmf->page->zone_device_data; + dmirror = rpage->zone_device_data; + + ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror->mdevice); + if (ret) + return ret; + migrate_vma_pages(&args); + dmirror_devmem_fault_finalize_and_map(&args, dmirror); + migrate_vma_finalize(&args); + return 0; +} + +static const struct dev_pagemap_ops dmirror_devmem_ops = { + .page_free = dmirror_devmem_free, + .migrate_to_ram = dmirror_devmem_fault, +}; + +static int dmirror_device_init(struct dmirror_device *mdevice, int id) +{ + dev_t dev; + int ret; + + dev = MKDEV(MAJOR(dmirror_dev), id); + mutex_init(&mdevice->devmem_lock); + spin_lock_init(&mdevice->lock); + + cdev_init(&mdevice->cdevice, &dmirror_fops); + ret = cdev_add(&mdevice->cdevice, dev, 1); + if (ret) + return ret; + + /* Build a list of free ZONE_DEVICE private struct pages */ + dmirror_allocate_chunk(mdevice, NULL); + + return 0; +} + +static void dmirror_device_remove(struct dmirror_device *mdevice) +{ + unsigned int i; + + if (mdevice->devmem_chunks) { + for (i = 0; i < mdevice->devmem_count; i++) { + struct dmirror_chunk *devmem = + mdevice->devmem_chunks[i]; + + memunmap_pages(&devmem->pagemap); + kfree(devmem); + } + kfree(mdevice->devmem_chunks); + } + + cdev_del(&mdevice->cdevice); +} + +static int __init hmm_dmirror_init(void) +{ + int ret; + int id; + + ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES, + "HMM_DMIRROR"); + if (ret) + goto err_unreg; + + for (id = 0; id < DMIRROR_NDEVICES; id++) { + ret = dmirror_device_init(dmirror_devices + id, id); + if (ret) + goto err_chrdev; + } + + /* + * Allocate a zero page to simulate a reserved page of device private + * memory which is always zero. The zero_pfn page isn't used just to + * make the code here simpler (i.e., we need a struct page for it). + */ + dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO); + if (!dmirror_zero_page) + goto err_chrdev; + + pr_info("HMM test module loaded. This is only for testing HMM.\n"); + return 0; + +err_chrdev: + while (--id >= 0) + dmirror_device_remove(dmirror_devices + id); + unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES); +err_unreg: + return ret; +} + +static void __exit hmm_dmirror_exit(void) +{ + int id; + + if (dmirror_zero_page) + __free_page(dmirror_zero_page); + for (id = 0; id < DMIRROR_NDEVICES; id++) + dmirror_device_remove(dmirror_devices + id); + unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES); +} + +module_init(hmm_dmirror_init); +module_exit(hmm_dmirror_exit); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 31b3c98b6d34..3054565b3f07 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -14,3 +14,4 @@ virtual_address_range gup_benchmark va_128TBswitch map_fixed_noreplace +hmm-tests diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 7f9a8a8c31da..3fadab99d991 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -7,6 +7,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS) LDLIBS = -lrt TEST_GEN_FILES = compaction_test TEST_GEN_FILES += gup_benchmark +TEST_GEN_FILES += hmm-tests TEST_GEN_FILES += hugepage-mmap TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += map_hugetlb @@ -31,6 +32,8 @@ TEST_FILES := test_vmalloc.sh KSFT_KHDR_INSTALL := 1 include ../lib.mk
+$(OUTPUT)/hmm-tests: LDLIBS += -lhugetlbfs -lpthread + $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
$(OUTPUT)/mlock-random-test: LDLIBS += -lcap diff --git a/tools/testing/selftests/vm/config b/tools/testing/selftests/vm/config index 93b90a9b1eeb..f6d0adad739f 100644 --- a/tools/testing/selftests/vm/config +++ b/tools/testing/selftests/vm/config @@ -1,3 +1,5 @@ CONFIG_SYSVIPC=y CONFIG_USERFAULTFD=y CONFIG_TEST_VMALLOC=m +CONFIG_HMM_MIRROR=y +CONFIG_DEVICE_PRIVATE=y diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c new file mode 100644 index 000000000000..23bf2e956f1d --- /dev/null +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -0,0 +1,1354 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * HMM stands for Heterogeneous Memory Management, it is a helper layer inside + * the linux kernel to help device drivers mirror a process address space in + * the device. This allows the device to use the same address space which + * makes communication and data exchange a lot easier. + * + * This framework's sole purpose is to exercise various code paths inside + * the kernel to make sure that HMM performs as expected and to flush out any + * bugs. + */ + +#include "../kselftest_harness.h" + +#include <errno.h> +#include <fcntl.h> +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <unistd.h> +#include <strings.h> +#include <time.h> +#include <pthread.h> +#include <hugetlbfs.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/mman.h> +#include <sys/ioctl.h> +#include <linux/test_hmm.h> + +struct hmm_buffer { + void *ptr; + void *mirror; + unsigned long size; + int fd; + uint64_t cpages; + uint64_t faults; +}; + +#define TWOMEG (1 << 21) +#define HMM_BUFFER_SIZE (1024 << 12) +#define HMM_PATH_MAX 64 +#define NTIMES 256 + +#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1))) + +FIXTURE(hmm) +{ + int fd; + unsigned int page_size; + unsigned int page_shift; +}; + +FIXTURE(hmm2) +{ + int fd0; + int fd1; + unsigned int page_size; + unsigned int page_shift; +}; + +static int hmm_open(int unit) +{ + char pathname[HMM_PATH_MAX]; + int fd; + + snprintf(pathname, sizeof(pathname), "/dev/hmm_dmirror%d", unit); + fd = open(pathname, O_RDWR, 0); + if (fd < 0) + fprintf(stderr, "could not open hmm dmirror driver (%s)\n", + pathname); + return fd; +} + +FIXTURE_SETUP(hmm) +{ + self->page_size = sysconf(_SC_PAGE_SIZE); + self->page_shift = ffs(self->page_size) - 1; + + self->fd = hmm_open(0); + ASSERT_GE(self->fd, 0); +} + +FIXTURE_SETUP(hmm2) +{ + self->page_size = sysconf(_SC_PAGE_SIZE); + self->page_shift = ffs(self->page_size) - 1; + + self->fd0 = hmm_open(0); + ASSERT_GE(self->fd0, 0); + self->fd1 = hmm_open(1); + ASSERT_GE(self->fd1, 0); +} + +FIXTURE_TEARDOWN(hmm) +{ + int ret = close(self->fd); + + ASSERT_EQ(ret, 0); + self->fd = -1; +} + +FIXTURE_TEARDOWN(hmm2) +{ + int ret = close(self->fd0); + + ASSERT_EQ(ret, 0); + self->fd0 = -1; + + ret = close(self->fd1); + ASSERT_EQ(ret, 0); + self->fd1 = -1; +} + +static int hmm_dmirror_cmd(int fd, + unsigned long request, + struct hmm_buffer *buffer, + unsigned long npages) +{ + struct hmm_dmirror_cmd cmd; + int ret; + + /* Simulate a device reading system memory. */ + cmd.addr = (__u64)buffer->ptr; + cmd.ptr = (__u64)buffer->mirror; + cmd.npages = npages; + + for (;;) { + ret = ioctl(fd, request, &cmd); + if (ret == 0) + break; + if (errno == EINTR) + continue; + return -errno; + } + buffer->cpages = cmd.cpages; + buffer->faults = cmd.faults; + + return 0; +} + +static void hmm_buffer_free(struct hmm_buffer *buffer) +{ + if (buffer == NULL) + return; + + if (buffer->ptr) + munmap(buffer->ptr, buffer->size); + free(buffer->mirror); + free(buffer); +} + +/* + * Create a temporary file that will be deleted on close. + */ +static int hmm_create_file(unsigned long size) +{ + char path[HMM_PATH_MAX]; + int fd; + + strcpy(path, "/tmp"); + fd = open(path, O_TMPFILE | O_EXCL | O_RDWR, 0600); + if (fd >= 0) { + int r; + + do { + r = ftruncate(fd, size); + } while (r == -1 && errno == EINTR); + if (!r) + return fd; + close(fd); + } + return -1; +} + +/* + * Return a random unsigned number. + */ +static unsigned int hmm_random(void) +{ + static int fd = -1; + unsigned int r; + + if (fd < 0) { + fd = open("/dev/urandom", O_RDONLY); + if (fd < 0) { + fprintf(stderr, "%s:%d failed to open /dev/urandom\n", + __FILE__, __LINE__); + return ~0U; + } + } + read(fd, &r, sizeof(r)); + return r; +} + +static void hmm_nanosleep(unsigned int n) +{ + struct timespec t; + + t.tv_sec = 0; + t.tv_nsec = n; + nanosleep(&t, NULL); +} + +/* + * Simple NULL test of device open/close. + */ +TEST_F(hmm, open_close) +{ +} + +/* + * Read private anonymous memory. + */ +TEST_F(hmm, anon_read) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + int val; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* + * Initialize buffer in system memory but leave the first two pages + * zero (pte_none and pfn_zero). + */ + i = 2 * self->page_size / sizeof(*ptr); + for (ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Set buffer permission to read-only. */ + ret = mprotect(buffer->ptr, size, PROT_READ); + ASSERT_EQ(ret, 0); + + /* Populate the CPU page table with a special zero page. */ + val = *(int *)(buffer->ptr + self->page_size); + ASSERT_EQ(val, 0); + + /* Simulate a device reading system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device read. */ + ptr = buffer->mirror; + for (i = 0; i < 2 * self->page_size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], 0); + for (; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + +/* + * Read private anonymous memory which has been protected with + * mprotect() PROT_NONE. + */ +TEST_F(hmm, anon_read_prot) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Initialize mirror buffer so we can verify it isn't written. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = -i; + + /* Protect buffer from reading. */ + ret = mprotect(buffer->ptr, size, PROT_NONE); + ASSERT_EQ(ret, 0); + + /* Simulate a device reading system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages); + ASSERT_EQ(ret, -EFAULT); + + /* Allow CPU to read the buffer so we can check it. */ + ret = mprotect(buffer->ptr, size, PROT_READ); + ASSERT_EQ(ret, 0); + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], -i); + + hmm_buffer_free(buffer); +} + +/* + * Write private anonymous memory. + */ +TEST_F(hmm, anon_write) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize data that the device will write to buffer->ptr. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + +/* + * Write private anonymous memory which has been protected with + * mprotect() PROT_READ. + */ +TEST_F(hmm, anon_write_prot) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Simulate a device reading a zero page of memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, 1); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, 1); + ASSERT_EQ(buffer->faults, 1); + + /* Initialize data that the device will write to buffer->ptr. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, -EPERM); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], 0); + + /* Now allow writing and see that the zero page is replaced. */ + ret = mprotect(buffer->ptr, size, PROT_WRITE | PROT_READ); + ASSERT_EQ(ret, 0); + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + +/* + * Check that a device writing an anonymous private mapping + * will copy-on-write if a child process inherits the mapping. + */ +TEST_F(hmm, anon_write_child) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + pid_t pid; + int child_fd; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer->ptr so we can tell if it is written. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Initialize data that the device will write to buffer->ptr. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = -i; + + pid = fork(); + if (pid == -1) + ASSERT_EQ(pid, 0); + if (pid != 0) { + waitpid(pid, &ret, 0); + ASSERT_EQ(WIFEXITED(ret), 1); + + /* Check that the parent's buffer did not change. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + return; + } + + /* Check that we see the parent's values. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], -i); + + /* The child process needs its own mirror to its own mm. */ + child_fd = hmm_open(0); + ASSERT_GE(child_fd, 0); + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], -i); + + close(child_fd); + exit(0); +} + +/* + * Check that a device writing an anonymous shared mapping + * will not copy-on-write if a child process inherits the mapping. + */ +TEST_F(hmm, anon_write_child_shared) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + pid_t pid; + int child_fd; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer->ptr so we can tell if it is written. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Initialize data that the device will write to buffer->ptr. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = -i; + + pid = fork(); + if (pid == -1) + ASSERT_EQ(pid, 0); + if (pid != 0) { + waitpid(pid, &ret, 0); + ASSERT_EQ(WIFEXITED(ret), 1); + + /* Check that the parent's buffer did change. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], -i); + return; + } + + /* Check that we see the parent's values. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], -i); + + /* The child process needs its own mirror to its own mm. */ + child_fd = hmm_open(0); + ASSERT_GE(child_fd, 0); + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], -i); + + close(child_fd); + exit(0); +} + +/* + * Write private anonymous huge page. + */ +TEST_F(hmm, anon_write_huge) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + void *old_ptr; + void *map; + int *ptr; + int ret; + + size = 2 * TWOMEG; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + size = TWOMEG; + npages = size >> self->page_shift; + map = (void *)ALIGN((uintptr_t)buffer->ptr, size); + ret = madvise(map, size, MADV_HUGEPAGE); + ASSERT_EQ(ret, 0); + old_ptr = buffer->ptr; + buffer->ptr = map; + + /* Initialize data that the device will write to buffer->ptr. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + buffer->ptr = old_ptr; + hmm_buffer_free(buffer); +} + +/* + * Write huge TLBFS page. + */ +TEST_F(hmm, anon_write_hugetlbfs) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + long pagesizes[4]; + int n, idx; + + /* Skip test if we can't allocate a hugetlbfs page. */ + + n = gethugepagesizes(pagesizes, 4); + if (n <= 0) + return; + for (idx = 0; --n > 0; ) { + if (pagesizes[n] < pagesizes[idx]) + idx = n; + } + size = ALIGN(TWOMEG, pagesizes[idx]); + npages = size >> self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->ptr = get_hugepage_region(size, GHR_STRICT); + if (buffer->ptr == NULL) { + free(buffer); + return; + } + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + /* Initialize data that the device will write to buffer->ptr. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + free_hugepage_region(buffer->ptr); + buffer->ptr = NULL; + hmm_buffer_free(buffer); +} + +/* + * Read mmap'ed file memory. + */ +TEST_F(hmm, file_read) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + int fd; + ssize_t len; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + fd = hmm_create_file(size); + ASSERT_GE(fd, 0); + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = fd; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + /* Write initial contents of the file. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + len = pwrite(fd, buffer->mirror, size, 0); + ASSERT_EQ(len, size); + memset(buffer->mirror, 0, size); + + buffer->ptr = mmap(NULL, size, + PROT_READ, + MAP_SHARED, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Simulate a device reading system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + +/* + * Write mmap'ed file memory. + */ +TEST_F(hmm, file_write) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + int fd; + ssize_t len; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + fd = hmm_create_file(size); + ASSERT_GE(fd, 0); + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = fd; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize data that the device will write to buffer->ptr. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device wrote. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + /* Check that the device also wrote the file. */ + len = pread(fd, buffer->mirror, size, 0); + ASSERT_EQ(len, size); + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + +/* + * Migrate anonymous memory to device private memory. + */ +TEST_F(hmm, migrate) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Migrate memory to device. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + +/* + * Migrate anonymous memory to device private memory and fault it back to system + * memory. + */ +TEST_F(hmm, migrate_fault) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Migrate memory to device. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + /* Fault pages back to system memory and check them. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + +/* + * Try to migrate various memory types to device private memory. + */ +TEST_F(hmm2, migrate_mixed) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + int *ptr; + unsigned char *p; + int ret; + int val; + + npages = 6; + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + /* Reserve a range of addresses. */ + buffer->ptr = mmap(NULL, size, + PROT_NONE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + p = buffer->ptr; + + /* Now try to migrate everything to device 1. */ + ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, 6); + + /* Punch a hole after the first page address. */ + ret = munmap(buffer->ptr + self->page_size, self->page_size); + ASSERT_EQ(ret, 0); + + /* We expect an error if the vma doesn't cover the range. */ + ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3); + ASSERT_EQ(ret, -EINVAL); + + /* Page 2 will be a read-only zero page. */ + ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size, + PROT_READ); + ASSERT_EQ(ret, 0); + ptr = (int *)(buffer->ptr + 2 * self->page_size); + val = *ptr + 3; + ASSERT_EQ(val, 3); + + /* Page 3 will be read-only. */ + ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size, + PROT_READ | PROT_WRITE); + ASSERT_EQ(ret, 0); + ptr = (int *)(buffer->ptr + 3 * self->page_size); + *ptr = val; + ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size, + PROT_READ); + ASSERT_EQ(ret, 0); + + /* Page 4 will be read-write. */ + ret = mprotect(buffer->ptr + 4 * self->page_size, self->page_size, + PROT_READ | PROT_WRITE); + ASSERT_EQ(ret, 0); + ptr = (int *)(buffer->ptr + 4 * self->page_size); + *ptr = val; + + /* Page 5 won't be migrated to device 0 because it's on device 1. */ + buffer->ptr = p + 5 * self->page_size; + ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1); + ASSERT_EQ(ret, -ENOENT); + buffer->ptr = p; + + /* Now try to migrate pages 2-3 to device 1. */ + buffer->ptr = p + 2 * self->page_size; + ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 2); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, 2); + buffer->ptr = p; + + hmm_buffer_free(buffer); +} + +/* + * Migrate anonymous memory to device private memory and fault it back to system + * memory multiple times. + */ +TEST_F(hmm, migrate_multiple) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + unsigned long c; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + for (c = 0; c < NTIMES; c++) { + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Migrate memory to device. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, + npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + /* Fault pages back to system memory and check them. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); + } +} + +/* + * Read anonymous memory multiple times. + */ +TEST_F(hmm, anon_read_multiple) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + unsigned long c; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + for (c = 0; c < NTIMES; c++) { + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i + c; + + /* Simulate a device reading system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, + npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i + c); + + hmm_buffer_free(buffer); + } +} + +void *unmap_buffer(void *p) +{ + struct hmm_buffer *buffer = p; + + /* Delay for a bit and then unmap buffer while it is being read. */ + hmm_nanosleep(hmm_random() % 32000); + munmap(buffer->ptr + buffer->size / 2, buffer->size / 2); + buffer->ptr = NULL; + + return NULL; +} + +/* + * Try reading anonymous memory while it is being unmapped. + */ +TEST_F(hmm, anon_teardown) +{ + unsigned long npages; + unsigned long size; + unsigned long c; + void *ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + for (c = 0; c < NTIMES; ++c) { + pthread_t thread; + struct hmm_buffer *buffer; + unsigned long i; + int *ptr; + int rc; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i + c; + + rc = pthread_create(&thread, NULL, unmap_buffer, buffer); + ASSERT_EQ(rc, 0); + + /* Simulate a device reading system memory. */ + rc = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, + npages); + if (rc == 0) { + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; + i < size / sizeof(*ptr); + ++i) + ASSERT_EQ(ptr[i], i + c); + } + + pthread_join(thread, &ret); + hmm_buffer_free(buffer); + } +} + +/* + * Test memory snapshot without faulting in pages accessed by the device. + */ +TEST_F(hmm2, snapshot) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + int *ptr; + unsigned char *p; + unsigned char *m; + int ret; + int val; + + npages = 7; + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(npages); + ASSERT_NE(buffer->mirror, NULL); + + /* Reserve a range of addresses. */ + buffer->ptr = mmap(NULL, size, + PROT_NONE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + p = buffer->ptr; + + /* Punch a hole after the first page address. */ + ret = munmap(buffer->ptr + self->page_size, self->page_size); + ASSERT_EQ(ret, 0); + + /* Page 2 will be read-only zero page. */ + ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size, + PROT_READ); + ASSERT_EQ(ret, 0); + ptr = (int *)(buffer->ptr + 2 * self->page_size); + val = *ptr + 3; + ASSERT_EQ(val, 3); + + /* Page 3 will be read-only. */ + ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size, + PROT_READ | PROT_WRITE); + ASSERT_EQ(ret, 0); + ptr = (int *)(buffer->ptr + 3 * self->page_size); + *ptr = val; + ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size, + PROT_READ); + ASSERT_EQ(ret, 0); + + /* Page 4-6 will be read-write. */ + ret = mprotect(buffer->ptr + 4 * self->page_size, 3 * self->page_size, + PROT_READ | PROT_WRITE); + ASSERT_EQ(ret, 0); + ptr = (int *)(buffer->ptr + 4 * self->page_size); + *ptr = val; + + /* Page 5 will be migrated to device 0. */ + buffer->ptr = p + 5 * self->page_size; + ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, 1); + + /* Page 6 will be migrated to device 1. */ + buffer->ptr = p + 6 * self->page_size; + ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, 1); + + /* Simulate a device snapshotting CPU pagetables. */ + buffer->ptr = p; + ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_SNAPSHOT, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Check what the device saw. */ + m = buffer->mirror; + ASSERT_EQ(m[0], HMM_DMIRROR_PROT_NONE); + ASSERT_EQ(m[1], HMM_DMIRROR_PROT_NONE); + ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ); + ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ); + ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE); + ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL | + HMM_DMIRROR_PROT_WRITE); + ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE | + HMM_DMIRROR_PROT_WRITE); + + hmm_buffer_free(buffer); +} + +/* + * Test two devices reading the same memory (double mapped). + */ +TEST_F(hmm2, double_map) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = 6; + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(npages); + ASSERT_NE(buffer->mirror, NULL); + + /* Reserve a range of addresses. */ + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Make region read-only. */ + ret = mprotect(buffer->ptr, size, PROT_READ); + ASSERT_EQ(ret, 0); + + /* Simulate device 0 reading system memory. */ + ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_READ, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + /* Simulate device 1 reading system memory. */ + ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_READ, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + ASSERT_EQ(buffer->faults, 1); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + /* Punch a hole after the first page address. */ + ret = munmap(buffer->ptr + self->page_size, self->page_size); + ASSERT_EQ(ret, 0); + + hmm_buffer_free(buffer); +} + +TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests index a692ea828317..ea3728570585 100755 --- a/tools/testing/selftests/vm/run_vmtests +++ b/tools/testing/selftests/vm/run_vmtests @@ -237,4 +237,20 @@ else exitcode=1 fi
+echo "------------------------------------" +echo "running HMM smoke test" +echo "------------------------------------" +./test_hmm.sh smoke +ret_val=$? + +if [ $ret_val -eq 0 ]; then + echo "[PASS]" +elif [ $ret_val -eq $ksft_skip ]; then + echo "[SKIP]" + exitcode=$ksft_skip +else + echo "[FAIL]" + exitcode=1 +fi + exit $exitcode diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh new file mode 100755 index 000000000000..461e4a99a362 --- /dev/null +++ b/tools/testing/selftests/vm/test_hmm.sh @@ -0,0 +1,97 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (C) 2018 Uladzislau Rezki (Sony) urezki@gmail.com +# +# This is a test script for the kernel test driver to analyse vmalloc +# allocator. Therefore it is just a kernel module loader. You can specify +# and pass different parameters in order to: +# a) analyse performance of vmalloc allocations; +# b) stressing and stability check of vmalloc subsystem. + +TEST_NAME="test_hmm" +DRIVER="test_hmm" + +# 1 if fails +exitcode=1 + +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +check_test_requirements() +{ + uid=$(id -u) + if [ $uid -ne 0 ]; then + echo "$0: Must be run as root" + exit $ksft_skip + fi + + if ! which modprobe > /dev/null 2>&1; then + echo "$0: You need modprobe installed" + exit $ksft_skip + fi + + if ! modinfo $DRIVER > /dev/null 2>&1; then + echo "$0: You must have the following enabled in your kernel:" + echo "CONFIG_TEST_HMM=m" + exit $ksft_skip + fi +} + +load_driver() +{ + modprobe $DRIVER > /dev/null 2>&1 + if [ $? == 0 ]; then + major=$(awk "$2=="HMM_DMIRROR" {print $1}" /proc/devices) + mknod /dev/hmm_dmirror0 c $major 0 + mknod /dev/hmm_dmirror1 c $major 1 + fi +} + +unload_driver() +{ + modprobe -r $DRIVER > /dev/null 2>&1 + rm -f /dev/hmm_dmirror? +} + +run_smoke() +{ + echo "Running smoke test. Note, this test provides basic coverage." + + load_driver + ./hmm-tests + unload_driver +} + +usage() +{ + echo -n "Usage: $0" + echo + echo "Example usage:" + echo + echo "# Shows help message" + echo "./${TEST_NAME}.sh" + echo + echo "# Smoke testing" + echo "./${TEST_NAME}.sh smoke" + echo + exit 0 +} + +function run_test() +{ + if [ $# -eq 0 ]; then + usage + else + if [ "$1" = "smoke" ]; then + run_smoke + else + usage + fi + fi +} + +check_test_requirements +run_test $@ + +exit 0
On Mon, Jan 13, 2020 at 02:47:01PM -0800, Ralph Campbell wrote:
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 47ad9cc89aab..4efecc0f13cb 100644 +++ b/mm/mmu_notifier.c @@ -1171,6 +1171,39 @@ void mmu_interval_notifier_update(struct mmu_interval_notifier *mni, } EXPORT_SYMBOL_GPL(mmu_interval_notifier_update); +struct mmu_interval_notifier *mmu_interval_notifier_find(struct mm_struct *mm,
const struct mmu_interval_notifier_ops *ops,
unsigned long start, unsigned long last)
+{
- struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
- struct interval_tree_node *node;
- struct mmu_interval_notifier *mni;
- struct mmu_interval_notifier *res = NULL;
- spin_lock(&mmn_mm->lock);
- node = interval_tree_iter_first(&mmn_mm->itree, start, last);
- if (node) {
mni = container_of(node, struct mmu_interval_notifier,
interval_tree);
while (true) {
if (mni->ops == ops) {
res = mni;
break;
}
node = interval_tree_iter_next(&mni->interval_tree,
start, last);
if (!node)
break;
mni = container_of(node, struct mmu_interval_notifier,
interval_tree);
}
- }
- spin_unlock(&mmn_mm->lock);
This doesn't seem safe at all, here we are returning a pointer to memory from the interval tree with out any kind of lifetime protection.
If the interval tree is read it must be left in the read lock state until the caller is done with the pointer.
.. and this poses all sorts of questions about consistency with items on the deferred list. Should find return an item undergoing deletion?
Should find return items using the old interval tree values or their new updated values?
Jason
On Mon, Jan 13, 2020 at 02:47:02PM -0800, Ralph Campbell wrote:
void nouveau_svmm_fini(struct nouveau_svmm **psvmm) { struct nouveau_svmm *svmm = *psvmm;
- struct mmu_interval_notifier *mni;
- if (svmm) { mutex_lock(&svmm->mutex);
while (true) {
mni = mmu_interval_notifier_find(svmm->mm,
&nouveau_svm_mni_ops, 0UL, ~0UL);
if (!mni)
break;
mmu_interval_notifier_put(mni);
Oh, now I really don't like the name 'put'. It looks like mni is refcounted here, and it isn't. put should be called 'remove_deferred'
And then you also need a way to barrier this scheme on driver unload.
svmm->vmm = NULL; mutex_unlock(&svmm->mutex);}
mmu_notifier_put(&svmm->notifier);
While here it was actually a refcount.
+static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni,
const struct mmu_notifier_range *range)
+{
- struct svmm_interval *smi =
container_of(mni, struct svmm_interval, notifier);
- struct nouveau_svmm *svmm = smi->svmm;
- unsigned long start = mmu_interval_notifier_start(mni);
- unsigned long last = mmu_interval_notifier_last(mni);
This whole algorithm only works if it is protected by the read side of the interval tree lock. Deserves at least a comment if not an assertion too.
static int nouveau_range_fault(struct nouveau_svmm *svmm, struct nouveau_drm *drm, void *data, u32 size,
u64 *pfns, struct svm_notifier *notifier)
u64 *pfns, u64 start, u64 end)
{ unsigned long timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); /* Have HMM fault pages within the fault window to the GPU. */ struct hmm_range range = {
.notifier = ¬ifier->notifier,
.start = notifier->notifier.interval_tree.start,
.end = notifier->notifier.interval_tree.last + 1,
.start = start,
.pfns = pfns, .flags = nouveau_svm_pfn_flags, .values = nouveau_svm_pfn_values,.end = end,
.default_flags = 0,
.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT, };.pfn_flags_mask = ~0UL,
- struct mm_struct *mm = notifier->notifier.mm;
- struct mm_struct *mm = svmm->mm; long ret;
while (true) { if (time_after(jiffies, timeout)) return -EBUSY;
range.notifier_seq = mmu_interval_read_begin(range.notifier);
range.default_flags = 0;
down_read(&mm->mmap_sem);range.pfn_flags_mask = -1UL;
mmap sem doesn't have to be held for the interval search, and again we have lifetime issues with the membership here.
ret = nouveau_svmm_interval_find(svmm, &range);
if (ret) {
up_read(&mm->mmap_sem);
return ret;
}
ret = hmm_range_fault(&range, 0); up_read(&mm->mmap_sem); if (ret <= 0) {range.notifier_seq = mmu_interval_read_begin(range.notifier);
I'm still not sure this is a better approach than what ODP does. It looks very expensive on the fault path..
Jason
On 1/14/20 4:49 AM, Jason Gunthorpe wrote:
On Mon, Jan 13, 2020 at 02:47:01PM -0800, Ralph Campbell wrote:
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 47ad9cc89aab..4efecc0f13cb 100644 +++ b/mm/mmu_notifier.c @@ -1171,6 +1171,39 @@ void mmu_interval_notifier_update(struct mmu_interval_notifier *mni, } EXPORT_SYMBOL_GPL(mmu_interval_notifier_update); +struct mmu_interval_notifier *mmu_interval_notifier_find(struct mm_struct *mm,
const struct mmu_interval_notifier_ops *ops,
unsigned long start, unsigned long last)
+{
- struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
- struct interval_tree_node *node;
- struct mmu_interval_notifier *mni;
- struct mmu_interval_notifier *res = NULL;
- spin_lock(&mmn_mm->lock);
- node = interval_tree_iter_first(&mmn_mm->itree, start, last);
- if (node) {
mni = container_of(node, struct mmu_interval_notifier,
interval_tree);
while (true) {
if (mni->ops == ops) {
res = mni;
break;
}
node = interval_tree_iter_next(&mni->interval_tree,
start, last);
if (!node)
break;
mni = container_of(node, struct mmu_interval_notifier,
interval_tree);
}
- }
- spin_unlock(&mmn_mm->lock);
This doesn't seem safe at all, here we are returning a pointer to memory from the interval tree with out any kind of lifetime protection.
It is memory that the driver has allocated and has full control over the lifetime since the driver does all the insertions and removals. The driver does have to hold the HW page table lock so lookups are synchronized with interval insertions and removals and page table entry insertions and removals.
If the interval tree is read it must be left in the read lock state until the caller is done with the pointer.
.. and this poses all sorts of questions about consistency with items on the deferred list. Should find return an item undergoing deletion?
I don't think so. The deferred operations are all complete when mmu_interval_read_begin() returns, and the sequence number check with mmu_interval_read_retry() guarantees there have been no changes while not holding the driver page table lock and calling hmm_range_fault().
Should find return items using the old interval tree values or their new updated values?
Jason
I think it should only look at the old interval tree and not the deferred insert/remove/updates as explained above.
On 1/14/20 5:00 AM, Jason Gunthorpe wrote:
On Mon, Jan 13, 2020 at 02:47:02PM -0800, Ralph Campbell wrote:
void nouveau_svmm_fini(struct nouveau_svmm **psvmm) { struct nouveau_svmm *svmm = *psvmm;
- struct mmu_interval_notifier *mni;
- if (svmm) { mutex_lock(&svmm->mutex);
while (true) {
mni = mmu_interval_notifier_find(svmm->mm,
&nouveau_svm_mni_ops, 0UL, ~0UL);
if (!mni)
break;
mmu_interval_notifier_put(mni);
Oh, now I really don't like the name 'put'. It looks like mni is refcounted here, and it isn't. put should be called 'remove_deferred'
OK.
And then you also need a way to barrier this scheme on driver unload.
Good point. I can add something like void mmu_interval_notifier_synchronize(struct mm_struct *mm) that waits for deferred operations to complete similar to mmu_interval_read_begin().
svmm->vmm = NULL; mutex_unlock(&svmm->mutex);}
mmu_notifier_put(&svmm->notifier);
While here it was actually a refcount.
+static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni,
const struct mmu_notifier_range *range)
+{
- struct svmm_interval *smi =
container_of(mni, struct svmm_interval, notifier);
- struct nouveau_svmm *svmm = smi->svmm;
- unsigned long start = mmu_interval_notifier_start(mni);
- unsigned long last = mmu_interval_notifier_last(mni);
This whole algorithm only works if it is protected by the read side of the interval tree lock. Deserves at least a comment if not an assertion too.
This is called from the invalidate() callback and while holding the driver page table lock so the struct mmu_interval_notifier and the interval tree can't change. I will add comments for v7.
static int nouveau_range_fault(struct nouveau_svmm *svmm, struct nouveau_drm *drm, void *data, u32 size,
u64 *pfns, struct svm_notifier *notifier)
{ unsigned long timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); /* Have HMM fault pages within the fault window to the GPU. */ struct hmm_range range = {u64 *pfns, u64 start, u64 end)
.notifier = ¬ifier->notifier,
.start = notifier->notifier.interval_tree.start,
.end = notifier->notifier.interval_tree.last + 1,
.start = start,
.pfns = pfns, .flags = nouveau_svm_pfn_flags, .values = nouveau_svm_pfn_values,.end = end,
.default_flags = 0,
.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT, };.pfn_flags_mask = ~0UL,
- struct mm_struct *mm = notifier->notifier.mm;
- struct mm_struct *mm = svmm->mm; long ret;
while (true) { if (time_after(jiffies, timeout)) return -EBUSY;
range.notifier_seq = mmu_interval_read_begin(range.notifier);
range.default_flags = 0;
down_read(&mm->mmap_sem);range.pfn_flags_mask = -1UL;
mmap sem doesn't have to be held for the interval search, and again we have lifetime issues with the membership here.
I agree mmap_sem isn't needed for the interval search, it is needed if the search doesn't find a registered interval and one needs to be created to cover the underlying VMA. If an arbitrary size interval was created instead, then mmap_sem wouldn't be needed. I don't understand the lifetime/membership issue. The driver is the only thing that allocates, inserts, or removes struct mmu_interval_notifier and thus completely controls the lifetime.
ret = nouveau_svmm_interval_find(svmm, &range);
if (ret) {
up_read(&mm->mmap_sem);
return ret;
}
ret = hmm_range_fault(&range, 0); up_read(&mm->mmap_sem); if (ret <= 0) {range.notifier_seq = mmu_interval_read_begin(range.notifier);
I'm still not sure this is a better approach than what ODP does. It looks very expensive on the fault path..
Jason
ODP doesn't have this problem because users have to call ib_reg_mr() before any I/O can happen to the process address space. That is when mmu_interval_notifier_insert() / mmu_interval_notifier_remove() can be called and the driver doesn't have to worry about the interval changing sizes or being removed while I/O is happening. For GPU like devices, I'm trying to allow hardware access to any user level address without pre-registering it. That means inserting mmu interval notifiers for the ranges the GPU page faults on and updating the intervals as munmap() calls remove parts of the address space. I don't want to register an interval per page so the logical range is the underlying VMA.
It isn't that expensive, there is an extra driver lock/unlock as part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC) for new intervals. Also, the deferred interval updates for munmap(). Compared to the cost of updating PTEs in the device and GPU fault handling, this is minimal overhead.
The name is weird, normally we call functions like this _locked.
Additionally it will need a lockdep_assert_held to ensure the caller actually holds the lock.
On Wed, Jan 15, 2020 at 02:05:24PM -0800, Ralph Campbell wrote:
On 1/14/20 4:49 AM, Jason Gunthorpe wrote:
On Mon, Jan 13, 2020 at 02:47:01PM -0800, Ralph Campbell wrote:
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 47ad9cc89aab..4efecc0f13cb 100644 +++ b/mm/mmu_notifier.c @@ -1171,6 +1171,39 @@ void mmu_interval_notifier_update(struct mmu_interval_notifier *mni, } EXPORT_SYMBOL_GPL(mmu_interval_notifier_update); +struct mmu_interval_notifier *mmu_interval_notifier_find(struct mm_struct *mm,
const struct mmu_interval_notifier_ops *ops,
unsigned long start, unsigned long last)
+{
- struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
- struct interval_tree_node *node;
- struct mmu_interval_notifier *mni;
- struct mmu_interval_notifier *res = NULL;
- spin_lock(&mmn_mm->lock);
- node = interval_tree_iter_first(&mmn_mm->itree, start, last);
- if (node) {
mni = container_of(node, struct mmu_interval_notifier,
interval_tree);
while (true) {
if (mni->ops == ops) {
res = mni;
break;
}
node = interval_tree_iter_next(&mni->interval_tree,
start, last);
if (!node)
break;
mni = container_of(node, struct mmu_interval_notifier,
interval_tree);
}
- }
- spin_unlock(&mmn_mm->lock);
This doesn't seem safe at all, here we are returning a pointer to memory from the interval tree with out any kind of lifetime protection.
It is memory that the driver has allocated and has full control over the lifetime since the driver does all the insertions and removals. The driver does have to hold the HW page table lock so lookups are synchronized with interval insertions and removals and page table entry insertions and removals.
No.. the ->release is async, so having the driver hold a lock around all the mmu_interval_ APIS still doesn't make it safe. The element could be on the defered list and it could become freed at any moment.
If the interval tree is read it must be left in the read lock state until the caller is done with the pointer.
.. and this poses all sorts of questions about consistency with items on the deferred list. Should find return an item undergoing deletion?
I don't think so. The deferred operations are all complete when mmu_interval_read_begin() returns, and the sequence number check with mmu_interval_read_retry() guarantees there have been no changes while not holding the driver page table lock and calling hmm_range_fault().
It seems very dangerous to say, on one hand, that the driver is serialized because it holds a lock around all mmu_interval_* calls, while on the other saying that on rare edge cases find does not return a result that matches the serial-program-order sequence.
This seems like a way to create bugs.
For instance, if find is consistent with the defered list then it will not return any element that has a pending deletion and the above issue with lifetime wouldn't happen.
However, I'm still not sure that providing an API tha requires the driver to provide tricky locking is the best idea. This basically says that if a driver uses find then every single other call to mmu_interval_* must be serialized with a single lock.
Jason
On Wed, Jan 15, 2020 at 02:09:47PM -0800, Ralph Campbell wrote:
I don't understand the lifetime/membership issue. The driver is the only thing that allocates, inserts, or removes struct mmu_interval_notifier and thus completely controls the lifetime.
If the returned value is on the defered list it could be freed at any moment. The existing locks do not prevent it.
ret = nouveau_svmm_interval_find(svmm, &range);
if (ret) {
up_read(&mm->mmap_sem);
return ret;
}
ret = hmm_range_fault(&range, 0); up_read(&mm->mmap_sem); if (ret <= 0) {range.notifier_seq = mmu_interval_read_begin(range.notifier);
I'm still not sure this is a better approach than what ODP does. It looks very expensive on the fault path..
Jason
ODP doesn't have this problem because users have to call ib_reg_mr() before any I/O can happen to the process address space.
ODP supports a single 'full VA' call at process startup, just like these cases.
That is when mmu_interval_notifier_insert() / mmu_interval_notifier_remove() can be called and the driver doesn't have to worry about the interval changing sizes or being removed while I/O is happening.
No, for the 'ODP full process VA' (aka implicit ODP) mode it dynamically maintains a list of intervals. ODP chooses the align the dynamic intervals to it's HW page table levels, and not to SW VMAs. This is much simpler to manage and faster to fault, at the cost of capturing more VA for invalidations which have to be probed against the HW shadow PTEs.
It isn't that expensive, there is an extra driver lock/unlock as part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC) for new intervals. Also, the deferred interval updates for munmap(). Compared to the cost of updating PTEs in the device and GPU fault handling, this is minimal overhead.
Well, compared to ODP which does a single xa lookup with no lock to find its interval, this looks very expensive and not parallel.
I think if there is merit in having ranges cover the vmas and track changes then there is probably merit in having the core code provide much of that logic, not the driver.
But it would be interesting to see some kind of analysis on the two methods to decide if the complexity is worthwhile.
Jason
On 1/16/20 8:00 AM, Jason Gunthorpe wrote:
On Wed, Jan 15, 2020 at 02:09:47PM -0800, Ralph Campbell wrote:
I don't understand the lifetime/membership issue. The driver is the only thing that allocates, inserts, or removes struct mmu_interval_notifier and thus completely controls the lifetime.
If the returned value is on the defered list it could be freed at any moment. The existing locks do not prevent it.
ret = nouveau_svmm_interval_find(svmm, &range);
if (ret) {
up_read(&mm->mmap_sem);
return ret;
}
range.notifier_seq = mmu_interval_read_begin(range.notifier); ret = hmm_range_fault(&range, 0); up_read(&mm->mmap_sem); if (ret <= 0) {
I'm still not sure this is a better approach than what ODP does. It looks very expensive on the fault path..
Jason
ODP doesn't have this problem because users have to call ib_reg_mr() before any I/O can happen to the process address space.
ODP supports a single 'full VA' call at process startup, just like these cases.
That is when mmu_interval_notifier_insert() / mmu_interval_notifier_remove() can be called and the driver doesn't have to worry about the interval changing sizes or being removed while I/O is happening.
No, for the 'ODP full process VA' (aka implicit ODP) mode it dynamically maintains a list of intervals. ODP chooses the align the dynamic intervals to it's HW page table levels, and not to SW VMAs. This is much simpler to manage and faster to fault, at the cost of capturing more VA for invalidations which have to be probed against the HW shadow PTEs.
It isn't that expensive, there is an extra driver lock/unlock as part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC) for new intervals. Also, the deferred interval updates for munmap(). Compared to the cost of updating PTEs in the device and GPU fault handling, this is minimal overhead.
Well, compared to ODP which does a single xa lookup with no lock to find its interval, this looks very expensive and not parallel.
I think if there is merit in having ranges cover the vmas and track changes then there is probably merit in having the core code provide much of that logic, not the driver.
But it would be interesting to see some kind of analysis on the two methods to decide if the complexity is worthwhile.
Jason
Can you point me to the latest ODP code? Seems like my understanding is quite off.
On Thu, Jan 16, 2020 at 12:16:30PM -0800, Ralph Campbell wrote:
Can you point me to the latest ODP code? Seems like my understanding is quite off.
https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/infiniband/hw/mlx5/...
Look for the word 'implicit'
mlx5_ib_invalidate_range() releases the interval_notifier when there are no populated shadow PTEs in its leaf
pagefault_implicit_mr() creates an interval_notifier that covers the level in the page table that needs population. Notice it just uses an unlocked xa_load to find the page table level.
The locking is pretty tricky as it relies on RCU, but the fault flow is fairly lightweight.
Jason
On 1/16/20 12:21 PM, Jason Gunthorpe wrote:
On Thu, Jan 16, 2020 at 12:16:30PM -0800, Ralph Campbell wrote:
Can you point me to the latest ODP code? Seems like my understanding is quite off.
https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/infiniband/hw/mlx5/...
Look for the word 'implicit'
mlx5_ib_invalidate_range() releases the interval_notifier when there are no populated shadow PTEs in its leaf
pagefault_implicit_mr() creates an interval_notifier that covers the level in the page table that needs population. Notice it just uses an unlocked xa_load to find the page table level.
The locking is pretty tricky as it relies on RCU, but the fault flow is fairly lightweight.
Jason
Thanks for the information, Jason.
I'm still interested in finding a way to support range based hints to device drivers. madvise() looks like it only sets a bit in vma->vm_flags or acts on the advice immediately. mbind() and set_mempolicy() only work with CPUs and memory with NUMA a node number. What I'm looking for is a way for the device to know whether to migrate pages to device private memory on a fault, whether to duplicate read-only pages in device private memory, or remote map/access a page instead of migrating it. For example, there is a working draft extension to OpenCL, https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/cl_intel_uni... that could provide a way to specify this sort of advice. C++ is also looking at extentions for specifying affinity attributes. In any case, these are probably a long ways off before being finalized and implemented.
I also have some changes to support THP migration to device private memory but that will require updating nouveau to use 2MB TLB mappings.
In the mean time, I can update the HMM self tests to do something like ODP without changing mm/mmu_notifier.c but I don't think I can easily change nouveau to that model.
linux-kselftest-mirror@lists.linaro.org