From: "Mike Rapoport (Microsoft)" rppt@kernel.org
Hi,
These patches allow guest_memfd to notify userspace about minor page faults using userfaultfd and let userspace to resolve these page faults using UFFDIO_CONTINUE.
To allow UFFDIO_CONTINUE outside of the core mm I added a get_pagecache_folio() callback to vm_ops that allows an address space backing a VMA to return a folio that exists in it's page cache (patch 2)
In order for guest_memfd to notify userspace about page faults, it has to call handle_userfault() and since guest_memfd may be a part of kvm module, handle_userfault() is exported for kvm module (patch 3).
Note that patch 3 changelog does not provide motivation for enabling uffd in guest_memfd, mainly because I can't say I understand why is that required :) Would be great to hear from KVM folks about it.
This series is the minimal change I've been able to come up with to allow integration of guest_memfd with uffd and while refactoring uffd and making mfill_atomic() flow more linear would have been a nice improvement, it's way out of the scope of enabling uffd with guest_memfd.
Mike Rapoport (Microsoft) (3): userfaultfd: move vma_can_userfault out of line userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE userfaultfd, guest_memfd: support userfault minor mode in guest_memfd
Nikita Kalyazin (1): KVM: selftests: test userfaultfd minor for guest_memfd
fs/userfaultfd.c | 4 +- include/linux/mm.h | 9 ++ include/linux/userfaultfd_k.h | 36 +----- include/uapi/linux/userfaultfd.h | 8 +- mm/shmem.c | 20 ++++ mm/userfaultfd.c | 88 ++++++++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++ virt/kvm/guest_memfd.c | 30 +++++ 8 files changed, 245 insertions(+), 53 deletions(-)
base-commit: 6146a0f1dfae5d37442a9ddcba012add260bceb0
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
vma_can_userfault() has grown pretty big and it's not called on performance critical path.
Move it out of line.
No functional changes.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org --- include/linux/userfaultfd_k.h | 36 ++--------------------------------- mm/userfaultfd.c | 34 +++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+), 34 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index c0e716aec26a..e4f43e7b063f 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -208,40 +208,8 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma) return vma->vm_flags & __VM_UFFD_FLAGS; }
-static inline bool vma_can_userfault(struct vm_area_struct *vma, - vm_flags_t vm_flags, - bool wp_async) -{ - vm_flags &= __VM_UFFD_FLAGS; - - if (vma->vm_flags & VM_DROPPABLE) - return false; - - if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) - return false; - - /* - * If wp async enabled, and WP is the only mode enabled, allow any - * memory type. - */ - if (wp_async && (vm_flags == VM_UFFD_WP)) - return true; - -#ifndef CONFIG_PTE_MARKER_UFFD_WP - /* - * If user requested uffd-wp but not enabled pte markers for - * uffd-wp, then shmem & hugetlbfs are not supported but only - * anonymous. - */ - if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) - return false; -#endif - - /* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); -} +bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, + bool wp_async);
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index af61b95c89e4..8dc964389b0d 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1977,6 +1977,40 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, return moved ? moved : err; }
+bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, + bool wp_async) +{ + vm_flags &= __VM_UFFD_FLAGS; + + if (vma->vm_flags & VM_DROPPABLE) + return false; + + if ((vm_flags & VM_UFFD_MINOR) && + (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) + return false; + + /* + * If wp async enabled, and WP is the only mode enabled, allow any + * memory type. + */ + if (wp_async && (vm_flags == VM_UFFD_WP)) + return true; + +#ifndef CONFIG_PTE_MARKER_UFFD_WP + /* + * If user requested uffd-wp but not enabled pte markers for + * uffd-wp, then shmem & hugetlbfs are not supported but only + * anonymous. + */ + if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) + return false; +#endif + + /* By default, allow any of anon|shmem|hugetlb */ + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || + vma_is_shmem(vma); +} + static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, vm_flags_t vm_flags) {
On 17.11.25 12:46, Mike Rapoport wrote:
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
vma_can_userfault() has grown pretty big and it's not called on performance critical path.
Move it out of line.
No functional changes.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org
Reviewed-by: David Hildenbrand (Red Hat) david@kernel.org
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA.
Instead of using shmem_get_folio() for that, add a get_pagecache_folio() method to 'struct vm_operations_struct' that will return a folio if it exists in the VMA's pagecache at given pgoff.
Implement get_pagecache_folio() method for shmem and slightly refactor userfaultfd's mfill_atomic() and mfill_atomic_pte_continue() to support this new API.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org --- include/linux/mm.h | 9 +++++++ mm/shmem.c | 20 ++++++++++++++++ mm/userfaultfd.c | 60 ++++++++++++++++++++++++++++++---------------- 3 files changed, 69 insertions(+), 20 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index d16b33bacc32..c35c1e1ac4dd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -690,6 +690,15 @@ struct vm_operations_struct { struct page *(*find_normal_page)(struct vm_area_struct *vma, unsigned long addr); #endif /* CONFIG_FIND_NORMAL_PAGE */ +#ifdef CONFIG_USERFAULTFD + /* + * Called by userfault to resolve UFFDIO_CONTINUE request. + * Should return the folio found at pgoff in the VMA's pagecache if it + * exists or ERR_PTR otherwise. + */ + struct folio *(*get_pagecache_folio)(struct vm_area_struct *vma, + pgoff_t pgoff); +#endif };
#ifdef CONFIG_NUMA_BALANCING diff --git a/mm/shmem.c b/mm/shmem.c index b9081b817d28..4ac122284bff 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3260,6 +3260,20 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, shmem_inode_unacct_blocks(inode, 1); return ret; } + +static struct folio *shmem_get_pagecache_folio(struct vm_area_struct *vma, + pgoff_t pgoff) +{ + struct inode *inode = file_inode(vma->vm_file); + struct folio *folio; + int err; + + err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); + if (err) + return ERR_PTR(err); + + return folio; +} #endif /* CONFIG_USERFAULTFD */
#ifdef CONFIG_TMPFS @@ -5292,6 +5306,9 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .get_pagecache_folio = shmem_get_pagecache_folio, +#endif };
static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5301,6 +5318,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .get_pagecache_folio = shmem_get_pagecache_folio, +#endif };
int shmem_init_fs_context(struct fs_context *fc) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 8dc964389b0d..60b3183a72c0 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -382,21 +382,17 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, unsigned long dst_addr, uffd_flags_t flags) { - struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); struct folio *folio; struct page *page; int ret;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); + folio = dst_vma->vm_ops->get_pagecache_folio(dst_vma, pgoff); /* Our caller expects us to return -EFAULT if we failed to find folio */ - if (ret == -ENOENT) - ret = -EFAULT; - if (ret) - goto out; - if (!folio) { - ret = -EFAULT; - goto out; + if (IS_ERR_OR_NULL(folio)) { + if (PTR_ERR(folio) == -ENOENT || !folio) + return -EFAULT; + return PTR_ERR(folio); }
page = folio_file_page(folio, pgoff); @@ -411,13 +407,12 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out_release;
folio_unlock(folio); - ret = 0; -out: - return ret; + return 0; + out_release: folio_unlock(folio); folio_put(folio); - goto out; + return ret; }
/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ @@ -694,6 +689,22 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, return err; }
+static __always_inline bool vma_can_mfill_atomic(struct vm_area_struct *vma, + uffd_flags_t flags) +{ + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { + if (vma->vm_ops && vma->vm_ops->get_pagecache_folio) + return true; + else + return false; + } + + if (vma_is_anonymous(vma) || vma_is_shmem(vma)) + return true; + + return false; +} + static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, unsigned long dst_start, unsigned long src_start, @@ -766,10 +777,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) - goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + if (!vma_can_mfill_atomic(dst_vma, flags)) goto out_unlock;
while (src_addr < src_start + len) { @@ -1985,9 +1993,21 @@ bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, if (vma->vm_flags & VM_DROPPABLE) return false;
- if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) - return false; + if (vm_flags & VM_UFFD_MINOR) { + /* + * If only MINOR mode is requested and we can request an + * existing folio from VMA's page cache, allow it + */ + if (vm_flags == VM_UFFD_MINOR && vma->vm_ops && + vma->vm_ops->get_pagecache_folio) + return true; + /* + * Only hugetlb and shmem can support MINOR mode in combination + * with other modes + */ + if (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)) + return false; + }
/* * If wp async enabled, and WP is the only mode enabled, allow any
On 17.11.25 12:46, Mike Rapoport wrote:
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA.
Instead of using shmem_get_folio() for that, add a get_pagecache_folio() method to 'struct vm_operations_struct' that will return a folio if it exists in the VMA's pagecache at given pgoff.
Implement get_pagecache_folio() method for shmem and slightly refactor userfaultfd's mfill_atomic() and mfill_atomic_pte_continue() to support this new API.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org
include/linux/mm.h | 9 +++++++ mm/shmem.c | 20 ++++++++++++++++ mm/userfaultfd.c | 60 ++++++++++++++++++++++++++++++---------------- 3 files changed, 69 insertions(+), 20 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index d16b33bacc32..c35c1e1ac4dd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -690,6 +690,15 @@ struct vm_operations_struct { struct page *(*find_normal_page)(struct vm_area_struct *vma, unsigned long addr); #endif /* CONFIG_FIND_NORMAL_PAGE */ +#ifdef CONFIG_USERFAULTFD
- /*
* Called by userfault to resolve UFFDIO_CONTINUE request.* Should return the folio found at pgoff in the VMA's pagecache if it* exists or ERR_PTR otherwise.*/
What are the locking +refcount rules? Without looking at the code, I would assume we return with a folio reference held and the folio locked?
- struct folio *(*get_pagecache_folio)(struct vm_area_struct *vma,
pgoff_t pgoff);
The combination of VMA + pgoff looks weird at first. Would vma + addr or vma+vma_offset into vma be better?
But it also makes me wonder if the callback would ever even require the VMA, or actually only vma->vm_file?
Thinking out loud, I wonder if one could just call that "get_folio" or "get_shared_folio" (IOW, never an anon folio in a MAP_PRIVATE mapping).
+#endif }; #ifdef CONFIG_NUMA_BALANCING diff --git a/mm/shmem.c b/mm/shmem.c index b9081b817d28..4ac122284bff 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3260,6 +3260,20 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, shmem_inode_unacct_blocks(inode, 1); return ret; }
+static struct folio *shmem_get_pagecache_folio(struct vm_area_struct *vma,
pgoff_t pgoff)+{
- struct inode *inode = file_inode(vma->vm_file);
- struct folio *folio;
- int err;
- err = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
- if (err)
return ERR_PTR(err);- return folio;
+} #endif /* CONFIG_USERFAULTFD */ #ifdef CONFIG_TMPFS @@ -5292,6 +5306,9 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD
- .get_pagecache_folio = shmem_get_pagecache_folio,
+#endif }; static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5301,6 +5318,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD
- .get_pagecache_folio = shmem_get_pagecache_folio,
+#endif }; int shmem_init_fs_context(struct fs_context *fc) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 8dc964389b0d..60b3183a72c0 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -382,21 +382,17 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, unsigned long dst_addr, uffd_flags_t flags) {
- struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); struct folio *folio; struct page *page; int ret;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
- folio = dst_vma->vm_ops->get_pagecache_folio(dst_vma, pgoff); /* Our caller expects us to return -EFAULT if we failed to find folio */
- if (ret == -ENOENT)
ret = -EFAULT;- if (ret)
goto out;- if (!folio) {
ret = -EFAULT;goto out;
- if (IS_ERR_OR_NULL(folio)) {
if (PTR_ERR(folio) == -ENOENT || !folio)return -EFAULT; }return PTR_ERR(folio);page = folio_file_page(folio, pgoff); @@ -411,13 +407,12 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out_release; folio_unlock(folio);
- ret = 0;
-out:
- return ret;
- return 0;
- out_release: folio_unlock(folio); folio_put(folio);
- goto out;
- return ret; }
/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ @@ -694,6 +689,22 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, return err; } +static __always_inline bool vma_can_mfill_atomic(struct vm_area_struct *vma,
uffd_flags_t flags)+{
- if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
if (vma->vm_ops && vma->vm_ops->get_pagecache_folio)return true;elsereturn false;
Probably easier to read is
return vma->vm_ops && vma->vm_ops->get_pagecache_folio;
- }
- if (vma_is_anonymous(vma) || vma_is_shmem(vma))
return true;- return false;
Could also be simplified to:
return vma_is_anonymous(vma) || vma_is_shmem(vma);
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
* Export handle_userfault() for KVM module so that fault() handler in guest_memfd would be able to notify userspace about page faults in its address space. * Implement get_pagecache_folio() for guest_memfd. * And finally, introduce UFFD_FEATURE_MINOR_GENERIC that will allow using userfaultfd minor mode with memory types other than shmem and hugetlb provided they are allowed to call handle_userfault() and implement get_pagecache_folio().
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org --- fs/userfaultfd.c | 4 +++- include/uapi/linux/userfaultfd.h | 8 +++++++- virt/kvm/guest_memfd.c | 30 ++++++++++++++++++++++++++++++ 3 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 54c6cc7fe9c6..964fa2662d5c 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -537,6 +537,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) out: return ret; } +EXPORT_SYMBOL_FOR_MODULES(handle_userfault, "kvm");
static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, struct userfaultfd_wait_queue *ewq) @@ -1978,7 +1979,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, uffdio_api.features = UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR uffdio_api.features &= - ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM | + UFFD_FEATURE_MINOR_GENERIC); #endif #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 2841e4ea8f2c..c5cbd4a5a26e 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -42,7 +42,8 @@ UFFD_FEATURE_WP_UNPOPULATED | \ UFFD_FEATURE_POISON | \ UFFD_FEATURE_WP_ASYNC | \ - UFFD_FEATURE_MOVE) + UFFD_FEATURE_MOVE | \ + UFFD_FEATURE_MINOR_GENERIC) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -210,6 +211,10 @@ struct uffdio_api { * UFFD_FEATURE_MINOR_SHMEM indicates the same support as * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead. * + * UFFD_FEATURE_MINOR_GENERIC indicates that minor faults can be + * intercepted for file-backed memory in case subsystem backing this + * memory supports it. + * * UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page * faults would be provided and the offset within the page would not be * masked. @@ -248,6 +253,7 @@ struct uffdio_api { #define UFFD_FEATURE_POISON (1<<14) #define UFFD_FEATURE_WP_ASYNC (1<<15) #define UFFD_FEATURE_MOVE (1<<16) +#define UFFD_FEATURE_MINOR_GENERIC (1<<17) __u64 features;
__u64 ioctls; diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index fbca8c0972da..5e3c63307fdf 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#include <linux/userfaultfd_k.h>
#include "kvm_mm.h"
@@ -369,6 +370,12 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) return vmf_error(err); }
+ if (userfaultfd_minor(vmf->vma)) { + folio_unlock(folio); + folio_put(folio); + return handle_userfault(vmf, VM_UFFD_MINOR); + } + if (WARN_ON_ONCE(folio_test_large(folio))) { ret = VM_FAULT_SIGBUS; goto out_folio; @@ -390,8 +397,31 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) return ret; }
+#ifdef CONFIG_USERFAULTFD +static struct folio *kvm_gmem_get_pagecache_folio(struct vm_area_struct *vma, + pgoff_t pgoff) +{ + struct inode *inode = file_inode(vma->vm_file); + struct folio *folio; + + folio = kvm_gmem_get_folio(inode, pgoff); + if (IS_ERR_OR_NULL(folio)) + return folio; + + if (!folio_test_uptodate(folio)) { + clear_highpage(folio_page(folio, 0)); + kvm_gmem_mark_prepared(folio); + } + + return folio; +} +#endif + static const struct vm_operations_struct kvm_gmem_vm_ops = { .fault = kvm_gmem_fault_user_mapping, +#ifdef CONFIG_USERFAULTFD + .get_pagecache_folio = kvm_gmem_get_pagecache_folio, +#endif };
static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
From: Nikita Kalyazin kalyazin@amazon.com
The test demonstrates that a minor userfaultfd event in guest_memfd can be resolved via a memcpy followed by a UFFDIO_CONTINUE ioctl.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com Co-developed-by: Mike Rapoport (Microsoft) rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org --- .../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++ 1 file changed, 103 insertions(+)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index e7d9aeb418d3..a5d3ed21d7bb 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -10,13 +10,17 @@ #include <errno.h> #include <stdio.h> #include <fcntl.h> +#include <pthread.h>
#include <linux/bitmap.h> #include <linux/falloc.h> #include <linux/sizes.h> +#include <linux/userfaultfd.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> +#include <sys/syscall.h> +#include <sys/ioctl.h>
#include "kvm_util.h" #include "test_util.h" @@ -254,6 +258,104 @@ static void test_guest_memfd_flags(struct kvm_vm *vm) } }
+struct fault_args { + char *addr; + volatile char value; +}; + +static void *fault_thread_fn(void *arg) +{ + struct fault_args *args = arg; + + /* Trigger page fault */ + args->value = *args->addr; + return NULL; +} + +static void test_uffd_minor(int fd, size_t total_size) +{ + struct uffdio_api uffdio_api = { + .api = UFFD_API, + .features = UFFD_FEATURE_MINOR_GENERIC, + }; + struct uffdio_register uffd_reg; + struct uffdio_continue uffd_cont; + struct uffd_msg msg; + struct fault_args args; + pthread_t fault_thread; + void *mem, *mem_nofault, *buf = NULL; + int uffd, ret; + off_t offset = page_size; + void *fault_addr; + + ret = posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + + memset(buf, 0xaa, total_size); + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC); + TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed"); + + ret = ioctl(uffd, UFFDIO_API, &uffdio_api); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed"); + + mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed"); + + mem_nofault = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem_nofault != MAP_FAILED, "mmap should succeed"); + + uffd_reg.range.start = (unsigned long)mem; + uffd_reg.range.len = total_size; + uffd_reg.mode = UFFDIO_REGISTER_MODE_MINOR; + ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed"); + + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, + offset, page_size); + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed"); + + fault_addr = mem + offset; + args.addr = fault_addr; + + ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args); + TEST_ASSERT(ret == 0, "pthread_create should succeed"); + + ret = read(uffd, &msg, sizeof(msg)); + TEST_ASSERT(ret != -1, "read from userfaultfd should succeed"); + TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault"); + TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr, + "pagefault should occur at expected address"); + + memcpy(mem_nofault + offset, buf + offset, page_size); + + uffd_cont.range.start = (unsigned long)fault_addr; + uffd_cont.range.len = page_size; + uffd_cont.mode = 0; + ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed"); + + /* + * wait for fault_thread to finish to make sure fault happened and was + * resolved before we verify the values + */ + ret = pthread_join(fault_thread, NULL); + TEST_ASSERT(ret == 0, "pthread_join should succeed"); + + TEST_ASSERT(args.value == *(char *)(mem_nofault + offset), + "memory should contain the value that was copied"); + TEST_ASSERT(args.value == *(char *)(mem + offset), + "no further fault is expected"); + + ret = munmap(mem_nofault, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + + ret = munmap(mem, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + free(buf); + close(uffd); +} + #define gmem_test(__test, __vm, __flags) \ do { \ int fd = vm_create_guest_memfd(__vm, page_size * 4, __flags); \ @@ -273,6 +375,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags) if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) { gmem_test(mmap_supported, vm, flags); gmem_test(fault_overflow, vm, flags); + gmem_test(uffd_minor, vm, flags); } else { gmem_test(fault_private, vm, flags); }
On 17/11/2025 11:46, Mike Rapoport wrote:
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
Hi,
These patches allow guest_memfd to notify userspace about minor page faults using userfaultfd and let userspace to resolve these page faults using UFFDIO_CONTINUE.
To allow UFFDIO_CONTINUE outside of the core mm I added a get_pagecache_folio() callback to vm_ops that allows an address space backing a VMA to return a folio that exists in it's page cache (patch 2)
In order for guest_memfd to notify userspace about page faults, it has to call handle_userfault() and since guest_memfd may be a part of kvm module, handle_userfault() is exported for kvm module (patch 3).
Note that patch 3 changelog does not provide motivation for enabling uffd in guest_memfd, mainly because I can't say I understand why is that required :) Would be great to hear from KVM folks about it.
Hi Mike,
Thanks for posting it!
In our use case, Firecracker snapshot-restore using UFFD [1], we will use UFFD minor/continue to respond to guest_memfd faults in user mappings primarily due to VMM accesses that are required for PV (virtio) device emulation and also KVM accesses when decoding MMIO operations on x86.
Nikita
[1] https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotti...
This series is the minimal change I've been able to come up with to allow integration of guest_memfd with uffd and while refactoring uffd and making mfill_atomic() flow more linear would have been a nice improvement, it's way out of the scope of enabling uffd with guest_memfd.
Mike Rapoport (Microsoft) (3): userfaultfd: move vma_can_userfault out of line userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE userfaultfd, guest_memfd: support userfault minor mode in guest_memfd
Nikita Kalyazin (1): KVM: selftests: test userfaultfd minor for guest_memfd
fs/userfaultfd.c | 4 +- include/linux/mm.h | 9 ++ include/linux/userfaultfd_k.h | 36 +----- include/uapi/linux/userfaultfd.h | 8 +- mm/shmem.c | 20 ++++ mm/userfaultfd.c | 88 ++++++++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 103 ++++++++++++++++++ virt/kvm/guest_memfd.c | 30 +++++ 8 files changed, 245 insertions(+), 53 deletions(-)
base-commit: 6146a0f1dfae5d37442a9ddcba012add260bceb0
2.50.1
linux-kselftest-mirror@lists.linaro.org