This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]: - James: Fix sgp type when calling shmem_get_folio_gfp - James: Improved vm_ops->fault() error handling - James: Add and make use of the can_userfault() VMA operation - James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag - James: Fix typos and add more checks in the test
Nikita
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos... [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
Nikita Kalyazin (6): mm: userfaultfd: generic continue for non hugetlbfs mm: provide can_userfault vma operation mm: userfaultfd: use can_userfault vma operation KVM: guest_memfd: add support for userfaultfd minor mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD KVM: selftests: test userfaultfd minor for guest_memfd
fs/userfaultfd.c | 3 +- include/linux/mm.h | 5 + include/linux/mm_types.h | 4 + include/linux/userfaultfd_k.h | 10 +- include/uapi/linux/userfaultfd.h | 8 +- mm/hugetlb.c | 9 +- mm/shmem.c | 17 +++- mm/userfaultfd.c | 47 ++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 99 +++++++++++++++++++ virt/kvm/guest_memfd.c | 10 ++ 10 files changed, 188 insertions(+), 24 deletions(-)
base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
Remove shmem-specific code from UFFDIO_CONTINUE implementation for non-huge pages by calling vm_ops->fault(). A new VMF flag, FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to handle_userfault().
Suggested-by: James Houghton jthoughton@google.com Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- include/linux/mm_types.h | 4 ++++ mm/hugetlb.c | 2 +- mm/shmem.c | 9 ++++++--- mm/userfaultfd.c | 37 +++++++++++++++++++++++++++---------- 4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0234f14f2aa6..2f26ee9742bf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1429,6 +1429,9 @@ enum tlb_flush_reason { * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. * We should only access orig_pte if this flag set. * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock. + * @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd + * minor handler as it is being called by the + * userfaultfd code itself. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -1467,6 +1470,7 @@ enum fault_flag { FAULT_FLAG_UNSHARE = 1 << 10, FAULT_FLAG_ORIG_PTE_VALID = 1 << 11, FAULT_FLAG_VMA_LOCK = 1 << 12, + FAULT_FLAG_USERFAULT_CONTINUE = 1 << 13, };
typedef unsigned int __bitwise zap_flags_t; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 97930d44d460..c004cfdcd4e2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6228,7 +6228,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, }
/* Check for page in userfault range. */ - if (userfaultfd_minor(vma)) { + if (userfaultfd_minor(vma) && !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) { folio_unlock(folio); folio_put(folio); /* See comment in userfaultfd_missing() block above */ diff --git a/mm/shmem.c b/mm/shmem.c index 1ede0800e846..b4159303fe59 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2467,7 +2467,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index, fault_mm = vma ? vma->vm_mm : NULL;
folio = filemap_get_entry(inode->i_mapping, index); - if (folio && vma && userfaultfd_minor(vma)) { + if (folio && vma && userfaultfd_minor(vma) && + !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) { if (!xa_is_value(folio)) folio_put(folio); *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); @@ -2727,6 +2728,8 @@ static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode) static vm_fault_t shmem_fault(struct vm_fault *vmf) { struct inode *inode = file_inode(vmf->vma->vm_file); + enum sgp_type sgp = vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE ? + SGP_NOALLOC : SGP_CACHE; gfp_t gfp = mapping_gfp_mask(inode->i_mapping); struct folio *folio = NULL; vm_fault_t ret = 0; @@ -2743,8 +2746,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) }
WARN_ON_ONCE(vmf->page != NULL); - err = shmem_get_folio_gfp(inode, vmf->pgoff, 0, &folio, SGP_CACHE, - gfp, vmf, &ret); + err = shmem_get_folio_gfp(inode, vmf->pgoff, 0, &folio, sgp, gfp, vmf, + &ret); if (err) return vmf_error(err); if (folio) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index d06453fa8aba..4b3dbc7dac64 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -380,30 +380,47 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, return ret; }
-/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ +/* Handles UFFDIO_CONTINUE for all VMAs */ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, uffd_flags_t flags) { - struct inode *inode = file_inode(dst_vma->vm_file); - pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); struct folio *folio; struct page *page; int ret; + struct vm_fault vmf = { + .vma = dst_vma, + .address = dst_addr, + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE | + FAULT_FLAG_USERFAULT_CONTINUE, + .pte = NULL, + .page = NULL, + .pgoff = linear_page_index(dst_vma, dst_addr), + }; + + if (!dst_vma->vm_ops || !dst_vma->vm_ops->fault) + return -EINVAL;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); - /* Our caller expects us to return -EFAULT if we failed to find folio */ - if (ret == -ENOENT) +retry: + ret = dst_vma->vm_ops->fault(&vmf); + if (ret & VM_FAULT_ERROR) { ret = -EFAULT; - if (ret) goto out; - if (!folio) { - ret = -EFAULT; + } + + if (ret & VM_FAULT_NOPAGE) { + ret = -EAGAIN; goto out; }
- page = folio_file_page(folio, pgoff); + if (ret & VM_FAULT_RETRY) + goto retry; + + page = vmf.page; + folio = page_folio(page); + BUG_ON(!folio); + if (PageHWPoison(page)) { ret = -EIO; goto out_release;
On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
Remove shmem-specific code from UFFDIO_CONTINUE implementation for non-huge pages by calling vm_ops->fault(). A new VMF flag, FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or some mention of the need to introduce userfaultfd support in kernel modules.
Suggested-by: James Houghton jthoughton@google.com Signed-off-by: Nikita Kalyazin kalyazin@amazon.com
include/linux/mm_types.h | 4 ++++ mm/hugetlb.c | 2 +- mm/shmem.c | 9 ++++++--- mm/userfaultfd.c | 37 +++++++++++++++++++++++++++---------- 4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0234f14f2aa6..2f26ee9742bf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
- @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
We should only access orig_pte if this flag set.
- @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
- @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
minor handler as it is being called by the
userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible, as it's not easy to follow when without userfault minor context. It might be better to use generic terms like NO_USERFAULT.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter patch, whether we can also avoid reusing fault() but instead resolve the page faults using the vm_ops hook too. That might be helpful because then we can avoid this new FAULT_FLAG_* that is totally not useful to non-userfault users, meanwhile we also don't need to hand-cook the vm_fault struct below just to suite the current fault() interfacing.
Thanks,
On 10/06/2025 23:22, Peter Xu wrote:
On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
Remove shmem-specific code from UFFDIO_CONTINUE implementation for non-huge pages by calling vm_ops->fault(). A new VMF flag, FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or some mention of the need to introduce userfaultfd support in kernel modules.
Hi Peter,
Sounds fair, thank you.
Suggested-by: James Houghton jthoughton@google.com Signed-off-by: Nikita Kalyazin kalyazin@amazon.com
include/linux/mm_types.h | 4 ++++ mm/hugetlb.c | 2 +- mm/shmem.c | 9 ++++++--- mm/userfaultfd.c | 37 +++++++++++++++++++++++++++---------- 4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0234f14f2aa6..2f26ee9742bf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
- @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
We should only access orig_pte if this flag set.
- @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
- @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
minor handler as it is being called by the
userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible, as it's not easy to follow when without userfault minor context. It might be better to use generic terms like NO_USERFAULT.
Yes, I agree, can name it more generically.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter patch, whether we can also avoid reusing fault() but instead resolve the page faults using the vm_ops hook too. That might be helpful because then we can avoid this new FAULT_FLAG_* that is totally not useful to non-userfault users, meanwhile we also don't need to hand-cook the vm_fault struct below just to suite the current fault() interfacing.
I'm not sure I fully understand that. Calling fault() op helps us reuse the FS specifics when resolving the fault. I get that the new op can imply the userfault flag so the flag doesn't need to be exposed to mm, but doing so will bring duplication of the logic within FSes between this new op and the fault(), unless we attempt to factor common parts out. For example, for shmem_get_folio_gfp(), we would still need to find a way to suppress the call to handle_userfault() when shmem_get_folio_gfp() is called from the new op. Is that what you're proposing?
Thanks,
-- Peter Xu
On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
On 10/06/2025 23:22, Peter Xu wrote:
On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
Remove shmem-specific code from UFFDIO_CONTINUE implementation for non-huge pages by calling vm_ops->fault(). A new VMF flag, FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or some mention of the need to introduce userfaultfd support in kernel modules.
Hi Peter,
Sounds fair, thank you.
Suggested-by: James Houghton jthoughton@google.com Signed-off-by: Nikita Kalyazin kalyazin@amazon.com
include/linux/mm_types.h | 4 ++++ mm/hugetlb.c | 2 +- mm/shmem.c | 9 ++++++--- mm/userfaultfd.c | 37 +++++++++++++++++++++++++++---------- 4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0234f14f2aa6..2f26ee9742bf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1429,6 +1429,9 @@ enum tlb_flush_reason {
- @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
We should only access orig_pte if this flag set.
- @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
- @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
minor handler as it is being called by the
userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible, as it's not easy to follow when without userfault minor context. It might be better to use generic terms like NO_USERFAULT.
Yes, I agree, can name it more generically.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter patch, whether we can also avoid reusing fault() but instead resolve the page faults using the vm_ops hook too. That might be helpful because then we can avoid this new FAULT_FLAG_* that is totally not useful to non-userfault users, meanwhile we also don't need to hand-cook the vm_fault struct below just to suite the current fault() interfacing.
I'm not sure I fully understand that. Calling fault() op helps us reuse the FS specifics when resolving the fault. I get that the new op can imply the userfault flag so the flag doesn't need to be exposed to mm, but doing so will bring duplication of the logic within FSes between this new op and the fault(), unless we attempt to factor common parts out. For example, for shmem_get_folio_gfp(), we would still need to find a way to suppress the call to handle_userfault() when shmem_get_folio_gfp() is called from the new op. Is that what you're proposing?
Yes it is what I was proposing. shmem_get_folio_gfp() always has that handling when vmf==NULL, then vma==NULL and userfault will be skipped.
So what I was thinking is one vm_ops.userfaultfd_request(req), where req can be:
(1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return both MISSING/WP/MINOR. Here WP should mean sync-wp tracking, async was so far by default almost supported everywhere except VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only as of now (even if I think it shouldn't be hard to support the rest two..).
(2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well defined to suite userfault's need on fault resolutions. It likely doesn't need vmf as the parameter, but likely (when anon isn't taking into account, after all anon have vm_ops==NULL..) the inode and offsets, perhaps some flag would be needed to identify MISSING or MINOR faults, for example.
Maybe some more.
I was even thinking whether we could merge hugetlb into the picture too on generalize its fault resolutions. Hugetlb was always special, maye this is a chance too to make it generalized, but it doesn't need to happen in one shot even if it could work. We could start with shmem.
So this does sound like slightly involved, and I'm not yet 100% sure this will work, but likely. If you want, I can take a stab at this this week or next just to see whether it'll work in general. I also don't expect this to depend on guest-memfd at all - it can be alone a refactoring making userfault module-ready.
Thanks,
On 11/06/2025 13:56, Peter Xu wrote:
On Wed, Jun 11, 2025 at 01:09:32PM +0100, Nikita Kalyazin wrote:
On 10/06/2025 23:22, Peter Xu wrote:
On Fri, Apr 04, 2025 at 03:43:47PM +0000, Nikita Kalyazin wrote:
Remove shmem-specific code from UFFDIO_CONTINUE implementation for non-huge pages by calling vm_ops->fault(). A new VMF flag, FAULT_FLAG_USERFAULT_CONTINUE, is introduced to avoid recursive call to handle_userfault().
It's not clear yet on why this is needed to be generalized out of the blue.
Some mentioning of guest_memfd use case might help for other reviewers, or some mention of the need to introduce userfaultfd support in kernel modules.
Hi Peter,
Sounds fair, thank you.
Suggested-by: James Houghton jthoughton@google.com Signed-off-by: Nikita Kalyazin kalyazin@amazon.com
include/linux/mm_types.h | 4 ++++ mm/hugetlb.c | 2 +- mm/shmem.c | 9 ++++++--- mm/userfaultfd.c | 37 +++++++++++++++++++++++++++---------- 4 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 0234f14f2aa6..2f26ee9742bf 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1429,6 +1429,9 @@ enum tlb_flush_reason { * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached. * We should only access orig_pte if this flag set. * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
- @FAULT_FLAG_USERFAULT_CONTINUE: The fault handler must not call userfaultfd
minor handler as it is being called by the
userfaultfd code itself.
We probably shouldn't leak the "CONTINUE" concept to mm core if possible, as it's not easy to follow when without userfault minor context. It might be better to use generic terms like NO_USERFAULT.
Yes, I agree, can name it more generically.
Said that, I wonder if we'll need to add a vm_ops anyway in the latter patch, whether we can also avoid reusing fault() but instead resolve the page faults using the vm_ops hook too. That might be helpful because then we can avoid this new FAULT_FLAG_* that is totally not useful to non-userfault users, meanwhile we also don't need to hand-cook the vm_fault struct below just to suite the current fault() interfacing.
I'm not sure I fully understand that. Calling fault() op helps us reuse the FS specifics when resolving the fault. I get that the new op can imply the userfault flag so the flag doesn't need to be exposed to mm, but doing so will bring duplication of the logic within FSes between this new op and the fault(), unless we attempt to factor common parts out. For example, for shmem_get_folio_gfp(), we would still need to find a way to suppress the call to handle_userfault() when shmem_get_folio_gfp() is called from the new op. Is that what you're proposing?
Yes it is what I was proposing. shmem_get_folio_gfp() always has that handling when vmf==NULL, then vma==NULL and userfault will be skipped.
So what I was thinking is one vm_ops.userfaultfd_request(req), where req can be:
(1) UFFD_REQ_GET_SUPPORTED: this should, for existing RAM-FSes return both MISSING/WP/MINOR. Here WP should mean sync-wp tracking, async was so far by default almost supported everywhere except VM_DROPPABLE. For guest-memfd in the future, we can return MINOR only as of now (even if I think it shouldn't be hard to support the rest two..).
(2) UFFD_REQ_FAULT_RESOLVE: this should play the fault() role but well defined to suite userfault's need on fault resolutions. It likely doesn't need vmf as the parameter, but likely (when anon isn't taking into account, after all anon have vm_ops==NULL..) the inode and offsets, perhaps some flag would be needed to identify MISSING or MINOR faults, for example.
Maybe some more.
I was even thinking whether we could merge hugetlb into the picture too on generalize its fault resolutions. Hugetlb was always special, maye this is a chance too to make it generalized, but it doesn't need to happen in one shot even if it could work. We could start with shmem.
So this does sound like slightly involved, and I'm not yet 100% sure this will work, but likely. If you want, I can take a stab at this this week or next just to see whether it'll work in general. I also don't expect this to depend on guest-memfd at all - it can be alone a refactoring making userfault module-ready.
Thanks for explaining that. I played a bit with it myself and it appears to be working for the MISSING mode for both shmem and guest_memfd. Attaching my sketch below. Please let me know if this is how you see it.
I found that arguments and return values are significantly different between the two request types, which may be a bit confusing, although we do not expect many callers of those.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 8483e09aeb2c..eb30b23b24d3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -603,6 +603,16 @@ struct vm_fault { */ };
+#ifdef CONFIG_USERFAULTFD +/* + * Used by userfaultfd_request(). + */ +enum uffd_req { + UFFD_REQ_GET_SUPPORTED, /* query supported userfaulfd modes */ + UFFD_REQ_FAULT_RESOLVE, /* request fault resolution */ +}; +#endif + /* * These are the virtual MM functions - opening of an area, closing and * unmapping it (needed to keep files on disk up-to-date etc), pointer @@ -680,6 +690,22 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); + +#ifdef CONFIG_USERFAULTFD + /* + * Called by the userfaultfd code to query supported modes or request + * fault resolution. + * If called with req UFFD_REQ_GET_SUPPORTED, it returns a bitmask + * of modes as in struct uffdio_register. No other arguments are + * used. + * If called with req UFFD_REQ_FAULT_RESOLVE, it resolves the fault + * using the mode specified in the mode argument. The inode, pgoff and + * foliop arguments must be set accordingly. + */ + int (*userfaultfd_request)(enum uffd_req req, int mode, + struct inode *inode, pgoff_t pgoff, + struct folio **foliop); +#endif };
#ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 75342022d144..1cabb925da0e 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -222,7 +222,11 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, return false;
if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) + (!is_vm_hugetlb_page(vma) && + !vma->vm_ops->userfaultfd_request && + !(vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, + NULL, 0, NULL) & + UFFDIO_REGISTER_MODE_MINOR))) return false;
/* @@ -243,8 +247,11 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, #endif
/* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); + return vma_is_anonymous(vma) || + is_vm_hugetlb_page(vma) || + (vma->vm_ops->userfaultfd_request && + vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, NULL, + 0, NULL)); }
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) diff --git a/mm/shmem.c b/mm/shmem.c index 1ede0800e846..a5b5c4131dcf 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -5203,6 +5203,40 @@ static int shmem_error_remove_folio(struct address_space *mapping, return 0; }
+#ifdef CONFIG_USERFAULTFD +static int shmem_userfaultfd_request(enum uffd_req req, int mode, + struct inode *inode, pgoff_t pgoff, + struct folio **foliop) +{ + int ret; + + switch (req) { + case UFFD_REQ_GET_SUPPORTED: + ret = + UFFDIO_REGISTER_MODE_MISSING | + UFFDIO_REGISTER_MODE_WP | + UFFDIO_REGISTER_MODE_MINOR; + break; + case UFFD_REQ_FAULT_RESOLVE: + ret = shmem_get_folio(inode, pgoff, 0, foliop, SGP_NOALLOC); + if (ret == -ENOENT) + ret = -EFAULT; + if (ret) + break; + if (!*foliop) { + ret = -EFAULT; + break; + } + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} +#endif + static const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio, @@ -5306,6 +5340,9 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .userfaultfd_request = shmem_userfaultfd_request, +#endif };
static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5315,6 +5352,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .userfaultfd_request = shmem_userfaultfd_request, +#endif };
int shmem_init_fs_context(struct fs_context *fc) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index d06453fa8aba..efc150bf5691 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -392,16 +392,18 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, struct page *page; int ret;
- ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); - /* Our caller expects us to return -EFAULT if we failed to find folio */ - if (ret == -ENOENT) - ret = -EFAULT; + if (!dst_vma->vm_ops->userfaultfd_request || + !(dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, + NULL, 0, NULL) & + UFFDIO_REGISTER_MODE_MINOR)) { + return -EFAULT; + } + + ret = dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_FAULT_RESOLVE, + UFFDIO_REGISTER_MODE_MINOR, + inode, pgoff, &folio); if (ret) goto out; - if (!folio) { - ret = -EFAULT; - goto out; - }
page = folio_file_page(folio, pgoff); if (PageHWPoison(page)) { @@ -770,10 +772,10 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) - goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + if (!vma_is_anonymous(dst_vma) && + (!dst_vma->vm_ops->userfaultfd_request || + (!dst_vma->vm_ops->userfaultfd_request(UFFD_REQ_GET_SUPPORTED, 0, + NULL, 0, NULL)))) goto out_unlock;
while (src_addr < src_start + len) {
Thanks,
-- Peter Xu
Hi, Nikita,
On Fri, Jun 20, 2025 at 01:00:24PM +0100, Nikita Kalyazin wrote:
Thanks for explaining that. I played a bit with it myself and it appears to be working for the MISSING mode for both shmem and guest_memfd. Attaching
[1]
my sketch below. Please let me know if this is how you see it.
I found that arguments and return values are significantly different between the two request types, which may be a bit confusing, although we do not expect many callers of those.
Indeed. Actually since I didn't yet get your reply, early this week I gave it a shot, and then I found the same thing that it'll be nice to keep the type checks all over the places. It'll also be awkward if we want to add MISSING into the picture with the current req() interface (btw, IIUC you meant MINOR above, not MISSING).
Please have a look at what I came up with. I didn't yet got a chance to post it, but it did compile all fine and pass the smoke tests here. Feel free to take it over if you think that makes sense to you, or I can also post it officially after some more tests.
So, ultimately I introduced a vm_uffd_ops to keep all the type checks. I don't think I like the uffd_copy() interfacing too much, but it should still be the minimum changeset I can think of to generalize shmem as an userfault user / module, and finally drop "linux/shmem_fs.h" inclusion in the last patch.
It's also unfortunate that hugetlb won't be able to already use the API, similar to why we have hugetlb's fault() to BUG() and hard-coded it in handle_mm_fault(). However it'll at least start to use the rest API all fine, so as to generalize some hugetlb checks.
The shmem definition looks like this:
static const vm_uffd_ops shmem_uffd_ops = { .uffd_features = __VM_UFFD_FLAGS, .uffd_ioctls = BIT(_UFFDIO_COPY) | BIT(_UFFDIO_ZEROPAGE) | BIT(_UFFDIO_WRITEPROTECT) | BIT(_UFFDIO_CONTINUE) | BIT(_UFFDIO_POISON), .uffd_get_folio = shmem_uffd_get_folio, .uffd_copy = shmem_mfill_atomic_pte, };
Then guest-memfd can set (1) VM_UFFD_MINOR, (2) _UFFDIO_CONTINUE and provide uffd_get_folio() for supporting MINOR.
Let me know what do you think.
Thanks,
===8<=== From ca500177de122d32194f8bf4589faceeaaae2c0c Mon Sep 17 00:00:00 2001 From: Peter Xu peterx@redhat.com Date: Thu, 12 Jun 2025 11:51:58 -0400 Subject: [PATCH 1/4] mm: Introduce vm_uffd_ops API
Introduce a generic userfaultfd API for vm_operations_struct, so that one vma, especially when as a module, can support userfaults without modifying the core files. More importantly, when the module can be compiled out of the kernel.
So, instead of having core mm referencing modules that may not ever exist, we need to have modules opt-in on core mm hooks instead.
After this API applied, if a module wants to support userfaultfd, the module should only need to touch its own file and properly define vm_uffd_ops, instead of changing anything in core mm.
Note that such API will not work for anonymous. Core mm will process anonymous memory separately for userfault operations like before.
This patch only introduces the API alone so that we can start to move existing users over but without breaking them.
Signed-off-by: Peter Xu peterx@redhat.com --- include/linux/mm.h | 71 +++++++++++++++++++++++++++++++++++ include/linux/userfaultfd_k.h | 12 ------ 2 files changed, 71 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 98a606908307..8dfd83f01d3d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -576,6 +576,70 @@ struct vm_fault { */ };
+#ifdef CONFIG_USERFAULTFD +/* A combined operation mode + behavior flags. */ +typedef unsigned int __bitwise uffd_flags_t; + +enum mfill_atomic_mode { + MFILL_ATOMIC_COPY, + MFILL_ATOMIC_ZEROPAGE, + MFILL_ATOMIC_CONTINUE, + MFILL_ATOMIC_POISON, + NR_MFILL_ATOMIC_MODES, +}; + +/* VMA userfaultfd operations */ +typedef struct { + /** + * @uffd_features: features supported in bitmask. + * + * When the ops is defined, the driver must set non-zero features + * to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR. + */ + unsigned long uffd_features; + /** + * @uffd_ioctls: ioctls supported in bitmask. + * + * Userfaultfd ioctls supported by the module. Below will always + * be supported by default whenever a module provides vm_uffd_ops: + * + * _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE + * + * The module needs to provide all the rest optionally supported + * ioctls. For example, when VM_UFFD_MISSING was supported, + * _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE + * is optional. + */ + unsigned long uffd_ioctls; + /** + * uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request. + * + * @inode: the inode for folio lookup + * @pgoff: the pgoff of the folio + * @folio: returned folio pointer + * + * Return: zero if succeeded, negative for errors. + */ + int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff, + struct folio **folio); + /** + * uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request. + * + * @dst_pmd: target pmd to resolve page fault + * @dst_vma: target vma + * @dst_addr: target virtual address + * @src_addr: source address to copy from + * @flags: userfaultfd request flags + * @foliop: previously allocated folio + * + * Return: zero if succeeded, negative for errors. + */ + int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, + unsigned long dst_addr, unsigned long src_addr, + uffd_flags_t flags, struct folio **foliop); +} vm_uffd_ops; +#endif + /* * These are the virtual MM functions - opening of an area, closing and * unmapping it (needed to keep files on disk up-to-date etc), pointer @@ -653,6 +717,13 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); +#ifdef CONFIG_USERFAULTFD + /* + * Userfaultfd related ops. Modules need to define this to support + * userfaultfd. + */ + const vm_uffd_ops *userfaultfd_ops; +#endif };
#ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ccad58602846..e79c724b3b95 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -80,18 +80,6 @@ struct userfaultfd_ctx {
extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
-/* A combined operation mode + behavior flags. */ -typedef unsigned int __bitwise uffd_flags_t; - -/* Mutually exclusive modes of operation. */ -enum mfill_atomic_mode { - MFILL_ATOMIC_COPY, - MFILL_ATOMIC_ZEROPAGE, - MFILL_ATOMIC_CONTINUE, - MFILL_ATOMIC_POISON, - NR_MFILL_ATOMIC_MODES, -}; - #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1) #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr)) #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
On 20/06/2025 16:21, Peter Xu wrote:
Hi, Nikita,
On Fri, Jun 20, 2025 at 01:00:24PM +0100, Nikita Kalyazin wrote:
Thanks for explaining that. I played a bit with it myself and it appears to be working for the MISSING mode for both shmem and guest_memfd. Attaching
[1]
my sketch below. Please let me know if this is how you see it.
I found that arguments and return values are significantly different between the two request types, which may be a bit confusing, although we do not expect many callers of those.
Indeed. Actually since I didn't yet get your reply, early this week I gave it a shot, and then I found the same thing that it'll be nice to keep the type checks all over the places. It'll also be awkward if we want to add MISSING into the picture with the current req() interface (btw, IIUC you meant MINOR above, not MISSING).
I did indeed, sorry for the typo.
Please have a look at what I came up with. I didn't yet got a chance to post it, but it did compile all fine and pass the smoke tests here. Feel free to take it over if you think that makes sense to you, or I can also post it officially after some more tests.
This looks much cleaner. Please go ahead and post it. I will make use of the interface in guest_memfd and run some tests as well meanwhile.
So, ultimately I introduced a vm_uffd_ops to keep all the type checks. I don't think I like the uffd_copy() interfacing too much, but it should still be the minimum changeset I can think of to generalize shmem as an userfault user / module, and finally drop "linux/shmem_fs.h" inclusion in the last patch.
It's also unfortunate that hugetlb won't be able to already use the API, similar to why we have hugetlb's fault() to BUG() and hard-coded it in handle_mm_fault(). However it'll at least start to use the rest API all fine, so as to generalize some hugetlb checks.
The shmem definition looks like this:
static const vm_uffd_ops shmem_uffd_ops = { .uffd_features = __VM_UFFD_FLAGS, .uffd_ioctls = BIT(_UFFDIO_COPY) | BIT(_UFFDIO_ZEROPAGE) | BIT(_UFFDIO_WRITEPROTECT) | BIT(_UFFDIO_CONTINUE) | BIT(_UFFDIO_POISON), .uffd_get_folio = shmem_uffd_get_folio, .uffd_copy = shmem_mfill_atomic_pte, };
Then guest-memfd can set (1) VM_UFFD_MINOR, (2) _UFFDIO_CONTINUE and provide uffd_get_folio() for supporting MINOR.
Let me know what do you think.
Thanks,
===8<=== From ca500177de122d32194f8bf4589faceeaaae2c0c Mon Sep 17 00:00:00 2001 From: Peter Xu peterx@redhat.com Date: Thu, 12 Jun 2025 11:51:58 -0400 Subject: [PATCH 1/4] mm: Introduce vm_uffd_ops API
Introduce a generic userfaultfd API for vm_operations_struct, so that one vma, especially when as a module, can support userfaults without modifying the core files. More importantly, when the module can be compiled out of the kernel.
So, instead of having core mm referencing modules that may not ever exist, we need to have modules opt-in on core mm hooks instead.
After this API applied, if a module wants to support userfaultfd, the module should only need to touch its own file and properly define vm_uffd_ops, instead of changing anything in core mm.
Note that such API will not work for anonymous. Core mm will process anonymous memory separately for userfault operations like before.
This patch only introduces the API alone so that we can start to move existing users over but without breaking them.
Signed-off-by: Peter Xu peterx@redhat.com
include/linux/mm.h | 71 +++++++++++++++++++++++++++++++++++ include/linux/userfaultfd_k.h | 12 ------ 2 files changed, 71 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 98a606908307..8dfd83f01d3d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -576,6 +576,70 @@ struct vm_fault { */ };
+#ifdef CONFIG_USERFAULTFD +/* A combined operation mode + behavior flags. */ +typedef unsigned int __bitwise uffd_flags_t;
+enum mfill_atomic_mode {
MFILL_ATOMIC_COPY,
MFILL_ATOMIC_ZEROPAGE,
MFILL_ATOMIC_CONTINUE,
MFILL_ATOMIC_POISON,
NR_MFILL_ATOMIC_MODES,
+};
+/* VMA userfaultfd operations */ +typedef struct {
/**
* @uffd_features: features supported in bitmask.
*
* When the ops is defined, the driver must set non-zero features
* to be a subset (or all) of: VM_UFFD_MISSING|WP|MINOR.
*/
unsigned long uffd_features;
/**
* @uffd_ioctls: ioctls supported in bitmask.
*
* Userfaultfd ioctls supported by the module. Below will always
* be supported by default whenever a module provides vm_uffd_ops:
*
* _UFFDIO_API, _UFFDIO_REGISTER, _UFFDIO_UNREGISTER, _UFFDIO_WAKE
*
* The module needs to provide all the rest optionally supported
* ioctls. For example, when VM_UFFD_MISSING was supported,
* _UFFDIO_COPY must be supported as ioctl, while _UFFDIO_ZEROPAGE
* is optional.
*/
unsigned long uffd_ioctls;
/**
* uffd_get_folio: Handler to resolve UFFDIO_CONTINUE request.
*
* @inode: the inode for folio lookup
* @pgoff: the pgoff of the folio
* @folio: returned folio pointer
*
* Return: zero if succeeded, negative for errors.
*/
int (*uffd_get_folio)(struct inode *inode, pgoff_t pgoff,
struct folio **folio);
/**
* uffd_copy: Handler to resolve UFFDIO_COPY|ZEROPAGE request.
*
* @dst_pmd: target pmd to resolve page fault
* @dst_vma: target vma
* @dst_addr: target virtual address
* @src_addr: source address to copy from
* @flags: userfaultfd request flags
* @foliop: previously allocated folio
*
* Return: zero if succeeded, negative for errors.
*/
int (*uffd_copy)(pmd_t *dst_pmd, struct vm_area_struct *dst_vma,
unsigned long dst_addr, unsigned long src_addr,
uffd_flags_t flags, struct folio **foliop);
+} vm_uffd_ops; +#endif
- /*
- These are the virtual MM functions - opening of an area, closing and
- unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -653,6 +717,13 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); +#ifdef CONFIG_USERFAULTFD
/*
* Userfaultfd related ops. Modules need to define this to support
* userfaultfd.
*/
const vm_uffd_ops *userfaultfd_ops;
+#endif };
#ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ccad58602846..e79c724b3b95 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -80,18 +80,6 @@ struct userfaultfd_ctx {
extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
-/* A combined operation mode + behavior flags. */ -typedef unsigned int __bitwise uffd_flags_t;
-/* Mutually exclusive modes of operation. */ -enum mfill_atomic_mode {
MFILL_ATOMIC_COPY,
MFILL_ATOMIC_ZEROPAGE,
MFILL_ATOMIC_CONTINUE,
MFILL_ATOMIC_POISON,
NR_MFILL_ATOMIC_MODES,
-};
- #define MFILL_ATOMIC_MODE_BITS (const_ilog2(NR_MFILL_ATOMIC_MODES - 1) + 1) #define MFILL_ATOMIC_BIT(nr) BIT(MFILL_ATOMIC_MODE_BITS + (nr)) #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr))
-- 2.49.0
From a7094b86d3308e91ac7ab785b7d71ae6cc4739f4 Mon Sep 17 00:00:00 2001 From: Peter Xu peterx@redhat.com Date: Wed, 11 Jun 2025 10:18:23 -0400 Subject: [PATCH 2/4] mm/shmem: Support vm_uffd_ops API
Add support for the new vm_uffd_ops API for shmem. Note that this only introduces the support, the API is not yet used by core mm.
Cc: Hugh Dickins hughd@google.com Signed-off-by: Peter Xu peterx@redhat.com
mm/shmem.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/mm/shmem.c b/mm/shmem.c index 0bc30dafad90..bd0a29000318 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3151,6 +3151,13 @@ static inline struct inode *shmem_get_inode(struct mnt_idmap *idmap, #endif /* CONFIG_TMPFS_QUOTA */
#ifdef CONFIG_USERFAULTFD
+static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff,
struct folio **folio)
+{
return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC);
+}
- int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr,
@@ -5194,6 +5201,19 @@ static int shmem_error_remove_folio(struct address_space *mapping, return 0; }
+#ifdef CONFIG_USERFAULTFD +static const vm_uffd_ops shmem_uffd_ops = {
.uffd_features = __VM_UFFD_FLAGS,
.uffd_ioctls = BIT(_UFFDIO_COPY) |
BIT(_UFFDIO_ZEROPAGE) |
BIT(_UFFDIO_WRITEPROTECT) |
BIT(_UFFDIO_CONTINUE) |
BIT(_UFFDIO_POISON),
.uffd_get_folio = shmem_uffd_get_folio,
.uffd_copy = shmem_mfill_atomic_pte,
+}; +#endif
- static const struct address_space_operations shmem_aops = { .dirty_folio = noop_dirty_folio, #ifdef CONFIG_TMPFS
@@ -5296,6 +5316,9 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD
.userfaultfd_ops = &shmem_uffd_ops,
+#endif };
static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5305,6 +5328,9 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD
.userfaultfd_ops = &shmem_uffd_ops,
+#endif };
int shmem_init_fs_context(struct fs_context *fc)
2.49.0
From fab8f8312982619ca80299d6cf35d5661cf61911 Mon Sep 17 00:00:00 2001 From: Peter Xu peterx@redhat.com Date: Wed, 11 Jun 2025 10:18:40 -0400 Subject: [PATCH 3/4] mm/hugetlb: Support vm_uffd_ops API
Add support for the new vm_uffd_ops API for hugetlb. Note that this only introduces the support, the API is not yet used by core mm.
Cc: Muchun Song muchun.song@linux.dev Cc: Oscar Salvador osalvador@suse.de Signed-off-by: Peter Xu peterx@redhat.com
mm/hugetlb.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3d61ec17c15a..b9e473fab871 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5459,6 +5459,22 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf) return 0; }
+#ifdef CONFIG_USERFAULTFD +static const vm_uffd_ops hugetlb_uffd_ops = {
.uffd_features = __VM_UFFD_FLAGS,
/* _UFFDIO_ZEROPAGE not supported */
.uffd_ioctls = BIT(_UFFDIO_COPY) |
BIT(_UFFDIO_WRITEPROTECT) |
BIT(_UFFDIO_CONTINUE) |
BIT(_UFFDIO_POISON),
/*
* Hugetlbfs still has its own hard-coded handler in userfaultfd,
* due to limitations similar to vm_operations_struct.fault().
* TODO: generalize it to use the API functions.
*/
+}; +#endif
- /*
- When a new function is introduced to vm_operations_struct and added
- to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -5472,6 +5488,9 @@ const struct vm_operations_struct hugetlb_vm_ops = { .close = hugetlb_vm_op_close, .may_split = hugetlb_vm_op_split, .pagesize = hugetlb_vm_op_pagesize, +#ifdef CONFIG_USERFAULTFD
.userfaultfd_ops = &hugetlb_uffd_ops,
+#endif };
static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio,
2.49.0
From de6ac50b189dc16b2d5759f67b32d528a6c9ccde Mon Sep 17 00:00:00 2001 From: Peter Xu peterx@redhat.com Date: Thu, 12 Jun 2025 11:55:08 -0400 Subject: [PATCH 4/4] mm: Apply vm_uffd_ops API to core mm
This patch completely moves the old userfaultfd core to use the new vm_uffd_ops API. After this change, existing file systems will start to use the new API for userfault operations.
When at it, moving vma_can_userfault() into mm/userfaultfd.c instead, because it's getting too big. It's only used in slow paths so it shouldn't be an issue.
This will also remove quite some hard-coded checks for either shmem or hugetlbfs. Now all the old checks should still work but with vm_uffd_ops.
Signed-off-by: Peter Xu peterx@redhat.com
include/linux/shmem_fs.h | 14 ----- include/linux/userfaultfd_k.h | 46 ++++---------- mm/shmem.c | 2 +- mm/userfaultfd.c | 115 +++++++++++++++++++++++++--------- 4 files changed, 101 insertions(+), 76 deletions(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 6d0f9c599ff7..2f5b7b295cf6 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -195,20 +195,6 @@ static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof) extern bool shmem_charge(struct inode *inode, long pages); extern void shmem_uncharge(struct inode *inode, long pages);
-#ifdef CONFIG_USERFAULTFD -#ifdef CONFIG_SHMEM -extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
uffd_flags_t flags,
struct folio **foliop);
-#else /* !CONFIG_SHMEM */ -#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \
src_addr, flags, foliop) ({ BUG(); 0; })
-#endif /* CONFIG_SHMEM */ -#endif /* CONFIG_USERFAULTFD */
- /*
- Used space is stored as unsigned 64-bit value in bytes but
- quota core supports only signed 64-bit values so use that
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index e79c724b3b95..4e56ad423a4a 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -85,9 +85,14 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); #define MFILL_ATOMIC_FLAG(nr) ((__force uffd_flags_t) MFILL_ATOMIC_BIT(nr)) #define MFILL_ATOMIC_MODE_MASK ((__force uffd_flags_t) (MFILL_ATOMIC_BIT(0) - 1))
+static inline enum mfill_atomic_mode uffd_flags_get_mode(uffd_flags_t flags) +{
return (enum mfill_atomic_mode)(flags & MFILL_ATOMIC_MODE_MASK);
+}
- static inline bool uffd_flags_mode_is(uffd_flags_t flags, enum mfill_atomic_mode expected) {
return (flags & MFILL_ATOMIC_MODE_MASK) == ((__force uffd_flags_t) expected);
return uffd_flags_get_mode(flags) == expected;
}
static inline uffd_flags_t uffd_flags_set_mode(uffd_flags_t flags, enum mfill_atomic_mode mode)
@@ -196,41 +201,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma) return vma->vm_flags & __VM_UFFD_FLAGS; }
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
unsigned long vm_flags,
bool wp_async)
+static inline const vm_uffd_ops *vma_get_uffd_ops(struct vm_area_struct *vma) {
vm_flags &= __VM_UFFD_FLAGS;
if (vma->vm_flags & VM_DROPPABLE)
return false;
if ((vm_flags & VM_UFFD_MINOR) &&
(!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
return false;
/*
* If wp async enabled, and WP is the only mode enabled, allow any
* memory type.
*/
if (wp_async && (vm_flags == VM_UFFD_WP))
return true;
-#ifndef CONFIG_PTE_MARKER_UFFD_WP
/*
* If user requested uffd-wp but not enabled pte markers for
* uffd-wp, then shmem & hugetlbfs are not supported but only
* anonymous.
*/
if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
return false;
-#endif
/* By default, allow any of anon|shmem|hugetlb */
return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
vma_is_shmem(vma);
if (vma->vm_ops && vma->vm_ops->userfaultfd_ops)
return vma->vm_ops->userfaultfd_ops;
}return NULL;
+bool vma_can_userfault(struct vm_area_struct *vma,
unsigned long vm_flags, bool wp_async);
- static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) { struct userfaultfd_ctx *uffd_ctx = vma->vm_userfaultfd_ctx.ctx;
diff --git a/mm/shmem.c b/mm/shmem.c index bd0a29000318..4d71fc7be358 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3158,7 +3158,7 @@ static int shmem_uffd_get_folio(struct inode *inode, pgoff_t pgoff, return shmem_get_folio(inode, pgoff, 0, folio, SGP_NOALLOC); }
-int shmem_mfill_atomic_pte(pmd_t *dst_pmd, +static int shmem_mfill_atomic_pte(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 879505c6996f..61783ff2d335 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -14,12 +14,48 @@ #include <linux/userfaultfd_k.h> #include <linux/mmu_notifier.h> #include <linux/hugetlb.h> -#include <linux/shmem_fs.h> #include <asm/tlbflush.h> #include <asm/tlb.h> #include "internal.h" #include "swap.h"
+bool vma_can_userfault(struct vm_area_struct *vma,
unsigned long vm_flags, bool wp_async)
+{
unsigned long supported;
if (vma->vm_flags & VM_DROPPABLE)
return false;
vm_flags &= __VM_UFFD_FLAGS;
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
/*
* If user requested uffd-wp but not enabled pte markers for
* uffd-wp, then any file system (like shmem or hugetlbfs) are not
* supported but only anonymous.
*/
if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
return false;
+#endif
/*
* If wp async enabled, and WP is the only mode enabled, allow any
* memory type.
*/
if (wp_async && (vm_flags == VM_UFFD_WP))
return true;
if (vma_is_anonymous(vma))
/* Anonymous has no page cache, MINOR not supported */
supported = VM_UFFD_MISSING | VM_UFFD_WP;
else if (vma_get_uffd_ops(vma))
supported = vma_get_uffd_ops(vma)->uffd_features;
else
return false;
return !(vm_flags & (~supported));
+}
- static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end) {
@@ -384,11 +420,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, { struct inode *inode = file_inode(dst_vma->vm_file); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma); struct folio *folio; struct page *page; int ret;
ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_get_folio))
return -EINVAL;
ret = uffd_ops->uffd_get_folio(inode, pgoff, &folio); /* Our caller expects us to return -EFAULT if we failed to find folio */ if (ret == -ENOENT) ret = -EFAULT;
@@ -504,18 +544,6 @@ static __always_inline ssize_t mfill_atomic_hugetlb( u32 hash; struct address_space *mapping;
/*
* There is no default zero huge page for all huge page sizes as
* supported by hugetlb. A PMD_SIZE huge pages may exist as used
* by THP. Since we can not reliably insert a zero page, this
* feature is not supported.
*/
if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) {
up_read(&ctx->map_changing_lock);
uffd_mfill_unlock(dst_vma);
return -EINVAL;
}
src_addr = src_start; dst_addr = dst_start; copied = 0;
@@ -686,14 +714,55 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, err = mfill_atomic_pte_zeropage(dst_pmd, dst_vma, dst_addr); } else {
err = shmem_mfill_atomic_pte(dst_pmd, dst_vma,
dst_addr, src_addr,
flags, foliop);
const vm_uffd_ops *uffd_ops = vma_get_uffd_ops(dst_vma);
if (WARN_ON_ONCE(!uffd_ops || !uffd_ops->uffd_copy)) {
err = -EINVAL;
} else {
err = uffd_ops->uffd_copy(dst_pmd, dst_vma,
dst_addr, src_addr,
flags, foliop);
} } return err;
}
+static inline bool +vma_uffd_ops_supported(struct vm_area_struct *vma, uffd_flags_t flags) +{
enum mfill_atomic_mode mode = uffd_flags_get_mode(flags);
const vm_uffd_ops *uffd_ops;
unsigned long uffd_ioctls;
if ((flags & MFILL_ATOMIC_WP) && !(vma->vm_flags & VM_UFFD_WP))
return false;
/* Anonymous supports everything except CONTINUE */
if (vma_is_anonymous(vma))
return mode != MFILL_ATOMIC_CONTINUE;
uffd_ops = vma_get_uffd_ops(vma);
if (!uffd_ops)
return false;
uffd_ioctls = uffd_ops->uffd_ioctls;
switch (mode) {
case MFILL_ATOMIC_COPY:
return uffd_ioctls & BIT(_UFFDIO_COPY);
case MFILL_ATOMIC_ZEROPAGE:
return uffd_ioctls & BIT(_UFFDIO_ZEROPAGE);
case MFILL_ATOMIC_CONTINUE:
if (!(vma->vm_flags & VM_SHARED))
return false;
return uffd_ioctls & BIT(_UFFDIO_CONTINUE);
case MFILL_ATOMIC_POISON:
return uffd_ioctls & BIT(_UFFDIO_POISON);
default:
return false;
}
+}
- static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, unsigned long dst_start, unsigned long src_start,
@@ -752,11 +821,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, dst_vma->vm_flags & VM_SHARED)) goto out_unlock;
/*
* validate 'mode' now that we know the dst_vma: don't allow
* a wrprotect copy if the userfaultfd didn't register as WP.
*/
if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
if (!vma_uffd_ops_supported(dst_vma, flags)) goto out_unlock; /*
@@ -766,12 +831,6 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags);
if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
if (!vma_is_shmem(dst_vma) &&
uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
goto out_unlock;
while (src_addr < src_start + len) { pmd_t dst_pmdval;
-- 2.49.0
-- Peter Xu
The new operation allows to decouple the userfaulfd code from dependencies to VMA types, specifically, shmem and hugetlb. The vm_flags bitmap argument is processed with "any" logic, meaning if the VMA type supports any of the flags set, it returns true. This is to avoid multiple calls when checking for __VM_UFFD_FLAGS.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- include/linux/mm.h | 5 +++++ mm/hugetlb.c | 7 +++++++ mm/shmem.c | 8 ++++++++ 3 files changed, 20 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 8483e09aeb2c..488d721d8542 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -680,6 +680,11 @@ struct vm_operations_struct { */ struct page *(*find_special_page)(struct vm_area_struct *vma, unsigned long addr); + /* + * True if the VMA supports userfault at least for one of the vm_flags + */ + bool (*can_userfault)(struct vm_area_struct *vma, + unsigned long vm_flags); };
#ifdef CONFIG_NUMA_BALANCING diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c004cfdcd4e2..f3901c11e1fd 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5143,6 +5143,12 @@ static unsigned long hugetlb_vm_op_pagesize(struct vm_area_struct *vma) return huge_page_size(hstate_vma(vma)); }
+static bool hugetlb_vm_op_can_userfault(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + return true; +} + /* * We cannot handle pagefaults against hugetlb pages at all. They cause * handle_mm_fault() to try to instantiate regular-sized pages in the @@ -5168,6 +5174,7 @@ const struct vm_operations_struct hugetlb_vm_ops = { .close = hugetlb_vm_op_close, .may_split = hugetlb_vm_op_split, .pagesize = hugetlb_vm_op_pagesize, + .can_userfault = hugetlb_vm_op_can_userfault, };
static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, diff --git a/mm/shmem.c b/mm/shmem.c index b4159303fe59..0b9e19abd1e9 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2891,6 +2891,12 @@ static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index); }
+static bool shmem_can_userfault(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + return true; +} + static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, pgoff_t index, unsigned int order, pgoff_t *ilx) { @@ -5309,6 +5315,7 @@ static const struct vm_operations_struct shmem_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif + .can_userfault = shmem_can_userfault, };
static const struct vm_operations_struct shmem_anon_vm_ops = { @@ -5318,6 +5325,7 @@ static const struct vm_operations_struct shmem_anon_vm_ops = { .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, #endif + .can_userfault = shmem_can_userfault, };
int shmem_init_fs_context(struct fs_context *fc)
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- include/linux/userfaultfd_k.h | 13 ++++++------- mm/userfaultfd.c | 10 +++++++--- 2 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 75342022d144..64551e8a55fb 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -221,8 +221,8 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, if (vm_flags & VM_DROPPABLE) return false;
- if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) + if (!vma->vm_ops->can_userfault || + !vma->vm_ops->can_userfault(vma, VM_UFFD_MINOR)) return false;
/* @@ -235,16 +235,15 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, #ifndef CONFIG_PTE_MARKER_UFFD_WP /* * If user requested uffd-wp but not enabled pte markers for - * uffd-wp, then shmem & hugetlbfs are not supported but only - * anonymous. + * uffd-wp, then only anonymous is supported. */ if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) return false; #endif
- /* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); + return vma_is_anonymous(vma) || + (vma->vm_ops->can_userfault && + vma->vm_ops->can_userfault(vma, vm_flags)); }
static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct *vma) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 4b3dbc7dac64..0aa82c968e16 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -728,6 +728,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, unsigned long src_addr, dst_addr; long copied; struct folio *folio; + bool can_userfault;
/* * Sanitize the command parameters: @@ -787,10 +788,13 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags);
- if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + can_userfault = dst_vma->vm_ops->can_userfault && + dst_vma->vm_ops->can_userfault(dst_vma, __VM_UFFD_FLAGS); + + if (!vma_is_anonymous(dst_vma) && !can_userfault) goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + + if (!can_userfault && uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock;
while (src_addr < src_start + len) {
Add support for sending a pagefault event if userfaultfd is registered. Only page minor event is currently supported.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- virt/kvm/guest_memfd.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index fbf89e643add..096d89e7282d 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,9 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#ifdef CONFIG_KVM_PRIVATE_MEM +#include <linux/userfaultfd_k.h> +#endif /* CONFIG_KVM_PRIVATE_MEM */
#include "kvm_mm.h"
@@ -380,6 +383,13 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) kvm_gmem_mark_prepared(folio); }
+ if (userfaultfd_minor(vmf->vma) && + !(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) { + folio_unlock(folio); + filemap_invalidate_unlock_shared(inode->i_mapping); + return handle_userfault(vmf, VM_UFFD_MINOR); + } + vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio:
On Fri, Apr 04, 2025 at 03:43:50PM +0000, Nikita Kalyazin wrote:
Add support for sending a pagefault event if userfaultfd is registered. Only page minor event is currently supported.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com
virt/kvm/guest_memfd.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index fbf89e643add..096d89e7282d 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,9 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#ifdef CONFIG_KVM_PRIVATE_MEM +#include <linux/userfaultfd_k.h> +#endif /* CONFIG_KVM_PRIVATE_MEM */ #include "kvm_mm.h" @@ -380,6 +383,13 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) kvm_gmem_mark_prepared(folio); }
- if (userfaultfd_minor(vmf->vma) &&
!(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) {
folio_unlock(folio);
filemap_invalidate_unlock_shared(inode->i_mapping);
return handle_userfault(vmf, VM_UFFD_MINOR);
- }
Hmm, does guest-memfd (when with your current approach) at least needs to define the new can_userfault() hook?
Meanwhile, we have some hard-coded lines so far, like:
mfill_atomic(): if (!vma_is_shmem(dst_vma) && uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock;
I thought it would fail guest-memfd already on a CONTINUE request, and it doesn't seem to be touched yet in this series.
I'm not yet sure how the test worked out without hitting things like it. Highly likely I missed something. Some explanations would be welcomed..
Thanks,
vmf->page = folio_file_page(folio, vmf->pgoff); out_folio: -- 2.47.1
On 10/06/2025 23:25, Peter Xu wrote:
On Fri, Apr 04, 2025 at 03:43:50PM +0000, Nikita Kalyazin wrote:
Add support for sending a pagefault event if userfaultfd is registered. Only page minor event is currently supported.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com
virt/kvm/guest_memfd.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index fbf89e643add..096d89e7282d 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,9 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#ifdef CONFIG_KVM_PRIVATE_MEM +#include <linux/userfaultfd_k.h> +#endif /* CONFIG_KVM_PRIVATE_MEM */
#include "kvm_mm.h"
@@ -380,6 +383,13 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) kvm_gmem_mark_prepared(folio); }
if (userfaultfd_minor(vmf->vma) &&
!(vmf->flags & FAULT_FLAG_USERFAULT_CONTINUE)) {
folio_unlock(folio);
filemap_invalidate_unlock_shared(inode->i_mapping);
return handle_userfault(vmf, VM_UFFD_MINOR);
}
Hmm, does guest-memfd (when with your current approach) at least needs to define the new can_userfault() hook?
Meanwhile, we have some hard-coded lines so far, like:
mfill_atomic(): if (!vma_is_shmem(dst_vma) && uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock;
I thought it would fail guest-memfd already on a CONTINUE request, and it doesn't seem to be touched yet in this series.
I'm not yet sure how the test worked out without hitting things like it. Highly likely I missed something. Some explanations would be welcomed..
Yes, I realised that I'd failed to post this part soon after I sent the series, but I refrained from sending a new version because the upstream consensus was to review/merge the mmap support in guest_memfd [1] before continuing to build on top of it. This is the missed part I planned to include in the next version. Sorry for the confusion.
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 64551e8a55fb..080437fa7eab 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -221,8 +221,10 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, if (vm_flags & VM_DROPPABLE) return false;
- if (!vma->vm_ops->can_userfault || - !vma->vm_ops->can_userfault(vma, VM_UFFD_MINOR)) + if ((vm_flags & VM_UFFD_MINOR) && + (!vma->vm_ops || + !vma->vm_ops->can_userfault || + !vma->vm_ops->can_userfault(vma, VM_UFFD_MINOR))) return false;
/* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0aa82c968e16..638360a78561 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -788,7 +788,9 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, src_start, len, flags);
- can_userfault = dst_vma->vm_ops->can_userfault && + can_userfault = + dst_vma->vm_ops && + dst_vma->vm_ops->can_userfault && dst_vma->vm_ops->can_userfault(dst_vma, __VM_UFFD_FLAGS);
if (!vma_is_anonymous(dst_vma) && !can_userfault) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 91ee5dd91c31..202b12dc4b6f 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -420,8 +420,15 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) return ret; }
+static bool kvm_gmem_can_userfault(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + return vm_flags & VM_UFFD_MINOR; +} + static const struct vm_operations_struct kvm_gmem_vm_ops = { - .fault = kvm_gmem_fault, + .fault = kvm_gmem_fault, + .can_userfault = kvm_gmem_can_userfault, };
static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
[1] https://lore.kernel.org/kvm/20250605153800.557144-1-tabba@google.com/
Thanks,
vmf->page = folio_file_page(folio, vmf->pgoff);
out_folio:
2.47.1
-- Peter Xu
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- fs/userfaultfd.c | 3 ++- include/uapi/linux/userfaultfd.h | 8 +++++++- 2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 97c4d71115d8..32152bfa462a 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1954,7 +1954,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, uffdio_api.features = UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR uffdio_api.features &= - ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM | + UFFD_FEATURE_MINOR_GUEST_MEMFD); #endif #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 2841e4ea8f2c..ed688797eba7 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -42,7 +42,8 @@ UFFD_FEATURE_WP_UNPOPULATED | \ UFFD_FEATURE_POISON | \ UFFD_FEATURE_WP_ASYNC | \ - UFFD_FEATURE_MOVE) + UFFD_FEATURE_MOVE | \ + UFFD_FEATURE_MINOR_GUEST_MEMFD) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -230,6 +231,10 @@ struct uffdio_api { * * UFFD_FEATURE_MOVE indicates that the kernel supports moving an * existing page contents from userspace. + * + * UFFD_FEATURE_MINOR_GUEST_MEMFD indicates the same support as + * UFFD_FEATURE_MINOR_HUGETLBFS, but for guest_memfd-backed pages + * instead. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -248,6 +253,7 @@ struct uffdio_api { #define UFFD_FEATURE_POISON (1<<14) #define UFFD_FEATURE_WP_ASYNC (1<<15) #define UFFD_FEATURE_MOVE (1<<16) +#define UFFD_FEATURE_MINOR_GUEST_MEMFD (1<<17) __u64 features;
__u64 ioctls;
The test demonstrates that a minor userfaultfd event in guest_memfd can be resolved via a memcpy followed by a UFFDIO_CONTINUE ioctl.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- .../testing/selftests/kvm/guest_memfd_test.c | 99 +++++++++++++++++++ 1 file changed, 99 insertions(+)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index 38c501e49e0e..9a06c2486218 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -10,12 +10,16 @@ #include <errno.h> #include <stdio.h> #include <fcntl.h> +#include <pthread.h>
#include <linux/bitmap.h> #include <linux/falloc.h> +#include <linux/userfaultfd.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> +#include <sys/syscall.h> +#include <sys/ioctl.h>
#include "kvm_util.h" #include "test_util.h" @@ -206,6 +210,98 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm) close(fd1); }
+struct fault_args { + char *addr; + volatile char value; +}; + +static void *fault_thread_fn(void *arg) +{ + struct fault_args *args = arg; + + /* Trigger page fault */ + args->value = *args->addr; + return NULL; +} + +static void test_uffd_minor(int fd, size_t page_size, size_t total_size) +{ + struct uffdio_register uffd_reg; + struct uffdio_continue uffd_cont; + struct uffd_msg msg; + struct fault_args args; + pthread_t fault_thread; + void *mem, *mem_nofault, *buf = NULL; + int uffd, ret; + off_t offset = page_size; + void *fault_addr; + + ret = posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC); + TEST_ASSERT(uffd != -1, "userfaultfd creation should succeed"); + + struct uffdio_api uffdio_api = { + .api = UFFD_API, + .features = UFFD_FEATURE_MINOR_GUEST_MEMFD, + }; + ret = ioctl(uffd, UFFDIO_API, &uffdio_api); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_API) should succeed"); + + mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed"); + + mem_nofault = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem_nofault != MAP_FAILED, "mmap should succeed"); + + uffd_reg.range.start = (unsigned long)mem; + uffd_reg.range.len = total_size; + uffd_reg.mode = UFFDIO_REGISTER_MODE_MINOR; + ret = ioctl(uffd, UFFDIO_REGISTER, &uffd_reg); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_REGISTER) should succeed"); + + ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, + offset, page_size); + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed"); + + fault_addr = mem + offset; + args.addr = fault_addr; + + ret = pthread_create(&fault_thread, NULL, fault_thread_fn, &args); + TEST_ASSERT(ret == 0, "pthread_create should succeed"); + + ret = read(uffd, &msg, sizeof(msg)); + TEST_ASSERT(ret != -1, "read from userfaultfd should succeed"); + TEST_ASSERT(msg.event == UFFD_EVENT_PAGEFAULT, "event type should be pagefault"); + TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) == fault_addr, + "pagefault should occur at expected address"); + + memcpy(mem_nofault + offset, buf + offset, page_size); + + uffd_cont.range.start = (unsigned long)fault_addr; + uffd_cont.range.len = page_size; + uffd_cont.mode = 0; + ret = ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont); + TEST_ASSERT(ret != -1, "ioctl(UFFDIO_CONTINUE) should succeed"); + + TEST_ASSERT(args.value == *(char *)(mem_nofault + offset), + "memory should contain the value that was copied"); + TEST_ASSERT(args.value == *(char *)(mem + offset), + "no further fault is expected"); + + ret = pthread_join(fault_thread, NULL); + TEST_ASSERT(ret == 0, "pthread_join should succeed"); + + ret = munmap(mem_nofault, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + + ret = munmap(mem, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + free(buf); + close(uffd); +} + unsigned long get_shared_type(void) { #ifdef __x86_64__ @@ -244,6 +340,9 @@ void test_vm_type(unsigned long type, bool is_shared) test_fallocate(fd, page_size, total_size); test_invalid_punch_hole(fd, page_size, total_size);
+ if (is_shared) + test_uffd_minor(fd, page_size, total_size); + close(fd); kvm_vm_release(vm); }
On Fri, Apr 04, 2025 at 03:43:46PM +0000, Nikita Kalyazin wrote:
This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
Hm if this is based on an unmerged series this seems quite speculative and should maybe be an RFC? I mean that series at least still seems quite under discussion/experiencing issues?
Maybe worth RFC'ing until that one settles down first to avoid complexity in review/application to tree?
Thanks!
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos... [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
Nikita Kalyazin (6): mm: userfaultfd: generic continue for non hugetlbfs mm: provide can_userfault vma operation mm: userfaultfd: use can_userfault vma operation KVM: guest_memfd: add support for userfaultfd minor mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD KVM: selftests: test userfaultfd minor for guest_memfd
fs/userfaultfd.c | 3 +- include/linux/mm.h | 5 + include/linux/mm_types.h | 4 + include/linux/userfaultfd_k.h | 10 +- include/uapi/linux/userfaultfd.h | 8 +- mm/hugetlb.c | 9 +- mm/shmem.c | 17 +++- mm/userfaultfd.c | 47 ++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 99 +++++++++++++++++++ virt/kvm/guest_memfd.c | 10 ++ 10 files changed, 188 insertions(+), 24 deletions(-)
base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
2.47.1
On 04/04/2025 17:33, Lorenzo Stoakes wrote:
On Fri, Apr 04, 2025 at 03:43:46PM +0000, Nikita Kalyazin wrote:
This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
Hm if this is based on an unmerged series this seems quite speculative and should maybe be an RFC? I mean that series at least still seems quite under discussion/experiencing issues?
Maybe worth RFC'ing until that one settles down first to avoid complexity in review/application to tree?
Hi,
I dropped the RFC tag because I saw similar examples before, but I'm happy to bring it back next time if the dependency is not merged until then.
Thanks!
Thanks!
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos... [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
Nikita Kalyazin (6): mm: userfaultfd: generic continue for non hugetlbfs mm: provide can_userfault vma operation mm: userfaultfd: use can_userfault vma operation KVM: guest_memfd: add support for userfaultfd minor mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD KVM: selftests: test userfaultfd minor for guest_memfd
fs/userfaultfd.c | 3 +- include/linux/mm.h | 5 + include/linux/mm_types.h | 4 + include/linux/userfaultfd_k.h | 10 +- include/uapi/linux/userfaultfd.h | 8 +- mm/hugetlb.c | 9 +- mm/shmem.c | 17 +++- mm/userfaultfd.c | 47 ++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 99 +++++++++++++++++++ virt/kvm/guest_memfd.c | 10 ++ 10 files changed, 188 insertions(+), 24 deletions(-)
base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
2.47.1
On Fri, Apr 04, 2025 at 05:56:58PM +0100, Nikita Kalyazin wrote:
On 04/04/2025 17:33, Lorenzo Stoakes wrote:
On Fri, Apr 04, 2025 at 03:43:46PM +0000, Nikita Kalyazin wrote:
This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
Hm if this is based on an unmerged series this seems quite speculative and should maybe be an RFC? I mean that series at least still seems quite under discussion/experiencing issues?
Maybe worth RFC'ing until that one settles down first to avoid complexity in review/application to tree?
Hi,
I dropped the RFC tag because I saw similar examples before, but I'm happy to bring it back next time if the dependency is not merged until then.
Yeah really sorry to be a pain haha, I realise this particular situation is a bit unclear, but I think just for the sake of getting our ducks in a row and ensuring things are settled on the baseline (and it's sort of a fairly big baseline), it'd be best to bring it back!
Thanks!
Thanks!
Cheers!
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables. In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos... [4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
Nikita Kalyazin (6): mm: userfaultfd: generic continue for non hugetlbfs mm: provide can_userfault vma operation mm: userfaultfd: use can_userfault vma operation KVM: guest_memfd: add support for userfaultfd minor mm: userfaultfd: add UFFD_FEATURE_MINOR_GUEST_MEMFD KVM: selftests: test userfaultfd minor for guest_memfd
fs/userfaultfd.c | 3 +- include/linux/mm.h | 5 + include/linux/mm_types.h | 4 + include/linux/userfaultfd_k.h | 10 +- include/uapi/linux/userfaultfd.h | 8 +- mm/hugetlb.c | 9 +- mm/shmem.c | 17 +++- mm/userfaultfd.c | 47 ++++++--- .../testing/selftests/kvm/guest_memfd_test.c | 99 +++++++++++++++++++ virt/kvm/guest_memfd.c | 10 ++ 10 files changed, 188 insertions(+), 24 deletions(-)
base-commit: 3cc51efc17a2c41a480eed36b31c1773936717e0
2.47.1
+To authors of v7 series referenced in [1]
* Nikita Kalyazin kalyazin@amazon.com [250404 11:44]:
This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
I didn't see their addresses in the to/cc, so I added them to my response as I reference the v7 patch set below.
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables.
Thanks for being open about the technical call, but it would be better to capture the reasons and not the call date. I explain why in the linking section as well.
In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
Please slow down...
This patch is at v3, the v7 patch that you are building off has lockdep issues [1] reported by one of the authors, and (sorry for sounding harsh about the v7 of that patch) the cover letter reads a bit more like an RFC than a set ready to go into linux-mm.
Maybe the lockdep issue is just a patch ordering thing or removed in a later patch set, but that's not mentioned in the discovery email?
What exactly is the goal here and the path forward for the rest of us trying to build on this once it's in mm-new/mm-unstable?
Note that mm-unstable is shared with a lot of other people through linux-next, and we are really trying to stop breaking stuff on them.
Obviously v7 cannot go in until it works with lockdep - otherwise none of us can use lockdep which is not okay.
Also, I am concerned about the amount of testing in the v7 and v3 patch sets that did not bring up a lockdep issue..
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos...
If there is anything we need to know about the decisions in the call and that document, can you please pull it into this change log?
I don't think anyone can ensure google will not rename docs to some other office theme tomorrow - as they famously ditch basically every name and application.
Also, most of the community does not want to go to a 17 page (and growing) spreadsheet to hunt down the facts when there is an acceptable and ideal place to document them in git. It's another barrier of entry on reviewing your code as well.
But please, don't take this suggestion as carte blanche for copying a conversation from the doc, just give us the technical reasons for your decisions as briefly as possible.
[4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
[1]. https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com...
Thanks, Liam
On 04/04/2025 18:12, Liam R. Howlett wrote:
+To authors of v7 series referenced in [1]
- Nikita Kalyazin kalyazin@amazon.com [250404 11:44]:
This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
I didn't see their addresses in the to/cc, so I added them to my response as I reference the v7 patch set below.
Hi Liam,
Thanks for the feedback and for extending the list.
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables.
Thanks for being open about the technical call, but it would be better to capture the reasons and not the call date. I explain why in the linking section as well.
Thanks for bringing that up. The document mostly contains the decision itself. The main alternative considered previously was a temporary reintroduction of the pages to the direct map whenever a KVM-internal access is required. It was coming with a significant complexity of guaranteeing correctness in all cases [1]. Since the memslot structure already contains a guest memory pointer supplied by the userspace, KVM can use it directly when in the VMM or vCPU context. I will add this in the cover for the next version.
[1] https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#...
In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
Please slow down...
This patch is at v3, the v7 patch that you are building off has lockdep issues [1] reported by one of the authors, and (sorry for sounding harsh about the v7 of that patch) the cover letter reads a bit more like an RFC than a set ready to go into linux-mm.
AFAIK the lockdep issue was reported on a v7 of a different change. I'm basing my series on [2] ("KVM: Mapping guest_memfd backed memory at the host for software protected VMs"), while the issue was reported on [2] ("KVM: Restricted mapping of guest_memfd at the host and arm64 support"), which is also built on top of [2]. Please correct me if I'm missing something.
The key feature that is required by my series is the ability to mmap guest_memfd when the VM type allows. My understanding is no-one is opposed to that as of now, that's why I assumed it's safe to build on top of that.
[2] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [3] https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com...
Maybe the lockdep issue is just a patch ordering thing or removed in a later patch set, but that's not mentioned in the discovery email?
What exactly is the goal here and the path forward for the rest of us trying to build on this once it's in mm-new/mm-unstable?
Note that mm-unstable is shared with a lot of other people through linux-next, and we are really trying to stop breaking stuff on them.
Obviously v7 cannot go in until it works with lockdep - otherwise none of us can use lockdep which is not okay.
Also, I am concerned about the amount of testing in the v7 and v3 patch sets that did not bring up a lockdep issue..
[1] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/T... [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos...
If there is anything we need to know about the decisions in the call and that document, can you please pull it into this change log?
I don't think anyone can ensure google will not rename docs to some other office theme tomorrow - as they famously ditch basically every name and application.
Also, most of the community does not want to go to a 17 page (and growing) spreadsheet to hunt down the facts when there is an acceptable and ideal place to document them in git. It's another barrier of entry on reviewing your code as well.
But please, don't take this suggestion as carte blanche for copying a conversation from the doc, just give us the technical reasons for your decisions as briefly as possible.
[4] https://lore.kernel.org/kvm/20250402160721.97596-1-kalyazin@amazon.com/T/
[1]. https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com...
Thanks, Liam
* Nikita Kalyazin kalyazin@amazon.com [250407 07:04]:
On 04/04/2025 18:12, Liam R. Howlett wrote:
+To authors of v7 series referenced in [1]
- Nikita Kalyazin kalyazin@amazon.com [250404 11:44]:
This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
I didn't see their addresses in the to/cc, so I added them to my response as I reference the v7 patch set below.
Hi Liam,
Thanks for the feedback and for extending the list.
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables.
Thanks for being open about the technical call, but it would be better to capture the reasons and not the call date. I explain why in the linking section as well.
Thanks for bringing that up. The document mostly contains the decision itself. The main alternative considered previously was a temporary reintroduction of the pages to the direct map whenever a KVM-internal access is required. It was coming with a significant complexity of guaranteeing correctness in all cases [1]. Since the memslot structure already contains a guest memory pointer supplied by the userspace, KVM can use it directly when in the VMM or vCPU context. I will add this in the cover for the next version.
Thank you.
[1] https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#...
In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
Please slow down...
This patch is at v3, the v7 patch that you are building off has lockdep issues [1] reported by one of the authors, and (sorry for sounding harsh about the v7 of that patch) the cover letter reads a bit more like an RFC than a set ready to go into linux-mm.
AFAIK the lockdep issue was reported on a v7 of a different change. I'm basing my series on [2] ("KVM: Mapping guest_memfd backed memory at the host for software protected VMs"), while the issue was reported on [2] ("KVM: Restricted mapping of guest_memfd at the host and arm64 support"), which is also built on top of [2]. Please correct me if I'm missing something.
I think you messed up the numbering in your statement above.
I believe you are making the point that I messed up which patches depend on what and your code does not depend on faulty locking, which appears to be the case.
There are a few issues with the required patch set?
The key feature that is required by my series is the ability to mmap guest_memfd when the VM type allows. My understanding is no-one is opposed to that as of now, that's why I assumed it's safe to build on top of that.
[2] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [3] https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com...
All of this is extremely confusing because the onus of figuring out what the final code will look like is put on the reviewer. As it is, we have issues with people not doing enough review of the code (due to limited time). One way to get reviews is to make the barrier of entry as low as possible.
I spent Friday going down a rabbit hole of patches referring to each other as dependencies and I gave up. It looks like I mistook one set of patches as required vs them requiring the same in-flight ones as your patches.
I am struggling to see how we can adequately support all of you given the way the patches are sent out in batches with dependencies - it is just too time consuming to sort out.
Thank you, Liam
On 07/04/2025 14:40, Liam R. Howlett wrote:
- Nikita Kalyazin kalyazin@amazon.com [250407 07:04]:
On 04/04/2025 18:12, Liam R. Howlett wrote:
+To authors of v7 series referenced in [1]
- Nikita Kalyazin kalyazin@amazon.com [250404 11:44]:
This series is built on top of the Fuad's v7 "mapping guest_memfd backed memory at the host" [1].
I didn't see their addresses in the to/cc, so I added them to my response as I reference the v7 patch set below.
Hi Liam,
Thanks for the feedback and for extending the list.
With James's KVM userfault [2], it is possible to handle stage-2 faults in guest_memfd in userspace. However, KVM itself also triggers faults in guest_memfd in some cases, for example: PV interfaces like kvmclock, PV EOI and page table walking code when fetching the MMIO instruction on x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] that KVM would be accessing those pages via userspace page tables.
Thanks for being open about the technical call, but it would be better to capture the reasons and not the call date. I explain why in the linking section as well.
Thanks for bringing that up. The document mostly contains the decision itself. The main alternative considered previously was a temporary reintroduction of the pages to the direct map whenever a KVM-internal access is required. It was coming with a significant complexity of guaranteeing correctness in all cases [1]. Since the memslot structure already contains a guest memory pointer supplied by the userspace, KVM can use it directly when in the VMM or vCPU context. I will add this in the cover for the next version.
Thank you.
[1] https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#...
In order for such faults to be handled in userspace, guest_memfd needs to support userfaultfd.
Changes since v2 [4]:
- James: Fix sgp type when calling shmem_get_folio_gfp
- James: Improved vm_ops->fault() error handling
- James: Add and make use of the can_userfault() VMA operation
- James: Add UFFD_FEATURE_MINOR_GUEST_MEMFD feature flag
- James: Fix typos and add more checks in the test
Nikita
Please slow down...
This patch is at v3, the v7 patch that you are building off has lockdep issues [1] reported by one of the authors, and (sorry for sounding harsh about the v7 of that patch) the cover letter reads a bit more like an RFC than a set ready to go into linux-mm.
AFAIK the lockdep issue was reported on a v7 of a different change. I'm basing my series on [2] ("KVM: Mapping guest_memfd backed memory at the host for software protected VMs"), while the issue was reported on [2] ("KVM: Restricted mapping of guest_memfd at the host and arm64 support"), which is also built on top of [2]. Please correct me if I'm missing something.
I think you messed up the numbering in your statement above.
I did, in an attempt to make it "even more clear" :) Sorry about that, glad you got the intention.
I believe you are making the point that I messed up which patches depend on what and your code does not depend on faulty locking, which appears to be the case.
There are a few issues with the required patch set?
There are indeed, but not in the part this series depends on, as far as I can see.
The key feature that is required by my series is the ability to mmap guest_memfd when the VM type allows. My understanding is no-one is opposed to that as of now, that's why I assumed it's safe to build on top of that.
[2] https://lore.kernel.org/kvm/20250318161823.4005529-1-tabba@google.com/T/ [3] https://lore.kernel.org/all/diqz1puanquh.fsf@ackerleytng-ctop.c.googlers.com...
All of this is extremely confusing because the onus of figuring out what the final code will look like is put on the reviewer. As it is, we have issues with people not doing enough review of the code (due to limited time). One way to get reviews is to make the barrier of entry as low as possible.
I spent Friday going down a rabbit hole of patches referring to each other as dependencies and I gave up. It looks like I mistook one set of patches as required vs them requiring the same in-flight ones as your patches.
I am struggling to see how we can adequately support all of you given the way the patches are sent out in batches with dependencies - it is just too time consuming to sort out.
I'm happy to do whatever I can to make the review easier. I suppose the extreme case is to wait for the dependencies to get accepted, effectively serialising submissions, but that slows the process down significantly. For example, I received very good feedback on v1 and v2 of this series and was able to address it instead of waiting for the dependency. Would including the required patches directly in the series help? My only concern is in that case the same patch will be submitted multiple times (as a part of every depending series), but if it's better, I'll be doing that instead.
Thank you, Liam
* Nikita Kalyazin kalyazin@amazon.com [250407 10:05]:
...
All of this is extremely confusing because the onus of figuring out what the final code will look like is put on the reviewer. As it is, we have issues with people not doing enough review of the code (due to limited time). One way to get reviews is to make the barrier of entry as low as possible.
I spent Friday going down a rabbit hole of patches referring to each other as dependencies and I gave up. It looks like I mistook one set of patches as required vs them requiring the same in-flight ones as your patches.
I am struggling to see how we can adequately support all of you given the way the patches are sent out in batches with dependencies - it is just too time consuming to sort out.
I'm happy to do whatever I can to make the review easier. I suppose the extreme case is to wait for the dependencies to get accepted, effectively serialising submissions, but that slows the process down significantly. For example, I received very good feedback on v1 and v2 of this series and was able to address it instead of waiting for the dependency. Would including the required patches directly in the series help? My only concern is in that case the same patch will be submitted multiple times (as a part of every depending series), but if it's better, I'll be doing that instead.
Don't resend patches that someone else is upstreaming, that'll cause other problems.
Three methods come to mind:
1. As you stated, wait for the dependencies to land. This is will mean what you are working against is well tested and won't change (and you won't have to re-spin due to an unstable base).
2. Combine them into a bigger patch set. I can then pull one patch set and look at the parts of interest to the mm side.
3. Provide a git repo with the necessary changes together.
I think 2 and 3 together should be used for the guest_memfd patches. Someone needs to be managing these to send upstream. See the discussion in another patch set on guest_memfd here [1].
As this is not based on fully upstream patches, this should be marked as RFC, imo.
Thanks, Liam
[1]. https://lore.kernel.org/all/aizia2elwspxcmfrjote5h7k5wdw2stp42slytkl5visrjvz...
On 07.04.25 16:24, Liam R. Howlett wrote:
- Nikita Kalyazin kalyazin@amazon.com [250407 10:05]:
...
All of this is extremely confusing because the onus of figuring out what the final code will look like is put on the reviewer. As it is, we have issues with people not doing enough review of the code (due to limited time). One way to get reviews is to make the barrier of entry as low as possible.
I spent Friday going down a rabbit hole of patches referring to each other as dependencies and I gave up. It looks like I mistook one set of patches as required vs them requiring the same in-flight ones as your patches.
I am struggling to see how we can adequately support all of you given the way the patches are sent out in batches with dependencies - it is just too time consuming to sort out.
I'm happy to do whatever I can to make the review easier. I suppose the extreme case is to wait for the dependencies to get accepted, effectively serialising submissions, but that slows the process down significantly. For example, I received very good feedback on v1 and v2 of this series and was able to address it instead of waiting for the dependency. Would including the required patches directly in the series help? My only concern is in that case the same patch will be submitted multiple times (as a part of every depending series), but if it's better, I'll be doing that instead.
Don't resend patches that someone else is upstreaming, that'll cause other problems.
Three methods come to mind:
- As you stated, wait for the dependencies to land. This is will mean
what you are working against is well tested and won't change (and you won't have to re-spin due to an unstable base).
- Combine them into a bigger patch set. I can then pull one patch set
and look at the parts of interest to the mm side.
- Provide a git repo with the necessary changes together.
I think 2 and 3 together should be used for the guest_memfd patches. Someone needs to be managing these to send upstream. See the discussion in another patch set on guest_memfd here [1].
The issue is that most extensions are fairly independent from each other, except that they built up on Fuad's mmap support,
Sending all together as one thing might not be the best option.
Once basic mmap support is upstream, some of the extensions (e.g., directmap removal) can go in next.
So until that is upstream, I agree that tagging the stuff that builds up on that is the right thing to do, and providing git trees is another very good idea.
I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, I know)
On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
On 07.04.25 16:24, Liam R. Howlett wrote:
- Nikita Kalyazin kalyazin@amazon.com [250407 10:05]:
...
All of this is extremely confusing because the onus of figuring out what the final code will look like is put on the reviewer. As it is, we have issues with people not doing enough review of the code (due to limited time). One way to get reviews is to make the barrier of entry as low as possible.
I spent Friday going down a rabbit hole of patches referring to each other as dependencies and I gave up. It looks like I mistook one set of patches as required vs them requiring the same in-flight ones as your patches.
I am struggling to see how we can adequately support all of you given the way the patches are sent out in batches with dependencies - it is just too time consuming to sort out.
I'm happy to do whatever I can to make the review easier. I suppose the extreme case is to wait for the dependencies to get accepted, effectively serialising submissions, but that slows the process down significantly. For example, I received very good feedback on v1 and v2 of this series and was able to address it instead of waiting for the dependency. Would including the required patches directly in the series help? My only concern is in that case the same patch will be submitted multiple times (as a part of every depending series), but if it's better, I'll be doing that instead.
Don't resend patches that someone else is upstreaming, that'll cause other problems.
Three methods come to mind:
- As you stated, wait for the dependencies to land. This is will mean
what you are working against is well tested and won't change (and you won't have to re-spin due to an unstable base).
- Combine them into a bigger patch set. I can then pull one patch set
and look at the parts of interest to the mm side.
- Provide a git repo with the necessary changes together.
I think 2 and 3 together should be used for the guest_memfd patches. Someone needs to be managing these to send upstream. See the discussion in another patch set on guest_memfd here [1].
The issue is that most extensions are fairly independent from each other, except that they built up on Fuad's mmap support,
Sending all together as one thing might not be the best option.
Once basic mmap support is upstream, some of the extensions (e.g., directmap removal) can go in next.
So until that is upstream, I agree that tagging the stuff that builds up on that is the right thing to do, and providing git trees is another very good idea.
I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, I know)
Which series is this? Sorry maybe lost track of this one.
-- Cheers,
David / dhildenb
On 07.04.25 17:14, Lorenzo Stoakes wrote:
On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
On 07.04.25 16:24, Liam R. Howlett wrote:
- Nikita Kalyazin kalyazin@amazon.com [250407 10:05]:
...
All of this is extremely confusing because the onus of figuring out what the final code will look like is put on the reviewer. As it is, we have issues with people not doing enough review of the code (due to limited time). One way to get reviews is to make the barrier of entry as low as possible.
I spent Friday going down a rabbit hole of patches referring to each other as dependencies and I gave up. It looks like I mistook one set of patches as required vs them requiring the same in-flight ones as your patches.
I am struggling to see how we can adequately support all of you given the way the patches are sent out in batches with dependencies - it is just too time consuming to sort out.
I'm happy to do whatever I can to make the review easier. I suppose the extreme case is to wait for the dependencies to get accepted, effectively serialising submissions, but that slows the process down significantly. For example, I received very good feedback on v1 and v2 of this series and was able to address it instead of waiting for the dependency. Would including the required patches directly in the series help? My only concern is in that case the same patch will be submitted multiple times (as a part of every depending series), but if it's better, I'll be doing that instead.
Don't resend patches that someone else is upstreaming, that'll cause other problems.
Three methods come to mind:
- As you stated, wait for the dependencies to land. This is will mean
what you are working against is well tested and won't change (and you won't have to re-spin due to an unstable base).
- Combine them into a bigger patch set. I can then pull one patch set
and look at the parts of interest to the mm side.
- Provide a git repo with the necessary changes together.
I think 2 and 3 together should be used for the guest_memfd patches. Someone needs to be managing these to send upstream. See the discussion in another patch set on guest_memfd here [1].
The issue is that most extensions are fairly independent from each other, except that they built up on Fuad's mmap support,
Sending all together as one thing might not be the best option.
Once basic mmap support is upstream, some of the extensions (e.g., directmap removal) can go in next.
So until that is upstream, I agree that tagging the stuff that builds up on that is the right thing to do, and providing git trees is another very good idea.
I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, I know)
Which series is this? Sorry maybe lost track of this one.
Heh, not your fault :)
The most important one for basic mmap support is "KVM: Mapping guest_memfd backed memory at the host for software protected VMs" [1]. Some stuff (e.g., direct map removal) should be able to make progress once that landed.
I do expect the MM-specific patch in there ("mm: Consolidate freeing of typed folios on final folio_put()") to not be included as part of that work.
[I shared the feedback from the LSF/MM session in the upstream guest_memfd call, and we decided to minimize the usage of the folio_put() callback to where absolutely required; that will simplify things and avoid issues as pointed out by Willy, which is great]
The next important one will be "[PATCH v7 0/7] KVM: Restricted mapping of guest_memfd at the host and arm64 support" [2], but I similarly expect a simplification as we try moving away from folio_put() for the "shared <-> private" page conversion case.
So I expect a v8 of [1] (and that also [2] needs to be updated).
@Fuad, please let me know if I am wrong.
[1] https://lore.kernel.org/all/20250318161823.4005529-1-tabba@google.com/T/#u [2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/
On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
On 07.04.25 16:24, Liam R. Howlett wrote:
- Nikita Kalyazin kalyazin@amazon.com [250407 10:05]:
...
All of this is extremely confusing because the onus of figuring out what the final code will look like is put on the reviewer. As it is, we have issues with people not doing enough review of the code (due to limited time). One way to get reviews is to make the barrier of entry as low as possible.
I spent Friday going down a rabbit hole of patches referring to each other as dependencies and I gave up. It looks like I mistook one set of patches as required vs them requiring the same in-flight ones as your patches.
I am struggling to see how we can adequately support all of you given the way the patches are sent out in batches with dependencies - it is just too time consuming to sort out.
I'm happy to do whatever I can to make the review easier. I suppose the extreme case is to wait for the dependencies to get accepted, effectively serialising submissions, but that slows the process down significantly. For example, I received very good feedback on v1 and v2 of this series and was able to address it instead of waiting for the dependency. Would including the required patches directly in the series help? My only concern is in that case the same patch will be submitted multiple times (as a part of every depending series), but if it's better, I'll be doing that instead.
Don't resend patches that someone else is upstreaming, that'll cause other problems.
Three methods come to mind:
- As you stated, wait for the dependencies to land. This is will mean
what you are working against is well tested and won't change (and you won't have to re-spin due to an unstable base).
- Combine them into a bigger patch set. I can then pull one patch set
and look at the parts of interest to the mm side.
- Provide a git repo with the necessary changes together.
I think 2 and 3 together should be used for the guest_memfd patches. Someone needs to be managing these to send upstream. See the discussion in another patch set on guest_memfd here [1].
The issue is that most extensions are fairly independent from each other, except that they built up on Fuad's mmap support,
Sending all together as one thing might not be the best option.
Once basic mmap support is upstream, some of the extensions (e.g., directmap removal) can go in next.
So until that is upstream, I agree that tagging the stuff that builds up on that is the right thing to do, and providing git trees is another very good idea.
I'll prioritize getting Fuad's mmap stuff reviewed. (I keep saying that, I know)
Fwiw, b4 allows to specify dependencies so you can b4 shazam/am and it will pull in all prerequisite patches:
b4 prep --edit-deps Edit the series dependencies in your defined $EDITOR (or core.editor)
Christian Brauner brauner@kernel.org writes:
On Mon, Apr 07, 2025 at 04:46:48PM +0200, David Hildenbrand wrote:
<snip>
Fwiw, b4 allows to specify dependencies so you can b4 shazam/am and it will pull in all prerequisite patches:
b4 prep --edit-deps Edit the series dependencies in your defined $EDITOR (or core.editor)
Thank you for this tip! On this note, what are some good CONFIGs people always enable during development?
linux-kselftest-mirror@lists.linaro.org