Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

7 May 2025

On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
...
Yan Zhao yan.y.zhao@intel.com writes:
...
...
...
<snip>
What options does userspace have in this scenario?
It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
isn't ideal either.
What about something similar as below?

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d2feacd14786..87c33704a748 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
        }
    *pfn = folio_file_pfn(folio, index);


  if (max_order)


          *max_order = folio_order(folio);




  if (max_order) {


          int order;



          order = folio_order(folio);



          while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))


                  order--;



          *max_order = order;


  }

  *is_prepared = folio_test_uptodate(folio);
  return folio;




Vishal was wondering how this is working before guest_memfd was
introduced, for other backing memory like HugeTLB.
I then poked around and found this [1]. I will be adding a similar check
for any slot where kvm_slot_can_be_private(slot).
Yan, that should work, right?
No, I don't think the checking of ugfn [1] should work.

Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory

are allocated from guest_memfd), the slot->userspace_addr does not necessarily
have the same offset as slot->gmem.pgoff. Even if we audit the offset in
kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing
slot->userspace_addr to point to a different offset.

for slots bound to guest_memfd that do not support in-place-conversion,

shared memory is allocated from a different backend. Therefore, checking
"slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is
currently absent because guest_memfd supports 4K only.
Let me clarify, I meant these changes:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b64ab3..d0dccf1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
        return 0;
 }
 
+static inline bool kvm_is_level_aligned(u64 value, int level)
+{

  return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));



+}



static int kvm_alloc_memslot_metadata(struct kvm *kvm,
                                      struct kvm_memory_slot *slot)
 {
@@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 
                slot->arch.lpage_info[i - 1] = linfo;

          if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))




          if (!kvm_is_level_aligned(slot->base_gfn, level))
                  linfo[0].disallow_lpage = 1;




          if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))




          if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
                  linfo[lpages - 1].disallow_lpage = 1;
          ugfn = slot->userspace_addr >> PAGE_SHIFT;
          /*




           * If the gfn and userspace address are not aligned wrt each


           * other, disable large page support for this slot.




           * If the gfn and userspace address are not aligned or if gfn


           * and guest_memfd offset are not aligned wrt each other,


           * disable large page support for this slot.
           */




          if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {




          if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||


              (kvm_slot_can_be_private(slot) &&


               !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,


                                     level))) {
                  unsigned long j;



for (j = 0; j < lpages; ++j)
This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.
In the case of shared memory is not allocated from guest_memfd, (e.g. with the
current upstream code), the checking of gmem.pgoff here will disallow huge page
of shared memory even if "slot->base_gfn ^ ugfn" is aligned.
...
I think this should take care of case (1.), for guest_memfds going to be
used for both shared and private memory. Userspace can't update
slot->userspace_addr, since guest_memfd memslots cannot be updated and
can only be deleted.
If userspace re-uses slot->userspace_addr for some other memory address
without deleting and re-adding a memslot,

KVM's access to memory should still be fine, since after the recent
discussion at guest_memfd upstream call, KVM's guest faults will
always go via fd+offset and KVM's access won't be disrupted
there. Whatever checking done at memslot binding time will still be
valid.

Could the offset of shared memory and offset of private memory be different if
userspace re-uses slot->userspace_addr without deleting and re-adding a memslot?
Then though the two offsets are validated as equal in kvm_gmem_bind(), they may
differ later on.
...

Host's access and other accesses (e.g. instruction emulation, which
uses slot->userspace_addr) to guest memory will be broken, but I think
there's nothing protecting against that. The same breakage would
happen for non-guest_memfd memslot.

Why is host access broken in non-guest_memfd case?
The HVA is still a valid one in QEMU's mmap-ed address space.
...
p.s. I will be adding the validation as you suggested [1], though that
shouldn't make a difference here, since the above check directly
validates against gmem.pgoff.
Regarding 2., checking this checks against gmem.pgoff and should handle
that as well.
[1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/
I prefer checking at binding time because it aligns with the ugfn check
that is already there, and avoids having to check at every fault.
...
...
[1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616...
...
...
...
> Adding checks at binding time will allow hugepage-unaligned offsets (to
> be at parity with non-guest_memfd backing memory) but still fix this
> issue.
> 
> lpage_info will make sure that ranges near the bounds will be
> fragmented, but the hugepages in the middle will still be mappable as
> hugepages.
> 
> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding...

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page