Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

13 May 2025

Yan Zhao yan.y.zhao@intel.com writes:
...
On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
...
Yan Zhao yan.y.zhao@intel.com writes:
...
...
...
<snip>
What options does userspace have in this scenario?
It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
isn't ideal either.
What about something similar as below?

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d2feacd14786..87c33704a748 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
        }
    *pfn = folio_file_pfn(folio, index);


  if (max_order)


          *max_order = folio_order(folio);




  if (max_order) {


          int order;



          order = folio_order(folio);



          while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))


                  order--;



          *max_order = order;


  }

  *is_prepared = folio_test_uptodate(folio);
  return folio;




Vishal was wondering how this is working before guest_memfd was
introduced, for other backing memory like HugeTLB.
I then poked around and found this [1]. I will be adding a similar check
for any slot where kvm_slot_can_be_private(slot).
Yan, that should work, right?
No, I don't think the checking of ugfn [1] should work.

Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory

are allocated from guest_memfd), the slot->userspace_addr does not necessarily
have the same offset as slot->gmem.pgoff. Even if we audit the offset in
kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing
slot->userspace_addr to point to a different offset.

for slots bound to guest_memfd that do not support in-place-conversion,

shared memory is allocated from a different backend. Therefore, checking
"slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is
currently absent because guest_memfd supports 4K only.
Let me clarify, I meant these changes:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b64ab3..d0dccf1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
        return 0;
 }
 
+static inline bool kvm_is_level_aligned(u64 value, int level)
+{

  return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));



+}



static int kvm_alloc_memslot_metadata(struct kvm *kvm,
                                      struct kvm_memory_slot *slot)
 {
@@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 
                slot->arch.lpage_info[i - 1] = linfo;

          if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))




          if (!kvm_is_level_aligned(slot->base_gfn, level))
                  linfo[0].disallow_lpage = 1;




          if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))




          if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
                  linfo[lpages - 1].disallow_lpage = 1;
          ugfn = slot->userspace_addr >> PAGE_SHIFT;
          /*




           * If the gfn and userspace address are not aligned wrt each


           * other, disable large page support for this slot.




           * If the gfn and userspace address are not aligned or if gfn


           * and guest_memfd offset are not aligned wrt each other,


           * disable large page support for this slot.
           */




          if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {




          if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||


              (kvm_slot_can_be_private(slot) &&


               !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,


                                     level))) {
                  unsigned long j;



for (j = 0; j < lpages; ++j)
This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.
In the case of shared memory is not allocated from guest_memfd, (e.g. with the
current upstream code), the checking of gmem.pgoff here will disallow huge page
of shared memory even if "slot->base_gfn ^ ugfn" is aligned.
Thanks, I get it now. What you mean is that the memslot could have been
set up such that
+ slot->userspace_addr is aligned with slot->base_gfn, to be used for
  shared memory, and 
+ slot->gmem.pgoff is not aligned with slot->base_gfn, to be used for
  private memory
and this check would disallow huge page mappings even though this
memslot was going to only be used for shared memory.
The only way to fix this would indeed be a runtime check, since the
shared/private status can change at runtime.
I think it is okay that this check is stricter than necessary, since it
just results in mapping without huge pages.
What do you think?
...
...
I think this should take care of case (1.), for guest_memfds going to be
used for both shared and private memory. Userspace can't update
slot->userspace_addr, since guest_memfd memslots cannot be updated and
can only be deleted.
If userspace re-uses slot->userspace_addr for some other memory address
without deleting and re-adding a memslot,

KVM's access to memory should still be fine, since after the recent
discussion at guest_memfd upstream call, KVM's guest faults will
always go via fd+offset and KVM's access won't be disrupted
there. Whatever checking done at memslot binding time will still be
valid.

Could the offset of shared memory and offset of private memory be different if
userspace re-uses slot->userspace_addr without deleting and re-adding a memslot?
They could be different, yes. I think what you mean is if userspace does
something like
addr = mmap(guest_memfd);
ioctl(KVM_SET_USER_MEMORY_REGION, addr, guest_memfd);
munmap(addr);
addr = mmap(addr, other_fd);
(with no second call to KVM_SET_USER_MEMORY_REGION)
Without guest_memfd, when munmap() happens, KVM should get a
notification via mmu_notifiers. That will unmap the pages from guest
page tables. At the next fault, host page tables will be consulted to
determine max_mapping_level, and at that time the mapping level would be
the new mapping level in host page tables.
...
Then though the two offsets are validated as equal in kvm_gmem_bind(), they may
differ later on.
This is true.
Continuing from above, with guest_memfd, no issues if guest_memfd is
only used for private memory, since shared memory uses the same
mechanism as before guest_memfd.
If guest_memfd is used for both private and shared memory, on unmapping,
KVM will also get notified via mmu_notifiers. On the next fault, the
mapping level is determined as follows (I have a patch coming up that
will illustrate this better)
1. guest_memfd will return 4K since this is a shared folio and shared
   folios are always split to 4K. But suppose in future guest_memfd
   supports shared folios at higher levels, say 1G, we continue...
2. lpage info (not updated since userspace swapped out addr) will say
   map at 1G
3. Since this is a shared fault, we check host page tables, which would
   say 4K since there was a munmap() and mmap().
I think it should still work as expected.
...
...

Host's access and other accesses (e.g. instruction emulation, which
uses slot->userspace_addr) to guest memory will be broken, but I think
there's nothing protecting against that. The same breakage would
happen for non-guest_memfd memslot.

Why is host access broken in non-guest_memfd case?
The HVA is still a valid one in QEMU's mmap-ed address space.
I was thinking that if a guest was executing code and the code gets
swapped out from under its feet by replacing the memory pointed to by
addr, the guest would be broken.
Now that I think about it again, it could be a valid use case. You're
right, thanks for pointing this out!
...
...
p.s. I will be adding the validation as you suggested [1], though that
shouldn't make a difference here, since the above check directly
validates against gmem.pgoff.
Regarding 2., checking this checks against gmem.pgoff and should handle
that as well.
[1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/
I prefer checking at binding time because it aligns with the ugfn check
that is already there, and avoids having to check at every fault.
...
...
[1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616...
...
...
>> Adding checks at binding time will allow hugepage-unaligned offsets (to
>> be at parity with non-guest_memfd backing memory) but still fix this
>> issue.
>> 
>> lpage_info will make sure that ranges near the bounds will be
>> fragmented, but the hugepages in the middle will still be mappable as
>> hugepages.
>> 
>> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding...

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page