On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
Yan Zhao yan.y.zhao@intel.com writes:
<snip>
What options does userspace have in this scenario? It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff isn't ideal either.
What about something similar as below?
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index d2feacd14786..87c33704a748 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot, }
*pfn = folio_file_pfn(folio, index);
if (max_order)
*max_order = folio_order(folio);
if (max_order) {
int order;
order = folio_order(folio);
while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
order--;
*max_order = order;
} *is_prepared = folio_test_uptodate(folio); return folio;
Vishal was wondering how this is working before guest_memfd was introduced, for other backing memory like HugeTLB.
I then poked around and found this [1]. I will be adding a similar check for any slot where kvm_slot_can_be_private(slot).
Yan, that should work, right?
No, I don't think the checking of ugfn [1] should work.
- Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory
are allocated from guest_memfd), the slot->userspace_addr does not necessarily have the same offset as slot->gmem.pgoff. Even if we audit the offset in kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing slot->userspace_addr to point to a different offset.
- for slots bound to guest_memfd that do not support in-place-conversion,
shared memory is allocated from a different backend. Therefore, checking "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is currently absent because guest_memfd supports 4K only.
Let me clarify, I meant these changes:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4b64ab3..d0dccf1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages) return 0; } +static inline bool kvm_is_level_aligned(u64 value, int level) +{
return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
+}
static int kvm_alloc_memslot_metadata(struct kvm *kvm, struct kvm_memory_slot *slot) { @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm, slot->arch.lpage_info[i - 1] = linfo;
if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
if (!kvm_is_level_aligned(slot->base_gfn, level)) linfo[0].disallow_lpage = 1;
if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
if (!kvm_is_level_aligned(slot->base_gfn + npages, level)) linfo[lpages - 1].disallow_lpage = 1; ugfn = slot->userspace_addr >> PAGE_SHIFT; /*
* If the gfn and userspace address are not aligned wrt each
* other, disable large page support for this slot.
* If the gfn and userspace address are not aligned or if gfn
* and guest_memfd offset are not aligned wrt each other,
* disable large page support for this slot. */
if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
(kvm_slot_can_be_private(slot) &&
!kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
level))) { unsigned long j;
for (j = 0; j < lpages; ++j)
This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.
In the case of shared memory is not allocated from guest_memfd, (e.g. with the current upstream code), the checking of gmem.pgoff here will disallow huge page of shared memory even if "slot->base_gfn ^ ugfn" is aligned.
I think this should take care of case (1.), for guest_memfds going to be used for both shared and private memory. Userspace can't update slot->userspace_addr, since guest_memfd memslots cannot be updated and can only be deleted.
If userspace re-uses slot->userspace_addr for some other memory address without deleting and re-adding a memslot,
- KVM's access to memory should still be fine, since after the recent discussion at guest_memfd upstream call, KVM's guest faults will always go via fd+offset and KVM's access won't be disrupted there. Whatever checking done at memslot binding time will still be valid.
Could the offset of shared memory and offset of private memory be different if userspace re-uses slot->userspace_addr without deleting and re-adding a memslot?
Then though the two offsets are validated as equal in kvm_gmem_bind(), they may differ later on.
- Host's access and other accesses (e.g. instruction emulation, which uses slot->userspace_addr) to guest memory will be broken, but I think there's nothing protecting against that. The same breakage would happen for non-guest_memfd memslot.
Why is host access broken in non-guest_memfd case? The HVA is still a valid one in QEMU's mmap-ed address space.
p.s. I will be adding the validation as you suggested [1], though that shouldn't make a difference here, since the above check directly validates against gmem.pgoff.
Regarding 2., checking this checks against gmem.pgoff and should handle that as well.
[1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/
I prefer checking at binding time because it aligns with the ugfn check that is already there, and avoids having to check at every fault.
[1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616...
> Adding checks at binding time will allow hugepage-unaligned offsets (to > be at parity with non-guest_memfd backing memory) but still fix this > issue. > > lpage_info will make sure that ranges near the bounds will be > fragmented, but the hugepages in the middle will still be mappable as > hugepages. > > [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding...