Re: [RFC PATCH 00/39] 1G page support for guest_memfd

11 Sep 2024

      Cc Oscar for awareness
On Tue 10-09-24 23:43:31, Ackerley Tng wrote:
...
Hello,
This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.
The patchset covers:

How to get 1G pages
Allowing mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages
Splitting and reconstructing pages to support conversions and mmap()
How the VM, userspace and guest_memfd interact to support conversions
Selftests to test all the above
Selftests also demonstrate the conversion flow between VM, userspace and
guest_memfd.

Why 1G pages in guest memfd?
Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.

Performance is improved with 1G pages by more TLB hits and faster page walks
on TLB misses.
Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).

Options for 1G page support:

HugeTLB
Contiguous Memory Allocator (CMA)
Other suggestions are welcome!

Comparison between options:

HugeTLB
Refactor HugeTLB to separate allocator from the rest of HugeTLB
Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs

Pro: Can provide iterative steps toward new future allocator
Unexplored: Managing userspace-visible changes
e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
but not when future allocator is used

CMA
Port some HugeTLB features to be applied on CMA
Pro: Clean slate

What would refactoring HugeTLB involve?
(Some refactoring was done in this RFC, more can be done.)

Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
Brings more modularity to HugeTLB
No functionality change intended
Likely step towards HugeTLB's integration into core-mm

guest_memfd will use just the allocator component of HugeTLB, not including
the complex parts of HugeTLB like
Userspace reservations (resv_map)
Shared PMD mappings
Special page walkers

What features would need to be ported to CMA?

Improved allocation guarantees
Per NUMA node pool of huge pages
Subpools per guest_memfd

Memory savings
Something like HugeTLB Vmemmap Optimization

Configuration/reporting features
Configuration of number of pages available (and per NUMA node) at and
after host boot
Reporting of memory usage/availability statistics at runtime

HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.
To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.
Improving how guest_memfd uses the allocator in a future revision of this RFC:
To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:

reserve(node, page_size, num_pages) => opaque handle
Used when a guest_memfd inode is created to reserve memory from backend
allocator

allocate(handle, mempolicy, page_size) => folio
To allocate a folio from guest_memfd's reservation

split(handle, folio, target_page_size) => void
To take a huge folio, and split it to smaller folios, restore to filemap

reconstruct(handle, first_folio, nr_pages) => void
To take a folio, and reconstruct a huge folio out of nr_pages from the
first_folio

free(handle, folio) => void
To return folio to guest_memfd's reservation

error(handle, folio) => void
To handle memory errors

unreserve(handle) => void
To return guest_memfd's reservation to allocator backend

Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.
Overview of patches:

Patches 01-12
Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
HugeTLB, and to expose HugeTLB functions.

Patches 13-16
Letting guest_memfd use HugeTLB
Creation of each guest_memfd reserves pages from HugeTLB's global hstate
and puts it into the guest_memfd inode's subpool
Each folio allocation takes a page from the guest_memfd inode's subpool

Patches 17-21
Selftests for new HugeTLB features in guest_memfd

Patches 22-24
More small changes on the HugeTLB side to expose functions needed by
guest_memfd

Patch 25:
Uses the newly available functions from patches 22-24 to split HugeTLB
pages. In this patch, HugeTLB folios are always split to 4K before any
usage, private or shared.

Patches 26-28
Allow mmap() in guest_memfd and faulting in shared pages

Patch 29
Enables conversion between private/shared pages

Patch 30
Required to zero folios after conversions to avoid leaking initialized
kernel memory

Patch 31-38
Add selftests to test mapping pages to userspace, guest/host memory
sharing and update conversions tests
Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd

Patch 39
Dynamically split and reconstruct HugeTLB pages instead of always
splitting before use. All earlier selftests are expected to still pass.

TODOs:

Add logic to wait for safe_refcount [1]
Look into lazy splitting/reconstruction of pages
Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
mem_attr_array and faultability updated, the pages in the requested range
are also split/reconstructed as necessary. We want to look into delaying
splitting/reconstruction to fault time.

Solve race between folios being faulted in and being truncated
When running private_mem_conversions_test with more than 1 vCPU, a folio
getting truncated may get faulted in by another process, causing elevated
mapcounts when the folio is freed (VM_BUG_ON_FOLIO).

Add intermediate splits (1G should first split to 2M and not split directly to
4K)
Use guest's lock instead of hugetlb_lock
Use multi-index xarray/replace xarray with some other data struct for
faultability flag
Refactor HugeTLB better, present generic allocator interface

Please let us know your thoughts on:

HugeTLB as the choice of transitional allocator backend
Refactoring HugeTLB to provide generic allocator interface
Shared/private conversion flow
Requiring user to request kernel to unmap pages from userspace using
madvise(MADV_DONTNEED)
Failing conversion on elevated mapcounts/pincounts/refcounts

Process of splitting/reconstructing page
Anything else!

[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quici...
Ackerley Tng (37):
  mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
  mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
  mm: hugetlb: Remove unnecessary check for avoid_reserve
  mm: mempolicy: Refactor out policy_node_nodemask()
  mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
    interpret mempolicy instead of vma
  mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
  mm: hugetlb: Refactor out hugetlb_alloc_folio
  mm: truncate: Expose preparation steps for truncate_inode_pages_final
  mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  mm: hugetlb: Add option to create new subpool without using surplus
  mm: hugetlb: Expose hugetlb_acct_memory()
  mm: hugetlb: Move and expose hugetlb_zero_partial_page()
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: hugetlb: initialization and cleanup
  KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
  KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  KVM: selftests: Support various types of backing sources for private
    memory
  KVM: selftests: Update test for various private memory backing source
    types
  KVM: selftests: Add private_mem_conversions_test.sh
  KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  mm: hugetlb: Expose vmemmap optimization functions
  mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
  mm: hugetlb: Add functions to add/move/remove from hugetlb lists
  KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  KVM: guest_memfd: Allow mmapping guest_memfd files
  KVM: guest_memfd: Use vm_type to determine default faultability
  KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
  KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  KVM: selftests: Allow vm_set_memory_attributes to be used without
    asserting return value of 0
  KVM: selftests: Test using guest_memfd memory from userspace
  KVM: selftests: Test guest_memfd memory sharing between guest and host
  KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
    guest_memfd
  KVM: selftests: Test that pinned pages block KVM from setting memory
    attributes to PRIVATE
  KVM: selftests: Refactor vm_mem_add to be more flexible
  KVM: selftests: Add helper to perform madvise by memslots
  KVM: selftests: Update private_mem_conversions_test for mmap()able
    guest_memfd
Vishal Annapurve (2):
  KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
  KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
fs/hugetlbfs/inode.c                          |   35 +-
 include/linux/hugetlb.h                       |   54 +-
 include/linux/kvm_host.h                      |    1 +
 include/linux/mempolicy.h                     |    2 +
 include/linux/mm.h                            |    1 +
 include/uapi/linux/kvm.h                      |   26 +
 include/uapi/linux/magic.h                    |    1 +
 mm/hugetlb.c                                  |  346 ++--
 mm/hugetlb_vmemmap.h                          |   11 -
 mm/mempolicy.c                                |   36 +-
 mm/truncate.c                                 |   26 +-
 tools/include/linux/kernel.h                  |    4 +-
 tools/testing/selftests/kvm/Makefile          |    3 +
 .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
 .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
 .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
 .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
 .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
 .../testing/selftests/kvm/include/test_util.h |   18 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
 tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
 .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
 .../x86_64/private_mem_conversions_test.sh    |   91 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
 virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
 virt/kvm/kvm_main.c                           |   17 +
 virt/kvm/kvm_mm.h                             |   16 +
 27 files changed, 3288 insertions(+), 443 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
 create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
--
2.46.0.598.g6f2099f65c-goog
-- 
Michal Hocko
SUSE Labs

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH 00/39] 1G page support for guest_memfd