This series introduces NUMA-aware memory placement support for KVM guests
with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17)
that enabled host-mapping for guest_memfd memory [1] and can be applied
directly applied on KVM tree [2] (branch kvm-next, base commit: a6ad5413,
Merge branch 'guest-memfd-mmap' into HEAD)
== Background ==
KVM's guest-memfd memory backend currently lacks support for NUMA policy
enforcement, causing guest memory allocations to be distributed across host
nodes according to kernel's default behavior, irrespective of any policy
specified by the VMM. This limitation arises because conventional userspace
NUMA control mechanisms like mbind(2) don't work since the memory isn't
directly mapped to userspace when allocations occur.
Fuad's work [1] provides the necessary mmap capability, and this series
leverages it to enable mbind(2).
== Implementation ==
This series implements proper NUMA policy support for guest-memfd by:
1. Adding mempolicy-aware allocation APIs to the filemap layer.
2. Introducing custom inodes (via a dedicated slab-allocated inode cache,
kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory.
3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA
policy.
With these changes, VMMs can now control guest memory placement by mapping
guest_memfd file descriptor and using mbind(2) to specify:
- Policy modes: default, bind, interleave, or preferred
- Host NUMA nodes: List of target nodes for memory allocation
These Policies affect only future allocations and do not migrate existing
memory. This matches mbind(2)'s default behavior which affects only new
allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not
supported for guest_memfd as it is unmovable by design).
== Upstream Plan ==
Phased approach as per David's guest_memfd extension overview [3] and
community calls [4]:
Phase 1 (this series):
1. Focuses on shared guest_memfd support (non-CoCo VMs).
2. Builds on Fuad's host-mapping work [1].
Phase2 (future work):
1. NUMA support for private guest_memfd (CoCo VMs).
2. Depends on SNP in-place conversion support [5].
This series provides a clean integration path for NUMA-aware memory
management for guest_memfd and lays the groundwork for future confidential
computing NUMA capabilities.
Thanks,
Shivank
== Changelog ==
- v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy.
- v3: Introduced fbind() syscall for VMM memory-placement configuration.
- v4-v6: Current approach using shared_policy support and vm_ops (based on
suggestions from David [6] and guest_memfd bi-weekly upstream
call discussion [7]).
- v7: Use inodes to store NUMA policy instead of file [8].
- v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory.
- v9: Rebase on top of Fuad's V13 and incorporate review comments
- V10: Rebase on top of Fuad's V17. Use latest guest_memfd inode patch
from Ackerley (with David's review comments). Use newer kmem_cache_create()
API variant with arg parameter (Vlastimil)
- V11: Rebase on kvm-next, remove RFC tag, use Ackerley's latest patch
and fix a rcu race bug during kvm module unload.
[1] https://lore.kernel.org/all/20250729225455.670324-1-seanjc@google.com
[2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next
[3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com
[4] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[5] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com
[6] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com
[7] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com
[8] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com
Ackerley Tng (1):
KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
Matthew Wilcox (Oracle) (2):
mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()
mm/filemap: Extend __filemap_get_folio() to support NUMA memory
policies
Shivank Garg (4):
mm/mempolicy: Export memory policy symbols
KVM: guest_memfd: Add slab-allocated inode cache
KVM: guest_memfd: Enforce NUMA mempolicy using shared policy
KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy
support
fs/bcachefs/fs-io-buffered.c | 2 +-
fs/btrfs/compression.c | 4 +-
fs/btrfs/verity.c | 2 +-
fs/erofs/zdata.c | 2 +-
fs/f2fs/compress.c | 2 +-
include/linux/pagemap.h | 18 +-
include/uapi/linux/magic.h | 1 +
mm/filemap.c | 23 +-
mm/mempolicy.c | 6 +
mm/readahead.c | 2 +-
tools/testing/selftests/kvm/Makefile.kvm | 1 +
.../testing/selftests/kvm/guest_memfd_test.c | 121 ++++++++
virt/kvm/guest_memfd.c | 262 ++++++++++++++++--
virt/kvm/kvm_main.c | 7 +-
virt/kvm/kvm_mm.h | 9 +-
15 files changed, 412 insertions(+), 50 deletions(-)
--
2.43.0
---
== Earlier Postings ==
v10: https://lore.kernel.org/all/20250811090605.16057-2-shivankg@amd.com
v9: https://lore.kernel.org/all/20250713174339.13981-2-shivankg@amd.com
v8: https://lore.kernel.org/all/20250618112935.7629-1-shivankg@amd.com
v7: https://lore.kernel.org/all/20250408112402.181574-1-shivankg@amd.com
v6: https://lore.kernel.org/all/20250226082549.6034-1-shivankg@amd.com
v5: https://lore.kernel.org/all/20250219101559.414878-1-shivankg@amd.com
v4: https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com
v3: https://lore.kernel.org/all/20241105164549.154700-1-shivankg@amd.com
v2: https://lore.kernel.org/all/20240919094438.10987-1-shivankg@amd.com
v1: https://lore.kernel.org/all/20240916165743.201087-1-shivankg@amd.com
Hi all,
This patch series introduces improvements to the cgroup selftests by adding
a helper function to better handle asynchronous updates in cgroup statistics.
These changes are especially useful for managing cgroup stats like
memory.stat and cgroup.stat, which can be affected by delays (e.g., RCU
callbacks and asynchronous rstat flushing).
Patch 1/3 adds cg_read_key_long_poll(), a generic helper to poll a numeric
key in a cgroup file until it reaches an expected value or a retry limit is
hit. Patches 2/3 and 3/3 convert existing tests to use this helper, making
them more robust on busy systems.
v5:
- Drop the "/* 3s total */" comment from MEMCG_SOCKSTAT_WAIT_RETRIES as
suggested by Shakeel, so it does not become stale if the wait interval
changes.
- Elaborate in the commit message of patch 3/3 on the rationale behind the
8s timeout in test_kmem_dead_cgroups(), and add a comment next to
KMEM_DEAD_WAIT_RETRIES explaining that it is a generous upper bound
derived from stress testing and not tied to a specific kernel constant.
- Add Reviewed-by: Shakeel Butt <shakeel.butt(a)linux.dev> to this series.
- Link to v4: https://lore.kernel.org/all/20251124123816.486164-1-zhangguopeng@kylinos.cn/
v4:
- Patch 1/3: Add the cg_read_key_long_poll() helper to poll cgroup keys
with retries and configurable intervals.
- Patch 2/3: Update test_memcg_sock() to use cg_read_key_long_poll() for
handling delayed "sock " counter updates in memory.stat.
- Patch 3/3: Replace the sleep-and-retry logic in test_kmem_dead_cgroups()
with cg_read_key_long_poll() for waiting on nr_dying_descendants.
- Link to v3: https://lore.kernel.org/all/p655qedqjaakrnqpytc6dltejfluxo6jrffcltfz2ivonmk…
v3:
- Move MEMCG_SOCKSTAT_WAIT_* defines after the #include block as suggested.
- Link to v2: https://lore.kernel.org/all/5ad2b75f-748a-4e93-8d11-63295bda0cbf@linux.dev/
v2:
- Clarify the rationale for the 3s timeout and mention the periodic rstat
flush interval (FLUSH_TIME = 2 * HZ) in the comment.
- Replace hardcoded retry count and wait interval with macros to avoid
magic numbers and make the timeout calculation explicit.
- Link to v1: https://lore.kernel.org/all/20251119122758.85610-1-ioworker0@gmail.com/
Thanks to Michal Koutný for the suggestion, and to Lance Yang and Shakeel Butt for their reviews and feedback.
Guopeng Zhang (3):
selftests: cgroup: Add cg_read_key_long_poll() to poll a cgroup key
with retries
selftests: cgroup: make test_memcg_sock robust against delayed sock
stats
selftests: cgroup: Replace sleep with cg_read_key_long_poll() for
waiting on nr_dying_descendants
.../selftests/cgroup/lib/cgroup_util.c | 21 ++++++++++++
.../cgroup/lib/include/cgroup_util.h | 5 +++
tools/testing/selftests/cgroup/test_kmem.c | 33 +++++++++----------
.../selftests/cgroup/test_memcontrol.c | 20 ++++++++++-
4 files changed, 60 insertions(+), 19 deletions(-)
--
2.25.1
[ based on kvm/next ]
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: if the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly
60% of speculative execution issues fall into this category [1, Table
1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
Additionally, a Firecracker branch with support for these VMs can be
found on GitHub [2].
For more details, please refer to the v5 cover letter. No substantial
changes in design have taken place since.
See also related write() syscall support in guest_memfd [3] where
the interoperation between the two features is described.
Changes since v7:
- David: separate patches for adding x86 and ARM support
- Dave/Will: drop support for disabling TLB flushes
v7: https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu…
v6: https://lore.kernel.org/kvm/20250912091708.17502-1-roypat@amazon.co.uk
v5: https://lore.kernel.org/kvm/20250828093902.2719-1-roypat@amazon.co.uk
v4: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk
RFCv3: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk
RFCv2: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk
RFCv1: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk
[1] https://download.vusec.net/papers/quarantine_raid23.pdf
[2] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hidi…
[3] https://lore.kernel.org/kvm/20251114151828.98165-1-kalyazin@amazon.com
Patrick Roy (13):
x86: export set_direct_map_valid_noflush to KVM module
x86/tlb: export flush_tlb_kernel_range to KVM module
mm: introduce AS_NO_DIRECT_MAP
KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
KVM: guest_memfd: Add flag to remove from direct map
KVM: x86: define kvm_arch_gmem_supports_no_direct_map()
KVM: arm64: define kvm_arch_gmem_supports_no_direct_map()
KVM: selftests: load elf via bounce buffer
KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
!= -1
KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing
selftests
KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
KVM: selftests: Test guest execution from direct map removed gmem
Documentation/virt/kvm/api.rst | 22 ++++---
arch/arm64/include/asm/kvm_host.h | 13 ++++
arch/x86/include/asm/kvm_host.h | 9 +++
arch/x86/include/asm/tlbflush.h | 3 +-
arch/x86/mm/pat/set_memory.c | 1 +
arch/x86/mm/tlb.c | 1 +
include/linux/kvm_host.h | 14 ++++
include/linux/pagemap.h | 16 +++++
include/linux/secretmem.h | 18 ------
include/uapi/linux/kvm.h | 1 +
lib/buildid.c | 4 +-
mm/gup.c | 19 ++----
mm/mlock.c | 2 +-
mm/secretmem.c | 8 +--
.../testing/selftests/kvm/guest_memfd_test.c | 17 ++++-
.../testing/selftests/kvm/include/kvm_util.h | 37 ++++++++---
.../testing/selftests/kvm/include/test_util.h | 8 +++
tools/testing/selftests/kvm/lib/elf.c | 8 +--
tools/testing/selftests/kvm/lib/io.c | 23 +++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 59 +++++++++--------
tools/testing/selftests/kvm/lib/test_util.c | 8 +++
tools/testing/selftests/kvm/lib/x86/sev.c | 1 +
.../selftests/kvm/pre_fault_memory_test.c | 1 +
.../selftests/kvm/set_memory_region_test.c | 52 +++++++++++++--
.../kvm/x86/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 64 +++++++++++++++++--
26 files changed, 314 insertions(+), 102 deletions(-)
base-commit: e0c26d47def7382d7dbd9cad58bc653aed75737a
--
2.50.1
Hi,
the kernel selftests do not currently support glibc v2.35 and older.
This is primarily because glibc v2.36 added support for syscall API
functions such as fsopen, move_mount, fsmount, open_tree, and fsconfig,
and those API functions are now used in several tests.
As result, glibc v2.35 and older are no longer supported, at least not
for affected tests.
Historically the selftest framework implemented such functions with
wrappers named sys_<syscall>. This means that sys_<syscall> and <syscall>
functions are now used in parallel.
I see a number of possibilities to solve the problem.
1) Do nothing. Document somewhere that glibc v2.35 and older is not
supported by the kernel selftests.
2) Document that glibc v2.35 and older is not supported, drop
all wrappers implementing the sys_<syscall> functions, and rename
the calling code to <syscall>.
3) Rename all calls to sys_<syscall> to <syscall> and modify the
wrapper code to only implement those functions for glibc v2.35 and
older.
4) Add defines to the existing wrapper code to also define <syscall>
in addition to sys_<syscall>, and leave the code otherwise alone.
What would be the preferred solution ? My personal preference would be 3),
followed by 4).
Thanks,
Guenter
Hi,
This series fixes a verifier issue with bpf_d_path() and adds a
regression test to cover its use within a hook function.
Patch 1 updates the bpf_d_path() helper prototype so that the second
argument is marked as MEM_WRITE. This makes it explicit to the verifier
that the helper writes into the provided buffer.
Patch 2 extends the existing d_path selftest to cover incorrect verifier
assumptions caused by an incorrect function prototype. The test program calls
bpf_d_path() and checks if the first character of the path is '/'.
It ensures the verifier does not assume the buffer remains unwritten.
Changelog
=========
v4:
- Use the fallocate hook instead of an LSM hook to simplify the selftest,
as suggested by Matt and Alexei.
- Add a utility function in test_d_path.c to load the BPF program,
improving code reuse.
v3:
- Switch the pathname prefix loop to use bpf_for() instead of
#pragma unroll, as suggested by Matt.
- Remove /tmp/bpf_d_path_test in the test cleanup path.
- Add the missing Reviewed-by tags.
v2:
- Merge the new test into the existing d_path selftest rather than
creating new files.
- Add PID filtering in the LSM program to avoid nondeterministic failures
due to unrelated processes triggering bprm_check_security.
- Synchronize child execution using a pipe to ensure deterministic
updates to the PID.
Thanks for your time and reviews.
Shuran Liu (2):
bpf: mark bpf_d_path() buffer as writeable
selftests/bpf: add regression test for bpf_d_path()
kernel/trace/bpf_trace.c | 2 +-
.../testing/selftests/bpf/prog_tests/d_path.c | 90 +++++++++++++++----
.../testing/selftests/bpf/progs/test_d_path.c | 23 +++++
3 files changed, 96 insertions(+), 19 deletions(-)
--
2.52.0