This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17) that enabled host-mapping for guest_memfd memory [1].
== Background == KVM's guest-memfd memory backend currently lacks support for NUMA policy enforcement, causing guest memory allocations to be distributed across host nodes according to kernel's default behavior, irrespective of any policy specified by the VMM. This limitation arises because conventional userspace NUMA control mechanisms like mbind(2) don't work since the memory isn't directly mapped to userspace when allocations occur. Fuad's work [1] provides the necessary mmap capability, and this series leverages it to enable mbind(2).
== Implementation == This series implements proper NUMA policy support for guest-memfd by:
1. Adding mempolicy-aware allocation APIs to the filemap layer. 2. Introducing custom inodes (via a dedicated slab-allocated inode cache, kvm_gmem_inode_info) to store NUMA policy and metadata for guest memory. 3. Implementing get/set_policy vm_ops in guest_memfd to support NUMA policy.
With these changes, VMMs can now control guest memory placement by mapping guest_memfd file descriptor and using mbind(2) to specify: - Policy modes: default, bind, interleave, or preferred - Host NUMA nodes: List of target nodes for memory allocation
These Policies affect only future allocations and do not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (Not supported for guest_memfd as it is unmovable by design).
== Upstream Plan == Phased approach as per David's guest_memfd extension overview [2] and community calls [3]:
Phase 1 (this series): 1. Focuses on shared guest_memfd support (non-CoCo VMs). 2. Builds on Fuad's host-mapping work.
Phase2 (future work): 1. NUMA support for private guest_memfd (CoCo VMs). 2. Depends on SNP in-place conversion support [4].
This series provides a clean integration path for NUMA-aware memory management for guest_memfd and lays the groundwork for future confidential computing NUMA capabilities.
Please review and provide feedback!
Thanks, Shivank
== Changelog ==
- v1,v2: Extended the KVM_CREATE_GUEST_MEMFD IOCTL to pass mempolicy. - v3: Introduced fbind() syscall for VMM memory-placement configuration. - v4-v6: Current approach using shared_policy support and vm_ops (based on suggestions from David [5] and guest_memfd bi-weekly upstream call discussion [6]). - v7: Use inodes to store NUMA policy instead of file [7]. - v8: Rebase on top of Fuad's V12: Host mmaping for guest_memfd memory. - v9: Rebase on top of Fuad's V13 and incorporate review comments - V10: Rebase on top of Fuad's V17. Use latest guest_memfd inode patch from Ackerley (with David's review comments). Use newer kmem_cache_create() API variant with arg parameter (Vlastimil)
[1] https://lore.kernel.org/all/20250729225455.670324-1-seanjc@google.com [2] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com [3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAos... [4] https://lore.kernel.org/all/20250613005400.3694904-1-michael.roth@amd.com [5] https://lore.kernel.org/all/6fbef654-36e2-4be5-906e-2a648a845278@redhat.com [6] https://lore.kernel.org/all/2b77e055-98ac-43a1-a7ad-9f9065d7f38f@amd.com [7] https://lore.kernel.org/all/diqzbjumm167.fsf@ackerleytng-ctop.c.googlers.com
Ackerley Tng (1): KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
Matthew Wilcox (Oracle) (2): mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() mm/filemap: Extend __filemap_get_folio() to support NUMA memory policies
Shivank Garg (4): mm/mempolicy: Export memory policy symbols KVM: guest_memfd: Add slab-allocated inode cache KVM: guest_memfd: Enforce NUMA mempolicy using shared policy KVM: guest_memfd: selftests: Add tests for mmap and NUMA policy support
fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 +- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 18 +- include/uapi/linux/magic.h | 1 + mm/filemap.c | 23 +- mm/mempolicy.c | 6 + mm/readahead.c | 2 +- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 121 ++++++++ virt/kvm/guest_memfd.c | 260 ++++++++++++++++-- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 9 +- 15 files changed, 410 insertions(+), 50 deletions(-)
From: "Matthew Wilcox (Oracle)" willy@infradead.org
Add a mempolicy parameter to filemap_alloc_folio() to enable NUMA-aware page cache allocations. This will be used by upcoming changes to support NUMA policies in guest-memfd, where guest_memory need to be allocated NUMA policy specified by VMM.
All existing users pass NULL maintaining current behavior.
Reviewed-by: Pankaj Gupta pankaj.gupta@amd.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: David Hildenbrand david@redhat.com Signed-off-by: Shivank Garg shivankg@amd.com --- fs/bcachefs/fs-io-buffered.c | 2 +- fs/btrfs/compression.c | 4 ++-- fs/btrfs/verity.c | 2 +- fs/erofs/zdata.c | 2 +- fs/f2fs/compress.c | 2 +- include/linux/pagemap.h | 8 +++++--- mm/filemap.c | 14 +++++++++----- mm/readahead.c | 2 +- 8 files changed, 21 insertions(+), 15 deletions(-)
diff --git a/fs/bcachefs/fs-io-buffered.c b/fs/bcachefs/fs-io-buffered.c index 1c54b9b5bd69..3af2eabb7ed3 100644 --- a/fs/bcachefs/fs-io-buffered.c +++ b/fs/bcachefs/fs-io-buffered.c @@ -124,7 +124,7 @@ static int readpage_bio_extend(struct btree_trans *trans, if (folio && !xa_is_value(folio)) break;
- folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), order); + folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), order, NULL); if (!folio) break;
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index d09d622016ef..139f9609e737 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -474,8 +474,8 @@ static noinline int add_ra_bio_pages(struct inode *inode, continue; }
- folio = filemap_alloc_folio(mapping_gfp_constraint(mapping, - ~__GFP_FS), 0); + folio = filemap_alloc_folio(mapping_gfp_constraint(mapping, ~__GFP_FS), + 0, NULL); if (!folio) break;
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c index b7a96a005487..c43a789ba6d2 100644 --- a/fs/btrfs/verity.c +++ b/fs/btrfs/verity.c @@ -742,7 +742,7 @@ static struct page *btrfs_read_merkle_tree_page(struct inode *inode, }
folio = filemap_alloc_folio(mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS), - 0); + 0, NULL); if (!folio) return ERR_PTR(-ENOMEM);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index 792f20888a8f..09e2ed2ae0d1 100644 --- a/fs/erofs/zdata.c +++ b/fs/erofs/zdata.c @@ -562,7 +562,7 @@ static void z_erofs_bind_cache(struct z_erofs_frontend *fe) * Allocate a managed folio for cached I/O, or it may be * then filled with a file-backed folio for in-place I/O */ - newfolio = filemap_alloc_folio(gfp, 0); + newfolio = filemap_alloc_folio(gfp, 0, NULL); if (!newfolio) continue; newfolio->private = Z_EROFS_PREALLOCATED_FOLIO; diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c index 5c1f47e45dab..56a51c9ba4f1 100644 --- a/fs/f2fs/compress.c +++ b/fs/f2fs/compress.c @@ -1942,7 +1942,7 @@ static void f2fs_cache_compressed_page(struct f2fs_sb_info *sbi, return; }
- cfolio = filemap_alloc_folio(__GFP_NOWARN | __GFP_IO, 0); + cfolio = filemap_alloc_folio(__GFP_NOWARN | __GFP_IO, 0, NULL); if (!cfolio) return;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 12a12dae727d..ce617a35dc35 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -646,9 +646,11 @@ static inline void *detach_page_private(struct page *page) }
#ifdef CONFIG_NUMA -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order); +struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order, + struct mempolicy *policy); #else -static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order) +static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order, + struct mempolicy *policy) { return folio_alloc_noprof(gfp, order); } @@ -659,7 +661,7 @@ static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int o
static inline struct page *__page_cache_alloc(gfp_t gfp) { - return &filemap_alloc_folio(gfp, 0)->page; + return &filemap_alloc_folio(gfp, 0, NULL)->page; }
static inline gfp_t readahead_gfp_mask(struct address_space *x) diff --git a/mm/filemap.c b/mm/filemap.c index 751838ef05e5..495f7f5c3d2e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -989,11 +989,16 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, EXPORT_SYMBOL_GPL(filemap_add_folio);
#ifdef CONFIG_NUMA -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order) +struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order, + struct mempolicy *policy) { int n; struct folio *folio;
+ if (policy) + return folio_alloc_mpol_noprof(gfp, order, policy, + NO_INTERLEAVE_INDEX, numa_node_id()); + if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { @@ -1978,7 +1983,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, err = -ENOMEM; if (order > min_order) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; - folio = filemap_alloc_folio(alloc_gfp, order); + folio = filemap_alloc_folio(alloc_gfp, order, NULL); if (!folio) continue;
@@ -2517,7 +2522,7 @@ static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch) if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ)) return -EAGAIN;
- folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order); + folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order, NULL); if (!folio) return -ENOMEM; if (iocb->ki_flags & IOCB_DONTCACHE) @@ -3916,8 +3921,7 @@ static struct folio *do_read_cache_folio(struct address_space *mapping, repeat: folio = filemap_get_folio(mapping, index); if (IS_ERR(folio)) { - folio = filemap_alloc_folio(gfp, - mapping_min_folio_order(mapping)); + folio = filemap_alloc_folio(gfp, mapping_min_folio_order(mapping), NULL); if (!folio) return ERR_PTR(-ENOMEM); index = mapping_align_index(mapping, index); diff --git a/mm/readahead.c b/mm/readahead.c index 406756d34309..a4dfa837dfbd 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -183,7 +183,7 @@ static struct folio *ractl_alloc_folio(struct readahead_control *ractl, { struct folio *folio;
- folio = filemap_alloc_folio(gfp_mask, order); + folio = filemap_alloc_folio(gfp_mask, order, NULL); if (folio && ractl->dropbehind) __folio_set_dropbehind(folio);
From: "Matthew Wilcox (Oracle)" willy@infradead.org
Extend __filemap_get_folio() to support NUMA memory policies by renaming the implementation to __filemap_get_folio_mpol() and adding a mempolicy parameter. The original function becomes a static inline wrapper that passes NULL for the mempolicy.
This infrastructure will enable future support for NUMA-aware page cache allocations in guest_memfd memory backend KVM guests.
Reviewed-by: Pankaj Gupta pankaj.gupta@amd.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: David Hildenbrand david@redhat.com Signed-off-by: Shivank Garg shivankg@amd.com --- include/linux/pagemap.h | 10 ++++++++-- mm/filemap.c | 11 ++++++----- 2 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index ce617a35dc35..94d65ced0a1d 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -747,11 +747,17 @@ static inline fgf_t fgf_set_order(size_t size) }
void *filemap_get_entry(struct address_space *mapping, pgoff_t index); -struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, - fgf_t fgp_flags, gfp_t gfp); +struct folio *__filemap_get_folio_mpol(struct address_space *mapping, + pgoff_t index, fgf_t fgf_flags, gfp_t gfp, struct mempolicy *policy); struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
+static inline struct folio *__filemap_get_folio(struct address_space *mapping, + pgoff_t index, fgf_t fgf_flags, gfp_t gfp) +{ + return __filemap_get_folio_mpol(mapping, index, fgf_flags, gfp, NULL); +} + /** * write_begin_get_folio - Get folio for write_begin with flags. * @iocb: The kiocb passed from write_begin (may be NULL). diff --git a/mm/filemap.c b/mm/filemap.c index 495f7f5c3d2e..03f223be575c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1897,11 +1897,12 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index) }
/** - * __filemap_get_folio - Find and get a reference to a folio. + * __filemap_get_folio_mpol - Find and get a reference to a folio. * @mapping: The address_space to search. * @index: The page index. * @fgp_flags: %FGP flags modify how the folio is returned. * @gfp: Memory allocation flags to use if %FGP_CREAT is specified. + * @policy: NUMA memory allocation policy to follow. * * Looks up the page cache entry at @mapping & @index. * @@ -1912,8 +1913,8 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index) * * Return: The found folio or an ERR_PTR() otherwise. */ -struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, - fgf_t fgp_flags, gfp_t gfp) +struct folio *__filemap_get_folio_mpol(struct address_space *mapping, + pgoff_t index, fgf_t fgp_flags, gfp_t gfp, struct mempolicy *policy) { struct folio *folio;
@@ -1983,7 +1984,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, err = -ENOMEM; if (order > min_order) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; - folio = filemap_alloc_folio(alloc_gfp, order, NULL); + folio = filemap_alloc_folio(alloc_gfp, order, policy); if (!folio) continue;
@@ -2030,7 +2031,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, folio_clear_dropbehind(folio); return folio; } -EXPORT_SYMBOL(__filemap_get_folio); +EXPORT_SYMBOL(__filemap_get_folio_mpol);
static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max, xa_mark_t mark)
KVM guest_memfd wants to implement support for NUMA policies just like shmem already does using the shared policy infrastructure. As guest_memfd currently resides in KVM module code, we have to export the relevant symbols.
In the future, guest_memfd might be moved to core-mm, at which point the symbols no longer would have to be exported. When/if that happens is still unclear.
Acked-by: David Hildenbrand david@redhat.com Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Shivank Garg shivankg@amd.com --- mm/mempolicy.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eb83cff7db8c..d385202306db 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -354,6 +354,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
return &default_policy; } +EXPORT_SYMBOL_GPL_FOR_MODULES(get_task_policy, "kvm");
static const struct mempolicy_operations { int (*create)(struct mempolicy *pol, const nodemask_t *nodes); @@ -487,6 +488,7 @@ void __mpol_put(struct mempolicy *pol) return; kmem_cache_free(policy_cache, pol); } +EXPORT_SYMBOL_GPL_FOR_MODULES(__mpol_put, "kvm");
static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes) { @@ -2885,6 +2887,7 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, read_unlock(&sp->lock); return pol; } +EXPORT_SYMBOL_GPL_FOR_MODULES(mpol_shared_policy_lookup, "kvm");
static void sp_free(struct sp_node *n) { @@ -3170,6 +3173,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) mpol_put(mpol); /* drop our incoming ref on sb mpol */ } } +EXPORT_SYMBOL_GPL_FOR_MODULES(mpol_shared_policy_init, "kvm");
int mpol_set_shared_policy(struct shared_policy *sp, struct vm_area_struct *vma, struct mempolicy *pol) @@ -3188,6 +3192,7 @@ int mpol_set_shared_policy(struct shared_policy *sp, sp_free(new); return err; } +EXPORT_SYMBOL_GPL_FOR_MODULES(mpol_set_shared_policy, "kvm");
/* Free a backing policy store on inode delete. */ void mpol_free_shared_policy(struct shared_policy *sp) @@ -3206,6 +3211,7 @@ void mpol_free_shared_policy(struct shared_policy *sp) } write_unlock(&sp->lock); } +EXPORT_SYMBOL_GPL_FOR_MODULES(mpol_free_shared_policy, "kvm");
#ifdef CONFIG_NUMA_BALANCING static int __initdata numabalancing_override;
From: Ackerley Tng ackerleytng@google.com
guest_memfd's inode represents memory the guest_memfd is providing. guest_memfd's file represents a struct kvm's view of that memory.
Using a custom inode allows customization of the inode teardown process via callbacks. For example, ->evict_inode() allows customization of the truncation process on file close, and ->destroy_inode() and ->free_inode() allow customization of the inode freeing process.
Customizing the truncation process allows flexibility in management of guest_memfd memory and customization of the inode freeing process allows proper cleanup of memory metadata stored on the inode.
Memory metadata is more appropriately stored on the inode (as opposed to the file), since the metadata is for the memory and is not unique to a specific binding and struct kvm.
Co-developed-by: Fuad Tabba tabba@google.com Signed-off-by: Fuad Tabba tabba@google.com Signed-off-by: Ackerley Tng ackerleytng@google.com Signed-off-by: Shivank Garg shivankg@amd.com --- include/uapi/linux/magic.h | 1 + virt/kvm/guest_memfd.c | 128 ++++++++++++++++++++++++++++++------- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 9 +-- 4 files changed, 118 insertions(+), 27 deletions(-)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index bb575f3ab45e..638ca21b7a90 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -103,5 +103,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ #define PID_FS_MAGIC 0x50494446 /* "PIDF" */ +#define GUEST_MEMFD_MAGIC 0x474d454d /* "GMEM" */
#endif /* __LINUX_MAGIC_H__ */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 08a6bc7d25b6..0e93323fc839 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -1,12 +1,16 @@ // SPDX-License-Identifier: GPL-2.0 +#include <linux/anon_inodes.h> #include <linux/backing-dev.h> #include <linux/falloc.h> +#include <linux/fs.h> #include <linux/kvm_host.h> +#include <linux/pseudo_fs.h> #include <linux/pagemap.h> -#include <linux/anon_inodes.h>
#include "kvm_mm.h"
+static struct vfsmount *kvm_gmem_mnt; + struct kvm_gmem { struct kvm *kvm; struct xarray bindings; @@ -385,9 +389,45 @@ static struct file_operations kvm_gmem_fops = { .fallocate = kvm_gmem_fallocate, };
-void kvm_gmem_init(struct module *module) +static int kvm_gmem_init_fs_context(struct fs_context *fc) +{ + if (!init_pseudo(fc, GUEST_MEMFD_MAGIC)) + return -ENOMEM; + + fc->s_iflags |= SB_I_NOEXEC; + fc->s_iflags |= SB_I_NODEV; + + return 0; +} + +static struct file_system_type kvm_gmem_fs = { + .name = "guest_memfd", + .init_fs_context = kvm_gmem_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int kvm_gmem_init_mount(void) +{ + kvm_gmem_mnt = kern_mount(&kvm_gmem_fs); + + if (IS_ERR(kvm_gmem_mnt)) + return PTR_ERR(kvm_gmem_mnt); + + kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC; + return 0; +} + +int kvm_gmem_init(struct module *module) { kvm_gmem_fops.owner = module; + + return kvm_gmem_init_mount(); +} + +void kvm_gmem_exit(void) +{ + kern_unmount(kvm_gmem_mnt); + kvm_gmem_mnt = NULL; }
static int kvm_gmem_migrate_folio(struct address_space *mapping, @@ -463,11 +503,71 @@ bool __weak kvm_arch_supports_gmem_mmap(struct kvm *kvm) return true; }
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name, + loff_t size, u64 flags) +{ + struct inode *inode; + + inode = anon_inode_make_secure_inode(kvm_gmem_mnt->mnt_sb, name, NULL); + if (IS_ERR(inode)) + return inode; + + inode->i_private = (void *)(unsigned long)flags; + inode->i_op = &kvm_gmem_iops; + inode->i_mapping->a_ops = &kvm_gmem_aops; + inode->i_mode |= S_IFREG; + inode->i_size = size; + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_inaccessible(inode->i_mapping); + /* Unmovable mappings are supposed to be marked unevictable as well. */ + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); + + return inode; +} + +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size, + u64 flags) +{ + static const char *name = "[kvm-gmem]"; + struct inode *inode; + struct file *file; + int err; + + err = -ENOENT; + if (!try_module_get(kvm_gmem_fops.owner)) + goto err; + + inode = kvm_gmem_inode_make_secure_inode(name, size, flags); + if (IS_ERR(inode)) { + err = PTR_ERR(inode); + goto err_put_module; + } + + file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, + &kvm_gmem_fops); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_put_inode; + } + + file->f_flags |= O_LARGEFILE; + file->private_data = priv; + +out: + return file; + +err_put_inode: + iput(inode); +err_put_module: + module_put(kvm_gmem_fops.owner); +err: + file = ERR_PTR(err); + goto out; +} + static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { - const char *anon_name = "[kvm-gmem]"; struct kvm_gmem *gmem; - struct inode *inode; struct file *file; int fd, err;
@@ -481,32 +581,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) goto err_fd; }
- file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, - O_RDWR, NULL); + file = kvm_gmem_inode_create_getfile(gmem, size, flags); if (IS_ERR(file)) { err = PTR_ERR(file); goto err_gmem; }
- file->f_flags |= O_LARGEFILE; - - inode = file->f_inode; - WARN_ON(file->f_mapping != inode->i_mapping); - - inode->i_private = (void *)(unsigned long)flags; - inode->i_op = &kvm_gmem_iops; - inode->i_mapping->a_ops = &kvm_gmem_aops; - inode->i_mode |= S_IFREG; - inode->i_size = size; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); - mapping_set_inaccessible(inode->i_mapping); - /* Unmovable mappings are supposed to be marked unevictable as well. */ - WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); - kvm_get_kvm(kvm); gmem->kvm = kvm; xa_init(&gmem->bindings); - list_add(&gmem->entry, &inode->i_mapping->i_private_list); + list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
fd_install(fd, file); return fd; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 18f29ef93543..301d48d6e00d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -6489,7 +6489,9 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) if (WARN_ON_ONCE(r)) goto err_vfio;
- kvm_gmem_init(module); + r = kvm_gmem_init(module); + if (r) + goto err_gmem;
r = kvm_init_virtualization(); if (r) @@ -6510,6 +6512,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) err_register: kvm_uninit_virtualization(); err_virt: + kvm_gmem_exit(); +err_gmem: kvm_vfio_ops_exit(); err_vfio: kvm_async_pf_deinit(); @@ -6541,6 +6545,7 @@ void kvm_exit(void) for_each_possible_cpu(cpu) free_cpumask_var(per_cpu(cpu_kick_mask, cpu)); kmem_cache_destroy(kvm_vcpu_cache); + kvm_gmem_exit(); kvm_vfio_ops_exit(); kvm_async_pf_deinit(); kvm_irqfd_exit(); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 31defb08ccba..9fcc5d5b7f8d 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -68,17 +68,18 @@ static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, #endif /* HAVE_KVM_PFNCACHE */
#ifdef CONFIG_KVM_GUEST_MEMFD -void kvm_gmem_init(struct module *module); +int kvm_gmem_init(struct module *module); +void kvm_gmem_exit(void); int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset); void kvm_gmem_unbind(struct kvm_memory_slot *slot); #else -static inline void kvm_gmem_init(struct module *module) +static inline int kvm_gmem_init(struct module *module) { - + return 0; } - +static inline void kvm_gmem_exit(void) {}; static inline int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset)
On 11.08.25 11:06, Shivank Garg wrote:
From: Ackerley Tng ackerleytng@google.com
guest_memfd's inode represents memory the guest_memfd is providing. guest_memfd's file represents a struct kvm's view of that memory.
Using a custom inode allows customization of the inode teardown process via callbacks. For example, ->evict_inode() allows customization of the truncation process on file close, and ->destroy_inode() and ->free_inode() allow customization of the inode freeing process.
Customizing the truncation process allows flexibility in management of guest_memfd memory and customization of the inode freeing process allows proper cleanup of memory metadata stored on the inode.
Memory metadata is more appropriately stored on the inode (as opposed to the file), since the metadata is for the memory and is not unique to a specific binding and struct kvm.
Co-developed-by: Fuad Tabba tabba@google.com Signed-off-by: Fuad Tabba tabba@google.com Signed-off-by: Ackerley Tng ackerleytng@google.com Signed-off-by: Shivank Garg shivankg@amd.com
[...]
static int kvm_gmem_migrate_folio(struct address_space *mapping, @@ -463,11 +503,71 @@ bool __weak kvm_arch_supports_gmem_mmap(struct kvm *kvm) return true; } +static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
loff_t size, u64 flags)
+{
- struct inode *inode;
- inode = anon_inode_make_secure_inode(kvm_gmem_mnt->mnt_sb, name, NULL);
- if (IS_ERR(inode))
return inode;
- inode->i_private = (void *)(unsigned long)flags;
- inode->i_op = &kvm_gmem_iops;
- inode->i_mapping->a_ops = &kvm_gmem_aops;
- inode->i_mode |= S_IFREG;
- inode->i_size = size;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
- mapping_set_inaccessible(inode->i_mapping);
- /* Unmovable mappings are supposed to be marked unevictable as well. */
- WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
- return inode;
+}
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
u64 flags)
+{
- static const char *name = "[kvm-gmem]";
- struct inode *inode;
- struct file *file;
- int err;
- err = -ENOENT;
Maybe add a comment here when the module reference will get dropped. And maybe we should just switch to fops_get() + fops_put?
/* __fput() will take care of fops_put(). */ if (!fops_get(&kvm_gmem_fops)) goto err;
- inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
- if (IS_ERR(inode)) {
err = PTR_ERR(inode);
goto err_put_module;
- }
- file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
&kvm_gmem_fops);
- if (IS_ERR(file)) {
err = PTR_ERR(file);
goto err_put_inode;
- }
- file->f_flags |= O_LARGEFILE;
- file->private_data = priv;
+out:
- return file;
+err_put_inode:
- iput(inode);
+err_put_module:
- module_put(kvm_gmem_fops.owner);
fops_put(&kvm_gmem_fops);
?
Acked-by: David Hildenbrand david@redhat.com
David Hildenbrand david@redhat.com writes:
On 11.08.25 11:06, Shivank Garg wrote:
From: Ackerley Tng ackerleytng@google.com
[...snip...]
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
u64 flags)
+{
- static const char *name = "[kvm-gmem]";
- struct inode *inode;
- struct file *file;
- int err;
- err = -ENOENT;
Maybe add a comment here when the module reference will get dropped. And maybe we should just switch to fops_get() + fops_put?
/* __fput() will take care of fops_put(). */ if (!fops_get(&kvm_gmem_fops)) goto err;
Sounds good! Please see attached patch. It's exactly what you suggested except I renamed the goto target to err_fops_put:
- inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
- if (IS_ERR(inode)) {
err = PTR_ERR(inode);
goto err_put_module;
- }
- file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
&kvm_gmem_fops);
- if (IS_ERR(file)) {
err = PTR_ERR(file);
goto err_put_inode;
- }
- file->f_flags |= O_LARGEFILE;
- file->private_data = priv;
+out:
- return file;
+err_put_inode:
- iput(inode);
+err_put_module:
- module_put(kvm_gmem_fops.owner);
fops_put(&kvm_gmem_fops);
?
Acked-by: David Hildenbrand david@redhat.com
-- Cheers,
David / dhildenb
From f2bd4499bce4db69bf34be75e009579db4329b7c Mon Sep 17 00:00:00 2001 From: Ackerley Tng ackerleytng@google.com Date: Sun, 13 Jul 2025 17:43:35 +0000 Subject: [PATCH] KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
guest_memfd's inode represents memory the guest_memfd is providing. guest_memfd's file represents a struct kvm's view of that memory.
Using a custom inode allows customization of the inode teardown process via callbacks. For example, ->evict_inode() allows customization of the truncation process on file close, and ->destroy_inode() and ->free_inode() allow customization of the inode freeing process.
Customizing the truncation process allows flexibility in management of guest_memfd memory and customization of the inode freeing process allows proper cleanup of memory metadata stored on the inode.
Memory metadata is more appropriately stored on the inode (as opposed to the file), since the metadata is for the memory and is not unique to a specific binding and struct kvm.
Co-developed-by: Fuad Tabba tabba@google.com Signed-off-by: Fuad Tabba tabba@google.com Signed-off-by: Shivank Garg shivankg@amd.com Signed-off-by: Ackerley Tng ackerleytng@google.com --- include/uapi/linux/magic.h | 1 + virt/kvm/guest_memfd.c | 129 ++++++++++++++++++++++++++++++------- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 9 +-- 4 files changed, 119 insertions(+), 27 deletions(-)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index bb575f3ab45e5..638ca21b7a909 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -103,5 +103,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ #define PID_FS_MAGIC 0x50494446 /* "PIDF" */ +#define GUEST_MEMFD_MAGIC 0x474d454d /* "GMEM" */
#endif /* __LINUX_MAGIC_H__ */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 08a6bc7d25b60..6c66a09740550 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -1,12 +1,16 @@ // SPDX-License-Identifier: GPL-2.0 +#include <linux/anon_inodes.h> #include <linux/backing-dev.h> #include <linux/falloc.h> +#include <linux/fs.h> #include <linux/kvm_host.h> +#include <linux/pseudo_fs.h> #include <linux/pagemap.h> -#include <linux/anon_inodes.h>
#include "kvm_mm.h"
+static struct vfsmount *kvm_gmem_mnt; + struct kvm_gmem { struct kvm *kvm; struct xarray bindings; @@ -385,9 +389,45 @@ static struct file_operations kvm_gmem_fops = { .fallocate = kvm_gmem_fallocate, };
-void kvm_gmem_init(struct module *module) +static int kvm_gmem_init_fs_context(struct fs_context *fc) +{ + if (!init_pseudo(fc, GUEST_MEMFD_MAGIC)) + return -ENOMEM; + + fc->s_iflags |= SB_I_NOEXEC; + fc->s_iflags |= SB_I_NODEV; + + return 0; +} + +static struct file_system_type kvm_gmem_fs = { + .name = "guest_memfd", + .init_fs_context = kvm_gmem_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int kvm_gmem_init_mount(void) +{ + kvm_gmem_mnt = kern_mount(&kvm_gmem_fs); + + if (IS_ERR(kvm_gmem_mnt)) + return PTR_ERR(kvm_gmem_mnt); + + kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC; + return 0; +} + +int kvm_gmem_init(struct module *module) { kvm_gmem_fops.owner = module; + + return kvm_gmem_init_mount(); +} + +void kvm_gmem_exit(void) +{ + kern_unmount(kvm_gmem_mnt); + kvm_gmem_mnt = NULL; }
static int kvm_gmem_migrate_folio(struct address_space *mapping, @@ -463,11 +503,72 @@ bool __weak kvm_arch_supports_gmem_mmap(struct kvm *kvm) return true; }
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name, + loff_t size, u64 flags) +{ + struct inode *inode; + + inode = anon_inode_make_secure_inode(kvm_gmem_mnt->mnt_sb, name, NULL); + if (IS_ERR(inode)) + return inode; + + inode->i_private = (void *)(unsigned long)flags; + inode->i_op = &kvm_gmem_iops; + inode->i_mapping->a_ops = &kvm_gmem_aops; + inode->i_mode |= S_IFREG; + inode->i_size = size; + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_inaccessible(inode->i_mapping); + /* Unmovable mappings are supposed to be marked unevictable as well. */ + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); + + return inode; +} + +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size, + u64 flags) +{ + static const char *name = "[kvm-gmem]"; + struct inode *inode; + struct file *file; + int err; + + err = -ENOENT; + /* __fput() will take care of fops_put(). */ + if (!fops_get(&kvm_gmem_fops)) + goto err; + + inode = kvm_gmem_inode_make_secure_inode(name, size, flags); + if (IS_ERR(inode)) { + err = PTR_ERR(inode); + goto err_fops_put; + } + + file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, + &kvm_gmem_fops); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_put_inode; + } + + file->f_flags |= O_LARGEFILE; + file->private_data = priv; + +out: + return file; + +err_put_inode: + iput(inode); +err_fops_put: + fops_put(&kvm_gmem_fops); +err: + file = ERR_PTR(err); + goto out; +} + static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { - const char *anon_name = "[kvm-gmem]"; struct kvm_gmem *gmem; - struct inode *inode; struct file *file; int fd, err;
@@ -481,32 +582,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) goto err_fd; }
- file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, - O_RDWR, NULL); + file = kvm_gmem_inode_create_getfile(gmem, size, flags); if (IS_ERR(file)) { err = PTR_ERR(file); goto err_gmem; }
- file->f_flags |= O_LARGEFILE; - - inode = file->f_inode; - WARN_ON(file->f_mapping != inode->i_mapping); - - inode->i_private = (void *)(unsigned long)flags; - inode->i_op = &kvm_gmem_iops; - inode->i_mapping->a_ops = &kvm_gmem_aops; - inode->i_mode |= S_IFREG; - inode->i_size = size; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); - mapping_set_inaccessible(inode->i_mapping); - /* Unmovable mappings are supposed to be marked unevictable as well. */ - WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); - kvm_get_kvm(kvm); gmem->kvm = kvm; xa_init(&gmem->bindings); - list_add(&gmem->entry, &inode->i_mapping->i_private_list); + list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
fd_install(fd, file); return fd; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 18f29ef935437..301d48d6e00d0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -6489,7 +6489,9 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) if (WARN_ON_ONCE(r)) goto err_vfio;
- kvm_gmem_init(module); + r = kvm_gmem_init(module); + if (r) + goto err_gmem;
r = kvm_init_virtualization(); if (r) @@ -6510,6 +6512,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) err_register: kvm_uninit_virtualization(); err_virt: + kvm_gmem_exit(); +err_gmem: kvm_vfio_ops_exit(); err_vfio: kvm_async_pf_deinit(); @@ -6541,6 +6545,7 @@ void kvm_exit(void) for_each_possible_cpu(cpu) free_cpumask_var(per_cpu(cpu_kick_mask, cpu)); kmem_cache_destroy(kvm_vcpu_cache); + kvm_gmem_exit(); kvm_vfio_ops_exit(); kvm_async_pf_deinit(); kvm_irqfd_exit(); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 31defb08ccbab..9fcc5d5b7f8d0 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -68,17 +68,18 @@ static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, #endif /* HAVE_KVM_PFNCACHE */
#ifdef CONFIG_KVM_GUEST_MEMFD -void kvm_gmem_init(struct module *module); +int kvm_gmem_init(struct module *module); +void kvm_gmem_exit(void); int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset); void kvm_gmem_unbind(struct kvm_memory_slot *slot); #else -static inline void kvm_gmem_init(struct module *module) +static inline int kvm_gmem_init(struct module *module) { - + return 0; } - +static inline void kvm_gmem_exit(void) {}; static inline int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset) -- 2.51.0.rc0.155.g4a0f42376b-goog
On 8/12/2025 2:53 AM, Ackerley Tng wrote:
David Hildenbrand david@redhat.com writes:
On 11.08.25 11:06, Shivank Garg wrote:
From: Ackerley Tng ackerleytng@google.com
[...snip...]
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
u64 flags)
+{
- static const char *name = "[kvm-gmem]";
- struct inode *inode;
- struct file *file;
- int err;
- err = -ENOENT;
Maybe add a comment here when the module reference will get dropped. And maybe we should just switch to fops_get() + fops_put?
/* __fput() will take care of fops_put(). */ if (!fops_get(&kvm_gmem_fops)) goto err;
Sounds good! Please see attached patch. It's exactly what you suggested except I renamed the goto target to err_fops_put:
- inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
- if (IS_ERR(inode)) {
err = PTR_ERR(inode);
goto err_put_module;
- }
- file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
&kvm_gmem_fops);
- if (IS_ERR(file)) {
err = PTR_ERR(file);
goto err_put_inode;
- }
- file->f_flags |= O_LARGEFILE;
- file->private_data = priv;
+out:
- return file;
+err_put_inode:
- iput(inode);
+err_put_module:
- module_put(kvm_gmem_fops.owner);
fops_put(&kvm_gmem_fops);
?
Acked-by: David Hildenbrand david@redhat.com
-- Cheers,
David / dhildenb
From f2bd4499bce4db69bf34be75e009579db4329b7c Mon Sep 17 00:00:00 2001 From: Ackerley Tng ackerleytng@google.com Date: Sun, 13 Jul 2025 17:43:35 +0000 Subject: [PATCH] KVM: guest_memfd: Use guest mem inodes instead of anonymous inodes
guest_memfd's inode represents memory the guest_memfd is providing. guest_memfd's file represents a struct kvm's view of that memory.
Using a custom inode allows customization of the inode teardown process via callbacks. For example, ->evict_inode() allows customization of the truncation process on file close, and ->destroy_inode() and ->free_inode() allow customization of the inode freeing process.
Customizing the truncation process allows flexibility in management of guest_memfd memory and customization of the inode freeing process allows proper cleanup of memory metadata stored on the inode.
Memory metadata is more appropriately stored on the inode (as opposed to the file), since the metadata is for the memory and is not unique to a specific binding and struct kvm.
Co-developed-by: Fuad Tabba tabba@google.com Signed-off-by: Fuad Tabba tabba@google.com Signed-off-by: Shivank Garg shivankg@amd.com Signed-off-by: Ackerley Tng ackerleytng@google.com
include/uapi/linux/magic.h | 1 + virt/kvm/guest_memfd.c | 129 ++++++++++++++++++++++++++++++------- virt/kvm/kvm_main.c | 7 +- virt/kvm/kvm_mm.h | 9 +-- 4 files changed, 119 insertions(+), 27 deletions(-)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index bb575f3ab45e5..638ca21b7a909 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -103,5 +103,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ #define PID_FS_MAGIC 0x50494446 /* "PIDF" */ +#define GUEST_MEMFD_MAGIC 0x474d454d /* "GMEM" */
#endif /* __LINUX_MAGIC_H__ */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 08a6bc7d25b60..6c66a09740550 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -1,12 +1,16 @@ // SPDX-License-Identifier: GPL-2.0 +#include <linux/anon_inodes.h> #include <linux/backing-dev.h> #include <linux/falloc.h> +#include <linux/fs.h> #include <linux/kvm_host.h> +#include <linux/pseudo_fs.h> #include <linux/pagemap.h> -#include <linux/anon_inodes.h>
#include "kvm_mm.h"
+static struct vfsmount *kvm_gmem_mnt;
struct kvm_gmem { struct kvm *kvm; struct xarray bindings; @@ -385,9 +389,45 @@ static struct file_operations kvm_gmem_fops = { .fallocate = kvm_gmem_fallocate, };
-void kvm_gmem_init(struct module *module) +static int kvm_gmem_init_fs_context(struct fs_context *fc) +{
- if (!init_pseudo(fc, GUEST_MEMFD_MAGIC))
return -ENOMEM;
- fc->s_iflags |= SB_I_NOEXEC;
- fc->s_iflags |= SB_I_NODEV;
- return 0;
+}
+static struct file_system_type kvm_gmem_fs = {
- .name = "guest_memfd",
- .init_fs_context = kvm_gmem_init_fs_context,
- .kill_sb = kill_anon_super,
+};
+static int kvm_gmem_init_mount(void) +{
- kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
- if (IS_ERR(kvm_gmem_mnt))
return PTR_ERR(kvm_gmem_mnt);
- kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
- return 0;
+}
+int kvm_gmem_init(struct module *module) { kvm_gmem_fops.owner = module;
- return kvm_gmem_init_mount();
+}
+void kvm_gmem_exit(void) +{
- kern_unmount(kvm_gmem_mnt);
- kvm_gmem_mnt = NULL;
}
static int kvm_gmem_migrate_folio(struct address_space *mapping, @@ -463,11 +503,72 @@ bool __weak kvm_arch_supports_gmem_mmap(struct kvm *kvm) return true; }
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
loff_t size, u64 flags)
+{
- struct inode *inode;
- inode = anon_inode_make_secure_inode(kvm_gmem_mnt->mnt_sb, name, NULL);
- if (IS_ERR(inode))
return inode;
- inode->i_private = (void *)(unsigned long)flags;
- inode->i_op = &kvm_gmem_iops;
- inode->i_mapping->a_ops = &kvm_gmem_aops;
- inode->i_mode |= S_IFREG;
- inode->i_size = size;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
- mapping_set_inaccessible(inode->i_mapping);
- /* Unmovable mappings are supposed to be marked unevictable as well. */
- WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
- return inode;
+}
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
u64 flags)
+{
- static const char *name = "[kvm-gmem]";
- struct inode *inode;
- struct file *file;
- int err;
- err = -ENOENT;
- /* __fput() will take care of fops_put(). */
- if (!fops_get(&kvm_gmem_fops))
goto err;
- inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
- if (IS_ERR(inode)) {
err = PTR_ERR(inode);
goto err_fops_put;
- }
- file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
&kvm_gmem_fops);
- if (IS_ERR(file)) {
err = PTR_ERR(file);
goto err_put_inode;
- }
- file->f_flags |= O_LARGEFILE;
- file->private_data = priv;
+out:
- return file;
+err_put_inode:
- iput(inode);
+err_fops_put:
- fops_put(&kvm_gmem_fops);
Thanks Ackerley. LGTM
Add dedicated inode structure (kvm_gmem_inode_info) and slab-allocated inode cache for guest memory backing, similar to how shmem handles inodes.
This adds the necessary allocation/destruction functions and prepares for upcoming guest_memfd NUMA policy support changes.
Signed-off-by: Shivank Garg shivankg@amd.com --- virt/kvm/guest_memfd.c | 69 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 67 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 0e93323fc839..d9c23401e770 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -17,6 +17,15 @@ struct kvm_gmem { struct list_head entry; };
+struct kvm_gmem_inode_info { + struct inode vfs_inode; +}; + +static inline struct kvm_gmem_inode_info *KVM_GMEM_I(struct inode *inode) +{ + return container_of(inode, struct kvm_gmem_inode_info, vfs_inode); +} + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index. @@ -389,13 +398,46 @@ static struct file_operations kvm_gmem_fops = { .fallocate = kvm_gmem_fallocate, };
+static struct kmem_cache *kvm_gmem_inode_cachep; + +static struct inode *kvm_gmem_alloc_inode(struct super_block *sb) +{ + struct kvm_gmem_inode_info *info; + + info = alloc_inode_sb(sb, kvm_gmem_inode_cachep, GFP_KERNEL); + if (!info) + return NULL; + + return &info->vfs_inode; +} + +static void kvm_gmem_destroy_inode(struct inode *inode) +{ +} + +static void kvm_gmem_free_inode(struct inode *inode) +{ + kmem_cache_free(kvm_gmem_inode_cachep, KVM_GMEM_I(inode)); +} + +static const struct super_operations kvm_gmem_super_operations = { + .statfs = simple_statfs, + .alloc_inode = kvm_gmem_alloc_inode, + .destroy_inode = kvm_gmem_destroy_inode, + .free_inode = kvm_gmem_free_inode, +}; + static int kvm_gmem_init_fs_context(struct fs_context *fc) { + struct pseudo_fs_context *ctx; + if (!init_pseudo(fc, GUEST_MEMFD_MAGIC)) return -ENOMEM;
fc->s_iflags |= SB_I_NOEXEC; fc->s_iflags |= SB_I_NODEV; + ctx = fc->fs_private; + ctx->ops = &kvm_gmem_super_operations;
return 0; } @@ -417,17 +459,40 @@ static int kvm_gmem_init_mount(void) return 0; }
+static void kvm_gmem_init_inode(void *foo) +{ + struct kvm_gmem_inode_info *info = foo; + + inode_init_once(&info->vfs_inode); +} + int kvm_gmem_init(struct module *module) { - kvm_gmem_fops.owner = module; + int ret; + struct kmem_cache_args args = { + .align = 0, + .ctor = kvm_gmem_init_inode, + };
- return kvm_gmem_init_mount(); + kvm_gmem_fops.owner = module; + kvm_gmem_inode_cachep = kmem_cache_create("kvm_gmem_inode_cache", + sizeof(struct kvm_gmem_inode_info), + &args, SLAB_ACCOUNT); + if (!kvm_gmem_inode_cachep) + return -ENOMEM; + ret = kvm_gmem_init_mount(); + if (ret) { + kmem_cache_destroy(kvm_gmem_inode_cachep); + return ret; + } + return 0; }
void kvm_gmem_exit(void) { kern_unmount(kvm_gmem_mnt); kvm_gmem_mnt = NULL; + kmem_cache_destroy(kvm_gmem_inode_cachep); }
static int kvm_gmem_migrate_folio(struct address_space *mapping,
On 8/11/2025 2:36 PM, Shivank Garg wrote:
Add dedicated inode structure (kvm_gmem_inode_info) and slab-allocated inode cache for guest memory backing, similar to how shmem handles inodes.
This adds the necessary allocation/destruction functions and prepares for upcoming guest_memfd NUMA policy support changes.
Signed-off-by: Shivank Garg shivankg@amd.com
virt/kvm/guest_memfd.c | 69 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 67 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 0e93323fc839..d9c23401e770 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -17,6 +17,15 @@ struct kvm_gmem { struct list_head entry; }; +struct kvm_gmem_inode_info {
- struct inode vfs_inode;
+};
+static inline struct kvm_gmem_inode_info *KVM_GMEM_I(struct inode *inode) +{
- return container_of(inode, struct kvm_gmem_inode_info, vfs_inode);
+}
/**
- folio_file_pfn - like folio_file_page, but return a pfn.
- @folio: The folio which contains this index.
@@ -389,13 +398,46 @@ static struct file_operations kvm_gmem_fops = { .fallocate = kvm_gmem_fallocate, }; +static struct kmem_cache *kvm_gmem_inode_cachep;
+static struct inode *kvm_gmem_alloc_inode(struct super_block *sb) +{
- struct kvm_gmem_inode_info *info;
- info = alloc_inode_sb(sb, kvm_gmem_inode_cachep, GFP_KERNEL);
- if (!info)
return NULL;
- return &info->vfs_inode;
+}
+static void kvm_gmem_destroy_inode(struct inode *inode) +{ +}
+static void kvm_gmem_free_inode(struct inode *inode) +{
- kmem_cache_free(kvm_gmem_inode_cachep, KVM_GMEM_I(inode));
+}
+static const struct super_operations kvm_gmem_super_operations = {
- .statfs = simple_statfs,
- .alloc_inode = kvm_gmem_alloc_inode,
- .destroy_inode = kvm_gmem_destroy_inode,
- .free_inode = kvm_gmem_free_inode,
+};
static int kvm_gmem_init_fs_context(struct fs_context *fc) {
- struct pseudo_fs_context *ctx;
- if (!init_pseudo(fc, GUEST_MEMFD_MAGIC)) return -ENOMEM;
fc->s_iflags |= SB_I_NOEXEC; fc->s_iflags |= SB_I_NODEV;
- ctx = fc->fs_private;
- ctx->ops = &kvm_gmem_super_operations;
return 0; } @@ -417,17 +459,40 @@ static int kvm_gmem_init_mount(void) return 0; } +static void kvm_gmem_init_inode(void *foo) +{
- struct kvm_gmem_inode_info *info = foo;
- inode_init_once(&info->vfs_inode);
+}
int kvm_gmem_init(struct module *module) {
- kvm_gmem_fops.owner = module;
- int ret;
- struct kmem_cache_args args = {
.align = 0,
.ctor = kvm_gmem_init_inode,
- };
- return kvm_gmem_init_mount();
- kvm_gmem_fops.owner = module;
- kvm_gmem_inode_cachep = kmem_cache_create("kvm_gmem_inode_cache",
sizeof(struct kvm_gmem_inode_info),
&args, SLAB_ACCOUNT);
- if (!kvm_gmem_inode_cachep)
return -ENOMEM;
- ret = kvm_gmem_init_mount();
- if (ret) {
kmem_cache_destroy(kvm_gmem_inode_cachep);
return ret;
- }
- return 0;
} void kvm_gmem_exit(void) { kern_unmount(kvm_gmem_mnt); kvm_gmem_mnt = NULL;
- kmem_cache_destroy(kvm_gmem_inode_cachep);
}
While testing my code, I discovered a bug that occurs when unloading the kvm_amd module after a guest_memfd-backed VM has run.
dmesg logs: [ 610.075763] ============================================================================= [ 610.083933] BUG kvm_gmem_inode_cache (Not tainted): Objects remaining on __kmem_cache_shutdown() [ 610.092711] ----------------------------------------------------------------------------- [ 610.102368] Object 0x000000008ee52a58 @offset=19200 [ 610.107247] Slab 0x000000004b1b088c objects=51 used=1 fp=0x000000007c55fc00 flags=0x57ffffc0000240(workingset|head|node=1|zone=2|lastcpupid=0x1fffff) [ 610.120733] Disabling lock debugging due to kernel taint [ 610.120741] ------------[ cut here ]------------ [ 610.120742] WARNING: CPU: 7 PID: 7554 at mm/slub.c:1171 __kmem_cache_shutdown+0x264/0x370 [ 610.120751] Modules linked in: xt_set ip_set xt_addrtype xfrm_user xfrm_algo xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nff_defrag_ipv4 nf_tables overlay bridge stp llc cfg80211 rfkill binfmt_misc ipmi_ssif amd_atl intel_rapl_msr wmi_bmof intel_rapl_common amd64_edac edac_mce_amdmem_helper drm_kms_helper i2c_piix4 ptdma i2c_smbus k10temp wmi acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler sg dm_multipath fuse drm dm_mo56 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 sd_mod kvm_amd(-) ahci libahci kvm nvme tg3 libata ccp irqbypass nvme_c [ 610.120831] CPU: 7 UID: 0 PID: 7554 Comm: rmmod Kdump: loaded Tainted: G B 6.16.0+ #10 PREEMPT(none) [ 610.120835] Tainted: [B]=BAD_PAGE [ 610.120836] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.16.2 07/09/2024 [ 610.120838] RIP: 0010:__kmem_cache_shutdown+0x264/0x370 [ 610.120841] Code: 89 f1 4c 89 f6 4d 8b 46 20 48 c7 c7 08 08 ec 87 81 e2 ff 7f 00 00 e8 fb a7 d7 ff be 01 00 00 00 bf 05 00 00 00 e8 dc e9 cd ff <0f> 0b 48 fe ff ff [ 610.120843] RSP: 0018:ffffcd6962963cb8 EFLAGS: 00010046 [ 610.120846] RAX: 0000000000000000 RBX: ffff89fde07d21c0 RCX: 0000000000000027 [ 610.120848] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff89fcbe5dbe80 [ 610.120850] RBP: ffff89fde07d21c0 R08: 0000000000000000 R09: 0000000000000003 [ 610.120851] R10: ffffcd6962963b58 R11: ffffffff889db908 R12: ffff89fdcccd7f80 [ 610.120852] R13: ffff89fdcccd0000 R14: fffff96802333400 R15: ffff89fdd6ab6c00 [ 610.120854] FS: 00007f066eaab080(0000) GS:ffff89fd3516f000(0000) knlGS:0000000000000000 [ 610.120856] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 610.120857] CR2: 00007ffefd577828 CR3: 0000000220406004 CR4: 0000000000770ef0 [ 610.120859] PKRU: 55555554 [ 610.120860] Call Trace: [ 610.120862] <TASK> [ 610.120866] kmem_cache_destroy+0x3a/0x150 [ 610.120872] kvm_exit+0x7b/0xa0 [kvm] [ 610.120919] svm_exit+0x5/0x10 [kvm_amd] [ 610.120926] __do_sys_delete_module.isra.0+0x18b/0x2e0 [ 610.120933] ? srso_alias_return_thunk+0x5/0xfbef5 [ 610.120937] ? syscall_trace_enter+0xfa/0x1a0 [ 610.120941] do_syscall_64+0x7b/0x2c0 [ 610.120947] ? srso_alias_return_thunk+0x5/0xfbef5 [ 610.120950] ? __handle_mm_fault+0x2aa/0x670 [ 610.120954] ? iterate_dir+0x11e/0x230 [ 610.120960] ? srso_alias_return_thunk+0x5/0xfbef5 [ 610.120963] ? count_memcg_events+0xb2/0x160 [ 610.120967] ? srso_alias_return_thunk+0x5/0xfbef5 [ 610.120969] ? handle_mm_fault+0xb2/0x2f0 [ 610.120972] ? srso_alias_return_thunk+0x5/0xfbef5 [ 610.120975] ? do_user_addr_fault+0x16f/0x6f0 [ 610.120981] ? srso_alias_return_thunk+0x5/0xfbef5 [ 610.120984] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 610.120986] RIP: 0033:0x7f066e12ac9b [ 610.120989] Code: 73 01 c3 48 8b 0d 7d 81 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 89 01 48 [ 610.120990] RSP: 002b:00007ffc629f1878 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 610.120993] RAX: ffffffffffffffda RBX: 00005630e80256f0 RCX: 00007f066e12ac9b [ 610.120994] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005630e8025758 [ 610.120996] RBP: 00007ffc629f18a0 R08: 1999999999999999 R09: 0000000000000000 [ 610.120997] R10: 00007f066e1b1fc0 R11: 0000000000000206 R12: 0000000000000000 [ 610.120999] R13: 00007ffc629f1af0 R14: 00005630e80256f0 R15: 0000000000000000 [ 610.121003] </TASK> [ 610.121004] ---[ end trace 0000000000000000 ]--- [ 610.121017] ------------[ cut here ]------------
There is a race condition here: kern_unmount() -> mntput() -> cleanup_mnt() -> deactivate_super() -> deactivate_locked_super() -> fs->kill_sb() (guest_memfd kill_sb) -> generic_shutdown_super() -> evict_inodes() -> destroy_inode() -> call_rcu()
I should be waiting for pending RCU callback to finish before calling the kmem_cache_destroy().
To fix this, I added rcu_barrier() like dax_fs_exit() is doing.
@@ -561,6 +566,7 @@ void kvm_gmem_exit(void) { kern_unmount(kvm_gmem_mnt); kvm_gmem_mnt = NULL; + rcu_barrier(); kmem_cache_destroy(kvm_gmem_inode_cachep); }
I'll incorporate this fix into next version.
Thanks, Shivank
Previously, guest-memfd allocations followed local NUMA node id in absence of process mempolicy, resulting in arbitrary memory allocation. Moreover, mbind() couldn't be used by the VMM as guest memory wasn't mapped into userspace when allocation occurred.
Enable NUMA policy support by implementing vm_ops for guest-memfd mmap operation. This allows the VMM to map the memory and use mbind() to set the desired NUMA policy. The policy is stored in the inode structure via kvm_gmem_inode_info, as memory policy is a property of the memory (struct inode) itself. The policy is then retrieved via mpol_shared_policy_lookup() and passed to filemap_grab_folio_mpol() to ensure that allocations follow the specified memory policy.
This enables the VMM to control guest memory NUMA placement by calling mbind() on the mapped memory regions, providing fine-grained control over guest memory allocation across NUMA nodes.
The policy change only affect future allocations and does not migrate existing memory. This matches mbind(2)'s default behavior which affects only new allocations unless overridden with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags, which are not supported for guest_memfd as it is unmovable.
Suggested-by: David Hildenbrand david@redhat.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Shivank Garg shivankg@amd.com --- virt/kvm/guest_memfd.c | 67 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 65 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index d9c23401e770..7821c1036e49 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include <linux/falloc.h> #include <linux/fs.h> #include <linux/kvm_host.h> +#include <linux/mempolicy.h> #include <linux/pseudo_fs.h> #include <linux/pagemap.h>
@@ -18,6 +19,7 @@ struct kvm_gmem { };
struct kvm_gmem_inode_info { + struct shared_policy policy; struct inode vfs_inode; };
@@ -26,6 +28,9 @@ static inline struct kvm_gmem_inode_info *KVM_GMEM_I(struct inode *inode) return container_of(inode, struct kvm_gmem_inode_info, vfs_inode); }
+static struct mempolicy *kvm_gmem_get_pgoff_policy(struct kvm_gmem_inode_info *info, + pgoff_t index); + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index. @@ -112,7 +117,25 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) { /* TODO: Support huge pages. */ - return filemap_grab_folio(inode->i_mapping, index); + struct mempolicy *policy; + struct folio *folio; + + /* + * Fast-path: See if folio is already present in mapping to avoid + * policy_lookup. + */ + folio = __filemap_get_folio(inode->i_mapping, index, + FGP_LOCK | FGP_ACCESSED, 0); + if (!IS_ERR(folio)) + return folio; + + policy = kvm_gmem_get_pgoff_policy(KVM_GMEM_I(inode), index); + folio = __filemap_get_folio_mpol(inode->i_mapping, index, + FGP_LOCK | FGP_ACCESSED | FGP_CREAT, + mapping_gfp_mask(inode->i_mapping), policy); + mpol_cond_put(policy); + + return folio; }
static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start, @@ -372,8 +395,45 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) return ret; }
+#ifdef CONFIG_NUMA +static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) +{ + struct inode *inode = file_inode(vma->vm_file); + + return mpol_set_shared_policy(&KVM_GMEM_I(inode)->policy, vma, mpol); +} + +static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma, + unsigned long addr, pgoff_t *pgoff) +{ + struct inode *inode = file_inode(vma->vm_file); + + *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT); + return mpol_shared_policy_lookup(&KVM_GMEM_I(inode)->policy, *pgoff); +} + +static struct mempolicy *kvm_gmem_get_pgoff_policy(struct kvm_gmem_inode_info *info, + pgoff_t index) +{ + struct mempolicy *mpol; + + mpol = mpol_shared_policy_lookup(&info->policy, index); + return mpol ? mpol : get_task_policy(current); +} +#else +static struct mempolicy *kvm_gmem_get_pgoff_policy(struct kvm_gmem_inode_info *info, + pgoff_t index) +{ + return NULL; +} +#endif /* CONFIG_NUMA */ + static const struct vm_operations_struct kvm_gmem_vm_ops = { - .fault = kvm_gmem_fault_user_mapping, + .fault = kvm_gmem_fault_user_mapping, +#ifdef CONFIG_NUMA + .get_policy = kvm_gmem_get_policy, + .set_policy = kvm_gmem_set_policy, +#endif };
static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) @@ -408,11 +468,14 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb) if (!info) return NULL;
+ mpol_shared_policy_init(&info->policy, NULL); + return &info->vfs_inode; }
static void kvm_gmem_destroy_inode(struct inode *inode) { + mpol_free_shared_policy(&KVM_GMEM_I(inode)->policy); }
static void kvm_gmem_free_inode(struct inode *inode)
Add tests for NUMA memory policy binding and NUMA aware allocation in guest_memfd. This extends the existing selftests by adding proper validation for: - KVM GMEM set_policy and get_policy() vm_ops functionality using mbind() and get_mempolicy() - NUMA policy application before and after memory allocation
These tests help ensure NUMA support for guest_memfd works correctly.
Signed-off-by: Shivank Garg shivankg@amd.com --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 121 ++++++++++++++++++ 2 files changed, 122 insertions(+)
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 90f03f00cb04..c46cef2a7cd7 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -275,6 +275,7 @@ pgste-option = $(call try-run, echo 'int main(void) { return 0; }' | \ $(CC) -Werror -Wl$(comma)--s390-pgste -x c - -o "$$TMP",-Wl$(comma)--s390-pgste)
LDLIBS += -ldl +LDLIBS += -lnuma LDFLAGS += -pthread $(no-pie-option) $(pgste-option)
LIBKVM_C := $(filter %.c,$(LIBKVM)) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index b86bf89a71e0..4d33c225d9f7 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -7,6 +7,8 @@ #include <stdlib.h> #include <string.h> #include <unistd.h> +#include <numa.h> +#include <numaif.h> #include <errno.h> #include <stdio.h> #include <fcntl.h> @@ -19,6 +21,7 @@ #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> +#include <sys/syscall.h>
#include "kvm_util.h" #include "test_util.h" @@ -72,6 +75,122 @@ static void test_mmap_supported(int fd, size_t page_size, size_t total_size) TEST_ASSERT(!ret, "munmap() should succeed."); }
+#define TEST_REQUIRE_NUMA_MULTIPLE_NODES() \ + TEST_REQUIRE(numa_available() != -1 && numa_max_node() >= 1) + +static void test_mbind(int fd, size_t page_size, size_t total_size) +{ + unsigned long nodemask = 1; /* nid: 0 */ + unsigned long maxnode = 8; + unsigned long get_nodemask; + int get_policy; + char *mem; + int ret; + + TEST_REQUIRE_NUMA_MULTIPLE_NODES(); + + mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem != MAP_FAILED, "mmap for mbind test should succeed"); + + /* Test MPOL_INTERLEAVE policy */ + ret = syscall(__NR_mbind, mem, page_size * 2, MPOL_INTERLEAVE, + &nodemask, maxnode, 0); + TEST_ASSERT(!ret, "mbind with INTERLEAVE to node 0 should succeed"); + ret = syscall(__NR_get_mempolicy, &get_policy, &get_nodemask, + maxnode, mem, MPOL_F_ADDR); + TEST_ASSERT(!ret && get_policy == MPOL_INTERLEAVE && get_nodemask == nodemask, + "Policy should be MPOL_INTERLEAVE and nodes match"); + + /* Test basic MPOL_BIND policy */ + ret = syscall(__NR_mbind, mem + page_size * 2, page_size * 2, MPOL_BIND, + &nodemask, maxnode, 0); + TEST_ASSERT(!ret, "mbind with MPOL_BIND to node 0 should succeed"); + ret = syscall(__NR_get_mempolicy, &get_policy, &get_nodemask, + maxnode, mem + page_size * 2, MPOL_F_ADDR); + TEST_ASSERT(!ret && get_policy == MPOL_BIND && get_nodemask == nodemask, + "Policy should be MPOL_BIND and nodes match"); + + /* Test MPOL_DEFAULT policy */ + ret = syscall(__NR_mbind, mem, total_size, MPOL_DEFAULT, NULL, 0, 0); + TEST_ASSERT(!ret, "mbind with MPOL_DEFAULT should succeed"); + ret = syscall(__NR_get_mempolicy, &get_policy, &get_nodemask, + maxnode, mem, MPOL_F_ADDR); + TEST_ASSERT(!ret && get_policy == MPOL_DEFAULT && get_nodemask == 0, + "Policy should be MPOL_DEFAULT and nodes zero"); + + /* Test with invalid policy */ + ret = syscall(__NR_mbind, mem, page_size, 999, &nodemask, maxnode, 0); + TEST_ASSERT(ret == -1 && errno == EINVAL, + "mbind with invalid policy should fail with EINVAL"); + + TEST_ASSERT(munmap(mem, total_size) == 0, "munmap should succeed"); +} + +static void test_numa_allocation(int fd, size_t page_size, size_t total_size) +{ + unsigned long node0_mask = 1; /* Node 0 */ + unsigned long node1_mask = 2; /* Node 1 */ + unsigned long maxnode = 8; + void *pages[4]; + int status[4]; + char *mem; + int ret, i; + + TEST_REQUIRE_NUMA_MULTIPLE_NODES(); + + /* Clean slate: deallocate all file space, if any */ + ret = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, total_size); + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed"); + + mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem != MAP_FAILED, "mmap should succeed"); + + for (i = 0; i < 4; i++) + pages[i] = (char *)mem + page_size * i; + + /* Set NUMA policy after allocation */ + memset(mem, 0xaa, page_size); + ret = syscall(__NR_mbind, pages[0], page_size, MPOL_BIND, &node0_mask, maxnode, 0); + TEST_ASSERT(!ret, "mbind after allocation page 0 to node 0 should succeed"); + ret = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, page_size); + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed"); + + /* Set NUMA policy before allocation */ + ret = syscall(__NR_mbind, pages[0], page_size * 2, MPOL_BIND, &node1_mask, maxnode, 0); + TEST_ASSERT(!ret, "mbind page 0, 1 to node 1 should succeed"); + ret = syscall(__NR_mbind, pages[2], page_size * 2, MPOL_BIND, &node0_mask, maxnode, 0); + TEST_ASSERT(!ret, "mbind page 2, 3 to node 0 should succeed"); + memset(mem, 0xaa, total_size); + + /* Validate if pages are allocated on specified NUMA nodes */ + ret = syscall(__NR_move_pages, 0, 4, pages, NULL, status, 0); + TEST_ASSERT(ret >= 0, "move_pages should succeed for status check"); + TEST_ASSERT(status[0] == 1, "Page 0 should be allocated on node 1"); + TEST_ASSERT(status[1] == 1, "Page 1 should be allocated on node 1"); + TEST_ASSERT(status[2] == 0, "Page 2 should be allocated on node 0"); + TEST_ASSERT(status[3] == 0, "Page 3 should be allocated on node 0"); + + /* Punch hole for all pages */ + ret = fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, total_size); + TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) should succeed"); + + /* Change NUMA policy nodes and reallocate */ + ret = syscall(__NR_mbind, pages[0], page_size * 2, MPOL_BIND, &node0_mask, maxnode, 0); + TEST_ASSERT(!ret, "mbind page 0, 1 to node 0 should succeed"); + ret = syscall(__NR_mbind, pages[2], page_size * 2, MPOL_BIND, &node1_mask, maxnode, 0); + TEST_ASSERT(!ret, "mbind page 2, 3 to node 1 should succeed"); + memset(mem, 0xaa, total_size); + + ret = syscall(__NR_move_pages, 0, 4, pages, NULL, status, 0); + TEST_ASSERT(ret >= 0, "move_pages should succeed after reallocation"); + TEST_ASSERT(status[0] == 0, "Page 0 should be allocated on node 0"); + TEST_ASSERT(status[1] == 0, "Page 1 should be allocated on node 0"); + TEST_ASSERT(status[2] == 1, "Page 2 should be allocated on node 1"); + TEST_ASSERT(status[3] == 1, "Page 3 should be allocated on node 1"); + + TEST_ASSERT(munmap(mem, total_size) == 0, "munmap should succeed"); +} + static sigjmp_buf jmpbuf; void fault_sigbus_handler(int signum) { @@ -286,6 +405,8 @@ static void test_guest_memfd(unsigned long vm_type) if (flags & GUEST_MEMFD_FLAG_MMAP) { test_mmap_supported(fd, page_size, total_size); test_fault_overflow(fd, page_size, total_size); + test_mbind(fd, page_size, total_size); + test_numa_allocation(fd, page_size, total_size); } else { test_mmap_not_supported(fd, page_size, total_size); }
On Mon, Aug 11, 2025, Shivank Garg wrote:
This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17) that enabled host-mapping for guest_memfd memory [1].
Is this still actually an RFC? If so, why? If not, drop tag on the next version (if one is needed/sent).
On 11.08.25 16:34, Sean Christopherson wrote:
On Mon, Aug 11, 2025, Shivank Garg wrote:
This series introduces NUMA-aware memory placement support for KVM guests with guest_memfd memory backends. It builds upon Fuad Tabba's work (V17) that enabled host-mapping for guest_memfd memory [1].
Is this still actually an RFC? If so, why? If not, drop tag on the next version (if one is needed/sent).
There was the complaint that !RFC meant that it would be based on a consumable upstream branch.
I think once this series is rebase on top of kvm-next, we can finally drop the tag.
linux-kselftest-mirror@lists.linaro.org