Hello,
This patchset builds upon a soon-to-be-published WIP patchset that Sean published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned at [1].
The tree can be found at: https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced, allowing VM private memory (for confidential computing) to be backed by hugetlb pages.
guest_mem provides userspace with a handle, with which userspace can allocate and deallocate memory for confidential VMs without mapping the memory into userspace.
Why use hugetlb instead of introducing a new allocator, like gmem does for 4K and transparent hugepages?
+ hugetlb provides the following useful functionality, which would otherwise have to be reimplemented: + Allocation of hugetlb pages at boot time, including + Parsing of kernel boot parameters to configure hugetlb + Tracking of usage in hstate + gmem will share the same system-wide pool of hugetlb pages, so users don't have to have separate pools for hugetlb and gmem + Page accounting with subpools + hugetlb pages are tracked in subpools, which gmem uses to reserve pages from the global hstate + Memory charging + hugetlb provides code that charges memory to cgroups + Reporting: hugetlb usage and availability are available at /proc/meminfo, etc
The first 11 patches in this patchset is a series of refactoring to decouple hugetlb and hugetlbfs.
The central thread binding the refactoring is that some functions (like inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs concept, that the resv_map, subpool, hstate, are in a specific field in a hugetlb inode.
Refactoring to parametrize functions by hstate, subpool, resv_map will allow hugetlb to be used by gmem and in other places where these data structures aren't necessarily stored in the same positions in the inode.
The refactoring proposed here is just the minimum required to get a proof-of-concept working with gmem. I would like to get opinions on this approach before doing further refactoring. (See TODOs)
TODOs:
+ hugetlb/hugetlbfs refactoring + remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs specific and used only in inode.c + remove_mapping_hugepages(), remove_inode_single_folio(), hugetlb_unreserve_pages() shouldn't need to take inode as a parameter + Updating inode->i_blocks can be refactored to a separate function and called from hugetlbfs and gmem + alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by vma + hugetlb_reserve_pages() should be refactored to be symmetric with hugetlb_unreserve_pages() + It should be parametrized by resv_map + alloc_hugetlb_folio_from_subpool() could perhaps use hugetlb_reserve_pages()? + gmem + Figure out if resv_map should be used by gmem at all + Probably needs more refactoring to decouple resv_map from hugetlb functions
Questions for the community:
1. In this patchset, every gmem file backed with hugetlb is given a new subpool. Is that desirable? + In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one mount per hugetlb size (2M, 1G, etc) + memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it (rightfully) uses the hugetlbfs kernel mounts and their subpools + I gave each file a subpool mostly to speed up implementation and still be able to reserve hugetlb pages from the global hstate based on the gmem file size. + gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so + Should there be multiple mounts, one for each hugetlb size? + Will the mounts be initialized on boot or on first gmem file creation? + Or is one subpool per gmem file fine? 2. Should resv_map be used for gmem at all, since gmem doesn't allow userspace reservations?
[1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@google.com/
---
Ackerley Tng (19): mm: hugetlb: Expose get_hstate_idx() mm: hugetlb: Move and expose hugetlbfs_zero_partial_page mm: hugetlb: Expose remove_inode_hugepages mm: hugetlb: Decouple hstate, subpool from inode mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool and hstate mm: hugetlb: Provide hugetlb_filemap_add_folio() mm: hugetlb: Refactor vma_*_reservation functions mm: hugetlb: Refactor restore_reserve_on_error mm: hugetlb: Use restore_reserve_on_error directly in filesystems mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by resv_map mm: hugetlb: Parametrize hugetlb functions by resv_map mm: truncate: Expose preparation steps for truncate_inode_pages_final KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers KVM: guest_mem: Refactor cleanup to separate inode and file cleanup KVM: guest_mem: hugetlb: initialization and cleanup KVM: guest_mem: hugetlb: allocate and truncate from hugetlb KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem KVM: selftests: Support various types of backing sources for private memory KVM: selftests: Update test for various private memory backing source types
fs/hugetlbfs/inode.c | 102 ++-- include/linux/hugetlb.h | 86 ++- include/linux/mm.h | 1 + include/uapi/linux/kvm.h | 25 + mm/hugetlb.c | 324 +++++++----- mm/truncate.c | 24 +- .../testing/selftests/kvm/guest_memfd_test.c | 33 +- .../testing/selftests/kvm/include/test_util.h | 14 + tools/testing/selftests/kvm/lib/test_util.c | 74 +++ .../kvm/x86_64/private_mem_conversions_test.c | 38 +- virt/kvm/guest_mem.c | 488 ++++++++++++++---- 11 files changed, 882 insertions(+), 327 deletions(-)
-- 2.41.0.rc0.172.g3f132b7071-goog
Expose get_hstate_idx() so it can be used from KVM's guest_mem code
Signed-off-by: Ackerley Tng ackerleytng@google.com --- fs/hugetlbfs/inode.c | 9 --------- include/linux/hugetlb.h | 14 ++++++++++++++ 2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 9062da6da567..406d7366cf3e 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1560,15 +1560,6 @@ static int can_do_hugetlb_shm(void) return capable(CAP_IPC_LOCK) || in_group_p(shm_group); }
-static int get_hstate_idx(int page_size_log) -{ - struct hstate *h = hstate_sizelog(page_size_log); - - if (!h) - return -1; - return hstate_index(h); -} - /* * Note that size should be aligned to proper hugepage size in caller side, * otherwise hugetlb_reserve_pages reserves one less hugepages than intended. diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 7c977d234aba..37c2edf7beea 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -876,6 +876,15 @@ static inline int hstate_index(struct hstate *h) return h - hstates; }
+static inline int get_hstate_idx(int page_size_log) +{ + struct hstate *h = hstate_sizelog(page_size_log); + + if (!h) + return -1; + return hstate_index(h); +} + extern int dissolve_free_huge_page(struct page *page); extern int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn); @@ -1142,6 +1151,11 @@ static inline int hstate_index(struct hstate *h) return 0; }
+static inline int get_hstate_idx(int page_size_log) +{ + return 0; +} + static inline int dissolve_free_huge_page(struct page *page) { return 0;
Zeroing of pages is generalizable to hugetlb and is not specific to hugetlbfs.
Rename hugetlbfs_zero_partial_page => hugetlb_zero_partial_page, move it to mm/hugetlb.c and expose it in linux/hugetlb.h.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- fs/hugetlbfs/inode.c | 27 ++------------------------- include/linux/hugetlb.h | 6 ++++++ mm/hugetlb.c | 22 ++++++++++++++++++++++ 3 files changed, 30 insertions(+), 25 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 406d7366cf3e..3dab50d3ed88 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -688,29 +688,6 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset) remove_inode_hugepages(inode, offset, LLONG_MAX); }
-static void hugetlbfs_zero_partial_page(struct hstate *h, - struct address_space *mapping, - loff_t start, - loff_t end) -{ - pgoff_t idx = start >> huge_page_shift(h); - struct folio *folio; - - folio = filemap_lock_folio(mapping, idx); - if (!folio) - return; - - start = start & ~huge_page_mask(h); - end = end & ~huge_page_mask(h); - if (!end) - end = huge_page_size(h); - - folio_zero_segment(folio, (size_t)start, (size_t)end); - - folio_unlock(folio); - folio_put(folio); -} - static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) { struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); @@ -737,7 +714,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
/* If range starts before first full page, zero partial page. */ if (offset < hole_start) - hugetlbfs_zero_partial_page(h, mapping, + hugetlb_zero_partial_page(h, mapping, offset, min(offset + len, hole_start));
/* Unmap users of full pages in the hole. */ @@ -750,7 +727,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
/* If range extends beyond last full page, zero partial page. */ if ((offset + len) > hole_end && (offset + len) > hole_start) - hugetlbfs_zero_partial_page(h, mapping, + hugetlb_zero_partial_page(h, mapping, hole_end, offset + len);
i_mmap_unlock_write(mapping); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 37c2edf7beea..023293ceec25 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -256,6 +256,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma, bool is_hugetlb_entry_migration(pte_t pte); void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
+void hugetlb_zero_partial_page(struct hstate *h, struct address_space *mapping, + loff_t start, loff_t end); + #else /* !CONFIG_HUGETLB_PAGE */
static inline void hugetlb_dup_vma_private(struct vm_area_struct *vma) @@ -464,6 +467,9 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
+static inline void hugetlb_zero_partial_page( + struct hstate *h, struct address_space *mapping, loff_t start, loff_t end) {} + #endif /* !CONFIG_HUGETLB_PAGE */ /* * hugepages at page global directory. If arch support diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 07abcb6eb203..9c9262833b4f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7407,6 +7407,28 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) ALIGN_DOWN(vma->vm_end, PUD_SIZE)); }
+void hugetlb_zero_partial_page(struct hstate *h, + struct address_space *mapping, + loff_t start, loff_t end) +{ + pgoff_t idx = start >> huge_page_shift(h); + struct folio *folio; + + folio = filemap_lock_folio(mapping, idx); + if (!folio) + return; + + start = start & ~huge_page_mask(h); + end = end & ~huge_page_mask(h); + if (!end) + end = huge_page_size(h); + + folio_zero_segment(folio, (size_t)start, (size_t)end); + + folio_unlock(folio); + folio_put(folio); +} + #ifdef CONFIG_CMA static bool cma_reserve_called __initdata;
TODO may want to move this to hugetlb
Signed-off-by: Ackerley Tng ackerleytng@google.com --- fs/hugetlbfs/inode.c | 3 +-- include/linux/hugetlb.h | 4 ++++ 2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 3dab50d3ed88..4f25df31ae80 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -611,8 +611,7 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode, * Note: If the passed end of range value is beyond the end of file, but * not LLONG_MAX this routine still performs a hole punch operation. */ -static void remove_inode_hugepages(struct inode *inode, loff_t lstart, - loff_t lend) +void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) { struct hstate *h = hstate_inode(inode); struct address_space *mapping = &inode->i_data; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 023293ceec25..1483020b412b 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -259,6 +259,8 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); void hugetlb_zero_partial_page(struct hstate *h, struct address_space *mapping, loff_t start, loff_t end);
+void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend); + #else /* !CONFIG_HUGETLB_PAGE */
static inline void hugetlb_dup_vma_private(struct vm_area_struct *vma) @@ -470,6 +472,8 @@ static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { } static inline void hugetlb_zero_partial_page( struct hstate *h, struct address_space *mapping, loff_t start, loff_t end) {}
+static inline void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) {} + #endif /* !CONFIG_HUGETLB_PAGE */ /* * hugepages at page global directory. If arch support
hstate and subpool being retrievable from inode via hstate_inode() and subpool_inode() respectively is a hugetlbfs concept.
hugetlb should be agnostic of hugetlbfs and hugetlb accounting functions should accept hstate (required) and subpool (can be NULL) independently of inode.
inode is still a parameter for these accounting functions since the inode's block counts need to be updated during accounting.
The inode's resv_map will also still need to be updated if not NULL.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- fs/hugetlbfs/inode.c | 59 ++++++++++++++++++++++++++++------------- include/linux/hugetlb.h | 32 +++++++++++++++++----- mm/hugetlb.c | 49 ++++++++++++++++++++-------------- 3 files changed, 95 insertions(+), 45 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 4f25df31ae80..0fc49b6252e4 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -164,7 +164,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file);
ret = -ENOMEM; - if (!hugetlb_reserve_pages(inode, + if (!hugetlb_reserve_pages(h, subpool_inode(inode), inode, vma->vm_pgoff >> huge_page_order(h), len >> huge_page_shift(h), vma, vma->vm_flags)) @@ -550,14 +550,18 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, } }
-/* +/** + * Remove folio from page_cache and userspace mappings. Also unreserves pages, + * updating hstate @h, subpool @spool (if not NULL), @inode block info and + * @inode's resv_map (if not NULL). + * * Called with hugetlb fault mutex held. * Returns true if page was actually removed, false otherwise. */ -static bool remove_inode_single_folio(struct hstate *h, struct inode *inode, - struct address_space *mapping, - struct folio *folio, pgoff_t index, - bool truncate_op) +static bool remove_mapping_single_folio( + struct address_space *mapping, struct folio *folio, pgoff_t index, + struct hstate *h, struct hugepage_subpool *spool, struct inode *inode, + bool truncate_op) { bool ret = false;
@@ -582,9 +586,8 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode, hugetlb_delete_from_page_cache(folio); ret = true; if (!truncate_op) { - if (unlikely(hugetlb_unreserve_pages(inode, index, - index + 1, 1))) - hugetlb_fix_reserve_counts(inode); + if (unlikely(hugetlb_unreserve_pages(h, spool, inode, index, index + 1, 1))) + hugetlb_fix_reserve_counts(h, spool); }
folio_unlock(folio); @@ -592,7 +595,14 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode, }
/* - * remove_inode_hugepages handles two distinct cases: truncation and hole + * Remove hugetlb page mappings from @mapping between offsets [@lstart, @lend). + * Also updates reservations in: + * + hstate @h (required) + * + subpool @spool (can be NULL) + * + resv_map in @inode (can be NULL) + * and updates blocks in @inode (required) + * + * remove_mapping_hugepages handles two distinct cases: truncation and hole * punch. There are subtle differences in operation for each case. * * truncation is indicated by end of range being LLONG_MAX @@ -611,10 +621,10 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode, * Note: If the passed end of range value is beyond the end of file, but * not LLONG_MAX this routine still performs a hole punch operation. */ -void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) +void remove_mapping_hugepages(struct address_space *mapping, + struct hstate *h, struct hugepage_subpool *spool, + struct inode *inode, loff_t lstart, loff_t lend) { - struct hstate *h = hstate_inode(inode); - struct address_space *mapping = &inode->i_data; const pgoff_t start = lstart >> huge_page_shift(h); const pgoff_t end = lend >> huge_page_shift(h); struct folio_batch fbatch; @@ -636,8 +646,8 @@ void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) /* * Remove folio that was part of folio_batch. */ - if (remove_inode_single_folio(h, inode, mapping, folio, - index, truncate_op)) + if (remove_mapping_single_folio(mapping, folio, index, + h, spool, inode, truncate_op)) freed++;
mutex_unlock(&hugetlb_fault_mutex_table[hash]); @@ -647,7 +657,16 @@ void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) }
if (truncate_op) - (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed); + (void)hugetlb_unreserve_pages(h, spool, inode, start, LONG_MAX, freed); +} + +void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) +{ + struct address_space *mapping = &inode->i_data; + struct hstate *h = hstate_inode(inode); + struct hugepage_subpool *spool = subpool_inode(inode); + + return remove_mapping_hugepages(mapping, h, spool, inode, lstart, lend); }
static void hugetlbfs_evict_inode(struct inode *inode) @@ -1548,6 +1567,7 @@ struct file *hugetlb_file_setup(const char *name, size_t size, struct vfsmount *mnt; int hstate_idx; struct file *file; + struct hstate *h;
hstate_idx = get_hstate_idx(page_size_log); if (hstate_idx < 0) @@ -1578,9 +1598,10 @@ struct file *hugetlb_file_setup(const char *name, size_t size, inode->i_size = size; clear_nlink(inode);
- if (!hugetlb_reserve_pages(inode, 0, - size >> huge_page_shift(hstate_inode(inode)), NULL, - acctflag)) + h = hstate_inode(inode); + if (!hugetlb_reserve_pages(h, subpool_inode(inode), inode, 0, + size >> huge_page_shift(h), NULL, + acctflag)) file = ERR_PTR(-ENOMEM); else file = alloc_file_pseudo(inode, mnt, name, O_RDWR, diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 1483020b412b..2457d7a21974 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -166,11 +166,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte, struct page **pagep, bool wp_copy); #endif /* CONFIG_USERFAULTFD */ -bool hugetlb_reserve_pages(struct inode *inode, long from, long to, - struct vm_area_struct *vma, - vm_flags_t vm_flags); -long hugetlb_unreserve_pages(struct inode *inode, long start, long end, - long freed); +bool hugetlb_reserve_pages(struct hstate *h, struct hugepage_subpool *spool, + struct inode *inode, + long from, long to, + struct vm_area_struct *vma, + vm_flags_t vm_flags); +long hugetlb_unreserve_pages(struct hstate *h, struct hugepage_subpool *spool, + struct inode *inode, long start, long end, long freed); bool isolate_hugetlb(struct folio *folio, struct list_head *list); int get_hwpoison_hugetlb_folio(struct folio *folio, bool *hugetlb, bool unpoison); int get_huge_page_for_hwpoison(unsigned long pfn, int flags, @@ -178,7 +180,7 @@ int get_huge_page_for_hwpoison(unsigned long pfn, int flags, void folio_putback_active_hugetlb(struct folio *folio); void move_hugetlb_state(struct folio *old_folio, struct folio *new_folio, int reason); void free_huge_page(struct page *page); -void hugetlb_fix_reserve_counts(struct inode *inode); +void hugetlb_fix_reserve_counts(struct hstate *h, struct hugepage_subpool *spool); extern struct mutex *hugetlb_fault_mutex_table; u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
@@ -259,6 +261,9 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma); void hugetlb_zero_partial_page(struct hstate *h, struct address_space *mapping, loff_t start, loff_t end);
+void remove_mapping_hugepages(struct address_space *mapping, + struct hstate *h, struct hugepage_subpool *spool, + struct inode *inode, loff_t lstart, loff_t lend); void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend);
#else /* !CONFIG_HUGETLB_PAGE */ @@ -472,6 +477,9 @@ static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { } static inline void hugetlb_zero_partial_page( struct hstate *h, struct address_space *mapping, loff_t start, loff_t end) {}
+static inline void remove_mapping_hugepages( + struct address_space *mapping, struct hstate *h, struct hugepage_subpool *spool, + struct inode *inode, loff_t lstart, loff_t lend) {} static inline void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) {}
#endif /* !CONFIG_HUGETLB_PAGE */ @@ -554,6 +562,12 @@ static inline struct hstate *hstate_inode(struct inode *i) { return HUGETLBFS_SB(i->i_sb)->hstate; } + +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) +{ + return HUGETLBFS_SB(inode->i_sb)->spool; +} + #else /* !CONFIG_HUGETLBFS */
#define is_file_hugepages(file) false @@ -568,6 +582,12 @@ static inline struct hstate *hstate_inode(struct inode *i) { return NULL; } + +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) +{ + return NULL; +} + #endif /* !CONFIG_HUGETLBFS */
#ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9c9262833b4f..9da419b930df 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -247,11 +247,6 @@ static long hugepage_subpool_put_pages(struct hugepage_subpool *spool, return ret; }
-static inline struct hugepage_subpool *subpool_inode(struct inode *inode) -{ - return HUGETLBFS_SB(inode->i_sb)->spool; -} - static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) { return subpool_inode(file_inode(vma->vm_file)); @@ -898,16 +893,13 @@ static long region_del(struct resv_map *resv, long f, long t) * appear as a "reserved" entry instead of simply dangling with incorrect * counts. */ -void hugetlb_fix_reserve_counts(struct inode *inode) +void hugetlb_fix_reserve_counts(struct hstate *h, struct hugepage_subpool *spool) { - struct hugepage_subpool *spool = subpool_inode(inode); long rsv_adjust; bool reserved = false;
rsv_adjust = hugepage_subpool_get_pages(spool, 1); if (rsv_adjust > 0) { - struct hstate *h = hstate_inode(inode); - if (!hugetlb_acct_memory(h, 1)) reserved = true; } else if (!rsv_adjust) { @@ -6762,15 +6754,22 @@ long hugetlb_change_protection(struct vm_area_struct *vma, return pages > 0 ? (pages << h->order) : pages; }
-/* Return true if reservation was successful, false otherwise. */ -bool hugetlb_reserve_pages(struct inode *inode, - long from, long to, - struct vm_area_struct *vma, - vm_flags_t vm_flags) +/** + * Reserves pages between vma indices @from and @to by handling accounting in: + * + hstate @h (required) + * + subpool @spool (can be NULL) + * + @inode (required if @vma is NULL) + * + * Will setup resv_map in @vma if necessary. + * Return true if reservation was successful, false otherwise. + */ +bool hugetlb_reserve_pages(struct hstate *h, struct hugepage_subpool *spool, + struct inode *inode, + long from, long to, + struct vm_area_struct *vma, + vm_flags_t vm_flags) { long chg = -1, add = -1; - struct hstate *h = hstate_inode(inode); - struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; struct hugetlb_cgroup *h_cg = NULL; long gbl_reserve, regions_needed = 0; @@ -6921,13 +6920,23 @@ bool hugetlb_reserve_pages(struct inode *inode, return false; }
-long hugetlb_unreserve_pages(struct inode *inode, long start, long end, - long freed) +/** + * Unreserves pages between vma indices @start and @end by handling accounting + * in: + * + hstate @h (required) + * + subpool @spool (can be NULL) + * + @inode (required) + * + resv_map in @inode (can be NULL) + * + * @freed is the number of pages freed, for updating inode->i_blocks. + * + * Returns 0 on success. + */ +long hugetlb_unreserve_pages(struct hstate *h, struct hugepage_subpool *spool, + struct inode *inode, long start, long end, long freed) { - struct hstate *h = hstate_inode(inode); struct resv_map *resv_map = inode_resv_map(inode); long chg = 0; - struct hugepage_subpool *spool = subpool_inode(inode); long gbl_reserve;
/*
subpool_inode() and hstate_inode() are hugetlbfs-specific.
By allowing subpool and hstate to be specified, hugetlb is further modularized from hugetlbfs.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- include/linux/hugetlb.h | 3 +++ mm/hugetlb.c | 16 ++++++++++++---- 2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 2457d7a21974..14df89d1642c 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -747,6 +747,9 @@ struct huge_bootmem_page { };
int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); +struct folio *alloc_hugetlb_folio_from_subpool( + struct hugepage_subpool *spool, struct hstate *h, + struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9da419b930df..99ab4bbdb2ce 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3008,11 +3008,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) return ret; }
-struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, - unsigned long addr, int avoid_reserve) +struct folio *alloc_hugetlb_folio_from_subpool( + struct hugepage_subpool *spool, struct hstate *h, + struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) { - struct hugepage_subpool *spool = subpool_vma(vma); - struct hstate *h = hstate_vma(vma); struct folio *folio; long map_chg, map_commit; long gbl_chg; @@ -3139,6 +3138,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, return ERR_PTR(-ENOSPC); }
+struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, + unsigned long addr, int avoid_reserve) +{ + struct hugepage_subpool *spool = subpool_vma(vma); + struct hstate *h = hstate_vma(vma); + + return alloc_hugetlb_folio_from_subpool(spool, h, vma, addr, avoid_reserve); +} + int alloc_bootmem_huge_page(struct hstate *h, int nid) __attribute__ ((weak, alias("__alloc_bootmem_huge_page"))); int __alloc_bootmem_huge_page(struct hstate *h, int nid)
hstate_inode() is hugetlbfs-specific, limiting hugetlb_add_to_page_cache() to hugetlbfs.
hugetlb_filemap_add_folio() allows hstate to be specified and further separates hugetlb from hugetlbfs.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- include/linux/hugetlb.h | 2 ++ mm/hugetlb.c | 13 ++++++++++--- 2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 14df89d1642c..7d49048c5a2a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -756,6 +756,8 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask, gfp_t gfp_mask); struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address); +int hugetlb_filemap_add_folio(struct address_space *mapping, struct hstate *h, + struct folio *folio, pgoff_t idx); int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping, pgoff_t idx); void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 99ab4bbdb2ce..d16c6417b90f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5665,11 +5665,10 @@ static bool hugetlbfs_pagecache_present(struct hstate *h, return present; }
-int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping, - pgoff_t idx) +int hugetlb_filemap_add_folio(struct address_space *mapping, struct hstate *h, + struct folio *folio, pgoff_t idx) { struct inode *inode = mapping->host; - struct hstate *h = hstate_inode(inode); int err;
__folio_set_locked(folio); @@ -5693,6 +5692,14 @@ int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping return 0; }
+int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping, + pgoff_t idx) +{ + struct hstate *h = hstate_inode(mapping->host); + + return hugetlb_filemap_add_folio(mapping, h, folio, idx); +} + static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma, struct address_space *mapping, pgoff_t idx,
vma_*_reservation functions rely on vma_resv_map(), which assumes on a hugetlbfs concept of the resv_map being stored in a specific field of the inode.
This refactor enables vma_*_reservation functions, now renamed resv_map_*_reservation, to be used with non-hugetlbfs filesystems, further decoupling hugetlb from hugetlbfs.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- mm/hugetlb.c | 184 +++++++++++++++++++++++++++------------------------ 1 file changed, 99 insertions(+), 85 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d16c6417b90f..d943f83d15a9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2643,89 +2643,81 @@ static void return_unused_surplus_pages(struct hstate *h,
/* - * vma_needs_reservation, vma_commit_reservation and vma_end_reservation - * are used by the huge page allocation routines to manage reservations. + * resv_map_needs_reservation, resv_map_commit_reservation and + * resv_map_end_reservation are used by the huge page allocation routines to + * manage reservations. * - * vma_needs_reservation is called to determine if the huge page at addr - * within the vma has an associated reservation. If a reservation is - * needed, the value 1 is returned. The caller is then responsible for - * managing the global reservation and subpool usage counts. After - * the huge page has been allocated, vma_commit_reservation is called - * to add the page to the reservation map. If the page allocation fails, - * the reservation must be ended instead of committed. vma_end_reservation - * is called in such cases. + * resv_map_needs_reservation is called to determine if the huge page at addr + * within the vma has an associated reservation. If a reservation is needed, + * the value 1 is returned. The caller is then responsible for managing the + * global reservation and subpool usage counts. After the huge page has been + * allocated, resv_map_commit_reservation is called to add the page to the + * reservation map. If the page allocation fails, the reservation must be ended + * instead of committed. resv_map_end_reservation is called in such cases. * - * In the normal case, vma_commit_reservation returns the same value - * as the preceding vma_needs_reservation call. The only time this - * is not the case is if a reserve map was changed between calls. It - * is the responsibility of the caller to notice the difference and - * take appropriate action. + * In the normal case, resv_map_commit_reservation returns the same value as the + * preceding resv_map_needs_reservation call. The only time this is not the + * case is if a reserve map was changed between calls. It is the responsibility + * of the caller to notice the difference and take appropriate action. * - * vma_add_reservation is used in error paths where a reservation must - * be restored when a newly allocated huge page must be freed. It is - * to be called after calling vma_needs_reservation to determine if a - * reservation exists. + * resv_map_add_reservation is used in error paths where a reservation must be + * restored when a newly allocated huge page must be freed. It is to be called + * after calling resv_map_needs_reservation to determine if a reservation + * exists. * - * vma_del_reservation is used in error paths where an entry in the reserve - * map was created during huge page allocation and must be removed. It is to - * be called after calling vma_needs_reservation to determine if a reservation + * resv_map_del_reservation is used in error paths where an entry in the reserve + * map was created during huge page allocation and must be removed. It is to be + * called after calling resv_map_needs_reservation to determine if a reservation * exists. */ -enum vma_resv_mode { - VMA_NEEDS_RESV, - VMA_COMMIT_RESV, - VMA_END_RESV, - VMA_ADD_RESV, - VMA_DEL_RESV, +enum resv_map_resv_mode { + RESV_MAP_NEEDS_RESV, + RESV_MAP_COMMIT_RESV, + RESV_MAP_END_RESV, + RESV_MAP_ADD_RESV, + RESV_MAP_DEL_RESV, }; -static long __vma_reservation_common(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr, - enum vma_resv_mode mode) +static long __resv_map_reservation_common(struct resv_map *resv, pgoff_t resv_index, + bool may_be_shared_mapping, + enum resv_map_resv_mode mode) { - struct resv_map *resv; - pgoff_t idx; long ret; long dummy_out_regions_needed;
- resv = vma_resv_map(vma); - if (!resv) - return 1; - - idx = vma_hugecache_offset(h, vma, addr); switch (mode) { - case VMA_NEEDS_RESV: - ret = region_chg(resv, idx, idx + 1, &dummy_out_regions_needed); + case RESV_MAP_NEEDS_RESV: + ret = region_chg(resv, resv_index, resv_index + 1, &dummy_out_regions_needed); /* We assume that vma_reservation_* routines always operate on * 1 page, and that adding to resv map a 1 page entry can only * ever require 1 region. */ VM_BUG_ON(dummy_out_regions_needed != 1); break; - case VMA_COMMIT_RESV: - ret = region_add(resv, idx, idx + 1, 1, NULL, NULL); + case RESV_MAP_COMMIT_RESV: + ret = region_add(resv, resv_index, resv_index + 1, 1, NULL, NULL); /* region_add calls of range 1 should never fail. */ VM_BUG_ON(ret < 0); break; - case VMA_END_RESV: - region_abort(resv, idx, idx + 1, 1); + case RESV_MAP_END_RESV: + region_abort(resv, resv_index, resv_index + 1, 1); ret = 0; break; - case VMA_ADD_RESV: - if (vma->vm_flags & VM_MAYSHARE) { - ret = region_add(resv, idx, idx + 1, 1, NULL, NULL); + case RESV_MAP_ADD_RESV: + if (may_be_shared_mapping) { + ret = region_add(resv, resv_index, resv_index + 1, 1, NULL, NULL); /* region_add calls of range 1 should never fail. */ VM_BUG_ON(ret < 0); } else { - region_abort(resv, idx, idx + 1, 1); - ret = region_del(resv, idx, idx + 1); + region_abort(resv, resv_index, resv_index + 1, 1); + ret = region_del(resv, resv_index, resv_index + 1); } break; - case VMA_DEL_RESV: - if (vma->vm_flags & VM_MAYSHARE) { - region_abort(resv, idx, idx + 1, 1); - ret = region_del(resv, idx, idx + 1); + case RESV_MAP_DEL_RESV: + if (may_be_shared_mapping) { + region_abort(resv, resv_index, resv_index + 1, 1); + ret = region_del(resv, resv_index, resv_index + 1); } else { - ret = region_add(resv, idx, idx + 1, 1, NULL, NULL); + ret = region_add(resv, resv_index, resv_index + 1, 1, NULL, NULL); /* region_add calls of range 1 should never fail. */ VM_BUG_ON(ret < 0); } @@ -2734,7 +2726,7 @@ static long __vma_reservation_common(struct hstate *h, BUG(); }
- if (vma->vm_flags & VM_MAYSHARE || mode == VMA_DEL_RESV) + if (may_be_shared_mapping || mode == RESV_MAP_DEL_RESV) return ret; /* * We know private mapping must have HPAGE_RESV_OWNER set. @@ -2758,34 +2750,39 @@ static long __vma_reservation_common(struct hstate *h, return ret; }
-static long vma_needs_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static long resv_map_needs_reservation(struct resv_map *resv, pgoff_t resv_index, + bool may_be_shared_mapping) { - return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV); + return __resv_map_reservation_common( + resv, resv_index, may_be_shared_mapping, RESV_MAP_NEEDS_RESV); }
-static long vma_commit_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static long resv_map_commit_reservation(struct resv_map *resv, pgoff_t resv_index, + bool may_be_shared_mapping) { - return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV); + return __resv_map_reservation_common( + resv, resv_index, may_be_shared_mapping, RESV_MAP_COMMIT_RESV); }
-static void vma_end_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static void resv_map_end_reservation(struct resv_map *resv, pgoff_t resv_index, + bool may_be_shared_mapping) { - (void)__vma_reservation_common(h, vma, addr, VMA_END_RESV); + (void)__resv_map_reservation_common( + resv, resv_index, may_be_shared_mapping, RESV_MAP_END_RESV); }
-static long vma_add_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static long resv_map_add_reservation(struct resv_map *resv, pgoff_t resv_index, + bool may_be_shared_mapping) { - return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV); + return __resv_map_reservation_common( + resv, resv_index, may_be_shared_mapping, RESV_MAP_ADD_RESV); }
-static long vma_del_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static long resv_map_del_reservation(struct resv_map *resv, pgoff_t resv_index, + bool may_be_shared_mapping) { - return __vma_reservation_common(h, vma, addr, VMA_DEL_RESV); + return __resv_map_reservation_common( + resv, resv_index, may_be_shared_mapping, RESV_MAP_DEL_RESV); }
/* @@ -2811,7 +2808,12 @@ static long vma_del_reservation(struct hstate *h, void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, unsigned long address, struct folio *folio) { - long rc = vma_needs_reservation(h, vma, address); + long rc; + struct resv_map *resv = vma_resv_map(vma); + pgoff_t resv_index = vma_hugecache_offset(h, vma, address); + bool may_share = vma->vm_flags & VM_MAYSHARE; + + rc = resv_map_needs_reservation(resv, resv_index, may_share);
if (folio_test_hugetlb_restore_reserve(folio)) { if (unlikely(rc < 0)) @@ -2828,9 +2830,9 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, */ folio_clear_hugetlb_restore_reserve(folio); else if (rc) - (void)vma_add_reservation(h, vma, address); + (void)resv_map_add_reservation(resv, resv_index, may_share); else - vma_end_reservation(h, vma, address); + resv_map_end_reservation(resv, resv_index, may_share); } else { if (!rc) { /* @@ -2841,7 +2843,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, * Remove the entry so that a subsequent allocation * does not consume a reservation. */ - rc = vma_del_reservation(h, vma, address); + rc = resv_map_del_reservation(resv, resv_index, may_share); if (rc < 0) /* * VERY rare out of memory condition. Since @@ -2855,7 +2857,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, } else if (rc < 0) { /* * Rare out of memory condition from - * vma_needs_reservation call. Memory allocation is + * resv_map_needs_reservation call. Memory allocation is * only attempted if a new entry is needed. Therefore, * this implies there is not an entry in the * reserve map. @@ -2877,7 +2879,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, /* * No reservation present, do nothing */ - vma_end_reservation(h, vma, address); + resv_map_end_reservation(resv, resv_index, may_share); } }
@@ -3019,13 +3021,17 @@ struct folio *alloc_hugetlb_folio_from_subpool( struct hugetlb_cgroup *h_cg = NULL; bool deferred_reserve;
+ struct resv_map *resv = vma_resv_map(vma); + pgoff_t resv_index = vma_hugecache_offset(h, vma, addr); + bool may_share = vma->vm_flags & VM_MAYSHARE; + idx = hstate_index(h); /* * Examine the region/reserve map to determine if the process * has a reservation for the page to be allocated. A return * code of zero indicates a reservation exists (no change). */ - map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); + map_chg = gbl_chg = resv_map_needs_reservation(resv, resv_index, may_share); if (map_chg < 0) return ERR_PTR(-ENOMEM);
@@ -3039,7 +3045,7 @@ struct folio *alloc_hugetlb_folio_from_subpool( if (map_chg || avoid_reserve) { gbl_chg = hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) { - vma_end_reservation(h, vma, addr); + resv_map_end_reservation(resv, resv_index, may_share); return ERR_PTR(-ENOSPC); }
@@ -3104,11 +3110,11 @@ struct folio *alloc_hugetlb_folio_from_subpool(
hugetlb_set_folio_subpool(folio, spool);
- map_commit = vma_commit_reservation(h, vma, addr); + map_commit = resv_map_commit_reservation(resv, resv_index, may_share); if (unlikely(map_chg > map_commit)) { /* * The page was added to the reservation map between - * vma_needs_reservation and vma_commit_reservation. + * resv_map_needs_reservation and resv_map_commit_reservation. * This indicates a race with hugetlb_reserve_pages. * Adjust for the subpool count incremented above AND * in hugetlb_reserve_pages for the same page. Also, @@ -3134,7 +3140,7 @@ struct folio *alloc_hugetlb_folio_from_subpool( out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); - vma_end_reservation(h, vma, addr); + resv_map_end_reservation(resv, resv_index, may_share); return ERR_PTR(-ENOSPC); }
@@ -5901,12 +5907,16 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, * the spinlock. */ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { - if (vma_needs_reservation(h, vma, haddr) < 0) { + struct resv_map *resv = vma_resv_map(vma); + pgoff_t resv_index = vma_hugecache_offset(h, vma, address); + bool may_share = vma->vm_flags & VM_MAYSHARE; + + if (resv_map_needs_reservation(resv, resv_index, may_share) < 0) { ret = VM_FAULT_OOM; goto backout_unlocked; } /* Just decrements count, does not deallocate */ - vma_end_reservation(h, vma, haddr); + resv_map_end_reservation(resv, resv_index, may_share); }
ptl = huge_pte_lock(h, mm, ptep); @@ -6070,12 +6080,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, */ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(entry)) { - if (vma_needs_reservation(h, vma, haddr) < 0) { + struct resv_map *resv = vma_resv_map(vma); + pgoff_t resv_index = vma_hugecache_offset(h, vma, address); + bool may_share = vma->vm_flags & VM_MAYSHARE; + + if (resv_map_needs_reservation(resv, resv_index, may_share) < 0) { ret = VM_FAULT_OOM; goto out_mutex; } /* Just decrements count, does not deallocate */ - vma_end_reservation(h, vma, haddr); + resv_map_end_reservation(resv, resv_index, may_share);
pagecache_folio = filemap_lock_folio(mapping, idx); }
Refactor restore_reserve_on_error to allow resv_map to be passed in. vma_resv_map() assumes the use of hugetlbfs in the way it retrieves the resv_map from the vma and inode.
Introduce restore_reserve_on_error_vma() which retains original functionality to simplify refactoring for now.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- fs/hugetlbfs/inode.c | 2 +- include/linux/hugetlb.h | 6 ++++-- mm/hugetlb.c | 37 +++++++++++++++++++++---------------- 3 files changed, 26 insertions(+), 19 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 0fc49b6252e4..44e6ee9a856d 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -868,7 +868,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, __folio_mark_uptodate(folio); error = hugetlb_add_to_page_cache(folio, mapping, index); if (unlikely(error)) { - restore_reserve_on_error(h, &pseudo_vma, addr, folio); + restore_reserve_on_error_vma(h, &pseudo_vma, addr, folio); folio_put(folio); mutex_unlock(&hugetlb_fault_mutex_table[hash]); goto out; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 7d49048c5a2a..02a2766d89a4 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -760,8 +760,10 @@ int hugetlb_filemap_add_folio(struct address_space *mapping, struct hstate *h, struct folio *folio, pgoff_t idx); int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping, pgoff_t idx); -void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, - unsigned long address, struct folio *folio); +void restore_reserve_on_error(struct resv_map *resv, pgoff_t resv_index, + bool may_share, struct folio *folio); +void restore_reserve_on_error_vma(struct hstate *h, struct vm_area_struct *vma, + unsigned long address, struct folio *folio);
/* arch callback */ int __init __alloc_bootmem_huge_page(struct hstate *h, int nid); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d943f83d15a9..4675f9efeba4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2805,15 +2805,10 @@ static long resv_map_del_reservation(struct resv_map *resv, pgoff_t resv_index, * * In case 2, simply undo reserve map modifications done by alloc_hugetlb_folio. */ -void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, - unsigned long address, struct folio *folio) +void restore_reserve_on_error(struct resv_map *resv, pgoff_t resv_index, + bool may_share, struct folio *folio) { - long rc; - struct resv_map *resv = vma_resv_map(vma); - pgoff_t resv_index = vma_hugecache_offset(h, vma, address); - bool may_share = vma->vm_flags & VM_MAYSHARE; - - rc = resv_map_needs_reservation(resv, resv_index, may_share); + long rc = resv_map_needs_reservation(resv, resv_index, may_share);
if (folio_test_hugetlb_restore_reserve(folio)) { if (unlikely(rc < 0)) @@ -2865,7 +2860,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, * For shared mappings, no entry in the map indicates * no reservation. We are done. */ - if (!(vma->vm_flags & VM_MAYSHARE)) + if (!may_share) /* * For private mappings, no entry indicates * a reservation is present. Since we can @@ -2883,6 +2878,16 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, } }
+void restore_reserve_on_error_vma(struct hstate *h, struct vm_area_struct *vma, + unsigned long address, struct folio *folio) +{ + struct resv_map *resv = vma_resv_map(vma); + pgoff_t resv_index = vma_hugecache_offset(h, vma, address); + bool may_share = vma->vm_flags & VM_MAYSHARE; + + restore_reserve_on_error(resv, resv_index, may_share, folio); +} + /* * alloc_and_dissolve_hugetlb_folio - Allocate a new folio and dissolve * the old one @@ -5109,8 +5114,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); entry = huge_ptep_get(src_pte); if (!pte_same(src_pte_old, entry)) { - restore_reserve_on_error(h, dst_vma, addr, - new_folio); + restore_reserve_on_error_vma(h, dst_vma, addr, + new_folio); folio_put(new_folio); /* huge_ptep of dst_pte won't change as in child */ goto again; @@ -5642,7 +5647,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, * unshare) */ if (new_folio != page_folio(old_page)) - restore_reserve_on_error(h, vma, haddr, new_folio); + restore_reserve_on_error_vma(h, vma, haddr, new_folio); folio_put(new_folio); out_release_old: put_page(old_page); @@ -5860,7 +5865,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, * to the page cache. So it's safe to call * restore_reserve_on_error() here. */ - restore_reserve_on_error(h, vma, haddr, folio); + restore_reserve_on_error_vma(h, vma, haddr, folio); folio_put(folio); goto out; } @@ -5965,7 +5970,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, spin_unlock(ptl); backout_unlocked: if (new_folio && !new_pagecache_folio) - restore_reserve_on_error(h, vma, haddr, folio); + restore_reserve_on_error_vma(h, vma, haddr, folio);
folio_unlock(folio); folio_put(folio); @@ -6232,7 +6237,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, /* Free the allocated folio which may have * consumed a reservation. */ - restore_reserve_on_error(h, dst_vma, dst_addr, folio); + restore_reserve_on_error_vma(h, dst_vma, dst_addr, folio); folio_put(folio);
/* Allocate a temporary folio to hold the copied @@ -6361,7 +6366,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, folio_unlock(folio); out_release_nounlock: if (!folio_in_pagecache) - restore_reserve_on_error(h, dst_vma, dst_addr, folio); + restore_reserve_on_error_vma(h, dst_vma, dst_addr, folio); folio_put(folio); goto out; }
Expose inode_resv_map() so that hugetlbfs can access its own resv_map.
Hide restore_reserve_on_error_vma(), that function is now only used within mm/hugetlb.c.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- fs/hugetlbfs/inode.c | 2 +- include/linux/hugetlb.h | 21 +++++++++++++++++++-- mm/hugetlb.c | 13 ------------- 3 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 44e6ee9a856d..53f6a421499d 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -868,7 +868,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, __folio_mark_uptodate(folio); error = hugetlb_add_to_page_cache(folio, mapping, index); if (unlikely(error)) { - restore_reserve_on_error_vma(h, &pseudo_vma, addr, folio); + restore_reserve_on_error(inode_resv_map(inode), index, true, folio); folio_put(folio); mutex_unlock(&hugetlb_fault_mutex_table[hash]); goto out; diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 02a2766d89a4..5fe9643826d7 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -568,6 +568,20 @@ static inline struct hugepage_subpool *subpool_inode(struct inode *inode) return HUGETLBFS_SB(inode->i_sb)->spool; }
+static inline struct resv_map *inode_resv_map(struct inode *inode) +{ + /* + * At inode evict time, i_mapping may not point to the original + * address space within the inode. This original address space + * contains the pointer to the resv_map. So, always use the + * address space embedded within the inode. + * The VERY common case is inode->mapping == &inode->i_data but, + * this may not be true for device special inodes. + */ + return (struct resv_map *)(&inode->i_data)->private_data; +} + + #else /* !CONFIG_HUGETLBFS */
#define is_file_hugepages(file) false @@ -588,6 +602,11 @@ static inline struct hugepage_subpool *subpool_inode(struct inode *inode) return NULL; }
+static inline struct resv_map *inode_resv_map(struct inode *inode) +{ + return NULL; +} + #endif /* !CONFIG_HUGETLBFS */
#ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA @@ -762,8 +781,6 @@ int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping pgoff_t idx); void restore_reserve_on_error(struct resv_map *resv, pgoff_t resv_index, bool may_share, struct folio *folio); -void restore_reserve_on_error_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address, struct folio *folio);
/* arch callback */ int __init __alloc_bootmem_huge_page(struct hstate *h, int nid); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4675f9efeba4..540634aec181 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1091,19 +1091,6 @@ void resv_map_release(struct kref *ref) kfree(resv_map); }
-static inline struct resv_map *inode_resv_map(struct inode *inode) -{ - /* - * At inode evict time, i_mapping may not point to the original - * address space within the inode. This original address space - * contains the pointer to the resv_map. So, always use the - * address space embedded within the inode. - * The VERY common case is inode->mapping == &inode->i_data but, - * this may not be true for device special inodes. - */ - return (struct resv_map *)(&inode->i_data)->private_data; -} - static struct resv_map *vma_resv_map(struct vm_area_struct *vma) { VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
Parametrize alloc_hugetlb_folio_from_subpool() by resv_map to remove the use of vma_resv_map() and decouple hugetlb with hugetlbfs.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- include/linux/hugetlb.h | 2 +- mm/hugetlb.c | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 5fe9643826d7..d564802ace4b 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -767,7 +767,7 @@ struct huge_bootmem_page {
int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); struct folio *alloc_hugetlb_folio_from_subpool( - struct hugepage_subpool *spool, struct hstate *h, + struct hugepage_subpool *spool, struct hstate *h, struct resv_map *resv, struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 540634aec181..aebdd8c63439 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3003,7 +3003,7 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) }
struct folio *alloc_hugetlb_folio_from_subpool( - struct hugepage_subpool *spool, struct hstate *h, + struct hugepage_subpool *spool, struct hstate *h, struct resv_map *resv, struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) { struct folio *folio; @@ -3013,7 +3013,6 @@ struct folio *alloc_hugetlb_folio_from_subpool( struct hugetlb_cgroup *h_cg = NULL; bool deferred_reserve;
- struct resv_map *resv = vma_resv_map(vma); pgoff_t resv_index = vma_hugecache_offset(h, vma, addr); bool may_share = vma->vm_flags & VM_MAYSHARE;
@@ -3141,8 +3140,9 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, { struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); + struct resv_map *resv = vma_resv_map(vma);
- return alloc_hugetlb_folio_from_subpool(spool, h, vma, addr, avoid_reserve); + return alloc_hugetlb_folio_from_subpool(spool, h, resv, vma, addr, avoid_reserve); }
int alloc_bootmem_huge_page(struct hstate *h, int nid)
Parametrize remove_mapping_hugepages() and hugetlb_unreserve_pages() by resv_map to remove the use of inode_resv_map() and decouple hugetlb with hugetlbfs.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- fs/hugetlbfs/inode.c | 16 ++++++++++------ include/linux/hugetlb.h | 6 ++++-- mm/hugetlb.c | 4 ++-- 3 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 53f6a421499d..a7791b1390a6 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -560,8 +560,8 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end, */ static bool remove_mapping_single_folio( struct address_space *mapping, struct folio *folio, pgoff_t index, - struct hstate *h, struct hugepage_subpool *spool, struct inode *inode, - bool truncate_op) + struct hstate *h, struct hugepage_subpool *spool, struct resv_map *resv_map, + struct inode *inode, bool truncate_op) { bool ret = false;
@@ -586,7 +586,8 @@ static bool remove_mapping_single_folio( hugetlb_delete_from_page_cache(folio); ret = true; if (!truncate_op) { - if (unlikely(hugetlb_unreserve_pages(h, spool, inode, index, index + 1, 1))) + if (unlikely(hugetlb_unreserve_pages(h, spool, resv_map, + inode, index, index + 1, 1))) hugetlb_fix_reserve_counts(h, spool); }
@@ -623,6 +624,7 @@ static bool remove_mapping_single_folio( */ void remove_mapping_hugepages(struct address_space *mapping, struct hstate *h, struct hugepage_subpool *spool, + struct resv_map *resv_map, struct inode *inode, loff_t lstart, loff_t lend) { const pgoff_t start = lstart >> huge_page_shift(h); @@ -647,7 +649,7 @@ void remove_mapping_hugepages(struct address_space *mapping, * Remove folio that was part of folio_batch. */ if (remove_mapping_single_folio(mapping, folio, index, - h, spool, inode, truncate_op)) + h, spool, resv_map, inode, truncate_op)) freed++;
mutex_unlock(&hugetlb_fault_mutex_table[hash]); @@ -657,7 +659,8 @@ void remove_mapping_hugepages(struct address_space *mapping, }
if (truncate_op) - (void)hugetlb_unreserve_pages(h, spool, inode, start, LONG_MAX, freed); + (void)hugetlb_unreserve_pages(h, spool, resv_map, inode, + start, LONG_MAX, freed); }
void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) @@ -665,8 +668,9 @@ void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) struct address_space *mapping = &inode->i_data; struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); + struct resv_map *resv_map = inode_resv_map(inode);
- return remove_mapping_hugepages(mapping, h, spool, inode, lstart, lend); + return remove_mapping_hugepages(mapping, h, spool, resv_map, inode, lstart, lend); }
static void hugetlbfs_evict_inode(struct inode *inode) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index d564802ace4b..af04588a5afe 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -172,7 +172,8 @@ bool hugetlb_reserve_pages(struct hstate *h, struct hugepage_subpool *spool, struct vm_area_struct *vma, vm_flags_t vm_flags); long hugetlb_unreserve_pages(struct hstate *h, struct hugepage_subpool *spool, - struct inode *inode, long start, long end, long freed); + struct resv_map *resv_map, struct inode *inode, + long start, long end, long freed); bool isolate_hugetlb(struct folio *folio, struct list_head *list); int get_hwpoison_hugetlb_folio(struct folio *folio, bool *hugetlb, bool unpoison); int get_huge_page_for_hwpoison(unsigned long pfn, int flags, @@ -263,6 +264,7 @@ void hugetlb_zero_partial_page(struct hstate *h, struct address_space *mapping,
void remove_mapping_hugepages(struct address_space *mapping, struct hstate *h, struct hugepage_subpool *spool, + struct resv_map *resv_map, struct inode *inode, loff_t lstart, loff_t lend); void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend);
@@ -479,7 +481,7 @@ static inline void hugetlb_zero_partial_page(
static inline void remove_mapping_hugepages( struct address_space *mapping, struct hstate *h, struct hugepage_subpool *spool, - struct inode *inode, loff_t lstart, loff_t lend) {} + struct resv_map *resv_map, struct inode *inode, loff_t lstart, loff_t lend) {} static inline void remove_inode_hugepages(struct inode *inode, loff_t lstart, loff_t lend) {}
#endif /* !CONFIG_HUGETLB_PAGE */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index aebdd8c63439..a1cbda457aa7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6954,9 +6954,9 @@ bool hugetlb_reserve_pages(struct hstate *h, struct hugepage_subpool *spool, * Returns 0 on success. */ long hugetlb_unreserve_pages(struct hstate *h, struct hugepage_subpool *spool, - struct inode *inode, long start, long end, long freed) + struct resv_map *resv_map, struct inode *inode, + long start, long end, long freed) { - struct resv_map *resv_map = inode_resv_map(inode); long chg = 0; long gbl_reserve;
This will allow preparation steps to be shared
Signed-off-by: Ackerley Tng ackerleytng@google.com --- include/linux/mm.h | 1 + mm/truncate.c | 24 ++++++++++++++---------- 2 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1f79667824eb..7a8f6b810de0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3053,6 +3053,7 @@ extern unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info); extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, loff_t lstart, loff_t lend); +extern void truncate_inode_pages_final_prepare(struct address_space *mapping); extern void truncate_inode_pages_final(struct address_space *);
/* generic vm_area_ops exported for stackable file systems */ diff --git a/mm/truncate.c b/mm/truncate.c index 7b4ea4c4a46b..4a7ae87e03b5 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -449,16 +449,7 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart) } EXPORT_SYMBOL(truncate_inode_pages);
-/** - * truncate_inode_pages_final - truncate *all* pages before inode dies - * @mapping: mapping to truncate - * - * Called under (and serialized by) inode->i_rwsem. - * - * Filesystems have to use this in the .evict_inode path to inform the - * VM that this is the final truncate and the inode is going away. - */ -void truncate_inode_pages_final(struct address_space *mapping) +void truncate_inode_pages_final_prepare(struct address_space *mapping) { /* * Page reclaim can not participate in regular inode lifetime @@ -479,7 +470,20 @@ void truncate_inode_pages_final(struct address_space *mapping) xa_lock_irq(&mapping->i_pages); xa_unlock_irq(&mapping->i_pages); } +}
+/** + * truncate_inode_pages_final - truncate *all* pages before inode dies + * @mapping: mapping to truncate + * + * Called under (and serialized by) inode->i_rwsem. + * + * Filesystems have to use this in the .evict_inode path to inform the + * VM that this is the final truncate and the inode is going away. + */ +void truncate_inode_pages_final(struct address_space *mapping) +{ + truncate_inode_pages_final_prepare(mapping); truncate_inode_pages(mapping, 0); } EXPORT_SYMBOL(truncate_inode_pages_final);
First create a gmem inode, then create a gmem file using the inode, then install the file into an fd.
Creating the file in layers separates inode concepts (struct kvm_gmem) from file concepts and makes cleaning up in stages neater.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- virt/kvm/guest_mem.c | 86 +++++++++++++++++++++++++------------------- 1 file changed, 50 insertions(+), 36 deletions(-)
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index 8708139822d3..2f69ef666871 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -375,41 +375,27 @@ static const struct inode_operations kvm_gmem_iops = { .setattr = kvm_gmem_setattr, };
-static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags, - struct vfsmount *mnt) +static struct inode *kvm_gmem_create_inode(struct kvm *kvm, loff_t size, u64 flags, + struct vfsmount *mnt) { + int err; + struct inode *inode; + struct kvm_gmem *gmem; const char *anon_name = "[kvm-gmem]"; const struct qstr qname = QSTR_INIT(anon_name, strlen(anon_name)); - struct kvm_gmem *gmem; - struct inode *inode; - struct file *file; - int fd, err; - - fd = get_unused_fd_flags(0); - if (fd < 0) - return fd;
inode = alloc_anon_inode(mnt->mnt_sb); - if (IS_ERR(inode)) { - err = PTR_ERR(inode); - goto err_fd; - } + if (IS_ERR(inode)) + return inode;
err = security_inode_init_security_anon(inode, &qname, NULL); if (err) goto err_inode;
- file = alloc_file_pseudo(inode, mnt, "kvm-gmem", O_RDWR, &kvm_gmem_fops); - if (IS_ERR(file)) { - err = PTR_ERR(file); - goto err_inode; - } - + err = -ENOMEM; gmem = kzalloc(sizeof(*gmem), GFP_KERNEL); - if (!gmem) { - err = -ENOMEM; - goto err_file; - } + if (!gmem) + goto err_inode;
xa_init(&gmem->bindings);
@@ -426,24 +412,41 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags, mapping_set_large_folios(inode->i_mapping); mapping_set_unevictable(inode->i_mapping);
- file->f_flags |= O_LARGEFILE; - file->f_mapping = inode->i_mapping; - file->private_data = gmem; - - fd_install(fd, file); - return fd; + return inode;
-err_file: - fput(file); err_inode: iput(inode); -err_fd: - put_unused_fd(fd); - return err; + return ERR_PTR(err); +} + + +static struct file *kvm_gmem_create_file(struct kvm *kvm, loff_t size, u64 flags, + struct vfsmount *mnt) +{ + struct file *file; + struct inode *inode; + + inode = kvm_gmem_create_inode(kvm, size, flags, mnt); + if (IS_ERR(inode)) + return ERR_CAST(inode); + + file = alloc_file_pseudo(inode, mnt, "kvm-gmem", O_RDWR, &kvm_gmem_fops); + if (IS_ERR(file)) { + iput(inode); + return file; + } + + file->f_flags |= O_LARGEFILE; + file->f_mapping = inode->i_mapping; + file->private_data = inode->i_mapping->private_data; + + return file; }
int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *gmem) { + int fd; + struct file *file; loff_t size = gmem->size; u64 flags = gmem->flags;
@@ -462,7 +465,18 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *gmem) #endif }
- return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt); + fd = get_unused_fd_flags(0); + if (fd < 0) + return fd; + + file = kvm_gmem_create_file(kvm, size, flags, kvm_gmem_mnt); + if (IS_ERR(file)) { + put_unused_fd(fd); + return PTR_ERR(file); + } + + fd_install(fd, file); + return fd; }
int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
Cleanup in kvm_gmem_release() should be the reverse of kvm_gmem_create_file().
Cleanup in kvm_gmem_evict_inode() should be the reverse of kvm_gmem_create_inode().
Signed-off-by: Ackerley Tng ackerleytng@google.com --- virt/kvm/guest_mem.c | 105 +++++++++++++++++++++++++++++-------------- 1 file changed, 71 insertions(+), 34 deletions(-)
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index 2f69ef666871..13253af40be6 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -247,42 +247,13 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
static int kvm_gmem_release(struct inode *inode, struct file *file) { - struct kvm_gmem *gmem = inode->i_mapping->private_data; - struct kvm_memory_slot *slot; - struct kvm *kvm = gmem->kvm; - unsigned long index; - /* - * Prevent concurrent attempts to *unbind* a memslot. This is the last - * reference to the file and thus no new bindings can be created, but - * deferencing the slot for existing bindings needs to be protected - * against memslot updates, specifically so that unbind doesn't race - * and free the memslot (kvm_gmem_get_file() will return NULL). + * This is called when the last reference to the file is released. Only + * clean up file-related stuff. struct kvm_gmem is also referred to in + * the inode, so clean that up in kvm_gmem_evict_inode(). */ - mutex_lock(&kvm->slots_lock); - - xa_for_each(&gmem->bindings, index, slot) - rcu_assign_pointer(slot->gmem.file, NULL); - - synchronize_rcu(); - - /* - * All in-flight operations are gone and new bindings can be created. - * Free the backing memory, and more importantly, zap all SPTEs that - * pointed at this file. - */ - kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul); - truncate_inode_pages_final(file->f_mapping); - kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul); - - mutex_unlock(&kvm->slots_lock); - - WARN_ON_ONCE(!(mapping_empty(file->f_mapping))); - - xa_destroy(&gmem->bindings); - kfree(gmem); - - kvm_put_kvm(kvm); + file->f_mapping = NULL; + file->private_data = NULL;
return 0; } @@ -603,11 +574,77 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, } EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
+static void kvm_gmem_evict_inode(struct inode *inode) +{ + struct kvm_gmem *gmem = inode->i_mapping->private_data; + struct kvm_memory_slot *slot; + struct kvm *kvm; + unsigned long index; + + /* + * If iput() was called before inode is completely set up due to some + * error in kvm_gmem_create_inode(), gmem will be NULL. + */ + if (!gmem) + goto basic_cleanup; + + kvm = gmem->kvm; + + /* + * Prevent concurrent attempts to *unbind* a memslot. This is the last + * reference to the file and thus no new bindings can be created, but + * deferencing the slot for existing bindings needs to be protected + * against memslot updates, specifically so that unbind doesn't race + * and free the memslot (kvm_gmem_get_file() will return NULL). + */ + mutex_lock(&kvm->slots_lock); + + xa_for_each(&gmem->bindings, index, slot) + rcu_assign_pointer(slot->gmem.file, NULL); + + synchronize_rcu(); + + /* + * All in-flight operations are gone and new bindings can be created. + * Free the backing memory, and more importantly, zap all SPTEs that + * pointed at this file. + */ + kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul); + truncate_inode_pages_final(inode->i_mapping); + kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul); + + mutex_unlock(&kvm->slots_lock); + + WARN_ON_ONCE(!(mapping_empty(inode->i_mapping))); + + xa_destroy(&gmem->bindings); + kfree(gmem); + + kvm_put_kvm(kvm); + +basic_cleanup: + clear_inode(inode); +} + +static const struct super_operations kvm_gmem_super_operations = { + /* + * TODO update statfs handler for kvm_gmem. What should the statfs + * handler return? + */ + .statfs = simple_statfs, + .evict_inode = kvm_gmem_evict_inode, +}; + static int kvm_gmem_init_fs_context(struct fs_context *fc) { + struct pseudo_fs_context *ctx; + if (!init_pseudo(fc, GUEST_MEMORY_MAGIC)) return -ENOMEM;
+ ctx = fc->fs_private; + ctx->ops = &kvm_gmem_super_operations; + return 0; }
First stage of hugetlb support: add initialization and cleanup routines
Signed-off-by: Ackerley Tng ackerleytng@google.com --- include/uapi/linux/kvm.h | 25 ++++++++++++ virt/kvm/guest_mem.c | 88 +++++++++++++++++++++++++++++++++++++--- 2 files changed, 108 insertions(+), 5 deletions(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0fa665e8862a..1df0c802c29f 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -13,6 +13,7 @@ #include <linux/compiler.h> #include <linux/ioctl.h> #include <asm/kvm.h> +#include <asm-generic/hugetlb_encode.h>
#define KVM_API_VERSION 12
@@ -2280,6 +2281,30 @@ struct kvm_memory_attributes { #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
#define KVM_GUEST_MEMFD_HUGE_PMD (1ULL << 0) +#define KVM_GUEST_MEMFD_HUGETLB (1ULL << 1) + +/* + * Huge page size encoding when KVM_GUEST_MEMFD_HUGETLB is specified, and a huge + * page size other than the default is desired. See hugetlb_encode.h. All + * known huge page size encodings are provided here. It is the responsibility + * of the application to know which sizes are supported on the running system. + * See mmap(2) man page for details. + */ +#define KVM_GUEST_MEMFD_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT +#define KVM_GUEST_MEMFD_HUGE_MASK HUGETLB_FLAG_ENCODE_MASK + +#define KVM_GUEST_MEMFD_HUGE_64KB HUGETLB_FLAG_ENCODE_64KB +#define KVM_GUEST_MEMFD_HUGE_512KB HUGETLB_FLAG_ENCODE_512KB +#define KVM_GUEST_MEMFD_HUGE_1MB HUGETLB_FLAG_ENCODE_1MB +#define KVM_GUEST_MEMFD_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB +#define KVM_GUEST_MEMFD_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB +#define KVM_GUEST_MEMFD_HUGE_16MB HUGETLB_FLAG_ENCODE_16MB +#define KVM_GUEST_MEMFD_HUGE_32MB HUGETLB_FLAG_ENCODE_32MB +#define KVM_GUEST_MEMFD_HUGE_256MB HUGETLB_FLAG_ENCODE_256MB +#define KVM_GUEST_MEMFD_HUGE_512MB HUGETLB_FLAG_ENCODE_512MB +#define KVM_GUEST_MEMFD_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB +#define KVM_GUEST_MEMFD_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB +#define KVM_GUEST_MEMFD_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB
struct kvm_create_guest_memfd { __u64 size; diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index 13253af40be6..b533143e2878 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -19,6 +19,7 @@ #include <linux/secretmem.h> #include <linux/set_memory.h> #include <linux/sched/signal.h> +#include <linux/hugetlb.h>
#include <uapi/linux/magic.h>
@@ -30,6 +31,11 @@ struct kvm_gmem { struct kvm *kvm; u64 flags; struct xarray bindings; + struct { + struct hstate *h; + struct hugepage_subpool *spool; + struct resv_map *resv_map; + } hugetlb; };
static loff_t kvm_gmem_get_size(struct file *file) @@ -346,6 +352,46 @@ static const struct inode_operations kvm_gmem_iops = { .setattr = kvm_gmem_setattr, };
+static int kvm_gmem_hugetlb_setup(struct inode *inode, struct kvm_gmem *gmem, + loff_t size, u64 flags) +{ + int page_size_log; + int hstate_idx; + long hpages; + struct resv_map *resv_map; + struct hugepage_subpool *spool; + struct hstate *h; + + page_size_log = (flags >> KVM_GUEST_MEMFD_HUGE_SHIFT) & KVM_GUEST_MEMFD_HUGE_MASK; + hstate_idx = get_hstate_idx(page_size_log); + if (hstate_idx < 0) + return -ENOENT; + + h = &hstates[hstate_idx]; + /* Round up to accommodate size requests that don't align with huge pages */ + hpages = round_up(size, huge_page_size(h)) >> huge_page_shift(h); + spool = hugepage_new_subpool(h, hpages, hpages); + if (!spool) + goto out; + + resv_map = resv_map_alloc(); + if (!resv_map) + goto out_subpool; + + inode->i_blkbits = huge_page_shift(h); + + gmem->hugetlb.h = h; + gmem->hugetlb.spool = spool; + gmem->hugetlb.resv_map = resv_map; + + return 0; + +out_subpool: + kfree(spool); +out: + return -ENOMEM; +} + static struct inode *kvm_gmem_create_inode(struct kvm *kvm, loff_t size, u64 flags, struct vfsmount *mnt) { @@ -368,6 +414,12 @@ static struct inode *kvm_gmem_create_inode(struct kvm *kvm, loff_t size, u64 fla if (!gmem) goto err_inode;
+ if (flags & KVM_GUEST_MEMFD_HUGETLB) { + err = kvm_gmem_hugetlb_setup(inode, gmem, size, flags); + if (err) + goto err_gmem; + } + xa_init(&gmem->bindings);
kvm_get_kvm(kvm); @@ -385,6 +437,8 @@ static struct inode *kvm_gmem_create_inode(struct kvm *kvm, loff_t size, u64 fla
return inode;
+err_gmem: + kfree(gmem); err_inode: iput(inode); return ERR_PTR(err); @@ -414,6 +468,8 @@ static struct file *kvm_gmem_create_file(struct kvm *kvm, loff_t size, u64 flags return file; }
+#define KVM_GUEST_MEMFD_ALL_FLAGS (KVM_GUEST_MEMFD_HUGE_PMD | KVM_GUEST_MEMFD_HUGETLB) + int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *gmem) { int fd; @@ -424,8 +480,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *gmem) if (size < 0 || !PAGE_ALIGNED(size)) return -EINVAL;
- if (flags & ~KVM_GUEST_MEMFD_HUGE_PMD) - return -EINVAL; + if (!(flags & KVM_GUEST_MEMFD_HUGETLB)) { + if (flags & ~(unsigned int)KVM_GUEST_MEMFD_ALL_FLAGS) + return -EINVAL; + } else { + /* Allow huge page size encoding in flags. */ + if (flags & ~(unsigned int)(KVM_GUEST_MEMFD_ALL_FLAGS | + (KVM_GUEST_MEMFD_HUGE_MASK << KVM_GUEST_MEMFD_HUGE_SHIFT))) + return -EINVAL; + }
if (flags & KVM_GUEST_MEMFD_HUGE_PMD) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -610,7 +673,17 @@ static void kvm_gmem_evict_inode(struct inode *inode) * pointed at this file. */ kvm_gmem_invalidate_begin(kvm, gmem, 0, -1ul); - truncate_inode_pages_final(inode->i_mapping); + if (gmem->flags & KVM_GUEST_MEMFD_HUGETLB) { + truncate_inode_pages_final_prepare(inode->i_mapping); + remove_mapping_hugepages( + inode->i_mapping, gmem->hugetlb.h, gmem->hugetlb.spool, + gmem->hugetlb.resv_map, inode, 0, LLONG_MAX); + + resv_map_release(&gmem->hugetlb.resv_map->refs); + hugepage_put_subpool(gmem->hugetlb.spool); + } else { + truncate_inode_pages_final(inode->i_mapping); + } kvm_gmem_invalidate_end(kvm, gmem, 0, -1ul);
mutex_unlock(&kvm->slots_lock); @@ -688,10 +761,15 @@ bool kvm_gmem_check_alignment(const struct kvm_userspace_memory_region2 *mem) { size_t page_size;
- if (mem->flags & KVM_GUEST_MEMFD_HUGE_PMD) + if (mem->flags & KVM_GUEST_MEMFD_HUGETLB) { + size_t page_size_log = ((mem->flags >> KVM_GUEST_MEMFD_HUGE_SHIFT) + & KVM_GUEST_MEMFD_HUGE_MASK); + page_size = 1UL << page_size_log; + } else if (mem->flags & KVM_GUEST_MEMFD_HUGE_PMD) { page_size = HPAGE_PMD_SIZE; - else + } else { page_size = PAGE_SIZE; + }
return (IS_ALIGNED(mem->gmem_offset, page_size) && IS_ALIGNED(mem->memory_size, page_size));
Introduce kvm_gmem_hugetlb_get_folio(), then update kvm_gmem_allocate() and kvm_gmem_truncate() to use hugetlb functions.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- virt/kvm/guest_mem.c | 215 +++++++++++++++++++++++++++++++++++++------ 1 file changed, 188 insertions(+), 27 deletions(-)
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index b533143e2878..6271621f6b73 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -43,6 +43,95 @@ static loff_t kvm_gmem_get_size(struct file *file) return i_size_read(file_inode(file)); }
+static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio( + struct file *file, pgoff_t hindex) +{ + int err; + struct folio *folio; + struct kvm_gmem *gmem; + struct hstate *h; + struct resv_map *resv_map; + unsigned long offset; + struct vm_area_struct pseudo_vma; + + gmem = file->private_data; + h = gmem->hugetlb.h; + resv_map = gmem->hugetlb.resv_map; + offset = hindex << huge_page_shift(h); + + vma_init(&pseudo_vma, NULL); + vm_flags_init(&pseudo_vma, VM_HUGETLB | VM_MAYSHARE | VM_SHARED); + /* vma infrastructure is dependent on vm_file being set */ + pseudo_vma.vm_file = file; + + /* TODO setup NUMA policy. Meanwhile, fallback to get_task_policy(). */ + pseudo_vma.vm_policy = NULL; + folio = alloc_hugetlb_folio_from_subpool( + gmem->hugetlb.spool, h, resv_map, &pseudo_vma, offset, 0); + /* Remember to take and drop refcount from vm_policy */ + if (IS_ERR(folio)) + return folio; + + /* + * FIXME: Skip clearing pages when trusted firmware will do it when + * assigning memory to the guest. + */ + clear_huge_page(&folio->page, offset, pages_per_huge_page(h)); + __folio_mark_uptodate(folio); + err = hugetlb_filemap_add_folio(file->f_mapping, h, folio, hindex); + if (unlikely(err)) { + restore_reserve_on_error(resv_map, hindex, true, folio); + folio_put(folio); + folio = ERR_PTR(err); + } + + return folio; +} + +/** + * Gets a hugetlb folio, from @file, at @index (in terms of PAGE_SIZE) within + * the file. + * + * The returned folio will be in @file's page cache, and locked. + */ +static struct folio *kvm_gmem_hugetlb_get_folio(struct file *file, pgoff_t index) +{ + struct folio *folio; + u32 hash; + /* hindex is in terms of huge_page_size(h) and not PAGE_SIZE */ + pgoff_t hindex; + struct kvm_gmem *gmem; + struct hstate *h; + struct address_space *mapping; + + gmem = file->private_data; + h = gmem->hugetlb.h; + hindex = index >> huge_page_order(h); + + mapping = file->f_mapping; + hash = hugetlb_fault_mutex_hash(mapping, hindex); + mutex_lock(&hugetlb_fault_mutex_table[hash]); + + rcu_read_lock(); + folio = filemap_lock_folio(mapping, hindex); + rcu_read_unlock(); + if (folio) + goto folio_valid; + + folio = kvm_gmem_hugetlb_alloc_and_cache_folio(file, hindex); + /* + * TODO Perhaps the interface of kvm_gmem_get_folio should change to better + * report errors + */ + if (IS_ERR(folio)) + folio = NULL; + +folio_valid: + mutex_unlock(&hugetlb_fault_mutex_table[hash]); + + return folio; +} + static struct folio *kvm_gmem_get_huge_folio(struct file *file, pgoff_t index) { #ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -74,36 +163,56 @@ static struct folio *kvm_gmem_get_huge_folio(struct file *file, pgoff_t index) #endif }
+/** + * Gets a folio, from @file, at @index (in terms of PAGE_SIZE) within the file. + * + * The returned folio will be in @file's page cache and locked. + */ static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index) { struct folio *folio; + struct kvm_gmem *gmem = file->private_data;
- folio = kvm_gmem_get_huge_folio(file, index); - if (!folio) { - folio = filemap_grab_folio(file->f_mapping, index); + if (gmem->flags & KVM_GUEST_MEMFD_HUGETLB) { + folio = kvm_gmem_hugetlb_get_folio(file, index); + + /* hugetlb gmem does not fall back to non-hugetlb pages */ if (!folio) return NULL; - }
- /* - * TODO: Confirm this won't zero in-use pages, and skip clearing pages - * when trusted firmware will do it when assigning memory to the guest. - */ - if (!folio_test_uptodate(folio)) { - unsigned long nr_pages = folio_nr_pages(folio); - unsigned long i; + /* + * Don't need to clear pages because + * kvm_gmem_hugetlb_alloc_and_cache_folio() already clears pages + * when allocating + */ + } else { + folio = kvm_gmem_get_huge_folio(file, index); + if (!folio) { + folio = filemap_grab_folio(file->f_mapping, index); + if (!folio) + return NULL; + }
- for (i = 0; i < nr_pages; i++) - clear_highpage(folio_page(folio, i)); - } + /* + * TODO: Confirm this won't zero in-use pages, and skip clearing pages + * when trusted firmware will do it when assigning memory to the guest. + */ + if (!folio_test_uptodate(folio)) { + unsigned long nr_pages = folio_nr_pages(folio); + unsigned long i;
- /* - * filemap_grab_folio() uses FGP_ACCESSED, which already called - * folio_mark_accessed(), so we clear it. - * TODO: Should we instead be clearing this when truncating? - * TODO: maybe don't use FGP_ACCESSED at all and call __filemap_get_folio directly. - */ - folio_clear_referenced(folio); + for (i = 0; i < nr_pages; i++) + clear_highpage(folio_page(folio, i)); + } + + /* + * filemap_grab_folio() uses FGP_ACCESSED, which already called + * folio_mark_accessed(), so we clear it. + * TODO: Should we instead be clearing this when truncating? + * TODO: maybe don't use FGP_ACCESSED at all and call __filemap_get_folio directly. + */ + folio_clear_referenced(folio); + }
/* * Indicate that this folio matches the backing store (in this case, has @@ -156,6 +265,44 @@ static void kvm_gmem_invalidate_end(struct kvm *kvm, struct kvm_gmem *gmem, KVM_MMU_UNLOCK(kvm); }
+static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, + loff_t offset, loff_t len) +{ + loff_t hsize; + loff_t full_hpage_start; + loff_t full_hpage_end; + struct kvm_gmem *gmem; + struct hstate *h; + struct address_space *mapping; + + mapping = inode->i_mapping; + gmem = mapping->private_data; + h = gmem->hugetlb.h; + hsize = huge_page_size(h); + full_hpage_start = round_up(offset, hsize); + full_hpage_end = round_down(offset + len, hsize); + + /* If range starts before first full page, zero partial page. */ + if (offset < full_hpage_start) { + hugetlb_zero_partial_page( + h, mapping, offset, min(offset + len, full_hpage_start)); + } + + /* Remove full pages from the file. */ + if (full_hpage_end > full_hpage_start) { + remove_mapping_hugepages(mapping, h, gmem->hugetlb.spool, + gmem->hugetlb.resv_map, inode, + full_hpage_start, full_hpage_end); + } + + + /* If range extends beyond last full page, zero partial page. */ + if ((offset + len) > full_hpage_end && (offset + len) > full_hpage_start) { + hugetlb_zero_partial_page( + h, mapping, full_hpage_end, offset + len); + } +} + static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len) { struct kvm_gmem *gmem = file->private_data; @@ -171,7 +318,10 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len)
kvm_gmem_invalidate_begin(kvm, gmem, start, end);
- truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1); + if (gmem->flags & KVM_GUEST_MEMFD_HUGETLB) + kvm_gmem_hugetlb_truncate_range(file_inode(file), offset, len); + else + truncate_inode_pages_range(file->f_mapping, offset, offset + len - 1);
kvm_gmem_invalidate_end(kvm, gmem, start, end);
@@ -183,6 +333,7 @@ static long kvm_gmem_punch_hole(struct file *file, loff_t offset, loff_t len) static long kvm_gmem_allocate(struct file *file, loff_t offset, loff_t len) { struct address_space *mapping = file->f_mapping; + struct kvm_gmem *gmem = file->private_data; pgoff_t start, index, end; int r;
@@ -192,9 +343,14 @@ static long kvm_gmem_allocate(struct file *file, loff_t offset, loff_t len)
filemap_invalidate_lock_shared(mapping);
- start = offset >> PAGE_SHIFT; - /* Align so that at least 1 page is allocated */ - end = ALIGN(offset + len, PAGE_SIZE) >> PAGE_SHIFT; + if (gmem->flags & KVM_GUEST_MEMFD_HUGETLB) { + start = offset >> huge_page_shift(gmem->hugetlb.h); + end = ALIGN(offset + len, huge_page_size(gmem->hugetlb.h)) >> PAGE_SHIFT; + } else { + start = offset >> PAGE_SHIFT; + /* Align so that at least 1 page is allocated */ + end = ALIGN(offset + len, PAGE_SIZE) >> PAGE_SHIFT; + }
r = 0; for (index = start; index < end; ) { @@ -211,7 +367,7 @@ static long kvm_gmem_allocate(struct file *file, loff_t offset, loff_t len) break; }
- index = folio_next_index(folio); + index += folio_nr_pages(folio);
folio_unlock(folio); folio_put(folio); @@ -625,7 +781,12 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, return -ENOMEM; }
- page = folio_file_page(folio, index); + /* + * folio_file_page() always returns the head page for hugetlb + * folios. Reimplement to get the page within this folio, even for + * hugetlb pages. + */ + page = folio_page(folio, index & (folio_nr_pages(folio) - 1));
*pfn = page_to_pfn(page); *order = thp_order(compound_head(page));
Add tests for 2MB and 1GB page sizes.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- .../testing/selftests/kvm/guest_memfd_test.c | 33 ++++++++++++++----- 1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index 059b33cdecec..6e24631119c6 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -90,20 +90,14 @@ static void test_fallocate(int fd, size_t page_size, size_t total_size) TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed"); }
- -int main(int argc, char *argv[]) +void test_guest_mem(struct kvm_vm *vm, uint32_t flags, size_t page_size) { - size_t page_size; - size_t total_size; int fd; - struct kvm_vm *vm; + size_t total_size;
- page_size = getpagesize(); total_size = page_size * 4;
- vm = vm_create_barebones(); - - fd = vm_create_guest_memfd(vm, total_size, 0); + fd = vm_create_guest_memfd(vm, total_size, flags);
test_file_read_write(fd); test_mmap(fd, page_size); @@ -112,3 +106,24 @@ int main(int argc, char *argv[])
close(fd); } + +int main(int argc, char *argv[]) +{ + struct kvm_vm *vm = vm_create_barebones(); + + printf("Test guest mem 4K\n"); + test_guest_mem(vm, 0, getpagesize()); + printf(" PASSED\n"); + + printf("Test guest mem hugetlb 2M\n"); + test_guest_mem( + vm, KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_2MB, 2UL << 20); + printf(" PASSED\n"); + + printf("Test guest mem hugetlb 1G\n"); + test_guest_mem( + vm, KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_1GB, 1UL << 30); + printf(" PASSED\n"); + + return 0; +}
Adds support for various type of backing sources for private memory (in the sense of confidential computing), similar to the backing sources available for shared memory.
Signed-off-by: Ackerley Tng ackerleytng@google.com --- .../testing/selftests/kvm/include/test_util.h | 14 ++++ tools/testing/selftests/kvm/lib/test_util.c | 74 +++++++++++++++++++ 2 files changed, 88 insertions(+)
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h index a6e9f215ce70..899ea15ca8a9 100644 --- a/tools/testing/selftests/kvm/include/test_util.h +++ b/tools/testing/selftests/kvm/include/test_util.h @@ -122,6 +122,16 @@ struct vm_mem_backing_src_alias { uint32_t flag; };
+enum vm_pmem_backing_src_type { + VM_PMEM_SRC_GMEM, + VM_PMEM_SRC_HUGETLB, /* Use kernel default page size for hugetlb pages */ + VM_PMEM_SRC_HUGETLB_2MB, + VM_PMEM_SRC_HUGETLB_1GB, + NUM_PMEM_SRC_TYPES, +}; + +#define DEFAULT_VM_PMEM_SRC VM_PMEM_SRC_GMEM + #define MIN_RUN_DELAY_NS 200000UL
bool thp_configured(void); @@ -132,6 +142,10 @@ size_t get_backing_src_pagesz(uint32_t i); bool is_backing_src_hugetlb(uint32_t i); void backing_src_help(const char *flag); enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name); +void pmem_backing_src_help(const char *flag); +enum vm_pmem_backing_src_type parse_pmem_backing_src_type(const char *type_name); +const struct vm_mem_backing_src_alias *vm_pmem_backing_src_alias(uint32_t i); +size_t get_pmem_backing_src_pagesz(uint32_t i); long get_run_delay(void);
/* diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c index b772193f6c18..62efb7b8ba51 100644 --- a/tools/testing/selftests/kvm/lib/test_util.c +++ b/tools/testing/selftests/kvm/lib/test_util.c @@ -8,6 +8,7 @@ #include <assert.h> #include <ctype.h> #include <limits.h> +#include <linux/kvm.h> #include <stdlib.h> #include <time.h> #include <sys/stat.h> @@ -287,6 +288,34 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i) return &aliases[i]; }
+const struct vm_mem_backing_src_alias *vm_pmem_backing_src_alias(uint32_t i) +{ + static const struct vm_mem_backing_src_alias aliases[] = { + [VM_PMEM_SRC_GMEM] = { + .name = "pmem_gmem", + .flag = 0, + }, + [VM_PMEM_SRC_HUGETLB] = { + .name = "pmem_hugetlb", + .flag = KVM_GUEST_MEMFD_HUGETLB, + }, + [VM_PMEM_SRC_HUGETLB_2MB] = { + .name = "pmem_hugetlb_2mb", + .flag = KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_2MB, + }, + [VM_PMEM_SRC_HUGETLB_1GB] = { + .name = "pmem_hugetlb_1gb", + .flag = KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_1GB, + }, + }; + _Static_assert(ARRAY_SIZE(aliases) == NUM_PMEM_SRC_TYPES, + "Missing new backing private mem src types?"); + + TEST_ASSERT(i < NUM_PMEM_SRC_TYPES, "Private mem backing src type ID %d too big", i); + + return &aliases[i]; +} + #define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
size_t get_backing_src_pagesz(uint32_t i) @@ -307,6 +336,20 @@ size_t get_backing_src_pagesz(uint32_t i) } }
+size_t get_pmem_backing_src_pagesz(uint32_t i) +{ + uint32_t flag = vm_pmem_backing_src_alias(i)->flag; + + switch (i) { + case VM_PMEM_SRC_GMEM: + return getpagesize(); + case VM_PMEM_SRC_HUGETLB: + return get_def_hugetlb_pagesz(); + default: + return MAP_HUGE_PAGE_SIZE(flag); + } +} + bool is_backing_src_hugetlb(uint32_t i) { return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB); @@ -343,6 +386,37 @@ enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name) return -1; }
+static void print_available_pmem_backing_src_types(const char *prefix) +{ + int i; + + printf("%sAvailable private mem backing src types:\n", prefix); + + for (i = 0; i < NUM_PMEM_SRC_TYPES; i++) + printf("%s %s\n", prefix, vm_pmem_backing_src_alias(i)->name); +} + +void pmem_backing_src_help(const char *flag) +{ + printf(" %s: specify the type of memory that should be used to\n" + " back guest private memory. (default: %s)\n", + flag, vm_pmem_backing_src_alias(DEFAULT_VM_MEM_SRC)->name); + print_available_pmem_backing_src_types(" "); +} + +enum vm_pmem_backing_src_type parse_pmem_backing_src_type(const char *type_name) +{ + int i; + + for (i = 0; i < NUM_SRC_TYPES; i++) + if (!strcmp(type_name, vm_pmem_backing_src_alias(i)->name)) + return i; + + print_available_pmem_backing_src_types(""); + TEST_FAIL("Unknown private mem backing src type: %s", type_name); + return -1; +} + long get_run_delay(void) { char path[64];
Update private_mem_conversions_test for various private memory backing source types
Signed-off-by: Ackerley Tng ackerleytng@google.com --- .../kvm/x86_64/private_mem_conversions_test.c | 38 ++++++++++++++----- 1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c index 6a353cf64f52..27a7e5099b7b 100644 --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c @@ -240,14 +240,15 @@ static void *__test_mem_conversions(void *__vcpu) } }
-static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus, - uint32_t nr_memslots) +static void test_mem_conversions(enum vm_mem_backing_src_type src_type, + enum vm_pmem_backing_src_type pmem_src_type, + uint32_t nr_vcpus, uint32_t nr_memslots) { - const size_t memfd_size = PER_CPU_DATA_SIZE * nr_vcpus; struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; pthread_t threads[KVM_MAX_VCPUS]; struct kvm_vm *vm; int memfd, i, r; + size_t pmem_aligned_size, memfd_size; size_t test_unit_size;
const struct vm_shape shape = { @@ -270,21 +271,32 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t * Allocate enough memory so that each vCPU's chunk of memory can be * naturally aligned with respect to the size of the backing store. */ - test_unit_size = align_up(PER_CPU_DATA_SIZE, get_backing_src_pagesz(src_type)); + test_unit_size = align_up(PER_CPU_DATA_SIZE, + max(get_backing_src_pagesz(src_type), + get_pmem_backing_src_pagesz(pmem_src_type))); }
- memfd = vm_create_guest_memfd(vm, memfd_size, 0); + pmem_aligned_size = PER_CPU_DATA_SIZE; + if (nr_memslots > 1) { + pmem_aligned_size = align_up(PER_CPU_DATA_SIZE, + get_pmem_backing_src_pagesz(pmem_src_type)); + } + + memfd_size = pmem_aligned_size * nr_vcpus; + memfd = vm_create_guest_memfd(vm, memfd_size, + vm_pmem_backing_src_alias(pmem_src_type)->flag); for (i = 0; i < nr_memslots; i++) { uint64_t gpa = BASE_DATA_GPA + i * test_unit_size; - uint64_t npages = PER_CPU_DATA_SIZE / vm->page_size; + uint64_t npages = pmem_aligned_size / vm->page_size;
/* Make sure the memslot is large enough for all the test units */ if (nr_memslots == 1) npages *= nr_vcpus;
+ /* Offsets must be aligned to private mem's page size */ vm_mem_add(vm, src_type, gpa, BASE_DATA_SLOT + i, npages, - KVM_MEM_PRIVATE, memfd, PER_CPU_DATA_SIZE * i); + KVM_MEM_PRIVATE, memfd, pmem_aligned_size * i); }
for (i = 0; i < nr_vcpus; i++) { @@ -324,10 +336,12 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t static void usage(const char *cmd) { puts(""); - printf("usage: %s [-h] [-m] [-s mem_type] [-n nr_vcpus]\n", cmd); + printf("usage: %s [-h] [-m] [-s mem_type] [-p pmem_type] [-n nr_vcpus]\n", cmd); puts(""); backing_src_help("-s"); puts(""); + pmem_backing_src_help("-p"); + puts(""); puts(" -n: specify the number of vcpus (default: 1)"); puts(""); puts(" -m: use multiple memslots (default: 1)"); @@ -337,6 +351,7 @@ static void usage(const char *cmd) int main(int argc, char *argv[]) { enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC; + enum vm_pmem_backing_src_type pmem_src_type = DEFAULT_VM_PMEM_SRC; bool use_multiple_memslots = false; uint32_t nr_vcpus = 1; uint32_t nr_memslots; @@ -345,11 +360,14 @@ int main(int argc, char *argv[]) TEST_REQUIRE(kvm_has_cap(KVM_CAP_EXIT_HYPERCALL)); TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_PROTECTED_VM));
- while ((opt = getopt(argc, argv, "hms:n:")) != -1) { + while ((opt = getopt(argc, argv, "hms:p:n:")) != -1) { switch (opt) { case 's': src_type = parse_backing_src_type(optarg); break; + case 'p': + pmem_src_type = parse_pmem_backing_src_type(optarg); + break; case 'n': nr_vcpus = atoi_positive("nr_vcpus", optarg); break; @@ -365,7 +383,7 @@ int main(int argc, char *argv[])
nr_memslots = use_multiple_memslots ? nr_vcpus : 1;
- test_mem_conversions(src_type, nr_vcpus, nr_memslots); + test_mem_conversions(src_type, pmem_src_type, nr_vcpus, nr_memslots);
return 0; }
On Tue, Jun 06, 2023 at 07:03:45PM +0000, Ackerley Tng ackerleytng@google.com wrote:
Hello,
This patchset builds upon a soon-to-be-published WIP patchset that Sean published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned at [1].
The tree can be found at: https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced, allowing VM private memory (for confidential computing) to be backed by hugetlb pages.
guest_mem provides userspace with a handle, with which userspace can allocate and deallocate memory for confidential VMs without mapping the memory into userspace.
Why use hugetlb instead of introducing a new allocator, like gmem does for 4K and transparent hugepages?
- hugetlb provides the following useful functionality, which would otherwise have to be reimplemented:
- Allocation of hugetlb pages at boot time, including
- Parsing of kernel boot parameters to configure hugetlb
- Tracking of usage in hstate
- gmem will share the same system-wide pool of hugetlb pages, so users don't have to have separate pools for hugetlb and gmem
- Page accounting with subpools
- hugetlb pages are tracked in subpools, which gmem uses to reserve pages from the global hstate
- Memory charging
- hugetlb provides code that charges memory to cgroups
- Reporting: hugetlb usage and availability are available at /proc/meminfo, etc
The first 11 patches in this patchset is a series of refactoring to decouple hugetlb and hugetlbfs.
The central thread binding the refactoring is that some functions (like inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs concept, that the resv_map, subpool, hstate, are in a specific field in a hugetlb inode.
Refactoring to parametrize functions by hstate, subpool, resv_map will allow hugetlb to be used by gmem and in other places where these data structures aren't necessarily stored in the same positions in the inode.
The refactoring proposed here is just the minimum required to get a proof-of-concept working with gmem. I would like to get opinions on this approach before doing further refactoring. (See TODOs)
TODOs:
- hugetlb/hugetlbfs refactoring
- remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs specific and used only in inode.c
- remove_mapping_hugepages(), remove_inode_single_folio(), hugetlb_unreserve_pages() shouldn't need to take inode as a parameter
- Updating inode->i_blocks can be refactored to a separate function and called from hugetlbfs and gmem
- alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by vma
- hugetlb_reserve_pages() should be refactored to be symmetric with hugetlb_unreserve_pages()
- It should be parametrized by resv_map
- alloc_hugetlb_folio_from_subpool() could perhaps use hugetlb_reserve_pages()?
- gmem
- Figure out if resv_map should be used by gmem at all
- Probably needs more refactoring to decouple resv_map from hugetlb functions
Hi. If kvm gmem is compiled as kernel module, many symbols are failed to link. You need to add EXPORT_SYMBOL{,_GPL} for exported symbols. Or compile it to kernel instead of module?
Thanks,
Questions for the community:
- In this patchset, every gmem file backed with hugetlb is given a new subpool. Is that desirable?
- In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one mount per hugetlb size (2M, 1G, etc)
- memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it (rightfully) uses the hugetlbfs kernel mounts and their subpools
- I gave each file a subpool mostly to speed up implementation and still be able to reserve hugetlb pages from the global hstate based on the gmem file size.
- gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so
- Should there be multiple mounts, one for each hugetlb size?
- Will the mounts be initialized on boot or on first gmem file creation?
- Or is one subpool per gmem file fine?
- Should resv_map be used for gmem at all, since gmem doesn't allow userspace reservations?
[1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@google.com/
Ackerley Tng (19): mm: hugetlb: Expose get_hstate_idx() mm: hugetlb: Move and expose hugetlbfs_zero_partial_page mm: hugetlb: Expose remove_inode_hugepages mm: hugetlb: Decouple hstate, subpool from inode mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool and hstate mm: hugetlb: Provide hugetlb_filemap_add_folio() mm: hugetlb: Refactor vma_*_reservation functions mm: hugetlb: Refactor restore_reserve_on_error mm: hugetlb: Use restore_reserve_on_error directly in filesystems mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by resv_map mm: hugetlb: Parametrize hugetlb functions by resv_map mm: truncate: Expose preparation steps for truncate_inode_pages_final KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers KVM: guest_mem: Refactor cleanup to separate inode and file cleanup KVM: guest_mem: hugetlb: initialization and cleanup KVM: guest_mem: hugetlb: allocate and truncate from hugetlb KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem KVM: selftests: Support various types of backing sources for private memory KVM: selftests: Update test for various private memory backing source types
fs/hugetlbfs/inode.c | 102 ++-- include/linux/hugetlb.h | 86 ++- include/linux/mm.h | 1 + include/uapi/linux/kvm.h | 25 + mm/hugetlb.c | 324 +++++++----- mm/truncate.c | 24 +- .../testing/selftests/kvm/guest_memfd_test.c | 33 +- .../testing/selftests/kvm/include/test_util.h | 14 + tools/testing/selftests/kvm/lib/test_util.c | 74 +++ .../kvm/x86_64/private_mem_conversions_test.c | 38 +- virt/kvm/guest_mem.c | 488 ++++++++++++++---- 11 files changed, 882 insertions(+), 327 deletions(-)
-- 2.41.0.rc0.172.g3f132b7071-goog
On 06/06/23 19:03, Ackerley Tng wrote:
Hello,
This patchset builds upon a soon-to-be-published WIP patchset that Sean published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned at [1].
The tree can be found at: https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced, allowing VM private memory (for confidential computing) to be backed by hugetlb pages.
guest_mem provides userspace with a handle, with which userspace can allocate and deallocate memory for confidential VMs without mapping the memory into userspace.
Hello Ackerley,
I am not sure if you are aware or, have been following the hugetlb HGM discussion in this thread: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/
There we are trying to decide if HGM should be added to hugetlb, or if perhaps a new filesystem/driver/allocator should be created. The concern is added complexity to hugetlb as well as core mm special casing. Note that HGM is addressing issues faced by existing hugetlb users.
Your proposal here suggests modifying hugetlb so that it can be used in a new way (use case) by KVM's guest_mem. As such it really seems like something that should be done in a separate filesystem/driver/allocator. You will likely not get much support for modifying hugetlb.
On Fri, Jun 16, 2023 at 11:28 AM Mike Kravetz mike.kravetz@oracle.com wrote:
On 06/06/23 19:03, Ackerley Tng wrote:
Hello,
This patchset builds upon a soon-to-be-published WIP patchset that Sean published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned at [1].
The tree can be found at: https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced, allowing VM private memory (for confidential computing) to be backed by hugetlb pages.
guest_mem provides userspace with a handle, with which userspace can allocate and deallocate memory for confidential VMs without mapping the memory into userspace.
Hello Ackerley,
I am not sure if you are aware or, have been following the hugetlb HGM discussion in this thread: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/
There we are trying to decide if HGM should be added to hugetlb, or if perhaps a new filesystem/driver/allocator should be created. The concern is added complexity to hugetlb as well as core mm special casing. Note that HGM is addressing issues faced by existing hugetlb users.
Your proposal here suggests modifying hugetlb so that it can be used in a new way (use case) by KVM's guest_mem. As such it really seems like something that should be done in a separate filesystem/driver/allocator. You will likely not get much support for modifying hugetlb.
-- Mike Kravetz
IIUC mm/hugetlb.c implements memory manager for Hugetlb pages and fd/hugetlbfs/inode.c implements the filesystem logic for hugetlbfs.
This series implements a new filesystem with limited operations parallel to hugetlbfs filesystem but tries to reuse hugetlb memory manager. The effort here is to not add any new feature to hugetlb memory manager but clean it up so that it can be used by a new filesystem.
guest_mem warrants a new filesystem since it supports limited operations on the underlying files but there is no additional restriction on underlying memory management. Though one could argue that memory management for guest_mem files can be a very simple one that goes inline with limited operations on the files.
If this series were to go a separate way of implementing a new memory manager, one immediate requirement that might spring up, would be to convert memory from hugetlb managed memory to be managed by this newly introduced memory manager and vice a versa at runtime since there could be a mix of VMs on the same platform using guest_mem and hugetlb. Maybe this can be satisfied by having a separate global pool for reservation that's consumed by both, which would need more changes in my understanding.
Using guest_mem for all the VMs by default would be a future work contingent on all existing usecases/requirements being satisfied.
Regards, Vishal
linux-kselftest-mirror@lists.linaro.org