Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.) - Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.) - Allow configuring direct map state of not-yet faulted in memory (Vishal A.) - Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
=== Implementation ===
This patch series introduces a new flag to the KVM_CREATE_GUEST_MEMFD that causes guest_memfd to remove its pages from the host kernel's direct map immediately after population/preparation. It also adds infrastructure for tracking the direct map state of all gmem folios inside the guest_memfd inode. Storing this information in the inode has the advantage that the code is ready for future hugepages extensions, where only removing/reinserting direct map entries for sub-ranges of a huge folio is a valid usecase, and it allows pre-configuring the direct map state of not-yet faulted in parts of memory (for example, when the VMM is receiving a RX virtio buffer from the guest).
=== Summary ===
Patch 1 (from Mike Rapoport) adds arch APIs for manipulating the direct map for ranges of physically contiguous pages, which are used by guest_memfd in follow up patches. Patch 2 adds the KVM_GMEM_NO_DIRECT_MAP flag and the logic for configuring direct map state of freshly prepared folios. Patches 3 and 4 mainly serve an illustrative purpose, to show how the framework from patch 2 can be extended with routines for runtime direct map manipulation. Patches 5 and 6 deal with documentation and self-tests respectively.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf [RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/ [RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
Mike Rapoport (Microsoft) (1): arch: introduce set_direct_map_valid_noflush()
Patrick Roy (5): kvm: gmem: add flag to remove memory from kernel direct map kvm: gmem: implement direct map manipulation routines kvm: gmem: add trace point for direct map state changes kvm: document KVM_GMEM_NO_DIRECT_MAP flag kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set
Documentation/virt/kvm/api.rst | 14 + arch/arm64/include/asm/set_memory.h | 1 + arch/arm64/mm/pageattr.c | 10 + arch/loongarch/include/asm/set_memory.h | 1 + arch/loongarch/mm/pageattr.c | 21 ++ arch/riscv/include/asm/set_memory.h | 1 + arch/riscv/mm/pageattr.c | 15 + arch/s390/include/asm/set_memory.h | 1 + arch/s390/mm/pageattr.c | 11 + arch/x86/include/asm/set_memory.h | 1 + arch/x86/mm/pat/set_memory.c | 8 + include/linux/set_memory.h | 6 + include/trace/events/kvm.h | 22 ++ include/uapi/linux/kvm.h | 2 + .../testing/selftests/kvm/guest_memfd_test.c | 2 +- .../kvm/x86_64/private_mem_conversions_test.c | 7 +- virt/kvm/guest_memfd.c | 280 +++++++++++++++++- 17 files changed, 384 insertions(+), 19 deletions(-)
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
From: Mike Rapoport (Microsoft) rppt@kernel.org
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages.
It will be used in the following patches.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org Signed-off-by: Patrick Roy roypat@amazon.co.uk --- arch/arm64/include/asm/set_memory.h | 1 + arch/arm64/mm/pageattr.c | 10 ++++++++++ arch/loongarch/include/asm/set_memory.h | 1 + arch/loongarch/mm/pageattr.c | 21 +++++++++++++++++++++ arch/riscv/include/asm/set_memory.h | 1 + arch/riscv/mm/pageattr.c | 15 +++++++++++++++ arch/s390/include/asm/set_memory.h | 1 + arch/s390/mm/pageattr.c | 11 +++++++++++ arch/x86/include/asm/set_memory.h | 1 + arch/x86/mm/pat/set_memory.c | 8 ++++++++ include/linux/set_memory.h | 6 ++++++ 11 files changed, 76 insertions(+)
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h index 917761feeffdd..98088c043606a 100644 --- a/arch/arm64/include/asm/set_memory.h +++ b/arch/arm64/include/asm/set_memory.h @@ -13,6 +13,7 @@ int set_memory_valid(unsigned long addr, int numpages, int enable);
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid); bool kernel_page_present(struct page *page);
#endif /* _ASM_ARM64_SET_MEMORY_H */ diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 0e270a1c51e64..01225900293ac 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -192,6 +192,16 @@ int set_direct_map_default_noflush(struct page *page) PAGE_SIZE, change_page_range, &data); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + unsigned long addr = (unsigned long)page_address(page); + + if (!can_set_direct_map()) + return 0; + + return set_memory_valid(addr, nr, valid); +} + #ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/arch/loongarch/include/asm/set_memory.h b/arch/loongarch/include/asm/set_memory.h index d70505b6676cb..55dfaefd02c8a 100644 --- a/arch/loongarch/include/asm/set_memory.h +++ b/arch/loongarch/include/asm/set_memory.h @@ -17,5 +17,6 @@ int set_memory_rw(unsigned long addr, int numpages); bool kernel_page_present(struct page *page); int set_direct_map_default_noflush(struct page *page); int set_direct_map_invalid_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
#endif /* _ASM_LOONGARCH_SET_MEMORY_H */ diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c index ffd8d76021d47..f14b40c968b48 100644 --- a/arch/loongarch/mm/pageattr.c +++ b/arch/loongarch/mm/pageattr.c @@ -216,3 +216,24 @@ int set_direct_map_invalid_noflush(struct page *page)
return __set_memory(addr, 1, __pgprot(0), __pgprot(_PAGE_PRESENT | _PAGE_VALID)); } + +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + unsigned long addr = (unsigned long)page_address(page); + pgprot_t set, clear; + + return __set_memory((unsigned long)page_address(page), nr, set, clear); + + if (addr < vm_map_base) + return 0; + + if (valid) { + set = PAGE_KERNEL; + clear = __pgprot(0); + } else { + set = __pgprot(0); + clear = __pgprot(_PAGE_PRESENT | _PAGE_VALID); + } + + return __set_memory(addr, 1, set, clear); +} diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h index ab92fc84e1fc9..ea263d3683ef6 100644 --- a/arch/riscv/include/asm/set_memory.h +++ b/arch/riscv/include/asm/set_memory.h @@ -42,6 +42,7 @@ static inline int set_kernel_memory(char *startp, char *endp,
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid); bool kernel_page_present(struct page *page);
#endif /* __ASSEMBLY__ */ diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c index 271d01a5ba4da..d815448758a19 100644 --- a/arch/riscv/mm/pageattr.c +++ b/arch/riscv/mm/pageattr.c @@ -386,6 +386,21 @@ int set_direct_map_default_noflush(struct page *page) PAGE_KERNEL, __pgprot(_PAGE_EXEC)); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + pgprot_t set, clear; + + if (valid) { + set = PAGE_KERNEL; + clear = __pgprot(_PAGE_EXEC); + } else { + set = __pgprot(0); + clear = __pgprot(_PAGE_PRESENT); + } + + return __set_memory((unsigned long)page_address(page), nr, set, clear); +} + #ifdef CONFIG_DEBUG_PAGEALLOC static int debug_pagealloc_set_page(pte_t *pte, unsigned long addr, void *data) { diff --git a/arch/s390/include/asm/set_memory.h b/arch/s390/include/asm/set_memory.h index 06fbabe2f66c9..240bcfbdcdcec 100644 --- a/arch/s390/include/asm/set_memory.h +++ b/arch/s390/include/asm/set_memory.h @@ -62,5 +62,6 @@ __SET_MEMORY_FUNC(set_memory_4k, SET_MEMORY_4K)
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
#endif diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c index 5f805ad42d4c3..4c7ee74aa130d 100644 --- a/arch/s390/mm/pageattr.c +++ b/arch/s390/mm/pageattr.c @@ -406,6 +406,17 @@ int set_direct_map_default_noflush(struct page *page) return __set_memory((unsigned long)page_to_virt(page), 1, SET_MEMORY_DEF); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + unsigned long flags; + + if (valid) + flags = SET_MEMORY_DEF; + else + flags = SET_MEMORY_INV; + + return __set_memory((unsigned long)page_to_virt(page), nr, flags); +} #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
static void ipte_range(pte_t *pte, unsigned long address, int nr) diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 4b2abce2e3e7d..cc62ef70ccc0a 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -89,6 +89,7 @@ int set_pages_rw(struct page *page, int numpages);
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid); bool kernel_page_present(struct page *page);
extern int kernel_set_to_readonly; diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 44f7b2ea6a073..069e421c22474 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2444,6 +2444,14 @@ int set_direct_map_default_noflush(struct page *page) return __set_pages_p(page, 1); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + if (valid) + return __set_pages_p(page, nr); + + return __set_pages_np(page, nr); +} + #ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index e7aec20fb44f1..3030d9245f5ac 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) return 0; }
+static inline int set_direct_map_valid_noflush(struct page *page, + unsigned nr, bool valid) +{ + return 0; +} + static inline bool kernel_page_present(struct page *page) { return true;
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
On 30.10.24 14:49, Patrick Roy wrote:
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
From: Mike Rapoport (Microsoft) rppt@kernel.org
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages.
It will be used in the following patches.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org Signed-off-by: Patrick Roy roypat@amazon.co.uk
[...]
#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index e7aec20fb44f1..3030d9245f5ac 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) return 0; } +static inline int set_direct_map_valid_noflush(struct page *page,
unsigned nr, bool valid)
I recall that "unsigned" is frowned upon; "unsigned int".
+{
- return 0;
+}
Can we add some kernel doc for this?
In particular
(a) What does it mean when we return 0? That it worked? Then, this dummy function looks wrong. Or this it return the number of processed entries? Then we'd have a possible "int" vs. "unsigned int" inconsistency.
(b) What are the semantics when we fail halfway through the operation when processing nr > 1? Is it "all or nothing"?
Add a new flag, KVM_GMEM_NO_DIRECT_MAP, to KVM_CREATE_GUEST_MEMFD, which causes KVM to remove the folios backing this guest_memfd from the direct map after preparation/population. This flag is only exposed on architectures that can set the direct map (the notable exception here being ARM64 if the direct map is not set up at 4K granularity), otherwise EOPNOTSUPP is returned.
This patch also implements infrastructure for tracking (temporary) reinsertation of memory ranges into the direct map (more accurately: It allows recording that specific memory ranges deviate from the default direct map setup. Currently the default setup is always "direct map entries removed", but it is trivial to extend this with some "default_state_for_vm_type" mechanism to cover the pKVM usecase of memory starting off with directe map entries present). An xarray tracks this at page granularity, to be compatible with future hugepages usecases that might require subranges of hugetlb folios to have direct map entries restored. This xarray holds entries for each page that has a direct map state deviating from the default, and holes for all pages whose direct map state matches the default, the idea being that these "deviations" will be rare. kvm_gmem_folio_configure_direct_map applies the configuration stored in the xarray to a given folio, and is called for each new gmem folio after preparation/population.
Storing direct map state in the gmem inode has two advantages: 1) We can track direct map state at page granularity even for huge folios (see also Ackerley's series on hugetlbfs support in guest_memfd [1]) 2) We can pre-configure the direct map state of not-yet-faulted in folios. This would for example be needed if a VMM is receiving a virtio buffer that the guest is requested it to fill. In this case, the pages backing the guest physical address range of the buffer might not be faulted in yet, and thus would be faulted when the VMM tries to write to them, and at this point we would need to ensure direct map entries are present)
Note that this patch does not include operations for manipulating the direct map state xarray, or for changing direct map state of already existing folios. These routines are sketched out in the following patch, although are not needed in this initial patch series.
When a gmem folio is freed, it is reinserted into the direct map (and failing this, marked as HWPOISON to avoid any other part of the kernel accidentally touching folios without complete direct map entries). The direct map configuration stored in the xarray is _not_ reset when the folio is freed (although this could be implemented by storing the reference to the xarray in the folio's private data instead of only the inode).
[1]: https://lore.kernel.org/kvm/cover.1726009989.git.ackerleytng@google.com/
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 150 +++++++++++++++++++++++++++++++++++---- 2 files changed, 137 insertions(+), 15 deletions(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc0551453..81b0f4a236b8c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; };
+#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0) + #define KVM_PRE_FAULT_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
struct kvm_pre_fault_memory { diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 47a9f68f7b247..50ffc2ad73eda 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#include <linux/set_memory.h>
#include "kvm_mm.h"
@@ -13,6 +14,88 @@ struct kvm_gmem { struct list_head entry; };
+struct kvm_gmem_inode_private { + unsigned long long flags; + + /* + * direct map configuration of the gmem instance this private data + * is associated with. present indices indicate a desired direct map + * configuration deviating from default_direct_map_state (e.g. if + * default_direct_map_state is false/not present, then the xarray + * contains all indices for which direct map entries are restored). + */ + struct xarray direct_map_state; + bool default_direct_map_state; +}; + +static bool kvm_gmem_test_no_direct_map(struct kvm_gmem_inode_private *gmem_priv) +{ + return ((unsigned long)gmem_priv->flags & KVM_GMEM_NO_DIRECT_MAP) != 0; +} + +/* + * Configure the direct map present/not present state of @folio based on + * the xarray stored in the associated inode's private data. + * + * Assumes the folio lock is held. + */ +static int kvm_gmem_folio_configure_direct_map(struct folio *folio) +{ + struct inode *inode = folio_inode(folio); + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; + bool default_state = gmem_priv->default_direct_map_state; + + pgoff_t start = folio_index(folio); + pgoff_t last = start + folio_nr_pages(folio) - 1; + + struct xarray *xa = &gmem_priv->direct_map_state; + unsigned long index; + void *entry; + + pgoff_t range_start = start; + unsigned long npages = 1; + int r = 0; + + if (!kvm_gmem_test_no_direct_map(gmem_priv)) + goto out; + + r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), + default_state); + if (r) + goto out; + + if (!xa_find_after(xa, &range_start, last, XA_PRESENT)) + goto out_flush; + + xa_for_each_range(xa, index, entry, range_start, last) { + ++npages; + + if (index == range_start + npages) + continue; + + r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages - 1, + !default_state); + if (r) + goto out_flush; + + range_start = index; + npages = 1; + } + + r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages, + !default_state); + +out_flush: + /* + * Use PG_private to track that this folio has had potentially some of + * its direct map entries modified, so that we can restore them in free_folio. + */ + folio_set_private(folio); + flush_tlb_kernel_range(start, start + folio_size(folio)); +out: + return r; +} + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index. @@ -42,9 +125,19 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo return 0; }
-static inline void kvm_gmem_mark_prepared(struct folio *folio) + +static inline int kvm_gmem_finalize_folio(struct folio *folio) { + int r = kvm_gmem_folio_configure_direct_map(folio); + + /* + * Parts of the direct map might have been punched out, mark this folio + * as prepared even in the error case to avoid touching parts without + * direct map entries in a potential re-preparation. + */ folio_mark_uptodate(folio); + + return r; }
/* @@ -82,11 +175,10 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, index = ALIGN_DOWN(index, 1 << folio_order(folio)); r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); if (!r) - kvm_gmem_mark_prepared(folio); + r = kvm_gmem_finalize_folio(folio);
return r; } - /* * Returns a locked folio on success. The caller is responsible for * setting the up-to-date flag before the memory is mapped into the guest. @@ -249,6 +341,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, static int kvm_gmem_release(struct inode *inode, struct file *file) { struct kvm_gmem *gmem = file->private_data; + struct kvm_gmem_inode_private *gmem_priv; struct kvm_memory_slot *slot; struct kvm *kvm = gmem->kvm; unsigned long index; @@ -279,13 +372,17 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
list_del(&gmem->entry);
+ gmem_priv = inode->i_private; + filemap_invalidate_unlock(inode->i_mapping);
mutex_unlock(&kvm->slots_lock); - xa_destroy(&gmem->bindings); kfree(gmem);
+ xa_destroy(&gmem_priv->direct_map_state); + kfree(gmem_priv); + kvm_put_kvm(kvm);
return 0; @@ -357,24 +454,37 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol return MF_DELAYED; }
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE static void kvm_gmem_free_folio(struct folio *folio) { +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE struct page *page = folio_page(folio, 0); kvm_pfn_t pfn = page_to_pfn(page); int order = folio_order(folio);
kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); -} #endif
+ if (folio_test_private(folio)) { + unsigned long start = (unsigned long)folio_address(folio); + + int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), + true); + /* + * There might be holes left in the folio, better make sure + * nothing tries to touch it again. + */ + if (r) + folio_set_hwpoison(folio); + + flush_tlb_kernel_range(start, start + folio_size(folio)); + } +} + static const struct address_space_operations kvm_gmem_aops = { .dirty_folio = noop_dirty_folio, .migrate_folio = kvm_gmem_migrate_folio, .error_remove_folio = kvm_gmem_error_folio, -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE .free_folio = kvm_gmem_free_folio, -#endif };
static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path, @@ -401,6 +511,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { const char *anon_name = "[kvm-gmem]"; struct kvm_gmem *gmem; + struct kvm_gmem_inode_private *gmem_priv; struct inode *inode; struct file *file; int fd, err; @@ -409,11 +520,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) if (fd < 0) return fd;
+ err = -ENOMEM; gmem = kzalloc(sizeof(*gmem), GFP_KERNEL); - if (!gmem) { - err = -ENOMEM; + if (!gmem) + goto err_fd; + + gmem_priv = kzalloc(sizeof(*gmem_priv), GFP_KERNEL); + if (!gmem_priv) goto err_fd; - }
file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, O_RDWR, NULL); @@ -427,7 +541,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) inode = file->f_inode; WARN_ON(file->f_mapping != inode->i_mapping);
- inode->i_private = (void *)(unsigned long)flags; + inode->i_private = gmem_priv; inode->i_op = &kvm_gmem_iops; inode->i_mapping->a_ops = &kvm_gmem_aops; inode->i_mode |= S_IFREG; @@ -442,6 +556,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) xa_init(&gmem->bindings); list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+ xa_init(&gmem_priv->direct_map_state); + gmem_priv->flags = flags; + fd_install(fd, file); return fd;
@@ -456,11 +573,14 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; u64 flags = args->flags; - u64 valid_flags = 0; + u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
if (flags & ~valid_flags) return -EINVAL;
+ if ((flags & KVM_GMEM_NO_DIRECT_MAP) && !can_set_direct_map()) + return -EOPNOTSUPP; + if (size <= 0 || !PAGE_ALIGNED(size)) return -EINVAL;
@@ -679,7 +799,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long break; }
- folio_unlock(folio); WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order));
@@ -695,7 +814,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long p = src ? src + i * PAGE_SIZE : NULL; ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); if (!ret) - kvm_gmem_mark_prepared(folio); + ret = kvm_gmem_finalize_folio(folio); + folio_unlock(folio);
put_folio_and_exit: folio_put(folio);
On 10/30/24 08:49, Patrick Roy wrote:
Add a new flag, KVM_GMEM_NO_DIRECT_MAP, to KVM_CREATE_GUEST_MEMFD, which causes KVM to remove the folios backing this guest_memfd from the direct map after preparation/population. This flag is only exposed on architectures that can set the direct map (the notable exception here being ARM64 if the direct map is not set up at 4K granularity), otherwise EOPNOTSUPP is returned.
This patch also implements infrastructure for tracking (temporary) reinsertation of memory ranges into the direct map (more accurately: It allows recording that specific memory ranges deviate from the default direct map setup. Currently the default setup is always "direct map entries removed", but it is trivial to extend this with some "default_state_for_vm_type" mechanism to cover the pKVM usecase of memory starting off with directe map entries present). An xarray tracks this at page granularity, to be compatible with future hugepages usecases that might require subranges of hugetlb folios to have direct map entries restored. This xarray holds entries for each page that has a direct map state deviating from the default, and holes for all pages whose direct map state matches the default, the idea being that these "deviations" will be rare. kvm_gmem_folio_configure_direct_map applies the configuration stored in the xarray to a given folio, and is called for each new gmem folio after preparation/population.
Storing direct map state in the gmem inode has two advantages:
- We can track direct map state at page granularity even for huge folios (see also Ackerley's series on hugetlbfs support in guest_memfd [1])
- We can pre-configure the direct map state of not-yet-faulted in folios. This would for example be needed if a VMM is receiving a virtio buffer that the guest is requested it to fill. In this case, the pages backing the guest physical address range of the buffer might not be faulted in yet, and thus would be faulted when the VMM tries to write to them, and at this point we would need to ensure direct map entries are present)
Note that this patch does not include operations for manipulating the direct map state xarray, or for changing direct map state of already existing folios. These routines are sketched out in the following patch, although are not needed in this initial patch series.
When a gmem folio is freed, it is reinserted into the direct map (and failing this, marked as HWPOISON to avoid any other part of the kernel accidentally touching folios without complete direct map entries). The direct map configuration stored in the xarray is _not_ reset when the folio is freed (although this could be implemented by storing the reference to the xarray in the folio's private data instead of only the inode).
Signed-off-by: Patrick Roy roypat@amazon.co.uk
include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 150 +++++++++++++++++++++++++++++++++++---- 2 files changed, 137 insertions(+), 15 deletions(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc0551453..81b0f4a236b8c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; +#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0)
- #define KVM_PRE_FAULT_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
struct kvm_pre_fault_memory { diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 47a9f68f7b247..50ffc2ad73eda 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#include <linux/set_memory.h> #include "kvm_mm.h" @@ -13,6 +14,88 @@ struct kvm_gmem { struct list_head entry; }; +struct kvm_gmem_inode_private {
- unsigned long long flags;
- /*
* direct map configuration of the gmem instance this private data
* is associated with. present indices indicate a desired direct map
* configuration deviating from default_direct_map_state (e.g. if
* default_direct_map_state is false/not present, then the xarray
* contains all indices for which direct map entries are restored).
*/
- struct xarray direct_map_state;
- bool default_direct_map_state;
+};
+static bool kvm_gmem_test_no_direct_map(struct kvm_gmem_inode_private *gmem_priv) +{
- return ((unsigned long)gmem_priv->flags & KVM_GMEM_NO_DIRECT_MAP) != 0;
+}
+/*
- Configure the direct map present/not present state of @folio based on
- the xarray stored in the associated inode's private data.
- Assumes the folio lock is held.
- */
+static int kvm_gmem_folio_configure_direct_map(struct folio *folio) +{
- struct inode *inode = folio_inode(folio);
- struct kvm_gmem_inode_private *gmem_priv = inode->i_private;
- bool default_state = gmem_priv->default_direct_map_state;
- pgoff_t start = folio_index(folio);
- pgoff_t last = start + folio_nr_pages(folio) - 1;
pgoff_t last = folio_next_index(folio) - 1;
thanks, Mike
- struct xarray *xa = &gmem_priv->direct_map_state;
- unsigned long index;
- void *entry;
- pgoff_t range_start = start;
- unsigned long npages = 1;
- int r = 0;
- if (!kvm_gmem_test_no_direct_map(gmem_priv))
goto out;
- r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
default_state);
- if (r)
goto out;
- if (!xa_find_after(xa, &range_start, last, XA_PRESENT))
goto out_flush;
- xa_for_each_range(xa, index, entry, range_start, last) {
++npages;
if (index == range_start + npages)
continue;
r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages - 1,
!default_state);
if (r)
goto out_flush;
range_start = index;
npages = 1;
- }
- r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages,
!default_state);
+out_flush:
- /*
* Use PG_private to track that this folio has had potentially some of
* its direct map entries modified, so that we can restore them in free_folio.
*/
- folio_set_private(folio);
- flush_tlb_kernel_range(start, start + folio_size(folio));
+out:
- return r;
+}
- /**
- folio_file_pfn - like folio_file_page, but return a pfn.
- @folio: The folio which contains this index.
@@ -42,9 +125,19 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo return 0; } -static inline void kvm_gmem_mark_prepared(struct folio *folio)
+static inline int kvm_gmem_finalize_folio(struct folio *folio) {
- int r = kvm_gmem_folio_configure_direct_map(folio);
- /*
* Parts of the direct map might have been punched out, mark this folio
* as prepared even in the error case to avoid touching parts without
* direct map entries in a potential re-preparation.
folio_mark_uptodate(folio);*/
- return r; }
/* @@ -82,11 +175,10 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, index = ALIGN_DOWN(index, 1 << folio_order(folio)); r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); if (!r)
kvm_gmem_mark_prepared(folio);
r = kvm_gmem_finalize_folio(folio);
return r; }
- /*
- Returns a locked folio on success. The caller is responsible for
- setting the up-to-date flag before the memory is mapped into the guest.
@@ -249,6 +341,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, static int kvm_gmem_release(struct inode *inode, struct file *file) { struct kvm_gmem *gmem = file->private_data;
- struct kvm_gmem_inode_private *gmem_priv; struct kvm_memory_slot *slot; struct kvm *kvm = gmem->kvm; unsigned long index;
@@ -279,13 +372,17 @@ static int kvm_gmem_release(struct inode *inode, struct file *file) list_del(&gmem->entry);
- gmem_priv = inode->i_private;
- filemap_invalidate_unlock(inode->i_mapping);
mutex_unlock(&kvm->slots_lock);
- xa_destroy(&gmem->bindings); kfree(gmem);
- xa_destroy(&gmem_priv->direct_map_state);
- kfree(gmem_priv);
- kvm_put_kvm(kvm);
return 0; @@ -357,24 +454,37 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol return MF_DELAYED; } -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE static void kvm_gmem_free_folio(struct folio *folio) { +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE struct page *page = folio_page(folio, 0); kvm_pfn_t pfn = page_to_pfn(page); int order = folio_order(folio); kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); -} #endif
- if (folio_test_private(folio)) {
unsigned long start = (unsigned long)folio_address(folio);
int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
true);
/*
* There might be holes left in the folio, better make sure
* nothing tries to touch it again.
*/
if (r)
folio_set_hwpoison(folio);
flush_tlb_kernel_range(start, start + folio_size(folio));
- }
+}
- static const struct address_space_operations kvm_gmem_aops = { .dirty_folio = noop_dirty_folio, .migrate_folio = kvm_gmem_migrate_folio, .error_remove_folio = kvm_gmem_error_folio,
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE .free_folio = kvm_gmem_free_folio, -#endif }; static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path, @@ -401,6 +511,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { const char *anon_name = "[kvm-gmem]"; struct kvm_gmem *gmem;
- struct kvm_gmem_inode_private *gmem_priv; struct inode *inode; struct file *file; int fd, err;
@@ -409,11 +520,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) if (fd < 0) return fd;
- err = -ENOMEM; gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
- if (!gmem) {
err = -ENOMEM;
- if (!gmem)
goto err_fd;
- gmem_priv = kzalloc(sizeof(*gmem_priv), GFP_KERNEL);
- if (!gmem_priv) goto err_fd;
- }
file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, O_RDWR, NULL); @@ -427,7 +541,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) inode = file->f_inode; WARN_ON(file->f_mapping != inode->i_mapping);
- inode->i_private = (void *)(unsigned long)flags;
- inode->i_private = gmem_priv; inode->i_op = &kvm_gmem_iops; inode->i_mapping->a_ops = &kvm_gmem_aops; inode->i_mode |= S_IFREG;
@@ -442,6 +556,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) xa_init(&gmem->bindings); list_add(&gmem->entry, &inode->i_mapping->i_private_list);
- xa_init(&gmem_priv->direct_map_state);
- gmem_priv->flags = flags;
- fd_install(fd, file); return fd;
@@ -456,11 +573,14 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; u64 flags = args->flags;
- u64 valid_flags = 0;
- u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
if (flags & ~valid_flags) return -EINVAL;
- if ((flags & KVM_GMEM_NO_DIRECT_MAP) && !can_set_direct_map())
return -EOPNOTSUPP;
- if (size <= 0 || !PAGE_ALIGNED(size)) return -EINVAL;
@@ -679,7 +799,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long break; }
WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order));folio_unlock(folio);
@@ -695,7 +814,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long p = src ? src + i * PAGE_SIZE : NULL; ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); if (!ret)
kvm_gmem_mark_prepared(folio);
ret = kvm_gmem_finalize_folio(folio);
folio_unlock(folio);
put_folio_and_exit: folio_put(folio);
Implement (yet unused) routines for manipulating guest_memfd direct map state. This is largely for illustration purposes.
kvm_gmem_set_direct_map allows manipulating arbitrary pgoff_t ranges, even if the covered memory has not yet been faulted in (in which case the requested direct map state is recorded in the xarray and will be applied by kvm_gmem_folio_configure_direct_map after the folio is faulted in and prepared/populated). This can be used to realize private/shared conversions on not-yet-faulted in memory, as discussed in the guest_memfd upstream call [1].
kvm_gmem_folio_set_direct_map allows manipulating the direct map entries for a gmem folio that the caller already holds a reference for (whereas kvm_gmem_set_direct_map needs to look up all folios intersecting the given pgoff range in the filemap first).
The xa lock serializes calls to kvm_gmem_folio_set_direct_map and kvm_gmem_set_direct_map, while the read side (kvm_gmem_folio_configure_direct_map) is protected by RCU. This is sufficient to ensure consistency between the xarray and the folio's actual direct map state, as kvm_gmem_folio_configure_direct_map is called only for freshly allocated folios, and before the folio lock is dropped for the first time, meaning kvm_gmem_folio_configure_direct_map always does it's set_direct_map calls before either of kvm_gmem_[folio_]set_direct_map get a chance. Even if a concurrent call to kvm_gmem_[folio_]set_direct_map happens, this ensures a sort of "eventual consistency" between xarray and actual direct map configuration by the time kvm_gmem_[folio_]set_direct_map exits.
[1]: https://lore.kernel.org/kvm/4b49248b-1cf1-44dc-9b50-ee551e1671ac@redhat.com/
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- virt/kvm/guest_memfd.c | 125 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 125 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 50ffc2ad73eda..54387828dcc6a 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -96,6 +96,131 @@ static int kvm_gmem_folio_configure_direct_map(struct folio *folio) return r; }
+/* + * Updates the range [@start, @end] in @gmem_priv's direct map state xarray to be @state, + * e.g. erasing entries in this range if @state is the default state, and creating + * entries otherwise. + * + * Assumes the xa_lock is held. + */ +static int __kvm_gmem_update_xarray(struct kvm_gmem_inode_private *gmem_priv, pgoff_t start, + pgoff_t end, bool state) +{ + struct xarray *xa = &gmem_priv->direct_map_state; + int r = 0; + + /* + * Cannot use xa_store_range, as multi-indexes cannot easily + * be partially updated. + */ + for (pgoff_t index = start; index < end; ++index) { + if (state == gmem_priv->default_direct_map_state) + __xa_erase(xa, index); + else + /* don't care _what_ we store in the xarray, only care about presence */ + __xa_store(xa, index, gmem_priv, GFP_KERNEL); + + r = xa_err(xa); + if (r) + goto out; + } + +out: + return r; +} + +static int __kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, pgoff_t end, + bool state) +{ + unsigned long npages = end - start + 1; + struct page *first_page = folio_file_page(folio, start); + + int r = set_direct_map_valid_noflush(first_page, npages, state); + + flush_tlb_kernel_range((unsigned long)page_address(first_page), + (unsigned long)page_address(first_page) + + npages * PAGE_SIZE); + return r; +} + +/* + * Updates the direct map status for the given range from @start to @end (inclusive), returning + * -EINVAL if this range is not completely contained within @folio. Also updates the + * xarray stored in the private data of the inode @folio is attached to. + * + * Takes and drops the folio lock. + */ +static __always_unused int kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, + pgoff_t end, bool state) +{ + struct inode *inode = folio_inode(folio); + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; + int r = -EINVAL; + + if (!folio_contains(folio, start) || !folio_contains(folio, end)) + goto out; + + xa_lock(&gmem_priv->direct_map_state); + r = __kvm_gmem_update_xarray(gmem_priv, start, end, state); + if (r) + goto unlock_xa; + + folio_lock(folio); + r = __kvm_gmem_folio_set_direct_map(folio, start, end, state); + folio_unlock(folio); + +unlock_xa: + xa_unlock(&gmem_priv->direct_map_state); +out: + return r; +} + +/* + * Updates the direct map status for the given range from @start to @end (inclusive) + * of @inode. Folios in this range have their direct map entries reconfigured, + * and the xarray in the @inode's private data is updated. + */ +static __always_unused int kvm_gmem_set_direct_map(struct inode *inode, pgoff_t start, + pgoff_t end, bool state) +{ + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; + struct folio_batch fbatch; + pgoff_t index = start; + unsigned int count, i; + int r = 0; + + xa_lock(&gmem_priv->direct_map_state); + + r = __kvm_gmem_update_xarray(gmem_priv, start, end, state); + if (r) + goto out; + + folio_batch_init(&fbatch); + while (!filemap_get_folios(inode->i_mapping, &index, end, &fbatch) && !r) { + count = folio_batch_count(&fbatch); + for (i = 0; i < count; i++) { + struct folio *folio = fbatch.folios[i]; + pgoff_t folio_start = max(folio_index(folio), start); + pgoff_t folio_end = + min(folio_index(folio) + folio_nr_pages(folio), + end); + + folio_lock(folio); + r = __kvm_gmem_folio_set_direct_map(folio, folio_start, + folio_end, state); + folio_unlock(folio); + + if (r) + break; + } + folio_batch_release(&fbatch); + } + + xa_unlock(&gmem_priv->direct_map_state); +out: + return r; +} + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index.
On 10/30/24 08:49, Patrick Roy wrote:
Implement (yet unused) routines for manipulating guest_memfd direct map state. This is largely for illustration purposes.
kvm_gmem_set_direct_map allows manipulating arbitrary pgoff_t ranges, even if the covered memory has not yet been faulted in (in which case the requested direct map state is recorded in the xarray and will be applied by kvm_gmem_folio_configure_direct_map after the folio is faulted in and prepared/populated). This can be used to realize private/shared conversions on not-yet-faulted in memory, as discussed in the guest_memfd upstream call [1].
kvm_gmem_folio_set_direct_map allows manipulating the direct map entries for a gmem folio that the caller already holds a reference for (whereas kvm_gmem_set_direct_map needs to look up all folios intersecting the given pgoff range in the filemap first).
The xa lock serializes calls to kvm_gmem_folio_set_direct_map and kvm_gmem_set_direct_map, while the read side (kvm_gmem_folio_configure_direct_map) is protected by RCU. This is sufficient to ensure consistency between the xarray and the folio's actual direct map state, as kvm_gmem_folio_configure_direct_map is called only for freshly allocated folios, and before the folio lock is dropped for the first time, meaning kvm_gmem_folio_configure_direct_map always does it's set_direct_map calls before either of kvm_gmem_[folio_]set_direct_map get a chance. Even if a concurrent call to kvm_gmem_[folio_]set_direct_map happens, this ensures a sort of "eventual consistency" between xarray and actual direct map configuration by the time kvm_gmem_[folio_]set_direct_map exits.
Signed-off-by: Patrick Roy roypat@amazon.co.uk
virt/kvm/guest_memfd.c | 125 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 125 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 50ffc2ad73eda..54387828dcc6a 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -96,6 +96,131 @@ static int kvm_gmem_folio_configure_direct_map(struct folio *folio) return r; } +/*
- Updates the range [@start, @end] in @gmem_priv's direct map state xarray to be @state,
- e.g. erasing entries in this range if @state is the default state, and creating
- entries otherwise.
- Assumes the xa_lock is held.
- */
+static int __kvm_gmem_update_xarray(struct kvm_gmem_inode_private *gmem_priv, pgoff_t start,
pgoff_t end, bool state)
+{
- struct xarray *xa = &gmem_priv->direct_map_state;
- int r = 0;
- /*
* Cannot use xa_store_range, as multi-indexes cannot easily
* be partially updated.
*/
- for (pgoff_t index = start; index < end; ++index) {
if (state == gmem_priv->default_direct_map_state)
__xa_erase(xa, index);
else
/* don't care _what_ we store in the xarray, only care about presence */
__xa_store(xa, index, gmem_priv, GFP_KERNEL);
r = xa_err(xa);
if (r)
goto out;
- }
+out:
- return r;
+}
+static int __kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, pgoff_t end,
bool state)
+{
- unsigned long npages = end - start + 1;
- struct page *first_page = folio_file_page(folio, start);
- int r = set_direct_map_valid_noflush(first_page, npages, state);
- flush_tlb_kernel_range((unsigned long)page_address(first_page),
(unsigned long)page_address(first_page) +
npages * PAGE_SIZE);
- return r;
+}
+/*
- Updates the direct map status for the given range from @start to @end (inclusive), returning
- -EINVAL if this range is not completely contained within @folio. Also updates the
- xarray stored in the private data of the inode @folio is attached to.
- Takes and drops the folio lock.
- */
+static __always_unused int kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start,
pgoff_t end, bool state)
+{
- struct inode *inode = folio_inode(folio);
- struct kvm_gmem_inode_private *gmem_priv = inode->i_private;
- int r = -EINVAL;
- if (!folio_contains(folio, start) || !folio_contains(folio, end))
goto out;
- xa_lock(&gmem_priv->direct_map_state);
- r = __kvm_gmem_update_xarray(gmem_priv, start, end, state);
- if (r)
goto unlock_xa;
- folio_lock(folio);
- r = __kvm_gmem_folio_set_direct_map(folio, start, end, state);
- folio_unlock(folio);
+unlock_xa:
- xa_unlock(&gmem_priv->direct_map_state);
+out:
- return r;
+}
+/*
- Updates the direct map status for the given range from @start to @end (inclusive)
- of @inode. Folios in this range have their direct map entries reconfigured,
- and the xarray in the @inode's private data is updated.
- */
+static __always_unused int kvm_gmem_set_direct_map(struct inode *inode, pgoff_t start,
pgoff_t end, bool state)
+{
- struct kvm_gmem_inode_private *gmem_priv = inode->i_private;
- struct folio_batch fbatch;
- pgoff_t index = start;
- unsigned int count, i;
- int r = 0;
- xa_lock(&gmem_priv->direct_map_state);
- r = __kvm_gmem_update_xarray(gmem_priv, start, end, state);
- if (r)
goto out;
if (r) { xa_unlock(&gmem_priv->direct_map_state); goto out; }
thanks,
Mike
- folio_batch_init(&fbatch);
- while (!filemap_get_folios(inode->i_mapping, &index, end, &fbatch) && !r) {
count = folio_batch_count(&fbatch);
for (i = 0; i < count; i++) {
struct folio *folio = fbatch.folios[i];
pgoff_t folio_start = max(folio_index(folio), start);
pgoff_t folio_end =
min(folio_index(folio) + folio_nr_pages(folio),
end);
folio_lock(folio);
r = __kvm_gmem_folio_set_direct_map(folio, folio_start,
folio_end, state);
folio_unlock(folio);
if (r)
break;
}
folio_batch_release(&fbatch);
- }
- xa_unlock(&gmem_priv->direct_map_state);
+out:
- return r;
+}
- /**
- folio_file_pfn - like folio_file_page, but return a pfn.
- @folio: The folio which contains this index.
Add tracepoints to kvm_gmem_set_direct_map and kvm_gmem_folio_set_direct_map.
The above operations can cause folios to be insert/removed into/from the direct map. We want to be able to make sure that only those gmem folios that we expect KVM to access are ever reinserted into the direct map, and that all folios that are temporarily reinserted are also removed again at a later point. Processing ftrace output is one way to verify this.
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- include/trace/events/kvm.h | 22 ++++++++++++++++++++++ virt/kvm/guest_memfd.c | 5 +++++ 2 files changed, 27 insertions(+)
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 74e40d5d4af42..f3d852c18fa08 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -489,6 +489,28 @@ TRACE_EVENT(kvm_test_age_hva, TP_printk("mmu notifier test age hva: %#016lx", __entry->hva) );
+#ifdef CONFIG_KVM_PRIVATE_MEM +TRACE_EVENT(kvm_gmem_direct_map_state_change, + TP_PROTO(pgoff_t start, pgoff_t end, bool state), + TP_ARGS(start, end, state), + + TP_STRUCT__entry( + __field(pgoff_t, start) + __field(pgoff_t, end) + __field(bool, state) + ), + + TP_fast_assign( + __entry->start = start; + __entry->end = end; + __entry->state = state; + ), + + TP_printk("changed direct map state of guest_memfd range %lu to %lu to %s", + __entry->start, __entry->end, __entry->state ? "present" : "not present") +); +#endif + #endif /* _TRACE_KVM_MAIN_H */
/* This part must be outside protection */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 54387828dcc6a..a0b3b9cacd361 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,6 +7,7 @@ #include <linux/set_memory.h>
#include "kvm_mm.h" +#include "trace/events/kvm.h"
struct kvm_gmem { struct kvm *kvm; @@ -169,6 +170,8 @@ static __always_unused int kvm_gmem_folio_set_direct_map(struct folio *folio, pg r = __kvm_gmem_folio_set_direct_map(folio, start, end, state); folio_unlock(folio);
+ trace_kvm_gmem_direct_map_state_change(start, end, state); + unlock_xa: xa_unlock(&gmem_priv->direct_map_state); out: @@ -216,6 +219,8 @@ static __always_unused int kvm_gmem_set_direct_map(struct inode *inode, pgoff_t folio_batch_release(&fbatch); }
+ trace_kvm_gmem_direct_map_state_change(start, end, state); + xa_unlock(&gmem_priv->direct_map_state); out: return r;
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- Documentation/virt/kvm/api.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index edc070c6e19b2..c8e21c523411c 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6382,6 +6382,20 @@ a single guest_memfd file, but the bound ranges must not overlap).
See KVM_SET_USER_MEMORY_REGION2 for additional details.
+The following flags are defined: + +KVM_GMEM_NO_DIRECT_MAP + Ensure memory backing this guest_memfd inode is unmapped from the kernel's + address space. + +Errors: + + ========== =============================================================== + EOPNOTSUPP `KVM_GMEM_NO_DIRECT_MAP` was set in `flags`, but the host does + not support direct map manipulations. + ========== =============================================================== + + 4.143 KVM_PRE_FAULT_MEMORY ---------------------------
Also adjust test_create_guest_memfd_invalid, as now BIT(0) is a valid value for flags (note that this also fixes an issue where the loop in test_create_guest_memfd_invalid is a noop. I've posted that fix as a separate patch last week [1]).
[1]: https://lore.kernel.org/kvm/20241024095956.3668818-1-roypat@amazon.co.uk/
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- tools/testing/selftests/kvm/guest_memfd_test.c | 2 +- .../selftests/kvm/x86_64/private_mem_conversions_test.c | 7 ++++--- 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index ba0c8e9960358..d04f7ff3dfb15 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm) size); }
- for (flag = 0; flag; flag <<= 1) { + for (flag = BIT(1); flag; flag <<= 1) { fd = __vm_create_guest_memfd(vm, page_size, flag); TEST_ASSERT(fd == -1 && errno == EINVAL, "guest_memfd() with flag '0x%lx' should fail with EINVAL", diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c index 82a8d88b5338e..dfc78781e93b8 100644 --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c @@ -367,7 +367,7 @@ static void *__test_mem_conversions(void *__vcpu) }
static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus, - uint32_t nr_memslots) + uint32_t nr_memslots, uint64_t gmem_flags) { /* * Allocate enough memory so that each vCPU's chunk of memory can be @@ -394,7 +394,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
- memfd = vm_create_guest_memfd(vm, memfd_size, 0); + memfd = vm_create_guest_memfd(vm, memfd_size, gmem_flags);
for (i = 0; i < nr_memslots; i++) vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i, @@ -477,7 +477,8 @@ int main(int argc, char *argv[]) } }
- test_mem_conversions(src_type, nr_vcpus, nr_memslots); + test_mem_conversions(src_type, nr_vcpus, nr_memslots, 0); + test_mem_conversions(src_type, nr_vcpus, nr_memslots, KVM_GMEM_NO_DIRECT_MAP);
return 0; }
On 30.10.24 14:49, Patrick Roy wrote:
Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
But as you said, the current upstream state has no notion of "shared" memory in guest_memfd, so everything is private and thus everything is direct map removed (although it is indeed already inaccessible anyway for TDX and friends. That's what makes this patch series a bit awkward :( )
-- Cheers,
David / dhildenb
Best, Patrick
On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
But as you said, the current upstream state has no notion of "shared" memory in guest_memfd, so everything is private and thus everything is direct map removed (although it is indeed already inaccessible anyway for TDX and friends. That's what makes this patch series a bit awkward :( )
TDX and SEV encryption happens between the core and main memory, so cached guest data we're most concerned about for transient execution attacks isn't necessarily inaccessible.
I'd be interested what Intel, AMD, and other folks think on this, but I think direct map removal is worthwhile for CoCo cases as well.
Derek
linux-kselftest-mirror@lists.linaro.org