Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.) - Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.) - Allow configuring direct map state of not-yet faulted in memory (Vishal A.) - Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
=== Implementation ===
This patch series introduces a new flag to the KVM_CREATE_GUEST_MEMFD that causes guest_memfd to remove its pages from the host kernel's direct map immediately after population/preparation. It also adds infrastructure for tracking the direct map state of all gmem folios inside the guest_memfd inode. Storing this information in the inode has the advantage that the code is ready for future hugepages extensions, where only removing/reinserting direct map entries for sub-ranges of a huge folio is a valid usecase, and it allows pre-configuring the direct map state of not-yet faulted in parts of memory (for example, when the VMM is receiving a RX virtio buffer from the guest).
=== Summary ===
Patch 1 (from Mike Rapoport) adds arch APIs for manipulating the direct map for ranges of physically contiguous pages, which are used by guest_memfd in follow up patches. Patch 2 adds the KVM_GMEM_NO_DIRECT_MAP flag and the logic for configuring direct map state of freshly prepared folios. Patches 3 and 4 mainly serve an illustrative purpose, to show how the framework from patch 2 can be extended with routines for runtime direct map manipulation. Patches 5 and 6 deal with documentation and self-tests respectively.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf [RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/ [RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
Mike Rapoport (Microsoft) (1): arch: introduce set_direct_map_valid_noflush()
Patrick Roy (5): kvm: gmem: add flag to remove memory from kernel direct map kvm: gmem: implement direct map manipulation routines kvm: gmem: add trace point for direct map state changes kvm: document KVM_GMEM_NO_DIRECT_MAP flag kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set
Documentation/virt/kvm/api.rst | 14 + arch/arm64/include/asm/set_memory.h | 1 + arch/arm64/mm/pageattr.c | 10 + arch/loongarch/include/asm/set_memory.h | 1 + arch/loongarch/mm/pageattr.c | 21 ++ arch/riscv/include/asm/set_memory.h | 1 + arch/riscv/mm/pageattr.c | 15 + arch/s390/include/asm/set_memory.h | 1 + arch/s390/mm/pageattr.c | 11 + arch/x86/include/asm/set_memory.h | 1 + arch/x86/mm/pat/set_memory.c | 8 + include/linux/set_memory.h | 6 + include/trace/events/kvm.h | 22 ++ include/uapi/linux/kvm.h | 2 + .../testing/selftests/kvm/guest_memfd_test.c | 2 +- .../kvm/x86_64/private_mem_conversions_test.c | 7 +- virt/kvm/guest_memfd.c | 280 +++++++++++++++++- 17 files changed, 384 insertions(+), 19 deletions(-)
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
From: Mike Rapoport (Microsoft) rppt@kernel.org
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages.
It will be used in the following patches.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org Signed-off-by: Patrick Roy roypat@amazon.co.uk --- arch/arm64/include/asm/set_memory.h | 1 + arch/arm64/mm/pageattr.c | 10 ++++++++++ arch/loongarch/include/asm/set_memory.h | 1 + arch/loongarch/mm/pageattr.c | 21 +++++++++++++++++++++ arch/riscv/include/asm/set_memory.h | 1 + arch/riscv/mm/pageattr.c | 15 +++++++++++++++ arch/s390/include/asm/set_memory.h | 1 + arch/s390/mm/pageattr.c | 11 +++++++++++ arch/x86/include/asm/set_memory.h | 1 + arch/x86/mm/pat/set_memory.c | 8 ++++++++ include/linux/set_memory.h | 6 ++++++ 11 files changed, 76 insertions(+)
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h index 917761feeffdd..98088c043606a 100644 --- a/arch/arm64/include/asm/set_memory.h +++ b/arch/arm64/include/asm/set_memory.h @@ -13,6 +13,7 @@ int set_memory_valid(unsigned long addr, int numpages, int enable);
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid); bool kernel_page_present(struct page *page);
#endif /* _ASM_ARM64_SET_MEMORY_H */ diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 0e270a1c51e64..01225900293ac 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -192,6 +192,16 @@ int set_direct_map_default_noflush(struct page *page) PAGE_SIZE, change_page_range, &data); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + unsigned long addr = (unsigned long)page_address(page); + + if (!can_set_direct_map()) + return 0; + + return set_memory_valid(addr, nr, valid); +} + #ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/arch/loongarch/include/asm/set_memory.h b/arch/loongarch/include/asm/set_memory.h index d70505b6676cb..55dfaefd02c8a 100644 --- a/arch/loongarch/include/asm/set_memory.h +++ b/arch/loongarch/include/asm/set_memory.h @@ -17,5 +17,6 @@ int set_memory_rw(unsigned long addr, int numpages); bool kernel_page_present(struct page *page); int set_direct_map_default_noflush(struct page *page); int set_direct_map_invalid_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
#endif /* _ASM_LOONGARCH_SET_MEMORY_H */ diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c index ffd8d76021d47..f14b40c968b48 100644 --- a/arch/loongarch/mm/pageattr.c +++ b/arch/loongarch/mm/pageattr.c @@ -216,3 +216,24 @@ int set_direct_map_invalid_noflush(struct page *page)
return __set_memory(addr, 1, __pgprot(0), __pgprot(_PAGE_PRESENT | _PAGE_VALID)); } + +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + unsigned long addr = (unsigned long)page_address(page); + pgprot_t set, clear; + + return __set_memory((unsigned long)page_address(page), nr, set, clear); + + if (addr < vm_map_base) + return 0; + + if (valid) { + set = PAGE_KERNEL; + clear = __pgprot(0); + } else { + set = __pgprot(0); + clear = __pgprot(_PAGE_PRESENT | _PAGE_VALID); + } + + return __set_memory(addr, 1, set, clear); +} diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h index ab92fc84e1fc9..ea263d3683ef6 100644 --- a/arch/riscv/include/asm/set_memory.h +++ b/arch/riscv/include/asm/set_memory.h @@ -42,6 +42,7 @@ static inline int set_kernel_memory(char *startp, char *endp,
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid); bool kernel_page_present(struct page *page);
#endif /* __ASSEMBLY__ */ diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c index 271d01a5ba4da..d815448758a19 100644 --- a/arch/riscv/mm/pageattr.c +++ b/arch/riscv/mm/pageattr.c @@ -386,6 +386,21 @@ int set_direct_map_default_noflush(struct page *page) PAGE_KERNEL, __pgprot(_PAGE_EXEC)); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + pgprot_t set, clear; + + if (valid) { + set = PAGE_KERNEL; + clear = __pgprot(_PAGE_EXEC); + } else { + set = __pgprot(0); + clear = __pgprot(_PAGE_PRESENT); + } + + return __set_memory((unsigned long)page_address(page), nr, set, clear); +} + #ifdef CONFIG_DEBUG_PAGEALLOC static int debug_pagealloc_set_page(pte_t *pte, unsigned long addr, void *data) { diff --git a/arch/s390/include/asm/set_memory.h b/arch/s390/include/asm/set_memory.h index 06fbabe2f66c9..240bcfbdcdcec 100644 --- a/arch/s390/include/asm/set_memory.h +++ b/arch/s390/include/asm/set_memory.h @@ -62,5 +62,6 @@ __SET_MEMORY_FUNC(set_memory_4k, SET_MEMORY_4K)
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid);
#endif diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c index 5f805ad42d4c3..4c7ee74aa130d 100644 --- a/arch/s390/mm/pageattr.c +++ b/arch/s390/mm/pageattr.c @@ -406,6 +406,17 @@ int set_direct_map_default_noflush(struct page *page) return __set_memory((unsigned long)page_to_virt(page), 1, SET_MEMORY_DEF); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + unsigned long flags; + + if (valid) + flags = SET_MEMORY_DEF; + else + flags = SET_MEMORY_INV; + + return __set_memory((unsigned long)page_to_virt(page), nr, flags); +} #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
static void ipte_range(pte_t *pte, unsigned long address, int nr) diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 4b2abce2e3e7d..cc62ef70ccc0a 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -89,6 +89,7 @@ int set_pages_rw(struct page *page, int numpages);
int set_direct_map_invalid_noflush(struct page *page); int set_direct_map_default_noflush(struct page *page); +int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid); bool kernel_page_present(struct page *page);
extern int kernel_set_to_readonly; diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 44f7b2ea6a073..069e421c22474 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2444,6 +2444,14 @@ int set_direct_map_default_noflush(struct page *page) return __set_pages_p(page, 1); }
+int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) +{ + if (valid) + return __set_pages_p(page, nr); + + return __set_pages_np(page, nr); +} + #ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index e7aec20fb44f1..3030d9245f5ac 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) return 0; }
+static inline int set_direct_map_valid_noflush(struct page *page, + unsigned nr, bool valid) +{ + return 0; +} + static inline bool kernel_page_present(struct page *page) { return true;
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
On 30.10.24 14:49, Patrick Roy wrote:
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
From: Mike Rapoport (Microsoft) rppt@kernel.org
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages.
It will be used in the following patches.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org Signed-off-by: Patrick Roy roypat@amazon.co.uk
[...]
#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index e7aec20fb44f1..3030d9245f5ac 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) return 0; } +static inline int set_direct_map_valid_noflush(struct page *page,
unsigned nr, bool valid)
I recall that "unsigned" is frowned upon; "unsigned int".
+{
- return 0;
+}
Can we add some kernel doc for this?
In particular
(a) What does it mean when we return 0? That it worked? Then, this dummy function looks wrong. Or this it return the number of processed entries? Then we'd have a possible "int" vs. "unsigned int" inconsistency.
(b) What are the semantics when we fail halfway through the operation when processing nr > 1? Is it "all or nothing"?
On 10/31/24 10:57, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
From: Mike Rapoport (Microsoft) rppt@kernel.org
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages.
It will be used in the following patches.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org Signed-off-by: Patrick Roy roypat@amazon.co.uk
[...]
#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index e7aec20fb44f1..3030d9245f5ac 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) return 0; } +static inline int set_direct_map_valid_noflush(struct page *page,
unsigned nr, bool valid)
I recall that "unsigned" is frowned upon; "unsigned int".
+{
- return 0;
+}
Can we add some kernel doc for this?
In particular
(a) What does it mean when we return 0? That it worked? Then, this
Seems so.
dummy function looks wrong. Or this it return the
That's !CONFIG_ARCH_HAS_SET_DIRECT_MAP and other functions around do it the same way. Looks like the current callers can only exist with the CONFIG_ enabled in the first place.
number of processed entries? Then we'd have a possible "int" vs. "unsigned int" inconsistency.
(b) What are the semantics when we fail halfway through the operation when processing nr > 1? Is it "all or nothing"?
Looking at x86 implementation it seems like it can just bail out in the middle, but then I'm not sure if it can really fail in the middle, hmm...
On Mon, 2024-11-11 at 12:12 +0000, Vlastimil Babka wrote:
On 10/31/24 10:57, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
From: "Mike Rapoport (Microsoft)" rppt@kernel.org
From: Mike Rapoport (Microsoft) rppt@kernel.org
Add an API that will allow updates of the direct/linear map for a set of physically contiguous pages.
It will be used in the following patches.
Signed-off-by: Mike Rapoport (Microsoft) rppt@kernel.org Signed-off-by: Patrick Roy roypat@amazon.co.uk
[...]
#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index e7aec20fb44f1..3030d9245f5ac 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) return 0; }
+static inline int set_direct_map_valid_noflush(struct page *page,
unsigned nr, bool valid)
I recall that "unsigned" is frowned upon; "unsigned int".
+{
- return 0;
+}
Can we add some kernel doc for this?
In particular
(a) What does it mean when we return 0? That it worked? Then, this
Seems so.
dummy function looks wrong. Or this it return the
That's !CONFIG_ARCH_HAS_SET_DIRECT_MAP and other functions around do it the same way. Looks like the current callers can only exist with the CONFIG_ enabled in the first place.
Yeah, it looks a bit weird, but these functions seem to generally return 0 if the operation is not supported. ARM specifically has
if (!can_set_direct_map()) return 0;
inside `set_direct_map_invalid_{noflush,default}`. Documenting this definitely cannot hurt, I'll keep it on my todo list for the next iteration :)
number of processed entries? Then we'd have a possible "int" vs. "unsigned int" inconsistency.
(b) What are the semantics when we fail halfway through the operation when processing nr > 1? Is it "all or nothing"?
Looking at x86 implementation it seems like it can just bail out in the middle, but then I'm not sure if it can really fail in the middle, hmm...
If I understood Mike correctly when talking about this at LPC, then it can only fail if during break-up of huge mappings, it fails to allocate page tables to hold the lower-granularity mappings (which happens before any present bits are modified).
Best, Patrick
Add a new flag, KVM_GMEM_NO_DIRECT_MAP, to KVM_CREATE_GUEST_MEMFD, which causes KVM to remove the folios backing this guest_memfd from the direct map after preparation/population. This flag is only exposed on architectures that can set the direct map (the notable exception here being ARM64 if the direct map is not set up at 4K granularity), otherwise EOPNOTSUPP is returned.
This patch also implements infrastructure for tracking (temporary) reinsertation of memory ranges into the direct map (more accurately: It allows recording that specific memory ranges deviate from the default direct map setup. Currently the default setup is always "direct map entries removed", but it is trivial to extend this with some "default_state_for_vm_type" mechanism to cover the pKVM usecase of memory starting off with directe map entries present). An xarray tracks this at page granularity, to be compatible with future hugepages usecases that might require subranges of hugetlb folios to have direct map entries restored. This xarray holds entries for each page that has a direct map state deviating from the default, and holes for all pages whose direct map state matches the default, the idea being that these "deviations" will be rare. kvm_gmem_folio_configure_direct_map applies the configuration stored in the xarray to a given folio, and is called for each new gmem folio after preparation/population.
Storing direct map state in the gmem inode has two advantages: 1) We can track direct map state at page granularity even for huge folios (see also Ackerley's series on hugetlbfs support in guest_memfd [1]) 2) We can pre-configure the direct map state of not-yet-faulted in folios. This would for example be needed if a VMM is receiving a virtio buffer that the guest is requested it to fill. In this case, the pages backing the guest physical address range of the buffer might not be faulted in yet, and thus would be faulted when the VMM tries to write to them, and at this point we would need to ensure direct map entries are present)
Note that this patch does not include operations for manipulating the direct map state xarray, or for changing direct map state of already existing folios. These routines are sketched out in the following patch, although are not needed in this initial patch series.
When a gmem folio is freed, it is reinserted into the direct map (and failing this, marked as HWPOISON to avoid any other part of the kernel accidentally touching folios without complete direct map entries). The direct map configuration stored in the xarray is _not_ reset when the folio is freed (although this could be implemented by storing the reference to the xarray in the folio's private data instead of only the inode).
[1]: https://lore.kernel.org/kvm/cover.1726009989.git.ackerleytng@google.com/
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 150 +++++++++++++++++++++++++++++++++++---- 2 files changed, 137 insertions(+), 15 deletions(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc0551453..81b0f4a236b8c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; };
+#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0) + #define KVM_PRE_FAULT_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
struct kvm_pre_fault_memory { diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 47a9f68f7b247..50ffc2ad73eda 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#include <linux/set_memory.h>
#include "kvm_mm.h"
@@ -13,6 +14,88 @@ struct kvm_gmem { struct list_head entry; };
+struct kvm_gmem_inode_private { + unsigned long long flags; + + /* + * direct map configuration of the gmem instance this private data + * is associated with. present indices indicate a desired direct map + * configuration deviating from default_direct_map_state (e.g. if + * default_direct_map_state is false/not present, then the xarray + * contains all indices for which direct map entries are restored). + */ + struct xarray direct_map_state; + bool default_direct_map_state; +}; + +static bool kvm_gmem_test_no_direct_map(struct kvm_gmem_inode_private *gmem_priv) +{ + return ((unsigned long)gmem_priv->flags & KVM_GMEM_NO_DIRECT_MAP) != 0; +} + +/* + * Configure the direct map present/not present state of @folio based on + * the xarray stored in the associated inode's private data. + * + * Assumes the folio lock is held. + */ +static int kvm_gmem_folio_configure_direct_map(struct folio *folio) +{ + struct inode *inode = folio_inode(folio); + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; + bool default_state = gmem_priv->default_direct_map_state; + + pgoff_t start = folio_index(folio); + pgoff_t last = start + folio_nr_pages(folio) - 1; + + struct xarray *xa = &gmem_priv->direct_map_state; + unsigned long index; + void *entry; + + pgoff_t range_start = start; + unsigned long npages = 1; + int r = 0; + + if (!kvm_gmem_test_no_direct_map(gmem_priv)) + goto out; + + r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), + default_state); + if (r) + goto out; + + if (!xa_find_after(xa, &range_start, last, XA_PRESENT)) + goto out_flush; + + xa_for_each_range(xa, index, entry, range_start, last) { + ++npages; + + if (index == range_start + npages) + continue; + + r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages - 1, + !default_state); + if (r) + goto out_flush; + + range_start = index; + npages = 1; + } + + r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages, + !default_state); + +out_flush: + /* + * Use PG_private to track that this folio has had potentially some of + * its direct map entries modified, so that we can restore them in free_folio. + */ + folio_set_private(folio); + flush_tlb_kernel_range(start, start + folio_size(folio)); +out: + return r; +} + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index. @@ -42,9 +125,19 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo return 0; }
-static inline void kvm_gmem_mark_prepared(struct folio *folio) + +static inline int kvm_gmem_finalize_folio(struct folio *folio) { + int r = kvm_gmem_folio_configure_direct_map(folio); + + /* + * Parts of the direct map might have been punched out, mark this folio + * as prepared even in the error case to avoid touching parts without + * direct map entries in a potential re-preparation. + */ folio_mark_uptodate(folio); + + return r; }
/* @@ -82,11 +175,10 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, index = ALIGN_DOWN(index, 1 << folio_order(folio)); r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); if (!r) - kvm_gmem_mark_prepared(folio); + r = kvm_gmem_finalize_folio(folio);
return r; } - /* * Returns a locked folio on success. The caller is responsible for * setting the up-to-date flag before the memory is mapped into the guest. @@ -249,6 +341,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, static int kvm_gmem_release(struct inode *inode, struct file *file) { struct kvm_gmem *gmem = file->private_data; + struct kvm_gmem_inode_private *gmem_priv; struct kvm_memory_slot *slot; struct kvm *kvm = gmem->kvm; unsigned long index; @@ -279,13 +372,17 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
list_del(&gmem->entry);
+ gmem_priv = inode->i_private; + filemap_invalidate_unlock(inode->i_mapping);
mutex_unlock(&kvm->slots_lock); - xa_destroy(&gmem->bindings); kfree(gmem);
+ xa_destroy(&gmem_priv->direct_map_state); + kfree(gmem_priv); + kvm_put_kvm(kvm);
return 0; @@ -357,24 +454,37 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol return MF_DELAYED; }
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE static void kvm_gmem_free_folio(struct folio *folio) { +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE struct page *page = folio_page(folio, 0); kvm_pfn_t pfn = page_to_pfn(page); int order = folio_order(folio);
kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); -} #endif
+ if (folio_test_private(folio)) { + unsigned long start = (unsigned long)folio_address(folio); + + int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), + true); + /* + * There might be holes left in the folio, better make sure + * nothing tries to touch it again. + */ + if (r) + folio_set_hwpoison(folio); + + flush_tlb_kernel_range(start, start + folio_size(folio)); + } +} + static const struct address_space_operations kvm_gmem_aops = { .dirty_folio = noop_dirty_folio, .migrate_folio = kvm_gmem_migrate_folio, .error_remove_folio = kvm_gmem_error_folio, -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE .free_folio = kvm_gmem_free_folio, -#endif };
static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path, @@ -401,6 +511,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { const char *anon_name = "[kvm-gmem]"; struct kvm_gmem *gmem; + struct kvm_gmem_inode_private *gmem_priv; struct inode *inode; struct file *file; int fd, err; @@ -409,11 +520,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) if (fd < 0) return fd;
+ err = -ENOMEM; gmem = kzalloc(sizeof(*gmem), GFP_KERNEL); - if (!gmem) { - err = -ENOMEM; + if (!gmem) + goto err_fd; + + gmem_priv = kzalloc(sizeof(*gmem_priv), GFP_KERNEL); + if (!gmem_priv) goto err_fd; - }
file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, O_RDWR, NULL); @@ -427,7 +541,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) inode = file->f_inode; WARN_ON(file->f_mapping != inode->i_mapping);
- inode->i_private = (void *)(unsigned long)flags; + inode->i_private = gmem_priv; inode->i_op = &kvm_gmem_iops; inode->i_mapping->a_ops = &kvm_gmem_aops; inode->i_mode |= S_IFREG; @@ -442,6 +556,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) xa_init(&gmem->bindings); list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+ xa_init(&gmem_priv->direct_map_state); + gmem_priv->flags = flags; + fd_install(fd, file); return fd;
@@ -456,11 +573,14 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; u64 flags = args->flags; - u64 valid_flags = 0; + u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
if (flags & ~valid_flags) return -EINVAL;
+ if ((flags & KVM_GMEM_NO_DIRECT_MAP) && !can_set_direct_map()) + return -EOPNOTSUPP; + if (size <= 0 || !PAGE_ALIGNED(size)) return -EINVAL;
@@ -679,7 +799,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long break; }
- folio_unlock(folio); WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order));
@@ -695,7 +814,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long p = src ? src + i * PAGE_SIZE : NULL; ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); if (!ret) - kvm_gmem_mark_prepared(folio); + ret = kvm_gmem_finalize_folio(folio); + folio_unlock(folio);
put_folio_and_exit: folio_put(folio);
On 10/30/24 08:49, Patrick Roy wrote:
Add a new flag, KVM_GMEM_NO_DIRECT_MAP, to KVM_CREATE_GUEST_MEMFD, which causes KVM to remove the folios backing this guest_memfd from the direct map after preparation/population. This flag is only exposed on architectures that can set the direct map (the notable exception here being ARM64 if the direct map is not set up at 4K granularity), otherwise EOPNOTSUPP is returned.
This patch also implements infrastructure for tracking (temporary) reinsertation of memory ranges into the direct map (more accurately: It allows recording that specific memory ranges deviate from the default direct map setup. Currently the default setup is always "direct map entries removed", but it is trivial to extend this with some "default_state_for_vm_type" mechanism to cover the pKVM usecase of memory starting off with directe map entries present). An xarray tracks this at page granularity, to be compatible with future hugepages usecases that might require subranges of hugetlb folios to have direct map entries restored. This xarray holds entries for each page that has a direct map state deviating from the default, and holes for all pages whose direct map state matches the default, the idea being that these "deviations" will be rare. kvm_gmem_folio_configure_direct_map applies the configuration stored in the xarray to a given folio, and is called for each new gmem folio after preparation/population.
Storing direct map state in the gmem inode has two advantages:
- We can track direct map state at page granularity even for huge folios (see also Ackerley's series on hugetlbfs support in guest_memfd [1])
- We can pre-configure the direct map state of not-yet-faulted in folios. This would for example be needed if a VMM is receiving a virtio buffer that the guest is requested it to fill. In this case, the pages backing the guest physical address range of the buffer might not be faulted in yet, and thus would be faulted when the VMM tries to write to them, and at this point we would need to ensure direct map entries are present)
Note that this patch does not include operations for manipulating the direct map state xarray, or for changing direct map state of already existing folios. These routines are sketched out in the following patch, although are not needed in this initial patch series.
When a gmem folio is freed, it is reinserted into the direct map (and failing this, marked as HWPOISON to avoid any other part of the kernel accidentally touching folios without complete direct map entries). The direct map configuration stored in the xarray is _not_ reset when the folio is freed (although this could be implemented by storing the reference to the xarray in the folio's private data instead of only the inode).
Signed-off-by: Patrick Roy roypat@amazon.co.uk
include/uapi/linux/kvm.h | 2 + virt/kvm/guest_memfd.c | 150 +++++++++++++++++++++++++++++++++++---- 2 files changed, 137 insertions(+), 15 deletions(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc0551453..81b0f4a236b8c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1564,6 +1564,8 @@ struct kvm_create_guest_memfd { __u64 reserved[6]; }; +#define KVM_GMEM_NO_DIRECT_MAP (1ULL << 0)
- #define KVM_PRE_FAULT_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
struct kvm_pre_fault_memory { diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 47a9f68f7b247..50ffc2ad73eda 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include <linux/kvm_host.h> #include <linux/pagemap.h> #include <linux/anon_inodes.h> +#include <linux/set_memory.h> #include "kvm_mm.h" @@ -13,6 +14,88 @@ struct kvm_gmem { struct list_head entry; }; +struct kvm_gmem_inode_private {
- unsigned long long flags;
- /*
* direct map configuration of the gmem instance this private data
* is associated with. present indices indicate a desired direct map
* configuration deviating from default_direct_map_state (e.g. if
* default_direct_map_state is false/not present, then the xarray
* contains all indices for which direct map entries are restored).
*/
- struct xarray direct_map_state;
- bool default_direct_map_state;
+};
+static bool kvm_gmem_test_no_direct_map(struct kvm_gmem_inode_private *gmem_priv) +{
- return ((unsigned long)gmem_priv->flags & KVM_GMEM_NO_DIRECT_MAP) != 0;
+}
+/*
- Configure the direct map present/not present state of @folio based on
- the xarray stored in the associated inode's private data.
- Assumes the folio lock is held.
- */
+static int kvm_gmem_folio_configure_direct_map(struct folio *folio) +{
- struct inode *inode = folio_inode(folio);
- struct kvm_gmem_inode_private *gmem_priv = inode->i_private;
- bool default_state = gmem_priv->default_direct_map_state;
- pgoff_t start = folio_index(folio);
- pgoff_t last = start + folio_nr_pages(folio) - 1;
pgoff_t last = folio_next_index(folio) - 1;
thanks, Mike
- struct xarray *xa = &gmem_priv->direct_map_state;
- unsigned long index;
- void *entry;
- pgoff_t range_start = start;
- unsigned long npages = 1;
- int r = 0;
- if (!kvm_gmem_test_no_direct_map(gmem_priv))
goto out;
- r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
default_state);
- if (r)
goto out;
- if (!xa_find_after(xa, &range_start, last, XA_PRESENT))
goto out_flush;
- xa_for_each_range(xa, index, entry, range_start, last) {
++npages;
if (index == range_start + npages)
continue;
r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages - 1,
!default_state);
if (r)
goto out_flush;
range_start = index;
npages = 1;
- }
- r = set_direct_map_valid_noflush(folio_file_page(folio, range_start), npages,
!default_state);
+out_flush:
- /*
* Use PG_private to track that this folio has had potentially some of
* its direct map entries modified, so that we can restore them in free_folio.
*/
- folio_set_private(folio);
- flush_tlb_kernel_range(start, start + folio_size(folio));
+out:
- return r;
+}
- /**
- folio_file_pfn - like folio_file_page, but return a pfn.
- @folio: The folio which contains this index.
@@ -42,9 +125,19 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo return 0; } -static inline void kvm_gmem_mark_prepared(struct folio *folio)
+static inline int kvm_gmem_finalize_folio(struct folio *folio) {
- int r = kvm_gmem_folio_configure_direct_map(folio);
- /*
* Parts of the direct map might have been punched out, mark this folio
* as prepared even in the error case to avoid touching parts without
* direct map entries in a potential re-preparation.
folio_mark_uptodate(folio);*/
- return r; }
/* @@ -82,11 +175,10 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, index = ALIGN_DOWN(index, 1 << folio_order(folio)); r = __kvm_gmem_prepare_folio(kvm, slot, index, folio); if (!r)
kvm_gmem_mark_prepared(folio);
r = kvm_gmem_finalize_folio(folio);
return r; }
- /*
- Returns a locked folio on success. The caller is responsible for
- setting the up-to-date flag before the memory is mapped into the guest.
@@ -249,6 +341,7 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, static int kvm_gmem_release(struct inode *inode, struct file *file) { struct kvm_gmem *gmem = file->private_data;
- struct kvm_gmem_inode_private *gmem_priv; struct kvm_memory_slot *slot; struct kvm *kvm = gmem->kvm; unsigned long index;
@@ -279,13 +372,17 @@ static int kvm_gmem_release(struct inode *inode, struct file *file) list_del(&gmem->entry);
- gmem_priv = inode->i_private;
- filemap_invalidate_unlock(inode->i_mapping);
mutex_unlock(&kvm->slots_lock);
- xa_destroy(&gmem->bindings); kfree(gmem);
- xa_destroy(&gmem_priv->direct_map_state);
- kfree(gmem_priv);
- kvm_put_kvm(kvm);
return 0; @@ -357,24 +454,37 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol return MF_DELAYED; } -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE static void kvm_gmem_free_folio(struct folio *folio) { +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE struct page *page = folio_page(folio, 0); kvm_pfn_t pfn = page_to_pfn(page); int order = folio_order(folio); kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order)); -} #endif
- if (folio_test_private(folio)) {
unsigned long start = (unsigned long)folio_address(folio);
int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
true);
/*
* There might be holes left in the folio, better make sure
* nothing tries to touch it again.
*/
if (r)
folio_set_hwpoison(folio);
flush_tlb_kernel_range(start, start + folio_size(folio));
- }
+}
- static const struct address_space_operations kvm_gmem_aops = { .dirty_folio = noop_dirty_folio, .migrate_folio = kvm_gmem_migrate_folio, .error_remove_folio = kvm_gmem_error_folio,
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE .free_folio = kvm_gmem_free_folio, -#endif }; static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path, @@ -401,6 +511,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { const char *anon_name = "[kvm-gmem]"; struct kvm_gmem *gmem;
- struct kvm_gmem_inode_private *gmem_priv; struct inode *inode; struct file *file; int fd, err;
@@ -409,11 +520,14 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) if (fd < 0) return fd;
- err = -ENOMEM; gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
- if (!gmem) {
err = -ENOMEM;
- if (!gmem)
goto err_fd;
- gmem_priv = kzalloc(sizeof(*gmem_priv), GFP_KERNEL);
- if (!gmem_priv) goto err_fd;
- }
file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, O_RDWR, NULL); @@ -427,7 +541,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) inode = file->f_inode; WARN_ON(file->f_mapping != inode->i_mapping);
- inode->i_private = (void *)(unsigned long)flags;
- inode->i_private = gmem_priv; inode->i_op = &kvm_gmem_iops; inode->i_mapping->a_ops = &kvm_gmem_aops; inode->i_mode |= S_IFREG;
@@ -442,6 +556,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) xa_init(&gmem->bindings); list_add(&gmem->entry, &inode->i_mapping->i_private_list);
- xa_init(&gmem_priv->direct_map_state);
- gmem_priv->flags = flags;
- fd_install(fd, file); return fd;
@@ -456,11 +573,14 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; u64 flags = args->flags;
- u64 valid_flags = 0;
- u64 valid_flags = KVM_GMEM_NO_DIRECT_MAP;
if (flags & ~valid_flags) return -EINVAL;
- if ((flags & KVM_GMEM_NO_DIRECT_MAP) && !can_set_direct_map())
return -EOPNOTSUPP;
- if (size <= 0 || !PAGE_ALIGNED(size)) return -EINVAL;
@@ -679,7 +799,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long break; }
WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) || (npages - i) < (1 << max_order));folio_unlock(folio);
@@ -695,7 +814,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long p = src ? src + i * PAGE_SIZE : NULL; ret = post_populate(kvm, gfn, pfn, p, max_order, opaque); if (!ret)
kvm_gmem_mark_prepared(folio);
ret = kvm_gmem_finalize_folio(folio);
folio_unlock(folio);
put_folio_and_exit: folio_put(folio);
Implement (yet unused) routines for manipulating guest_memfd direct map state. This is largely for illustration purposes.
kvm_gmem_set_direct_map allows manipulating arbitrary pgoff_t ranges, even if the covered memory has not yet been faulted in (in which case the requested direct map state is recorded in the xarray and will be applied by kvm_gmem_folio_configure_direct_map after the folio is faulted in and prepared/populated). This can be used to realize private/shared conversions on not-yet-faulted in memory, as discussed in the guest_memfd upstream call [1].
kvm_gmem_folio_set_direct_map allows manipulating the direct map entries for a gmem folio that the caller already holds a reference for (whereas kvm_gmem_set_direct_map needs to look up all folios intersecting the given pgoff range in the filemap first).
The xa lock serializes calls to kvm_gmem_folio_set_direct_map and kvm_gmem_set_direct_map, while the read side (kvm_gmem_folio_configure_direct_map) is protected by RCU. This is sufficient to ensure consistency between the xarray and the folio's actual direct map state, as kvm_gmem_folio_configure_direct_map is called only for freshly allocated folios, and before the folio lock is dropped for the first time, meaning kvm_gmem_folio_configure_direct_map always does it's set_direct_map calls before either of kvm_gmem_[folio_]set_direct_map get a chance. Even if a concurrent call to kvm_gmem_[folio_]set_direct_map happens, this ensures a sort of "eventual consistency" between xarray and actual direct map configuration by the time kvm_gmem_[folio_]set_direct_map exits.
[1]: https://lore.kernel.org/kvm/4b49248b-1cf1-44dc-9b50-ee551e1671ac@redhat.com/
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- virt/kvm/guest_memfd.c | 125 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 125 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 50ffc2ad73eda..54387828dcc6a 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -96,6 +96,131 @@ static int kvm_gmem_folio_configure_direct_map(struct folio *folio) return r; }
+/* + * Updates the range [@start, @end] in @gmem_priv's direct map state xarray to be @state, + * e.g. erasing entries in this range if @state is the default state, and creating + * entries otherwise. + * + * Assumes the xa_lock is held. + */ +static int __kvm_gmem_update_xarray(struct kvm_gmem_inode_private *gmem_priv, pgoff_t start, + pgoff_t end, bool state) +{ + struct xarray *xa = &gmem_priv->direct_map_state; + int r = 0; + + /* + * Cannot use xa_store_range, as multi-indexes cannot easily + * be partially updated. + */ + for (pgoff_t index = start; index < end; ++index) { + if (state == gmem_priv->default_direct_map_state) + __xa_erase(xa, index); + else + /* don't care _what_ we store in the xarray, only care about presence */ + __xa_store(xa, index, gmem_priv, GFP_KERNEL); + + r = xa_err(xa); + if (r) + goto out; + } + +out: + return r; +} + +static int __kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, pgoff_t end, + bool state) +{ + unsigned long npages = end - start + 1; + struct page *first_page = folio_file_page(folio, start); + + int r = set_direct_map_valid_noflush(first_page, npages, state); + + flush_tlb_kernel_range((unsigned long)page_address(first_page), + (unsigned long)page_address(first_page) + + npages * PAGE_SIZE); + return r; +} + +/* + * Updates the direct map status for the given range from @start to @end (inclusive), returning + * -EINVAL if this range is not completely contained within @folio. Also updates the + * xarray stored in the private data of the inode @folio is attached to. + * + * Takes and drops the folio lock. + */ +static __always_unused int kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, + pgoff_t end, bool state) +{ + struct inode *inode = folio_inode(folio); + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; + int r = -EINVAL; + + if (!folio_contains(folio, start) || !folio_contains(folio, end)) + goto out; + + xa_lock(&gmem_priv->direct_map_state); + r = __kvm_gmem_update_xarray(gmem_priv, start, end, state); + if (r) + goto unlock_xa; + + folio_lock(folio); + r = __kvm_gmem_folio_set_direct_map(folio, start, end, state); + folio_unlock(folio); + +unlock_xa: + xa_unlock(&gmem_priv->direct_map_state); +out: + return r; +} + +/* + * Updates the direct map status for the given range from @start to @end (inclusive) + * of @inode. Folios in this range have their direct map entries reconfigured, + * and the xarray in the @inode's private data is updated. + */ +static __always_unused int kvm_gmem_set_direct_map(struct inode *inode, pgoff_t start, + pgoff_t end, bool state) +{ + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; + struct folio_batch fbatch; + pgoff_t index = start; + unsigned int count, i; + int r = 0; + + xa_lock(&gmem_priv->direct_map_state); + + r = __kvm_gmem_update_xarray(gmem_priv, start, end, state); + if (r) + goto out; + + folio_batch_init(&fbatch); + while (!filemap_get_folios(inode->i_mapping, &index, end, &fbatch) && !r) { + count = folio_batch_count(&fbatch); + for (i = 0; i < count; i++) { + struct folio *folio = fbatch.folios[i]; + pgoff_t folio_start = max(folio_index(folio), start); + pgoff_t folio_end = + min(folio_index(folio) + folio_nr_pages(folio), + end); + + folio_lock(folio); + r = __kvm_gmem_folio_set_direct_map(folio, folio_start, + folio_end, state); + folio_unlock(folio); + + if (r) + break; + } + folio_batch_release(&fbatch); + } + + xa_unlock(&gmem_priv->direct_map_state); +out: + return r; +} + /** * folio_file_pfn - like folio_file_page, but return a pfn. * @folio: The folio which contains this index.
On 10/30/24 08:49, Patrick Roy wrote:
Implement (yet unused) routines for manipulating guest_memfd direct map state. This is largely for illustration purposes.
kvm_gmem_set_direct_map allows manipulating arbitrary pgoff_t ranges, even if the covered memory has not yet been faulted in (in which case the requested direct map state is recorded in the xarray and will be applied by kvm_gmem_folio_configure_direct_map after the folio is faulted in and prepared/populated). This can be used to realize private/shared conversions on not-yet-faulted in memory, as discussed in the guest_memfd upstream call [1].
kvm_gmem_folio_set_direct_map allows manipulating the direct map entries for a gmem folio that the caller already holds a reference for (whereas kvm_gmem_set_direct_map needs to look up all folios intersecting the given pgoff range in the filemap first).
The xa lock serializes calls to kvm_gmem_folio_set_direct_map and kvm_gmem_set_direct_map, while the read side (kvm_gmem_folio_configure_direct_map) is protected by RCU. This is sufficient to ensure consistency between the xarray and the folio's actual direct map state, as kvm_gmem_folio_configure_direct_map is called only for freshly allocated folios, and before the folio lock is dropped for the first time, meaning kvm_gmem_folio_configure_direct_map always does it's set_direct_map calls before either of kvm_gmem_[folio_]set_direct_map get a chance. Even if a concurrent call to kvm_gmem_[folio_]set_direct_map happens, this ensures a sort of "eventual consistency" between xarray and actual direct map configuration by the time kvm_gmem_[folio_]set_direct_map exits.
Signed-off-by: Patrick Roy roypat@amazon.co.uk
virt/kvm/guest_memfd.c | 125 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 125 insertions(+)
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 50ffc2ad73eda..54387828dcc6a 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -96,6 +96,131 @@ static int kvm_gmem_folio_configure_direct_map(struct folio *folio) return r; } +/*
- Updates the range [@start, @end] in @gmem_priv's direct map state xarray to be @state,
- e.g. erasing entries in this range if @state is the default state, and creating
- entries otherwise.
- Assumes the xa_lock is held.
- */
+static int __kvm_gmem_update_xarray(struct kvm_gmem_inode_private *gmem_priv, pgoff_t start,
pgoff_t end, bool state)
+{
- struct xarray *xa = &gmem_priv->direct_map_state;
- int r = 0;
- /*
* Cannot use xa_store_range, as multi-indexes cannot easily
* be partially updated.
*/
- for (pgoff_t index = start; index < end; ++index) {
if (state == gmem_priv->default_direct_map_state)
__xa_erase(xa, index);
else
/* don't care _what_ we store in the xarray, only care about presence */
__xa_store(xa, index, gmem_priv, GFP_KERNEL);
r = xa_err(xa);
if (r)
goto out;
- }
+out:
- return r;
+}
+static int __kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, pgoff_t end,
bool state)
+{
- unsigned long npages = end - start + 1;
- struct page *first_page = folio_file_page(folio, start);
- int r = set_direct_map_valid_noflush(first_page, npages, state);
- flush_tlb_kernel_range((unsigned long)page_address(first_page),
(unsigned long)page_address(first_page) +
npages * PAGE_SIZE);
- return r;
+}
+/*
- Updates the direct map status for the given range from @start to @end (inclusive), returning
- -EINVAL if this range is not completely contained within @folio. Also updates the
- xarray stored in the private data of the inode @folio is attached to.
- Takes and drops the folio lock.
- */
+static __always_unused int kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start,
pgoff_t end, bool state)
+{
- struct inode *inode = folio_inode(folio);
- struct kvm_gmem_inode_private *gmem_priv = inode->i_private;
- int r = -EINVAL;
- if (!folio_contains(folio, start) || !folio_contains(folio, end))
goto out;
- xa_lock(&gmem_priv->direct_map_state);
- r = __kvm_gmem_update_xarray(gmem_priv, start, end, state);
- if (r)
goto unlock_xa;
- folio_lock(folio);
- r = __kvm_gmem_folio_set_direct_map(folio, start, end, state);
- folio_unlock(folio);
+unlock_xa:
- xa_unlock(&gmem_priv->direct_map_state);
+out:
- return r;
+}
+/*
- Updates the direct map status for the given range from @start to @end (inclusive)
- of @inode. Folios in this range have their direct map entries reconfigured,
- and the xarray in the @inode's private data is updated.
- */
+static __always_unused int kvm_gmem_set_direct_map(struct inode *inode, pgoff_t start,
pgoff_t end, bool state)
+{
- struct kvm_gmem_inode_private *gmem_priv = inode->i_private;
- struct folio_batch fbatch;
- pgoff_t index = start;
- unsigned int count, i;
- int r = 0;
- xa_lock(&gmem_priv->direct_map_state);
- r = __kvm_gmem_update_xarray(gmem_priv, start, end, state);
- if (r)
goto out;
if (r) { xa_unlock(&gmem_priv->direct_map_state); goto out; }
thanks,
Mike
- folio_batch_init(&fbatch);
- while (!filemap_get_folios(inode->i_mapping, &index, end, &fbatch) && !r) {
count = folio_batch_count(&fbatch);
for (i = 0; i < count; i++) {
struct folio *folio = fbatch.folios[i];
pgoff_t folio_start = max(folio_index(folio), start);
pgoff_t folio_end =
min(folio_index(folio) + folio_nr_pages(folio),
end);
folio_lock(folio);
r = __kvm_gmem_folio_set_direct_map(folio, folio_start,
folio_end, state);
folio_unlock(folio);
if (r)
break;
}
folio_batch_release(&fbatch);
- }
- xa_unlock(&gmem_priv->direct_map_state);
+out:
- return r;
+}
- /**
- folio_file_pfn - like folio_file_page, but return a pfn.
- @folio: The folio which contains this index.
Add tracepoints to kvm_gmem_set_direct_map and kvm_gmem_folio_set_direct_map.
The above operations can cause folios to be insert/removed into/from the direct map. We want to be able to make sure that only those gmem folios that we expect KVM to access are ever reinserted into the direct map, and that all folios that are temporarily reinserted are also removed again at a later point. Processing ftrace output is one way to verify this.
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- include/trace/events/kvm.h | 22 ++++++++++++++++++++++ virt/kvm/guest_memfd.c | 5 +++++ 2 files changed, 27 insertions(+)
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h index 74e40d5d4af42..f3d852c18fa08 100644 --- a/include/trace/events/kvm.h +++ b/include/trace/events/kvm.h @@ -489,6 +489,28 @@ TRACE_EVENT(kvm_test_age_hva, TP_printk("mmu notifier test age hva: %#016lx", __entry->hva) );
+#ifdef CONFIG_KVM_PRIVATE_MEM +TRACE_EVENT(kvm_gmem_direct_map_state_change, + TP_PROTO(pgoff_t start, pgoff_t end, bool state), + TP_ARGS(start, end, state), + + TP_STRUCT__entry( + __field(pgoff_t, start) + __field(pgoff_t, end) + __field(bool, state) + ), + + TP_fast_assign( + __entry->start = start; + __entry->end = end; + __entry->state = state; + ), + + TP_printk("changed direct map state of guest_memfd range %lu to %lu to %s", + __entry->start, __entry->end, __entry->state ? "present" : "not present") +); +#endif + #endif /* _TRACE_KVM_MAIN_H */
/* This part must be outside protection */ diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 54387828dcc6a..a0b3b9cacd361 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,6 +7,7 @@ #include <linux/set_memory.h>
#include "kvm_mm.h" +#include "trace/events/kvm.h"
struct kvm_gmem { struct kvm *kvm; @@ -169,6 +170,8 @@ static __always_unused int kvm_gmem_folio_set_direct_map(struct folio *folio, pg r = __kvm_gmem_folio_set_direct_map(folio, start, end, state); folio_unlock(folio);
+ trace_kvm_gmem_direct_map_state_change(start, end, state); + unlock_xa: xa_unlock(&gmem_priv->direct_map_state); out: @@ -216,6 +219,8 @@ static __always_unused int kvm_gmem_set_direct_map(struct inode *inode, pgoff_t folio_batch_release(&fbatch); }
+ trace_kvm_gmem_direct_map_state_change(start, end, state); + xa_unlock(&gmem_priv->direct_map_state); out: return r;
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- Documentation/virt/kvm/api.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index edc070c6e19b2..c8e21c523411c 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6382,6 +6382,20 @@ a single guest_memfd file, but the bound ranges must not overlap).
See KVM_SET_USER_MEMORY_REGION2 for additional details.
+The following flags are defined: + +KVM_GMEM_NO_DIRECT_MAP + Ensure memory backing this guest_memfd inode is unmapped from the kernel's + address space. + +Errors: + + ========== =============================================================== + EOPNOTSUPP `KVM_GMEM_NO_DIRECT_MAP` was set in `flags`, but the host does + not support direct map manipulations. + ========== =============================================================== + + 4.143 KVM_PRE_FAULT_MEMORY ---------------------------
Also adjust test_create_guest_memfd_invalid, as now BIT(0) is a valid value for flags (note that this also fixes an issue where the loop in test_create_guest_memfd_invalid is a noop. I've posted that fix as a separate patch last week [1]).
[1]: https://lore.kernel.org/kvm/20241024095956.3668818-1-roypat@amazon.co.uk/
Signed-off-by: Patrick Roy roypat@amazon.co.uk --- tools/testing/selftests/kvm/guest_memfd_test.c | 2 +- .../selftests/kvm/x86_64/private_mem_conversions_test.c | 7 ++++--- 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index ba0c8e9960358..d04f7ff3dfb15 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -134,7 +134,7 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm) size); }
- for (flag = 0; flag; flag <<= 1) { + for (flag = BIT(1); flag; flag <<= 1) { fd = __vm_create_guest_memfd(vm, page_size, flag); TEST_ASSERT(fd == -1 && errno == EINVAL, "guest_memfd() with flag '0x%lx' should fail with EINVAL", diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c index 82a8d88b5338e..dfc78781e93b8 100644 --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c @@ -367,7 +367,7 @@ static void *__test_mem_conversions(void *__vcpu) }
static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus, - uint32_t nr_memslots) + uint32_t nr_memslots, uint64_t gmem_flags) { /* * Allocate enough memory so that each vCPU's chunk of memory can be @@ -394,7 +394,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
- memfd = vm_create_guest_memfd(vm, memfd_size, 0); + memfd = vm_create_guest_memfd(vm, memfd_size, gmem_flags);
for (i = 0; i < nr_memslots; i++) vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i, @@ -477,7 +477,8 @@ int main(int argc, char *argv[]) } }
- test_mem_conversions(src_type, nr_vcpus, nr_memslots); + test_mem_conversions(src_type, nr_vcpus, nr_memslots, 0); + test_mem_conversions(src_type, nr_vcpus, nr_memslots, KVM_GMEM_NO_DIRECT_MAP);
return 0; }
On 30.10.24 14:49, Patrick Roy wrote:
Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
But as you said, the current upstream state has no notion of "shared" memory in guest_memfd, so everything is private and thus everything is direct map removed (although it is indeed already inaccessible anyway for TDX and friends. That's what makes this patch series a bit awkward :( )
-- Cheers,
David / dhildenb
Best, Patrick
On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
But as you said, the current upstream state has no notion of "shared" memory in guest_memfd, so everything is private and thus everything is direct map removed (although it is indeed already inaccessible anyway for TDX and friends. That's what makes this patch series a bit awkward :( )
TDX and SEV encryption happens between the core and main memory, so cached guest data we're most concerned about for transient execution attacks isn't necessarily inaccessible.
I'd be interested what Intel, AMD, and other folks think on this, but I think direct map removal is worthwhile for CoCo cases as well.
Derek
+David Kaplan
On Thu, Oct 31, 2024, Derek Manwaring wrote:
On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
But as you said, the current upstream state has no notion of "shared" memory in guest_memfd, so everything is private and thus everything is direct map removed (although it is indeed already inaccessible anyway for TDX and friends. That's what makes this patch series a bit awkward :( )
TDX and SEV encryption happens between the core and main memory, so cached guest data we're most concerned about for transient execution attacks isn't necessarily inaccessible.
I'd be interested what Intel, AMD, and other folks think on this, but I think direct map removal is worthwhile for CoCo cases as well.
Removal of the direct map entries for guest private PFNs likely won't affect the ability of an attacker to glean information from the unencrypted data that's in the CPU caches, at least not on x86. Both TDX and SEV steal physical address bit(s) for tagging encrypted memory, and unless things have changed on recent AMD microarchitectures (I'm 99.9% certain Intel CPUs haven't changed), those stolen address bits are propagated into the caches. I.e. the encrypted and unencrypted forms of a given PFN are actually two different physical addresses under the hood.
I don't actually know how SEV uses the stolen PA bits though. I don't see how it simply be the ASID, because IIUC, AMD CPUs allow for more unique SEV-capable ASIDs than uniquely addressable PAs by the number of stolen bits. But I would be very surprised if the tag for the cache isn't guaranteed to be unique per encryption key.
David?
[AMD Official Use Only - AMD Internal Distribution Only]
-----Original Message----- From: Sean Christopherson seanjc@google.com Sent: Friday, November 1, 2024 10:18 AM To: Derek Manwaring derekmn@amazon.com Cc: roypat@amazon.co.uk; ackerleytng@google.com; agordeev@linux.ibm.com; aou@eecs.berkeley.edu; borntraeger@linux.ibm.com; bp@alien8.de; catalin.marinas@arm.com; chenhuacai@kernel.org; corbet@lwn.net; dave.hansen@linux.intel.com; david@redhat.com; gerald.schaefer@linux.ibm.com; gor@linux.ibm.com; graf@amazon.com; hca@linux.ibm.com; hpa@zytor.com; jgowans@amazon.com; jthoughton@google.com; kalyazin@amazon.com; kernel@xen0n.name; kvm@vger.kernel.org; linux-arm- kernel@lists.infradead.org; linux-doc@vger.kernel.org; linux- kernel@vger.kernel.org; linux-kselftest@vger.kernel.org; linux- mm@kvack.org; linux-riscv@lists.infradead.org; linux-s390@vger.kernel.org; linux-trace-kernel@vger.kernel.org; loongarch@lists.linux.dev; luto@kernel.org; mathieu.desnoyers@efficios.com; mhiramat@kernel.org; mingo@redhat.com; palmer@dabbelt.com; paul.walmsley@sifive.com; pbonzini@redhat.com; peterz@infradead.org; quic_eberman@quicinc.com; rostedt@goodmis.org; rppt@kernel.org; shuah@kernel.org; svens@linux.ibm.com; tabba@google.com; tglx@linutronix.de; vannapurve@google.com; will@kernel.org; x86@kernel.org; xmarcalx@amazon.com; Kaplan, David David.Kaplan@amd.com Subject: Re: [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
+David Kaplan
On Thu, Oct 31, 2024, Derek Manwaring wrote:
On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for
private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to
"encrypted"/"protected"
memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
But as you said, the current upstream state has no notion of "shared" memory in guest_memfd, so everything is private and thus everything is direct map removed (although it is indeed already inaccessible anyway for TDX and friends. That's what makes this patch series a bit awkward :( )
TDX and SEV encryption happens between the core and main memory, so cached guest data we're most concerned about for transient execution attacks isn't necessarily inaccessible.
I'd be interested what Intel, AMD, and other folks think on this, but I think direct map removal is worthwhile for CoCo cases as well.
Removal of the direct map entries for guest private PFNs likely won't affect the ability of an attacker to glean information from the unencrypted data that's in the CPU caches, at least not on x86. Both TDX and SEV steal physical address bit(s) for tagging encrypted memory, and unless things have changed on recent AMD microarchitectures (I'm 99.9% certain Intel CPUs haven't changed), those stolen address bits are propagated into the caches. I.e. the encrypted and unencrypted forms of a given PFN are actually two different physical addresses under the hood.
I don't actually know how SEV uses the stolen PA bits though. I don't see how it simply be the ASID, because IIUC, AMD CPUs allow for more unique SEV-capable ASIDs than uniquely addressable PAs by the number of stolen bits. But I would be very surprised if the tag for the cache isn't guaranteed to be unique per encryption key.
David?
How the stolen PA bits are used is a microarchitectural implementation detail. It is true that the tag will be unique per encryption key. Beyond that, I'm not sure what other details are relevant to SW.
--David Kaplan
On 10/31/24 17:10, Manwaring, Derek wrote:
TDX and SEV encryption happens between the core and main memory, so cached guest data we're most concerned about for transient execution attacks isn't necessarily inaccessible.
I'd be interested what Intel, AMD, and other folks think on this, but I think direct map removal is worthwhile for CoCo cases as well.
I'm not sure specifically which attacks you have in mind. Also, don't forget that TDX doesn't get any intra-system security from encryption itself. All that the encryption does is prevent some physical attacks.
The main thing I think you want to keep in mind is mentioned in the "TDX Module v1.5 Base Architecture Specification"[1]:
Any software except guest TD or TDX module must not be able to speculatively or non-speculatively access TD private memory,
That's a pretty broad claim and it involves mitigations in hardware and the TDX module.
I _think_ you might be thinking of attacks like MDS where some random microarchitectural buffer contains guest data after a VM exit and then an attacker extracts it. Direct map removal doesn't affect these buffers and doesn't mitigate an attacker getting the data out. TDX relies on other defenses.
As for the CPU caches, direct map removal doesn't help or hurt anything. A malicious VMM would just map the guest data if it could and try to extract it if that were possible. If the attack is mitigated when the data is _mapped_, then it's certainly not possible _unmapped_.
So why bother with direct map removal for TDX? A VMM write to TD private data causes machine checks. So any kernel bug that even accidentally writes to kernel memory can bring the whole system down. Not nice.
+Elena
On 2024-11-01 at 16:06+0000, Dave Hansen wrote:
On 10/31/24 17:10, Manwaring, Derek wrote:
TDX and SEV encryption happens between the core and main memory, so cached guest data we're most concerned about for transient execution attacks isn't necessarily inaccessible.
I'd be interested what Intel, AMD, and other folks think on this, but I think direct map removal is worthwhile for CoCo cases as well.
I'm not sure specifically which attacks you have in mind. [...]
I _think_ you might be thinking of attacks like MDS where some random microarchitectural buffer contains guest data after a VM exit and then an attacker extracts it. Direct map removal doesn't affect these buffers and doesn't mitigate an attacker getting the data out.
Right, the only attacks we can thwart with direct map removal are transient execution attacks on the host kernel whose leak origin is "Mapped memory" in Table 1 of the Quarantine paper [2]. Maybe the simplest hypothetical to consider here is a new spectre v1 gadget in the host kernel.
The main thing I think you want to keep in mind is mentioned in the "TDX Module v1.5 Base Architecture Specification"[1]:
Any software except guest TD or TDX module must not be able to speculatively or non-speculatively access TD private memory,
That's a pretty broad claim and it involves mitigations in hardware and the TDX module.
Thank you, I hadn't seen that. That is a very strong claim as far as preventing speculative access; I didn't realize Intel claimed that about TDX. The comma followed by "to detect if a prior corruption attempt was successful" makes me wonder a bit if the statement is not quite as broad as it sounds, but maybe that's just meant to relate it to the integrity section?
If the attack is mitigated when the > data is _mapped_, then it's certainly not possible _unmapped_.
So why bother with direct map removal for TDX? A VMM write to TD private data causes machine checks. So any kernel bug that even accidentally writes to kernel memory can bring the whole system down. Not nice.
Fair enough. It hasn't been clear to me if there is a machine check when the host kernel accesses guest memory only transiently. I was assuming there is not. But if other mitigations completely prevent even speculative access of TD private memory like you're saying, then agree nothing to gain from direct map removal in the TDX case.
Derek
On 11/1/24 09:56, Manwaring, Derek wrote: ...
Any software except guest TD or TDX module must not be able to speculatively or non-speculatively access TD private memory,
That's a pretty broad claim and it involves mitigations in hardware and the TDX module.
Thank you, I hadn't seen that. That is a very strong claim as far as preventing speculative access; I didn't realize Intel claimed that about TDX. The comma followed by "to detect if a prior corruption attempt was successful" makes me wonder a bit if the statement is not quite as broad as it sounds, but maybe that's just meant to relate it to the integrity section?
I think it's just relating it to the integrity section.
If the attack is mitigated when the > data is _mapped_, then it's certainly not possible _unmapped_.
So why bother with direct map removal for TDX? A VMM write to TD private data causes machine checks. So any kernel bug that even accidentally writes to kernel memory can bring the whole system down. Not nice.
Fair enough. It hasn't been clear to me if there is a machine check when the host kernel accesses guest memory only transiently. I was assuming there is not.
Previous generations of hardware have had some nastiness in this area. Speculative accesses were (I think) logged in the machine check banks, but wouldn't raise an #MC. I believe TDX-capable hardware won't even log speculative accesses.
But if other mitigations completely prevent even speculative access of TD private memory like you're saying, then agree nothing to gain from direct map removal in the TDX case.
Remember, guest unmapping is done in the VMM. The VMM is not trusted in the TDX (or SEV-SNP) model. If any VMM can harm the protections on guest memory, then we have a big problem.
That isn't to say big problem can't happen. Say some crazy attack comes to light where the VMM can attack TDX if the VMM has mapping for a guest (or TDX module) memory. Crazier things have happened, and guest unmapping _would_ help there, if you trusted the VMM.
Basically, I think guest unmapping only helps system security as a whole if you must _already_ trust the VMM.
On 2024-11-01 at 17:20+0000, Dave Hansen wrote:
On 11/1/24 09:56, Manwaring, Derek wrote:
But if other mitigations completely prevent even speculative access of TD private memory like you're saying, then agree nothing to gain from direct map removal in the TDX case.
Remember, guest unmapping is done in the VMM. The VMM is not trusted in the TDX (or SEV-SNP) model. If any VMM can harm the protections on guest memory, then we have a big problem.
That isn't to say big problem can't happen. Say some crazy attack comes to light where the VMM can attack TDX if the VMM has mapping for a guest (or TDX module) memory. Crazier things have happened, and guest unmapping _would_ help there, if you trusted the VMM.
Basically, I think guest unmapping only helps system security as a whole if you must _already_ trust the VMM.
Yeah that makes a lot of sense. I just view the ideal outcome as a composition of strong, independent defenses. So as a guest you have the confidentiality and integrity guarantees of the hardware, *and* you have an up-to-date, good-hygiene (albeit not attested) host kernel just in case some crazy attack/gap comes up.
From that standpoint I'm still tempted to turn the question around a bit for the host kernel's perspective. Like if the host kernel should not (and indeed cannot with TDX controls in place) access guest private memory, why not remove it from the direct map?
Derek
On 11/1/24 11:31, Manwaring, Derek wrote:
From that standpoint I'm still tempted to turn the question around a bit
for the host kernel's perspective. Like if the host kernel should not (and indeed cannot with TDX controls in place) access guest private memory, why not remove it from the direct map?
Pretend that the machine check warts aren't there.
It costs performance and complexity, for an only theoretical gain. This is especially true for a VMM that's not doing a just doing confidential guests. You fracture the direct map to pieces forever (for now).
On 2024-11-01 at 18:43+0000, Dave Hansen wrote:
On 11/1/24 11:31, Manwaring, Derek wrote:
From that standpoint I'm still tempted to turn the question around a bit for the host kernel's perspective. Like if the host kernel should not (and indeed cannot with TDX controls in place) access guest private memory, why not remove it from the direct map?
Pretend that the machine check warts aren't there.
It costs performance and complexity, for an only theoretical gain. This is especially true for a VMM that's not doing a just doing confidential guests. You fracture the direct map to pieces forever (for now).
I'm hopeful we'll navigate the complexity in a worthwhile way for the non-CoCo case. Assuming we get there and have the option to remove from direct map, users with CoCo hardware could choose if they want to do both on their host. For me that's a sensible choice, but maybe that's just me.
As far as performance, are you talking about just the fracturing or something beyond that? The data Mike brought to LSFMMBPF 2023 showed the perf impact from direct map fragmentation for memfd_secret isn't "that bad" [1].
Derek
On 11/1/24 12:29, Manwaring, Derek wrote:
As far as performance, are you talking about just the fracturing or something beyond that? The data Mike brought to LSFMMBPF 2023 showed the perf impact from direct map fragmentation for memfd_secret isn't "that bad" [1].
Just the fracturing.
Mike's data are interesting, but I still think we're pretty far away from saying that the kernel doesn't need large mappings.
+Elena
On 2024-11-01 at 16:06+0000, Dave Hansen wrote:
On 10/31/24 17:10, Manwaring, Derek wrote:
TDX and SEV encryption happens between the core and main memory, so cached guest data we're most concerned about for transient execution attacks isn't necessarily inaccessible.
I'd be interested what Intel, AMD, and other folks think on this, but I think direct map removal is worthwhile for CoCo cases as well.
I'm not sure specifically which attacks you have in mind. [...]
I _think_ you might be thinking of attacks like MDS where some random microarchitectural buffer contains guest data after a VM exit and then an attacker extracts it. Direct map removal doesn't affect these buffers and doesn't mitigate an attacker getting the data out.
Right, the only attacks we can thwart with direct map removal are transient execution attacks on the host kernel whose leak origin is "Mapped memory" in Table 1 of the Quarantine paper [2]. Maybe the simplest hypothetical to consider here is a new spectre v1 gadget in the host kernel.
The main thing I think you want to keep in mind is mentioned in the "TDX Module v1.5 Base Architecture Specification"[1]:
Any software except guest TD or TDX module must not be able to speculatively or non-speculatively access TD private memory,
That's a pretty broad claim and it involves mitigations in hardware and the TDX module.
Thank you, I hadn't seen that. That is a very strong claim as far as preventing speculative access; I didn't realize Intel claimed that about TDX. The comma followed by "to detect if a prior corruption attempt was successful" makes me wonder a bit if the statement is not quite as broad as it sounds, but maybe that's just meant to relate it to the integrity section?
This statement *is* for integrity section. We have a separate TDX guidance on side-channels (including speculative) [3] and some speculative attacks that affect confidentiality (for example spectre v1) are listed as not covered by TDX but remaining SW responsibility (as they are now).
[3] https://www.intel.com/content/www/us/en/developer/articles/technical/softwar...
Best Regards, Elena.
On 2024-11-04 at 08:33+0000, Elena Reshetova wrote:
This statement *is* for integrity section. We have a separate TDX guidance on side-channels (including speculative) [3] and some speculative attacks that affect confidentiality (for example spectre v1) are listed as not covered by TDX but remaining SW responsibility (as they are now).
Thanks for the additional info, Elena. Given that clarification, I definitely see direct map removal and TDX as complementary.
Derek
On 2024-11-04 at 08:33+0000, Elena Reshetova wrote:
This statement *is* for integrity section. We have a separate TDX guidance on side-channels (including speculative) [3] and some speculative attacks that affect confidentiality (for example spectre v1) are listed as not covered by TDX but remaining SW responsibility (as they are now).
Thanks for the additional info, Elena. Given that clarification, I definitely see direct map removal and TDX as complementary.
Jus to clarify to make sure my comment is not misunderstood. What I meant is that we cannot generally assume that confidentiality leaks from CoCo guests to host/VMM via speculative channels are completely impossible. Spectre V1 is a prime example of such a possible leak. Dave also elaborated on other potential vectors earlier.
The usefulness of direct map removal for CoCo guests as a concrete mitigation for certain types of memory attacks must be precisely evaluated per each attack vector, attack vector direction (host -> guest, guest ->host, etc) and per each countermeasure that CoCo vendors provide for each such case. I don't know if there is any existing study that examines this for major CoCo vendors. I think this is what must be done for this work in order to have a strong reasoning for its usefulness.
Best Regards, Elena.
On 2024-11-08 at 10:36, Elena Reshetova wrote:
On 2024-11-06 at 17:04, Derek Manwaring wrote:
On 2024-11-04 at 08:33+0000, Elena Reshetova wrote:
This statement *is* for integrity section. We have a separate TDX guidance on side-channels (including speculative) [3] and some speculative attacks that affect confidentiality (for example spectre v1) are listed as not covered by TDX but remaining SW responsibility (as they are now).
Thanks for the additional info, Elena. Given that clarification, I definitely see direct map removal and TDX as complementary.
Jus to clarify to make sure my comment is not misunderstood. What I meant is that we cannot generally assume that confidentiality leaks from CoCo guests to host/VMM via speculative channels are completely impossible. Spectre V1 is a prime example of such a possible leak. Dave also elaborated on other potential vectors earlier.
The usefulness of direct map removal for CoCo guests as a concrete mitigation for certain types of memory attacks must be precisely evaluated per each attack vector, attack vector direction (host -> guest, guest ->host, etc) and per each countermeasure that CoCo vendors provide for each such case. I don't know if there is any existing study that examines this for major CoCo vendors. I think this is what must be done for this work in order to have a strong reasoning for its usefulness.
Thanks, that makes sense. I'm a little hyperfocused on guest->host which is the opposite direction of what is generally used for the CoCo threat model. I think what both cases care about though is guest->guest. For me, guest->host matters because it's a route for guest->guest (at least in the non-CoCo world). There's some good discussion on this in David's series on attack vector controls [1].
Like you mention, beyond direction it matters which CoCo countermeasures are at play and how far they go during transient execution. That part is not clear to me for the host->guest direction involving the direct map, but agree any countermeasures like direct map removal should be evaluated based on a better understanding of what those existing countermeasures already cover and what attack is intended to be mitigated.
Derek
[1] https://lore.kernel.org/lkml/LV3PR12MB92658EA6CCF18F90DAAA168394532@LV3PR12M...
On 31.10.24 11:42, Patrick Roy wrote:
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
I wanted to follow-up to the discussion we had in the bi-weekly call.
We talked about shared (faultable) vs. private (unfaultable), and how it would interact with the directmap patches here.
As discussed, having private (unfaultable) memory with the direct-map removed and shared (faultable) memory with the direct-mapping can make sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, the discussion here seems to indicate that it might currently not be required.
So one thing we could do is that shared (faultable) will have a direct mapping and be gup-able and private (unfaultable) memory will not have a direct mapping and is, by design, not gup-able.
Maybe it could make sense to not have a direct map for all guest_memfd memory, making it behave like secretmem (and it would be easy to implement)? But I'm not sure if that is really desirable in VM context.
Having a mixture of "has directmap" and "has no directmap" for shared (faultable) memory should not be done. Similarly, private memory really should stay "unfaultable".
I think one of the points raised during the bi-weekly call was that using a viommu/swiotlb might be the right call, such that all memory can be considered private (unfaultable) that is not explicitly shared/expected to be modified by the hypervisor (-> faultable, -> GUP-able).
Further, I think Sean had some good points why we should explore that direction, but I recall that there were some issue to be sorted out (interpreted instructions requiring direct map when accessing "private" memory?), not sure if that is already working/can be made working in KVM.
What's your opinion after the call and the next step for use cases like you have in mind (IIRC firecracker, which wants to not have the direct-map for guest memory where it can be avoided)?
Hi David,
On 11/4/24 12:18, David Hildenbrand wrote:
On 31.10.24 11:42, Patrick Roy wrote:
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
On 30.10.24 14:49, Patrick Roy wrote:
Unmapping virtual machine guest memory from the host kernel's direct map is a successful mitigation against Spectre-style transient execution issues: If the kernel page tables do not contain entries pointing to guest memory, then any attempted speculative read through the direct map will necessarily be blocked by the MMU before any observable microarchitectural side-effects happen. This means that Spectre-gadgets and similar cannot be used to target virtual machine memory. Roughly 60% of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its memory from the host kernel's direct map, to be able to attain the above protection for KVM guests running inside guest_memfd.
=== Changes to v2 ===
- Handle direct map removal for physically contiguous pages in arch code (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)
Most significantly, I've reduced the patch series to focus only on direct map removal for guest_memfd for now, leaving the whole "how to do non-CoCo VMs in guest_memfd" for later. If this separation is acceptable, then I think I can drop the RFC tag in the next revision (I've mainly kept it here because I'm not entirely sure what to do with patches 3 and 4).
Hi,
keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory?
So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already.
Correct?
Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private.
I wanted to follow-up to the discussion we had in the bi-weekly call.
Thanks for summarizing!
We talked about shared (faultable) vs. private (unfaultable), and how it would interact with the directmap patches here.
As discussed, having private (unfaultable) memory with the direct-map removed and shared (faultable) memory with the direct-mapping can make sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, the discussion here seems to indicate that it might currently not be required.
So one thing we could do is that shared (faultable) will have a direct mapping and be gup-able and private (unfaultable) memory will not have a direct mapping and is, by design, not gup-able.> Maybe it could make sense to not have a direct map for all guest_memfd memory, making it behave like secretmem (and it would be easy to implement)? But I'm not sure if that is really desirable in VM context.
This would work for us (in this scenario, the swiotlb areas would be "traditional" memory, e.g. set to shared via mem attributes instead of "shared" inside KVM), it's kinda what I had prototyped in my v1 of this series (well, we'd need to figure out how to get the mappings of gmem back into KVM, since in this setup, short-circuiting it into userspace_addr wouldn't work, unless we banish swiotlb into a different memslot altogether somehow). But I don't think it'd work for pKVM, iirc they need GUP on gmem, and also want direct map removal (... but maybe, the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be behave differently? non-CoCo gets essentially memfd_secret, pKVM gets GUP+no faults of private mem).
Having a mixture of "has directmap" and "has no directmap" for shared (faultable) memory should not be done. Similarly, private memory really should stay "unfaultable".
You've convinced me that having both GUP-able and non GUP-able memory in the same VMA will be tricky. However, I'm less convinced on why private memory should stay unfaultable; only that it shouldn't be faultable into a VMA that also allows GUP. Can we have two VMAs? One that disallows GUP, but allows userspace access to shared and private, and one that allows GUP, but disallows accessing private memory? Maybe via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly different spin of the above idea.
I think one of the points raised during the bi-weekly call was that using a viommu/swiotlb might be the right call, such that all memory can be considered private (unfaultable) that is not explicitly shared/expected to be modified by the hypervisor (-> faultable, -> GUP-able).
Further, I think Sean had some good points why we should explore that direction, but I recall that there were some issue to be sorted out (interpreted instructions requiring direct map when accessing "private" memory?), not sure if that is already working/can be made working in KVM.
Yeah, the big one is MMIO instruction emulation on x86, which does guest page table walks and instruction fetch (and particularly the latter cannot be known ahead-of-time by the guest, aka cannot be explicitly "shared"). That's what the majority of my v2 series was about. For traditional memslots, KVM handles these via get_user and friends, but if we don't have a VMA that allows faulting all of gmem, then that's impossible, and we're in "temporarily restore direct map" land. Which comes with significantly performance penalties due to TLB flushes.
What's your opinion after the call and the next step for use cases like you have in mind (IIRC firecracker, which wants to not have the direct-map for guest memory where it can be avoided)?
Yea, the usecase is for Firecracker to not have direct map entries for guest memory, unless needed for I/O (-> swiotlb).
As for next steps, let's determine once and for all if we can do the KVM-internal guest memory accesses for MMIO emulation through userspace mappings (although if we can't I'll have some serious soul-searching to do, because all other solutions we talked about so far also have fairly big drawbacks; on-demand direct map reinsertion has terrible performance, protection keys would limit us to 15 VMs on the host, and the page table swapping runs into problems with NMIs if I understood Sean correctly last Thursday :( ).
-- Cheers,
David / dhildenb
Best, Patrick
We talked about shared (faultable) vs. private (unfaultable), and how it would interact with the directmap patches here.
As discussed, having private (unfaultable) memory with the direct-map removed and shared (faultable) memory with the direct-mapping can make sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, the discussion here seems to indicate that it might currently not be required.
So one thing we could do is that shared (faultable) will have a direct mapping and be gup-able and private (unfaultable) memory will not have a direct mapping and is, by design, not gup-able.> Maybe it could make sense to not have a direct map for all guest_memfd memory, making it behave like secretmem (and it would be easy to implement)? But I'm not sure if that is really desirable in VM context.
This would work for us (in this scenario, the swiotlb areas would be "traditional" memory, e.g. set to shared via mem attributes instead of "shared" inside KVM), it's kinda what I had prototyped in my v1 of this series (well, we'd need to figure out how to get the mappings of gmem back into KVM, since in this setup, short-circuiting it into userspace_addr wouldn't work, unless we banish swiotlb into a different memslot altogether somehow).
Right.
But I don't think it'd work for pKVM, iirc they need GUP on gmem, and also want direct map removal (... but maybe, the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be behave differently? non-CoCo gets essentially memfd_secret, pKVM gets GUP+no faults of private mem).
Good question. So far my perception was that the directmap removal on "private/unfaultable" would be sufficient.
Having a mixture of "has directmap" and "has no directmap" for shared (faultable) memory should not be done. Similarly, private memory really should stay "unfaultable".
You've convinced me that having both GUP-able and non GUP-able memory in the same VMA will be tricky. However, I'm less convinced on why private memory should stay unfaultable; only that it shouldn't be faultable into a VMA that also allows GUP. Can we have two VMAs? One that disallows GUP, but allows userspace access to shared and private, and one that allows GUP, but disallows accessing private memory? Maybe via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly different spin of the above idea.
What we are trying to achieve is making guest_memfd not behave completely different on that level for different "types" of VMs. So one of the goals should be to try to unify it as much as possible.
shared -> faultable: GUP-able private -> unfaultable: unGUP-able
And it makes sense, because a lot of future work will rely on some important properties: for example, if private memory cannot be faulted in + GUPed, core-MM will never have obtained valid references to such a page. There is no need to split large folios into smaller ones for tracking purposes; there is no need to maintain per-page refcounts and pincounts ...
It doesn't mean that we cannot consider it if really required, but there really has to be a strong case for it, because it will all get really messy.
For example, one issue is that a folio only has a single mapping (folio->mapping), and that is used in the GUP-fast path (no VMA) to determine whether GUP-fast is allowed or not.
So you'd have to force everything through GUP-slow, where you could consider VMA properties :( It sounds quite suboptimal.
I don't think multiple VMAs are what we really want. See below.
I think one of the points raised during the bi-weekly call was that using a viommu/swiotlb might be the right call, such that all memory can be considered private (unfaultable) that is not explicitly shared/expected to be modified by the hypervisor (-> faultable, -> GUP-able).
Further, I think Sean had some good points why we should explore that direction, but I recall that there were some issue to be sorted out (interpreted instructions requiring direct map when accessing "private" memory?), not sure if that is already working/can be made working in KVM.
Yeah, the big one is MMIO instruction emulation on x86, which does guest page table walks and instruction fetch (and particularly the latter cannot be known ahead-of-time by the guest, aka cannot be explicitly "shared"). That's what the majority of my v2 series was about. For traditional memslots, KVM handles these via get_user and friends, but if we don't have a VMA that allows faulting all of gmem, then that's impossible, and we're in "temporarily restore direct map" land. Which comes with significantly performance penalties due to TLB flushes.
Agreed.
What's your opinion after the call and the next step for use cases
like
you have in mind (IIRC firecracker, which wants to not have the direct-map for guest memory where it can be avoided)?
Yea, the usecase is for Firecracker to not have direct map entries for guest memory, unless needed for I/O (-> swiotlb).
As for next steps, let's determine once and for all if we can do the KVM-internal guest memory accesses for MMIO emulation through userspace mappings (although if we can't I'll have some serious soul-searching to do, because all other solutions we talked about so far also have fairly big drawbacks; on-demand direct map reinsertion has terrible performance
So IIUC, KVM would have to access "unfaultable" guest_memfd memory using fd+offset, and that's problematic because "no-directmap".
So you'd have to map+unmap the directmap repeatedly, and still expose it temporarily in the direct map to others. I see how that is undesirable, even when trying to cache hotspots (partly destroying the purpose of the directmap removal).
Would a per-MM kernel mapping of these pages work, so KVM can access them?
It sounds a bit like what is required for clean per-MM allocations [1]: establish a per-MM kernel mapping of (selected?) pages. Not necessarily all of them.
Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything involved with ordinary user mappings for these private/unfaultable thingies. Just like as discussed in, and similar to [1].
Just throwing it out there, maybe we really want to avoid the directmap (keep it unmapped) and maintain a per-mm mapping for a bunch of folios that can be easily removed when required by guest_memfd (ftruncate, conversion private->shared) on request.
[1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u
Hi David,
sorry for the late response, I ended up catching the flu last week and was out of commission for a while :(
On Mon, 2024-11-04 at 21:30 +0000, David Hildenbrand wrote:
We talked about shared (faultable) vs. private (unfaultable), and how it would interact with the directmap patches here.
As discussed, having private (unfaultable) memory with the direct-map removed and shared (faultable) memory with the direct-mapping can make sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, the discussion here seems to indicate that it might currently not be required.
So one thing we could do is that shared (faultable) will have a direct mapping and be gup-able and private (unfaultable) memory will not have a direct mapping and is, by design, not gup-able.> Maybe it could make sense to not have a direct map for all guest_memfd memory, making it behave like secretmem (and it would be easy to implement)? But I'm not sure if that is really desirable in VM context.
This would work for us (in this scenario, the swiotlb areas would be "traditional" memory, e.g. set to shared via mem attributes instead of "shared" inside KVM), it's kinda what I had prototyped in my v1 of this series (well, we'd need to figure out how to get the mappings of gmem back into KVM, since in this setup, short-circuiting it into userspace_addr wouldn't work, unless we banish swiotlb into a different memslot altogether somehow).
Right.
"right" as in, "yes we could do that"? :p
But I don't think it'd work for pKVM, iirc they need GUP on gmem, and also want direct map removal (... but maybe, the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be behave differently? non-CoCo gets essentially memfd_secret, pKVM gets GUP+no faults of private mem).
Good question. So far my perception was that the directmap removal on "private/unfaultable" would be sufficient.
Having a mixture of "has directmap" and "has no directmap" for shared (faultable) memory should not be done. Similarly, private memory really should stay "unfaultable".
You've convinced me that having both GUP-able and non GUP-able memory in the same VMA will be tricky. However, I'm less convinced on why private memory should stay unfaultable; only that it shouldn't be faultable into a VMA that also allows GUP. Can we have two VMAs? One that disallows GUP, but allows userspace access to shared and private, and one that allows GUP, but disallows accessing private memory? Maybe via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly different spin of the above idea.
What we are trying to achieve is making guest_memfd not behave completely different on that level for different "types" of VMs. So one of the goals should be to try to unify it as much as possible.
shared -> faultable: GUP-able private -> unfaultable: unGUP-able
And it makes sense, because a lot of future work will rely on some important properties: for example, if private memory cannot be faulted in + GUPed, core-MM will never have obtained valid references to such a page. There is no need to split large folios into smaller ones for tracking purposes; there is no need to maintain per-page refcounts and pincounts ...
It doesn't mean that we cannot consider it if really required, but there really has to be a strong case for it, because it will all get really messy.
For example, one issue is that a folio only has a single mapping (folio->mapping), and that is used in the GUP-fast path (no VMA) to determine whether GUP-fast is allowed or not.
So you'd have to force everything through GUP-slow, where you could consider VMA properties :( It sounds quite suboptimal.
I don't think multiple VMAs are what we really want. See below.
Ah, okay, I see. Thanks for explaining, this all makes a lot of sense to me now!
I think one of the points raised during the bi-weekly call was that using a viommu/swiotlb might be the right call, such that all memory can be considered private (unfaultable) that is not explicitly shared/expected to be modified by the hypervisor (-> faultable, -> GUP-able).
Further, I think Sean had some good points why we should explore that direction, but I recall that there were some issue to be sorted out (interpreted instructions requiring direct map when accessing "private" memory?), not sure if that is already working/can be made working in KVM.
Yeah, the big one is MMIO instruction emulation on x86, which does guest page table walks and instruction fetch (and particularly the latter cannot be known ahead-of-time by the guest, aka cannot be explicitly "shared"). That's what the majority of my v2 series was about. For traditional memslots, KVM handles these via get_user and friends, but if we don't have a VMA that allows faulting all of gmem, then that's impossible, and we're in "temporarily restore direct map" land. Which comes with significantly performance penalties due to TLB flushes.
Agreed.
What's your opinion after the call and the next step for use cases
like
you have in mind (IIRC firecracker, which wants to not have the direct-map for guest memory where it can be avoided)?
Yea, the usecase is for Firecracker to not have direct map entries for guest memory, unless needed for I/O (-> swiotlb).
As for next steps, let's determine once and for all if we can do the KVM-internal guest memory accesses for MMIO emulation through userspace mappings (although if we can't I'll have some serious soul-searching to do, because all other solutions we talked about so far also have fairly big drawbacks; on-demand direct map reinsertion has terrible performance
So IIUC, KVM would have to access "unfaultable" guest_memfd memory using fd+offset, and that's problematic because "no-directmap".
So you'd have to map+unmap the directmap repeatedly, and still expose it temporarily in the direct map to others. I see how that is undesirable, even when trying to cache hotspots (partly destroying the purpose of the directmap removal).
Would a per-MM kernel mapping of these pages work, so KVM can access them?
It sounds a bit like what is required for clean per-MM allocations [1]: establish a per-MM kernel mapping of (selected?) pages. Not necessarily all of them.
Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything involved with ordinary user mappings for these private/unfaultable thingies. Just like as discussed in, and similar to [1].
Just throwing it out there, maybe we really want to avoid the directmap (keep it unmapped) and maintain a per-mm mapping for a bunch of folios that can be easily removed when required by guest_memfd (ftruncate, conversion private->shared) on request.
I remember talking to someone at some point about whether we could reuse the proc-local stuff for guest memory, but I cannot remember the outcome of that discussion... (or maybe I just wanted to have a discussion about it, but forgot to follow up on that thought?). I guess we wouldn't use proc-local _allocations_, but rather just set up proc-local mappings of the gmem allocations that have been removed from the direct map.
I'm wondering, where exactly would be the differences to Sean's idea about messing with the CR3 register inside KVM to temporarily install page tables that contain all the gmem stuff, conceptually? Wouldn't we run into the same interrupt problems that Sean foresaw for the CR3 stuff? (which, admittedly, I still don't quite follow what these are :( ).
(I've cc'd Fares Mehanna as well)
[1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u
-- Cheers,
David / dhildenb
Best, Patrick
On 12.11.24 15:40, Patrick Roy wrote:
Hi David,
sorry for the late response, I ended up catching the flu last week and was out of commission for a while :(
On Mon, 2024-11-04 at 21:30 +0000, David Hildenbrand wrote:
We talked about shared (faultable) vs. private (unfaultable), and how it would interact with the directmap patches here.
As discussed, having private (unfaultable) memory with the direct-map removed and shared (faultable) memory with the direct-mapping can make sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, the discussion here seems to indicate that it might currently not be required.
So one thing we could do is that shared (faultable) will have a direct mapping and be gup-able and private (unfaultable) memory will not have a direct mapping and is, by design, not gup-able.> Maybe it could make sense to not have a direct map for all guest_memfd memory, making it behave like secretmem (and it would be easy to implement)? But I'm not sure if that is really desirable in VM context.
This would work for us (in this scenario, the swiotlb areas would be "traditional" memory, e.g. set to shared via mem attributes instead of "shared" inside KVM), it's kinda what I had prototyped in my v1 of this series (well, we'd need to figure out how to get the mappings of gmem back into KVM, since in this setup, short-circuiting it into userspace_addr wouldn't work, unless we banish swiotlb into a different memslot altogether somehow).
Right.
"right" as in, "yes we could do that"? :p
"right" as in "I see how that could work" :)
[...]
I remember talking to someone at some point about whether we could reuse the proc-local stuff for guest memory, but I cannot remember the outcome of that discussion... (or maybe I just wanted to have a discussion about it, but forgot to follow up on that thought?). I guess we wouldn't use proc-local _allocations_, but rather just set up proc-local mappings of the gmem allocations that have been removed from the direct map.
Yes. And likely only for memory we really access / try access, if possible.
I'm wondering, where exactly would be the differences to Sean's idea about messing with the CR3 register inside KVM to temporarily install page tables that contain all the gmem stuff, conceptually? Wouldn't we run into the same interrupt problems that Sean foresaw for the CR3 stuff? (which, admittedly, I still don't quite follow what these are :( ).
I'd need some more details on that. If anything would rely on the direct mapping (from IRQ context?) than ... we obviously cannot remove the direct mapping :)
On Tue, 2024-11-12 at 14:52 +0000, David Hildenbrand wrote:
On 12.11.24 15:40, Patrick Roy wrote:
I remember talking to someone at some point about whether we could reuse the proc-local stuff for guest memory, but I cannot remember the outcome of that discussion... (or maybe I just wanted to have a discussion about it, but forgot to follow up on that thought?). I guess we wouldn't use proc-local _allocations_, but rather just set up proc-local mappings of the gmem allocations that have been removed from the direct map.
Yes. And likely only for memory we really access / try access, if possible.
Well, if we start on-demand mm-local mapping the things we want to access, we're back in TLB flush hell, no? And we can't know ahead-of-time what needs to be mapped, so everything would need to be mapped (unless we do something like mm-local mapping a page on first access, and then just never unmapping it again, under the assumption that establishing the mapping won't be expensive)
I'm wondering, where exactly would be the differences to Sean's idea about messing with the CR3 register inside KVM to temporarily install page tables that contain all the gmem stuff, conceptually? Wouldn't we run into the same interrupt problems that Sean foresaw for the CR3 stuff? (which, admittedly, I still don't quite follow what these are :( ).
I'd need some more details on that. If anything would rely on the direct mapping (from IRQ context?) than ... we obviously cannot remove the direct mapping :)
I've talked to Fares internally, and it seems that generally doing mm-local mappings of guest memory would work for us. We also figured out what the "interrupt problem" is, namely that if we receive an interrupt while executing in a context that has mm-local mappings available, those mappings will continue to be available while the interrupt is being handled. I'm talking to my security folks to see how much of a concern this is for the speculation hardening we're trying to achieve. Will keep you in the loop there :)
-- Cheers,
David / dhildenb
Best, Patrick
On 15.11.24 17:59, Patrick Roy wrote:
On Tue, 2024-11-12 at 14:52 +0000, David Hildenbrand wrote:
On 12.11.24 15:40, Patrick Roy wrote:
I remember talking to someone at some point about whether we could reuse the proc-local stuff for guest memory, but I cannot remember the outcome of that discussion... (or maybe I just wanted to have a discussion about it, but forgot to follow up on that thought?). I guess we wouldn't use proc-local _allocations_, but rather just set up proc-local mappings of the gmem allocations that have been removed from the direct map.
Yes. And likely only for memory we really access / try access, if possible.
Well, if we start on-demand mm-local mapping the things we want to access, we're back in TLB flush hell, no?
At least the on-demand mapping shouldn't require a TLB flush? Only "unmapping" if we want to restrict the size of a "mapped pool" of restricted size.
Anyhow, this would be a pure optimization, to avoid the expense of mapping everything, when in practice you'd like not access most of it.
(my theory, happy to be told I'm wrong :) )
And we can't know ahead-of-time what needs to be mapped, so everything would need to be mapped (unless we do something like mm-local mapping a page on first access, and then just never unmapping it again, under the assumption that establishing the mapping won't be expensive)
Right, the whole problem is that we don't know that upfront.
I'm wondering, where exactly would be the differences to Sean's idea about messing with the CR3 register inside KVM to temporarily install page tables that contain all the gmem stuff, conceptually? Wouldn't we run into the same interrupt problems that Sean foresaw for the CR3 stuff? (which, admittedly, I still don't quite follow what these are :( ).
I'd need some more details on that. If anything would rely on the direct mapping (from IRQ context?) than ... we obviously cannot remove the direct mapping :)
I've talked to Fares internally, and it seems that generally doing mm-local mappings of guest memory would work for us. We also figured out what the "interrupt problem" is, namely that if we receive an interrupt while executing in a context that has mm-local mappings available, those mappings will continue to be available while the interrupt is being handled.
Isn't that likely also the case with secretmem where we removed the directmap, but have an effective per-mm mapping in the (user-space portion) of the page table?
I'm talking to my security folks to see how much of a concern this is for the speculation hardening we're trying to achieve. Will keep you in the loop there :)
Thanks!
On Fri, 2024-11-15 at 17:10 +0000, David Hildenbrand wrote:
[...]
I've talked to Fares internally, and it seems that generally doing mm-local mappings of guest memory would work for us. We also figured out what the "interrupt problem" is, namely that if we receive an interrupt while executing in a context that has mm-local mappings available, those mappings will continue to be available while the interrupt is being handled.
Isn't that likely also the case with secretmem where we removed the directmap, but have an effective per-mm mapping in the (user-space portion) of the page table?
Mh, that's an excellent point, I never thought of that. But with secretmem, the memory would still be protected by SMAP (admittedly, I have no idea how much this is worth in the face of all these speculative issues), right?
I'm talking to my security folks to see how much of a concern this is for the speculation hardening we're trying to achieve. Will keep you in the loop there :)
Thanks!
-- Cheers,
David / dhildenb
Best, Patrick
linux-kselftest-mirror@lists.linaro.org