In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah. To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
[1]: https://lore.kernel.org/all/20240319143119.GA2736@willie-the-truck/
Tree with patches at: https://git.codelinaro.org/clo/linux-kernel/gunyah-linux/-/tree/sent/exclusi...
anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, ackerleytng@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, tabba@google.com
Signed-off-by: Elliot Berman quic_eberman@quicinc.com --- Elliot Berman (2): mm/gup-test: Verify exclusive pinned mm/gup_test: Verify GUP grabs same pages twice
Fuad Tabba (3): mm/gup: Move GUP_PIN_COUNTING_BIAS to page_ref.h mm/gup: Add an option for obtaining an exclusive pin mm/gup: Add support for re-pinning a normal pinned page as exclusive
include/linux/mm.h | 57 ++++---- include/linux/mm_types.h | 2 + include/linux/page_ref.h | 74 ++++++++++ mm/Kconfig | 5 + mm/gup.c | 265 ++++++++++++++++++++++++++++++---- mm/gup_test.c | 108 ++++++++++++++ mm/gup_test.h | 1 + tools/testing/selftests/mm/gup_test.c | 5 +- 8 files changed, 457 insertions(+), 60 deletions(-) --- base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f change-id: 20240509-exclusive-gup-66259138bbff
Best regards,
From: Fuad Tabba tabba@google.com
No functional change intended.
Signed-off-by: Fuad Tabba tabba@google.com Signed-off-by: Elliot Berman quic_eberman@quicinc.com --- include/linux/mm.h | 32 -------------------------------- include/linux/page_ref.h | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 32 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 9849dfda44d43..fd0d10b08e7ac 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1580,38 +1580,6 @@ static inline void put_page(struct page *page) folio_put(folio); }
-/* - * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload - * the page's refcount so that two separate items are tracked: the original page - * reference count, and also a new count of how many pin_user_pages() calls were - * made against the page. ("gup-pinned" is another term for the latter). - * - * With this scheme, pin_user_pages() becomes special: such pages are marked as - * distinct from normal pages. As such, the unpin_user_page() call (and its - * variants) must be used in order to release gup-pinned pages. - * - * Choice of value: - * - * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference - * counts with respect to pin_user_pages() and unpin_user_page() becomes - * simpler, due to the fact that adding an even power of two to the page - * refcount has the effect of using only the upper N bits, for the code that - * counts up using the bias value. This means that the lower bits are left for - * the exclusive use of the original code that increments and decrements by one - * (or at least, by much smaller values than the bias value). - * - * Of course, once the lower bits overflow into the upper bits (and this is - * OK, because subtraction recovers the original values), then visual inspection - * no longer suffices to directly view the separate counts. However, for normal - * applications that don't have huge page reference counts, this won't be an - * issue. - * - * Locking: the lockless algorithm described in folio_try_get_rcu() - * provides safe operation for get_user_pages(), page_mkclean() and - * other calls that race to set up page table entries. - */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) - void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 1acf5bac7f503..e6aeaafb143ca 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -62,6 +62,38 @@ static inline void __page_ref_unfreeze(struct page *page, int v)
#endif
+/* + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload + * the page's refcount so that two separate items are tracked: the original page + * reference count, and also a new count of how many pin_user_pages() calls were + * made against the page. ("gup-pinned" is another term for the latter). + * + * With this scheme, pin_user_pages() becomes special: such pages are marked as + * distinct from normal pages. As such, the unpin_user_page() call (and its + * variants) must be used in order to release gup-pinned pages. + * + * Choice of value: + * + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference + * counts with respect to pin_user_pages() and unpin_user_page() becomes + * simpler, due to the fact that adding an even power of two to the page + * refcount has the effect of using only the upper N bits, for the code that + * counts up using the bias value. This means that the lower bits are left for + * the exclusive use of the original code that increments and decrements by one + * (or at least, by much smaller values than the bias value). + * + * Of course, once the lower bits overflow into the upper bits (and this is + * OK, because subtraction recovers the original values), then visual inspection + * no longer suffices to directly view the separate counts. However, for normal + * applications that don't have huge page reference counts, this won't be an + * issue. + * + * Locking: the lockless algorithm described in folio_try_get_rcu() + * provides safe operation for get_user_pages(), page_mkclean() and + * other calls that race to set up page table entries. + */ +#define GUP_PIN_COUNTING_BIAS (1U << 10) + static inline int page_ref_count(const struct page *page) { return atomic_read(&page->_refcount);
From: Fuad Tabba tabba@google.com
Introduce the ability to obtain an exclusive long-term pin on a page. This exclusive pin can only be held if there are no other pins on the page, regular, or exclusive. Moreover, once this pin is held, no other pins can be grabbed until the exclusive pin is released.
This pin is grabbed using the (new) FOLL_EXCLUSIVE flag, and is gated by the EXCLUSIVE_PIN configuration option.
Similar to how the normal GUP pin is obtain, the exclusive PIN overloads the _refcount field for normal pages, or the _pincount field for large pages. It appropriates bit 30 of these two fields, which still allows the detection of overflows into bit 31. It does however, half the number of potential normals pins for a page.
In order to avoid the possibility of COWing such a page, once an exclusive pin has been obtained, it's marked as AnonExclusive.
Co-Developed-by: Elliot Berman quic_eberman@quicinc.com Signed-off-by: Elliot Berman quic_eberman@quicinc.com Signed-off-by: Fuad Tabba tabba@google.com Signed-off-by: Elliot Berman quic_eberman@quicinc.com --- include/linux/mm.h | 24 +++++ include/linux/mm_types.h | 2 + include/linux/page_ref.h | 36 +++++++ mm/Kconfig | 5 + mm/gup.c | 239 +++++++++++++++++++++++++++++++++++++++++------ 5 files changed, 279 insertions(+), 27 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index fd0d10b08e7ac..d03d62bceba08 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1583,9 +1583,13 @@ static inline void put_page(struct page *page) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); +void unpin_exc_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty); void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, bool make_dirty); void unpin_user_pages(struct page **pages, unsigned long npages); +void unpin_exc_pages(struct page **pages, unsigned long npages); +void unexc_user_page(struct page *page);
static inline bool is_cow_mapping(vm_flags_t flags) { @@ -1958,6 +1962,26 @@ static inline bool folio_needs_cow_for_dma(struct vm_area_struct *vma, return folio_maybe_dma_pinned(folio); }
+static inline bool folio_maybe_exclusive_pinned(const struct folio *folio) +{ + unsigned int count; + + if (!IS_ENABLED(CONFIG_EXCLUSIVE_PIN)) + return false; + + if (folio_test_large(folio)) + count = atomic_read(&folio->_pincount); + else + count = folio_ref_count(folio); + + return count >= GUP_PIN_EXCLUSIVE_BIAS; +} + +static inline bool page_maybe_exclusive_pinned(const struct page *page) +{ + return folio_maybe_exclusive_pinned(page_folio(page)); +} + /** * is_zero_page - Query if a page is a zero page * @page: The page to query diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index af3a0256fa93b..dc397e3465c23 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1465,6 +1465,8 @@ enum { * hinting faults. */ FOLL_HONOR_NUMA_FAULT = 1 << 12, + /* exclusive PIN only if there aren't other pins (including this) */ + FOLL_EXCLUSIVE = 1 << 13,
/* See also internal only FOLL flags in mm/internal.h */ }; diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index e6aeaafb143ca..9d16e1f4db094 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -94,6 +94,14 @@ static inline void __page_ref_unfreeze(struct page *page, int v) */ #define GUP_PIN_COUNTING_BIAS (1U << 10)
+/* + * GUP_PIN_EXCLUSIVE_BIAS is used to grab an exclusive pin over a page. + * This exclusive pin can only be taken once, and only if no other GUP pins + * exist for the page. + * After it's taken, no other gup pins can be taken. + */ +#define GUP_PIN_EXCLUSIVE_BIAS (1U << 30) + static inline int page_ref_count(const struct page *page) { return atomic_read(&page->_refcount); @@ -147,6 +155,34 @@ static inline void init_page_count(struct page *page) set_page_count(page, 1); }
+static __must_check inline bool page_ref_setexc(struct page *page, unsigned int refs) +{ + unsigned int old_count, new_count; + + if (WARN_ON_ONCE(refs >= GUP_PIN_EXCLUSIVE_BIAS)) + return false; + + do { + old_count = atomic_read(&page->_refcount); + + if (old_count >= GUP_PIN_COUNTING_BIAS) + return false; + + if (check_add_overflow(old_count, refs + GUP_PIN_EXCLUSIVE_BIAS, &new_count)) + return false; + } while (atomic_cmpxchg(&page->_refcount, old_count, new_count) != old_count); + + if (page_ref_tracepoint_active(page_ref_mod)) + __page_ref_mod(page, refs); + + return true; +} + +static __must_check inline bool folio_ref_setexc(struct folio *folio, unsigned int refs) +{ + return page_ref_setexc(&folio->page, refs); +} + static inline void page_ref_add(struct page *page, int nr) { atomic_add(nr, &page->_refcount); diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a541..56f8c80b996f5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1249,6 +1249,11 @@ config IOMMU_MM_DATA config EXECMEM bool
+config EXCLUSIVE_PIN + def_bool y + help + Add support for exclusive pins of pages. + source "mm/damon/Kconfig"
endmenu diff --git a/mm/gup.c b/mm/gup.c index ca0f5cedce9b2..7f20de33221da 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -97,6 +97,65 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; }
+static bool large_folio_pin_setexc(struct folio *folio, unsigned int pins) +{ + unsigned int old_pincount, new_pincount; + + if (WARN_ON_ONCE(pins >= GUP_PIN_EXCLUSIVE_BIAS)) + return false; + + do { + old_pincount = atomic_read(&folio->_pincount); + + if (old_pincount > 0) + return false; + + if (check_add_overflow(old_pincount, pins + GUP_PIN_EXCLUSIVE_BIAS, &new_pincount)) + return false; + } while (atomic_cmpxchg(&folio->_pincount, old_pincount, pins) != old_pincount); + + return true; +} + +static bool __try_grab_folio_excl(struct folio *folio, int pincount, int refcount) +{ + if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN))) + return false; + + if (folio_test_large(folio)) { + if (!large_folio_pin_setexc(folio, pincount)) + return false; + } else if (!folio_ref_setexc(folio, refcount)) { + return false; + } + + if (!PageAnonExclusive(&folio->page)) + SetPageAnonExclusive(&folio->page); + + return true; +} + +static bool try_grab_folio_excl(struct folio *folio, int refs) +{ + /* + * When pinning a large folio, use an exact count to track it. + * + * However, be sure to *also* increment the normal folio + * refcount field at least once, so that the folio really + * is pinned. That's why the refcount from the earlier + * try_get_folio() is left intact. + */ + return __try_grab_folio_excl(folio, refs, + refs * (GUP_PIN_COUNTING_BIAS - 1)); +} + +static bool try_grab_page_excl(struct page *page) +{ + struct folio *folio = page_folio(page); + + return __try_grab_folio_excl(folio, 1, GUP_PIN_COUNTING_BIAS); +} + /** * try_grab_folio() - Attempt to get or pin a folio. * @page: pointer to page to be grabbed @@ -161,19 +220,41 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) return NULL; }
- /* - * When pinning a large folio, use an exact count to track it. - * - * However, be sure to *also* increment the normal folio - * refcount field at least once, so that the folio really - * is pinned. That's why the refcount from the earlier - * try_get_folio() is left intact. - */ - if (folio_test_large(folio)) - atomic_add(refs, &folio->_pincount); - else - folio_ref_add(folio, - refs * (GUP_PIN_COUNTING_BIAS - 1)); + if (unlikely(folio_maybe_exclusive_pinned(folio))) { + if (!put_devmap_managed_folio_refs(folio, refs)) + folio_put_refs(folio, refs); + return NULL; + } + + if (unlikely(flags & FOLL_EXCLUSIVE)) { + if (!try_grab_folio_excl(folio, refs)) + return NULL; + } else { + /* + * When pinning a large folio, use an exact count to track it. + * + * However, be sure to *also* increment the normal folio + * refcount field at least once, so that the folio really + * is pinned. That's why the refcount from the earlier + * try_get_folio() is left intact. + */ + if (folio_test_large(folio)) + atomic_add(refs, &folio->_pincount); + else + folio_ref_add(folio, + refs * (GUP_PIN_COUNTING_BIAS - 1)); + + if (unlikely(folio_maybe_exclusive_pinned(folio))) { + if (folio_test_large(folio)) + atomic_sub(refs, &folio->_pincount); + else + folio_put_refs(folio, + refs * (GUP_PIN_COUNTING_BIAS - 1)); + + return NULL; + } + } + /* * Adjust the pincount before re-checking the PTE for changes. * This is essentially a smp_mb() and is paired with a memory @@ -198,6 +279,26 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) refs *= GUP_PIN_COUNTING_BIAS; }
+ if (unlikely(flags & FOLL_EXCLUSIVE)) { + if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN))) + goto out; + if (is_zero_folio(folio)) + return; + if (folio_test_large(folio)) { + if (WARN_ON_ONCE((atomic_read(&folio->_pincount) < GUP_PIN_EXCLUSIVE_BIAS))) + goto out; + atomic_sub(GUP_PIN_EXCLUSIVE_BIAS, &folio->_pincount); + } else { + if (WARN_ON_ONCE((unsigned int)refs >= GUP_PIN_EXCLUSIVE_BIAS)) + goto out; + if (WARN_ON_ONCE(folio_ref_count(folio) < GUP_PIN_EXCLUSIVE_BIAS)) + goto out; + + refs += GUP_PIN_EXCLUSIVE_BIAS; + } + } + +out: if (!put_devmap_managed_folio_refs(folio, refs)) folio_put_refs(folio, refs); } @@ -242,16 +343,35 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) if (is_zero_page(page)) return 0;
- /* - * Similar to try_grab_folio(): be sure to *also* - * increment the normal page refcount field at least once, - * so that the page really is pinned. - */ - if (folio_test_large(folio)) { - folio_ref_add(folio, 1); - atomic_add(1, &folio->_pincount); + if (unlikely(folio_maybe_exclusive_pinned(folio))) + return -EBUSY; + + if (unlikely(flags & FOLL_EXCLUSIVE)) { + if (!try_grab_page_excl(page)) + return -EBUSY; } else { - folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + /* + * Similar to try_grab_folio(): be sure to *also* + * increment the normal page refcount field at least once, + * so that the page really is pinned. + */ + if (folio_test_large(folio)) { + folio_ref_add(folio, 1); + atomic_add(1, &folio->_pincount); + } else { + folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); + } + + if (unlikely(folio_maybe_exclusive_pinned(folio))) { + if (folio_test_large(folio)) { + folio_put_refs(folio, 1); + atomic_sub(1, &folio->_pincount); + } else { + folio_put_refs(folio, GUP_PIN_COUNTING_BIAS); + } + + return -EBUSY; + } }
node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1); @@ -288,6 +408,9 @@ void folio_add_pin(struct folio *folio) if (is_zero_folio(folio)) return;
+ if (unlikely(folio_maybe_exclusive_pinned(folio))) + return; + /* * Similar to try_grab_folio(): be sure to *also* increment the normal * page refcount field at least once, so that the page really is @@ -301,6 +424,15 @@ void folio_add_pin(struct folio *folio) WARN_ON_ONCE(folio_ref_count(folio) < GUP_PIN_COUNTING_BIAS); folio_ref_add(folio, GUP_PIN_COUNTING_BIAS); } + + if (unlikely(folio_maybe_exclusive_pinned(folio))) { + if (folio_test_large(folio)) { + folio_put_refs(folio, 1); + atomic_sub(1, &folio->_pincount); + } else { + folio_put_refs(folio, GUP_PIN_COUNTING_BIAS); + } + } }
static inline struct folio *gup_folio_range_next(struct page *start, @@ -355,8 +487,8 @@ static inline struct folio *gup_folio_next(struct page **list, * set_page_dirty_lock(), unpin_user_page(). * */ -void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, - bool make_dirty) +static void __unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty, unsigned int flags) { unsigned long i; struct folio *folio; @@ -395,11 +527,28 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, folio_mark_dirty(folio); folio_unlock(folio); } - gup_put_folio(folio, nr, FOLL_PIN); + gup_put_folio(folio, nr, flags); } } + +void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty) +{ + __unpin_user_pages_dirty_lock(pages, npages, make_dirty, FOLL_PIN); +} EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
+void unpin_exc_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty) +{ + if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN))) + return; + + __unpin_user_pages_dirty_lock(pages, npages, make_dirty, + FOLL_PIN | FOLL_EXCLUSIVE); +} +EXPORT_SYMBOL(unpin_exc_pages_dirty_lock); + /** * unpin_user_page_range_dirty_lock() - release and optionally dirty * gup-pinned page range @@ -466,7 +615,7 @@ static void gup_fast_unpin_user_pages(struct page **pages, unsigned long npages) * * Please see the unpin_user_page() documentation for details. */ -void unpin_user_pages(struct page **pages, unsigned long npages) +static void __unpin_user_pages(struct page **pages, unsigned long npages, unsigned int flags) { unsigned long i; struct folio *folio; @@ -483,11 +632,35 @@ void unpin_user_pages(struct page **pages, unsigned long npages) sanity_check_pinned_pages(pages, npages); for (i = 0; i < npages; i += nr) { folio = gup_folio_next(pages, npages, i, &nr); - gup_put_folio(folio, nr, FOLL_PIN); + gup_put_folio(folio, nr, flags); } } + +void unpin_user_pages(struct page **pages, unsigned long npages) +{ + __unpin_user_pages(pages, npages, FOLL_PIN); +} EXPORT_SYMBOL(unpin_user_pages);
+void unpin_exc_pages(struct page **pages, unsigned long npages) +{ + if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN))) + return; + + __unpin_user_pages(pages, npages, FOLL_PIN | FOLL_EXCLUSIVE); +} +EXPORT_SYMBOL(unpin_exc_pages); + +void unexc_user_page(struct page *page) +{ + if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN))) + return; + + sanity_check_pinned_pages(&page, 1); + gup_put_folio(page_folio(page), 0, FOLL_EXCLUSIVE); +} +EXPORT_SYMBOL(unexc_user_page); + /* * Set the MMF_HAS_PINNED if not set yet; after set it'll be there for the mm's * lifecycle. Avoid setting the bit unless necessary, or it might cause write @@ -2610,6 +2783,18 @@ static bool is_valid_gup_args(struct page **pages, int *locked, if (WARN_ON_ONCE(!(gup_flags & FOLL_PIN) && (gup_flags & FOLL_LONGTERM))) return false;
+ /* EXCLUSIVE can only be specified when config is enabled */ + if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN) && (gup_flags & FOLL_EXCLUSIVE))) + return false; + + /* EXCLUSIVE can only be specified when pinning */ + if (WARN_ON_ONCE(!(gup_flags & FOLL_PIN) && (gup_flags & FOLL_EXCLUSIVE))) + return false; + + /* EXCLUSIVE can only be specified when LONGTERM */ + if (WARN_ON_ONCE(!(gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_EXCLUSIVE))) + return false; + /* Pages input must be given if using GET/PIN */ if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages)) return false;
From: Fuad Tabba tabba@google.com
When a page is shared, the exclusive pin is dropped, but one normal pin is maintained. In order to be able to unshare a page, add the ability to reaquire the exclusive pin, but only if there is only one normal pin on the page, and only if the page is marked as AnonExclusive.
Co-Developed-by: Elliot Berman quic_eberman@quicinc.com Signed-off-by: Elliot Berman quic_eberman@quicinc.com Signed-off-by: Fuad Tabba tabba@google.com Signed-off-by: Elliot Berman quic_eberman@quicinc.com --- include/linux/mm.h | 1 + include/linux/page_ref.h | 18 ++++++++++++------ mm/gup.c | 48 +++++++++++++++++++++++++++++++++++++----------- 3 files changed, 50 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index d03d62bceba0..628ab936dd2b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1590,6 +1590,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, void unpin_user_pages(struct page **pages, unsigned long npages); void unpin_exc_pages(struct page **pages, unsigned long npages); void unexc_user_page(struct page *page); +int reexc_user_page(struct page *page);
static inline bool is_cow_mapping(vm_flags_t flags) { diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 9d16e1f4db09..e66130fe995d 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -92,7 +92,8 @@ static inline void __page_ref_unfreeze(struct page *page, int v) * provides safe operation for get_user_pages(), page_mkclean() and * other calls that race to set up page table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define GUP_PIN_COUNTING_SHIFT (10) +#define GUP_PIN_COUNTING_BIAS (1U << GUP_PIN_COUNTING_SHIFT)
/* * GUP_PIN_EXCLUSIVE_BIAS is used to grab an exclusive pin over a page. @@ -100,7 +101,8 @@ static inline void __page_ref_unfreeze(struct page *page, int v) * exist for the page. * After it's taken, no other gup pins can be taken. */ -#define GUP_PIN_EXCLUSIVE_BIAS (1U << 30) +#define GUP_PIN_EXCLUSIVE_SHIFT (30) +#define GUP_PIN_EXCLUSIVE_BIAS (1U << GUP_PIN_EXCLUSIVE_SHIFT)
static inline int page_ref_count(const struct page *page) { @@ -155,7 +157,9 @@ static inline void init_page_count(struct page *page) set_page_count(page, 1); }
-static __must_check inline bool page_ref_setexc(struct page *page, unsigned int refs) +static __must_check inline bool page_ref_setexc(struct page *page, + unsigned int expected_pins, + unsigned int refs) { unsigned int old_count, new_count;
@@ -165,7 +169,7 @@ static __must_check inline bool page_ref_setexc(struct page *page, unsigned int do { old_count = atomic_read(&page->_refcount);
- if (old_count >= GUP_PIN_COUNTING_BIAS) + if ((old_count >> GUP_PIN_COUNTING_SHIFT) != expected_pins) return false;
if (check_add_overflow(old_count, refs + GUP_PIN_EXCLUSIVE_BIAS, &new_count)) @@ -178,9 +182,11 @@ static __must_check inline bool page_ref_setexc(struct page *page, unsigned int return true; }
-static __must_check inline bool folio_ref_setexc(struct folio *folio, unsigned int refs) +static __must_check inline bool folio_ref_setexc(struct folio *folio, + unsigned int expected_pins, + unsigned int refs) { - return page_ref_setexc(&folio->page, refs); + return page_ref_setexc(&folio->page, expected_pins, refs); }
static inline void page_ref_add(struct page *page, int nr) diff --git a/mm/gup.c b/mm/gup.c index 7f20de33221d..663030d03d95 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -97,7 +97,9 @@ static inline struct folio *try_get_folio(struct page *page, int refs) return folio; }
-static bool large_folio_pin_setexc(struct folio *folio, unsigned int pins) +static bool large_folio_pin_setexc(struct folio *folio, + unsigned int expected_pins, + unsigned int pins) { unsigned int old_pincount, new_pincount;
@@ -107,7 +109,7 @@ static bool large_folio_pin_setexc(struct folio *folio, unsigned int pins) do { old_pincount = atomic_read(&folio->_pincount);
- if (old_pincount > 0) + if (old_pincount != expected_pins) return false;
if (check_add_overflow(old_pincount, pins + GUP_PIN_EXCLUSIVE_BIAS, &new_pincount)) @@ -117,15 +119,18 @@ static bool large_folio_pin_setexc(struct folio *folio, unsigned int pins) return true; }
-static bool __try_grab_folio_excl(struct folio *folio, int pincount, int refcount) +static bool __try_grab_folio_excl(struct folio *folio, + unsigned int expected_pins, + int pincount, + int refcount) { if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN))) return false;
if (folio_test_large(folio)) { - if (!large_folio_pin_setexc(folio, pincount)) + if (!large_folio_pin_setexc(folio, expected_pins, pincount)) return false; - } else if (!folio_ref_setexc(folio, refcount)) { + } else if (!folio_ref_setexc(folio, expected_pins, refcount)) { return false; }
@@ -135,7 +140,9 @@ static bool __try_grab_folio_excl(struct folio *folio, int pincount, int refcoun return true; }
-static bool try_grab_folio_excl(struct folio *folio, int refs) +static bool try_grab_folio_excl(struct folio *folio, + unsigned int expected_pins, + int refs) { /* * When pinning a large folio, use an exact count to track it. @@ -145,15 +152,17 @@ static bool try_grab_folio_excl(struct folio *folio, int refs) * is pinned. That's why the refcount from the earlier * try_get_folio() is left intact. */ - return __try_grab_folio_excl(folio, refs, + return __try_grab_folio_excl(folio, expected_pins, refs, refs * (GUP_PIN_COUNTING_BIAS - 1)); }
-static bool try_grab_page_excl(struct page *page) +static bool try_grab_page_excl(struct page *page, + unsigned int expected_pins) { struct folio *folio = page_folio(page);
- return __try_grab_folio_excl(folio, 1, GUP_PIN_COUNTING_BIAS); + return __try_grab_folio_excl(folio, expected_pins, 1, + GUP_PIN_COUNTING_BIAS); }
/** @@ -227,7 +236,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) }
if (unlikely(flags & FOLL_EXCLUSIVE)) { - if (!try_grab_folio_excl(folio, refs)) + if (!try_grab_folio_excl(folio, 0, refs)) return NULL; } else { /* @@ -347,7 +356,7 @@ int __must_check try_grab_page(struct page *page, unsigned int flags) return -EBUSY;
if (unlikely(flags & FOLL_EXCLUSIVE)) { - if (!try_grab_page_excl(page)) + if (!try_grab_page_excl(page, 0)) return -EBUSY; } else { /* @@ -661,6 +670,23 @@ void unexc_user_page(struct page *page) } EXPORT_SYMBOL(unexc_user_page);
+int reexc_user_page(struct page *page) +{ + if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_EXCLUSIVE_PIN))) + return -EINVAL; + + sanity_check_pinned_pages(&page, 1); + + if (!PageAnonExclusive(page)) + return -EINVAL; + + if (!try_grab_page_excl(page, 1)) + return -EBUSY; + + return 0; +} +EXPORT_SYMBOL(reexc_user_page); + /* * Set the MMF_HAS_PINNED if not set yet; after set it'll be there for the mm's * lifecycle. Avoid setting the bit unless necessary, or it might cause write
Add test that pages have the exclusive pin bias when providing FOLL_EXCLUSIVE.
Signed-off-by: Elliot Berman quic_eberman@quicinc.com --- mm/gup_test.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/mm/gup_test.c b/mm/gup_test.c index eeb3f4d87c510..9c6b8c93e44a7 100644 --- a/mm/gup_test.c +++ b/mm/gup_test.c @@ -66,6 +66,26 @@ static void verify_dma_pinned(unsigned int cmd, struct page **pages, } }
+static void verify_exclusive_pinned(unsigned int gup_flags, struct page **pages, + unsigned long nr_pages) +{ + unsigned long i; + const struct folio *folio; + + if (!(gup_flags & FOLL_EXCLUSIVE)) + return; + + for (i = 0; i < nr_pages; i++) { + folio = page_folio(pages[i]); + + if (WARN(!folio_maybe_exclusive_pinned(folio), + "pages[%lu] is not exclusive pinned\n", i)) { + dump_page(&folio->page, "gup_test failure"); + break; + } + } +} + static void dump_pages_test(struct gup_test *gup, struct page **pages, unsigned long nr_pages) { @@ -185,6 +205,8 @@ static int __gup_test_ioctl(unsigned int cmd, */ verify_dma_pinned(cmd, pages, nr_pages);
+ verify_exclusive_pinned(gup->gup_flags, pages, nr_pages); + if (cmd == DUMP_USER_PAGES_TEST) dump_pages_test(gup, pages, nr_pages);
GUP'ing pages should get the same pages, test it. In case of FOLL_EXCLUSIVE, the second pin should fail to get any pages.
Note: this change ought to be refactored to pull out the GUP'ing bits that's duplicated between the original and the second GUP.
Signed-off-by: Elliot Berman quic_eberman@quicinc.com --- mm/gup_test.c | 86 +++++++++++++++++++++++++++++++++++ mm/gup_test.h | 1 + tools/testing/selftests/mm/gup_test.c | 5 +- 3 files changed, 91 insertions(+), 1 deletion(-)
diff --git a/mm/gup_test.c b/mm/gup_test.c index 9c6b8c93e44a7..28cc422b60b78 100644 --- a/mm/gup_test.c +++ b/mm/gup_test.c @@ -86,6 +86,89 @@ static void verify_exclusive_pinned(unsigned int gup_flags, struct page **pages, } }
+static int verify_gup_twice(unsigned int cmd, struct gup_test *gup, + struct page **expected_pages, + unsigned long expected_nr_pages) +{ + unsigned long i, nr_pages, addr, next; + long nr; + struct page **pages __free(kfree) = NULL; + int ret = 0; + + nr_pages = gup->size / PAGE_SIZE; + pages = kvcalloc(nr_pages, sizeof(void *), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + i = 0; + nr = gup->nr_pages_per_call; + for (addr = gup->addr; addr < gup->addr + gup->size; addr = next) { + if (nr != gup->nr_pages_per_call) + break; + + next = addr + nr * PAGE_SIZE; + if (next > gup->addr + gup->size) { + next = gup->addr + gup->size; + nr = (next - addr) / PAGE_SIZE; + } + + switch (cmd) { + case GUP_FAST_BENCHMARK: + nr = get_user_pages_fast(addr, nr, gup->gup_flags, + pages + i); + break; + case GUP_BASIC_TEST: + nr = get_user_pages(addr, nr, gup->gup_flags, pages + i); + break; + case PIN_FAST_BENCHMARK: + nr = pin_user_pages_fast(addr, nr, gup->gup_flags, + pages + i); + break; + case PIN_BASIC_TEST: + nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i); + break; + case PIN_LONGTERM_BENCHMARK: + nr = pin_user_pages(addr, nr, + gup->gup_flags | FOLL_LONGTERM, + pages + i); + break; + default: + pr_err("cmd %d not supported for %s\n", cmd, __func__); + return -EINVAL; + } + + if (nr <= 0) + break; + i += nr; + } + + nr_pages = i; + + if (gup->gup_flags & FOLL_EXCLUSIVE) { + if (WARN(nr_pages, + "Able to acquire exclusive pin twice for %ld of %ld pages", + nr_pages, expected_nr_pages)) { + dump_page(pages[0], + "gup_test: verify_gup_twice() test"); + ret = -EIO; + } + } else if (nr_pages != expected_nr_pages) { + pr_err("%s: Expected %ld pages, got %ld\n", __func__, + expected_nr_pages, nr_pages); + ret = -EIO; + } else { + for (i = 0; i < nr_pages; i++) { + if (WARN(pages[i] != expected_pages[i], + "pages[%lu] mismatch\n", i)) + break; + } + } + + put_back_pages(cmd, pages, nr_pages, gup->test_flags); + + return ret; +} + static void dump_pages_test(struct gup_test *gup, struct page **pages, unsigned long nr_pages) { @@ -210,6 +293,9 @@ static int __gup_test_ioctl(unsigned int cmd, if (cmd == DUMP_USER_PAGES_TEST) dump_pages_test(gup, pages, nr_pages);
+ if (gup->test_flags & GUP_TEST_FLAG_GUP_TWICE) + ret = verify_gup_twice(cmd, gup, pages, nr_pages); + start_time = ktime_get();
put_back_pages(cmd, pages, nr_pages, gup->test_flags); diff --git a/mm/gup_test.h b/mm/gup_test.h index 5b37b54e8bea6..fcd41919b0159 100644 --- a/mm/gup_test.h +++ b/mm/gup_test.h @@ -17,6 +17,7 @@ #define GUP_TEST_MAX_PAGES_TO_DUMP 8
#define GUP_TEST_FLAG_DUMP_PAGES_USE_PIN 0x1 +#define GUP_TEST_FLAG_GUP_TWICE 0x2
struct gup_test { __u64 get_delta_usec; diff --git a/tools/testing/selftests/mm/gup_test.c b/tools/testing/selftests/mm/gup_test.c index bdeaac67ff9aa..b4b10c8338f80 100644 --- a/tools/testing/selftests/mm/gup_test.c +++ b/tools/testing/selftests/mm/gup_test.c @@ -98,7 +98,7 @@ int main(int argc, char **argv) pthread_t *tid; char *p;
- while ((opt = getopt(argc, argv, "m:r:n:F:f:abcj:tTLUuwWSHpz")) != -1) { + while ((opt = getopt(argc, argv, "m:r:n:F:f:abcj:dtTLUuwWSHpz")) != -1) { switch (opt) { case 'a': cmd = PIN_FAST_BENCHMARK; @@ -172,6 +172,9 @@ int main(int argc, char **argv) /* fault pages in gup, do not fault in userland */ touch = 1; break; + case 'd': + gup.test_flags |= GUP_TEST_FLAG_GUP_TWICE; + break; default: ksft_exit_fail_msg("Wrong argument\n"); }
b4 wasn't happy with my copy/paste of the CC list from Fuad's series [1]. CC'ing them here.
[1]: https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com/
On Tue, Jun 18, 2024 at 05:05:06PM -0700, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah. To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
Tree with patches at: https://git.codelinaro.org/clo/linux-kernel/gunyah-linux/-/tree/sent/exclusi...
Signed-off-by: Elliot Berman quic_eberman@quicinc.com
Elliot Berman (2): mm/gup-test: Verify exclusive pinned mm/gup_test: Verify GUP grabs same pages twice
Fuad Tabba (3): mm/gup: Move GUP_PIN_COUNTING_BIAS to page_ref.h mm/gup: Add an option for obtaining an exclusive pin mm/gup: Add support for re-pinning a normal pinned page as exclusive
include/linux/mm.h | 57 ++++---- include/linux/mm_types.h | 2 + include/linux/page_ref.h | 74 ++++++++++ mm/Kconfig | 5 + mm/gup.c | 265 ++++++++++++++++++++++++++++++---- mm/gup_test.c | 108 ++++++++++++++ mm/gup_test.h | 1 + tools/testing/selftests/mm/gup_test.c | 5 +- 8 files changed, 457 insertions(+), 60 deletions(-)
base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f change-id: 20240509-exclusive-gup-66259138bbff
Best regards,
Elliot Berman quic_eberman@quicinc.com
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah. To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
Hi!
Looking through this, I feel that some intangible threshold of "this is too much overloading of page->_refcount" has been crossed. This is a very specific feature, and it is using approximately one more bit than is really actually "available"...
If we need a bit in struct page/folio, is this really the only way? Willy is working towards getting us an entirely separate folio->pincount, I suppose that might take too long? Or not?
This feels like force-fitting a very specific feature (KVM/CoCo handling of shmem pages) into a more general mechanism that is running low on bits (gup/pup).
Maybe a good topic for LPC!
thanks,
Hi,
On 19.06.24 04:44, John Hubbard wrote:
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah.
The main question really is, into which direction we want and can develop guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if guest_memfd should be our new target for guest memory, both shared and private. There are a bunch of issues to be sorted out though ...
As there is interest from Red Hat into supporting hugetlb-style huge pages in confidential VMs for real-time workloads, and wasting memory is not really desired, I'm going to think some more about some of the challenges (shared+private in guest_memfd, mmap support, migration of !shared folios, hugetlb-like support, in-place shared<->private conversion, interaction with page pinning). Tricky.
Ideally, we'd have one way to back guest memory for confidential VMs in the future.
Can you comment on the bigger design goal here? In particular:
1) Who would get the exclusive PIN and for which reason? When would we pin, when would we unpin?
2) What would happen if there is already another PIN? Can we deal with speculative short-term PINs from GUP-fast that could introduce errors?
3) How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
4) Why are GUP pins special? How one would deal with other folio references (e.g., simply mmap the shmem file into a different process).
5) Why you have to bother about anonymous pages at all (skimming over s some patches), when you really want to handle shmem differently only?
To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
So, FOLL_EXCLUSIVE would fail if there already is a PIN, but !FOLL_EXCLUSIVE would succeed even if there is a single PIN via FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins that don't have FOLL_EXCLUSIVE set fail as well?
Hi!
Looking through this, I feel that some intangible threshold of "this is too much overloading of page->_refcount" has been crossed. This is a very specific feature, and it is using approximately one more bit than is really actually "available"...
Agreed.
If we need a bit in struct page/folio, is this really the only way? Willy is working towards getting us an entirely separate folio->pincount, I suppose that might take too long? Or not?
Before talking about how to implement it, I think we first have to learn whether that approach is what we want at all, and how it fits into the bigger picture of that use case.
This feels like force-fitting a very specific feature (KVM/CoCo handling of shmem pages) into a more general mechanism that is running low on bits (gup/pup).
Agreed.
Maybe a good topic for LPC!
The KVM track has plenty of guest_memfd topics, might be a good fit there. (or in the MM track, of course)
Hi John and David,
Thank you for your comments.
On Wed, Jun 19, 2024 at 8:38 AM David Hildenbrand david@redhat.com wrote:
Hi,
On 19.06.24 04:44, John Hubbard wrote:
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah.
The main question really is, into which direction we want and can develop guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if guest_memfd should be our new target for guest memory, both shared and private. There are a bunch of issues to be sorted out though ...
As there is interest from Red Hat into supporting hugetlb-style huge pages in confidential VMs for real-time workloads, and wasting memory is not really desired, I'm going to think some more about some of the challenges (shared+private in guest_memfd, mmap support, migration of !shared folios, hugetlb-like support, in-place shared<->private conversion, interaction with page pinning). Tricky.
Ideally, we'd have one way to back guest memory for confidential VMs in the future.
As you know, initially we went down the route of guest memory and invested a lot of time on it, including presenting our proposal at LPC last year. But there was resistance to expanding it to support more than what was initially envisioned, e.g., sharing guest memory in place migration, and maybe even huge pages, and its implications such as being able to conditionally mmap guest memory.
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
We are currently shipping pKVM in Android as it is, warts and all. We're also working on upstreaming the rest of it. Currently, this is the main blocker for us to be able to upstream the rest (same probably applies to Gunyah).
Can you comment on the bigger design goal here? In particular:
At a high level: We want to prevent a misbehaving host process from crashing the system when attempting to access (deliberately or accidentally) protected guest memory. As it currently stands in pKVM and Gunyah, the hypervisor does prevent the host from accessing (private) guest memory. In certain cases though, if the host attempts to access that memory and is prevented by the hypervisor (either out of ignorance or out of malice), the host kernel wouldn't be able to recover, causing the whole system to crash.
guest_memfd() prevents such accesses by not allowing confidential memory to be mapped at the host to begin with. This works fine for us, but there's the issue of being able to share memory in place, which implies mapping it conditionally (among others that I've mentioned).
The approach we're taking with this proposal is to instead restrict the pinning of protected memory. If the host kernel can't pin the memory, then a misbehaving process can't trick the host into accessing it.
- Who would get the exclusive PIN and for which reason? When would we pin, when would we unpin?
The exclusive pin would be acquired for private guest pages, in addition to a normal pin. It would be released when the private memory is released, or if the guest shares that memory.
- What would happen if there is already another PIN? Can we deal with speculative short-term PINs from GUP-fast that could introduce errors?
The exclusive pin would be rejected if there's any other pin (exclusive or normal). Normal pins would be rejected if there's an exclusive pin.
- How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
I can't :)
- Why are GUP pins special? How one would deal with other folio references (e.g., simply mmap the shmem file into a different process).
Other references would crash the userspace process, but the host kernel can handle them, and shouldn't cause the system to crash. The way things are now in Android/pKVM, a userspace process can crash the system as a whole.
- Why you have to bother about anonymous pages at all (skimming over s some patches), when you really want to handle shmem differently only?
I'm not sure I understand the question. We use anonymous memory for pKVM.
To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
So, FOLL_EXCLUSIVE would fail if there already is a PIN, but !FOLL_EXCLUSIVE would succeed even if there is a single PIN via FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins that don't have FOLL_EXCLUSIVE set fail as well?
A FOLL_EXCLUSIVE would fail if there's any other pin. A normal pin (!FOLL_EXCLUSIVE) would fail if there's a FOLL_EXCLUSIVE pin. It's the PIN to end all pins!
Hi!
Looking through this, I feel that some intangible threshold of "this is too much overloading of page->_refcount" has been crossed. This is a very specific feature, and it is using approximately one more bit than is really actually "available"...
Agreed.
We are gating it behind a CONFIG flag :)
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
If we need a bit in struct page/folio, is this really the only way? Willy is working towards getting us an entirely separate folio->pincount, I suppose that might take too long? Or not?
Before talking about how to implement it, I think we first have to learn whether that approach is what we want at all, and how it fits into the bigger picture of that use case.
This feels like force-fitting a very specific feature (KVM/CoCo handling of shmem pages) into a more general mechanism that is running low on bits (gup/pup).
Agreed.
Maybe a good topic for LPC!
The KVM track has plenty of guest_memfd topics, might be a good fit there. (or in the MM track, of course)
We are planning on submitting a proposal for LPC (see you in Vienna!) :)
Thanks again! /fuad (and elliot*)
* Mistakes, errors, and unclear statements in this email are mine alone though.
-- Cheers,
David / dhildenb
On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
I think using a FD to control this special lifetime stuff is dramatically better than trying to force the MM to do it with struct page hacks.
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
We really need to be thinking more about containing these special things and not just sprinkling them everywhere.
The approach we're taking with this proposal is to instead restrict the pinning of protected memory. If the host kernel can't pin the memory, then a misbehaving process can't trick the host into accessing it.
If the memory can't be accessed by the CPU then it shouldn't be mapped into a PTE in the first place. The fact you made userspace faults (only) work is nifty but still an ugly hack to get around the fact you shouldn't be mapping in the first place.
We already have ZONE_DEVICE/DEVICE_PRIVATE to handle exactly this scenario. "memory" that cannot be touched by the CPU but can still be specially accessed by enlightened components.
guest_memfd, and more broadly memfd based instead of VMA based, memory mapping in KVM is a similar outcome to DEVICE_PRIVATE.
I think you need to stay in the world of not mapping the memory, one way or another.
- How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
I can't :)
AFAICT in the pKVM model the IOMMU has to be managed by the hypervisor..
We are gating it behind a CONFIG flag :)
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
Yeah, but every time someone does this and then links it to a uAPI it becomes utterly baked in concrete for the MM forever.
Jason
Hi Jason,
On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
I think using a FD to control this special lifetime stuff is dramatically better than trying to force the MM to do it with struct page hacks.
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
We really need to be thinking more about containing these special things and not just sprinkling them everywhere.
I agree that we need to agree :) This discussion has been going on since before LPC last year, and the consensus from the guest_memfd() folks (if I understood it correctly) is that guest_memfd() is what it is: designed for a specific type of confidential computing, in the style of TDX and CCA perhaps, and that it cannot (or will not) perform the role of being a general solution for all confidential computing.
The approach we're taking with this proposal is to instead restrict the pinning of protected memory. If the host kernel can't pin the memory, then a misbehaving process can't trick the host into accessing it.
If the memory can't be accessed by the CPU then it shouldn't be mapped into a PTE in the first place. The fact you made userspace faults (only) work is nifty but still an ugly hack to get around the fact you shouldn't be mapping in the first place.
We already have ZONE_DEVICE/DEVICE_PRIVATE to handle exactly this scenario. "memory" that cannot be touched by the CPU but can still be specially accessed by enlightened components.
guest_memfd, and more broadly memfd based instead of VMA based, memory mapping in KVM is a similar outcome to DEVICE_PRIVATE.
I think you need to stay in the world of not mapping the memory, one way or another.
As I mentioned earlier, that's my personal preferred option.
- How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
I can't :)
AFAICT in the pKVM model the IOMMU has to be managed by the hypervisor..
I realized that I misunderstood this. At least speaking for pKVM, we don't need other long term pins as long as the memory is private. The exclusive pin is dropped when the memory is shared.
We are gating it behind a CONFIG flag :)
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
Yeah, but every time someone does this and then links it to a uAPI it becomes utterly baked in concrete for the MM forever.
I agree. But if we can't modify guest_memfd() to fit our needs (pKVM, Gunyah), then we don't really have that many other options.
Thanks! /fuad
Jason
On Wed, Jun 19, 2024 at 01:01:14PM +0100, Fuad Tabba wrote:
Hi Jason,
On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
I think using a FD to control this special lifetime stuff is dramatically better than trying to force the MM to do it with struct page hacks.
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
We really need to be thinking more about containing these special things and not just sprinkling them everywhere.
I agree that we need to agree :) This discussion has been going on since before LPC last year, and the consensus from the guest_memfd() folks (if I understood it correctly) is that guest_memfd() is what it is: designed for a specific type of confidential computing, in the style of TDX and CCA perhaps, and that it cannot (or will not) perform the role of being a general solution for all confidential computing.
If you can't agree with guest_memfd, that just says you need Yet Another FD, not mm hacks.
IMHO there is nothing intrinsically wrong with having the various FD types being narrowly tailored to their use case. Not to say sharing wouldn't be nice too.
Jason
On Wed, Jun 19, 2024, Fuad Tabba wrote:
Hi Jason,
On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
I think using a FD to control this special lifetime stuff is dramatically better than trying to force the MM to do it with struct page hacks.
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
We really need to be thinking more about containing these special things and not just sprinkling them everywhere.
I agree that we need to agree :) This discussion has been going on since before LPC last year, and the consensus from the guest_memfd() folks (if I understood it correctly) is that guest_memfd() is what it is: designed for a specific type of confidential computing, in the style of TDX and CCA perhaps, and that it cannot (or will not) perform the role of being a general solution for all confidential computing.
That isn't remotely accurate. I have stated multiple times that I want guest_memfd to be a vehicle for all VM types, i.e. not just CoCo VMs, and most definitely not just TDX/SNP/CCA VMs.
What I am staunchly against is piling features onto guest_memfd that will cause it to eventually become virtually indistinguishable from any other file-based backing store. I.e. while I want to make guest_memfd usable for all VM *types*, making guest_memfd the preferred backing store for all *VMs* and use cases is very much a non-goal.
From an earlier conversation[1]:
: In other words, ditch the complexity for features that are well served by existing : general purpose solutions, so that guest_memfd can take on a bit of complexity to : serve use cases that are unique to KVM guests, without becoming an unmaintainble : mess due to cross-products.
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
Yeah, but every time someone does this and then links it to a uAPI it becomes utterly baked in concrete for the MM forever.
I agree. But if we can't modify guest_memfd() to fit our needs (pKVM, Gunyah), then we don't really have that many other options.
What _are_ your needs? There are multiple unanswered questions from our last conversation[2]. And by "needs" I don't mean "what changes do you want to make to guest_memfd?", I mean "what are the use cases, patterns, and scenarios that you want to support?".
: What's "hypervisor-assisted page migration"? More specifically, what's the : mechanism that drives it?
: Do you happen to have a list of exactly what you mean by "normal mm stuff"? I : am not at all opposed to supporting .mmap(), because long term I also want to : use guest_memfd for non-CoCo VMs. But I want to be very conservative with respect : to what is allowed for guest_memfd. E.g. host userspace can map guest_memfd, : and do operations that are directly related to its mapping, but that's about it.
That distinction matters, because as I have stated in that thread, I am not opposed to page migration itself:
: I am not opposed to page migration itself, what I am opposed to is adding deep : integration with core MM to do some of the fancy/complex things that lead to page : migration.
I am generally aware of the core pKVM use cases, but I AFAIK I haven't seen a complete picture of everything you want to do, and _why_.
E.g. if one of your requirements is that guest memory is managed by core-mm the same as all other memory in the system, then yeah, guest_memfd isn't for you. Integrating guest_memfd deeply into core-mm simply isn't realistic, at least not without *massive* changes to core-mm, as the whole point of guest_memfd is that it is guest-first memory, i.e. it is NOT memory that is managed by core-mm (primary MMU) and optionally mapped into KVM (secondary MMU).
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
Hi Sean,
On Thu, Jun 20, 2024 at 4:37 PM Sean Christopherson seanjc@google.com wrote:
On Wed, Jun 19, 2024, Fuad Tabba wrote:
Hi Jason,
On Wed, Jun 19, 2024 at 12:51 PM Jason Gunthorpe jgg@nvidia.com wrote:
On Wed, Jun 19, 2024 at 10:11:35AM +0100, Fuad Tabba wrote:
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
I think using a FD to control this special lifetime stuff is dramatically better than trying to force the MM to do it with struct page hacks.
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
We really need to be thinking more about containing these special things and not just sprinkling them everywhere.
I agree that we need to agree :) This discussion has been going on since before LPC last year, and the consensus from the guest_memfd() folks (if I understood it correctly) is that guest_memfd() is what it is: designed for a specific type of confidential computing, in the style of TDX and CCA perhaps, and that it cannot (or will not) perform the role of being a general solution for all confidential computing.
That isn't remotely accurate. I have stated multiple times that I want guest_memfd to be a vehicle for all VM types, i.e. not just CoCo VMs, and most definitely not just TDX/SNP/CCA VMs.
I think that there might have been a slight misunderstanding between us. I just thought that that's what you meant by:
: And I'm saying say we should stand firm in what guest_memfd _won't_ support, e.g. : swap/reclaim and probably page migration should get a hard "no".
https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com/
What I am staunchly against is piling features onto guest_memfd that will cause it to eventually become virtually indistinguishable from any other file-based backing store. I.e. while I want to make guest_memfd usable for all VM *types*, making guest_memfd the preferred backing store for all *VMs* and use cases is very much a non-goal.
From an earlier conversation[1]:
: In other words, ditch the complexity for features that are well served by existing : general purpose solutions, so that guest_memfd can take on a bit of complexity to : serve use cases that are unique to KVM guests, without becoming an unmaintainble : mess due to cross-products.
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
Yeah, but every time someone does this and then links it to a uAPI it becomes utterly baked in concrete for the MM forever.
I agree. But if we can't modify guest_memfd() to fit our needs (pKVM, Gunyah), then we don't really have that many other options.
What _are_ your needs? There are multiple unanswered questions from our last conversation[2]. And by "needs" I don't mean "what changes do you want to make to guest_memfd?", I mean "what are the use cases, patterns, and scenarios that you want to support?".
I think Quentin's reply in this thread outlines what it is pKVM would like to do, and why it's different from, e.g., TDX: https://lore.kernel.org/all/ZnUsmFFslBWZxGIq@google.com/
To summarize, our requirements are the same as other CC implementations, except that we don't want to pay a penalty for operations that pKVM (and Gunyah) can do more efficiently than encryption-based CC, e.g., in-place conversion of private -> shared.
Apart from that, we are happy to use an interface that can support our needs, or at least that we can extend in the (near) future to do that. Whether it's guest_memfd() or something else.
: What's "hypervisor-assisted page migration"? More specifically, what's the : mechanism that drives it?
I believe what Will specifically meant by this is that, we can add hypervisor support for migration in pKVM for the stage 2 page tables.
We don't have a detailed implementation for this yet, of course, since there's no point yet until we know whether we're going with guest_memfd(), or another alternative.
: Do you happen to have a list of exactly what you mean by "normal mm stuff"? I : am not at all opposed to supporting .mmap(), because long term I also want to : use guest_memfd for non-CoCo VMs. But I want to be very conservative with respect : to what is allowed for guest_memfd. E.g. host userspace can map guest_memfd, : and do operations that are directly related to its mapping, but that's about it.
That distinction matters, because as I have stated in that thread, I am not opposed to page migration itself:
: I am not opposed to page migration itself, what I am opposed to is adding deep : integration with core MM to do some of the fancy/complex things that lead to page : migration.
So it's not a "hard no"? :)
I am generally aware of the core pKVM use cases, but I AFAIK I haven't seen a complete picture of everything you want to do, and _why_. E.g. if one of your requirements is that guest memory is managed by core-mm the same as all other memory in the system, then yeah, guest_memfd isn't for you. Integrating guest_memfd deeply into core-mm simply isn't realistic, at least not without *massive* changes to core-mm, as the whole point of guest_memfd is that it is guest-first memory, i.e. it is NOT memory that is managed by core-mm (primary MMU) and optionally mapped into KVM (secondary MMU).
It's not a requirement that guest memory is managed by the core-mm. But, like we mentioned, support for in-place conversion from shared->private, huge pages, and eventually migration are.
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
I wonder if it might be more productive to also discuss this in one of the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
Cheers, /fuad
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
I wonder if it might be more productive to also discuss this in one of the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
I don't know in which context you usually discuss that, but I could propose that as a topic in the bi-weekly MM meeting.
This would, of course, be focused on the bigger MM picture: how to mmap, how how to support huge pages, interaction with page pinning, ... So obviously more MM focused once we are in agreement that we want to support shared memory in guest_memfd and how to make that work with core-mm.
Discussing if we want shared memory in guest_memfd might be betetr suited for a different, more CC/KVM specific meeting (likely the "PUCKs" mentioned here?).
Hi David,
On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand david@redhat.com wrote:
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
I wonder if it might be more productive to also discuss this in one of the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
I don't know in which context you usually discuss that, but I could propose that as a topic in the bi-weekly MM meeting.
This would, of course, be focused on the bigger MM picture: how to mmap, how how to support huge pages, interaction with page pinning, ... So obviously more MM focused once we are in agreement that we want to support shared memory in guest_memfd and how to make that work with core-mm.
Discussing if we want shared memory in guest_memfd might be betetr suited for a different, more CC/KVM specific meeting (likely the "PUCKs" mentioned here?).
Sorry, I should have given more context on what a PUCK* is :) It's a periodic (almost weekly) upstream call for KVM.
[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
But yes, having a discussion in one of the mm meetings ahead of LPC would also be great. When do these meetings usually take place, to try to coordinate across timezones.
Cheers, /fuad
-- Cheers,
David / dhildenb
On 21.06.24 10:54, Fuad Tabba wrote:
Hi David,
On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand david@redhat.com wrote:
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
I wonder if it might be more productive to also discuss this in one of the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
I don't know in which context you usually discuss that, but I could propose that as a topic in the bi-weekly MM meeting.
This would, of course, be focused on the bigger MM picture: how to mmap, how how to support huge pages, interaction with page pinning, ... So obviously more MM focused once we are in agreement that we want to support shared memory in guest_memfd and how to make that work with core-mm.
Discussing if we want shared memory in guest_memfd might be betetr suited for a different, more CC/KVM specific meeting (likely the "PUCKs" mentioned here?).
Sorry, I should have given more context on what a PUCK* is :) It's a periodic (almost weekly) upstream call for KVM.
[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
But yes, having a discussion in one of the mm meetings ahead of LPC would also be great. When do these meetings usually take place, to try to coordinate across timezones.
It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
If we're in agreement, we could (assuming there are no other planned topics) either use the slot next week (June 26) or the following one (July 10).
Selfish as I am, I would prefer July 10, because I'll be on vacation next week and there would be little time to prepare.
@David R., heads up that this might become a topic ("shared and private memory in guest_memfd: mmap, pinning and huge pages"), if people here agree that this is a direction worth heading.
Hi David,
On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand david@redhat.com wrote:
On 21.06.24 10:54, Fuad Tabba wrote:
Hi David,
On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand david@redhat.com wrote:
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
I wonder if it might be more productive to also discuss this in one of the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
I don't know in which context you usually discuss that, but I could propose that as a topic in the bi-weekly MM meeting.
This would, of course, be focused on the bigger MM picture: how to mmap, how how to support huge pages, interaction with page pinning, ... So obviously more MM focused once we are in agreement that we want to support shared memory in guest_memfd and how to make that work with core-mm.
Discussing if we want shared memory in guest_memfd might be betetr suited for a different, more CC/KVM specific meeting (likely the "PUCKs" mentioned here?).
Sorry, I should have given more context on what a PUCK* is :) It's a periodic (almost weekly) upstream call for KVM.
[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
But yes, having a discussion in one of the mm meetings ahead of LPC would also be great. When do these meetings usually take place, to try to coordinate across timezones.
It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
If we're in agreement, we could (assuming there are no other planned topics) either use the slot next week (June 26) or the following one (July 10).
Selfish as I am, I would prefer July 10, because I'll be on vacation next week and there would be little time to prepare.
@David R., heads up that this might become a topic ("shared and private memory in guest_memfd: mmap, pinning and huge pages"), if people here agree that this is a direction worth heading.
Thanks for the invite! Tentatively July 10th works for me, but I'd like to talk to the others who might be interested (pKVM, Gunyah, and others) to see if that works for them. I'll get back to you shortly.
Cheers, /fuad
-- Cheers,
David / dhildenb
On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
Hi David,
On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand david@redhat.com wrote:
On 21.06.24 10:54, Fuad Tabba wrote:
Hi David,
On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand david@redhat.com wrote:
Again from that thread, one of most important aspects guest_memfd is that VMAs are not required. Stating the obvious, lack of VMAs makes it really hard to drive swap, reclaim, migration, etc. from code that fundamentally operates on VMAs.
: More broadly, no VMAs are required. The lack of stage-1 page tables are nice to : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. : it's not subject to VMA protections, isn't restricted to host mapping size, etc.
[1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
I wonder if it might be more productive to also discuss this in one of the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
I don't know in which context you usually discuss that, but I could propose that as a topic in the bi-weekly MM meeting.
This would, of course, be focused on the bigger MM picture: how to mmap, how how to support huge pages, interaction with page pinning, ... So obviously more MM focused once we are in agreement that we want to support shared memory in guest_memfd and how to make that work with core-mm.
Discussing if we want shared memory in guest_memfd might be betetr suited for a different, more CC/KVM specific meeting (likely the "PUCKs" mentioned here?).
Sorry, I should have given more context on what a PUCK* is :) It's a periodic (almost weekly) upstream call for KVM.
[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
But yes, having a discussion in one of the mm meetings ahead of LPC would also be great. When do these meetings usually take place, to try to coordinate across timezones.
It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
If we're in agreement, we could (assuming there are no other planned topics) either use the slot next week (June 26) or the following one (July 10).
Selfish as I am, I would prefer July 10, because I'll be on vacation next week and there would be little time to prepare.
@David R., heads up that this might become a topic ("shared and private memory in guest_memfd: mmap, pinning and huge pages"), if people here agree that this is a direction worth heading.
Thanks for the invite! Tentatively July 10th works for me, but I'd like to talk to the others who might be interested (pKVM, Gunyah, and others) to see if that works for them. I'll get back to you shortly.
I'd like to join too, July 10th at that time works for me.
- Elliot
On Fri, Jun 21, 2024, Elliot Berman wrote:
On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand david@redhat.com wrote:
On 21.06.24 10:54, Fuad Tabba wrote:
On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand david@redhat.com wrote:
> Again from that thread, one of most important aspects guest_memfd is that VMAs > are not required. Stating the obvious, lack of VMAs makes it really hard to drive > swap, reclaim, migration, etc. from code that fundamentally operates on VMAs. > > : More broadly, no VMAs are required. The lack of stage-1 page tables are nice to > : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. > : it's not subject to VMA protections, isn't restricted to host mapping size, etc. > > [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com > [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com
I wonder if it might be more productive to also discuss this in one of the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
I don't know in which context you usually discuss that, but I could propose that as a topic in the bi-weekly MM meeting.
This would, of course, be focused on the bigger MM picture: how to mmap, how how to support huge pages, interaction with page pinning, ... So obviously more MM focused once we are in agreement that we want to support shared memory in guest_memfd and how to make that work with core-mm.
Discussing if we want shared memory in guest_memfd might be betetr suited for a different, more CC/KVM specific meeting (likely the "PUCKs" mentioned here?).
Sorry, I should have given more context on what a PUCK* is :) It's a periodic (almost weekly) upstream call for KVM.
[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
But yes, having a discussion in one of the mm meetings ahead of LPC would also be great. When do these meetings usually take place, to try to coordinate across timezones.
Let's do the MM meeting. As evidenced by the responses, it'll be easier to get KVM folks to join the MM meeting as opposed to other way around.
It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
If we're in agreement, we could (assuming there are no other planned topics) either use the slot next week (June 26) or the following one (July 10).
Selfish as I am, I would prefer July 10, because I'll be on vacation next week and there would be little time to prepare.
@David R., heads up that this might become a topic ("shared and private memory in guest_memfd: mmap, pinning and huge pages"), if people here agree that this is a direction worth heading.
Thanks for the invite! Tentatively July 10th works for me, but I'd like to talk to the others who might be interested (pKVM, Gunyah, and others) to see if that works for them. I'll get back to you shortly.
I'd like to join too, July 10th at that time works for me.
July 10th works for me too.
On Mon, 24 Jun 2024, Sean Christopherson wrote:
On Fri, Jun 21, 2024, Elliot Berman wrote:
On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand david@redhat.com wrote:
On 21.06.24 10:54, Fuad Tabba wrote:
On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand david@redhat.com wrote:
>> Again from that thread, one of most important aspects guest_memfd is that VMAs >> are not required. Stating the obvious, lack of VMAs makes it really hard to drive >> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs. >> >> : More broadly, no VMAs are required. The lack of stage-1 page tables are nice to >> : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. >> : it's not subject to VMA protections, isn't restricted to host mapping size, etc. >> >> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com >> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com > > I wonder if it might be more productive to also discuss this in one of > the PUCKs, ahead of LPC, in addition to trying to go over this in LPC.
I don't know in which context you usually discuss that, but I could propose that as a topic in the bi-weekly MM meeting.
This would, of course, be focused on the bigger MM picture: how to mmap, how how to support huge pages, interaction with page pinning, ... So obviously more MM focused once we are in agreement that we want to support shared memory in guest_memfd and how to make that work with core-mm.
Discussing if we want shared memory in guest_memfd might be betetr suited for a different, more CC/KVM specific meeting (likely the "PUCKs" mentioned here?).
Sorry, I should have given more context on what a PUCK* is :) It's a periodic (almost weekly) upstream call for KVM.
[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
But yes, having a discussion in one of the mm meetings ahead of LPC would also be great. When do these meetings usually take place, to try to coordinate across timezones.
Let's do the MM meeting. As evidenced by the responses, it'll be easier to get KVM folks to join the MM meeting as opposed to other way around.
It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
If we're in agreement, we could (assuming there are no other planned topics) either use the slot next week (June 26) or the following one (July 10).
Selfish as I am, I would prefer July 10, because I'll be on vacation next week and there would be little time to prepare.
@David R., heads up that this might become a topic ("shared and private memory in guest_memfd: mmap, pinning and huge pages"), if people here agree that this is a direction worth heading.
Thanks for the invite! Tentatively July 10th works for me, but I'd like to talk to the others who might be interested (pKVM, Gunyah, and others) to see if that works for them. I'll get back to you shortly.
I'd like to join too, July 10th at that time works for me.
July 10th works for me too.
Thanks all, and David H for the topic suggestion. Let's tentatively pencil this in for the Wednesday, July 10th instance at 9am PDT and I'll follow-up offlist with those will be needed to lead the discussion to make sure we're on track.
On Mon, Jun 24, 2024 at 2:50 PM David Rientjes rientjes@google.com wrote:
On Mon, 24 Jun 2024, Sean Christopherson wrote:
On Fri, Jun 21, 2024, Elliot Berman wrote:
On Fri, Jun 21, 2024 at 11:16:31AM +0100, Fuad Tabba wrote:
On Fri, Jun 21, 2024 at 10:10 AM David Hildenbrand david@redhat.com wrote:
On 21.06.24 10:54, Fuad Tabba wrote:
On Fri, Jun 21, 2024 at 9:44 AM David Hildenbrand david@redhat.com wrote: > >>> Again from that thread, one of most important aspects guest_memfd is that VMAs >>> are not required. Stating the obvious, lack of VMAs makes it really hard to drive >>> swap, reclaim, migration, etc. from code that fundamentally operates on VMAs. >>> >>> : More broadly, no VMAs are required. The lack of stage-1 page tables are nice to >>> : have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. >>> : it's not subject to VMA protections, isn't restricted to host mapping size, etc. >>> >>> [1] https://lore.kernel.org/all/Zfmpby6i3PfBEcCV@google.com >>> [2] https://lore.kernel.org/all/Zg3xF7dTtx6hbmZj@google.com >> >> I wonder if it might be more productive to also discuss this in one of >> the PUCKs, ahead of LPC, in addition to trying to go over this in LPC. > > I don't know in which context you usually discuss that, but I could > propose that as a topic in the bi-weekly MM meeting. > > This would, of course, be focused on the bigger MM picture: how to mmap, > how how to support huge pages, interaction with page pinning, ... So > obviously more MM focused once we are in agreement that we want to > support shared memory in guest_memfd and how to make that work with core-mm. > > Discussing if we want shared memory in guest_memfd might be betetr > suited for a different, more CC/KVM specific meeting (likely the "PUCKs" > mentioned here?).
Sorry, I should have given more context on what a PUCK* is :) It's a periodic (almost weekly) upstream call for KVM.
[*] https://lore.kernel.org/all/20230512231026.799267-1-seanjc@google.com/
But yes, having a discussion in one of the mm meetings ahead of LPC would also be great. When do these meetings usually take place, to try to coordinate across timezones.
Let's do the MM meeting. As evidenced by the responses, it'll be easier to get KVM folks to join the MM meeting as opposed to other way around.
It's Wednesday, 9:00 - 10:00am PDT (GMT-7) every second week.
If we're in agreement, we could (assuming there are no other planned topics) either use the slot next week (June 26) or the following one (July 10).
Selfish as I am, I would prefer July 10, because I'll be on vacation next week and there would be little time to prepare.
@David R., heads up that this might become a topic ("shared and private memory in guest_memfd: mmap, pinning and huge pages"), if people here agree that this is a direction worth heading.
Thanks for the invite! Tentatively July 10th works for me, but I'd like to talk to the others who might be interested (pKVM, Gunyah, and others) to see if that works for them. I'll get back to you shortly.
I'd like to join too, July 10th at that time works for me.
July 10th works for me too.
Thanks all, and David H for the topic suggestion. Let's tentatively pencil this in for the Wednesday, July 10th instance at 9am PDT and I'll follow-up offlist with those will be needed to lead the discussion to make sure we're on track.
I would like to join the call too.
Regards, Vishal
If the memory can't be accessed by the CPU then it shouldn't be mapped into a PTE in the first place. The fact you made userspace faults (only) work is nifty but still an ugly hack to get around the fact you shouldn't be mapping in the first place.
We already have ZONE_DEVICE/DEVICE_PRIVATE to handle exactly this scenario. "memory" that cannot be touched by the CPU but can still be specially accessed by enlightened components.
guest_memfd, and more broadly memfd based instead of VMA based, memory mapping in KVM is a similar outcome to DEVICE_PRIVATE.
I think you need to stay in the world of not mapping the memory, one way or another.
Fully agreed. Private memory shall not be mapped.
On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
Or we're just not going to support it at all. It's not like supporting this weird usage model is a must-have for Linux to start with.
Hi,
On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig hch@infradead.org wrote:
On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
Or we're just not going to support it at all. It's not like supporting this weird usage model is a must-have for Linux to start with.
Sorry, but could you please clarify to me what usage model you're referring to exactly, and why you think it's weird? It's just that we have covered a few things in this thread, and to me it's not clear if you're referring to protected VMs sharing memory, or being able to (conditionally) map a VM's memory that's backed by guest_memfd(), or if it's the Exclusive pin.
Thank you, /fuad
On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
Hi,
On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig hch@infradead.org wrote:
On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
Or we're just not going to support it at all. It's not like supporting this weird usage model is a must-have for Linux to start with.
Sorry, but could you please clarify to me what usage model you're referring to exactly, and why you think it's weird? It's just that we have covered a few things in this thread, and to me it's not clear if you're referring to protected VMs sharing memory, or being able to (conditionally) map a VM's memory that's backed by guest_memfd(), or if it's the Exclusive pin.
Personally I think mapping memory under guest_memfd is pretty weird.
I don't really understand why you end up with something different than normal CC. Normal CC has memory that the VMM can access and memory it cannot access. guest_memory is supposed to hold the memory the VMM cannot reach, right?
So how does normal CC handle memory switching between private and shared and why doesn't that work for pKVM? I think the normal CC path effectively discards the memory content on these switches and is slow. Are you trying to make the switch content preserving and faster?
If yes, why? What is wrong with the normal CC model of slow and non-preserving shared memory? Are you trying to speed up IO in these VMs by dynamically sharing pages instead of SWIOTLB?
Maybe this was all explained, but I reviewed your presentation and the cover letter for the guest_memfd patches and I still don't see the why in all of this.
Jason
On 20.06.24 15:55, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
Hi,
On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig hch@infradead.org wrote:
On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
Or we're just not going to support it at all. It's not like supporting this weird usage model is a must-have for Linux to start with.
Sorry, but could you please clarify to me what usage model you're referring to exactly, and why you think it's weird? It's just that we have covered a few things in this thread, and to me it's not clear if you're referring to protected VMs sharing memory, or being able to (conditionally) map a VM's memory that's backed by guest_memfd(), or if it's the Exclusive pin.
Personally I think mapping memory under guest_memfd is pretty weird.
I don't really understand why you end up with something different than normal CC. Normal CC has memory that the VMM can access and memory it cannot access. guest_memory is supposed to hold the memory the VMM cannot reach, right?
So how does normal CC handle memory switching between private and shared and why doesn't that work for pKVM? I think the normal CC path effectively discards the memory content on these switches and is slow. Are you trying to make the switch content preserving and faster?
If yes, why? What is wrong with the normal CC model of slow and non-preserving shared memory?
I'll leave the !huge page part to Fuad.
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private. How to handle that without eventually running into a double memory-allocation? (in the worst case, allocating a 1GiB huge page for shared and for private memory).
In the world of RT, you want your VM to be consistently backed by huge/gigantic mappings, not some weird mixture -- so I've been told by our RT team.
(there are more issues with huge pages in the style hugetlb, where we actually want to preallocate all pages and not rely on dynamic allocation at runtime when we convert back and forth between shared and private)
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
On 20.06.24 15:55, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
Hi,
On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig hch@infradead.org wrote:
On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
Or we're just not going to support it at all. It's not like supporting this weird usage model is a must-have for Linux to start with.
Sorry, but could you please clarify to me what usage model you're referring to exactly, and why you think it's weird? It's just that we have covered a few things in this thread, and to me it's not clear if you're referring to protected VMs sharing memory, or being able to (conditionally) map a VM's memory that's backed by guest_memfd(), or if it's the Exclusive pin.
Personally I think mapping memory under guest_memfd is pretty weird.
I don't really understand why you end up with something different than normal CC. Normal CC has memory that the VMM can access and memory it cannot access. guest_memory is supposed to hold the memory the VMM cannot reach, right?
So how does normal CC handle memory switching between private and shared and why doesn't that work for pKVM? I think the normal CC path effectively discards the memory content on these switches and is slow. Are you trying to make the switch content preserving and faster?
If yes, why? What is wrong with the normal CC model of slow and non-preserving shared memory?
I'll leave the !huge page part to Fuad.
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
How to handle that without eventually running into a double memory-allocation? (in the worst case, allocating a 1GiB huge page for shared and for private memory).
I expect you'd take the linear range of 1G of PFNs and fragment it into three ranges private/shared/private that span the same 1G.
When you construct a page table (ie a S2) that holds these three ranges and has permission to access all the memory you want the page table to automatically join them back together into 1GB entry.
When you construct a page table that has only access to the shared, then you'd only install the shared hole at its natural best size.
So, I think there are two challenges - how to build an allocator and uAPI to manage this sort of stuff so you can keep track of any fractured pfns and ensure things remain in physical order.
Then how to re-consolidate this for the KVM side of the world.
guest_memfd, or something like it, is just really a good answer. You have it obtain the huge folio, and keep track on its own which sub pages can be mapped to a VMA because they are shared. KVM will obtain the PFNs directly from the fd and KVM will not see the shared holes. This means your S2's can be trivially constructed correctly.
No need to double allocate..
I'm kind of surprised the CC folks don't want the same thing for exactly the same reason. It is much easier to recover the huge mappings for the S2 in the presence of shared holes if you track it this way. Even CC will have this problem, to some degree, too.
In the world of RT, you want your VM to be consistently backed by huge/gigantic mappings, not some weird mixture -- so I've been told by our RT team.
Yes, even outside RT, if you want good IO performance in DMA you must also have high IOTLB hit rates too, especially with nesting.
Jason
On 20.06.24 16:29, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
On 20.06.24 15:55, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
Hi,
On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig hch@infradead.org wrote:
On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
If you can't agree with the guest_memfd people on how to get there then maybe you need a guest_memfd2 for this slightly different special stuff instead of intruding on the core mm so much. (though that would be sad)
Or we're just not going to support it at all. It's not like supporting this weird usage model is a must-have for Linux to start with.
Sorry, but could you please clarify to me what usage model you're referring to exactly, and why you think it's weird? It's just that we have covered a few things in this thread, and to me it's not clear if you're referring to protected VMs sharing memory, or being able to (conditionally) map a VM's memory that's backed by guest_memfd(), or if it's the Exclusive pin.
Personally I think mapping memory under guest_memfd is pretty weird.
I don't really understand why you end up with something different than normal CC. Normal CC has memory that the VMM can access and memory it cannot access. guest_memory is supposed to hold the memory the VMM cannot reach, right?
So how does normal CC handle memory switching between private and shared and why doesn't that work for pKVM? I think the normal CC path effectively discards the memory content on these switches and is slow. Are you trying to make the switch content preserving and faster?
If yes, why? What is wrong with the normal CC model of slow and non-preserving shared memory?
I'll leave the !huge page part to Fuad.
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
I am not an expert on that, but I remember that the way memory shared<->private conversion happens can heavily depend on the VM use case, and that under pKVM we might see more frequent conversion, without even going to user space.
How to handle that without eventually running into a double memory-allocation? (in the worst case, allocating a 1GiB huge page for shared and for private memory).
I expect you'd take the linear range of 1G of PFNs and fragment it into three ranges private/shared/private that span the same 1G.
When you construct a page table (ie a S2) that holds these three ranges and has permission to access all the memory you want the page table to automatically join them back together into 1GB entry.
When you construct a page table that has only access to the shared, then you'd only install the shared hole at its natural best size.
So, I think there are two challenges - how to build an allocator and uAPI to manage this sort of stuff so you can keep track of any fractured pfns and ensure things remain in physical order.
Then how to re-consolidate this for the KVM side of the world.
Exactly!
guest_memfd, or something like it, is just really a good answer. You have it obtain the huge folio, and keep track on its own which sub pages can be mapped to a VMA because they are shared. KVM will obtain the PFNs directly from the fd and KVM will not see the shared holes. This means your S2's can be trivially constructed correctly.
No need to double allocate..
Yes, that's why my thinking so far was:
Let guest_memfd (or something like that) consume huge pages (somehow, let it access the hugetlb reserves). Preallocate that memory once, as the VM starts up: just like we do with hugetlb in VMs.
Let KVM track which parts are shared/private, and if required, let it map only the shared parts to user space. KVM has all information to make these decisions.
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
Of course, there might be alternatives, and I'll be happy to learn about them. The allcoator part would be fairly easy, and the uAPI part would similarly be comparably easy. So far the theory :)
I'm kind of surprised the CC folks don't want the same thing for exactly the same reason. It is much easier to recover the huge mappings for the S2 in the presence of shared holes if you track it this way. Even CC will have this problem, to some degree, too.
Precisely! RH (and therefore, me) is primarily interested in existing guest_memfd users at this point ("CC"), and I don't see an easy way to get that running with huge pages in the existing model reasonably well ...
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 16:29, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
On 20.06.24 15:55, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
I am not an expert on that, but I remember that the way memory shared<->private conversion happens can heavily depend on the VM use case,
Yeah, I forget the details, but there are scenarios where the guest will share (and unshare) memory at 4KiB (give or take) granularity, at runtime. There's an RFC[*] for making SWIOTLB operate at 2MiB is driven by the same underlying problems.
But even if Linux-as-a-guest were better behaved, we (the host) can't prevent the guest from doing suboptimal conversions. In practice, killing the guest or refusing to convert memory isn't an option, i.e. we can't completely push the problem into the guest
https://lore.kernel.org/all/20240112055251.36101-1-vannapurve@google.com
and that under pKVM we might see more frequent conversion, without even going to user space.
How to handle that without eventually running into a double memory-allocation? (in the worst case, allocating a 1GiB huge page for shared and for private memory).
I expect you'd take the linear range of 1G of PFNs and fragment it into three ranges private/shared/private that span the same 1G.
When you construct a page table (ie a S2) that holds these three ranges and has permission to access all the memory you want the page table to automatically join them back together into 1GB entry.
When you construct a page table that has only access to the shared, then you'd only install the shared hole at its natural best size.
So, I think there are two challenges - how to build an allocator and uAPI to manage this sort of stuff so you can keep track of any fractured pfns and ensure things remain in physical order.
Then how to re-consolidate this for the KVM side of the world.
Exactly!
guest_memfd, or something like it, is just really a good answer. You have it obtain the huge folio, and keep track on its own which sub pages can be mapped to a VMA because they are shared. KVM will obtain the PFNs directly from the fd and KVM will not see the shared holes. This means your S2's can be trivially constructed correctly.
No need to double allocate..
Yes, that's why my thinking so far was:
Let guest_memfd (or something like that) consume huge pages (somehow, let it access the hugetlb reserves). Preallocate that memory once, as the VM starts up: just like we do with hugetlb in VMs.
Let KVM track which parts are shared/private, and if required, let it map only the shared parts to user space. KVM has all information to make these decisions.
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
Of course, there might be alternatives, and I'll be happy to learn about them. The allcoator part would be fairly easy, and the uAPI part would similarly be comparably easy. So far the theory :)
I'm kind of surprised the CC folks don't want the same thing for exactly the same reason. It is much easier to recover the huge mappings for the S2 in the presence of shared holes if you track it this way. Even CC will have this problem, to some degree, too.
Precisely! RH (and therefore, me) is primarily interested in existing guest_memfd users at this point ("CC"), and I don't see an easy way to get that running with huge pages in the existing model reasonably well ...
This is the general direction guest_memfd is headed, but getting there is easier said than done. E.g. as alluded to above, "simply unmap that folio" is quite difficult, bordering on infeasible if the kernel is allowed to gup() shared guest_memfd memory.
On 20.06.24 18:04, Sean Christopherson wrote:
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 16:29, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
On 20.06.24 15:55, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
I am not an expert on that, but I remember that the way memory shared<->private conversion happens can heavily depend on the VM use case,
Yeah, I forget the details, but there are scenarios where the guest will share (and unshare) memory at 4KiB (give or take) granularity, at runtime. There's an RFC[*] for making SWIOTLB operate at 2MiB is driven by the same underlying problems.
But even if Linux-as-a-guest were better behaved, we (the host) can't prevent the guest from doing suboptimal conversions. In practice, killing the guest or refusing to convert memory isn't an option, i.e. we can't completely push the problem into the guest
Agreed!
https://lore.kernel.org/all/20240112055251.36101-1-vannapurve@google.com
and that under pKVM we might see more frequent conversion, without even going to user space.
How to handle that without eventually running into a double memory-allocation? (in the worst case, allocating a 1GiB huge page for shared and for private memory).
I expect you'd take the linear range of 1G of PFNs and fragment it into three ranges private/shared/private that span the same 1G.
When you construct a page table (ie a S2) that holds these three ranges and has permission to access all the memory you want the page table to automatically join them back together into 1GB entry.
When you construct a page table that has only access to the shared, then you'd only install the shared hole at its natural best size.
So, I think there are two challenges - how to build an allocator and uAPI to manage this sort of stuff so you can keep track of any fractured pfns and ensure things remain in physical order.
Then how to re-consolidate this for the KVM side of the world.
Exactly!
guest_memfd, or something like it, is just really a good answer. You have it obtain the huge folio, and keep track on its own which sub pages can be mapped to a VMA because they are shared. KVM will obtain the PFNs directly from the fd and KVM will not see the shared holes. This means your S2's can be trivially constructed correctly.
No need to double allocate..
Yes, that's why my thinking so far was:
Let guest_memfd (or something like that) consume huge pages (somehow, let it access the hugetlb reserves). Preallocate that memory once, as the VM starts up: just like we do with hugetlb in VMs.
Let KVM track which parts are shared/private, and if required, let it map only the shared parts to user space. KVM has all information to make these decisions.
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
Of course, there might be alternatives, and I'll be happy to learn about them. The allcoator part would be fairly easy, and the uAPI part would similarly be comparably easy. So far the theory :)
I'm kind of surprised the CC folks don't want the same thing for exactly the same reason. It is much easier to recover the huge mappings for the S2 in the presence of shared holes if you track it this way. Even CC will have this problem, to some degree, too.
Precisely! RH (and therefore, me) is primarily interested in existing guest_memfd users at this point ("CC"), and I don't see an easy way to get that running with huge pages in the existing model reasonably well ...
This is the general direction guest_memfd is headed, but getting there is easier said than done. E.g. as alluded to above, "simply unmap that folio" is quite difficult, bordering on infeasible if the kernel is allowed to gup() shared guest_memfd memory.
Right. I think ways forward are the ones stated in my mail to Jason: disallow long-term GUP or expose the huge page as unmovable small folios to core-mm.
Maybe there are other alternatives, but it all feels like we want the MM to track in granularity of small pages, but map it into the KVM/IOMMU page tables in large pages.
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
IMHO it should be reasonable to make it work like ZONE_MOVABLE and FOLL_LONGTERM. Making a shared page private is really no different from moving it.
And if you have built a VMM that uses VMA mapped shared pages and short-term pinning then you should really also ensure that the VM is aware when the pins go away. For instance if you are doing some virtio thing with O_DIRECT pinning then the guest will know the pins are gone when it observes virtio completions.
In this way making private is just like moving, we unmap the page and then drive the refcount to zero, then move it.
I'm kind of surprised the CC folks don't want the same thing for exactly the same reason. It is much easier to recover the huge mappings for the S2 in the presence of shared holes if you track it this way. Even CC will have this problem, to some degree, too.
Precisely! RH (and therefore, me) is primarily interested in existing guest_memfd users at this point ("CC"), and I don't see an easy way to get that running with huge pages in the existing model reasonably well ...
IMHO it is an important topic so I'm glad you are thinking about it.
There is definately some overlap here where if you do teach guest_memfd about huge pages then you must also provide a away to map the fragments of them that have become shared. I think there is little option here unless you double allocate and/or destroy the performance properties of the huge pages.
It is just the nature of our system that shared pages must be in VMAs and must be copy_to/from_user/GUP'able/etc.
Jason
On 20.06.24 18:36, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
IMHO it should be reasonable to make it work like ZONE_MOVABLE and FOLL_LONGTERM. Making a shared page private is really no different from moving it.
And if you have built a VMM that uses VMA mapped shared pages and short-term pinning then you should really also ensure that the VM is aware when the pins go away. For instance if you are doing some virtio thing with O_DIRECT pinning then the guest will know the pins are gone when it observes virtio completions.
In this way making private is just like moving, we unmap the page and then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large folio is (validly) longterm pinned and you want to convert another shared subpage to private?
Sure, we can unmap the whole large folio (including all shared parts) before the conversion, just like we would do for migration. But we cannot detect that nobody pinned that subpage that we want to convert to private.
Core-mm is not, and will not, track pins per subpage.
So I only see two options:
a) Disallow long-term pinning. That means, we can, with a bit of wait, always convert subpages shared->private after unmapping them and waiting for the short-term pin to go away. Not too bad, and we already have other mechanisms disallow long-term pinnings (especially writable fs ones!).
b) Expose the large folio as multiple 4k folios to the core-mm.
b) would look as follows: we allocate a gigantic page from the (hugetlb) reserve into guest_memfd. Then, we break it down into individual 4k folios by splitting/demoting the folio. We make sure that all 4k folios are unmovable (raised refcount). We keep tracking internally that these 4k folios comprise a single large gigantic page.
Core-mm can track for us now without any modifications per (previously subpage,) now small folios GUP pins and page table mappings without modifications.
Once we unmap the gigantic page from guest_memfd, we recronstruct the gigantic page and hand it back to the reserve (only possible once all pins are gone).
We can still map the whole thing into the KVM guest+iommu using a single large unit, because guest_memfd knows the origin/relationship of these pages. But we would only map individual pages into user page tables (unless we use large VM_PFNMAP mappings, but then also pinning would not work, so that's likely also not what we want).
The downside is that we won't benefit from vmemmap optimizations for large folios from hugetlb, and have more tracking overhead when mapping individual pages into user page tables.
OTOH, maybe we really *need* per-page tracking and this might be the simplest way forward, making GUP and friends just work naturally with it.
I'm kind of surprised the CC folks don't want the same thing for exactly the same reason. It is much easier to recover the huge mappings for the S2 in the presence of shared holes if you track it this way. Even CC will have this problem, to some degree, too.
Precisely! RH (and therefore, me) is primarily interested in existing guest_memfd users at this point ("CC"), and I don't see an easy way to get that running with huge pages in the existing model reasonably well ...
IMHO it is an important topic so I'm glad you are thinking about it.
Thank my manager ;)
There is definately some overlap here where if you do teach guest_memfd about huge pages then you must also provide a away to map the fragments of them that have become shared. I think there is little option here unless you double allocate and/or destroy the performance properties of the huge pages.
Right, and that's not what we want.
It is just the nature of our system that shared pages must be in VMAs and must be copy_to/from_user/GUP'able/etc.
Right. Longterm GUP is not a real requirement.
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 18:36, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
IMHO it should be reasonable to make it work like ZONE_MOVABLE and FOLL_LONGTERM. Making a shared page private is really no different from moving it.
And if you have built a VMM that uses VMA mapped shared pages and short-term pinning then you should really also ensure that the VM is aware when the pins go away. For instance if you are doing some virtio thing with O_DIRECT pinning then the guest will know the pins are gone when it observes virtio completions.
In this way making private is just like moving, we unmap the page and then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large folio is (validly) longterm pinned and you want to convert another shared subpage to private?
Sure, we can unmap the whole large folio (including all shared parts) before the conversion, just like we would do for migration. But we cannot detect that nobody pinned that subpage that we want to convert to private.
Core-mm is not, and will not, track pins per subpage.
So I only see two options:
a) Disallow long-term pinning. That means, we can, with a bit of wait, always convert subpages shared->private after unmapping them and waiting for the short-term pin to go away. Not too bad, and we already have other mechanisms disallow long-term pinnings (especially writable fs ones!).
I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow GUP" route than I think it needs to disallow GUP, period. Like the whole "GUP writes to file-back memory" issue[*], which I think you're alluding to, short-term GUP is also problematic. But unlike file-backed memory, for TDX and SNP (and I think pKVM), a single rogue access has a high probability of being fatal to the entire system.
I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee with 100% accuracy that there are no outstanding mappings when converting a page from shared=>private. Crossing our fingers and hoping that short-term GUP will have gone away isn't enough.
[*] https://lore.kernel.org/all/cover.1683235180.git.lstoakes@gmail.com
b) Expose the large folio as multiple 4k folios to the core-mm.
b) would look as follows: we allocate a gigantic page from the (hugetlb) reserve into guest_memfd. Then, we break it down into individual 4k folios by splitting/demoting the folio. We make sure that all 4k folios are unmovable (raised refcount). We keep tracking internally that these 4k folios comprise a single large gigantic page.
Core-mm can track for us now without any modifications per (previously subpage,) now small folios GUP pins and page table mappings without modifications.
Once we unmap the gigantic page from guest_memfd, we recronstruct the gigantic page and hand it back to the reserve (only possible once all pins are gone).
We can still map the whole thing into the KVM guest+iommu using a single large unit, because guest_memfd knows the origin/relationship of these pages. But we would only map individual pages into user page tables (unless we use large VM_PFNMAP mappings, but then also pinning would not work, so that's likely also not what we want).
Not being to map guest_memfd into userspace with 1GiB mappings should be ok, at least for CoCo VMs. If the guest shares an entire 1GiB chunk, e.g. for DMA or whatever, then userspace can simply punch a hole in guest_memfd and allocate 1GiB of memory from regular memory. Even losing 2MiB mappings should be ok.
For non-CoCo VMs, I expect we'll want to be much more permissive, but I think they'll be a complete non-issue because there is no shared vs. private to worry about. We can simply allow any and all userspace mappings for guest_memfd that is attached to a "regular" VM, because a misbehaving userspace only loses whatever hardening (or other benefits) was being provided by using guest_memfd. I.e. the kernel and system at-large isn't at risk.
The downside is that we won't benefit from vmemmap optimizations for large folios from hugetlb, and have more tracking overhead when mapping individual pages into user page tables.
Hmm, I suspect losing the vmemmap optimizations would be acceptable, especially if we could defer the shattering until the guest actually tried to partially convert a 1GiB/2MiB region, and restore the optimizations when the memory is converted back.
OTOH, maybe we really *need* per-page tracking and this might be the simplest way forward, making GUP and friends just work naturally with it.
On 20.06.24 22:30, Sean Christopherson wrote:
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 18:36, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
IMHO it should be reasonable to make it work like ZONE_MOVABLE and FOLL_LONGTERM. Making a shared page private is really no different from moving it.
And if you have built a VMM that uses VMA mapped shared pages and short-term pinning then you should really also ensure that the VM is aware when the pins go away. For instance if you are doing some virtio thing with O_DIRECT pinning then the guest will know the pins are gone when it observes virtio completions.
In this way making private is just like moving, we unmap the page and then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large folio is (validly) longterm pinned and you want to convert another shared subpage to private?
Sure, we can unmap the whole large folio (including all shared parts) before the conversion, just like we would do for migration. But we cannot detect that nobody pinned that subpage that we want to convert to private.
Core-mm is not, and will not, track pins per subpage.
So I only see two options:
a) Disallow long-term pinning. That means, we can, with a bit of wait, always convert subpages shared->private after unmapping them and waiting for the short-term pin to go away. Not too bad, and we already have other mechanisms disallow long-term pinnings (especially writable fs ones!).
I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow GUP" route than I think it needs to disallow GUP, period. Like the whole "GUP writes to file-back memory" issue[*], which I think you're alluding to, short-term GUP is also problematic. But unlike file-backed memory, for TDX and SNP (and I think pKVM), a single rogue access has a high probability of being fatal to the entire system.
Disallowing short-term should work, in theory, because the writes-to-fileback has different issues (the PIN is not the problem but the dirtying).
It's more related us not allowing long-term pins for FSDAX pages, because the lifetime of these pages is determined by the FS.
What we would do is
1) Unmap the large folio completely and make any refaults block. -> No new pins can pop up
2) If the folio is pinned, busy-wait until all the short-term pins are gone.
3) Safely convert the relevant subpage from shared -> private
Not saying it's the best approach, but it should be doable.
I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee with 100% accuracy that there are no outstanding mappings when converting a page from shared=>private. Crossing our fingers and hoping that short-term GUP will have gone away isn't enough.
We do have the mapcount and the refcount that will be completely reliable for our cases.
folio_mapcount()==0 not mapped
folio_ref_count()==1 we hold the single folio reference. (-> no mapping, no GUP, no unexpected references)
(folio_maybe_dma_pinned() could be used as well, but things like vmsplice() and some O_DIRECT might still take references. folio_ref_count() is more reliable in that regard)
[*] https://lore.kernel.org/all/cover.1683235180.git.lstoakes@gmail.com
b) Expose the large folio as multiple 4k folios to the core-mm.
b) would look as follows: we allocate a gigantic page from the (hugetlb) reserve into guest_memfd. Then, we break it down into individual 4k folios by splitting/demoting the folio. We make sure that all 4k folios are unmovable (raised refcount). We keep tracking internally that these 4k folios comprise a single large gigantic page.
Core-mm can track for us now without any modifications per (previously subpage,) now small folios GUP pins and page table mappings without modifications.
Once we unmap the gigantic page from guest_memfd, we recronstruct the gigantic page and hand it back to the reserve (only possible once all pins are gone).
We can still map the whole thing into the KVM guest+iommu using a single large unit, because guest_memfd knows the origin/relationship of these pages. But we would only map individual pages into user page tables (unless we use large VM_PFNMAP mappings, but then also pinning would not work, so that's likely also not what we want).
Not being to map guest_memfd into userspace with 1GiB mappings should be ok, at least for CoCo VMs. If the guest shares an entire 1GiB chunk, e.g. for DMA or whatever, then userspace can simply punch a hole in guest_memfd and allocate 1GiB of memory from regular memory. Even losing 2MiB mappings should be ok.
For non-CoCo VMs, I expect we'll want to be much more permissive, but I think they'll be a complete non-issue because there is no shared vs. private to worry about. We can simply allow any and all userspace mappings for guest_memfd that is attached to a "regular" VM, because a misbehaving userspace only loses whatever hardening (or other benefits) was being provided by using guest_memfd. I.e. the kernel and system at-large isn't at risk.
The downside is that we won't benefit from vmemmap optimizations for large folios from hugetlb, and have more tracking overhead when mapping individual pages into user page tables.
Hmm, I suspect losing the vmemmap optimizations would be acceptable, especially if we could defer the shattering until the guest actually tried to partially convert a 1GiB/2MiB region, and restore the optimizations when the memory is converted back.
We can only shatter/collapse if there are no unexpected folio references. So GUP would have to be handles as well ... so that is certainly problematic.
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 22:30, Sean Christopherson wrote:
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 18:36, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
IMHO it should be reasonable to make it work like ZONE_MOVABLE and FOLL_LONGTERM. Making a shared page private is really no different from moving it.
And if you have built a VMM that uses VMA mapped shared pages and short-term pinning then you should really also ensure that the VM is aware when the pins go away. For instance if you are doing some virtio thing with O_DIRECT pinning then the guest will know the pins are gone when it observes virtio completions.
In this way making private is just like moving, we unmap the page and then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large folio is (validly) longterm pinned and you want to convert another shared subpage to private?
Sure, we can unmap the whole large folio (including all shared parts) before the conversion, just like we would do for migration. But we cannot detect that nobody pinned that subpage that we want to convert to private.
Core-mm is not, and will not, track pins per subpage.
So I only see two options:
a) Disallow long-term pinning. That means, we can, with a bit of wait, always convert subpages shared->private after unmapping them and waiting for the short-term pin to go away. Not too bad, and we already have other mechanisms disallow long-term pinnings (especially writable fs ones!).
I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow GUP" route than I think it needs to disallow GUP, period. Like the whole "GUP writes to file-back memory" issue[*], which I think you're alluding to, short-term GUP is also problematic. But unlike file-backed memory, for TDX and SNP (and I think pKVM), a single rogue access has a high probability of being fatal to the entire system.
Disallowing short-term should work, in theory, because the
By "short-term", I assume you mean "long-term"? Or am I more lost than I realize?
writes-to-fileback has different issues (the PIN is not the problem but the dirtying).
It's more related us not allowing long-term pins for FSDAX pages, because the lifetime of these pages is determined by the FS.
What we would do is
- Unmap the large folio completely and make any refaults block.
-> No new pins can pop up
- If the folio is pinned, busy-wait until all the short-term pins are gone.
This is the step that concerns me. "Relatively short time" is, well, relative. Hmm, though I suppose if userspace managed to map a shared page into something that pins the page, and can't force an unpin, e.g. by stopping I/O?, then either there's a host userspace bug or a guest bug, and so effectively hanging the vCPU that is waiting for the conversion to complete is ok.
- Safely convert the relevant subpage from shared -> private
Not saying it's the best approach, but it should be doable.
This is the step that concerns me. "Relatively short time" is, well, relative. Hmm, though I suppose if userspace managed to map a shared page into something that pins the page, and can't force an unpin, e.g. by stopping I/O?, then either there's a host userspace bug or a guest bug, and so effectively hanging the vCPU that is waiting for the conversion to complete is ok.
The whole entire point of FOLL_LONGTERM is to interact with ZONE_MOVABLE stuff such that only FOLL_LONGTERM users will cause unlimited refcount elevation.
Blocking FOLL_LONGTERM is supposed to result result in pins that go to zero on their own in some entirely kernel controlled time frame. Userspace is not supposed to be able to do anything to prevent this.
This is not necessarily guarenteed "fast", but it is certainly largely under the control of hypervisor kernel and VMM. ie if you do O_DIRECT to the shared memory then the memory will remain pinned until the storage completes. Which might be ms or it might be a xx second storage timeout.
But putting it in the full context, if the guest tries to make a page private that is actively undergoing IO while shared, then I think it is misbehaving and it is quite reasonable to stall its call for private until the page refs drop to zero. If guests want shared to private to be fast then guests need to ensure there is no outstanding IO.
In other words the page ref scheme would only be protective against hostile guests and in real workloads we'd never expect to have to wait. The same as ZONE_MOVABLE.
Jason
On Thu, Jun 20, 2024 at 01:30:29PM -0700, Sean Christopherson wrote:
I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee with 100% accuracy that there are no outstanding mappings when converting a page from shared=>private. Crossing our fingers and hoping that short-term GUP will have gone away isn't enough.
To be clear it is not crossing fingers. If the page refcount is 0 then there are no references to that memory anywhere at all. It is 100% certain.
It may take time to reach zero, but when it does it is safe.
Many things rely on this property, including FSDAX.
For non-CoCo VMs, I expect we'll want to be much more permissive, but I think they'll be a complete non-issue because there is no shared vs. private to worry about. We can simply allow any and all userspace mappings for guest_memfd that is attached to a "regular" VM, because a misbehaving userspace only loses whatever hardening (or other benefits) was being provided by using guest_memfd. I.e. the kernel and system at-large isn't at risk.
It does seem to me like guest_memfd should really focus on the private aspect.
If we need normal memfd enhancements of some kind to work better with KVM then that may be a better option than turning guest_memfd into memfd.
Jason
On Thu, Jun 20, 2024, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 01:30:29PM -0700, Sean Christopherson wrote:
I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee with 100% accuracy that there are no outstanding mappings when converting a page from shared=>private. Crossing our fingers and hoping that short-term GUP will have gone away isn't enough.
To be clear it is not crossing fingers. If the page refcount is 0 then there are no references to that memory anywhere at all. It is 100% certain.
It may take time to reach zero, but when it does it is safe.
Yeah, we're on the same page, I just didn't catch the implicit (or maybe it was explicitly stated earlier) "wait for the refcount to hit zero" part that David already clarified.
Many things rely on this property, including FSDAX.
For non-CoCo VMs, I expect we'll want to be much more permissive, but I think they'll be a complete non-issue because there is no shared vs. private to worry about. We can simply allow any and all userspace mappings for guest_memfd that is attached to a "regular" VM, because a misbehaving userspace only loses whatever hardening (or other benefits) was being provided by using guest_memfd. I.e. the kernel and system at-large isn't at risk.
It does seem to me like guest_memfd should really focus on the private aspect.
If we need normal memfd enhancements of some kind to work better with KVM then that may be a better option than turning guest_memfd into memfd.
Heh, and then we'd end up turning memfd into guest_memfd. As I see it, being able to safely map TDX/SNP/pKVM private memory is a happy side effect that is possible because guest_memfd isn't subordinate to the primary MMU, but private memory isn't the core idenity of guest_memfd.
The thing that makes guest_memfd tick is that it's guest-first, i.e. allows mapping memory into the guest with more permissions/capabilities than the host. E.g. access to private memory, hugepage mappings when the host is forced to use small pages, RWX mappings when the host is limited to RO, etc.
We could do a subset of those for memfd, but I don't see the point, assuming we allow mmap() on shared guest_memfd memory. Solving mmap() for VMs that do private<=>shared conversions is the hard problem to solve. Once that's done, we'll get support for regular VMs along with the other benefits of guest_memfd for free (or very close to free).
On 21.06.24 01:54, Sean Christopherson wrote:
On Thu, Jun 20, 2024, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 01:30:29PM -0700, Sean Christopherson wrote:
I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee with 100% accuracy that there are no outstanding mappings when converting a page from shared=>private. Crossing our fingers and hoping that short-term GUP will have gone away isn't enough.
To be clear it is not crossing fingers. If the page refcount is 0 then there are no references to that memory anywhere at all. It is 100% certain.
It may take time to reach zero, but when it does it is safe.
Yeah, we're on the same page, I just didn't catch the implicit (or maybe it was explicitly stated earlier) "wait for the refcount to hit zero" part that David already clarified.
Many things rely on this property, including FSDAX.
For non-CoCo VMs, I expect we'll want to be much more permissive, but I think they'll be a complete non-issue because there is no shared vs. private to worry about. We can simply allow any and all userspace mappings for guest_memfd that is attached to a "regular" VM, because a misbehaving userspace only loses whatever hardening (or other benefits) was being provided by using guest_memfd. I.e. the kernel and system at-large isn't at risk.
It does seem to me like guest_memfd should really focus on the private aspect.
We'll likely have to enter that domain for clean huge page support and/or pKVM here either way.
Likely the future will see a mixture of things: some will use guest_memfd only for the "private" parts and anon/shmem for the "shared" parts, others will use guest_memfd for both.
If we need normal memfd enhancements of some kind to work better with KVM then that may be a better option than turning guest_memfd into memfd.
Heh, and then we'd end up turning memfd into guest_memfd. As I see it, being able to safely map TDX/SNP/pKVM private memory is a happy side effect that is possible because guest_memfd isn't subordinate to the primary MMU, but private memory isn't the core idenity of guest_memfd.
Right.
The thing that makes guest_memfd tick is that it's guest-first, i.e. allows mapping memory into the guest with more permissions/capabilities than the host. E.g. access to private memory, hugepage mappings when the host is forced to use small pages, RWX mappings when the host is limited to RO, etc.
We could do a subset of those for memfd, but I don't see the point, assuming we allow mmap() on shared guest_memfd memory. Solving mmap() for VMs that do private<=>shared conversions is the hard problem to solve. Once that's done, we'll get support for regular VMs along with the other benefits of guest_memfd for free (or very close to free).
I suspect there would be pushback from Hugh trying to teach memfd things it really shouldn't be doing.
I once shared the idea of having a guest_memfd+memfd pair (managed by KVM or whatever more genric virt infrastructure), whereby we could move folios back and forth and only the memfd pages can be mapped and consequently pinned. Of course, we could only move full folios, which implies some kind of option b) for handling larger memory chunks (gigantic pages).
But I'm not sure if that is really required and it wouldn't be just easier to let the guest_memfd be mapped but only shared pages are handed out.
On Thu, Jun 20, 2024 at 04:54:00PM -0700, Sean Christopherson wrote:
Heh, and then we'd end up turning memfd into guest_memfd. As I see it, being able to safely map TDX/SNP/pKVM private memory is a happy side effect that is possible because guest_memfd isn't subordinate to the primary MMU, but private memory isn't the core idenity of guest_memfd.
IMHO guest memfd still has a very bright line between it and normal memfd.
guest mfd is holding all the memory and making it unmovable because it has donated it to some secure world. Unmovable means the mm can't do anything with it in normal ways. For things like David's 'b' where we fragement the pages it also requires guest memfd act as an allocator and completely own the PFNs, including handling free callbacks like ZONE_DEVICE does.
memfd on the other hand should always be normal movable allocated kernel memory with full normal folios and it shouldn't act as an allocator.
Teaching memfd to hold a huge folio is probably going to be a different approach than teaching guest memfd, I suspect "a" would be a more suitable choice there. You give up the kvm side contiguity, but get full mm integration of the memory.
User gets to choose which is more important..
It is not that different than today where VMMs are using hugetlbfs to get unmovable memory.
We could do a subset of those for memfd, but I don't see the point, assuming we allow mmap() on shared guest_memfd memory. Solving mmap() for VMs that do private<=>shared conversions is the hard problem to solve. Once that's done, we'll get support for regular VMs along with the other benefits of guest_memfd for free (or very close to free).
Yes, but I get the feeling that even in the best case for guest memfd you still end up with the non-movable memory and less mm features available.
Like if we do movability in a guest memfd space it would have be with some op callback to move the memory via the secure world, and guest memfd would still be pinning all the memory. Quite a different flow than what memfd should do.
There may still be merit in teaching memfd how to do huge pages too, though I don't really know.
Jason
On Thu, Jun 20, 2024 at 08:53:07PM +0200, David Hildenbrand wrote:
On 20.06.24 18:36, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:
If we could disallow pinning any shared pages, that would make life a lot easier, but I think there were reasons for why we might require it. To convert shared->private, simply unmap that folio (only the shared parts could possibly be mapped) from all user page tables.
IMHO it should be reasonable to make it work like ZONE_MOVABLE and FOLL_LONGTERM. Making a shared page private is really no different from moving it.
And if you have built a VMM that uses VMA mapped shared pages and short-term pinning then you should really also ensure that the VM is aware when the pins go away. For instance if you are doing some virtio thing with O_DIRECT pinning then the guest will know the pins are gone when it observes virtio completions.
In this way making private is just like moving, we unmap the page and then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large folio is (validly) longterm pinned and you want to convert another shared subpage to private?
When I wrote the above I was assuming option b was the choice.
a) Disallow long-term pinning. That means, we can, with a bit of wait, always convert subpages shared->private after unmapping them and waiting for the short-term pin to go away. Not too bad, and we already have other mechanisms disallow long-term pinnings (especially writable fs ones!).
This seems reasonable, but you are trading off a big hit to IO performance while doing shared/private operations
b) Expose the large folio as multiple 4k folios to the core-mm.
And this trades off more VMM memory usage and micro-slower copy_to/from_user. I think this is probably the better choice
IMHO the VMA does not need to map at a high granularity for these cases. The IO path on these VM types is already disastrously slow, optimizing with 1GB huge pages in the VMM to make copy_to/from_user very slightly faster doesn't seem worthwhile.
b) would look as follows: we allocate a gigantic page from the (hugetlb) reserve into guest_memfd. Then, we break it down into individual 4k folios by splitting/demoting the folio. We make sure that all 4k folios are unmovable (raised refcount). We keep tracking internally that these 4k folios comprise a single large gigantic page.
Yes, something like this. Or maybe they get converted to ZONE_DEVICE pages so that freeing them goes back to pgmap callback in the the guest_memfd or something simple like that.
The downside is that we won't benefit from vmemmap optimizations for large folios from hugetlb, and have more tracking overhead when mapping individual pages into user page tables.
Yes, that too, but you are going to have some kind of per 4k tracking overhead anyhow in guest_memfd no matter what you do. It would probably be less than the struct pages though.
There is also the interesting option to use a PFNMAP VMA so there is no refcounting and we don't need to mess with the struct pages. The downside is that you totally lose GUP. So no O_DIRECT..
Jason
On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
I'd let Fuad comment if he's aware of any specific/concrete Anrdoid usecases about converting between shared and private. One usecase I can think about is host providing large multimedia blobs (e.g. video) to the guest. Rather than using swiotlb, the CC guest can share pages back with the host so host can copy the blob in, possibly using H/W accel. I mention this example because we may not need to support shared/private conversions at granularity finer than huge pages. The host and guest can negotiate the minimum size that can be converted and you never run into issue where subpages of a folio are differently shared. I can't think of a usecase where we need such granularity for converting private/shared.
Jason, do you have scenario in mind? I couldn't tell if we now had a usecase or are brainstorming a solution to have a solution.
Thanks, Elliot
On Thu, Jun 20, 2024 at 03:47:23PM -0700, Elliot Berman wrote:
On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
I'd let Fuad comment if he's aware of any specific/concrete Anrdoid usecases about converting between shared and private. One usecase I can think about is host providing large multimedia blobs (e.g. video) to the guest. Rather than using swiotlb, the CC guest can share pages back with the host so host can copy the blob in, possibly using H/W accel. I mention this example because we may not need to support shared/private conversions at granularity finer than huge pages.
I suspect the more useful thing would be to be able to allocate actual shared memory and use that to shuffle data without a copy, setup much less frequently. Ie you could allocate a large shared buffer for video sharing and stream the video frames through that memory without copy.
This is slightly different from converting arbitary memory in-place into shared memory. The VM may be able to do a better job at clustering the shared memory allocation requests, ie locate them all within a 1GB region to further optimize the host side.
Jason, do you have scenario in mind? I couldn't tell if we now had a usecase or are brainstorming a solution to have a solution.
No, I'm interested in what pKVM is doing that needs this to be so much different than the CC case..
Jason
On Thursday 20 Jun 2024 at 20:18:14 (-0300), Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 03:47:23PM -0700, Elliot Berman wrote:
On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
I'd let Fuad comment if he's aware of any specific/concrete Anrdoid usecases about converting between shared and private. One usecase I can think about is host providing large multimedia blobs (e.g. video) to the guest. Rather than using swiotlb, the CC guest can share pages back with the host so host can copy the blob in, possibly using H/W accel. I mention this example because we may not need to support shared/private conversions at granularity finer than huge pages.
I suspect the more useful thing would be to be able to allocate actual shared memory and use that to shuffle data without a copy, setup much less frequently. Ie you could allocate a large shared buffer for video sharing and stream the video frames through that memory without copy.
This is slightly different from converting arbitary memory in-place into shared memory. The VM may be able to do a better job at clustering the shared memory allocation requests, ie locate them all within a 1GB region to further optimize the host side.
Jason, do you have scenario in mind? I couldn't tell if we now had a usecase or are brainstorming a solution to have a solution.
No, I'm interested in what pKVM is doing that needs this to be so much different than the CC case..
The underlying technology for implementing CC is obviously very different (MMU-based for pKVM, encryption-based for the others + some extra bits but let's keep it simple). In-place conversion is inherently painful with encryption-based schemes, so it's not a surprise the approach taken in these cases is built around destructive conversions as a core construct. But as Elliot highlighted, the MMU-based approach allows for pretty flexible and efficient zero-copy, which we're not ready to sacrifice purely to shoehorn pKVM into a model that was designed for a technology that has very different set of constraints. A private->shared conversion in the pKVM case is nothing more than setting a PTE in the recipient's stage-2 page-table.
I'm not at all against starting with something simple and bouncing via swiotlb, that is totally fine. What is _not_ fine however would be to bake into the userspace API that conversions are not in-place and destructive (which in my mind equates to 'you can't mmap guest_memfd pages'). But I think that isn't really a point of disagreement these days, so hopefully we're aligned.
And to clarify some things I've also read in the thread, pKVM can handle the vast majority of faults caused by accesses to protected memory just fine. Userspace accesses protected guest memory? Fine, we'll SEGV the userspace process. The kernel accesses via uaccess macros? Also fine, we'll fail the syscall (or whatever it is we're doing) cleanly -- the whole extable machinery works OK, which also means that things like load_unaligned_zeropad() keep working as-is. The only thing pKVM does is re-inject the fault back into the kernel with some extra syndrome information it can figure out what to do by itself.
It's really only accesses via e.g. the linear map that are problematic, hence the exclusive GUP approach proposed in the series that tries to avoid that by construction. That has the benefit of leaving guest_memfd to other CC solutions that have more things in common. I think it's good for that discussion to happen, no matter what we end up doing in the end.
I hope that helps!
Thanks, Quentin
On 21.06.24 09:32, Quentin Perret wrote:
On Thursday 20 Jun 2024 at 20:18:14 (-0300), Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 03:47:23PM -0700, Elliot Berman wrote:
On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared, now the VM requests to make one subpage private.
I think the general CC model has the shared/private setup earlier on the VM lifecycle with large runs of contiguous pages. It would only become a problem if you intend to to high rate fine granual shared/private switching. Which is why I am asking what the actual "why" is here.
I'd let Fuad comment if he's aware of any specific/concrete Anrdoid usecases about converting between shared and private. One usecase I can think about is host providing large multimedia blobs (e.g. video) to the guest. Rather than using swiotlb, the CC guest can share pages back with the host so host can copy the blob in, possibly using H/W accel. I mention this example because we may not need to support shared/private conversions at granularity finer than huge pages.
I suspect the more useful thing would be to be able to allocate actual shared memory and use that to shuffle data without a copy, setup much less frequently. Ie you could allocate a large shared buffer for video sharing and stream the video frames through that memory without copy.
This is slightly different from converting arbitary memory in-place into shared memory. The VM may be able to do a better job at clustering the shared memory allocation requests, ie locate them all within a 1GB region to further optimize the host side.
Jason, do you have scenario in mind? I couldn't tell if we now had a usecase or are brainstorming a solution to have a solution.
No, I'm interested in what pKVM is doing that needs this to be so much different than the CC case..
The underlying technology for implementing CC is obviously very different (MMU-based for pKVM, encryption-based for the others + some extra bits but let's keep it simple). In-place conversion is inherently painful with encryption-based schemes, so it's not a surprise the approach taken in these cases is built around destructive conversions as a core construct. But as Elliot highlighted, the MMU-based approach allows for pretty flexible and efficient zero-copy, which we're not ready to sacrifice purely to shoehorn pKVM into a model that was designed for a technology that has very different set of constraints. A private->shared conversion in the pKVM case is nothing more than setting a PTE in the recipient's stage-2 page-table.
I'm not at all against starting with something simple and bouncing via swiotlb, that is totally fine. What is _not_ fine however would be to bake into the userspace API that conversions are not in-place and destructive (which in my mind equates to 'you can't mmap guest_memfd pages'). But I think that isn't really a point of disagreement these days, so hopefully we're aligned.
And to clarify some things I've also read in the thread, pKVM can handle the vast majority of faults caused by accesses to protected memory just fine. Userspace accesses protected guest memory? Fine, we'll SEGV the userspace process. The kernel accesses via uaccess macros? Also fine, we'll fail the syscall (or whatever it is we're doing) cleanly -- the whole extable machinery works OK, which also means that things like load_unaligned_zeropad() keep working as-is. The only thing pKVM does is re-inject the fault back into the kernel with some extra syndrome information it can figure out what to do by itself.
It's really only accesses via e.g. the linear map that are problematic, hence the exclusive GUP approach proposed in the series that tries to avoid that by construction. That has the benefit of leaving guest_memfd to other CC solutions that have more things in common. I think it's good for that discussion to happen, no matter what we end up doing in the end.
Thanks for the information. IMHO we really should try to find a common ground here, and FOLL_EXCLUSIVE is likely not it :)
Thanks for reviving this discussion with your patch set!
pKVM is interested in in-place conversion, I believe there are valid use cases for in-place conversion for TDX and friends as well (as discussed, I think that might be a clean way to get huge/gigantic page support in).
This implies the option to:
1) Have shared+private memory in guest_memfd 2) Be able to mmap shared parts 3) Be able to convert shared<->private in place
and later in my interest
4) Have huge/gigantic page support in guest_memfd with the option of converting individual subpages
We might not want to make use of that model for all of CC -- as you state, sometimes the destructive approach might be better performance wise -- but having that option doesn't sound crazy to me (and maybe would solve real issues as well).
After all, the common requirement here is that "private" pages are not mapped/pinned/accessible.
Sure, there might be cases like "pKVM can handle access to private pages in user page mappings", "AMD-SNP will not crash the host if writing to private pages" but there are not factors that really make a difference for a common solution.
private memory: not mapped, not pinned shared memory: maybe mapped, maybe pinned granularity of conversion: single pages
Anything I am missing?
On Friday 21 Jun 2024 at 10:02:08 (+0200), David Hildenbrand wrote:
Thanks for the information. IMHO we really should try to find a common ground here, and FOLL_EXCLUSIVE is likely not it :)
That's OK, IMO at least :-).
Thanks for reviving this discussion with your patch set!
pKVM is interested in in-place conversion, I believe there are valid use cases for in-place conversion for TDX and friends as well (as discussed, I think that might be a clean way to get huge/gigantic page support in).
This implies the option to:
- Have shared+private memory in guest_memfd
- Be able to mmap shared parts
- Be able to convert shared<->private in place
and later in my interest
- Have huge/gigantic page support in guest_memfd with the option of converting individual subpages
We might not want to make use of that model for all of CC -- as you state, sometimes the destructive approach might be better performance wise -- but having that option doesn't sound crazy to me (and maybe would solve real issues as well).
Cool.
After all, the common requirement here is that "private" pages are not mapped/pinned/accessible.
Sure, there might be cases like "pKVM can handle access to private pages in user page mappings", "AMD-SNP will not crash the host if writing to private pages" but there are not factors that really make a difference for a common solution.
Sure, there isn't much value in differentiating on these things. One might argue that we could save one mmap() on the private->shared conversion path by keeping all of guest_memfd mapped in userspace including private memory, but that's most probably not worth the effort of re-designing the whole thing just for that, so let's forget that.
The ability to handle stage-2 faults in the kernel has implications in other places however. It means we don't need to punch holes in the kernel linear map when donating memory to a guest for example, even with 'crazy' access patterns like load_unaligned_zeropad(). So that's good.
private memory: not mapped, not pinned shared memory: maybe mapped, maybe pinned granularity of conversion: single pages
Anything I am missing?
That looks good to me. And as discussed in previous threads, we have the ambition of getting page-migration to work, including for private memory, mostly to get kcompactd to work better when pVMs are running. Android makes extensive use of compaction, and pVMs currently stick out like a sore thumb.
We can trivially implement a hypercall to have pKVM swap a private page with another without the guest having to know. The difficulty is obviously to hook that in Linux, and I've personally not looked into it properly, so that is clearly longer term. We don't want to take anybody by surprise if there is a need for some added complexity in guest_memfd to support this use-case though. I don't expect folks on the receiving end of that to agree to it blindly without knowing _what_ this complexity is FWIW. But at least our intentions are clear :-)
Thanks, Quentin
On 21.06.24 11:25, Quentin Perret wrote:
On Friday 21 Jun 2024 at 10:02:08 (+0200), David Hildenbrand wrote:
Thanks for the information. IMHO we really should try to find a common ground here, and FOLL_EXCLUSIVE is likely not it :)
That's OK, IMO at least :-).
Thanks for reviving this discussion with your patch set!
pKVM is interested in in-place conversion, I believe there are valid use cases for in-place conversion for TDX and friends as well (as discussed, I think that might be a clean way to get huge/gigantic page support in).
This implies the option to:
- Have shared+private memory in guest_memfd
- Be able to mmap shared parts
- Be able to convert shared<->private in place
and later in my interest
- Have huge/gigantic page support in guest_memfd with the option of converting individual subpages
We might not want to make use of that model for all of CC -- as you state, sometimes the destructive approach might be better performance wise -- but having that option doesn't sound crazy to me (and maybe would solve real issues as well).
Cool.
After all, the common requirement here is that "private" pages are not mapped/pinned/accessible.
Sure, there might be cases like "pKVM can handle access to private pages in user page mappings", "AMD-SNP will not crash the host if writing to private pages" but there are not factors that really make a difference for a common solution.
Sure, there isn't much value in differentiating on these things. One might argue that we could save one mmap() on the private->shared conversion path by keeping all of guest_memfd mapped in userspace including private memory, but that's most probably not worth the effort of re-designing the whole thing just for that, so let's forget that.
In a world where we can mmap() the whole (sparse "shared") thing, and dynamically map/unmap the shared parts only it would be saving a page fault on private->shared conversion, correct.
But that's sounds more like a CC-specific optimization for frequent conversions, which we should just ignore initially.
The ability to handle stage-2 faults in the kernel has implications in other places however. It means we don't need to punch holes in the kernel linear map when donating memory to a guest for example, even with 'crazy' access patterns like load_unaligned_zeropad(). So that's good.
private memory: not mapped, not pinned shared memory: maybe mapped, maybe pinned granularity of conversion: single pages
Anything I am missing?
That looks good to me. And as discussed in previous threads, we have the ambition of getting page-migration to work, including for private memory, mostly to get kcompactd to work better when pVMs are running. Android makes extensive use of compaction, and pVMs currently stick out like a sore thumb.
Yes, I think migration for compaction has to be supported at some point (at least for small pages that can be either private or shared, not a mixture), and I suspect we should be able to integrate it with core-mm in a not-too-horrible fashion. For example, we do have a non-lru page migration infrastructure in place already if the LRU-based one is not a good fit.
Memory swapping and all other currently-strictly LRU-based mechanisms should be out of scope for now: as Sean says, we don't want to go down that path.
We can trivially implement a hypercall to have pKVM swap a private page with another without the guest having to know. The difficulty is obviously to hook that in Linux, and I've personally not looked into it properly, so that is clearly longer term. We don't want to take anybody by surprise if there is a need for some added complexity in guest_memfd to support this use-case though. I don't expect folks on the receiving end of that to agree to it blindly without knowing _what_ this complexity is FWIW. But at least our intentions are clear :-)
Agreed.
On Fri, Jun 21, 2024 at 09:25:10AM +0000, Quentin Perret wrote:
On Friday 21 Jun 2024 at 10:02:08 (+0200), David Hildenbrand wrote:
Sure, there might be cases like "pKVM can handle access to private pages in user page mappings", "AMD-SNP will not crash the host if writing to private pages" but there are not factors that really make a difference for a common solution.
Sure, there isn't much value in differentiating on these things. One might argue that we could save one mmap() on the private->shared conversion path by keeping all of guest_memfd mapped in userspace including private memory, but that's most probably not worth the effort of re-designing the whole thing just for that, so let's forget that.
The ability to handle stage-2 faults in the kernel has implications in other places however. It means we don't need to punch holes in the kernel linear map when donating memory to a guest for example, even with 'crazy' access patterns like load_unaligned_zeropad(). So that's good.
The ability to handle stage-2 faults in the kernel is something that's specific to arm64 pKVM though. We do want to punch holes in the linear map for Gunyah case. I don't think this is blocking issue. I only want to point out we can't totally ignore the linear map.
Thanks, Elliot
On Fri, Jun 21, 2024 at 07:32:40AM +0000, Quentin Perret wrote:
No, I'm interested in what pKVM is doing that needs this to be so much different than the CC case..
The underlying technology for implementing CC is obviously very different (MMU-based for pKVM, encryption-based for the others + some extra bits but let's keep it simple). In-place conversion is inherently painful with encryption-based schemes, so it's not a surprise the approach taken in these cases is built around destructive conversions as a core construct.
I'm not sure I fully agree with this. CC can do non-destructive too (though the proprietary secure worlds may choose not to implement it). Even implementations like ARM's CC are much closer to how pKVM works without encryption and just page table updates.
The only question that matters at all is how fast is the private->shared conversion. Is it fast enough that it can be used on the IO path instead of swiotlb?
TBH I'm willing to believe number's showing that pKVM is fast enough, but would like to see them before we consider major changes to the kernel :)
I'm not at all against starting with something simple and bouncing via swiotlb, that is totally fine. What is _not_ fine however would be to bake into the userspace API that conversions are not in-place and destructive (which in my mind equates to 'you can't mmap guest_memfd pages'). But I think that isn't really a point of disagreement these days, so hopefully we're aligned.
IMHO CC and pKVM should align here and provide a way for optional non-destructive private->shared conversion.
It's really only accesses via e.g. the linear map that are problematic, hence the exclusive GUP approach proposed in the series that tries to avoid that by construction.
I think as others have said, this is just too weird. Memory that is inaccessible and always faults the kernel doesn't make any sense. It shouldn't be mapped into VMAs.
If you really, really, want to do this then use your own FD and a PFN map. Copy to user will still work fine and you don't need to disrupt the mm.
Jason
On 19.06.24 11:11, Fuad Tabba wrote:
Hi John and David,
Thank you for your comments.
On Wed, Jun 19, 2024 at 8:38 AM David Hildenbrand david@redhat.com wrote:
Hi,
On 19.06.24 04:44, John Hubbard wrote:
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah.
The main question really is, into which direction we want and can develop guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if guest_memfd should be our new target for guest memory, both shared and private. There are a bunch of issues to be sorted out though ...
As there is interest from Red Hat into supporting hugetlb-style huge pages in confidential VMs for real-time workloads, and wasting memory is not really desired, I'm going to think some more about some of the challenges (shared+private in guest_memfd, mmap support, migration of !shared folios, hugetlb-like support, in-place shared<->private conversion, interaction with page pinning). Tricky.
Ideally, we'd have one way to back guest memory for confidential VMs in the future.
As you know, initially we went down the route of guest memory and invested a lot of time on it, including presenting our proposal at LPC last year. But there was resistance to expanding it to support more than what was initially envisioned, e.g., sharing guest memory in place migration, and maybe even huge pages, and its implications such as being able to conditionally mmap guest memory.
Yes, and I think we might have to revive that discussion, unfortunately. I started thinking about this, but did not reach a conclusion. Sharing my thoughts.
The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not just for private memory should be:
(1) Have private + shared parts backed by guest_memfd. Either the same, or a fd pair. (2) Allow to mmap only the "shared" parts. (3) Allow in-place conversion between "shared" and "private" parts. (4) Allow migration of the "shared" parts.
A) Convert shared -> private? * Must not be GUP-pinned * Must not be mapped * Must not reside on ZONE_MOVABLE/MIGRATE_CMA * (must rule out any other problematic folio references that could read/write memory, might be feasible for guest_memfd)
B) Convert private -> shared? * Nothing to consider
C) Map something? * Must not be private
For ordinary (small) pages, that might be feasible. (ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not support them initially)
The real fun begins once we want to support huge pages/large folios and can end up having a mixture of "private" and "shared" per huge page. But really, that's what we want in the end I think.
Unless we can teach the VM to not convert arbitrary physical memory ranges on a 4k basis to a mixture of private/shared ... but I've been told we don't want that. Hm.
There are two big problems with that that I can see:
1) References/GUP-pins are per folio
What if some shared part of the folio is pinned but another shared part that we want to convert to private is not? Core-mm will not provide the answer to that: the folio maybe pinned, that's it. *Disallowing* at least long-term GUP-pins might be an option.
To get stuff into an IOMMU, maybe a per-fd interface could work, and guest_memfd would track itself which parts are currently "handed out", and with which "semantics" (shared vs. private).
[IOMMU + private parts might require that either way? Because, if we dissallow mmap, how should that ever work with an IOMMU otherwise].
2) Tracking of mappings will likely soon be per folio.
page_mapped() / folio_mapped() only tell us if any part of the folio is mapped. Of course, what always works is unmapping the whole thing, or walking the rmap to detect if a specific part is currently mapped.
Then, there is the problem of getting huge pages into guest_memfd (using hugetlb reserves, but not using hugetlb), but that should be solvable.
As raised in previous discussions, I think we should then allow the whole guest_memfd to be mapped, but simply SIGBUS/... when trying to access a private part. We would track private/shared internally, and track "handed out" pages to IOMMUs internally. FOLL_LONGTERM would be disallowed.
But that's only the high level idea I had so far ... likely ignore way too many details.
Is there broader interest to discuss that and there would be value in setting up a meeting and finally make progress with that?
I recall quite some details with memory renting or so on pKVM ... and I have to refresh my memory on that.
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
Yes, huge pages are also of interest for RH. And memory-overconsumption due to having partially used huge pages in private/shared memory is not desired.
We are currently shipping pKVM in Android as it is, warts and all. We're also working on upstreaming the rest of it. Currently, this is the main blocker for us to be able to upstream the rest (same probably applies to Gunyah).
Can you comment on the bigger design goal here? In particular:
At a high level: We want to prevent a misbehaving host process from crashing the system when attempting to access (deliberately or accidentally) protected guest memory. As it currently stands in pKVM and Gunyah, the hypervisor does prevent the host from accessing (private) guest memory. In certain cases though, if the host attempts to access that memory and is prevented by the hypervisor (either out of ignorance or out of malice), the host kernel wouldn't be able to recover, causing the whole system to crash.
guest_memfd() prevents such accesses by not allowing confidential memory to be mapped at the host to begin with. This works fine for us, but there's the issue of being able to share memory in place, which implies mapping it conditionally (among others that I've mentioned).
The approach we're taking with this proposal is to instead restrict the pinning of protected memory. If the host kernel can't pin the memory, then a misbehaving process can't trick the host into accessing it.
Got it, thanks. So once we pinned it, nobody else can pin it. But we can still map it?
- Who would get the exclusive PIN and for which reason? When would we pin, when would we unpin?
The exclusive pin would be acquired for private guest pages, in addition to a normal pin. It would be released when the private memory is released, or if the guest shares that memory.
Understood.
- What would happen if there is already another PIN? Can we deal with speculative short-term PINs from GUP-fast that could introduce errors?
The exclusive pin would be rejected if there's any other pin (exclusive or normal). Normal pins would be rejected if there's an exclusive pin.
Makes sense, thanks.
- How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
I can't :)
:)
- Why are GUP pins special? How one would deal with other folio references (e.g., simply mmap the shmem file into a different process).
Other references would crash the userspace process, but the host kernel can handle them, and shouldn't cause the system to crash. The way things are now in Android/pKVM, a userspace process can crash the system as a whole.
Okay, so very Android/pKVM specific :/
- Why you have to bother about anonymous pages at all (skimming over s some patches), when you really want to handle shmem differently only?
I'm not sure I understand the question. We use anonymous memory for pKVM.
"we want to support grabbing shmem user pages instead of using KVM's guestmemfd" indicated to me that you primarily care about shmem with FOLL_EXCLUSIVE?
To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
So, FOLL_EXCLUSIVE would fail if there already is a PIN, but !FOLL_EXCLUSIVE would succeed even if there is a single PIN via FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins that don't have FOLL_EXCLUSIVE set fail as well?
A FOLL_EXCLUSIVE would fail if there's any other pin. A normal pin (!FOLL_EXCLUSIVE) would fail if there's a FOLL_EXCLUSIVE pin. It's the PIN to end all pins!
Hi!
Looking through this, I feel that some intangible threshold of "this is too much overloading of page->_refcount" has been crossed. This is a very specific feature, and it is using approximately one more bit than is really actually "available"...
Agreed.
We are gating it behind a CONFIG flag :)
;)
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
If we need a bit in struct page/folio, is this really the only way? Willy is working towards getting us an entirely separate folio->pincount, I suppose that might take too long? Or not?
Before talking about how to implement it, I think we first have to learn whether that approach is what we want at all, and how it fits into the bigger picture of that use case.
This feels like force-fitting a very specific feature (KVM/CoCo handling of shmem pages) into a more general mechanism that is running low on bits (gup/pup).
Agreed.
Maybe a good topic for LPC!
The KVM track has plenty of guest_memfd topics, might be a good fit there. (or in the MM track, of course)
We are planning on submitting a proposal for LPC (see you in Vienna!) :)
Great!
Hi David,
On Wed, Jun 19, 2024 at 1:16 PM David Hildenbrand david@redhat.com wrote:
On 19.06.24 11:11, Fuad Tabba wrote:
Hi John and David,
Thank you for your comments.
On Wed, Jun 19, 2024 at 8:38 AM David Hildenbrand david@redhat.com wrote:
Hi,
On 19.06.24 04:44, John Hubbard wrote:
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah.
The main question really is, into which direction we want and can develop guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if guest_memfd should be our new target for guest memory, both shared and private. There are a bunch of issues to be sorted out though ...
As there is interest from Red Hat into supporting hugetlb-style huge pages in confidential VMs for real-time workloads, and wasting memory is not really desired, I'm going to think some more about some of the challenges (shared+private in guest_memfd, mmap support, migration of !shared folios, hugetlb-like support, in-place shared<->private conversion, interaction with page pinning). Tricky.
Ideally, we'd have one way to back guest memory for confidential VMs in the future.
As you know, initially we went down the route of guest memory and invested a lot of time on it, including presenting our proposal at LPC last year. But there was resistance to expanding it to support more than what was initially envisioned, e.g., sharing guest memory in place migration, and maybe even huge pages, and its implications such as being able to conditionally mmap guest memory.
Yes, and I think we might have to revive that discussion, unfortunately. I started thinking about this, but did not reach a conclusion. Sharing my thoughts.
The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not just for private memory should be:
(1) Have private + shared parts backed by guest_memfd. Either the same, or a fd pair. (2) Allow to mmap only the "shared" parts. (3) Allow in-place conversion between "shared" and "private" parts.
These three were covered (modulo bugs) in the guest_memfd() RFC I'd sent a while back:
https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com/
(4) Allow migration of the "shared" parts.
We would really like that too, if they allow us :)
A) Convert shared -> private?
- Must not be GUP-pinned
- Must not be mapped
- Must not reside on ZONE_MOVABLE/MIGRATE_CMA
- (must rule out any other problematic folio references that could read/write memory, might be feasible for guest_memfd)
B) Convert private -> shared?
- Nothing to consider
C) Map something?
- Must not be private
A,B and C were covered (again, modulo bugs) in the RFC.
For ordinary (small) pages, that might be feasible. (ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not support them initially)
The real fun begins once we want to support huge pages/large folios and can end up having a mixture of "private" and "shared" per huge page. But really, that's what we want in the end I think.
I agree.
Unless we can teach the VM to not convert arbitrary physical memory ranges on a 4k basis to a mixture of private/shared ... but I've been told we don't want that. Hm.
There are two big problems with that that I can see:
- References/GUP-pins are per folio
What if some shared part of the folio is pinned but another shared part that we want to convert to private is not? Core-mm will not provide the answer to that: the folio maybe pinned, that's it. *Disallowing* at least long-term GUP-pins might be an option.
Right.
To get stuff into an IOMMU, maybe a per-fd interface could work, and guest_memfd would track itself which parts are currently "handed out", and with which "semantics" (shared vs. private).
[IOMMU + private parts might require that either way? Because, if we dissallow mmap, how should that ever work with an IOMMU otherwise].
Not sure if IOMMU + private makes that much sense really, but I think I might not really understand what you mean by this.
- Tracking of mappings will likely soon be per folio.
page_mapped() / folio_mapped() only tell us if any part of the folio is mapped. Of course, what always works is unmapping the whole thing, or walking the rmap to detect if a specific part is currently mapped.
This might complicate things a but, but we could be conservative, at least initially in what we allow to be mapped.
Then, there is the problem of getting huge pages into guest_memfd (using hugetlb reserves, but not using hugetlb), but that should be solvable.
As raised in previous discussions, I think we should then allow the whole guest_memfd to be mapped, but simply SIGBUS/... when trying to access a private part. We would track private/shared internally, and track "handed out" pages to IOMMUs internally. FOLL_LONGTERM would be disallowed.
But that's only the high level idea I had so far ... likely ignore way too many details.
Is there broader interest to discuss that and there would be value in setting up a meeting and finally make progress with that?
I recall quite some details with memory renting or so on pKVM ... and I have to refresh my memory on that.
I really would like to get to a place where we could investigate and sort out all of these issues. It would be good to know though, what, in principle (and not due to any technical limitations), we might be allowed to do and expand guest_memfd() to do, and what out of principle is off the table.
To be honest, personally (speaking only for myself, not necessarily for Elliot and not for anyone else in the pKVM team), I still would prefer to use guest_memfd(). I think that having one solution for confidential computing that rules them all would be best. But we do need to be able to share memory in place, have a plan for supporting huge pages in the near future, and migration in the not-too-distant future.
Yes, huge pages are also of interest for RH. And memory-overconsumption due to having partially used huge pages in private/shared memory is not desired.
We are currently shipping pKVM in Android as it is, warts and all. We're also working on upstreaming the rest of it. Currently, this is the main blocker for us to be able to upstream the rest (same probably applies to Gunyah).
Can you comment on the bigger design goal here? In particular:
At a high level: We want to prevent a misbehaving host process from crashing the system when attempting to access (deliberately or accidentally) protected guest memory. As it currently stands in pKVM and Gunyah, the hypervisor does prevent the host from accessing (private) guest memory. In certain cases though, if the host attempts to access that memory and is prevented by the hypervisor (either out of ignorance or out of malice), the host kernel wouldn't be able to recover, causing the whole system to crash.
guest_memfd() prevents such accesses by not allowing confidential memory to be mapped at the host to begin with. This works fine for us, but there's the issue of being able to share memory in place, which implies mapping it conditionally (among others that I've mentioned).
The approach we're taking with this proposal is to instead restrict the pinning of protected memory. If the host kernel can't pin the memory, then a misbehaving process can't trick the host into accessing it.
Got it, thanks. So once we pinned it, nobody else can pin it. But we can still map it?
This proposal (the exclusive gup) places no limitations on mapping, only on pinning. If private memory is mapped and then accessed, then the worst thing that could happen is the userspace process gets killed, potentially taking down the guest with it (if that process happens to be the VMM for example).
The reason why we care about pinning is to ensure that the host kernel doesn't access protected memory, thereby crashing the system.
- Who would get the exclusive PIN and for which reason? When would we pin, when would we unpin?
The exclusive pin would be acquired for private guest pages, in addition to a normal pin. It would be released when the private memory is released, or if the guest shares that memory.
Understood.
- What would happen if there is already another PIN? Can we deal with speculative short-term PINs from GUP-fast that could introduce errors?
The exclusive pin would be rejected if there's any other pin (exclusive or normal). Normal pins would be rejected if there's an exclusive pin.
Makes sense, thanks.
- How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
I can't :)
:)
- Why are GUP pins special? How one would deal with other folio references (e.g., simply mmap the shmem file into a different process).
Other references would crash the userspace process, but the host kernel can handle them, and shouldn't cause the system to crash. The way things are now in Android/pKVM, a userspace process can crash the system as a whole.
Okay, so very Android/pKVM specific :/
Gunyah too.
- Why you have to bother about anonymous pages at all (skimming over s some patches), when you really want to handle shmem differently only?
I'm not sure I understand the question. We use anonymous memory for pKVM.
"we want to support grabbing shmem user pages instead of using KVM's guestmemfd" indicated to me that you primarily care about shmem with FOLL_EXCLUSIVE?
Right, maybe we should have clarified this better when we sent out this series.
This patch series is meant as an alternative to guest_memfd(), and not as something to be used in conjunction with it. This came about from the discussions we had with you and others back when Elliot and I sent our respective RFCs, and found that there was resistance into adding guest_memfd() support that would make it practical to use with pKVM or Gunyah.
https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/
Thanks again for your ideas and comments! /fuad
To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
So, FOLL_EXCLUSIVE would fail if there already is a PIN, but !FOLL_EXCLUSIVE would succeed even if there is a single PIN via FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins that don't have FOLL_EXCLUSIVE set fail as well?
A FOLL_EXCLUSIVE would fail if there's any other pin. A normal pin (!FOLL_EXCLUSIVE) would fail if there's a FOLL_EXCLUSIVE pin. It's the PIN to end all pins!
Hi!
Looking through this, I feel that some intangible threshold of "this is too much overloading of page->_refcount" has been crossed. This is a very specific feature, and it is using approximately one more bit than is really actually "available"...
Agreed.
We are gating it behind a CONFIG flag :)
;)
Also, since pin is already overloading the refcount, having the exclusive pin there helps in ensuring atomic accesses and avoiding races.
If we need a bit in struct page/folio, is this really the only way? Willy is working towards getting us an entirely separate folio->pincount, I suppose that might take too long? Or not?
Before talking about how to implement it, I think we first have to learn whether that approach is what we want at all, and how it fits into the bigger picture of that use case.
This feels like force-fitting a very specific feature (KVM/CoCo handling of shmem pages) into a more general mechanism that is running low on bits (gup/pup).
Agreed.
Maybe a good topic for LPC!
The KVM track has plenty of guest_memfd topics, might be a good fit there. (or in the MM track, of course)
We are planning on submitting a proposal for LPC (see you in Vienna!) :)
Great!
-- Cheers,
David / dhildenb
Yes, and I think we might have to revive that discussion, unfortunately. I started thinking about this, but did not reach a conclusion. Sharing my thoughts.
The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not just for private memory should be:
(1) Have private + shared parts backed by guest_memfd. Either the same, or a fd pair. (2) Allow to mmap only the "shared" parts. (3) Allow in-place conversion between "shared" and "private" parts.
These three were covered (modulo bugs) in the guest_memfd() RFC I'd sent a while back:
https://lore.kernel.org/all/20240222161047.402609-1-tabba@google.com/
I remember there was a catch to it (either around mmap or pinning detection -- or around support for huge pages in the future; maybe these count as BUGs :) ).
I should probably go back and revisit the whole thing, I was only CCed on some part of it back then.
(4) Allow migration of the "shared" parts.
We would really like that too, if they allow us :)
A) Convert shared -> private?
- Must not be GUP-pinned
- Must not be mapped
- Must not reside on ZONE_MOVABLE/MIGRATE_CMA
- (must rule out any other problematic folio references that could read/write memory, might be feasible for guest_memfd)
B) Convert private -> shared?
- Nothing to consider
C) Map something?
- Must not be private
A,B and C were covered (again, modulo bugs) in the RFC.
For ordinary (small) pages, that might be feasible. (ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not support them initially)
The real fun begins once we want to support huge pages/large folios and can end up having a mixture of "private" and "shared" per huge page. But really, that's what we want in the end I think.
I agree.
Unless we can teach the VM to not convert arbitrary physical memory ranges on a 4k basis to a mixture of private/shared ... but I've been told we don't want that. Hm.
There are two big problems with that that I can see:
- References/GUP-pins are per folio
What if some shared part of the folio is pinned but another shared part that we want to convert to private is not? Core-mm will not provide the answer to that: the folio maybe pinned, that's it. *Disallowing* at least long-term GUP-pins might be an option.
Right.
To get stuff into an IOMMU, maybe a per-fd interface could work, and guest_memfd would track itself which parts are currently "handed out", and with which "semantics" (shared vs. private).
[IOMMU + private parts might require that either way? Because, if we dissallow mmap, how should that ever work with an IOMMU otherwise].
Not sure if IOMMU + private makes that much sense really, but I think I might not really understand what you mean by this.
A device might be able to access private memory. In the TDX world, this would mean that a device "speaks" encrypted memory.
At the same time, a device might be able to access shared memory. Maybe devices can do both?
What do do when converting between private and shared? I think it depends on various factors (e.g., device capabilities).
[...]
I recall quite some details with memory renting or so on pKVM ... and I have to refresh my memory on that.
I really would like to get to a place where we could investigate and sort out all of these issues. It would be good to know though, what, in principle (and not due to any technical limitations), we might be allowed to do and expand guest_memfd() to do, and what out of principle is off the table.
As Jason said, maybe we need a revised model that can handle [...] private+shared properly.
On Thu, Jun 20, 2024 at 11:00:45AM +0200, David Hildenbrand wrote:
Not sure if IOMMU + private makes that much sense really, but I think I might not really understand what you mean by this.
A device might be able to access private memory. In the TDX world, this would mean that a device "speaks" encrypted memory.
At the same time, a device might be able to access shared memory. Maybe devices can do both?
What do do when converting between private and shared? I think it depends on various factors (e.g., device capabilities).
The whole thing is complicated once you put the pages into the VMA. We have hmm_range_fault and IOMMU SVA paths that both obtain the pfns without any of the checks here.
(and I suspect many of the target HW's for pKVM have/will have SVA capable GPUs so SVA is an attack vector worth considering)
What happens if someone does DMA to these PFNs? It seems like nothing good in either scenario..
Really the only way to do it properly is to keep the memory unmapped, that must be the starting point to any solution. Denying GUP is just an ugly hack.
Jason
Hi David,
On Wed, Jun 19, 2024 at 09:37:58AM +0200, David Hildenbrand wrote:
Hi,
On 19.06.24 04:44, John Hubbard wrote:
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah.
The main question really is, into which direction we want and can develop guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if guest_memfd should be our new target for guest memory, both shared and private. There are a bunch of issues to be sorted out though ...
As there is interest from Red Hat into supporting hugetlb-style huge pages in confidential VMs for real-time workloads, and wasting memory is not really desired, I'm going to think some more about some of the challenges (shared+private in guest_memfd, mmap support, migration of !shared folios, hugetlb-like support, in-place shared<->private conversion, interaction with page pinning). Tricky.
Ideally, we'd have one way to back guest memory for confidential VMs in the future.
Can you comment on the bigger design goal here? In particular:
Who would get the exclusive PIN and for which reason? When would we pin, when would we unpin?
What would happen if there is already another PIN? Can we deal with speculative short-term PINs from GUP-fast that could introduce errors?
How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
Can you please clarify more about the IOMMU case?
pKVM has no merged upstream IOMMU support at the moment, although there was an RFC a while a go [1], also there would be a v2 soon.
In the patches KVM (running in EL2) will manage the IOMMUs including the page tables and all pages used in that are allocated from the kernel.
These patches don't support IOMMUs for guests. However, I don't see why would that be different from the CPU? as once the page is pinned it can be owned by a guest and that would be reflected in the hypervisor tracking, the CPU stage-2 and IOMMU page tables as well.
[1] https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-philippe@linaro...
Thanks, Mostafa
Why are GUP pins special? How one would deal with other folio references (e.g., simply mmap the shmem file into a different process).
Why you have to bother about anonymous pages at all (skimming over s some patches), when you really want to handle shmem differently only?
To that end, we introduce the concept of "exclusive GUP pinning", which enforces that only one pin of any kind is allowed when using the FOLL_EXCLUSIVE flag is set. This behavior doesn't affect FOLL_GET or any other folio refcount operations that don't go through the FOLL_PIN path.
So, FOLL_EXCLUSIVE would fail if there already is a PIN, but !FOLL_EXCLUSIVE would succeed even if there is a single PIN via FOLL_EXCLUSIVE? Or would the single FOLL_EXCLUSIVE pin make other pins that don't have FOLL_EXCLUSIVE set fail as well?
Hi!
Looking through this, I feel that some intangible threshold of "this is too much overloading of page->_refcount" has been crossed. This is a very specific feature, and it is using approximately one more bit than is really actually "available"...
Agreed.
If we need a bit in struct page/folio, is this really the only way? Willy is working towards getting us an entirely separate folio->pincount, I suppose that might take too long? Or not?
Before talking about how to implement it, I think we first have to learn whether that approach is what we want at all, and how it fits into the bigger picture of that use case.
This feels like force-fitting a very specific feature (KVM/CoCo handling of shmem pages) into a more general mechanism that is running low on bits (gup/pup).
Agreed.
Maybe a good topic for LPC!
The KVM track has plenty of guest_memfd topics, might be a good fit there. (or in the MM track, of course)
-- Cheers,
David / dhildenb
On 20.06.24 15:08, Mostafa Saleh wrote:
Hi David,
On Wed, Jun 19, 2024 at 09:37:58AM +0200, David Hildenbrand wrote:
Hi,
On 19.06.24 04:44, John Hubbard wrote:
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah.
The main question really is, into which direction we want and can develop guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if guest_memfd should be our new target for guest memory, both shared and private. There are a bunch of issues to be sorted out though ...
As there is interest from Red Hat into supporting hugetlb-style huge pages in confidential VMs for real-time workloads, and wasting memory is not really desired, I'm going to think some more about some of the challenges (shared+private in guest_memfd, mmap support, migration of !shared folios, hugetlb-like support, in-place shared<->private conversion, interaction with page pinning). Tricky.
Ideally, we'd have one way to back guest memory for confidential VMs in the future.
Can you comment on the bigger design goal here? In particular:
Who would get the exclusive PIN and for which reason? When would we pin, when would we unpin?
What would happen if there is already another PIN? Can we deal with speculative short-term PINs from GUP-fast that could introduce errors?
How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
Can you please clarify more about the IOMMU case?
pKVM has no merged upstream IOMMU support at the moment, although there was an RFC a while a go [1], also there would be a v2 soon.
In the patches KVM (running in EL2) will manage the IOMMUs including the page tables and all pages used in that are allocated from the kernel.
These patches don't support IOMMUs for guests. However, I don't see why would that be different from the CPU? as once the page is pinned it can be owned by a guest and that would be reflected in the hypervisor tracking, the CPU stage-2 and IOMMU page tables as well.
So this is my thinking, it might be flawed:
In the "normal" world (e.g., vfio), we FOLL_PIN|FOLL_LONGTERM the pages to be accessible by a dedicated device. We look them up in the page tables to pin them, then we can map them into the IOMMU.
Devices that cannot speak "private memory" should only access shared memory. So we must not have "private memory" mapped into their IOMMU.
Devices that can speak "private memory" may either access shared or private memory. So we may have"private memory" mapped into their IOMMU.
What I see (again, I might be just wrong):
1) How would the device be able to grab/access "private memory", if not via the user page tables?
2) How would we be able to convert shared -> private, if there is a longterm pin from that IOMMU? We must dynamically unmap it from the IOMMU.
I assume when you're saying "In the patches KVM (running in EL2) will manage the IOMMUs including the page tables", this is easily solved by not relying on pinning: KVM just knows what to update and where. (which is a very different model than what VFIO does)
Thanks!
On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
- How would the device be able to grab/access "private memory", if not via the user page tables?
The approaches I'm aware of require the secure world to own the IOMMU and generate the IOMMU page tables. So we will not use a GUP approach with VFIO today as the kernel will not have any reason to generate a page table in the first place. Instead we will say "this PCI device translates through the secure world" and walk away.
The page table population would have to be done through the KVM path.
I assume when you're saying "In the patches KVM (running in EL2) will manage the IOMMUs including the page tables", this is easily solved by not relying on pinning: KVM just knows what to update and where. (which is a very different model than what VFIO does)
This is my read as well for pKVM.
IMHO pKVM is just a version of CC without requiring some of HW features to make the isolation stronger and ignoring the attestation/strong confidentiality part.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, June 20, 2024 10:34 PM
On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
- How would the device be able to grab/access "private memory", if not via the user page tables?
The approaches I'm aware of require the secure world to own the IOMMU and generate the IOMMU page tables. So we will not use a GUP approach with VFIO today as the kernel will not have any reason to generate a page table in the first place. Instead we will say "this PCI device translates through the secure world" and walk away.
The page table population would have to be done through the KVM path.
Sorry for noting this discussion late. Dave pointed it to me in a related thread [1].
I had an impression that above approach fits some trusted IO arch (e.g. TDX Connect which has a special secure I/O page table format and requires sharing it between IOMMU/KVM) but not all.
e.g. SEV-TIO spec [2] (page 8) describes to have the IOMMU walk the existing I/O page tables to get HPA and then verify it through a new permission table (RMP) for access control.
That arch may better fit a scheme in which the I/O page tables are still managed by VFIO/IOMMUFD and RMP is managed by KVM, with an an extension to the MAP_DMA call to accept a [guest_memfd, offset] pair to find out the pfn instead of using host virtual address.
looks the Linux MM alignment session [3] did mention "guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used" to support that extension?
[1] https://lore.kernel.org/kvm/272e3dbf-ed4a-43f5-8b5f-56bf6d74930c@redhat.com/ [2] https://www.amd.com/system/files/documents/sev-tio-whitepaper.pdf [3] https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
Thanks Kevin
On Fri, Aug 02, 2024 at 08:26:48AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, June 20, 2024 10:34 PM
On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
- How would the device be able to grab/access "private memory", if not via the user page tables?
The approaches I'm aware of require the secure world to own the IOMMU and generate the IOMMU page tables. So we will not use a GUP approach with VFIO today as the kernel will not have any reason to generate a page table in the first place. Instead we will say "this PCI device translates through the secure world" and walk away.
The page table population would have to be done through the KVM path.
Sorry for noting this discussion late. Dave pointed it to me in a related thread [1].
I had an impression that above approach fits some trusted IO arch (e.g. TDX Connect which has a special secure I/O page table format and requires sharing it between IOMMU/KVM) but not all.
e.g. SEV-TIO spec [2] (page 8) describes to have the IOMMU walk the existing I/O page tables to get HPA and then verify it through a new permission table (RMP) for access control.
It is not possible, you cannot have the unsecure world control the IOMMU translation and expect a secure guest.
The unsecure world can attack the guest by scrambling the mappings of its private pages. A RMP does not protect against this.
This is why the secure world controls the CPU's GPA translation exclusively, same reasoning for iommu.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, August 2, 2024 7:22 PM
On Fri, Aug 02, 2024 at 08:26:48AM +0000, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, June 20, 2024 10:34 PM
On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
- How would the device be able to grab/access "private memory", if
not
via the user page tables?
The approaches I'm aware of require the secure world to own the
IOMMU
and generate the IOMMU page tables. So we will not use a GUP approach with VFIO today as the kernel will not have any reason to generate a page table in the first place. Instead we will say "this PCI device translates through the secure world" and walk away.
The page table population would have to be done through the KVM path.
Sorry for noting this discussion late. Dave pointed it to me in a related thread [1].
I had an impression that above approach fits some trusted IO arch (e.g. TDX Connect which has a special secure I/O page table format and requires sharing it between IOMMU/KVM) but not all.
e.g. SEV-TIO spec [2] (page 8) describes to have the IOMMU walk the existing I/O page tables to get HPA and then verify it through a new permission table (RMP) for access control.
It is not possible, you cannot have the unsecure world control the IOMMU translation and expect a secure guest.
The unsecure world can attack the guest by scrambling the mappings of its private pages. A RMP does not protect against this.
This is why the secure world controls the CPU's GPA translation exclusively, same reasoning for iommu.
According to [3],
" With SNP, when pages are marked as guest-owned in the RMP table, they are assigned to a specific guest/ASID, as well as a specific GFN with in the guest. Any attempts to map it in the RMP table to a different guest/ASID, or a different GFN within a guest/ASID, will result in an RMP nested page fault. "
With that measure in place my impression is that even the CPU's GPA translation can be controlled by the unsecure world in SEV-SNP.
[3] https://lore.kernel.org/all/20240501085210.2213060-1-michael.roth@amd.com/
On Mon, Aug 05, 2024 at 02:24:42AM +0000, Tian, Kevin wrote:
According to [3],
" With SNP, when pages are marked as guest-owned in the RMP table, they are assigned to a specific guest/ASID, as well as a specific GFN with in the guest. Any attempts to map it in the RMP table to a different guest/ASID, or a different GFN within a guest/ASID, will result in an RMP nested page fault. "
With that measure in place my impression is that even the CPU's GPA translation can be controlled by the unsecure world in SEV-SNP.
Sure, but the GPA is the KVM S2, not the IOMMU. If there is some complicated way to lock down the KVM S2 then it doesn't necessarily apply to every IOVA to GPA translation as well.
The guest/hypervisor could have a huge number of iommu domains, where would you even store such granular data?
About the only thing that could possibly do is setup a S2 IOMMU identity translation reliably and have no support for vIOMMU - which doesn't sound like a sane architecture to me.
It is not insurmountable, but it is going to be annoying if someone needs access to the private pages physical address in the iommufd side.
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Tuesday, August 6, 2024 7:23 AM
On Mon, Aug 05, 2024 at 02:24:42AM +0000, Tian, Kevin wrote:
According to [3],
" With SNP, when pages are marked as guest-owned in the RMP table, they are assigned to a specific guest/ASID, as well as a specific GFN with in the guest. Any attempts to map it in the RMP table to a different guest/ASID, or a different GFN within a guest/ASID, will result in an RMP nested page fault. "
With that measure in place my impression is that even the CPU's GPA translation can be controlled by the unsecure world in SEV-SNP.
Sure, but the GPA is the KVM S2, not the IOMMU. If there is some complicated way to lock down the KVM S2 then it doesn't necessarily apply to every IOVA to GPA translation as well.
The guest/hypervisor could have a huge number of iommu domains, where would you even store such granular data?
About the only thing that could possibly do is setup a S2 IOMMU identity translation reliably and have no support for vIOMMU - which doesn't sound like a sane architecture to me.
According to the SEV-TIO spec there will be a new structure called Secure Device Table to track security attributes of a TDI and also location of guest page tables. It also puts hardware assisted vIOMMU in the TCB then with nested translation the IOMMU S2 will always be GPA.
It is not insurmountable, but it is going to be annoying if someone needs access to the private pages physical address in the iommufd side.
Don't know much about SEV but based on my reading it appears that it is designed with the assumption that GPA page tables (both CPU/IOMMU S2, in nested translation) are managed by untrusted host, for both shared and private pages.
Probably AMD folks can chime in to help confirm. 😊
Hi David,
On Thu, Jun 20, 2024 at 04:14:23PM +0200, David Hildenbrand wrote:
On 20.06.24 15:08, Mostafa Saleh wrote:
Hi David,
On Wed, Jun 19, 2024 at 09:37:58AM +0200, David Hildenbrand wrote:
Hi,
On 19.06.24 04:44, John Hubbard wrote:
On 6/18/24 5:05 PM, Elliot Berman wrote:
In arm64 pKVM and QuIC's Gunyah protected VM model, we want to support grabbing shmem user pages instead of using KVM's guestmemfd. These hypervisors provide a different isolation model than the CoCo implementations from x86. KVM's guest_memfd is focused on providing memory that is more isolated than AVF requires. Some specific examples include ability to pre-load data onto guest-private pages, dynamically sharing/isolating guest pages without copy, and (future) migrating guest-private pages. In sum of those differences after a discussion in [1] and at PUCK, we want to try to stick with existing shmem and extend GUP to support the isolation needs for arm64 pKVM and Gunyah.
The main question really is, into which direction we want and can develop guest_memfd. At this point (after talking to Jason at LSF/MM), I wonder if guest_memfd should be our new target for guest memory, both shared and private. There are a bunch of issues to be sorted out though ...
As there is interest from Red Hat into supporting hugetlb-style huge pages in confidential VMs for real-time workloads, and wasting memory is not really desired, I'm going to think some more about some of the challenges (shared+private in guest_memfd, mmap support, migration of !shared folios, hugetlb-like support, in-place shared<->private conversion, interaction with page pinning). Tricky.
Ideally, we'd have one way to back guest memory for confidential VMs in the future.
Can you comment on the bigger design goal here? In particular:
Who would get the exclusive PIN and for which reason? When would we pin, when would we unpin?
What would happen if there is already another PIN? Can we deal with speculative short-term PINs from GUP-fast that could introduce errors?
How can we be sure we don't need other long-term pins (IOMMUs?) in the future?
Can you please clarify more about the IOMMU case?
pKVM has no merged upstream IOMMU support at the moment, although there was an RFC a while a go [1], also there would be a v2 soon.
In the patches KVM (running in EL2) will manage the IOMMUs including the page tables and all pages used in that are allocated from the kernel.
These patches don't support IOMMUs for guests. However, I don't see why would that be different from the CPU? as once the page is pinned it can be owned by a guest and that would be reflected in the hypervisor tracking, the CPU stage-2 and IOMMU page tables as well.
So this is my thinking, it might be flawed:
In the "normal" world (e.g., vfio), we FOLL_PIN|FOLL_LONGTERM the pages to be accessible by a dedicated device. We look them up in the page tables to pin them, then we can map them into the IOMMU.
Devices that cannot speak "private memory" should only access shared memory. So we must not have "private memory" mapped into their IOMMU.
Devices that can speak "private memory" may either access shared or private memory. So we may have"private memory" mapped into their IOMMU.
Private pages must not be accessible to devices owned by the host, and for that we have the same rules as the CPU: A) The hypervisor doesn’t trust the host, and must enforce that using the CPU stage-2 MMU. B) It’s preferable that userspace doesn’t, and hence these patches (or guest_memfd...)
We need the same rules for DMA, otherwise it is "simple" to instrument a DMA attack, so we need a protection by the IOMMU. pKVM at the moment provides 2 ways of establishing that (each has their own trade-off which are not relevant here):
1) pKVM manages the IOMMUs and provides a hypercall interface to map/unmap in the IOMMU, looking at the rules
For A), pKVM has its own per-page metadata which tracks page state, which can prevent mapping private pages in the IOMMU and transitioning pages to private if they are mapped in the IOMMU.
For B), userspace won’t be able to map private pages(through VFIO/IOMMUFD), as the hypercall interface would fail if the pages are private.
This proposal is the one on the list.
2) pKVM manages a second stage of the IOMMU (as SMMUv3), and let the kernel map what it wants in stage-1 and pKVM would use a mirrored page table of the CPU MMU stage-2.
For A) Similar to the CPU, stage-2 IOMMU will protect the private pages.
For B) userspace can map private pages in the first stage IOMMU, and that would result in stage-2 fault, AFAIK, SMMUv3 is the only Arm implementation that supports nesting in Linux, for that the driver would only print a page fault, and ideally the kernel wouldn’t crash, although that is really hardware dependant how it handle faults, and I guess assigning a device through VFIO to userspace comes with similar risks already (bogus MMIO access can crash the system).
This proposal only exists in Android at the moment(However I am working on getting an SMMUv3 compliant implementation that can be posted upstream).
What I see (again, I might be just wrong):
- How would the device be able to grab/access "private memory", if not via the user page tables?
I hope the above answers the question, but just to confirmn, a device owned by the host shouldn’t access the memory as the host kernel is not trusted and can instrument DMA attacks. Device assignment (passthrough) is another story.
- How would we be able to convert shared -> private, if there is a longterm pin from that IOMMU? We must dynamically unmap it from the IOMMU.
Depending on which solution from the above, for 1) The transition from shared -> private would fail 2) The private page would be unmapped from the stage-2 IOMMU (similar to the stage-2 CPU MMU)
I assume when you're saying "In the patches KVM (running in EL2) will manage the IOMMUs including the page tables", this is easily solved by not relying on pinning: KVM just knows what to update and where. (which is a very different model than what VFIO does)
Yes, that's is not required to protect private memory.
Thanks, Mostafa
Thanks!
-- Cheers,
David / dhildenb
Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am PDT:
The current direction is:
+ Allow mmap() of ranges that cover both shared and private memory, but disallow faulting in of private pages + On access to private pages, userspace will get some error, perhaps SIGBUS + On shared to private conversions, unmap the page and decrease refcounts
+ To support huge pages, guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used. + guest_memfd will track usage of (sub)pages, for both private and shared memory + Pages will be broken into smaller (probably 4K) chunks at creation time to simplify implementation (as opposed to splitting at runtime when private to shared conversion is requested by the guest) + Core MM infrastructure will still be used to track page table mappings in mapcounts and other references (refcounts) per subpage + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to be optimized later. Suggestions: + Use a tracking data structure other than struct page + Remove the memory for struct pages backing private memory from the vmemmap, and re-populate the vmemmap on conversion from private to shared + Implementation pointers for huge page support + Consensus was that getting core MM to do tracking seems wrong + Maintaining special page refcounts for guest_memfd pages is difficult to get working and requires weird special casing in many places. This was tried for FS DAX pages and did not work out: [1]
+ Implementation suggestion: use infrastructure similar to what ZONE_DEVICE uses, to provide the huge page to interested parties + TBD: how to actually get huge pages into guest_memfd + TBD: how to provide/convert the huge pages to ZONE_DEVICE + Perhaps reserve them at boot time like in HugeTLB
+ Line of sight to compaction/migration: + Compaction here means making memory contiguous + Compaction/migration scope: + In scope for 4K pages + Out of scope for 1G pages and anything managed through ZONE_DEVICE + Out of scope for an initial implementation + Ideas for future implementations + Reuse the non-LRU page migration framework as used by memory balloning + Have userspace drive compaction/migration via ioctls + Having line of sight to optimizing lost HVO means avoiding being locked in to any implementation requiring struct pages + Without struct pages, it is hard to reuse core MM’s compaction/migration infrastructure
+ Discuss more details at LPC in Sep 2024, such as how to use huge pages, shared/private conversion, huge page splitting
This addresses the prerequisites set out by Fuad and Elliott at the beginning of the session, which were:
1. Non-destructive shared/private conversion + Through having guest_memfd manage and track both shared/private memory 2. Huge page support with the option of converting individual subpages + Splitting of pages will be managed by guest_memfd 3. Line of sight to compaction/migration of private memory + Possibly driven by userspace using guest_memfd ioctls 4. Loading binaries into guest (private) memory before VM starts + This was identified as a special case of (1.) above 5. Non-protected guests in pKVM + Not discussed during session, but this is a goal of guest_memfd, for all VM types [2]
David Hildenbrand summarized this during the meeting at t=47m25s [3].
[1]: https://lore.kernel.org/linux-mm/cover.66009f59a7fe77320d413011386c3ae5c2ee8... [2]: https://lore.kernel.org/lkml/ZnRMn1ObU8TFrms3@google.com/ [3]: https://drive.google.com/file/d/17lruFrde2XWs6B1jaTrAy9gjv08FnJ45/view?t=47m...
Thanks for doing the dirty work!
On Fri, Jul 12, 2024, Ackerley Tng wrote:
Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am PDT:
The current direction is:
- Allow mmap() of ranges that cover both shared and private memory, but disallow faulting in of private pages
- On access to private pages, userspace will get some error, perhaps SIGBUS
- On shared to private conversions, unmap the page and decrease refcounts
Note, I would strike the "decrease refcounts" part, as putting references is a natural consequence of unmapping memory, not an explicit action guest_memfd will take when converting from shared=>private.
And more importantly, guest_memfd will wait for the refcount to hit zero (or whatever the baseline refcount is).
- To support huge pages, guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used.
- guest_memfd will track usage of (sub)pages, for both private and shared memory
- Pages will be broken into smaller (probably 4K) chunks at creation time to simplify implementation (as opposed to splitting at runtime when private to shared conversion is requested by the guest)
FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that shatters pages at creation. I can see it being an intermediate step, e.g. to prove correctness and provide a bisection point, but shattering hugepages at creation would effectively make hugepage support useless.
I don't think we need to sort this out now though, as when the shattering (and potential reconstituion) occurs doesn't affect the overall direction in any way (AFAIK). I'm chiming in purely to stave off complaints that this would break hugepage support :-)
+ Core MM infrastructure will still be used to track page table mappings in mapcounts and other references (refcounts) per subpage + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to be optimized later. Suggestions: + Use a tracking data structure other than struct page + Remove the memory for struct pages backing private memory from the vmemmap, and re-populate the vmemmap on conversion from private to shared
Implementation pointers for huge page support
- Consensus was that getting core MM to do tracking seems wrong
- Maintaining special page refcounts for guest_memfd pages is difficult to get working and requires weird special casing in many places. This was tried for FS DAX pages and did not work out: [1]
Implementation suggestion: use infrastructure similar to what ZONE_DEVICE uses, to provide the huge page to interested parties
- TBD: how to actually get huge pages into guest_memfd
- TBD: how to provide/convert the huge pages to ZONE_DEVICE
- Perhaps reserve them at boot time like in HugeTLB
Line of sight to compaction/migration:
- Compaction here means making memory contiguous
- Compaction/migration scope:
- In scope for 4K pages
- Out of scope for 1G pages and anything managed through ZONE_DEVICE
- Out of scope for an initial implementation
- Ideas for future implementations
- Reuse the non-LRU page migration framework as used by memory balloning
- Have userspace drive compaction/migration via ioctls
- Having line of sight to optimizing lost HVO means avoiding being locked in to any implementation requiring struct pages
- Without struct pages, it is hard to reuse core MM’s compaction/migration infrastructure
Discuss more details at LPC in Sep 2024, such as how to use huge pages, shared/private conversion, huge page splitting
This addresses the prerequisites set out by Fuad and Elliott at the beginning of the session, which were:
- Non-destructive shared/private conversion
- Through having guest_memfd manage and track both shared/private memory
- Huge page support with the option of converting individual subpages
- Splitting of pages will be managed by guest_memfd
- Line of sight to compaction/migration of private memory
- Possibly driven by userspace using guest_memfd ioctls
- Loading binaries into guest (private) memory before VM starts
- This was identified as a special case of (1.) above
- Non-protected guests in pKVM
- Not discussed during session, but this is a goal of guest_memfd, for all VM types [2]
David Hildenbrand summarized this during the meeting at t=47m25s [3].
On Tue, Jul 16, 2024 at 09:03:00AM -0700, Sean Christopherson wrote:
- To support huge pages, guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used.
- guest_memfd will track usage of (sub)pages, for both private and shared memory
- Pages will be broken into smaller (probably 4K) chunks at creation time to simplify implementation (as opposed to splitting at runtime when private to shared conversion is requested by the guest)
FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that shatters pages at creation. I can see it being an intermediate step, e.g. to prove correctness and provide a bisection point, but shattering hugepages at creation would effectively make hugepage support useless.
Why? If the private memory retains its contiguity seperately but the struct pages are removed from the vmemmap, what is the downside?
As I understand it the point is to give a large contiguous range to the private world and use only 4k pages to give the hypervisor world access to limited amounts of the memory.
Is there a reason that not having the shared memory elevated to higher contiguity a deal breaker?
Jason
On Tue, Jul 16, 2024, Jason Gunthorpe wrote:
On Tue, Jul 16, 2024 at 09:03:00AM -0700, Sean Christopherson wrote:
- To support huge pages, guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used.
- guest_memfd will track usage of (sub)pages, for both private and shared memory
- Pages will be broken into smaller (probably 4K) chunks at creation time to simplify implementation (as opposed to splitting at runtime when private to shared conversion is requested by the guest)
FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that shatters pages at creation. I can see it being an intermediate step, e.g. to prove correctness and provide a bisection point, but shattering hugepages at creation would effectively make hugepage support useless.
Why? If the private memory retains its contiguity seperately but the struct pages are removed from the vmemmap, what is the downside?
Oooh, you're talking about shattering only the host userspace mappings. Now I understand why there was a bit of a disconnect, I was thinking you (hand-wavy everyone) were saying that KVM would immediately shatter its own mappings too.
As I understand it the point is to give a large contiguous range to the private world and use only 4k pages to give the hypervisor world access to limited amounts of the memory.
Is there a reason that not having the shared memory elevated to higher contiguity a deal breaker?
Nope. I'm sure someone will ask for it sooner than later, but definitely not a must have.
On Tue, Jul 16, 2024 at 10:34:55AM -0700, Sean Christopherson wrote:
On Tue, Jul 16, 2024, Jason Gunthorpe wrote:
On Tue, Jul 16, 2024 at 09:03:00AM -0700, Sean Christopherson wrote:
- To support huge pages, guest_memfd will take ownership of the hugepages, and provide interested parties (userspace, KVM, iommu) with pages to be used.
- guest_memfd will track usage of (sub)pages, for both private and shared memory
- Pages will be broken into smaller (probably 4K) chunks at creation time to simplify implementation (as opposed to splitting at runtime when private to shared conversion is requested by the guest)
FWIW, I doubt we'll ever release a version with mmap()+guest_memfd support that shatters pages at creation. I can see it being an intermediate step, e.g. to prove correctness and provide a bisection point, but shattering hugepages at creation would effectively make hugepage support useless.
Why? If the private memory retains its contiguity seperately but the struct pages are removed from the vmemmap, what is the downside?
Oooh, you're talking about shattering only the host userspace mappings. Now I understand why there was a bit of a disconnect, I was thinking you (hand-wavy everyone) were saying that KVM would immediately shatter its own mappings too.
Right, I'm imagining that guestmemfd keep track of the physical ranges in something else, like a maple tree, xarray or heck a SW radix page table perhaps. It does not use struct pages. Then it has, say, a bitmap indicating what 4k granuals are shared.
When kvm or the private world needs the physical addresses it reads them out of that structure and it always sees perfectly physically contiguous data regardless of any shared/private stuff.
It is not so much "broken at creation time", but more that guest memfd does not use struct pages at all for private mappings and thus we can setup the unused struct pages however we like, including removing them from the vmemmap or preconfiguring them for order 0 granuals.
There is definitely some detailed datastructure work here to allow guestmemfd to manage all of this efficiently and be effective for 4k and 1G cases.
Jason
linux-kselftest-mirror@lists.linaro.org