Ackerley Tng ackerleytng@google.com writes:
Elliot Berman quic_eberman@quicinc.com writes:
On Tue, Sep 10, 2024 at 11:44:01PM +0000, Ackerley Tng wrote:
Since guest_memfd now supports mmap(), folios have to be prepared before they are faulted into userspace.
When memory attributes are switched between shared and private, the up-to-date flags will be cleared.
Use the folio's up-to-date flag to indicate being ready for the guest usage and can be used to mark whether the folio is ready for shared OR private use.
Clearing the up-to-date flag also means that the page gets zero'd out whenever it transitions between shared and private (either direction). pKVM (Android) hypervisor policy can allow in-place conversion between shared/private.
I believe the important thing is that sev_gmem_prepare() needs to be called prior to giving page to guest. In my series, I had made a ->prepare_inaccessible() callback where KVM would only do this part. When transitioning to inaccessible, only that callback would be made, besides the bookkeeping. The folio zeroing happens once when allocating the folio if the folio is initially accessible (faultable).
From x86 CoCo perspective, I think it also makes sense to not zero the folio when changing faultiblity from private to shared:
- If guest is sharing some data with host, you've wiped the data and guest has to copy again.
- Or, if SEV/TDX enforces that page is zero'd between transitions, Linux has duplicated the work that trusted entity has already done.
Fuad and I can help add some details for the conversion. Hopefully we can figure out some of the plan at plumbers this week.
Zeroing the page prevents leaking host data (see function docstring for kvm_gmem_prepare_folio() introduced in [1]), so we definitely don't want to introduce a kernel data leak bug here.
Actually it seems like filemap_grab_folio() already gets a zeroed page.
filemap_grab_folio() eventually calls __alloc_pages_noprof() -> get_page_from_freelist() -> prep_new_page() -> post_alloc_hook()
and post_alloc_hook() calls kernel_init_pages(), which zeroes the page, depending on kernel config.
Paolo, was calling clear_highpage() in kvm_gmem_prepare_folio() zeroing an already empty page returned from filemap_grab_folio()?
In-place conversion does require preservation of data, so for conversions, shall we zero depending on VM type?
- Gunyah: don't zero since ->prepare_inaccessible() is a no-op
- pKVM: don't zero
- TDX: don't zero
- SEV: AMD Architecture Programmers Manual 7.10.6 says there is no automatic encryption and implies no zeroing, hence perform zeroing
- KVM_X86_SW_PROTECTED_VM: Doesn't have a formal definition so I guess we could require zeroing on transition?
This way, the uptodate flag means that it has been prepared (as in sev_gmem_prepare()), and zeroed if required by VM type.
Regarding flushing the dcache/tlb in your other question [2], if we don't use folio_zero_user(), can we relying on unmapping within core-mm to flush after shared use, and unmapping within KVM To flush after private use?
Or should flush_dcache_folio() be explicitly called on kvm_gmem_fault()?
clear_highpage(), used in the non-hugetlb (original) path, doesn't flush the dcache. Was that intended?
Thanks, Elliot
<snip>
[1] https://lore.kernel.org/all/20240726185157.72821-8-pbonzini@redhat.com/ [2] https://lore.kernel.org/all/diqz34ldszp3.fsf@ackerleytng-ctop.c.googlers.com...