On Tue, Oct 08, 2024, Ackerley Tng wrote:
Patrick Roy roypat@amazon.co.uk writes:
For the "non-CoCo with direct map entries removed" VMs that we at AWS are going for, we'd like a VM type with host-controlled in-place conversions which doesn't zero on transitions,
Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for KVM to care if userspace or the guest _wants_ a page to be shared vs. private. Userspace is fully trusted to manage things; KVM simply reacts to the current state of things.
And more importantly, whether or not the direct map is zapped needs to be a property of the guest_memfd inode, i.e. can't be associated with a struct kvm. I forget who got volunteered to do the work, but we're going to need similar functionality for tracking the state of individual pages in a huge folio, as folio_mark_uptodate() is too coarse-grained. I.e. at some point, I expect that guest_memfd will make it easy-ish to determine whether or not the direct map has been obliterated.
The shared vs. private attributes tracking in KVM is still needed (I think), as it communicates what userspace _wants_, whereas he guest_memfd machinery will track what the state _is_.
so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new VM type for that.
Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename? The original thought behind "software protected VM" was to do a slow build of something akin to pKVM, but realistically I don't think that idea is going anywhere.
Alternatively, depending on how KVM accesses guest memory that's been removed from the direct map, another solution would be to allow "regular" VMs to bind memslots to guest_memfd, i.e. if the non-CoCo use case needs/wnats to bind all memory to guest_memfd, not just "private" mappings.
That's probably the biggest topic of discussion: how do we want to allow mapping guest_memfd into the guest, without direct map entries, but while still allowing KVM to access guest memory as needed, e.g. for shadow paging. One approach is your RFC, where KVM maps guest_memfd pfns on-demand.
Another (slightly crazy) approach would be use protection keys to provide the security properties that you want, while giving KVM (and userspace) a quick-and-easy override to access guest memory.
1. mmap() guest_memfd into userpace with RW protections 2. Configure PKRU to make guest_memfd memory inaccessible by default 3. Swizzle PKRU on-demand when intentionally accessing guest memory
It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory instead of to usersepace memory.
The benefit of the PKRU approach is that there are no PTE modifications, and thus no TLB flushes, and only the CPU that is access guest memory gains temporary access. The big downside is that it would be limited to modern hardware, but that might be acceptable, especially if it simplifies KVM's implementation.
Somewhat related sidenote: For VMs that allow inplace conversions and do not zero, we do not need to zap the stage-2 mappings on memory attribute changes, right?
See above. I don't think conversions by toggling the shared/private flag in KVM's memory attributes is the right fit for your use case.