On Tue, Aug 16, 2022 at 03:38:08PM +0000, Sean Christopherson wrote:
On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Does this also mean reservation of guest physical memory with secure processor (both for SEV-SNP & TDX) will also happen at page fault time?
Do we plan to keep it this way?
If you are talking about accepting memory by the guest, it is initiated by the guest and has nothing to do with page fault time vs fallocate() allocation of host memory. I mean acceptance happens after host memory allocation but they are not in lockstep, acceptance can happen much later.
No, I meant reserving guest physical memory range from hypervisor e.g with RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way for userspace to pre-map guest memory.
Hi Sean,
Currently I have the rmpupdate hook in KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls, so that when the pages actually get faulted in they are already in the expected state. I have userspace set up to call KVM_MEMORY_ENCRYPT_* in response to explicit page state changes issued by the guest, as well as in response to MEMORY_FAULT exits for implicit page state changes.
Initially the private backing store may or may not be pre-fallocate()'d depending on how userspace wants to handle it. If it's not pre-fallocate()'d, then the pages don't get faulted in until the guest does explicit page state changes (currently SNP guests will do this for all memory at boot time, but with unaccepted memory patches for guest/ovmf this will happen during guest run-time, would still allow us to make efficient use of lazy-pinning support for shorter boot times).
If userspaces wants to pre-allocate, it can issue the fallocate() for all the ranges up-front so it doesn't incur the cost during run-time.
Is that compatible with the proposed design?
Of course, for the initial encrypted payload, we would need to to issue the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION up-front. I'm doing that in conjunction with the hack to allow pwrite() to memfd to pre-populate the private pages before the in-place encryption that occurs when SNP_LAUNCH_UPDATE is issued...
In the past you and Vishal suggested doing the copy from within SNP_LAUNCH_UPDATE, which seems like a workable solution and something we've been meaning to implement...
I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic vCPU-scoped ioctl() that allows userspace to pre-map guest memory. Supporting initializing guest private memory with a source page can be implemented via a flag. That also gives KVM line of sight to in-place "conversion", e.g. another flag could be added to say that the dest is also the source.
So is this proposed ioctl only intended to handle the initial encrypted payload, and the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls would still be used for conversions post-boot?
If so, that seems reasonable, but I thought there was some consensus that just handling it per-platform in, e.g., SNP_LAUNCH_UPDATE, was sufficient for now until some additional need arose for a new interface. Has something changed in the regard? Just want to understand the motivations so we can plan accordingly.
Thanks!
-Mike
The TDX and SNP restrictions would then become addition restrictions on when initializing with a source is allowed (and VMs that don't have guest private memory wouldn't allow the flag at all).