I think we are lacking background of this usage model and how it works. For instance, typically L2 is created by L1, and L1 is responsible for L2's device I/O emulation. I don't quite understand how could L0 emulate L2's device I/O?
Can you provide more information?
Let's differentiate between fast and slow I/O. The whole point of the paravisor in L1 is to provide device emulation for slow I/O: TPM, RTC, NVRAM, IO-APIC, serial ports.
But fast I/O is designed to bypass it and go straight to L0. Hyper-V uses paravirtual vmbus devices for fast I/O (net/block). The vmbus protocol has awareness of page visibility built-in and uses native (GHCI on TDX, GHCB on SNP) mechanisms for notifications. So once everything is set up (rings/buffers in swiotlb), the I/O for fast devices does not involve L1. This is only possible when the VM manages C-bit itself.
Yeah that makes sense. Thanks for the info.
I think the same thing could work for virtio if someone would "enlighten" vring notification calls (instead of I/O or MMIO instructions).
Whats missing is the tdx_guest flag is not exposed to userspace in /proc/cpuinfo, and as a result dmesg does not currently display: "Memory Encryption Features active: Intel TDX".
That's what I set out to correct.
So far I see that you try to get kernel think that it runs as TDX guest, but not really. This is not very convincing model.
No that's not accurate at all. The kernel is running as a TDX guest so I want the kernel to know that.
But it isn't. It runs on a hypervisor which is a TDX guest, but this doesn't make itself a TDX guest.>
That depends on your definition of "TDX guest". The TDX 1.5 TD partitioning spec talks of TDX-enlightened L1 VMM, (optionally) TDX-enlightened L2 VM and Unmodified Legacy L2 VM. Here we're dealing with a TDX-enlightened L2 VM.
If a guest runs inside an Intel TDX protected TD, is aware of memory encryption and issues TDVMCALLs - to me that makes it a TDX guest.
The thing I don't quite understand is what enlightenment(s) requires L2 to issue TDVMCALL and know "encryption bit".
The reason that I can think of is:
If device I/O emulation of L2 is done by L0 then I guess it's reasonable to make L2 aware of the "encryption bit" because L0 can only write emulated data to shared buffer. The shared buffer must be initially converted by the L2 by using MAP_GPA TDVMCALL to L0 (to zap private pages in S-EPT etc), and L2 needs to know the "encryption bit" to set up its page table properly. L1 must be aware of such private <-> shared conversion too to setup page table properly so L1 must also be notified.
Your description is correct, except that L2 uses a hypercall (hv_mark_gpa_visibility()) to notify L1 and L1 issues the MAP_GPA TDVMCALL to L0.
In TDX partitioning IIUC L1 and L2 use different secure-EPT page table when mapping GPA of L1 and L2. Therefore IIUC entries of both secure-EPT table which map to the "to be converted page" need to be zapped.
I am not entirely sure whether using hv_mark_gpa_visibility() is suffice? As if the MAP_GPA was from L1 then I am not sure L0 is easy to zap secure-EPT entry for L2.
But anyway these are details probably we don't need to consider.
C-bit awareness is necessary to setup the whole swiotlb pool to be host visible for DMA.
Agreed.
The concern I am having is whether there's other usage model(s) that we need to consider. For instance, running both unmodified L2 and enlightened L2. Or some L2 only needs TDVMCALL enlightenment but no "encryption bit".
Presumably unmodified L2 and enlightened L2 are already covered by current code but require excessive trapping to L1.
I can't see a usecase for TDVMCALLs but no "encryption bit".
In other words, that seems pretty much L1 hypervisor/paravisor implementation specific. I am wondering whether we can completely hide the enlightenment(s) logic to hypervisor/paravisor specific code but not generically mark L2 as TDX guest but still need to disable TDCALL sort of things.
That's how it currently works - all the enlightenments are in hypervisor/paravisor specific code in arch/x86/hyperv and drivers/hv and the vm is not marked with X86_FEATURE_TDX_GUEST.
And I believe there's a reason that the VM is not marked as TDX guest.
But without X86_FEATURE_TDX_GUEST userspace has no unified way to discover that an environment is protected by TDX and also the VM gets classified as "AMD SEV" in dmesg. This is due to CC_ATTR_GUEST_MEM_ENCRYPT being set but X86_FEATURE_TDX_GUEST not.
Can you provide more information about what does _userspace_ do here?
What's the difference if it sees a TDX guest or a normal non-coco guest in /proc/cpuinfo?
Looks the whole purpose of this series is to make userspace happy by advertising TDX guest to /proc/cpuinfo. But if we do that we will have bad side-effect in the kernel so that we need to do things in your patch 2/3.
That doesn't seem very convincing. Is there any other way that userspace can utilize, e.g., any HV hypervisor/paravisor specific attributes that are exposed to userspace?