On Tue, Sep 10, 2024, Nikolas Wipper wrote:
This series introduces a new ioctl KVM_TRANSLATE2, which expands on KVM_TRANSLATE. It is required to implement Hyper-V's HvTranslateVirtualAddress hyper-call as part of the ongoing effort to emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper- call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which implements the core functionality of the hyper-call. The rest of the required functionality will be implemented in subsequent series.
Other than translating guest virtual addresses, the ioctl allows the caller to control whether the access and dirty bits are set during the page walk. It also allows specifying an access mode instead of returning viable access modes, which enables setting the bits up to the level that caused a failure. Additionally, the ioctl provides more information about why the page walk failed, and which page table is responsible. This functionality is not available within KVM_TRANSLATE, and can't be added without breaking backwards compatiblity, thus a new ioctl is required.
...
Documentation/virt/kvm/api.rst | 131 ++++++++ arch/x86/include/asm/kvm_host.h | 18 +- arch/x86/kvm/hyperv.c | 3 +- arch/x86/kvm/kvm_emulate.h | 8 + arch/x86/kvm/mmu.h | 10 +- arch/x86/kvm/mmu/mmu.c | 7 +- arch/x86/kvm/mmu/paging_tmpl.h | 80 +++-- arch/x86/kvm/x86.c | 123 ++++++- include/linux/kvm_host.h | 6 + include/uapi/linux/kvm.h | 33 ++ tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++ virt/kvm/kvm_main.c | 41 +++ 13 files changed, 724 insertions(+), 47 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
...
The simple reason for keeping this functionality in KVM, is that it already has a mature, production-level page walker (which is already exposed) and creating something similar QEMU would take a lot longer and would be much harder to maintain than just creating an API that leverages the existing walker.
I'm not convinced that implementing targeted support in QEMU (or any other VMM) would be at all challenging or a burden to maintain. I do think duplicating functionality across multiple VMMs is undesirable, but that's an argument for creating modular userspace libraries for such functionality. E.g. I/O APIC emulation is another one I'd love to move to a common library.
Traversing page tables isn't difficult. Checking permission bits isn't complex. Tedious, perhaps. But not complex. KVM's rather insane code comes from KVM's desire to make the checks as performant as possible, because eking out every little bit of performance matters for legacy shadow paging. I doubt VSM needs _that_ level of performance.
I say "targeted", because I assume the only use case for VSM is 64-bit non-nested guests. QEMU already has a rudimentary supporting for walking guest page tables, and that code is all of 40 LoC. Granted, it's heinous and lacks permission checks and A/D updates, but I would expect a clean implementation with permission checks and A/D support would clock in around 200 LoC. Maybe 300.
And ignoring docs and selftests, that's roughly what's being added in this series. Much of the code being added is quite simple, but there are non-trivial changes here as well. E.g. the different ways of setting A/D bits.
My biggest concern is taking on ABI that restricts what KVM can do in its walker. E.g. I *really* don't like the PKU change. Yeah, Intel doesn't explicitly define architectural behavior, but diverging from hardware behavior is rarely a good idea.
Similarly, the behavior of FNAME(protect_clean_gpte)() probably isn't desirable for the VSM use case.