This series introduces a new ioctl KVM_TRANSLATE2, which expands on KVM_TRANSLATE. It is required to implement Hyper-V's HvTranslateVirtualAddress hyper-call as part of the ongoing effort to emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper- call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which implements the core functionality of the hyper-call. The rest of the required functionality will be implemented in subsequent series.
Other than translating guest virtual addresses, the ioctl allows the caller to control whether the access and dirty bits are set during the page walk. It also allows specifying an access mode instead of returning viable access modes, which enables setting the bits up to the level that caused a failure. Additionally, the ioctl provides more information about why the page walk failed, and which page table is responsible. This functionality is not available within KVM_TRANSLATE, and can't be added without breaking backwards compatiblity, thus a new ioctl is required.
The ioctl was designed to facilitate as many other use cases as possible apart from VSM. The error codes were intentionally chosen to be broad enough to avoid exposing architecture specific details. Even though HvTranslateVirtualAddress only really needs one flag to set the accessed and dirty bits whenever possible, that was split into several flags so that future users can chose more gradually when these bits should be set. Furthermore, as much information as possible is provided to the caller.
The patch series includes selftests for the ioctl, as well as fuzzy testing on random garbage guest page table entries. All previously passing KVM selftests and KVM unit tests still pass.
Series overview: - 1: Document the new ioctl - 2-11: Update the page walker in preparation - 12-14: Implement the ioctl - 15: Implement testing
This series, alongside the series by Nicolas Saenz Julienne [1] introducing the core building blocks for VSM and the accompanying QEMU implementation [2], is capable of booting Windows Server 2019.
Both series are also available on GitHub [3].
[1] https://lore.kernel.org/linux-hyperv/20240609154945.55332-1-nsaenz@amazon.co... [2] https://github.com/vianpl/qemu/tree/vsm/next [3] https://github.com/vianpl/linux/tree/vsm/next
Best, Nikolas
Nikolas Wipper (15): KVM: Add API documentation for KVM_TRANSLATE2 KVM: x86/mmu: Abort page walk if permission checks fail KVM: x86/mmu: Introduce exception flag for unmapped GPAs KVM: x86/mmu: Store GPA in exception if applicable KVM: x86/mmu: Introduce flags parameter to page walker KVM: x86/mmu: Implement PWALK_SET_ACCESSED in page walker KVM: x86/mmu: Implement PWALK_SET_DIRTY in page walker KVM: x86/mmu: Implement PWALK_FORCE_SET_ACCESSED in page walker KVM: x86/mmu: Introduce status parameter to page walker KVM: x86/mmu: Implement PWALK_STATUS_READ_ONLY_PTE_GPA in page walker KVM: x86: Introduce generic gva to gpa translation function KVM: Introduce KVM_TRANSLATE2 KVM: Add KVM_TRANSLATE2 stub KVM: x86: Implement KVM_TRANSLATE2 KVM: selftests: Add test for KVM_TRANSLATE2
Documentation/virt/kvm/api.rst | 131 ++++++++ arch/x86/include/asm/kvm_host.h | 18 +- arch/x86/kvm/hyperv.c | 3 +- arch/x86/kvm/kvm_emulate.h | 8 + arch/x86/kvm/mmu.h | 10 +- arch/x86/kvm/mmu/mmu.c | 7 +- arch/x86/kvm/mmu/paging_tmpl.h | 80 +++-- arch/x86/kvm/x86.c | 123 ++++++- include/linux/kvm_host.h | 6 + include/uapi/linux/kvm.h | 33 ++ tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++ virt/kvm/kvm_main.c | 41 +++ 13 files changed, 724 insertions(+), 47 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
Add API documentation for the new KVM_TRANSLATE2 ioctl.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- Documentation/virt/kvm/api.rst | 131 +++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index a4b7dc4a9dda..632dc591badf 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6442,6 +6442,137 @@ the capability to be present.
`flags` must currently be zero.
+4.144 KVM_TRANSLATE2 +-------------------- + +:Capability: KVM_CAP_TRANSLATE2 +:Architectures: x86 +:Type: vcpu ioctl +:Parameters: struct kvm_translation2 (in/out) +:Returns: 0 on success, <0 on error + +KVM_TRANSLATE2 translates a guest virtual address into a guest physical one +while probing for requested access permissions and allowing for control over +whether accessed and dirty bits are set in each of the page map levels +structure. If the page walk fails, it provides detailed information explaining +the reason for the failure. + +:: + + /* for KVM_TRANSLATE2 */ + struct kvm_translation2 { + /* in */ + __u64 linear_address; + #define KVM_TRANSLATE_FLAGS_SET_ACCESSED (1 << 0) + #define KVM_TRANSLATE_FLAGS_SET_DIRTY (1 << 1) + #define KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED (1 << 2) + __u16 flags; + #define KVM_TRANSLATE_ACCESS_WRITE (1 << 0) + #define KVM_TRANSLATE_ACCESS_USER (1 << 1) + #define KVM_TRANSLATE_ACCESS_EXEC (1 << 2) + #define KVM_TRANSLATE_ACCESS_ALL \ + (KVM_TRANSLATE_ACCESS_WRITE | \ + KVM_TRANSLATE_ACCESS_USER | \ + KVM_TRANSLATE_ACCESS_EXEC) + __u16 access; + __u8 padding[4]; + + /* out */ + __u64 physical_address; + __u8 valid; + #define KVM_TRANSLATE_FAULT_NOT_PRESENT 1 + #define KVM_TRANSLATE_FAULT_PRIVILEGE_VIOLATION 2 + #define KVM_TRANSLATE_FAULT_RESERVED_BITS 3 + #define KVM_TRANSLATE_FAULT_INVALID_GVA 4 + #define KVM_TRANSLATE_FAULT_INVALID_GPA 5 + __u16 error_code; + __u8 set_bits_succeeded; + __u8 padding2[4]; + }; + +If the page walk succeeds, `physical_address` will contain the result of the +page walk, `valid` will be set to 1 and `error_code` will not contain any +meaningful value. + +If the page walk fails, `valid` will be set to 0 and `error_code` will contain +the reason of the walk failure. `physical_address` may contain the physical +address of the page table where the page walk was aborted, depending on the +returned error code: + +.. csv-table:: + :header: "`error_code`", "`physical_address`" + + "KVM_TRANSLATE_FAULT_NOT_PRESENT", "Physical address of the page table entry without the present bit" + "KVM_TRANSLATE_FAULT_PRIVILEGE_VIOLATION", "Physical address of the page table entry where access checks failed" + "KVM_TRANSLATE_FAULT_RESERVED_BITS", "Physical address of the page table entry with reserved bits set" + "KVM_TRANSLATE_FAULT_INVALID_GPA", "Physical address that wasn't backed by host memory" + "KVM_TRANSLATE_FAULT_INVALID_GVA", "empty", + +The `flags` field can take each of these flags: + +KVM_TRANSLATE_FLAGS_SET_ACCESSED + Sets the accessed bit on each page table level on a successful page walk. + +KVM_TRANSLATE_FLAGS_SET_DIRTY + Sets the dirty bit on each page table level on a successful page walk. + +KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED + Forces setting the accessed bit on every page table level that was walked + successfully on failed page walks. + +.. warning:: + + Setting these flags and then using the translated address may lead to a + race, if another vCPU remotely flushes the local vCPUs TLB while the + address is still in use. This can be mitigated by stalling such TLB flushes + until the memory operation is finished. + +The `access` field can take each of these flags: + +KVM_TRANSLATE_ACCESS_WRITE + The page walker will check for write access on every page table. + +KVM_TRANSLATE_ACCESS_USER + The page walker will check for user mode access on every page table. + +KVM_TRANSLATE_ACCESS_EXEC + The page walker will check for executable/fetch access on every page table. + +If none of these flags are set, read access and kernel mode permissions are +implied. + +The `error_code` field can take one of these values: + +KVM_TRANSLATE_FAULT_NOT_PRESENT + The virtual address is not mapped to any physical address. + +KVM_TRANSLATE_FAULT_PRIVILEGE_VIOLATION + One of the access checks failed during the page walk. + +KVM_TRANSLATE_FAULT_RESERVED_BITS + Reserved bits were set in a page table. + +KVM_TRANSLATE_FAULT_INVALID_GPA + One of the guest page table entries' addresses along the page walk was not + backed by a host memory. + +KVM_TRANSLATE_FAULT_INVALID_GVA + The GVA provided is not valid in the current vCPU state. For example, if on + 32-bit systems, the virtual address provided was larger than 32-bits, or on + 64-bit x86 systems, the virtual address was non-canonical. + +Regardless of the success of the page walk, `set_bits_succeeded` will contain a +boolean value indicating whether the accessed/dirty bits were set. It may be +false, if the bits were not set, because the page walk failed and +KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED was not passed, or if there was an error +setting the bits, for example, the host memory backing the page table entry was +marked read-only. + +KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED and KVM_TRANSLATE_FLAGS_SET_DIRTY must +never be passed without KVM_TRANSLATE_FLAGS_SET_ACCESSED. +KVM_TRANSLATE_FLAGS_SET_DIRTY must never be passed without +KVM_TRANSLATE_ACCESS_WRITE. Doing either will cause the ioctl to fail with exit +code -EINVAL.
5. The kvm_run structure ========================
Abort the page walk, if permission checks fail on any page table level, by moving the check to within the page walker loop. Currently, the page walker only checks for access flags after successfully walking the entire paging structure. This change is needed later to enable setting accessed bits in each page table that was successfully visited, during a page walk that ultimately failed.
As a result, error codes returned by the page walker may observe a change in behaviour, specifically, the error code will be built as soon as an access violation is found, meaning that for example, if an access violation is detected on page level 4, the page walker will abort the walk without looking at level 3 and below. However, since the error code returned is built from the passed access requirements, regardless of the actual cause of the failure, it will only be different if there is an access violation in one level and a PKRU violation in a lower one.
Previously the error code would include this PKRU violation, whereas now it does not, which is still in line with the behaviour specified in Intel's SDM. The exact procedure to test for violations is currently not specified in the SDM, but aborting the page walk early seems to be a reasonable implementation detail. As KVM does not read the PK bit anywhere, this only results in a different page-fault error codes for guests.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/kvm/mmu/paging_tmpl.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index ae7d39ff2d07..d9c3c78b3c14 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -422,6 +422,12 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, goto error; }
+ /* Convert to ACC_*_MASK flags for struct guest_walker. */ + walker->pte_access = FNAME(gpte_access)(pte_access ^ walk_nx_mask); + errcode = permission_fault(vcpu, mmu, walker->pte_access, pte_pkey, access); + if (unlikely(errcode)) + goto error; + walker->ptes[walker->level - 1] = pte;
/* Convert to ACC_*_MASK flags for struct guest_walker. */ @@ -431,12 +437,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, pte_pkey = FNAME(gpte_pkeys)(vcpu, pte); accessed_dirty = have_ad ? pte_access & PT_GUEST_ACCESSED_MASK : 0;
- /* Convert to ACC_*_MASK flags for struct guest_walker. */ - walker->pte_access = FNAME(gpte_access)(pte_access ^ walk_nx_mask); - errcode = permission_fault(vcpu, mmu, walker->pte_access, pte_pkey, access); - if (unlikely(errcode)) - goto error; - gfn = gpte_to_gfn_lvl(pte, walker->level); gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
Introduce a flag in x86_exception which signals that a page walk failed because a page table GPA wasn't backed by a memslot. This only applies to page tables; the final physical address is not checked.
This extra flag is needed, because the normal page fault error code does not contain a bit to signal this kind of fault.
Used in subsequent patches to give userspace information about translation failure.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/kvm/kvm_emulate.h | 2 ++ arch/x86/kvm/mmu/paging_tmpl.h | 6 +++++- 2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index 55a18e2f2dcd..afd8e86bc6af 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -27,6 +27,8 @@ struct x86_exception { u64 address; /* cr2 or nested page fault gpa */ u8 async_page_fault; unsigned long exit_qualification; +#define KVM_X86_UNMAPPED_PTE_GPA BIT(0) + u16 flags; };
/* diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index d9c3c78b3c14..f6a78b7cfca1 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -339,6 +339,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, #endif walker->max_level = walker->level;
+ walker->fault.flags = 0; + /* * FIXME: on Intel processors, loads of the PDPTE registers for PAE paging * by the MOV to CR instruction are treated as reads and do not cause the @@ -393,8 +395,10 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, return 0;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(real_gpa)); - if (!kvm_is_visible_memslot(slot)) + if (!kvm_is_visible_memslot(slot)) { + walker->fault.flags = KVM_X86_UNMAPPED_PTE_GPA; goto error; + }
host_addr = gfn_to_hva_memslot_prot(slot, gpa_to_gfn(real_gpa), &walker->pte_writable[walker->level - 1]);
Store the GPA where the page walk failed within the walker's exception. Precisely this means, the PTE's GPA, if it couldn't be resolved or it caused an access violation, or the fully translated GPA in case the final page caused an access violation.
Returning the GPA from the page walker directly is not possible, because other code within KVM relies on INVALID_GPA being returned on failure.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/kvm/kvm_emulate.h | 6 ++++++ arch/x86/kvm/mmu/paging_tmpl.h | 1 + 2 files changed, 7 insertions(+)
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index afd8e86bc6af..6501ce1c76fd 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -25,6 +25,12 @@ struct x86_exception { u16 error_code; bool nested_page_fault; u64 address; /* cr2 or nested page fault gpa */ + /* + * If error_code is a page fault, this will be the address of the last + * visited page table, or the fully translated address if it caused the + * failure. Otherwise, it will not hold a meaningful value. + */ + u64 gpa_page_fault; u8 async_page_fault; unsigned long exit_qualification; #define KVM_X86_UNMAPPED_PTE_GPA BIT(0) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index f6a78b7cfca1..74651b097fa0 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -485,6 +485,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, walker->fault.vector = PF_VECTOR; walker->fault.error_code_valid = true; walker->fault.error_code = errcode; + walker->fault.gpa_page_fault = real_gpa;
#if PTTYPE == PTTYPE_EPT /*
Introduce the flags parameter to walk_addr_generic() which is needed to introduce fine grained control over the accessed/dirty bits. Also forward the parameter to several of the page walker's helper functions, so it can be used in an ioctl.
Setting both PWALK_SET_ACCESSED and PWALK_SET_DIRTY will continue to maintain the previous behaviour, that is, both bits are only set after a successful walk and the dirty bit is only set when write access is enabled.
No functional change intended.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/include/asm/kvm_host.h | 10 +++++++++- arch/x86/kvm/hyperv.c | 3 ++- arch/x86/kvm/mmu.h | 6 +++--- arch/x86/kvm/mmu/mmu.c | 4 ++-- arch/x86/kvm/mmu/paging_tmpl.h | 25 ++++++++++++++----------- arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- 6 files changed, 50 insertions(+), 31 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 46e0a466d7fb..3acf0b069693 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -281,6 +281,14 @@ enum x86_intercept_stage; #define PFERR_PRIVATE_ACCESS BIT_ULL(49) #define PFERR_SYNTHETIC_MASK (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
+#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK | \ + PFERR_WRITE_MASK | \ + PFERR_PRESENT_MASK) + +#define PWALK_SET_ACCESSED BIT(0) +#define PWALK_SET_DIRTY BIT(1) +#define PWALK_SET_ALL (PWALK_SET_ACCESSED | PWALK_SET_DIRTY) + /* apic attention bits */ #define KVM_APIC_CHECK_VAPIC 0 /* @@ -450,7 +458,7 @@ struct kvm_mmu { void (*inject_page_fault)(struct kvm_vcpu *vcpu, struct x86_exception *fault); gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gpa_t gva_or_gpa, u64 access, + gpa_t gva_or_gpa, u64 access, u64 flags, struct x86_exception *exception); int (*sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i); diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 4f0a94346d00..b237231ace61 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2036,7 +2036,8 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) * read with kvm_read_guest(). */ if (!hc->fast && is_guest_mode(vcpu)) { - hc->ingpa = translate_nested_gpa(vcpu, hc->ingpa, 0, NULL); + hc->ingpa = translate_nested_gpa(vcpu, hc->ingpa, 0, + PWALK_SET_ALL, NULL); if (unlikely(hc->ingpa == INVALID_GPA)) return HV_STATUS_INVALID_HYPERCALL_INPUT; } diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 9dc5dd43ae7f..35030f6466b5 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -275,15 +275,15 @@ static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count) }
gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access, - struct x86_exception *exception); + u64 flags, struct x86_exception *exception);
static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gpa_t gpa, u64 access, + gpa_t gpa, u64 access, u64 flags, struct x86_exception *exception) { if (mmu != &vcpu->arch.nested_mmu) return gpa; - return translate_nested_gpa(vcpu, gpa, access, exception); + return translate_nested_gpa(vcpu, gpa, access, flags, exception); } #endif diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0d94354bb2f8..50c635142bf7 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4102,12 +4102,12 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu) }
static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gpa_t vaddr, u64 access, + gpa_t vaddr, u64 access, u64 flags, struct x86_exception *exception) { if (exception) exception->error_code = 0; - return kvm_translate_gpa(vcpu, mmu, vaddr, access, exception); + return kvm_translate_gpa(vcpu, mmu, vaddr, access, flags, exception); }
static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 74651b097fa0..c278b83b023f 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -301,7 +301,7 @@ static inline bool FNAME(is_last_gpte)(struct kvm_mmu *mmu, */ static int FNAME(walk_addr_generic)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gpa_t addr, u64 access) + gpa_t addr, u64 access, u64 flags) { int ret; pt_element_t pte; @@ -379,7 +379,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, walker->pte_gpa[walker->level - 1] = pte_gpa;
real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(table_gfn), - nested_access, &walker->fault); + nested_access, flags, + &walker->fault);
/* * FIXME: This can happen if emulation (for of an INS/OUTS @@ -449,7 +450,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, gfn += pse36_gfn_delta(pte); #endif
- real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(gfn), access, &walker->fault); + real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(gfn), access, + flags, &walker->fault); if (real_gpa == INVALID_GPA) return 0;
@@ -467,8 +469,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, (PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
if (unlikely(!accessed_dirty)) { - ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, - addr, write_fault); + ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, addr, + write_fault); if (unlikely(ret < 0)) goto error; else if (ret) @@ -527,11 +529,11 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, return 0; }
-static int FNAME(walk_addr)(struct guest_walker *walker, - struct kvm_vcpu *vcpu, gpa_t addr, u64 access) +static int FNAME(walk_addr)(struct guest_walker *walker, struct kvm_vcpu *vcpu, + gpa_t addr, u64 access, u64 flags) { return FNAME(walk_addr_generic)(walker, vcpu, vcpu->arch.mmu, addr, - access); + access, flags); }
static bool @@ -793,7 +795,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault * The bit needs to be cleared before walking guest page tables. */ r = FNAME(walk_addr)(&walker, vcpu, fault->addr, - fault->error_code & ~PFERR_RSVD_MASK); + fault->error_code & ~PFERR_RSVD_MASK, + PWALK_SET_ALL);
/* * The page is not mapped by the guest. Let the guest handle it. @@ -872,7 +875,7 @@ static gpa_t FNAME(get_level1_sp_gpa)(struct kvm_mmu_page *sp)
/* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gpa_t addr, u64 access, + gpa_t addr, u64 access, u64 flags, struct x86_exception *exception) { struct guest_walker walker; @@ -884,7 +887,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, WARN_ON_ONCE((addr >> 32) && mmu == vcpu->arch.walk_mmu); #endif
- r = FNAME(walk_addr_generic)(&walker, vcpu, mmu, addr, access); + r = FNAME(walk_addr_generic)(&walker, vcpu, mmu, addr, access, flags);
if (r) { gpa = gfn_to_gpa(walker.gfn); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 15080385b8fe..32e81cd502ee 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1067,7 +1067,8 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3) * to an L1 GPA. */ real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(pdpt_gfn), - PFERR_USER_MASK | PFERR_WRITE_MASK, NULL); + PFERR_USER_MASK | PFERR_WRITE_MASK, + PWALK_SET_ALL, NULL); if (real_gpa == INVALID_GPA) return 0;
@@ -7560,7 +7561,7 @@ void kvm_get_segment(struct kvm_vcpu *vcpu, }
gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access, - struct x86_exception *exception) + u64 flags, struct x86_exception *exception) { struct kvm_mmu *mmu = vcpu->arch.mmu; gpa_t t_gpa; @@ -7569,7 +7570,7 @@ gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access,
/* NPT walks are always user-walks */ access |= PFERR_USER_MASK; - t_gpa = mmu->gva_to_gpa(vcpu, mmu, gpa, access, exception); + t_gpa = mmu->gva_to_gpa(vcpu, mmu, gpa, access, flags, exception);
return t_gpa; } @@ -7580,7 +7581,8 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
u64 access = (kvm_x86_call(get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; - return mmu->gva_to_gpa(vcpu, mmu, gva, access, exception); + return mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, + exception); } EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_read);
@@ -7591,7 +7593,8 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva,
u64 access = (kvm_x86_call(get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; access |= PFERR_WRITE_MASK; - return mmu->gva_to_gpa(vcpu, mmu, gva, access, exception); + return mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, + exception); } EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_write);
@@ -7601,7 +7604,7 @@ gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva, { struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
- return mmu->gva_to_gpa(vcpu, mmu, gva, 0, exception); + return mmu->gva_to_gpa(vcpu, mmu, gva, 0, PWALK_SET_ALL, exception); }
static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes, @@ -7613,7 +7616,8 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes, int r = X86EMUL_CONTINUE;
while (bytes) { - gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access, exception); + gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access, + PWALK_SET_ALL, exception); unsigned offset = addr & (PAGE_SIZE-1); unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset); int ret; @@ -7647,8 +7651,8 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt, int ret;
/* Inline kvm_read_guest_virt_helper for speed. */ - gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access|PFERR_FETCH_MASK, - exception); + gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access | PFERR_FETCH_MASK, + PWALK_SET_ALL, exception); if (unlikely(gpa == INVALID_GPA)) return X86EMUL_PROPAGATE_FAULT;
@@ -7705,7 +7709,8 @@ static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes int r = X86EMUL_CONTINUE;
while (bytes) { - gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access, exception); + gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access, + PWALK_SET_ALL, exception); unsigned offset = addr & (PAGE_SIZE-1); unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset); int ret; @@ -7817,14 +7822,15 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, */ if (vcpu_match_mmio_gva(vcpu, gva) && (!is_paging(vcpu) || !permission_fault(vcpu, vcpu->arch.walk_mmu, - vcpu->arch.mmio_access, 0, access))) { + vcpu->arch.mmio_access, + PWALK_SET_ALL, access))) { *gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT | (gva & (PAGE_SIZE - 1)); trace_vcpu_match_mmio(gva, *gpa, write, false); return 1; }
- *gpa = mmu->gva_to_gpa(vcpu, mmu, gva, access, exception); + *gpa = mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, exception);
if (*gpa == INVALID_GPA) return -1; @@ -13644,7 +13650,8 @@ void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_c (PFERR_WRITE_MASK | PFERR_FETCH_MASK | PFERR_USER_MASK);
if (!(error_code & PFERR_PRESENT_MASK) || - mmu->gva_to_gpa(vcpu, mmu, gva, access, &fault) != INVALID_GPA) { + mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, + &fault) != INVALID_GPA) { /* * If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page * tables probably do not match the TLB. Just proceed
Implement PWALK_SET_ACCESSED in the page walker. This flag allows controling whether the page walker will set the accessed bits after a successful page walk in all page table levels. If the page walk is aborted for any reason, none of the access bits are set.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/kvm/mmu/paging_tmpl.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index c278b83b023f..eed6e2c653ba 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -317,6 +317,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, const int write_fault = access & PFERR_WRITE_MASK; const int user_fault = access & PFERR_USER_MASK; const int fetch_fault = access & PFERR_FETCH_MASK; + const int set_accessed = flags & PWALK_SET_ACCESSED; u16 errcode = 0; gpa_t real_gpa; gfn_t gfn; @@ -468,7 +469,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, accessed_dirty &= pte >> (PT_GUEST_DIRTY_SHIFT - PT_GUEST_ACCESSED_SHIFT);
- if (unlikely(!accessed_dirty)) { + if (unlikely(set_accessed && !accessed_dirty)) { ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, addr, write_fault); if (unlikely(ret < 0))
Implement PWALK_SET_DIRTY in the page walker. This flag allows controlling, whether the page walker will set the dirty bit after a successful page walk. If the page walk fails for any reason, the dirty flag is not set.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/kvm/mmu/paging_tmpl.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index eed6e2c653ba..b6897f7fbf52 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -318,6 +318,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, const int user_fault = access & PFERR_USER_MASK; const int fetch_fault = access & PFERR_FETCH_MASK; const int set_accessed = flags & PWALK_SET_ACCESSED; + const int set_dirty = flags & PWALK_SET_DIRTY; u16 errcode = 0; gpa_t real_gpa; gfn_t gfn; @@ -471,7 +472,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
if (unlikely(set_accessed && !accessed_dirty)) { ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, addr, - write_fault); + write_fault && set_dirty); if (unlikely(ret < 0)) goto error; else if (ret)
Implement PWALK_FORCE_SET_ACCESSED in the page walker. This flag forces the page walker to set the accessed flag in all successfully visited page table levels, regardless of the outcome of the page walk.
For example, if the page walk fails on level 2, the accessed bit will still be set on levels 3 and up.
If the nested translations of GPAs fail, the bits will still be set.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 17 +++++++++++++++-- 2 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3acf0b069693..cd2c391d6a24 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -287,6 +287,7 @@ enum x86_intercept_stage;
#define PWALK_SET_ACCESSED BIT(0) #define PWALK_SET_DIRTY BIT(1) +#define PWALK_FORCE_SET_ACCESSED BIT(2) #define PWALK_SET_ALL (PWALK_SET_ACCESSED | PWALK_SET_DIRTY)
/* apic attention bits */ diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index b6897f7fbf52..2cc40fd17f53 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -319,6 +319,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, const int fetch_fault = access & PFERR_FETCH_MASK; const int set_accessed = flags & PWALK_SET_ACCESSED; const int set_dirty = flags & PWALK_SET_DIRTY; + const int force_set = flags & PWALK_FORCE_SET_ACCESSED; u16 errcode = 0; gpa_t real_gpa; gfn_t gfn; @@ -395,7 +396,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, * fields. */ if (unlikely(real_gpa == INVALID_GPA)) - return 0; + goto late_exit;
slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(real_gpa)); if (!kvm_is_visible_memslot(slot)) { @@ -455,7 +456,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(gfn), access, flags, &walker->fault); if (real_gpa == INVALID_GPA) - return 0; + goto late_exit;
walker->gfn = real_gpa >> PAGE_SHIFT;
@@ -528,6 +529,18 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, walker->fault.async_page_fault = false;
trace_kvm_mmu_walker_error(walker->fault.error_code); + +late_exit: + if (force_set) { + /* + * Don't set the accessed bit for the page table that caused the + * walk to fail. + */ + ++walker->level; + FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, addr, + false); + --walker->level; + } return 0; }
Introduce the status parameter to walk_addr_generic() which is used in later patches to provide the caller with information on whether setting the accessed/dirty bits succeeded.
No functional change intended.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/hyperv.c | 2 +- arch/x86/kvm/mmu.h | 8 +++++--- arch/x86/kvm/mmu/mmu.c | 5 +++-- arch/x86/kvm/mmu/paging_tmpl.h | 26 ++++++++++++++++---------- arch/x86/kvm/x86.c | 25 ++++++++++++++----------- 6 files changed, 40 insertions(+), 28 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index cd2c391d6a24..1c5aaf55c683 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,7 +460,7 @@ struct kvm_mmu { struct x86_exception *fault); gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t gva_or_gpa, u64 access, u64 flags, - struct x86_exception *exception); + struct x86_exception *exception, u16 *status); int (*sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, int i); struct kvm_mmu_root_info root; diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index b237231ace61..30d5b86bc306 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2037,7 +2037,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) */ if (!hc->fast && is_guest_mode(vcpu)) { hc->ingpa = translate_nested_gpa(vcpu, hc->ingpa, 0, - PWALK_SET_ALL, NULL); + PWALK_SET_ALL, NULL, NULL); if (unlikely(hc->ingpa == INVALID_GPA)) return HV_STATUS_INVALID_HYPERCALL_INPUT; } diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 35030f6466b5..272ce93f855f 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -275,15 +275,17 @@ static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count) }
gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access, - u64 flags, struct x86_exception *exception); + u64 flags, struct x86_exception *exception, + u16 *status);
static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t gpa, u64 access, u64 flags, - struct x86_exception *exception) + struct x86_exception *exception, + u16 *status) { if (mmu != &vcpu->arch.nested_mmu) return gpa; - return translate_nested_gpa(vcpu, gpa, access, flags, exception); + return translate_nested_gpa(vcpu, gpa, access, flags, exception, status); } #endif diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 50c635142bf7..2ab0437edf54 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4103,11 +4103,12 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu)
static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t vaddr, u64 access, u64 flags, - struct x86_exception *exception) + struct x86_exception *exception, u16 *status) { if (exception) exception->error_code = 0; - return kvm_translate_gpa(vcpu, mmu, vaddr, access, flags, exception); + return kvm_translate_gpa(vcpu, mmu, vaddr, access, flags, exception, + status); }
static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 2cc40fd17f53..985a19dda603 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -197,7 +197,8 @@ static inline unsigned FNAME(gpte_access)(u64 gpte) static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, struct guest_walker *walker, - gpa_t addr, int write_fault) + gpa_t addr, int write_fault, + u16 *status) { unsigned level, index; pt_element_t pte, orig_pte; @@ -301,7 +302,8 @@ static inline bool FNAME(is_last_gpte)(struct kvm_mmu *mmu, */ static int FNAME(walk_addr_generic)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gpa_t addr, u64 access, u64 flags) + gpa_t addr, u64 access, u64 flags, + u16 *status) { int ret; pt_element_t pte; @@ -344,6 +346,9 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
walker->fault.flags = 0;
+ if (status) + *status = 0; + /* * FIXME: on Intel processors, loads of the PDPTE registers for PAE paging * by the MOV to CR instruction are treated as reads and do not cause the @@ -383,7 +388,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(table_gfn), nested_access, flags, - &walker->fault); + &walker->fault, status);
/* * FIXME: This can happen if emulation (for of an INS/OUTS @@ -453,8 +458,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, gfn += pse36_gfn_delta(pte); #endif
- real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(gfn), access, - flags, &walker->fault); + real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(gfn), access, flags, + &walker->fault, status); if (real_gpa == INVALID_GPA) goto late_exit;
@@ -473,7 +478,8 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
if (unlikely(set_accessed && !accessed_dirty)) { ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, addr, - write_fault && set_dirty); + write_fault && set_dirty, + status); if (unlikely(ret < 0)) goto error; else if (ret) @@ -538,7 +544,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, */ ++walker->level; FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, addr, - false); + false, status); --walker->level; } return 0; @@ -548,7 +554,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker, struct kvm_vcpu *vcpu, gpa_t addr, u64 access, u64 flags) { return FNAME(walk_addr_generic)(walker, vcpu, vcpu->arch.mmu, addr, - access, flags); + access, flags, NULL); }
static bool @@ -891,7 +897,7 @@ static gpa_t FNAME(get_level1_sp_gpa)(struct kvm_mmu_page *sp) /* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gpa_t addr, u64 access, u64 flags, - struct x86_exception *exception) + struct x86_exception *exception, u16 *status) { struct guest_walker walker; gpa_t gpa = INVALID_GPA; @@ -902,7 +908,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, WARN_ON_ONCE((addr >> 32) && mmu == vcpu->arch.walk_mmu); #endif
- r = FNAME(walk_addr_generic)(&walker, vcpu, mmu, addr, access, flags); + r = FNAME(walk_addr_generic)(&walker, vcpu, mmu, addr, access, flags, status);
if (r) { gpa = gfn_to_gpa(walker.gfn); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 32e81cd502ee..be696b60aba6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1068,7 +1068,7 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3) */ real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(pdpt_gfn), PFERR_USER_MASK | PFERR_WRITE_MASK, - PWALK_SET_ALL, NULL); + PWALK_SET_ALL, NULL, NULL); if (real_gpa == INVALID_GPA) return 0;
@@ -7561,7 +7561,8 @@ void kvm_get_segment(struct kvm_vcpu *vcpu, }
gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access, - u64 flags, struct x86_exception *exception) + u64 flags, struct x86_exception *exception, + u16 *status) { struct kvm_mmu *mmu = vcpu->arch.mmu; gpa_t t_gpa; @@ -7570,7 +7571,8 @@ gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access,
/* NPT walks are always user-walks */ access |= PFERR_USER_MASK; - t_gpa = mmu->gva_to_gpa(vcpu, mmu, gpa, access, flags, exception); + t_gpa = mmu->gva_to_gpa(vcpu, mmu, gpa, access, flags, exception, + status);
return t_gpa; } @@ -7582,7 +7584,7 @@ gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva,
u64 access = (kvm_x86_call(get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; return mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, - exception); + exception, NULL); } EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_read);
@@ -7594,7 +7596,7 @@ gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva, u64 access = (kvm_x86_call(get_cpl)(vcpu) == 3) ? PFERR_USER_MASK : 0; access |= PFERR_WRITE_MASK; return mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, - exception); + exception, NULL); } EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_write);
@@ -7604,7 +7606,7 @@ gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva, { struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
- return mmu->gva_to_gpa(vcpu, mmu, gva, 0, PWALK_SET_ALL, exception); + return mmu->gva_to_gpa(vcpu, mmu, gva, 0, PWALK_SET_ALL, exception, NULL); }
static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes, @@ -7617,7 +7619,7 @@ static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
while (bytes) { gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access, - PWALK_SET_ALL, exception); + PWALK_SET_ALL, exception, NULL); unsigned offset = addr & (PAGE_SIZE-1); unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset); int ret; @@ -7652,7 +7654,8 @@ static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
/* Inline kvm_read_guest_virt_helper for speed. */ gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access | PFERR_FETCH_MASK, - PWALK_SET_ALL, exception); + PWALK_SET_ALL, + exception, NULL); if (unlikely(gpa == INVALID_GPA)) return X86EMUL_PROPAGATE_FAULT;
@@ -7710,7 +7713,7 @@ static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes
while (bytes) { gpa_t gpa = mmu->gva_to_gpa(vcpu, mmu, addr, access, - PWALK_SET_ALL, exception); + PWALK_SET_ALL, exception, NULL); unsigned offset = addr & (PAGE_SIZE-1); unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset); int ret; @@ -7830,7 +7833,7 @@ static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva, return 1; }
- *gpa = mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, exception); + *gpa = mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, exception, NULL);
if (*gpa == INVALID_GPA) return -1; @@ -13651,7 +13654,7 @@ void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_c
if (!(error_code & PFERR_PRESENT_MASK) || mmu->gva_to_gpa(vcpu, mmu, gva, access, PWALK_SET_ALL, - &fault) != INVALID_GPA) { + &fault, NULL) != INVALID_GPA) { /* * If vcpu->arch.walk_mmu->gva_to_gpa succeeded, the page * tables probably do not match the TLB. Just proceed
Implement PWALK_STATUS_READ_ONLY_PTE_GPA in the page walker. This status flag is set when setting an accessed or dirty bit fails, because the memory of the page table entry was marked as read-only
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/mmu/paging_tmpl.h | 5 ++++- 2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1c5aaf55c683..7ac1956f6f9b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -290,6 +290,8 @@ enum x86_intercept_stage; #define PWALK_FORCE_SET_ACCESSED BIT(2) #define PWALK_SET_ALL (PWALK_SET_ACCESSED | PWALK_SET_DIRTY)
+#define PWALK_STATUS_READ_ONLY_PTE_GPA BIT(0) + /* apic attention bits */ #define KVM_APIC_CHECK_VAPIC 0 /* diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 985a19dda603..0eefa48e0e7f 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -244,8 +244,11 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, * overwrite the read-only memory to set the accessed and dirty * bits. */ - if (unlikely(!walker->pte_writable[level - 1])) + if (unlikely(!walker->pte_writable[level - 1])) { + if (status) + *status |= PWALK_STATUS_READ_ONLY_PTE_GPA; continue; + }
ret = __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault); if (ret)
Introduce a function to translate gvas to gpas with the ability to control set_bit_mode, access mode and flags, as well as receive status codes.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/x86.c | 11 +++++++++++ 2 files changed, 14 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7ac1956f6f9b..ae05e917d7ea 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2160,6 +2160,9 @@ static inline bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, ulong roots_to_free); void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu); +gpa_t kvm_mmu_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t gva, u64 access, + u64 flags, struct x86_exception *exception, + u16 *status); gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, struct x86_exception *exception); gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index be696b60aba6..27fc71aaa1e4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7577,6 +7577,17 @@ gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access, return t_gpa; }
+gpa_t kvm_mmu_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t gva, u64 access, + u64 flags, struct x86_exception *exception, + u16 *status) +{ + struct kvm_mmu *mmu = vcpu->arch.walk_mmu; + + return mmu->gva_to_gpa(vcpu, mmu, gva, access, flags, exception, + status); +} +EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa); + gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva, struct x86_exception *exception) {
Introduce a new ioctl that extends the functionality of KVM_TRANSLATE. It allows the caller to specify an access mode that must be upheld throughout the entire page walk. Additionally, it provides control over whether the accessed/dirty bits in the page table should be set at all, and whether they should be set if the walk fails. Lastly, if the page walk fails, it returns the exact error code which caused the failure.
KVM_TRANSLATE lacks information about executability of the translated page and doesn't provide control over the accessed/dirty page table bits at all. Because it lacks any sort of input flags, it cannot simply be expanded without breaking backwards compatibility. Additionally, in the x86 implementation the 'writable' and 'usermode' are currently hardcoded to 1 and 0 respectively, which is behaviour that might be relied upon.
The ioctl will be implemented for x86 in following commits.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- include/linux/kvm_host.h | 4 ++++ include/uapi/linux/kvm.h | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b23c6d48392f..c78017fd2907 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -84,6 +84,10 @@ #define KVM_MAX_NR_ADDRESS_SPACES 1 #endif
+#define KVM_TRANSLATE_FLAGS_ALL \ + (KVM_TRANSLATE_FLAGS_SET_ACCESSED | \ + KVM_TRANSLATE_FLAGS_SET_DIRTY | \ + KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED) /* * For the normal pfn, the highest 12 bits should be zero, * so we can mask bit 62 ~ bit 52 to indicate the error pfn, diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc055145..602323e734cc 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -512,6 +512,37 @@ struct kvm_translation { __u8 pad[5]; };
+/* for KVM_TRANSLATE2 */ +struct kvm_translation2 { + /* in */ + __u64 linear_address; +#define KVM_TRANSLATE_FLAGS_SET_ACCESSED (1 << 0) +#define KVM_TRANSLATE_FLAGS_SET_DIRTY (1 << 1) +#define KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED (1 << 2) + __u16 flags; +#define KVM_TRANSLATE_ACCESS_WRITE (1 << 0) +#define KVM_TRANSLATE_ACCESS_USER (1 << 1) +#define KVM_TRANSLATE_ACCESS_EXEC (1 << 2) +#define KVM_TRANSLATE_ACCESS_ALL \ + (KVM_TRANSLATE_ACCESS_WRITE | \ + KVM_TRANSLATE_ACCESS_USER | \ + KVM_TRANSLATE_ACCESS_EXEC) + __u16 access; + __u8 padding[4]; + + /* out */ + __u64 physical_address; + __u8 valid; +#define KVM_TRANSLATE_FAULT_NOT_PRESENT 1 +#define KVM_TRANSLATE_FAULT_PRIVILEGE_VIOLATION 2 +#define KVM_TRANSLATE_FAULT_RESERVED_BITS 3 +#define KVM_TRANSLATE_FAULT_INVALID_GVA 4 +#define KVM_TRANSLATE_FAULT_INVALID_GPA 5 + __u16 error_code; + __u8 set_bits_succeeded; + __u8 padding2[4]; +}; + /* for KVM_INTERRUPT */ struct kvm_interrupt { /* in */ @@ -933,6 +964,7 @@ struct kvm_enable_cap { #define KVM_CAP_PRE_FAULT_MEMORY 236 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237 #define KVM_CAP_X86_GUEST_MODE 238 +#define KVM_CAP_TRANSLATE2 239
struct kvm_irq_routing_irqchip { __u32 irqchip; @@ -1269,6 +1301,7 @@ struct kvm_vfio_spapr_tce { #define KVM_SET_SREGS _IOW(KVMIO, 0x84, struct kvm_sregs) #define KVM_TRANSLATE _IOWR(KVMIO, 0x85, struct kvm_translation) #define KVM_INTERRUPT _IOW(KVMIO, 0x86, struct kvm_interrupt) +#define KVM_TRANSLATE2 _IOWR(KVMIO, 0x87, struct kvm_translation2) #define KVM_GET_MSRS _IOWR(KVMIO, 0x88, struct kvm_msrs) #define KVM_SET_MSRS _IOW(KVMIO, 0x89, struct kvm_msrs) #define KVM_SET_CPUID _IOW(KVMIO, 0x8a, struct kvm_cpuid)
Add stub function for the KVM_TRANSLATE2 ioctl, as well as generic parameter verification. In a later commit, the ioctl will be properly implemented for x86.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 41 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c78017fd2907..de6557794735 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1492,6 +1492,8 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu);
int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, struct kvm_translation *tr); +int kvm_arch_vcpu_ioctl_translate2(struct kvm_vcpu *vcpu, + struct kvm_translation2 *tr);
int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs); int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d51357fd28d7..c129dc0b0485 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4442,6 +4442,32 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, } #endif
+int __weak kvm_arch_vcpu_ioctl_translate2(struct kvm_vcpu *vcpu, + struct kvm_translation2 *tr) +{ + return -EINVAL; +} + +static int kvm_vcpu_ioctl_translate2(struct kvm_vcpu *vcpu, + struct kvm_translation2 *tr) +{ + /* Don't allow FORCE_SET_ACCESSED and SET_BITS without SET_ACCESSED */ + if (!(tr->flags & KVM_TRANSLATE_FLAGS_SET_ACCESSED) && + (tr->flags & KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED || + tr->flags & KVM_TRANSLATE_FLAGS_SET_DIRTY)) + return -EINVAL; + + if (tr->flags & KVM_TRANSLATE_FLAGS_SET_DIRTY && + !(tr->access & KVM_TRANSLATE_ACCESS_WRITE)) + return -EINVAL; + + if (tr->flags & ~KVM_TRANSLATE_FLAGS_ALL || + tr->access & ~KVM_TRANSLATE_ACCESS_ALL) + return -EINVAL; + + return kvm_arch_vcpu_ioctl_translate2(vcpu, tr); +} + static long kvm_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -4585,6 +4611,21 @@ static long kvm_vcpu_ioctl(struct file *filp, r = 0; break; } + case KVM_TRANSLATE2: { + struct kvm_translation2 tr; + + r = -EFAULT; + if (copy_from_user(&tr, argp, sizeof(tr))) + goto out; + r = kvm_vcpu_ioctl_translate2(vcpu, &tr); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user(argp, &tr, sizeof(tr))) + goto out; + r = 0; + break; + } case KVM_SET_GUEST_DEBUG: { struct kvm_guest_debug dbg;
Implement KVM_TRANSLATE2 for x86 using the default KVM page walker.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- arch/x86/kvm/x86.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 27fc71aaa1e4..3bcbad958324 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4683,6 +4683,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_MEMORY_FAULT_INFO: case KVM_CAP_X86_GUEST_MODE: + case KVM_CAP_TRANSLATE2: r = 1; break; case KVM_CAP_PRE_FAULT_MEMORY: @@ -12156,6 +12157,81 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, return 0; }
+/* + * Translate a guest virtual address to a guest physical address. + */ +int kvm_arch_vcpu_ioctl_translate2(struct kvm_vcpu *vcpu, + struct kvm_translation2 *tr) +{ + int idx, set_bit_mode = 0, access = 0; + struct x86_exception exception = { }; + gva_t vaddr = tr->linear_address; + u16 status = 0; + gpa_t gpa; + + if (tr->flags & KVM_TRANSLATE_FLAGS_SET_ACCESSED) + set_bit_mode |= PWALK_SET_ACCESSED; + if (tr->flags & KVM_TRANSLATE_FLAGS_SET_DIRTY) + set_bit_mode |= PWALK_SET_DIRTY; + if (tr->flags & KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED) + set_bit_mode |= PWALK_FORCE_SET_ACCESSED; + + if (tr->access & KVM_TRANSLATE_ACCESS_WRITE) + access |= PFERR_WRITE_MASK; + if (tr->access & KVM_TRANSLATE_ACCESS_USER) + access |= PFERR_USER_MASK; + if (tr->access & KVM_TRANSLATE_ACCESS_EXEC) + access |= PFERR_FETCH_MASK; + + vcpu_load(vcpu); + + idx = srcu_read_lock(&vcpu->kvm->srcu); + + /* Even with PAE virtual addresses are still 32-bit */ + if (is_64_bit_mode(vcpu) ? is_noncanonical_address(vaddr, vcpu) : + tr->linear_address >> 32) { + tr->valid = false; + tr->error_code = KVM_TRANSLATE_FAULT_INVALID_GVA; + goto exit; + } + + gpa = kvm_mmu_gva_to_gpa(vcpu, vaddr, access, set_bit_mode, &exception, + &status); + + tr->physical_address = exception.error_code_valid ? exception.gpa_page_fault : gpa; + tr->valid = !exception.error_code_valid; + + /* + * Order is important here: + * - If there are access restrictions those will always be set in the + * error_code + * - If a PTE GPA is unmapped, the present bit in error_code may not + * have been set already + */ + if (exception.flags & KVM_X86_UNMAPPED_PTE_GPA) + tr->error_code = KVM_TRANSLATE_FAULT_INVALID_GPA; + else if (!(exception.error_code & PFERR_PRESENT_MASK)) + tr->error_code = KVM_TRANSLATE_FAULT_NOT_PRESENT; + else if (exception.error_code & PFERR_RSVD_MASK) + tr->error_code = KVM_TRANSLATE_FAULT_RESERVED_BITS; + else if (exception.error_code & (PFERR_USER_MASK | PFERR_WRITE_MASK | + PFERR_FETCH_MASK)) + tr->error_code = KVM_TRANSLATE_FAULT_PRIVILEGE_VIOLATION; + + /* + * exceptions.flags and thus tr->set_bits_succeeded have meaning + * regardless of the success of the page walk. + */ + tr->set_bits_succeeded = tr->flags && + !(status & PWALK_STATUS_READ_ONLY_PTE_GPA); + +exit: + srcu_read_unlock(&vcpu->kvm->srcu, idx); + + vcpu_put(vcpu); + return 0; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { struct fxregs_state *fxsave;
On Tue, Sep 10, 2024, Nikolas Wipper wrote:
+int kvm_arch_vcpu_ioctl_translate2(struct kvm_vcpu *vcpu,
struct kvm_translation2 *tr)
+{
- int idx, set_bit_mode = 0, access = 0;
- struct x86_exception exception = { };
- gva_t vaddr = tr->linear_address;
- u16 status = 0;
- gpa_t gpa;
- if (tr->flags & KVM_TRANSLATE_FLAGS_SET_ACCESSED)
set_bit_mode |= PWALK_SET_ACCESSED;
- if (tr->flags & KVM_TRANSLATE_FLAGS_SET_DIRTY)
set_bit_mode |= PWALK_SET_DIRTY;
- if (tr->flags & KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED)
set_bit_mode |= PWALK_FORCE_SET_ACCESSED;
- if (tr->access & KVM_TRANSLATE_ACCESS_WRITE)
access |= PFERR_WRITE_MASK;
- if (tr->access & KVM_TRANSLATE_ACCESS_USER)
access |= PFERR_USER_MASK;
- if (tr->access & KVM_TRANSLATE_ACCESS_EXEC)
access |= PFERR_FETCH_MASK;
WRITE and FETCH accesses need to be mutually exclusive.
Add selftest for KVM_TRANSLATE2. There are four different subtests.
A basic translate test that checks whether access permissions are handled correctly. A set bits test, that checks whether the accessed and dirty bits are set correctly. An errors test, that checks negative cases of the flags. And a fuzzy test on random guest page tables.
The tests currently use x86 specific paging code, so generalising them for more platforms is hard. Once other architectures implement KVM_TRANSLATE2 they need to be split into arch specific and agnostic parts.
Signed-off-by: Nikolas Wipper nikwip@amazon.de --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++ 2 files changed, 311 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 45cb70c048bb..5bb2db679658 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test TEST_GEN_PROGS_x86_64 += x86_64/hyperv_tlb_flush TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test +TEST_GEN_PROGS_x86_64 += x86_64/kvm_translate2 TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test diff --git a/tools/testing/selftests/kvm/x86_64/kvm_translate2.c b/tools/testing/selftests/kvm/x86_64/kvm_translate2.c new file mode 100644 index 000000000000..607af6376243 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/kvm_translate2.c @@ -0,0 +1,310 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test for x86 KVM_TRANSLATE2 + * + * Copyright © 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. + * + * This work is licensed under the terms of the GNU GPL, version 2. + * + */ +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/ioctl.h> +#include <linux/bitmap.h> + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" + +#define CHECK_ACCESSED_BIT(pte, set, start) \ + ({ \ + for (int _i = start; _i <= PG_LEVEL_512G; _i++) { \ + if (set) \ + TEST_ASSERT( \ + (*pte[_i] & PTE_ACCESSED_MASK) != 0, \ + "Page not marked accessed on level %i", \ + _i); \ + else \ + TEST_ASSERT( \ + (*pte[_i] & PTE_ACCESSED_MASK) == 0, \ + "Page marked accessed on level %i", \ + _i); \ + } \ + }) + +#define CHECK_DIRTY_BIT(pte, set) \ + ({ \ + if (set) \ + TEST_ASSERT((*pte[PG_LEVEL_4K] & PTE_DIRTY_MASK) != 0, \ + "Page not marked dirty"); \ + else \ + TEST_ASSERT((*pte[PG_LEVEL_4K] & PTE_DIRTY_MASK) == 0, \ + "Page marked dirty"); \ + }) + +enum point_of_failure { + pof_none, + pof_ioctl, + pof_page_walk, + pof_no_failure, +}; + +struct kvm_translation2 kvm_translate2(struct kvm_vcpu *vcpu, uint64_t vaddr, + int flags, int access, + enum point_of_failure pof) +{ + struct kvm_translation2 tr = { .linear_address = vaddr, + .flags = flags, + .access = access }; + + int res = ioctl(vcpu->fd, KVM_TRANSLATE2, &tr); + + if (pof == pof_none) + return tr; + + if (pof == pof_ioctl) { + TEST_ASSERT(res == -1, "ioctl didn't fail"); + return tr; + } + + TEST_ASSERT(res != -1, "ioctl failed"); + TEST_ASSERT((pof != pof_page_walk) == tr.valid, + "Page walk fail with code %u", tr.error_code); + + return tr; +} + +void test_translate(struct kvm_vm *vm, struct kvm_vcpu *vcpu, int index, + uint64_t *pte[PG_LEVEL_NUM], vm_vaddr_t vaddr) +{ + struct kvm_translation2 translation; + int access = index; + + printf("%s - write: %u, user: %u, exec: %u ...\t", + __func__, + (access & KVM_TRANSLATE_ACCESS_WRITE) >> 0, + (access & KVM_TRANSLATE_ACCESS_USER) >> 1, + (access & KVM_TRANSLATE_ACCESS_EXEC) >> 2); + + uint64_t mask = PTE_WRITABLE_MASK | PTE_USER_MASK | PTE_NX_MASK; + uint64_t new_value = 0; + + if (access & KVM_TRANSLATE_ACCESS_WRITE) + new_value |= PTE_WRITABLE_MASK; + if (access & KVM_TRANSLATE_ACCESS_USER) + new_value |= PTE_USER_MASK; + if (!(access & KVM_TRANSLATE_ACCESS_EXEC)) + new_value |= PTE_NX_MASK; + + for (int i = PG_LEVEL_4K; i <= PG_LEVEL_512G; i++) + *pte[i] = (*pte[i] & ~mask) | new_value; + + translation = kvm_translate2(vcpu, vaddr, 0, access, pof_no_failure); + + TEST_ASSERT_EQ(*pte[PG_LEVEL_4K] & GENMASK(51, 12), + translation.physical_address); + + /* Check configurations that have extra access requirements */ + for (int i = 0; i < 8; i++) { + int case_access = i; + + if ((case_access | access) <= access) + continue; + + translation = kvm_translate2(vcpu, vaddr, 0, case_access, + pof_page_walk); + TEST_ASSERT_EQ(translation.error_code, + KVM_TRANSLATE_FAULT_PRIVILEGE_VIOLATION); + } + + /* Clear accessed bits */ + for (int i = PG_LEVEL_4K; i <= PG_LEVEL_512G; i++) + *pte[i] &= ~PTE_ACCESSED_MASK; + + printf("[ok]\n"); +} + +void test_set_bits(struct kvm_vm *vm, struct kvm_vcpu *vcpu, + uint64_t *pte[PG_LEVEL_NUM], vm_vaddr_t vaddr) +{ + printf("%s ...\t", __func__); + + /* Sanity checks */ + CHECK_ACCESSED_BIT(pte, false, PG_LEVEL_4K); + CHECK_DIRTY_BIT(pte, false); + + kvm_translate2(vcpu, vaddr, 0, 0, pof_no_failure); + + CHECK_ACCESSED_BIT(pte, false, PG_LEVEL_4K); + CHECK_DIRTY_BIT(pte, false); + + kvm_translate2(vcpu, vaddr, KVM_TRANSLATE_FLAGS_SET_ACCESSED, 0, + pof_no_failure); + + CHECK_ACCESSED_BIT(pte, true, PG_LEVEL_4K); + CHECK_DIRTY_BIT(pte, false); + + kvm_translate2(vcpu, vaddr, + KVM_TRANSLATE_FLAGS_SET_ACCESSED | KVM_TRANSLATE_FLAGS_SET_DIRTY, + KVM_TRANSLATE_ACCESS_WRITE, pof_no_failure); + + CHECK_ACCESSED_BIT(pte, true, PG_LEVEL_4K); + CHECK_DIRTY_BIT(pte, true); + + printf("[ok]\n"); +} + +void test_errors(struct kvm_vm *vm, struct kvm_vcpu *vcpu, + uint64_t *pte[PG_LEVEL_NUM], vm_vaddr_t vaddr) +{ + struct kvm_translation2 tr; + + printf("%s ...\t", __func__); + + /* Set an unsupported access bit */ + kvm_translate2(vcpu, vaddr, 0, (1 << 3), pof_ioctl); + kvm_translate2(vcpu, vaddr, KVM_TRANSLATE_FLAGS_SET_DIRTY, 0, pof_ioctl); + kvm_translate2(vcpu, vaddr, KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED, 0, + pof_ioctl); + + /* Try to translate a non-canonical address */ + tr = kvm_translate2(vcpu, 0b101ull << 60, 0, 0, pof_page_walk); + TEST_ASSERT_EQ(tr.error_code, KVM_TRANSLATE_FAULT_INVALID_GVA); + + uint64_t old_pte = *pte[PG_LEVEL_2M]; + + *pte[PG_LEVEL_2M] |= (1ull << 51); /* Set a reserved bit */ + + tr = kvm_translate2(vcpu, vaddr, 0, 0, pof_page_walk); + TEST_ASSERT_EQ(tr.error_code, KVM_TRANSLATE_FAULT_RESERVED_BITS); + + *pte[PG_LEVEL_2M] &= ~(1ull << 51); + + /* Create a GPA that's definitely not mapped */ + *pte[PG_LEVEL_2M] |= GENMASK(35, 13); + + tr = kvm_translate2(vcpu, vaddr, 0, 0, pof_page_walk); + TEST_ASSERT_EQ(tr.error_code, KVM_TRANSLATE_FAULT_INVALID_GPA); + + *pte[PG_LEVEL_2M] = old_pte; + + /* Clear accessed bits */ + for (int i = PG_LEVEL_4K; i <= PG_LEVEL_512G; i++) + *pte[i] &= ~PTE_ACCESSED_MASK; + + /* Try translating a non-present page */ + *pte[PG_LEVEL_4K] &= ~PTE_PRESENT_MASK; + + tr = kvm_translate2( + vcpu, vaddr, + KVM_TRANSLATE_FLAGS_SET_ACCESSED | + KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED, 0, + pof_page_walk); + TEST_ASSERT_EQ(tr.error_code, KVM_TRANSLATE_FAULT_NOT_PRESENT); + CHECK_ACCESSED_BIT(pte, true, PG_LEVEL_2M); + + *pte[PG_LEVEL_4K] |= PTE_PRESENT_MASK; + + /* + * Try setting accessed/dirty bits on a PTE that is in read-only memory + */ + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0x80000000, 1, 4, + KVM_MEM_READONLY); + + uint64_t *addr = addr_gpa2hva(vm, 0x80000000); + uint64_t *base = addr_gpa2hva(vm, *pte[PG_LEVEL_2M] & GENMASK(51, 12)); + + /* Copy the entire page table */ + for (int i = 0; i < 0x200; i += 1) + addr[i] = (base[i] & ~PTE_ACCESSED_MASK) | PTE_PRESENT_MASK; + + uint64_t old_2m = *pte[PG_LEVEL_2M]; + *pte[PG_LEVEL_2M] &= ~GENMASK(51, 12); + *pte[PG_LEVEL_2M] |= 0x80000000; + + tr = kvm_translate2(vcpu, vaddr, + KVM_TRANSLATE_FLAGS_SET_ACCESSED | + KVM_TRANSLATE_FLAGS_SET_DIRTY | + KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED, + KVM_TRANSLATE_ACCESS_WRITE, pof_no_failure); + + TEST_ASSERT(!tr.set_bits_succeeded, "Page not read-only"); + + *pte[PG_LEVEL_2M] = old_2m; + + printf("[ok]\n"); +} + +/* Test page walker stability, by trying to translate with garbage PTEs */ +void test_fuzz(struct kvm_vm *vm, struct kvm_vcpu *vcpu, + uint64_t *pte[PG_LEVEL_NUM], vm_vaddr_t vaddr) +{ + printf("%s ...\t", __func__); + + /* Test gPTEs that point to random addresses */ + for (int level = PG_LEVEL_4K; level < PG_LEVEL_NUM; level++) { + for (int i = 0; i < 10000; i++) { + uint64_t random_address = random() % GENMASK(29, 0) << 12; + *pte[level] = (*pte[level] & ~GENMASK(51, 12)) | random_address; + + kvm_translate2(vcpu, vaddr, + KVM_TRANSLATE_FLAGS_SET_ACCESSED | + KVM_TRANSLATE_FLAGS_SET_DIRTY | + KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED, + 0, pof_none); + } + } + + /* Test gPTEs with completely random values */ + for (int level = PG_LEVEL_4K; level < PG_LEVEL_NUM; level++) { + for (int i = 0; i < 10000; i++) { + *pte[level] = random(); + + kvm_translate2(vcpu, vaddr, + KVM_TRANSLATE_FLAGS_SET_ACCESSED | + KVM_TRANSLATE_FLAGS_SET_DIRTY | + KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED, + 0, pof_none); + } + } + + printf("[ok]\n"); +} + +int main(int argc, char *argv[]) +{ + uint64_t *pte[PG_LEVEL_NUM]; + struct kvm_vcpu *vcpu; + struct kvm_sregs regs; + struct kvm_vm *vm; + vm_vaddr_t vaddr; + int page_level; + + TEST_REQUIRE(kvm_has_cap(KVM_CAP_TRANSLATE2)); + + vm = vm_create_with_one_vcpu(&vcpu, NULL); + + vaddr = __vm_vaddr_alloc_page(vm, MEM_REGION_TEST_DATA); + + for (page_level = PG_LEVEL_512G; page_level > PG_LEVEL_NONE; + page_level--) { + pte[page_level] = __vm_get_page_table_entry(vm, vaddr, &page_level); + } + + /* Enable WP bit in cr0, so kernel accesses uphold write protection */ + vcpu_ioctl(vcpu, KVM_GET_SREGS, ®s); + regs.cr0 |= 1 << 16; + vcpu_ioctl(vcpu, KVM_SET_SREGS, ®s); + + for (int index = 0; index < 8; index++) + test_translate(vm, vcpu, index, pte, vaddr); + + test_set_bits(vm, vcpu, pte, vaddr); + test_errors(vm, vcpu, pte, vaddr); + test_fuzz(vm, vcpu, pte, vaddr); + + kvm_vm_free(vm); + + return 0; +}
I saw this on another series[*]:
if KVM_TRANSLATE2 lands (though I'm somewhat curious as to why QEMU doesn't do the page walks itself).
The simple reason for keeping this functionality in KVM, is that it already has a mature, production-level page walker (which is already exposed) and creating something similar QEMU would take a lot longer and would be much harder to maintain than just creating an API that leverages the existing walker.
[*] https://lore.kernel.org/lkml/ZvJseVoT7gN_GBG3@google.com/T/#mb0b23a1f5023192...
ps: this is also a gentle ping for review, if this got lost in between conferences
Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
On Tue, Sep 10, 2024, Nikolas Wipper wrote:
This series introduces a new ioctl KVM_TRANSLATE2, which expands on KVM_TRANSLATE. It is required to implement Hyper-V's HvTranslateVirtualAddress hyper-call as part of the ongoing effort to emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper- call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which implements the core functionality of the hyper-call. The rest of the required functionality will be implemented in subsequent series.
Other than translating guest virtual addresses, the ioctl allows the caller to control whether the access and dirty bits are set during the page walk. It also allows specifying an access mode instead of returning viable access modes, which enables setting the bits up to the level that caused a failure. Additionally, the ioctl provides more information about why the page walk failed, and which page table is responsible. This functionality is not available within KVM_TRANSLATE, and can't be added without breaking backwards compatiblity, thus a new ioctl is required.
...
Documentation/virt/kvm/api.rst | 131 ++++++++ arch/x86/include/asm/kvm_host.h | 18 +- arch/x86/kvm/hyperv.c | 3 +- arch/x86/kvm/kvm_emulate.h | 8 + arch/x86/kvm/mmu.h | 10 +- arch/x86/kvm/mmu/mmu.c | 7 +- arch/x86/kvm/mmu/paging_tmpl.h | 80 +++-- arch/x86/kvm/x86.c | 123 ++++++- include/linux/kvm_host.h | 6 + include/uapi/linux/kvm.h | 33 ++ tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++ virt/kvm/kvm_main.c | 41 +++ 13 files changed, 724 insertions(+), 47 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
...
The simple reason for keeping this functionality in KVM, is that it already has a mature, production-level page walker (which is already exposed) and creating something similar QEMU would take a lot longer and would be much harder to maintain than just creating an API that leverages the existing walker.
I'm not convinced that implementing targeted support in QEMU (or any other VMM) would be at all challenging or a burden to maintain. I do think duplicating functionality across multiple VMMs is undesirable, but that's an argument for creating modular userspace libraries for such functionality. E.g. I/O APIC emulation is another one I'd love to move to a common library.
Traversing page tables isn't difficult. Checking permission bits isn't complex. Tedious, perhaps. But not complex. KVM's rather insane code comes from KVM's desire to make the checks as performant as possible, because eking out every little bit of performance matters for legacy shadow paging. I doubt VSM needs _that_ level of performance.
I say "targeted", because I assume the only use case for VSM is 64-bit non-nested guests. QEMU already has a rudimentary supporting for walking guest page tables, and that code is all of 40 LoC. Granted, it's heinous and lacks permission checks and A/D updates, but I would expect a clean implementation with permission checks and A/D support would clock in around 200 LoC. Maybe 300.
And ignoring docs and selftests, that's roughly what's being added in this series. Much of the code being added is quite simple, but there are non-trivial changes here as well. E.g. the different ways of setting A/D bits.
My biggest concern is taking on ABI that restricts what KVM can do in its walker. E.g. I *really* don't like the PKU change. Yeah, Intel doesn't explicitly define architectural behavior, but diverging from hardware behavior is rarely a good idea.
Similarly, the behavior of FNAME(protect_clean_gpte)() probably isn't desirable for the VSM use case.
linux-kselftest-mirror@lists.linaro.org