From: Jack Thomson jackabt@amazon.com
Overview:
This patch series adds ARM64 support for the KVM_PRE_FAULT_MEMORY feature, which was previously only available on x86 [1]. This allows a reduction in the number of stage-2 faults during execution. This is beneficial in post-copy migration scenarios, particularly in memory intensive applications, where high latencies are experienced due to the stage-2 faults when pre-populating memory via UFFD / memcpy.
Patch Overview:
- The first patch is a preparatory refactor.
- The second patch is adding a page walk flag for pre-faulting.
- The third patch adds support for the KVM_PRE_FAULT_MEMORY ioctl on arm64.
- The fourth patch fixes an issue with unaligned mmap allocations in the selftests.
- The fifth patch updates the pre_fault_memory_test to support arm64.
- The last patch extends the pre_fault_memory_test to cover different vm memory backings.
[1]: https://lore.kernel.org/kvm/20240710174031.312055-1-pbonzini@redhat.com
Jack Thomson (6): KVM: arm64: Add __gmem_abort and __user_mem_abort KVM: arm64: Add KVM_PGTABLE_WALK_PRE_FAULT walk flag KVM: arm64: Add pre_fault_memory implementation KVM: selftests: Fix unaligned mmap allocations KVM: selftests: Enable pre_fault_memory_test for arm64 KVM: selftests: Add option for different backing in pre-fault tests
arch/arm64/include/asm/kvm_pgtable.h | 3 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/hyp/pgtable.c | 6 +- arch/arm64/kvm/mmu.c | 97 +++++++++++++-- tools/testing/selftests/kvm/Makefile.kvm | 1 + tools/testing/selftests/kvm/lib/kvm_util.c | 12 +- .../selftests/kvm/pre_fault_memory_test.c | 110 +++++++++++++----- 8 files changed, 186 insertions(+), 45 deletions(-)
base-commit: 42188667be387867d2bf763d028654cbad046f7b
From: Jack Thomson jackabt@amazon.com
Adding __gmem_abort and __user_mem_abort that preserve -EAGAIN results. These will be used by the pre-fault implementation which needs to retry on -EAGAIN.
Also add an optional page_size output parameter to __user_mem_abort to return the VMA page size, which will be needed for pre-faulting.
No functional changes are intended
Signed-off-by: Jack Thomson jackabt@amazon.com --- arch/arm64/kvm/mmu.c | 36 +++++++++++++++++++++++++++++------- 1 file changed, 29 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index a36426ccd9b5..082e7d8ae655 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1521,9 +1521,9 @@ static void adjust_nested_fault_perms(struct kvm_s2_trans *nested,
#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
-static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, bool is_perm) +static int __gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, + struct kvm_s2_trans *nested, + struct kvm_memory_slot *memslot, bool is_perm) { bool write_fault, exec_fault, writable; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS; @@ -1592,13 +1592,22 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (writable && !ret) mark_page_dirty_in_slot(kvm, memslot, gfn);
+ return ret; +} + +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, + struct kvm_s2_trans *nested, + struct kvm_memory_slot *memslot, bool is_perm) +{ + int ret = __gmem_abort(vcpu, fault_ipa, nested, memslot, is_perm); return ret != -EAGAIN ? ret : 0; }
-static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, unsigned long hva, - bool fault_is_perm) +static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, + struct kvm_s2_trans *nested, + struct kvm_memory_slot *memslot, + long *page_size, unsigned long hva, + bool fault_is_perm) { int ret = 0; bool topup_memcache; @@ -1871,10 +1880,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, kvm_release_faultin_page(kvm, page, !!ret, writable); kvm_fault_unlock(kvm);
+ if (page_size) + *page_size = vma_pagesize; + /* Mark the page dirty only if the fault is handled successfully */ if (writable && !ret) mark_page_dirty_in_slot(kvm, memslot, gfn);
+ return ret; +} + +static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, + struct kvm_s2_trans *nested, + struct kvm_memory_slot *memslot, unsigned long hva, + bool fault_is_perm) +{ + int ret = __user_mem_abort(vcpu, fault_ipa, nested, memslot, NULL, + hva, fault_is_perm); return ret != -EAGAIN ? ret : 0; }
Hi Jack,
On Thu, Sep 11, 2025 at 02:46:43PM +0100, Jack Thomson wrote:
From: Jack Thomson jackabt@amazon.com
Adding __gmem_abort and __user_mem_abort that preserve -EAGAIN results. These will be used by the pre-fault implementation which needs to retry on -EAGAIN.
-EAGAIN is a pretty clear signal that another vCPU has faulted on this memory and is in the middle of installing a mapping. Why bother with retrying?
If we conceptually treat this thing as a synthetic stage-2 abort then it should use the same EAGAIN handling as a literal stage-2 abort.
Thanks, Oliver
From: Jack Thomson jackabt@amazon.com
Don't return -EAGAIN from stage2_map_walker_try_leaf during KVM_PRE_FAULT_MEMORY.
During pre-faults, user_abort() is retried upon returning -EAGAIN, meaning the ioctl would get stuck in an infinite loop if userspace tries to pre-fault already existing mappings
Signed-off-by: Jack Thomson jackabt@amazon.com --- arch/arm64/include/asm/kvm_pgtable.h | 3 +++ arch/arm64/kvm/hyp/pgtable.c | 6 +++++- 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 2888b5d03757..0789671d1c4f 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -296,6 +296,8 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end, * @KVM_PGTABLE_WALK_SKIP_CMO: Visit and update table entries * without Cache maintenance * operations required. + * @KVM_PGTABLE_WALK_PRE_FAULT Indicates the page-table walk was + * invoked from a pre-fault request. */ enum kvm_pgtable_walk_flags { KVM_PGTABLE_WALK_LEAF = BIT(0), @@ -305,6 +307,7 @@ enum kvm_pgtable_walk_flags { KVM_PGTABLE_WALK_HANDLE_FAULT = BIT(4), KVM_PGTABLE_WALK_SKIP_BBM_TLBI = BIT(5), KVM_PGTABLE_WALK_SKIP_CMO = BIT(6), + KVM_PGTABLE_WALK_PRE_FAULT = BIT(7), };
struct kvm_pgtable_visit_ctx { diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index c351b4abd5db..140dccec2c5b 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -914,9 +914,13 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx, * same mapping or only change the access permissions. Instead, * the vCPU will exit one more time from guest if still needed * and then go through the path of relaxing permissions. + * + * When walking in the context of a pre-fault request, if the + * mapping already exists we can return 0, as there's nothing + * to do. */ if (!stage2_pte_needs_update(ctx->old, new)) - return -EAGAIN; + return (ctx->flags & KVM_PGTABLE_WALK_PRE_FAULT) ? 0 : -EAGAIN;
/* If we're only changing software bits, then store them and go! */ if (!kvm_pgtable_walk_shared(ctx) &&
From: Jack Thomson jackabt@amazon.com
Add kvm_arch_vcpu_pre_fault_memory() for arm64. The implementation hands off the stage-2 faulting logic to either gmem_abort() or user_mem_abort().
Update __gmem_abort() and __user_mem_abort() to take the pre_fault parameter. When passed, the paths to determine write or exec faults are short circuited to false, as when pre-faulting, it should be treated as a read fault.
This closely follows the implementation on x86.
Signed-off-by: Jack Thomson jackabt@amazon.com --- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/mmu.c | 71 ++++++++++++++++++++++++++++++++++++------ 3 files changed, 64 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index bff62e75d681..1ac0605f86cb 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -25,6 +25,7 @@ menuconfig KVM select HAVE_KVM_CPU_RELAX_INTERCEPT select KVM_MMIO select KVM_GENERIC_DIRTYLOG_READ_PROTECT + select KVM_GENERIC_PRE_FAULT_MEMORY select KVM_XFER_TO_GUEST_WORK select KVM_VFIO select HAVE_KVM_DIRTY_RING_ACQ_REL diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 888f7c7abf54..65654a742864 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -322,6 +322,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: case KVM_CAP_ARM_WRITABLE_IMP_ID_REGS: + case KVM_CAP_PRE_FAULT_MEMORY: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 082e7d8ae655..002f564c6ac7 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1523,7 +1523,8 @@ static void adjust_nested_fault_perms(struct kvm_s2_trans *nested,
static int __gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, bool is_perm) + struct kvm_memory_slot *memslot, bool is_perm, + bool pre_fault) { bool write_fault, exec_fault, writable; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS; @@ -1537,6 +1538,9 @@ static int __gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, gfn_t gfn; int ret;
+ if (pre_fault) + flags |= KVM_PGTABLE_WALK_PRE_FAULT; + ret = prepare_mmu_memcache(vcpu, true, &memcache); if (ret) return ret; @@ -1546,8 +1550,8 @@ static int __gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, else gfn = fault_ipa >> PAGE_SHIFT;
- write_fault = kvm_is_write_fault(vcpu); - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); + write_fault = !pre_fault && kvm_is_write_fault(vcpu); + exec_fault = !pre_fault && kvm_vcpu_trap_is_exec_fault(vcpu);
VM_WARN_ON_ONCE(write_fault && exec_fault);
@@ -1599,7 +1603,7 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, bool is_perm) { - int ret = __gmem_abort(vcpu, fault_ipa, nested, memslot, is_perm); + int ret = __gmem_abort(vcpu, fault_ipa, nested, memslot, is_perm, false); return ret != -EAGAIN ? ret : 0; }
@@ -1607,7 +1611,7 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, long *page_size, unsigned long hva, - bool fault_is_perm) + bool fault_is_perm, bool pre_fault) { int ret = 0; bool topup_memcache; @@ -1631,10 +1635,13 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vm_flags_t vm_flags; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
+ if (pre_fault) + flags |= KVM_PGTABLE_WALK_PRE_FAULT; + if (fault_is_perm) fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); - write_fault = kvm_is_write_fault(vcpu); - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); + write_fault = !pre_fault && kvm_is_write_fault(vcpu); + exec_fault = !pre_fault && kvm_vcpu_trap_is_exec_fault(vcpu); VM_WARN_ON_ONCE(write_fault && exec_fault);
/* @@ -1895,8 +1902,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_memory_slot *memslot, unsigned long hva, bool fault_is_perm) { - int ret = __user_mem_abort(vcpu, fault_ipa, nested, memslot, NULL, - hva, fault_is_perm); + int ret = __user_mem_abort(vcpu, fault_ipa, nested, memslot, NULL, hva, + fault_is_perm, false); return ret != -EAGAIN ? ret : 0; }
@@ -2468,3 +2475,49 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled); } + +long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, + struct kvm_pre_fault_memory *range) +{ + int r; + hva_t hva; + phys_addr_t end; + long page_size; + struct kvm_memory_slot *memslot; + phys_addr_t ipa = range->gpa; + gfn_t gfn = gpa_to_gfn(range->gpa); + + while (true) { + page_size = PAGE_SIZE; + memslot = gfn_to_memslot(vcpu->kvm, gfn); + if (!memslot) + return -ENOENT; + + if (kvm_slot_has_gmem(memslot)) { + r = __gmem_abort(vcpu, ipa, NULL, memslot, false, true); + } else { + hva = gfn_to_hva_memslot_prot(memslot, gfn, NULL); + if (kvm_is_error_hva(hva)) + return -EFAULT; + r = __user_mem_abort(vcpu, ipa, NULL, memslot, &page_size, hva, false, + true); + } + + if (r != -EAGAIN) + break; + + if (signal_pending(current)) + return -EINTR; + + if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) + return -EIO; + + cond_resched(); + }; + + if (r < 0) + return r; + + end = (range->gpa & ~(page_size - 1)) + page_size; + return min(range->size, end - range->gpa); +}
On Thu, Sep 11, 2025 at 02:46:45PM +0100, Jack Thomson wrote:
@@ -1607,7 +1611,7 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, long *page_size, unsigned long hva,
bool fault_is_perm)
bool fault_is_perm, bool pre_fault)
{ int ret = 0; bool topup_memcache; @@ -1631,10 +1635,13 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vm_flags_t vm_flags; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
- if (pre_fault)
flags |= KVM_PGTABLE_WALK_PRE_FAULT;
- if (fault_is_perm) fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
- write_fault = kvm_is_write_fault(vcpu);
- exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
- write_fault = !pre_fault && kvm_is_write_fault(vcpu);
- exec_fault = !pre_fault && kvm_vcpu_trap_is_exec_fault(vcpu);
I'm not a fan of this. While user_mem_abort() is already a sloppy mess, one thing we could reliably assume is the presence of a valid fault context. Now we need to remember to special-case our interpretation of a fault on whether or not we're getting invoked for a pre-fault.
I'd rather see the pre-fault infrastructure compose a synthetic fault context (HPFAR_EL2, ESR_EL2, etc.). It places the complexity where it belongs and the rest of the abort handling code should 'just work'.
+long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
struct kvm_pre_fault_memory *range)
+{
- int r;
- hva_t hva;
- phys_addr_t end;
- long page_size;
- struct kvm_memory_slot *memslot;
- phys_addr_t ipa = range->gpa;
- gfn_t gfn = gpa_to_gfn(range->gpa);
- while (true) {
page_size = PAGE_SIZE;
memslot = gfn_to_memslot(vcpu->kvm, gfn);
if (!memslot)
return -ENOENT;
if (kvm_slot_has_gmem(memslot)) {
r = __gmem_abort(vcpu, ipa, NULL, memslot, false, true);
} else {
hva = gfn_to_hva_memslot_prot(memslot, gfn, NULL);
if (kvm_is_error_hva(hva))
return -EFAULT;
r = __user_mem_abort(vcpu, ipa, NULL, memslot, &page_size, hva, false,
true);
}
if (r != -EAGAIN)
break;
if (signal_pending(current))
return -EINTR;
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
return -EIO;
cond_resched();
- };
Why do we need another retry loop? Looks like we've already got one in the arch-generic code.
Thanks, Oliver
Hi Oliver,
Thanks for reviewing!
On 11/09/2025 7:42 pm, Oliver Upton wrote:
On Thu, Sep 11, 2025 at 02:46:45PM +0100, Jack Thomson wrote:
@@ -1607,7 +1611,7 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, long *page_size, unsigned long hva,
bool fault_is_perm)
{ int ret = 0; bool topup_memcache;bool fault_is_perm, bool pre_fault)
@@ -1631,10 +1635,13 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vm_flags_t vm_flags; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
- if (pre_fault)
flags |= KVM_PGTABLE_WALK_PRE_FAULT;
- if (fault_is_perm) fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
- write_fault = kvm_is_write_fault(vcpu);
- exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
- write_fault = !pre_fault && kvm_is_write_fault(vcpu);
- exec_fault = !pre_fault && kvm_vcpu_trap_is_exec_fault(vcpu);
I'm not a fan of this. While user_mem_abort() is already a sloppy mess, one thing we could reliably assume is the presence of a valid fault context. Now we need to remember to special-case our interpretation of a fault on whether or not we're getting invoked for a pre-fault.
I'd rather see the pre-fault infrastructure compose a synthetic fault context (HPFAR_EL2, ESR_EL2, etc.). It places the complexity where it belongs and the rest of the abort handling code should 'just work'.
Agreed, it looks much better with the synthetic abort. Is this the approach you had in mind?
+long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu, + struct kvm_pre_fault_memory *range) +{ + int ret, idx; + hva_t hva; + phys_addr_t end; + u64 esr, hpfar; + struct kvm_memory_slot *memslot; + struct kvm_vcpu_fault_info *fault_info; + + long page_size = PAGE_SIZE; + phys_addr_t ipa = range->gpa; + gfn_t gfn = gpa_to_gfn(range->gpa); + + idx = srcu_read_lock(&vcpu->kvm->srcu); + + if (ipa >= kvm_phys_size(vcpu->arch.hw_mmu)) { + ret = -ENOENT; + goto out_unlock; + } + + memslot = gfn_to_memslot(vcpu->kvm, gfn); + if (!memslot) { + ret = -ENOENT; + goto out_unlock; + } + + fault_info = &vcpu->arch.fault; + + esr = fault_info->esr_el2; + hpfar = fault_info->hpfar_el2; + + fault_info->esr_el2 = ESR_ELx_FSC_ACCESS_L(KVM_PGTABLE_LAST_LEVEL); + fault_info->hpfar_el2 = HPFAR_EL2_NS | + ((ipa >> (12 - HPFAR_EL2_FIPA_SHIFT)) & HPFAR_EL2_FIPA_MASK); + + if (kvm_slot_has_gmem(memslot)) { + ret = gmem_abort(vcpu, ipa, NULL, memslot, false); + } else { + hva = gfn_to_hva_memslot_prot(memslot, gfn, NULL); + if (kvm_is_error_hva(hva)) { + ret = -EFAULT; + goto out; + } + ret = user_mem_abort(vcpu, ipa, NULL, memslot, &page_size, hva, + false); + } + + if (ret < 0) + goto out; + + end = (range->gpa & ~(page_size - 1)) + page_size; + ret = min(range->size, end - range->gpa); + +out: + fault_info->esr_el2 = esr; + fault_info->hpfar_el2 = hpfar; +out_unlock: + srcu_read_unlock(&vcpu->kvm->srcu, idx); + return ret; +}
+long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
struct kvm_pre_fault_memory *range)
+{
- int r;
- hva_t hva;
- phys_addr_t end;
- long page_size;
- struct kvm_memory_slot *memslot;
- phys_addr_t ipa = range->gpa;
- gfn_t gfn = gpa_to_gfn(range->gpa);
- while (true) {
page_size = PAGE_SIZE;
memslot = gfn_to_memslot(vcpu->kvm, gfn);
if (!memslot)
return -ENOENT;
if (kvm_slot_has_gmem(memslot)) {
r = __gmem_abort(vcpu, ipa, NULL, memslot, false, true);
} else {
hva = gfn_to_hva_memslot_prot(memslot, gfn, NULL);
if (kvm_is_error_hva(hva))
return -EFAULT;
r = __user_mem_abort(vcpu, ipa, NULL, memslot, &page_size, hva, false,
true);
}
if (r != -EAGAIN)
break;
if (signal_pending(current))
return -EINTR;
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
return -EIO;
cond_resched();
- };
Why do we need another retry loop? Looks like we've already got one in the arch-generic code.
Good point thanks, I've removed that now.
Thanks, Oliver
Thanks, Jack
On Mon, Sep 29, 2025 at 02:59:35PM +0100, Thomson, Jack wrote:
Hi Oliver,
Thanks for reviewing!
On 11/09/2025 7:42 pm, Oliver Upton wrote:
On Thu, Sep 11, 2025 at 02:46:45PM +0100, Jack Thomson wrote:
@@ -1607,7 +1611,7 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_s2_trans *nested, struct kvm_memory_slot *memslot, long *page_size, unsigned long hva,
bool fault_is_perm)
{ int ret = 0; bool topup_memcache;bool fault_is_perm, bool pre_fault)
@@ -1631,10 +1635,13 @@ static int __user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vm_flags_t vm_flags; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
- if (pre_fault)
flags |= KVM_PGTABLE_WALK_PRE_FAULT;
- if (fault_is_perm) fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
- write_fault = kvm_is_write_fault(vcpu);
- exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
- write_fault = !pre_fault && kvm_is_write_fault(vcpu);
- exec_fault = !pre_fault && kvm_vcpu_trap_is_exec_fault(vcpu);
I'm not a fan of this. While user_mem_abort() is already a sloppy mess, one thing we could reliably assume is the presence of a valid fault context. Now we need to remember to special-case our interpretation of a fault on whether or not we're getting invoked for a pre-fault.
I'd rather see the pre-fault infrastructure compose a synthetic fault context (HPFAR_EL2, ESR_EL2, etc.). It places the complexity where it belongs and the rest of the abort handling code should 'just work'.
Agreed, it looks much better with the synthetic abort. Is this the approach you had in mind?
Pretty much. Thanks for taking a moment to fiddle with it.
+long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
struct kvm_pre_fault_memory *range)
+{
- int ret, idx;
- hva_t hva;
- phys_addr_t end;
- u64 esr, hpfar;
- struct kvm_memory_slot *memslot;
- struct kvm_vcpu_fault_info *fault_info;
- long page_size = PAGE_SIZE;
- phys_addr_t ipa = range->gpa;
- gfn_t gfn = gpa_to_gfn(range->gpa);
- idx = srcu_read_lock(&vcpu->kvm->srcu);
- if (ipa >= kvm_phys_size(vcpu->arch.hw_mmu)) {
ret = -ENOENT;
goto out_unlock;
- }
- memslot = gfn_to_memslot(vcpu->kvm, gfn);
- if (!memslot) {
ret = -ENOENT;
goto out_unlock;
- }
- fault_info = &vcpu->arch.fault;
- esr = fault_info->esr_el2;
- hpfar = fault_info->hpfar_el2;
nit: Just snapshot the entire struct, makes this forward-compatible with new fields showing up.
- fault_info->esr_el2 = ESR_ELx_FSC_ACCESS_L(KVM_PGTABLE_LAST_LEVEL);
A translation fault would be a more accurate representation what you're trying to do Access flag faults aren't expected in user_mem_abort() and instead handled in handle_access_fault().
You're also missing the rest of the ESR fields that are relevant here, such as ESR_ELx.EC which would actually indicate a data abort. I think you'd also want to communicate this as a nISV fault (i.e. ESR_ELx.ISV=0).
- fault_info->hpfar_el2 = HPFAR_EL2_NS |
((ipa >> (12 - HPFAR_EL2_FIPA_SHIFT)) & HPFAR_EL2_FIPA_MASK);
FIELD_PREP()?
- if (kvm_slot_has_gmem(memslot)) {
ret = gmem_abort(vcpu, ipa, NULL, memslot, false);
- } else {
hva = gfn_to_hva_memslot_prot(memslot, gfn, NULL);
if (kvm_is_error_hva(hva)) {
ret = -EFAULT;
goto out;
}
ret = user_mem_abort(vcpu, ipa, NULL, memslot, &page_size, hva,
false);
- }
- if (ret < 0)
goto out;
- end = (range->gpa & ~(page_size - 1)) + page_size;
- ret = min(range->size, end - range->gpa);
+out:
- fault_info->esr_el2 = esr;
- fault_info->hpfar_el2 = hpfar;
+out_unlock:
- srcu_read_unlock(&vcpu->kvm->srcu, idx);
- return ret;
+}
Thanks, Oliver
From: Jack Thomson jackabt@amazon.com
When creating a VM using mmap with huge pages, and the memory amount does not align with the underlying page size. The stored mmap_size value does not account for the fact that mmap will automatically align the length to a multiple of the underlying page size. During the teardown of the test, munmap is used. However, munmap requires the length to be a multiple of the underlying page size.
Update the vm_mem_add method to ensure the mmap_size is aligned to the underlying page size.
Signed-off-by: Jack Thomson jackabt@amazon.com --- tools/testing/selftests/kvm/lib/kvm_util.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index c3f5142b0a54..b106fbed999c 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -1051,7 +1051,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, /* Allocate and initialize new mem region structure. */ region = calloc(1, sizeof(*region)); TEST_ASSERT(region != NULL, "Insufficient Memory"); - region->mmap_size = mem_size;
#ifdef __s390x__ /* On s390x, the host address must be aligned to 1M (due to PGSTEs) */ @@ -1060,6 +1059,11 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, alignment = 1; #endif
+ alignment = max(backing_src_pagesz, alignment); + region->mmap_size = align_up(mem_size, alignment); + + TEST_ASSERT_EQ(guest_paddr, align_up(guest_paddr, backing_src_pagesz)); + /* * When using THP mmap is not guaranteed to returned a hugepage aligned * address so we have to pad the mmap. Padding is not needed for HugeTLB @@ -1067,12 +1071,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type, * page size. */ if (src_type == VM_MEM_SRC_ANONYMOUS_THP) - alignment = max(backing_src_pagesz, alignment); - - TEST_ASSERT_EQ(guest_paddr, align_up(guest_paddr, backing_src_pagesz)); - - /* Add enough memory to align up if necessary */ - if (alignment > 1) region->mmap_size += alignment;
region->fd = -1;
From: Jack Thomson jackabt@amazon.com
Enable the pre_fault_memory_test to run on arm64 by making it work with different guest page sizes and testing multiple guest configurations.
Update the test_assert to compare against the UCALL_EXIT_REASON, for portability, as arm64 exits with KVM_EXIT_MMIO while x86 uses KVM_EXIT_IO.
Signed-off-by: Jack Thomson jackabt@amazon.com --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/pre_fault_memory_test.c | 79 ++++++++++++++----- 2 files changed, 59 insertions(+), 21 deletions(-)
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 90f03f00cb04..4db1737fad04 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -180,6 +180,7 @@ TEST_GEN_PROGS_arm64 += memslot_perf_test TEST_GEN_PROGS_arm64 += mmu_stress_test TEST_GEN_PROGS_arm64 += rseq_test TEST_GEN_PROGS_arm64 += steal_time +TEST_GEN_PROGS_arm64 += pre_fault_memory_test
TEST_GEN_PROGS_s390 = $(TEST_GEN_PROGS_COMMON) TEST_GEN_PROGS_s390 += s390/memop diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c index 0350a8896a2f..ed9848a8af60 100644 --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c @@ -10,19 +10,29 @@ #include <test_util.h> #include <kvm_util.h> #include <processor.h> +#include <guest_modes.h>
/* Arbitrarily chosen values */ -#define TEST_SIZE (SZ_2M + PAGE_SIZE) -#define TEST_NPAGES (TEST_SIZE / PAGE_SIZE) +#define TEST_BASE_SIZE SZ_2M #define TEST_SLOT 10
+/* Storage of test info to share with guest code */ +struct test_config { + int page_size; + uint64_t test_size; + uint64_t test_num_pages; +}; + +struct test_config test_config; + static void guest_code(uint64_t base_gpa) { volatile uint64_t val __used; + struct test_config *config = &test_config; int i;
- for (i = 0; i < TEST_NPAGES; i++) { - uint64_t *src = (uint64_t *)(base_gpa + i * PAGE_SIZE); + for (i = 0; i < config->test_num_pages; i++) { + uint64_t *src = (uint64_t *)(base_gpa + i * config->page_size);
val = *src; } @@ -63,11 +73,17 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, "KVM_PRE_FAULT_MEMORY", ret, vcpu->vm); }
-static void __test_pre_fault_memory(unsigned long vm_type, bool private) +struct test_params { + unsigned long vm_type; + bool private; +}; + +static void __test_pre_fault_memory(enum vm_guest_mode guest_mode, void *arg) { + struct test_params *p = arg; const struct vm_shape shape = { - .mode = VM_MODE_DEFAULT, - .type = vm_type, + .mode = guest_mode, + .type = p->vm_type, }; struct kvm_vcpu *vcpu; struct kvm_run *run; @@ -78,10 +94,17 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private) uint64_t guest_test_virt_mem; uint64_t alignment, guest_page_size;
+ pr_info("Testing guest mode: %s\n", vm_guest_mode_string(guest_mode)); + vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
- alignment = guest_page_size = vm_guest_mode_params[VM_MODE_DEFAULT].page_size; - guest_test_phys_mem = (vm->max_gfn - TEST_NPAGES) * guest_page_size; + guest_page_size = vm_guest_mode_params[guest_mode].page_size; + + test_config.page_size = guest_page_size; + test_config.test_size = TEST_BASE_SIZE + test_config.page_size; + test_config.test_num_pages = vm_calc_num_guest_pages(vm->mode, test_config.test_size); + + guest_test_phys_mem = (vm->max_gfn - test_config.test_num_pages) * test_config.page_size; #ifdef __s390x__ alignment = max(0x100000UL, guest_page_size); #else @@ -91,22 +114,31 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private) guest_test_virt_mem = guest_test_phys_mem & ((1ULL << (vm->va_bits - 1)) - 1);
vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, - guest_test_phys_mem, TEST_SLOT, TEST_NPAGES, - private ? KVM_MEM_GUEST_MEMFD : 0); - virt_map(vm, guest_test_virt_mem, guest_test_phys_mem, TEST_NPAGES); - - if (private) - vm_mem_set_private(vm, guest_test_phys_mem, TEST_SIZE); - pre_fault_memory(vcpu, guest_test_phys_mem, SZ_2M, 0); - pre_fault_memory(vcpu, guest_test_phys_mem + SZ_2M, PAGE_SIZE * 2, PAGE_SIZE); - pre_fault_memory(vcpu, guest_test_phys_mem + TEST_SIZE, PAGE_SIZE, PAGE_SIZE); + guest_test_phys_mem, TEST_SLOT, test_config.test_num_pages, + p->private ? KVM_MEM_GUEST_MEMFD : 0); + virt_map(vm, guest_test_virt_mem, guest_test_phys_mem, test_config.test_num_pages); + + if (p->private) + vm_mem_set_private(vm, guest_test_phys_mem, test_config.test_size); + pre_fault_memory(vcpu, guest_test_phys_mem, TEST_BASE_SIZE, 0); + /* Test pre-faulting over an already faulted range */ + pre_fault_memory(vcpu, guest_test_phys_mem, TEST_BASE_SIZE, 0); + pre_fault_memory(vcpu, guest_test_phys_mem + TEST_BASE_SIZE, + test_config.page_size * 2, test_config.page_size); + pre_fault_memory(vcpu, guest_test_phys_mem + test_config.test_size, + test_config.page_size, test_config.page_size);
vcpu_args_set(vcpu, 1, guest_test_virt_mem); + + /* Export the shared variables to the guest. */ + sync_global_to_guest(vm, test_config); + vcpu_run(vcpu);
run = vcpu->run; - TEST_ASSERT(run->exit_reason == KVM_EXIT_IO, - "Wanted KVM_EXIT_IO, got exit reason: %u (%s)", + TEST_ASSERT(run->exit_reason == UCALL_EXIT_REASON, + "Wanted %s, got exit reason: %u (%s)", + exit_reason_str(UCALL_EXIT_REASON), run->exit_reason, exit_reason_str(run->exit_reason));
switch (get_ucall(vcpu, &uc)) { @@ -130,7 +162,12 @@ static void test_pre_fault_memory(unsigned long vm_type, bool private) return; }
- __test_pre_fault_memory(vm_type, private); + struct test_params p = { + .vm_type = vm_type, + .private = private, + }; + + for_each_guest_mode(__test_pre_fault_memory, &p); }
int main(int argc, char *argv[])
From: Jack Thomson jackabt@amazon.com
Add a -m option to specify different memory backing types for the pre-fault tests (e.g., anonymous, hugetlb), allowing testing of the pre-fault functionality across different memory configurations.
Signed-off-by: Jack Thomson jackabt@amazon.com --- .../selftests/kvm/pre_fault_memory_test.c | 31 ++++++++++++++++--- 1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c index ed9848a8af60..22e2e53945d9 100644 --- a/tools/testing/selftests/kvm/pre_fault_memory_test.c +++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c @@ -76,6 +76,7 @@ static void pre_fault_memory(struct kvm_vcpu *vcpu, u64 gpa, u64 size, struct test_params { unsigned long vm_type; bool private; + enum vm_mem_backing_src_type mem_backing_src; };
static void __test_pre_fault_memory(enum vm_guest_mode guest_mode, void *arg) @@ -94,7 +95,11 @@ static void __test_pre_fault_memory(enum vm_guest_mode guest_mode, void *arg) uint64_t guest_test_virt_mem; uint64_t alignment, guest_page_size;
+ size_t backing_src_pagesz = get_backing_src_pagesz(p->mem_backing_src); + pr_info("Testing guest mode: %s\n", vm_guest_mode_string(guest_mode)); + pr_info("Testing memory backing src type: %s\n", + vm_mem_backing_src_alias(p->mem_backing_src)->name);
vm = vm_create_shape_with_one_vcpu(shape, &vcpu, guest_code);
@@ -110,10 +115,11 @@ static void __test_pre_fault_memory(enum vm_guest_mode guest_mode, void *arg) #else alignment = SZ_2M; #endif + alignment = max(alignment, backing_src_pagesz); guest_test_phys_mem = align_down(guest_test_phys_mem, alignment); guest_test_virt_mem = guest_test_phys_mem & ((1ULL << (vm->va_bits - 1)) - 1);
- vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, + vm_userspace_mem_region_add(vm, p->mem_backing_src, guest_test_phys_mem, TEST_SLOT, test_config.test_num_pages, p->private ? KVM_MEM_GUEST_MEMFD : 0); virt_map(vm, guest_test_virt_mem, guest_test_phys_mem, test_config.test_num_pages); @@ -155,7 +161,8 @@ static void __test_pre_fault_memory(enum vm_guest_mode guest_mode, void *arg) kvm_vm_free(vm); }
-static void test_pre_fault_memory(unsigned long vm_type, bool private) +static void test_pre_fault_memory(unsigned long vm_type, enum vm_mem_backing_src_type backing_src, + bool private) { if (vm_type && !(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type))) { pr_info("Skipping tests for vm_type 0x%lx\n", vm_type); @@ -165,6 +172,7 @@ static void test_pre_fault_memory(unsigned long vm_type, bool private) struct test_params p = { .vm_type = vm_type, .private = private, + .mem_backing_src = backing_src, };
for_each_guest_mode(__test_pre_fault_memory, &p); @@ -174,10 +182,23 @@ int main(int argc, char *argv[]) { TEST_REQUIRE(kvm_check_cap(KVM_CAP_PRE_FAULT_MEMORY));
- test_pre_fault_memory(0, false); + int opt; + enum vm_mem_backing_src_type backing = VM_MEM_SRC_ANONYMOUS; + + while ((opt = getopt(argc, argv, "m:")) != -1) { + switch (opt) { + case 'm': + backing = parse_backing_src_type(optarg); + break; + default: + break; + } + } + + test_pre_fault_memory(0, backing, false); #ifdef __x86_64__ - test_pre_fault_memory(KVM_X86_SW_PROTECTED_VM, false); - test_pre_fault_memory(KVM_X86_SW_PROTECTED_VM, true); + test_pre_fault_memory(KVM_X86_SW_PROTECTED_VM, backing, false); + test_pre_fault_memory(KVM_X86_SW_PROTECTED_VM, backing, true); #endif return 0; }
Hi Jack,
On Thu, Sep 11, 2025 at 02:46:42PM +0100, Jack Thomson wrote:
From: Jack Thomson jackabt@amazon.com
Overview:
This patch series adds ARM64 support for the KVM_PRE_FAULT_MEMORY feature, which was previously only available on x86 [1]. This allows a reduction in the number of stage-2 faults during execution. This is beneficial in post-copy migration scenarios, particularly in memory intensive applications, where high latencies are experienced due to the stage-2 faults when pre-populating memory via UFFD / memcpy.
Thanks for posting the series. More of a general comment on the UAPI documentation:
"... However, KVM does not mark any newly created stage-2 PTE as Accessed."
This behavior is x86-specific since kvm_pgtable_stage2_map() will lay down PTEs with the AF set. Probably shouldn't have documented the internal state of the stage-2 in the first place but oh well, please just update the UAPI description to make it clear this is x86-specific.
Thanks Oliver
linux-kselftest-mirror@lists.linaro.org