When handling page faults for many vCPUs during demand paging, KVM's MMU lock becomes highly contended. This series creates a test with a naive userfaultfd based demand paging implementation to demonstrate that contention. This test serves both as a functional test of userfaultfd and a microbenchmark of demand paging performance with a variable number of vCPUs and memory per vCPU.
The test creates N userfaultfd threads, N vCPUs, and a region of memory with M pages per vCPU. The N userfaultfd polling threads are each set up to serve faults on a region of memory corresponding to one of the vCPUs. Each of the vCPUs is then started, and touches each page of its disjoint memory region, sequentially. In response to faults, the userfaultfd threads copy a static buffer into the guest's memory. This creates a worst case for MMU lock contention as we have removed most of the contention between the userfaultfd threads and there is no time required to fetch the contents of guest memory.
This test was run successfully on Intel Haswell, Broadwell, and Cascadelake hosts with a variety of vCPU counts and memory sizes.
This test was adapted from the dirty_log_test.
The series can also be viewed in Gerrit here: https://linux-review.googlesource.com/c/virt/kvm/kvm/+/1464 (Thanks to Dmitry Vyukov dvyukov@google.com for setting up the Gerrit instance)
v4 (Responding to feedback from Andrew Jones, Peter Xu, and Peter Shier): - Tested this revision by running demand_paging_test at each commit in the series on an Intel Haswell machine. Ran demand_paging_test -u -v 8 -b 8M -d 10 on the same machine at the last commit in the series. - Readded partial aarch64 support, though aarch64 and s390 remain untested - Implemented pipefd polling to reduce UFFD thread exit latency - Added variable unit input for memory size so users can pass command line arguments of the form -b 24M instead of the raw number or bytes - Moved a missing break from a patch later in the series to an earlier one - Moved to syncing per-vCPU global variables to guest and looking up per-vcpu arguments based on a single CPU ID passed to each guest vCPU. This allows for future patches to pass more than the supported number of arguments for each arch to the vCPUs. - Implemented vcpu_args_set for s390 and aarch64 [UNTESTED] - Changed vm_create to always allocate memslot 0 at 4G instead of only when the number of pages required is large. - Changed vcpu_wss to vcpu_memory_size for clarity.
Ben Gardon (10): KVM: selftests: Create a demand paging test KVM: selftests: Add demand paging content to the demand paging test KVM: selftests: Add configurable demand paging delay KVM: selftests: Add memory size parameter to the demand paging test KVM: selftests: Pass args to vCPU in global vCPU args struct KVM: selftests: Add support for vcpu_args_set to aarch64 and s390x KVM: selftests: Support multiple vCPUs in demand paging test KVM: selftests: Time guest demand paging KVM: selftests: Stop memslot creation in KVM internal memslot region KVM: selftests: Move memslot 0 above KVM internal memslots
tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 5 +- .../selftests/kvm/demand_paging_test.c | 680 ++++++++++++++++++ .../testing/selftests/kvm/include/test_util.h | 2 + .../selftests/kvm/lib/aarch64/processor.c | 33 + tools/testing/selftests/kvm/lib/kvm_util.c | 27 +- .../selftests/kvm/lib/s390x/processor.c | 35 + tools/testing/selftests/kvm/lib/test_util.c | 61 ++ 8 files changed, 839 insertions(+), 5 deletions(-) create mode 100644 tools/testing/selftests/kvm/demand_paging_test.c create mode 100644 tools/testing/selftests/kvm/lib/test_util.c