On systems that support shared guest memory, write() is useful, for example, for population of the initial image. Even though the same can also be achieved via userspace mapping and memcpying from userspace, write() provides a more performant option because it does not need to set user page tables and it does not cause a page fault for every page like memcpy would. Note that memcpy cannot be accelerated via MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on GUP.
Populating 512MiB of guest_memfd on a x86 machine: - via memcpy: 436 ms - via write: 202 ms (-54%)
Only PAGE_ALIGNED offset and len are allowed. Even though non-aligned writes are technically possible, when in-place conversion support is implemented [1], the restriction makes handling of mixed shared/private huge pages simpler. write() will only be allowed to populate shared pages.
When direct map removal is implemented [2] - write() will not be allowed to access pages that have already been removed from direct map - on completion, write() will remove the populated pages from direct map
While it is technically possible to implement read() syscall on systems with shared guest memory, it is not supported as there is currently no use case for it.
[1] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com [2] https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu....
Nikita Kalyazin (2): KVM: guest_memfd: add generic population via write KVM: selftests: update guest_memfd write tests
Documentation/virt/kvm/api.rst | 2 + include/linux/kvm_host.h | 2 +- include/uapi/linux/kvm.h | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 58 +++++++++++++++++-- virt/kvm/guest_memfd.c | 52 +++++++++++++++++ 5 files changed, 108 insertions(+), 7 deletions(-)
base-commit: 8a4821412cf2c1429fffa07c012dd150f2edf78c -- 2.50.1
From: Nikita Kalyazin kalyazin@amazon.com
On systems that support shared guest memory, write() is useful, for example, for population of the initial image. Even though the same can also be achieved via userspace mapping and memcpying from userspace, write() provides a more performant option because it does not need to set user page tables and it does not cause a page fault for every page like memcpy would. Note that memcpy cannot be accelerated via MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on GUP.
Populating 512MiB of guest_memfd on a x86 machine: - via memcpy: 436 ms - via write: 202 ms (-54%)
Only PAGE_ALIGNED offset and len are allowed. Even though non-aligned writes are technically possible, when in-place conversion support is implemented [1], the restriction makes handling of mixed shared/private huge pages simpler. write() will only be allowed to populate shared pages.
When direct map removal is implemented [2] - write() will not be allowed to access pages that have already been removed from direct map - on completion, write() will remove the populated pages from direct map
While it is technically possible to implement read() syscall on systems with shared guest memory, it is not supported as there is currently no use case for it.
[1] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com [2] https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu....
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- Documentation/virt/kvm/api.rst | 2 ++ include/linux/kvm_host.h | 2 +- include/uapi/linux/kvm.h | 1 + virt/kvm/guest_memfd.c | 52 ++++++++++++++++++++++++++++++++++ 4 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 57061fa29e6a..9541e95fc2ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6448,6 +6448,8 @@ specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags: without INIT_SHARED will be marked private). Shared memory can be faulted into host userspace page tables. Private memory cannot. + GUEST_MEMFD_FLAG_WRITE Enable using write() on the guest_memfd file + descriptor. ============================ ================================================
When the KVM MMU performs a PFN lookup to service a guest fault and the backing diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 5bd76cf394fa..5fbf65f49586 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -736,7 +736,7 @@ static inline u64 kvm_gmem_get_supported_flags(struct kvm *kvm) u64 flags = GUEST_MEMFD_FLAG_MMAP;
if (!kvm || kvm_arch_supports_gmem_init_shared(kvm)) - flags |= GUEST_MEMFD_FLAG_INIT_SHARED; + flags |= GUEST_MEMFD_FLAG_INIT_SHARED | GUEST_MEMFD_FLAG_WRITE;
return flags; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 52f6000ab020..5b73d6528f1c 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1601,6 +1601,7 @@ struct kvm_memory_attributes { #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd) #define GUEST_MEMFD_FLAG_MMAP (1ULL << 0) #define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1) +#define GUEST_MEMFD_FLAG_WRITE (1ULL << 2)
struct kvm_create_guest_memfd { __u64 size; diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index ffadc5ee8e04..2c71c21b9189 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -411,6 +411,8 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
static struct file_operations kvm_gmem_fops = { .mmap = kvm_gmem_mmap, + .llseek = default_llseek, + .write_iter = generic_perform_write, .open = generic_file_open, .release = kvm_gmem_release, .fallocate = kvm_gmem_fallocate, @@ -421,6 +423,53 @@ void kvm_gmem_init(struct module *module) kvm_gmem_fops.owner = module; }
+static bool kvm_gmem_supports_write(struct inode *inode) +{ + const u64 flags = (u64)inode->i_private; + + return flags & GUEST_MEMFD_FLAG_WRITE; +} + +static int kvm_gmem_write_begin(const struct kiocb *kiocb, + struct address_space *mapping, + loff_t pos, unsigned int len, + struct folio **folio, void **fsdata) +{ + struct inode *inode = file_inode(kiocb->ki_filp); + + if (!kvm_gmem_supports_write(inode)) + return -ENODEV; + + if (pos + len > i_size_read(inode)) + return -EINVAL; + + if (!IS_ALIGNED(pos, PAGE_SIZE) || !IS_ALIGNED(len, PAGE_SIZE)) + return -EINVAL; + + *folio = kvm_gmem_get_folio(inode, pos >> PAGE_SHIFT); + if (IS_ERR(*folio)) + return PTR_ERR(*folio); + + return 0; +} + +static int kvm_gmem_write_end(const struct kiocb *kiocb, + struct address_space *mapping, + loff_t pos, unsigned int len, + unsigned int copied, + struct folio *folio, void *fsdata) +{ + if (!folio_test_uptodate(folio)) { + folio_zero_range(folio, copied, len - copied); + folio_mark_uptodate(folio); + } + + folio_unlock(folio); + folio_put(folio); + + return copied; +} + static int kvm_gmem_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode mode) @@ -469,6 +518,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
static const struct address_space_operations kvm_gmem_aops = { .dirty_folio = noop_dirty_folio, + .write_begin = kvm_gmem_write_begin, + .write_end = kvm_gmem_write_end, .migrate_folio = kvm_gmem_migrate_folio, .error_remove_folio = kvm_gmem_error_folio, #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE @@ -516,6 +567,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) }
file->f_flags |= O_LARGEFILE; + file->f_mode |= FMODE_LSEEK | FMODE_PWRITE;
inode = file->f_inode; WARN_ON(file->f_mapping != inode->i_mapping);
From: Nikita Kalyazin kalyazin@amazon.com
This is to reflect that the write syscall is now implemented for guest_memfd.
Signed-off-by: Nikita Kalyazin kalyazin@amazon.com --- .../testing/selftests/kvm/guest_memfd_test.c | 58 +++++++++++++++++-- 1 file changed, 52 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index e7d9aeb418d3..ef3e92e18ee6 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -24,18 +24,57 @@
static size_t page_size;
-static void test_file_read_write(int fd, size_t total_size) +static void test_file_read(int fd, size_t total_size) { char buf[64];
TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0, "read on a guest_mem fd should fail"); - TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0, - "write on a guest_mem fd should fail"); TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0, "pread on a guest_mem fd should fail"); - TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0, - "pwrite on a guest_mem fd should fail"); +} + +static void test_write_supported(int fd, size_t total_size) +{ + size_t page_size = getpagesize(); + void *buf = NULL; + int ret; + + ret = posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + + ret = pwrite(fd, buf, page_size, total_size); + TEST_ASSERT(ret == -1, "writing past the file size on a guest_mem fd should fail"); + TEST_ASSERT_EQ(errno, EINVAL); + + ret = pwrite(fd, buf, 1, 0); + TEST_ASSERT(ret == -1, "writing an unaligned count a guest_mem fd should fail"); + TEST_ASSERT_EQ(errno, EINVAL); + + ret = pwrite(fd, buf, page_size, 1); + TEST_ASSERT(ret == -1, "writing to an unaligned offset a guest_mem fd should fail"); + TEST_ASSERT_EQ(errno, EINVAL); + + ret = pwrite(fd, buf, page_size, 0); + TEST_ASSERT(ret == page_size, "write on a guest_mem fd should succeed"); + + free(buf); +} + +static void test_write_not_supported(int fd, size_t total_size) +{ + size_t page_size = getpagesize(); + void *buf = NULL; + int ret; + + ret = posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + + ret = pwrite(fd, buf, page_size, 0); + TEST_ASSERT(ret == -1, "write on guest_mem fd should fail"); + TEST_ASSERT_EQ(errno, ENODEV); + + free(buf); }
static void test_mmap_cow(int fd, size_t size) @@ -267,7 +306,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags) test_create_guest_memfd_multiple(vm); test_create_guest_memfd_invalid_sizes(vm, flags);
- gmem_test(file_read_write, vm, flags); + gmem_test(file_read, vm, flags);
if (flags & GUEST_MEMFD_FLAG_MMAP) { if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) { @@ -282,6 +321,11 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags) gmem_test(mmap_not_supported, vm, flags); }
+ if (flags & GUEST_MEMFD_FLAG_WRITE) + gmem_test(write_supported, vm, flags); + else + gmem_test(write_not_supported, vm, flags); + gmem_test(file_size, vm, flags); gmem_test(fallocate, vm, flags); gmem_test(invalid_punch_hole, vm, flags); @@ -299,6 +343,8 @@ static void test_guest_memfd(unsigned long vm_type) flags = vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS); if (flags & GUEST_MEMFD_FLAG_MMAP) __test_guest_memfd(vm, GUEST_MEMFD_FLAG_MMAP); + if (flags & GUEST_MEMFD_FLAG_WRITE) + __test_guest_memfd(vm, GUEST_MEMFD_FLAG_WRITE);
/* MMAP should always be supported if INIT_SHARED is supported. */ if (flags & GUEST_MEMFD_FLAG_INIT_SHARED)
On 14/11/2025 15:18, Kalyazin, Nikita wrote:
On systems that support shared guest memory, write() is useful, for example, for population of the initial image. Even though the same can also be achieved via userspace mapping and memcpying from userspace, write() provides a more performant option because it does not need to set user page tables and it does not cause a page fault for every page like memcpy would. Note that memcpy cannot be accelerated via MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on GUP.
Populating 512MiB of guest_memfd on a x86 machine:
- via memcpy: 436 ms
- via write: 202 ms (-54%)
Only PAGE_ALIGNED offset and len are allowed. Even though non-aligned writes are technically possible, when in-place conversion support is implemented [1], the restriction makes handling of mixed shared/private huge pages simpler. write() will only be allowed to populate shared pages.
When direct map removal is implemented [2]
- write() will not be allowed to access pages that have already been removed from direct map
- on completion, write() will remove the populated pages from direct map
While it is technically possible to implement read() syscall on systems with shared guest memory, it is not supported as there is currently no use case for it.
[1] https://lore.kernel.org/kvm/cover.1760731772.git.ackerleytng@google.com [2] https://lore.kernel.org/kvm/20250924151101.2225820-1-patrick.roy@campus.lmu....
I failed to include links to previous versions:
v7: - Sean: add GUEST_MEMFD_FLAG_WRITE and documentation for it - Ackerley: only allow PAGE_ALIGNED offset and len - Sean/Ackerley: formatting fixes
v6: - https://lore.kernel.org/kvm/20251020161352.69257-1-kalyazin@amazon.com - Make write support conditional on mmap support instead of relying on the up-to-date flag to decide whether writing to a page is allowed - James: Remove dependencies on folio_test_large - James: Remove page alignment restriction - James: Formatting fixes
v5: - https://lore.kernel.org/kvm/20250902111951.58315-1-kalyazin@amazon.com - Replace the call to the unexported filemap_remove_folio with zeroing the bytes that could not be copied - Fix checkpatch findings
v4: - https://lore.kernel.org/kvm/20250828153049.3922-1-kalyazin@amazon.com - Switch from implementing the write callback to write_iter - Remove conditional compilation
v3: - https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com - David/Mike D: Only compile support for the write syscall if CONFIG_KVM_GMEM_SHARED_MEM (now gone) is enabled. v2: - https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com - Switch from an ioctl to the write syscall to implement population
v1: - https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com
Nikita Kalyazin (2): KVM: guest_memfd: add generic population via write KVM: selftests: update guest_memfd write tests
Documentation/virt/kvm/api.rst | 2 + include/linux/kvm_host.h | 2 +- include/uapi/linux/kvm.h | 1 + .../testing/selftests/kvm/guest_memfd_test.c | 58 +++++++++++++++++-- virt/kvm/guest_memfd.c | 52 +++++++++++++++++ 5 files changed, 108 insertions(+), 7 deletions(-)
base-commit: 8a4821412cf2c1429fffa07c012dd150f2edf78c
2.50.1
linux-kselftest-mirror@lists.linaro.org