Since commit e87412e621f1 ("integrate Zaamo and Zalrsc text (#1304)"),
the A extension has been described as a set of instructions provided by
Zaamo and Zalrsc. Add these two extensions.
This series is based on the Zc one [1].
Link: https://lore.kernel.org/linux-riscv/20240619113529.676940-1-cleger@rivosinc…
---
Clément Léger (5):
dt-bindings: riscv: add Zaamo and Zalrsc ISA extension description
riscv: add parsing for Zaamo and Zalrsc extensions
riscv: hwprobe: export Zaamo and Zalrsc extensions
RISC-V: KVM: Allow Zaamo/Zalrsc extensions for Guest/VM
KVM: riscv: selftests: Add Zaamo/Zalrsc extensions to get-reg-list
test
Documentation/arch/riscv/hwprobe.rst | 8 ++++++++
.../devicetree/bindings/riscv/extensions.yaml | 19 +++++++++++++++++++
arch/riscv/include/asm/hwcap.h | 2 ++
arch/riscv/include/uapi/asm/hwprobe.h | 2 ++
arch/riscv/include/uapi/asm/kvm.h | 2 ++
arch/riscv/kernel/cpufeature.c | 9 ++++++++-
arch/riscv/kernel/sys_hwprobe.c | 2 ++
arch/riscv/kvm/vcpu_onereg.c | 4 ++++
.../selftests/kvm/riscv/get-reg-list.c | 8 ++++++++
9 files changed, 55 insertions(+), 1 deletion(-)
--
2.45.2
This series introduces a new ioctl KVM_TRANSLATE2, which expands on
KVM_TRANSLATE. It is required to implement Hyper-V's
HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper-
call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which
implements the core functionality of the hyper-call. The rest of the
required functionality will be implemented in subsequent series.
Other than translating guest virtual addresses, the ioctl allows the
caller to control whether the access and dirty bits are set during the
page walk. It also allows specifying an access mode instead of returning
viable access modes, which enables setting the bits up to the level that
caused a failure. Additionally, the ioctl provides more information about
why the page walk failed, and which page table is responsible. This
functionality is not available within KVM_TRANSLATE, and can't be added
without breaking backwards compatiblity, thus a new ioctl is required.
The ioctl was designed to facilitate as many other use cases as possible
apart from VSM. The error codes were intentionally chosen to be broad
enough to avoid exposing architecture specific details. Even though
HvTranslateVirtualAddress only really needs one flag to set the accessed
and dirty bits whenever possible, that was split into several flags so
that future users can chose more gradually when these bits should be set.
Furthermore, as much information as possible is provided to the caller.
The patch series includes selftests for the ioctl, as well as fuzzy
testing on random garbage guest page table entries. All previously passing
KVM selftests and KVM unit tests still pass.
Series overview:
- 1: Document the new ioctl
- 2-11: Update the page walker in preparation
- 12-14: Implement the ioctl
- 15: Implement testing
This series, alongside the series by Nicolas Saenz Julienne [1]
introducing the core building blocks for VSM and the accompanying QEMU
implementation [2], is capable of booting Windows Server 2019.
Both series are also available on GitHub [3].
[1] https://lore.kernel.org/linux-hyperv/20240609154945.55332-1-nsaenz@amazon.c…
[2] https://github.com/vianpl/qemu/tree/vsm/next
[3] https://github.com/vianpl/linux/tree/vsm/next
Best,
Nikolas
Nikolas Wipper (15):
KVM: Add API documentation for KVM_TRANSLATE2
KVM: x86/mmu: Abort page walk if permission checks fail
KVM: x86/mmu: Introduce exception flag for unmapped GPAs
KVM: x86/mmu: Store GPA in exception if applicable
KVM: x86/mmu: Introduce flags parameter to page walker
KVM: x86/mmu: Implement PWALK_SET_ACCESSED in page walker
KVM: x86/mmu: Implement PWALK_SET_DIRTY in page walker
KVM: x86/mmu: Implement PWALK_FORCE_SET_ACCESSED in page walker
KVM: x86/mmu: Introduce status parameter to page walker
KVM: x86/mmu: Implement PWALK_STATUS_READ_ONLY_PTE_GPA in page walker
KVM: x86: Introduce generic gva to gpa translation function
KVM: Introduce KVM_TRANSLATE2
KVM: Add KVM_TRANSLATE2 stub
KVM: x86: Implement KVM_TRANSLATE2
KVM: selftests: Add test for KVM_TRANSLATE2
Documentation/virt/kvm/api.rst | 131 ++++++++
arch/x86/include/asm/kvm_host.h | 18 +-
arch/x86/kvm/hyperv.c | 3 +-
arch/x86/kvm/kvm_emulate.h | 8 +
arch/x86/kvm/mmu.h | 10 +-
arch/x86/kvm/mmu/mmu.c | 7 +-
arch/x86/kvm/mmu/paging_tmpl.h | 80 +++--
arch/x86/kvm/x86.c | 123 ++++++-
include/linux/kvm_host.h | 6 +
include/uapi/linux/kvm.h | 33 ++
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/kvm_translate2.c | 310 ++++++++++++++++++
virt/kvm/kvm_main.c | 41 +++
13 files changed, 724 insertions(+), 47 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
--
2.40.1
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
`MFD_NOEXEC_SEAL` should remove the executable bits and set `F_SEAL_EXEC`
to prevent further modifications to the executable bits as per the comment
in the uapi header file:
not executable and sealed to prevent changing to executable
However, commit 105ff5339f498a ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
that introduced this feature made it so that `MFD_NOEXEC_SEAL` unsets
`F_SEAL_SEAL`, essentially acting as a superset of `MFD_ALLOW_SEALING`.
Nothing implies that it should be so, and indeed up until the second version
of the of the patchset[0] that introduced `MFD_EXEC` and `MFD_NOEXEC_SEAL`,
`F_SEAL_SEAL` was not removed, however, it was changed in the third revision
of the patchset[1] without a clear explanation.
This behaviour is surprising for application developers, there is no
documentation that would reveal that `MFD_NOEXEC_SEAL` has the additional
effect of `MFD_ALLOW_SEALING`. Additionally, combined with `vm.memfd_noexec=2`
it has the effect of making all memfds initially sealable.
So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested,
thereby returning to the pre-Linux 6.3 behaviour of only allowing
sealing when `MFD_ALLOW_SEALING` is specified.
Now, this is technically a uapi break. However, the damage is expected
to be minimal. To trigger user visible change, a program has to do the
following steps:
- create memfd:
- with `MFD_NOEXEC_SEAL`,
- without `MFD_ALLOW_SEALING`;
- try to add seals / check the seals.
But that seems unlikely to happen intentionally since this change
essentially reverts the kernel's behaviour to that of Linux <6.3,
so if a program worked correctly on those older kernels, it will
likely work correctly after this change.
I have used Debian Code Search and GitHub to try to find potential
breakages, and I could only find a single one. dbus-broker's
memfd_create() wrapper is aware of this implicit `MFD_ALLOW_SEALING`
behaviour, and tries to work around it[2]. This workaround will
break. Luckily, this only affects the test suite, it does not affect
the normal operations of dbus-broker. There is a PR with a fix[3].
I also carried out a smoke test by building a kernel with this change
and booting an Arch Linux system into GNOME and Plasma sessions.
There was also a previous attempt to address this peculiarity by
introducing a new flag[4].
[0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
[1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
[2]: https://github.com/bus1/dbus-broker/blob/9eb0b7e5826fc76cad7b025bc46f267d4a…
[3]: https://github.com/bus1/dbus-broker/pull/366
[4]: https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
Cc: stable(a)vger.kernel.org
Signed-off-by: Barnabás Pőcze <pobrn(a)protonmail.com>
---
* v3: https://lore.kernel.org/linux-mm/20240611231409.3899809-1-jeffxu@chromium.o…
* v2: https://lore.kernel.org/linux-mm/20240524033933.135049-1-jeffxu@google.com/
* v1: https://lore.kernel.org/linux-mm/20240513191544.94754-1-pobrn@protonmail.co…
This fourth version returns to removing the inconsistency as opposed to documenting
its existence, with the same code change as v1 but with a somewhat extended commit
message. This is sent because I believe it is worth at least a try; it can be easily
reverted if bigger application breakages are discovered than initially imagined.
---
mm/memfd.c | 9 ++++-----
tools/testing/selftests/memfd/memfd_test.c | 2 +-
2 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/mm/memfd.c b/mm/memfd.c
index 7d8d3ab3fa37..8b7f6afee21d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -356,12 +356,11 @@ SYSCALL_DEFINE2(memfd_create,
inode->i_mode &= ~0111;
file_seals = memfd_file_seals_ptr(file);
- if (file_seals) {
- *file_seals &= ~F_SEAL_SEAL;
+ if (file_seals)
*file_seals |= F_SEAL_EXEC;
- }
- } else if (flags & MFD_ALLOW_SEALING) {
- /* MFD_EXEC and MFD_ALLOW_SEALING are set */
+ }
+
+ if (flags & MFD_ALLOW_SEALING) {
file_seals = memfd_file_seals_ptr(file);
if (file_seals)
*file_seals &= ~F_SEAL_SEAL;
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 95af2d78fd31..7b78329f65b6 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -1151,7 +1151,7 @@ static void test_noexec_seal(void)
mfd_def_size,
MFD_CLOEXEC | MFD_NOEXEC_SEAL);
mfd_assert_mode(fd, 0666);
- mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL | F_SEAL_EXEC);
mfd_fail_chmod(fd, 0777);
close(fd);
}
--
2.45.2
Hello,
This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.
The patchset covers:
+ How to get 1G pages
+ Allowing mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages
+ Splitting and reconstructing pages to support conversions and mmap()
+ How the VM, userspace and guest_memfd interact to support conversions
+ Selftests to test all the above
+ Selftests also demonstrate the conversion flow between VM, userspace and
guest_memfd.
Why 1G pages in guest memfd?
Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.
+ Performance is improved with 1G pages by more TLB hits and faster page walks
on TLB misses.
+ Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
Options for 1G page support:
1. HugeTLB
2. Contiguous Memory Allocator (CMA)
3. Other suggestions are welcome!
Comparison between options:
1. HugeTLB
+ Refactor HugeTLB to separate allocator from the rest of HugeTLB
+ Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
+ Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
+ Pro: Can provide iterative steps toward new future allocator
+ Unexplored: Managing userspace-visible changes
+ e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
but not when future allocator is used
2. CMA
+ Port some HugeTLB features to be applied on CMA
+ Pro: Clean slate
What would refactoring HugeTLB involve?
(Some refactoring was done in this RFC, more can be done.)
1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
+ Brings more modularity to HugeTLB
+ No functionality change intended
+ Likely step towards HugeTLB's integration into core-mm
2. guest_memfd will use just the allocator component of HugeTLB, not including
the complex parts of HugeTLB like
+ Userspace reservations (resv_map)
+ Shared PMD mappings
+ Special page walkers
What features would need to be ported to CMA?
+ Improved allocation guarantees
+ Per NUMA node pool of huge pages
+ Subpools per guest_memfd
+ Memory savings
+ Something like HugeTLB Vmemmap Optimization
+ Configuration/reporting features
+ Configuration of number of pages available (and per NUMA node) at and
after host boot
+ Reporting of memory usage/availability statistics at runtime
HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.
To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.
Improving how guest_memfd uses the allocator in a future revision of this RFC:
To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:
+ reserve(node, page_size, num_pages) => opaque handle
+ Used when a guest_memfd inode is created to reserve memory from backend
allocator
+ allocate(handle, mempolicy, page_size) => folio
+ To allocate a folio from guest_memfd's reservation
+ split(handle, folio, target_page_size) => void
+ To take a huge folio, and split it to smaller folios, restore to filemap
+ reconstruct(handle, first_folio, nr_pages) => void
+ To take a folio, and reconstruct a huge folio out of nr_pages from the
first_folio
+ free(handle, folio) => void
+ To return folio to guest_memfd's reservation
+ error(handle, folio) => void
+ To handle memory errors
+ unreserve(handle) => void
+ To return guest_memfd's reservation to allocator backend
Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.
Overview of patches:
+ Patches 01-12
+ Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
HugeTLB, and to expose HugeTLB functions.
+ Patches 13-16
+ Letting guest_memfd use HugeTLB
+ Creation of each guest_memfd reserves pages from HugeTLB's global hstate
and puts it into the guest_memfd inode's subpool
+ Each folio allocation takes a page from the guest_memfd inode's subpool
+ Patches 17-21
+ Selftests for new HugeTLB features in guest_memfd
+ Patches 22-24
+ More small changes on the HugeTLB side to expose functions needed by
guest_memfd
+ Patch 25:
+ Uses the newly available functions from patches 22-24 to split HugeTLB
pages. In this patch, HugeTLB folios are always split to 4K before any
usage, private or shared.
+ Patches 26-28
+ Allow mmap() in guest_memfd and faulting in shared pages
+ Patch 29
+ Enables conversion between private/shared pages
+ Patch 30
+ Required to zero folios after conversions to avoid leaking initialized
kernel memory
+ Patch 31-38
+ Add selftests to test mapping pages to userspace, guest/host memory
sharing and update conversions tests
+ Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
+ Patch 39
+ Dynamically split and reconstruct HugeTLB pages instead of always
splitting before use. All earlier selftests are expected to still pass.
TODOs:
+ Add logic to wait for safe_refcount [1]
+ Look into lazy splitting/reconstruction of pages
+ Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
mem_attr_array and faultability updated, the pages in the requested range
are also split/reconstructed as necessary. We want to look into delaying
splitting/reconstruction to fault time.
+ Solve race between folios being faulted in and being truncated
+ When running private_mem_conversions_test with more than 1 vCPU, a folio
getting truncated may get faulted in by another process, causing elevated
mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
+ Add intermediate splits (1G should first split to 2M and not split directly to
4K)
+ Use guest's lock instead of hugetlb_lock
+ Use multi-index xarray/replace xarray with some other data struct for
faultability flag
+ Refactor HugeTLB better, present generic allocator interface
Please let us know your thoughts on:
+ HugeTLB as the choice of transitional allocator backend
+ Refactoring HugeTLB to provide generic allocator interface
+ Shared/private conversion flow
+ Requiring user to request kernel to unmap pages from userspace using
madvise(MADV_DONTNEED)
+ Failing conversion on elevated mapcounts/pincounts/refcounts
+ Process of splitting/reconstructing page
+ Anything else!
[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quic…
Ackerley Tng (37):
mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
mm: hugetlb: Remove unnecessary check for avoid_reserve
mm: mempolicy: Refactor out policy_node_nodemask()
mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
interpret mempolicy instead of vma
mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
mm: hugetlb: Refactor out hugetlb_alloc_folio
mm: truncate: Expose preparation steps for truncate_inode_pages_final
mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
mm: hugetlb: Add option to create new subpool without using surplus
mm: hugetlb: Expose hugetlb_acct_memory()
mm: hugetlb: Move and expose hugetlb_zero_partial_page()
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: hugetlb: initialization and cleanup
KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
KVM: selftests: Add private_mem_conversions_test.sh
KVM: selftests: Test that guest_memfd usage is reported via hugetlb
mm: hugetlb: Expose vmemmap optimization functions
mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
mm: hugetlb: Add functions to add/move/remove from hugetlb lists
KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
KVM: guest_memfd: Allow mmapping guest_memfd files
KVM: guest_memfd: Use vm_type to determine default faultability
KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
KVM: selftests: Allow vm_set_memory_attributes to be used without
asserting return value of 0
KVM: selftests: Test using guest_memfd memory from userspace
KVM: selftests: Test guest_memfd memory sharing between guest and host
KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
guest_memfd
KVM: selftests: Test that pinned pages block KVM from setting memory
attributes to PRIVATE
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Add helper to perform madvise by memslots
KVM: selftests: Update private_mem_conversions_test for mmap()able
guest_memfd
Vishal Annapurve (2):
KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
fs/hugetlbfs/inode.c | 35 +-
include/linux/hugetlb.h | 54 +-
include/linux/kvm_host.h | 1 +
include/linux/mempolicy.h | 2 +
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 26 +
include/uapi/linux/magic.h | 1 +
mm/hugetlb.c | 346 ++--
mm/hugetlb_vmemmap.h | 11 -
mm/mempolicy.c | 36 +-
mm/truncate.c | 26 +-
tools/include/linux/kernel.h | 4 +-
tools/testing/selftests/kvm/Makefile | 3 +
.../kvm/guest_memfd_hugetlb_reporting_test.c | 222 +++
.../selftests/kvm/guest_memfd_pin_test.c | 104 ++
.../selftests/kvm/guest_memfd_sharing_test.c | 160 ++
.../testing/selftests/kvm/guest_memfd_test.c | 238 ++-
.../testing/selftests/kvm/include/kvm_util.h | 45 +-
.../testing/selftests/kvm/include/test_util.h | 18 +
tools/testing/selftests/kvm/lib/kvm_util.c | 443 +++--
tools/testing/selftests/kvm/lib/test_util.c | 99 ++
.../kvm/x86_64/private_mem_conversions_test.c | 158 +-
.../x86_64/private_mem_conversions_test.sh | 91 +
.../kvm/x86_64/private_mem_kvm_exits_test.c | 11 +-
virt/kvm/guest_memfd.c | 1563 ++++++++++++++++-
virt/kvm/kvm_main.c | 17 +
virt/kvm/kvm_mm.h | 16 +
27 files changed, 3288 insertions(+), 443 deletions(-)
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
--
2.46.0.598.g6f2099f65c-goog
The include.sh file is generated for inclusion and should not be executable.
Otherwise, it will be added to kselftest-list.txt. Additionally, add the
executable bit for test.py at the same time to ensure proper functionality.
Fixes: 3ade6ce1255e ("selftests: rds: add testing infrastructure")
Signed-off-by: Hangbin Liu <liuhangbin(a)gmail.com>
---
tools/testing/selftests/net/rds/Makefile | 3 ++-
tools/testing/selftests/net/rds/test.py | 0
2 files changed, 2 insertions(+), 1 deletion(-)
mode change 100644 => 100755 tools/testing/selftests/net/rds/test.py
diff --git a/tools/testing/selftests/net/rds/Makefile b/tools/testing/selftests/net/rds/Makefile
index da9714bc7aad..cf30307a829b 100644
--- a/tools/testing/selftests/net/rds/Makefile
+++ b/tools/testing/selftests/net/rds/Makefile
@@ -4,9 +4,10 @@ all:
@echo mk_build_dir="$(shell pwd)" > include.sh
TEST_PROGS := run.sh \
- include.sh \
test.py
+TEST_FILES := include.sh
+
EXTRA_CLEAN := /tmp/rds_logs
include ../../lib.mk
diff --git a/tools/testing/selftests/net/rds/test.py b/tools/testing/selftests/net/rds/test.py
old mode 100644
new mode 100755
--
2.39.3 (Apple Git-146)