Currently there is no means of determining whether a give page in a mapping
range is designated a guard region (as installed via madvise() using the
MADV_GUARD_INSTALL flag).
This is generally not an issue, but in some instances users may wish to
determine whether this is the case.
This series adds this ability via /proc/$pid/pagemap, updates the
documentation and adds a self test to assert that this functions correctly.
Lorenzo Stoakes (2):
fs/proc/task_mmu: add guard region bit to pagemap
tools/selftests: add guard region test for /proc/$pid/pagemap
Documentation/admin-guide/mm/pagemap.rst | 3 +-
fs/proc/task_mmu.c | 6 ++-
tools/testing/selftests/mm/guard-regions.c | 47 ++++++++++++++++++++++
tools/testing/selftests/mm/vm_util.h | 1 +
4 files changed, 55 insertions(+), 2 deletions(-)
--
2.48.1
The first patch makes use of the correct terminology for synchronous and
asynchronous errors. The second patch checks whether PROT_MTE is
supported on hugetlb mappings before continuing with the tests. Such
support was added in 6.13 but people tend to use current kselftests on
older kernels. Avoid the failure reporting on such kernels, just skip
the tests.
Catalin Marinas (2):
kselftest/arm64: mte: Use the correct naming for tag check modes in
check_hugetlb_options.c
kselftest/arm64: mte: Skip the hugetlb tests if MTE not supported on
such mappings
.../arm64/mte/check_hugetlb_options.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
The guard regions feature was initially implemented to support anonymous
mappings only, excluding shmem.
This was done such as to introduce the feature carefully and incrementally
and to be conservative when considering the various caveats and corner
cases that are applicable to file-backed mappings but not to anonymous
ones.
Now this feature has landed in 6.13, it is time to revisit this and to
extend this functionality to file-backed and shmem mappings.
In order to make this maximally useful, and since one may map file-backed
mappings read-only (for instance ELF images), we also remove the
restriction on read-only mappings and permit the establishment of guard
regions in any non-hugetlb, non-mlock()'d mapping.
It is permissible to permit the establishment of guard regions in read-only
mappings because the guard regions only reduce access to the mapping, and
when removed simply reinstate the existing attributes of the underlying
VMA, meaning no access violations can occur.
While the change in kernel code introduced in this series is small, the
majority of the effort here is spent in extending the testing to assert
that the feature works correctly across numerous file-backed mapping
scenarios.
Every single guard region self-test performed against anonymous memory
(which is relevant and not anon-only) has now been updated to also be
performed against shmem and a mapping of a file in the working directory.
This confirms that all cases also function correctly for file-backed guard
regions.
In addition a number of other tests are added for specific file-backed
mapping scenarios.
There are a number of other concerns that one might have with regard to
guard regions, addressed below:
Readahead
~~~~~~~~~
Readahead is a process through which the page cache is populated on the
assumption that sequential reads will occur, thus amortising I/O and,
through a clever use of the PG_readahead folio flag establishing during
major fault and checked upon minor fault, provides for asynchronous I/O to
occur as dat is processed, reducing I/O stalls as data is faulted in.
Guard regions do not alter this mechanism which operations at the folio and
fault level, but do of course prevent the faulting of folios that would
otherwise be mapped.
In the instance of a major fault prior to a guard region, synchronous
readahead will occur including populating folios in the page cache which
the guard regions will, in the case of the mapping in question, prevent
access to.
In addition, if PG_readahead is placed in a folio that is now inaccessible,
this will prevent asynchronous readahead from occurring as it would
otherwise do.
However, there are mechanisms for heuristically resetting this within
readahead regardless, which will 'recover' correct readahead behaviour.
Readahead presumes sequential data access, the presence of a guard region
clearly indicates that, at least in the guard region, no such sequential
access will occur, as it cannot occur there.
So this should have very little impact on any real workload. The far more
important point is as to whether readahead causes incorrect or
inappropriate mapping of ranges disallowed by the presence of guard
regions - this is not the case, as readahead does not 'pre-fault' memory in
this fashion.
At any rate, any mechanism which would attempt to do so would hit the usual
page fault paths, which correctly handle PTE markers as with anonymous
mappings.
Fault-Around
~~~~~~~~~~~~
The fault-around logic, in a similar vein to readahead, attempts to improve
efficiency with regard to file-backed memory mappings, however it differs
in that it does not try to fetch folios into the page cache that are about
to be accessed, but rather pre-maps a range of folios around the faulting
address.
Guard regions making use of PTE markers makes this relatively trivial, as
this case is already handled - see filemap_map_folio_range() and
filemap_map_order0_folio() - in both instances, the solution is to simply
keep the established page table mappings and let the fault handler take
care of PTE markers, as per the comment:
/*
* NOTE: If there're PTE markers, we'll leave them to be
* handled in the specific fault path, and it'll prohibit
* the fault-around logic.
*/
This works, as establishing guard regions results in page table mappings
with PTE markers, and clearing them removes them.
Truncation
~~~~~~~~~~
File truncation will not eliminate existing guard regions, as the
truncation operation will ultimately zap the range via
unmap_mapping_range(), which specifically excludes PTE markers.
Zapping
~~~~~~~
Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(),
which specifically deals with guard entries, leaving them intact except in
instances such as process teardown or munmap() where they need to be
removed.
Reclaim
~~~~~~~
When reclaim is performed on file-backed folios, it ultimately invokes
try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte()
will ultimately abort the operation for the guard region mapping. If large,
then check_pte() will determine that this is a non-device private
entry/device-exclusive entry 'swap' PTE and thus abort the operation in
that instance.
Therefore, no odd things happen in the instance of reclaim being attempted
upon a file-backed guard region.
Hole Punching
~~~~~~~~~~~~~
This updates the page cache and ultimately invokes unmap_mapping_range(),
which explicitly leaves PTE markers in place.
Because the establishment of guard regions zapped any existing mappings to
file-backed folios, once the guard regions are removed then the
hole-punched region will be faulted in as usual and everything will behave
as expected.
Lorenzo Stoakes (4):
mm: allow guard regions in file-backed and read-only mappings
selftests/mm: rename guard-pages to guard-regions
tools/selftests: expand all guard region tests to file-backed
tools/selftests: add file/shmem-backed mapping guard region tests
mm/madvise.c | 8 +-
tools/testing/selftests/mm/.gitignore | 2 +-
tools/testing/selftests/mm/Makefile | 2 +-
.../mm/{guard-pages.c => guard-regions.c} | 921 ++++++++++++++++--
4 files changed, 821 insertions(+), 112 deletions(-)
rename tools/testing/selftests/mm/{guard-pages.c => guard-regions.c} (58%)
--
2.48.1
[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).
If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.
There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
(Reserved Memory Regions) in guest's IORT. Then the guest kernel would
fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
region per iommu group for a direct mapping. Eventually, the mappings
would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
page location (IPA) to the kernel for the stage-2 mapping. Eventually:
IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had a solution for approach (2) first, and then
switched to approach (1), suggested by Jean-Philippe for the reduction of
complexity.
Approach (1) basically feels like the existing VFIO passthrough that has
a 1-stage mapping for the unmanaged domain, yet only by shifting the MSI
mapping from stage 1 (no-viommu case) to stage 2 (has-viommu case). So,
it could reuse the existing IOMMU_RESV_SW_MSI piece, by sharing the same
idea of "VMM leaving everything to the kernel".
Approach (2) is an ideal solution, yet it requires additional effort for
kernel to be aware of the stage-1 gIOVAs and the stage-2 IPAs for vITS
page(s), which demands VMM to closely cooperate.
* It also brings some complicated use cases to the table where the host
or/and guest system(s) has/have multiple ITS pages.
[ Execution ]
Though these two approaches feel very different on the surface, they can
share some underlying common infrastructure. Currently, only one pair of
sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
drivers to directly use. There could be different versions of functions
from different domain owners: for existing VFIO passthrough cases and in-
kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
functions; for nested translation use cases, there can be another version
of sw_msi functions to handle mapping and msi_msg(s) differently.
As a part-1 series, this refactors the core infrastructure:
- Get rid of the duplication in the "compose" function
- Introduce a function pointer for the previously "prepare" function
- Allow different domain owners to set their own "sw_msi" implementations
- Implement an iommufd_sw_msi function to additionally support non-nested
use cases and also prepare for a nested translation use case using the
approach (1)
[ Future Plan ]
Part-2 will add support of approach (1), i.e. RMR solution:
- Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
agree on (for approach 1)
Part-3 and beyond will continue the effort of supporting approach (2) i.e.
a complete vITS-to-pITS mapping:
- Map the phsical ITS page (potentially via IOMMUFD_CMD_IOAS_MAP_MSI)
- Convey the IOVAs per-irq (potentially via VFIO_IRQ_SET_ACTION_PREPARE)
---
This is a joint effort that includes Jason's rework in irq/iommu/iommufd
base level and my additional patches on top of that for new uAPIs.
This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi_p1-v2
For testing with nested SMMU (approach 1):
https://github.com/nicolinc/iommufd/commits/wip/iommufd_msi_p2-v2
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi_p2-v2-rmr
Changelog
v2
* Split the iommufd ioctl for approach (1) out of this part-1
* Rebase on Jason's for-next tree (6.14-rc2) for two iommufd patches
* Update commit logs in two irqchip patches to make narrative clearer
* Keep iommu_dma_compose_msi_msg() in PATCH-1 as a small cleaner step
* Improve with some coding style changes: kdoc and 100-char wrapping
v1
https://lore.kernel.org/kvm/cover.1739005085.git.nicolinc@nvidia.com/
* Rebase on v6.14-rc1 and iommufd_attach_handle-v1 series
https://lore.kernel.org/all/cover.1738645017.git.nicolinc@nvidia.com/
* Correct typos
* Replace set_bit with __set_bit
* Use a common helper to get iommufd_handle
* Add kdoc for iommu_msi_iova/iommu_msi_page_shift
* Rename msi_msg_set_msi_addr() to msi_msg_set_addr()
* Update selftest for a better coverage for the new options
* Change IOMMU_OPTION_SW_MSI_START/SIZE to be per-idev and properly
check against device's reserved region list
RFCv2
https://lore.kernel.org/kvm/cover.1736550979.git.nicolinc@nvidia.com/
* Rebase on v6.13-rc6
* Drop all the irq/pci patches and rework the compose function instead
* Add a new sw_msi op to iommu_domain for a per type implementation and
let iommufd core has its own implementation to support both approaches
* Add RMR-solution (approach 1) support since it is straightforward and
have been used in some out-of-tree projects widely
RFCv1
https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (5):
genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
iommu_cookie
genirq/msi: Refactor iommu_dma_compose_msi_msg()
iommu: Make iommu_dma_prepare_msi() into a generic operation
irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by irqchips that need
it
iommufd: Implement sw_msi support natively
Nicolin Chen (2):
iommu: Turn fault_data to iommufd private pointer
iommu: Turn iova_cookie to dma-iommu private pointer
drivers/iommu/Kconfig | 1 -
drivers/irqchip/Kconfig | 4 +
kernel/irq/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_private.h | 23 +++-
include/linux/iommu.h | 58 +++++----
include/linux/msi.h | 55 +++++---
drivers/iommu/dma-iommu.c | 63 +++-------
drivers/iommu/iommu.c | 29 +++++
drivers/iommu/iommufd/device.c | 160 ++++++++++++++++++++----
drivers/iommu/iommufd/fault.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 5 +-
drivers/iommu/iommufd/main.c | 9 ++
drivers/irqchip/irq-gic-v2m.c | 5 +-
drivers/irqchip/irq-gic-v3-its.c | 13 +-
drivers/irqchip/irq-gic-v3-mbi.c | 12 +-
drivers/irqchip/irq-ls-scfg-msi.c | 5 +-
16 files changed, 309 insertions(+), 136 deletions(-)
base-commit: dc10ba25d43f433ad5d9e8e6be4f4d2bb3cd9ddb
prerequisite-patch-id: 0000000000000000000000000000000000000000
--
2.43.0
Hi all,
This patch series continues the work to migrate the script tests into
prog_tests.
test_xdp_vlan.sh tests the ability of an XDP program to modify the VLAN
ids on the fly. This isn't currently covered by an other test in the
test_progs framework so I add a new file prog_tests/xdp_vlan.c that does
the exact same tests (same network topology, same BPF programs) and
remove the script.
Signed-off-by: Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
---
Bastien Curutchet (eBPF Foundation) (2):
selftests/bpf: test_xdp_vlan: Rename BPF sections
selftests/bpf: Migrate test_xdp_vlan.sh into test_progs
tools/testing/selftests/bpf/Makefile | 4 +-
tools/testing/selftests/bpf/prog_tests/xdp_vlan.c | 175 ++++++++++++++++
tools/testing/selftests/bpf/progs/test_xdp_vlan.c | 20 +-
tools/testing/selftests/bpf/test_xdp_vlan.sh | 233 ---------------------
.../selftests/bpf/test_xdp_vlan_mode_generic.sh | 9 -
.../selftests/bpf/test_xdp_vlan_mode_native.sh | 9 -
6 files changed, 186 insertions(+), 264 deletions(-)
---
base-commit: a814b9be27fb3c3f49343aee4b015b76f5875558
change-id: 20250130-xdp_vlan-e825cc4df14a
Best regards,
--
Bastien Curutchet (eBPF Foundation) <bastien.curutchet(a)bootlin.com>
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of speculative execution issues fall into this category [1, Table 1].
This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.
=== Changes to RFC v3 ===
- Settle relationship between direct map removal and shared/private
memory in guest_memfd (David H.)
- Omit TLB flushes upon direct map removal again
- Settle uABI for how KVM accesses guest memory in non-CoCo guest_memfd
VMs (upstream guest_memfd calls)
- Add selftests that exercise the codepaths of non-CoCo guest_memfd VMs
Lastly, this series is rebased on top of Fuad's v4 for shared mapping of
guest_memfd [2]. The KVM parts should also apply on top of 0ad2507d5d93
("Linux 6.14-rc3"), but the selftest patches need Fuad's series as base.
=== Overview ===
guest_memfd should be usable for "non-CoCo" VMs - virtual machines where
host userspace is trusted (e.g. can have access to all of guest memory),
but which should still be hardened against speculative execution attacks
(Spectre, etc.) staged through potentially existing gadgets in the host
kernel.
To attain this hardening, unmap guest memory from the host kernels
address space (e.g. zap direct map entries), while allowing KVM to
continue accessing guest memory through userspace mappings. This works
because KVM already almost always uses userspace mappings whenever KVM
needs to access guest memory - the only parts that require direct map
entries (because they use GUP) are KVM's MMU, and kvm-clock on x86.
Building on top of guest_memfd sidesteps the MMU problem, as for
memslots with KVM_MEM_GUEST_MEMFD set, the MMU consumes fd + offset
directly without going through any VMAs. kvm-clock on the other hand is
not strictly needed (guests boot fine without it), so ignore it for
now.
=== Implementation ===
Make KVM_CREATE_GUEST_MEMFD accept a flag (KVM_GMEM_NO_DIRECT_MAP) that
instructs it to remove newly allocated folios from the host kernels
direct map immediately after preparation.
Nothing further is needed to make non-CoCo VMs work - particularly, KVM
does not need to be taught any special ways of accessing guest memory if
it is in guest_memfd. Userspace can simply mmap guest_memfd (via
KVM_GMEM_SHARED_MEM added in Fuad's series), and set the memslot's
userspace_addr to this userspace mapping of guest_memfd.
=== Open Questions ===
In this patch series, stale TLB entries do not get flushed after direct
map entries are marked as not present. This is fine from a functional
point of view (as the mapping is still valid, it's just temporarily not
supposed to be used), but pokes a theoretical hole into the speculation
protection: Something could try to keep alive stale TLB entries for
specific pages until the guest starts using them for sensitive
information, and then stage a Spectre attack on that memory, despite it
being unmapped. In practice, this would require knowing in advance, at
gmem fault-time, which pages will eventually contain information of
interest, and then preventing these specific TLB entries from getting
naturally evicted (where the number of pages that can be targeted like
this is limited by the size of the TLB). These seem to be fairly
difficult requisites to fulfill, but we were wondering what the
community thinks.
=== Summary ===
Patch 1 adds a struct address_space flag that indices that folios in a
mapping are direct map removed, and threads it through mm code to ensure
direct map removed folios don't end up in places where they can cause
mayhem (particularly, we reject them in get_user_pages). Since these
checks end up being duplicates of already existing checks for secretmem
folios, patch 2 unifies the two by using the new address_space flag for
secretmem mappings. Patches 3 through 5 are about support for direct map
removal in guest_memfd, while patches 6 through 12 are about testing the
non-CoCo setup in KVM selftests, with patches 6 through 9 being
preparatory, and patches 10 through 12 adding the actual test cases.
[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[2]: https://lore.kernel.org/kvm/20250218172500.807733-1-tabba@google.com/
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[RFC v3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/
Patrick Roy (12):
mm: introduce AS_NO_DIRECT_MAP
mm/secretmem: set AS_NO_DIRECT_MAP instead of special-casing
KVM: guest_memfd: Add flag to remove from direct map
KVM: Add capability to discover KVM_GMEM_NO_DIRECT_MAP support
KVM: Documentation: document KVM_GMEM_NO_DIRECT_MAP flag
KVM: selftests: load elf via bounce buffer
KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
!= -1
KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
KVM: selftests: adjust test_create_guest_memfd_invalid
KVM: selftests: set KVM_GMEM_NO_DIRECT_MAP in mem conversion tests
KVM: selftests: Test guest execution from direct map removed gmem
Documentation/virt/kvm/api.rst | 13 ++++
include/linux/pagemap.h | 16 +++++
include/linux/secretmem.h | 18 ------
include/uapi/linux/kvm.h | 3 +
lib/buildid.c | 4 +-
mm/gup.c | 14 +---
mm/mlock.c | 2 +-
mm/secretmem.c | 6 +-
.../testing/selftests/kvm/guest_memfd_test.c | 2 +-
.../testing/selftests/kvm/include/kvm_util.h | 29 ++++++---
.../testing/selftests/kvm/include/test_util.h | 8 +++
tools/testing/selftests/kvm/lib/elf.c | 8 +--
tools/testing/selftests/kvm/lib/io.c | 23 +++++++
tools/testing/selftests/kvm/lib/kvm_util.c | 64 +++++++++++--------
tools/testing/selftests/kvm/lib/test_util.c | 8 +++
tools/testing/selftests/kvm/lib/x86/sev.c | 1 +
.../selftests/kvm/pre_fault_memory_test.c | 1 +
.../selftests/kvm/set_memory_region_test.c | 40 ++++++++++++
.../kvm/x86/private_mem_conversions_test.c | 7 +-
virt/kvm/guest_memfd.c | 24 ++++++-
virt/kvm/kvm_main.c | 5 ++
21 files changed, 214 insertions(+), 82 deletions(-)
base-commit: da40655874b54a2b563f8ceb3ed839c6cd38e0b4
--
2.48.1