This adds the pasid attach/detach uAPIs for userspace to attach/detach
a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is
enabled in this series. After this series, PASID-capable devices bound
with vfio-pci can report PASID capability to userspace and VM to enable
PASID usages like Shared Virtual Addressing (SVA).
Based on the discussion about reporting the vPASID to VM [1], it's agreed
that we will let the userspace VMM to synthesize the vPASID capability.
The VMM needs to figure out a hole to put the vPASID cap. This includes
the hidden bits handling for some devices. While, it's up to the userspace,
it's not the focus of this series.
This series first adds the helpers for pasid attach in vfio core and then
adds the device cdev ioctls for pasid attach/detach. In the end of this
series, the IOMMU_GET_HW_INFO ioctl is extended to report the PCI PASID
capability to the userspace. Userspace should check this before using any
PASID related uAPIs provided by VFIO, which is the agreement in [2]. This
series depends on the iommufd pasid attach/detach series [3].
The completed code can be found at [4], tested with a hacky Qemu branch [5].
[1] https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11M…
[2] https://lore.kernel.org/kvm/4f2daf50-a5ad-4599-ab59-bcfc008688d8@intel.com/
[3] https://lore.kernel.org/linux-iommu/20240912131255.13305-1-yi.l.liu@intel.c…
[4] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[5] https://github.com/yiliu1765/qemu/tree/wip/zhenzhong/iommufd_nesting_rfcv2-…
Change log:
v3:
- Misc enhancement on patch 01 of v2 (Alex, Jason)
- Add Jason's r-b to patch 03 of v2
- Drop the logic that report PASID via VFIO_DEVICE_FEATURE ioctl
- Extend IOMMU_GET_HW_INFO to report PASID support (Kevin, Jason, Alex)
v2: https://lore.kernel.org/kvm/20240412082121.33382-1-yi.l.liu@intel.com/
- Use IDA to track if PASID is attached or not in VFIO. (Jason)
- Fix the issue of calling pasid_at[de]tach_ioas callback unconditionally (Alex)
- Fix the wrong data copy in vfio_df_ioctl_pasid_detach_pt() (Zhenzhong)
- Minor tweaks in comments (Kevin)
v1: https://lore.kernel.org/kvm/20231127063909.129153-1-yi.l.liu@intel.com/
- Report PASID capability via VFIO_DEVICE_FEATURE (Alex)
rfc: https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Yi Liu (4):
ida: Add ida_find_first_range()
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
drivers/iommu/iommufd/device.c | 27 +++++++++++++-
drivers/pci/ats.c | 32 ++++++++++++++++
drivers/vfio/device_cdev.c | 51 ++++++++++++++++++++++++++
drivers/vfio/iommufd.c | 50 +++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci.c | 2 +
drivers/vfio/vfio.h | 4 ++
drivers/vfio/vfio_main.c | 8 ++++
include/linux/idr.h | 11 ++++++
include/linux/pci-ats.h | 3 ++
include/linux/vfio.h | 11 ++++++
include/uapi/linux/iommufd.h | 14 ++++++-
include/uapi/linux/vfio.h | 55 ++++++++++++++++++++++++++++
lib/idr.c | 67 ++++++++++++++++++++++++++++++++++
13 files changed, 333 insertions(+), 2 deletions(-)
--
2.34.1
Recently we committed a fix to allow processes to receive notifications for
non-zero exits via the process connector module. Commit is a4c9a56e6a2c.
However, for threads, when it does a pthread_exit(&exit_status) call, the
kernel is not aware of the exit status with which pthread_exit is called.
It is sent by child thread to the parent process, if it is waiting in
pthread_join(). Hence, for a thread exiting abnormally, kernel cannot
send notifications to any listening processes.
The exception to this is if the thread is sent a signal which it has not
handled, and dies along with it's process as a result; for eg. SIGSEGV or
SIGKILL. In this case, kernel is aware of the non-zero exit and sends a
notification for it.
For our use case, we cannot have parent wait in pthread_join, one of the
main reasons for this being that we do not want to track normal
pthread_exit(), which could be a very large number. We only want to be
notified of any abnormal exits. Hence, threads are created with
pthread_attr_t set to PTHREAD_CREATE_DETACHED.
To fix this problem, we add a new type PROC_CN_MCAST_NOTIFY to proc connector
API, which allows a thread to send it's exit status to kernel either when
it needs to call pthread_exit() with non-zero value to indicate some
error or from signal handler before pthread_exit().
v5->v6 changes:
- As suggested by Simon Horman, fixed style issues (some old) by running
./scripts/checkpatch.pl --strict --max-line-length=80
- Removed inline functions as suggested by Simon Horman
- Added "depends on" in Kconfig.debug as suggested by Stanislav Fomichev
- Removed warning while compilation with kernel space headers as
suggested by Simon Horman
- Removed "comm" field, will send separate patch for this.
- Added kunit configs in tools/testing/kunit/configs/all_tests.config
with it's dependencies.
v4->v5 changes:
- Handled comment by Stanislav Fomichev to fix a print format error.
- Made thread.c completely automated by starting proc_filter program
from within threads.c.
- Changed name CONFIG_CN_HASH_KUNIT_TEST to CN_HASH_KUNIT_TEST in
Kconfig.debug and changed display text.
v3->v4 changes:
- Reduce size of exit.log by removing unnecessary text.
v2->v3 changes:
- Handled comment by Liam Howlett to set hdev to NULL and add comment on
it.
- Handled comment by Liam Howlett to combine functions for deleting+get
and deleting into one in cn_hash.c
- Handled comment by Liam Howlett to remove extern in the functions
defined in cn_hash_test.h
- Some nits by Liam Howlett fixed.
- Handled comment by Liam Howlett to make threads test automated.
proc_filter.c creates exit.log, which is read by thread.c and checks
the values reported.
- Added "comm" field to struct proc_event, to copy the task's name to
the packet to allow further filtering by packets.
v1->v2 changes:
- Handled comment by Peter Zijlstra to remove locking for PF_EXIT_NOTIFY
task->flags.
- Added error handling in thread.c
v->v1 changes:
- Handled comment by Simon Horman to remove unused err in cn_proc.c
- Handled comment by Simon Horman to make adata and key_display static
in cn_hash_test.c
Anjali Kulkarni (3):
connector/cn_proc: Add hash table for threads
connector/cn_proc: Kunit tests for threads hash table
connector/cn_proc: Selftest for threads
drivers/connector/Makefile | 2 +-
drivers/connector/cn_hash.c | 216 ++++++++++++++++
drivers/connector/cn_proc.c | 81 ++++--
drivers/connector/connector.c | 88 ++++++-
include/linux/connector.h | 62 ++++-
include/linux/sched.h | 2 +-
include/uapi/linux/cn_proc.h | 4 +-
lib/Kconfig.debug | 19 ++
lib/Makefile | 1 +
lib/cn_hash_test.c | 169 +++++++++++++
lib/cn_hash_test.h | 10 +
tools/testing/kunit/configs/all_tests.config | 5 +
tools/testing/selftests/connector/Makefile | 25 ++
.../testing/selftests/connector/proc_filter.c | 63 +++--
tools/testing/selftests/connector/thread.c | 238 ++++++++++++++++++
.../selftests/connector/thread_filter.c | 96 +++++++
16 files changed, 1027 insertions(+), 54 deletions(-)
create mode 100644 drivers/connector/cn_hash.c
create mode 100644 lib/cn_hash_test.c
create mode 100644 lib/cn_hash_test.h
create mode 100644 tools/testing/selftests/connector/thread.c
create mode 100644 tools/testing/selftests/connector/thread_filter.c
--
2.46.0
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1]. With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses. This provides some
protection against ROP attacks and making it easier to collect call
stacks. These shadow stacks are allocated in the address space of the
userspace process.
Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled. The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread. This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces. As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.
Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone(). The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.
Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET available to me.
[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87…
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@k…
Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@ke…
Changes in v9:
- Pull token validation earlier and report problems with an error return
to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@ke…
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@ke…
Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@ke…
Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@ke…
Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@ke…
Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@ke…
Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@ke…
Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@ke…
---
Mark Brown (8):
arm64/gcs: Return a success value from gcs_alloc_thread_stack()
Documentation: userspace-api: Add shadow stack API documentation
selftests: Provide helper header for shadow stack testing
fork: Add shadow stack support to clone3()
selftests/clone3: Remove redundant flushes of output streams
selftests/clone3: Factor more of main loop into test_clone3()
selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
selftests/clone3: Test shadow stack support
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/shadow_stack.rst | 41 ++++
arch/arm64/include/asm/gcs.h | 8 +-
arch/arm64/kernel/process.c | 8 +-
arch/arm64/mm/gcs.c | 62 +++++-
arch/x86/include/asm/shstk.h | 11 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 57 +++++-
include/asm-generic/cacheflush.h | 11 ++
include/linux/sched/task.h | 17 ++
include/uapi/linux/sched.h | 10 +-
kernel/fork.c | 96 +++++++--
tools/testing/selftests/clone3/clone3.c | 226 ++++++++++++++++++----
tools/testing/selftests/clone3/clone3_selftests.h | 65 ++++++-
tools/testing/selftests/ksft_shstk.h | 98 ++++++++++
15 files changed, 632 insertions(+), 81 deletions(-)
---
base-commit: d17cd7b7cc92d37ee8b2df8f975fc859a261f4dc
change-id: 20231019-clone3-shadow-stack-15d40d2bf536
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi all,
This series is based on v6.12-rc4. It fixes an issue with the tracefs
gid mount option. Adds test case to prevent future breakages and updates
the tracefs readme to document the expected behavior of this option.
Thanks,
Kalesh
Kalesh Singh (3):
tracing: Document tracefs gid mount option
tracing/selftests: Add tracefs mount options test
tracing: Fix tracefs gid mount option
fs/tracefs/inode.c | 12 ++-
kernel/trace/trace.c | 4 +
.../ftrace/test.d/00basic/mount_options.tc | 101 ++++++++++++++++++
.../ftrace/test.d/00basic/test_ownership.tc | 16 +--
.../testing/selftests/ftrace/test.d/functions | 25 +++++
5 files changed, 142 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/00basic/mount_options.tc
base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
--
2.47.0.163.g1226f6d8fa-goog
Hi all,
This is v2 of the series fixing the tracefs mount options.
Changes in v2:
- Update the commit descriptions, per Eric and Steve
- Fix ordering of the patches, per Steve
v1 can be found at:
- https://lore.kernel.org/r/20241028214550.2099923-1-kaleshsingh@google.com/
Thanks,
Kalesh
Kalesh Singh (3):
tracing: Fix tracefs mount options
tracing: Document tracefs gid mount option
tracing/selftests: Add tracefs mount options test
fs/tracefs/inode.c | 12 ++-
kernel/trace/trace.c | 4 +
.../ftrace/test.d/00basic/mount_options.tc | 101 ++++++++++++++++++
.../ftrace/test.d/00basic/test_ownership.tc | 16 +--
.../testing/selftests/ftrace/test.d/functions | 25 +++++
5 files changed, 142 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/00basic/mount_options.tc
base-commit: 42f7652d3eb527d03665b09edac47f85fb600924
--
2.47.0.163.g1226f6d8fa-goog
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available. And this container will be able to hold some advanced
feature too.
For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
_______________________________________________________________________
| iommufd (with vIOMMU) |
| _____________ |
| | | |
| |----------------| vIOMMU | |
| | ______ | | _____________ ________ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
| | | | | | |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Non-affiliated event reporting (e.g. an invalidation queue error)
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that have a slice passed to (via device) a guest VM,
while being able to hold the shareable parent HWPT. Each vIOMMU then just
needs to allocate its own individual ID to tag its own cache:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only. And implement it in arm-smmu-v3 driver as a real world use case.
More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v5
(paring QEMU branch for testing will be provided with the part2 series)
Changelog
v5
* Added "Reviewed-by" from Kevin
* Reworked iommufd_viommu_alloc helper
* Revised the uAPI kdoc for vIOMMU object
* Revised comments for pluggable iommu_dev
* Added a couple of cleanup patches for selftest
* Renamed domain_alloc_nested op to alloc_domain_nested
* Updated a few commit messages to reflect the latest series
* Renamed iommufd_hwpt_nested_alloc_for_viommu to
iommufd_viommu_alloc_hwpt_nested, and added flag validation
v4
https://lore.kernel.org/all/cover.1729553811.git.nicolinc@nvidia.com/
* Added "Reviewed-by" from Jason
* Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
* Dropped iommufd_object_alloc_elm renamings
* Renamed iommufd's viommu_api.c to driver.c
* Reworked iommufd_viommu_alloc helper
* Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
hwpt_nested allocations on a vIOMMU, and added comparison between
viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
* Replaced s2_parent with vsmmu in arm_smmu_nested_domain
* Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
viommu_ops
* Replaced wait_queue_head_t with a completion, to delay the unplug of
mock_iommu_dev
* Corrected documentation graph that was missing struct iommu_device
* Added an iommufd_verify_unfinalized_object helper to verify driver-
allocated vIOMMU/vDEVICE objects
* Added missing test cases for TEST_LENGTH and fail_nth
v3
https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
* Rebased on top of Jason's nesting v3 series
https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvid…
* Split the series into smaller parts
* Added Jason's Reviewed-by
* Added back viommu->iommu_dev
* Added support for driver-allocated vIOMMU v.s. core-allocated
* Dropped arm_smmu_cache_invalidate_user
* Added an iommufd_test_wait_for_users() in selftest
* Reworked test code to make viommu an individual FIXTURE
* Added missing TEST_LENGTH case for the new ioctl command
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (13):
iommufd: Move struct iommufd_object to public iommufd header
iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
iommufd: Add iommufd_verify_unfinalized_object
iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
iommufd: Add alloc_domain_nested op to iommufd_viommu_ops
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add container_of helpers
iommufd/selftest: Prepare for mock_viommu_alloc_domain_nested()
iommufd/selftest: Add refcount to mock_iommu_device
iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
Documentation: userspace-api: iommufd: Update vIOMMU
iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support
drivers/iommu/iommufd/Makefile | 5 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +-
drivers/iommu/iommufd/iommufd_private.h | 36 +--
drivers/iommu/iommufd/iommufd_test.h | 2 +
include/linux/iommu.h | 14 +
include/linux/iommufd.h | 86 ++++++
include/uapi/linux/iommufd.h | 56 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 28 ++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 80 ++++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +-
drivers/iommu/iommufd/driver.c | 38 +++
drivers/iommu/iommufd/hw_pagetable.c | 71 ++++-
drivers/iommu/iommufd/main.c | 58 ++--
drivers/iommu/iommufd/selftest.c | 256 ++++++++++++------
drivers/iommu/iommufd/viommu.c | 89 ++++++
tools/testing/selftests/iommu/iommufd.c | 87 ++++++
.../selftests/iommu/iommufd_fail_nth.c | 11 +
Documentation/userspace-api/iommufd.rst | 69 ++++-
18 files changed, 827 insertions(+), 194 deletions(-)
create mode 100644 drivers/iommu/iommufd/driver.c
create mode 100644 drivers/iommu/iommufd/viommu.c
--
2.43.0
There is a spelling mistake in a ksft_test_result_skip message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c b/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c
index 222ba1cc3cac..34e0ceceaa74 100644
--- a/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c
+++ b/tools/testing/selftests/kvm/s390x/cpumodel_subfuncs_test.c
@@ -276,7 +276,7 @@ int main(int argc, char *argv[])
ksft_test_result_pass("%s\n", testlist[idx].subfunc_name);
free(array);
} else {
- ksft_test_result_skip("%s feature is not avaialable\n",
+ ksft_test_result_skip("%s feature is not available\n",
testlist[idx].subfunc_name);
}
}
--
2.39.5
This series was originally written by José Expósito, and has been
modified and updated by Matt Gilbride and myself. The original version
can be found here:
https://github.com/Rust-for-Linux/linux/pull/950
Add support for writing KUnit tests in Rust. While Rust doctests are
already converted to KUnit tests and run, they're really better suited
for examples, rather than as first-class unit tests.
This series implements a series of direct Rust bindings for KUnit tests,
as well as a new macro which allows KUnit tests to be written using a
close variant of normal Rust unit test syntax. The only change required
is replacing '#[cfg(test)]' with '#[kunit_tests(kunit_test_suite_name)]'
An example test would look like:
#[kunit_tests(rust_kernel_hid_driver)]
mod tests {
use super::*;
use crate::{c_str, driver, hid, prelude::*};
use core::ptr;
struct SimpleTestDriver;
impl Driver for SimpleTestDriver {
type Data = ();
}
#[test]
fn rust_test_hid_driver_adapter() {
let mut hid = bindings::hid_driver::default();
let name = c_str!("SimpleTestDriver");
static MODULE: ThisModule = unsafe { ThisModule::from_ptr(ptr::null_mut()) };
let res = unsafe {
<hid::Adapter<SimpleTestDriver> as driver::DriverOps>::register(&mut hid, name, &MODULE)
};
assert_eq!(res, Err(ENODEV)); // The mock returns -19
}
}
Please give this a go, and make sure I haven't broken it! There's almost
certainly a lot of improvements which can be made -- and there's a fair
case to be made for replacing some of this with generated C code which
can use the C macros -- but this is hopefully an adequate implementation
for now, and the interface can (with luck) remain the same even if the
implementation changes.
A few small notable missing features:
- Attributes (like the speed of a test) are hardcoded to the default
value.
- Similarly, the module name attribute is hardcoded to NULL. In C, we
use the KBUILD_MODNAME macro, but I couldn't find a way to use this
from Rust which wasn't more ugly than just disabling it.
- Assertions are not automatically rewritten to use KUnit assertions.
---
Changes since v1:
https://lore.kernel.org/lkml/20230720-rustbind-v1-0-c80db349e3b5@google.com…
- Rebase on top of the latest rust-next (commit 718c4069896c)
- Make kunit_case a const fn, rather than a macro (Thanks Boqun)
- As a result, the null terminator is now created with
kernel::kunit::kunit_case_null()
- Use the C kunit_get_current_test() function to implement
in_kunit_test(), rather than re-implementing it (less efficiently)
ourselves.
Changes since the GitHub PR:
- Rebased on top of kselftest/kunit
- Add const_mut_refs feature
This may conflict with https://lore.kernel.org/lkml/20230503090708.2524310-6-nmi@metaspace.dk/
- Add rust/macros/kunit.rs to the KUnit MAINTAINERS entry
---
José Expósito (3):
rust: kunit: add KUnit case and suite macros
rust: macros: add macro to easily run KUnit tests
rust: kunit: allow to know if we are in a test
MAINTAINERS | 1 +
rust/kernel/kunit.rs | 191 +++++++++++++++++++++++++++++++++++++++++++
rust/kernel/lib.rs | 1 +
rust/macros/lib.rs | 29 +++++++
4 files changed, 222 insertions(+)
--
2.47.0.163.g1226f6d8fa-goog
If running hid testcases with command "./run_kselftest.sh -c hid",
the following tests will take longer than the kselftest framework
timeout (now 200 seconds) to run and thus got terminated with TIMEOUT
error:
hid-multitouch.sh - took about 6min41s
hid-tablet.sh - took about 6min30s
Increase the timeout setting to 10 minutes to allow them have a chance
to finish.
Cc: stable(a)vger.kernel.org
Signed-off-by: Yun Lu <luyun(a)kylinos.cn>
---
tools/testing/selftests/hid/settings | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/hid/settings b/tools/testing/selftests/hid/settings
index b3cbfc521b10..dff0d947f9c2 100644
--- a/tools/testing/selftests/hid/settings
+++ b/tools/testing/selftests/hid/settings
@@ -1,3 +1,3 @@
# HID tests can be long, so give a little bit more time
# to them
-timeout=200
+timeout=600
--
2.27.0
Following the previous vIOMMU series, this adds another vDEVICE structure,
representing the association from an iommufd_device to an iommufd_viommu.
This gives the whole architecture a new "v" layer:
_______________________________________________________________________
| iommufd (with vIOMMU/vDEVICE) |
| _____________ _____________ |
| | | | | |
| |----------------| vIOMMU |<---| vDEVICE |<------| |
| | | | |_____________| | |
| | ______ | | _____________ ___|____ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
This vDEVICE object is used to collect and store all vIOMMU-related device
information/attributes in a VM. As an initial series for vDEVICE, add only
the virt_id to the vDEVICE, which is a vIOMMU specific device ID in a VM:
e.g. vSID of ARM SMMUv3, vDeviceID of AMD IOMMU, and vID of Intel VT-d to
a Context Table. This virt_id helps IOMMU drivers to link the vID to a pID
of the device against the physical IOMMU instance. This is essential for a
vIOMMU-based invalidation, where the request contains a device's vID for a
device cache flush, e.g. ATC invalidation.
Therefore, with this vDEVICE object, support a vIOMMU-based invalidation,
by reusing IOMMUFD_CMD_HWPT_INVALIDATE for a vIOMMU object to flush cache
with a given driver data.
As for the implementation of the series, add driver support in ARM SMMUv3
for a real world use case.
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v5
For testing, try this "with-rmr" branch:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v5-with-rmr
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p2-v5
Changelog
v5
* Dropped driver-allocated vDEVICE support
* Changed vdev_to_dev helper to iommufd_viommu_find_dev
v4
https://lore.kernel.org/all/cover.1729555967.git.nicolinc@nvidia.com/
* Added missing brackets in switch-case
* Fixed the unreleased idev refcount issue
* Reworked the iommufd_vdevice_alloc allocator
* Dropped support for IOMMU_VIOMMU_TYPE_DEFAULT
* Added missing TEST_LENGTH and fail_nth coverages
* Added a verification to the driver-allocated vDEVICE object
* Added an iommufd_vdevice_abort for a missing mutex protection
* Added a u64 structure arm_vsmmu_invalidation_cmd for user command
conversion
v3
https://lore.kernel.org/all/cover.1728491532.git.nicolinc@nvidia.com/
* Added Jason's Reviewed-by
* Split this invalidation part out of the part-1 series
* Repurposed VDEV_ID ioctl to a wider vDEVICE structure and ioctl
* Reduced viommu_api functions by allowing drivers to access viommu
and vdevice structure directly
* Dropped vdevs_rwsem by using xa_lock instead
* Dropped arm_smmu_cache_invalidate_user
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (2):
iommu: Add iommu_copy_struct_from_full_user_array helper
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
Nicolin Chen (11):
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd/hw_pagetable: Enforce invalidation op on vIOMMU-based
hwpt_nested
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
Documentation: userspace-api: iommufd: Update vDEVICE
iommu/arm-smmu-v3: Add arm_vsmmu_cache_invalidate
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 9 +-
drivers/iommu/iommufd/iommufd_private.h | 20 ++
drivers/iommu/iommufd/iommufd_test.h | 30 +++
include/linux/iommu.h | 48 ++++-
include/linux/iommufd.h | 21 ++
include/uapi/linux/iommufd.h | 61 +++++-
tools/testing/selftests/iommu/iommufd_utils.h | 83 +++++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 161 +++++++++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 ++-
drivers/iommu/iommufd/device.c | 11 +
drivers/iommu/iommufd/driver.c | 13 ++
drivers/iommu/iommufd/hw_pagetable.c | 36 +++-
drivers/iommu/iommufd/main.c | 7 +
drivers/iommu/iommufd/selftest.c | 98 ++++++++-
drivers/iommu/iommufd/viommu.c | 101 +++++++++
tools/testing/selftests/iommu/iommufd.c | 204 +++++++++++++++++-
.../selftests/iommu/iommufd_fail_nth.c | 4 +
Documentation/userspace-api/iommufd.rst | 41 +++-
18 files changed, 942 insertions(+), 38 deletions(-)
--
2.43.0
Compiled binary files should be added to .gitignore
'git status' complains:
Untracked files:
(use "git add <file>..." to include in what will be committed)
net/netfilter/conntrack_reverse_clash
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)netfilter.org>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: Eric Dumazet <edumazet(a)google.com>
Cc: Jakub Kicinski <kuba(a)kernel.org>
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: netfilter-devel(a)vger.kernel.org
Cc: coreteam(a)netfilter.org
Cc: netdev(a)vger.kernel.org
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V3:
sort the files
V2:
split as a separate patch from a small one [0]
[0] https://lore.kernel.org/linux-kselftest/20241015010817.453539-1-lizhijian@f…
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
tools/testing/selftests/net/netfilter/.gitignore | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/net/netfilter/.gitignore b/tools/testing/selftests/net/netfilter/.gitignore
index 0a64d6d0e29a..64c4f8d9aa6c 100644
--- a/tools/testing/selftests/net/netfilter/.gitignore
+++ b/tools/testing/selftests/net/netfilter/.gitignore
@@ -2,5 +2,6 @@
audit_logread
connect_close
conntrack_dump_flush
+conntrack_reverse_clash
sctp_collision
nf_queue
--
2.44.0
Compiled binary files should be added to .gitignore
'git status' complains:
Untracked files:
(use "git add <file>..." to include in what will be committed)
alsa/global-timer
alsa/utimer-test
Cc: Mark Brown <broonie(a)kernel.org>
Cc: Jaroslav Kysela <perex(a)perex.cz>
Cc: Takashi Iwai <tiwai(a)suse.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Cc: linux-sound(a)vger.kernel.org
---
Hello,
Cover letter is here.
This patch set aims to make 'git status' clear after 'make' and 'make
run_tests' for kselftests.
---
V2:
split as a separate patch from a small one [0]
[0] https://lore.kernel.org/linux-kselftest/20241015010817.453539-1-lizhijian@f…
---
tools/testing/selftests/alsa/.gitignore | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/alsa/.gitignore b/tools/testing/selftests/alsa/.gitignore
index 12dc3fcd3456..1407fd24a97b 100644
--- a/tools/testing/selftests/alsa/.gitignore
+++ b/tools/testing/selftests/alsa/.gitignore
@@ -1,3 +1,5 @@
mixer-test
pcm-test
test-pcmtest-driver
+global-timer
+utimer-test
--
2.44.0
The goal of the series is to simplify and make it possible to use
ncdevmem in an automated way from the ksft python wrapper.
ncdevmem is slowly mutated into a state where it uses stdout
to print the payload and the python wrapper is added to
make sure the arrived payload matches the expected one.
v5:
- properly handle errors from inet_pton() and socket() (Paolo)
- remove unneeded import from python selftest (Paolo)
v4:
- keep usage example with validation (Mina)
- fix compilation issue in one patch (s/start_queues/start_queue/)
v3:
- keep and refine the comment about ncdevmem invocation (Mina)
- add the comment about not enforcing exit status for ntuple reset (Mina)
- make configure_headersplit more robust (Mina)
- use num_queues/2 in selftest and let the users override it (Mina)
- remove memory_provider.memcpy_to_device (Mina)
- keep ksft as is (don't use -v validate flags): we are gonna
need a --debug-disable flag to make it less chatty; otherwise
it times out when sending too much data; so leaving it as
a separate follow up
v2:
- don't remove validation (Mina)
- keep 5-tuple flow steering but use it only when -c is provided (Mina)
- remove separate flag for probing (Mina)
- move ncdevmem under drivers/net/hw, not drivers/net (Jakub)
Cc: Mina Almasry <almasrymina(a)google.com>
Stanislav Fomichev (12):
selftests: ncdevmem: Redirect all non-payload output to stderr
selftests: ncdevmem: Separate out dmabuf provider
selftests: ncdevmem: Unify error handling
selftests: ncdevmem: Make client_ip optional
selftests: ncdevmem: Remove default arguments
selftests: ncdevmem: Switch to AF_INET6
selftests: ncdevmem: Properly reset flow steering
selftests: ncdevmem: Use YNL to enable TCP header split
selftests: ncdevmem: Remove hard-coded queue numbers
selftests: ncdevmem: Run selftest when none of the -s or -c has been
provided
selftests: ncdevmem: Move ncdevmem under drivers/net/hw
selftests: ncdevmem: Add automated test
.../selftests/drivers/net/hw/.gitignore | 1 +
.../testing/selftests/drivers/net/hw/Makefile | 9 +
.../selftests/drivers/net/hw/devmem.py | 45 +
.../selftests/drivers/net/hw/ncdevmem.c | 773 ++++++++++++++++++
tools/testing/selftests/net/.gitignore | 1 -
tools/testing/selftests/net/Makefile | 8 -
tools/testing/selftests/net/ncdevmem.c | 570 -------------
7 files changed, 828 insertions(+), 579 deletions(-)
create mode 100644 tools/testing/selftests/drivers/net/hw/.gitignore
create mode 100755 tools/testing/selftests/drivers/net/hw/devmem.py
create mode 100644 tools/testing/selftests/drivers/net/hw/ncdevmem.c
delete mode 100644 tools/testing/selftests/net/ncdevmem.c
--
2.47.0
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 already supports shadow stack for user mode and arm64 support for GCS in
usermode [7] is ongoing.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
$ git clone git@github.com:deepak0414/qemu.git -b zicfilp_zicfiss_ratified_master_july11
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
Branch where user cfi enabling patches are maintained
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
[7] - https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.or…
To: Thomas Gleixner <tglx(a)linutronix.de>
To: Ingo Molnar <mingo(a)redhat.com>
To: Borislav Petkov <bp(a)alien8.de>
To: Dave Hansen <dave.hansen(a)linux.intel.com>
To: x86(a)kernel.org
To: H. Peter Anvin <hpa(a)zytor.com>
To: Andrew Morton <akpm(a)linux-foundation.org>
To: Liam R. Howlett <Liam.Howlett(a)oracle.com>
To: Vlastimil Babka <vbabka(a)suse.cz>
To: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
To: Paul Walmsley <paul.walmsley(a)sifive.com>
To: Palmer Dabbelt <palmer(a)dabbelt.com>
To: Albert Ou <aou(a)eecs.berkeley.edu>
To: Conor Dooley <conor(a)kernel.org>
To: Rob Herring <robh(a)kernel.org>
To: Krzysztof Kozlowski <krzk+dt(a)kernel.org>
To: Arnd Bergmann <arnd(a)arndb.de>
To: Christian Brauner <brauner(a)kernel.org>
To: Peter Zijlstra <peterz(a)infradead.org>
To: Oleg Nesterov <oleg(a)redhat.com>
To: Eric Biederman <ebiederm(a)xmission.com>
To: Kees Cook <kees(a)kernel.org>
To: Jonathan Corbet <corbet(a)lwn.net>
To: Shuah Khan <shuah(a)kernel.org>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-fsdevel(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Cc: linux-riscv(a)lists.infradead.org
Cc: devicetree(a)vger.kernel.org
Cc: linux-arch(a)vger.kernel.org
Cc: linux-doc(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: alistair.francis(a)wdc.com
Cc: richard.henderson(a)linaro.org
Cc: jim.shu(a)sifive.com
Cc: andybnac(a)gmail.com
Cc: kito.cheng(a)sifive.com
Cc: charlie(a)rivosinc.com
Cc: atishp(a)rivosinc.com
Cc: evan(a)rivosinc.com
Cc: cleger(a)rivosinc.com
Cc: alexghiti(a)rivosinc.com
Cc: samitolvanen(a)google.com
Cc: broonie(a)kernel.org
Cc: rick.p.edgecombe(a)intel.com
Signed-off-by: Deepak Gupta <debug(a)rivosinc.com>
---
changelog
---------
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Clément Léger (1):
riscv: Add Firmware Feature SBI extensions definitions
Deepak Gupta (26):
mm: helper `is_shadow_stack_vma` to check shadow stack vma
riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
prctl: arch-agnostic prctl for indirect branch tracking
riscv: Implements arch agnostic shadow stack prctls
riscv: Implements arch agnostic indirect branch tracking prctls
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: kernel command line option to opt out of user cfi
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Mark Brown (2):
mm: Introduce ARCH_HAS_USER_SHADOW_STACK
prctl: arch-agnostic prctl for shadow stack
Samuel Holland (3):
riscv: Enable cbo.zero only when all harts support Zicboz
riscv: Add support for per-thread envcfg CSR values
riscv: Call riscv_user_isa_enable() only on the boot hart
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 176 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 21 +
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/cpufeature.h | 15 +-
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 24 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 3 +
arch/riscv/include/asm/sbi.h | 27 ++
arch/riscv/include/asm/switch_to.h | 8 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 89 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 22 +
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 2 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 13 +-
arch/riscv/kernel/entry.S | 31 +-
arch/riscv/kernel/head.S | 12 +
arch/riscv/kernel/process.c | 31 +-
arch/riscv/kernel/ptrace.c | 83 ++++
arch/riscv/kernel/signal.c | 140 +++++-
arch/riscv/kernel/smpboot.c | 2 -
arch/riscv/kernel/suspend.c | 4 +-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 42 ++
arch/riscv/kernel/usercfi.c | 526 +++++++++++++++++++++
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
arch/x86/Kconfig | 1 +
fs/proc/task_mmu.c | 2 +-
include/linux/cpu.h | 4 +
include/linux/mm.h | 5 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 48 ++
kernel/sys.c | 60 +++
mm/Kconfig | 6 +
mm/gup.c | 2 +-
mm/mmap.c | 1 +
mm/vma.h | 10 +-
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++
tools/testing/selftests/riscv/cfi/shadowstack.c | 373 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++
56 files changed, 2190 insertions(+), 42 deletions(-)
---
base-commit: 7d9923ee3960bdbfaa7f3a4e0ac2364e770c46ff
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
The 2024 architecture release includes a number of data processing
extensions, mostly SVE and SME additions with a few others. These are
all very straightforward extensions which add instructions but no
architectural state so only need hwcaps and exposing of the ID registers
to KVM guests and userspace.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (9):
arm64/sysreg: Update ID_AA64PFR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR3_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/hwcap: Describe 2024 dpISA extensions to userspace
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
Documentation/arch/arm64/elf_hwcaps.rst | 51 ++++++
arch/arm64/include/asm/hwcap.h | 17 ++
arch/arm64/include/uapi/asm/hwcap.h | 17 ++
arch/arm64/kernel/cpufeature.c | 35 ++++
arch/arm64/kernel/cpuinfo.c | 17 ++
arch/arm64/kvm/sys_regs.c | 3 +-
arch/arm64/tools/sysreg | 87 +++++++++-
tools/testing/selftests/arm64/abi/hwcap.c | 273 +++++++++++++++++++++++++++++-
8 files changed, 490 insertions(+), 10 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241008-arm64-2024-dpisa-8091074a7f48
Best regards,
--
Mark Brown <broonie(a)kernel.org>
One of the things that fp-stress does to stress the floating point
context switching is send signals to the test threads it spawns.
Currently we do this once per second but as suggested by Mark Rutland if
we increase this we can improve the chances of triggering any issues
with context switching the signal handling code. Do a quick change to
increase the rate of signal delivery, trying to avoid excessive impact
on emulated platforms, and a further change to mitigate the impact that
this creates during startup.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
kselftest/arm64: Increase frequency of signal delivery in fp-stress
kselftest/arm64: Lower poll interval while waiting for fp-stress children
tools/testing/selftests/arm64/fp/fp-stress.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
---
base-commit: 8e929cb546ee42c9a61d24fae60605e9e3192354
change-id: 20241028-arm64-fp-stress-interval-8f5e62c06e12
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.
The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!) - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.
After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
which installs page table-level guard page markers - and
MADV_GUARD_REMOVE - which clears them.
Guard markers can be installed across multiple VMAs and any existing
mappings will be cleared, that is zapped, before installing the guard page
markers in the page tables.
There is no concept of 'nested' guard markers, multiple attempts to install
guard markers in a range will, after the first attempt, have no effect.
Importantly, removing guard markers over a range that contains both guard
markers and ordinary backed memory has no effect on anything but the guard
markers (including leaving huge pages un-split), so a user can safely
remove guard markers over a range of memory leaving the rest intact.
The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.
We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).
We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
remove guard markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).
It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the markers if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.
An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of guard
ranges.
Suggested-by: Vlastimil Babka <vbabka(a)suse.cz>
Suggested-by: Jann Horn <jannh(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
v4
* Use restart_syscall() to implement -ERESTARTNOINTR to ensure correctly
handled by kernel - tested this code path and confirmed it works
correctly. Thanks to Vlastimil for pointing this issue out!
* Updated the vector_madvise() handler to not unnecessarily invoke
cond_resched() as suggested by Vlastimil.
* Updated guard page tests to add a test for a vector operation which
overwrites existing mappings. Tested this against the -ERESTARTNOINTR
case and confirmed working.
* Improved page walk logic further, refactoring handling logic as suggested
by Vlastimil.
* Moved MAX_MADVISE_GUARD_RETRIES to mm/madvise.c as suggested by Vlastimil.
v3
* Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested
by Vlastiml.
* Explicitly avoid splitting THP on PTE installation, as suggested by
Vlastimil. Note this has no impact on the guard pages logic, which has
page table entry handlers at PUD, PMD and PTE level.
* Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard
marker, as suggested by Vlastimil.
* Reverted change to is_poisoned_swp_entry() to exclude guard pages which
has the effect of MADV_FREE _not_ clearing guard pages. After discussion
with Vlastimil, it became apparent that the ability to 'cancel' the
freeing operation by writing to the mapping after having issued an
MADV_FREE would mean that we would risk unexpected behaviour should the
guard pages be removed, so we now do not remove markers here at all.
* Added comment to PTE_MARKER_GUARD to highlight that memory tagged with
the marker behaves as if it were a region mapped PROT_NONE, as
highlighted by David.
* Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL /
MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the
request of David and John who both find the poison analogy
confusing/overloaded.
* After a lot of discussion, replace the looping behaviour should page
faults race with guard page installation with a modest reattempt followed
by returning -ERESTARTNOINTR to have the operation abort and re-enter,
relieving lock contention and avoiding the possibility of allowing a
malicious sandboxed process to impact the mmap lock or stall the overall
process more than necessary, as suggested by Jann and Vlastimil having
raised the issue.
* Adjusted the page table walker so a populated huge PUD or PMD is
correctly treated as being populated, necessitating a zap. In v2 we
incorrectly skipped over these, which would cause the logic to wrongly
proceed as if nothing were populated and the install succeeded.
Instead, explicitly check to see if a huge page - if so, do not split but
rather abort the operation and let zap take care of things.
* Updated the guard remove logic to not unnecessarily split huge pages
either.
* Added a debug check to assert that the number of installed PTEs matches
expectation, accounting for any existing guard pages.
* Adapted vector_madvise() used by the process_madvise() system call to
handle -ERESTARTNOINTR correctly.
https://lore.kernel.org/all/cover.1729699916.git.lorenzo.stoakes@oracle.com/
v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
safe in single line if / else or for /which blocks, however working
around this results in checkpatch producing invalid warnings, as reported
by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
instead rewrite test blocks so as to use multiple lines by separating out
a decl in most cases. This has the side effect of, for the most part,
making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
checkpatch complaining about this, so we explain it, as reported by
Shuah.
* Updated commit message to highlight that we skip tests we lack
permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
defines and directly include asm-generic/mman.h to get the
platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.
https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@oracle.com/
v1
* Un-RFC'd as appears no major objections to approach but rather debate on
implementation.
* Fixed issue with arches which need mmu_context.h and
tlbfush.h. header imports in pagewalker logic to be able to use
update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
ops->install_pte and why as well as adding a check_ops_valid() helper
function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
consistent with how this is used elsewhere in the kernel as suggested by
Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
and duck out if a fatal signal is pending or a conditional reschedule is
needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/
RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (5):
mm: pagewalk: add the ability to install PTEs
mm: add PTE_MARKER_GUARD PTE marker
mm: madvise: implement lightweight guard page mechanism
tools: testing: update tools UAPI header for mman-common.h
selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 +
arch/mips/include/uapi/asm/mman.h | 3 +
arch/parisc/include/uapi/asm/mman.h | 3 +
arch/xtensa/include/uapi/asm/mman.h | 3 +
include/linux/mm_inline.h | 2 +-
include/linux/pagewalk.h | 18 +-
include/linux/swapops.h | 24 +-
include/uapi/asm-generic/mman-common.h | 3 +
mm/hugetlb.c | 4 +
mm/internal.h | 6 +
mm/madvise.c | 239 ++++
mm/memory.c | 18 +-
mm/mprotect.c | 6 +-
mm/mseal.c | 1 +
mm/pagewalk.c | 246 +++-
tools/include/uapi/asm-generic/mman-common.h | 3 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/guard-pages.c | 1243 ++++++++++++++++++
19 files changed, 1751 insertions(+), 76 deletions(-)
create mode 100644 tools/testing/selftests/mm/guard-pages.c
--
2.47.0
Use a less populated IP range to run the tests, as suggested by Petr in
Link: https://lore.kernel.org/netdev/87ikvukv3s.fsf@nvidia.com/.
Suggested-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/netcons_basic.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 06021b2059b7..4ad1e216c6b0 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -20,9 +20,9 @@ SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
# Simple script to test dynamic targets in netconsole
SRCIF="" # to be populated later
-SRCIP=192.168.1.1
+SRCIP=192.168.2.1
DSTIF="" # to be populated later
-DSTIP=192.168.1.2
+DSTIP=192.168.2.2
PORT="6666"
MSG="netconsole selftest"
--
2.43.5
Extend netcons_basic selftest to verify the userdata functionality by:
1. Creating a test key in the userdata configfs directory
2. Writing a known value to the key
3. Validating the key-value pair appears in the captured network output
This ensures the userdata feature is properly tested during selftests.
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
.../selftests/drivers/net/netcons_basic.sh | 29 +++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 8d28e5189e91..182eb1a97e59 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -26,10 +26,13 @@ DSTIP=192.0.2.2
PORT="6666"
MSG="netconsole selftest"
+USERDATA_KEY="key"
+USERDATA_VALUE="value"
TARGET=$(mktemp -u netcons_XXXXX)
DEFAULT_PRINTK_VALUES=$(cat /proc/sys/kernel/printk)
NETCONS_CONFIGFS="/sys/kernel/config/netconsole"
NETCONS_PATH="${NETCONS_CONFIGFS}"/"${TARGET}"
+KEY_PATH="${NETCONS_PATH}/userdata/${USERDATA_KEY}"
# NAMESPACE will be populated by setup_ns with a random value
NAMESPACE=""
@@ -122,6 +125,8 @@ function cleanup() {
# delete netconsole dynamic reconfiguration
echo 0 > "${NETCONS_PATH}"/enabled
+ # Remove key
+ rmdir "${KEY_PATH}"
# Remove the configfs entry
rmdir "${NETCONS_PATH}"
@@ -136,6 +141,18 @@ function cleanup() {
echo "${DEFAULT_PRINTK_VALUES}" > /proc/sys/kernel/printk
}
+function set_user_data() {
+ if [[ ! -d "${NETCONS_PATH}""/userdata" ]]
+ then
+ echo "Userdata path not available in ${NETCONS_PATH}/userdata"
+ exit "${ksft_skip}"
+ fi
+
+ mkdir -p "${KEY_PATH}"
+ VALUE_PATH="${KEY_PATH}""/value"
+ echo "${USERDATA_VALUE}" > "${VALUE_PATH}"
+}
+
function listen_port_and_save_to() {
local OUTPUT=${1}
# Just wait for 2 seconds
@@ -146,6 +163,10 @@ function listen_port_and_save_to() {
function validate_result() {
local TMPFILENAME="$1"
+ # TMPFILENAME will contain something like:
+ # 6.11.1-0_fbk0_rc13_509_g30d75cea12f7,13,1822,115075213798,-;netconsole selftest: netcons_gtJHM
+ # key=value
+
# Check if the file exists
if [ ! -f "$TMPFILENAME" ]; then
echo "FAIL: File was not generated." >&2
@@ -158,6 +179,12 @@ function validate_result() {
exit "${ksft_fail}"
fi
+ if ! grep -q "${USERDATA_KEY}=${USERDATA_VALUE}" "${TMPFILENAME}"; then
+ echo "FAIL: ${USERDATA_KEY}=${USERDATA_VALUE} not found in ${TMPFILENAME}" >&2
+ cat "${TMPFILENAME}" >&2
+ exit "${ksft_fail}"
+ fi
+
# Delete the file once it is validated, otherwise keep it
# for debugging purposes
rm "${TMPFILENAME}"
@@ -220,6 +247,8 @@ trap cleanup EXIT
set_network
# Create a dynamic target for netconsole
create_dynamic_target
+# Set userdata "key" with the "value" value
+set_user_data
# Listed for netconsole port inside the namespace and destination interface
listen_port_and_save_to "${OUTPUT_FILE}" &
# Wait for socat to start and listen to the port.
--
2.43.5
Use a less populated IP range to run the tests, as suggested by Petr in
Link: https://lore.kernel.org/netdev/87ikvukv3s.fsf@nvidia.com/.
Suggested-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
tools/testing/selftests/drivers/net/netcons_basic.sh | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh b/tools/testing/selftests/drivers/net/netcons_basic.sh
index 06021b2059b7..8d28e5189e91 100755
--- a/tools/testing/selftests/drivers/net/netcons_basic.sh
+++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
@@ -20,9 +20,9 @@ SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
# Simple script to test dynamic targets in netconsole
SRCIF="" # to be populated later
-SRCIP=192.168.1.1
+SRCIP=192.0.2.1
DSTIF="" # to be populated later
-DSTIP=192.168.1.2
+DSTIP=192.0.2.2
PORT="6666"
MSG="netconsole selftest"
--
2.43.5
This patch series is motivated by the following observation:
Raise a signal, jump to signal handler. The ucontext_t structure dumped
by kernel to userspace has a uc_sigmask field having the mask of blocked
signals. If you run a fresh minimalistic program doing this, this field
is empty, even if you block some signals while registering the handler
with sigaction().
Here is what the man-pages have to say:
sigaction(2): "sa_mask specifies a mask of signals which should be blocked
(i.e., added to the signal mask of the thread in which the signal handler
is invoked) during execution of the signal handler. In addition, the
signal which triggered the handler will be blocked, unless the SA_NODEFER
flag is used."
signal(7): Under "Execution of signal handlers", (1.3) implies:
"The thread's current signal mask is accessible via the ucontext_t
object that is pointed to by the third argument of the signal handler."
But, (1.4) states:
"Any signals specified in act->sa_mask when registering the handler with
sigprocmask(2) are added to the thread's signal mask. The signal being
delivered is also added to the signal mask, unless SA_NODEFER was
specified when registering the handler. These signals are thus blocked
while the handler executes."
There clearly is no distinction being made in the man pages between
"Thread's signal mask" and ucontext_t; this logically should imply
that a signal blocked by populating struct sigaction should be visible
in ucontext_t.
Here is what the kernel code does (for Aarch64):
do_signal() -> handle_signal() -> sigmask_to_save(), which returns
¤t->blocked, is passed to setup_rt_frame() -> setup_sigframe() ->
__copy_to_user(). Hence, ¤t->blocked is copied to ucontext_t
exposed to userspace. Returning back to handle_signal(),
signal_setup_done() -> signal_delivered() -> sigorsets() and
set_current_blocked() are responsible for using information from
struct ksignal ksig, which was populated through the sigaction()
system call in kernel/signal.c:
copy_from_user(&new_sa.sa, act, sizeof(new_sa.sa)),
to update ¤t->blocked; hence, the set of blocked signals for the
current thread is updated AFTER the kernel dumps ucontext_t to
userspace.
Assuming that the above is indeed the intended behaviour, because it
semantically makes sense, since the signals blocked using sigaction()
remain blocked only till the execution of the handler, and not in the
context present before jumping to the handler (but nothing can be
confirmed from the man-pages), the series introduces a test for
mangling with uc_sigmask. I will send a separate series to fix the
man-pages.
The proposed selftest has been tested out on Aarch32, Aarch64 and x86_64.
v5->v6:
- Drop renaming of sas.c
- Include the explanation from the cover letter in the changelog
for the second patch
v4->v5:
- Remove a redundant print statement
v3->v4:
- Allocate sigsets as automatic variables to avoid malloc()
v2->v3:
- ucontext describes current state -> ucontext describes interrupted context
- Add a comment for blockage of USR2 even after return from handler
- Describe blockage of signals in a better way
v1->v2:
- Replace all occurrences of SIGPIPE with SIGSEGV
- Fixed a mismatch between code comment and ksft log
- Add a testcase: Raise the same signal again; it must not be queued
- Remove unneeded <assert.h>, <unistd.h>
- Give a detailed test description in the comments; also describe the
exact meaning of delivered and blocked
- Handle errors for all libc functions/syscalls
- Mention tests in Makefile and .gitignore in alphabetical order
v1:
- https://lore.kernel.org/all/20240607122319.768640-1-dev.jain@arm.com/
Dev Jain (2):
selftests: Rename sigaltstack to generic signal
selftests: Add a test mangling with uc_sigmask
tools/testing/selftests/Makefile | 2 +-
.../{sigaltstack => signal}/.gitignore | 1 +
.../{sigaltstack => signal}/Makefile | 3 +-
.../current_stack_pointer.h | 0
.../selftests/signal/mangle_uc_sigmask.c | 184 ++++++++++++++++++
.../selftests/{sigaltstack => signal}/sas.c | 0
6 files changed, 188 insertions(+), 2 deletions(-)
rename tools/testing/selftests/{sigaltstack => signal}/.gitignore (70%)
rename tools/testing/selftests/{sigaltstack => signal}/Makefile (56%)
rename tools/testing/selftests/{sigaltstack => signal}/current_stack_pointer.h (100%)
create mode 100644 tools/testing/selftests/signal/mangle_uc_sigmask.c
rename tools/testing/selftests/{sigaltstack => signal}/sas.c (100%)
--
2.30.2
When running watchdog-test with 'make run_tests', the watchdog-test will
be terminated by a timeout signal(SIGTERM) due to the test timemout.
And then, a system reboot would happen due to watchdog not stop. see
the dmesg as below:
```
[ 1367.185172] watchdog: watchdog0: watchdog did not stop!
```
Fix it by registering more signals(including SIGTERM) in watchdog-test,
where its signal handler will stop the watchdog.
After that
# timeout 1 ./watchdog-test
Watchdog Ticking Away!
.
Stopping watchdog ticks...
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
tools/testing/selftests/watchdog/watchdog-test.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde..a1f506ba5578 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -334,7 +334,13 @@ int main(int argc, char *argv[])
printf("Watchdog Ticking Away!\n");
+ /*
+ * Register the signals
+ */
signal(SIGINT, term);
+ signal(SIGTERM, term);
+ signal(SIGKILL, term);
+ signal(SIGQUIT, term);
while (1) {
keep_alive();
--
2.44.0
Currently, watchdog-test keep running until it gets a SIGINT. However,
when watchdog-test is executed from the kselftests framework, where it
launches test via timeout which will send SIGTERM in time up. This could
lead to
1. watchdog haven't stop, a watchdog reset is triggered to reboot the OS
in silent.
2. kselftests gets an timeout exit code, and judge watchdog-test as
'not ok'
This patch is prepare to fix above 2 issues
Signed-off-by: Li Zhijian <lizhijian(a)fujitsu.com>
---
Hey,
Cover letter is here.
It's notice that a OS reboot was triggerred after ran the watchdog-test
in kselftests framwork 'make run_tests', that's because watchdog-test
didn't stop feeding the watchdog after enable it.
In addition, current watchdog-test didn't adapt to the kselftests
framework which launchs the test with /usr/bin/timeout and no timeout
is expected.
---
tools/testing/selftests/watchdog/watchdog-test.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/watchdog/watchdog-test.c b/tools/testing/selftests/watchdog/watchdog-test.c
index bc71cbca0dde..2f8fd2670897 100644
--- a/tools/testing/selftests/watchdog/watchdog-test.c
+++ b/tools/testing/selftests/watchdog/watchdog-test.c
@@ -27,7 +27,7 @@
int fd;
const char v = 'V';
-static const char sopts[] = "bdehp:st:Tn:NLf:i";
+static const char sopts[] = "bdehp:st:Tn:NLf:c:i";
static const struct option lopts[] = {
{"bootstatus", no_argument, NULL, 'b'},
{"disable", no_argument, NULL, 'd'},
@@ -42,6 +42,7 @@ static const struct option lopts[] = {
{"gettimeleft", no_argument, NULL, 'L'},
{"file", required_argument, NULL, 'f'},
{"info", no_argument, NULL, 'i'},
+ {"count", required_argument, NULL, 'c'},
{NULL, no_argument, NULL, 0x0}
};
@@ -95,6 +96,7 @@ static void usage(char *progname)
printf(" -n, --pretimeout=T\tSet the pretimeout to T seconds\n");
printf(" -N, --getpretimeout\tGet the pretimeout\n");
printf(" -L, --gettimeleft\tGet the time left until timer expires\n");
+ printf(" -c, --count\tStop after feeding the watchdog count times\n");
printf("\n");
printf("Parameters are parsed left-to-right in real-time.\n");
printf("Example: %s -d -t 10 -p 5 -e\n", progname);
@@ -174,7 +176,7 @@ int main(int argc, char *argv[])
unsigned int ping_rate = DEFAULT_PING_RATE;
int ret;
int c;
- int oneshot = 0;
+ int oneshot = 0, stop = 1, count = 0;
char *file = "/dev/watchdog";
struct watchdog_info info;
int temperature;
@@ -307,6 +309,9 @@ int main(int argc, char *argv[])
else
printf("WDIOC_GETTIMELEFT error '%s'\n", strerror(errno));
break;
+ case 'c':
+ stop = 0;
+ count = strtoul(optarg, NULL, 0);
case 'f':
/* Handled above */
break;
@@ -336,8 +341,8 @@ int main(int argc, char *argv[])
signal(SIGINT, term);
- while (1) {
- keep_alive();
+ while (stop || count--) {
+ exit_code = keep_alive();
sleep(ping_rate);
}
end:
--
2.44.0
From: zhouyuhang <zhouyuhang(a)kylinos.cn>
Test case idmap_mount_tree_invalid failed to run on the newer kernel
with the following output:
# RUN mount_setattr_idmapped.idmap_mount_tree_invalid ...
# mount_setattr_test.c:1428:idmap_mount_tree_invalid:Expected sys_mount_setattr(open_tree_fd, "", AT_EMPTY_PATH, &attr, sizeof(attr)) (0) ! = 0 (0)
# idmap_mount_tree_invalid: Test terminated by assertion
This is because tmpfs is mounted at "/mnt/A", and tmpfs already
contains the flag FS_ALLOW_IDMAP after the commit 7a80e5b8c6fa ("shmem:
support idmapped mounts for tmpfs"). So calling sys_mount_setattr here
returns 0 instead of -EINVAL as expected.
Ramfs does not support idmap mounts, so we can use it here to test invalid mounts,
which allows the test case to pass with the following output:
# Starting 1 tests from 1 test cases.
# RUN mount_setattr_idmapped.idmap_mount_tree_invalid ...
# OK mount_setattr_idmapped.idmap_mount_tree_invalid
ok 1 mount_setattr_idmapped.idmap_mount_tree_invalid
# PASSED: 1 / 1 tests passed.
Signed-off-by: zhouyuhang <zhouyuhang(a)kylinos.cn>
---
.../testing/selftests/mount_setattr/mount_setattr_test.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
index c6a8c732b802..68801e1a9ec2 100644
--- a/tools/testing/selftests/mount_setattr/mount_setattr_test.c
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -1414,6 +1414,13 @@ TEST_F(mount_setattr_idmapped, idmap_mount_tree_invalid)
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/b", 0, 0, 0), 0);
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/BB/b", 0, 0, 0), 0);
+ ASSERT_EQ(mount("testing", "/mnt/A", "ramfs", MS_NOATIME | MS_NODEV,
+ "size=100000,mode=700"), 0);
+
+ ASSERT_EQ(mkdir("/mnt/A/AA", 0777), 0);
+
+ ASSERT_EQ(mount("/tmp", "/mnt/A/AA", NULL, MS_BIND | MS_REC, NULL), 0);
+
open_tree_fd = sys_open_tree(-EBADF, "/mnt/A",
AT_RECURSIVE |
AT_EMPTY_PATH |
@@ -1433,6 +1440,8 @@ TEST_F(mount_setattr_idmapped, idmap_mount_tree_invalid)
ASSERT_EQ(expected_uid_gid(-EBADF, "/tmp/B/BB/b", 0, 0, 0), 0);
ASSERT_EQ(expected_uid_gid(open_tree_fd, "B/b", 0, 0, 0), 0);
ASSERT_EQ(expected_uid_gid(open_tree_fd, "B/BB/b", 0, 0, 0), 0);
+
+ (void)umount2("/mnt/A", MNT_DETACH);
}
TEST_F(mount_setattr, mount_attr_nosymfollow)
--
2.27.0
Good day,
My name is Evan from Ukraine.I have tried to reach you but find it difficult due to internet scarcity here in this town.
I am contacting you because I want to come over to your country,I have some huge funds I plan to invest in your country because I want to relocate my Late Father business due to ongoing war in my country Ukraine and visit your country to set up new investment business you may advice profitable in your area.
I also have some millions of dollars belonging to my late father that i want to invest and 995kg of 24+carat 99.9% purity gold I want to ship to your country and establish gold jewelry manufacturing company.
I will offer you good monetary rewards for your help,due to how russian drone bombards this town frequently,internet network is disrupted because of attacks on network installations,please for fast discussion, you can add me on whatsapp and let's talk faster as i want to leave this war zone as soon as possible,this is my whatsapp number below.
+380-672575220
Email:evanbohdan@gmail.com
I wait for your earliest response.
Regards,
Evan
Whatsapp/Tel:+380-672575220
Email:evanbohdan@gmail.com
From: Jason Xing <kernelxing(a)tencent.com>
As we can see from the title, when I compiled the selftests/bpf, I
saw the error:
implicit declaration of function ‘gettid’ ; did you mean ‘getgid’? [-Werror=implicit-function-declaration]
skel->bss->tid = gettid();
^~~~~~
getgid
Adding a define to fix it (referring to
tools/perf/tests/shell/coresight/thread_loop/thread_loop.c file.
Signed-off-by: Jason Xing <kernelxing(a)tencent.com>
---
tools/testing/selftests/bpf/prog_tests/bpf_iter.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index f0a3a9c18e9e..a105759f3dcf 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -34,6 +34,8 @@
#include "bpf_iter_ksym.skel.h"
#include "bpf_iter_sockmap.skel.h"
+#define gettid() syscall(SYS_gettid)
+
static void test_btf_id_or_null(void)
{
struct bpf_iter_test_kern3 *skel;
--
2.37.3
Fixup small broken window panes in DAMON selftests and kunit tests.
First four patches clean up DAMON debugfs interface selftests output, by
fixing segmentation fault of a test program (patch 1), removing
unnecessary debugging messages (patch 2), and hiding error messages from
expected failures (patches 3 and 4).
Following two patches fix copy-paste mistakes in DAMON Kconfig help
message that copied from debugfs kunit test (patch 5) and a comment on
the debugfs kunit test code (patch 6).
Signed-off-by: SeongJae Park <sj(a)kernel.org>
Andrew Paniakin (1):
selftests/damon/huge_count_read_write: provide sufficiently large
buffer for DEPRECATED file read
SeongJae Park (5):
selftests/damon/huge_count_read_write: remove unnecessary debugging
message
selftests/damon/_debugfs_common: hide expected error message from
test_write_result()
selftests/damon/debugfs_duplicate_context_creation: hide errors from
expected file write failures
mm/damon/Kconfig: update DBGFS_KUNIT prompt copy for SYSFS_KUNIT
mm/damon/tests/dbgfs-kunit: fix the header double inclusion guarding
ifdef comment
mm/damon/Kconfig | 2 +-
mm/damon/tests/dbgfs-kunit.h | 2 +-
tools/testing/selftests/damon/_debugfs_common.sh | 7 ++++++-
.../selftests/damon/debugfs_duplicate_context_creation.sh | 2 +-
tools/testing/selftests/damon/huge_count_read_write.c | 4 +---
5 files changed, 10 insertions(+), 7 deletions(-)
base-commit: 13583c750117b4e10cdaf5578dcc7723b305ce4e
--
2.39.5
Two small fixes related to the MPTCP packets scheduler:
- Patch 1: add missing rcu_read_(un)lock(). A fix for >= 6.6.
- Patch 2: remove unneeded lock when listing packets schedulers. A fix
for >= 6.10.
And some modifications in the MPTCP selftests:
- Patch 3: a small addition to the MPTCP selftests to cover more code.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (3):
mptcp: init: protect sched with rcu_read_lock
mptcp: remove unneeded lock when listing scheds
selftests: mptcp: list sysctl data
net/mptcp/protocol.c | 2 ++
net/mptcp/sched.c | 2 --
tools/testing/selftests/net/mptcp/mptcp_connect.sh | 9 +++++++++
3 files changed, 11 insertions(+), 2 deletions(-)
---
base-commit: 3b05b9c36ddd01338e1352588f2ec1ea23f97d43
change-id: 20241021-net-mptcp-sched-lock-10dfc75d1e00
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.
The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.
With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.
This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!) - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.
Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.
We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.
This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.
After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.
By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.
This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.
The mechanism is implemented via madvise() behaviour - MADV_GUARD_INSTALL
which installs page table-level guard page markers - and
MADV_GUARD_REMOVE - which clears them.
Guard markers can be installed across multiple VMAs and any existing
mappings will be cleared, that is zapped, before installing the guard page
markers in the page tables.
There is no concept of 'nested' guard markers, multiple attempts to install
guard markers in a range will, after the first attempt, have no effect.
Importantly, removing guard markers over a range that contains both guard
markers and ordinary backed memory has no effect on anything but the guard
markers (including leaving huge pages un-split), so a user can safely
remove guard markers over a range of memory leaving the rest intact.
The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.
Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.
We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).
We ensure that zapping performed by MADV_DONTNEED and MADV_FREE do not
remove guard markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).
It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the markers if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.
An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of guard
ranges.
Suggested-by: Vlastimil Babka <vbabka(a)suse.cz>
Suggested-by: Jann Horn <jannh(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
v3
* Cleaned up mm/pagewalk.c logic a bit to make things clearer, as suggested
by Vlastiml.
* Explicitly avoid splitting THP on PTE installation, as suggested by
Vlastimil. Note this has no impact on the guard pages logic, which has
page table entry handlers at PUD, PMD and PTE level.
* Added WARN_ON_ONCE() to mm/hugetlb.c path where we don't expect a guard
marker, as suggested by Vlastimil.
* Reverted change to is_poisoned_swp_entry() to exclude guard pages which
has the effect of MADV_FREE _not_ clearing guard pages. After discussion
with Vlastimil, it became apparent that the ability to 'cancel' the
freeing operation by writing to the mapping after having issued an
MADV_FREE would mean that we would risk unexpected behaviour should the
guard pages be removed, so we now do not remove markers here at all.
* Added comment to PTE_MARKER_GUARD to highlight that memory tagged with
the marker behaves as if it were a region mapped PROT_NONE, as
highlighted by David.
* Rename poison -> install, unpoison -> remove (i.e. MADV_GUARD_INSTALL /
MADV_GUARD_REMOVE over MADV_GUARD_POISON / MADV_GUARD_REMOVE) at the
request of David and John who both find the poison analogy
confusing/overloaded.
* After a lot of discussion, replace the looping behaviour should page
faults race with guard page installation with a modest reattempt followed
by returning -ERESTARTNOINTR to have the operation abort and re-enter,
relieving lock contention and avoiding the possibility of allowing a
malicious sandboxed process to impact the mmap lock or stall the overall
process more than necessary, as suggested by Jann and Vlastimil having
raised the issue.
* Adjusted the page table walker so a populated huge PUD or PMD is
correctly treated as being populated, necessitating a zap. In v2 we
incorrectly skipped over these, which would cause the logic to wrongly
proceed as if nothing were populated and the install succeeded.
Instead, explicitly check to see if a huge page - if so, do not split but
rather abort the operation and let zap take care of things.
* Updated the guard remove logic to not unnecessarily split huge pages
either.
* Added a debug check to assert that the number of installed PTEs matches
expectation, accounting for any existing guard pages.
* Adapted vector_madvise() used by the process_madvise() system call to
handle -ERESTARTNOINTR correctly.
v2
* The macros in kselftest_harness.h seem to be broken - __EXPECT() is
terminated by '} while (0); OPTIONAL_HANDLER(_assert)' meaning it is not
safe in single line if / else or for /which blocks, however working
around this results in checkpatch producing invalid warnings, as reported
by Shuah.
* Fixing these macros is out of scope for this series, so compromise and
instead rewrite test blocks so as to use multiple lines by separating out
a decl in most cases. This has the side effect of, for the most part,
making things more readable.
* Heavily document the use of the volatile keyword - we can't avoid
checkpatch complaining about this, so we explain it, as reported by
Shuah.
* Updated commit message to highlight that we skip tests we lack
permissions for, as reported by Shuah.
* Replaced a perror() with ksft_exit_fail_perror(), as reported by Shuah.
* Added user friendly messages to cases where tests are skipped due to lack
of permissions, as reported by Shuah.
* Update the tool header to include the new MADV_GUARD_POISON/UNPOISON
defines and directly include asm-generic/mman.h to get the
platform-neutral versions to ensure we import them.
* Finally fixed Vlastimil's email address in Suggested-by tags from suze to
suse, as reported by Vlastimil.
* Added linux-api to cc list, as reported by Vlastimil.
https://lore.kernel.org/all/cover.1729440856.git.lorenzo.stoakes@oracle.com/
v1
* Un-RFC'd as appears no major objections to approach but rather debate on
implementation.
* Fixed issue with arches which need mmu_context.h and
tlbfush.h. header imports in pagewalker logic to be able to use
update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
ops->install_pte and why as well as adding a check_ops_valid() helper
function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
consistent with how this is used elsewhere in the kernel as suggested by
Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
and duck out if a fatal signal is pending or a conditional reschedule is
needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
ACTION_CONTINUE, as suggested by Jann.
https://lore.kernel.org/all/cover.1729196871.git.lorenzo.stoakes@oracle.com/
RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (5):
mm: pagewalk: add the ability to install PTEs
mm: add PTE_MARKER_GUARD PTE marker
mm: madvise: implement lightweight guard page mechanism
tools: testing: update tools UAPI header for mman-common.h
selftests/mm: add self tests for guard page feature
arch/alpha/include/uapi/asm/mman.h | 3 +
arch/mips/include/uapi/asm/mman.h | 3 +
arch/parisc/include/uapi/asm/mman.h | 3 +
arch/xtensa/include/uapi/asm/mman.h | 3 +
include/linux/mm_inline.h | 2 +-
include/linux/pagewalk.h | 18 +-
include/linux/swapops.h | 24 +-
include/uapi/asm-generic/mman-common.h | 3 +
mm/hugetlb.c | 4 +
mm/internal.h | 12 +
mm/madvise.c | 225 ++++
mm/memory.c | 18 +-
mm/mprotect.c | 6 +-
mm/mseal.c | 1 +
mm/pagewalk.c | 227 +++-
tools/include/uapi/asm-generic/mman-common.h | 3 +
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/guard-pages.c | 1239 ++++++++++++++++++
19 files changed, 1720 insertions(+), 76 deletions(-)
create mode 100644 tools/testing/selftests/mm/guard-pages.c
--
2.47.0
The following kselftest arm64 and FVP failed with Linux next-20241025 on
- Qemu-arm64
- FVP
running Linux next-20241025 kernel.
First seen on next-20241025
Good: next-20241024
BAD: next-20241025
kselftest-arm64, FVP
* arm64_check_buffer_fill
* arm64_check_mmap_options
* arm64_check_child_memory
Anyone have seen these failures ?
Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
Test log:
----------
# selftests: arm64: check_buffer_fill
# 1..20
# ok 1 Check buffer correctness by byte with sync err mode and mmap memory
# ok 2 Check buffer correctness by byte with async err mode and mmap memory
# ok 3 Check buffer correctness by byte with sync err mode and
mmap/mprotect memory
# ok 4 Check buffer correctness by byte with async err mode and
mmap/mprotect memory
# ok 5 Check buffer write underflow by byte with sync mode and mmap memory
# ok 6 Check buffer write underflow by byte with async mode and mmap memory
# ok 7 Check buffer write underflow by byte with tag check fault
ignore and mmap memory
# ok 8 Check buffer write underflow by byte with sync mode and mmap memory
# ok 9 Check buffer write underflow by byte with async mode and mmap memory
# ok 10 Check buffer write underflow by byte with tag check fault
ignore and mmap memory
# ok 11 Check buffer write overflow by byte with sync mode and mmap memory
# ok 12 Check buffer write overflow by byte with async mode and mmap memory
# ok 13 Check buffer write overflow by byte with tag fault ignore mode
and mmap memory
# ok 14 Check buffer write correctness by block with sync mode and mmap memory
# ok 15 Check buffer write correctness by block with async mode and mmap memory
# ok 16 Check buffer write correctness by block with tag fault ignore
and mmap memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 17 Check initial tags with private mapping, sync error mode
and mmap memory
# ok 18 Check initial tags with private mapping, sync error mode and
mmap/mprotect memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 19 Check initial tags with shared mapping, sync error mode
and mmap memory
# ok 20 Check initial tags with shared mapping, sync error mode and
mmap/mprotect memory
# # Totals: pass:18 fail:2 xfail:0 xpass:0 skip:0 error:0
not ok 21 selftests: arm64: check_buffer_fill # exit=1
# timeout set to 45
# selftests: arm64: check_child_memory
# 1..12
# ok 1 Check child anonymous memory with private mapping, precise mode
and mmap memory
# ok 2 Check child anonymous memory with shared mapping, precise mode
and mmap memory
# ok 3 Check child anonymous memory with private mapping, imprecise
mode and mmap memory
# ok 4 Check child anonymous memory with shared mapping, imprecise
mode and mmap memory
# ok 5 Check child anonymous memory with private mapping, precise mode
and mmap/mprotect memory
# ok 6 Check child anonymous memory with shared mapping, precise mode
and mmap/mprotect memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 7 Check child file memory with private mapping, precise mode
and mmap memory
# # FAIL: mmap allocation
# # FAIL: memory allocation
# not ok 8 Check child file memory with shared mapping, precise mode
and mmap memory
# ok 9 Check child file memory with private mapping, imprecise mode
and mmap memory
# ok 10 Check child file memory with shared mapping, imprecise mode
and mmap memory
# ok 11 Check child file memory with private mapping, precise mode and
mmap/mprotect memory
# ok 12 Check child file memory with shared mapping, precise mode and
mmap/mprotect memory
# # Totals: pass:10 fail:2 xfail:0 xpass:0 skip:0 error:0
not ok 22 selftests: arm64: check_child_memory # exit=1
boot Log links,
--------
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
Test results history:
----------
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20241028/te…
metadata:
----
git describe: next-20241028
git repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
git sha: dec9255a128e19c5fcc3bdb18175d78094cc624d
kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2o3tMqzOtHXYQjlvfR5t…
build url: https://storage.tuxsuite.com/public/linaro/lkft/builds/2o3tMqzOtHXYQjlvfR5t…
toolchain: gcc-13
Steps to reproduce:
---------
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2o3tON5kNi…
- https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2o3tON5kNi…
--
Linaro LKFT
https://lkft.linaro.org
As the part-2 of the VIOMMU infrastructure, this series introduces a VIRQ
object after repurposing the existing FAULT object, which provides a nice
notification pathway to the user space already. So, the first thing to do
is reworking the FAULT object.
Mimicing the HWPT structures, add a common EVENT structure to support its
derivatives: EVENT_IOPF (the prior FAULT object) and EVENT_VIRQ (new one).
IOMMUFD_CMD_VIRQ_ALLOC is introduced to allocate EVENT_VIRQ for a VIOMMU.
One VIOMMU can have multiple VIRQs in different types but can not support
multiple VIRQs with the same types.
Drivers might need the VIOMMU's vdev_id list or the exact vdev_id link of
the passthrough device's to forward IRQs/events via the VIOMMU framework.
Thus, extend the set/unset_vdev_id ioctls down to the driver using VIOMMU
ops. This allows drivers to take the control of a vdev_id's lifecycle.
The forwarding part is fairly simple but might need to replace a physical
device ID with a virtual device ID. So, there comes with some helpers for
drivers to use.
As usual, this series comes with the selftest coverage for this new VIRQ,
and with a real world use case in the ARM SMMUv3 driver.
This must be based on the VIOMMU Part-1 series. It's on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_virq-v1
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_virq-v1
Thanks!
Nicolin
Nicolin Chen (10):
iommufd: Rename IOMMUFD_OBJ_FAULT to IOMMUFD_OBJ_EVENT_IOPF
iommufd: Rename fault.c to event.c
iommufd: Add IOMMUFD_OBJ_EVENT_VIRQ and IOMMUFD_CMD_VIRQ_ALLOC
iommufd/viommu: Allow drivers to control vdev_id lifecycle
iommufd/viommu: Add iommufd_vdev_id_to_dev helper
iommufd/viommu: Add iommufd_viommu_report_irq helper
iommufd/selftest: Implement mock_viommu_set/unset_vdev_id
iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VIRQ for VIRQ coverage
iommufd/selftest: Add EVENT_VIRQ test coverage
iommu/arm-smmu-v3: Report virtual IRQ for device in user space
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 109 +++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +
drivers/iommu/iommufd/Makefile | 2 +-
drivers/iommu/iommufd/device.c | 2 +
drivers/iommu/iommufd/event.c | 613 ++++++++++++++++++
drivers/iommu/iommufd/fault.c | 443 -------------
drivers/iommu/iommufd/hw_pagetable.c | 12 +-
drivers/iommu/iommufd/iommufd_private.h | 147 ++++-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 13 +-
drivers/iommu/iommufd/selftest.c | 66 ++
drivers/iommu/iommufd/viommu.c | 25 +-
drivers/iommu/iommufd/viommu_api.c | 54 ++
include/linux/iommufd.h | 28 +
include/uapi/linux/iommufd.h | 46 ++
tools/testing/selftests/iommu/iommufd.c | 11 +
tools/testing/selftests/iommu/iommufd_utils.h | 64 ++
17 files changed, 1130 insertions(+), 517 deletions(-)
create mode 100644 drivers/iommu/iommufd/event.c
delete mode 100644 drivers/iommu/iommufd/fault.c
--
2.43.0
This series introduces a new vIOMMU infrastructure and related ioctls.
IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone: a
parent HWPT typically hold one stage-2 IO pagetable and tag it with only
one ID in the cache entries. When sharing one large stage-2 IO pagetable
across physical IOMMU instances, that one ID may not always be available
across all the IOMMU instances. In other word, it's ideal for SW to have
a different container for the stage-2 IO pagetable so it can hold another
ID that's available.
For this "different container", add vIOMMU, an additional layer to hold
extra virtualization information:
_______________________________________________________________________
| iommufd (with vIOMMU) |
| |
| [5] |
| _____________ |
| | | |
| |----------------| vIOMMU | |
| | | | |
| | | | |
| | [1] | | [4] [2] |
| | ______ | | _____________ ________ |
| | | | | [3] | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
| | | | | | |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
The vIOMMU object should be seen as a slice of a physical IOMMU instance
that is passed to or shared with a VM. That can be some HW/SW resources:
- Security namespace for guest owned ID, e.g. guest-controlled cache tags
- Access to a sharable nesting parent pagetable across physical IOMMUs
- Virtualization of various platforms IDs, e.g. RIDs and others
- Delivery of paravirtualized invalidation
- Direct assigned invalidation queues
- Direct assigned interrupts
- Non-affiliated event reporting
On a multi-IOMMU system, the vIOMMU object must be instanced to the number
of the physical IOMMUs that are passed to (via devices) a guest VM, while
being able to hold the shareable parent HWPT. Each vIOMMU then just needs
to allocate its own individual ID to tag its own cache:
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested0 |--->| viommu0 ------------------
---------------- | | IDx |
----------------------------
----------------------------
---------------- | | paging_hwpt0 |
| hwpt_nested1 |--->| viommu1 ------------------
---------------- | | IDy |
----------------------------
As an initial part-1, add IOMMUFD_CMD_VIOMMU_ALLOC ioctl for an allocation
only. And implement it in arm-smmu-v3 driver as a real world use case.
More vIOMMU-based structs and ioctls will be introduced in the follow-up
series to support vDEVICE, vIRQ (vEVENT) and vQUEUE objects. Although we
repurposed the vIOMMU object from an earlier RFC, just for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v4
(paring QEMU branch for testing will be provided with the part2 series)
Changelog
v4
* Added "Reviewed-by" from Jason
* Dropped IOMMU_VIOMMU_TYPE_DEFAULT support
* Dropped iommufd_object_alloc_elm renamings
* Renamed iommufd's viommu_api.c to driver.c
* Reworked iommufd_viommu_alloc helper
* Added a separate iommufd_hwpt_nested_alloc_for_viommu function for
hwpt_nested allocations on a vIOMMU, and added comparison between
viommu->iommu_dev->ops and dev_iommu_ops(idev->dev)
* Replaced s2_parent with vsmmu in arm_smmu_nested_domain
* Replaced domain_alloc_user in iommu_ops with domain_alloc_nested in
viommu_ops
* Replaced wait_queue_head_t with a completion, to delay the unplug of
mock_iommu_dev
* Corrected documentation graph that was missing struct iommu_device
* Added an iommufd_verify_unfinalized_object helper to verify driver-
allocated vIOMMU/vDEVICE objects
* Added missing test cases for TEST_LENGTH and fail_nth
v3
https://lore.kernel.org/all/cover.1728491453.git.nicolinc@nvidia.com/
* Rebased on top of Jason's nesting v3 series
https://lore.kernel.org/all/0-v3-e2e16cd7467f+2a6a1-smmuv3_nesting_jgg@nvid…
* Split the series into smaller parts
* Added Jason's Reviewed-by
* Added back viommu->iommu_dev
* Added support for driver-allocated vIOMMU v.s. core-allocated
* Dropped arm_smmu_cache_invalidate_user
* Added an iommufd_test_wait_for_users() in selftest
* Reworked test code to make viommu an individual FIXTURE
* Added missing TEST_LENGTH case for the new ioctl command
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Nicolin Chen (11):
iommufd: Move struct iommufd_object to public iommufd header
iommufd: Introduce IOMMUFD_OBJ_VIOMMU and its related struct
iommufd: Add iommufd_verify_unfinalized_object
iommufd/viommu: Add IOMMU_VIOMMU_ALLOC ioctl
iommufd: Add domain_alloc_nested op to iommufd_viommu_ops
iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
iommufd/selftest: Add refcount to mock_iommu_device
iommufd/selftest: Add IOMMU_VIOMMU_TYPE_SELFTEST
iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
Documentation: userspace-api: iommufd: Update vIOMMU
iommu/arm-smmu-v3: Add IOMMU_VIOMMU_TYPE_ARM_SMMUV3 support
drivers/iommu/iommufd/Makefile | 5 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 26 +++---
drivers/iommu/iommufd/iommufd_private.h | 36 ++------
drivers/iommu/iommufd/iommufd_test.h | 2 +
include/linux/iommu.h | 14 +++
include/linux/iommufd.h | 89 +++++++++++++++++++
include/uapi/linux/iommufd.h | 56 ++++++++++--
tools/testing/selftests/iommu/iommufd_utils.h | 28 ++++++
.../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 79 ++++++++++------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 +-
drivers/iommu/iommufd/driver.c | 38 ++++++++
drivers/iommu/iommufd/hw_pagetable.c | 69 +++++++++++++-
drivers/iommu/iommufd/main.c | 58 ++++++------
drivers/iommu/iommufd/selftest.c | 73 +++++++++++++--
drivers/iommu/iommufd/viommu.c | 85 ++++++++++++++++++
tools/testing/selftests/iommu/iommufd.c | 78 ++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 11 +++
Documentation/userspace-api/iommufd.rst | 69 +++++++++++++-
18 files changed, 701 insertions(+), 124 deletions(-)
create mode 100644 drivers/iommu/iommufd/driver.c
create mode 100644 drivers/iommu/iommufd/viommu.c
--
2.43.0