Hi, all
Thanks very much for your review suggestions of the v1 series [1], we
just sent out the generic part1 [2], and here is the part2 of the whole
v2 revision.
Changes from v1 -> v2:
* Don't emulate the return values in the new syscalls path, fix up or
support the new syscalls in the side of the related test cases (1-3)
selftests/nolibc: remove gettimeofday_bad1/2 completely
selftests/nolibc: support two errnos with EXPECT_SYSER2()
selftests/nolibc: waitpid_min: add waitid syscall support
(Review suggestions from Willy and Thomas)
* Fix up new failure of the state_timestamps test case (4, new)
tools/nolibc: add missing nanoseconds support for __NR_statx
(Fixes for the commit a89c937d781a ("tools/nolibc: support nanoseconds in stat()")
* Add new waitstatus macros as a standalone patch for the waitid support (5)
tools/nolibc: add more wait status related types
(Split and Cleanup for the waitid syscall based sys_wait4)
* Pure 64bit lseek and time64 select/poll/gettimeofday support (6-11)
tools/nolibc: add pure 64bit off_t, time_t and blkcnt_t
tools/nolibc: sys_lseek: add pure 64bit lseek
tools/nolibc: add pure 64bit time structs
tools/nolibc: sys_select: add pure 64bit select
tools/nolibc: sys_poll: add pure 64bit poll
tools/nolibc: sys_gettimeofday: add pure 64bit gettimeofday
(Review suggestions from Arnd, Thomas and Willy, time32 variants have
been removed completely and some fixups)
* waitid syscall support cleanup (12)
tools/nolibc: sys_wait4: add waitid syscall support
(Sync with the waitstatus macros update and Removal of emulated code)
* rv32 nolibc-test support, commit message update (13)
selftests/nolibc: riscv: customize makefile for rv32
(Review suggestions from Thomas, explain more about the change logic in commit message)
Best regards,
Zhangjin
---
[1]: https://lore.kernel.org/linux-riscv/20230529113143.GB2762@1wt.eu/T/#t
[2]: https://lore.kernel.org/linux-riscv/cover.1685362482.git.falcon@tinylab.org/
Zhangjin Wu (13):
selftests/nolibc: remove gettimeofday_bad1/2 completely
selftests/nolibc: support two errnos with EXPECT_SYSER2()
selftests/nolibc: waitpid_min: add waitid syscall support
tools/nolibc: add missing nanoseconds support for __NR_statx
tools/nolibc: add more wait status related types
tools/nolibc: add pure 64bit off_t, time_t and blkcnt_t
tools/nolibc: sys_lseek: add pure 64bit lseek
tools/nolibc: add pure 64bit time structs
tools/nolibc: sys_select: add pure 64bit select
tools/nolibc: sys_poll: add pure 64bit poll
tools/nolibc: sys_gettimeofday: add pure 64bit gettimeofday
tools/nolibc: sys_wait4: add waitid syscall support
selftests/nolibc: riscv: customize makefile for rv32
tools/include/nolibc/arch-aarch64.h | 3 -
tools/include/nolibc/arch-loongarch.h | 3 -
tools/include/nolibc/arch-riscv.h | 3 -
tools/include/nolibc/std.h | 28 ++--
tools/include/nolibc/sys.h | 134 +++++++++++++++----
tools/include/nolibc/types.h | 58 +++++++-
tools/testing/selftests/nolibc/Makefile | 11 +-
tools/testing/selftests/nolibc/nolibc-test.c | 20 +--
8 files changed, 202 insertions(+), 58 deletions(-)
--
2.25.1
From: Jeff Xu <jeffxu(a)google.com>
Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting
it differently.
However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass
and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code
and root escalation. [2] lists more VRP in this kind.
On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].
To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
X bit.For example, if a container has vm.memfd_noexec=2, then
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
LSM, which rejects or allows executable memfd based on its security policy.
Change history:
v8:
- Update ref bug in cover letter.
- Add Reviewed-by field.
- Remove security hook (security_memfd_create) patch, which will have
its own patch set in future.
v7:
- patch 2/6: remove #ifdef and MAX_PATH (memfd_test.c).
- patch 3/6: check capability (CAP_SYS_ADMIN) from userns instead of
global ns (pid_sysctl.h). Add a tab (pid_namespace.h).
- patch 5/6: remove #ifdef (memfd_test.c)
- patch 6/6: remove unneeded security_move_mount(security.c).
v6:https://lore.kernel.org/lkml/20221206150233.1963717-1-jeffxu@google.com/
- Address comment and move "#ifdef CONFIG_" from .c file to pid_sysctl.h
v5:https://lore.kernel.org/lkml/20221206152358.1966099-1-jeffxu@google.com/
- Pass vm.memfd_noexec from current ns to child ns.
- Fix build issue detected by kernel test robot.
- Add missing security.c
v3:https://lore.kernel.org/lkml/20221202013404.163143-1-jeffxu@google.com/
- Address API design comments in v2.
- Let memfd_create() to set X bit at creation time.
- A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit.
- A new security hook in memfd_create().
v2:https://lore.kernel.org/lkml/20220805222126.142525-1-jeffxu@google.com/
- address comments in V1.
- add sysctl (vm.mfd_noexec) to set the default file permissions of
memfd_create to be non-executable.
v1:https://lwn.net/Articles/890096/
[1] https://crbug.com/1305267
[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20me…
[3] https://lwn.net/Articles/781013/
Daniel Verkamp (2):
mm/memfd: add F_SEAL_EXEC
selftests/memfd: add tests for F_SEAL_EXEC
Jeff Xu (3):
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd
selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC
include/linux/pid_namespace.h | 19 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/memfd.h | 4 +
kernel/pid_namespace.c | 5 +
kernel/pid_sysctl.h | 59 ++++
mm/memfd.c | 56 +++-
mm/shmem.c | 6 +
tools/testing/selftests/memfd/fuse_test.c | 1 +
tools/testing/selftests/memfd/memfd_test.c | 341 ++++++++++++++++++++-
9 files changed, 489 insertions(+), 3 deletions(-)
create mode 100644 kernel/pid_sysctl.h
base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
--
2.39.0.rc1.256.g54fd8350bd-goog
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |--------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1]. It first
introduces new iommu op for allocating domains with user data and the op
for syncing stage-1 IOTLB, and then extend the IOMMUFD internal infrastructure
to accept user_data and parent hwpt, then relay the data to iommu core to
allocate iommu_domain. After it, extend the ioctl IOMMU_HWPT_ALLOC to accept
user data and stage-2 hwpt ID to allocate hwpt. Along with it, ioctl
IOMMU_HWPT_INVALIDATE is added to invalidate stage-1 IOTLB. This is needed
for user-managed hwpts. Selftest is added as well to cover the new ioctls.
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
base-commit: cf905391237ded2331388e75adb5afbabeddc852
[1] https://lore.kernel.org/linux-iommu/20230511143024.19542-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%…
Change log:
v2:
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (2):
iommu: Add new iommu op to create domains owned by userspace
iommu: Add nested domain support
Nicolin Chen (5):
iommufd/hw_pagetable: Do not populate user-managed hw_pagetables
iommufd/selftest: Add domain_alloc_user() support in iommu mock
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with user data
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (4):
iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
iommufd: Pass parent hwpt and user_data to
iommufd_hw_pagetable_alloc()
iommufd: IOMMU_HWPT_ALLOC allocation with user data
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/iommufd/device.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 191 +++++++++++++++++-
drivers/iommu/iommufd/iommufd_private.h | 16 +-
drivers/iommu/iommufd/iommufd_test.h | 30 +++
drivers/iommu/iommufd/main.c | 5 +-
drivers/iommu/iommufd/selftest.c | 119 ++++++++++-
include/linux/iommu.h | 36 ++++
include/uapi/linux/iommufd.h | 58 +++++-
tools/testing/selftests/iommu/iommufd.c | 126 +++++++++++-
tools/testing/selftests/iommu/iommufd_utils.h | 70 +++++++
10 files changed, 629 insertions(+), 24 deletions(-)
--
2.34.1
Hi folks,
This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. The use case is nested
translation, where modern IOMMU hardware supports two-stage translation
tables. The second-stage translation table is managed by the host VMM
while the first-stage translation table is owned by the user space.
Hence, any IO page fault that occurs on the first-stage page table
should be delivered to the user space and handled there. The user space
should respond the page fault handling result to the device top-down
through the IOMMUFD response uAPI.
User space indicates its capablity of handling IO page faults by setting
a user HWPT allocation flag IOMMU_HWPT_ALLOC_FLAGS_IOPF_CAPABLE. IOMMUFD
will then setup its infrastructure for page fault delivery. Together
with the iopf-capable flag, user space should also provide an eventfd
where it will listen on any down-top page fault messages.
On a successful return of the allocation of iopf-capable HWPT, a fault
fd will be returned. User space can open and read fault messages from it
once the eventfd is signaled.
Besides the overall design, I'd like to hear comments about below
designs:
- The IOMMUFD fault message format. It is very similar to that in
uapi/linux/iommu which has been discussed before and partially used by
the IOMMU SVA implementation. I'd like to get more comments on the
format when it comes to IOMMUFD.
- The timeout value for the pending page fault messages. Ideally we
should determine the timeout value from the device configuration, but
I failed to find any statement in the PCI specification (version 6.x).
A default 100 milliseconds is selected in the implementation, but it
leave the room for grow the code for per-device setting.
This series is only for review comment purpose. I used IOMMUFD selftest
to verify the hwpt allocation, attach/detach and replace. But I didn't
get a chance to run it with real hardware yet. I will do more test in
the subsequent versions when I am confident that I am heading on the
right way.
This series is based on the latest implementation of the nested
translation under discussion. The whole series and related patches are
available on gitbub:
https://github.com/LuBaolu/intel-iommu/commits/iommufd-io-pgfault-delivery-…
Best regards,
baolu
Lu Baolu (17):
iommu: Move iommu fault data to linux/iommu.h
iommu: Support asynchronous I/O page fault response
iommu: Add helper to set iopf handler for domain
iommu: Pass device parameter to iopf handler
iommu: Split IO page fault handling from SVA
iommu: Add iommu page fault cookie helpers
iommufd: Add iommu page fault data
iommufd: IO page fault delivery initialization and release
iommufd: Add iommufd hwpt iopf handler
iommufd: Add IOMMU_HWPT_ALLOC_FLAGS_USER_PASID_TABLE for hwpt_alloc
iommufd: Deliver fault messages to user space
iommufd: Add io page fault response support
iommufd: Add a timer for each iommufd fault data
iommufd: Drain all pending faults when destroying hwpt
iommufd: Allow new hwpt_alloc flags
iommufd/selftest: Add IOPF feature for mock devices
iommufd/selftest: Cover iopf-capable nested hwpt
include/linux/iommu.h | 175 +++++++++-
drivers/iommu/{iommu-sva.h => io-pgfault.h} | 25 +-
drivers/iommu/iommu-priv.h | 3 +
drivers/iommu/iommufd/iommufd_private.h | 32 ++
include/uapi/linux/iommu.h | 161 ---------
include/uapi/linux/iommufd.h | 73 +++-
tools/testing/selftests/iommu/iommufd_utils.h | 20 +-
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +-
drivers/iommu/intel/iommu.c | 2 +-
drivers/iommu/intel/svm.c | 2 +-
drivers/iommu/io-pgfault.c | 7 +-
drivers/iommu/iommu-sva.c | 4 +-
drivers/iommu/iommu.c | 50 ++-
drivers/iommu/iommufd/device.c | 64 +++-
drivers/iommu/iommufd/hw_pagetable.c | 318 +++++++++++++++++-
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 71 ++++
tools/testing/selftests/iommu/iommufd.c | 17 +-
MAINTAINERS | 1 -
drivers/iommu/Kconfig | 4 +
drivers/iommu/Makefile | 3 +-
drivers/iommu/intel/Kconfig | 1 +
23 files changed, 837 insertions(+), 203 deletions(-)
rename drivers/iommu/{iommu-sva.h => io-pgfault.h} (71%)
delete mode 100644 include/uapi/linux/iommu.h
--
2.34.1
Hello!
Here is v4 of the mremap start address optimization / fix for exec warning. It
took me a while to write a test that catches the issue me/Linus discussed in
the last version. And I verified kernel crashes without the check. See below.
The main changes in this series is:
Care to be taken to move purely within a VMA, in other words this check
in call_align_down():
if (vma->vm_start != addr_masked)
return false;
As an example of why this is needed:
Consider the following range which is 2MB aligned and is
a part of a larger 10MB range which is not shown. Each
character is 256KB below making the source and destination
2MB each. The lower case letters are moved (s to d) and the
upper case letters are not moved.
|DDDDddddSSSSssss|
If we align down 'ssss' to start from the 'SSSS', we will end up destroying
SSSS. The above if statement prevents that and I verified it.
I also added a test for this in the last patch.
History of patches
==================
v3->v4:
1. Make sure to check address to align is beginning of VMA
2. Add test to check this (test fails with a kernel crash if we don't do this).
v2->v3:
1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
2. We now handle moves happening purely within a VMA, a new test is added to handle this.
3. More code comments.
v1->v2:
1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
that it works correctly.
2. Fix issue with bogus return value found by Linus if we broke out of the
above loop for the first PMD itself.
v1: Initial RFC.
Description of patches
======================
These patches optimizes the start addresses in move_page_tables() and tests the
changes. It addresses a warning [1] that occurs due to a downward, overlapping
move on a mutually-aligned offset within a PMD during exec. By initiating the
copy process at the PMD level when such alignment is present, we can prevent
this warning and speed up the copying process at the same time. Linus Torvalds
suggested this idea.
Please check the individual patches for more details.
thanks,
- Joel
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
Joel Fernandes (Google) (7):
mm/mremap: Optimize the start addresses in move_page_tables()
mm/mremap: Allow moves within the same VMA for stack
selftests: mm: Fix failure case when new remap region was not found
selftests: mm: Add a test for mutually aligned moves > PMD size
selftests: mm: Add a test for remapping to area immediately after
existing mapping
selftests: mm: Add a test for remapping within a range
selftests: mm: Add a test for moving from an offset from start of
mapping
fs/exec.c | 2 +-
include/linux/mm.h | 2 +-
mm/mremap.c | 63 ++++-
tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
4 files changed, 319 insertions(+), 49 deletions(-)
--
2.41.0.rc2.161.g9c6817b8e7-goog
This is to add Intel VT-d nested translation based on IOMMUFD nesting
infrastructure. As the iommufd nesting infrastructure series[1], iommu
core supports new ops to report iommu hardware information, allocate
domains with user data and sync stage-1 IOTLB. The data required in
the three paths are vendor-specific, so
1) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_device_info_vtd are
defined to report iommu hardware information for Intel VT-d .
2) IOMMU_HWPT_DATA_VTD_S1 is defined for the Intel VT-d stage-1 page
table, it will be used in the stage-1 domain allocation and IOTLB
syncing path. struct iommu_hwpt_intel_vtd is defined to pass user_data
for the Intel VT-d stage-1 domain allocation.
struct iommu_hwpt_invalidate_intel_vtd is defined to pass the data for
the Intel VT-d stage-1 IOTLB invalidation.
With above IOMMUFD extensions, the intel iommu driver implements the three
paths to support nested translation.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch10 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to allow user opt-out RO mappings in
Qemu might be an acceptable tradeoff. But how to achieve it cleanly
needs more discussion in Qemu community. For now we just hacked Qemu
to test.
Complete code can be found in [3], QEMU could can be found in [4].
base-commit: ce9b593b1f74ccd090edc5d2ad397da84baa9946
[1] https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%…
Change log:
v3:
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow nesting on domains with read-only mappings
Yi Liu (5):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
iommu/vt-d: Add iotlb flush for nested domain
iommu/vt-d: Implement hw_info for iommu capability query
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 78 ++++++++++++---
drivers/iommu/intel/iommu.h | 55 +++++++++--
drivers/iommu/intel/nested.c | 181 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 151 +++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
drivers/iommu/iommufd/main.c | 6 ++
include/linux/iommu.h | 1 +
include/uapi/linux/iommufd.h | 149 ++++++++++++++++++++++++++++
9 files changed, 603 insertions(+), 22 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1