From: Jeff Xu jeffxu@chromium.org
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type.
Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
Two system calls are involved in sealing the map: mmap() and mseal().
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location, via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks the map as sealable. A map created without MAP_SEALABLE will not support sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior unchanged. For those that need sealing support, opt-in by adding MAP_SEALABLE in mmap().
The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API.
Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process.
Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED).
However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity.
Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections.
In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch:
Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Pedro Falcato: suggesting sealing in the mmap(). Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD.
Change history: =============== V8: - perf optimization in mmap. (Liam R. Howlett) - add one testcase (test_seal_zero_address) - Update mseal.rst to add note for MAP_SEALABLE.
V7: - fix index.rst (Randy Dunlap) - fix arm build (Randy Dunlap) - return EPERM for blocked operations (Theo de Raadt) https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.or...
V6: - Drop RFC from subject, Given Linus's general approval. - Adjust syscall number for mseal (main Jan.11/2024) - Code style fix (Matthew Wilcox) - selftest: use ksft macros (Muhammad Usama Anjum) - Document fix. (Randy Dunlap) https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/
V5: - fix build issue in mseal-Wire-up-mseal-syscall (Suggested by Linus Torvalds, and Greg KH) - updates on selftest. https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r
V4: (Suggested by Linus Torvalds) - new signature: mseal(start,len,flags) - 32 bit is not supported. vm_seal is removed, use vm_flags instead. - single bit in vm_flags for sealed state. - CONFIG_MSEAL kernel config is removed. - single bit of PROT_SEAL in the "Prot" field of mmap(). Other changes: - update selftest (Suggested by Muhammad Usama Anjum) - update documentation. https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/
V3: - Abandon per-syscall approach, (Suggested by Linus Torvalds). - Organize sealing types around their functionality, such as MM_SEAL_BASE, MM_SEAL_PROT_PKEY. - Extend the scope of sealing from calls originated in userspace to both kernel and userspace. (Suggested by Linus Torvalds) - Add seal type support in mmap(). (Suggested by Pedro Falcato) - Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent destructive operations of madvise. (Suggested by Jann Horn and Stephen Röttger) - Make sealed VMAs mergeable. (Suggested by Jann Horn) - Add MAP_SEALABLE to mmap() - Add documentation - mseal.rst https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.or...
v2: Use _BITUL to define MM_SEAL_XX type. Use unsigned long for seal type in sys_mseal() and other functions. Remove internal VM_SEAL_XX type and convert_user_seal_type(). Remove MM_ACTION_XX type. Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask. Add more comments in code. Add a detailed commit message. https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/
v1: https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
---------------------------------------------------------------- [1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea... [6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfU... [7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
Jeff Xu (4): mseal: Wire up mseal syscall mseal: add mseal syscall selftest mm/mseal memory sealing mseal:add documentation
Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mseal.rst | 215 ++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/mman-common.h | 8 + include/uapi/asm-generic/unistd.h | 5 +- kernel/sys_ni.c | 1 + mm/Makefile | 4 + mm/internal.h | 48 + mm/madvise.c | 12 + mm/mmap.c | 35 +- mm/mprotect.c | 10 + mm/mremap.c | 31 + mm/mseal.c | 343 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++ 33 files changed, 2756 insertions(+), 3 deletions(-) create mode 100644 Documentation/userspace-api/mseal.rst create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c
From: Jeff Xu jeffxu@chromium.org
Wire up mseal syscall for all architectures.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 ++ arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/uapi/asm-generic/unistd.h | 5 ++++- kernel/sys_ni.c | 1 + 19 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 8ff110826ce2..d8f96362e9f8 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -501,3 +501,4 @@ 569 common lsm_get_self_attr sys_lsm_get_self_attr 570 common lsm_set_self_attr sys_lsm_set_self_attr 571 common lsm_list_modules sys_lsm_list_modules +572 common mseal sys_mseal diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index b6c9e01e14f5..2ed7d229c8f9 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -475,3 +475,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 491b2b9bd553..1346579f802f 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 462 +#define __NR_compat_syscalls 463 #endif
#define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 7118282d1c79..266b96acc014 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -929,6 +929,8 @@ __SYSCALL(__NR_lsm_get_self_attr, sys_lsm_get_self_attr) __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules) +#define __NR_mseal 462 +__SYSCALL(__NR_mseal, sys_mseal)
/* * Please add new compat syscalls above this comment and update diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 7fd43fd4c9f2..22a3cbd4c602 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index b00ab2cabab9..2b81a6bd78b2 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -467,3 +467,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 83cfc9eb6b88..cc869f5d5693 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -400,3 +400,4 @@ 459 n32 lsm_get_self_attr sys_lsm_get_self_attr 460 n32 lsm_set_self_attr sys_lsm_set_self_attr 461 n32 lsm_list_modules sys_lsm_list_modules +462 n32 mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 532b855df589..1464c6be6eb3 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -376,3 +376,4 @@ 459 n64 lsm_get_self_attr sys_lsm_get_self_attr 460 n64 lsm_set_self_attr sys_lsm_set_self_attr 461 n64 lsm_list_modules sys_lsm_list_modules +462 n64 mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index f45c9530ea93..008ebe60263e 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -449,3 +449,4 @@ 459 o32 lsm_get_self_attr sys_lsm_get_self_attr 460 o32 lsm_set_self_attr sys_lsm_set_self_attr 461 o32 lsm_list_modules sys_lsm_list_modules +462 o32 mseal sys_mseal diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index b236a84c4e12..b13c21373974 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 17173b82ca21..3656f1ca7a21 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -548,3 +548,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 095bb86339a7..bd0fee24ad10 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -464,3 +464,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal sys_mseal diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 86fe269f0220..bbf83a2db986 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -464,3 +464,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index b23d59313589..ac6c281ccfe0 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -507,3 +507,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 5f8591ce7f25..7fd1f57ad3d3 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -466,3 +466,4 @@ 459 i386 lsm_get_self_attr sys_lsm_get_self_attr 460 i386 lsm_set_self_attr sys_lsm_set_self_attr 461 i386 lsm_list_modules sys_lsm_list_modules +462 i386 mseal sys_mseal diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 7e8d46f4147f..52df0dec70da 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -383,6 +383,7 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal
# # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index dd116598fb25..67083fc1b2f5 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -432,3 +432,4 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 common mseal sys_mseal diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 75f00965ab15..d983c48a3b6a 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
+#define __NR_mseal 462 +__SYSCALL(__NR_mseal, sys_mseal) + #undef __NR_syscalls -#define __NR_syscalls 462 +#define __NR_syscalls 463
/* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index faad00cce269..d7eee421d4bc 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -196,6 +196,7 @@ COND_SYSCALL(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL(set_mempolicy_home_node); COND_SYSCALL(cachestat); +COND_SYSCALL(mseal);
COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4);
From: Jeff Xu jeffxu@chromium.org
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location, via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory.
In addition: mmap() has two related changes.
The PROT_SEAL bit in prot field of mmap(). When present, it marks the map sealed since creation.
The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks the map as sealable. A map created without MAP_SEALABLE will not support sealing, i.e. mseal() will fail.
Applications that don't care about sealing will expect their behavior unchanged. For those that need sealing support, opt-in by adding MAP_SEALABLE in mmap().
Following input during RFC are incooperated into this patch:
Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Pedro Falcato: suggesting sealing in the mmap(). Liam R. Howlett: perf optimization.
Finally, the idea that inspired this patch comes from Stephen Röttger’s work in Chrome V8 CFI.
Signed-off-by: Jeff Xu jeffxu@chromium.org --- include/linux/syscalls.h | 1 + include/uapi/asm-generic/mman-common.h | 8 + mm/Makefile | 4 + mm/internal.h | 48 ++++ mm/madvise.c | 12 + mm/mmap.c | 35 ++- mm/mprotect.c | 10 + mm/mremap.c | 31 +++ mm/mseal.c | 343 +++++++++++++++++++++++++ 9 files changed, 491 insertions(+), 1 deletion(-) create mode 100644 mm/mseal.c
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index cdba4d0c6d4a..2d44e0d99e37 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -820,6 +820,7 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned long flags); asmlinkage long sys_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..3ca4d694a621 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -17,6 +17,11 @@ #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
+/* + * The PROT_SEAL defines memory sealing in the prot argument of mmap(). + */ +#define PROT_SEAL 0x04000000 /* _BITUL(26) */ + /* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ @@ -33,6 +38,9 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
+/* map is sealable */ +#define MAP_SEALABLE 0x8000000 /* _BITUL(27) */ + /* * Flags for mlock */ diff --git a/mm/Makefile b/mm/Makefile index e4b5b75aaec9..cbae83f74642 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -43,6 +43,10 @@ ifdef CONFIG_CROSS_MEMORY_ATTACH mmu-$(CONFIG_MMU) += process_vm_access.o endif
+ifdef CONFIG_64BIT +mmu-$(CONFIG_MMU) += mseal.o +endif + obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ maccess.o page-writeback.o folio-compat.o \ readahead.o swap.o truncate.o vmscan.o shrinker.o \ diff --git a/mm/internal.h b/mm/internal.h index f309a010d50f..00b45c8550c4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1221,6 +1221,54 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority);
+#ifdef CONFIG_64BIT +/* VM is sealable, in vm_flags */ +#define VM_SEALABLE _BITUL(63) + +/* VM is sealed, in vm_flags */ +#define VM_SEALED _BITUL(62) +#endif + +#ifdef CONFIG_64BIT +static inline int can_do_mseal(unsigned long flags) +{ + if (flags) + return -EINVAL; + + return 0; +} + +bool can_modify_mm(struct mm_struct *mm, unsigned long start, + unsigned long end); +bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, + unsigned long end, int behavior); +unsigned long get_mmap_seals(unsigned long prot, + unsigned long flags); +#else +static inline int can_do_mseal(unsigned long flags) +{ + return -EPERM; +} + +static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start, + unsigned long end) +{ + return true; +} + +static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, + unsigned long end, int behavior) +{ + return true; +} + +static inline unsigned long get_mmap_seals(unsigned long prot, + unsigned long flags) +{ + return 0; +} +#endif + #ifdef CONFIG_SHRINKER_DEBUG static inline __printf(2, 0) int shrinker_debugfs_name_alloc( struct shrinker *shrinker, const char *fmt, va_list ap) diff --git a/mm/madvise.c b/mm/madvise.c index 912155a94ed5..9c0761c68111 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1393,6 +1393,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * -EIO - an I/O error occurred while paging in data. * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. + * -EPERM - memory is sealed. */ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { @@ -1436,10 +1437,21 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh start = untagged_addr_remote(mm, start); end = start + len;
+ /* + * Check if the address range is sealed for do_madvise(). + * can_modify_mm_madv assumes we have acquired the lock on MM. + */ + if (!can_modify_mm_madv(mm, start, end, behavior)) { + error = -EPERM; + goto out; + } + blk_start_plug(&plug); error = madvise_walk_vmas(mm, start, end, behavior, madvise_vma_behavior); blk_finish_plug(&plug); + +out: if (write) mmap_write_unlock(mm); else diff --git a/mm/mmap.c b/mm/mmap.c index b78e83d351d2..4b3143044db4 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1213,6 +1213,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, { struct mm_struct *mm = current->mm; int pkey = 0; + unsigned long vm_seals;
*populate = 0;
@@ -1233,6 +1234,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (flags & MAP_FIXED_NOREPLACE) flags |= MAP_FIXED;
+ vm_seals = get_mmap_seals(prot, flags); + if (!(flags & MAP_FIXED)) addr = round_hint_to_min(addr);
@@ -1261,6 +1264,16 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EEXIST; }
+ /* + * addr is returned from get_unmapped_area, + * There are two cases: + * 1> MAP_FIXED == false + * unallocated memory, no need to check sealing. + * 1> MAP_FIXED == true + * sealing is checked inside mmap_region when + * do_vmi_munmap is called. + */ + if (prot == PROT_EXEC) { pkey = execute_only_pkey(mm); if (pkey < 0) @@ -1376,6 +1389,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; }
+ vm_flags |= vm_seals; addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || @@ -2679,6 +2693,14 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, if (end == start) return -EINVAL;
+ /* + * Check if memory is sealed before arch_unmap. + * Prevent unmapping a sealed VMA. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, start, end)) + return -EPERM; + /* arch_unmap() might do unmaps itself. */ arch_unmap(mm, start, end);
@@ -2741,7 +2763,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, }
/* Unmap any existing mapping in the area */ - if (do_vmi_munmap(&vmi, mm, addr, len, uf, false)) + error = do_vmi_munmap(&vmi, mm, addr, len, uf, false); + if (error == -EPERM) + return error; + else if (error) return -ENOMEM;
/* @@ -3102,6 +3127,14 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm;
+ /* + * Check if memory is sealed before arch_unmap. + * Prevent unmapping a sealed VMA. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, start, end)) + return -EPERM; + arch_unmap(mm, start, end); return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 81991102f785..5f0f716bf4ae 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -32,6 +32,7 @@ #include <linux/sched/sysctl.h> #include <linux/userfaultfd_k.h> #include <linux/memory-tiers.h> +#include <uapi/linux/mman.h> #include <asm/cacheflush.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -743,6 +744,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len, } }
+ /* + * checking if memory is sealed. + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(current->mm, start, end)) { + error = -EPERM; + goto out; + } + prev = vma_prev(&vmi); if (start > vma->vm_start) prev = vma; diff --git a/mm/mremap.c b/mm/mremap.c index 38d98465f3d8..d69b438dcf83 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -902,7 +902,25 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len, if ((mm->map_count + 2) >= sysctl_max_map_count - 3) return -ENOMEM;
+ /* + * In mremap_to(). + * Move a VMA to another location, check if src addr is sealed. + * + * Place can_modify_mm here because mremap_to() + * does its own checking for address range, and we only + * check the sealing after passing those checks. + * + * can_modify_mm assumes we have acquired the lock on MM. + */ + if (!can_modify_mm(mm, addr, addr + old_len)) + return -EPERM; + if (flags & MREMAP_FIXED) { + /* + * In mremap_to(). + * VMA is moved to dst address, and munmap dst first. + * do_munmap will check if dst is sealed. + */ ret = do_munmap(mm, new_addr, new_len, uf_unmap_early); if (ret) goto out; @@ -1061,6 +1079,19 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; }
+ /* + * Below is shrink/expand case (not mremap_to()) + * Check if src address is sealed, if so, reject. + * In other words, prevent shrinking or expanding a sealed VMA. + * + * Place can_modify_mm here so we can keep the logic related to + * shrink/expand together. + */ + if (!can_modify_mm(mm, addr, addr + old_len)) { + ret = -EPERM; + goto out; + } + /* * Always allow a shrinking remap: that just unmaps * the unnecessary pages.. diff --git a/mm/mseal.c b/mm/mseal.c new file mode 100644 index 000000000000..abc00c0b9895 --- /dev/null +++ b/mm/mseal.c @@ -0,0 +1,343 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Implement mseal() syscall. + * + * Copyright (c) 2023,2024 Google, Inc. + * + * Author: Jeff Xu jeffxu@chromium.org + */ + +#include <linux/mempolicy.h> +#include <linux/mman.h> +#include <linux/mm.h> +#include <linux/mm_inline.h> +#include <linux/mmu_context.h> +#include <linux/syscalls.h> +#include <linux/sched.h> +#include "internal.h" + +static inline bool vma_is_sealed(struct vm_area_struct *vma) +{ + return (vma->vm_flags & VM_SEALED); +} + +static inline bool vma_is_sealable(struct vm_area_struct *vma) +{ + return vma->vm_flags & VM_SEALABLE; +} + +static inline void set_vma_sealed(struct vm_area_struct *vma) +{ + vm_flags_set(vma, VM_SEALED); +} + +/* + * check if a vma is sealed for modification. + * return true, if modification is allowed. + */ +static bool can_modify_vma(struct vm_area_struct *vma) +{ + if (vma_is_sealed(vma)) + return false; + + return true; +} + +static bool is_madv_discard(int behavior) +{ + return behavior & + (MADV_FREE | MADV_DONTNEED | MADV_DONTNEED_LOCKED | + MADV_REMOVE | MADV_DONTFORK | MADV_WIPEONFORK); +} + +static bool is_ro_anon(struct vm_area_struct *vma) +{ + /* check anonymous mapping. */ + if (vma->vm_file || vma->vm_flags & VM_SHARED) + return false; + + /* + * check for non-writable: + * PROT=RO or PKRU is not writeable. + */ + if (!(vma->vm_flags & VM_WRITE) || + !arch_vma_access_permitted(vma, true, false, false)) + return true; + + return false; +} + +/* + * Check if the vmas of a memory range are allowed to be modified. + * the memory ranger can have a gap (unallocated memory). + * return true, if it is allowed. + */ +bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end) +{ + struct vm_area_struct *vma; + + VMA_ITERATOR(vmi, mm, start); + + /* going through each vma to check. */ + for_each_vma_range(vmi, vma, end) { + if (!can_modify_vma(vma)) + return false; + } + + /* Allow by default. */ + return true; +} + +/* + * Check if the vmas of a memory range are allowed to be modified by madvise. + * the memory ranger can have a gap (unallocated memory). + * return true, if it is allowed. + */ +bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long end, + int behavior) +{ + struct vm_area_struct *vma; + + VMA_ITERATOR(vmi, mm, start); + + if (!is_madv_discard(behavior)) + return true; + + /* going through each vma to check. */ + for_each_vma_range(vmi, vma, end) + if (is_ro_anon(vma) && !can_modify_vma(vma)) + return false; + + /* Allow by default. */ + return true; +} + +unsigned long get_mmap_seals(unsigned long prot, + unsigned long flags) +{ + unsigned long vm_seals; + + if (prot & PROT_SEAL) + vm_seals = VM_SEALED | VM_SEALABLE; + else + vm_seals = (flags & MAP_SEALABLE) ? VM_SEALABLE : 0; + + return vm_seals; +} + +/* + * Check if a seal type can be added to VMA. + */ +static bool can_add_vma_seal(struct vm_area_struct *vma) +{ + /* if map is not sealable, reject. */ + if (!vma_is_sealable(vma)) + return false; + + return true; +} + +static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, vm_flags_t newflags) +{ + int ret = 0; + vm_flags_t oldflags = vma->vm_flags; + + if (newflags == oldflags) + goto out; + + vma = vma_modify_flags(vmi, *prev, vma, start, end, newflags); + if (IS_ERR(vma)) { + ret = PTR_ERR(vma); + goto out; + } + + set_vma_sealed(vma); +out: + *prev = vma; + return ret; +} + +/* + * Check for do_mseal: + * 1> start is part of a valid vma. + * 2> end is part of a valid vma. + * 3> No gap (unallocated address) between start and end. + * 4> map is sealable. + */ +static int check_mm_seal(unsigned long start, unsigned long end) +{ + struct vm_area_struct *vma; + unsigned long nstart = start; + + VMA_ITERATOR(vmi, current->mm, start); + + /* going through each vma to check. */ + for_each_vma_range(vmi, vma, end) { + if (vma->vm_start > nstart) + /* unallocated memory found. */ + return -ENOMEM; + + if (!can_add_vma_seal(vma)) + return -EACCES; + + if (vma->vm_end >= end) + return 0; + + nstart = vma->vm_end; + } + + return -ENOMEM; +} + +/* + * Apply sealing. + */ +static int apply_mm_seal(unsigned long start, unsigned long end) +{ + unsigned long nstart; + struct vm_area_struct *vma, *prev; + + VMA_ITERATOR(vmi, current->mm, start); + + vma = vma_iter_load(&vmi); + /* + * Note: check_mm_seal should already checked ENOMEM case. + * so vma should not be null, same for the other ENOMEM cases. + */ + prev = vma_prev(&vmi); + if (start > vma->vm_start) + prev = vma; + + nstart = start; + for_each_vma_range(vmi, vma, end) { + int error; + unsigned long tmp; + vm_flags_t newflags; + + newflags = vma->vm_flags | VM_SEALED; + tmp = vma->vm_end; + if (tmp > end) + tmp = end; + error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags); + if (error) + return error; + tmp = vma_iter_end(&vmi); + nstart = tmp; + } + + return 0; +} + +/* + * mseal(2) seals the VM's meta data from + * selected syscalls. + * + * addr/len: VM address range. + * + * The address range by addr/len must meet: + * start (addr) must be in a valid VMA. + * end (addr + len) must be in a valid VMA. + * no gap (unallocated memory) between start and end. + * start (addr) must be page aligned. + * + * len: len will be page aligned implicitly. + * + * Below VMA operations are blocked after sealing. + * 1> Unmapping, moving to another location, and shrinking + * the size, via munmap() and mremap(), can leave an empty + * space, therefore can be replaced with a VMA with a new + * set of attributes. + * 2> Moving or expanding a different vma into the current location, + * via mremap(). + * 3> Modifying a VMA via mmap(MAP_FIXED). + * 4> Size expansion, via mremap(), does not appear to pose any + * specific risks to sealed VMAs. It is included anyway because + * the use case is unclear. In any case, users can rely on + * merging to expand a sealed VMA. + * 5> mprotect and pkey_mprotect. + * 6> Some destructive madvice() behavior (e.g. MADV_DONTNEED) + * for anonymous memory, when users don't have write permission to the + * memory. Those behaviors can alter region contents by discarding pages, + * effectively a memset(0) for anonymous memory. + * + * flags: reserved. + * + * return values: + * zero: success. + * -EINVAL: + * invalid input flags. + * start address is not page aligned. + * Address arange (start + len) overflow. + * -ENOMEM: + * addr is not a valid address (not allocated). + * end (start + len) is not a valid address. + * a gap (unallocated memory) between start and end. + * -EACCES: + * MAP_SEALABLE is not set. + * -EPERM: + * - In 32 bit architecture, sealing is not supported. + * Note: + * user can call mseal(2) multiple times, adding a seal on an + * already sealed memory is a no-action (no error). + * + * unseal() is not supported. + */ +static int do_mseal(unsigned long start, size_t len_in, unsigned long flags) +{ + size_t len; + int ret = 0; + unsigned long end; + struct mm_struct *mm = current->mm; + + ret = can_do_mseal(flags); + if (ret) + return ret; + + start = untagged_addr(start); + if (!PAGE_ALIGNED(start)) + return -EINVAL; + + len = PAGE_ALIGN(len_in); + /* Check to see whether len was rounded up from small -ve to zero. */ + if (len_in && !len) + return -EINVAL; + + end = start + len; + if (end < start) + return -EINVAL; + + if (end == start) + return 0; + + if (mmap_write_lock_killable(mm)) + return -EINTR; + + /* + * First pass, this helps to avoid + * partial sealing in case of error in input address range, + * e.g. ENOMEM and EACCESS error. + */ + ret = check_mm_seal(start, end); + if (ret) + goto out; + + /* + * Second pass, this should success, unless there are errors + * from vma_modify_flags, e.g. merge/split error, or process + * reaching the max supported VMAs, however, those cases shall + * be rare. + */ + ret = apply_mm_seal(start, end); + +out: + mmap_write_unlock(current->mm); + return ret; +} + +SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long, + flags) +{ + return do_mseal(start, len, flags); +}
On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
[PATCH v8 2/4] mseal: add mseal syscall
[...]
+/*
- The PROT_SEAL defines memory sealing in the prot argument of mmap().
- */
+#define PROT_SEAL 0x04000000 /* _BITUL(26) */
/* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ @@ -33,6 +38,9 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ +/* map is sealable */ +#define MAP_SEALABLE 0x8000000 /* _BITUL(27) */
IMO this patch is misleading, as it claims to just be adding a new syscall, but it actually adds three new UAPIs, only one of which is the new syscall. The other two new UAPIs are new flags to the mmap syscall.
Based on recent discussions, it seems the usefulness of the new mmap flags has not yet been established. Note also that there are only a limited number of mmap flags remaining, so we should be careful about allocating them.
Therefore, why not start by just adding the mseal syscall, without the new mmap flags alongside it?
I'll also note that the existing PROT_* flags seem to be conventionally used for the CPU page protections, as opposed to kernel-specific properties of the VMA object. As such, PROT_SEAL feels a bit out of place anyway. If it's added at all it perhaps should be a MAP_* flag, not PROT_*. I'm not sure this aspect has been properly discussed yet, seeing as the patchset is presented as just adding sys_mseal(). Some reviewers may not have noticed or considered the new flags.
- Eric
On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers ebiggers@kernel.org wrote:
On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
[PATCH v8 2/4] mseal: add mseal syscall
[...]
+/*
- The PROT_SEAL defines memory sealing in the prot argument of mmap().
- */
+#define PROT_SEAL 0x04000000 /* _BITUL(26) */
/* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ @@ -33,6 +38,9 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
+/* map is sealable */ +#define MAP_SEALABLE 0x8000000 /* _BITUL(27) */
IMO this patch is misleading, as it claims to just be adding a new syscall, but it actually adds three new UAPIs, only one of which is the new syscall. The other two new UAPIs are new flags to the mmap syscall.
The description does include all three. I could update the patch title.
Based on recent discussions, it seems the usefulness of the new mmap flags has not yet been established. Note also that there are only a limited number of mmap flags remaining, so we should be careful about allocating them.
Therefore, why not start by just adding the mseal syscall, without the new mmap flags alongside it?
I'll also note that the existing PROT_* flags seem to be conventionally used for the CPU page protections, as opposed to kernel-specific properties of the VMA object. As such, PROT_SEAL feels a bit out of place anyway. If it's added at all it perhaps should be a MAP_* flag, not PROT_*. I'm not sure this aspect has been properly discussed yet, seeing as the patchset is presented as just adding sys_mseal(). Some reviewers may not have noticed or considered the new flags.
MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
The PROT_SEAL might make more sense because sealing the protection bit is the main functionality of the sealing at this moment.
Thanks -Jeff
- Eric
Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers ebiggers@kernel.org wrote:
On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
[PATCH v8 2/4] mseal: add mseal syscall
[...]
+/*
- The PROT_SEAL defines memory sealing in the prot argument of mmap().
- */
+#define PROT_SEAL 0x04000000 /* _BITUL(26) */
/* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ @@ -33,6 +38,9 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
+/* map is sealable */ +#define MAP_SEALABLE 0x8000000 /* _BITUL(27) */
IMO this patch is misleading, as it claims to just be adding a new syscall, but it actually adds three new UAPIs, only one of which is the new syscall. The other two new UAPIs are new flags to the mmap syscall.
The description does include all three. I could update the patch title.
Based on recent discussions, it seems the usefulness of the new mmap flags has not yet been established. Note also that there are only a limited number of mmap flags remaining, so we should be careful about allocating them.
Therefore, why not start by just adding the mseal syscall, without the new mmap flags alongside it?
I'll also note that the existing PROT_* flags seem to be conventionally used for the CPU page protections, as opposed to kernel-specific properties of the VMA object. As such, PROT_SEAL feels a bit out of place anyway. If it's added at all it perhaps should be a MAP_* flag, not PROT_*. I'm not sure this aspect has been properly discussed yet, seeing as the patchset is presented as just adding sys_mseal(). Some reviewers may not have noticed or considered the new flags.
MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
The PROT_SEAL might make more sense because sealing the protection bit is the main functionality of the sealing at this moment.
Jeff, please show a piece of software that needs to do PROT_SEAL as mprotect() or mmap() argument.
Please don't write it as a vague essay.
Instead, take a piece of existing code, write a diff, and show your work.
Then explain that diff, justify why doing the PROT_SEAL as an argument of mprotect() or mmap() is a required improvement, and show your Linux developer peers that you can do computer science.
I did the same work in OpenBSD, at least 25% time over 2 years, and I had to prove my work inside my development community. I had to prove that it worked system wide, not in 1 program, with hand-waving for the rest. If I had said "Looks, it works in ssh, trust me it works in other programs", it would not have gone further.
glibc is the best example to demonstrate, but smaller examples might convince.
On Thu, Feb 1, 2024 at 7:54 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers ebiggers@kernel.org wrote:
On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
[PATCH v8 2/4] mseal: add mseal syscall
[...]
+/*
- The PROT_SEAL defines memory sealing in the prot argument of mmap().
- */
+#define PROT_SEAL 0x04000000 /* _BITUL(26) */
/* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ @@ -33,6 +38,9 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
+/* map is sealable */ +#define MAP_SEALABLE 0x8000000 /* _BITUL(27) */
IMO this patch is misleading, as it claims to just be adding a new syscall, but it actually adds three new UAPIs, only one of which is the new syscall. The other two new UAPIs are new flags to the mmap syscall.
The description does include all three. I could update the patch title.
Based on recent discussions, it seems the usefulness of the new mmap flags has not yet been established. Note also that there are only a limited number of mmap flags remaining, so we should be careful about allocating them.
Therefore, why not start by just adding the mseal syscall, without the new mmap flags alongside it?
I'll also note that the existing PROT_* flags seem to be conventionally used for the CPU page protections, as opposed to kernel-specific properties of the VMA object. As such, PROT_SEAL feels a bit out of place anyway. If it's added at all it perhaps should be a MAP_* flag, not PROT_*. I'm not sure this aspect has been properly discussed yet, seeing as the patchset is presented as just adding sys_mseal(). Some reviewers may not have noticed or considered the new flags.
MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
The PROT_SEAL might make more sense because sealing the protection bit is the main functionality of the sealing at this moment.
Jeff, please show a piece of software that needs to do PROT_SEAL as mprotect() or mmap() argument.
I didn't propose mprotect().
for mmap() here is a potential use case:
fs/binfmt_elf.c if (current->personality & MMAP_PAGE_ZERO) { /* Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications "depend" upon this behavior. Since we do not have the power to recompile these, we emulate the SVr4 behavior. Sigh. */
error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC, <-- add PROT_SEAL MAP_FIXED | MAP_PRIVATE, 0); }
I don't see the benefit of RWX page 0, which might make a null pointers error to become executable for some code.
Please don't write it as a vague essay.
Instead, take a piece of existing code, write a diff, and show your work.
Then explain that diff, justify why doing the PROT_SEAL as an argument of mprotect() or mmap() is a required improvement, and show your Linux developer peers that you can do computer science.
I did the same work in OpenBSD, at least 25% time over 2 years, and I had to prove my work inside my development community. I had to prove that it worked system wide, not in 1 program, with hand-waving for the rest. If I had said "Looks, it works in ssh, trust me it works in other programs", it would not have gone further.
glibc is the best example to demonstrate, but smaller examples might convince.
Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 7:54 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers ebiggers@kernel.org wrote:
On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
[PATCH v8 2/4] mseal: add mseal syscall
[...]
+/*
- The PROT_SEAL defines memory sealing in the prot argument of mmap().
- */
+#define PROT_SEAL 0x04000000 /* _BITUL(26) */
/* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ @@ -33,6 +38,9 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
+/* map is sealable */ +#define MAP_SEALABLE 0x8000000 /* _BITUL(27) */
IMO this patch is misleading, as it claims to just be adding a new syscall, but it actually adds three new UAPIs, only one of which is the new syscall. The other two new UAPIs are new flags to the mmap syscall.
The description does include all three. I could update the patch title.
Based on recent discussions, it seems the usefulness of the new mmap flags has not yet been established. Note also that there are only a limited number of mmap flags remaining, so we should be careful about allocating them.
Therefore, why not start by just adding the mseal syscall, without the new mmap flags alongside it?
I'll also note that the existing PROT_* flags seem to be conventionally used for the CPU page protections, as opposed to kernel-specific properties of the VMA object. As such, PROT_SEAL feels a bit out of place anyway. If it's added at all it perhaps should be a MAP_* flag, not PROT_*. I'm not sure this aspect has been properly discussed yet, seeing as the patchset is presented as just adding sys_mseal(). Some reviewers may not have noticed or considered the new flags.
MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
The PROT_SEAL might make more sense because sealing the protection bit is the main functionality of the sealing at this moment.
Jeff, please show a piece of software that needs to do PROT_SEAL as mprotect() or mmap() argument.
I didn't propose mprotect().
for mmap() here is a potential use case:
fs/binfmt_elf.c if (current->personality & MMAP_PAGE_ZERO) { /* Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications "depend" upon this behavior. Since we do not have the power to recompile these, we emulate the SVr4 behavior. Sigh. */
error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC, <-- add PROT_SEAL MAP_FIXED | MAP_PRIVATE, 0); }
I don't see the benefit of RWX page 0, which might make a null pointers error to become executable for some code.
And this is a lot faster than doing the operation as a second step?
But anyways, that's kernel code. It is not userland exposed API used by programs.
The question is the damage you create by adding API exposed to userland (since this is Linux: forever).
I should be the first person thrilled to see Linux make API/ABI mistakes they have to support forever, but I can't be that person.
On Thu, Feb 1, 2024 at 8:10 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 7:54 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers ebiggers@kernel.org wrote:
On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
[PATCH v8 2/4] mseal: add mseal syscall
[...]
+/*
- The PROT_SEAL defines memory sealing in the prot argument of mmap().
- */
+#define PROT_SEAL 0x04000000 /* _BITUL(26) */
/* 0x01 - 0x03 are defined in linux/mman.h */ #define MAP_TYPE 0x0f /* Mask for type of mapping */ #define MAP_FIXED 0x10 /* Interpret addr exactly */ @@ -33,6 +38,9 @@ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */
+/* map is sealable */ +#define MAP_SEALABLE 0x8000000 /* _BITUL(27) */
IMO this patch is misleading, as it claims to just be adding a new syscall, but it actually adds three new UAPIs, only one of which is the new syscall. The other two new UAPIs are new flags to the mmap syscall.
The description does include all three. I could update the patch title.
Based on recent discussions, it seems the usefulness of the new mmap flags has not yet been established. Note also that there are only a limited number of mmap flags remaining, so we should be careful about allocating them.
Therefore, why not start by just adding the mseal syscall, without the new mmap flags alongside it?
I'll also note that the existing PROT_* flags seem to be conventionally used for the CPU page protections, as opposed to kernel-specific properties of the VMA object. As such, PROT_SEAL feels a bit out of place anyway. If it's added at all it perhaps should be a MAP_* flag, not PROT_*. I'm not sure this aspect has been properly discussed yet, seeing as the patchset is presented as just adding sys_mseal(). Some reviewers may not have noticed or considered the new flags.
MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
The PROT_SEAL might make more sense because sealing the protection bit is the main functionality of the sealing at this moment.
Jeff, please show a piece of software that needs to do PROT_SEAL as mprotect() or mmap() argument.
I didn't propose mprotect().
for mmap() here is a potential use case:
fs/binfmt_elf.c if (current->personality & MMAP_PAGE_ZERO) { /* Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications "depend" upon this behavior. Since we do not have the power to recompile these, we emulate the SVr4 behavior. Sigh. */
error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC, <-- add PROT_SEAL MAP_FIXED | MAP_PRIVATE, 0); }
I don't see the benefit of RWX page 0, which might make a null pointers error to become executable for some code.
And this is a lot faster than doing the operation as a second step?
But anyways, that's kernel code. It is not userland exposed API used by programs.
The question is the damage you create by adding API exposed to userland (since this is Linux: forever).
I should be the first person thrilled to see Linux make API/ABI mistakes they have to support forever, but I can't be that person.
Point taken. I can remove PROT_SEAL.
From: Jeff Xu jeffxu@chromium.org
selftest for memory sealing change in mmap() and mseal().
Signed-off-by: Jeff Xu jeffxu@chromium.org --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++++++ 3 files changed, 2026 insertions(+) create mode 100644 tools/testing/selftests/mm/mseal_test.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 4ff10ea61461..76474c51c786 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -46,3 +46,4 @@ gup_longterm mkdirty va_high_addr_switch hugetlb_fault_after_madv +mseal_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 2453add65d12..ba36a5c2b1fc 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests TEST_GEN_FILES += mrelease_test TEST_GEN_FILES += mremap_dontunmap TEST_GEN_FILES += mremap_test +TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += pagemap_ioctl TEST_GEN_FILES += thuge-gen diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c new file mode 100644 index 000000000000..746bb0f96fe4 --- /dev/null +++ b/tools/testing/selftests/mm/mseal_test.c @@ -0,0 +1,2024 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <sys/mman.h> +#include <stdint.h> +#include <unistd.h> +#include <string.h> +#include <sys/time.h> +#include <sys/resource.h> +#include <stdbool.h> +#include "../kselftest.h" +#include <syscall.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <assert.h> +#include <fcntl.h> +#include <assert.h> +#include <sys/ioctl.h> +#include <sys/vfs.h> +#include <sys/stat.h> + +/* + * need those definition for manually build using gcc. + * gcc -I ../../../../usr/include -DDEBUG -O3 -DDEBUG -O3 mseal_test.c -o mseal_test + */ +#ifndef MAP_SEALABLE +#define MAP_SEALABLE 0x8000000 +#endif + +#ifndef PROT_SEAL +#define PROT_SEAL 0x04000000 +#endif + +#ifndef PKEY_DISABLE_ACCESS +# define PKEY_DISABLE_ACCESS 0x1 +#endif + +#ifndef PKEY_DISABLE_WRITE +# define PKEY_DISABLE_WRITE 0x2 +#endif + +#ifndef PKEY_BITS_PER_KEY +#define PKEY_BITS_PER_PKEY 2 +#endif + +#ifndef PKEY_MASK +#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE) +#endif + +#define FAIL_TEST_IF_FALSE(c) do {\ + if (!(c)) {\ + ksft_test_result_fail("%s, line:%d\n", __func__, __LINE__);\ + goto test_end;\ + } \ + } \ + while (0) + +#define SKIP_TEST_IF_FALSE(c) do {\ + if (!(c)) {\ + ksft_test_result_skip("%s, line:%d\n", __func__, __LINE__);\ + goto test_end;\ + } \ + } \ + while (0) + + +#define TEST_END_CHECK() {\ + ksft_test_result_pass("%s\n", __func__);\ + return;\ +test_end:\ + return;\ +} + +#ifndef u64 +#define u64 unsigned long long +#endif + +static unsigned long get_vma_size(void *addr) +{ + FILE *maps; + char line[256]; + int size = 0; + uintptr_t addr_start, addr_end; + + maps = fopen("/proc/self/maps", "r"); + if (!maps) + return 0; + + while (fgets(line, sizeof(line), maps)) { + if (sscanf(line, "%lx-%lx", &addr_start, &addr_end) == 2) { + if (addr_start == (uintptr_t) addr) { + size = addr_end - addr_start; + break; + } + } + } + fclose(maps); + return size; +} + +/* + * define sys_xyx to call syscall directly. + */ +static int sys_mseal(void *start, size_t len) +{ + int sret; + + errno = 0; + sret = syscall(__NR_mseal, start, len, 0); + return sret; +} + +static int sys_mprotect(void *ptr, size_t size, unsigned long prot) +{ + int sret; + + errno = 0; + sret = syscall(__NR_mprotect, ptr, size, prot); + return sret; +} + +static int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot, + unsigned long pkey) +{ + int sret; + + errno = 0; + sret = syscall(__NR_pkey_mprotect, ptr, size, orig_prot, pkey); + return sret; +} + +static void *sys_mmap(void *addr, unsigned long len, unsigned long prot, + unsigned long flags, unsigned long fd, unsigned long offset) +{ + void *sret; + + errno = 0; + sret = (void *) syscall(__NR_mmap, addr, len, prot, + flags, fd, offset); + return sret; +} + +static int sys_munmap(void *ptr, size_t size) +{ + int sret; + + errno = 0; + sret = syscall(__NR_munmap, ptr, size); + return sret; +} + +static int sys_madvise(void *start, size_t len, int types) +{ + int sret; + + errno = 0; + sret = syscall(__NR_madvise, start, len, types); + return sret; +} + +static int sys_pkey_alloc(unsigned long flags, unsigned long init_val) +{ + int ret = syscall(__NR_pkey_alloc, flags, init_val); + + return ret; +} + +static unsigned int __read_pkey_reg(void) +{ + unsigned int eax, edx; + unsigned int ecx = 0; + unsigned int pkey_reg; + + asm volatile(".byte 0x0f,0x01,0xee\n\t" + : "=a" (eax), "=d" (edx) + : "c" (ecx)); + pkey_reg = eax; + return pkey_reg; +} + +static void __write_pkey_reg(u64 pkey_reg) +{ + unsigned int eax = pkey_reg; + unsigned int ecx = 0; + unsigned int edx = 0; + + asm volatile(".byte 0x0f,0x01,0xef\n\t" + : : "a" (eax), "c" (ecx), "d" (edx)); + assert(pkey_reg == __read_pkey_reg()); +} + +static unsigned long pkey_bit_position(int pkey) +{ + return pkey * PKEY_BITS_PER_PKEY; +} + +static u64 set_pkey_bits(u64 reg, int pkey, u64 flags) +{ + unsigned long shift = pkey_bit_position(pkey); + + /* mask out bits from pkey in old value */ + reg &= ~((u64)PKEY_MASK << shift); + /* OR in new bits for pkey */ + reg |= (flags & PKEY_MASK) << shift; + return reg; +} + +static void set_pkey(int pkey, unsigned long pkey_value) +{ + unsigned long mask = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE); + u64 new_pkey_reg; + + assert(!(pkey_value & ~mask)); + new_pkey_reg = set_pkey_bits(__read_pkey_reg(), pkey, pkey_value); + __write_pkey_reg(new_pkey_reg); +} + +static void setup_single_address(int size, void **ptrOut) +{ + void *ptr; + + ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + assert(ptr != (void *)-1); + *ptrOut = ptr; +} + +static void setup_single_address_rw_sealable(int size, void **ptrOut, bool sealable) +{ + void *ptr; + unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE; + + if (sealable) + mapflags |= MAP_SEALABLE; + + ptr = sys_mmap(NULL, size, PROT_READ | PROT_WRITE, mapflags, -1, 0); + assert(ptr != (void *)-1); + *ptrOut = ptr; +} + +static void clean_single_address(void *ptr, int size) +{ + int ret; + + ret = munmap(ptr, size); + assert(!ret); +} + +static void seal_single_address(void *ptr, int size) +{ + int ret; + + ret = sys_mseal(ptr, size); + assert(!ret); +} + +bool seal_support(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + + ptr = sys_mmap(NULL, page_size, PROT_READ | PROT_SEAL, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + if (ptr == (void *) -1) + return false; + + ret = sys_mseal(ptr, page_size); + if (ret < 0) + return false; + + return true; +} + +bool pkey_supported(void) +{ + int pkey = sys_pkey_alloc(0, 0); + + if (pkey > 0) + return true; + return false; +} + +static void test_seal_addseal(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_unmapped_start(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* munmap 2 pages from ptr. */ + ret = sys_munmap(ptr, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* mprotect will fail because 2 pages from ptr are unmapped. */ + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(ret < 0); + + /* mseal will fail because 2 pages from ptr are unmapped. */ + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + + ret = sys_mseal(ptr + 2 * page_size, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_unmapped_middle(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* munmap 2 pages from ptr + page. */ + ret = sys_munmap(ptr + page_size, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* mprotect will fail, since middle 2 pages are unmapped. */ + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(ret < 0); + + /* mseal will fail as well. */ + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + + /* we still can add seal to the first page and last page*/ + ret = sys_mseal(ptr, page_size); + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_mseal(ptr + 3 * page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_unmapped_end(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* unmap last 2 pages. */ + ret = sys_munmap(ptr + 2 * page_size, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* mprotect will fail since last 2 pages are unmapped. */ + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(ret < 0); + + /* mseal will fail as well. */ + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + + /* The first 2 pages is not sealed, and can add seals */ + ret = sys_mseal(ptr, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_multiple_vmas(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* use mprotect to split the vma into 3. */ + ret = sys_mprotect(ptr + page_size, 2 * page_size, + PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* mprotect will get applied to all 4 pages - 3 VMAs. */ + ret = sys_mprotect(ptr, size, PROT_READ); + FAIL_TEST_IF_FALSE(!ret); + + /* use mprotect to split the vma into 3. */ + ret = sys_mprotect(ptr + page_size, 2 * page_size, + PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* mseal get applied to all 4 pages - 3 VMAs. */ + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_split_start(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* use mprotect to split at middle */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* seal the first page, this will split the VMA */ + ret = sys_mseal(ptr, page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* add seal to the remain 3 pages */ + ret = sys_mseal(ptr + page_size, 3 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_split_end(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* use mprotect to split at middle */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* seal the last page */ + ret = sys_mseal(ptr + 3 * page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* Adding seals to the first 3 pages */ + ret = sys_mseal(ptr, 3 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_invalid_input(void) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(8 * page_size, &ptr); + clean_single_address(ptr + 4 * page_size, 4 * page_size); + + /* invalid flag */ + ret = syscall(__NR_mseal, ptr, size, 0x20); + FAIL_TEST_IF_FALSE(ret < 0); + + /* unaligned address */ + ret = sys_mseal(ptr + 1, 2 * page_size); + FAIL_TEST_IF_FALSE(ret < 0); + + /* length too big */ + ret = sys_mseal(ptr, 5 * page_size); + FAIL_TEST_IF_FALSE(ret < 0); + + /* length overflow */ + ret = sys_mseal(ptr, UINT64_MAX/page_size); + FAIL_TEST_IF_FALSE(ret < 0); + + /* start is not in a valid VMA */ + ret = sys_mseal(ptr - page_size, 5 * page_size); + FAIL_TEST_IF_FALSE(ret < 0); + + TEST_END_CHECK(); +} + +static void test_seal_zero_length(void) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + ret = sys_mprotect(ptr, 0, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* seal 0 length will be OK, same as mprotect */ + ret = sys_mseal(ptr, 0); + FAIL_TEST_IF_FALSE(!ret); + + /* verify the 4 pages are not sealed by previous call. */ + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_zero_address(void) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + /* use mmap to change protection. */ + ptr = sys_mmap(0, size, PROT_NONE | PROT_SEAL, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + FAIL_TEST_IF_FALSE(ptr == 0); + + size = get_vma_size(ptr); + FAIL_TEST_IF_FALSE(size == 4 * page_size); + + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + + /* verify the 4 pages are sealed by previous call. */ + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(ret); + + TEST_END_CHECK(); +} + +static void test_seal_twice(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + + /* apply the same seal will be OK. idempotent. */ + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_single_address(ptr, size); + + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_start_mprotect(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_single_address(ptr, page_size); + + /* the first page is sealed. */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + /* pages after the first page is not sealed. */ + ret = sys_mprotect(ptr + page_size, page_size * 3, + PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_end_mprotect(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_single_address(ptr + page_size, 3 * page_size); + + /* first page is not sealed */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* last 3 page are sealed */ + ret = sys_mprotect(ptr + page_size, page_size * 3, + PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_unalign_len(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_single_address(ptr, page_size * 2 - 1); + + /* 2 pages are sealed. */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_mprotect(ptr + page_size * 2, page_size, + PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_unalign_len_variant_2(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + if (seal) + seal_single_address(ptr, page_size * 2 + 1); + + /* 3 pages are sealed. */ + ret = sys_mprotect(ptr, page_size * 3, PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_mprotect(ptr + page_size * 3, page_size, + PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_two_vma(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + if (seal) + seal_single_address(ptr, page_size * 4); + + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_mprotect(ptr + page_size * 2, page_size * 2, + PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_two_vma_with_split(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split as two vma. */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* mseal can apply across 2 vma, also split them. */ + if (seal) + seal_single_address(ptr + page_size, page_size * 2); + + /* the first page is not sealed. */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* the second page is sealed. */ + ret = sys_mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + /* the third page is sealed. */ + ret = sys_mprotect(ptr + 2 * page_size, page_size, + PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + /* the fouth page is not sealed. */ + ret = sys_mprotect(ptr + 3 * page_size, page_size, + PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_partial_mprotect(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* seal one page. */ + if (seal) + seal_single_address(ptr, page_size); + + /* mprotect first 2 page will fail, since the first page are sealed. */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_two_vma_with_gap(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split. */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* use mprotect to split. */ + ret = sys_mprotect(ptr + 3 * page_size, page_size, + PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* use munmap to free two pages in the middle */ + ret = sys_munmap(ptr + page_size, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* mprotect will fail, because there is a gap in the address. */ + /* notes, internally mprotect still updated the first page. */ + ret = sys_mprotect(ptr, 4 * page_size, PROT_READ); + FAIL_TEST_IF_FALSE(ret < 0); + + /* mseal will fail as well. */ + ret = sys_mseal(ptr, 4 * page_size); + FAIL_TEST_IF_FALSE(ret < 0); + + /* the first page is not sealed. */ + ret = sys_mprotect(ptr, page_size, PROT_READ); + FAIL_TEST_IF_FALSE(ret == 0); + + /* the last page is not sealed. */ + ret = sys_mprotect(ptr + 3 * page_size, page_size, PROT_READ); + FAIL_TEST_IF_FALSE(ret == 0); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_split(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split. */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* seal all 4 pages. */ + if (seal) { + ret = sys_mseal(ptr, 4 * page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* mprotect is sealed. */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + + ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mprotect_merge(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split one page. */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + /* seal first two pages. */ + if (seal) { + ret = sys_mseal(ptr, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* 2 pages are sealed. */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + /* last 2 pages are not sealed. */ + ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ); + FAIL_TEST_IF_FALSE(ret == 0); + + TEST_END_CHECK(); +} + +static void test_seal_munmap(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* 4 pages are sealed. */ + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +/* + * allocate 4 pages, + * use mprotect to split it as two VMAs + * seal the whole range + * munmap will fail on both + */ +static void test_seal_munmap_two_vma(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(!ret); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + ret = sys_munmap(ptr, page_size * 2); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr + page_size, page_size * 2); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +/* + * allocate a VMA with 4 pages. + * munmap the middle 2 pages. + * seal the whole 4 pages, will fail. + * note: one of the pages are sealed + * munmap the first page will be OK. + * munmap the last page will be OK. + */ +static void test_seal_munmap_vma_with_gap(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + ret = sys_munmap(ptr + page_size, page_size * 2); + FAIL_TEST_IF_FALSE(!ret); + + if (seal) { + /* can't have gap in the middle. */ + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + } + + ret = sys_munmap(ptr, page_size); + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr + page_size * 2, page_size); + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_munmap_start_freed(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* unmap the first page. */ + ret = sys_munmap(ptr, page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* seal the last 3 pages. */ + if (seal) { + ret = sys_mseal(ptr + page_size, 3 * page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* unmap from the first page. */ + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + /* note: this will be OK, even the first page is */ + /* already unmapped. */ + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_munmap_end_freed(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + /* unmap last page. */ + ret = sys_munmap(ptr + page_size * 3, page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* seal the first 3 pages. */ + if (seal) { + ret = sys_mseal(ptr, 3 * page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* unmap all pages. */ + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_munmap_middle_freed(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + /* unmap 2 pages in the middle. */ + ret = sys_munmap(ptr + page_size, page_size * 2); + FAIL_TEST_IF_FALSE(!ret); + + /* seal the first page. */ + if (seal) { + ret = sys_mseal(ptr, page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* munmap all 4 pages. */ + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_mremap_shrink(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* shrink from 4 pages to 2 pages. */ + ret2 = mremap(ptr, size, 2 * page_size, 0, 0); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else { + FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED); + + } + + TEST_END_CHECK(); +} + +static void test_seal_mremap_expand(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + /* ummap last 2 pages. */ + ret = sys_munmap(ptr + 2 * page_size, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + if (seal) { + ret = sys_mseal(ptr, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* expand from 2 page to 4 pages. */ + ret2 = mremap(ptr, 2 * page_size, 4 * page_size, 0, 0); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else { + FAIL_TEST_IF_FALSE(ret2 == ptr); + + } + + TEST_END_CHECK(); +} + +static void test_seal_mremap_move(bool seal) +{ + void *ptr, *newPtr; + unsigned long page_size = getpagesize(); + unsigned long size = page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_address(size, &newPtr); + clean_single_address(newPtr, size); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* move from ptr to fixed address. */ + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newPtr); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else { + FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED); + + } + + TEST_END_CHECK(); +} + +static void test_seal_mmap_overwrite_prot(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* use mmap to change protection. */ + ret2 = sys_mmap(ptr, size, PROT_NONE, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else + FAIL_TEST_IF_FALSE(ret2 == ptr); + + TEST_END_CHECK(); +} + +static void test_seal_mmap_expand(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 12 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + /* ummap last 4 pages. */ + ret = sys_munmap(ptr + 8 * page_size, 4 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + if (seal) { + ret = sys_mseal(ptr, 8 * page_size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* use mmap to expand. */ + ret2 = sys_mmap(ptr, size, PROT_READ, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else + FAIL_TEST_IF_FALSE(ret2 == ptr); + + TEST_END_CHECK(); +} + +static void test_seal_mmap_shrink(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 12 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* use mmap to shrink. */ + ret2 = sys_mmap(ptr, 8 * page_size, PROT_READ, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else + FAIL_TEST_IF_FALSE(ret2 == ptr); + + TEST_END_CHECK(); +} + +static void test_seal_mremap_shrink_fixed(bool seal) +{ + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* mremap to move and shrink to fixed address */ + ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + newAddr); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else + FAIL_TEST_IF_FALSE(ret2 == newAddr); + + TEST_END_CHECK(); +} + +static void test_seal_mremap_expand_fixed(bool seal) +{ + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(page_size, &ptr); + setup_single_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(newAddr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* mremap to move and expand to fixed address */ + ret2 = mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED, + newAddr); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else + FAIL_TEST_IF_FALSE(ret2 == newAddr); + + TEST_END_CHECK(); +} + +static void test_seal_mremap_move_fixed(bool seal) +{ + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(newAddr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* mremap to move to fixed address */ + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else + FAIL_TEST_IF_FALSE(ret2 == newAddr); + + TEST_END_CHECK(); +} + +static void test_seal_mremap_move_fixed_zero(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* + * MREMAP_FIXED can move the mapping to zero address + */ + ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + 0); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else { + FAIL_TEST_IF_FALSE(ret2 == 0); + + } + + TEST_END_CHECK(); +} + +static void test_seal_mremap_move_dontunmap(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* mremap to move, and don't unmap src addr. */ + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else { + FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED); + + } + + TEST_END_CHECK(); +} + +static void test_seal_mremap_move_dontunmap_anyaddr(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* + * The 0xdeaddead should not have effect on dest addr + * when MREMAP_DONTUNMAP is set. + */ + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, + 0xdeaddead); + if (seal) { + FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED); + FAIL_TEST_IF_FALSE(errno == EPERM); + } else { + FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED); + FAIL_TEST_IF_FALSE((long)ret2 != 0xdeaddead); + + } + + TEST_END_CHECK(); +} + + +static void test_seal_mmap_seal(void) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + ptr = sys_mmap(NULL, size, PROT_READ | PROT_SEAL, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + ret = sys_munmap(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + FAIL_TEST_IF_FALSE(ret < 0); + + ret = sys_madvise(ptr, size, MADV_DONTNEED); + FAIL_TEST_IF_FALSE(ret < 0); + + TEST_END_CHECK(); +} + +static void test_seal_merge_and_split(void) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size; + int ret; + + /* (24 RO) */ + setup_single_address(24 * page_size, &ptr); + + /* use mprotect(NONE) to set out boundary */ + /* (1 NONE) (22 RO) (1 NONE) */ + ret = sys_mprotect(ptr, page_size, PROT_NONE); + FAIL_TEST_IF_FALSE(!ret); + ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 22 * page_size); + + /* use mseal to split from beginning */ + /* (1 NONE) (1 RO_SEAL) (21 RO) (1 NONE) */ + ret = sys_mseal(ptr + page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == page_size); + size = get_vma_size(ptr + 2 * page_size); + FAIL_TEST_IF_FALSE(size == 21 * page_size); + + /* use mseal to split from the end. */ + /* (1 NONE) (1 RO_SEAL) (20 RO) (1 RO_SEAL) (1 NONE) */ + ret = sys_mseal(ptr + 22 * page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + 22 * page_size); + FAIL_TEST_IF_FALSE(size == page_size); + size = get_vma_size(ptr + 2 * page_size); + FAIL_TEST_IF_FALSE(size == 20 * page_size); + + /* merge with prev. */ + /* (1 NONE) (2 RO_SEAL) (19 RO) (1 RO_SEAL) (1 NONE) */ + ret = sys_mseal(ptr + 2 * page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 2 * page_size); + + /* merge with after. */ + /* (1 NONE) (2 RO_SEAL) (18 RO) (2 RO_SEALS) (1 NONE) */ + ret = sys_mseal(ptr + 21 * page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + 21 * page_size); + FAIL_TEST_IF_FALSE(size == 2 * page_size); + + /* split and merge from prev */ + /* (1 NONE) (3 RO_SEAL) (17 RO) (2 RO_SEALS) (1 NONE) */ + ret = sys_mseal(ptr + 2 * page_size, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + 1 * page_size); + FAIL_TEST_IF_FALSE(size == 3 * page_size); + ret = sys_munmap(ptr + page_size, page_size); + FAIL_TEST_IF_FALSE(ret < 0); + ret = sys_mprotect(ptr + 2 * page_size, page_size, PROT_NONE); + FAIL_TEST_IF_FALSE(ret < 0); + + /* split and merge from next */ + /* (1 NONE) (3 RO_SEAL) (16 RO) (3 RO_SEALS) (1 NONE) */ + ret = sys_mseal(ptr + 20 * page_size, 2 * page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + 20 * page_size); + FAIL_TEST_IF_FALSE(size == 3 * page_size); + + /* merge from middle of prev and middle of next. */ + /* (1 NONE) (22 RO_SEAL) (1 NONE) */ + ret = sys_mseal(ptr + 2 * page_size, 20 * page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 22 * page_size); + + TEST_END_CHECK(); +} + +static void test_seal_mmap_merge(void) +{ + + void *ptr, *ptr2; + unsigned long page_size = getpagesize(); + unsigned long size; + int ret; + + /* (24 RO) */ + setup_single_address(24 * page_size, &ptr); + + /* use mprotect(NONE) to set out boundary */ + /* (1 NONE) (22 RO) (1 NONE) */ + ret = sys_mprotect(ptr, page_size, PROT_NONE); + FAIL_TEST_IF_FALSE(!ret); + ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 22 * page_size); + + /* use munmap to free 2 segment of memory. */ + /* (1 NONE) (1 free) (20 RO) (1 free) (1 NONE) */ + ret = sys_munmap(ptr + page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr + 22 * page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* apply seal to the middle */ + /* (1 NONE) (1 free) (20 RO_SEAL) (1 free) (1 NONE) */ + ret = sys_mseal(ptr + 2 * page_size, 20 * page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + 2 * page_size); + FAIL_TEST_IF_FALSE(size == 20 * page_size); + + /* allocate a mapping at beginning, and make sure it merges. */ + /* (1 NONE) (21 RO_SEAL) (1 free) (1 NONE) */ + ptr2 = sys_mmap(ptr + page_size, page_size, PROT_READ | PROT_SEAL, + MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + FAIL_TEST_IF_FALSE(ptr2 != (void *)-1); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 21 * page_size); + + /* allocate a mapping at end, and make sure it merges. */ + /* (1 NONE) (22 RO_SEAL) (1 NONE) */ + ptr2 = sys_mmap(ptr + 22 * page_size, page_size, PROT_READ | PROT_SEAL, + MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 22 * page_size); + + TEST_END_CHECK(); +} + +static void test_not_sealable(void) +{ + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + + TEST_END_CHECK(); +} + +static void test_mmap_fixed_change_to_sealable(void) +{ + int ret; + void *ptr, *ptr2; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + + ptr2 = sys_mmap(ptr, size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + FAIL_TEST_IF_FALSE(ptr2 == ptr); + + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_mmap_fixed_change_to_not_sealable(void) +{ + int ret; + void *ptr, *ptr2; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + ptr2 = sys_mmap(ptr, size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + FAIL_TEST_IF_FALSE(ptr2 == ptr); + + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(ret < 0); + + TEST_END_CHECK(); +} + +static void test_merge_sealable(void) +{ + int ret; + void *ptr, *ptr2; + unsigned long page_size = getpagesize(); + unsigned long size; + + /* (24 RO) */ + setup_single_address(24 * page_size, &ptr); + + /* use mprotect(NONE) to set out boundary */ + /* (1 NONE) (22 RO) (1 NONE) */ + ret = sys_mprotect(ptr, page_size, PROT_NONE); + FAIL_TEST_IF_FALSE(!ret); + ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 22 * page_size); + + /* (1 NONE) (RO) (4 free) (17 RO) (1 NONE) */ + ret = sys_munmap(ptr + 2 * page_size, 4 * page_size); + FAIL_TEST_IF_FALSE(!ret); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 1 * page_size); + size = get_vma_size(ptr + 6 * page_size); + FAIL_TEST_IF_FALSE(size == 17 * page_size); + + /* (1 NONE) (RO) (1 free) (2 RO) (1 free) (17 RO) (1 NONE) */ + ptr2 = sys_mmap(ptr + 3 * page_size, 2 * page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + size = get_vma_size(ptr + 3 * page_size); + FAIL_TEST_IF_FALSE(size == 2 * page_size); + + /* (1 NONE) (RO) (1 free) (20 RO) (1 NONE) */ + ptr2 = sys_mmap(ptr + 5 * page_size, 1 * page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0); + FAIL_TEST_IF_FALSE(ptr2 != (void *)-1); + size = get_vma_size(ptr + 3 * page_size); + FAIL_TEST_IF_FALSE(size == 20 * page_size); + + /* (1 NONE) (RO) (1 free) (19 RO) (1 RO_SEAL) (1 NONE) */ + ret = sys_mseal(ptr + 22 * page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* (1 NONE) (RO) (not sealable) (19 RO) (1 RO_SEAL) (1 NONE) */ + ptr2 = sys_mmap(ptr + 2 * page_size, page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + FAIL_TEST_IF_FALSE(ptr2 != (void *)-1); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == page_size); + size = get_vma_size(ptr + 2 * page_size); + FAIL_TEST_IF_FALSE(size == page_size); + + /* (1 NONE) (1 free) (1 NOT_SEALABLE) (19 free) (1 RO_SEAL) (1 NONE) */ + ret = sys_munmap(ptr + page_size, page_size); + FAIL_TEST_IF_FALSE(!ret); + ret = sys_munmap(ptr + 3 * page_size, 19 * page_size); + FAIL_TEST_IF_FALSE(!ret); + + /* (1 NONE) (2 NOT_SEALABLE) (19 free) (1 RO_SEAL) (1 NONE) */ + ptr2 = sys_mmap(ptr + page_size, page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + FAIL_TEST_IF_FALSE(ptr2 != (void *)-1); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 2 * page_size); + + /* (1 NONE) (21 NOT_SEALABLE)(1 RO_SEAL) (1 NONE) */ + ptr2 = sys_mmap(ptr + 3 * page_size, 19 * page_size, PROT_READ, + MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + FAIL_TEST_IF_FALSE(ptr2 != (void *)-1); + size = get_vma_size(ptr + page_size); + FAIL_TEST_IF_FALSE(size == 21 * page_size); + + TEST_END_CHECK(); +} + +static void test_seal_discard_ro_anon_on_rw(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address_rw_sealable(size, &ptr, seal); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* sealing doesn't take effect on RW memory. */ + ret = sys_madvise(ptr, size, MADV_DONTNEED); + FAIL_TEST_IF_FALSE(!ret); + + /* base seal still apply. */ + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_discard_ro_anon_on_pkey(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + int pkey; + + SKIP_TEST_IF_FALSE(pkey_supported()); + + setup_single_address_rw_sealable(size, &ptr, seal); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + pkey = sys_pkey_alloc(0, 0); + FAIL_TEST_IF_FALSE(pkey > 0); + + ret = sys_mprotect_pkey((void *)ptr, size, PROT_READ | PROT_WRITE, pkey); + FAIL_TEST_IF_FALSE(!ret); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* sealing doesn't take effect if PKRU allow write. */ + set_pkey(pkey, 0); + ret = sys_madvise(ptr, size, MADV_DONTNEED); + FAIL_TEST_IF_FALSE(!ret); + + /* sealing will take effect if PKRU deny write. */ + set_pkey(pkey, PKEY_DISABLE_WRITE); + ret = sys_madvise(ptr, size, MADV_DONTNEED); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + /* base seal still apply. */ + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_discard_ro_anon_on_filebacked(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + int fd; + unsigned long mapflags = MAP_PRIVATE; + + if (seal) + mapflags |= MAP_SEALABLE; + + fd = memfd_create("test", 0); + FAIL_TEST_IF_FALSE(fd > 0); + + ret = fallocate(fd, 0, 0, size); + FAIL_TEST_IF_FALSE(!ret); + + ptr = sys_mmap(NULL, size, PROT_READ, mapflags, fd, 0); + FAIL_TEST_IF_FALSE(ptr != MAP_FAILED); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* sealing doesn't apply for file backed mapping. */ + ret = sys_madvise(ptr, size, MADV_DONTNEED); + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + close(fd); + + TEST_END_CHECK(); +} + +static void test_seal_discard_ro_anon_on_shared(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + unsigned long mapflags = MAP_ANONYMOUS | MAP_SHARED; + + if (seal) + mapflags |= MAP_SEALABLE; + + ptr = sys_mmap(NULL, size, PROT_READ, mapflags, -1, 0); + FAIL_TEST_IF_FALSE(ptr != (void *)-1); + + if (seal) { + ret = sys_mseal(ptr, size); + FAIL_TEST_IF_FALSE(!ret); + } + + /* sealing doesn't apply for shared mapping. */ + ret = sys_madvise(ptr, size, MADV_DONTNEED); + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +static void test_seal_discard_ro_anon(bool seal) +{ + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_single_address(ptr, size); + + ret = sys_madvise(ptr, size, MADV_DONTNEED); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + ret = sys_munmap(ptr, size); + if (seal) + FAIL_TEST_IF_FALSE(ret < 0); + else + FAIL_TEST_IF_FALSE(!ret); + + TEST_END_CHECK(); +} + +int main(int argc, char **argv) +{ + bool test_seal = seal_support(); + + ksft_print_header(); + + if (!test_seal) + ksft_exit_skip("sealing not supported, check CONFIG_64BIT\n"); + + if (!pkey_supported()) + ksft_print_msg("PKEY not supported\n"); + + ksft_set_plan(86); + + test_seal_addseal(); + test_seal_unmapped_start(); + test_seal_unmapped_middle(); + test_seal_unmapped_end(); + test_seal_multiple_vmas(); + test_seal_split_start(); + test_seal_split_end(); + test_seal_invalid_input(); + test_seal_zero_length(); + test_seal_twice(); + + test_seal_mprotect(false); + test_seal_mprotect(true); + + test_seal_start_mprotect(false); + test_seal_start_mprotect(true); + + test_seal_end_mprotect(false); + test_seal_end_mprotect(true); + + test_seal_mprotect_unalign_len(false); + test_seal_mprotect_unalign_len(true); + + test_seal_mprotect_unalign_len_variant_2(false); + test_seal_mprotect_unalign_len_variant_2(true); + + test_seal_mprotect_two_vma(false); + test_seal_mprotect_two_vma(true); + + test_seal_mprotect_two_vma_with_split(false); + test_seal_mprotect_two_vma_with_split(true); + + test_seal_mprotect_partial_mprotect(false); + test_seal_mprotect_partial_mprotect(true); + + test_seal_mprotect_two_vma_with_gap(false); + test_seal_mprotect_two_vma_with_gap(true); + + test_seal_mprotect_merge(false); + test_seal_mprotect_merge(true); + + test_seal_mprotect_split(false); + test_seal_mprotect_split(true); + + test_seal_munmap(false); + test_seal_munmap(true); + test_seal_munmap_two_vma(false); + test_seal_munmap_two_vma(true); + test_seal_munmap_vma_with_gap(false); + test_seal_munmap_vma_with_gap(true); + + test_munmap_start_freed(false); + test_munmap_start_freed(true); + test_munmap_middle_freed(false); + test_munmap_middle_freed(true); + test_munmap_end_freed(false); + test_munmap_end_freed(true); + + test_seal_mremap_shrink(false); + test_seal_mremap_shrink(true); + test_seal_mremap_expand(false); + test_seal_mremap_expand(true); + test_seal_mremap_move(false); + test_seal_mremap_move(true); + + test_seal_mremap_shrink_fixed(false); + test_seal_mremap_shrink_fixed(true); + test_seal_mremap_expand_fixed(false); + test_seal_mremap_expand_fixed(true); + test_seal_mremap_move_fixed(false); + test_seal_mremap_move_fixed(true); + test_seal_mremap_move_dontunmap(false); + test_seal_mremap_move_dontunmap(true); + test_seal_mremap_move_fixed_zero(false); + test_seal_mremap_move_fixed_zero(true); + test_seal_mremap_move_dontunmap_anyaddr(false); + test_seal_mremap_move_dontunmap_anyaddr(true); + test_seal_discard_ro_anon(false); + test_seal_discard_ro_anon(true); + test_seal_discard_ro_anon_on_rw(false); + test_seal_discard_ro_anon_on_rw(true); + test_seal_discard_ro_anon_on_shared(false); + test_seal_discard_ro_anon_on_shared(true); + test_seal_discard_ro_anon_on_filebacked(false); + test_seal_discard_ro_anon_on_filebacked(true); + test_seal_mmap_overwrite_prot(false); + test_seal_mmap_overwrite_prot(true); + test_seal_mmap_expand(false); + test_seal_mmap_expand(true); + test_seal_mmap_shrink(false); + test_seal_mmap_shrink(true); + + test_seal_mmap_seal(); + test_seal_merge_and_split(); + test_seal_mmap_merge(); + + test_not_sealable(); + test_merge_sealable(); + test_mmap_fixed_change_to_sealable(); + test_mmap_fixed_change_to_not_sealable(); + + test_seal_zero_address(); + + test_seal_discard_ro_anon_on_pkey(false); + test_seal_discard_ro_anon_on_pkey(true); + + ksft_finished(); + return 0; +}
From: Jeff Xu jeffxu@chromium.org
Add documentation for mseal().
Signed-off-by: Jeff Xu jeffxu@chromium.org --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mseal.rst | 215 ++++++++++++++++++++++++++ 2 files changed, 216 insertions(+) create mode 100644 Documentation/userspace-api/mseal.rst
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 09f61bd2ac2e..178f6a1d79cb 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -26,6 +26,7 @@ place where this information is gathered. iommu iommufd media/index + mseal netlink/index sysfs-platform_profile vduse diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst new file mode 100644 index 000000000000..6bfac0622178 --- /dev/null +++ b/Documentation/userspace-api/mseal.rst @@ -0,0 +1,215 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Introduction of mseal +===================== + +:Author: Jeff Xu jeffxu@chromium.org + +Modern CPUs support memory permissions such as RW and NX bits. The memory +permission feature improves security stance on memory corruption bugs, i.e. +the attacker can’t just write to arbitrary memory and point the code to it, +the memory has to be marked with X bit, or else an exception will happen. + +Memory sealing additionally protects the mapping itself against +modifications. This is useful to mitigate memory corruption issues where a +corrupted pointer is passed to a memory management system. For example, +such an attacker primitive can break control-flow integrity guarantees +since read-only memory that is supposed to be trusted can become writable +or .text pages can get remapped. Memory sealing can automatically be +applied by the runtime loader to seal .text and .rodata pages and +applications can additionally seal security critical data at runtime. + +A similar feature already exists in the XNU kernel with the +VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2]. + +User API +======== +Two system calls are involved in virtual memory sealing, mseal() and mmap(). + +mseal() +----------- +The mseal() syscall has the following signature: + +``int mseal(void addr, size_t len, unsigned long flags)`` + +**addr/len**: virtual memory address range. + +The address range set by ``addr``/``len`` must meet: + - The start address must be in an allocated VMA. + - The start address must be page aligned. + - The end address (``addr`` + ``len``) must be in an allocated VMA. + - no gap (unallocated memory) between start and end address. + +The ``len`` will be paged aligned implicitly by the kernel. + +**flags**: reserved for future use. + +**return values**: + +- ``0``: Success. + +- ``-EINVAL``: + - Invalid input ``flags``. + - The start address (``addr``) is not page aligned. + - Address range (``addr`` + ``len``) overflow. + +- ``-ENOMEM``: + - The start address (``addr``) is not allocated. + - The end address (``addr`` + ``len``) is not allocated. + - A gap (unallocated memory) between start and end address. + +- ``-EACCES``: + - ``MAP_SEALABLE`` is not set during mmap(). + +- ``-EPERM``: + - sealing is supported only on 64-bit CPUs, 32-bit is not supported. + +- For above error cases, users can expect the given memory range is + unmodified, i.e. no partial update. + +- There might be other internal errors/cases not listed here, e.g. + error during merging/splitting VMAs, or the process reaching the max + number of supported VMAs. In those cases, partial updates to the given + memory range could happen. However, those cases should be rare. + +**Blocked operations after sealing**: + Unmapping, moving to another location, and shrinking the size, + via munmap() and mremap(), can leave an empty space, therefore + can be replaced with a VMA with a new set of attributes. + + Moving or expanding a different VMA into the current location, + via mremap(). + + Modifying a VMA via mmap(MAP_FIXED). + + Size expansion, via mremap(), does not appear to pose any + specific risks to sealed VMAs. It is included anyway because + the use case is unclear. In any case, users can rely on + merging to expand a sealed VMA. + + mprotect() and pkey_mprotect(). + + Some destructive madvice() behaviors (e.g. MADV_DONTNEED) + for anonymous memory, when users don't have write permission to the + memory. Those behaviors can alter region contents by discarding pages, + effectively a memset(0) for anonymous memory. + + Kernel will return -EPERM for blocked operations. + +**Note**: + +- mseal() only works on 64-bit CPUs, not 32-bit CPU. + +- users can call mseal() multiple times, mseal() on an already sealed memory + is a no-action (not error). + +- munseal() is not supported. + +mmap() +---------- +``void *mmap(void* addr, size_t length, int prot, int flags, int fd, +off_t offset);`` + +We add two changes in ``prot`` and ``flags`` of mmap() related to +memory sealing. + +**prot** + +The ``PROT_SEAL`` bit in ``prot`` field of mmap(). + +When present, it marks the memory is sealed since creation. + +This is useful as optimization because it avoids having to make two +system calls: one for mmap() and one for mseal(). + +It's worth noting that even though the sealing is set via the +``prot`` field in mmap(), it can't be set in the ``prot`` +field in later mprotect(). This is unlike the ``PROT_READ``, +``PROT_WRITE``, ``PROT_EXEC`` bits, e.g. if ``PROT_WRITE`` is not set in +mprotect(), it means that the region is not writable. + +Setting ``PROT_SEAL`` implies setting ``MAP_SEALABLE`` below. + +**flags** + +The ``MAP_SEALABLE`` bit in the ``flags`` field of mmap(). + +When present, it marks the map as sealable. A map created +without ``MAP_SEALABLE`` will not support sealing. In other words, +mseal() will fail for such a map. + +Applications that don't care about sealing will expect their +behavior unchanged. For those that need sealing support, opt in +by adding ``MAP_SEALABLE`` in mmap(). + +Use Case: +========= +- glibc: + The dynamic linker, during loading ELF executables, can apply sealing to + non-writable memory segments. + +- Chrome browser: protect some security sensitive data-structures. + +Notes On MAP_SEALABLE +===================== +With the MAP_SEALABLE flag in mmap(), the memory must be mmap() with +MAP_SEALABLE, otherwise mseal() will fail. This raises the bar of +which memory can be sealed. + +Today, in linux, sealing have known side effects if applied in below +two cases: + +- aio/shm + + aio/shm can mmap/munmap on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime of those mapping are not tied to the lifetime of the process. If those memories are sealed from userspace, then unmap will fail, causing leaks in VMA address space during the lifetime of the process. + +- Brk (heap/stack) + + Currently, userspace applications can seal parts of the heap by calling malloc() and mseal(). + let's assume following calls from user space: + + - ptr = malloc(size); + - mprotect(ptr, size, RO); + - mseal(ptr, size); + - free(ptr); + + Technically, before mseal() is added, the user can change the protection of the heap by calling mprotect(RO). As long as the user changes the protection back to RW before free(), the memory can be reused. + + Adding mseal() into the picture, however, the heap is then sealed partially, the user can still free it, but the memory remains to be RO. In addition, the result of brk-shrink is nondeterministic, depending on if munmap() will try to free the sealed memory.(brk uses munmap to shrink the heap). + + Given the heap is not marked with MAP_SEALABLE (at the time of this document's writing), this might discourage the inadvertent sealing on the heap. + + It is noteworthy, nonetheless, for mappings that were created without the MAP_SEALABLE flag, a knowledgeable developer who wants to assume ownership of the memory range still has the option of mmap(MAP_FIXED|MAP_SEALABLE), which is equivalent to invoking munmap() and then mmap(MAP_FIXED). Indeed, a "not-allow-sealing" feature is not possible without some level of baseline sealing support and is out-of-scope currently. + + In summary, the considerations for having MAP_SEALABLE are as follows: + +- Grants software owners the ability to incrementally incorporate sealing support for their designated memory ranges, such as brk. +- Raises the bar for which memory can be sealed, and discourages inadvertent sealing. +- Such a decision is reversible. In other words, a sysctl could be implemented to render all memory sealable in the future. However, if all memory were allowed to be sealable from the beginning, reversing that decision would be problematic. + +Additional notes: +================= +As Jann Horn pointed out in [3], there are still a few ways to write +to RO memory, which is, in a way, by design. Those cases are not covered +by mseal(). If applications want to block such cases, sandbox tools (such as +seccomp, LSM, etc) might be considered. + +Those cases are: + +- Write to read-only memory through /proc/self/mem interface. +- Write to read-only memory through ptrace (such as PTRACE_POKETEXT). +- userfaultfd. + +The idea that inspired this patch comes from Stephen Röttger’s work in V8 +CFI [4]. Chrome browser in ChromeOS will be the first user of this API. + +Reference: +========== +[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... + +[2] https://man.openbsd.org/mimmutable.2 + +[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfU... + +[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea...
Please add me to the Cc list of these patches.
* jeffxu@chromium.org jeffxu@chromium.org [240131 12:50]:
From: Jeff Xu jeffxu@chromium.org
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type.
... The v8 cut Jonathan's email discussion [1] off and instead of replying there, I'm going to add my question here.
The best plan to ensure it is a general safety measure for all of linux is to work with the community before it lands upstream. It's much harder to change functionality provided to users after it is upstream. I'm happy to hear google is super excited about sharing this, but so far, the community isn't as excited.
It seems Theo has a lot of experience trying to add a feature very close to what you are doing and has real data on how this went [2]. Can we see if there is a solution that is, at least, different enough from what he tried to do for a shot of success? Do we have anyone in the toolchain groups that sees this working well? If this means Stephen needs to do something, can we get that to happen please?
I mean, you specifically state that this is a 'very specific requirement' in your cover letter. Does this mean even other browsers have no use for it?
I am very concerned this feature will land and have to be maintained by the core mm people for the one user it was specifically targeting.
Can we also get some benchmarking on the impact of this feature? I believe my answer in v7 removed the worst offender, but since there is no benchmarking we really are guessing (educated or not, hard data would help). We still have an extra loop in madvise, mprotect_pkey, mremap_to (and mreamp syscall?).
You also did not clean up the loop you copied from mlock, which I pointed out [3]. Stating that your copy/paste is easier to review is not sufficient to keep unneeded assignments around.
[1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/ [2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/ [3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
Please add me to the Cc list of these patches.
Ok.
- jeffxu@chromium.org jeffxu@chromium.org [240131 12:50]:
From: Jeff Xu jeffxu@chromium.org
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself is against modifications of the selected seal type.
... The v8 cut Jonathan's email discussion [1] off and instead of replying there, I'm going to add my question here.
The best plan to ensure it is a general safety measure for all of linux is to work with the community before it lands upstream. It's much harder to change functionality provided to users after it is upstream. I'm happy to hear google is super excited about sharing this, but so far, the community isn't as excited.
It seems Theo has a lot of experience trying to add a feature very close to what you are doing and has real data on how this went [2]. Can we see if there is a solution that is, at least, different enough from what he tried to do for a shot of success? Do we have anyone in the toolchain groups that sees this working well? If this means Stephen needs to do something, can we get that to happen please?
For Theo's input from OpenBSD's perspective; IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same scope on what operations to seal, e.g. considering the progress made on both sides since the beginning of the RFC: - mseal(Linux): dropped "multiple-bit" approach. - mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).
The difference is in mmap(), i.e. - mseal(Linux): support of PROT_SEAL in mmap(). - mseal(Linux): use of MAP_SEALABLE in mmap().
I considered Theo's inputs from OpenBSD's perspective regarding the difference, and I wasn't convinced that Linux should remove these. In my view, those are two different kernels code, and the difference in Linux is not added without reasons (for MAP_SEALABLE, there is a note in the documentation section with details).
I would love to hear more from Linux developers on this.
I mean, you specifically state that this is a 'very specific requirement' in your cover letter. Does this mean even other browsers have no use for it?
No, I don’t mean “other browsers have no use for it”.
About specific requirements from Chrome, that refers to "The lifetime of those mappings are not tied to the lifetime of the process, which is not the case of libc" as in the cover letter. This addition to the cover letter was made in V3, thus, it might be beneficial to provide additional context to help answer the question.
This patch series begins with multiple-bit approaches (v1,v2,v3), the rationale for this is that I am uncertain if Chrome's specific needs are common enough for other use cases. Consequently, I am unable to make this decision myself without input from the community. To accommodate this, multiple bits are selected initially due to their adaptability.
Since V1, after hearing from the community, Chrome has changed its design (no longer relying on separating out mprotect), and Linus acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs, today mseal() has a simple design that: - meet Chrome's specific needs. - meet Libc's needs. - Chrome's specific need doesn't interfere with Libc's.
[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0...
I am very concerned this feature will land and have to be maintained by the core mm people for the one user it was specifically targeting.
See above. This feature is not specifically targeting Chrome.
Can we also get some benchmarking on the impact of this feature? I believe my answer in v7 removed the worst offender, but since there is no benchmarking we really are guessing (educated or not, hard data would help). We still have an extra loop in madvise, mprotect_pkey, mremap_to (and mreamp syscall?).
Yes. There is an extra loop in mmap(FIXED), munmap(), madvise(DONOTNEED), mremap(), to emulate the VMAs for the given address range. I suspect the impact would be low, but having some hard data would be good. I will see what I can find to assist the perf testing. If you have a specific test suite in mind, I can also try it.
You also did not clean up the loop you copied from mlock, which I pointed out [3]. Stating that your copy/paste is easier to review is not sufficient to keep unneeded assignments around.
OK.
[1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/ [2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/ [3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/
Jeff Xu jeffxu@chromium.org wrote:
I considered Theo's inputs from OpenBSD's perspective regarding the difference, and I wasn't convinced that Linux should remove these. In my view, those are two different kernels code, and the difference in Linux is not added without reasons (for MAP_SEALABLE, there is a note in the documentation section with details).
That note is describing a fiction.
I would love to hear more from Linux developers on this.
I'm not sure you are capable of listening.
But I'll repeat for others to stop this train wreck:
1. When execve() maps a programs's .data section, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE?
Does the kernel seal the .data section? It cannot, because of RELRO and IFUNCS. Do you know what those are? (like in OpenBSD) the kernel cannot and will *not* seal the .data section, it lets later code do that.
2. When execve() maps a programs's .bss section, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE?
Does the kernel seal the .bss section? It cannot, because of RELRO and IFUNCS. Do you know what those are? (like in OpenBSD) the kernel cannot and will *not* seal the .bss section, it lets later code do that.
In the proposed diff, the kernel does not set MAP_SEALABLE on those regions.
How does a userland program seal the .data and .bss regions?
It cannot. It is too late to set the MAP_SEALABLE, because the kernel already decided not do to it.
So those regions cannot be sealed.
3. When execve() maps a programs's stack, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE?
In the proposed diff, the kernel does not set MAP_SEALABLE.
You think you can seal the stack in the kernel?? Sorry to be the bearer of bad news, but glibc has code which on occasion will mprotects the stack executable.
But if userland decides that mprotect case won't occur -- how does a userland program seal its stack? It is now too late to set MAP_SEALABLE.
So the stack must remain unsealed.
4. What about the text segment?
5. Do you know what a text-relocation is? They are now rare, but there are still compile/linker stages which will produce them, and there is software which requires that to work. It means userland fixes it's own .text, then calls mprotect. The kernel does not know if this will happen.
6. When execve() maps the .text segment, will it set MAP_SEALABLE?
If it doesn't set it, userland cannot seal it's text after it makes the decision to do.
You can continue to extrapolate those same points for all other segments of a static binary, all segments of a dynamic binary, all segments of the shared library linker.
And then you can go further, and recognize the logic that will be needed in the shared library linker to *make the same decisions*.
In each case, the *decision* to make a mapping happens in one piece of code, and the decision to use and NOW SEAL THAT MAPPING, happens in a different piece of code.
The only answer to these problems will be to always set MAP_SEALABLE. To go through the entire Linux ecosystem, and change every call to mmap() to use this new MAP_SEALABLE flag, and it will look something like this:
+#ifndef MAP_SEALABLE +#define MAP_SEALABLE 0 +#endif - ptr = mmap(...., MAP... - ptr = mmap(...., MAP_SEALABLE | MAP...
Every single one of them, and you'll need to do it in the kernel.
If you had spent a second trying to make this work in a second piece of software, you would have realized that the ONLY way this could work is by adding a flag with the opposite meaning:
MAP_NOTSEALABLE
But nothing will use that. I promise you
I would love to hear more from Linux developers on this.
I'm not sure you are capable of listening.
-----Original Message----- From: Theo de Raadt deraadt@openbsd.org
I would love to hear more from Linux developers on this.
I'm not sure you are capable of listening.
Theo,
It is possible to make your technical points, and even to express frustration that it has been difficult to get them across, without resorting to personal attacks.
-- Tim
I'd like to propose a new flag to the Linux open() system call.
It is
O_DUPABLE
You mix it with other O_* flags to the open call, everyone is familiar with this, it is very easy to use.
If the O_DUPABLE flag is set, the file descriptor may be cloned with dup(), dup2() or similar call. If not set, those calls will return with -1 EPERM.
I know it goes strongly against the grain of ancient assumptions that file descriptors (just like memory) are fully mutable, and therefore managed with care. But in these trying times, we need protection against file descriptor desecration.
It protects programmers from accidentally making clones of file descriptors and leaking them out of programs, like I dunno, runc. OK, besides this one very specific place that could (maybe) use it today, there is other code which can use this but the margin is too narrow to contain.
The documentation can describe the behaviour as similar to MAP_SEALABLE, so that noone is shocked.
/sarc
* Jeff Xu jeffxu@chromium.org [240131 20:27]:
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
Please add me to the Cc list of these patches.
Ok.
- jeffxu@chromium.org jeffxu@chromium.org [240131 12:50]:
From: Jeff Xu jeffxu@chromium.org
This patchset proposes a new mseal() syscall for the Linux kernel.
In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits.
Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself is against modifications of the selected seal type.
... The v8 cut Jonathan's email discussion [1] off and instead of replying there, I'm going to add my question here.
The best plan to ensure it is a general safety measure for all of linux is to work with the community before it lands upstream. It's much harder to change functionality provided to users after it is upstream. I'm happy to hear google is super excited about sharing this, but so far, the community isn't as excited.
It seems Theo has a lot of experience trying to add a feature very close to what you are doing and has real data on how this went [2]. Can we see if there is a solution that is, at least, different enough from what he tried to do for a shot of success? Do we have anyone in the toolchain groups that sees this working well? If this means Stephen needs to do something, can we get that to happen please?
For Theo's input from OpenBSD's perspective; IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same scope on what operations to seal, e.g. considering the progress made on both sides since the beginning of the RFC:
- mseal(Linux): dropped "multiple-bit" approach.
- mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).
The difference is in mmap(), i.e.
- mseal(Linux): support of PROT_SEAL in mmap().
- mseal(Linux): use of MAP_SEALABLE in mmap().
I considered Theo's inputs from OpenBSD's perspective regarding the difference, and I wasn't convinced that Linux should remove these. In my view, those are two different kernels code, and the difference in Linux is not added without reasons (for MAP_SEALABLE, there is a note in the documentation section with details).
I would love to hear more from Linux developers on this.
Linus said it was really important to get the semantics correct, but you took his (unfinished) list and kept going. I think there are some unanswered questions and that's frustrating some people as you may not be valuing the experience they have in this area.
You dropped the RFC from the topic and incremented the version numbering on the patch set. I thought it was customary to restart counting after the RFC was complete? Maybe I'm wrong, but it seemed a bit odd to see that happen. The documentation also implies there are still questions to be answered, so it seems this is still an RFC in some ways?
I'd like to talk about the design some more.
Having to opt-in to allowing mseal will probably not work well.
Initial library mappings happen in one huge chunk then it's cut up into smaller VMAs, at least that's what I see with my maple tree tracing. If you opt-in, then the entire library will have to opt-in and so the 'discourage inadvertent sealing' argument is not very strong.
It also makes a somewhat messy tracking of inheritance of the attribute across splitting, MAP_FIXED replacement, vma_move, vma_copy. I think most of this is forced on the user?
It makes your call less flexible, it means you have to hope that the VMA origin was blessed before you decide you want to mseal it.
What if you want to ensure the library mapped by a parent or on launch is mseal'ed?
What about the initial relocated VMA (expand/shrink of VMA)?
Creating something as "non-sealable" is pointless. If you don't want it sealed, then don't mseal() that region.
If your use case doesn't need it, then can we please drop the opt-in behaviour and just have all VMAs treated the same?
If it does need it, can you explain why?
The glibc relocation/fixup will then work. glibc could mseal once it is complete - or an application could bypass glibc support and use the feature itself.
If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the heap/stack concerns. We can either let people shoot their own feet off or try to protect them.
Right now, you seem to be trying to protect them. Keeping with that, I guess we could either get the kernel to mark those VMAs or tell some other way? I'd suggest a range, but people do very strange things with these special VMAs [1]. I don't think you can predict enough crazy actions to make a difference in trying to protect people.
There are far fewer VMAs that should not be allowed to be mseal'ed than should be, and the kernel creates those so it seems logical to only let the kernel opt-out on those ones.
I'd rather just let people shoot themselves and return an error.
I also hope it reduces the complexity of this code while increasing the flexibility of the feature. As stated before, we remove the dependency of needing support from the initial loader.
Merging VMAs I can see this going Very Bad with brk + mseal. But, again, if someone decides to mseal these VMAs then they should expect Bad Things to happen (or maybe they know what they are doing even in some complex situation?)
vma_merge() can also expand a VMA. I think this is okay as it checks for the same flags, so you will allow VMA expansion of two (or three) vma areas to become one. Is this okay in your model?
I mean, you specifically state that this is a 'very specific requirement' in your cover letter. Does this mean even other browsers have no use for it?
No, I don’t mean “other browsers have no use for it”.
About specific requirements from Chrome, that refers to "The lifetime of those mappings are not tied to the lifetime of the process, which is not the case of libc" as in the cover letter. This addition to the cover letter was made in V3, thus, it might be beneficial to provide additional context to help answer the question.
This patch series begins with multiple-bit approaches (v1,v2,v3), the rationale for this is that I am uncertain if Chrome's specific needs are common enough for other use cases. Consequently, I am unable to make this decision myself without input from the community. To accommodate this, multiple bits are selected initially due to their adaptability.
Since V1, after hearing from the community, Chrome has changed its design (no longer relying on separating out mprotect), and Linus acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs, today mseal() has a simple design that:
- meet Chrome's specific needs.
How many VMAs will chrome have that are mseal'ed? Is this a common operation?
PROT_SEAL seems like an extra flag we could drop. I don't expect we'll be sealing enough VMAs that a hand full of extra syscalls would make a difference?
- meet Libc's needs.
What needs of libc are you referring to? I'm looking through the version changelog and I guess you mean return EPERM?
- Chrome's specific need doesn't interfere with Libc's.
[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0...
Linus said he'd be happier if we made the change in general.
I am very concerned this feature will land and have to be maintained by the core mm people for the one user it was specifically targeting.
See above. This feature is not specifically targeting Chrome.
Can we also get some benchmarking on the impact of this feature? I believe my answer in v7 removed the worst offender, but since there is no benchmarking we really are guessing (educated or not, hard data would help). We still have an extra loop in madvise, mprotect_pkey, mremap_to (and mreamp syscall?).
Yes. There is an extra loop in mmap(FIXED), munmap(), madvise(DONOTNEED), mremap(), to emulate the VMAs for the given address range. I suspect the impact would be low, but having some hard data would be good. I will see what I can find to assist the perf testing. If you have a specific test suite in mind, I can also try it.
You should look at mmtests [2]. But since you are adding loops across VMA ranges, you need to test loops across several ranges of VMAs. That is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or some subset of small and large numbers to get an idea of complexity we are adding. My hope is that the looping will be cache-hot in the maple tree and have minimum effect.
In my personal testing, I've seen munmap often do a single VMA, or 3, or more rare 7 on x86_64. There should be some good starting points in mmtests for the common operations.
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/m... [2] https://github.com/gormanm/mmtests
Thanks, Liam
There is another problem with adding PROT_SEAL to the mprotect() call.
What are the precise semantics?
If one reviews how mprotect() behaves, it is quickly clear that it is very sloppy specification. We spent quite a bit of effort making our manual page as clear as possible to the most it gaurantees, in the standard, and in all the various Unix:
Not all implementations will guarantee protection on a page basis; the granularity of protection changes may be as large as an entire region. Nor will all implementations guarantee to give exactly the requested permissions; more permissions may be granted than requested by prot. However, if PROT_WRITE was not specified then the page will not be writable.
Anything else is different.
That is the specification in case of PROT_READ, PROT_WRITE, and PROT_EXEC.
What happens if you add additional PROT_* flags?
Does mprotect still behave just as sloppy (as specified)?
Or it now return an error partway through an operation?
When it returns the error, does it skip doing the work on the remaining region?
Or does it skip doing any protection operation at all? (That means the code has to do two passes over the region; first one checks if it may proceed, second pass performs the change. I think I've reat PROT_SEAL was supposed to try to do things as one pass; is that actually possible without requiring a second pass in the kernel?
To wit, do these two sequences have _exactly_ the same behaviour in all cases that we can think of - unmapped sub-regions - sealed sub-regions - and who knows what else mprotect() may encounter
a)
mprotect(addr, len, PROT_READ); mseal(addr, len, 0);
b)
mprotect(addr, len, PROT_READ | PROT_SEAL);
Are they the same, or are they different?
Here's what I think: mprotect() behaves quite differently if you add the PROT_SEAL flag, but I can't quite tell precisely what happens because I don't understand the linux vm system enough.
(As an outsider, I have glanced at the new PROT_MTE flag changes; that one seem to just "set a flag where possible", rather than performing an action which could result in an error, and seems to not have this problem).
As an outsider, Linux development is really strange:
Two sub-features are being pushed very hard, and the primary developer doesn't have code which uses either of them. And once it goes in, it cannot be changed.
It's very different from my world, where the absolutely minimal interface was written to apply to a whole operating system plus 10,000+ applications, and then took months of testing before it was approved for inclusion. And if it was subtly wrong, we would be able to change it.
On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
As an outsider, Linux development is really strange:
Two sub-features are being pushed very hard, and the primary developer doesn't have code which uses either of them. And once it goes in, it cannot be changed.
It's very different from my world, where the absolutely minimal interface was written to apply to a whole operating system plus 10,000+ applications, and then took months of testing before it was approved for inclusion. And if it was subtly wrong, we would be able to change it.
No, it's this "feature" submission that is strange to think that we don't need that. We do need, and will require, an actual working userspace something to use it, otherwise as you say, there's no way to actually know if it works properly or not and we can't change it once we accept it.
So along those lines, Jeff, do you have a pointer to the Chrome patches, or glibc patches, that use this new interface that proves that it actually works? Those would be great to see to at least verify it's been tested in a real-world situation and actually works for your use case.
thanks,
greg k-h
On Thu, Feb 1, 2024 at 5:06 PM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
As an outsider, Linux development is really strange:
Two sub-features are being pushed very hard, and the primary developer doesn't have code which uses either of them. And once it goes in, it cannot be changed.
It's very different from my world, where the absolutely minimal interface was written to apply to a whole operating system plus 10,000+ applications, and then took months of testing before it was approved for inclusion. And if it was subtly wrong, we would be able to change it.
No, it's this "feature" submission that is strange to think that we don't need that. We do need, and will require, an actual working userspace something to use it, otherwise as you say, there's no way to actually know if it works properly or not and we can't change it once we accept it.
So along those lines, Jeff, do you have a pointer to the Chrome patches, or glibc patches, that use this new interface that proves that it actually works? Those would be great to see to at least verify it's been tested in a real-world situation and actually works for your use case.
The MAP_SEALABLE is raised because of other concerns not related to libc.
The patch Stephan developed was based on V1 of the patch, IIRC, which is really ancient, and it is not based on MAP_SEALABLE, which is a more recent development entirely from me.
I don't see unresolvable problems with glibc though, E.g. For the elf case (binfmt_elf.c), there are two places I need to add MAP_SEALABLE, then the memory to user space is marked with sealable. There might be cases where glibc needs to add MAP_SEALABLE it uses mmap(FIXED) to split the memory.
If the decision of MAP_SELABLE depends on the glibc case being able to use it, we can develop such a patch, but it will take a while, say a few weeks to months, due to vacation, work load, etc.
Best Regards, -Jeff
thanks,
greg k-h
On Thu, 1 Feb 2024 at 19:24, Jeff Xu jeffxu@chromium.org wrote:
The patch Stephan developed was based on V1 of the patch, IIRC, which is really ancient, and it is not based on MAP_SEALABLE, which is a more recent development entirely from me.
So the problem with this whole patch series from the very beginning was that it was very specialized, and COMPLETELY OVER-ENGINEERED.
It got simpler at one point. And then you started adding these features that have absolutely no reason for them. Again.
It's frustrating. And it's not making it more likely to be ever merged.
Linus
On Thu, Feb 1, 2024 at 7:29 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, 1 Feb 2024 at 19:24, Jeff Xu jeffxu@chromium.org wrote:
The patch Stephan developed was based on V1 of the patch, IIRC, which is really ancient, and it is not based on MAP_SEALABLE, which is a more recent development entirely from me.
So the problem with this whole patch series from the very beginning was that it was very specialized, and COMPLETELY OVER-ENGINEERED.
It got simpler at one point. And then you started adding these features that have absolutely no reason for them. Again.
It's frustrating. And it's not making it more likely to be ever merged.
I'm sorry for over-thinking. Remove the MAP_SEALABLE it is then.
Keep with mseal(addr,len,0) only ?
-Jeff
On Thu, Feb 01, 2024 at 07:24:02PM -0800, Jeff Xu wrote:
On Thu, Feb 1, 2024 at 5:06 PM Greg KH gregkh@linuxfoundation.org wrote:
On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
As an outsider, Linux development is really strange:
Two sub-features are being pushed very hard, and the primary developer doesn't have code which uses either of them. And once it goes in, it cannot be changed.
It's very different from my world, where the absolutely minimal interface was written to apply to a whole operating system plus 10,000+ applications, and then took months of testing before it was approved for inclusion. And if it was subtly wrong, we would be able to change it.
No, it's this "feature" submission that is strange to think that we don't need that. We do need, and will require, an actual working userspace something to use it, otherwise as you say, there's no way to actually know if it works properly or not and we can't change it once we accept it.
So along those lines, Jeff, do you have a pointer to the Chrome patches, or glibc patches, that use this new interface that proves that it actually works? Those would be great to see to at least verify it's been tested in a real-world situation and actually works for your use case.
The MAP_SEALABLE is raised because of other concerns not related to libc.
The patch Stephan developed was based on V1 of the patch, IIRC, which is really ancient, and it is not based on MAP_SEALABLE, which is a more recent development entirely from me.
I don't see unresolvable problems with glibc though, E.g. For the elf case (binfmt_elf.c), there are two places I need to add MAP_SEALABLE, then the memory to user space is marked with sealable. There might be cases where glibc needs to add MAP_SEALABLE it uses mmap(FIXED) to split the memory.
If the decision of MAP_SELABLE depends on the glibc case being able to use it, we can develop such a patch, but it will take a while, say a few weeks to months, due to vacation, work load, etc.
There's no rush here, and no deadlines in kernel development. If you don't have a working userspace user for your new feature(s), there is no way we can accept the changes to the kernel (and hint, you don't want us to either...)
good luck!
greg k-h
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
I would love to hear more from Linux developers on this.
Linus said it was really important to get the semantics correct, but you took his (unfinished) list and kept going. I think there are some unanswered questions and that's frustrating some people as you may not be valuing the experience they have in this area.
Perhaps you didn't follow the discussions closely during the RFCs, so I like to clarify the timeline:
- Dec.12: RFC V3 was out for comments: [1] This version added MAP_SEALABLE and sealing type in mmap() The sealing type in mmap() was suggested by Pedro Falcato during V1. [2] And MAP_SEALABLE is new to V3 and I added an open discussion in the cover letter.
- Dec.14 Linus made a set of recommendations based on V3 [3], this is where Linus mentioned the semantics.
Quoted below: "Particularly for new system calls with fairly specialized use, I think it's very important that the semantics are sensible on a conceptual level, and that we do not add system calls that are based on "random implementation issue of the day".
- Jan.4: I sent out V4 of that patch for comments [5] This version implements all of Linus's recommendations made in V3.
In V3, I didn't receive comments about MAP_SEALABLE, so I kept that as an open discussion item in V4 and specifically mentioned it in the first sentence of the V4 cover letter.
"This is V4 of the patch, the patch has improved significantly since V1, thanks to diverse inputs, a few discussions remain, please read those in the open discussion section of v4 of change history."
- Jan.4: Linus gave a comment on V4: [6]
Quoted below: "Other than that, this seems all reasonable to me now."
To me, this means Linus is OK with the general signatures of the APIs.
-Jan.9 During comments for V5. [7] Kees suggested dropping RFC from subsequent versions, given Linus's general approval on the v4.
[1] https://lore.kernel.org/all/80897.1705769947@cvs.openbsd.org/T/#mbf4749d465b...
[2] https://lore.kernel.org/lkml/CAKbZUD2A+=bp_sd+Q0Yif7NJqMu8p__eb4yguq0agEcmLH...
[3] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0...
[4] https://lore.kernel.org/all/CABi2SkUTdF6PHrudHTZZ0oWK-oU+T-5+7Eqnei4yCj2fsW2...
[5] https://lore.kernel.org/lkml/796b6877-0548-4d2a-a484-ba4156104a20@infradead....
[6] https://lore.kernel.org/lkml/CAHk-=wiy0nHG9+3rXzQa=W8gM8F6-MhsHrs_ZqWaHtjmPK...
[7] https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/T/...
You dropped the RFC from the topic and increment the version numbering on the patch set. I thought it was customary to restart counting after the RFC was complete? Maybe I'm wrong, but it seemed a bit odd to see that happen. The documentation also implies there are still questions to be answered, so it seems this is still an RFC in some ways?
The RFC has been dropped since V6. That said, I'm open to feedback from Linux developers. I will respond to the rest of your email in seperate emails.
Best Regards. -Jeff
On Thu, 1 Feb 2024 at 14:54, Theo de Raadt deraadt@openbsd.org wrote:
Linus, you are in for a shock when the proposal doesn't work for glibc and all the applications!
Heh. I've enjoyed seeing your argumentative style that made you so famous back in the days. Maybe it's always been there, but I haven't seen the BSD people in so long that I'd forgotten all about it.
That said, famously argumentative or not, I think Theo is right, and I do think the MAP_SEALABLE bit is nonsensical.
If somebody wants to mseal() a memory region, why would they need to express that ahead of time?
So the part I think is sane is the mseal() system call itself, in that it allows *potential* future expansion of the semantics.
But hopefully said future expansion isn't even needed, and all users want the base experience, which is why I think PROT_SEAL (both to mmap and to mprotect) makes sense as an alternative form.
So yes, to my mind
mprotect(addr, len, PROT_READ); mseal(addr, len, 0);
should basically give identical results to
mprotect(addr, len, PROT_READ | PROT_SEAL);
and using PROT_SEAL at mmap() time is similarly the same obvious notion of "map this, and then seal that mapping".
The reason for having "mseal()" as a separate call at all from the PROT_SEAL bit is that it does allow possible future expansion (while PROT_SEAL is just a single bit, and it won't change semantics) but also so that you can do whatever prep-work in stages if you want to, and then just go "now we seal it all".
Linus
Linus Torvalds torvalds@linux-foundation.org wrote:
So yes, to my mind
mprotect(addr, len, PROT_READ); mseal(addr, len, 0);
should basically give identical results to
mprotect(addr, len, PROT_READ | PROT_SEAL);
and using PROT_SEAL at mmap() time is similarly the same obvious notion of "map this, and then seal that mapping".
I think that isn't easy to do. Let's expand it to show error checking.
if (mprotect(addr, len, PROT_READ) == -1) react to the errno value if (mseal(addr, len, 0) == -1) react to the errno value
and
if (mprotect(addr, len, PROT_READ | PROT_SEAL) == -1) react to the errno value
For current mprotect(), the errno values are mostly related to range issues with the parameters.
After sealing a region, mprotect() also has the new errno EPERM.
But what is the return value supposed to be from "PROT_READ | PROT_SEAL" over various sub-region types?
Say I have a region 3 pages long. One page is unmapped, one page is regular, and one page is sealed. Re-arrange those 3 pages in all 6 permutations. Try them all.
Does the returned errno change, based upon the order? Does it do part of the operation, or all of the operation?
If the sealed page is first, the regular page is second, and the unmapped page is 3rd, does it return an error or return 0? Does it change the permission on the 3rd page? If it returns an error, has it changed any permissions?
I don't think the diff follows the principle of
if an error is returned --> we know nothing was changed. if success is returned --> we know all the requests were satisfied
The reason for having "mseal()" as a separate call at all from the PROT_SEAL bit is that it does allow possible future expansion (while PROT_SEAL is just a single bit, and it won't change semantics) but also so that you can do whatever prep-work in stages if you want to, and then just go "now we seal it all".
How about you add basic mseal() that is maximum compatible with mimmutable(), and then we can all talk about whether PROT_SEAL makes sense once there are applications that demand it, and can prove they need it?
Linus Torvalds torvalds@linux-foundation.org wrote:
and using PROT_SEAL at mmap() time is similarly the same obvious notion of "map this, and then seal that mapping".
The usual way is:
ptr = mmap(NULL, len PROT_READ|PROT_WRITE, ...)
initialize region between ptr, ptr+len
mprotect(ptr, len, PROT_READ) mseal(ptr, len, 0);
Our source tree contains one place where a locking happens very close to a mmap().
It is the shared-library-linker 'hints file', this is a file that gets mapped PROT_READ and then we lock it.
It feels like that could be one operation? It can't be.
addr = (void *)mmap(0, hsize, PROT_READ, MAP_PRIVATE, hfd, 0); if (_dl_mmap_error(addr)) goto bad_hints;
hheader = (struct hints_header *)addr; if (HH_BADMAG(*hheader) || hheader->hh_ehints > hsize) goto bad_hints;
/* couple more error checks */
mimmutable(addr, hsize); close(hfd); return (0); bad_hints: munmap(addr, hsize); ...
See the problem? It unmaps it if the contents are broken. So even that case cannot use something like "PROT_SEAL".
These are not hypotheticals. I'm grepping an entire Unix kernel and userland source tree, and I know what 100,000+ applications do. I found piece of code that could almost use it, but upon inspection it can't, and it is obvious why: it is best idiom to allow a programmer to insert an inspection operation between two disctinct operations, and especially critical if the 2nd operation cannot be reversed.
Noone needs PROT_SEAL as a shortcut operation in mmap() or mprotect().
Throwing around ideas without proving their use in practice is very unscientific.
On Thu, Feb 1, 2024 at 3:15 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, 1 Feb 2024 at 14:54, Theo de Raadt deraadt@openbsd.org wrote:
Linus, you are in for a shock when the proposal doesn't work for glibc and all the applications!
Heh. I've enjoyed seeing your argumentative style that made you so famous back in the days. Maybe it's always been there, but I haven't seen the BSD people in so long that I'd forgotten all about it.
That said, famously argumentative or not, I think Theo is right, and I do think the MAP_SEALABLE bit is nonsensical.
If somebody wants to mseal() a memory region, why would they need to express that ahead of time?
I like to look at things from the point of view of average Linux userspace developers, they might not have the same level of expertise as the other folks on this email list or they might not have time and mileage for those details.
To me, the most important thing is to deliver a feature that's easy to use and works well. I don't want users to mess things up, so if I'm the one giving them the tools, I'm going to make sure they have all the information they need and that there are safeguards in place.
e.g. considering the following user case: 1> a security sensitive data is allocated from heap, using malloc, from the software component A, and filled with information. 2> software component B then uses mprotect to change it to RO, and seal it using mseal().
Yes. we could choose to allow it. But there are complications:
1> Is this the right pattern ? why don't component A already seal it if they think it is important ? 2> Why heap, why not mmap() a new memory mapping for that security data ? 3> free() will not respect the situation of whether the memory is sealed or not. How would a new developer know they probably shall never free the sealed memory ? 4> brk-shrink will never be able to pass the VMA that gets splited out by mseal(), there are memory footprint implications to the process. 5> what if the security sensitive data happens to be the first VMA or last VMA of the heap, will sealing the first VMA/last VMA cause any issue there ? since they might carry important VMA flags ? ( I don't know enough about brk.) 6> If we ever support sealing the heap for its entirety (make it not executable), and still want to support other brk behaviors, such as shrink/grow, would that conflict with current mseal(), if we allow it on heap from beginning ?
Questions like that, without clear answers, to me it is premature to already let developers start using mseal() for heap.
And even if we have all the answers for heap, how about stack, or other types of virtual memory ?
Again, I don't have enough knowledge to get a complete list that shouldn't be sealed, the input from Theo is none should I worry about. However it is clearly not none to me, besides heap mentioned, there is also aio/shm.
So MAP_SEALABLE is a conservative approach to limit the scope to *** two known use cases *** that I want to work on (libc and chrome) and give time needed to answer those questions. It is like a claim: only those marked by MAP_SEALABLE support the sealing at this point of time.
And MAP_SEALABLE is reversible, e.g. a sysctl could be added to make all memory sealable in the future, or we could obsoleted it entirely when time comes, an application that already passes MAP_SEALABLE can be treated as noop. However, if all memory were allowed to be sealable from the beginning, reversing that decision would be hard.
After those considerations, if MAP_SEALABLE is still not preferred by you. Then I have the following options for you to choose:
1. MAP_NOT_SEALABLE in the mmap(). And I will use them for the heap/aio/shm case. This basically says Linux does not officially support sealing on those, until we support them, we discourage the sealing on those mappings.
2. make MAP_NOT_SEALABLE only a kernel visible flag. So application space won't be able to use it.
3. open for all, and list as much as details in the documentation. If we choose this route, I would like to have more discussion on the heap/stack, at least the Linux developers will learn from those discussions.
So the part I think is sane is the mseal() system call itself, in that it allows *potential* future expansion of the semantics.
But hopefully said future expansion isn't even needed, and all users want the base experience, which is why I think PROT_SEAL (both to mmap and to mprotect) makes sense as an alternative form.
So yes, to my mind
mprotect(addr, len, PROT_READ); mseal(addr, len, 0);
should basically give identical results to
mprotect(addr, len, PROT_READ | PROT_SEAL);
and using PROT_SEAL at mmap() time is similarly the same obvious notion of "map this, and then seal that mapping".
The reason for having "mseal()" as a separate call at all from the PROT_SEAL bit is that it does allow possible future expansion (while PROT_SEAL is just a single bit, and it won't change semantics) but also so that you can do whatever prep-work in stages if you want to, and then just go "now we seal it all".
To clarify: do you mean to have the following ?
mmap(PROT_READ|PROT_SEAL) mseal(addr,len,0) mprotect(addr,len,PROT_READ|PROT_SEAL) ?
I have to think about the mprotect() case.
For mmap(PROT_READ|PROT_SEAL), I might have a use case already:
fs/binfmt_elf.c if (current->personality & MMAP_PAGE_ZERO) { /* Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications "depend" upon this behavior. Since we do not have the power to recompile these, we emulate the SVr4 behavior. Sigh. */
error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC, <-- add PROT_SEAL MAP_FIXED | MAP_PRIVATE, 0); }
I don't see the benefit of RWX page 0, which might make a null pointers error to become executable for some code.
Best Regards, -Jeff
Linus
Jeff Xu jeffxu@google.com wrote:
To me, the most important thing is to deliver a feature that's easy to use and works well. I don't want users to mess things up, so if I'm the one giving them the tools, I'm going to make sure they have all the information they need and that there are safeguards in place.
e.g. considering the following user case: 1> a security sensitive data is allocated from heap, using malloc, from the software component A, and filled with information. 2> software component B then uses mprotect to change it to RO, and seal it using mseal().
p = malloc(80); mprotect(p & ~4095, 4096, PROT_NONE); free(p);
Will you save such a developer also? No.
Since the same problem you describe already exists with mprotect() what does mseal() even have to do with your proposal?
What about this?
p = malloc(80); munmap(p & ~4095, 4096); free(p);
And since it is not sealed, how about madvise operations on a proper non-malloc memory allocation? Well, the process smashes it's own memory. And why is it not sealed? You make it harder to seal memory!
How about this?
p = malloc(80); bzero(p, 100000;
Yes it is a buffer overflow. But this is all the same class of software problem:
Memory belongs to processes, which belongs to the program, which is coded by the programmer, who has to learn to be careful and handle the memory correctly.
mseal() / mimmutable() add *no new expectation* to a careful programmer, because they expected to only use it on memory that they *promise will never be de-allocated or re-permissioned*.
What you are proposing is not a "mitigation", it entirely cripples the proposed subsystem because you are afraid of it; because you have cloned a memory subsystem primitive you don't fully understand; and this is because you've not seen a complete operating system using it.
When was the last time you developed outside of Chrome?
This is systems programming. The kernel supports all the programs, not just the one holy program from god.
On Thu, Feb 1, 2024 at 8:05 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@google.com wrote:
To me, the most important thing is to deliver a feature that's easy to use and works well. I don't want users to mess things up, so if I'm the one giving them the tools, I'm going to make sure they have all the information they need and that there are safeguards in place.
e.g. considering the following user case: 1> a security sensitive data is allocated from heap, using malloc, from the software component A, and filled with information. 2> software component B then uses mprotect to change it to RO, and seal it using mseal().
p = malloc(80); mprotect(p & ~4095, 4096, PROT_NONE); free(p);
Will you save such a developer also? No.
Since the same problem you describe already exists with mprotect() what does mseal() even have to do with your proposal?
What about this?
p = malloc(80); munmap(p & ~4095, 4096); free(p);
And since it is not sealed, how about madvise operations on a proper non-malloc memory allocation? Well, the process smashes it's own memory. And why is it not sealed? You make it harder to seal memory!
How about this?
p = malloc(80); bzero(p, 100000;
Yes it is a buffer overflow. But this is all the same class of software problem:
Memory belongs to processes, which belongs to the program, which is coded by the programmer, who has to learn to be careful and handle the memory correctly.
mseal() / mimmutable() add *no new expectation* to a careful programmer, because they expected to only use it on memory that they *promise will never be de-allocated or re-permissioned*.
What you are proposing is not a "mitigation", it entirely cripples the proposed subsystem because you are afraid of it; because you have cloned a memory subsystem primitive you don't fully understand; and this is because you've not seen a complete operating system using it.
When was the last time you developed outside of Chrome?
This is systems programming. The kernel supports all the programs, not just the one holy program from god.
Even without free. I personally do not like the heap getting sealed like that.
Component A. p=malloc(4096); writing something to p.
Component B: mprotect(p,4096, RO) mseal(p,4096)
This will split the heap VMA, and prevent the heap from shrinking, if this is in a frequent code path, then it might hurt the process's memory usage.
The existing code is more likely to use malloc than mmap(), so it is easier for dev to seal a piece of data belonging to another component. I hope this pattern is not wide-spreading.
The ideal way will be just changing the library A to use mmap.
Jeff Xu jeffxu@chromium.org wrote:
Even without free. I personally do not like the heap getting sealed like that.
Component A. p=malloc(4096); writing something to p.
Component B: mprotect(p,4096, RO) mseal(p,4096)
This will split the heap VMA, and prevent the heap from shrinking, if this is in a frequent code path, then it might hurt the process's memory usage.
The existing code is more likely to use malloc than mmap(), so it is easier for dev to seal a piece of data belonging to another component. I hope this pattern is not wide-spreading.
The ideal way will be just changing the library A to use mmap.
I think you are lacking some test programs to see how it actually behaves; the effect is worse than you think, and the impact is immediately visible to the programmer, and the lesson is clear:
you can only seal objects which you gaurantee never get recycled.
Pushing a sealed object back into reuse is a disasterous bug.
Noone should call this interface, unless they understand that.
I'll say again, you don't have a test program for various allocators to understand how it behaves. The failure modes described in your docuemnts are not correct.
On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@chromium.org wrote:
Even without free. I personally do not like the heap getting sealed like that.
Component A. p=malloc(4096); writing something to p.
Component B: mprotect(p,4096, RO) mseal(p,4096)
This will split the heap VMA, and prevent the heap from shrinking, if this is in a frequent code path, then it might hurt the process's memory usage.
The existing code is more likely to use malloc than mmap(), so it is easier for dev to seal a piece of data belonging to another component. I hope this pattern is not wide-spreading.
The ideal way will be just changing the library A to use mmap.
I think you are lacking some test programs to see how it actually behaves; the effect is worse than you think, and the impact is immediately visible to the programmer, and the lesson is clear:
you can only seal objects which you gaurantee never get recycled. Pushing a sealed object back into reuse is a disasterous bug. Noone should call this interface, unless they understand that.
I'll say again, you don't have a test program for various allocators to understand how it behaves. The failure modes described in your docuemnts are not correct.
I understand what you mean: I will add that part to the document: Try to recycle a sealed memory is disastrous, e.g. p=malloc(4096); mprotect(p,4096,RO) mseal(p,4096) free(p);
My point is: I think sealing an object from the heap is a bad pattern in general, even dev doesn't free it. That was one of the reasons for the sealable flag, I hope saying this doesn't be perceived as looking for excuses.
On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@chromium.org wrote:
Even without free. I personally do not like the heap getting sealed like that.
Component A. p=malloc(4096); writing something to p.
Compohave nent B: mprotect(p,4096, RO) mseal(p,4096)
This will split the heap VMA, and prevent the heap from shrinking, if this is in a frequent code path, then it might hurt the process's memory usage.
The existing code is more likely to use malloc than mmap(), so it is easier for dev to seal a piece of data belonging to another component. I hope this pattern is not wide-spreading.
The ideal way will be just changing the library A to use mmap.
I think you are lacking some test programs to see how it actually behaves; the effect is worse than you think, and the impact is immediately visible to the programmer, and the lesson is clear:
you can only seal objects which you gaurantee never get recycled. Pushing a sealed object back into reuse is a disasterous bug. Noone should call this interface, unless they understand that.
I'll say again, you don't have a test program for various allocators to understand how it behaves. The failure modes described in your docuemnts are not correct.
I understand what you mean: I will add that part to the document: Try to recycle a sealed memory is disastrous, e.g. p=malloc(4096); mprotect(p,4096,RO) mseal(p,4096) free(p);
My point is: I think sealing an object from the heap is a bad pattern in general, even dev doesn't free it. That was one of the reasons for the sealable flag, I hope saying this doesn't be perceived as looking for excuses.
The point you're missing is that adding MAP_SEALABLE reduces composability. With MAP_SEALABLE, everything that mmaps some part of the address space that may ever be sealed will need to be modified to know about MAP_SEALABLE.
Say you did the same thing for mprotect. MAP_PROTECT would control the mprotectability of the map. You'd stop:
p = malloc(4096); mprotect(p, 4096, PROT_READ); free(p);
! But you'd need to change every spot that mmap()'s something to know about and use MAP_PROTECT: all "producers" of mmap memory would need to know about the consumers doing mprotect(). So now either all mmap() callers mindlessly add MAP_PROTECT out of fear the consumers do mprotect (and you gain nothing from MAP_PROTECT), or the mmap() callers need to know the consumers call mprotect(), and thus you introduce a huge layering violation (and you actually lose from having MAP_PROTECT).
Hopefully you can map the above to MAP_SEALABLE. Or to any other m*() operation. For example, if chrome runs on an older glibc that does not know about MAP_SEALABLE, it will not be able to mseal() its own shared libraries' .text (even if, yes, that should ideally be left to ld.so).
IMO, UNIX API design has historically mostly been "play stupid games, win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) work. If you close stdout (and don't dup/reopen something to stdout) and printf(), things will break, and you get to keep both pieces. There's no O_CLOSEABLE, just as there's no O_DUPABLE.
On Fri, Feb 2, 2024 at 10:52 AM Pedro Falcato pedro.falcato@gmail.com wrote:
On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu jeffxu@chromium.org wrote:
On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@chromium.org wrote:
Even without free. I personally do not like the heap getting sealed like that.
Component A. p=malloc(4096); writing something to p.
Compohave nent B: mprotect(p,4096, RO) mseal(p,4096)
This will split the heap VMA, and prevent the heap from shrinking, if this is in a frequent code path, then it might hurt the process's memory usage.
The existing code is more likely to use malloc than mmap(), so it is easier for dev to seal a piece of data belonging to another component. I hope this pattern is not wide-spreading.
The ideal way will be just changing the library A to use mmap.
I think you are lacking some test programs to see how it actually behaves; the effect is worse than you think, and the impact is immediately visible to the programmer, and the lesson is clear:
you can only seal objects which you gaurantee never get recycled. Pushing a sealed object back into reuse is a disasterous bug. Noone should call this interface, unless they understand that.
I'll say again, you don't have a test program for various allocators to understand how it behaves. The failure modes described in your docuemnts are not correct.
I understand what you mean: I will add that part to the document: Try to recycle a sealed memory is disastrous, e.g. p=malloc(4096); mprotect(p,4096,RO) mseal(p,4096) free(p);
My point is: I think sealing an object from the heap is a bad pattern in general, even dev doesn't free it. That was one of the reasons for the sealable flag, I hope saying this doesn't be perceived as looking for excuses.
The point you're missing is that adding MAP_SEALABLE reduces composability. With MAP_SEALABLE, everything that mmaps some part of the address space that may ever be sealed will need to be modified to know about MAP_SEALABLE.
Say you did the same thing for mprotect. MAP_PROTECT would control the mprotectability of the map. You'd stop:
p = malloc(4096); mprotect(p, 4096, PROT_READ); free(p);
! But you'd need to change every spot that mmap()'s something to know about and use MAP_PROTECT: all "producers" of mmap memory would need to know about the consumers doing mprotect(). So now either all mmap() callers mindlessly add MAP_PROTECT out of fear the consumers do mprotect (and you gain nothing from MAP_PROTECT), or the mmap() callers need to know the consumers call mprotect(), and thus you introduce a huge layering violation (and you actually lose from having MAP_PROTECT).
Hopefully you can map the above to MAP_SEALABLE. Or to any other m*() operation. For example, if chrome runs on an older glibc that does not know about MAP_SEALABLE, it will not be able to mseal() its own shared libraries' .text (even if, yes, that should ideally be left to ld.so).
I think I have heard enough complaints about MAP_SEALABLE from Linux developers and Linus in the last two days to convince myself that it is a bad idea :)
For the last time, I was trying to limit the scope of mseal() limited to two known cases. And MAP_SEALABLE is a reversible decision, a system ctrl can turn it off, or we can obsolete it in future. (this was mentioned in the document of V8).
I will rest my case. Obviously from the feedback, it is loud and clear that we want to be able to seal all the memory.
IMO, UNIX API design has historically mostly been "play stupid games, win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) work. If you close stdout (and don't dup/reopen something to stdout) and printf(), things will break, and you get to keep both pieces. There's no O_CLOSEABLE, just as there's no O_DUPABLE.
-- Pedro
...
IMO, UNIX API design has historically mostly been "play stupid games, win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) work. If you close stdout (and don't dup/reopen something to stdout) and printf(), things will break, and you get to keep both pieces.
That is pretty much why libraries must never use printf(). (Try telling that to people at work!)
In the days when processes could only have 20 files open it was a much bigger problem. You couldn't afford to not use 0, 1 and 2. A certain daemon ended up using fd 1 as a pipe to another daemon. Someone accidentally used printf() instead of fprintf() for a trace. When the 10k stdio buffer filled the text got written to the pipe. The expected fixed size message had a 32bit 'trailer' size. Although no defined messages supported trailers the second daemon synchronously discarded the trailer - with the expected side effect.
Wasn't my bug, and someone else found it, but I'd read the broken code a few times without seeing the fubar.
Trouble is it all worked for quite a long time...
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Another interaction to consider is sigaltstack().
In OpenBSD, sigaltstack() forces MAP_STACK onto the specified (pre-allocated) region, because on kernel-entry we require the "sp" register to point to a MAP_STACK region (this severely damages ROP pivot methods). Linux does not have MAP_STACK enforcement (yet), but one day someone may try to do that work.
This interacted poorly with mimmutable() because some applications allocate the memory being provided poorly. I won't get into the details unless pushed, because what we found makes me upset. Over the years, we've upstreamed diffs to applications to resolve all the nasty allocation patterns. I think the software ecosystem is now mostly clean.
I suggest someone in Linux look into whether sigaltstack() is a mseal() bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the correct strategy.
This is our documented strategy:
On OpenBSD some additional restrictions prevent dangerous address space modifications. The proposed space at ss_sp is verified to be contiguously mapped for read-write permissions (no execute) and incapable of syscall entry (see msyscall(2)). If those conditions are met, a page- aligned inner region will be freshly mapped (all zero) with MAP_STACK (see mmap(2)), destroying the pre-existing data in the region. Once the sigaltstack is disabled, the MAP_STACK attribute remains on the memory, so it is best to deallocate the memory via a method that results in munmap(2).
OK, I better provide the details of what people were doing. sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the stack, we even found one creating a sigaltstack inside a buffer on a pthread stack. We told everyone to use mmap() and munmap(), with MAP_STACK if #ifdef MAP_STACK finds a definition.
On Fri, Feb 2, 2024 at 9:05 AM Theo de Raadt deraadt@openbsd.org wrote:
Another interaction to consider is sigaltstack().
In OpenBSD, sigaltstack() forces MAP_STACK onto the specified (pre-allocated) region, because on kernel-entry we require the "sp" register to point to a MAP_STACK region (this severely damages ROP pivot methods). Linux does not have MAP_STACK enforcement (yet), but one day someone may try to do that work.
This interacted poorly with mimmutable() because some applications allocate the memory being provided poorly. I won't get into the details unless pushed, because what we found makes me upset. Over the years, we've upstreamed diffs to applications to resolve all the nasty allocation patterns. I think the software ecosystem is now mostly clean.
I suggest someone in Linux look into whether sigaltstack() is a mseal() bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the correct strategy.
Thanks for bringing this up. I will follow up on sigaltstack() in Linux.
This is our documented strategy:
On OpenBSD some additional restrictions prevent dangerous address space modifications. The proposed space at ss_sp is verified to be contiguously mapped for read-write permissions (no execute) and incapable of syscall entry (see msyscall(2)). If those conditions are met, a page- aligned inner region will be freshly mapped (all zero) with MAP_STACK (see mmap(2)), destroying the pre-existing data in the region. Once the sigaltstack is disabled, the MAP_STACK attribute remains on the memory, so it is best to deallocate the memory via a method that results in munmap(2).
OK, I better provide the details of what people were doing. sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the stack, we even found one creating a sigaltstack inside a buffer on a pthread stack. We told everyone to use mmap() and munmap(), with MAP_STACK if #ifdef MAP_STACK finds a definition.
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Jeff Xu jeffxu@chromium.org [240131 20:27]:
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.
Initial library mappings happen in one huge chunk then it's cut up into smaller VMAs, at least that's what I see with my maple tree tracing. If you opt-in, then the entire library will have to opt-in and so the 'discourage inadvertent sealing' argument is not very strong.
Regarding "The initial library mappings happen in one huge chunk then it is cut up into smaller VMAS", this is not a problem.
As example of elf loading (fs/binfmt_elf.c), there is just a few places to pass in what type of memory to be allocated, e.g. MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those places. If glic does additional splitting on the memory range, by using mprotect(), then the MAP_SEALABLE is automatically applied after splitting. If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).
It also makes a somewhat messy tracking of inheritance of the attribute across splitting, MAP_FIXED replacement, vma_move, vma_copy. I think most of this is forced on the user?
The inheritance is the same as other VMA flags.
It makes your call less flexible, it means you have to hope that the VMA origin was blessed before you decide you want to mseal it.
What if you want to ensure the library mapped by a parent or on launch is mseal'ed?
What about the initial relocated VMA (expand/shrink of VMA)?
Creating something as "non-sealable" is pointless. If you don't want it sealed, then don't mseal() that region.
If your use case doesn't need it, then can we please drop the opt-in behaviour and just have all VMAs treated the same?
If it does need it, can you explain why?
The glibc relocation/fixup will then work. glibc could mseal once it is complete - or an application could bypass glibc support and use the feature itself.
Yes. That is the idea.
If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the heap/stack concerns. We can either let people shoot their own feet off or try to protect them.
Right now, you seem to be trying to protect them. Keeping with that, I guess we could either get the kernel to mark those VMAs or tell some other way? I'd suggest a range, but people do very strange things with these special VMAs [1]. I don't think you can predict enough crazy actions to make a difference in trying to protect people.
There are far fewer VMAs that should not be allowed to be mseal'ed than should be, and the kernel creates those so it seems logical to only let the kernel opt-out on those ones.
I'd rather just let people shoot themselves and return an error.
I also hope it reduces the complexity of this code while increasing the flexibility of the feature. As stated before, we remove the dependency of needing support from the initial loader.
Merging VMAs I can see this going Very Bad with brk + mseal. But, again, if someone decides to mseal these VMAs then they should expect Bad Things to happen (or maybe they know what they are doing even in some complex situation?)
vma_merge() can also expand a VMA. I think this is okay as it checks for the same flags, so you will allow VMA expansion of two (or three) vma areas to become one. Is this okay in your model?
I mean, you specifically state that this is a 'very specific requirement' in your cover letter. Does this mean even other browsers have no use for it?
No, I don’t mean “other browsers have no use for it”.
About specific requirements from Chrome, that refers to "The lifetime of those mappings are not tied to the lifetime of the process, which is not the case of libc" as in the cover letter. This addition to the cover letter was made in V3, thus, it might be beneficial to provide additional context to help answer the question.
This patch series begins with multiple-bit approaches (v1,v2,v3), the rationale for this is that I am uncertain if Chrome's specific needs are common enough for other use cases. Consequently, I am unable to make this decision myself without input from the community. To accommodate this, multiple bits are selected initially due to their adaptability.
Since V1, after hearing from the community, Chrome has changed its design (no longer relying on separating out mprotect), and Linus acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs, today mseal() has a simple design that:
- meet Chrome's specific needs.
How many VMAs will chrome have that are mseal'ed? Is this a common operation?
PROT_SEAL seems like an extra flag we could drop. I don't expect we'll be sealing enough VMAs that a hand full of extra syscalls would make a difference?
- meet Libc's needs.
What needs of libc are you referring to? I'm looking through the version changelog and I guess you mean return EPERM?
I meant libc's sealing RO part of the elf binary, those memory's lifetime are associated with the lifetime of the process.
- Chrome's specific need doesn't interfere with Libc's.
[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0...
Linus said he'd be happier if we made the change in general.
I am very concerned this feature will land and have to be maintained by the core mm people for the one user it was specifically targeting.
See above. This feature is not specifically targeting Chrome.
Can we also get some benchmarking on the impact of this feature? I believe my answer in v7 removed the worst offender, but since there is no benchmarking we really are guessing (educated or not, hard data would help). We still have an extra loop in madvise, mprotect_pkey, mremap_to (and mreamp syscall?).
Yes. There is an extra loop in mmap(FIXED), munmap(), madvise(DONOTNEED), mremap(), to emulate the VMAs for the given address range. I suspect the impact would be low, but having some hard data would be good. I will see what I can find to assist the perf testing. If you have a specific test suite in mind, I can also try it.
You should look at mmtests [2]. But since you are adding loops across VMA ranges, you need to test loops across several ranges of VMAs. That is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or some subset of small and large numbers to get an idea of complexity we are adding. My hope is that the looping will be cache-hot in the maple tree and have minimum effect.
In my personal testing, I've seen munmap often do a single VMA, or 3, or more rare 7 on x86_64. There should be some good starting points in mmtests for the common operations.
Thanks. Will do.
[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/m... [2] https://github.com/gormanm/mmtests
Thanks, Liam
* Jeff Xu jeffxu@google.com [240201 22:15]:
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Jeff Xu jeffxu@chromium.org [240131 20:27]:
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.
Initial library mappings happen in one huge chunk then it's cut up into smaller VMAs, at least that's what I see with my maple tree tracing. If you opt-in, then the entire library will have to opt-in and so the 'discourage inadvertent sealing' argument is not very strong.
Regarding "The initial library mappings happen in one huge chunk then it is cut up into smaller VMAS", this is not a problem.
As example of elf loading (fs/binfmt_elf.c), there is just a few places to pass in what type of memory to be allocated, e.g. MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those places. If glic does additional splitting on the memory range, by using mprotect(), then the MAP_SEALABLE is automatically applied after splitting. If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).
You are adding a flag that requires a new glibc. When I try to point out how this is unnecessary and excessive, you tell me it's fine and probably not a whole lot of work.
This isn't working with developers, you are dismissing the developers who are trying to help you.
Can you please:
Provide code that uses this feature.
Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and 32 VMAs.
Provide code that tests and checks the failure paths. Failures at the start, middle, and end of the modifications.
Document what happens in those failure paths.
And, most importantly: keep an open mind and allow your opinion to change when presented with new information.
All of these things are to help you. We need to know what needs fixing so you can be successful.
Thanks, Liam
On Fri, Feb 2, 2024 at 7:13 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Jeff Xu jeffxu@google.com [240201 22:15]:
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Jeff Xu jeffxu@chromium.org [240131 20:27]:
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.
Initial library mappings happen in one huge chunk then it's cut up into smaller VMAs, at least that's what I see with my maple tree tracing. If you opt-in, then the entire library will have to opt-in and so the 'discourage inadvertent sealing' argument is not very strong.
Regarding "The initial library mappings happen in one huge chunk then it is cut up into smaller VMAS", this is not a problem.
As example of elf loading (fs/binfmt_elf.c), there is just a few places to pass in what type of memory to be allocated, e.g. MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those places. If glic does additional splitting on the memory range, by using mprotect(), then the MAP_SEALABLE is automatically applied after splitting. If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).
You are adding a flag that requires a new glibc. When I try to point out how this is unnecessary and excessive, you tell me it's fine and probably not a whole lot of work.
This isn't working with developers, you are dismissing the developers who are trying to help you.
Can you please:
Provide code that uses this feature.
Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and 32 VMAs.
I will prepare for the benchmark tests.
Provide code that tests and checks the failure paths. Failures at the start, middle, and end of the modifications.
Regarding, "Failures at the start, middle, and end of the modifications."
With the current implementation, e.g. it checks if the sealing is applied before actual modification of VMAs, so partial modifications are avoided in mprotect, mremap, munmap.
There are test cases in the selftests to cover the failure path, including the beginning, middle and end of VMAs. test_seal_unmapped_start test_seal_unmapped_middle test_seal_unmapped_end test_seal_invalid_input test_seal_start_mprotect test_seal_end_mprotect etc.
Are those what you are looking for ?
Document what happens in those failure paths.
And, most importantly: keep an open mind and allow your opinion to change when presented with new information.
All of these things are to help you. We need to know what needs fixing so you can be successful.
Thanks for those feedbacks.
I sincerely wish for more of those help so this syscall can be useful.
Thanks. Best Regards, -Jeff
Thanks, Liam
* Jeff Xu jeffxu@chromium.org [240202 12:24]:
...
Provide code that uses this feature.
Please do this too :)
Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and 32 VMAs.
I will prepare for the benchmark tests.
Thank you, please also include runs of calls that you are modifying for checking for mseal() as we are adding loops there.
Provide code that tests and checks the failure paths. Failures at the start, middle, and end of the modifications.
Regarding, "Failures at the start, middle, and end of the modifications."
With the current implementation, e.g. it checks if the sealing is applied before actual modification of VMAs, so partial modifications are avoided in mprotect, mremap, munmap.
There are test cases in the selftests to cover the failure path, including the beginning, middle and end of VMAs. test_seal_unmapped_start test_seal_unmapped_middle test_seal_unmapped_end test_seal_invalid_input test_seal_start_mprotect test_seal_end_mprotect etc.
Are those what you are looking for ?
Those are certainly good, but we need more checking in there. You have a seal_split test that splits the vma by mseal but you don't check the flags on the VMAs.
What I'm more concerned about is what happens if you call mseal() on a range and it can mseal a portion. Like, what happens to the first vma in your test_seal_unmapped_middle case? I see it returns an error, but is the first VMA mseal()'ed? (no it's not, but test that)
What about the other system calls that will be denied on an mseal() VMA? Do they still behave the same? do_mprotect_pkey() will break out of the loop on the first error it sees - but it has modified some VMAs up to that point, I believe? You have changed this to abort before anything is modified. This is probably acceptable because it won't affect existing applications unless they start using mseal(), but that's just my opinion.
It would be good to state the change in behaviour because it is changing the fundamental model of changing mprotect/madvise until an issue is hit. I think you are covering this by "it blocks X" but it's doing more than, say, a flag verification. One could reasonably assume this is just another flag verification.
Document what happens in those failure paths.
I'd like to know how this affects other system calls in the partial success cases/return error cases. Some will now return new error codes and some may change the behaviour.
It may even be okay to allow munmap() to split VMAs at the start/end of the region and fail to munmap because some VMA in the middle is mseal()'ed - but maybe not? I haven't put a whole lot of thought into it.
Thanks, Liam
What I'm more concerned about is what happens if you call mseal() on a range and it can mseal a portion. Like, what happens to the first vma in your test_seal_unmapped_middle case? I see it returns an error, but is the first VMA mseal()'ed? (no it's not, but test that)
That is correct, Liam.
Unix system calls must be atomic.
They either return an error, and that is a promise they made no changes.
Or they do the work required, and then return success.
In OpenBSD, all mimmutable() aspects were carefully studied to gaurantee this behaviour.
I am not an expert in the Linux kernel to make the assessment; someone who is qualified must make that assessment. Fuzzing with tests is a good way to judge it simpler.
On Fri, 2 Feb 2024 at 11:32, Theo de Raadt deraadt@openbsd.org wrote:
Unix system calls must be atomic.
They either return an error, and that is a promise they made no changes.
That's actually not true, and never has been.
It's a good thing to aim for, but several errors means "some or all may have been done".
EFAULT (for various system calls), ENOMEM and other errors are all things that can happen after some of the system call has already been done, and the rest failed.
There are lots of examples, but to pick one obvious VM example, something like mlock() may well return an error after the area has been successfully locked, but then the population of said pages failed for some reason.
Of course, implementations can differ, and POSIX sometimes has insane language that is actively incorrect.
Furthermore, the definition of "atomic" is unclear. For example, POSIX claims that a "write()" system call is one atomic thing for regular files, and some people think that means that you see all or nothing. That's simply not true, and you'll see the write progress in various indirect ways (look at intermediate file size with 'stat', look at intermediate contents with 'mmap' etc etc).
So I agree that atomicity is something that people should always *strive* for, but it's not some kind of final truth or absolute requirement.
In the specific case of mseal(), I suspect there are very few reasons ever *not* to be atomic, so in this particular context atomicity is likely always something that should be guaranteed. But I just wanted to point out that it's most definitely not a black-and-white issue in the general case.
Linus
On Fri, Feb 2, 2024 at 12:37 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, 2 Feb 2024 at 11:32, Theo de Raadt deraadt@openbsd.org wrote:
Unix system calls must be atomic.
They either return an error, and that is a promise they made no changes.
That's actually not true, and never has been.
It's a good thing to aim for, but several errors means "some or all may have been done".
EFAULT (for various system calls), ENOMEM and other errors are all things that can happen after some of the system call has already been done, and the rest failed.
There are lots of examples, but to pick one obvious VM example, something like mlock() may well return an error after the area has been successfully locked, but then the population of said pages failed for some reason.
Of course, implementations can differ, and POSIX sometimes has insane language that is actively incorrect.
Furthermore, the definition of "atomic" is unclear. For example, POSIX claims that a "write()" system call is one atomic thing for regular files, and some people think that means that you see all or nothing. That's simply not true, and you'll see the write progress in various indirect ways (look at intermediate file size with 'stat', look at intermediate contents with 'mmap' etc etc).
So I agree that atomicity is something that people should always *strive* for, but it's not some kind of final truth or absolute requirement.
In the specific case of mseal(), I suspect there are very few reasons ever *not* to be atomic, so in this particular context atomicity is likely always something that should be guaranteed. But I just wanted to point out that it's most definitely not a black-and-white issue in the general case.
Thanks. At least I got this part done right for mseal() :-)
-Jeff
Linus
* Linus Torvalds torvalds@linux-foundation.org [240202 15:37]:
On Fri, 2 Feb 2024 at 11:32, Theo de Raadt deraadt@openbsd.org wrote:
Unix system calls must be atomic.
They either return an error, and that is a promise they made no changes.
That's actually not true, and never has been.
...
In the specific case of mseal(), I suspect there are very few reasons ever *not* to be atomic, so in this particular context atomicity is likely always something that should be guaranteed. But I just wanted to point out that it's most definitely not a black-and-white issue in the general case.
There will be a larger performance cost to checking up front without allowing the partial completion. I don't expect these to be high, but it's something to keep in mind if we are okay with the flexibility and less atomic operation.
Thanks, Liam
On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett Liam.Howlett@oracle.com wrote:
There will be a larger performance cost to checking up front without allowing the partial completion.
I suspect that for mseal(), the only half-way common case will be sealing an area that is entirely contained within one vma.
So the cost will be the vma splitting (if it's not the whole vma), and very unlikely to be any kind of "walk the vma's to check that they can all be sealed" loop up-front.
We'll see, but that's my gut feel, at least.
Linus
* Linus Torvalds torvalds@linux-foundation.org [240202 18:36]:
On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett Liam.Howlett@oracle.com wrote:
There will be a larger performance cost to checking up front without allowing the partial completion.
I suspect that for mseal(), the only half-way common case will be sealing an area that is entirely contained within one vma.
Agreed.
So the cost will be the vma splitting (if it's not the whole vma), and very unlikely to be any kind of "walk the vma's to check that they can all be sealed" loop up-front.
That's the cost of calling mseal(), and I think that will be totally reasonable.
I'm more concerned with the other calls that do affect more than one vma that will now have to ensure there is not an mseal'ed vma among the affected area.
As you pointed out, we don't do atomic updates and so we have to add a loop at the beginning to check this new special case, which is what this patch set does today. That means we're going to be looping through twice for any call that could fail if one is mseal'ed. This includes munmap() and mprotect().
The impact will vary based on how many vma's are handled. I'd like some numbers on this so we can see if it is a concern, which Jeff has agreed to provide in the future - Thank you, Jeff.
It also means we're modifying the behaviour of those calls so they could fail before anything changes (regardless of where the failure would occur), and we could still fail later when another aspect of a vma would cause a failure as we do today. We are paying the price for a more atomic update, but we aren't trying very hard to be atomic with our updates - we don't have many (virtually no) vma checks before modifications start.
For instance, we could move the mprotect check for map_deny_write_exec() to the pre-update loop to make it more atomic in nature. This one seems somewhat related to mseal, so it would be better if they were both checked atomic(ish) together. Although, I wonder if the user visible changes would be acceptable and worth the risk.
We will have two classes of updates to vma's: the more atomic view and the legacy view. The question of what happens when the two mix, or where a specific check should go will get (more) confusing.
Thanks, Liam
On Fri, Feb 2, 2024 at 8:46 PM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Linus Torvalds torvalds@linux-foundation.org [240202 18:36]:
On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett Liam.Howlett@oracle.com wrote:
There will be a larger performance cost to checking up front without allowing the partial completion.
I suspect that for mseal(), the only half-way common case will be sealing an area that is entirely contained within one vma.
Agreed.
So the cost will be the vma splitting (if it's not the whole vma), and very unlikely to be any kind of "walk the vma's to check that they can all be sealed" loop up-front.
That's the cost of calling mseal(), and I think that will be totally reasonable.
I'm more concerned with the other calls that do affect more than one vma that will now have to ensure there is not an mseal'ed vma among the affected area.
As you pointed out, we don't do atomic updates and so we have to add a loop at the beginning to check this new special case, which is what this patch set does today. That means we're going to be looping through twice for any call that could fail if one is mseal'ed. This includes munmap() and mprotect().
The impact will vary based on how many vma's are handled. I'd like some numbers on this so we can see if it is a concern, which Jeff has agreed to provide in the future - Thank you, Jeff.
Yes please. The additional walk Liam points to seems to be happening even if we don't use mseal at all. Android apps often create thousands of VMAs, so a small regression to a syscall like mprotect might cause a very visible regression to app launch times (one of the key metrics for Android). Having performance impact numbers here would be very helpful.
It also means we're modifying the behaviour of those calls so they could fail before anything changes (regardless of where the failure would occur), and we could still fail later when another aspect of a vma would cause a failure as we do today. We are paying the price for a more atomic update, but we aren't trying very hard to be atomic with our updates - we don't have many (virtually no) vma checks before modifications start.
For instance, we could move the mprotect check for map_deny_write_exec() to the pre-update loop to make it more atomic in nature. This one seems somewhat related to mseal, so it would be better if they were both checked atomic(ish) together. Although, I wonder if the user visible changes would be acceptable and worth the risk.
We will have two classes of updates to vma's: the more atomic view and the legacy view. The question of what happens when the two mix, or where a specific check should go will get (more) confusing.
Thanks, Liam
On Fri, Feb 2, 2024 at 11:21 AM Liam R. Howlett Liam.Howlett@oracle.com wrote:
- Jeff Xu jeffxu@chromium.org [240202 12:24]:
...
Provide code that uses this feature.
Please do this too :)
Yes. Will do.
Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and 32 VMAs.
I will prepare for the benchmark tests.
Thank you, please also include runs of calls that you are modifying for checking for mseal() as we are adding loops there.
It will includes mmap/mremap/mprotect/munmap
Provide code that tests and checks the failure paths. Failures at the start, middle, and end of the modifications.
Regarding, "Failures at the start, middle, and end of the modifications."
With the current implementation, e.g. it checks if the sealing is applied before actual modification of VMAs, so partial modifications are avoided in mprotect, mremap, munmap.
There are test cases in the selftests to cover the failure path, including the beginning, middle and end of VMAs. test_seal_unmapped_start test_seal_unmapped_middle test_seal_unmapped_end test_seal_invalid_input test_seal_start_mprotect test_seal_end_mprotect etc.
Are those what you are looking for ?
Those are certainly good, but we need more checking in there. You have a seal_split test that splits the vma by mseal but you don't check the flags on the VMAs.
I can add the flag check.
What I'm more concerned about is what happens if you call mseal() on a range and it can mseal a portion. Like, what happens to the first vma in your test_seal_unmapped_middle case? I see it returns an error, but is the first VMA mseal()'ed? (no it's not, but test that)
The first VMA is not sealed. That was covered by test_seal_mprotect_two_vma_with_gap.
What about the other system calls that will be denied on an mseal() VMA?
The other system call's behavior is kept as is, if the memory is not sealed.
Do they still behave the same? do_mprotect_pkey() will break out of the loop on the first error it sees - but it has modified some VMAs up to that point, I believe?
Yes. The description about do_mprotect_pkey() is correct.
You have changed this to abort before anything is modified. This is probably acceptable because it won't affect existing applications unless they start using mseal(), but that's just my opinion.
To make sure this, the test was written with sealing=false, those tests are passed in the main (before applying my patch) to make sure the test is correct.
It would be good to state the change in behaviour because it is changing the fundamental model of changing mprotect/madvise until an issue is hit. I think you are covering this by "it blocks X" but it's doing more than, say, a flag verification. One could reasonably assume this is just another flag verification.
Will add more in documentation.
Document what happens in those failure paths.
I'd like to know how this affects other system calls in the partial success cases/return error cases. Some will now return new error codes and some may change the behaviour.
For the mapping that is not sealed, all remain unchanged, including the error handling path. For the mapping that is sealed, EPREM is returned if the sealing check fails, and all of VMAs remain unchanged.
It may even be okay to allow munmap() to split VMAs at the start/end of the region and fail to munmap because some VMA in the middle is mseal()'ed - but maybe not? I haven't put a whole lot of thought into it.
If you are referring to something like below [unmapped][map1][unmapped][map2][unmapped][map3][unmapped] and map2 is sealed.
unmap(start of map1,end of map3) will fail. mmap/mremap/unmap/mprotect on an address range that includes map2 will fail with EPERM, with map1/map2/map3 unchanged.
Thanks -Jeff
Thanks, Liam
linux-kselftest-mirror@lists.linaro.org