From: Jeff Xu jeffxu@google.com
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves security stance on memory corruption bugs, i.e. the attacker can’t just write to arbitrary memory and point the code to it, the memory has to be marked with X bit, or else an exception will happen.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest.
types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. Call mseal() on an existing seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for sealing is added, this usually happens at the early point of the syscall, before any update is made to VMAs. The effect of that is: if any of the VMAs in the given address range fails the sealing check, none of the VMA will be updated. It might be worth noting that this is different from the rest of mprotect(), where some updates can happen even when mprotect returns fail. Consider can_modify_mm only checks vm_seals in vm_area_struct, and it is not going deeper in the page table or updating any HW, success or none behavior might fit better here. I would like to listen to the community's feedback on this.
The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
In addition, Stephen is working on glibc change to add sealing support into the dynamic linker to seal all non-writable segments at startup. When that work is completed, all applications can automatically benefit from these new protections.
[1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea...
Jeff Xu (8): Add mseal syscall Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: seal mprotect mseal munmap mseal mremap mseal mmap selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/aio.c | 5 +- include/linux/mm.h | 55 +- include/linux/mm_types.h | 7 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 6 + ipc/shm.c | 3 +- kernel/sys_ni.c | 1 + mm/Kconfig | 8 + mm/Makefile | 1 + mm/internal.h | 4 +- mm/mmap.c | 49 +- mm/mprotect.c | 6 + mm/mremap.c | 19 +- mm/mseal.c | 328 +++++ mm/nommu.c | 6 +- mm/util.c | 8 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ 37 files changed, 1934 insertions(+), 28 deletions(-) create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c
From: Jeff Xu jeffxu@google.com
mseal() prevents system calls from modifying the metadata of virtual addresses.
Five syscalls can be sealed, as specified by bitmasks: MM_SEAL_MPROTECT: Deny mprotect(2)/pkey_mprotect(2). MM_SEAL_MUNMAP: Deny munmap(2). MM_SEAL_MMAP: Deny mmap(2). MM_SEAL_MREMAP: Deny mremap(2). MM_SEAL_MSEAL: Deny adding a new seal type.
Signed-off-by: Jeff Xu jeffxu@google.com --- include/linux/mm.h | 14 ++ include/linux/mm_types.h | 7 + include/linux/syscalls.h | 2 + include/uapi/linux/mman.h | 6 + kernel/sys_ni.c | 1 + mm/Kconfig | 8 ++ mm/Makefile | 1 + mm/mmap.c | 14 ++ mm/mseal.c | 268 ++++++++++++++++++++++++++++++++++++++ 9 files changed, 321 insertions(+) create mode 100644 mm/mseal.c
diff --git a/include/linux/mm.h b/include/linux/mm.h index 53efddc4d178..e790b91a0cd4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -257,6 +257,20 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+/* + * vm_seals in vm_area_struct, see mm_types.h. + */ +#define VM_SEAL_NONE 0x00000000 +#define VM_SEAL_MSEAL 0x00000001 +#define VM_SEAL_MPROTECT 0x00000002 +#define VM_SEAL_MUNMAP 0x00000004 +#define VM_SEAL_MREMAP 0x00000008 +#define VM_SEAL_MMAP 0x00000010 + +#define VM_SEAL_ALL \ + (VM_SEAL_MSEAL | VM_SEAL_MPROTECT | VM_SEAL_MUNMAP | VM_SEAL_MMAP | \ + VM_SEAL_MREMAP) + /* * vm_flags in vm_area_struct, see mm_types.h. * When changing, update also include/trace/events/mmflags.h diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 36c5b43999e6..17d80f5a73dc 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -660,6 +660,13 @@ struct vm_area_struct { struct vma_numab_state *numab_state; /* NUMA Balancing state */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; +#ifdef CONFIG_MSEAL + /* + * bit masks for seal. + * need this since vm_flags is full. + */ + unsigned long vm_seals; /* seal flags, see mm.h. */ +#endif } __randomize_layout;
#ifdef CONFIG_SCHED_MM_CID diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index c0cb22cd607d..f574c7dbee76 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types, + unsigned int flags); asmlinkage long sys_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h index a246e11988d5..d7882b5984ce 100644 --- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -55,4 +55,10 @@ struct cachestat { __u64 nr_recently_evicted; };
+#define MM_SEAL_MSEAL 0x1 +#define MM_SEAL_MPROTECT 0x2 +#define MM_SEAL_MUNMAP 0x4 +#define MM_SEAL_MMAP 0x8 +#define MM_SEAL_MREMAP 0x10 + #endif /* _UAPI_LINUX_MMAN_H */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 781de7cc6a4e..06fabf379e33 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -192,6 +192,7 @@ COND_SYSCALL(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL(set_mempolicy_home_node); COND_SYSCALL(cachestat); +COND_SYSCALL(mseal);
COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); diff --git a/mm/Kconfig b/mm/Kconfig index 264a2df5ecf5..db8a567cb4d3 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1258,6 +1258,14 @@ config LOCK_MM_AND_FIND_VMA bool depends on !STACK_GROWSUP
+config MSEAL + default n + bool "Enable mseal() system call" + depends on MMU + help + Enable the mseal() system call. Make memory areas's metadata immutable + by selected system calls, i.e. mprotect(), munmap(), mremap(), mmap(). + source "mm/damon/Kconfig"
endmenu diff --git a/mm/Makefile b/mm/Makefile index ec65984e2ade..643d8518dac0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -120,6 +120,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_SECRETMEM) += secretmem.o +obj-$(CONFIG_MSEAL) += mseal.o obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o diff --git a/mm/mmap.c b/mm/mmap.c index 514ced13c65c..9b6c477e713e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -730,6 +730,20 @@ static inline bool is_mergeable_vma(struct vm_area_struct *vma, return false; if (!anon_vma_name_eq(anon_vma_name(vma), anon_name)) return false; +#ifdef CONFIG_MSEAL + /* + * If a VMA is sealed, it won't be merged with another VMA. + * This might be useful for diagnosis, i.e. the boundary used + * in the mseal() call will be preserved. + * There are chances of too many mseal() calls can create + * many segmentations. Considering mseal() usually comes + * with a careful memory layout design by the application, + * this might not be an issue in real world. + * Though, we could add merging support later if needed. + */ + if (vma->vm_seals & VM_SEAL_ALL) + return 0; +#endif return true; }
diff --git a/mm/mseal.c b/mm/mseal.c new file mode 100644 index 000000000000..615b6e06ab44 --- /dev/null +++ b/mm/mseal.c @@ -0,0 +1,268 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Implement mseal() syscall. + * + * Copyright (c) 2023 Google, Inc. + * + * Author: Jeff Xu jeffxu@google.com + */ + +#include <linux/mman.h> +#include <linux/mm.h> +#include <linux/syscalls.h> +#include <linux/sched.h> +#include "internal.h" + +/* + * MM_SEAL_ALL is all supported flags in mseal(). + */ +#define MM_SEAL_ALL ( \ + MM_SEAL_MSEAL | \ + MM_SEAL_MPROTECT | \ + MM_SEAL_MUNMAP | \ + MM_SEAL_MMAP | \ + MM_SEAL_MREMAP) + +static bool can_do_mseal(unsigned int types, unsigned int flags) +{ + /* check types is a valid bitmap */ + if (types & ~MM_SEAL_ALL) + return false; + + /* flags isn't used for now */ + if (flags) + return false; + + return true; +} + +/* + * Check if a seal type can be added to VMA. + */ +static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned int newSeals) +{ + /* When SEAL_MSEAL is set, reject if a new type of seal is added */ + if ((vma->vm_seals & VM_SEAL_MSEAL) && + (newSeals & ~(vma->vm_seals & VM_SEAL_ALL))) + return false; + + return true; +} + +static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, unsigned int addtypes) +{ + int ret = 0; + + if (addtypes & ~(vma->vm_seals & VM_SEAL_ALL)) { + /* + * Handle split at start and end. + * Note: sealed VMA doesn't merge with other VMAs. + */ + if (start != vma->vm_start) { + ret = split_vma(vmi, vma, start, 1); + if (ret) + goto out; + } + + if (end != vma->vm_end) { + ret = split_vma(vmi, vma, end, 0); + if (ret) + goto out; + } + + vma->vm_seals |= addtypes; + } + +out: + *prev = vma; + return ret; +} + +/* + * convert user input to internal type for seal type. + */ +static unsigned int convert_user_seal_type(unsigned int types) +{ + unsigned int newtypes = VM_SEAL_NONE; + + if (types & MM_SEAL_MSEAL) + newtypes |= VM_SEAL_MSEAL; + + if (types & MM_SEAL_MPROTECT) + newtypes |= VM_SEAL_MPROTECT; + + if (types & MM_SEAL_MUNMAP) + newtypes |= VM_SEAL_MUNMAP; + + if (types & MM_SEAL_MMAP) + newtypes |= VM_SEAL_MMAP; + + if (types & MM_SEAL_MREMAP) + newtypes |= VM_SEAL_MREMAP; + + return newtypes; +} + +/* + * Check for do_mseal: + * 1> start is part of a valid vma. + * 2> end is part of a valid vma. + * 3> No gap (unallocated address) between start and end. + * 4> requested seal type can be added in given address range. + */ +static int check_mm_seal(unsigned long start, unsigned long end, + unsigned int newtypes) +{ + struct vm_area_struct *vma; + unsigned long nstart = start; + + VMA_ITERATOR(vmi, current->mm, start); + + /* going through each vma to check */ + for_each_vma_range(vmi, vma, end) { + if (vma->vm_start > nstart) + /* unallocated memory found */ + return -ENOMEM; + + if (!can_add_vma_seals(vma, newtypes)) + return -EACCES; + + if (vma->vm_end >= end) + return 0; + + nstart = vma->vm_end; + } + + return -ENOMEM; +} + +/* + * Apply sealing. + */ +static int apply_mm_seal(unsigned long start, unsigned long end, + unsigned int newtypes) +{ + unsigned long nstart, nend; + struct vm_area_struct *vma, *prev = NULL; + struct vma_iterator vmi; + int error = 0; + + vma_iter_init(&vmi, current->mm, start); + vma = vma_find(&vmi, end); + + prev = vma_prev(&vmi); + if (start > vma->vm_start) + prev = vma; + + nstart = start; + + /* going through each vma to update */ + for_each_vma_range(vmi, vma, end) { + nend = vma->vm_end; + if (nend > end) + nend = end; + + error = mseal_fixup(&vmi, vma, &prev, nstart, nend, newtypes); + if (error) + break; + + nstart = vma->vm_end; + } + + return error; +} + +/* + * mseal(2) seals the VM's meta data from + * selected syscalls. + * + * addr/len: VM address range. + * + * The address range by addr/len must meet: + * start (addr) must be in a valid VMA. + * end (addr + len) must be in a valid VMA. + * no gap (unallocated memory) between start and end. + * start (addr) must be page aligned. + * + * len: len will be page aligned implicitly. + * + * types: bit mask for sealed syscalls. + * MM_SEAL_MPROTECT: seal mprotect(2)/pkey_mprotect(2). + * MM_SEAL_MUNMAP: seal munmap(2). + * MM_SEAL_MMAP: seal mmap(2). + * MM_SEAL_MREMAP: seal mremap(2). + * MM_SEAL_MSEAL: adding new seal type will be rejected. + * + * flags: reserved. + * + * return values: + * zero: success + * -EINVAL: + * invalid seal type. + * invalid input flags. + * addr is not page aligned. + * addr + len overflow. + * -ENOMEM: + * addr is not a valid address (not allocated). + * end (addr + len) is not a valid address. + * a gap (unallocated memory) between start and end. + * -EACCES: + * MM_SEAL_MSEAL is set, adding a new seal is rejected. + * + * Note: + * user can call mseal(2) multiple times to add new seal types. + * adding an already added seal type is a no-action (no error). + * adding a new seal type after MM_SEAL_MSEAL will be rejected. + * unseal() or removing a seal type is not supported. + */ +static int do_mseal(unsigned long start, size_t len_in, unsigned int types, + unsigned int flags) +{ + int ret = 0; + unsigned long end; + struct mm_struct *mm = current->mm; + unsigned int newtypes; + size_t len; + + if (!can_do_mseal(types, flags)) + return -EINVAL; + + newtypes = convert_user_seal_type(types); + + start = untagged_addr(start); + if (!PAGE_ALIGNED(start)) + return -EINVAL; + + len = PAGE_ALIGN(len_in); + /* Check to see whether len was rounded up from small -ve to zero */ + if (len_in && !len) + return -EINVAL; + + end = start + len; + if (end < start) + return -EINVAL; + + if (end == start) + return 0; + + if (mmap_write_lock_killable(mm)) + return -EINTR; + + ret = check_mm_seal(start, end, newtypes); + if (ret) + goto out; + + ret = apply_mm_seal(start, end, newtypes); + +out: + mmap_write_unlock(current->mm); + return ret; +} + +SYSCALL_DEFINE4(mseal, unsigned long, start, size_t, len, unsigned int, types, unsigned int, + flags) +{ + return do_mseal(start, len, types, flags); +}
On Mon, Oct 16, 2023 at 02:38:20PM +0000, jeffxu@chromium.org wrote:
+#ifdef CONFIG_MSEAL
- /*
* bit masks for seal.
* need this since vm_flags is full.
*/
- unsigned long vm_seals; /* seal flags, see mm.h. */
"unsigned long" and yet:
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index c0cb22cd607d..f574c7dbee76 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types,
unsigned int flags);
"unsigned int"?
Why the mis-match?
--- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -55,4 +55,10 @@ struct cachestat { __u64 nr_recently_evicted; }; +#define MM_SEAL_MSEAL 0x1 +#define MM_SEAL_MPROTECT 0x2 +#define MM_SEAL_MUNMAP 0x4 +#define MM_SEAL_MMAP 0x8 +#define MM_SEAL_MREMAP 0x10
I think we can use the BIT() macro in uapi .h files now, it is _BITUL(). Might want to use it here too to make it obvious what is happening.
thanks,
greg k-h
On Mon, Oct 16, 2023 at 8:07 AM Greg KH gregkh@linuxfoundation.org wrote:
On Mon, Oct 16, 2023 at 02:38:20PM +0000, jeffxu@chromium.org wrote:
+#ifdef CONFIG_MSEAL
/*
* bit masks for seal.
* need this since vm_flags is full.
*/
unsigned long vm_seals; /* seal flags, see mm.h. */
"unsigned long" and yet:
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index c0cb22cd607d..f574c7dbee76 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types,
unsigned int flags);
"unsigned int"?
Why the mis-match?
Thanks. Fixed in V2.
--- a/include/uapi/linux/mman.h +++ b/include/uapi/linux/mman.h @@ -55,4 +55,10 @@ struct cachestat { __u64 nr_recently_evicted; };
+#define MM_SEAL_MSEAL 0x1 +#define MM_SEAL_MPROTECT 0x2 +#define MM_SEAL_MUNMAP 0x4 +#define MM_SEAL_MMAP 0x8 +#define MM_SEAL_MREMAP 0x10
I think we can use the BIT() macro in uapi .h files now, it is _BITUL(). Might want to use it here too to make it obvious what is happening.
Sure. Will update in V2.
thanks,
greg k-h
From: Jeff Xu jeffxu@google.com
Wire up mseal syscall for all architectures.
Signed-off-by: Jeff Xu jeffxu@google.com --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 ++ arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/uapi/asm-generic/unistd.h | 5 ++++- 19 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index ad37569d0507..b5847d53102a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -492,3 +492,4 @@ 560 common set_mempolicy_home_node sys_ni_syscall 561 common cachestat sys_cachestat 562 common fchmodat2 sys_fchmodat2 +563 common mseal sys_mseal diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index c572d6c3dee0..b50c5ca5047d 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -466,3 +466,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index bd77253b62e0..6a28fb91b85d 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -39,7 +39,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 453 +#define __NR_compat_syscalls 454 #endif
#define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 78b68311ec81..1e9b3c098a8e 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -911,6 +911,8 @@ __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) __SYSCALL(__NR_cachestat, sys_cachestat) #define __NR_fchmodat2 452 __SYSCALL(__NR_fchmodat2, sys_fchmodat2) +#define __NR_mseal 453 +__SYSCALL(__NR_mseal, sys_mseal)
/* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 83d8609aec03..babe34d221ee 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -373,3 +373,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 259ceb125367..27cd3f7dbd5e 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -452,3 +452,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index a3798c2637fd..e49861f7c61f 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -458,3 +458,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 152034b8e0a0..78d15010cd77 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -391,3 +391,4 @@ 450 n32 set_mempolicy_home_node sys_set_mempolicy_home_node 451 n32 cachestat sys_cachestat 452 n32 fchmodat2 sys_fchmodat2 +453 n32 mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index cb5e757f6621..813614fedb72 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -367,3 +367,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 n64 cachestat sys_cachestat 452 n64 fchmodat2 sys_fchmodat2 +453 n64 mseal sys_mseal diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 1a646813afdc..01d88d3a6f3e 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -440,3 +440,4 @@ 450 o32 set_mempolicy_home_node sys_set_mempolicy_home_node 451 o32 cachestat sys_cachestat 452 o32 fchmodat2 sys_fchmodat2 +453 o32 mseal sys_mseal diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index e97c175b56f9..d52d08f0a1ea 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -451,3 +451,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 20e50586e8a2..d38deba73a7b 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -539,3 +539,4 @@ 450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 0122cc156952..cf3243c2978b 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -455,3 +455,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal sys_mseal diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index e90d585c4d3e..76f1cd33adaa 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -455,3 +455,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 4ed06c71c43f..d7728695d780 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -498,3 +498,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 2d0b1bd866ea..6d4cc386df22 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -457,3 +457,4 @@ 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node 451 i386 cachestat sys_cachestat 452 i386 fchmodat2 sys_fchmodat2 +453 i386 mseal sys_mseal diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 814768249eae..73dcfc43d921 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -374,6 +374,7 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal
# # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index fc1a4f3c81d9..e8fd3bf35d73 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -423,3 +423,4 @@ 450 common set_mempolicy_home_node sys_set_mempolicy_home_node 451 common cachestat sys_cachestat 452 common fchmodat2 sys_fchmodat2 +453 common mseal sys_mseal diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index abe087c53b4b..0c945a798208 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -823,8 +823,11 @@ __SYSCALL(__NR_cachestat, sys_cachestat) #define __NR_fchmodat2 452 __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
+#define __NR_mseal 453 +__SYSCALL(__NR_mseal, sys_mseal) + #undef __NR_syscalls -#define __NR_syscalls 453 +#define __NR_syscalls 454
/* * 32 bit systems traditionally used different
From: Jeff Xu jeffxu@google.com
can_modify_mm: checks sealing flags for given memory range.
can_modify_vma: checks sealing flags for given vma.
Signed-off-by: Jeff Xu jeffxu@google.com --- include/linux/mm.h | 34 ++++++++++++++++++++++++++ mm/mseal.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 94 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index e790b91a0cd4..aafdb68950f8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -257,6 +257,18 @@ extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif
+enum caller_origin { + ON_BEHALF_OF_KERNEL = 0, + ON_BEHALF_OF_USERSPACE, +}; + +enum mm_action { + MM_ACTION_MPROTECT, + MM_ACTION_MUNMAP, + MM_ACTION_MREMAP, + MM_ACTION_MMAP, +}; + /* * vm_seals in vm_area_struct, see mm_types.h. */ @@ -3302,6 +3314,28 @@ static inline void mm_populate(unsigned long addr, unsigned long len) static inline void mm_populate(unsigned long addr, unsigned long len) {} #endif
+#ifdef CONFIG_MSEAL +extern bool can_modify_mm(struct mm_struct *mm, unsigned long start, + unsigned long end, enum mm_action action, + enum caller_origin called); + +extern bool can_modify_vma(struct vm_area_struct *vma, enum mm_action action, + enum caller_origin called); +#else +static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start, + unsigned long end, enum mm_action action, + enum caller_origin called) +{ + return true; +} + +static inline bool can_modify_vma(struct vm_area_struct *vma, enum mm_action action, + enum caller_origin called) +{ + return true; +} +#endif + /* These take the mm semaphore themselves */ extern int __must_check vm_brk(unsigned long, unsigned long); extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long); diff --git a/mm/mseal.c b/mm/mseal.c index 615b6e06ab44..3285ef6b95a6 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -36,6 +36,66 @@ static bool can_do_mseal(unsigned int types, unsigned int flags) return true; }
+/* + * check if a vma is sealed for modification. + * return true, if modification is allowed. + */ +bool can_modify_vma(struct vm_area_struct *vma, enum mm_action action, + enum caller_origin called) +{ + if (called == ON_BEHALF_OF_KERNEL) + return true; + + switch (action) { + case MM_ACTION_MPROTECT: + if (vma->vm_seals & VM_SEAL_MPROTECT) + return false; + break; + + case MM_ACTION_MUNMAP: + if (vma->vm_seals & VM_SEAL_MUNMAP) + return false; + break; + + case MM_ACTION_MREMAP: + if (vma->vm_seals & VM_SEAL_MREMAP) + return false; + break; + + case MM_ACTION_MMAP: + if (vma->vm_seals & VM_SEAL_MMAP) + return false; + break; + } + + return true; +} + +/* + * Check if the vmas of a memory range are allowed to be modified. + * the memory ranger can have a gap (unallocated memory). + * return true, if it is allowed. + */ +bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end, + enum mm_action action, enum caller_origin called) +{ + struct vm_area_struct *vma; + + VMA_ITERATOR(vmi, mm, start); + + if (called == ON_BEHALF_OF_KERNEL) + return true; + + /* going through each vma to check */ + for_each_vma_range(vmi, vma, end) { + if (!can_modify_vma(vma, action, called)) + return false; + } + + /* Allow by default. */ + return true; +} + /* * Check if a seal type can be added to VMA. */
From: Jeff Xu jeffxu@google.com
check sealing for mprotect(2).
Signed-off-by: Jeff Xu jeffxu@google.com --- mm/mprotect.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/mm/mprotect.c b/mm/mprotect.c index 130db91d3a8c..5b67c66d55f7 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -753,6 +753,12 @@ static int do_mprotect_pkey(unsigned long start, size_t len, } }
+ if (!can_modify_mm(current->mm, start, end, MM_ACTION_MPROTECT, + ON_BEHALF_OF_USERSPACE)) { + error = -EACCES; + goto out; + } + prev = vma_prev(&vmi); if (start > vma->vm_start) prev = vma;
From: Jeff Xu jeffxu@google.com
check seal for munmap(2).
Signed-off-by: Jeff Xu jeffxu@google.com --- include/linux/mm.h | 2 +- mm/mmap.c | 22 ++++++++++++++-------- mm/mremap.c | 5 +++-- 3 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index aafdb68950f8..95b793eb3a80 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3294,7 +3294,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long pgoff, unsigned long *populate, struct list_head *uf); extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, - bool unlock); + bool unlock, enum caller_origin called); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior); diff --git a/mm/mmap.c b/mm/mmap.c index 9b6c477e713e..f4bfcc5d2c10 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2601,6 +2601,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, * @len: The length of the range to munmap * @uf: The userfaultfd list_head * @unlock: set to true if the user wants to drop the mmap_lock on success + * @called: caller origin * * This function takes a @mas that is either pointing to the previous VMA or set * to MA_START and sets it up to remove the mapping(s). The @len will be @@ -2611,7 +2612,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, */ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, - bool unlock) + bool unlock, enum caller_origin called) { unsigned long end; struct vm_area_struct *vma; @@ -2623,6 +2624,9 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, if (end == start) return -EINVAL;
+ if (!can_modify_mm(mm, start, end, MM_ACTION_MUNMAP, called)) + return -EACCES; + /* arch_unmap() might do unmaps itself. */ arch_unmap(mm, start, end);
@@ -2650,7 +2654,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, { VMA_ITERATOR(vmi, mm, start);
- return do_vmi_munmap(&vmi, mm, start, len, uf, false); + return do_vmi_munmap(&vmi, mm, start, len, uf, false, + ON_BEHALF_OF_KERNEL); }
unsigned long mmap_region(struct file *file, unsigned long addr, @@ -2684,7 +2689,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, }
/* Unmap any existing mapping in the area */ - if (do_vmi_munmap(&vmi, mm, addr, len, uf, false)) + if (do_vmi_munmap(&vmi, mm, addr, len, uf, false, ON_BEHALF_OF_KERNEL)) return -ENOMEM;
/* @@ -2909,7 +2914,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return error; }
-static int __vm_munmap(unsigned long start, size_t len, bool unlock) +static int __vm_munmap(unsigned long start, size_t len, bool unlock, + enum caller_origin called) { int ret; struct mm_struct *mm = current->mm; @@ -2919,7 +2925,7 @@ static int __vm_munmap(unsigned long start, size_t len, bool unlock) if (mmap_write_lock_killable(mm)) return -EINTR;
- ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock); + ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock, called); if (ret || !unlock) mmap_write_unlock(mm);
@@ -2929,14 +2935,14 @@ static int __vm_munmap(unsigned long start, size_t len, bool unlock)
int vm_munmap(unsigned long start, size_t len) { - return __vm_munmap(start, len, false); + return __vm_munmap(start, len, false, ON_BEHALF_OF_KERNEL); } EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) { addr = untagged_addr(addr); - return __vm_munmap(addr, len, true); + return __vm_munmap(addr, len, true, ON_BEHALF_OF_USERSPACE); }
@@ -3168,7 +3174,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags) if (ret) goto limits_failed;
- ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0); + ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0, ON_BEHALF_OF_KERNEL); if (ret) goto munmap_failed;
diff --git a/mm/mremap.c b/mm/mremap.c index 056478c106ee..e43f9ceaa29d 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -715,7 +715,8 @@ static unsigned long move_vma(struct vm_area_struct *vma, }
vma_iter_init(&vmi, mm, old_addr); - if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) { + if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false, + ON_BEHALF_OF_KERNEL)) { /* OOM: unable to split vma, just get accounts right */ if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) vm_acct_memory(old_len >> PAGE_SHIFT); @@ -1009,7 +1010,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, }
ret = do_vmi_munmap(&vmi, mm, addr + new_len, old_len - new_len, - &uf_unmap, true); + &uf_unmap, true, ON_BEHALF_OF_KERNEL); if (ret) goto out;
From: Jeff Xu jeffxu@google.com
check seal for mremap(2)
Signed-off-by: Jeff Xu jeffxu@google.com --- mm/mremap.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/mm/mremap.c b/mm/mremap.c index e43f9ceaa29d..2288f9d0b126 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -836,7 +836,15 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len, if ((mm->map_count + 2) >= sysctl_max_map_count - 3) return -ENOMEM;
+ if (!can_modify_mm(mm, addr, addr + old_len, MM_ACTION_MREMAP, + ON_BEHALF_OF_USERSPACE)) + return -EACCES; + if (flags & MREMAP_FIXED) { + if (!can_modify_mm(mm, new_addr, new_addr + new_len, + MM_ACTION_MREMAP, ON_BEHALF_OF_USERSPACE)) + return -EACCES; + ret = do_munmap(mm, new_addr, new_len, uf_unmap_early); if (ret) goto out; @@ -995,6 +1003,12 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, goto out; }
+ if (!can_modify_mm(mm, addr, addr + old_len, MM_ACTION_MREMAP, + ON_BEHALF_OF_USERSPACE)) { + ret = -EACCES; + goto out; + } + /* * Always allow a shrinking remap: that just unmaps * the unnecessary pages..
From: Jeff Xu jeffxu@google.com
check seal for mmap(2)
Signed-off-by: Jeff Xu jeffxu@google.com --- fs/aio.c | 5 +++-- include/linux/mm.h | 5 ++++- ipc/shm.c | 3 ++- mm/internal.h | 4 ++-- mm/mmap.c | 13 +++++++++---- mm/nommu.c | 6 ++++-- mm/util.c | 8 +++++--- 7 files changed, 29 insertions(+), 15 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index b3174da80ff6..81040126dd45 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -557,8 +557,9 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) }
ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size, - PROT_READ | PROT_WRITE, - MAP_SHARED, 0, &unused, NULL); + PROT_READ | PROT_WRITE, MAP_SHARED, 0, &unused, + NULL, ON_BEHALF_OF_KERNEL); + mmap_write_unlock(mm); if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 95b793eb3a80..ffa2eb9bd475 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3289,9 +3289,12 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, struct list_head *uf); + extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, - unsigned long pgoff, unsigned long *populate, struct list_head *uf); + unsigned long pgoff, unsigned long *populate, struct list_head *uf, + enum caller_origin called); + extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf, bool unlock, enum caller_origin called); diff --git a/ipc/shm.c b/ipc/shm.c index 60e45e7045d4..14aebeefe155 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1662,7 +1662,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, goto invalid; }
- addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL); + addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL, + ON_BEHALF_OF_KERNEL); *raddr = addr; err = 0; if (IS_ERR_VALUE(addr)) diff --git a/mm/internal.h b/mm/internal.h index d1d4bf4e63c0..4361eaf3d1c6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -800,8 +800,8 @@ extern u64 hwpoison_filter_memcg; extern u32 hwpoison_filter_enable;
extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long, - unsigned long, unsigned long, - unsigned long, unsigned long); + unsigned long, unsigned long, unsigned long, unsigned long, + enum caller_origin called);
extern void set_pageblock_order(void); unsigned long reclaim_pages(struct list_head *folio_list); diff --git a/mm/mmap.c b/mm/mmap.c index f4bfcc5d2c10..a42fe27a7d04 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1197,7 +1197,8 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode, unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, - unsigned long *populate, struct list_head *uf) + unsigned long *populate, struct list_head *uf, + enum caller_origin called) { struct mm_struct *mm = current->mm; vm_flags_t vm_flags; @@ -1365,6 +1366,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; }
+ if (!can_modify_mm(mm, addr, addr + len, MM_ACTION_MMAP, called)) + return -EACCES; + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || @@ -1411,7 +1415,8 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, return PTR_ERR(file); }
- retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); + retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, + ON_BEHALF_OF_USERSPACE); out_fput: if (file) fput(file); @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, flags |= MAP_LOCKED;
file = get_file(vma->vm_file); - ret = do_mmap(vma->vm_file, start, size, - prot, flags, pgoff, &populate, NULL); + ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff, + &populate, NULL, ON_BEHALF_OF_KERNEL); fput(file); out: mmap_write_unlock(mm); diff --git a/mm/nommu.c b/mm/nommu.c index 8dba41cfc44d..8c11de5dd8e6 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1018,7 +1018,8 @@ unsigned long do_mmap(struct file *file, unsigned long flags, unsigned long pgoff, unsigned long *populate, - struct list_head *uf) + struct list_head *uf, + enum caller_origin called) { struct vm_area_struct *vma; struct vm_region *region; @@ -1262,7 +1263,8 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, goto out; }
- retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); + retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff, + ON_BEHALF_OF_USERSPACE);
if (file) fput(file); diff --git a/mm/util.c b/mm/util.c index 4ed8b9b5273c..aaf37c3af517 100644 --- a/mm/util.c +++ b/mm/util.c @@ -532,7 +532,8 @@ EXPORT_SYMBOL_GPL(account_locked_vm);
unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, - unsigned long flag, unsigned long pgoff) + unsigned long flag, unsigned long pgoff, + enum caller_origin called) { unsigned long ret; struct mm_struct *mm = current->mm; @@ -544,7 +545,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, if (mmap_write_lock_killable(mm)) return -EINTR; ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate, - &uf); + &uf, called); mmap_write_unlock(mm); userfaultfd_unmap_complete(mm, &uf); if (populate) @@ -562,7 +563,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr, if (unlikely(offset_in_page(offset))) return -EINVAL;
- return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT); + return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT, + ON_BEHALF_OF_KERNEL); } EXPORT_SYMBOL(vm_mmap);
From: Jeff Xu jeffxu@google.com
selftest for sealing mprotect/munmap/mremap/mmap
Signed-off-by: Jeff Xu jeffxu@google.com --- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++++++ 2 files changed, 1429 insertions(+) create mode 100644 tools/testing/selftests/mm/mseal_test.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 6a9fc5693145..0c086cecc093 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests TEST_GEN_FILES += mrelease_test TEST_GEN_FILES += mremap_dontunmap TEST_GEN_FILES += mremap_test +TEST_GEN_FILES += mseal_test TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c new file mode 100644 index 000000000000..d6ae09729394 --- /dev/null +++ b/tools/testing/selftests/mm/mseal_test.c @@ -0,0 +1,1428 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <sys/mman.h> +#include <stdint.h> +#include <unistd.h> +#include <string.h> +#include <sys/time.h> +#include <sys/resource.h> +#include <stdbool.h> +#include "../kselftest.h" +#include <syscall.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <assert.h> + +#ifndef MM_SEAL_MSEAL +#define MM_SEAL_MSEAL 0x1 +#endif + +#ifndef MM_SEAL_MPROTECT +#define MM_SEAL_MPROTECT 0x2 +#endif + +#ifndef MM_SEAL_MUNMAP +#define MM_SEAL_MUNMAP 0x4 +#endif + +#ifndef MM_SEAL_MMAP +#define MM_SEAL_MMAP 0x8 +#endif + +#ifndef MM_SEAL_MREMAP +#define MM_SEAL_MREMAP 0x10 +#endif + +#ifndef DEBUG +#define LOG_TEST_ENTER() {} +#else +#define LOG_TEST_ENTER() { printf("%s\n", __func__); } +#endif + +static int sys_mseal(void *start, size_t len, int types) +{ + int sret; + + errno = 0; + sret = syscall(__NR_mseal, start, len, types, 0); + return sret; +} + +int sys_mprotect(void *ptr, size_t size, unsigned long prot) +{ + int sret; + + errno = 0; + sret = syscall(SYS_mprotect, ptr, size, prot); + return sret; +} + +int sys_munmap(void *ptr, size_t size) +{ + int sret; + + errno = 0; + sret = syscall(SYS_munmap, ptr, size); + return sret; +} + +static int sys_madvise(void *start, size_t len, int types) +{ + int sret; + + errno = 0; + sret = syscall(__NR_madvise, start, len, types); + return sret; +} + +void *addr1 = (void *)0x50000000; +void *addr2 = (void *)0x50004000; +void *addr3 = (void *)0x50008000; +void setup_single_address(int size, void **ptrOut) +{ + void *ptr; + + ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr != (void *)-1); + *ptrOut = ptr; +} + +void setup_single_fixed_address(int size, void **ptrOut) +{ + void *ptr; + + ptr = mmap(addr1, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + assert(ptr == (void *)addr1); + + *ptrOut = ptr; +} + +void clean_single_address(void *ptr, int size) +{ + int ret; + + ret = munmap(ptr, size); + assert(!ret); +} + +void seal_mprotect_single_address(void *ptr, int size) +{ + int ret; + + ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT); + assert(!ret); +} + +static void test_seal_addseals(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* adding seal one by one */ + ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT); + assert(!ret); + + ret = sys_mseal(ptr, size, MM_SEAL_MMAP); + assert(!ret); + + ret = sys_mseal(ptr, size, MM_SEAL_MREMAP); + assert(!ret); + + ret = sys_mseal(ptr, size, MM_SEAL_MSEAL); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_addseals_combined(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT); + assert(!ret); + + /* adding multiple seals */ + ret = sys_mseal(ptr, size, + MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MREMAP | + MM_SEAL_MSEAL); + assert(!ret); + + /* not adding more seal type, so ok. */ + ret = sys_mseal(ptr, size, + MM_SEAL_MMAP | MM_SEAL_MREMAP | MM_SEAL_MSEAL); + assert(!ret); + + /* not adding more seal type, so ok. */ + ret = sys_mseal(ptr, size, MM_SEAL_MSEAL); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_addseals_reject(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT | MM_SEAL_MSEAL); + assert(!ret); + + ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT); + assert(!ret); + + /* MM_SEAL_MSEAL is set, so not allow new seal type . */ + ret = sys_mseal(ptr, size, + MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MSEAL); + assert(ret < 0); + + ret = sys_mseal(ptr, size, MM_SEAL_MMAP); + assert(ret < 0); + + ret = sys_mseal(ptr, size, MM_SEAL_MMAP | MM_SEAL_MSEAL); + assert(ret < 0); + + clean_single_address(ptr, size); +} + +static void test_seal_unmapped_start(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // munmap 2 pages from ptr. + ret = sys_munmap(ptr, 2 * page_size); + assert(!ret); + + // mprotect will fail because 2 pages from ptr are unmapped. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(ret < 0); + + // mseal will fail because 2 pages from ptr are unmapped. + ret = sys_mseal(ptr, size, MM_SEAL_MSEAL); + assert(ret < 0); + + ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_MSEAL); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_unmapped_middle(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // munmap 2 pages from ptr + page. + ret = sys_munmap(ptr + page_size, 2 * page_size); + assert(!ret); + + // mprotect will fail, since size is 4 pages. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(ret < 0); + + // mseal will fail as well. + ret = sys_mseal(ptr, size, MM_SEAL_MSEAL); + assert(ret < 0); + + /* we still can add seal to the first page and last page*/ + ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(!ret); + + ret = sys_mseal(ptr + 3 * page_size, page_size, + MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_unmapped_end(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // unmap last 2 pages. + ret = sys_munmap(ptr + 2 * page_size, 2 * page_size); + assert(!ret); + + //mprotect will fail since last 2 pages are unmapped. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(ret < 0); + + //mseal will fail as well. + ret = sys_mseal(ptr, size, MM_SEAL_MSEAL); + assert(ret < 0); + + /* The first 2 pages is not sealed, and can add seals */ + ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_multiple_vmas(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + // use mprotect to split the vma into 3. + ret = sys_mprotect(ptr + page_size, 2 * page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + // mprotect will get applied to all 4 pages - 3 VMAs. + ret = sys_mprotect(ptr, size, PROT_READ); + assert(!ret); + + // use mprotect to split the vma into 3. + ret = sys_mprotect(ptr + page_size, 2 * page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + // mseal get applied to all 4 pages - 3 VMAs. + ret = sys_mseal(ptr, size, MM_SEAL_MSEAL); + assert(!ret); + + // verify additional seal type will fail after MM_SEAL_MSEAL set. + ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(ret < 0); + + ret = sys_mseal(ptr + page_size, 2 * page_size, + MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(ret < 0); + + ret = sys_mseal(ptr + 3 * page_size, page_size, + MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(ret < 0); + + clean_single_address(ptr, size); +} + +static void test_seal_split_start(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + /* use mprotect to split at middle */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + /* seal the first page, this will split the VMA */ + ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL); + assert(!ret); + + /* can't add seal to the first page */ + ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(ret < 0); + + /* add seal to the remain 3 pages */ + ret = sys_mseal(ptr + page_size, 3 * page_size, + MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_split_end(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_fixed_address(size, &ptr); + + /* use mprotect to split at middle */ + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + /* seal the last page */ + ret = sys_mseal(ptr + 3 * page_size, page_size, MM_SEAL_MSEAL); + assert(!ret); + + /* adding seal to the last page is rejected. */ + ret = sys_mseal(ptr + 3 * page_size, page_size, + MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(ret < 0); + + /* Adding seals to the first 3 pages */ + ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_invalid_input(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_fixed_address(size, &ptr); + + /* invalid flag */ + ret = sys_mseal(ptr, size, 0x20); + assert(ret < 0); + + ret = sys_mseal(ptr, size, 0x31); + assert(ret < 0); + + ret = sys_mseal(ptr, size, 0x3F); + assert(ret < 0); + + /* unaligned address */ + ret = sys_mseal(ptr + 1, 2 * page_size, MM_SEAL_MSEAL); + assert(ret < 0); + + /* length too big */ + ret = sys_mseal(ptr, 5 * page_size, MM_SEAL_MSEAL); + assert(ret < 0); + + /* start is not in a valid VMA */ + ret = sys_mseal(ptr - page_size, 5 * page_size, MM_SEAL_MSEAL); + assert(ret < 0); + + clean_single_address(ptr, size); +} + +static void test_seal_zero_length(void) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + ret = sys_mprotect(ptr, 0, PROT_READ | PROT_WRITE); + assert(!ret); + + /* seal 0 length will be OK, same as mprotect */ + ret = sys_mseal(ptr, 0, MM_SEAL_MPROTECT); + assert(!ret); + + // verify the 4 pages are not sealed by previous call. + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_twice(void) +{ + LOG_TEST_ENTER(); + int ret; + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + + setup_single_address(size, &ptr); + + ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT); + assert(!ret); + + // apply the same seal will be OK. idempotent. + ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT); + assert(!ret); + + ret = sys_mseal(ptr, size, + MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MREMAP | + MM_SEAL_MSEAL); + assert(!ret); + + ret = sys_mseal(ptr, size, + MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MREMAP | + MM_SEAL_MSEAL); + assert(!ret); + + ret = sys_mseal(ptr, size, MM_SEAL_MSEAL); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr, size); + + ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_start_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr, page_size); + + // the first page is sealed. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // pages after the first page is not sealed. + ret = sys_mprotect(ptr + page_size, page_size * 3, + PROT_READ | PROT_WRITE); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_end_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr + page_size, 3 * page_size); + + /* first page is not sealed */ + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + /* last 3 page are sealed */ + ret = sys_mprotect(ptr + page_size, page_size * 3, + PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_unalign_len(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) + seal_mprotect_single_address(ptr, page_size * 2 - 1); + + // 2 pages are sealed. + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_mprotect(ptr + page_size * 2, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_unalign_len_variant_2(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + if (seal) + seal_mprotect_single_address(ptr, page_size * 2 + 1); + + // 3 pages are sealed. + ret = sys_mprotect(ptr, page_size * 3, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_mprotect(ptr + page_size * 3, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_two_vma(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + assert(!ret); + + if (seal) + seal_mprotect_single_address(ptr, page_size * 4); + + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_mprotect(ptr + page_size * 2, page_size * 2, + PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_two_vma_with_split(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // use mprotect to split as two vma. + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + assert(!ret); + + // mseal can apply across 2 vma, also split them. + if (seal) + seal_mprotect_single_address(ptr + page_size, page_size * 2); + + // the first page is not sealed. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + // the second page is sealed. + ret = sys_mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // the third page is sealed. + ret = sys_mprotect(ptr + 2 * page_size, page_size, + PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // the fouth page is not sealed. + ret = sys_mprotect(ptr + 3 * page_size, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_partial_mprotect(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // seal one page. + if (seal) + seal_mprotect_single_address(ptr, page_size); + + // mprotect first 2 page will fail, since the first page are sealed. + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE); + if (seal) + assert(ret < 0); + else + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_two_vma_with_gap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // use mprotect to split. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + // use mprotect to split. + ret = sys_mprotect(ptr + 3 * page_size, page_size, + PROT_READ | PROT_WRITE); + assert(!ret); + + // use munmap to free two pages in the middle + ret = sys_munmap(ptr + page_size, 2 * page_size); + assert(!ret); + + // mprotect will fail, because there is a gap in the address. + // notes, internally mprotect still updated the first page. + ret = sys_mprotect(ptr, 4 * page_size, PROT_READ); + assert(ret < 0); + + // mseal will fail as well. + ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_MPROTECT); + assert(ret < 0); + + // unlike mprotect, the first page is not sealed. + ret = sys_mprotect(ptr, page_size, PROT_READ); + assert(ret == 0); + + // the last page is not sealed. + ret = sys_mprotect(ptr + 3 * page_size, page_size, PROT_READ); + assert(ret == 0); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_split(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + //use mprotect to split. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + //seal all 4 pages. + if (seal) { + ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_MPROTECT); + assert(!ret); + } + + //madvice is OK. + ret = sys_madvise(ptr, page_size * 2, MADV_WILLNEED); + assert(!ret); + + //mprotect is sealed. + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ); + if (seal) + assert(ret < 0); + else + assert(!ret); + + + ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ); + if (seal) + assert(ret < 0); + else + assert(!ret); + + clean_single_address(ptr, size); +} + +static void test_seal_mprotect_merge(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // use mprotect to split one page. + ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE); + assert(!ret); + + // seal first two pages. + if (seal) { + ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_MPROTECT); + assert(!ret); + } + + ret = sys_madvise(ptr, page_size, MADV_WILLNEED); + assert(!ret); + + // 2 pages are sealed. + ret = sys_mprotect(ptr, 2 * page_size, PROT_READ); + if (seal) + assert(ret < 0); + else + assert(!ret); + + // last 2 pages are not sealed. + ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ); + assert(ret == 0); + + clean_single_address(ptr, size); +} + +static void test_seal_munmap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MUNMAP); + assert(!ret); + } + + // 4 pages are sealed. + ret = sys_munmap(ptr, size); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +/* + * allocate 4 pages, + * use mprotect to split it as two VMAs + * seal the whole range + * munmap will fail on both + */ +static void test_seal_munmap_two_vma(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + /* use mprotect to split */ + ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE); + assert(!ret); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MUNMAP); + assert(!ret); + } + + ret = sys_munmap(ptr, page_size * 2); + if (seal) + assert(ret < 0); + else + assert(!ret); + + ret = sys_munmap(ptr + page_size, page_size * 2); + if (seal) + assert(ret < 0); + else + assert(!ret); +} + +/* + * allocate a VMA with 4 pages. + * munmap the middle 2 pages. + * seal the whole 4 pages, will fail. + * note: one of the pages are sealed + * munmap the first page will be OK. + * munmap the last page will be OK. + */ +static void test_seal_munmap_vma_with_gap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + ret = sys_munmap(ptr + page_size, page_size * 2); + assert(!ret); + + if (seal) { + // can't have gap in the middle. + ret = sys_mseal(ptr, size, MM_SEAL_MUNMAP); + assert(ret < 0); + } + + ret = sys_munmap(ptr, page_size); + assert(!ret); + + ret = sys_munmap(ptr + page_size * 2, page_size); + assert(!ret); + + ret = sys_munmap(ptr, size); + assert(!ret); +} + +static void test_munmap_start_freed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + + // unmap the first page. + ret = sys_munmap(ptr, page_size); + assert(!ret); + + // seal the last 3 pages. + if (seal) { + ret = sys_mseal(ptr + page_size, 3 * page_size, MM_SEAL_MUNMAP); + assert(!ret); + } + + // unmap from the first page. + ret = sys_munmap(ptr, size); + if (seal) { + assert(ret < 0); + + // use mprotect to verify page is not unmapped. + ret = sys_mprotect(ptr + page_size, 3 * page_size, PROT_READ); + assert(!ret); + } else + // note: this will be OK, even the first page is + // already unmapped. + assert(!ret); +} + +static void test_munmap_end_freed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + // unmap last page. + ret = sys_munmap(ptr + page_size * 3, page_size); + assert(!ret); + + // seal the first 3 pages. + if (seal) { + ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_MUNMAP); + assert(!ret); + } + + // unmap all pages. + ret = sys_munmap(ptr, size); + if (seal) { + assert(ret < 0); + + // use mprotect to verify page is not unmapped. + ret = sys_mprotect(ptr, 3 * page_size, PROT_READ); + assert(!ret); + } else + assert(!ret); +} + +static void test_munmap_middle_freed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + + setup_single_address(size, &ptr); + // unmap 2 pages in the middle. + ret = sys_munmap(ptr + page_size, page_size * 2); + assert(!ret); + + // seal the first page. + if (seal) { + ret = sys_mseal(ptr, page_size, MM_SEAL_MUNMAP); + assert(!ret); + } + + // munmap all 4 pages. + ret = sys_munmap(ptr, size); + if (seal) { + assert(ret < 0); + + // use mprotect to verify page is not unmapped. + ret = sys_mprotect(ptr, page_size, PROT_READ); + assert(!ret); + } else + assert(!ret); +} + +void test_seal_mremap_shrink(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + // shrink from 4 pages to 2 pages. + ret2 = mremap(ptr, size, 2 * page_size, 0, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + clean_single_address(ret2, 2 * page_size); + } + clean_single_address(ptr, size); +} + +void test_seal_mremap_expand(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + // ummap last 2 pages. + ret = sys_munmap(ptr + 2 * page_size, 2 * page_size); + assert(!ret); + + if (seal) { + ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_MREMAP); + assert(!ret); + } + + // expand from 2 page to 4 pages. + ret2 = mremap(ptr, 2 * page_size, 4 * page_size, 0, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 == ptr); + clean_single_address(ret2, 4 * page_size); + } + clean_single_address(ptr, size); +} + +void test_seal_mremap_move(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + // move from ptr to fixed address. + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, addr1); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + clean_single_address(ret2, size); + } + clean_single_address(ptr, size); +} + +void test_seal_mmap_overwrite_prot(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + unsigned long page_size = getpagesize(); + unsigned long size = page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MMAP); + assert(!ret); + } + + // use mmap to change protection. + ret2 = mmap(ptr, size, PROT_NONE, + MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == ptr); + + clean_single_address(ptr, size); +} + +void test_seal_mremap_shrink_fixed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_fixed_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + // mremap to move and shrink to fixed address + ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + newAddr); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == newAddr); + + clean_single_address(ptr, size); + clean_single_address(newAddr, size); +} + +void test_seal_mremap_expand_fixed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(page_size, &ptr); + setup_single_fixed_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(newAddr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + // mremap to move and expand to fixed address + ret2 = mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED, + newAddr); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == newAddr); + + clean_single_address(ptr, page_size); + clean_single_address(newAddr, size); +} + +void test_seal_mremap_move_fixed(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + setup_single_fixed_address(size, &newAddr); + + if (seal) { + ret = sys_mseal(newAddr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + // mremap to move to fixed address + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else + assert(ret2 == newAddr); + + clean_single_address(ptr, page_size); + clean_single_address(newAddr, size); +} + +void test_seal_mremap_move_fixed_zero(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + /* + * MREMAP_FIXED can move the mapping to zero address + */ + ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, + 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 == 0); + clean_single_address(ret2, 2 * page_size); + } + clean_single_address(ptr, size); +} + +void test_seal_mremap_move_dontunmap(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + // mremap to move, and don't unmap src addr. + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + clean_single_address(ret2, size); + } + + clean_single_address(ptr, page_size); +} + +void test_seal_mremap_move_dontunmap_anyaddr(bool seal) +{ + LOG_TEST_ENTER(); + void *ptr; + void *newAddr; + unsigned long page_size = getpagesize(); + unsigned long size = 4 * page_size; + int ret; + void *ret2; + + setup_single_address(size, &ptr); + + if (seal) { + ret = sys_mseal(ptr, size, MM_SEAL_MREMAP); + assert(!ret); + } + + /* + * The 0xdeaddead should not have effect on dest addr + * when MREMAP_DONTUNMAP is set. + */ + ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, + 0xdeaddead); + if (seal) { + assert(ret2 == MAP_FAILED); + assert(errno == EACCES); + } else { + assert(ret2 != MAP_FAILED); + assert((long)ret2 != 0xdeaddead); + clean_single_address(ret2, size); + } + + clean_single_address(ptr, page_size); +} + +int main(int argc, char **argv) +{ + test_seal_invalid_input(); + test_seal_addseals(); + test_seal_addseals_combined(); + test_seal_addseals_reject(); + test_seal_unmapped_start(); + test_seal_unmapped_middle(); + test_seal_unmapped_end(); + test_seal_multiple_vmas(); + test_seal_split_start(); + test_seal_split_end(); + + test_seal_zero_length(); + test_seal_twice(); + + test_seal_mprotect(false); + test_seal_mprotect(true); + + test_seal_start_mprotect(false); + test_seal_start_mprotect(true); + + test_seal_end_mprotect(false); + test_seal_end_mprotect(true); + + test_seal_mprotect_unalign_len(false); + test_seal_mprotect_unalign_len(true); + + test_seal_mprotect_unalign_len_variant_2(false); + test_seal_mprotect_unalign_len_variant_2(true); + + test_seal_mprotect_two_vma(false); + test_seal_mprotect_two_vma(true); + + test_seal_mprotect_two_vma_with_split(false); + test_seal_mprotect_two_vma_with_split(true); + + test_seal_mprotect_partial_mprotect(false); + test_seal_mprotect_partial_mprotect(true); + + test_seal_mprotect_two_vma_with_gap(false); + test_seal_mprotect_two_vma_with_gap(true); + + test_seal_mprotect_merge(false); + test_seal_mprotect_merge(true); + + test_seal_mprotect_split(false); + test_seal_mprotect_split(true); + + test_seal_munmap(false); + test_seal_munmap(true); + test_seal_munmap_two_vma(false); + test_seal_munmap_two_vma(true); + test_seal_munmap_vma_with_gap(false); + test_seal_munmap_vma_with_gap(true); + + test_munmap_start_freed(false); + test_munmap_start_freed(true); + test_munmap_middle_freed(false); + test_munmap_middle_freed(true); + test_munmap_end_freed(false); + test_munmap_end_freed(true); + + test_seal_mremap_shrink(false); + test_seal_mremap_shrink(true); + test_seal_mremap_expand(false); + test_seal_mremap_expand(true); + test_seal_mremap_move(false); + test_seal_mremap_move(true); + + test_seal_mremap_shrink_fixed(false); + test_seal_mremap_shrink_fixed(true); + test_seal_mremap_expand_fixed(false); + test_seal_mremap_expand_fixed(true); + test_seal_mremap_move_fixed(false); + test_seal_mremap_move_fixed(true); + test_seal_mremap_move_dontunmap(false); + test_seal_mremap_move_dontunmap(true); + test_seal_mremap_move_fixed_zero(false); + test_seal_mremap_move_fixed_zero(true); + test_seal_mremap_move_dontunmap_anyaddr(false); + test_seal_mremap_move_dontunmap_anyaddr(true); + + test_seal_mmap_overwrite_prot(false); + test_seal_mmap_overwrite_prot(true); + + printf("OK\n"); + return 0; +}
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1].
This seems like a confusing way to introduce the subject. Here, you're talking about page permissions, whereas (as far as I can tell), mseal() is about making _virtual_ addresses immutable, for some value of immutable.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
This [2] seems very generic and wide-ranging, not helpful. [5] was more useful to understand what you're trying to do.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest.
Mmm. So when you say "continuous/allocated" what you really mean is "Must have contiguous VMAs" rather than "All pages in this range must be populated", yes?
types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
I don't understand why we want this level of granularity. The OpenBSD and XNU examples just say "This must be immutable*". For values of immutable that allow downgrading access (eg RW to RO or RX to RO), but not upgrading access (RW->RX, RO->*, RX->RW).
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
Honestly, it feels too flexible. Why not just two flags to mprotect() -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- maybe for some things we want to be able to downgrade and for other things, we don't.
I'd like to see some discussion of how this interacts with mprotect(). As far as I can tell, the intent is to lock the protections/existance of the mapping, and not to force memory to stay in core. So it's fine for the kernel to swap out the page and set up a PTE as a swap entry. It's also fine for the kernel to mark PTEs as RO to catch page faults; we're concerned with the LOGICAL permissions, and not the page tables.
Hi Matthew.
Thanks for your comments and time to review the patchset.
On Mon, Oct 16, 2023 at 8:18 AM Matthew Wilcox willy@infradead.org wrote:
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1].
This seems like a confusing way to introduce the subject. Here, you're talking about page permissions, whereas (as far as I can tell), mseal() is about making _virtual_ addresses immutable, for some value of immutable.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
This [2] seems very generic and wide-ranging, not helpful. [5] was more useful to understand what you're trying to do.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest.
Mmm. So when you say "continuous/allocated" what you really mean is "Must have contiguous VMAs" rather than "All pages in this range must be populated", yes?
There can't be a gap (unallocated memory) in the given range. Those are covered in selftest: test_seal_unmapped_start() test_seal_unmapped_middle() test_seal_unmapped_end() The comments in check_mm_seal() also mentioned that.
types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
I don't understand why we want this level of granularity. The OpenBSD and XNU examples just say "This must be immutable*". For values of immutable that allow downgrading access (eg RW to RO or RX to RO), but not upgrading access (RW->RX, RO->*, RX->RW).
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
Honestly, it feels too flexible. Why not just two flags to mprotect() -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- maybe for some things we want to be able to downgrade and for other things, we don't.
Having a seal type per syscall type helps to add the feature incrementally. Applications also know exactly what is sealed.
I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we can define what it seals precisely. As Jann pointed out, there have other scenarios that potentially affect IMMUTABLE. Implementing all thoses will take time. And if we missed a case, we could introduce backward compatibility issues to the application. Bitmask will solve this nicely, i.e. application will need to apply the newly added sealing type explicitly.
I'd like to see some discussion of how this interacts with mprotect(). As far as I can tell, the intent is to lock the protections/existance of the mapping, and not to force memory to stay in core. So it's fine for the kernel to swap out the page and set up a PTE as a swap entry. It's also fine for the kernel to mark PTEs as RO to catch page faults; we're concerned with the LOGICAL permissions, and not the page tables.
Yes. That is correct.
-Jeff Xu
On Tue, Oct 17, 2023 at 01:34:35AM -0700, Jeff Xu wrote:
types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
I don't understand why we want this level of granularity. The OpenBSD and XNU examples just say "This must be immutable*". For values of immutable that allow downgrading access (eg RW to RO or RX to RO), but not upgrading access (RW->RX, RO->*, RX->RW).
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
Honestly, it feels too flexible. Why not just two flags to mprotect() -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- maybe for some things we want to be able to downgrade and for other things, we don't.
Having a seal type per syscall type helps to add the feature incrementally. Applications also know exactly what is sealed.
You're trying to portray a disadvantage as an advantage, This is the seccomp disease, only worse because you're trying to deny individual syscalls instead of building up a list of permitted syscalls. If we introduce a new syscall tomorrow which can affect VMAs, we'll then make it the application's fault for not disabling that new syscall. That's terrible design!
I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we can define what it seals precisely. As Jann pointed out, there have other scenarios that potentially affect IMMUTABLE. Implementing all thoses will take time. And if we missed a case, we could introduce backward compatibility issues to the application. Bitmask will solve this nicely, i.e. application will need to apply the newly added sealing type explicitly.
It is your job to do this. You seem to have taken the attitude that you will give the Chrome team exactly what they asked for instead of trying to understand their requirements and give them what they need.
And don't send a v2 before discussion of v1 has come to an end. It uselessly fragments the discussion.
On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox willy@infradead.org wrote:
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1].
This seems like a confusing way to introduce the subject. Here, you're talking about page permissions, whereas (as far as I can tell), mseal() is about making _virtual_ addresses immutable, for some value of immutable.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
This [2] seems very generic and wide-ranging, not helpful. [5] was more useful to understand what you're trying to do.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest.
Mmm. So when you say "continuous/allocated" what you really mean is "Must have contiguous VMAs" rather than "All pages in this range must be populated", yes?
types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
I don't understand why we want this level of granularity. The OpenBSD and XNU examples just say "This must be immutable*". For values of immutable that allow downgrading access (eg RW to RO or RX to RO), but not upgrading access (RW->RX, RO->*, RX->RW).
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
Honestly, it feels too flexible. Why not just two flags to mprotect() -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- maybe for some things we want to be able to downgrade and for other things, we don't.
I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato pedro.falcato@gmail.com wrote:
On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox willy@infradead.org wrote:
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1].
This seems like a confusing way to introduce the subject. Here, you're talking about page permissions, whereas (as far as I can tell), mseal() is about making _virtual_ addresses immutable, for some value of immutable.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
This [2] seems very generic and wide-ranging, not helpful. [5] was more useful to understand what you're trying to do.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest.
Mmm. So when you say "continuous/allocated" what you really mean is "Must have contiguous VMAs" rather than "All pages in this range must be populated", yes?
types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
I don't understand why we want this level of granularity. The OpenBSD and XNU examples just say "This must be immutable*". For values of immutable that allow downgrading access (eg RW to RO or RX to RO), but not upgrading access (RW->RX, RO->*, RX->RW).
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
Honestly, it feels too flexible. Why not just two flags to mprotect() -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- maybe for some things we want to be able to downgrade and for other things, we don't.
I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
Yes. mmap() can support sealing as well, and memory is allocated as immutable from begining. This is orthogonal to mseal() though. In case of ld.so, iiuc, memory can be first allocated as W, then later changed to RO, for example, during symbol resolution. The important point is that the application can decide what type of sealing it wants, and when to apply it. There needs to be an api(), that can be mseal() or mprotect2() or mimmutable(), the naming is not important to me.
mprotect() in linux have the following signature: int mprotect(void addr[.len], size_t len, int prot); the prot bitmasks are all taken here. I have not checked the prot field in mmap(), there might be bits left, even not, we could have mmap2(), so that is not an issue.
Thanks -Jeff
-- Pedro
On Tue, Oct 17, 2023 at 10:34 PM Jeff Xu jeffxu@google.com wrote:
On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato pedro.falcato@gmail.com wrote:
On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox willy@infradead.org wrote:
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1].
This seems like a confusing way to introduce the subject. Here, you're talking about page permissions, whereas (as far as I can tell), mseal() is about making _virtual_ addresses immutable, for some value of immutable.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
This [2] seems very generic and wide-ranging, not helpful. [5] was more useful to understand what you're trying to do.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest.
Mmm. So when you say "continuous/allocated" what you really mean is "Must have contiguous VMAs" rather than "All pages in this range must be populated", yes?
types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
I don't understand why we want this level of granularity. The OpenBSD and XNU examples just say "This must be immutable*". For values of immutable that allow downgrading access (eg RW to RO or RX to RO), but not upgrading access (RW->RX, RO->*, RX->RW).
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
Honestly, it feels too flexible. Why not just two flags to mprotect() -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- maybe for some things we want to be able to downgrade and for other things, we don't.
I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
Yes. mmap() can support sealing as well, and memory is allocated as immutable from begining. This is orthogonal to mseal() though.
I don't see how this can be orthogonal to mseal(). In the case we opt for adding PROT_ bits, we should more or less only need to adapt calc_vm_prot_bits(), and the rest should work without issues. vma merging won't merge vmas with different prots. The current interfaces (mmap and mprotect) would work just fine. In this case, mseal() or mimmutable() would only be needed if you need to set immutability over a range of VMAs with different permissions.
Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe The only annoying wrench in my plans here is that we have effectively run out of vm_flags bits in 32-bit architectures, so this approach as I described is not compatible with 32-bit.
In case of ld.so, iiuc, memory can be first allocated as W, then later changed to RO, for example, during symbol resolution. The important point is that the application can decide what type of sealing it wants, and when to apply it. There needs to be an api(), that can be mseal() or mprotect2() or mimmutable(), the naming is not important to me.
mprotect() in linux have the following signature: int mprotect(void addr[.len], size_t len, int prot); the prot bitmasks are all taken here. I have not checked the prot field in mmap(), there might be bits left, even not, we could have mmap2(), so that is not an issue.
I don't see what you mean. We have plenty of prot bits left (32-bits, and we seem to have around 8 different bits used). And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
The only issue seems to be that 32-bit ran out of vm_flags, but that can probably be worked around if need be.
On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato pedro.falcato@gmail.com wrote:
I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
Yes. mmap() can support sealing as well, and memory is allocated as immutable from begining. This is orthogonal to mseal() though.
I don't see how this can be orthogonal to mseal(). In the case we opt for adding PROT_ bits, we should more or less only need to adapt calc_vm_prot_bits(), and the rest should work without issues. vma merging won't merge vmas with different prots. The current interfaces (mmap and mprotect) would work just fine. In this case, mseal() or mimmutable() would only be needed if you need to set immutability over a range of VMAs with different permissions.
Agreed. By orthogonal, I meant we can have two APIs: mmap() and mseal()/mprotect() i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). Sealing can be applied after initial memory creation.
Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe The only annoying wrench in my plans here is that we have effectively run out of vm_flags bits in 32-bit architectures, so this approach as I described is not compatible with 32-bit.
In case of ld.so, iiuc, memory can be first allocated as W, then later changed to RO, for example, during symbol resolution. The important point is that the application can decide what type of sealing it wants, and when to apply it. There needs to be an api(), that can be mseal() or mprotect2() or mimmutable(), the naming is not important to me.
mprotect() in linux have the following signature: int mprotect(void addr[.len], size_t len, int prot); the prot bitmasks are all taken here. I have not checked the prot field in mmap(), there might be bits left, even not, we could have mmap2(), so that is not an issue.
I don't see what you mean. We have plenty of prot bits left (32-bits, and we seem to have around 8 different bits used). And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
The only issue seems to be that 32-bit ran out of vm_flags, but that can probably be worked around if need be.
Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. Apology that I was wrong previously and caused confusion.
There is a slight difference in the syntax of mprotect and mseal. Each time when mprotect() is called, the kernel takes all of RWX bits and updates vm_flags, In other words, the application sets/unset each RWX, and kernel takes it.
In the mseal() case, the kernel will remember which seal types were applied previously, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed.
So if we want to use mprotect() for sealing, developers need to think of sealing bits differently than the rest of prot bits. It is a different programming model, might or might not be an obvious concept to developers.
There is a difference in input check and error handling as well. for mseal(), if a given address range has a gap (unallocated memory), or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is updated. For mprotect(), some VMAs can be updated, till an error happens to a VMA.
IMO: I think it makes sense for mseal() and mprotect() to be different in this. mseal() only checks vma struct so it is fast, and mprotect() goes deeper into HW.
Because of those two differences, personally I feel a new syscall might be worth the effort.
That said, mmap() + mprotect() is workable to me if that is what the community wants.
Thanks -Jeff
-Jeff
-- Pedro
Hi Pedro
Some followup on mmap() + mprotect():
On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu jeffxu@google.com wrote:
On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato pedro.falcato@gmail.com wrote:
I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
Yes. mmap() can support sealing as well, and memory is allocated as immutable from begining. This is orthogonal to mseal() though.
I don't see how this can be orthogonal to mseal(). In the case we opt for adding PROT_ bits, we should more or less only need to adapt calc_vm_prot_bits(), and the rest should work without issues. vma merging won't merge vmas with different prots. The current interfaces (mmap and mprotect) would work just fine. In this case, mseal() or mimmutable() would only be needed if you need to set immutability over a range of VMAs with different permissions.
Agreed. By orthogonal, I meant we can have two APIs: mmap() and mseal()/mprotect() i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). Sealing can be applied after initial memory creation.
Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe The only annoying wrench in my plans here is that we have effectively run out of vm_flags bits in 32-bit architectures, so this approach as I described is not compatible with 32-bit.
In case of ld.so, iiuc, memory can be first allocated as W, then later changed to RO, for example, during symbol resolution. The important point is that the application can decide what type of sealing it wants, and when to apply it. There needs to be an api(), that can be mseal() or mprotect2() or mimmutable(), the naming is not important to me.
mprotect() in linux have the following signature: int mprotect(void addr[.len], size_t len, int prot); the prot bitmasks are all taken here. I have not checked the prot field in mmap(), there might be bits left, even not, we could have mmap2(), so that is not an issue.
I don't see what you mean. We have plenty of prot bits left (32-bits, and we seem to have around 8 different bits used). And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
The only issue seems to be that 32-bit ran out of vm_flags, but that can probably be worked around if need be.
Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. Apology that I was wrong previously and caused confusion.
There is a slight difference in the syntax of mprotect and mseal. Each time when mprotect() is called, the kernel takes all of RWX bits and updates vm_flags, In other words, the application sets/unset each RWX, and kernel takes it.
In the mseal() case, the kernel will remember which seal types were applied previously, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed.
So if we want to use mprotect() for sealing, developers need to think of sealing bits differently than the rest of prot bits. It is a different programming model, might or might not be an obvious concept to developers.
This probably doesn't matter much to developers. We can enforce the sealing bit to be the same as the rest of PROT bits. If mprotect() tries to unset sealing, it will fail.
There is a difference in input check and error handling as well. for mseal(), if a given address range has a gap (unallocated memory), or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is updated. For mprotect(), some VMAs can be updated, till an error happens to a VMA.
This difference doesn't matter much.
For mprotect()/mmap(), is Linux implementation limited by POSIX ? This can be made backward compatible. If there is no objection to adding linux specific values in mmap() and mprotect(), This works for me.
Thanks for your input. -Jeff
On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu jeffxu@google.com wrote:
Hi Pedro
Some followup on mmap() + mprotect():
On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu jeffxu@google.com wrote:
On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato pedro.falcato@gmail.com wrote:
I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
Yes. mmap() can support sealing as well, and memory is allocated as immutable from begining. This is orthogonal to mseal() though.
I don't see how this can be orthogonal to mseal(). In the case we opt for adding PROT_ bits, we should more or less only need to adapt calc_vm_prot_bits(), and the rest should work without issues. vma merging won't merge vmas with different prots. The current interfaces (mmap and mprotect) would work just fine. In this case, mseal() or mimmutable() would only be needed if you need to set immutability over a range of VMAs with different permissions.
Agreed. By orthogonal, I meant we can have two APIs: mmap() and mseal()/mprotect() i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). Sealing can be applied after initial memory creation.
Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe The only annoying wrench in my plans here is that we have effectively run out of vm_flags bits in 32-bit architectures, so this approach as I described is not compatible with 32-bit.
In case of ld.so, iiuc, memory can be first allocated as W, then later changed to RO, for example, during symbol resolution. The important point is that the application can decide what type of sealing it wants, and when to apply it. There needs to be an api(), that can be mseal() or mprotect2() or mimmutable(), the naming is not important to me.
mprotect() in linux have the following signature: int mprotect(void addr[.len], size_t len, int prot); the prot bitmasks are all taken here. I have not checked the prot field in mmap(), there might be bits left, even not, we could have mmap2(), so that is not an issue.
I don't see what you mean. We have plenty of prot bits left (32-bits, and we seem to have around 8 different bits used). And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
The only issue seems to be that 32-bit ran out of vm_flags, but that can probably be worked around if need be.
Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. Apology that I was wrong previously and caused confusion.
There is a slight difference in the syntax of mprotect and mseal. Each time when mprotect() is called, the kernel takes all of RWX bits and updates vm_flags, In other words, the application sets/unset each RWX, and kernel takes it.
In the mseal() case, the kernel will remember which seal types were applied previously, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed.
So if we want to use mprotect() for sealing, developers need to think of sealing bits differently than the rest of prot bits. It is a different programming model, might or might not be an obvious concept to developers.
This probably doesn't matter much to developers. We can enforce the sealing bit to be the same as the rest of PROT bits. If mprotect() tries to unset sealing, it will fail.
Yep. Erroneous or malicious mprotects would all be caught. However, if we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect() to less permissions or even downright munmap()) you'd want some care to preserve that bit when setting permissions.
There is a difference in input check and error handling as well. for mseal(), if a given address range has a gap (unallocated memory), or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is updated. For mprotect(), some VMAs can be updated, till an error happens to a VMA.
This difference doesn't matter much.
For mprotect()/mmap(), is Linux implementation limited by POSIX ?
No. POSIX works merely as a baseline that UNIX systems aim towards. You can (and very frequently do) extend POSIX interfaces (in fact, it's how most of POSIX was written, through sheer "design-by-committee" on a bunch of UNIX systems' extensions).
This can be made backward compatible. If there is no objection to adding linux specific values in mmap() and mprotect(), This works for me.
Linux already has system-specific values for PROT_ (PROT_BTI, PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc). Whether this is the right interface is another question. I do like it a lot, but there's of course value in being compatible with existing solutions (like mimmutable()).
On Thu, 19 Oct 2023 at 15:47, Pedro Falcato pedro.falcato@gmail.com wrote:
For mprotect()/mmap(), is Linux implementation limited by POSIX ?
No. POSIX works merely as a baseline that UNIX systems aim towards. You can (and very frequently do) extend POSIX interfaces (in fact, it's how most of POSIX was written, through sheer "design-by-committee" on a bunch of UNIX systems' extensions).
We can in extreme circumstances actually go further than that, and not only extend on POSIX requirements, but actively even violate them.
It does need a very good reason, though, but it has happened when POSIX requirements were simply actively wrong.
For example, at one point POSIX required
int accept(int s, struct sockaddr *addr, size_t *addrlen);
which was simply completely wrong. It's utter shite, and didn't actually match any reality.
The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make it "size_t" was completely unacceptable.
So we ignored it, told POSIX people that they were full of sh*t, and they eventually fixed it in the next version (by introducing a "socklen_t" that had better be the same as "int").
So POSIX can simply be wrong.
Also, if it turns out that we had some ABI that wasn't POSIX-compatible, the whole "don't break user space" will take precedence over any POSIX concerns, and we will not "fix" our system call if people already use our old semantics.
So in that case, we generally end up with a new system call (or new flags) instead.
Or sometimes it just is such a small detail that nobody cares - POSIX also has a notion of documenting areas of non-conformance, and people who really care end up having notions like "conformance vs _strict_ conformance".
Linus
On Thu, Oct 19, 2023 at 4:06 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, 19 Oct 2023 at 15:47, Pedro Falcato pedro.falcato@gmail.com wrote:
For mprotect()/mmap(), is Linux implementation limited by POSIX ?
No. POSIX works merely as a baseline that UNIX systems aim towards. You can (and very frequently do) extend POSIX interfaces (in fact, it's how most of POSIX was written, through sheer "design-by-committee" on a bunch of UNIX systems' extensions).
We can in extreme circumstances actually go further than that, and not only extend on POSIX requirements, but actively even violate them.
It does need a very good reason, though, but it has happened when POSIX requirements were simply actively wrong.
For example, at one point POSIX required
int accept(int s, struct sockaddr *addr, size_t *addrlen);
which was simply completely wrong. It's utter shite, and didn't actually match any reality.
The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make it "size_t" was completely unacceptable.
So we ignored it, told POSIX people that they were full of sh*t, and they eventually fixed it in the next version (by introducing a "socklen_t" that had better be the same as "int").
So POSIX can simply be wrong.
Also, if it turns out that we had some ABI that wasn't POSIX-compatible, the whole "don't break user space" will take precedence over any POSIX concerns, and we will not "fix" our system call if people already use our old semantics.
So in that case, we generally end up with a new system call (or new flags) instead.
Or sometimes it just is such a small detail that nobody cares - POSIX also has a notion of documenting areas of non-conformance, and people who really care end up having notions like "conformance vs _strict_ conformance".
Linus
Thanks Linus for clarifying the guidelines on POSIX in Linux.
-Jeff
On Thu, Oct 19, 2023 at 3:47 PM Pedro Falcato pedro.falcato@gmail.com wrote:
On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu jeffxu@google.com wrote:
Hi Pedro
Some followup on mmap() + mprotect():
On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu jeffxu@google.com wrote:
On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato pedro.falcato@gmail.com wrote:
I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
Yes. mmap() can support sealing as well, and memory is allocated as immutable from begining. This is orthogonal to mseal() though.
I don't see how this can be orthogonal to mseal(). In the case we opt for adding PROT_ bits, we should more or less only need to adapt calc_vm_prot_bits(), and the rest should work without issues. vma merging won't merge vmas with different prots. The current interfaces (mmap and mprotect) would work just fine. In this case, mseal() or mimmutable() would only be needed if you need to set immutability over a range of VMAs with different permissions.
Agreed. By orthogonal, I meant we can have two APIs: mmap() and mseal()/mprotect() i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). Sealing can be applied after initial memory creation.
Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe The only annoying wrench in my plans here is that we have effectively run out of vm_flags bits in 32-bit architectures, so this approach as I described is not compatible with 32-bit.
In case of ld.so, iiuc, memory can be first allocated as W, then later changed to RO, for example, during symbol resolution. The important point is that the application can decide what type of sealing it wants, and when to apply it. There needs to be an api(), that can be mseal() or mprotect2() or mimmutable(), the naming is not important to me.
mprotect() in linux have the following signature: int mprotect(void addr[.len], size_t len, int prot); the prot bitmasks are all taken here. I have not checked the prot field in mmap(), there might be bits left, even not, we could have mmap2(), so that is not an issue.
I don't see what you mean. We have plenty of prot bits left (32-bits, and we seem to have around 8 different bits used). And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
The only issue seems to be that 32-bit ran out of vm_flags, but that can probably be worked around if need be.
Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. Apology that I was wrong previously and caused confusion.
There is a slight difference in the syntax of mprotect and mseal. Each time when mprotect() is called, the kernel takes all of RWX bits and updates vm_flags, In other words, the application sets/unset each RWX, and kernel takes it.
In the mseal() case, the kernel will remember which seal types were applied previously, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed.
So if we want to use mprotect() for sealing, developers need to think of sealing bits differently than the rest of prot bits. It is a different programming model, might or might not be an obvious concept to developers.
This probably doesn't matter much to developers. We can enforce the sealing bit to be the same as the rest of PROT bits. If mprotect() tries to unset sealing, it will fail.
Yep. Erroneous or malicious mprotects would all be caught. However, if we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect() to less permissions or even downright munmap()) you'd want some care to preserve that bit when setting permissions.
There is a difference in input check and error handling as well. for mseal(), if a given address range has a gap (unallocated memory), or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is updated. For mprotect(), some VMAs can be updated, till an error happens to a VMA.
This difference doesn't matter much.
For mprotect()/mmap(), is Linux implementation limited by POSIX ?
No. POSIX works merely as a baseline that UNIX systems aim towards. You can (and very frequently do) extend POSIX interfaces (in fact, it's how most of POSIX was written, through sheer "design-by-committee" on a bunch of UNIX systems' extensions).
This can be made backward compatible. If there is no objection to adding linux specific values in mmap() and mprotect(), This works for me.
Linux already has system-specific values for PROT_ (PROT_BTI, PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc). Whether this is the right interface is another question. I do like it a lot, but there's of course value in being compatible with existing solutions (like mimmutable()).
Thanks Pedro for providing examples on mm extension to POSIX. This opens more design options on solving the sealing problem. I will take a few days to research design options.
-Jeff
-- Pedro
On Mon, 16 Oct 2023 at 07:38, jeffxu@chromium.org wrote:
This patchset proposes a new mseal() syscall for the Linux kernel.
So I have no objections to adding some kind of "lock down memory mappings" model, but this isn't it.
First off, the simple stuff: the commit messages are worthless. Having
check seal for mmap(2)
as the commit message is not even remotely acceptable, to pick one random example from the series (7/8).
But that doesn't matter much, since I think the more fundamental problems are much worse:
- the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is just complete noise and totally illogical. The whole concept needs to be redone.
Example of complete nonsense (again, picking 7/8, but that's again just a random example):
@@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages, flags |= MAP_LOCKED;
file = get_file(vma->vm_file);
ret = do_mmap(vma->vm_file, start, size,
prot, flags, pgoff, &populate, NULL);
ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
&populate, NULL, ON_BEHALF_OF_KERNEL); fput(file);
out: mmap_write_unlock(mm);
Christ. That's *literally* the remap_file_pages() system call definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense in this context.
It's not the only situation. "mremap() as the same thing. vm_munmap() has the same thing. vm_brk_flags() has the same thing. None of them make any sense.
But even if those obvious kinds of complete mis-designs were to be individually fixed, the whole naming and concept is bogus. The "ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check sealing". Naming should describe what the thing *means*, not some random policy thing that may or may not be at all relevant.
- the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away.
Not only is it unacceptable to pointlessly create two different name spaces for no obvious reason, code like this (from 1/8) should not exist:
if (types & MM_SEAL_MSEAL)
newtypes |= VM_SEAL_MSEAL;
if (types & MM_SEAL_MPROTECT)
newtypes |= VM_SEAL_MPROTECT;
if (types & MM_SEAL_MUNMAP)
newtypes |= VM_SEAL_MUNMAP;
if (types & MM_SEAL_MMAP)
newtypes |= VM_SEAL_MMAP;
if (types & MM_SEAL_MREMAP)
newtypes |= VM_SEAL_MREMAP;
Sure, we have that in some cases when there was a *reason* for namespacing imposed on us from an API standpoint (ie the "open()" flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags being turned into MNT_xyz flags), but those tend to be signs of problems in the system call API where some mixup made it impossible to use the flags directly (most commonly because there is some extended internal representation of said things).
For example, the MS_xyz namespace is a combination of "these are flags for the new mount" (like MS_RDONLY) and "this is how you should mount it" (like MS_REMOUNT), and so the user interface has that odd mixing of different things, and the MNT_xyz namespace is a distillation of the former.
But we certainly should not strive to introduce *new* interfaces that start out with this kind of mixup and pointless "translate from one bit to another" code.
- And finally (for now), I hate that MM_ACTION_xyz thing too!
Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8):
switch (action) {
case MM_ACTION_MPROTECT:
if (vma->vm_seals & VM_SEAL_MPROTECT)
return false;
break;
when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and sealing bitmask and just say
if (vma->vm_seal & vm_action) return -EPERM;
IOW, you have pointlessly introduced not *two* different namespaces, but *three*. All doing the exact same thing, and all just causing pointless and ugly code to "translate" the actions of one into the model of another.
So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz" thing. Use *one* single "these are the sealing bits".
And get rid of "enum caller_origin" entirely. I don't know what the right model for that thing is, but that isn't it.
*Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space cannot set, but that the kernel can use internally - and if that is the right model, then dammit, the *uses* should be very very obvious why the override is fine, because that remap_file_pages() use sure as hell was *not* obvious.
In fact, it's not at all obvious why anything should override the sealing bits - EVER. So I'm not convinced that the right model is "replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we *ever* want to override sealing? It sounds like complete garbage. Most of the users seem to be things like "execve()", which is nonsensical, since the VM wouldn't have been sealed at that point _anyway_, so having a "don't bother checking sealing bits" flag seems entirely useless.
Anyway, this is all a resounding NAK on this series in this form. My complaints are not some kind of small "fix this up". These are fundamental issues.
Linus
Hello Linus,
Thank you for the time reviewing this patch set.
On Mon, Oct 16, 2023 at 10:23 AM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, 16 Oct 2023 at 07:38, jeffxu@chromium.org wrote:
This patchset proposes a new mseal() syscall for the Linux kernel.
So I have no objections to adding some kind of "lock down memory mappings" model, but this isn't it.
First off, the simple stuff: the commit messages are worthless. Having
check seal for mmap(2)
as the commit message is not even remotely acceptable, to pick one random example from the series (7/8).
But that doesn't matter much, since I think the more fundamental problems are much worse:
- the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is
just complete noise and totally illogical. The whole concept needs to be redone.
Example of complete nonsense (again, picking 7/8, but that's again just a random example):
@@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages, flags |= MAP_LOCKED;
file = get_file(vma->vm_file);
ret = do_mmap(vma->vm_file, start, size,
prot, flags, pgoff, &populate, NULL);
ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
&populate, NULL, ON_BEHALF_OF_KERNEL); fput(file);
out: mmap_write_unlock(mm);
Christ. That's *literally* the remap_file_pages() system call definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense in this context.
It's not the only situation. "mremap() as the same thing. vm_munmap() has the same thing. vm_brk_flags() has the same thing. None of them make any sense.
But even if those obvious kinds of complete mis-designs were to be individually fixed, the whole naming and concept is bogus. The "ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check sealing". Naming should describe what the thing *means*, not some random policy thing that may or may not be at all relevant.
I apologize that I didn't think of a better name for ON_BEHALF_OF_XX and I should have written a more clear commit message.
I prepared a V2 patchset with a more detailed commit message, hopefully that will help to explain the design.
Indeed, the ON_BEHALF_OF_XX is confusing, especially with remap_file_pathes(). remap_file_pathes(2) is not supported in this patch set, this covers mprotect(2), mmap(2), munmap(2), mremap(2) as the first feature set. I could extend the sealing to more syscalls, if it is determined necessary from the outcome of this discussion. The initial set of 4 syscalls was chosen based on Chrome's initial wish list.
Regarding the ON_BEHALF_OF flag, my intention is to have a flag, set at syscall entry points of mprotect(2), munmap(2), mremap(2), mmap(2), pass the flag along the call stack, till reaching can_modify_mm(), can_modify_mm() does the actual check for the sealing type.
It is probably worth noting that I choose to check one and only one sealing type per syscall. i.e. munmap(2) checks MM_SEAL_MUNMAP only. With this approach, sealing can be implemented incrementally. For example, When implementing sealing for munmap(2), I don't need to care that mremap(2) can also call internal functions to unmap the VMAs. The mremap(2) will be sealed by MM_SEAL_MREMAP(), and be dealt with separately.
This approach also allows dev to expand the sealing to madvice(), mlock(), or whatever syscalls or cases that modify VMA's meta data. As Yann points out, There is a list of cases that we might care about. Having all of those implemented will take time. Using bitmasks will help to add those incrementally. An application will be backward compatible when a new sealing type is added, i.e. It has to set the new sealing type explicitly.
- the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away.
Not only is it unacceptable to pointlessly create two different name spaces for no obvious reason, code like this (from 1/8) should not exist:
if (types & MM_SEAL_MSEAL)
newtypes |= VM_SEAL_MSEAL;
if (types & MM_SEAL_MPROTECT)
newtypes |= VM_SEAL_MPROTECT;
if (types & MM_SEAL_MUNMAP)
newtypes |= VM_SEAL_MUNMAP;
if (types & MM_SEAL_MMAP)
newtypes |= VM_SEAL_MMAP;
if (types & MM_SEAL_MREMAP)
newtypes |= VM_SEAL_MREMAP;
Sure, we have that in some cases when there was a *reason* for namespacing imposed on us from an API standpoint (ie the "open()" flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags being turned into MNT_xyz flags), but those tend to be signs of problems in the system call API where some mixup made it impossible to use the flags directly (most commonly because there is some extended internal representation of said things).
For example, the MS_xyz namespace is a combination of "these are flags for the new mount" (like MS_RDONLY) and "this is how you should mount it" (like MS_REMOUNT), and so the user interface has that odd mixing of different things, and the MNT_xyz namespace is a distillation of the former.
But we certainly should not strive to introduce *new* interfaces that start out with this kind of mixup and pointless "translate from one bit to another" code.
The two namespaces can go away, that means the bitmap will be stored as is by vm_seals in VMA struct. (1/8) Copied below.
+++ b/include/linux/mm_types.h @@ -660,6 +660,13 @@ struct vm_area_struct { + unsigned long vm_seals; /* seal flags, see mm.h. */
My original considerations are: 1. vm_seals is a new field, and mseal() currently uses 5 bits. vm_seals can be repurposed to store other VMA flags in future, having two namespaces (API and internal one) can be useful. 2. vm_flags is full and it seems to me there is pending work on expanding vm_flags. [1] if that happens, we will need translation logic. [1] https://lore.kernel.org/linux-mm/4F6CA298.4000301@jp.fujitsu.com/
I might have over-engineered this, so I removed the VM_SEAL_XX in V2.
- And finally (for now), I hate that MM_ACTION_xyz thing too!
Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8):
switch (action) {
case MM_ACTION_MPROTECT:
if (vma->vm_seals & VM_SEAL_MPROTECT)
return false;
break;
when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and sealing bitmask and just say
if (vma->vm_seal & vm_action) return -EPERM;
Make sense. My original thought is that can_modify_vma() will check one and only one seal type, and having an enum type will enforce that. This restriction feels unnecessary. I removed the action type in V2.
IOW, you have pointlessly introduced not *two* different namespaces, but *three*. All doing the exact same thing, and all just causing pointless and ugly code to "translate" the actions of one into the model of another.
So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz" thing. Use *one* single "these are the sealing bits".
And get rid of "enum caller_origin" entirely. I don't know what the right model for that thing is, but that isn't it.
*Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space cannot set, but that the kernel can use internally - and if that is the right model, then dammit, the *uses* should be very very obvious why the override is fine, because that remap_file_pages() use sure as hell was *not* obvious.
In fact, it's not at all obvious why anything should override the sealing bits - EVER. So I'm not convinced that the right model is "replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we *ever* want to override sealing? It sounds like complete garbage. Most of the users seem to be things like "execve()", which is nonsensical, since the VM wouldn't have been sealed at that point _anyway_, so having a "don't bother checking sealing bits" flag seems entirely useless.
Would the new commit message and comments in V2 help to explain the design better ? (will send shortly)
Another code change I can make to help the readability (not in v2), is to set and pass checkSeals flag from syscall entry point, all the way to can_modify_vma(). Currently, I didn't do that if I checked the internal function is only used by syscal entry point, e.g. in do_mprotect_pkey(), mremap_to(), ksys_mmap_pgoff() cases. Doing that does increase the size of the patch set though.
Thanks. -Jeff
-Jeff
Anyway, this is all a resounding NAK on this series in this form. My complaints are not some kind of small "fix this up". These are fundamental issues.
Linus
On Tue, 17 Oct 2023 at 02:08, Jeff Xu jeffxu@google.com wrote:
It is probably worth noting that I choose to check one and only one sealing type per syscall. i.e. munmap(2) checks MM_SEAL_MUNMAP only.
Yeah, this is wrong.
It's wrong exactly because other system calls will unmap things too.
Using mmap() to over-map something will unmap the old one.
Same goes for mremap() to move over an existing mapping.
So the whole "do things by the name of the system call" is not workable.
All that matters is what the system calls *do*, not what their name is.
And mmap() will fundamentally munmap() as part of the action.
This is why I absolutely hated the old "ON_BEHALF_OF_xyz" flag, and why I still absolutely hate the "randomly pass different sealing flags fto do_munmap()".
You should *not* be passing any flags at all to do_munmap(). Because *regardless* of who calls it, and regardless of which system call started the action, do_munmap() unmaps a virtual memory area.
See what I'm saying?
Linus
Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Oct 2023 at 02:08, Jeff Xu jeffxu@google.com wrote:
It is probably worth noting that I choose to check one and only one sealing type per syscall. i.e. munmap(2) checks MM_SEAL_MUNMAP only.
Yeah, this is wrong.
It's wrong exactly because other system calls will unmap things too.
Using mmap() to over-map something will unmap the old one.
Same goes for mremap() to move over an existing mapping.
So the whole "do things by the name of the system call" is not workable.
All that matters is what the system calls *do*, not what their name is.
I agree completely...
mseal() is a clone of mimmutable(2), but with an extremely over-complicated API based upon dubious arguments.
I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all the components working correctly. There were many intermediate API during development, but in the end the API is simply:
int mimmutable(void *addr, size_t len);
The kernel code for mimmutable() traverses the specified VA range. In that range, it will find unmapped sub-regions (which are are ignored) and mapped sub-regions. For these mapped regions, it does not care what the permissions are, it just marks each sub-region as immutable.
Later on, when any VM operation request upon a VA range attempts to (1) change the permissions (2) to re-map on top (3) or dispose of the mapping, that operation is refused with errno EPERM. We don't care where the request comes from (ie. what system call). It is a behaviour of the VM system, when asked to act upon a VA sub-range mapping.
Very simple semantics.
The only case where the immutable marker is ignored is during address space teardown as a result of process termination.
In his submission of this API, Jeff Xu makes three claims I find dubious;
Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
I specifically designed mimmutable(2) with chrome in mind, and the chrome binary running on OpenBSD is full of immutable mappings. All the library regions automatically become immutable because ld.so can infer it and do the mimmutable calls for the right subregions.
So this chrome work has already been done by OpenBSD, and it is dead simple. During early development I thought mimmutable(2) would be called by applications or libraries, but I was dead wrong: 99.9% of calls are from ld.so, and no applications need to call it, these are the two exceptions:
In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data structures at initialization time, so they canoot be attacked to create an invariant for use in ROP return-to-libc style methods.
In Chrome, there is a v8_flags variable rounded out to a full page, and placed in .data. Chrome initialized this variable, and wants to mprotect PROT_READ, but .data has been made immutable by ld.so. So we force this page into a new ELF section called "openbsd.mutable" which also behaves RW like .data. Where chrome does the mprotect PROT_READ, it now also performs mimmutable() on that page.
Having a seal type per syscall type helps to add the feature incrementally.
Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our entire base operating system and 10,000+ applications automatically receive the benefits. In one year's effort. The only application which cared about it was chrome, described in the previous paragraph.
I think Jeff's idea here is super dangerous. What will actually happen is people will add a few mseal() sub-operations and think the job is done. It isn't done. They need all the mseal() requests, or the mapping are not safe.
It is very counterproductive to provide developers a complex API that has insecure suboperations.
Applications also know exactly what is sealed.
Actually applicatins won't know because there is no tooling to inspect this -- but I will argue further that applications don't need to know. Immutable marking is a system decision, not a program decision.
I'll close by asking for a new look at the mimmutable(2) API we settled on for OpenBSD. I think there is nothing wrong with it. I'm willing to help guide glibc / ld.so / musl teams through the problems they may find along the way, I know where the skeletons are buried. Two in particular: -znow RELRO already today, and xonly-text in the future.
On Tue, 17 Oct 2023 at 11:20, Theo de Raadt deraadt@openbsd.org wrote:
The only case where the immutable marker is ignored is during address space teardown as a result of process termination.
.. and presumably also execve()?
I do like us starting with just "mimmutable()", since it already exists. Particularly if chrome already knows how to use it.
Maybe add a flag field (require it to be zero initially) just to allow any future expansion. Maybe the chrome team has *wanted* to have some finer granularity thing and currently doesn't use mimmutable() in some case?
Linus
Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Oct 2023 at 11:20, Theo de Raadt deraadt@openbsd.org wrote:
The only case where the immutable marker is ignored is during address space teardown as a result of process termination.
.. and presumably also execve()?
Ah yes of course that also.
I do like us starting with just "mimmutable()", since it already exists. Particularly if chrome already knows how to use it.
Well, our chrome fork knows how to use it. Robert Nagy in our group maintains 1280 patches to make chrome work on OpenBSD. Google ignores them and will not upstream them. Some of these changes are security related, and they ignore them. Other changes are to cope with security work we've done on our own, for example: JIT changes from Stephen@google for mandatory IBT which google hasn't upstreamed yet, impacts due to PROT_EXEC-only mappings, etc.
But the only chrome diff required for mimmutable is for that v8_flags thing I described. And the same issue would need handling for mseal(). Introducing the new "mutable" ELF section is probably going to be a bigger fuss than the system call after mprotect(PROT_READ)....
Maybe add a flag field (require it to be zero initially) just to allow any future expansion. Maybe the chrome team has *wanted* to have some finer granularity thing and currently doesn't use mimmutable() in some case?
There's only one feature I can think of, and we already do it right now, documented in our manual page:
CAVEATS At present, mprotect(2) may reduce permissions on immutable pages marked PROT_READ | PROT_WRITE to the less permissive PROT_READ. This one-way operation is permitted for an introductory period to observe how software uses this mechanism. It may change to require explicit mutable region annotation with __attribute__((section(".openbsd.mutable"))) and explicit calls to mimmutable().
We had something which needed this behaviour during the development transition. It is exlusively mprotect RW -> R, no other transitions allowed.
But once the transition was done, we don't need it anymore. I want to delete it, because it is a bit of a trap. It still fails closed from an attack perspective, but...
What worries me is a piece of code reached by mistake can do a mprotect(lowering), not receive -1 EPERM, and carry on running.. I'd prefer the first time you touch a mapping in the wrong way, you receive indication of error. This only applies applies to mprotect() acting up on a region so the argument is a bit weak, due to mprotect() return value checking being about as rare as unicorns.
Also, it would be a pain for OpenBSD to transition to adding a 0 flag. I would need to see real cause not theory.
I do like us starting with just "mimmutable()", since it already exists. Particularly if chrome already knows how to use it.
Maybe add a flag field (require it to be zero initially) just to allow any future expansion. Maybe the chrome team has *wanted* to have some finer granularity thing and currently doesn't use mimmutable() in some case?
Yes, we do have a use case in Chrome to split the sealing into unmap and mprotect which will allow us to seal additional pages that we can't seal with pure mimmutable(). For example, we have pkey-tagged RWX memory that we want to seal. Since the memory is already RWX and the pkey controls write access, we don't care about permission changes but sometimes we do need to mprotect data only pages. But the munmap sealing will provide protection against implicit changes of the pkey in this case which would happen if a page gets unmapped and another mapped in its place.
Stephen Röttger sroettger@google.com wrote:
I do like us starting with just "mimmutable()", since it already exists. Particularly if chrome already knows how to use it.
Maybe add a flag field (require it to be zero initially) just to allow any future expansion. Maybe the chrome team has *wanted* to have some finer granularity thing and currently doesn't use mimmutable() in some case?
Yes, we do have a use case in Chrome to split the sealing into unmap and mprotect which will allow us to seal additional pages that we can't seal with pure mimmutable().
For example, we have pkey-tagged RWX memory that we want to seal. Since the memory is already RWX and the pkey controls write access, we don't care about permission changes but sometimes we do need to mprotect data only pages.
Let me try to decompose this statement.
This is clearly for the JIT. You can pivot between the a JIT generated code mapping being RW and RX (or X-only), the object will pivot between W or X to satisfy W^X policy for safety.
I think you are talking about a RWX MAP_ANON object.
Then you use pkey_alloc() to get a PKEY. pkey_mprotect() sets the PKEY on the region. I argue you can then make it entirely immutable / sealed.
Let's say it is fully immutable / sealed.
After which, you can change the in-processor PKU register (using pkey_set) to toggle the Write-Inhibit and eXecute-Inhibit bits on that key.
The immutable object has a dangerous RWX permission. But the specific PKEY making it either RX (or X-only) or RW depending upon your context. The mapping is never exposed as RWX. The PKU model reduces the permission access of the object below the immutable permission level.
The security depends on the PKEY WI/XI bits being difficult to control.
SADLY on x86, this is managed with a PKRU userland register which is changeble without any supervisor control -- yes, it seems quite dangerous. Changing it requires a complicated MSR dance. It is unfortunate that the pkey_set() library function is easily reachedable in the PLT via ROP methods. On non-x86 cpus that have similar functionality, the register is privileged, but operating supporting it generally change it and return immediately.
The problem you seem to have with fully locked mseal() in chrome seems to be here:
about permission changes but sometimes we do need to mprotect data only pages.
Does that data have to be in the same region? Can your allocator not put the non-code pieces of the JIT elsewhere, with a different permission, fully immutable / msealed -- and perhaps even managed with a different PKEY if neccessary?
May that requires a huge rearchitecture. But isn't the root problem here that data and code are being handled in the same object with a shared permission model?
But the munmap sealing will provide protection against implicit changes of the pkey in this case which would happen if a page gets unmapped and another mapped in its place.
That primitive feels so weird, I have a difficult time believing it will remain unattackable in the long term.
But what if you could replace mprotect() with pkey_mprotect() upon a different key.. ?
--
A few more notes comparing what OpenBSD has done compared to Linux:
In OpenBSD, we do not have the pkey library. We have stolen one of the PKEY and use it for kernel support of xonly for kernel code and userland code. On x86 we recognize that userland can flip the permission by whacking the RPKU register -- which would make xonly code readable. (The chrome data you are trying to guard faces the same problem).
To prevent that, a majority of traps in the kernel (page faults, interrupts, etc) check if the PKRU register has been modified, and kill the process. It is statistically strong.
We are not making pkey available as a userland feature, but if we later do so we would still have 15 pkeys to play with. We would probably make the pkey_set() operation a system call, so the trap handler can also observe RPKU register modifications by the instruction.
Above, I mentioned pivoting between "RW or RX (or X-only)". On OpenBSD, chrome would be able to pivot between RW and X-only.
When it comes to Pkey utilization, we've ended up in a very different place than Linux.
The problem you seem to have with fully locked mseal() in chrome seems to be here:
about permission changes but sometimes we do need to mprotect data only pages.
Does that data have to be in the same region? Can your allocator not put the non-code pieces of the JIT elsewhere, with a different permission, fully immutable / msealed -- and perhaps even managed with a different PKEY if neccessary?
No we can't. We investigated this extensively since this also poses some difficulties on MacOS. We implemented different approaches but any such change to the allocator introduces too much of a performance impact.
On Tue, Oct 17, 2023 at 11:20 AM Theo de Raadt deraadt@openbsd.org wrote:
Linus Torvalds torvalds@linux-foundation.org wrote:
On Tue, 17 Oct 2023 at 02:08, Jeff Xu jeffxu@google.com wrote:
It is probably worth noting that I choose to check one and only one sealing type per syscall. i.e. munmap(2) checks MM_SEAL_MUNMAP only.
Yeah, this is wrong.
It's wrong exactly because other system calls will unmap things too.
Using mmap() to over-map something will unmap the old one.
Same goes for mremap() to move over an existing mapping.
So the whole "do things by the name of the system call" is not workable.
All that matters is what the system calls *do*, not what their name is.
I agree completely...
mseal() is a clone of mimmutable(2), but with an extremely over-complicated API based upon dubious arguments.
I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all the components working correctly. There were many intermediate API during development, but in the end the API is simply:
int mimmutable(void *addr, size_t len);
The kernel code for mimmutable() traverses the specified VA range. In that range, it will find unmapped sub-regions (which are are ignored) and mapped sub-regions. For these mapped regions, it does not care what the permissions are, it just marks each sub-region as immutable.
Later on, when any VM operation request upon a VA range attempts to (1) change the permissions (2) to re-map on top (3) or dispose of the mapping, that operation is refused with errno EPERM. We don't care where the request comes from (ie. what system call). It is a behaviour of the VM system, when asked to act upon a VA sub-range mapping.
Very simple semantics.
The only case where the immutable marker is ignored is during address space teardown as a result of process termination.
May I ask, for BSD's implementation of immutable(), do you cover things such as mlock(), madvice() ? or just the protection bit (WRX) + remap() + unmap().
In other words: Is BSD's definition of immutable equivalent to MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?
I hesitate to introduce the concept of immutable into linux because I don't know all the scenarios present in linux where VMAs's metadata can be modified. As Jann's email pointed out, There could be quite a few things we still need to deal with, to completely block the possibility, e.g. malicious code attempting to write to a RO memory or change RW memory to RWX.
If, as part of immutable, I also block madvice(), mlock(), which also updates VMA's metadata, so by definition, I could. What if the user wants the features in madvice() and at the same time, also wants their .text protected ?
Also, if linux introduces a new syscall that depends on a new metadata of VMA, say msecret(), (for discussion purpose), should immutable automatically support that ?
Without those questions answered, I couldn't choose the route of immutable() yet.
-Jeff
In his submission of this API, Jeff Xu makes three claims I find dubious;
Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
I specifically designed mimmutable(2) with chrome in mind, and the chrome binary running on OpenBSD is full of immutable mappings. All the library regions automatically become immutable because ld.so can infer it and do the mimmutable calls for the right subregions.
So this chrome work has already been done by OpenBSD, and it is dead simple. During early development I thought mimmutable(2) would be called by applications or libraries, but I was dead wrong: 99.9% of calls are from ld.so, and no applications need to call it, these are the two exceptions:
In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data structures at initialization time, so they canoot be attacked to create an invariant for use in ROP return-to-libc style methods.
In Chrome, there is a v8_flags variable rounded out to a full page, and placed in .data. Chrome initialized this variable, and wants to mprotect PROT_READ, but .data has been made immutable by ld.so. So we force this page into a new ELF section called "openbsd.mutable" which also behaves RW like .data. Where chrome does the mprotect PROT_READ, it now also performs mimmutable() on that page.
Having a seal type per syscall type helps to add the feature incrementally.
Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our entire base operating system and 10,000+ applications automatically receive the benefits. In one year's effort. The only application which cared about it was chrome, described in the previous paragraph.
I think Jeff's idea here is super dangerous. What will actually happen is people will add a few mseal() sub-operations and think the job is done. It isn't done. They need all the mseal() requests, or the mapping are not safe.
It is very counterproductive to provide developers a complex API that has insecure suboperations.
Applications also know exactly what is sealed.
Actually applicatins won't know because there is no tooling to inspect this -- but I will argue further that applications don't need to know. Immutable marking is a system decision, not a program decision.
I'll close by asking for a new look at the mimmutable(2) API we settled on for OpenBSD. I think there is nothing wrong with it. I'm willing to help guide glibc / ld.so / musl teams through the problems they may find along the way, I know where the skeletons are buried. Two in particular: -znow RELRO already today, and xonly-text in the future.
Jeff Xu jeffxu@google.com wrote:
May I ask, for BSD's implementation of immutable(), do you cover things such as mlock(), madvice() ? or just the protection bit (WRX) + remap() + unmap().
It only prevents removal of the mapping, placement of a replacement mapping, or changing the existing permissions. If one page in the existing sub-region is marked immutable, the whole operation fails with EPERM.
Those are the only user-visible aspects that an attacker cares about to utilize in this area.
mlock() and madvise() deal with the physical memory handling underneath the VA. They have nothing to do with how attack code might manipulate the VA address space inside a program to convert a series of dead-end approaches into a succesfull escalation strategy.
[It would be very long conversation to explain where and how this has been utilized to make an attack succesfull]
In other words: Is BSD's definition of immutable equivalent to MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?
I can't compare it to your subsystem, because I completely fail to understand the cause or benefit of all the complexity.
I think I've explained what mimmutable() is in extremely simple terms.
And I don't understand else you are trying to do anything beyond what mimmutable() offers. It seems like this is inventing additional solutions without proof that any of them are necessary to solve the specific problem that is known.
I hesitate to introduce the concept of immutable into linux because I don't know all the scenarios present in linux where VMAs's metadata can be modified.
Good grief. It seems obvious if you want to lock the change-behaviour of an object (the object in this case being a VA sub-region, there is a datastructure for that, in OpenBSD it is called an "entry"), then you put a flag in that object's data-structure and you simply check the flag everytime a change-operation is attempted. It is a flag which gets set, and checked. Nothing ever clears it (except address space teardown).
This flag must be put on the data structure that manages VA sub-ranges.
In our case when a prot/mapping operation reaches low-level code that will want to change an "entry", we notice it is not allowed and simply percolate EPERM up through the layers.
There could be quite a few things we still need to deal with, to completely block the possibility, e.g. malicious code attempting to write to a RO memory
What?! writes to RO memory are blocked by the permission bits.
or change RW memory to RWX.
In our case that is blocked by W^X policy.
But if the region is marked mimmutable, then that's another reason you cannot change RW to RWX. It seems so off-topic, to talk about writes to RO memory. I get a feeling you are a bit lost.
mimmutable() is not about permissions, but about locking permissions. - You can't change the permissions of the address space region. - You cannot map a replacement object at the location instead (especially with different permission). - You cannot unmap at that location (which you would do if you wanted to map a new object, with a different permission).
All 3 of these scenarios are identical. No regular code performs these 3 operations on regions of the address space which we mark immutable.
There is nothing more to mimmutable in the VM layer. The hard work is writing code in execve() and ld.so which will decide which objects can be marked immutable automatically, so that programs don't do this to themselves.
I'm aware of where this simple piece fits in. It does not solve all problems, it is a very narrow change to impact a problem which only high-value targets will ever face (like chrome).
But I think you don't understand the purpose of this mechanism.
If, as part of immutable, I also block madvice(), mlock(), which also updates VMA's metadata, so by definition, I could. What if the user wants the features in madvice() and at the same time, also wants their .text protected ?
I have no idea what you are talking about. None of those things relate to the access permission of the memory the user sees, and therefore none of them are in the attack surface profile which is being prevented.
Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because those relate to the physical storage and not the VA permission model.
Also, if linux introduces a new syscall that depends on a new metadata of VMA, say msecret(), (for discussion purpose), should immutable automatically support that ?
How about the future makingexcuses() system call?
I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me.
Maybe read the LWN article;
https://lwn.net/Articles/915640/
Without those questions answered, I couldn't choose the route of immutable() yet.
"... so I can clearly not choose the wine in front of you."
If you don't understand what this thing is for, and cannot minimize the complexity of this thing, then Linux doesn't need it at all.
I should warn everyone the hard work is not in the VM layer, but in ld.so -- deciding which parts of the image to make immutable, and when. It is also possible to make some segments immutable directly in execve() -- but in both cases you better have a really good grasp on RELRO executable layout or will make too many pieces immutable...
I am pretty sure Linux will never get as far as we got. Even our main stacks are marked immutable, but in Linux that would conflict with glibc ld.so mprotecting RWX the stack if you dlopen() a shared library with GNUSTACK, a very bad idea which needs a different fight...
On Tue, Oct 17, 2023 at 4:57 PM Theo de Raadt deraadt@openbsd.org wrote:
Jeff Xu jeffxu@google.com wrote:
May I ask, for BSD's implementation of immutable(), do you cover things such as mlock(), madvice() ? or just the protection bit (WRX) + remap() + unmap().
It only prevents removal of the mapping, placement of a replacement mapping, or changing the existing permissions. If one page in the existing sub-region is marked immutable, the whole operation fails with EPERM.
Those are the only user-visible aspects that an attacker cares about to utilize in this area.
mlock() and madvise() deal with the physical memory handling underneath the VA. They have nothing to do with how attack code might manipulate the VA address space inside a program to convert a series of dead-end approaches into a succesfull escalation strategy.
[It would be very long conversation to explain where and how this has been utilized to make an attack succesfull]
In other words: Is BSD's definition of immutable equivalent to MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?
I can't compare it to your subsystem, because I completely fail to understand the cause or benefit of all the complexity.
I think I've explained what mimmutable() is in extremely simple terms.
Thanks for the explanation, based on those, this is exactly what the current set of patch does. In practice: libc could do below: #define MM_IMMUTABLE (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP) mseal(add,len, MM_IMMUTABLE) it will be equivalent to BSD's immutable().
And I don't understand else you are trying to do anything beyond what
mimmutable() offers. It seems like this is inventing additional solutions without proof that any of them are necessary to solve the specific problem that is known.
I hesitate to introduce the concept of immutable into linux because I don't know all the scenarios present in linux where VMAs's metadata can be modified.
Good grief. It seems obvious if you want to lock the change-behaviour of an object (the object in this case being a VA sub-region, there is a datastructure for that, in OpenBSD it is called an "entry"), then you put a flag in that object's data-structure and you simply check the flag everytime a change-operation is attempted. It is a flag which gets set, and checked. Nothing ever clears it (except address space teardown).
This flag must be put on the data structure that manages VA sub-ranges.
In our case when a prot/mapping operation reaches low-level code that will want to change an "entry", we notice it is not allowed and simply percolate EPERM up through the layers.
There could be quite a few things we still need to deal with, to completely block the possibility, e.g. malicious code attempting to write to a RO memory
What?! writes to RO memory are blocked by the permission bits.
or change RW memory to RWX.
In our case that is blocked by W^X policy.
But if the region is marked mimmutable, then that's another reason you cannot change RW to RWX. It seems so off-topic, to talk about writes to RO memory. I get a feeling you are a bit lost.
immutable() is not about permissions, but about locking permissions.
- You can't change the permissions of the address space region.
- You cannot map a replacement object at the location instead (especially with different permission).
- You cannot unmap at that location (which you would do if you wanted to map a new object, with a different permission).
All 3 of these scenarios are identical. No regular code performs these 3 operations on regions of the address space which we mark immutable.
There is nothing more to mimmutable in the VM layer. The hard work is writing code in execve() and ld.so which will decide which objects can be marked immutable automatically, so that programs don't do this to themselves.
I'm aware of where this simple piece fits in. It does not solve all problems, it is a very narrow change to impact a problem which only high-value targets will ever face (like chrome).
But I think you don't understand the purpose of this mechanism.
In linux cases, I think, eventually, mseal() will have a bigger scope than BSD's mimmutable(). VMA's metadata(vm_area_struct) contains a lot of control info, depending on application's needs, mseal() can be expanded to seal individual control info.
For example, in madvice(2) case: As Jann point out in [1] and I quote: "you'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, ..."
Another example: if an application wants to keep a memory always present in RAM, for whatever the reason, it can call seal the mlock().
To handle those two new cases. mseal() could add two more bits: MM_SEAL_MADVICE, MM_SEAL_MLOCK.
It is practical to keep syscall extentable, when the business logic is the same.
I think I explained the logic of using bitmasks in the mseal() interface clearly with the example of madvice() and mlock().
-Jeff
[1] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfU...
If, as part of immutable, I also block madvice(), mlock(), which also updates VMA's metadata, so by definition, I could. What if the user wants the features in madvice() and at the same time, also wants their .text protected ?
I have no idea what you are talking about. None of those things relate to the access permission of the memory the user sees, and therefore none of them are in the attack surface profile which is being prevented.
Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because those relate to the physical storage and not the VA permission model.
Also, if linux introduces a new syscall that depends on a new metadata of VMA, say msecret(), (for discussion purpose), should immutable automatically support that ?
How about the future makingexcuses() system call?
I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me.
Maybe read the LWN article;
https://lwn.net/Articles/915640/
Without those questions answered, I couldn't choose the route of immutable() yet.
"... so I can clearly not choose the wine in front of you."
If you don't understand what this thing is for, and cannot minimize the complexity of this thing, then Linux doesn't need it at all.
I should warn everyone the hard work is not in the VM layer, but in ld.so -- deciding which parts of the image to make immutable, and when. It is also possible to make some segments immutable directly in execve() -- but in both cases you better have a really good grasp on RELRO executable layout or will make too many pieces immutable...
I am pretty sure Linux will never get as far as we got. Even our main stacks are marked immutable, but in Linux that would conflict with glibc ld.so mprotecting RWX the stack if you dlopen() a shared library with GNUSTACK, a very bad idea which needs a different fight...
Jeff Xu jeffxu@google.com wrote:
In linux cases, I think, eventually, mseal() will have a bigger scope than BSD's mimmutable().
I don't believe that, considering noone needed this behaviour from the VM system in the last 4 decades.
VMA's metadata(vm_area_struct) contains a lot of control info, depending on application's needs, mseal() can be expanded to seal individual control info.
For example, in madvice(2) case: As Jann point out in [1] and I quote: "you'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, ..."
Then prohibit madvise(MADV_FREE) on all non-writeable mappings that are immutable. Just include this in the set of behaviours. Or make it the default.
Don't make it an option that a program needs to set on pages! Noone is going to call it. Most programs don't know the addresses of the *REGIONS* they would want to do this for.
Does your program know where libc's text segment starts and ends? No your program does not know these addresses, so the parts of the 'system' which do know this needs to do it (which would be ld.so or the libc init constructors).
If madvise(MADV_FREE) is so dangerous.. say you have a program that would call through abort(), but you know a zero there can make the abort not abort but return, then is it bad to let the attacker do:
madvise(&abort, pagesize, MADV_FREE)
If that is bad, then block it in a smart way! Don't make a programmer of an application figure out how to do this. That results in a defense methodology where a handful of programs self-protect, but everything else is unprotected or unprotectable. That is shortsighted.
Another example: if an application wants to keep a memory always present in RAM, for whatever the reason, it can call seal the mlock().
Please explain the attack surface.
I think I explained the logic of using bitmasks in the mseal() interface clearly with the example of madvice() and mlock().
It is clear as mud.
On Tue, Oct 17, 2023 at 08:18:47PM -0700, Jeff Xu wrote:
In practice: libc could do below: #define MM_IMMUTABLE (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP) mseal(add,len, MM_IMMUTABLE) it will be equivalent to BSD's immutable().
No, it wouldn't, because you've carefully listed the syscalls you're blocking instead of understanding the _concept_ of what you need to block.
In linux cases, I think, eventually, mseal() will have a bigger scope than BSD's mimmutable(). VMA's metadata(vm_area_struct) contains a lot of control info, depending on application's needs, mseal() can be expanded to seal individual control info.
For example, in madvice(2) case: As Jann point out in [1] and I quote: "you'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, ..."
Another example: if an application wants to keep a memory always present in RAM, for whatever the reason, it can call seal the mlock().
To handle those two new cases. mseal() could add two more bits: MM_SEAL_MADVICE, MM_SEAL_MLOCK.
Yes, thank you for demonstrating that you have no idea what you need to block.
It is practical to keep syscall extentable, when the business logic is the same.
I concur with Theo & Linus. You don't know what you're doing. I think the underlying idea of mimmutable() is good, but how you've split it up and how you've implemented it is terrible.
Let's start with the purpose. The point of mimmutable/mseal/whatever is to fix the mapping of an address range to its underlying object, be it a particular file mapping or anonymous memory. After the call succeeds, it must not be possible to make any address in that virtual range point into any other object.
The secondary purpose is to lock down permissions on that range. Possibly to fix them where they are, possibly to allow RW->RO transitions.
With those purposes in mind, you should be able to deduce for any syscall or any madvise(), ... whether it should be allowed.
Look, I appreciate this is only your second set of patches to Linux and you've taken on a big job. But that's all the more reason you should listen to people who are trying to help you.
On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox willy@infradead.org wrote:
Let's start with the purpose. The point of mimmutable/mseal/whatever is to fix the mapping of an address range to its underlying object, be it a particular file mapping or anonymous memory. After the call succeeds, it must not be possible to make any address in that virtual range point into any other object.
The secondary purpose is to lock down permissions on that range. Possibly to fix them where they are, possibly to allow RW->RO transitions.
With those purposes in mind, you should be able to deduce for any syscall or any madvise(), ... whether it should be allowed.
I got it.
IMO: The approaches mimmutable() and mseal() took are different, but we all want to seal the memory from attackers and make the linux application safer.
mimmutable() started with "none of updates to the sealed address is allowed once marked as immutable", this includes from within kernel space including driver, or any new syscalls. It is reasonable to me.
mseal() starts with 4 syscalls from userspace, which is just a way (among many other ways) to demo what memory operation can be sealed, which happens to meet what Chome wants. This is an RFC, I appreciate your input.
Best regards, -Jeff
Jeff Xu jeffxu@google.com wrote:
On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox willy@infradead.org wrote:
Let's start with the purpose. The point of mimmutable/mseal/whatever is to fix the mapping of an address range to its underlying object, be it a particular file mapping or anonymous memory. After the call succeeds, it must not be possible to make any address in that virtual range point into any other object.
The secondary purpose is to lock down permissions on that range. Possibly to fix them where they are, possibly to allow RW->RO transitions.
With those purposes in mind, you should be able to deduce for any syscall or any madvise(), ... whether it should be allowed.
I got it.
IMO: The approaches mimmutable() and mseal() took are different, but we all want to seal the memory from attackers and make the linux application safer.
I think you are building mseal for chrome, and chrome alone.
I do not think this will work out for the rest of the application space because
1) it is too complicated 2) experience with mimmutable() says that applications don't do any of it themselves, it is all in execve(), libc initialization, and ld.so. You don't strike me as an execve, libc, or ld.so developer.
IMO: The approaches mimmutable() and mseal() took are different, but we all want to seal the memory from attackers and make the linux application safer.
I think you are building mseal for chrome, and chrome alone.
I do not think this will work out for the rest of the application space because
- it is too complicated
- experience with mimmutable() says that applications don't do any of it themselves, it is all in execve(), libc initialization, and ld.so. You don't strike me as an execve, libc, or ld.so developer.
We do want to build this in a way that it can be applied automatically by ld.so and we appreciate all your feedback on this. The intention of splitting the sealing by syscall was to provide flexibility while still allowing ld.so to seal all operations. But it's clear from the feedback that both the fine grained split and the per-syscall approach are not the right way to go. Does Linus' proposal to just split munmap / mprotect sealing address your complexity concerns? ld.so would always use both flags which should then behave similar to mimmutable().
Stephen Röttger sroettger@google.com wrote:
IMO: The approaches mimmutable() and mseal() took are different, but we all want to seal the memory from attackers and make the linux application safer.
I think you are building mseal for chrome, and chrome alone.
I do not think this will work out for the rest of the application space because
- it is too complicated
- experience with mimmutable() says that applications don't do any of it themselves, it is all in execve(), libc initialization, and ld.so. You don't strike me as an execve, libc, or ld.so developer.
We do want to build this in a way that it can be applied automatically by ld.so and we appreciate all your feedback on this.
Hi Stephen,
I am pretty sure your mechanism will be useable by ld.so.
What bothers me is the complex many-bits approach may encourage people to set only a subset of the bits, and then believe they have a security primitive.
Partial sealing is not safe. I define partial sealing as "blocking munmap, but not mprotect". Or "blocking mprotect, but not madvise or mmap".
In Message-id ZS/3GCKvNn5qzhC4@casper.infradead.org Matthew stated there that there are two aspects being locked: which object is mapped, and the permission of that mapping. When additional system calls msync() and madvise() are included in the picture, there are 3 actions being prevented:
- Can someone replace the object - Can someone change the permission - Can someone throw away the cached pages, reverting to original content of the object (that is the madvise / msync)
In Message-id: CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com Jan reminded us of this piece. I'm taking this as a long-standing security hole in some sub-operations of msync/madvise which can write to data regions that aren't actually writeable. Sub-operations with this problem are MADV_FREE, MADV_DONTNEED, POSIX_MADV_DONTNEED, MS_INVALIDATE.. on Linux MADV_WIPEONFORK, and probably a whole bunch of others. I am testing OpenBSD changes which demand PROT_WRITE permission for these sub-operations. Perhaps some systems are already careful.
If you leave any of these operators available, the object is not actually sealed against abuse. I believe an attacker will simply switch to a different operator (mmap, munmap, mprotect, madvise, msync) to achieve a similar objective of damaging the permission or contents.
Since mseal() is designed to create partial sealings, the name of the proposed system call really smells.
The intention of splitting the sealing by syscall was to provide flexibility while still allowing ld.so to seal all operations.
Yes, you will have ld.so set all the bits, and the same in C runtime initialization. If you convince glibc to stop make the stack executable in dlopen(), the kernel could automatically do it.. With Linux backwards compat management, getting there would be an extremely long long long roadmap. But anyways the idea would be "set all the bits". Because otherwise the object or data isn't safe.
Does Linus' proposal to just split munmap / mprotect sealing address your complexity concerns? ld.so would always use both flags which should then behave similar to mimmutable().
No, I think it is weak, because it isn't sealed.
A seperate mail in the thread from you says this is about chrome wanting to use PKU on RWX objects. I think that's the reason for wanting to seperate the sealing (I haven't heard of other applications wanting that). How about we explore that in the other subthread..
On Mon, Oct 16, 2023 at 4:38 PM jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@google.com
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves security stance on memory corruption bugs, i.e. the attacker can’t just write to arbitrary memory and point the code to it, the memory has to be marked with X bit, or else an exception will happen.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
Is the plan that the VMAs you need to protect would be created and mseal()'ed while you expect that attacker code can not (yet) be running concurrently?
Do you expect to be using sealed memory for shadow stacks (in x86 CET / arm64 GCS) to prevent an attacker from mixing those up by moving pages inside a shadow stack or between different shadow stacks or such? (If that's even possible, I think it is but I haven't tried.)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest. types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
You'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED; probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd want to just block all madvise() for sealed VMAs? Or rename process_madvise_behavior_valid() to something like "madvise_is_nondestructive()" and use that.
The following comments probably mostly don't matter in practice if this feature is used in a context that is heavily seccomp-sandboxed (like Desktop Linux Chrome), but should maybe be addressed to make this feature more usable for other users. (Including Android Chrome, which has a weaker sandbox...)
FWIW, it is also possible to write to read-only memory through the /proc/self/mem interface (or through ptrace commands like PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel configuration, seccomp policy, and what the LSMs do. (I think Android Chrome would allow /proc/self/mem writes, but would block PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)
I had a related ancient patch series in 2016 with an attempt to allow SELinux to prevent bypassing W^X protections with this, but I never followed through with getting that landed... (https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh...)
I guess the question there is what the right semantics for this kind of protected memory are when a debugger is active. The simple solution that might break some debugging would be "just deny all FOLL_FORCE write access for this memory" (which would prevent debuggers from being able to place breakpoints, which would maybe not be great). But maybe it makes more sense to consider this to be an independent concern and solve it with a new SELinux feature or something like that instead, and then document that mseal() requires some complement to prevent forced writes to read-only private memory? (For which the simplest solution would be "don't grant filesystem access or ptrace() access to the sandboxed code".)
What is the intended interaction with userfaultfd, which I believe by design permits arbitrary data into unpopulated areas of anonymous VMAs? If the intention is that the process should otherwise be sandboxed to not have access to userfaultfd, that should maybe be documented. (Alternatively I guess you could theoretically remove the VM_MAYWRITE bit from marked VMAs, but that might be more strict than we want, since it'd also block all FOLL_FORCE writes.)
There are also some interfaces like AIO or the X86 Shadow Stacks interface that indirectly unmap memory through the kernel and look like they could perhaps be tricked into racily unmapping a just-created sealed VMA. You'd probably have to make sure that they can't do that and essentially treat their unmap operations as if they came from userspace, I guess? What Linus just wrote.
I think either way this feature needs some documentation on what kind of context it's supposed to run in.
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. Call mseal() on an existing seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for sealing is added, this usually happens at the early point of the syscall, before any update is made to VMAs. The effect of that is: if any of the VMAs in the given address range fails the sealing check, none of the VMA will be updated. It might be worth noting that this is different from the rest of mprotect(), where some updates can happen even when mprotect returns fail. Consider can_modify_mm only checks vm_seals in vm_area_struct, and it is not going deeper in the page table or updating any HW, success or none behavior might fit better here. I would like to listen to the community's feedback on this.
The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
In addition, Stephen is working on glibc change to add sealing support into the dynamic linker to seal all non-writable segments at startup. When that work is completed, all applications can automatically benefit from these new protections.
[1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea...
Jeff Xu (8): Add mseal syscall Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: seal mprotect mseal munmap mseal mremap mseal mmap selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/aio.c | 5 +- include/linux/mm.h | 55 +- include/linux/mm_types.h | 7 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 6 + ipc/shm.c | 3 +- kernel/sys_ni.c | 1 + mm/Kconfig | 8 + mm/Makefile | 1 + mm/internal.h | 4 +- mm/mmap.c | 49 +- mm/mprotect.c | 6 + mm/mremap.c | 19 +- mm/mseal.c | 328 +++++ mm/nommu.c | 6 +- mm/util.c | 8 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ 37 files changed, 1934 insertions(+), 28 deletions(-) create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c
-- 2.42.0.609.gbb76f46606-goog
Hi Jann,
Thank you for reviewing the patch and comments.
On Mon, Oct 16, 2023 at 10:35 AM Jann Horn jannh@google.com wrote:
On Mon, Oct 16, 2023 at 4:38 PM jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@google.com
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves security stance on memory corruption bugs, i.e. the attacker can’t just write to arbitrary memory and point the code to it, the memory has to be marked with X bit, or else an exception will happen.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
Is the plan that the VMAs you need to protect would be created and mseal()'ed while you expect that attacker code can not (yet) be running concurrently?
Yes.
Do you expect to be using sealed memory for shadow stacks (in x86 CET / arm64 GCS) to prevent an attacker from mixing those up by moving pages inside a shadow stack or between different shadow stacks or such? (If that's even possible, I think it is but I haven't tried.)
I will defer the question to Stephen. (I also think that is possible, if the application know where the shadow stack is at runtime)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest. types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
You'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED; probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd want to just block all madvise() for sealed VMAs? Or rename process_madvise_behavior_valid() to something like "madvise_is_nondestructive()" and use that.
Thanks for the suggestion. I can add madvise() later, maybe in another series. For now, the goal of this patchset covers 4 syscalls, as wished for by Chrome initially.
The following comments probably mostly don't matter in practice if this feature is used in a context that is heavily seccomp-sandboxed (like Desktop Linux Chrome), but should maybe be addressed to make this feature more usable for other users. (Including Android Chrome, which has a weaker sandbox...)
FWIW, it is also possible to write to read-only memory through the /proc/self/mem interface (or through ptrace commands like PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel configuration, seccomp policy, and what the LSMs do. (I think Android Chrome would allow /proc/self/mem writes, but would block PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)
I had a related ancient patch series in 2016 with an attempt to allow SELinux to prevent bypassing W^X protections with this, but I never followed through with getting that landed... (https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh...)
I guess the question there is what the right semantics for this kind of protected memory are when a debugger is active. The simple solution that might break some debugging would be "just deny all FOLL_FORCE write access for this memory" (which would prevent debuggers from being able to place breakpoints, which would maybe not be great). But maybe it makes more sense to consider this to be an independent concern and solve it with a new SELinux feature or something like that instead, and then document that mseal() requires some complement to prevent forced writes to read-only private memory? (For which the simplest solution would be "don't grant filesystem access or ptrace() access to the sandboxed code".)
What is the intended interaction with userfaultfd, which I believe by design permits arbitrary data into unpopulated areas of anonymous VMAs? If the intention is that the process should otherwise be sandboxed to not have access to userfaultfd, that should maybe be documented. (Alternatively I guess you could theoretically remove the VM_MAYWRITE bit from marked VMAs, but that might be more strict than we want, since it'd also block all FOLL_FORCE writes.)
There are also some interfaces like AIO or the X86 Shadow Stacks interface that indirectly unmap memory through the kernel and look like they could perhaps be tricked into racily unmapping a just-created sealed VMA. You'd probably have to make sure that they can't do that and essentially treat their unmap operations as if they came from userspace, I guess? What Linus just wrote.
I think either way this feature needs some documentation on what kind of context it's supposed to run in.
Thanks for listing all the cases. As you pointed out, Chrome is heavily sandboxed, with syscalls and file access. I will work with Stephan to check if some of these applied to Chrome.
It is probably worth to mention, with the current approach of mseal(), i.e. by bitmask, it is possible to implement those above incrementally.
Thanks -Jeff
-Jeff
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. Call mseal() on an existing seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for sealing is added, this usually happens at the early point of the syscall, before any update is made to VMAs. The effect of that is: if any of the VMAs in the given address range fails the sealing check, none of the VMA will be updated. It might be worth noting that this is different from the rest of mprotect(), where some updates can happen even when mprotect returns fail. Consider can_modify_mm only checks vm_seals in vm_area_struct, and it is not going deeper in the page table or updating any HW, success or none behavior might fit better here. I would like to listen to the community's feedback on this.
The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
In addition, Stephen is working on glibc change to add sealing support into the dynamic linker to seal all non-writable segments at startup. When that work is completed, all applications can automatically benefit from these new protections.
[1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea...
Jeff Xu (8): Add mseal syscall Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: seal mprotect mseal munmap mseal mremap mseal mmap selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/aio.c | 5 +- include/linux/mm.h | 55 +- include/linux/mm_types.h | 7 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 6 + ipc/shm.c | 3 +- kernel/sys_ni.c | 1 + mm/Kconfig | 8 + mm/Makefile | 1 + mm/internal.h | 4 +- mm/mmap.c | 49 +- mm/mprotect.c | 6 + mm/mremap.c | 19 +- mm/mseal.c | 328 +++++ mm/nommu.c | 6 +- mm/util.c | 8 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ 37 files changed, 1934 insertions(+), 28 deletions(-) create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c
-- 2.42.0.609.gbb76f46606-goog
linux-kselftest-mirror@lists.linaro.org