On Mon, Oct 16, 2023 at 4:38 PM jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@google.com
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves security stance on memory corruption bugs, i.e. the attacker can’t just write to arbitrary memory and point the code to it, the memory has to be marked with X bit, or else an exception will happen.
Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
Is the plan that the VMAs you need to protect would be created and mseal()'ed while you expect that attacker code can not (yet) be running concurrently?
Do you expect to be using sealed memory for shadow stacks (in x86 CET / arm64 GCS) to prevent an attacker from mixing those up by moving pages inside a shadow stack or between different shadow stacks or such? (If that's even possible, I think it is but I haven't tried.)
addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest. types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10
You'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED; probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd want to just block all madvise() for sealed VMAs? Or rename process_madvise_behavior_valid() to something like "madvise_is_nondestructive()" and use that.
The following comments probably mostly don't matter in practice if this feature is used in a context that is heavily seccomp-sandboxed (like Desktop Linux Chrome), but should maybe be addressed to make this feature more usable for other users. (Including Android Chrome, which has a weaker sandbox...)
FWIW, it is also possible to write to read-only memory through the /proc/self/mem interface (or through ptrace commands like PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel configuration, seccomp policy, and what the LSMs do. (I think Android Chrome would allow /proc/self/mem writes, but would block PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)
I had a related ancient patch series in 2016 with an attempt to allow SELinux to prevent bypassing W^X protections with this, but I never followed through with getting that landed... (https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh...)
I guess the question there is what the right semantics for this kind of protected memory are when a debugger is active. The simple solution that might break some debugging would be "just deny all FOLL_FORCE write access for this memory" (which would prevent debuggers from being able to place breakpoints, which would maybe not be great). But maybe it makes more sense to consider this to be an independent concern and solve it with a new SELinux feature or something like that instead, and then document that mseal() requires some complement to prevent forced writes to read-only private memory? (For which the simplest solution would be "don't grant filesystem access or ptrace() access to the sandboxed code".)
What is the intended interaction with userfaultfd, which I believe by design permits arbitrary data into unpopulated areas of anonymous VMAs? If the intention is that the process should otherwise be sandboxed to not have access to userfaultfd, that should maybe be documented. (Alternatively I guess you could theoretically remove the VM_MAYWRITE bit from marked VMAs, but that might be more strict than we want, since it'd also block all FOLL_FORCE writes.)
There are also some interfaces like AIO or the X86 Shadow Stacks interface that indirectly unmap memory through the kernel and look like they could perhaps be tricked into racily unmapping a just-created sealed VMA. You'd probably have to make sure that they can't do that and essentially treat their unmap operations as if they came from userspace, I guess? What Linus just wrote.
I think either way this feature needs some documentation on what kind of context it's supposed to run in.
Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. Call mseal() on an existing seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for sealing is added, this usually happens at the early point of the syscall, before any update is made to VMAs. The effect of that is: if any of the VMAs in the given address range fails the sealing check, none of the VMA will be updated. It might be worth noting that this is different from the rest of mprotect(), where some updates can happen even when mprotect returns fail. Consider can_modify_mm only checks vm_seals in vm_area_struct, and it is not going deeper in the page table or updating any HW, success or none behavior might fit better here. I would like to listen to the community's feedback on this.
The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
In addition, Stephen is working on glibc change to add sealing support into the dynamic linker to seal all non-writable segments at startup. When that work is completed, all applications can automatically benefit from these new protections.
[1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9... [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgea...
Jeff Xu (8): Add mseal syscall Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: seal mprotect mseal munmap mseal mremap mseal mmap selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/aio.c | 5 +- include/linux/mm.h | 55 +- include/linux/mm_types.h | 7 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 6 + ipc/shm.c | 3 +- kernel/sys_ni.c | 1 + mm/Kconfig | 8 + mm/Makefile | 1 + mm/internal.h | 4 +- mm/mmap.c | 49 +- mm/mprotect.c | 6 + mm/mremap.c | 19 +- mm/mseal.c | 328 +++++ mm/nommu.c | 6 +- mm/util.c | 8 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ 37 files changed, 1934 insertions(+), 28 deletions(-) create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c
-- 2.42.0.609.gbb76f46606-goog