Re: [RFC PATCH v1 0/8] Introduce mseal() syscall

17 Oct 2023


      On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox willy@infradead.org wrote:
...
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
...
Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
This seems like a confusing way to introduce the subject.  Here, you're
talking about page permissions, whereas (as far as I can tell), mseal() is
about making _virtual_ addresses immutable, for some value of immutable.
...
Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.
This [2] seems very generic and wide-ranging, not helpful.  [5] was more
useful to understand what you're trying to do.
...
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.
Mmm.  So when you say "continuous/allocated" what you really mean is
"Must have contiguous VMAs" rather than "All pages in this range must
be populated", yes?
...
types: bit mask to specify which syscall to seal, currently they are:
MM_SEAL_MSEAL 0x1
MM_SEAL_MPROTECT 0x2
MM_SEAL_MUNMAP 0x4
MM_SEAL_MMAP 0x8
MM_SEAL_MREMAP 0x10
I don't understand why we want this level of granularity.  The OpenBSD
and XNU examples just say "This must be immutable*".  For values of
immutable that allow downgrading access (eg RW to RO or RX to RO),
but not upgrading access (RW->RX, RO->*, RX->RW).
...
Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.
Honestly, it feels too flexible.  Why not just two flags to mprotect()
-- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
maybe for some things we want to be able to downgrade and for other
things, we don't.
I think it's worth pointing out that this suggestion (with PROT_*)
could easily integrate with mmap() and as such allow for one-shot
mmap() + mseal().
If we consider the common case as 'addr = mmap(...); mseal(addr);', it
definitely sounds like a performance win as we halve the number of
syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
like common patterns.
-- 
Pedro

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH v1 0/8] Introduce mseal() syscall