Hi Pedro,
On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato pedro.falcato@gmail.com wrote:
On Fri, Sep 27, 2024 at 06:52:09PM GMT, jeffxu@chromium.org wrote:
From: Jeff Xu jeffxu@chromium.org
Update doc after in-loop change: mprotect/madvise can have partially updated and munmap is atomic.
Fix indentation and clarify some sections to improve readability.
Signed-off-by: Jeff Xu jeffxu@chromium.org
Documentation/userspace-api/mseal.rst | 290 ++++++++++++-------------- 1 file changed, 136 insertions(+), 154 deletions(-)
diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst index 4132eec995a3..68986084e191 100644 --- a/Documentation/userspace-api/mseal.rst +++ b/Documentation/userspace-api/mseal.rst @@ -23,177 +23,159 @@ applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
-User API
-mseal()
-The mseal() syscall has the following signature:
-``int mseal(void addr, size_t len, unsigned long flags)``
-**addr/len**: virtual memory address range.
-The address range set by ``addr``/``len`` must meet:
- The start address must be in an allocated VMA.
- The start address must be page aligned.
- The end address (``addr`` + ``len``) must be in an allocated VMA.
- no gap (unallocated memory) between start and end address.
-The ``len`` will be paged aligned implicitly by the kernel.
-**flags**: reserved for future use.
-**return values**:
-- ``0``: Success.
-- ``-EINVAL``:
- Invalid input ``flags``.
- The start address (``addr``) is not page aligned.
- Address range (``addr`` + ``len``) overflow.
-- ``-ENOMEM``:
- The start address (``addr``) is not allocated.
- The end address (``addr`` + ``len``) is not allocated.
- A gap (unallocated memory) between start and end address.
-- ``-EPERM``:
- sealing is supported only on 64-bit CPUs, 32-bit is not supported.
-- For above error cases, users can expect the given memory range is
- unmodified, i.e. no partial update.
-- There might be other internal errors/cases not listed here, e.g.
- error during merging/splitting VMAs, or the process reaching the max
- number of supported VMAs. In those cases, partial updates to the given
- memory range could happen. However, those cases should be rare.
-**Blocked operations after sealing**:
- Unmapping, moving to another location, and shrinking the size,
- via munmap() and mremap(), can leave an empty space, therefore
- can be replaced with a VMA with a new set of attributes.
- Moving or expanding a different VMA into the current location,
- via mremap().
- Modifying a VMA via mmap(MAP_FIXED).
- Size expansion, via mremap(), does not appear to pose any
- specific risks to sealed VMAs. It is included anyway because
- the use case is unclear. In any case, users can rely on
- merging to expand a sealed VMA.
- mprotect() and pkey_mprotect().
- Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
- for anonymous memory, when users don't have write permission to the
- memory. Those behaviors can alter region contents by discarding pages,
- effectively a memset(0) for anonymous memory.
- Kernel will return -EPERM for blocked operations.
- For blocked operations, one can expect the given address is unmodified,
- i.e. no partial update. Note, this is different from existing mm
- system call behaviors, where partial updates are made till an error is
- found and returned to userspace. To give an example:
- Assume following code sequence:
- ptr = mmap(null, 8192, PROT_NONE);
- munmap(ptr + 4096, 4096);
- ret1 = mprotect(ptr, 8192, PROT_READ);
- mseal(ptr, 4096);
- ret2 = mprotect(ptr, 8192, PROT_NONE);
- ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
- ret2 will be -EPERM, the page remains to be PROT_READ.
-**Note**:
-- mseal() only works on 64-bit CPUs, not 32-bit CPU.
-- users can call mseal() multiple times, mseal() on an already sealed memory
- is a no-action (not error).
-- munseal() is not supported. +SYSCALL +======= +mseal syscall signature +-----------------------
- **int** mseal(**void *** addr, **size_t** len, **unsigned long** flags)
- **addr**/**len**: virtual memory address range.
The address range set by **addr**/**len** must meet:
- The start address must be in an allocated VMA.
- The start address must be page aligned.
- The end address (**addr** + **len**) must be in an allocated VMA.
- no gap (unallocated memory) between start and end address.
The ``len`` will be paged aligned implicitly by the kernel.
- **flags**: reserved for future use.
- **Return values**:
- **0**: Success.
- **-EINVAL**:
* Invalid input ``flags``.
* The start address (``addr``) is not page aligned.
* Address range (``addr`` + ``len``) overflow.
- **-ENOMEM**:
* The start address (``addr``) is not allocated.
* The end address (``addr`` + ``len``) is not allocated.
* A gap (unallocated memory) between start and end address.
- **-EPERM**:
* sealing is supported only on 64-bit CPUs, 32-bit is not supported.
- **Note about error return**:
- For above error cases, users can expect the given memory range is
unmodified, i.e. no partial update.
- There might be other internal errors/cases not listed here, e.g.
error during merging/splitting VMAs, or the process reaching the max
number of supported VMAs. In those cases, partial updates to the given
memory range could happen. However, those cases should be rare.
How about turning the above into a man page?
yes. I have a TODO to add a man page :-)
- **Architecture support**:
mseal only works on 64-bit CPUs, not 32-bit CPU.
- **Idempotent**:
users can call mseal multiple times, mseal on an already sealed memory
is a no-action (not error).
- **no munseal**
Once mapping is sealed, it can't be unsealed. kernel should never
have munseal, this is consistent with other sealing feature, e.g.
F_SEAL_SEAL for file.
+Blocked mm syscall for sealed mapping +-------------------------------------
- It might be imporant to note: **once the mapping is sealed, it will
- stay in the process's memory till the process terminates**.
- Example::
*ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
rc = mseal(ptr, 4096, 0);
/* munmap will fail */
rc = munmap(ptr, 4096);
assert(rc < 0);
- Blocked mm syscall:
- munmap
- mmap
- mremap
- mprotect and pkey_mprotect
- some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
- The first set of syscall to block is munmap, mremap, mmap. They can
- either leave an empty space in the address space, therefore allow
- replacement with a new mapping with new set of attributes, or can
- overwrite the existing mapping with another mapping.
- mprotect and pkey_mprotect are blocked because they changes the
change
- protection bits (rwx) of the mapping.
- Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE,
- MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK)
- for anonymous memory, when users don't have write permission to the
- memory. Those behaviors can alter region contents by discarding pages,
- effectively a memset(0) for anonymous memory.
What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE?
MAP_FILE seems not used ? anonymous mapping is the mapping that is not backed by a file.
The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED on a read-only file mapping. e.g .text. This is obviously wrong?
When a MADV_DONTNEED is called, pages will be freed, on file-backed mapping, if the process reads from the mapping again, the content will be retrieved from the file.
For anonymous mapping, since there is no file backup, if process reads from the mapping, 0 is filled, hence equivalent to memset(0)
- Kernel will return -EPERM for blocked syscalls.
- When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked:
- munmap: munmap is atomic. If one of VMAs in the given range is
sealed, none of VMAs are updated.
- mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
when mprotect over multiple VMAs, mprotect might update the beginning
VMAs before reaching the sealed VMA and return -EPERM.
- mmap and mremap: undefined behavior.
mmap and mremap are actually not undefined as they use munmap semantics for their unmapping. Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?)
I'm not sure if I can declare mmap/mremap as atomic.
Although, it might be possible to achieve this due to munmap being atomic. I'm not sure as I didn't test this. Would you like to find out ?
Use cases:
- glibc: The dynamic linker, during loading ELF executables, can apply sealing to
- non-writable memory segments.
- mapping segments.
- Chrome browser: protect some security sensitive data-structures.
-Notes on which memory to seal:
-It might be important to note that sealing changes the lifetime of a mapping, -i.e. the sealed mapping won’t be unmapped till the process terminates or the -exec system call is invoked. Applications can apply sealing to any virtual -memory region from userspace, but it is crucial to thoroughly analyze the -mapping's lifetime prior to apply the sealing. +Don't use mseal on: +=================== +Applications can apply sealing to any virtual memory region from userspace, +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to +apply the sealing. This is because the sealed mapping *won’t be unmapped* +till the process terminates or the exec system call is invoked.
There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this. At least in its current form.
Ya, the mseal is not for most apps. I mention the malloc example to stress that.
<snip> > - > - > -Additional notes: > -================= > As Jann Horn pointed out in [3], there are still a few ways to write > -to RO memory, which is, in a way, by design. Those cases are not covered > -by mseal(). If applications want to block such cases, sandbox tools (such as > -seccomp, LSM, etc) might be considered. > +to RO memory, which is, in a way, by design. And those could be blocked > +by different security measures. > > Those cases are: > - > -- Write to read-only memory through /proc/self/mem interface. > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT). > -- userfaultfd. > + - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE). > + - Write to read-only memory through ptrace (such as PTRACE_POKETEXT). > + - userfaultfd.
I don't understand how this is not a problem, but MADV_DONTNEED is. To me it seems that what we have now is completely useless, because you can trivially bypass it using /proc/self/mem, which is enabled on most Linux systems.
Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed for Chrome. They need to work with every other distro and application as well.
It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and /proc/<pid>/mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's as the traditional UNIX permission model allows.
IMO, it is a matter of Divide and Conquer. In a nutshell, mseal only prevents VMA's certain attributes (such as prot bits) from changing. It doesn't mean to say that sealed RO memory is immutable. To achieve that, the system needs to apply multiple security measures.
For writing to /proc/pid/mem, it can be disabled via [1]. SELINUX and Landlock can achieve the same protection too.
[1] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@collabora.c...
-Jeff
-- Pedro