Hi Pedro
On Sun, Sep 8, 2024 at 2:35 PM Pedro Falcato pedro.falcato@gmail.com wrote:
I agree with most of the points. Sitting down here to write unofficial guidelines for mseal behavior.
mseal should seal regions and mark them immutable, which means their protection and contents (ish?) (not _only_ backing mapping, but also contents in general (see madvise below)) should not be changed throughout the lifetime of the address space.
For the general syscall interface, this means:
- mprotect and munmap need to be blocked on mseal regions.
1a) munmap _cannot_ tolerate partial failure, per POSIX. 2b) mmap MAP_FIXED counts as an unmap operation and also needs to be blocked and return -EPERM.
- Most madvise calls are allowed, except for destructive operations on
read-only anonymous _pages_ (MADV_DONTNEED is destructive for anon, but we also don't care about blocking these ops if we can do it manually with e.g memset) 2a) The current implementation only blocks discard on anonymous _regions_, which is slightly different. We probably do want to block these on MAP_PRIVATE file mappings, as to block stuff like madvise MADV_DONTNEED on program rodata. 2b) We take into account pkeys when doing the permission checks.
- mremap is not allowed as we'd change the "contents" of the old region.
3a) Should mremap expansion be allowed? aka only block moving and shrinking, but allow expansion. We already informally allow expansion if e.g mmapping after it + mseal.
mlock and msync are allowed.
mseal is blocked.
mseal is not blocked, i.e. seal on an already sealed memory is no-op. This is described in mseal.rst [1]
[1] https://github.com/torvalds/linux/blob/master/Documentation/userspace-api/ms...
- Other miscellaneous syscalls (mbind, etc) that do not change contents in any way, are allowed.
6a) This obviously means PTEs can change as long as the contents don't. Swapping is also ok.
- FOLL_FORCE (kernel-internal speak, more commonly seen as ptrace and /proc/self/mem from userspace) should be disallowed (?)
7a) This currently does not happen, and seems like a large hole? But disallowing this would probably severely break ptrace if the ELF sealing plans come to fruition.
Jann Horn pointed out FOLL_FORCE during RFC [2], and this is in mseal.rst too.
In short, FOLL_FORCE is not covered by mseal. On ChromeOS, FOLL_FORCE is disabled. Recently, Adrian Ratiu upstreamed that [3]
[2] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfU... [3] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@collabora.c...
-Jeff
When we say "disallowed", we usually (apart from munmap) allow for partial failure. This means getting an -EPERM while part of the call succeeded, if we e.g mprotect a region consisting of [NORMAL VMA][SEALED VMA]. We do not want to test for this, because we do not want to paint ourselves into a corner - this is strictly "undefined behavior". The msealed regions themselves will never be touched in such cases. (we do however want to test munmap operation atomicity, but this is also kind of a munmap-related test, and might not really be something we really want in the mseal tests)
Kernel-internal wise: The VMA and PTE modifications resulting from the above operations are blocked. Sealed VMAs allow splitting and merging; there was contention about the splitting issue, but it truly does not make sense to block operations unless they affect a VMA entirely, and that would also force VMA merging to be ABI ("vma_merge isn't merging these two regions and now my madvise works/doesn't work :(").
Do I have everything right? Am I missing anything?
-- Pedro