* jeffxu@chromium.org jeffxu@chromium.org [240415 12:35]:
From: Jeff Xu jeffxu@chromium.org
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location, via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory.
Following input during RFC are incooperated into this patch:
Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Liam R. Howlett: perf optimization. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD.
Finally, the idea that inspired this patch comes from Stephen Röttger’s work in Chrome V8 CFI.
No per-vma change is checked prior to entering a per-vma modification loop today. This means that mseal() differs in behaviour in "up-front failure" vs "partial change failure" that exists in every other function.
I'm not saying it's wrong or that it's right - I'm just wondering what the direction is here. Either we should do as much up-front as possible or keep with tradition and have (partial) success where possible.
If you look at do_mprotect_pkey(), you can even see map_deny_write_exec() being checked in a loop during modifications.
I think we can all agree that having some up-front and some later without any reason will lead to a higher probability of things getting missed.
Thanks, Liam