On Thu, Jul 21, 2022 at 03:05:09PM +0000, Sean Christopherson wrote:
On Thu, Jul 21, 2022, David Hildenbrand wrote:
On 21.07.22 11:44, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Correction: on read() we don't allocate a fresh page. But on read faults we would. So this comment here needs clarification.
Not just the comment, the code too. The intent of F_SEAL_AUTO_ALLOCATE is very much to block _all_ implicit allocations (or maybe just fault-based allocations if "implicit" is too broad of a description).
So maybe still your initial suggestion F_SEAL_FAULT_ALLOCATIONS? One reason I don't like it is the write() ioctl also cause allocation and we want to prevent it.
Chao