This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction ------------ In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension --------------------- Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated. - backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension ----------------- Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test ---- To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
- Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
- New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Changelog ---------- v7: - Move the private/shared info from backing store to KVM. - Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation. - Rework on the sync mechanism between zap/page fault paths. - Addressed other comments in v6. v6: - Re-organzied patch for both mm/KVM parts. - Added flags for memfile_notifier so its consumers can state their features and memory backing store can check against these flags. - Put a backing store reference in the memfile_notifier and move pfn_ops into backing store. - Only support boot time backing store register. - Overall KVM part improvement suggested by Sean and some others. v5: - Removed userspace visible F_SEAL_INACCESSIBLE, instead using an in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only be created by MFD_INACCESSIBLE. - Introduced new APIs for backing store to register itself to memfile_notifier instead of direct function call. - Added the accounting and restriction for MFD_INACCESSIBLE memory. - Added KVM API doc for new memslot extensions and man page for the new MFD_INACCESSIBLE flag. - Removed the overlap check for mapping the same file+offset into multiple gfns due to perf consideration, warned in document. - Addressed other comments in v4. v4: - Decoupled the callbacks between KVM/mm from memfd and use new name 'memfile_notifier'. - Supported register multiple memslots to the same backing store. - Added per-memslot pfn_ops instead of per-system. - Reworked the invalidation part. - Improved new KVM uAPIs (private memslot extension and memory error) per Sean's suggestions. - Addressed many other minor fixes for comments from v3. v3: - Added locking protection when calling invalidate_page_range/fallocate callbacks. - Changed memslot structure to keep use useraddr for shared memory. - Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS. - Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE. - Commit message improvement. - Many small fixes for comments from the last version.
Links to previous discussions ----------------------------- [1] Original design proposal: https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/ [2] Updated proposal and RFC patch v1: https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@lin... [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
Chao Peng (12): mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE mm: Introduce memfile_notifier mm/memfd: Introduce MFD_INACCESSIBLE flag KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS KVM: Use gfn instead of hva for mmu_notifier_retry KVM: Rename mmu_notifier_* KVM: Extend the memslot to support fd-based private memory KVM: Add KVM_EXIT_MEMORY_FAULT exit KVM: Register/unregister the guest private memory regions KVM: Handle page fault for private memory KVM: Enable and expose KVM_MEM_PRIVATE
Kirill A. Shutemov (1): mm/shmem: Support memfile_notifier
Documentation/virt/kvm/api.rst | 77 +++++- arch/arm64/kvm/mmu.c | 8 +- arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/mmu.c | 10 +- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 +- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/riscv/kvm/mmu.c | 4 +- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/Kconfig | 3 + arch/x86/kvm/mmu.h | 2 - arch/x86/kvm/mmu/mmu.c | 74 +++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++ arch/x86/kvm/mmu/mmutrace.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 4 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 105 +++++--- include/linux/memfile_notifier.h | 91 +++++++ include/linux/shmem_fs.h | 2 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/kvm.h | 37 +++ include/uapi/linux/memfd.h | 1 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/memfd.c | 18 +- mm/memfile_notifier.c | 123 ++++++++++ mm/shmem.c | 125 +++++++++- tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++ virt/kvm/Kconfig | 3 + virt/kvm/kvm_main.c | 272 ++++++++++++++++++--- virt/kvm/pfncache.c | 14 +- 35 files changed, 1074 insertions(+), 127 deletions(-) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ /* (1U << 31) is reserved for signed error codes */
/* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..2afd898798e4 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \ - F_SEAL_FUTURE_WRITE) + F_SEAL_FUTURE_WRITE | \ + F_SEAL_AUTO_ALLOCATE)
static int memfd_add_seals(struct file *file, unsigned int seals) { diff --git a/mm/shmem.c b/mm/shmem.c index a6f565308133..6c8aef15a17d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vma->vm_file); gfp_t gfp = mapping_gfp_mask(inode->i_mapping); + struct shmem_inode_info *info = SHMEM_I(inode); + enum sgp_type sgp; int err; vm_fault_t ret = VM_FAULT_LOCKED;
@@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) spin_unlock(&inode->i_lock); }
- err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE, + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) + sgp = SGP_NOALLOC; + else + sgp = SGP_CACHE; + + err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp, gfp, vma, vmf, &ret); if (err) return vmf_error(err); @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT; + enum sgp_type sgp; int ret = 0;
/* i_rwsem is held by caller */ @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping, return -EPERM; }
- ret = shmem_getpage(inode, index, pagep, SGP_WRITE); + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) + sgp = SGP_NOALLOC; + else + sgp = SGP_WRITE; + ret = shmem_getpage(inode, index, pagep, sgp);
if (ret) return ret;
On 06.07.22 10:20, Chao Peng wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Also, I *think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
On 21.07.22 11:44, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Correction: on read() we don't allocate a fresh page. But on read faults we would. So this comment here needs clarification.
Also, I *think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
On Thu, Jul 21, 2022, David Hildenbrand wrote:
On 21.07.22 11:44, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Correction: on read() we don't allocate a fresh page. But on read faults we would. So this comment here needs clarification.
Not just the comment, the code too. The intent of F_SEAL_AUTO_ALLOCATE is very much to block _all_ implicit allocations (or maybe just fault-based allocations if "implicit" is too broad of a description).
On Thu, Jul 21, 2022 at 03:05:09PM +0000, Sean Christopherson wrote:
On Thu, Jul 21, 2022, David Hildenbrand wrote:
On 21.07.22 11:44, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Correction: on read() we don't allocate a fresh page. But on read faults we would. So this comment here needs clarification.
Not just the comment, the code too. The intent of F_SEAL_AUTO_ALLOCATE is very much to block _all_ implicit allocations (or maybe just fault-based allocations if "implicit" is too broad of a description).
So maybe still your initial suggestion F_SEAL_FAULT_ALLOCATIONS? One reason I don't like it is the write() ioctl also cause allocation and we want to prevent it.
Chao
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Also, I *think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
I was also thinking this at the same time, but for different reason:
"Want to populate private preboot memory with firmware payload", so was thinking userfaulftd could be an option as direct writes are restricted?
Thanks, Pankaj
On Thu, Jul 21, 2022 at 12:27:03PM +0200, Gupta, Pankaj wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Also, I *think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
I was also thinking this at the same time, but for different reason:
"Want to populate private preboot memory with firmware payload", so was thinking userfaulftd could be an option as direct writes are restricted?
If that can be a side effect, I definitely glad to see it, though I'm still not clear how userfaultfd can be particularly helpful for that.
Chao
Thanks, Pankaj
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Also, I *think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
I was also thinking this at the same time, but for different reason:
"Want to populate private preboot memory with firmware payload", so was thinking userfaulftd could be an option as direct writes are restricted?
If that can be a side effect, I definitely glad to see it, though I'm still not clear how userfaultfd can be particularly helpful for that.
Was thinking if we can use userfaultfd to monitor the pagefault on virtual firmware memory range and use to populate the private memory.
Not sure if it is a side effect. Was just theoretically thinking (for now kept the idea aside as these enhancements can be worked later).
Thanks, Pankaj
On Thu, Jul 21, 2022 at 11:44:11AM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults.
Right, it also prevents read faults.
Also, I *think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Userfaultfd sounds interesting, will further investigate it. But a rough look sounds it only faults to usrspace for write/read fault, not write()? Also sounds it operates on vma and userfaultfd_register() takes mmap_lock which is what we want to avoid for frequent register/unregister during private/shared memory conversion.
Chao
-- Thanks,
David / dhildenb
On 7/21/22 11:44, David Hildenbrand wrote:
Also, I*think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result.
This means moving this patch later, after "mm: Introduce memfile_notifier".
Thanks,
Paolo
On 05.08.22 19:55, Paolo Bonzini wrote:
On 7/21/22 11:44, David Hildenbrand wrote:
Also, I*think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result.
An alternative is to simply have the shmem allocation fail in a similar way. Maybe it does already, I haven't checked (don't think so).
On Fri, Aug 05, 2022 at 08:06:03PM +0200, David Hildenbrand wrote:
On 05.08.22 19:55, Paolo Bonzini wrote:
On 7/21/22 11:44, David Hildenbrand wrote:
Also, I*think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result.
An alternative is to simply have the shmem allocation fail in a similar way. Maybe it does already, I haven't checked (don't think so).
This sounds a better option. We don't need uAPI changes for userfault_register uAPI but I guess we will still need a KVM uAPI, either on the memslot or on the whole VM since Roth said this feature should be optional because some usages may want to disable it for performance reason. For details please see discussion: https://lkml.org/lkml/2022/6/23/1905
Chao
-- Thanks,
David / dhildenb
On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
On 7/21/22 11:44, David Hildenbrand wrote:
Also, I*think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result.
Then we need change userfault_register uAPI for a new property flag. Userspace should still the decision-maker for this flag.
This means moving this patch later, after "mm: Introduce memfile_notifier".
Yes, it makes sense now.
Chao
Thanks,
Paolo
On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
On 7/21/22 11:44, David Hildenbrand wrote:
Also, I*think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result.
I donno, memory allocation with userfaultfd looks pretty intentional to me. Why would F_SEAL_AUTO_ALLOCATE prevent it?
Maybe we would need it in the future for post-copy migration or something?
Or existing practises around userfaultfd touch memory randomly and therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?
Note, that userfaultfd is only relevant for shared memory as it requires VMA which we don't have for MFD_INACCESSIBLE.
On 8/18/22 01:41, Kirill A. Shutemov wrote:
Note, that userfaultfd is only relevant for shared memory as it requires VMA which we don't have for MFD_INACCESSIBLE.
Oh, you're right! So yeah, looks like userfaultfd is not a problem.
Paolo
On 18.08.22 01:41, Kirill A. Shutemov wrote:
On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
On 7/21/22 11:44, David Hildenbrand wrote:
Also, I*think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result.
I donno, memory allocation with userfaultfd looks pretty intentional to me. Why would F_SEAL_AUTO_ALLOCATE prevent it?
Can't we say the same about a write()?
Maybe we would need it in the future for post-copy migration or something?
Or existing practises around userfaultfd touch memory randomly and therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?
Note, that userfaultfd is only relevant for shared memory as it requires VMA which we don't have for MFD_INACCESSIBLE.
This feature (F_SEAL_AUTO_ALLOCATE) is independent of all the lovely encrypted VM stuff, so it doesn't matter how it relates to MFD_INACCESSIBLE.
On Tue, Aug 23, 2022 at 09:36:57AM +0200, David Hildenbrand wrote:
On 18.08.22 01:41, Kirill A. Shutemov wrote:
On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote:
On 7/21/22 11:44, David Hildenbrand wrote:
Also, I*think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result.
I donno, memory allocation with userfaultfd looks pretty intentional to me. Why would F_SEAL_AUTO_ALLOCATE prevent it?
Can't we say the same about a write()?
Maybe we would need it in the future for post-copy migration or something?
Or existing practises around userfaultfd touch memory randomly and therefore incompatible with F_SEAL_AUTO_ALLOCATE intent?
Note, that userfaultfd is only relevant for shared memory as it requires VMA which we don't have for MFD_INACCESSIBLE.
This feature (F_SEAL_AUTO_ALLOCATE) is independent of all the lovely encrypted VM stuff, so it doesn't matter how it relates to MFD_INACCESSIBLE.
Right, this patch is for normal user accssible fd. In KVM this flag is expected to be set on the shared part of the memslot, while all other patches in this series are for private part of the memslot.
Private memory doesn't have this need because it's totally inaccissible from userspace so no chance for userspace to write to the fd and cause allocation by accident. While for shared memory, malicious/buggy guest OS may cause userspace to write to any range of the shared fd and cause memory allocation, even that range should the private memory not the shared memory be visible to guest OS.
Chao
-- Thanks,
David / dhildenb
Hi Chao,
On Wed, Jul 6, 2022 at 9:25 AM Chao Peng chao.p.peng@linux.intel.com wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
I think this should also be added to tools/include/uapi/linux/fcntl.h
Cheers, /fuad
/* (1U << 31) is reserved for signed error codes */
/* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..2afd898798e4 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \
F_SEAL_FUTURE_WRITE)
F_SEAL_FUTURE_WRITE | \
F_SEAL_AUTO_ALLOCATE)
static int memfd_add_seals(struct file *file, unsigned int seals) { diff --git a/mm/shmem.c b/mm/shmem.c index a6f565308133..6c8aef15a17d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vma->vm_file); gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
struct shmem_inode_info *info = SHMEM_I(inode);
enum sgp_type sgp; int err; vm_fault_t ret = VM_FAULT_LOCKED;
@@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) spin_unlock(&inode->i_lock); }
err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
sgp = SGP_NOALLOC;
else
sgp = SGP_CACHE;
err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp, gfp, vma, vmf, &ret); if (err) return vmf_error(err);
@@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT;
enum sgp_type sgp; int ret = 0; /* i_rwsem is held by caller */
@@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping, return -EPERM; }
ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
sgp = SGP_NOALLOC;
else
sgp = SGP_WRITE;
ret = shmem_getpage(inode, index, pagep, sgp); if (ret) return ret;
-- 2.25.1
On Fri, Aug 26, 2022 at 04:19:32PM +0100, Fuad Tabba wrote:
Hi Chao,
On Wed, Jul 6, 2022 at 9:25 AM Chao Peng chao.p.peng@linux.intel.com wrote:
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory.
This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at:
https://lkml.org/lkml/2022/6/14/1255
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */
I think this should also be added to tools/include/uapi/linux/fcntl.h
Yes, thanks.
Chao
Cheers, /fuad
/* (1U << 31) is reserved for signed error codes */
/* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..2afd898798e4 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \
F_SEAL_FUTURE_WRITE)
F_SEAL_FUTURE_WRITE | \
F_SEAL_AUTO_ALLOCATE)
static int memfd_add_seals(struct file *file, unsigned int seals) { diff --git a/mm/shmem.c b/mm/shmem.c index a6f565308133..6c8aef15a17d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vma->vm_file); gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
struct shmem_inode_info *info = SHMEM_I(inode);
enum sgp_type sgp; int err; vm_fault_t ret = VM_FAULT_LOCKED;
@@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) spin_unlock(&inode->i_lock); }
err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
sgp = SGP_NOALLOC;
else
sgp = SGP_CACHE;
err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp, gfp, vma, vmf, &ret); if (err) return vmf_error(err);
@@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT;
enum sgp_type sgp; int ret = 0; /* i_rwsem is held by caller */
@@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping, return -EPERM; }
ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE))
sgp = SGP_NOALLOC;
else
sgp = SGP_WRITE;
ret = shmem_getpage(inode, index, pagep, sgp); if (ret) return ret;
-- 2.25.1
Add tests to verify sealing memfds with the F_SEAL_AUTO_ALLOCATE works as expected.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++++++++++ 1 file changed, 166 insertions(+)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 94df2692e6e4..b849ece295fd 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -9,6 +9,7 @@ #include <fcntl.h> #include <linux/memfd.h> #include <sched.h> +#include <setjmp.h> #include <stdio.h> #include <stdlib.h> #include <signal.h> @@ -232,6 +233,31 @@ static void mfd_fail_open(int fd, int flags, mode_t mode) } }
+static void mfd_assert_fallocate(int fd) +{ + int r; + + r = fallocate(fd, 0, 0, mfd_def_size); + if (r < 0) { + printf("fallocate(ALLOC) failed: %m\n"); + abort(); + } +} + +static void mfd_assert_punch_hole(int fd) +{ + int r; + + r = fallocate(fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + 0, + mfd_def_size); + if (r < 0) { + printf("fallocate(PUNCH_HOLE) failed: %m\n"); + abort(); + } +} + static void mfd_assert_read(int fd) { char buf[16]; @@ -594,6 +620,94 @@ static void mfd_fail_grow_write(int fd) } }
+static void mfd_assert_hole_write(int fd) +{ + ssize_t l; + void *p; + char *p1; + + /* + * huegtlbfs does not support write, but we want to + * verify everything else here. + */ + if (!hugetlbfs_test) { + /* verify direct write() succeeds */ + l = write(fd, "\0\0\0\0", 4); + if (l != 4) { + printf("write() failed: %m\n"); + abort(); + } + } + + /* verify mmaped write succeeds */ + p = mmap(NULL, + mfd_def_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + p1 = (char *)p + mfd_def_size - 1; + *p1 = 'H'; + if (*p1 != 'H') { + printf("mmaped write failed: %m\n"); + abort(); + + } + munmap(p, mfd_def_size); +} + +sigjmp_buf jbuf, *sigbuf; +static void sig_handler(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + +static void mfd_fail_hole_write(int fd) +{ + ssize_t l; + void *p; + char *p1; + + /* verify direct write() fails */ + l = write(fd, "data", 4); + if (l > 0) { + printf("expected failure on write(), but got %d: %m\n", (int)l); + abort(); + } + + /* verify mmaped write fails */ + p = mmap(NULL, + mfd_def_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + sigbuf = &jbuf; + if (sigsetjmp(*sigbuf, 1)) + goto out; + + /* Below write should trigger SIGBUS signal */ + p1 = (char *)p + mfd_def_size - 1; + *p1 = 'H'; + printf("failed to receive SIGBUS for mmaped write: %m\n"); + abort(); +out: + munmap(p, mfd_def_size); +} + static int idle_thread_fn(void *arg) { sigset_t set; @@ -880,6 +994,57 @@ static void test_seal_resize(void) close(fd); }
+/* + * Test F_SEAL_AUTO_ALLOCATE + * Test whether F_SEAL_AUTO_ALLOCATE actually prevents allocation. + */ +static void test_seal_auto_allocate(void) +{ + struct sigaction act; + int fd; + + printf("%s SEAL-AUTO-ALLOCATE\n", memfd_str); + + memset(&act, 0, sizeof(act)); + act.sa_sigaction = sig_handler; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, &act, 0)) { + printf("sigaction() failed: %m\n"); + abort(); + } + + fd = mfd_assert_new("kern_memfd_seal_auto_allocate", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + /* read/write should pass if F_SEAL_AUTO_ALLOCATE not set */ + mfd_assert_read(fd); + mfd_assert_hole_write(fd); + + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_AUTO_ALLOCATE); + mfd_assert_has_seals(fd, F_SEAL_AUTO_ALLOCATE); + + /* read/write should pass for pre-allocated area */ + mfd_assert_read(fd); + mfd_assert_hole_write(fd); + + mfd_assert_punch_hole(fd); + + /* read should pass, write should fail in hole */ + mfd_assert_read(fd); + mfd_fail_hole_write(fd); + + mfd_assert_fallocate(fd); + + /* read/write should pass after fallocate */ + mfd_assert_read(fd); + mfd_assert_hole_write(fd); + + close(fd); +} + + /* * Test sharing via dup() * Test that seals are shared between dupped FDs and they're all equal. @@ -1059,6 +1224,7 @@ int main(int argc, char **argv) test_seal_shrink(); test_seal_grow(); test_seal_resize(); + test_seal_auto_allocate();
test_share_dup("SHARE-DUP", ""); test_share_mmap("SHARE-MMAP", "");
On 06.07.22 10:20, Chao Peng wrote:
Add tests to verify sealing memfds with the F_SEAL_AUTO_ALLOCATE works as expected.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++++++++++ 1 file changed, 166 insertions(+)
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 94df2692e6e4..b849ece295fd 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -9,6 +9,7 @@ #include <fcntl.h> #include <linux/memfd.h> #include <sched.h> +#include <setjmp.h> #include <stdio.h> #include <stdlib.h> #include <signal.h> @@ -232,6 +233,31 @@ static void mfd_fail_open(int fd, int flags, mode_t mode) } } +static void mfd_assert_fallocate(int fd) +{
- int r;
- r = fallocate(fd, 0, 0, mfd_def_size);
- if (r < 0) {
printf("fallocate(ALLOC) failed: %m\n");
abort();
- }
+}
+static void mfd_assert_punch_hole(int fd) +{
- int r;
- r = fallocate(fd,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
0,
mfd_def_size);
- if (r < 0) {
printf("fallocate(PUNCH_HOLE) failed: %m\n");
abort();
- }
+}
static void mfd_assert_read(int fd) { char buf[16]; @@ -594,6 +620,94 @@ static void mfd_fail_grow_write(int fd) } } +static void mfd_assert_hole_write(int fd) +{
- ssize_t l;
- void *p;
- char *p1;
- /*
* huegtlbfs does not support write, but we want to
* verify everything else here.
*/
- if (!hugetlbfs_test) {
/* verify direct write() succeeds */
l = write(fd, "\0\0\0\0", 4);
if (l != 4) {
printf("write() failed: %m\n");
abort();
}
- }
- /* verify mmaped write succeeds */
- p = mmap(NULL,
mfd_def_size,
PROT_READ | PROT_WRITE,
MAP_SHARED,
fd,
0);
- if (p == MAP_FAILED) {
printf("mmap() failed: %m\n");
abort();
- }
- p1 = (char *)p + mfd_def_size - 1;
- *p1 = 'H';
- if (*p1 != 'H') {
printf("mmaped write failed: %m\n");
abort();
- }
- munmap(p, mfd_def_size);
+}
+sigjmp_buf jbuf, *sigbuf; +static void sig_handler(int sig, siginfo_t *siginfo, void *ptr) +{
- if (sig == SIGBUS) {
if (sigbuf)
siglongjmp(*sigbuf, 1);
abort();
- }
+}
+static void mfd_fail_hole_write(int fd) +{
- ssize_t l;
- void *p;
- char *p1;
- /* verify direct write() fails */
- l = write(fd, "data", 4);
- if (l > 0) {
printf("expected failure on write(), but got %d: %m\n", (int)l);
abort();
- }
- /* verify mmaped write fails */
- p = mmap(NULL,
mfd_def_size,
PROT_READ | PROT_WRITE,
MAP_SHARED,
fd,
0);
- if (p == MAP_FAILED) {
printf("mmap() failed: %m\n");
abort();
- }
- sigbuf = &jbuf;
- if (sigsetjmp(*sigbuf, 1))
goto out;
- /* Below write should trigger SIGBUS signal */
- p1 = (char *)p + mfd_def_size - 1;
- *p1 = 'H';
Maybe you want to verify separately, that bothj
- printf("failed to receive SIGBUS for mmaped write: %m\n");
- abort();
+out:
- munmap(p, mfd_def_size);
+}
static int idle_thread_fn(void *arg) { sigset_t set; @@ -880,6 +994,57 @@ static void test_seal_resize(void) close(fd); } +/*
- Test F_SEAL_AUTO_ALLOCATE
- Test whether F_SEAL_AUTO_ALLOCATE actually prevents allocation.
- */
+static void test_seal_auto_allocate(void) +{
- struct sigaction act;
- int fd;
- printf("%s SEAL-AUTO-ALLOCATE\n", memfd_str);
- memset(&act, 0, sizeof(act));
- act.sa_sigaction = sig_handler;
- act.sa_flags = SA_SIGINFO;
- if (sigaction(SIGBUS, &act, 0)) {
printf("sigaction() failed: %m\n");
abort();
- }
- fd = mfd_assert_new("kern_memfd_seal_auto_allocate",
mfd_def_size,
MFD_CLOEXEC | MFD_ALLOW_SEALING);
- /* read/write should pass if F_SEAL_AUTO_ALLOCATE not set */
- mfd_assert_read(fd);
- mfd_assert_hole_write(fd);
- mfd_assert_has_seals(fd, 0);
- mfd_assert_add_seals(fd, F_SEAL_AUTO_ALLOCATE);
- mfd_assert_has_seals(fd, F_SEAL_AUTO_ALLOCATE);
- /* read/write should pass for pre-allocated area */
- mfd_assert_read(fd);
- mfd_assert_hole_write(fd);
- mfd_assert_punch_hole(fd);
- /* read should pass, write should fail in hole */
- mfd_assert_read(fd);
- mfd_fail_hole_write(fd);
- mfd_assert_fallocate(fd);
- /* read/write should pass after fallocate */
- mfd_assert_read(fd);
- mfd_assert_hole_write(fd);
- close(fd);
+}
What might make sense is to verify for the following operations: * read() * write() * read via mmap * write via mmap
After sealing on a hole, that there is *still* a hole and that only the read() might succeed, with a comment stating that shmem optimized for read on holes by reading from the shared zeropage.
I'd suggest decoupling hole_write from hole_mmap_write and similarly have hole_read and hole_mmap_read.
You should be able to use fstat() to obtain the number of allocated blocks to check that fairly easily.
This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become invalidated.
It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory).
It consists below components: - memfile_backing_store: Each supported memory file subsystem can be implemented as a memory backing store which bookmarks memory and provides callbacks for other kernel systems (memfile_notifier consumers) to interact with. - memfile_notifier: memfile_notifier consumers defines callbacks and associate them to a file using memfile_register_notifier(). - memfile_node: A memfile_node is associated with the file (inode) from the backing store and includes feature flags and a list of registered memfile_notifier for notifying.
In KVM usages, userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register memory slot to memory backing store via memfile_register_notifier.
Co-developed-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- include/linux/memfile_notifier.h | 93 ++++++++++++++++++++++++ mm/Kconfig | 4 + mm/Makefile | 1 + mm/memfile_notifier.c | 121 +++++++++++++++++++++++++++++++ 4 files changed, 219 insertions(+) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c
diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h new file mode 100644 index 000000000000..c5d66fd8ba53 --- /dev/null +++ b/include/linux/memfile_notifier.h @@ -0,0 +1,93 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMFILE_NOTIFIER_H +#define _LINUX_MEMFILE_NOTIFIER_H + +#include <linux/pfn_t.h> +#include <linux/rculist.h> +#include <linux/spinlock.h> +#include <linux/srcu.h> +#include <linux/fs.h> + +/* memory in the file is inaccessible from userspace (e.g. read/write/mmap) */ +#define MEMFILE_F_USER_INACCESSIBLE BIT(0) +/* memory in the file is unmovable (e.g. via pagemigration)*/ +#define MEMFILE_F_UNMOVABLE BIT(1) +/* memory in the file is unreclaimable (e.g. via kswapd) */ +#define MEMFILE_F_UNRECLAIMABLE BIT(2) + +#define MEMFILE_F_ALLOWED_MASK (MEMFILE_F_USER_INACCESSIBLE | \ + MEMFILE_F_UNMOVABLE | \ + MEMFILE_F_UNRECLAIMABLE) + +struct memfile_node { + struct list_head notifiers; /* registered notifiers */ + unsigned long flags; /* MEMFILE_F_* flags */ +}; + +struct memfile_backing_store { + struct list_head list; + spinlock_t lock; + struct memfile_node* (*lookup_memfile_node)(struct file *file); + int (*get_pfn)(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order); + void (*put_pfn)(pfn_t pfn); +}; + +struct memfile_notifier; +struct memfile_notifier_ops { + void (*invalidate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct memfile_notifier { + struct list_head list; + struct memfile_notifier_ops *ops; + struct memfile_backing_store *bs; +}; + +static inline void memfile_node_init(struct memfile_node *node) +{ + INIT_LIST_HEAD(&node->notifiers); + node->flags = 0; +} + +#ifdef CONFIG_MEMFILE_NOTIFIER +/* APIs for backing stores */ +extern void memfile_register_backing_store(struct memfile_backing_store *bs); +extern int memfile_node_set_flags(struct file *file, unsigned long flags); +extern void memfile_notifier_invalidate(struct memfile_node *node, + pgoff_t start, pgoff_t end); +/*APIs for notifier consumers */ +extern int memfile_register_notifier(struct file *file, unsigned long flags, + struct memfile_notifier *notifier); +extern void memfile_unregister_notifier(struct memfile_notifier *notifier); + +#else /* !CONFIG_MEMFILE_NOTIFIER */ +static inline void memfile_register_backing_store(struct memfile_backing_store *bs) +{ +} + +static inline int memfile_node_set_flags(struct file *file, unsigned long flags) +{ + return -EOPNOTSUPP; +} + +static inline void memfile_notifier_invalidate(struct memfile_node *node, + pgoff_t start, pgoff_t end) +{ +} + +static inline int memfile_register_notifier(struct file *file, + unsigned long flags, + struct memfile_notifier *notifier) +{ + return -EOPNOTSUPP; +} + +static inline void memfile_unregister_notifier(struct memfile_notifier *notifier) +{ +} + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + +#endif /* _LINUX_MEMFILE_NOTIFIER_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 169e64192e48..19ab9350f5cb 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1130,6 +1130,10 @@ config PTE_MARKER_UFFD_WP purposes. It is required to enable userfaultfd write protection on file-backed memory types like shmem and hugetlbfs.
+config MEMFILE_NOTIFIER + bool + select SRCU + source "mm/damon/Kconfig"
endmenu diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..b7e3fb5fa85b 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -133,3 +133,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o +obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c new file mode 100644 index 000000000000..799d3197903e --- /dev/null +++ b/mm/memfile_notifier.c @@ -0,0 +1,121 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2022 Intel Corporation. + * Chao Peng chao.p.peng@linux.intel.com + */ + +#include <linux/memfile_notifier.h> +#include <linux/pagemap.h> +#include <linux/srcu.h> + +DEFINE_STATIC_SRCU(memfile_srcu); +static __ro_after_init LIST_HEAD(backing_store_list); + + +void memfile_notifier_invalidate(struct memfile_node *node, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&memfile_srcu); + list_for_each_entry_srcu(notifier, &node->notifiers, list, + srcu_read_lock_held(&memfile_srcu)) { + if (notifier->ops->invalidate) + notifier->ops->invalidate(notifier, start, end); + } + srcu_read_unlock(&memfile_srcu, id); +} + +void __init memfile_register_backing_store(struct memfile_backing_store *bs) +{ + spin_lock_init(&bs->lock); + list_add_tail(&bs->list, &backing_store_list); +} + +static void memfile_node_update_flags(struct file *file, unsigned long flags) +{ + struct address_space *mapping = file_inode(file)->i_mapping; + gfp_t gfp; + + gfp = mapping_gfp_mask(mapping); + if (flags & MEMFILE_F_UNMOVABLE) + gfp &= ~__GFP_MOVABLE; + else + gfp |= __GFP_MOVABLE; + mapping_set_gfp_mask(mapping, gfp); + + if (flags & MEMFILE_F_UNRECLAIMABLE) + mapping_set_unevictable(mapping); + else + mapping_clear_unevictable(mapping); +} + +int memfile_node_set_flags(struct file *file, unsigned long flags) +{ + struct memfile_backing_store *bs; + struct memfile_node *node; + + if (flags & ~MEMFILE_F_ALLOWED_MASK) + return -EINVAL; + + list_for_each_entry(bs, &backing_store_list, list) { + node = bs->lookup_memfile_node(file); + if (node) { + spin_lock(&bs->lock); + node->flags = flags; + spin_unlock(&bs->lock); + memfile_node_update_flags(file, flags); + return 0; + } + } + + return -EOPNOTSUPP; +} + +int memfile_register_notifier(struct file *file, unsigned long flags, + struct memfile_notifier *notifier) +{ + struct memfile_backing_store *bs; + struct memfile_node *node; + struct list_head *list; + + if (!file || !notifier || !notifier->ops) + return -EINVAL; + if (flags & ~MEMFILE_F_ALLOWED_MASK) + return -EINVAL; + + list_for_each_entry(bs, &backing_store_list, list) { + node = bs->lookup_memfile_node(file); + if (node) { + list = &node->notifiers; + notifier->bs = bs; + + spin_lock(&bs->lock); + if (list_empty(list)) + node->flags = flags; + else if (node->flags ^ flags) { + spin_unlock(&bs->lock); + return -EINVAL; + } + + list_add_rcu(¬ifier->list, list); + spin_unlock(&bs->lock); + memfile_node_update_flags(file, flags); + return 0; + } + } + + return -EOPNOTSUPP; +} +EXPORT_SYMBOL_GPL(memfile_register_notifier); + +void memfile_unregister_notifier(struct memfile_notifier *notifier) +{ + spin_lock(¬ifier->bs->lock); + list_del_rcu(¬ifier->list); + spin_unlock(¬ifier->bs->lock); + + synchronize_srcu(&memfile_srcu); +} +EXPORT_SYMBOL_GPL(memfile_unregister_notifier);
On 06.07.22 10:20, Chao Peng wrote:
This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become invalidated.
Stupid question, but why is this called "memfile_notifier" and not "memfd_notifier". We're only dealing with memfd's after all ... which are anonymous files essentially. Or what am I missing? Are there any other plans for fs than plain memfd support that I am not aware of?
It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory).
It consists below components:
- memfile_backing_store: Each supported memory file subsystem can be implemented as a memory backing store which bookmarks memory and provides callbacks for other kernel systems (memfile_notifier consumers) to interact with.
- memfile_notifier: memfile_notifier consumers defines callbacks and associate them to a file using memfile_register_notifier().
- memfile_node: A memfile_node is associated with the file (inode) from the backing store and includes feature flags and a list of registered memfile_notifier for notifying.
In KVM usages, userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register memory slot to memory backing store via memfile_register_notifier.
Can we add documentation/description in any form how the different functions exposed in linux/memfile_notifier.h are supposed to be used?
Staring at memfile_node_set_flags() and memfile_notifier_invalidate() it's not immediately clear to me who's supposed to call that and under which conditions.
On Fri, Aug 05, 2022 at 03:22:58PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become invalidated.
Stupid question, but why is this called "memfile_notifier" and not "memfd_notifier". We're only dealing with memfd's after all ... which are anonymous files essentially. Or what am I missing? Are there any other plans for fs than plain memfd support that I am not aware of?
There were some discussions on this in v3. https://lkml.org/lkml/2021/12/28/484 Sean commented it's OK to abstract it from memfd but he also wants the kAPI (name) should not bind to memfd to make room for future non-memfd usages.
It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory).
It consists below components:
- memfile_backing_store: Each supported memory file subsystem can be implemented as a memory backing store which bookmarks memory and provides callbacks for other kernel systems (memfile_notifier consumers) to interact with.
- memfile_notifier: memfile_notifier consumers defines callbacks and associate them to a file using memfile_register_notifier().
- memfile_node: A memfile_node is associated with the file (inode) from the backing store and includes feature flags and a list of registered memfile_notifier for notifying.
In KVM usages, userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register memory slot to memory backing store via memfile_register_notifier.
Can we add documentation/description in any form how the different functions exposed in linux/memfile_notifier.h are supposed to be used?
Yeah, code comments can be added.
Staring at memfile_node_set_flags() and memfile_notifier_invalidate() it's not immediately clear to me who's supposed to call that and under which conditions.
I will also amend the commit message.
Chao
-- Thanks,
David / dhildenb
On 10.08.22 11:22, Chao Peng wrote:
On Fri, Aug 05, 2022 at 03:22:58PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become invalidated.
Stupid question, but why is this called "memfile_notifier" and not "memfd_notifier". We're only dealing with memfd's after all ... which are anonymous files essentially. Or what am I missing? Are there any other plans for fs than plain memfd support that I am not aware of?
There were some discussions on this in v3. https://lkml.org/lkml/2021/12/28/484 Sean commented it's OK to abstract it from memfd but he also wants the kAPI (name) should not bind to memfd to make room for future non-memfd usages.
Sorry, but how is "memfile" any better? memfd abstracted to memfile?! :)
I understand Sean's suggestion about abstracting, but if the new name makes it harder to grasp and there isn't really an alternative to memfd in sight, I'm not so sure I enjoy the tried abstraction here.
Otherwise we'd have to get creative now and discuss something like "file_population_notifer" or "mapping_population_notifer" and I am not sure that our time is well spent doing so right now.
... as this is kernel-internal, we can always adjust the name as we please later, once we *actually* now what the abstraction should be. Until then I'd suggest to KIS and soft-glue this to memfd.
Or am I missing something important?
+Will
On Wed, Aug 10, 2022, David Hildenbrand wrote:
On 10.08.22 11:22, Chao Peng wrote:
On Fri, Aug 05, 2022 at 03:22:58PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become invalidated.
Stupid question, but why is this called "memfile_notifier" and not "memfd_notifier". We're only dealing with memfd's after all ... which are anonymous files essentially. Or what am I missing? Are there any other plans for fs than plain memfd support that I am not aware of?
There were some discussions on this in v3. https://lkml.org/lkml/2021/12/28/484 Sean commented it's OK to abstract it from memfd but he also wants the kAPI (name) should not bind to memfd to make room for future non-memfd usages.
Sorry, but how is "memfile" any better? memfd abstracted to memfile?! :)
FWIW, I don't really like the memfile name either.
I understand Sean's suggestion about abstracting, but if the new name makes it harder to grasp and there isn't really an alternative to memfd in sight, I'm not so sure I enjoy the tried abstraction here.
ARM's pKVM implementation is potentially (hopefully) going to switch to this API (as a consumer) sooner than later. If they anticipate being able to use memfd, then there's unlikely to be a second backing type any time soon.
Quentin, Will?
Otherwise we'd have to get creative now and discuss something like "file_population_notifer" or "mapping_population_notifer" and I am not sure that our time is well spent doing so right now.
... as this is kernel-internal, we can always adjust the name as we please later, once we *actually* now what the abstraction should be. Until then I'd suggest to KIS and soft-glue this to memfd.
Or am I missing something important?
I don't think you're missing anything. I'd still prefer a name that doesn't couple KVM to memfd, but it's not a sticking point, and I've never been able to come up with a better name...
With a little bit of cleverness I think we can keep the coupling in KVM to a minimum, which is what I really care about.
+CC Fuad
On Wednesday 10 Aug 2022 at 14:38:43 (+0000), Sean Christopherson wrote:
I understand Sean's suggestion about abstracting, but if the new name makes it harder to grasp and there isn't really an alternative to memfd in sight, I'm not so sure I enjoy the tried abstraction here.
ARM's pKVM implementation is potentially (hopefully) going to switch to this API (as a consumer) sooner than later. If they anticipate being able to use memfd, then there's unlikely to be a second backing type any time soon.
Quentin, Will?
Yep, Fuad is currently trying to port the pKVM mm stuff on top of this series to see how well it fits, so stay tuned. I think there is still some room for discussion around page conversions (private->shared etc), and we'll need a clearer idea of what the code might look like to have a constructive discussion, but so far it does seem like using a memfd (the new private one or perhaps just memfd_secret, to be discussed) + memfd notifiers is a promising option.
On Thu, Aug 11, 2022 at 12:27:56PM +0000, Quentin Perret wrote:
+CC Fuad
On Wednesday 10 Aug 2022 at 14:38:43 (+0000), Sean Christopherson wrote:
I understand Sean's suggestion about abstracting, but if the new name makes it harder to grasp and there isn't really an alternative to memfd in sight, I'm not so sure I enjoy the tried abstraction here.
ARM's pKVM implementation is potentially (hopefully) going to switch to this API (as a consumer) sooner than later. If they anticipate being able to use memfd, then there's unlikely to be a second backing type any time soon.
Quentin, Will?
Yep, Fuad is currently trying to port the pKVM mm stuff on top of this series to see how well it fits, so stay tuned.
Good to hear that.
I think there is still some room for discussion around page conversions (private->shared etc), and we'll need a clearer idea of what the code might look like to have a constructive discussion,
That's fine. Looking forward to your feedbacks.
but so far it does seem like using a memfd (the new private one or perhaps just memfd_secret, to be discussed) + memfd notifiers is a promising option.
If it still memfd (even memfd_secret), maybe we can use the name memfd_notifier?
Chao
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
Implement shmem as a memfile_notifier backing store. Essentially it interacts with the memfile_notifier feature flags for userspace access/page migration/page reclaiming and implements the necessary memfile_backing_store callbacks.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- include/linux/shmem_fs.h | 2 + mm/shmem.c | 109 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 110 insertions(+), 1 deletion(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index a68f982f22d1..6031c0b08d26 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include <linux/percpu_counter.h> #include <linux/xattr.h> #include <linux/fs_parser.h> +#include <linux/memfile_notifier.h>
/* inode in-kernel data */
@@ -25,6 +26,7 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ struct timespec64 i_crtime; /* file creation time */ + struct memfile_node memfile_node; /* memfile node */ struct inode vfs_inode; };
diff --git a/mm/shmem.c b/mm/shmem.c index 6c8aef15a17d..627e315c3b4d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -905,6 +905,17 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) return page ? page_folio(page) : NULL; }
+static void notify_invalidate(struct inode *inode, struct folio *folio, + pgoff_t start, pgoff_t end) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, folio->index); + end = min(end, folio->index + folio_nr_pages(folio)); + + memfile_notifier_invalidate(&info->memfile_node, start, end); +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -948,6 +959,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += folio_nr_pages(folio) - 1;
+ notify_invalidate(inode, folio, start, end); + if (!unfalloc || !folio_test_uptodate(folio)) truncate_inode_folio(mapping, folio); folio_unlock(folio); @@ -1021,6 +1034,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + notify_invalidate(inode, folio, start, end); + VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); truncate_inode_folio(mapping, folio); @@ -1092,6 +1108,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM;
+ if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) { + if (oldsize) + return -EPERM; + if (!PAGE_ALIGNED(newsize)) + return -EINVAL; + } + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1336,6 +1359,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->memfile_node.flags & MEMFILE_F_UNRECLAIMABLE) + goto redirty;
/* * Our capabilities prevent regular writeback or sync from ever calling @@ -2271,6 +2296,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret;
+ if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED;
@@ -2306,6 +2334,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode info->i_crtime = inode->i_mtime; INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); + memfile_node_init(&info->memfile_node); simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -2477,6 +2506,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping, if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; } + if (unlikely(info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE)) + return -EPERM;
if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) sgp = SGP_NOALLOC; @@ -2556,6 +2587,13 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break; + + if (SHMEM_I(inode)->memfile_node.flags & + MEMFILE_F_USER_INACCESSIBLE) { + error = -EPERM; + break; + } + if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset) @@ -2697,6 +2735,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; }
+ if ((info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) && + (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; @@ -3806,6 +3850,20 @@ static int shmem_error_remove_page(struct address_space *mapping, return 0; }
+#ifdef CONFIG_MIGRATION +static int shmem_migrate_page(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + struct inode *inode = mapping->host; + struct shmem_inode_info *info = SHMEM_I(inode); + + if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE) + return -EOPNOTSUPP; + return migrate_page(mapping, newpage, page, mode); +} +#endif + const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio, @@ -3814,7 +3872,7 @@ const struct address_space_operations shmem_aops = { .write_end = shmem_write_end, #endif #ifdef CONFIG_MIGRATION - .migratepage = migrate_page, + .migratepage = shmem_migrate_page, #endif .error_remove_page = shmem_error_remove_page, }; @@ -3931,6 +3989,51 @@ static struct file_system_type shmem_fs_type = { .fs_flags = FS_USERNS_MOUNT, };
+#ifdef CONFIG_MEMFILE_NOTIFIER +static struct memfile_node *shmem_lookup_memfile_node(struct file *file) +{ + struct inode *inode = file_inode(file); + + if (!shmem_mapping(inode->i_mapping)) + return NULL; + + return &SHMEM_I(inode)->memfile_node; +} + + +static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order) +{ + struct page *page; + int ret; + + ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE); + if (ret) + return ret; + + unlock_page(page); + *pfn = page_to_pfn_t(page); + *order = thp_order(compound_head(page)); + return 0; +} + +static void shmem_put_pfn(pfn_t pfn) +{ + struct page *page = pfn_t_to_page(pfn); + + if (!page) + return; + + put_page(page); +} + +static struct memfile_backing_store shmem_backing_store = { + .lookup_memfile_node = shmem_lookup_memfile_node, + .get_pfn = shmem_get_pfn, + .put_pfn = shmem_put_pfn, +}; +#endif /* CONFIG_MEMFILE_NOTIFIER */ + void __init shmem_init(void) { int error; @@ -3956,6 +4059,10 @@ void __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif + +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_register_backing_store(&shmem_backing_store); +#endif return;
out1:
On 7/6/2022 10:20 AM, Chao Peng wrote:
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
Implement shmem as a memfile_notifier backing store. Essentially it interacts with the memfile_notifier feature flags for userspace access/page migration/page reclaiming and implements the necessary memfile_backing_store callbacks.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/linux/shmem_fs.h | 2 + mm/shmem.c | 109 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 110 insertions(+), 1 deletion(-)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index a68f982f22d1..6031c0b08d26 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include <linux/percpu_counter.h> #include <linux/xattr.h> #include <linux/fs_parser.h> +#include <linux/memfile_notifier.h> /* inode in-kernel data */ @@ -25,6 +26,7 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ struct timespec64 i_crtime; /* file creation time */
- struct memfile_node memfile_node; /* memfile node */ struct inode vfs_inode; };
diff --git a/mm/shmem.c b/mm/shmem.c index 6c8aef15a17d..627e315c3b4d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -905,6 +905,17 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) return page ? page_folio(page) : NULL; } +static void notify_invalidate(struct inode *inode, struct folio *folio,
pgoff_t start, pgoff_t end)
+{
- struct shmem_inode_info *info = SHMEM_I(inode);
- start = max(start, folio->index);
- end = min(end, folio->index + folio_nr_pages(folio));
- memfile_notifier_invalidate(&info->memfile_node, start, end);
+}
- /*
- Remove range of pages and swap entries from page cache, and free them.
- If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -948,6 +959,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += folio_nr_pages(folio) - 1;
notify_invalidate(inode, folio, start, end);
if (!unfalloc || !folio_test_uptodate(folio)) truncate_inode_folio(mapping, folio); folio_unlock(folio);
@@ -1021,6 +1034,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; }
notify_invalidate(inode, folio, start, end);
VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); truncate_inode_folio(mapping, folio);
@@ -1092,6 +1108,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM;
if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) {
if (oldsize)
return -EPERM;
if (!PAGE_ALIGNED(newsize))
return -EINVAL;
}
- if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize);
@@ -1336,6 +1359,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty;
- if (info->memfile_node.flags & MEMFILE_F_UNRECLAIMABLE)
goto redirty;
/* * Our capabilities prevent regular writeback or sync from ever calling @@ -2271,6 +2296,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret;
- if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE)
return -EPERM;
- /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED;
@@ -2306,6 +2334,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode info->i_crtime = inode->i_mtime; INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist);
simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping);memfile_node_init(&info->memfile_node);
@@ -2477,6 +2506,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping, if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; }
- if (unlikely(info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE))
return -EPERM;
if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) sgp = SGP_NOALLOC; @@ -2556,6 +2587,13 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break;
if (SHMEM_I(inode)->memfile_node.flags &
MEMFILE_F_USER_INACCESSIBLE) {
error = -EPERM;
break;
}
- if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset)
@@ -2697,6 +2735,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; }
if ((info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) &&
(!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) {
error = -EINVAL;
goto out;
}
- shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
@@ -3806,6 +3850,20 @@ static int shmem_error_remove_page(struct address_space *mapping, return 0; } +#ifdef CONFIG_MIGRATION +static int shmem_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode)
+{
- struct inode *inode = mapping->host;
- struct shmem_inode_info *info = SHMEM_I(inode);
- if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
return -EOPNOTSUPP;
- return migrate_page(mapping, newpage, page, mode);
Wondering how well page migrate would work for private pages on shmem memfd based backend?
+} +#endif
- const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio,
@@ -3814,7 +3872,7 @@ const struct address_space_operations shmem_aops = { .write_end = shmem_write_end, #endif #ifdef CONFIG_MIGRATION
- .migratepage = migrate_page,
- .migratepage = shmem_migrate_page, #endif .error_remove_page = shmem_error_remove_page, };
@@ -3931,6 +3989,51 @@ static struct file_system_type shmem_fs_type = { .fs_flags = FS_USERNS_MOUNT, }; +#ifdef CONFIG_MEMFILE_NOTIFIER +static struct memfile_node *shmem_lookup_memfile_node(struct file *file) +{
- struct inode *inode = file_inode(file);
- if (!shmem_mapping(inode->i_mapping))
return NULL;
- return &SHMEM_I(inode)->memfile_node;
+}
+static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
int *order)
+{
- struct page *page;
- int ret;
- ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE);
- if (ret)
return ret;
- unlock_page(page);
- *pfn = page_to_pfn_t(page);
- *order = thp_order(compound_head(page));
- return 0;
+}
+static void shmem_put_pfn(pfn_t pfn) +{
- struct page *page = pfn_t_to_page(pfn);
- if (!page)
return;
- put_page(page);
+}
+static struct memfile_backing_store shmem_backing_store = {
- .lookup_memfile_node = shmem_lookup_memfile_node,
- .get_pfn = shmem_get_pfn,
- .put_pfn = shmem_put_pfn,
+}; +#endif /* CONFIG_MEMFILE_NOTIFIER */
- void __init shmem_init(void) { int error;
@@ -3956,6 +4059,10 @@ void __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif
+#ifdef CONFIG_MEMFILE_NOTIFIER
- memfile_register_backing_store(&shmem_backing_store);
+#endif return; out1:
On Tue, Jul 12, 2022 at 08:02:34PM +0200, Gupta, Pankaj wrote:
On 7/6/2022 10:20 AM, Chao Peng wrote:
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
Implement shmem as a memfile_notifier backing store. Essentially it interacts with the memfile_notifier feature flags for userspace access/page migration/page reclaiming and implements the necessary memfile_backing_store callbacks.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/linux/shmem_fs.h | 2 + mm/shmem.c | 109 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 110 insertions(+), 1 deletion(-)
...
+#ifdef CONFIG_MIGRATION +static int shmem_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode)
+{
- struct inode *inode = mapping->host;
- struct shmem_inode_info *info = SHMEM_I(inode);
- if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
return -EOPNOTSUPP;
- return migrate_page(mapping, newpage, page, mode);
Wondering how well page migrate would work for private pages on shmem memfd based backend?
From high level: - KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of migrating a page. - Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM to register. - The callback is hooked to migrate_page() here. - Once page migration requested, shmem calls into the 'migrate' callback(s) to perform additional steps for encrypted memory (For TDX we will call TDH.MEM.PAGE.RELOCATE).
Chao
+} +#endif
- const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio,
@@ -3814,7 +3872,7 @@ const struct address_space_operations shmem_aops = { .write_end = shmem_write_end, #endif #ifdef CONFIG_MIGRATION
- .migratepage = migrate_page,
- .migratepage = shmem_migrate_page, #endif .error_remove_page = shmem_error_remove_page, };
@@ -3931,6 +3989,51 @@ static struct file_system_type shmem_fs_type = { .fs_flags = FS_USERNS_MOUNT, };
+#ifdef CONFIG_MIGRATION +static int shmem_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode)
+{
- struct inode *inode = mapping->host;
- struct shmem_inode_info *info = SHMEM_I(inode);
- if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
return -EOPNOTSUPP;
- return migrate_page(mapping, newpage, page, mode);
Wondering how well page migrate would work for private pages on shmem memfd based backend?
From high level:
- KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of migrating a page.
- Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM to register.
- The callback is hooked to migrate_page() here.
- Once page migration requested, shmem calls into the 'migrate' callback(s) to perform additional steps for encrypted memory (For TDX we will call TDH.MEM.PAGE.RELOCATE).
Yes, that would require additional (protocol specific) handling for private pages. Was trying to find where "MEMFILE_F_UNMOVABLE" flag is set currently?
Thanks, Pankaj
On Wed, Jul 13, 2022 at 12:01:13PM +0200, Gupta, Pankaj wrote:
+#ifdef CONFIG_MIGRATION +static int shmem_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode)
+{
- struct inode *inode = mapping->host;
- struct shmem_inode_info *info = SHMEM_I(inode);
- if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
return -EOPNOTSUPP;
- return migrate_page(mapping, newpage, page, mode);
Wondering how well page migrate would work for private pages on shmem memfd based backend?
From high level:
- KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of migrating a page.
- Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM to register.
- The callback is hooked to migrate_page() here.
- Once page migration requested, shmem calls into the 'migrate' callback(s) to perform additional steps for encrypted memory (For TDX we will call TDH.MEM.PAGE.RELOCATE).
Yes, that would require additional (protocol specific) handling for private pages. Was trying to find where "MEMFILE_F_UNMOVABLE" flag is set currently?
It's set with memfile_register_notifier() in patch 13.
Thanks, Pankaj
+#ifdef CONFIG_MIGRATION +static int shmem_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode)
+{
- struct inode *inode = mapping->host;
- struct shmem_inode_info *info = SHMEM_I(inode);
- if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE)
return -EOPNOTSUPP;
- return migrate_page(mapping, newpage, page, mode);
Wondering how well page migrate would work for private pages on shmem memfd based backend?
From high level: - KVM unset MEMFILE_F_UNMOVABLE bit to indicate it capable of migrating a page. - Introduce new 'migrate' callback(s) to memfile_notifier_ops for KVM to register. - The callback is hooked to migrate_page() here. - Once page migration requested, shmem calls into the 'migrate' callback(s) to perform additional steps for encrypted memory (For TDX we will call TDH.MEM.PAGE.RELOCATE).
Yes, that would require additional (protocol specific) handling for private pages. Was trying to find where "MEMFILE_F_UNMOVABLE" flag is set currently?
It's set with memfile_register_notifier() in patch 13.
o.k.
Thanks,
Pankaj
On 06.07.22 10:20, Chao Peng wrote:
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
Implement shmem as a memfile_notifier backing store. Essentially it interacts with the memfile_notifier feature flags for userspace access/page migration/page reclaiming and implements the necessary memfile_backing_store callbacks.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
[...]
+#ifdef CONFIG_MEMFILE_NOTIFIER +static struct memfile_node *shmem_lookup_memfile_node(struct file *file) +{
- struct inode *inode = file_inode(file);
- if (!shmem_mapping(inode->i_mapping))
return NULL;
- return &SHMEM_I(inode)->memfile_node;
+}
+static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
int *order)
+{
- struct page *page;
- int ret;
- ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE);
- if (ret)
return ret;
- unlock_page(page);
- *pfn = page_to_pfn_t(page);
- *order = thp_order(compound_head(page));
- return 0;
+}
+static void shmem_put_pfn(pfn_t pfn) +{
- struct page *page = pfn_t_to_page(pfn);
- if (!page)
return;
- put_page(page);
Why do we export shmem_get_pfn/shmem_put_pfn and not simply
get_folio()
and let the caller deal with putting the folio? What's the reason to
a) Operate on PFNs and not folios b) Have these get/put semantics?
+}
+static struct memfile_backing_store shmem_backing_store = {
- .lookup_memfile_node = shmem_lookup_memfile_node,
- .get_pfn = shmem_get_pfn,
- .put_pfn = shmem_put_pfn,
+}; +#endif /* CONFIG_MEMFILE_NOTIFIER */
void __init shmem_init(void) { int error; @@ -3956,6 +4059,10 @@ void __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif
+#ifdef CONFIG_MEMFILE_NOTIFIER
- memfile_register_backing_store(&shmem_backing_store);
Can we instead prove a dummy function that does nothing without CONFIG_MEMFILE_NOTIFIER?
+#endif return; out1:
On Fri, Aug 05, 2022 at 03:26:02PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
Implement shmem as a memfile_notifier backing store. Essentially it interacts with the memfile_notifier feature flags for userspace access/page migration/page reclaiming and implements the necessary memfile_backing_store callbacks.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
[...]
+#ifdef CONFIG_MEMFILE_NOTIFIER +static struct memfile_node *shmem_lookup_memfile_node(struct file *file) +{
- struct inode *inode = file_inode(file);
- if (!shmem_mapping(inode->i_mapping))
return NULL;
- return &SHMEM_I(inode)->memfile_node;
+}
+static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
int *order)
+{
- struct page *page;
- int ret;
- ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE);
- if (ret)
return ret;
- unlock_page(page);
- *pfn = page_to_pfn_t(page);
- *order = thp_order(compound_head(page));
- return 0;
+}
+static void shmem_put_pfn(pfn_t pfn) +{
- struct page *page = pfn_t_to_page(pfn);
- if (!page)
return;
- put_page(page);
Why do we export shmem_get_pfn/shmem_put_pfn and not simply
get_folio()
and let the caller deal with putting the folio? What's the reason to
a) Operate on PFNs and not folios b) Have these get/put semantics?
We have a design assumption that somedays this can even support non-page based backing stores. There are some discussions: https://lkml.org/lkml/2022/3/28/1440 I should add document for this two callbacks.
+}
+static struct memfile_backing_store shmem_backing_store = {
- .lookup_memfile_node = shmem_lookup_memfile_node,
- .get_pfn = shmem_get_pfn,
- .put_pfn = shmem_put_pfn,
+}; +#endif /* CONFIG_MEMFILE_NOTIFIER */
void __init shmem_init(void) { int error; @@ -3956,6 +4059,10 @@ void __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif
+#ifdef CONFIG_MEMFILE_NOTIFIER
- memfile_register_backing_store(&shmem_backing_store);
Can we instead prove a dummy function that does nothing without CONFIG_MEMFILE_NOTIFIER?
Sounds good.
Chao
+#endif return; out1:
-- Thanks,
David / dhildenb
Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly.
It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace.
The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- include/uapi/linux/memfd.h | 1 + mm/memfd.c | 15 ++++++++++++++- 2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U
/* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/memfd.c b/mm/memfd.c index 2afd898798e4..72d7139ccced 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -18,6 +18,7 @@ #include <linux/hugetlb.h> #include <linux/shmem_fs.h> #include <linux/memfd.h> +#include <linux/memfile_notifier.h> #include <uapi/linux/memfd.h>
/* @@ -262,7 +263,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE)
SYSCALL_DEFINE2(memfd_create, const char __user *, uname, @@ -284,6 +286,10 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; }
+ /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -330,12 +336,19 @@ SYSCALL_DEFINE2(memfd_create, if (flags & MFD_ALLOW_SEALING) { file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; + } else if (flags & MFD_INACCESSIBLE) { + error = memfile_node_set_flags(file, + MEMFILE_F_USER_INACCESSIBLE); + if (error) + goto err_file; }
fd_install(fd, file); kfree(name); return fd;
+err_file: + fput(file); err_fd: put_unused_fd(fd); err_name:
On 06.07.22 10:20, Chao Peng wrote:
Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly.
It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace.
The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag.
It's kind of weird to have it that way. Why should the user have to care? It's the notifier requirement to have that, no?
Why can't we handle that when register a notifier? If anything is already mapped, fail registering the notifier if the notifier has these demands. If registering succeeds, block it internally.
Or what am I missing? We might not need the memfile set flag semantics eventually and would not have to expose such a flag to user space.
On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly.
It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace.
The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag.
It's kind of weird to have it that way. Why should the user have to care? It's the notifier requirement to have that, no?
Why can't we handle that when register a notifier? If anything is already mapped, fail registering the notifier if the notifier has these demands. If registering succeeds, block it internally.
Or what am I missing? We might not need the memfile set flag semantics eventually and would not have to expose such a flag to user space.
This makes sense if doable. The major concern was: is there a reliable way to detect this (already mapped) at the time of memslot registering.
Chao
-- Thanks,
David / dhildenb
On 10.08.22 11:37, Chao Peng wrote:
On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly.
It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace.
The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag.
It's kind of weird to have it that way. Why should the user have to care? It's the notifier requirement to have that, no?
Why can't we handle that when register a notifier? If anything is already mapped, fail registering the notifier if the notifier has these demands. If registering succeeds, block it internally.
Or what am I missing? We might not need the memfile set flag semantics eventually and would not have to expose such a flag to user space.
This makes sense if doable. The major concern was: is there a reliable way to detect this (already mapped) at the time of memslot registering.
If too complicated, we could simplify to "was this ever mapped" and fail for now. Hooking into shmem_mmap() might be sufficient for that to get notified about the first mmap.
As an alternative, mapping_mapped() or similar *might* do what we want.
On Wed, Aug 10, 2022 at 11:55:19AM +0200, David Hildenbrand wrote:
On 10.08.22 11:37, Chao Peng wrote:
On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly.
It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace.
The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag.
It's kind of weird to have it that way. Why should the user have to care? It's the notifier requirement to have that, no?
Why can't we handle that when register a notifier? If anything is already mapped, fail registering the notifier if the notifier has these demands. If registering succeeds, block it internally.
Or what am I missing? We might not need the memfile set flag semantics eventually and would not have to expose such a flag to user space.
This makes sense if doable. The major concern was: is there a reliable way to detect this (already mapped) at the time of memslot registering.
If too complicated, we could simplify to "was this ever mapped" and fail for now. Hooking into shmem_mmap() might be sufficient for that to get notified about the first mmap.
As an alternative, mapping_mapped() or similar *might* do what we want.
mapping_mapped() sounds the right one, I remember SEV people want first map then unmap. "was this ever mapped" may not work for them.
Thanks, Chao
-- Thanks,
David / dhildenb
On Fri, Aug 05, 2022 at 03:28:50PM +0200, David Hildenbrand wrote:
On 06.07.22 10:20, Chao Peng wrote:
Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly.
It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace.
The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag.
It's kind of weird to have it that way. Why should the user have to care? It's the notifier requirement to have that, no?
Why can't we handle that when register a notifier? If anything is already mapped, fail registering the notifier if the notifier has these demands. If registering succeeds, block it internally.
Or what am I missing? We might not need the memfile set flag semantics eventually and would not have to expose such a flag to user space.
Well, with the new shim-based[1] implementation the approach without uAPI does not work.
We now have two struct file, one is a normal accessible memfd and the other one is wrapper around that hides the memfd from userspace and filters allowed operations. If we first create an accessible memfd that userspace see it would be hard to hide it as by the time userspace may have multiple fds in different processes that point to the same struct file.
[1] https://lore.kernel.org/all/20220831142439.65q2gi4g2d2z4ofh@box.shutemov.nam...
KVM_INTERNAL_MEM_SLOTS better reflects the fact those slots are not exposed to userspace and avoids confusion to real private slots that is going to be added.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- arch/mips/include/asm/kvm_host.h | 2 +- arch/x86/include/asm/kvm_host.h | 2 +- include/linux/kvm_host.h | 6 +++--- 3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 717716cc51c5..45a978c805bc 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -85,7 +85,7 @@
#define KVM_MAX_VCPUS 16 /* memory slots that does not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 0 +#define KVM_INTERNAL_MEM_SLOTS 0
#define KVM_HALT_POLL_NS_DEFAULT 500000
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index de5a149d0971..dae190e19fce 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -53,7 +53,7 @@ #define KVM_MAX_VCPU_IDS (KVM_MAX_VCPUS * KVM_VCPU_ID_RATIO)
/* memory slots that are not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 3 +#define KVM_INTERNAL_MEM_SLOTS 3
#define KVM_HALT_POLL_NS_DEFAULT 200000
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3b40f8d68fbb..0bdb6044e316 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -656,12 +656,12 @@ struct kvm_irq_routing_table { }; #endif
-#ifndef KVM_PRIVATE_MEM_SLOTS -#define KVM_PRIVATE_MEM_SLOTS 0 +#ifndef KVM_INTERNAL_MEM_SLOTS +#define KVM_INTERNAL_MEM_SLOTS 0 #endif
#define KVM_MEM_SLOTS_NUM SHRT_MAX -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS) +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
Currently in mmu_notifier validate path, hva range is recorded and then checked in the mmu_notifier_retry_hva() from page fault path. However for the to be introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing non private memory case, gfn is expected to continue to work.
The patch also fixes a potential bug in kvm_zap_gfn_range() which has already been using gfn when calling kvm_inc/dec_notifier_count() in current code.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 18 ++++++++---------- virt/kvm/kvm_main.c | 6 +++--- 3 files changed, 12 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f7fa4c31b7c5..0d882fad4bc1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, return true;
return fault->slot && - mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva); + mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn); }
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0bdb6044e316..e9153b54e2a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -767,8 +767,8 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count; - unsigned long mmu_notifier_range_start; - unsigned long mmu_notifier_range_end; + gfn_t mmu_notifier_range_start; + gfn_t mmu_notifier_range_end; #endif struct list_head devices; u64 manual_dirty_log_protect; @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif
-void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end); -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end); +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end);
long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); @@ -1923,9 +1921,9 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) return 0; }
-static inline int mmu_notifier_retry_hva(struct kvm *kvm, +static inline int mmu_notifier_retry_gfn(struct kvm *kvm, unsigned long mmu_seq, - unsigned long hva) + gfn_t gfn) { lockdep_assert_held(&kvm->mmu_lock); /* @@ -1935,8 +1933,8 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm, * positives, due to shortcuts when handing concurrent invalidations. */ if (unlikely(kvm->mmu_notifier_count) && - hva >= kvm->mmu_notifier_range_start && - hva < kvm->mmu_notifier_range_end) + gfn >= kvm->mmu_notifier_range_start && + gfn < kvm->mmu_notifier_range_end) return 1; if (kvm->mmu_notifier_seq != mmu_seq) return 1; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index da263c370d00..4d7f0e72366f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -536,8 +536,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start, - unsigned long end); +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
typedef void (*on_unlock_fn_t)(struct kvm *kvm);
@@ -624,7 +623,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, locked = true; KVM_MMU_LOCK(kvm); if (!IS_KVM_NULL_FN(range->on_lock)) - range->on_lock(kvm, range->start, range->end); + range->on_lock(kvm, gfn_range.start, + gfn_range.end); if (IS_KVM_NULL_FN(range->handler)) break; }
Currently in mmu_notifier validate path, hva range is recorded and then checked in the mmu_notifier_retry_hva() from page fault path. However for the to be introduced private memory, a page fault may not have a hva
As this patch appeared in v7, just wondering did you see an actual bug because of it? And not having corresponding 'hva' occurs only with private memory because its not mapped to host userspace?
Thanks, Pankaj
associated, checking gfn(gpa) makes more sense. For existing non private memory case, gfn is expected to continue to work.
The patch also fixes a potential bug in kvm_zap_gfn_range() which has already been using gfn when calling kvm_inc/dec_notifier_count() in current code.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 18 ++++++++---------- virt/kvm/kvm_main.c | 6 +++--- 3 files changed, 12 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f7fa4c31b7c5..0d882fad4bc1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, return true; return fault->slot &&
mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
}mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0bdb6044e316..e9153b54e2a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -767,8 +767,8 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count;
- unsigned long mmu_notifier_range_start;
- unsigned long mmu_notifier_range_end;
- gfn_t mmu_notifier_range_start;
- gfn_t mmu_notifier_range_end; #endif struct list_head devices; u64 manual_dirty_log_protect;
@@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
+void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); @@ -1923,9 +1921,9 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) return 0; } -static inline int mmu_notifier_retry_hva(struct kvm *kvm, +static inline int mmu_notifier_retry_gfn(struct kvm *kvm, unsigned long mmu_seq,
unsigned long hva)
{ lockdep_assert_held(&kvm->mmu_lock); /*gfn_t gfn)
@@ -1935,8 +1933,8 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm, * positives, due to shortcuts when handing concurrent invalidations. */ if (unlikely(kvm->mmu_notifier_count) &&
hva >= kvm->mmu_notifier_range_start &&
hva < kvm->mmu_notifier_range_end)
gfn >= kvm->mmu_notifier_range_start &&
return 1; if (kvm->mmu_notifier_seq != mmu_seq) return 1;gfn < kvm->mmu_notifier_range_end)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index da263c370d00..4d7f0e72366f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -536,8 +536,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end); typedef void (*on_unlock_fn_t)(struct kvm *kvm); @@ -624,7 +623,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, locked = true; KVM_MMU_LOCK(kvm); if (!IS_KVM_NULL_FN(range->on_lock))
range->on_lock(kvm, range->start, range->end);
range->on_lock(kvm, gfn_range.start,
gfn_range.end); if (IS_KVM_NULL_FN(range->handler)) break; }
On Fri, Jul 15, 2022 at 01:36:15PM +0200, Gupta, Pankaj wrote:
Currently in mmu_notifier validate path, hva range is recorded and then checked in the mmu_notifier_retry_hva() from page fault path. However for the to be introduced private memory, a page fault may not have a hva
As this patch appeared in v7, just wondering did you see an actual bug because of it? And not having corresponding 'hva' occurs only with private memory because its not mapped to host userspace?
The addressed problem is not new in this version, previous versions I also had code to handle it (just in different way). But the problem is: mmu_notifier/memfile_notifier may be in the progress of invalidating a pfn that obtained earlier in the page fault handler, when happens, we should retry the fault. In v6 I used global mmu_notifier_retry() for memfile_notifier but that can block unrelated mmu_notifer invalidation which has hva range specified.
Sean gave a comment at https://lkml.org/lkml/2022/6/17/1001 to separate memfile_notifier from mmu_notifier but during the implementation I realized we actually can reuse the same code for shared and private memory if both using gpa range and that can simplify the code handling in kvm_zap_gfn_range and some other code (e.g. we don't need two versions for memfile_notifier/mmu_notifier).
Adding gpa range for private memory invalidation also relieves the above blocking issue between private memory page fault and mmu_notifier.
Chao
Thanks, Pankaj
associated, checking gfn(gpa) makes more sense. For existing non private memory case, gfn is expected to continue to work.
The patch also fixes a potential bug in kvm_zap_gfn_range() which has already been using gfn when calling kvm_inc/dec_notifier_count() in current code.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 18 ++++++++---------- virt/kvm/kvm_main.c | 6 +++--- 3 files changed, 12 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f7fa4c31b7c5..0d882fad4bc1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, return true; return fault->slot &&
mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
} static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0bdb6044e316..e9153b54e2a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -767,8 +767,8 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count;
- unsigned long mmu_notifier_range_start;
- unsigned long mmu_notifier_range_end;
- gfn_t mmu_notifier_range_start;
- gfn_t mmu_notifier_range_end; #endif struct list_head devices; u64 manual_dirty_log_protect;
@@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
+void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); @@ -1923,9 +1921,9 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) return 0; } -static inline int mmu_notifier_retry_hva(struct kvm *kvm, +static inline int mmu_notifier_retry_gfn(struct kvm *kvm, unsigned long mmu_seq,
unsigned long hva)
{ lockdep_assert_held(&kvm->mmu_lock); /*gfn_t gfn)
@@ -1935,8 +1933,8 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm, * positives, due to shortcuts when handing concurrent invalidations. */ if (unlikely(kvm->mmu_notifier_count) &&
hva >= kvm->mmu_notifier_range_start &&
hva < kvm->mmu_notifier_range_end)
gfn >= kvm->mmu_notifier_range_start &&
return 1; if (kvm->mmu_notifier_seq != mmu_seq) return 1;gfn < kvm->mmu_notifier_range_end)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index da263c370d00..4d7f0e72366f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -536,8 +536,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end); typedef void (*on_unlock_fn_t)(struct kvm *kvm); @@ -624,7 +623,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, locked = true; KVM_MMU_LOCK(kvm); if (!IS_KVM_NULL_FN(range->on_lock))
range->on_lock(kvm, range->start, range->end);
range->on_lock(kvm, gfn_range.start,
gfn_range.end); if (IS_KVM_NULL_FN(range->handler)) break; }
On Mon, Jul 18, 2022, Chao Peng wrote:
On Fri, Jul 15, 2022 at 01:36:15PM +0200, Gupta, Pankaj wrote:
Currently in mmu_notifier validate path, hva range is recorded and then checked in the mmu_notifier_retry_hva() from page fault path. However for the to be introduced private memory, a page fault may not have a hva
As this patch appeared in v7, just wondering did you see an actual bug because of it? And not having corresponding 'hva' occurs only with private memory because its not mapped to host userspace?
The addressed problem is not new in this version, previous versions I also had code to handle it (just in different way). But the problem is: mmu_notifier/memfile_notifier may be in the progress of invalidating a pfn that obtained earlier in the page fault handler, when happens, we should retry the fault. In v6 I used global mmu_notifier_retry() for memfile_notifier but that can block unrelated mmu_notifer invalidation which has hva range specified.
Sean gave a comment at https://lkml.org/lkml/2022/6/17/1001 to separate memfile_notifier from mmu_notifier but during the implementation I realized we actually can reuse the same code for shared and private memory if both using gpa range and that can simplify the code handling in kvm_zap_gfn_range and some other code (e.g. we don't need two versions for memfile_notifier/mmu_notifier).
This should work, though I'm undecided as to whether or not it's a good idea. KVM allows aliasing multiple gfns to a single hva, and so using the gfn could result in a much larger range being rejected given the simplistic algorithm for handling multiple ranges in kvm_inc_notifier_count(). But I assume such aliasing is uncommon, so I'm not sure it's worth optimizing for.
Adding gpa range for private memory invalidation also relieves the above blocking issue between private memory page fault and mmu_notifier.
On Mon, Jul 18, 2022 at 03:26:34PM +0000, Sean Christopherson wrote:
On Mon, Jul 18, 2022, Chao Peng wrote:
On Fri, Jul 15, 2022 at 01:36:15PM +0200, Gupta, Pankaj wrote:
Currently in mmu_notifier validate path, hva range is recorded and then checked in the mmu_notifier_retry_hva() from page fault path. However for the to be introduced private memory, a page fault may not have a hva
As this patch appeared in v7, just wondering did you see an actual bug because of it? And not having corresponding 'hva' occurs only with private memory because its not mapped to host userspace?
The addressed problem is not new in this version, previous versions I also had code to handle it (just in different way). But the problem is: mmu_notifier/memfile_notifier may be in the progress of invalidating a pfn that obtained earlier in the page fault handler, when happens, we should retry the fault. In v6 I used global mmu_notifier_retry() for memfile_notifier but that can block unrelated mmu_notifer invalidation which has hva range specified.
Sean gave a comment at https://lkml.org/lkml/2022/6/17/1001 to separate memfile_notifier from mmu_notifier but during the implementation I realized we actually can reuse the same code for shared and private memory if both using gpa range and that can simplify the code handling in kvm_zap_gfn_range and some other code (e.g. we don't need two versions for memfile_notifier/mmu_notifier).
This should work, though I'm undecided as to whether or not it's a good idea. KVM allows aliasing multiple gfns to a single hva, and so using the gfn could result in a much larger range being rejected given the simplistic algorithm for handling multiple ranges in kvm_inc_notifier_count(). But I assume such aliasing is uncommon, so I'm not sure it's worth optimizing for.
That can be a real problem for current v7 code, __kvm_handle_hva_range() loops all possible gfn_range for a given hva_range but the on_lock/on_unlock is invoked only once, this should work for hva_range, but not gfn_range since we can have multiple of them.
Adding gpa range for private memory invalidation also relieves the above blocking issue between private memory page fault and mmu_notifier.
On Wed, Jul 06, 2022 at 04:20:09PM +0800, Chao Peng chao.p.peng@linux.intel.com wrote:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0bdb6044e316..e9153b54e2a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
+void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
The corresponding changes in kvm_main.c are missing.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index b2c79bef61bd..0184e327f6f5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -711,8 +711,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); }
-void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end) { /* * The count increase must become visible at unlock time as no @@ -786,8 +785,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; }
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end) { /* * This sequence increase will notify the kvm page fault that
On Thu, Aug 04, 2022 at 12:10:44AM -0700, Isaku Yamahata wrote:
On Wed, Jul 06, 2022 at 04:20:09PM +0800, Chao Peng chao.p.peng@linux.intel.com wrote:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0bdb6044e316..e9153b54e2a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end);
+void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg);
The corresponding changes in kvm_main.c are missing.
Exactly! Actually it's in the next patch while it should indeed in this patch.
Chao
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index b2c79bef61bd..0184e327f6f5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -711,8 +711,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); } -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end)
+void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end) { /* * The count increase must become visible at unlock time as no @@ -786,8 +785,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; } -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end)
+void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end) { /* * This sequence increase will notify the kvm page fault that
-- Isaku Yamahata isaku.yamahata@gmail.com
The sync mechanism between mmu_notifier and page fault handler employs fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the to be added private memory, there is the same mechanism needed but not rely on mmu_notifier (It uses new introduced memfile_notifier). This patch renames the existing fields and related helper functions to a neutral name mmu_updating_* so private memory can reuse.
No functional change intended.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- arch/arm64/kvm/mmu.c | 8 ++--- arch/mips/kvm/mmu.c | 10 +++--- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +-- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +-- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++-- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 ++--- arch/powerpc/kvm/e500_mmu_host.c | 4 +-- arch/riscv/kvm/mmu.c | 4 +-- arch/x86/kvm/mmu/mmu.c | 14 ++++---- arch/x86/kvm/mmu/paging_tmpl.h | 4 +-- include/linux/kvm_host.h | 38 ++++++++++----------- virt/kvm/kvm_main.c | 42 +++++++++++------------- virt/kvm/pfncache.c | 14 ++++---- 15 files changed, 81 insertions(+), 83 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 87f1cd0df36e..7ee6fafc24ee 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -993,7 +993,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, * THP doesn't start to split while we are adjusting the * refcounts. * - * We are sure this doesn't happen, because mmu_notifier_retry + * We are sure this doesn't happen, because mmu_updating_retry * was successful and we are holding the mmu_lock, so if this * THP is trying to split, it will be blocked in the mmu * notifier before touching any of the pages, specifically @@ -1188,9 +1188,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return ret; }
- mmu_seq = vcpu->kvm->mmu_notifier_seq; + mmu_seq = vcpu->kvm->mmu_updating_seq; /* - * Ensure the read of mmu_notifier_seq happens before we call + * Ensure the read of mmu_updating_seq happens before we call * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk * the page we just got a reference to gets unmapped before we have a * chance to grab the mmu_lock, which ensure that if the page gets @@ -1246,7 +1246,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, else write_lock(&kvm->mmu_lock); pgt = vcpu->arch.hw_mmu->pgt; - if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock;
/* diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index 1bfd1b501d82..abd468c6a749 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -615,17 +615,17 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa, * Used to check for invalidations in progress, of the pfn that is * returned by pfn_to_pfn_prot below. */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; /* - * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in + * Ensure the read of mmu_updating_seq isn't reordered with PTE reads in * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't * risk the page we get a reference to getting unmapped before we have a - * chance to grab the mmu_lock without mmu_notifier_retry() noticing. + * chance to grab the mmu_lock without mmu_updating_retry () noticing. * * This smp_rmb() pairs with the effective smp_wmb() of the combination * of the pte_unmap_unlock() after the PTE is zapped, and the * spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before - * mmu_notifier_seq is incremented. + * mmu_updating_seq is incremented. */ smp_rmb();
@@ -638,7 +638,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa,
spin_lock(&kvm->mmu_lock); /* Check if an invalidation has taken place since we got pfn */ - if (mmu_notifier_retry(kvm, mmu_seq)) { + if (mmu_updating_retry(kvm, mmu_seq)) { /* * This can happen when mappings are changed asynchronously, but * also synchronously if a COW is triggered by diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 4def2bd17b9b..4d35fb913de5 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -666,7 +666,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq, VM_WARN(!spin_is_locked(&kvm->mmu_lock), "%s called with kvm mmu_lock not held \n", __func__);
- if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) return NULL;
pte = __find_linux_pte(kvm->mm->pgd, ea, NULL, hshift); diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c index 1ae09992c9ea..78f1aae8cb60 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_host.c +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c @@ -90,7 +90,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte, unsigned long pfn;
/* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* Get host physical address for gpa */ @@ -151,7 +151,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte, cpte = kvmppc_mmu_hpte_cache_next(vcpu);
spin_lock(&kvm->mmu_lock); - if (!cpte || mmu_notifier_retry(kvm, mmu_seq)) { + if (!cpte || mmu_updating_retry(kvm, mmu_seq)) { r = -EAGAIN; goto out_unlock; } diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 514fd45c1994..bcdec6a6f2a7 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -578,7 +578,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu, return -EFAULT;
/* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
ret = -EFAULT; @@ -693,7 +693,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
/* Check if we might have been invalidated; let the guest retry if so */ ret = RESUME_GUEST; - if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) { + if (mmu_updating_retry(vcpu->kvm, mmu_seq)) { unlock_rmap(rmap); goto out_unlock; } diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 42851c32ff3b..c8890ccc3f40 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -639,7 +639,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte, /* Check if we might have been invalidated; let the guest retry if so */ spin_lock(&kvm->mmu_lock); ret = -EAGAIN; - if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock;
/* Now traverse again under the lock and change the tree */ @@ -829,7 +829,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, bool large_enable;
/* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* @@ -1190,7 +1190,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm, * Increase the mmu notifier sequence number to prevent any page * fault that read the memslot earlier from writing a PTE. */ - kvm->mmu_notifier_seq++; + kvm->mmu_updating_seq++; spin_unlock(&kvm->mmu_lock); }
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c index 0644732d1a25..09f841f730da 100644 --- a/arch/powerpc/kvm/book3s_hv_nested.c +++ b/arch/powerpc/kvm/book3s_hv_nested.c @@ -1579,7 +1579,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu, /* 2. Find the host pte for this L1 guest real address */
/* Used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* See if can find translation in our partition scoped tables for L1 */ diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 2257fb18cb72..952b504dc98a 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -219,7 +219,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, g_ptel = ptel;
/* used later to detect if we might have been invalidated */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* Find the memslot (if any) for this address */ @@ -366,7 +366,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, rmap = real_vmalloc_addr(rmap); lock_rmap(rmap); /* Check for pending invalidations under the rmap chain lock */ - if (mmu_notifier_retry(kvm, mmu_seq)) { + if (mmu_updating_retry(kvm, mmu_seq)) { /* inval in progress, write a non-present HPTE */ pteh |= HPTE_V_ABSENT; pteh &= ~HPTE_V_VALID; @@ -932,7 +932,7 @@ static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu, int i;
/* Used later to detect if we might have been invalidated */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock); @@ -960,7 +960,7 @@ static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu, long ret = H_SUCCESS;
/* Used later to detect if we might have been invalidated */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock); diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 7f16afc331ef..d7636b926f25 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -339,7 +339,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, unsigned long flags;
/* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* @@ -460,7 +460,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, }
spin_lock(&kvm->mmu_lock); - if (mmu_notifier_retry(kvm, mmu_seq)) { + if (mmu_updating_retry(kvm, mmu_seq)) { ret = -EAGAIN; goto out; } diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index 081f8d2b9cf3..a7db374d3861 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -654,7 +654,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, return ret; }
- mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq;
hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable); if (hfn == KVM_PFN_ERR_HWPOISON) { @@ -674,7 +674,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu,
spin_lock(&kvm->mmu_lock);
- if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock;
if (writeable) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0d882fad4bc1..545eb74305fe 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2908,7 +2908,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) * If addresses are being invalidated, skip prefetching to avoid * accidentally prefetching those addresses. */ - if (unlikely(vcpu->kvm->mmu_notifier_count)) + if (unlikely(vcpu->kvm->mmu_updating_count)) return;
__direct_pte_prefetch(vcpu, sp, sptep); @@ -2950,7 +2950,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, /* * Lookup the mapping level in the current mm. The information * may become stale soon, but it is safe to use as long as - * 1) mmu_notifier_retry was checked after taking mmu_lock, and + * 1) mmu_updating_retry was checked after taking mmu_lock, and * 2) mmu_lock is taken now. * * We still need to disable IRQs to prevent concurrent tear down @@ -3035,7 +3035,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault return;
/* - * mmu_notifier_retry() was successful and mmu_lock is held, so + * mmu_updating_retry was successful and mmu_lock is held, so * the pmd can't be split from under us. */ fault->goal_level = fault->req_level; @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, return true;
return fault->slot && - mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn); + mmu_updating_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn); }
static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) @@ -4206,7 +4206,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (r) return r;
- mmu_seq = vcpu->kvm->mmu_notifier_seq; + mmu_seq = vcpu->kvm->mmu_updating_seq; smp_rmb();
r = kvm_faultin_pfn(vcpu, fault); @@ -6023,7 +6023,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
write_lock(&kvm->mmu_lock);
- kvm_inc_notifier_count(kvm, gfn_start, gfn_end); + kvm_mmu_updating_begin(kvm, gfn_start, gfn_end);
flush = __kvm_zap_rmaps(kvm, gfn_start, gfn_end);
@@ -6037,7 +6037,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end - gfn_start);
- kvm_dec_notifier_count(kvm, gfn_start, gfn_end); + kvm_mmu_updating_end(kvm, gfn_start, gfn_end);
write_unlock(&kvm->mmu_lock); } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 2448fa8d8438..acf7e41aa02b 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -589,7 +589,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, * If addresses are being invalidated, skip prefetching to avoid * accidentally prefetching those addresses. */ - if (unlikely(vcpu->kvm->mmu_notifier_count)) + if (unlikely(vcpu->kvm->mmu_updating_count)) return;
if (sp->role.direct) @@ -838,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else fault->max_level = walker.level;
- mmu_seq = vcpu->kvm->mmu_notifier_seq; + mmu_seq = vcpu->kvm->mmu_updating_seq; smp_rmb();
r = kvm_faultin_pfn(vcpu, fault); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e9153b54e2a4..c262ebb168a7 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -765,10 +765,10 @@ struct kvm {
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier; - unsigned long mmu_notifier_seq; - long mmu_notifier_count; - gfn_t mmu_notifier_range_start; - gfn_t mmu_notifier_range_end; + unsigned long mmu_updating_seq; + long mmu_updating_count; + gfn_t mmu_updating_range_start; + gfn_t mmu_updating_range_end; #endif struct list_head devices; u64 manual_dirty_log_protect; @@ -1362,8 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif
-void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); -void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end);
long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); @@ -1901,42 +1901,42 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) -static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) +static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq) { - if (unlikely(kvm->mmu_notifier_count)) + if (unlikely(kvm->mmu_updating_count)) return 1; /* - * Ensure the read of mmu_notifier_count happens before the read - * of mmu_notifier_seq. This interacts with the smp_wmb() in + * Ensure the read of mmu_updating_count happens before the read + * of mmu_updating_seq. This interacts with the smp_wmb() in * mmu_notifier_invalidate_range_end to make sure that the caller - * either sees the old (non-zero) value of mmu_notifier_count or - * the new (incremented) value of mmu_notifier_seq. + * either sees the old (non-zero) value of mmu_updating_count or + * the new (incremented) value of mmu_updating_seq. * PowerPC Book3s HV KVM calls this under a per-page lock * rather than under kvm->mmu_lock, for scalability, so * can't rely on kvm->mmu_lock to keep things ordered. */ smp_rmb(); - if (kvm->mmu_notifier_seq != mmu_seq) + if (kvm->mmu_updating_seq != mmu_seq) return 1; return 0; }
-static inline int mmu_notifier_retry_gfn(struct kvm *kvm, +static inline int mmu_updating_retry_gfn(struct kvm *kvm, unsigned long mmu_seq, gfn_t gfn) { lockdep_assert_held(&kvm->mmu_lock); /* - * If mmu_notifier_count is non-zero, then the range maintained by + * If mmu_updating_count is non-zero, then the range maintained by * kvm_mmu_notifier_invalidate_range_start contains all addresses that * might be being invalidated. Note that it may include some false * positives, due to shortcuts when handing concurrent invalidations. */ - if (unlikely(kvm->mmu_notifier_count) && - gfn >= kvm->mmu_notifier_range_start && - gfn < kvm->mmu_notifier_range_end) + if (unlikely(kvm->mmu_updating_count) && + gfn >= kvm->mmu_updating_range_start && + gfn < kvm->mmu_updating_range_end) return 1; - if (kvm->mmu_notifier_seq != mmu_seq) + if (kvm->mmu_updating_seq != mmu_seq) return 1; return 0; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4d7f0e72366f..3ae4944b9f15 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -698,30 +698,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
/* * .change_pte() must be surrounded by .invalidate_range_{start,end}(). - * If mmu_notifier_count is zero, then no in-progress invalidations, + * If mmu_updating_count is zero, then no in-progress invalidations, * including this one, found a relevant memslot at start(); rechecking * memslots here is unnecessary. Note, a false positive (count elevated * by a different invalidation) is sub-optimal but functionally ok. */ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); - if (!READ_ONCE(kvm->mmu_notifier_count)) + if (!READ_ONCE(kvm->mmu_updating_count)) return;
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); }
-void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end) { /* * The count increase must become visible at unlock time as no * spte can be established without taking the mmu_lock and * count is also read inside the mmu_lock critical section. */ - kvm->mmu_notifier_count++; - if (likely(kvm->mmu_notifier_count == 1)) { - kvm->mmu_notifier_range_start = start; - kvm->mmu_notifier_range_end = end; + kvm->mmu_updating_count++; + if (likely(kvm->mmu_updating_count == 1)) { + kvm->mmu_updating_range_start = start; + kvm->mmu_updating_range_end = end; } else { /* * Fully tracking multiple concurrent ranges has diminishing @@ -732,10 +731,10 @@ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, * accumulate and persist until all outstanding invalidates * complete. */ - kvm->mmu_notifier_range_start = - min(kvm->mmu_notifier_range_start, start); - kvm->mmu_notifier_range_end = - max(kvm->mmu_notifier_range_end, end); + kvm->mmu_updating_range_start = + min(kvm->mmu_updating_range_start, start); + kvm->mmu_updating_range_end = + max(kvm->mmu_updating_range_end, end); } }
@@ -748,7 +747,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, .end = range->end, .pte = __pte(0), .handler = kvm_unmap_gfn_range, - .on_lock = kvm_inc_notifier_count, + .on_lock = kvm_mmu_updating_begin, .on_unlock = kvm_arch_guest_memory_reclaimed, .flush_on_ret = true, .may_block = mmu_notifier_range_blockable(range), @@ -759,7 +758,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, /* * Prevent memslot modification between range_start() and range_end() * so that conditionally locking provides the same result in both - * functions. Without that guarantee, the mmu_notifier_count + * functions. Without that guarantee, the mmu_updating_count * adjustments will be imbalanced. * * Pairs with the decrement in range_end(). @@ -775,7 +774,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * any given time, and the caches themselves can check for hva overlap, * i.e. don't need to rely on memslot overlap checks for performance. * Because this runs without holding mmu_lock, the pfn caches must use - * mn_active_invalidate_count (see above) instead of mmu_notifier_count. + * mn_active_invalidate_count (see above) instead of mmu_updating_count. */ gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, hva_range.may_block); @@ -785,22 +784,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; }
-void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end) { /* * This sequence increase will notify the kvm page fault that * the page that is going to be mapped in the spte could have * been freed. */ - kvm->mmu_notifier_seq++; + kvm->mmu_updating_seq++; smp_wmb(); /* * The above sequence increase must be visible before the * below count decrease, which is ensured by the smp_wmb above - * in conjunction with the smp_rmb in mmu_notifier_retry(). + * in conjunction with the smp_rmb in mmu_updating_retry(). */ - kvm->mmu_notifier_count--; + kvm->mmu_updating_count--; }
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, @@ -812,7 +810,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, .end = range->end, .pte = __pte(0), .handler = (void *)kvm_null_fn, - .on_lock = kvm_dec_notifier_count, + .on_lock = kvm_mmu_updating_end, .on_unlock = (void *)kvm_null_fn, .flush_on_ret = false, .may_block = mmu_notifier_range_blockable(range), @@ -833,7 +831,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, if (wake) rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
- BUG_ON(kvm->mmu_notifier_count < 0); + BUG_ON(kvm->mmu_updating_count < 0); }
static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index ab519f72f2cd..aa6d24966a76 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -112,27 +112,27 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s { /* * mn_active_invalidate_count acts for all intents and purposes - * like mmu_notifier_count here; but the latter cannot be used + * like mmu_updating_count here; but the latter cannot be used * here because the invalidation of caches in the mmu_notifier - * event occurs _before_ mmu_notifier_count is elevated. + * event occurs _before_ mmu_updating_count is elevated. * * Note, it does not matter that mn_active_invalidate_count * is not protected by gpc->lock. It is guaranteed to * be elevated before the mmu_notifier acquires gpc->lock, and - * isn't dropped until after mmu_notifier_seq is updated. + * isn't dropped until after mmu_updating_seq is updated. */ if (kvm->mn_active_invalidate_count) return true;
/* * Ensure mn_active_invalidate_count is read before - * mmu_notifier_seq. This pairs with the smp_wmb() in + * mmu_updating_seq. This pairs with the smp_wmb() in * mmu_notifier_invalidate_range_end() to guarantee either the * old (non-zero) value of mn_active_invalidate_count or the - * new (incremented) value of mmu_notifier_seq is observed. + * new (incremented) value of mmu_updating_seq is observed. */ smp_rmb(); - return kvm->mmu_notifier_seq != mmu_seq; + return kvm->mmu_updating_seq != mmu_seq; }
static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) @@ -155,7 +155,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) gpc->valid = false;
do { - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb();
write_unlock_irq(&gpc->lock);
On Wed, Jul 06, 2022, Chao Peng wrote:
The sync mechanism between mmu_notifier and page fault handler employs fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the to be added private memory, there is the same mechanism needed but not rely on mmu_notifier (It uses new introduced memfile_notifier). This patch renames the existing fields and related helper functions to a neutral name mmu_updating_* so private memory can reuse.
mmu_updating_* is too broad of a term, e.g. page faults and many other operations also update the mmu. Although the name most definitely came from the mmu_notifier, it's not completely inaccurate for other sources, e.g. KVM's MMU is still being notified of something, even if the source is not the actual mmu_notifier.
If we really want a different name, I'd vote for nomenclature that captures the invalidation aspect, which is really what the variables are all trackng, e.g.
mmu_invalidate_seq mmu_invalidate_in_progress mmu_invalidate_range_start mmu_invalidate_range_end
On Fri, Jul 29, 2022 at 07:02:12PM +0000, Sean Christopherson wrote:
On Wed, Jul 06, 2022, Chao Peng wrote:
The sync mechanism between mmu_notifier and page fault handler employs fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the to be added private memory, there is the same mechanism needed but not rely on mmu_notifier (It uses new introduced memfile_notifier). This patch renames the existing fields and related helper functions to a neutral name mmu_updating_* so private memory can reuse.
mmu_updating_* is too broad of a term, e.g. page faults and many other operations also update the mmu. Although the name most definitely came from the mmu_notifier, it's not completely inaccurate for other sources, e.g. KVM's MMU is still being notified of something, even if the source is not the actual mmu_notifier.
If we really want a different name, I'd vote for nomenclature that captures the invalidation aspect, which is really what the variables are all trackng, e.g.
mmu_invalidate_seq mmu_invalidate_in_progress mmu_invalidate_range_start mmu_invalidate_range_end
Looks good to me. Thanks.
Chao
On 7/29/22 21:02, Sean Christopherson wrote:
If we really want a different name, I'd vote for nomenclature that captures the invalidation aspect, which is really what the variables are all trackng, e.g.
mmu_invalidate_seq mmu_invalidate_in_progress mmu_invalidate_range_start mmu_invalidate_range_end
Agreed, and this can of course be committed separately if Chao Peng sends it outside this series.
Paolo
On Fri, Aug 05, 2022 at 09:54:35PM +0200, Paolo Bonzini wrote:
On 7/29/22 21:02, Sean Christopherson wrote:
If we really want a different name, I'd vote for nomenclature that captures the invalidation aspect, which is really what the variables are all trackng, e.g.
mmu_invalidate_seq mmu_invalidate_in_progress mmu_invalidate_range_start mmu_invalidate_range_end
Agreed, and this can of course be committed separately if Chao Peng sends it outside this series.
I will do that, probably also includes: 06/14 KVM: Rename KVM_PRIVATE_MEM_SLOT
Chao
Paolo
On 2022-07-06 16:20:10, Chao Peng wrote:
The sync mechanism between mmu_notifier and page fault handler employs fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the to be added private memory, there is the same mechanism needed but not rely on mmu_notifier (It uses new introduced memfile_notifier). This patch renames the existing fields and related helper functions to a neutral name mmu_updating_* so private memory can reuse.
No functional change intended.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
arch/arm64/kvm/mmu.c | 8 ++--- arch/mips/kvm/mmu.c | 10 +++--- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +-- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +-- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++-- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 ++--- arch/powerpc/kvm/e500_mmu_host.c | 4 +-- arch/riscv/kvm/mmu.c | 4 +-- arch/x86/kvm/mmu/mmu.c | 14 ++++---- arch/x86/kvm/mmu/paging_tmpl.h | 4 +-- include/linux/kvm_host.h | 38 ++++++++++----------- virt/kvm/kvm_main.c | 42 +++++++++++------------- virt/kvm/pfncache.c | 14 ++++---- 15 files changed, 81 insertions(+), 83 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 87f1cd0df36e..7ee6fafc24ee 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -993,7 +993,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, * THP doesn't start to split while we are adjusting the * refcounts. *
* We are sure this doesn't happen, because mmu_notifier_retry
* We are sure this doesn't happen, because mmu_updating_retry
- was successful and we are holding the mmu_lock, so if this
- THP is trying to split, it will be blocked in the mmu
- notifier before touching any of the pages, specifically
@@ -1188,9 +1188,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return ret; }
- mmu_seq = vcpu->kvm->mmu_notifier_seq;
- mmu_seq = vcpu->kvm->mmu_updating_seq; /*
* Ensure the read of mmu_notifier_seq happens before we call
* Ensure the read of mmu_updating_seq happens before we call
- gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk
- the page we just got a reference to gets unmapped before we have a
- chance to grab the mmu_lock, which ensure that if the page gets
@@ -1246,7 +1246,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, else write_lock(&kvm->mmu_lock); pgt = vcpu->arch.hw_mmu->pgt;
- if (mmu_notifier_retry(kvm, mmu_seq))
- if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock;
/* diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index 1bfd1b501d82..abd468c6a749 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -615,17 +615,17 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa, * Used to check for invalidations in progress, of the pfn that is * returned by pfn_to_pfn_prot below. */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; /*
* Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in
* Ensure the read of mmu_updating_seq isn't reordered with PTE reads in
- gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't
- risk the page we get a reference to getting unmapped before we have a
* chance to grab the mmu_lock without mmu_notifier_retry() noticing.
* chance to grab the mmu_lock without mmu_updating_retry () noticing.
- This smp_rmb() pairs with the effective smp_wmb() of the combination
- of the pte_unmap_unlock() after the PTE is zapped, and the
- spin_lock() in kvm_mmu_notifier_invalidate_<page|range_end>() before
* mmu_notifier_seq is incremented.
*/ smp_rmb();* mmu_updating_seq is incremented.
@@ -638,7 +638,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa, spin_lock(&kvm->mmu_lock); /* Check if an invalidation has taken place since we got pfn */
- if (mmu_notifier_retry(kvm, mmu_seq)) {
- if (mmu_updating_retry(kvm, mmu_seq)) { /*
- This can happen when mappings are changed asynchronously, but
- also synchronously if a COW is triggered by
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 4def2bd17b9b..4d35fb913de5 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -666,7 +666,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq, VM_WARN(!spin_is_locked(&kvm->mmu_lock), "%s called with kvm mmu_lock not held \n", __func__);
- if (mmu_notifier_retry(kvm, mmu_seq))
- if (mmu_updating_retry(kvm, mmu_seq)) return NULL;
pte = __find_linux_pte(kvm->mm->pgd, ea, NULL, hshift); diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c index 1ae09992c9ea..78f1aae8cb60 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_host.c +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c @@ -90,7 +90,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte, unsigned long pfn; /* used to check for invalidations in progress */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* Get host physical address for gpa */ @@ -151,7 +151,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte, cpte = kvmppc_mmu_hpte_cache_next(vcpu); spin_lock(&kvm->mmu_lock);
- if (!cpte || mmu_notifier_retry(kvm, mmu_seq)) {
- if (!cpte || mmu_updating_retry(kvm, mmu_seq)) { r = -EAGAIN; goto out_unlock; }
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 514fd45c1994..bcdec6a6f2a7 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -578,7 +578,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu, return -EFAULT; /* used to check for invalidations in progress */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
ret = -EFAULT; @@ -693,7 +693,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu, /* Check if we might have been invalidated; let the guest retry if so */ ret = RESUME_GUEST;
- if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) {
- if (mmu_updating_retry(vcpu->kvm, mmu_seq)) { unlock_rmap(rmap); goto out_unlock; }
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 42851c32ff3b..c8890ccc3f40 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -639,7 +639,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte, /* Check if we might have been invalidated; let the guest retry if so */ spin_lock(&kvm->mmu_lock); ret = -EAGAIN;
- if (mmu_notifier_retry(kvm, mmu_seq))
- if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock;
/* Now traverse again under the lock and change the tree */ @@ -829,7 +829,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, bool large_enable; /* used to check for invalidations in progress */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* @@ -1190,7 +1190,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm, * Increase the mmu notifier sequence number to prevent any page * fault that read the memslot earlier from writing a PTE. */
- kvm->mmu_notifier_seq++;
- kvm->mmu_updating_seq++; spin_unlock(&kvm->mmu_lock);
} diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c index 0644732d1a25..09f841f730da 100644 --- a/arch/powerpc/kvm/book3s_hv_nested.c +++ b/arch/powerpc/kvm/book3s_hv_nested.c @@ -1579,7 +1579,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu, /* 2. Find the host pte for this L1 guest real address */ /* Used to check for invalidations in progress */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* See if can find translation in our partition scoped tables for L1 */ diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 2257fb18cb72..952b504dc98a 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -219,7 +219,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, g_ptel = ptel; /* used later to detect if we might have been invalidated */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* Find the memslot (if any) for this address */ @@ -366,7 +366,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, rmap = real_vmalloc_addr(rmap); lock_rmap(rmap); /* Check for pending invalidations under the rmap chain lock */
if (mmu_notifier_retry(kvm, mmu_seq)) {
if (mmu_updating_retry(kvm, mmu_seq)) { /* inval in progress, write a non-present HPTE */ pteh |= HPTE_V_ABSENT; pteh &= ~HPTE_V_VALID;
@@ -932,7 +932,7 @@ static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu, int i; /* Used later to detect if we might have been invalidated */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock); @@ -960,7 +960,7 @@ static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu, long ret = H_SUCCESS; /* Used later to detect if we might have been invalidated */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock); diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 7f16afc331ef..d7636b926f25 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -339,7 +339,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, unsigned long flags; /* used to check for invalidations in progress */
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq; smp_rmb();
/* @@ -460,7 +460,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, } spin_lock(&kvm->mmu_lock);
- if (mmu_notifier_retry(kvm, mmu_seq)) {
- if (mmu_updating_retry(kvm, mmu_seq)) { ret = -EAGAIN; goto out; }
diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index 081f8d2b9cf3..a7db374d3861 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -654,7 +654,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, return ret; }
- mmu_seq = kvm->mmu_notifier_seq;
- mmu_seq = kvm->mmu_updating_seq;
hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable); if (hfn == KVM_PFN_ERR_HWPOISON) { @@ -674,7 +674,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, spin_lock(&kvm->mmu_lock);
- if (mmu_notifier_retry(kvm, mmu_seq))
- if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock;
if (writeable) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0d882fad4bc1..545eb74305fe 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2908,7 +2908,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) * If addresses are being invalidated, skip prefetching to avoid * accidentally prefetching those addresses. */
- if (unlikely(vcpu->kvm->mmu_notifier_count))
- if (unlikely(vcpu->kvm->mmu_updating_count)) return;
__direct_pte_prefetch(vcpu, sp, sptep); @@ -2950,7 +2950,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, /* * Lookup the mapping level in the current mm. The information * may become stale soon, but it is safe to use as long as
* 1) mmu_notifier_retry was checked after taking mmu_lock, and
* 1) mmu_updating_retry was checked after taking mmu_lock, and
- mmu_lock is taken now.
- We still need to disable IRQs to prevent concurrent tear down
@@ -3035,7 +3035,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault return; /*
* mmu_notifier_retry() was successful and mmu_lock is held, so
* mmu_updating_retry was successful and mmu_lock is held, so
*/ fault->goal_level = fault->req_level;
- the pmd can't be split from under us.
@@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, return true; return fault->slot &&
mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
mmu_updating_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
} static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) @@ -4206,7 +4206,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (r) return r;
- mmu_seq = vcpu->kvm->mmu_notifier_seq;
- mmu_seq = vcpu->kvm->mmu_updating_seq; smp_rmb();
r = kvm_faultin_pfn(vcpu, fault); @@ -6023,7 +6023,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) write_lock(&kvm->mmu_lock);
- kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
- kvm_mmu_updating_begin(kvm, gfn_start, gfn_end);
flush = __kvm_zap_rmaps(kvm, gfn_start, gfn_end); @@ -6037,7 +6037,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end - gfn_start);
- kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
- kvm_mmu_updating_end(kvm, gfn_start, gfn_end);
write_unlock(&kvm->mmu_lock); } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 2448fa8d8438..acf7e41aa02b 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -589,7 +589,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, * If addresses are being invalidated, skip prefetching to avoid * accidentally prefetching those addresses. */
- if (unlikely(vcpu->kvm->mmu_notifier_count))
- if (unlikely(vcpu->kvm->mmu_updating_count)) return;
if (sp->role.direct) @@ -838,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else fault->max_level = walker.level;
- mmu_seq = vcpu->kvm->mmu_notifier_seq;
- mmu_seq = vcpu->kvm->mmu_updating_seq; smp_rmb();
r = kvm_faultin_pfn(vcpu, fault); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e9153b54e2a4..c262ebb168a7 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -765,10 +765,10 @@ struct kvm { #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier;
- unsigned long mmu_notifier_seq;
- long mmu_notifier_count;
- gfn_t mmu_notifier_range_start;
- gfn_t mmu_notifier_range_end;
- unsigned long mmu_updating_seq;
- long mmu_updating_count;
Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ? I see that not all accesses to these are under the kvm->mmu_lock spinlock. This will also remove the need for putting separate smp_wmb() and smp_rmb() memory barriers while accessing these structure members.
- gfn_t mmu_updating_range_start;
- gfn_t mmu_updating_range_end;
#endif struct list_head devices; u64 manual_dirty_log_protect; @@ -1362,8 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif -void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); -void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); @@ -1901,42 +1901,42 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) -static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) +static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq) {
- if (unlikely(kvm->mmu_notifier_count))
- if (unlikely(kvm->mmu_updating_count)) return 1; /*
* Ensure the read of mmu_notifier_count happens before the read
* of mmu_notifier_seq. This interacts with the smp_wmb() in
* Ensure the read of mmu_updating_count happens before the read
* of mmu_updating_seq. This interacts with the smp_wmb() in
- mmu_notifier_invalidate_range_end to make sure that the caller
* either sees the old (non-zero) value of mmu_notifier_count or
* the new (incremented) value of mmu_notifier_seq.
* either sees the old (non-zero) value of mmu_updating_count or
* the new (incremented) value of mmu_updating_seq.
*/ smp_rmb();
- PowerPC Book3s HV KVM calls this under a per-page lock
- rather than under kvm->mmu_lock, for scalability, so
- can't rely on kvm->mmu_lock to keep things ordered.
- if (kvm->mmu_notifier_seq != mmu_seq)
- if (kvm->mmu_updating_seq != mmu_seq) return 1; return 0;
} -static inline int mmu_notifier_retry_gfn(struct kvm *kvm, +static inline int mmu_updating_retry_gfn(struct kvm *kvm, unsigned long mmu_seq, gfn_t gfn) { lockdep_assert_held(&kvm->mmu_lock); /*
* If mmu_notifier_count is non-zero, then the range maintained by
* If mmu_updating_count is non-zero, then the range maintained by
*/
- kvm_mmu_notifier_invalidate_range_start contains all addresses that
- might be being invalidated. Note that it may include some false
- positives, due to shortcuts when handing concurrent invalidations.
- if (unlikely(kvm->mmu_notifier_count) &&
gfn >= kvm->mmu_notifier_range_start &&
gfn < kvm->mmu_notifier_range_end)
- if (unlikely(kvm->mmu_updating_count) &&
gfn >= kvm->mmu_updating_range_start &&
return 1;gfn < kvm->mmu_updating_range_end)
- if (kvm->mmu_notifier_seq != mmu_seq)
- if (kvm->mmu_updating_seq != mmu_seq) return 1; return 0;
} diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4d7f0e72366f..3ae4944b9f15 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -698,30 +698,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, /* * .change_pte() must be surrounded by .invalidate_range_{start,end}().
* If mmu_notifier_count is zero, then no in-progress invalidations,
* If mmu_updating_count is zero, then no in-progress invalidations,
*/ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
- including this one, found a relevant memslot at start(); rechecking
- memslots here is unnecessary. Note, a false positive (count elevated
- by a different invalidation) is sub-optimal but functionally ok.
- if (!READ_ONCE(kvm->mmu_notifier_count))
- if (!READ_ONCE(kvm->mmu_updating_count)) return;
kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); } -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end)
+void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end) { /* * The count increase must become visible at unlock time as no * spte can be established without taking the mmu_lock and * count is also read inside the mmu_lock critical section. */
- kvm->mmu_notifier_count++;
- if (likely(kvm->mmu_notifier_count == 1)) {
kvm->mmu_notifier_range_start = start;
kvm->mmu_notifier_range_end = end;
- kvm->mmu_updating_count++;
- if (likely(kvm->mmu_updating_count == 1)) {
kvm->mmu_updating_range_start = start;
} else { /*kvm->mmu_updating_range_end = end;
- Fully tracking multiple concurrent ranges has diminishing
@@ -732,10 +731,10 @@ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, * accumulate and persist until all outstanding invalidates * complete. */
kvm->mmu_notifier_range_start =
min(kvm->mmu_notifier_range_start, start);
kvm->mmu_notifier_range_end =
max(kvm->mmu_notifier_range_end, end);
kvm->mmu_updating_range_start =
min(kvm->mmu_updating_range_start, start);
kvm->mmu_updating_range_end =
}max(kvm->mmu_updating_range_end, end);
} @@ -748,7 +747,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, .end = range->end, .pte = __pte(0), .handler = kvm_unmap_gfn_range,
.on_lock = kvm_inc_notifier_count,
.on_unlock = kvm_arch_guest_memory_reclaimed, .flush_on_ret = true, .may_block = mmu_notifier_range_blockable(range),.on_lock = kvm_mmu_updating_begin,
@@ -759,7 +758,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, /* * Prevent memslot modification between range_start() and range_end() * so that conditionally locking provides the same result in both
* functions. Without that guarantee, the mmu_notifier_count
* functions. Without that guarantee, the mmu_updating_count
- adjustments will be imbalanced.
- Pairs with the decrement in range_end().
@@ -775,7 +774,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * any given time, and the caches themselves can check for hva overlap, * i.e. don't need to rely on memslot overlap checks for performance. * Because this runs without holding mmu_lock, the pfn caches must use
* mn_active_invalidate_count (see above) instead of mmu_notifier_count.
*/ gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, hva_range.may_block);* mn_active_invalidate_count (see above) instead of mmu_updating_count.
@@ -785,22 +784,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; } -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start,
unsigned long end)
+void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end) { /* * This sequence increase will notify the kvm page fault that * the page that is going to be mapped in the spte could have * been freed. */
- kvm->mmu_notifier_seq++;
- kvm->mmu_updating_seq++; smp_wmb(); /*
- The above sequence increase must be visible before the
- below count decrease, which is ensured by the smp_wmb above
* in conjunction with the smp_rmb in mmu_notifier_retry().
*/* in conjunction with the smp_rmb in mmu_updating_retry().
- kvm->mmu_notifier_count--;
- kvm->mmu_updating_count--;
} static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, @@ -812,7 +810,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, .end = range->end, .pte = __pte(0), .handler = (void *)kvm_null_fn,
.on_lock = kvm_dec_notifier_count,
.on_unlock = (void *)kvm_null_fn, .flush_on_ret = false, .may_block = mmu_notifier_range_blockable(range),.on_lock = kvm_mmu_updating_end,
@@ -833,7 +831,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, if (wake) rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait);
- BUG_ON(kvm->mmu_notifier_count < 0);
- BUG_ON(kvm->mmu_updating_count < 0);
} static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index ab519f72f2cd..aa6d24966a76 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -112,27 +112,27 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s { /* * mn_active_invalidate_count acts for all intents and purposes
* like mmu_notifier_count here; but the latter cannot be used
* like mmu_updating_count here; but the latter cannot be used
- here because the invalidation of caches in the mmu_notifier
* event occurs _before_ mmu_notifier_count is elevated.
* event occurs _before_ mmu_updating_count is elevated.
- Note, it does not matter that mn_active_invalidate_count
- is not protected by gpc->lock. It is guaranteed to
- be elevated before the mmu_notifier acquires gpc->lock, and
* isn't dropped until after mmu_notifier_seq is updated.
*/ if (kvm->mn_active_invalidate_count) return true;* isn't dropped until after mmu_updating_seq is updated.
/* * Ensure mn_active_invalidate_count is read before
* mmu_notifier_seq. This pairs with the smp_wmb() in
* mmu_updating_seq. This pairs with the smp_wmb() in
- mmu_notifier_invalidate_range_end() to guarantee either the
- old (non-zero) value of mn_active_invalidate_count or the
* new (incremented) value of mmu_notifier_seq is observed.
*/ smp_rmb();* new (incremented) value of mmu_updating_seq is observed.
- return kvm->mmu_notifier_seq != mmu_seq;
- return kvm->mmu_updating_seq != mmu_seq;
} static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) @@ -155,7 +155,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) gpc->valid = false; do {
mmu_seq = kvm->mmu_notifier_seq;
smp_rmb();mmu_seq = kvm->mmu_updating_seq;
write_unlock_irq(&gpc->lock); -- 2.25.1
On Tue, May 23, 2023, Kautuk Consul wrote:
On 2022-07-06 16:20:10, Chao Peng wrote:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e9153b54e2a4..c262ebb168a7 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -765,10 +765,10 @@ struct kvm { #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier;
- unsigned long mmu_notifier_seq;
- long mmu_notifier_count;
- gfn_t mmu_notifier_range_start;
- gfn_t mmu_notifier_range_end;
- unsigned long mmu_updating_seq;
- long mmu_updating_count;
Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?
Heh, can we? Yes. Should we? No.
I see that not all accesses to these are under the kvm->mmu_lock spinlock.
Ya, working as intended. Ignoring gfn_to_pfn_cache for the moment, all accesses to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count above) are done under mmu_lock. And for for mmu_notifier_seq (mmu_updating_seq above), all writes and some reads are done under mmu_lock. The only reads that are done outside of mmu_lock are the initial snapshots of the sequence number.
gfn_to_pfn_cache uses a different locking scheme, the comments in mmu_notifier_retry_cache() do a good job explaining the ordering.
This will also remove the need for putting separate smp_wmb() and smp_rmb() memory barriers while accessing these structure members.
No, the memory barriers aren't there to provide any kind of atomicity. The barriers exist to ensure that stores and loads to/from the sequence and invalidate in-progress counts are ordered relative to the invalidation (stores to counts) and creation (loads) of SPTEs. Making the counts atomic changes nothing because atomic operations don't guarantee the necessary ordering.
E.g. when handling a page fault, KVM snapshots the sequence outside of mmu_lock _before_ touching any state that is involved in resolving the host pfn, e.g. primary MMU state (VMAs, host page tables, etc.). After the page fault task acquires mmu_lock, KVM checks that there are no in-progress invalidations and that the sequence count is the same. This ensures that if there is a concurrent page fault and invalidation event, the page fault task will either acquire mmu_lock and create SPTEs _before_ the invalidation is processed, or the page fault task will observe either an elevated mmu_invalidate_in_progress or a different sequence count, and thus retry the page fault, if the page fault task acquires mmu_lock after the invalidation event.
On 2023-05-23 07:19:43, Sean Christopherson wrote:
On Tue, May 23, 2023, Kautuk Consul wrote:
On 2022-07-06 16:20:10, Chao Peng wrote:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e9153b54e2a4..c262ebb168a7 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -765,10 +765,10 @@ struct kvm { #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier;
- unsigned long mmu_notifier_seq;
- long mmu_notifier_count;
- gfn_t mmu_notifier_range_start;
- gfn_t mmu_notifier_range_end;
- unsigned long mmu_updating_seq;
- long mmu_updating_count;
Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?
Heh, can we? Yes. Should we? No.
I see that not all accesses to these are under the kvm->mmu_lock spinlock.
Ya, working as intended. Ignoring gfn_to_pfn_cache for the moment, all accesses to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count above) are done under mmu_lock. And for for mmu_notifier_seq (mmu_updating_seq above), all writes and some reads are done under mmu_lock. The only reads that are done outside of mmu_lock are the initial snapshots of the sequence number.
gfn_to_pfn_cache uses a different locking scheme, the comments in mmu_notifier_retry_cache() do a good job explaining the ordering.
This will also remove the need for putting separate smp_wmb() and smp_rmb() memory barriers while accessing these structure members.
No, the memory barriers aren't there to provide any kind of atomicity. The barriers exist to ensure that stores and loads to/from the sequence and invalidate in-progress counts are ordered relative to the invalidation (stores to counts) and creation (loads) of SPTEs. Making the counts atomic changes nothing because atomic operations don't guarantee the necessary ordering.
I'm not saying that the memory barriers provide atomicity. My comment was based on the assumption that "all atomic operations are implicit memory barriers". If that assumption is true then we won't need the memory barriers here if we use atomic operations for protecting these 2 structure members.
E.g. when handling a page fault, KVM snapshots the sequence outside of mmu_lock _before_ touching any state that is involved in resolving the host pfn, e.g. primary MMU state (VMAs, host page tables, etc.). After the page fault task acquires mmu_lock, KVM checks that there are no in-progress invalidations and that the sequence count is the same. This ensures that if there is a concurrent page fault and invalidation event, the page fault task will either acquire mmu_lock and create SPTEs _before_ the invalidation is processed, or the page fault task will observe either an elevated mmu_invalidate_in_progress or a different sequence count, and thus retry the page fault, if the page fault task acquires mmu_lock after the invalidation event.
On Wed, May 24, 2023, Kautuk Consul wrote:
On 2023-05-23 07:19:43, Sean Christopherson wrote:
On Tue, May 23, 2023, Kautuk Consul wrote:
On 2022-07-06 16:20:10, Chao Peng wrote:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e9153b54e2a4..c262ebb168a7 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -765,10 +765,10 @@ struct kvm { #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier;
- unsigned long mmu_notifier_seq;
- long mmu_notifier_count;
- gfn_t mmu_notifier_range_start;
- gfn_t mmu_notifier_range_end;
- unsigned long mmu_updating_seq;
- long mmu_updating_count;
Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?
Heh, can we? Yes. Should we? No.
I see that not all accesses to these are under the kvm->mmu_lock spinlock.
Ya, working as intended. Ignoring gfn_to_pfn_cache for the moment, all accesses to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count above) are done under mmu_lock. And for for mmu_notifier_seq (mmu_updating_seq above), all writes and some reads are done under mmu_lock. The only reads that are done outside of mmu_lock are the initial snapshots of the sequence number.
gfn_to_pfn_cache uses a different locking scheme, the comments in mmu_notifier_retry_cache() do a good job explaining the ordering.
This will also remove the need for putting separate smp_wmb() and smp_rmb() memory barriers while accessing these structure members.
No, the memory barriers aren't there to provide any kind of atomicity. The barriers exist to ensure that stores and loads to/from the sequence and invalidate in-progress counts are ordered relative to the invalidation (stores to counts) and creation (loads) of SPTEs. Making the counts atomic changes nothing because atomic operations don't guarantee the necessary ordering.
I'm not saying that the memory barriers provide atomicity. My comment was based on the assumption that "all atomic operations are implicit memory barriers". If that assumption is true then we won't need the memory barriers here if we use atomic operations for protecting these 2 structure members.
Atomics aren't memory barriers on all architectures, e.g. see the various definitions of smp_mb__after_atomic().
Even if atomic operations did provide barriers, using an atomic would be overkill and a net negative. On strongly ordered architectures like x86, memory barriers are just compiler barriers, whereas atomics may be more expensive. Of course, the only accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a READ_ONCE() load, but that's not the case for all architectures.
Anyways, the point is that atomics and memory barriers are different things that serve different purposes.
On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
Atomics aren't memory barriers on all architectures, e.g. see the various definitions of smp_mb__after_atomic().
Even if atomic operations did provide barriers, using an atomic would be overkill and a net negative. On strongly ordered architectures like x86, memory barriers are just compiler barriers, whereas atomics may be more expensive.
Not quite, smp_{r,w}mb() and smp_mb__{before,after}_atomic() are compiler barriers on the TSO archs, but smp_mb() very much isn't. TSO still allows stores to be delayed vs later loads (iow it doesn't pretend to hide the store buffer).
Of course, the only accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a READ_ONCE() load, but that's not the case for all architectures.
This is true on *all* archs. atomic_set() and atomic_read() are no more and no less than WRITE_ONCE() / READ_ONCE().
Anyways, the point is that atomics and memory barriers are different things that serve different purposes.
This is true; esp. on the weakly ordered architectures where atomics do not naturally imply any ordering.
On Wed, May 24, 2023, Peter Zijlstra wrote:
On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
Of course, the only accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a READ_ONCE() load, but that's not the case for all architectures.
This is true on *all* archs. atomic_set() and atomic_read() are no more and no less than WRITE_ONCE() / READ_ONCE().
Ah, I take it s390's handcoded assembly routines are just a paranoid equivalents and not truly special? "l" and "st" do sound quite generic...
commit 7657e41a0bd16c9d8b3cefe8fd5d6ac3c25ae4bf Author: Heiko Carstens hca@linux.ibm.com Date: Thu Feb 17 13:13:58 2011 +0100
[S390] atomic: use inline asm
Use inline assemblies for atomic_read/set(). This way there shouldn't be any questions or subtle volatile semantics left.
static inline int __atomic_read(const atomic_t *v) { int c;
asm volatile( " l %0,%1\n" : "=d" (c) : "R" (v->counter)); return c; }
static inline void __atomic_set(atomic_t *v, int i) { asm volatile( " st %1,%0\n" : "=R" (v->counter) : "d" (i)); }
On Wed, May 24, 2023 at 02:39:50PM -0700, Sean Christopherson wrote:
On Wed, May 24, 2023, Peter Zijlstra wrote:
On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
Of course, the only accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a READ_ONCE() load, but that's not the case for all architectures.
This is true on *all* archs. atomic_set() and atomic_read() are no more and no less than WRITE_ONCE() / READ_ONCE().
Ah, I take it s390's handcoded assembly routines are just a paranoid equivalents and not truly special? "l" and "st" do sound quite generic...
Yep, compiler *should* generate the same with READ_ONCE/WRITE_ONCE.
On 2023-05-24 22:33:36, Peter Zijlstra wrote:
On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
Atomics aren't memory barriers on all architectures, e.g. see the various definitions of smp_mb__after_atomic().
Even if atomic operations did provide barriers, using an atomic would be overkill and a net negative. On strongly ordered architectures like x86, memory barriers are just compiler barriers, whereas atomics may be more expensive.
Not quite, smp_{r,w}mb() and smp_mb__{before,after}_atomic() are compiler barriers on the TSO archs, but smp_mb() very much isn't. TSO still allows stores to be delayed vs later loads (iow it doesn't pretend to hide the store buffer).
Of course, the only accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a READ_ONCE() load, but that's not the case for all architectures.
This is true on *all* archs. atomic_set() and atomic_read() are no more and no less than WRITE_ONCE() / READ_ONCE().
Anyways, the point is that atomics and memory barriers are different things that serve different purposes.
This is true; esp. on the weakly ordered architectures where atomics do not naturally imply any ordering.
Thanks for the information, everyone.
On Wed, May 24, 2023 at 11:42:15AM +0530, Kautuk Consul wrote:
My comment was based on the assumption that "all atomic operations are implicit memory barriers". If that assumption is true then we won't need
It is not -- also see Documentation/atomic_t.txt.
Specifically atomic_read() doesn't imply any ordering on any architecture including the strongly ordered TSO-archs (like x86).
Extend the memslot definition to provide guest private memory through a file descriptor(fd) instead of userspace_addr(hva). Such guest private memory(fd) may never be mapped into userspace so no userspace_addr(hva) can be used. Instead add another two new fields (private_fd/private_offset), plus the existing memory_size to represent the private memory range. Such memslot can still have the existing userspace_addr(hva). When use, a single memslot can maintain both private memory through private fd(private_fd/private_offset) and shared memory through hva(userspace_addr). Whether the private or shared part is effective for a guest GPA is maintained by other KVM code.
Since there is no userspace mapping for private fd so we cannot rely on get_user_pages() to get the pfn in KVM, instead we add a new memfile_notifier in the memslot and rely on it to get pfn by interacting the callbacks from memory backing store with the fd/offset.
This new extension is indicated by a new flag KVM_MEM_PRIVATE. At compile time, a new config HAVE_KVM_PRIVATE_MEM is added and right now it is selected on X86_64 for Intel TDX usage.
To make KVM easy, internally we use a binary compatible alias struct kvm_user_mem_region to handle both the normal and the '_ext' variants.
Co-developed-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- Documentation/virt/kvm/api.rst | 38 ++++++++++++++++---- arch/x86/kvm/Kconfig | 2 ++ arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 13 +++++-- include/uapi/linux/kvm.h | 28 +++++++++++++++ virt/kvm/Kconfig | 3 ++ virt/kvm/kvm_main.c | 64 +++++++++++++++++++++++++++++----- 7 files changed, 132 insertions(+), 18 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index bafaeedd455c..4f27c973a952 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1319,7 +1319,7 @@ yet and must be cleared on entry. :Capability: KVM_CAP_USER_MEMORY :Architectures: all :Type: vm ioctl -:Parameters: struct kvm_userspace_memory_region (in) +:Parameters: struct kvm_userspace_memory_region(_ext) (in) :Returns: 0 on success, -1 on error
:: @@ -1332,9 +1332,18 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ };
+ struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; + /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_PRIVATE (1UL << 2)
This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host.
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region +fields. It also includes additional fields for some specific features. See +below description of flags field for more information. It's recommended to use +kvm_userspace_memory_region_ext in new userspace code. + +The flags field supports below flags: + +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. + +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to + make a new slot read-only. In this case, writes to this memory will be posted + to userspace as KVM_EXIT_MMIO exits. + +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by + a file descirptor(fd) and the content of the private memory is invisible to + userspace. In this case, userspace should use private_fd/private_offset in + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to + guest. Userspace should guarantee not to map the same pfn indicated by + private_fd/private_offset to different gfns with multiple memslots. Failed to + do this may result undefined behavior.
When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e3cbd7706136..1f160801e2a7 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -48,6 +48,8 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select HAVE_KVM_PRIVATE_MEM if X86_64 + select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 567d13405445..77d16b90045c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12154,7 +12154,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, }
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_user_mem_region m;
m.slot = id | (i << 16); m.flags = 0; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c262ebb168a7..1b203c8aa696 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -44,6 +44,7 @@
#include <asm/kvm_host.h> #include <linux/kvm_dirty_ring.h> +#include <linux/memfile_notifier.h>
#ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -576,8 +577,16 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; + struct memfile_notifier notifier; };
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; @@ -1109,9 +1118,9 @@ enum kvm_mr_change { };
int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a36e78710382..c467c69b7ad7 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ };
+struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; + +#ifdef __KERNEL__ +/* + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access + * all fields from the top-level "extended" region. + */ +struct kvm_user_mem_region { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; + __u64 userspace_addr; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; +#endif + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2)
/* for KVM_IRQ_LINE */ struct kvm_irq_level { diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index a8c5c9f06b3c..ccaff13cc5b8 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK
config HAVE_KVM_PM_NOTIFIER bool + +config HAVE_KVM_PRIVATE_MEM + bool diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 3ae4944b9f15..230c8ff9659c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1508,7 +1508,7 @@ static void kvm_replace_memslot(struct kvm *kvm, } }
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -1902,7 +1902,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; @@ -2006,7 +2006,7 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { int r;
@@ -2018,7 +2018,7 @@ int kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(kvm_set_memory_region);
static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_user_mem_region *mem) { if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) return -EINVAL; @@ -4608,6 +4608,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) return fd; }
+#define SANITY_CHECK_MEM_REGION_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region, field)); \ +} while (0) + +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region_ext, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ +} while (0) + +static void kvm_sanity_check_user_mem_region_alias(void) +{ + SANITY_CHECK_MEM_REGION_FIELD(slot); + SANITY_CHECK_MEM_REGION_FIELD(flags); + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); + SANITY_CHECK_MEM_REGION_FIELD(memory_size); + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); +} + static long kvm_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_user_mem_region mem; + unsigned long size; + u32 flags; + + kvm_sanity_check_user_mem_region_alias(); + + memset(&mem, 0, sizeof(mem));
r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + + if (get_user(flags, + (u32 __user *)(argp + offsetof(typeof(mem), flags)))) + goto out; + + if (flags & KVM_MEM_PRIVATE) { + r = -EINVAL; + goto out; + } + + size = sizeof(struct kvm_userspace_memory_region); + + if (copy_from_user(&mem, argp, size)) + goto out; + + r = -EINVAL; + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE) goto out;
- r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } case KVM_GET_DIRTY_LOG: {
On Wed, Jul 06, 2022, Chao Peng wrote:
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ };
- struct kvm_userspace_memory_region_ext {
- struct kvm_userspace_memory_region region;
- __u64 private_offset;
- __u32 private_fd;
- __u32 pad1;
- __u64 pad2[14];
+};
- /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1)
- #define KVM_MEM_PRIVATE (1UL << 2)
Very belatedly following up on prior feedback...
| I think a flag is still needed, the problem is private_fd can be safely | accessed only when this flag is set, e.g. without this flag, we can't | copy_from_user these new fields since they don't exist for previous | kvm_userspace_memory_region callers.
I forgot about that aspect of things. We don't technically need a dedicated PRIVATE flag to handle that, but it does seem to be the least awful soltuion. We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a decent chance that we'll end up needed individual "this field is valid" flags anways.
E.g. if KVM requires pad1 and pad2 to be zero to carve out future extensions, then we're right back here if some future extension needs to treat '0' as a legal input.
TL;DR: adding KVM_MEM_PRIVATE still seems like the best approach.
@@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: {
struct kvm_userspace_memory_region kvm_userspace_mem;
struct kvm_user_mem_region mem;
unsigned long size;
u32 flags;
kvm_sanity_check_user_mem_region_alias();
memset(&mem, 0, sizeof(mem));
r = -EFAULT;
if (copy_from_user(&kvm_userspace_mem, argp,
sizeof(kvm_userspace_mem)))
if (get_user(flags,
(u32 __user *)(argp + offsetof(typeof(mem), flags))))
goto out;
Indentation is funky. It's hard to massage this into something short and readable What about capturing the offset separately? E.g.
struct kvm_user_mem_region mem; unsigned int flags_offset = offsetof(typeof(mem), flags)); unsigned long size; u32 flags;
kvm_sanity_check_user_mem_region_alias();
memset(&mem, 0, sizeof(mem));
r = -EFAULT; if (get_user(flags, (u32 __user *)(argp + flags_offset))) goto out;
But this can actually be punted until KVM_MEM_PRIVATE is fully supported. As of this patch, KVM doesn't read the extended size, so I believe the diff for this patch can simply be:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index da263c370d00..5194beb7b52f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4640,6 +4640,10 @@ static long kvm_vm_ioctl(struct file *filp, sizeof(kvm_userspace_mem))) goto out;
+ r = -EINVAL; + if (mem.flags & KVM_MEM_PRIVATE) + goto out; + r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); break; }
On Fri, Jul 29, 2022 at 07:51:29PM +0000, Sean Christopherson wrote:
On Wed, Jul 06, 2022, Chao Peng wrote:
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ };
- struct kvm_userspace_memory_region_ext {
- struct kvm_userspace_memory_region region;
- __u64 private_offset;
- __u32 private_fd;
- __u32 pad1;
- __u64 pad2[14];
+};
- /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1)
- #define KVM_MEM_PRIVATE (1UL << 2)
Very belatedly following up on prior feedback...
| I think a flag is still needed, the problem is private_fd can be safely | accessed only when this flag is set, e.g. without this flag, we can't | copy_from_user these new fields since they don't exist for previous | kvm_userspace_memory_region callers.
I forgot about that aspect of things. We don't technically need a dedicated PRIVATE flag to handle that, but it does seem to be the least awful soltuion. We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a decent chance that we'll end up needed individual "this field is valid" flags anways.
E.g. if KVM requires pad1 and pad2 to be zero to carve out future extensions, then we're right back here if some future extension needs to treat '0' as a legal input.
I had such practice (always rejecting none-zero 'pad' value when introducing new user APIs) in other project previously, but I rarely see that in KVM.
TL;DR: adding KVM_MEM_PRIVATE still seems like the best approach.
@@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: {
struct kvm_userspace_memory_region kvm_userspace_mem;
struct kvm_user_mem_region mem;
unsigned long size;
u32 flags;
kvm_sanity_check_user_mem_region_alias();
memset(&mem, 0, sizeof(mem));
r = -EFAULT;
if (copy_from_user(&kvm_userspace_mem, argp,
sizeof(kvm_userspace_mem)))
if (get_user(flags,
(u32 __user *)(argp + offsetof(typeof(mem), flags))))
goto out;
Indentation is funky. It's hard to massage this into something short and readable What about capturing the offset separately? E.g.
struct kvm_user_mem_region mem; unsigned int flags_offset = offsetof(typeof(mem), flags)); unsigned long size; u32 flags; kvm_sanity_check_user_mem_region_alias(); memset(&mem, 0, sizeof(mem)); r = -EFAULT; if (get_user(flags, (u32 __user *)(argp + flags_offset))) goto out;
But this can actually be punted until KVM_MEM_PRIVATE is fully supported. As of this patch, KVM doesn't read the extended size, so I believe the diff for this patch can simply be:
Looks good to me, Thanks.
Chao
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index da263c370d00..5194beb7b52f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4640,6 +4640,10 @@ static long kvm_vm_ioctl(struct file *filp, sizeof(kvm_userspace_mem))) goto out;
r = -EINVAL;
if (mem.flags & KVM_MEM_PRIVATE)
goto out;
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); break; }
On Wed, Aug 03, 2022, Chao Peng wrote:
On Fri, Jul 29, 2022 at 07:51:29PM +0000, Sean Christopherson wrote:
On Wed, Jul 06, 2022, Chao Peng wrote:
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ };
- struct kvm_userspace_memory_region_ext {
- struct kvm_userspace_memory_region region;
- __u64 private_offset;
- __u32 private_fd;
- __u32 pad1;
- __u64 pad2[14];
+};
- /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1)
- #define KVM_MEM_PRIVATE (1UL << 2)
Very belatedly following up on prior feedback...
| I think a flag is still needed, the problem is private_fd can be safely | accessed only when this flag is set, e.g. without this flag, we can't | copy_from_user these new fields since they don't exist for previous | kvm_userspace_memory_region callers.
I forgot about that aspect of things. We don't technically need a dedicated PRIVATE flag to handle that, but it does seem to be the least awful soltuion. We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a decent chance that we'll end up needed individual "this field is valid" flags anways.
E.g. if KVM requires pad1 and pad2 to be zero to carve out future extensions, then we're right back here if some future extension needs to treat '0' as a legal input.
I had such practice (always rejecting none-zero 'pad' value when introducing new user APIs) in other project previously, but I rarely see that in KVM.
Ya, KVM often uses flags to indicate the validity of a field specifically so that KVM doesn't misinterpret a '0' from an older userspace as an intended value.
This new KVM exit allows userspace to handle memory-related errors. It indicates an error happens in KVM at guest memory range [gpa, gpa+size). The flags includes additional information for userspace to handle the error. Currently bit 0 is defined as 'private memory' where '1' indicates error happens due to private memory access and '0' indicates error happens due to shared memory access.
After private memory is enabled, this new exit will be used for KVM to exit to userspace for shared memory <-> private memory conversion in memory encryption usage.
In such usage, typically there are two kind of memory conversions: - explicit conversion: happens when guest explicitly calls into KVM to map a range (as private or shared), KVM then exits to userspace to do the map/unmap operations. - implicit conversion: happens in KVM page fault handler. * if the fault is due to a private memory access then causes a userspace exit for a shared->private conversion when the page is recognized as shared by KVM. * If the fault is due to a shared memory access then causes a userspace exit for a private->shared conversion when the page is recognized as private by KVM.
Suggested-by: Sean Christopherson seanjc@google.com Co-developed-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++ include/uapi/linux/kvm.h | 9 +++++++++ 2 files changed, 31 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 4f27c973a952..5ecfc7fbe0ee 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6583,6 +6583,28 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc.
+:: + + /* KVM_EXIT_MEMORY_FAULT */ + struct { + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has +encountered a memory error which is not handled by KVM kernel module and +userspace may choose to handle it. The 'flags' field indicates the memory +properties of the exit. + + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by + private memory access when the bit is set otherwise the memory error is + caused by shared memory access when the bit is clear. + +'gpa' and 'size' indicate the memory range the error occurs at. The userspace +may handle the error and return to KVM to retry the previous memory access. + ::
/* KVM_EXIT_NOTIFY */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index c467c69b7ad7..83c278f284dd 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -299,6 +299,7 @@ struct kvm_xen_exit { #define KVM_EXIT_XEN 34 #define KVM_EXIT_RISCV_SBI 35 #define KVM_EXIT_NOTIFY 36 +#define KVM_EXIT_MEMORY_FAULT 37
/* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -530,6 +531,14 @@ struct kvm_run { #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) __u32 flags; } notify; + /* KVM_EXIT_MEMORY_FAULT */ + struct { +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; /* Fix the size of the union. */ char padding[256]; };
If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION ioctls. The patch reuses existing SEV ioctl but differs that the address in the region for private memory is gpa while SEV case it's hva.
The private memory region is stored as xarray in KVM for memory efficiency in normal usages and zapping existing memory mappings is also a side effect of these two ioctls.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- Documentation/virt/kvm/api.rst | 17 +++++++--- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu.h | 2 -- include/linux/kvm_host.h | 8 +++++ virt/kvm/kvm_main.c | 57 +++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 5ecfc7fbe0ee..dfb4caecab73 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/amd-memory-encryption.rst. This ioctl can be used to register a guest memory region which may contain encrypted data (e.g. guest RAM, SMRAM etc).
-It is used in the SEV-enabled guest. When encryption is enabled, a guest -memory region may contain encrypted data. The SEV memory encryption -engine uses a tweak such that two identical plaintext pages, each at -different locations will have differing ciphertexts. So swapping or +Currently this ioctl supports registering memory regions for two usages: +private memory and SEV-encrypted memory. + +When private memory is enabled, this ioctl is used to register guest private +memory region and the addr/size of kvm_enc_region represents guest physical +address (GPA). In this usage, this ioctl zaps the existing guest memory +mappings in KVM that fallen into the region. + +When SEV-encrypted memory is enabled, this ioctl is used to register guest +memory region which may contain encrypted data for a SEV-enabled guest. The +addr/size of kvm_enc_region represents userspace address (HVA). The SEV +memory encryption engine uses a tweak such that two identical plaintext pages, +each at different locations will have differing ciphertexts. So swapping or moving ciphertext of those pages will not result in plaintext being swapped. So relocating (or migrating) physical backing pages for the SEV guest will require some additional steps. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dae190e19fce..92120e3a224e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -37,6 +37,7 @@ #include <asm/hyperv-tlfs.h>
#define __KVM_HAVE_ARCH_VCPU_DEBUGFS +#define __KVM_HAVE_ZAP_GFN_RANGE
#define KVM_MAX_VCPUS 1024
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 1f160801e2a7..05861b9656a4 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -50,6 +50,7 @@ config KVM select HAVE_KVM_PM_NOTIFIER if PM select HAVE_KVM_PRIVATE_MEM if X86_64 select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM + select XARRAY_MULTI if HAVE_KVM_PRIVATE_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index a99acec925eb..428cd2e88cbd 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -209,8 +209,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return -(u32)fault & errcode; }
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); - int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
int kvm_mmu_post_init_vm(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1b203c8aa696..da33f8828456 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -260,6 +260,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); #endif
+#ifdef __KVM_HAVE_ZAP_GFN_RANGE +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); +#endif + enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE, @@ -795,6 +799,9 @@ struct kvm { struct notifier_block pm_notifier; #endif char stats_id[KVM_STATS_NAME_SIZE]; +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + struct xarray mem_attr_array; +#endif };
#define kvm_err(fmt, ...) \ @@ -1459,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_mem_supported(struct kvm *kvm);
#ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl, + struct kvm_enc_region *region) +{ + unsigned long start, end; + void *entry; + int r; + + if (region->size == 0 || region->addr + region->size < region->addr) + return -EINVAL; + if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1)) + return -EINVAL; + + start = region->addr >> PAGE_SHIFT; + end = (region->addr + region->size - 1) >> PAGE_SHIFT; + + entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ? + xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL; + + r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end, + entry, GFP_KERNEL_ACCOUNT)); + + kvm_zap_gfn_range(kvm, start, end + 1); + + return r; +} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */ + #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -1138,6 +1167,9 @@ static struct kvm *kvm_create_vm(unsigned long type) spin_lock_init(&kvm->mn_invalidate_lock); rcuwait_init(&kvm->mn_memslots_update_rcuwait); xa_init(&kvm->vcpu_array); +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + xa_init(&kvm->mem_attr_array); +#endif
INIT_LIST_HEAD(&kvm->gpc_list); spin_lock_init(&kvm->gpc_lock); @@ -1305,6 +1337,9 @@ static void kvm_destroy_vm(struct kvm *kvm) kvm_free_memslots(kvm, &kvm->__memslots[i][0]); kvm_free_memslots(kvm, &kvm->__memslots[i][1]); } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + xa_destroy(&kvm->mem_attr_array); +#endif cleanup_srcu_struct(&kvm->irq_srcu); cleanup_srcu_struct(&kvm->srcu); kvm_arch_free_vm(kvm); @@ -1508,6 +1543,11 @@ static void kvm_replace_memslot(struct kvm *kvm, } }
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{ + return false; +} + static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + case KVM_MEMORY_ENCRYPT_REG_REGION: + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { + struct kvm_enc_region region; + + if (!kvm_arch_private_mem_supported(kvm)) + goto arch_vm_ioctl; + + r = -EFAULT; + if (copy_from_user(®ion, argp, sizeof(region))) + goto out; + + r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion); + break; + } +#endif case KVM_GET_DIRTY_LOG: { struct kvm_dirty_log log;
@@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_get_stats_fd(kvm); break; default: +arch_vm_ioctl: r = kvm_arch_vm_ioctl(filp, ioctl, arg); } out:
Hi Chao,
Some comments below:
If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION ioctls. The patch reuses existing SEV ioctl but differs that the address in the region for private memory is gpa while SEV case it's hva.
The private memory region is stored as xarray in KVM for memory efficiency in normal usages and zapping existing memory mappings is also a side effect of these two ioctls.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
Documentation/virt/kvm/api.rst | 17 +++++++--- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu.h | 2 -- include/linux/kvm_host.h | 8 +++++ virt/kvm/kvm_main.c | 57 +++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 5ecfc7fbe0ee..dfb4caecab73 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/amd-memory-encryption.rst. This ioctl can be used to register a guest memory region which may contain encrypted data (e.g. guest RAM, SMRAM etc). -It is used in the SEV-enabled guest. When encryption is enabled, a guest -memory region may contain encrypted data. The SEV memory encryption -engine uses a tweak such that two identical plaintext pages, each at -different locations will have differing ciphertexts. So swapping or +Currently this ioctl supports registering memory regions for two usages: +private memory and SEV-encrypted memory.
+When private memory is enabled, this ioctl is used to register guest private +memory region and the addr/size of kvm_enc_region represents guest physical +address (GPA). In this usage, this ioctl zaps the existing guest memory +mappings in KVM that fallen into the region.
+When SEV-encrypted memory is enabled, this ioctl is used to register guest +memory region which may contain encrypted data for a SEV-enabled guest. The +addr/size of kvm_enc_region represents userspace address (HVA). The SEV +memory encryption engine uses a tweak such that two identical plaintext pages, +each at different locations will have differing ciphertexts. So swapping or moving ciphertext of those pages will not result in plaintext being swapped. So relocating (or migrating) physical backing pages for the SEV guest will require some additional steps. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dae190e19fce..92120e3a224e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -37,6 +37,7 @@ #include <asm/hyperv-tlfs.h> #define __KVM_HAVE_ARCH_VCPU_DEBUGFS +#define __KVM_HAVE_ZAP_GFN_RANGE #define KVM_MAX_VCPUS 1024 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 1f160801e2a7..05861b9656a4 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -50,6 +50,7 @@ config KVM select HAVE_KVM_PM_NOTIFIER if PM select HAVE_KVM_PRIVATE_MEM if X86_64 select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
- select XARRAY_MULTI if HAVE_KVM_PRIVATE_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index a99acec925eb..428cd2e88cbd 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -209,8 +209,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return -(u32)fault & errcode; } -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
- int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
int kvm_mmu_post_init_vm(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1b203c8aa696..da33f8828456 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -260,6 +260,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); #endif +#ifdef __KVM_HAVE_ZAP_GFN_RANGE +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); +#endif
- enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE,
@@ -795,6 +799,9 @@ struct kvm { struct notifier_block pm_notifier; #endif char stats_id[KVM_STATS_NAME_SIZE]; +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- struct xarray mem_attr_array;
+#endif }; #define kvm_err(fmt, ...) \ @@ -1459,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_mem_supported(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
struct kvm_enc_region *region)
+{
- unsigned long start, end;
- void *entry;
- int r;
- if (region->size == 0 || region->addr + region->size < region->addr)
return -EINVAL;
- if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
return -EINVAL;
- start = region->addr >> PAGE_SHIFT;
- end = (region->addr + region->size - 1) >> PAGE_SHIFT;
- entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
- r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
entry, GFP_KERNEL_ACCOUNT));
- kvm_zap_gfn_range(kvm, start, end + 1);
- return r;
+} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
- #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state,
@@ -1138,6 +1167,9 @@ static struct kvm *kvm_create_vm(unsigned long type) spin_lock_init(&kvm->mn_invalidate_lock); rcuwait_init(&kvm->mn_memslots_update_rcuwait); xa_init(&kvm->vcpu_array); +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- xa_init(&kvm->mem_attr_array);
+#endif INIT_LIST_HEAD(&kvm->gpc_list); spin_lock_init(&kvm->gpc_lock); @@ -1305,6 +1337,9 @@ static void kvm_destroy_vm(struct kvm *kvm) kvm_free_memslots(kvm, &kvm->__memslots[i][0]); kvm_free_memslots(kvm, &kvm->__memslots[i][1]); } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- xa_destroy(&kvm->mem_attr_array);
+#endif cleanup_srcu_struct(&kvm->irq_srcu); cleanup_srcu_struct(&kvm->srcu); kvm_arch_free_vm(kvm); @@ -1508,6 +1543,11 @@ static void kvm_replace_memslot(struct kvm *kvm, } } +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{
- return false;
+}
Does this function has to be overriden by SEV and TDX to support the private regions?
- static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- case KVM_MEMORY_ENCRYPT_REG_REGION:
- case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
this is to store private region metadata not only the encrypted region?
Also, seems same ioctl can be used to put other regions (e.g firmware, later maybe DAX backend etc) into private memory?
break;
- }
+#endif case KVM_GET_DIRTY_LOG: { struct kvm_dirty_log log; @@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_get_stats_fd(kvm); break; default: +arch_vm_ioctl: r = kvm_arch_vm_ioctl(filp, ioctl, arg); } out:
On Tue, Jul 19, 2022 at 10:00:23AM +0200, Gupta, Pankaj wrote:
...
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{
- return false;
+}
Does this function has to be overriden by SEV and TDX to support the private regions?
Yes it should be overridden by architectures which want to support it.
- static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- case KVM_MEMORY_ENCRYPT_REG_REGION:
- case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
this is to store private region metadata not only the encrypted region?
Correct.
Also, seems same ioctl can be used to put other regions (e.g firmware, later maybe DAX backend etc) into private memory?
Possibly. Depends on what exactly the semantics is. If just want to set those regions as private current code already support that.
Chao
break;
- }
+#endif case KVM_GET_DIRTY_LOG: { struct kvm_dirty_log log; @@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_get_stats_fd(kvm); break; default: +arch_vm_ioctl: r = kvm_arch_vm_ioctl(filp, ioctl, arg); } out:
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{
- return false;
+}
Does this function has to be overriden by SEV and TDX to support the private regions?
Yes it should be overridden by architectures which want to support it.
o.k
- static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- case KVM_MEMORY_ENCRYPT_REG_REGION:
- case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
this is to store private region metadata not only the encrypted region?
Correct.
Sorry for not being clear, was suggesting name change of this function from: "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
Also, seems same ioctl can be used to put other regions (e.g firmware, later maybe DAX backend etc) into private memory?
Possibly. Depends on what exactly the semantics is. If just want to set those regions as private current code already support that.
Agree. Sure!
Thanks, Pankaj
On Tue, Jul 19, 2022 at 04:23:52PM +0200, Gupta, Pankaj wrote:
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{
- return false;
+}
Does this function has to be overriden by SEV and TDX to support the private regions?
Yes it should be overridden by architectures which want to support it.
o.k
- static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- case KVM_MEMORY_ENCRYPT_REG_REGION:
- case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
this is to store private region metadata not only the encrypted region?
Correct.
Sorry for not being clear, was suggesting name change of this function from: "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
Though I don't have strong reason to change it, I'm fine with this and this name matches the above kvm_arch_private_mem_supported perfectly.
Thanks, Chao
Also, seems same ioctl can be used to put other regions (e.g firmware, later maybe DAX backend etc) into private memory?
Possibly. Depends on what exactly the semantics is. If just want to set those regions as private current code already support that.
Agree. Sure!
Thanks, Pankaj
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{
- return false;
+}
Does this function has to be overriden by SEV and TDX to support the private regions?
Yes it should be overridden by architectures which want to support it.
o.k
- static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
@@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- case KVM_MEMORY_ENCRYPT_REG_REGION:
- case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
this is to store private region metadata not only the encrypted region?
Correct.
Sorry for not being clear, was suggesting name change of this function from: "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
Though I don't have strong reason to change it, I'm fine with this and
Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would depict the actual functionality :)
this name matches the above kvm_arch_private_mem_supported perfectly.
BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region" matches "kvm_arch_private_mem_supported"?
Thanks, Pankaj
On Wed, Jul 20, 2022, Gupta, Pankaj wrote:
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking a flag of sorts, and to align with other helpers of this nature (and with CONFIG_HAVE_KVM_PRIVATE_MEM).
$ git grep kvm_arch | grep supported | wc -l 0 $ git grep kvm_arch | grep has | wc -l 26
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- case KVM_MEMORY_ENCRYPT_REG_REGION:
- case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
this is to store private region metadata not only the encrypted region?
Correct.
Sorry for not being clear, was suggesting name change of this function from: "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
Though I don't have strong reason to change it, I'm fine with this and
Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would depict the actual functionality :)
this name matches the above kvm_arch_private_mem_supported perfectly.
BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region" matches "kvm_arch_private_mem_supported"?
Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.
I also like using "private" instead of "encrypted", though we should probably find a different verb than "set", because calling "set_private" when making the region shared is confusing. I'm struggling to come up with a good alternative though.
kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION, and that also means that anything with "memory_region" in the name is bound to be confusing.
Hmm, and if we move away from "encrypted", it probably makes sense to pass in addr+size instead of a kvm_enc_region.
Maybe this?
static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa, gpa_t size, bool set_private)
and then:
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM case KVM_MEMORY_ENCRYPT_REG_REGION: case KVM_MEMORY_ENCRYPT_UNREG_REGION: { bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm)) goto arch_vm_ioctl;
r = -EFAULT; if (copy_from_user(®ion, argp, sizeof(region))) goto out;
r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr, region.size, set); break; } #endif
I don't love it, so if someone has a better idea...
Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking a flag of sorts, and to align with other helpers of this nature (and with CONFIG_HAVE_KVM_PRIVATE_MEM).
$ git grep kvm_arch | grep supported | wc -l 0 $ git grep kvm_arch | grep has | wc -l 26
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM > + case KVM_MEMORY_ENCRYPT_REG_REGION: > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > + struct kvm_enc_region region; > + > + if (!kvm_arch_private_mem_supported(kvm)) > + goto arch_vm_ioctl; > + > + r = -EFAULT; > + if (copy_from_user(®ion, argp, sizeof(region))) > + goto out; > + > + r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
this is to store private region metadata not only the encrypted region?
Correct.
Sorry for not being clear, was suggesting name change of this function from: "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
Though I don't have strong reason to change it, I'm fine with this and
Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would depict the actual functionality :)
this name matches the above kvm_arch_private_mem_supported perfectly.
BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region" matches "kvm_arch_private_mem_supported"?
Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.
I also like using "private" instead of "encrypted", though we should probably find a different verb than "set", because calling "set_private" when making the region shared is confusing. I'm struggling to come up with a good alternative though.
kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION, and that also means that anything with "memory_region" in the name is bound to be confusing.
Hmm, and if we move away from "encrypted", it probably makes sense to pass in addr+size instead of a kvm_enc_region.
Maybe this?
static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa, gpa_t size, bool set_private)
and then:
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM case KVM_MEMORY_ENCRYPT_REG_REGION: case KVM_MEMORY_ENCRYPT_UNREG_REGION: { bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm)) goto arch_vm_ioctl; r = -EFAULT; if (copy_from_user(®ion, argp, sizeof(region))) goto out; r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr, region.size, set); break;
} #endif
I don't love it, so if someone has a better idea...
Both the suggestions look good to me. Bring more clarity.
Thanks, Pankaj
On 7/21/22 00:21, Sean Christopherson wrote:
On Wed, Jul 20, 2022, Gupta, Pankaj wrote:
> +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking a flag of sorts, and to align with other helpers of this nature (and with CONFIG_HAVE_KVM_PRIVATE_MEM).
$ git grep kvm_arch | grep supported | wc -l 0 $ git grep kvm_arch | grep has | wc -l 26
> +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM > + case KVM_MEMORY_ENCRYPT_REG_REGION: > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > + struct kvm_enc_region region; > + > + if (!kvm_arch_private_mem_supported(kvm)) > + goto arch_vm_ioctl; > + > + r = -EFAULT; > + if (copy_from_user(®ion, argp, sizeof(region))) > + goto out; > + > + r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion); this is to store private region metadata not only the encrypted region?
Correct.
Sorry for not being clear, was suggesting name change of this function from: "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
Though I don't have strong reason to change it, I'm fine with this and
Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would depict the actual functionality :)
this name matches the above kvm_arch_private_mem_supported perfectly.
BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region" matches "kvm_arch_private_mem_supported"?
Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.
I also like using "private" instead of "encrypted", though we should probably find a different verb than "set", because calling "set_private" when making the region shared is confusing. I'm struggling to come up with a good alternative though.
kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION, and that also means that anything with "memory_region" in the name is bound to be confusing.
Hmm, and if we move away from "encrypted", it probably makes sense to pass in addr+size instead of a kvm_enc_region.
Maybe this?
static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa, gpa_t size, bool set_private)
and then:
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM case KVM_MEMORY_ENCRYPT_REG_REGION: case KVM_MEMORY_ENCRYPT_UNREG_REGION: { bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm)) goto arch_vm_ioctl; r = -EFAULT; if (copy_from_user(®ion, argp, sizeof(region))) goto out; r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr, region.size, set); break;
} #endif
I don't love it, so if someone has a better idea...
Maybe you could tag it with cgs for all the confidential guest support related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; ... kvm_vm_ioctl_set_cgs_mem(, is_private)
On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
On 7/21/22 00:21, Sean Christopherson wrote:
On Wed, Jul 20, 2022, Gupta, Pankaj wrote:
> > +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm)
Use kvm_arch_has_private_mem(), both because "has" makes it obvious this is checking a flag of sorts, and to align with other helpers of this nature (and with CONFIG_HAVE_KVM_PRIVATE_MEM).
$ git grep kvm_arch | grep supported | wc -l 0 $ git grep kvm_arch | grep has | wc -l 26
Make sense. kvm_arch_has_private_mem it actually better.
> > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM > > + case KVM_MEMORY_ENCRYPT_REG_REGION: > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { > > + struct kvm_enc_region region; > > + > > + if (!kvm_arch_private_mem_supported(kvm)) > > + goto arch_vm_ioctl; > > + > > + r = -EFAULT; > > + if (copy_from_user(®ion, argp, sizeof(region))) > > + goto out; > > + > > + r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion); > this is to store private region metadata not only the encrypted region? Correct.
Sorry for not being clear, was suggesting name change of this function from: "kvm_vm_ioctl_set_encrypted_region" to "kvm_vm_ioctl_set_private_region"
Though I don't have strong reason to change it, I'm fine with this and
Yes, no strong reason, just thought "kvm_vm_ioctl_set_private_region" would depict the actual functionality :)
this name matches the above kvm_arch_private_mem_supported perfectly.
BTW could not understand this, how "kvm_vm_ioctl_set_encrypted_region" matches "kvm_arch_private_mem_supported"?
Chao is saying that kvm_vm_ioctl_set_private_region() pairs nicely with kvm_arch_private_mem_supported(), not that the "encrypted" variant pairs nicely.
I also like using "private" instead of "encrypted", though we should probably find a different verb than "set", because calling "set_private" when making the region shared is confusing. I'm struggling to come up with a good alternative though.
kvm_vm_ioctl_set_memory_region() is already taken by KVM_SET_USER_MEMORY_REGION, and that also means that anything with "memory_region" in the name is bound to be confusing.
Hmm, and if we move away from "encrypted", it probably makes sense to pass in addr+size instead of a kvm_enc_region.
This makes sense.
Maybe this?
static int kvm_vm_ioctl_set_or_clear_mem_private(struct kvm *kvm, gpa_t gpa, gpa_t size, bool set_private)
Currently this should work.
and then:
#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM case KVM_MEMORY_ENCRYPT_REG_REGION: case KVM_MEMORY_ENCRYPT_UNREG_REGION: { bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm)) goto arch_vm_ioctl; r = -EFAULT; if (copy_from_user(®ion, argp, sizeof(region))) goto out; r = kvm_vm_ioctl_set_or_clear_mem_private(kvm, region.addr, region.size, set); break;
} #endif
I don't love it, so if someone has a better idea...
Maybe you could tag it with cgs for all the confidential guest support related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; ... kvm_vm_ioctl_set_cgs_mem(, is_private)
If we plan to widely use such abbr. through KVM (e.g. it's well known), I'm fine.
I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610 But I also don't quite like it, it's so generic and sounds say nothing.
But I do want a name can cover future usages other than just private/shared (pKVM for example may have a third state).
Thanks, Chao
On Thu, Jul 21, 2022, Chao Peng wrote:
On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
On 7/21/22 00:21, Sean Christopherson wrote: Maybe you could tag it with cgs for all the confidential guest support related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; ... kvm_vm_ioctl_set_cgs_mem(, is_private)
If we plan to widely use such abbr. through KVM (e.g. it's well known), I'm fine.
I'd prefer to stay away from "confidential guest", and away from any VM-scoped name for that matter. User-unmappable memmory has use cases beyond hiding guest state from the host, e.g. userspace could use inaccessible/unmappable memory to harden itself against unintentional access to guest memory.
I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610 But I also don't quite like it, it's so generic and sounds say nothing.
But I do want a name can cover future usages other than just private/shared (pKVM for example may have a third state).
I don't think there can be a third top-level state. Memory is either private to the guest or it's not. There can be sub-states, e.g. memory could be selectively shared or encrypted with a different key, in which case we'd need metadata to track that state.
Though that begs the question of whether or not private_fd is the correct terminology. E.g. if guest memory is backed by a memfd that can't be mapped by userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs that memory into a device or another VM, then arguably that memory is shared, especially the multi-VM scenario.
For TDX and SNP "private vs. shared" is likely the correct terminology given the current specs, but for generic KVM it's probably better to align with whatever terminology is used for memfd. "inaccessible_fd" and "user_inaccessible_fd" are a bit odd since the fd itself is accesible.
What about "user_unmappable"? E.g.
F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY, MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
that gives us flexibility to map the memory from within the kernel, e.g. into other VMs or devices.
Hmm, and then keep your original "mem_attr_array" name? And probably
int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, bool is_user_mappable)
Then the x86/mmu code for TDX/SNP private faults could be:
is_private = !kvm_is_gpa_user_mappable();
if (fault->is_private != is_private) {
or if we want to avoid mixing up "user_mappable" and "user_unmappable":
is_private = kvm_is_gpa_user_unmappable();
if (fault->is_private != is_private) {
though a helper that returns a negative (not mappable) feels kludgy. And I like kvm_is_gpa_user_mappable() because then when there's not "special" memory, it defaults to true, which is more intuitive IMO.
And then if the future needs more precision, e.g. user-unmappable memory isn't necessarily guest-exclusive, the uAPI names still work even though KVM internals will need to be reworked, but that's unavoidable. E.g. piggybacking KVM_MEMORY_ENCRYPT_(UN)REG_REGION doesn't allow for further differentiation, so we'd need to _extend_ the uAPI, but the _existing_ uAPI would still be sane.
On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
On Thu, Jul 21, 2022, Chao Peng wrote:
On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
On 7/21/22 00:21, Sean Christopherson wrote: Maybe you could tag it with cgs for all the confidential guest support related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; ... kvm_vm_ioctl_set_cgs_mem(, is_private)
If we plan to widely use such abbr. through KVM (e.g. it's well known), I'm fine.
I'd prefer to stay away from "confidential guest", and away from any VM-scoped name for that matter. User-unmappable memmory has use cases beyond hiding guest state from the host, e.g. userspace could use inaccessible/unmappable memory to harden itself against unintentional access to guest memory.
I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610 But I also don't quite like it, it's so generic and sounds say nothing.
But I do want a name can cover future usages other than just private/shared (pKVM for example may have a third state).
I don't think there can be a third top-level state. Memory is either private to the guest or it's not. There can be sub-states, e.g. memory could be selectively shared or encrypted with a different key, in which case we'd need metadata to track that state.
Though that begs the question of whether or not private_fd is the correct terminology. E.g. if guest memory is backed by a memfd that can't be mapped by userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs that memory into a device or another VM, then arguably that memory is shared, especially the multi-VM scenario.
For TDX and SNP "private vs. shared" is likely the correct terminology given the current specs, but for generic KVM it's probably better to align with whatever terminology is used for memfd. "inaccessible_fd" and "user_inaccessible_fd" are a bit odd since the fd itself is accesible.
What about "user_unmappable"? E.g.
F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY, MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
For KVM I also think user_unmappable looks better than 'private', e.g. user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more appropriate names. For memfd however, I don't feel that strong to change it from current 'inaccessible' to 'user_unmappable', one of the reason is it's not just about unmappable, but actually also inaccessible through direct ioctls like read()/write().
that gives us flexibility to map the memory from within the kernel, e.g. into other VMs or devices.
Hmm, and then keep your original "mem_attr_array" name? And probably
int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size, bool is_user_mappable)
Then the x86/mmu code for TDX/SNP private faults could be:
is_private = !kvm_is_gpa_user_mappable();
if (fault->is_private != is_private) {
or if we want to avoid mixing up "user_mappable" and "user_unmappable":
is_private = kvm_is_gpa_user_unmappable();
if (fault->is_private != is_private) {
though a helper that returns a negative (not mappable) feels kludgy. And I like kvm_is_gpa_user_mappable() because then when there's not "special" memory, it defaults to true, which is more intuitive IMO.
yes.
And then if the future needs more precision, e.g. user-unmappable memory isn't necessarily guest-exclusive, the uAPI names still work even though KVM internals will need to be reworked, but that's unavoidable. E.g. piggybacking KVM_MEMORY_ENCRYPT_(UN)REG_REGION doesn't allow for further differentiation, so we'd need to _extend_ the uAPI, but the _existing_ uAPI would still be sane.
Right, that has to be extended.
Chao
On Mon, Jul 25, 2022, Chao Peng wrote:
On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
On Thu, Jul 21, 2022, Chao Peng wrote:
On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
On 7/21/22 00:21, Sean Christopherson wrote: Maybe you could tag it with cgs for all the confidential guest support related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; ... kvm_vm_ioctl_set_cgs_mem(, is_private)
If we plan to widely use such abbr. through KVM (e.g. it's well known), I'm fine.
I'd prefer to stay away from "confidential guest", and away from any VM-scoped name for that matter. User-unmappable memmory has use cases beyond hiding guest state from the host, e.g. userspace could use inaccessible/unmappable memory to harden itself against unintentional access to guest memory.
I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610 But I also don't quite like it, it's so generic and sounds say nothing.
But I do want a name can cover future usages other than just private/shared (pKVM for example may have a third state).
I don't think there can be a third top-level state. Memory is either private to the guest or it's not. There can be sub-states, e.g. memory could be selectively shared or encrypted with a different key, in which case we'd need metadata to track that state.
Though that begs the question of whether or not private_fd is the correct terminology. E.g. if guest memory is backed by a memfd that can't be mapped by userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs that memory into a device or another VM, then arguably that memory is shared, especially the multi-VM scenario.
For TDX and SNP "private vs. shared" is likely the correct terminology given the current specs, but for generic KVM it's probably better to align with whatever terminology is used for memfd. "inaccessible_fd" and "user_inaccessible_fd" are a bit odd since the fd itself is accesible.
What about "user_unmappable"? E.g.
F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY, MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
For KVM I also think user_unmappable looks better than 'private', e.g. user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more appropriate names. For memfd however, I don't feel that strong to change it from current 'inaccessible' to 'user_unmappable', one of the reason is it's not just about unmappable, but actually also inaccessible through direct ioctls like read()/write().
Heh, I _knew_ there had to be a catch. I agree that INACCESSIBLE is better for memfd.
On Fri, Jul 29, 2022, Sean Christopherson wrote:
On Mon, Jul 25, 2022, Chao Peng wrote:
On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
On Thu, Jul 21, 2022, Chao Peng wrote:
On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
On 7/21/22 00:21, Sean Christopherson wrote: Maybe you could tag it with cgs for all the confidential guest support related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION; ... kvm_vm_ioctl_set_cgs_mem(, is_private)
If we plan to widely use such abbr. through KVM (e.g. it's well known), I'm fine.
I'd prefer to stay away from "confidential guest", and away from any VM-scoped name for that matter. User-unmappable memmory has use cases beyond hiding guest state from the host, e.g. userspace could use inaccessible/unmappable memory to harden itself against unintentional access to guest memory.
I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610 But I also don't quite like it, it's so generic and sounds say nothing.
But I do want a name can cover future usages other than just private/shared (pKVM for example may have a third state).
I don't think there can be a third top-level state. Memory is either private to the guest or it's not. There can be sub-states, e.g. memory could be selectively shared or encrypted with a different key, in which case we'd need metadata to track that state.
Though that begs the question of whether or not private_fd is the correct terminology. E.g. if guest memory is backed by a memfd that can't be mapped by userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel plugs that memory into a device or another VM, then arguably that memory is shared, especially the multi-VM scenario.
For TDX and SNP "private vs. shared" is likely the correct terminology given the current specs, but for generic KVM it's probably better to align with whatever terminology is used for memfd. "inaccessible_fd" and "user_inaccessible_fd" are a bit odd since the fd itself is accesible.
What about "user_unmappable"? E.g.
F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY, MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
For KVM I also think user_unmappable looks better than 'private', e.g. user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more appropriate names. For memfd however, I don't feel that strong to change it from current 'inaccessible' to 'user_unmappable', one of the reason is it's not just about unmappable, but actually also inaccessible through direct ioctls like read()/write().
Heh, I _knew_ there had to be a catch. I agree that INACCESSIBLE is better for memfd.
Thought about this some more...
I think we should avoid UNMAPPABLE even on the KVM side of things for the core memslots functionality and instead be very literal, e.g.
KVM_HAS_FD_BASED_MEMSLOTS KVM_MEM_FD_VALID
We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to the memslot. Decoupling the two thingis will require a bit of extra work, but the code impact should be quite small, e.g. explicitly query and propagate MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private. And unless I'm missing something, it won't require an additional memslot flag. The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would effectively ignore the hva for fd-based memslots for VM types that don't support private memory, i.e. userspace can't opt out of using the fd-based backing, but that doesn't seem like a deal breaker.
Decoupling private memory from fd-based memslots will allow using fd-based memslots for backing VMs even if the memory is user mappable, which opens up potentially interesting use cases. It would also allow testing some parts of fd-based memslots with existing VMs.
The big advantage of KVM's hva-based memslots is that KVM doesn't care what's backing a memslot, and so (in thoery) enabling new backing stores for KVM is free. It's not always free, but at this point I think we've eliminated most of the hiccups, e.g. x86's MMU should no longer require additional enlightenment to support huge pages for new backing types.
On the flip-side, a big disadvantage of hva-based memslots is that KVM doesn't _know_ what's backing a memslot. This is one of the major reasons, if not _the_ main reason at this point, why KVM binds a VM to a single virtual address space. Running with different hva=>pfn mappings would either be completely unsafe or prohibitively expensive (nearly impossible?) to ensure.
With fd-based memslots, KVM essentially binds a memslot directly to the backing store. This allows KVM to do a "deep" comparison of a memslot between two address spaces simply by checking that the backing store is the same. For intra-host/copyless migration (to upgrade the userspace VMM), being able to do a deep comparison would theoretically allow transferring KVM's page tables between VMs instead of forcing the target VM to rebuild the page tables. There are memcg complications (and probably many others) for transferring page tables, but I'm pretty sure it could work.
I don't have a concrete use case (this is a recent idea on my end), but since we're already adding fd-based memory, I can't think of a good reason not make it more generic for not much extra cost. And there are definitely classes of VMs for which fd-based memory would Just Work, e.g. large VMs that are never oversubscribed on memory don't need to support reclaim, so the fact that fd-based memslots won't support page aging (among other things) right away is a non-issue.
On Tue, Aug 02, 2022, Sean Christopherson wrote:
I think we should avoid UNMAPPABLE even on the KVM side of things for the core memslots functionality and instead be very literal, e.g.
KVM_HAS_FD_BASED_MEMSLOTS KVM_MEM_FD_VALID
We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to the memslot. Decoupling the two thingis will require a bit of extra work, but the code impact should be quite small, e.g. explicitly query and propagate MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private. And unless I'm missing something, it won't require an additional memslot flag. The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would effectively ignore the hva for fd-based memslots for VM types that don't support private memory, i.e. userspace can't opt out of using the fd-based backing, but that doesn't seem like a deal breaker.
Hrm, but basing private memory on top of a generic FD_VALID would effectively require shared memory to use hva-based memslots for confidential VMs. That'd yield a very weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots, but confidential VMs would be forced to use hva-based memslots.
Ignore this idea for now. If there's an actual use case for generic fd-based memory then we'll want a separate flag, fd, and offset, i.e. that support could be added independent of KVM_MEM_PRIVATE.
On Tue, Aug 02, 2022 at 04:38:55PM +0000, Sean Christopherson wrote:
On Tue, Aug 02, 2022, Sean Christopherson wrote:
I think we should avoid UNMAPPABLE even on the KVM side of things for the core memslots functionality and instead be very literal, e.g.
KVM_HAS_FD_BASED_MEMSLOTS KVM_MEM_FD_VALID
We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to the memslot. Decoupling the two thingis will require a bit of extra work, but the code impact should be quite small, e.g. explicitly query and propagate MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private. And unless I'm missing something, it won't require an additional memslot flag. The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would effectively ignore the hva for fd-based memslots for VM types that don't support private memory, i.e. userspace can't opt out of using the fd-based backing, but that doesn't seem like a deal breaker.
I actually love this idea. I don't mind adding extra code for potential usage other than confidential VMs if we can have a workable solution for it.
Hrm, but basing private memory on top of a generic FD_VALID would effectively require shared memory to use hva-based memslots for confidential VMs. That'd yield a very weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots, but confidential VMs would be forced to use hva-based memslots.
It would work if we can treat userspace_addr as optional for KVM_MEM_FD_VALID, e.g. userspace can opt in to decide whether needing the mappable part or not for a regular VM and we can enforce KVM for confidential VMs. But the u64 type of userspace_addr doesn't allow us to express a 'null' value so sounds like we will end up needing another flag anyway.
In concept, we could have three cofigurations here: 1. hva-only: without any flag and use userspace_addr; 2. fd-only: another new flag is needed and use fd/offset; 3. hva/fd mixed: both userspace_addr and fd/offset is effective. KVM_MEM_PRIVATE is a subset of it for confidential VMs. Not sure regular VM also wants this.
There is no direct relationship between unmappable and fd-based since even fd-based can also be mappable for regular VM?
Ignore this idea for now. If there's an actual use case for generic fd-based memory then we'll want a separate flag, fd, and offset, i.e. that support could be added independent of KVM_MEM_PRIVATE.
If we ignore this idea now (which I'm also fine), do you still think we need change KVM_MEM_PRIVATE to KVM_MEM_USER_UNMAPPBLE?
Thanks, Chao
On Wed, Aug 03, 2022, Chao Peng wrote:
On Tue, Aug 02, 2022 at 04:38:55PM +0000, Sean Christopherson wrote:
On Tue, Aug 02, 2022, Sean Christopherson wrote:
I think we should avoid UNMAPPABLE even on the KVM side of things for the core memslots functionality and instead be very literal, e.g.
KVM_HAS_FD_BASED_MEMSLOTS KVM_MEM_FD_VALID
We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to the memslot. Decoupling the two thingis will require a bit of extra work, but the code impact should be quite small, e.g. explicitly query and propagate MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private. And unless I'm missing something, it won't require an additional memslot flag. The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would effectively ignore the hva for fd-based memslots for VM types that don't support private memory, i.e. userspace can't opt out of using the fd-based backing, but that doesn't seem like a deal breaker.
I actually love this idea. I don't mind adding extra code for potential usage other than confidential VMs if we can have a workable solution for it.
Hrm, but basing private memory on top of a generic FD_VALID would effectively require shared memory to use hva-based memslots for confidential VMs. That'd yield a very weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots, but confidential VMs would be forced to use hva-based memslots.
It would work if we can treat userspace_addr as optional for KVM_MEM_FD_VALID, e.g. userspace can opt in to decide whether needing the mappable part or not for a regular VM and we can enforce KVM for confidential VMs. But the u64 type of userspace_addr doesn't allow us to express a 'null' value so sounds like we will end up needing another flag anyway.
In concept, we could have three cofigurations here:
- hva-only: without any flag and use userspace_addr;
- fd-only: another new flag is needed and use fd/offset;
- hva/fd mixed: both userspace_addr and fd/offset is effective. KVM_MEM_PRIVATE is a subset of it for confidential VMs. Not sure regular VM also wants this.
My mental model breaks things down slightly differently, though the end result is more or less the same.
After this series, there will be two types of memory: private and "regular" (I'm trying to avoid "shared"). "Regular" memory is always hva-based (userspace_addr), and private always fd-based (fd+offset).
In the future, if we want to support fd-based memory for "regular" memory, then as you said we'd need to add a new flag, and a new fd+offset pair.
At that point, we'd have two new (relatively to current) flags:
KVM_MEM_PRIVATE_FD_VALID KVM_MEM_FD_VALID
along with two new pairs of fd+offset (private_* and "regular"). Mapping those to your above list:
1. Neither *_FD_VALID flag set. 2a. Both PRIVATE_FD_VALID and FD_VALID are set 2b. FD_VALID is set and the VM doesn't support private memory 3. Only PRIVATE_FD_VALID is set (which private memory support in the VM).
Thus, "regular" VMs can't have a mix in a single memslot because they can't use private memory.
There is no direct relationship between unmappable and fd-based since even fd-based can also be mappable for regular VM?
Yep.
Ignore this idea for now. If there's an actual use case for generic fd-based memory then we'll want a separate flag, fd, and offset, i.e. that support could be added independent of KVM_MEM_PRIVATE.
If we ignore this idea now (which I'm also fine), do you still think we need change KVM_MEM_PRIVATE to KVM_MEM_USER_UNMAPPBLE?
Hmm, no. After working through this, I think it's safe to say KVM_MEM_USER_UNMAPPABLE is bad name because we could end up with "regular" memory that's backed by an inaccessible (unmappable) file.
One alternative would be to call it KVM_MEM_PROTECTED. That shouldn't cause problems for the known use of "private" (TDX and SNP), and it gives us a little wiggle room, e.g. if we ever get a use case where VMs can share memory that is otherwise protected.
That's a pretty big "if" though, and odds are good we'd need more memslot flags and fd+offset pairs to allow differentiating "private" vs. "protected-shared" without forcing userspace to punch holes in memslots, so I don't know that hedging now will buy us anything.
So I'd say that if people think KVM_MEM_PRIVATE brings additional and meaningful clarity over KVM_MEM_PROTECTECD, then lets go with PRIVATE. But if PROTECTED is just as good, go with PROTECTED as it gives us a wee bit of wiggle room for the future.
Note, regardless of what name we settle on, I think it makes to do the KVM_PRIVATE_MEM_SLOTS => KVM_INTERNAL_MEM_SLOTS rename.
On Wed, Aug 03, 2022 at 03:51:24PM +0000, Sean Christopherson wrote:
On Wed, Aug 03, 2022, Chao Peng wrote:
On Tue, Aug 02, 2022 at 04:38:55PM +0000, Sean Christopherson wrote:
On Tue, Aug 02, 2022, Sean Christopherson wrote:
I think we should avoid UNMAPPABLE even on the KVM side of things for the core memslots functionality and instead be very literal, e.g.
KVM_HAS_FD_BASED_MEMSLOTS KVM_MEM_FD_VALID
We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly to the memslot. Decoupling the two thingis will require a bit of extra work, but the code impact should be quite small, e.g. explicitly query and propagate MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be private. And unless I'm missing something, it won't require an additional memslot flag. The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would effectively ignore the hva for fd-based memslots for VM types that don't support private memory, i.e. userspace can't opt out of using the fd-based backing, but that doesn't seem like a deal breaker.
I actually love this idea. I don't mind adding extra code for potential usage other than confidential VMs if we can have a workable solution for it.
Hrm, but basing private memory on top of a generic FD_VALID would effectively require shared memory to use hva-based memslots for confidential VMs. That'd yield a very weird API, e.g. non-confidential VMs could be backed entirely by fd-based memslots, but confidential VMs would be forced to use hva-based memslots.
It would work if we can treat userspace_addr as optional for KVM_MEM_FD_VALID, e.g. userspace can opt in to decide whether needing the mappable part or not for a regular VM and we can enforce KVM for confidential VMs. But the u64 type of userspace_addr doesn't allow us to express a 'null' value so sounds like we will end up needing another flag anyway.
In concept, we could have three cofigurations here:
- hva-only: without any flag and use userspace_addr;
- fd-only: another new flag is needed and use fd/offset;
- hva/fd mixed: both userspace_addr and fd/offset is effective. KVM_MEM_PRIVATE is a subset of it for confidential VMs. Not sure regular VM also wants this.
My mental model breaks things down slightly differently, though the end result is more or less the same.
After this series, there will be two types of memory: private and "regular" (I'm trying to avoid "shared"). "Regular" memory is always hva-based (userspace_addr), and private always fd-based (fd+offset).
In the future, if we want to support fd-based memory for "regular" memory, then as you said we'd need to add a new flag, and a new fd+offset pair.
At that point, we'd have two new (relatively to current) flags:
KVM_MEM_PRIVATE_FD_VALID KVM_MEM_FD_VALID
along with two new pairs of fd+offset (private_* and "regular"). Mapping those to your above list:
I previously thought we could reuse the private_fd (name should be changed) for regular VM as well so only need one pair of fd+offset, the meaning of the fd can be decided by the flag. But introducing two pairs of them may support extra usages like one fd for regular memory and another private_fd for private memory, though unsure this is a useful configuration.
- Neither *_FD_VALID flag set.
2a. Both PRIVATE_FD_VALID and FD_VALID are set 2b. FD_VALID is set and the VM doesn't support private memory 3. Only PRIVATE_FD_VALID is set (which private memory support in the VM).
Thus, "regular" VMs can't have a mix in a single memslot because they can't use private memory.
There is no direct relationship between unmappable and fd-based since even fd-based can also be mappable for regular VM?
Hmm, yes, for private memory we have special treatment in page fault handler and that is not applied to regular VM.
Yep.
Ignore this idea for now. If there's an actual use case for generic fd-based memory then we'll want a separate flag, fd, and offset, i.e. that support could be added independent of KVM_MEM_PRIVATE.
If we ignore this idea now (which I'm also fine), do you still think we need change KVM_MEM_PRIVATE to KVM_MEM_USER_UNMAPPBLE?
Hmm, no. After working through this, I think it's safe to say KVM_MEM_USER_UNMAPPABLE is bad name because we could end up with "regular" memory that's backed by an inaccessible (unmappable) file.
One alternative would be to call it KVM_MEM_PROTECTED. That shouldn't cause problems for the known use of "private" (TDX and SNP), and it gives us a little wiggle room, e.g. if we ever get a use case where VMs can share memory that is otherwise protected.
That's a pretty big "if" though, and odds are good we'd need more memslot flags and fd+offset pairs to allow differentiating "private" vs. "protected-shared" without forcing userspace to punch holes in memslots, so I don't know that hedging now will buy us anything.
So I'd say that if people think KVM_MEM_PRIVATE brings additional and meaningful clarity over KVM_MEM_PROTECTECD, then lets go with PRIVATE. But if PROTECTED is just as good, go with PROTECTED as it gives us a wee bit of wiggle room for the future.
Then I'd stay with PRIVATE.
Note, regardless of what name we settle on, I think it makes to do the KVM_PRIVATE_MEM_SLOTS => KVM_INTERNAL_MEM_SLOTS rename.
Agreed.
Chao
On Wed, Jul 06, 2022, Chao Peng wrote:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
struct kvm_enc_region *region)
+{
- unsigned long start, end;
As alluded to in a different reply, because this will track GPAs instead of HVAs, the type needs to be "gpa_t", not "unsigned long". Oh, actually, they need to be gfn_t, since those are what gets shoved into the xarray.
- void *entry;
- int r;
- if (region->size == 0 || region->addr + region->size < region->addr)
return -EINVAL;
- if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
return -EINVAL;
- start = region->addr >> PAGE_SHIFT;
- end = (region->addr + region->size - 1) >> PAGE_SHIFT;
- entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
- r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
entry, GFP_KERNEL_ACCOUNT));
IIUC, this series treats memory as shared by default. I think we should invert that and have KVM's ABI be that all guest memory as private by default, i.e. require the guest to opt into sharing memory instead of opt out of sharing memory.
And then the xarray would track which regions are shared.
Regarding mem_attr_array, it probably makes sense to explicitly include what it's tracking in the name, i.e. name it {private,shared}_mem_array depending on whether it's used to track private vs. shared memory. If we ever need to track metadata beyond shared/private then we can tweak the name as needed, e.g. if hardware ever supports secondary non-ephemeral encryption keys.
On Wed, Jul 20, 2022 at 04:44:32PM +0000, Sean Christopherson wrote:
On Wed, Jul 06, 2022, Chao Peng wrote:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
struct kvm_enc_region *region)
+{
- unsigned long start, end;
As alluded to in a different reply, because this will track GPAs instead of HVAs, the type needs to be "gpa_t", not "unsigned long". Oh, actually, they need to be gfn_t, since those are what gets shoved into the xarray.
It's gfn_t actually. My original purpose for this is 32bit architectures (if any) can also work with it since index of xarrary is 32bit on those architectures. But kvm_enc_region is u64 so itr's even not possible.
- void *entry;
- int r;
- if (region->size == 0 || region->addr + region->size < region->addr)
return -EINVAL;
- if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
return -EINVAL;
- start = region->addr >> PAGE_SHIFT;
- end = (region->addr + region->size - 1) >> PAGE_SHIFT;
- entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
- r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
entry, GFP_KERNEL_ACCOUNT));
IIUC, this series treats memory as shared by default. I think we should invert that and have KVM's ABI be that all guest memory as private by default, i.e. require the guest to opt into sharing memory instead of opt out of sharing memory.
And then the xarray would track which regions are shared.
Maybe I missed some information discussed elsewhere? I followed https://lkml.org/lkml/2022/5/23/772. KVM is shared by default but userspace should set all guest memory to private before the guest launch, guest then sees all memory as private. While default it to private sounds also good, if we only talk about the private/shared in private memory context (I think so), then there is no ambiguity.
Regarding mem_attr_array, it probably makes sense to explicitly include what it's tracking in the name, i.e. name it {private,shared}_mem_array depending on whether it's used to track private vs. shared memory. If we ever need to track metadata beyond shared/private then we can tweak the name as needed, e.g. if hardware ever supports secondary non-ephemeral encryption keys.
As I think that there may be other state beyond that. Fine with me to just take consideration of private/shared, and it also sounds reasonable for people who want to support that to change.
Chao
... diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
struct kvm_enc_region *region)
+{
unsigned long start, end;
void *entry;
int r;
if (region->size == 0 || region->addr + region->size < region->addr)
return -EINVAL;
if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
return -EINVAL;
start = region->addr >> PAGE_SHIFT;
end = (region->addr + region->size - 1) >> PAGE_SHIFT;
entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
entry, GFP_KERNEL_ACCOUNT));
xa_store_range seems to create multi-index entries by default. Subsequent xa_store_range call changes all the entries stored previously. xa_store needs to be used here instead of xa_store_range to achieve the intended behavior.
kvm_zap_gfn_range(kvm, start, end + 1);
return r;
+} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
...
On Fri, Aug 19, 2022 at 12:37:42PM -0700, Vishal Annapurve wrote:
... diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
struct kvm_enc_region *region)
+{
unsigned long start, end;
void *entry;
int r;
if (region->size == 0 || region->addr + region->size < region->addr)
return -EINVAL;
if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
return -EINVAL;
start = region->addr >> PAGE_SHIFT;
end = (region->addr + region->size - 1) >> PAGE_SHIFT;
entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
entry, GFP_KERNEL_ACCOUNT));
xa_store_range seems to create multi-index entries by default. Subsequent xa_store_range call changes all the entries stored previously.
By using xa_store_range and storing them as multi-index entries I expected to save some memory for continuous pages originally.
But sounds like the current multi-index store behaviour isn't quite ready for our usage.
Chao
xa_store needs to be used here instead of xa_store_range to achieve the intended behavior.
kvm_zap_gfn_range(kvm, start, end + 1);
return r;
+} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
...
Hi Chao,
On Wed, Jul 6, 2022 at 9:27 AM Chao Peng chao.p.peng@linux.intel.com wrote:
If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION ioctls. The patch reuses existing SEV ioctl but differs that the address in the region for private memory is gpa while SEV case it's hva.
The private memory region is stored as xarray in KVM for memory efficiency in normal usages and zapping existing memory mappings is also a side effect of these two ioctls.
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
Documentation/virt/kvm/api.rst | 17 +++++++--- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu.h | 2 -- include/linux/kvm_host.h | 8 +++++ virt/kvm/kvm_main.c | 57 +++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 5ecfc7fbe0ee..dfb4caecab73 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/amd-memory-encryption.rst. This ioctl can be used to register a guest memory region which may contain encrypted data (e.g. guest RAM, SMRAM etc).
-It is used in the SEV-enabled guest. When encryption is enabled, a guest -memory region may contain encrypted data. The SEV memory encryption -engine uses a tweak such that two identical plaintext pages, each at -different locations will have differing ciphertexts. So swapping or +Currently this ioctl supports registering memory regions for two usages: +private memory and SEV-encrypted memory.
+When private memory is enabled, this ioctl is used to register guest private +memory region and the addr/size of kvm_enc_region represents guest physical +address (GPA). In this usage, this ioctl zaps the existing guest memory +mappings in KVM that fallen into the region.
+When SEV-encrypted memory is enabled, this ioctl is used to register guest +memory region which may contain encrypted data for a SEV-enabled guest. The +addr/size of kvm_enc_region represents userspace address (HVA). The SEV +memory encryption engine uses a tweak such that two identical plaintext pages, +each at different locations will have differing ciphertexts. So swapping or moving ciphertext of those pages will not result in plaintext being swapped. So relocating (or migrating) physical backing pages for the SEV guest will require some additional steps. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dae190e19fce..92120e3a224e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -37,6 +37,7 @@ #include <asm/hyperv-tlfs.h>
#define __KVM_HAVE_ARCH_VCPU_DEBUGFS +#define __KVM_HAVE_ZAP_GFN_RANGE
#define KVM_MAX_VCPUS 1024
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 1f160801e2a7..05861b9656a4 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -50,6 +50,7 @@ config KVM select HAVE_KVM_PM_NOTIFIER if PM select HAVE_KVM_PRIVATE_MEM if X86_64 select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
select XARRAY_MULTI if HAVE_KVM_PRIVATE_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index a99acec925eb..428cd2e88cbd 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -209,8 +209,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return -(u32)fault & errcode; }
-void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
int kvm_mmu_post_init_vm(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1b203c8aa696..da33f8828456 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -260,6 +260,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); #endif
+#ifdef __KVM_HAVE_ZAP_GFN_RANGE +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); +#endif
enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE, @@ -795,6 +799,9 @@ struct kvm { struct notifier_block pm_notifier; #endif char stats_id[KVM_STATS_NAME_SIZE]; +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
struct xarray mem_attr_array;
+#endif };
#define kvm_err(fmt, ...) \ @@ -1459,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_mem_supported(struct kvm *kvm);
#ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl,
struct kvm_enc_region *region)
+{
unsigned long start, end;
void *entry;
int r;
if (region->size == 0 || region->addr + region->size < region->addr)
return -EINVAL;
if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1))
return -EINVAL;
start = region->addr >> PAGE_SHIFT;
end = (region->addr + region->size - 1) >> PAGE_SHIFT;
entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ?
xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL;
r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end,
entry, GFP_KERNEL_ACCOUNT));
kvm_zap_gfn_range(kvm, start, end + 1);
return r;
+} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -1138,6 +1167,9 @@ static struct kvm *kvm_create_vm(unsigned long type) spin_lock_init(&kvm->mn_invalidate_lock); rcuwait_init(&kvm->mn_memslots_update_rcuwait); xa_init(&kvm->vcpu_array); +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
xa_init(&kvm->mem_attr_array);
+#endif
INIT_LIST_HEAD(&kvm->gpc_list); spin_lock_init(&kvm->gpc_lock);
@@ -1305,6 +1337,9 @@ static void kvm_destroy_vm(struct kvm *kvm) kvm_free_memslots(kvm, &kvm->__memslots[i][0]); kvm_free_memslots(kvm, &kvm->__memslots[i][1]); } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
xa_destroy(&kvm->mem_attr_array);
+#endif cleanup_srcu_struct(&kvm->irq_srcu); cleanup_srcu_struct(&kvm->srcu); kvm_arch_free_vm(kvm); @@ -1508,6 +1543,11 @@ static void kvm_replace_memslot(struct kvm *kvm, } }
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{
return false;
+}
static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
case KVM_MEMORY_ENCRYPT_REG_REGION:
case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
break;
}
+#endif case KVM_GET_DIRTY_LOG: { struct kvm_dirty_log log;
@@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_get_stats_fd(kvm); break; default: +arch_vm_ioctl:
It might be good to make this label conditional on CONFIG_HAVE_KVM_PRIVATE_MEM, otherwise you get a warning if CONFIG_HAVE_KVM_PRIVATE_MEM isn't defined.
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM arch_vm_ioctl: +#endif
Cheers, /fuad
r = kvm_arch_vm_ioctl(filp, ioctl, arg); }
out:
2.25.1
On Fri, Aug 26, 2022 at 04:19:43PM +0100, Fuad Tabba wrote:
+bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{
return false;
+}
static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
case KVM_MEMORY_ENCRYPT_REG_REGION:
case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
struct kvm_enc_region region;
if (!kvm_arch_private_mem_supported(kvm))
goto arch_vm_ioctl;
r = -EFAULT;
if (copy_from_user(®ion, argp, sizeof(region)))
goto out;
r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion);
break;
}
+#endif case KVM_GET_DIRTY_LOG: { struct kvm_dirty_log log;
@@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_get_stats_fd(kvm); break; default: +arch_vm_ioctl:
It might be good to make this label conditional on CONFIG_HAVE_KVM_PRIVATE_MEM, otherwise you get a warning if CONFIG_HAVE_KVM_PRIVATE_MEM isn't defined.
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM arch_vm_ioctl: +#endif
Right, as the bot already complains.
Chao
Cheers, /fuad
r = kvm_arch_vm_ioctl(filp, ioctl, arg); }
out:
2.25.1
A page fault can carry the private/shared information for KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like TDX code). To handle page fault for such access, KVM maps the page only when this private property matches the host's view on the page.
For a successful match, private pfn is obtained with memfile_notifier callbacks from private fd and shared pfn is obtained with existing get_user_pages.
For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to userspace. Userspace then can convert memory between private/shared from host's view then retry the access.
Co-developed-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- arch/x86/kvm/mmu/mmu.c | 60 ++++++++++++++++++++++++++++++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++++++ arch/x86/kvm/mmu/mmutrace.h | 1 + include/linux/kvm_host.h | 35 ++++++++++++++++++- 4 files changed, 112 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 545eb74305fe..27dbdd4fe8d1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K;
+ if (kvm_mem_is_private(kvm, gfn)) + return max_level; + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level); } @@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); }
+static inline u8 order_to_level(int order) +{ + enum pg_level level; + + for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--) + if (order >= page_level_shift(level) - PAGE_SHIFT) + return level; + return level; +} + +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault) +{ + int order; + struct kvm_memory_slot *slot = fault->slot; + bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn); + + if (fault->is_private != private_exist) { + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; + if (fault->is_private) + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE; + else + vcpu->run->memory.flags = 0; + vcpu->run->memory.padding = 0; + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; + vcpu->run->memory.size = PAGE_SIZE; + return RET_PF_USER; + } + + if (fault->is_private) { + if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order)) + return RET_PF_RETRY; + fault->max_level = min(order_to_level(order), fault->max_level); + fault->map_writable = !(slot->flags & KVM_MEM_READONLY); + return RET_PF_FIXED; + } + + /* Fault is shared, fallthrough. */ + return RET_PF_CONTINUE; +} + static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; bool async; + int r;
/* * Retry the page fault if the gfn hit a memslot that is being deleted @@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) return RET_PF_EMULATE; }
+ if (kvm_slot_can_be_private(slot)) { + r = kvm_faultin_pfn_private(vcpu, fault); + if (r != RET_PF_CONTINUE) + return r == RET_PF_FIXED ? RET_PF_CONTINUE : r; + } + async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable, @@ -4241,7 +4292,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + + if (fault->is_private) + kvm_private_mem_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); return r; }
@@ -5518,6 +5573,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err return -EIO; }
+ if (r == RET_PF_USER) + return 0; + if (r < 0) return r; if (r != RET_PF_EMULATE) diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index ae2d660e2dab..fb9c298abcf0 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -188,6 +188,7 @@ struct kvm_page_fault {
/* Derived from mmu and global state. */ const bool is_tdp; + const bool is_private; const bool nx_huge_page_workaround_enabled;
/* @@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * RET_PF_RETRY: let CPU fault again on the address. * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. + * RET_PF_USER: need to exit to userspace to handle this fault. * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. * @@ -252,6 +254,7 @@ enum { RET_PF_RETRY, RET_PF_EMULATE, RET_PF_INVALID, + RET_PF_USER, RET_PF_FIXED, RET_PF_SPURIOUS, }; @@ -318,4 +321,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+#ifndef CONFIG_HAVE_KVM_PRIVATE_MEM +static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot, + gfn_t gfn, kvm_pfn_t *pfn, int *order) +{ + WARN_ON_ONCE(1); + return -EOPNOTSUPP; +} + +static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + WARN_ON_ONCE(1); +} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */ + #endif /* __KVM_X86_MMU_INTERNAL_H */ diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h index ae86820cef69..2d7555381955 100644 --- a/arch/x86/kvm/mmu/mmutrace.h +++ b/arch/x86/kvm/mmu/mmutrace.h @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE); TRACE_DEFINE_ENUM(RET_PF_RETRY); TRACE_DEFINE_ENUM(RET_PF_EMULATE); TRACE_DEFINE_ENUM(RET_PF_INVALID); +TRACE_DEFINE_ENUM(RET_PF_USER); TRACE_DEFINE_ENUM(RET_PF_FIXED); TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index da33f8828456..8f56426aa1e3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -778,6 +778,10 @@ struct kvm {
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier; +#endif + +#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) || \ + defined(CONFIG_MEMFILE_NOTIFIER) unsigned long mmu_updating_seq; long mmu_updating_count; gfn_t mmu_updating_range_start; @@ -1917,7 +1921,8 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[]; extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) +#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) || \ + defined(CONFIG_MEMFILE_NOTIFIER) static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq) { if (unlikely(kvm->mmu_updating_count)) @@ -2266,4 +2271,32 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot, + gfn_t gfn, kvm_pfn_t *pfn, int *order) +{ + int ret; + pfn_t pfnt; + pgoff_t index = gfn - slot->base_gfn + + (slot->private_offset >> PAGE_SHIFT); + + ret = slot->notifier.bs->get_pfn(slot->private_file, index, &pfnt, + order); + *pfn = pfn_t_to_pfn(pfnt); + return ret; +} + +static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + slot->notifier.bs->put_pfn(pfn_to_pfn_t(pfn)); +} + +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) +{ + return !!xa_load(&kvm->mem_attr_array, gfn); +} + +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */ + #endif
On Wed, Jul 06, 2022, Chao Peng wrote:
A page fault can carry the private/shared information for KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like TDX code). To handle page fault for such access, KVM maps the page only when this private property matches the host's view on the page.
For a successful match, private pfn is obtained with memfile_notifier callbacks from private fd and shared pfn is obtained with existing get_user_pages.
For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to userspace. Userspace then can convert memory between private/shared from host's view then retry the access.
Co-developed-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
arch/x86/kvm/mmu/mmu.c | 60 ++++++++++++++++++++++++++++++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++++++ arch/x86/kvm/mmu/mmutrace.h | 1 + include/linux/kvm_host.h | 35 ++++++++++++++++++- 4 files changed, 112 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 545eb74305fe..27dbdd4fe8d1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K;
- if (kvm_mem_is_private(kvm, gfn))
return max_level;
- host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level);
} @@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); } +static inline u8 order_to_level(int order) +{
- enum pg_level level;
- for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--)
Curly braces needed for the for-loop.
And I think it makes sense to take in the fault->max_level, that way this is slightly more performant when the guest mapping is smaller than the host, e.g.
for (level = max_level; level > PG_LEVEL_4K; level--) ...
return level;
Though I think I'd vote to avoid a loop entirely and do:
BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
if (order > ???) return PG_LEVEL_1G; if (order > ???) return PG_LEVEL_2M;
return PG_LEVEL_4K;
if (order >= page_level_shift(level) - PAGE_SHIFT)
return level;
- return level;
+}
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
+{
- int order;
- struct kvm_memory_slot *slot = fault->slot;
- bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn);
- if (fault->is_private != private_exist) {
vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
if (fault->is_private)
vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
else
vcpu->run->memory.flags = 0;
vcpu->run->memory.padding = 0;
vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
vcpu->run->memory.size = PAGE_SIZE;
return RET_PF_USER;
- }
- if (fault->is_private) {
if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
return RET_PF_RETRY;
fault->max_level = min(order_to_level(order), fault->max_level);
fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
return RET_PF_FIXED;
- }
- /* Fault is shared, fallthrough. */
- return RET_PF_CONTINUE;
+}
static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; bool async;
- int r;
/* * Retry the page fault if the gfn hit a memslot that is being deleted @@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) return RET_PF_EMULATE; }
- if (kvm_slot_can_be_private(slot)) {
r = kvm_faultin_pfn_private(vcpu, fault);
if (r != RET_PF_CONTINUE)
return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
I apologize if I've given you conflicting feedback in the past. Now that this returns RET_PF_* directly, I definitely think it makes sense to do:
if (kvm_slot_can_be_private(slot) && fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) { vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; if (fault->is_private) vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE; else vcpu->run->memory.flags = 0; vcpu->run->memory.padding = 0; vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; vcpu->run->memory.size = PAGE_SIZE; return RET_PF_USER; }
if (fault->is_private) return kvm_faultin_pfn_private(vcpu, fault);
That way kvm_faultin_pfn_private() only handles private faults, and this doesn't need to play games with RET_PF_FIXED.
- }
- async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable,
@@ -4241,7 +4292,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
- if (fault->is_private)
kvm_private_mem_put_pfn(fault->slot, fault->pfn);
- else
kvm_release_pfn_clean(fault->pfn);
AFAIK, we never bottomed out on whether or not this is needed[*]. Can you follow up with Kirill to get an answer before posting v8?
[*] https://lore.kernel.org/all/20220620141647.GC2016793@chaop.bj.intel.com
On Fri, Jul 29, 2022 at 08:58:41PM +0000, Sean Christopherson wrote:
On Wed, Jul 06, 2022, Chao Peng wrote:
A page fault can carry the private/shared information for KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like TDX code). To handle page fault for such access, KVM maps the page only when this private property matches the host's view on the page.
For a successful match, private pfn is obtained with memfile_notifier callbacks from private fd and shared pfn is obtained with existing get_user_pages.
For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to userspace. Userspace then can convert memory between private/shared from host's view then retry the access.
Co-developed-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
arch/x86/kvm/mmu/mmu.c | 60 ++++++++++++++++++++++++++++++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++++++ arch/x86/kvm/mmu/mmutrace.h | 1 + include/linux/kvm_host.h | 35 ++++++++++++++++++- 4 files changed, 112 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 545eb74305fe..27dbdd4fe8d1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K;
- if (kvm_mem_is_private(kvm, gfn))
return max_level;
- host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level);
} @@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); } +static inline u8 order_to_level(int order) +{
- enum pg_level level;
- for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--)
Curly braces needed for the for-loop.
And I think it makes sense to take in the fault->max_level, that way this is slightly more performant when the guest mapping is smaller than the host, e.g.
for (level = max_level; level > PG_LEVEL_4K; level--) ...
return level;
Though I think I'd vote to avoid a loop entirely and do:
BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
if (order > ???) return PG_LEVEL_1G; if (order > ???) return PG_LEVEL_2M;
return PG_LEVEL_4K;
Sounds good.
if (order >= page_level_shift(level) - PAGE_SHIFT)
return level;
- return level;
+}
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault)
+{
- int order;
- struct kvm_memory_slot *slot = fault->slot;
- bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn);
- if (fault->is_private != private_exist) {
vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
if (fault->is_private)
vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
else
vcpu->run->memory.flags = 0;
vcpu->run->memory.padding = 0;
vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
vcpu->run->memory.size = PAGE_SIZE;
return RET_PF_USER;
- }
- if (fault->is_private) {
if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order))
return RET_PF_RETRY;
fault->max_level = min(order_to_level(order), fault->max_level);
fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
return RET_PF_FIXED;
- }
- /* Fault is shared, fallthrough. */
- return RET_PF_CONTINUE;
+}
static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; bool async;
- int r;
/* * Retry the page fault if the gfn hit a memslot that is being deleted @@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) return RET_PF_EMULATE; }
- if (kvm_slot_can_be_private(slot)) {
r = kvm_faultin_pfn_private(vcpu, fault);
if (r != RET_PF_CONTINUE)
return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;
I apologize if I've given you conflicting feedback in the past. Now that this returns RET_PF_* directly, I definitely think it makes sense to do:
if (kvm_slot_can_be_private(slot) && fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) { vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; if (fault->is_private) vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE; else vcpu->run->memory.flags = 0; vcpu->run->memory.padding = 0; vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; vcpu->run->memory.size = PAGE_SIZE; return RET_PF_USER; }
if (fault->is_private) return kvm_faultin_pfn_private(vcpu, fault);
That way kvm_faultin_pfn_private() only handles private faults, and this doesn't need to play games with RET_PF_FIXED.
Agreed, this looks much simpler.
- }
- async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable,
@@ -4241,7 +4292,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock);
- kvm_release_pfn_clean(fault->pfn);
- if (fault->is_private)
kvm_private_mem_put_pfn(fault->slot, fault->pfn);
- else
kvm_release_pfn_clean(fault->pfn);
AFAIK, we never bottomed out on whether or not this is needed[*]. Can you follow up with Kirill to get an answer before posting v8?
Sure.
Chao
[*] https://lore.kernel.org/all/20220620141647.GC2016793@chaop.bj.intel.com
Register private memslot to fd-based memory backing store and handle the memfile notifiers to zap the existing mappings.
Currently the register is happened at memslot creating time and the initial support does not include page migration/swap.
KVM_MEM_PRIVATE is not exposed by default, architecture code can turn on it by implementing kvm_arch_private_mem_supported().
A 'kvm' reference is added in memslot structure since in memfile_notifier callbacks we can only obtain a memslot reference while kvm is need to do the zapping.
Co-developed-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 117 ++++++++++++++++++++++++++++++++++++--- 2 files changed, 109 insertions(+), 9 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8f56426aa1e3..4e5a0db68799 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -584,6 +584,7 @@ struct kvm_memory_slot { struct file *private_file; loff_t private_offset; struct memfile_notifier notifier; + struct kvm *kvm; };
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bb714c2a4b06..d6f7e074cab2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -941,6 +941,63 @@ static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl
return r; } + +static void kvm_memfile_notifier_invalidate(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end) +{ + struct kvm_memory_slot *slot = container_of(notifier, + struct kvm_memory_slot, + notifier); + unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT; + gfn_t start_gfn = slot->base_gfn; + gfn_t end_gfn = slot->base_gfn + slot->npages; + + + if (start > base_pgoff) + start_gfn = slot->base_gfn + start - base_pgoff; + + if (end < base_pgoff + slot->npages) + end_gfn = slot->base_gfn + end - base_pgoff; + + if (start_gfn >= end_gfn) + return; + + kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn); +} + +static struct memfile_notifier_ops kvm_memfile_notifier_ops = { + .invalidate = kvm_memfile_notifier_invalidate, +}; + +#define KVM_MEMFILE_FLAGS (MEMFILE_F_USER_INACCESSIBLE | \ + MEMFILE_F_UNMOVABLE | \ + MEMFILE_F_UNRECLAIMABLE) + +static inline int kvm_private_mem_register(struct kvm_memory_slot *slot) +{ + slot->notifier.ops = &kvm_memfile_notifier_ops; + return memfile_register_notifier(slot->private_file, KVM_MEMFILE_FLAGS, + &slot->notifier); +} + +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot) +{ + memfile_unregister_notifier(&slot->notifier); +} + +#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */ + +static inline int kvm_private_mem_register(struct kvm_memory_slot *slot) +{ + WARN_ON_ONCE(1); + return -EOPNOTSUPP; +} + +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot) +{ + WARN_ON_ONCE(1); +} + #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER @@ -987,6 +1044,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot) /* This does not remove the slot from struct kvm_memslots data structures */ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { + if (slot->flags & KVM_MEM_PRIVATE) { + kvm_private_mem_unregister(slot); + fput(slot->private_file); + } + kvm_destroy_dirty_bitmap(slot);
kvm_arch_free_memslot(kvm, slot); @@ -1548,10 +1610,16 @@ bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) return false; }
-static int check_memory_region_flags(const struct kvm_user_mem_region *mem) +static int check_memory_region_flags(struct kvm *kvm, + const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + if (kvm_arch_private_mem_supported(kvm)) + valid_flags |= KVM_MEM_PRIVATE; +#endif + #ifdef __KVM_HAVE_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif @@ -1627,6 +1695,12 @@ static int kvm_prepare_memory_region(struct kvm *kvm, { int r;
+ if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) { + r = kvm_private_mem_register(new); + if (r) + return r; + } + /* * If dirty logging is disabled, nullify the bitmap; the old bitmap * will be freed on "commit". If logging is enabled in both old and @@ -1655,6 +1729,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm, if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap)) kvm_destroy_dirty_bitmap(new);
+ if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) + kvm_private_mem_unregister(new); + return r; }
@@ -1952,7 +2029,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r;
- r = check_memory_region_flags(mem); + r = check_memory_region_flags(kvm, mem); if (r) return r;
@@ -1971,6 +2048,10 @@ int __kvm_set_memory_region(struct kvm *kvm, !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size)) return -EINVAL; + if (mem->flags & KVM_MEM_PRIVATE && + (mem->private_offset & (PAGE_SIZE - 1) || + mem->private_offset > U64_MAX - mem->memory_size)) + return -EINVAL; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) return -EINVAL; if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) @@ -2009,6 +2090,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */ + /* Private memslots are immutable, they can only be deleted. */ + if (mem->flags & KVM_MEM_PRIVATE) + return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) @@ -2037,10 +2121,27 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr; + if (mem->flags & KVM_MEM_PRIVATE) { + new->private_file = fget(mem->private_fd); + if (!new->private_file) { + r = -EINVAL; + goto out; + } + new->private_offset = mem->private_offset; + } + + new->kvm = kvm;
r = kvm_set_memslot(kvm, old, new, change); if (r) - kfree(new); + goto out; + + return 0; + +out: + if (new->private_file) + fput(new->private_file); + kfree(new); return r; } EXPORT_SYMBOL_GPL(__kvm_set_memory_region); @@ -4712,12 +4813,10 @@ static long kvm_vm_ioctl(struct file *filp, (u32 __user *)(argp + offsetof(typeof(mem), flags)))) goto out;
- if (flags & KVM_MEM_PRIVATE) { - r = -EINVAL; - goto out; - } - - size = sizeof(struct kvm_userspace_memory_region); + if (flags & KVM_MEM_PRIVATE) + size = sizeof(struct kvm_userspace_memory_region_ext); + else + size = sizeof(struct kvm_userspace_memory_region);
if (copy_from_user(&mem, argp, size)) goto out;
Register private memslot to fd-based memory backing store and handle the memfile notifiers to zap the existing mappings.
Currently the register is happened at memslot creating time and the initial support does not include page migration/swap.
KVM_MEM_PRIVATE is not exposed by default, architecture code can turn on it by implementing kvm_arch_private_mem_supported().
A 'kvm' reference is added in memslot structure since in memfile_notifier callbacks we can only obtain a memslot reference while kvm is need to do the zapping.
Co-developed-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Yu Zhang yu.c.zhang@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 117 ++++++++++++++++++++++++++++++++++++--- 2 files changed, 109 insertions(+), 9 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8f56426aa1e3..4e5a0db68799 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -584,6 +584,7 @@ struct kvm_memory_slot { struct file *private_file; loff_t private_offset; struct memfile_notifier notifier;
- struct kvm *kvm; };
static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bb714c2a4b06..d6f7e074cab2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -941,6 +941,63 @@ static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl return r; }
+static void kvm_memfile_notifier_invalidate(struct memfile_notifier *notifier,
pgoff_t start, pgoff_t end)
+{
- struct kvm_memory_slot *slot = container_of(notifier,
struct kvm_memory_slot,
notifier);
- unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT;
- gfn_t start_gfn = slot->base_gfn;
- gfn_t end_gfn = slot->base_gfn + slot->npages;
- if (start > base_pgoff)
start_gfn = slot->base_gfn + start - base_pgoff;
- if (end < base_pgoff + slot->npages)
end_gfn = slot->base_gfn + end - base_pgoff;
- if (start_gfn >= end_gfn)
return;
- kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn);
+}
+static struct memfile_notifier_ops kvm_memfile_notifier_ops = {
- .invalidate = kvm_memfile_notifier_invalidate,
+};
+#define KVM_MEMFILE_FLAGS (MEMFILE_F_USER_INACCESSIBLE | \
MEMFILE_F_UNMOVABLE | \
MEMFILE_F_UNRECLAIMABLE)
+static inline int kvm_private_mem_register(struct kvm_memory_slot *slot) +{
- slot->notifier.ops = &kvm_memfile_notifier_ops;
- return memfile_register_notifier(slot->private_file, KVM_MEMFILE_FLAGS,
&slot->notifier);
+}
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot) +{
- memfile_unregister_notifier(&slot->notifier);
+}
+#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */
+static inline int kvm_private_mem_register(struct kvm_memory_slot *slot) +{
- WARN_ON_ONCE(1);
- return -EOPNOTSUPP;
+}
+static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot) +{
- WARN_ON_ONCE(1);
+}
- #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER @@ -987,6 +1044,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot) /* This does not remove the slot from struct kvm_memslots data structures */ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) {
- if (slot->flags & KVM_MEM_PRIVATE) {
kvm_private_mem_unregister(slot);
fput(slot->private_file);
- }
- kvm_destroy_dirty_bitmap(slot);
kvm_arch_free_memslot(kvm, slot); @@ -1548,10 +1610,16 @@ bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) return false; } -static int check_memory_region_flags(const struct kvm_user_mem_region *mem) +static int check_memory_region_flags(struct kvm *kvm,
{ u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;const struct kvm_user_mem_region *mem)
+#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
- if (kvm_arch_private_mem_supported(kvm))
valid_flags |= KVM_MEM_PRIVATE;
+#endif
- #ifdef __KVM_HAVE_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif
@@ -1627,6 +1695,12 @@ static int kvm_prepare_memory_region(struct kvm *kvm, { int r;
- if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) {
r = kvm_private_mem_register(new);
if (r)
return r;
- }
- /*
- If dirty logging is disabled, nullify the bitmap; the old bitmap
- will be freed on "commit". If logging is enabled in both old and
@@ -1655,6 +1729,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm, if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap)) kvm_destroy_dirty_bitmap(new);
- if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE)
kvm_private_mem_unregister(new);
- return r; }
@@ -1952,7 +2029,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r;
- r = check_memory_region_flags(mem);
- r = check_memory_region_flags(kvm, mem); if (r) return r;
@@ -1971,6 +2048,10 @@ int __kvm_set_memory_region(struct kvm *kvm, !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size)) return -EINVAL;
- if (mem->flags & KVM_MEM_PRIVATE &&
(mem->private_offset & (PAGE_SIZE - 1) ||
mem->private_offset > U64_MAX - mem->memory_size))
if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) return -EINVAL; if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)return -EINVAL;
@@ -2009,6 +2090,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */
/* Private memslots are immutable, they can only be deleted. */
if (mem->flags & KVM_MEM_PRIVATE)
if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY))return -EINVAL;
@@ -2037,10 +2121,27 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr;
- if (mem->flags & KVM_MEM_PRIVATE) {
new->private_file = fget(mem->private_fd);
if (!new->private_file) {
r = -EINVAL;
goto out;
}
new->private_offset = mem->private_offset;
- }
- new->kvm = kvm;
r = kvm_set_memslot(kvm, old, new, change); if (r)
kfree(new);
goto out;
- return 0;
+out:
- if (new->private_file)
fput(new->private_file);
- kfree(new); return r; } EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
@@ -4712,12 +4813,10 @@ static long kvm_vm_ioctl(struct file *filp, (u32 __user *)(argp + offsetof(typeof(mem), flags)))) goto out;
if (flags & KVM_MEM_PRIVATE) {
r = -EINVAL;
goto out;
}
size = sizeof(struct kvm_userspace_memory_region);
if (flags & KVM_MEM_PRIVATE)
size = sizeof(struct kvm_userspace_memory_region_ext);
Not sure if we use kvm_userspace_memory_region_ext or kvm_user_mem_region, just for readability.
else
size = sizeof(struct kvm_userspace_memory_region);
if (copy_from_user(&mem, argp, size)) goto out;
On Tue, Jul 19, 2022 at 11:55:24AM +0200, Gupta, Pankaj wrote:
...
@@ -4712,12 +4813,10 @@ static long kvm_vm_ioctl(struct file *filp, (u32 __user *)(argp + offsetof(typeof(mem), flags)))) goto out;
if (flags & KVM_MEM_PRIVATE) {
r = -EINVAL;
goto out;
}
size = sizeof(struct kvm_userspace_memory_region);
if (flags & KVM_MEM_PRIVATE)
size = sizeof(struct kvm_userspace_memory_region_ext);
Not sure if we use kvm_userspace_memory_region_ext or kvm_user_mem_region, just for readability.
Somehow, but majorly for code maintainability, kvm_user_mem_region is designed to be the alias of kvm_userspace_memory_region_ext so in the code we can access the 'unpacked' fields using something like 'mem.usersapce_addr' instead of 'mem.region.userspace_addr'.
Chao
else
if (copy_from_user(&mem, argp, size)) goto out;size = sizeof(struct kvm_userspace_memory_region);
Signed-off-by: Chao Peng chao.p.peng@linux.intel.com --- man2/memfd_create.2 | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/man2/memfd_create.2 b/man2/memfd_create.2 index 89e9c4136..2698222ae 100644 --- a/man2/memfd_create.2 +++ b/man2/memfd_create.2 @@ -101,6 +101,19 @@ meaning that no other seals can be set on the file. ." FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default? ." Is it worth adding some text explaining this? .TP +.BR MFD_INACCESSIBLE +Disallow userspace access through ordinary MMU accesses via +.BR read (2), +.BR write (2) +and +.BR mmap (2). +The file size cannot be changed once initialized. +This flag cannot coexist with +.B MFD_ALLOW_SEALING +and when this flag is set, the initial set of seals will be +.B F_SEAL_SEAL, +meaning that no other seals can be set on the file. +.TP .BR MFD_HUGETLB " (since Linux 4.14)" ." commit 749df87bd7bee5a79cef073f5d032ddb2b211de8 The anonymous file will be created in the hugetlbfs filesystem using
This patch does not belong in this series. It's not a patch to the kernel. This is a kernel series.
It would be much more appropriate to put a link to a separately posted manpage patch in the cover letter.
On Mon, Aug 01, 2022 at 07:40:32AM -0700, Dave Hansen wrote:
This patch does not belong in this series. It's not a patch to the kernel. This is a kernel series.
You are right.
It would be much more appropriate to put a link to a separately posted manpage patch in the cover letter.
Thanks for suggesion.
Chao
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
Thanks, Pankaj
and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated.
- backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Changelog
v7:
- Move the private/shared info from backing store to KVM.
- Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- Rework on the sync mechanism between zap/page fault paths.
- Addressed other comments in v6.
v6:
- Re-organzied patch for both mm/KVM parts.
- Added flags for memfile_notifier so its consumers can state their features and memory backing store can check against these flags.
- Put a backing store reference in the memfile_notifier and move pfn_ops into backing store.
- Only support boot time backing store register.
- Overall KVM part improvement suggested by Sean and some others.
v5:
- Removed userspace visible F_SEAL_INACCESSIBLE, instead using an in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only be created by MFD_INACCESSIBLE.
- Introduced new APIs for backing store to register itself to memfile_notifier instead of direct function call.
- Added the accounting and restriction for MFD_INACCESSIBLE memory.
- Added KVM API doc for new memslot extensions and man page for the new MFD_INACCESSIBLE flag.
- Removed the overlap check for mapping the same file+offset into multiple gfns due to perf consideration, warned in document.
- Addressed other comments in v4.
v4:
- Decoupled the callbacks between KVM/mm from memfd and use new name 'memfile_notifier'.
- Supported register multiple memslots to the same backing store.
- Added per-memslot pfn_ops instead of per-system.
- Reworked the invalidation part.
- Improved new KVM uAPIs (private memslot extension and memory error) per Sean's suggestions.
- Addressed many other minor fixes for comments from v3.
v3:
- Added locking protection when calling invalidate_page_range/fallocate callbacks.
- Changed memslot structure to keep use useraddr for shared memory.
- Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
- Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
- Commit message improvement.
- Many small fixes for comments from the last version.
Links to previous discussions
[1] Original design proposal: https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/ [2] Updated proposal and RFC patch v1: https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@lin... [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
Chao Peng (12): mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE mm: Introduce memfile_notifier mm/memfd: Introduce MFD_INACCESSIBLE flag KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS KVM: Use gfn instead of hva for mmu_notifier_retry KVM: Rename mmu_notifier_* KVM: Extend the memslot to support fd-based private memory KVM: Add KVM_EXIT_MEMORY_FAULT exit KVM: Register/unregister the guest private memory regions KVM: Handle page fault for private memory KVM: Enable and expose KVM_MEM_PRIVATE
Kirill A. Shutemov (1): mm/shmem: Support memfile_notifier
Documentation/virt/kvm/api.rst | 77 +++++- arch/arm64/kvm/mmu.c | 8 +- arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/mmu.c | 10 +- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 +- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/riscv/kvm/mmu.c | 4 +- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/Kconfig | 3 + arch/x86/kvm/mmu.h | 2 - arch/x86/kvm/mmu/mmu.c | 74 +++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++ arch/x86/kvm/mmu/mmutrace.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 4 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 105 +++++--- include/linux/memfile_notifier.h | 91 +++++++ include/linux/shmem_fs.h | 2 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/kvm.h | 37 +++ include/uapi/linux/memfd.h | 1 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/memfd.c | 18 +- mm/memfile_notifier.c | 123 ++++++++++ mm/shmem.c | 125 +++++++++- tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++ virt/kvm/Kconfig | 3 + virt/kvm/kvm_main.c | 272 ++++++++++++++++++--- virt/kvm/pfncache.c | 14 +- 35 files changed, 1074 insertions(+), 127 deletions(-) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c
On Wed, Jul 13, 2022 at 05:58:32AM +0200, Gupta, Pankaj wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
This is my understanding: in host it will be indeed in page cache (in current shmem implementation) but that's just the way it allocates and provides the physical memory for the guest. In guest, guest OS will not see this fd (absolutely), it only sees guest memory, on top of which it can build its own page cache system for its own file-mapped content but that is unrelated to host page cache.
Chao
Thanks, Pankaj
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
This is my understanding: in host it will be indeed in page cache (in current shmem implementation) but that's just the way it allocates and provides the physical memory for the guest. In guest, guest OS will not see this fd (absolutely), it only sees guest memory, on top of which it can build its own page cache system for its own file-mapped content but that is unrelated to host page cache.
yes. If guest fills its page cache with file backed memory, this at host side(on shmem fd backend) will also fill the host page cache fast. This can have an impact on performance of guest VM's if host goes to memory pressure situation sooner. Or else we end up utilizing way less System RAM.
Thanks, Pankaj
On Wed, Jul 13, 2022 at 12:35:56PM +0200, Gupta, Pankaj wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
This is my understanding: in host it will be indeed in page cache (in current shmem implementation) but that's just the way it allocates and provides the physical memory for the guest. In guest, guest OS will not see this fd (absolutely), it only sees guest memory, on top of which it can build its own page cache system for its own file-mapped content but that is unrelated to host page cache.
yes. If guest fills its page cache with file backed memory, this at host side(on shmem fd backend) will also fill the host page cache fast. This can have an impact on performance of guest VM's if host goes to memory pressure situation sooner. Or else we end up utilizing way less System RAM.
(Currently), the file backed guest private memory is long-term pinned and not reclaimable, it's in page cache anyway once we allocated it for guest. This does not depend on how guest use it (e.g. use it for guest page cache or not).
Chao
Thanks, Pankaj
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
This is my understanding: in host it will be indeed in page cache (in current shmem implementation) but that's just the way it allocates and provides the physical memory for the guest. In guest, guest OS will not see this fd (absolutely), it only sees guest memory, on top of which it can build its own page cache system for its own file-mapped content but that is unrelated to host page cache.
yes. If guest fills its page cache with file backed memory, this at host side(on shmem fd backend) will also fill the host page cache fast. This can have an impact on performance of guest VM's if host goes to memory pressure situation sooner. Or else we end up utilizing way less System RAM.
(Currently), the file backed guest private memory is long-term pinned and not reclaimable, it's in page cache anyway once we allocated it for guest. This does not depend on how guest use it (e.g. use it for guest page cache or not).
Even if host shmem backed memory always be always un-reclaimable, we end up utilizing double RAM (both in guest & host page cache) for guest disk accesses?
I am considering this a serious design decision before we commit to this approach.
Happy to be enlightened on this and know the thoughts from others as well.
Thanks, Pankaj
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
This is my understanding: in host it will be indeed in page cache (in current shmem implementation) but that's just the way it allocates and provides the physical memory for the guest. In guest, guest OS will not see this fd (absolutely), it only sees guest memory, on top of which it can build its own page cache system for its own file-mapped content but that is unrelated to host page cache.
yes. If guest fills its page cache with file backed memory, this at host side(on shmem fd backend) will also fill the host page cache fast. This can have an impact on performance of guest VM's if host goes to memory pressure situation sooner. Or else we end up utilizing way less System RAM.
(Currently), the file backed guest private memory is long-term pinned and not reclaimable, it's in page cache anyway once we allocated it for guest. This does not depend on how guest use it (e.g. use it for guest page cache or not).
Even if host shmem backed memory always be always un-reclaimable, we end up utilizing double RAM (both in guest & host page cache) for guest disk accesses?
Answering my own question:
We wont use double RAM, just view of guest & host structures would change as per the code path taken. If we we don't care about reclaim situations we should be good, else we have to think something to coordinate page cache between guest & host (that could be an optimization for later).
Thanks, Pankaj
On Wed, Jul 13, 2022, at 3:35 AM, Gupta, Pankaj wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
This is my understanding: in host it will be indeed in page cache (in current shmem implementation) but that's just the way it allocates and provides the physical memory for the guest. In guest, guest OS will not see this fd (absolutely), it only sees guest memory, on top of which it can build its own page cache system for its own file-mapped content but that is unrelated to host page cache.
yes. If guest fills its page cache with file backed memory, this at host side(on shmem fd backend) will also fill the host page cache fast. This can have an impact on performance of guest VM's if host goes to memory pressure situation sooner. Or else we end up utilizing way less System RAM.
Is this in any meaningful way different from a regular VM?
--Andy
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU
split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM
Thinking a bit, As host side fd on tmpfs or shmem will store memory on host page cache instead of mapping pages into userspace address space. Can we hit double (un-coordinated) page cache problem with this when guest page cache is also used?
This is my understanding: in host it will be indeed in page cache (in current shmem implementation) but that's just the way it allocates and provides the physical memory for the guest. In guest, guest OS will not see this fd (absolutely), it only sees guest memory, on top of which it can build its own page cache system for its own file-mapped content but that is unrelated to host page cache.
yes. If guest fills its page cache with file backed memory, this at host side(on shmem fd backend) will also fill the host page cache fast. This can have an impact on performance of guest VM's if host goes to memory pressure situation sooner. Or else we end up utilizing way less System RAM.
Is this in any meaningful way different from a regular VM?
After thinking a bit, Seems 'No'. Except the reclaim decisions system would take under memory pressure and also will have to see how well this gets stitched with memory tiers in future. But all these are future topics.
Sorry! for the noise.
Thanks, Pankaj
On 06/07/22 13:50, Chao Peng wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated.
- backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name; - int fd, priv_fd; + int fd, priv_fd, ret;
if (!backend->size) { error_setg(errp, "can't create backend with size 0"); @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name);
- fallocate(priv_fd, 0, 0, backend->size); +again: + ret = fallocate(priv_fd, 0, 0, backend->size); + if (ret) { + perror("Fallocate failed: \n"); + if (errno == EINTR) + goto again; + else + exit(1); + }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Regards, Nikunj
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated.
- backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name;
- int fd, priv_fd;
- int fd, priv_fd, ret;
if (!backend->size) { error_setg(errp, "can't create backend with size 0"); @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name);
- fallocate(priv_fd, 0, 0, backend->size);
+again:
- ret = fallocate(priv_fd, 0, 0, backend->size);
- if (ret) {
perror("Fallocate failed: \n");
if (errno == EINTR)
goto again;
else
exit(1);
- }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu notifier and architecture support while doing the memory operations e.g page migration and swapping/reclaim (not supported currently AFAIU). But yes, we need to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
Thanks, Pankaj
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Regards, Nikunj
On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated.
- backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
QEMU code has not well-tested so it's not strange you met problem. But from the man page, there is signal was caught for EINTR, do you know the signal number?
Thanks for you patch but before we change it in QEMU I want to make sure it's indeed a QEMU issue (e.g. not a kernel isssue).
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name;
- int fd, priv_fd;
- int fd, priv_fd, ret; if (!backend->size) { error_setg(errp, "can't create backend with size 0");
@@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name);
- fallocate(priv_fd, 0, 0, backend->size);
+again:
- ret = fallocate(priv_fd, 0, 0, backend->size);
- if (ret) {
perror("Fallocate failed: \n");
if (errno == EINTR)
goto again;
else
exit(1);
- }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu notifier and architecture support while doing the memory operations e.g page migration and swapping/reclaim (not supported currently AFAIU). But yes, we need to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
Right.
Thanks, Pankaj
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Regards, Nikunj
On 11/08/22 19:02, Chao Peng wrote:
On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
QEMU code has not well-tested so it's not strange you met problem. But from the man page, there is signal was caught for EINTR, do you know the signal number?
I haven't check that, but that should be fairly straight forward to get. I presume that you are referring to signal_pending() in the shmem_fallocate()
Thanks for you patch but before we change it in QEMU I want to make sure it's indeed a QEMU issue (e.g. not a kernel isssue).
As per the manual fallocate() can return EINTR, and this should be handled by the user space.
Regards Nikunj
On 11/08/22 19:02, Chao Peng wrote:
On Thu, Aug 11, 2022 at 01:30:06PM +0200, Gupta, Pankaj wrote:
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
QEMU code has not well-tested so it's not strange you met problem. But from the man page, there is signal was caught for EINTR, do you know the signal number?
Thanks for you patch but before we change it in QEMU I want to make sure it's indeed a QEMU issue (e.g. not a kernel isssue).
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name;
- int fd, priv_fd;
- int fd, priv_fd, ret; if (!backend->size) { error_setg(errp, "can't create backend with size 0");
@@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name);
- fallocate(priv_fd, 0, 0, backend->size);
+again:
- ret = fallocate(priv_fd, 0, 0, backend->size);
- if (ret) {
perror("Fallocate failed: \n");
if (errno == EINTR)
goto again;
else
exit(1);
- }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned.
This is with reference to the SEV demand pinning patches that I was working on. The understanding was UPM will not reserve memory for SEV/TDX guest in the beginning similar to normal guest. Here is the relevant quote from the discussion with Sean[1]:
"I think we should abandon this approach in favor of committing all our resources to fd-based private memory[*], which (if done right) will provide on-demand pinning for "free". "
Is there a way to prevent fallocate() from reserving full guest memory?
Regards Nikunj [1] https://lore.kernel.org/kvm/YkIh8zM7XfhsFN8L@google.com/
On 11/08/22 17:00, Gupta, Pankaj wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated. - backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
- Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
- New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name; - int fd, priv_fd; + int fd, priv_fd, ret; if (!backend->size) { error_setg(errp, "can't create backend with size 0"); @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name); - fallocate(priv_fd, 0, 0, backend->size); +again: + ret = fallocate(priv_fd, 0, 0, backend->size); + if (ret) { + perror("Fallocate failed: \n"); + if (errno == EINTR) + goto again; + else + exit(1); + }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
Regards Nikunj
On 8/11/2022 7:18 PM, Nikunj A. Dadhania wrote:
On 11/08/22 17:00, Gupta, Pankaj wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated. - backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
- Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
- New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name; - int fd, priv_fd; + int fd, priv_fd, ret; if (!backend->size) { error_setg(errp, "can't create backend with size 0"); @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name); - fallocate(priv_fd, 0, 0, backend->size); +again: + ret = fallocate(priv_fd, 0, 0, backend->size); + if (ret) { + perror("Fallocate failed: \n"); + if (errno == EINTR) + goto again; + else + exit(1); + }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
I guess so if guest memory is private by default.
Other option would be to allocate memory as shared by default and handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
Might be missing some details on this. So, better to wait for someone more familiar to answer.
Thanks, Pankaj
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated. - backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
- Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
- New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name; - int fd, priv_fd; + int fd, priv_fd, ret; if (!backend->size) { error_setg(errp, "can't create backend with size 0"); @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name); - fallocate(priv_fd, 0, 0, backend->size); +again: + ret = fallocate(priv_fd, 0, 0, backend->size); + if (ret) { + perror("Fallocate failed: \n"); + if (errno == EINTR) + goto again; + else + exit(1); + }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
I guess so if guest memory is private by default.
Other option would be to allocate memory as shared by default and handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
Sorry! Don't want to hijack the other thread so replying here.
I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
Might be missing some details on this. So, better to wait for someone more familiar to answer.
Same applies here :)
Thanks, Pankaj
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated. - backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
- Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
- New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx
While debugging an issue with SEV+UPM, found that fallocate() returns an error in QEMU which is not handled (EINTR). With the below handling of EINTR subsequent fallocate() succeeds:
diff --git a/backends/hostmem-memfd-private.c b/backends/hostmem-memfd-private.c index af8fb0c957..e8597ed28d 100644 --- a/backends/hostmem-memfd-private.c +++ b/backends/hostmem-memfd-private.c @@ -39,7 +39,7 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) MachineState *machine = MACHINE(qdev_get_machine()); uint32_t ram_flags; char *name; - int fd, priv_fd; + int fd, priv_fd, ret; if (!backend->size) { error_setg(errp, "can't create backend with size 0"); @@ -65,7 +65,15 @@ priv_memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) backend->size, ram_flags, fd, 0, errp); g_free(name); - fallocate(priv_fd, 0, 0, backend->size); +again: + ret = fallocate(priv_fd, 0, 0, backend->size); + if (ret) { + perror("Fallocate failed: \n"); + if (errno == EINTR) + goto again; + else + exit(1); + }
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
I guess so if guest memory is private by default.
Other option would be to allocate memory as shared by default and handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
Sorry! Don't want to hijack the other thread so replying here.
I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time. So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
Or maybe later we can think of something like allowing direct page fault on host memory access for *SEV* guest as there is no strict requirement for memory integrity guarantee and the performance overhead.
Don't know if it is feasible, just sharing my thoughts.
Thanks, Pankaj
Might be missing some details on this. So, better to wait for someone more familiar to answer.
Same applies here :)
Thanks, Pankaj
On 12/08/22 12:48, Gupta, Pankaj wrote:
However, fallocate() preallocates full guest memory before starting the guest. With this behaviour guest memory is *not* demand pinned. Is there a way to prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
I guess so if guest memory is private by default.
Other option would be to allocate memory as shared by default and handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
Sorry! Don't want to hijack the other thread so replying here.
I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time.
So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.
Regards Nikunj
> > However, fallocate() preallocates full guest memory before starting the guest. > With this behaviour guest memory is *not* demand pinned. Is there a way to > prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
I guess so if guest memory is private by default.
Other option would be to allocate memory as shared by default and handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
Sorry! Don't want to hijack the other thread so replying here.
I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time.
So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.
For SEV I am also not very sure what would be the best way. There could be a tradeoff between memory pinning and performance. As I was also thinking about "Async page fault" aspect of guest in my previous reply. Details needs to be figure out.
Quoting my previous reply here: "Or maybe later we can think of something like allowing direct page fault on host memory access for *SEV* guest as there is no strict requirement for memory integrity guarantee and the performance overhead."
Thanks, Pankaj
On Fri, Aug 12, 2022 at 02:18:43PM +0530, Nikunj A. Dadhania wrote:
On 12/08/22 12:48, Gupta, Pankaj wrote:
> > However, fallocate() preallocates full guest memory before starting the guest. > With this behaviour guest memory is *not* demand pinned. Is there a way to > prevent fallocate() from reserving full guest memory?
Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
I guess so if guest memory is private by default.
Other option would be to allocate memory as shared by default and handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
Sorry! Don't want to hijack the other thread so replying here.
I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time.
So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Definitely the page will still be pinned once it's allocated, there is no way to swap it out for example just with the current code. That kind of support, if desirable, can be extended through MOVABLE flag and some other callbacks to let feature-specific code to involve.
Chao
Regards Nikunj
On 15/08/22 18:34, Chao Peng wrote:
On Fri, Aug 12, 2022 at 02:18:43PM +0530, Nikunj A. Dadhania wrote:
On 12/08/22 12:48, Gupta, Pankaj wrote:
>> >> However, fallocate() preallocates full guest memory before starting the guest. >> With this behaviour guest memory is *not* demand pinned. Is there a way to >> prevent fallocate() from reserving full guest memory? > > Isn't the pinning being handled by the corresponding host memory backend with mmu > notifier and architecture support while doing the memory operations e.g page> migration and swapping/reclaim (not supported currently AFAIU). But yes, we need> to allocate entire guest memory with the new flags MEMFILE_F_{UNMOVABLE, UNRECLAIMABLE etc}.
That is correct, but the question is when does the memory allocated, as these flags are set, memory is neither moved nor reclaimed. In current scenario, if I start a 32GB guest, all 32GB is allocated.
I guess so if guest memory is private by default.
Other option would be to allocate memory as shared by default and handle on demand allocation and RMPUPDATE with page state change event. But still that would be done at guest boot time, IIUC.
Sorry! Don't want to hijack the other thread so replying here.
I thought the question is for SEV SNP. For SEV, maybe the hypercall with the page state information can be used to allocate memory as we use it or something like quota based memory allocation (just thinking).
But all this would have considerable performance overhead (if by default memory is shared) and used mostly at boot time.
So, preallocating memory (default memory private) seems better approach for both SEV & SEV SNP with later page management (pinning, reclaim) taken care by host memory backend & architecture together.
I am not sure how will pre-allocating memory help, even if guest would not use full memory it will be pre-allocated. Which if I understand correctly is not expected.
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Thanks for confirming Chao, in that case we can drop fallocate() from qemu in both the case * Once while creating the memfd private object * During ram_block_convert_range() for shared->private and vice versa.
Definitely the page will still be pinned once it's allocated, there is no way to swap it out for example just with the current code.
Agree, at present once the page is brought in, page will remain till VM shutdown.
That kind of support, if desirable, can be extended through MOVABLE flag and some other callbacks to let feature-specific code to involve.
Sure, that could be future work.
Regards Nikunj
Hi Chao,
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Does this also mean reservation of guest physical memory with secure processor (both for SEV-SNP & TDX) will also happen at page fault time?
Do we plan to keep it this way?
Thanks, Pankaj
Definitely the page will still be pinned once it's allocated, there is no way to swap it out for example just with the current code. That kind of support, if desirable, can be extended through MOVABLE flag and some other callbacks to let feature-specific code to involve.
On Tue, Aug 16, 2022 at 01:33:00PM +0200, Gupta, Pankaj wrote:
Hi Chao,
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Does this also mean reservation of guest physical memory with secure processor (both for SEV-SNP & TDX) will also happen at page fault time?
Do we plan to keep it this way?
If you are talking about accepting memory by the guest, it is initiated by the guest and has nothing to do with page fault time vs fallocate() allocation of host memory. I mean acceptance happens after host memory allocation but they are not in lockstep, acceptance can happen much later.
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Does this also mean reservation of guest physical memory with secure processor (both for SEV-SNP & TDX) will also happen at page fault time?
Do we plan to keep it this way?
If you are talking about accepting memory by the guest, it is initiated by the guest and has nothing to do with page fault time vs fallocate() allocation of host memory. I mean acceptance happens after host memory allocation but they are not in lockstep, acceptance can happen much later.
No, I meant reserving guest physical memory range from hypervisor e.g with RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
Thanks, Pankaj
On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Does this also mean reservation of guest physical memory with secure processor (both for SEV-SNP & TDX) will also happen at page fault time?
Do we plan to keep it this way?
If you are talking about accepting memory by the guest, it is initiated by the guest and has nothing to do with page fault time vs fallocate() allocation of host memory. I mean acceptance happens after host memory allocation but they are not in lockstep, acceptance can happen much later.
No, I meant reserving guest physical memory range from hypervisor e.g with RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way for userspace to pre-map guest memory.
I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic vCPU-scoped ioctl() that allows userspace to pre-map guest memory. Supporting initializing guest private memory with a source page can be implemented via a flag. That also gives KVM line of sight to in-place "conversion", e.g. another flag could be added to say that the dest is also the source.
The TDX and SNP restrictions would then become addition restrictions on when initializing with a source is allowed (and VMs that don't have guest private memory wouldn't allow the flag at all).
On Tue, Aug 16, 2022 at 03:38:08PM +0000, Sean Christopherson wrote:
On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Does this also mean reservation of guest physical memory with secure processor (both for SEV-SNP & TDX) will also happen at page fault time?
Do we plan to keep it this way?
If you are talking about accepting memory by the guest, it is initiated by the guest and has nothing to do with page fault time vs fallocate() allocation of host memory. I mean acceptance happens after host memory allocation but they are not in lockstep, acceptance can happen much later.
No, I meant reserving guest physical memory range from hypervisor e.g with RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way for userspace to pre-map guest memory.
Hi Sean,
Currently I have the rmpupdate hook in KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls, so that when the pages actually get faulted in they are already in the expected state. I have userspace set up to call KVM_MEMORY_ENCRYPT_* in response to explicit page state changes issued by the guest, as well as in response to MEMORY_FAULT exits for implicit page state changes.
Initially the private backing store may or may not be pre-fallocate()'d depending on how userspace wants to handle it. If it's not pre-fallocate()'d, then the pages don't get faulted in until the guest does explicit page state changes (currently SNP guests will do this for all memory at boot time, but with unaccepted memory patches for guest/ovmf this will happen during guest run-time, would still allow us to make efficient use of lazy-pinning support for shorter boot times).
If userspaces wants to pre-allocate, it can issue the fallocate() for all the ranges up-front so it doesn't incur the cost during run-time.
Is that compatible with the proposed design?
Of course, for the initial encrypted payload, we would need to to issue the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION up-front. I'm doing that in conjunction with the hack to allow pwrite() to memfd to pre-populate the private pages before the in-place encryption that occurs when SNP_LAUNCH_UPDATE is issued...
In the past you and Vishal suggested doing the copy from within SNP_LAUNCH_UPDATE, which seems like a workable solution and something we've been meaning to implement...
I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic vCPU-scoped ioctl() that allows userspace to pre-map guest memory. Supporting initializing guest private memory with a source page can be implemented via a flag. That also gives KVM line of sight to in-place "conversion", e.g. another flag could be added to say that the dest is also the source.
So is this proposed ioctl only intended to handle the initial encrypted payload, and the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls would still be used for conversions post-boot?
If so, that seems reasonable, but I thought there was some consensus that just handling it per-platform in, e.g., SNP_LAUNCH_UPDATE, was sufficient for now until some additional need arose for a new interface. Has something changed in the regard? Just want to understand the motivations so we can plan accordingly.
Thanks!
-Mike
The TDX and SNP restrictions would then become addition restrictions on when initializing with a source is allowed (and VMs that don't have guest private memory wouldn't allow the flag at all).
On Wed, Aug 17, 2022 at 10:27:19AM -0500, Michael Roth michael.roth@amd.com wrote:
I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic vCPU-scoped ioctl() that allows userspace to pre-map guest memory. Supporting initializing guest private memory with a source page can be implemented via a flag. That also gives KVM line of sight to in-place "conversion", e.g. another flag could be added to say that the dest is also the source.
So is this proposed ioctl only intended to handle the initial encrypted payload, and the KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION ioctls would still be used for conversions post-boot?
Yes. It is called before running any vcpu. At run time (after running vcpus), KVM_MEMORY_ENCRYPT_{REG,UNREG}_REGION is used.
Actually the current version allows you to delay the allocation to a later time (e.g. page fault time) if you don't call fallocate() on the private fd. fallocate() is necessary in previous versions because we treat the existense in the fd as 'private' but in this version we track private/shared info in KVM so we don't rely on that fact from memory backstores.
Does this also mean reservation of guest physical memory with secure processor (both for SEV-SNP & TDX) will also happen at page fault time?
Do we plan to keep it this way?
If you are talking about accepting memory by the guest, it is initiated by the guest and has nothing to do with page fault time vs fallocate() allocation of host memory. I mean acceptance happens after host memory allocation but they are not in lockstep, acceptance can happen much later.
No, I meant reserving guest physical memory range from hypervisor e.g with RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).
As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way for userspace to pre-map guest memory.
I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic vCPU-scoped ioctl() that allows userspace to pre-map guest memory. Supporting initializing guest private memory with a source page can be implemented via a flag. That also gives KVM line of sight to in-place "conversion", e.g. another flag could be added to say that the dest is also the source.
Questions to clarify *my* understanding here:
- Do you suggest to use KVM_TDX_INIT_MEM_REGION into a generic ioctl to pre-map guest private memory in addition to initialize the payload (in-place encryption or just copy page to guest private memory)?
- Want to clarify "pre-map": Are you suggesting to use the ioctl to avoid the RMP/PAMT registration at guest page fault time? instead pre-map guest private memory i.e to allocate and do RMP/PAMT registration before running the actual guest vCPU's?
Thanks, Pankaj
The TDX and SNP restrictions would then become addition restrictions on when initializing with a source is allowed (and VMs that don't have guest private memory wouldn't allow the flag at all).
On Wed, 6 Jul 2022, Chao Peng wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory.
Here at last are my reluctant thoughts on this patchset.
fd-based approach for supporting KVM guest private memory: fine.
Use or abuse of memfd and shmem.c: mistaken.
memfd_create() was an excellent way to put together the initial prototype.
But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere).
You don't need shmem.c or a filesystem for that!
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Some of these impressions may come from earlier iterations of the patchset (v7 looks better in several ways than v5). I am probably underestimating the extent to which you have taken on board other usages beyond TDX and SEV private memory, and rightly want to serve them all with similar interfaces: perhaps there is enough justification for shmem there, but I don't see it. There was mention of userfaultfd in one link: does that provide the justification for using shmem?
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
With that /dev/kvm_something subject to access controls and LSMs - which I cannot find for memfd_create(). Full marks for including the MFD_INACCESSIBLE manpage update, and for Cc'ing linux-api: but I'd have expected some doubts from that direction already.
Hugh
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory.
Here at last are my reluctant thoughts on this patchset.
fd-based approach for supporting KVM guest private memory: fine.
Use or abuse of memfd and shmem.c: mistaken.
memfd_create() was an excellent way to put together the initial prototype.
But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere).
You don't need shmem.c or a filesystem for that!
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
Some of these impressions may come from earlier iterations of the patchset (v7 looks better in several ways than v5). I am probably underestimating the extent to which you have taken on board other usages beyond TDX and SEV private memory, and rightly want to serve them all with similar interfaces: perhaps there is enough justification for shmem there, but I don't see it. There was mention of userfaultfd in one link: does that provide the justification for using shmem?
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations).
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm.
Note that on machines that run TDX guests such memory would likely be the bulk of memory use. Treating it as a fringe case may bite us one day.
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users.
What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere).
You don't need shmem.c or a filesystem for that!
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1].
And this isn't intended for just TDX (or SNP, or pKVM). We're not _that_ far off from being able to use UPM for "regular" VMs as a way to provide defense-in-depth without having to take on the overhead of confidential VMs. At that point, migration and probably even swap are on the table.
And swapping theoretically possible, but I'm not aware of any plans as of now.
Ya, I highly doubt confidential VMs will ever bother with swap.
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations).
But secretmem _could_ be a fit. If a use case wants to unmap guest private memory from both userspace and the kernel then KVM should absolutely be able to support that, but at the same time I don't want to have to update KVM to enable secretmem (and I definitely don't want KVM poking into the directmap itself).
MFD_INACCESSIBLE should only say "this memory can't be mapped into userspace", any other properties should be completely separate, e.g. the inability to migrate pages is effective a restriction from KVM (acting on behalf of TDX/SNP), it's not a fundamental property of MFD_INACCESSIBLE.
On Fri, 19 Aug 2022, Sean Christopherson wrote:
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users.
Okay, I stand corrected: there exist some users of memfd_create() who will also have use for "INACCESSIBLE" memory.
What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere).
You don't need shmem.c or a filesystem for that!
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1].
And this isn't intended for just TDX (or SNP, or pKVM). We're not _that_ far off from being able to use UPM for "regular" VMs as a way to provide defense-in-depth
UPM? That's an acronym from your side of the fence, I spy references to it in the mail threads, but haven't tracked down a definition. I'll just take it to mean the fd-based memory we're discussing.
without having to take on the overhead of confidential VMs. At that point, migration and probably even swap are on the table.
Good, the more "flexible" that memory is, the better for competing users of memory. But an fd supplied by KVM gives you freedom to change to a better implementation of allocation underneath, whenever it suits you. Maybe shmem beneath is good from the start, maybe not.
Hugh
On Thu, Aug 18, 2022, Hugh Dickins wrote:
On Fri, 19 Aug 2022, Sean Christopherson wrote:
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1].
And this isn't intended for just TDX (or SNP, or pKVM). We're not _that_ far off from being able to use UPM for "regular" VMs as a way to provide defense-in-depth
UPM? That's an acronym from your side of the fence, I spy references to it in the mail threads, but haven't tracked down a definition. I'll just take it to mean the fd-based memory we're discussing.
Ya, sorry, UPM is what we came up with as shorthand for "Unmapping guest Private Memory". Your assumption is spot on, it's just a fancy way of saying "guest is backed with inaccessible fd-based memory".
without having to take on the overhead of confidential VMs. At that point, migration and probably even swap are on the table.
Good, the more "flexible" that memory is, the better for competing users of memory. But an fd supplied by KVM gives you freedom to change to a better implementation of allocation underneath, whenever it suits you. Maybe shmem beneath is good from the start, maybe not.
The main flaw with KVM providing the fd is that it forces KVM to get into the memory management business, which us KVM folks really, really do not want to do. And based on the types of bugs KVM has had in the past related to memory management, it's a safe bet to say the mm folks don't want us getting involved either :-)
The combination of gup()/follow_pte() and mmu_notifiers has worked very well. KVM gets a set of (relatively) simple rules to follow and doesn't have to be taught new things every time a new backing type comes along. And from the other side, KVM has very rarely had to go poke into other subsystems' code to support exposing a new type of memory to guests.
What we're trying to do with UPM/fd-based memory is establish a similar contract between mm and KVM, but without requiring mm to also map memory into host userspace.
The only way having KVM provide the fd works out in the long run is if KVM is the only subsystem that ever wants to make use of memory that isn't accessible from userspace and isn't tied to a specific backing type, _and_ if the set of backing types that KVM ever supports is kept to an absolute minimum.
On 19.08.22 05:38, Hugh Dickins wrote:
On Fri, 19 Aug 2022, Sean Christopherson wrote:
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users.
Okay, I stand corrected: there exist some users of memfd_create() who will also have use for "INACCESSIBLE" memory.
As raised in reply to the relevant patch, I'm not sure if we really have to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a requirement of specific memfd_notifer (memfile_notifier) implementations -- such as TDX that will convert the memory and MCE-kill the machine on ordinary write access. We might be able to set/enforce this when registering a notifier internally instead, and fail notifier registration if a condition isn't met (e.g., existing mmap).
So I'd be curious, which other users of shmem/memfd would benefit from (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
On Tue, Aug 23, 2022, David Hildenbrand wrote:
On 19.08.22 05:38, Hugh Dickins wrote:
On Fri, 19 Aug 2022, Sean Christopherson wrote:
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users.
Okay, I stand corrected: there exist some users of memfd_create() who will also have use for "INACCESSIBLE" memory.
As raised in reply to the relevant patch, I'm not sure if we really have to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a requirement of specific memfd_notifer (memfile_notifier) implementations -- such as TDX that will convert the memory and MCE-kill the machine on ordinary write access. We might be able to set/enforce this when registering a notifier internally instead, and fail notifier registration if a condition isn't met (e.g., existing mmap).
So I'd be curious, which other users of shmem/memfd would benefit from (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
I agree that there's no need to expose the inaccessible behavior via uAPI. Making it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other flags that get added in the future).
AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide any unique functionality.
If we go that route, we might want to have shmem/memfd require INACCESSIBLE to be set for the initial implementation. I.e. disallow binding without INACCESSIBLE until there's a use case.
On Tue, Aug 23, 2022 at 04:05:27PM +0000, Sean Christopherson wrote:
On Tue, Aug 23, 2022, David Hildenbrand wrote:
On 19.08.22 05:38, Hugh Dickins wrote:
On Fri, 19 Aug 2022, Sean Christopherson wrote:
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users.
Okay, I stand corrected: there exist some users of memfd_create() who will also have use for "INACCESSIBLE" memory.
As raised in reply to the relevant patch, I'm not sure if we really have to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a requirement of specific memfd_notifer (memfile_notifier) implementations -- such as TDX that will convert the memory and MCE-kill the machine on ordinary write access. We might be able to set/enforce this when registering a notifier internally instead, and fail notifier registration if a condition isn't met (e.g., existing mmap).
So I'd be curious, which other users of shmem/memfd would benefit from (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
I agree that there's no need to expose the inaccessible behavior via uAPI. Making it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other flags that get added in the future).
AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide any unique functionality.
That's also what I'm thinking. And I don't see problem immediately if user has populated the fd at the binding time. Actually that looks an advantage for previously discussed guest payload pre-loading.
If we go that route, we might want to have shmem/memfd require INACCESSIBLE to be set for the initial implementation. I.e. disallow binding without INACCESSIBLE until there's a use case.
I can do that.
Chao
On 8/24/22 02:41, Chao Peng wrote:
On Tue, Aug 23, 2022 at 04:05:27PM +0000, Sean Christopherson wrote:
On Tue, Aug 23, 2022, David Hildenbrand wrote:
On 19.08.22 05:38, Hugh Dickins wrote:
On Fri, 19 Aug 2022, Sean Christopherson wrote:
On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: > On Wed, 6 Jul 2022, Chao Peng wrote: > But since then, TDX in particular has forced an effort into preventing > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs. > > Are any of the shmem.c mods useful to existing users of shmem.c? No. > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users.
Okay, I stand corrected: there exist some users of memfd_create() who will also have use for "INACCESSIBLE" memory.
As raised in reply to the relevant patch, I'm not sure if we really have to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a requirement of specific memfd_notifer (memfile_notifier) implementations -- such as TDX that will convert the memory and MCE-kill the machine on ordinary write access. We might be able to set/enforce this when registering a notifier internally instead, and fail notifier registration if a condition isn't met (e.g., existing mmap).
So I'd be curious, which other users of shmem/memfd would benefit from (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?
I agree that there's no need to expose the inaccessible behavior via uAPI. Making it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any other flags that get added in the future).
AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't provide any unique functionality.
That's also what I'm thinking. And I don't see problem immediately if user has populated the fd at the binding time. Actually that looks an advantage for previously discussed guest payload pre-loading.
I think this gets awkward. Trying to define sensible semantics for what happens if a shmem or similar fd gets used as secret guest memory and that fd isn't utterly and completely empty can get quite nasty. For example:
If there are already mmaps, then TDX (much more so than SEV) really doesn't want to also use it as guest memory.
If there is already data in the fd, then maybe some technologies can use this for pre-population, but TDX needs explicit instructions in order to get the guest's hash right.
In general, it seems like it will be much more likely to actually work well if the user (uAPI) is required to declare to the kernel exactly what the fd is for (e.g. TDX secret memory, software-only secret memory, etc) before doing anything at all with it other than binding it to KVM.
INACCESSIBLE is a way to achieve this. Maybe it's not the prettiest in the world -- I personally would rather see an explicit request for, say, TDX or SEV memory or maybe the memory that works for a particular KVM instance instead of something generic like INACCESSIBLE, but this is a pretty weak preference. But I think that just starting with a plain memfd is a can of worms.
On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc.
But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.)
Some of these impressions may come from earlier iterations of the patchset (v7 looks better in several ways than v5). I am probably underestimating the extent to which you have taken on board other usages beyond TDX and SEV private memory, and rightly want to serve them all with similar interfaces: perhaps there is enough justification for shmem there, but I don't see it. There was mention of userfaultfd in one link: does that provide the justification for using shmem?
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store.
Last year's MFD_HUGEPAGE proposal would have allowed you to do it with memfd via tmpfs without needing to involve hugetlbfs; but you may prefer the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.
But I've yet to see why you want to involve this or that filesystem (with all its filesystem-icity suppressed) at all. The backing store is host memory, and tmpfs and hugetlbfs just impose their own idiosyncrasies on how that memory is allocated; but I think you would do better to choose your own idiosyncrasies in allocation directly - you don't need a different "backing store" to choose between 4k or 2M or 1G or whatever allocations.
tmpfs and hugetlbfs and page cache are designed around sharing memory: TDX is designed around absolutely not sharing memory; and the further uses which Sean foresees appear not to need it as page cache either.
Except perhaps for page migration reasons. It's somewhat incidental, but of course page migration knows how to migrate page cache, so masquerading as page cache will give a short cut to page migration, when page migration becomes at all possible.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations).
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm.
You are already proposing a new set of rules, foreign to how tmpfs works for others. You're right that KVM allocating large amounts of memory, opaque to core-mm, carries risk: and you'd be right to say that shmem.c provides some clues (security_vm_enough_memory checks, memcg charging, user_shm_lock accounting) on what to remember.
But I'm not up to the job of being the one to police you there, and you don't want to be waiting on me either.
To take a rather silly example: Ted just added chattr support to tmpfs, and it fits in well. But I don't now want to have to decide whether "chattr +i" FS_IMMUTABLE_FL is or is not compatible with MEMFILE_F_USER_INACCESSIBLE. They are from different worlds, and I'd prefer KVM to carry the weight of imposing INACCESSIBLE: which seems easily done if it manages the fd, without making the memory allocated to that fd accessible to those who hold the fd.
Note that on machines that run TDX guests such memory would likely be the bulk of memory use. Treating it as a fringe case may bite us one day.
Yes, I suspected that machines running TDX guests might well consume most of the memory that way, but glad(?) to hear it confirmed.
I am not suggesting that this memory be treated as a fringe case, rather the reverse: a different case, not something to hide away inside shmem.c.
Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint of this usage? Whether or not it's also included in "Shmem:", I expect that its different characteristics will deserve its own display.
Hugh
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc.
But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.)
TDX 1.5 brings both.
In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
Some of these impressions may come from earlier iterations of the patchset (v7 looks better in several ways than v5). I am probably underestimating the extent to which you have taken on board other usages beyond TDX and SEV private memory, and rightly want to serve them all with similar interfaces: perhaps there is enough justification for shmem there, but I don't see it. There was mention of userfaultfd in one link: does that provide the justification for using shmem?
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store.
Last year's MFD_HUGEPAGE proposal would have allowed you to do it with memfd via tmpfs without needing to involve hugetlbfs; but you may prefer the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.
But I've yet to see why you want to involve this or that filesystem (with all its filesystem-icity suppressed) at all. The backing store is host memory, and tmpfs and hugetlbfs just impose their own idiosyncrasies on how that memory is allocated; but I think you would do better to choose your own idiosyncrasies in allocation directly - you don't need a different "backing store" to choose between 4k or 2M or 1G or whatever allocations.
These idiosyncrasies are well known: user who used hugetlbfs may want to get direct replacement that would tap into the same hugetlb reserves and get the same allocation guarantees. Admins know where to look if ENOMEM happens.
For THP, admin may know how to tweak allocation/defrag policy for his liking and how to track if they are allocated.
tmpfs and hugetlbfs and page cache are designed around sharing memory: TDX is designed around absolutely not sharing memory; and the further uses which Sean foresees appear not to need it as page cache either.
Except perhaps for page migration reasons. It's somewhat incidental, but of course page migration knows how to migrate page cache, so masquerading as page cache will give a short cut to page migration, when page migration becomes at all possible.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations).
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm.
You are already proposing a new set of rules, foreign to how tmpfs works for others. You're right that KVM allocating large amounts of memory, opaque to core-mm, carries risk: and you'd be right to say that shmem.c provides some clues (security_vm_enough_memory checks, memcg charging, user_shm_lock accounting) on what to remember.
That's a nice list of clues that would need to be re-implemented somewhere else to get competent solution.
But I'm not up to the job of being the one to police you there, and you don't want to be waiting on me either.
To take a rather silly example: Ted just added chattr support to tmpfs, and it fits in well. But I don't now want to have to decide whether "chattr +i" FS_IMMUTABLE_FL is or is not compatible with MEMFILE_F_USER_INACCESSIBLE. They are from different worlds, and I'd prefer KVM to carry the weight of imposing INACCESSIBLE: which seems easily done if it manages the fd, without making the memory allocated to that fd accessible to those who hold the fd.
From a quick look, these are orthogonal. But it is not your point.
Yes, INACCESSIBLE is increase of complexity which you do not want to deal with in shmem.c. It get it.
I will try next week to rework it as shim to top of shmem. Does it work for you?
But I think it is wrong to throw it over the fence to KVM folks and say it is your problem. Core MM has to manage it.
Note that on machines that run TDX guests such memory would likely be the bulk of memory use. Treating it as a fringe case may bite us one day.
Yes, I suspected that machines running TDX guests might well consume most of the memory that way, but glad(?) to hear it confirmed.
I am not suggesting that this memory be treated as a fringe case, rather the reverse: a different case, not something to hide away inside shmem.c.
Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint of this usage? Whether or not it's also included in "Shmem:", I expect that its different characteristics will deserve its own display.
That's the hint users know about from previous experience.
On Sat, 20 Aug 2022, Kirill A. Shutemov wrote:
Yes, INACCESSIBLE is increase of complexity which you do not want to deal with in shmem.c. It get it.
It's not so much that INACCESSIBLE increases the complexity of memfd/shmem/tmpfs, as that it is completely foreign to it.
And by handling all those foreign needs at the KVM end (where you can be sure that the mem attached to the fd is INACCESSIBLE because you have given nobody access to it - no handshaking with 3rd party required).
I will try next week to rework it as shim to top of shmem. Does it work for you?
Yes, please do, thanks. It's a compromise between us: the initial TDX case has no justification to use shmem at all, but doing it that way will help you with some of the infrastructure, and will probably be easiest for KVM to extend to other more relaxed fd cases later.
But I think it is wrong to throw it over the fence to KVM folks and say it is your problem. Core MM has to manage it.
We disagree on who is throwing over the fence to whom :)
Core MM should manage the core MM parts and KVM should manage the KVM parts. What makes this rather different from most driver usage of MM, is that KVM seems likely to use a great proportion of memory this way. With great memory usage comes great responsibility: I don't think all those flags and seals and notifiers let KVM escape from that.
Hugh
On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
I will try next week to rework it as shim to top of shmem. Does it work for you?
Yes, please do, thanks. It's a compromise between us: the initial TDX case has no justification to use shmem at all, but doing it that way will help you with some of the infrastructure, and will probably be easiest for KVM to extend to other more relaxed fd cases later.
Okay, below is my take on the shim approach.
I don't hate how it turned out. It is easier to understand without callback exchange thing.
The only caveat is I had to introduce external lock to protect against race between lookup and truncate. Otherwise, looks pretty reasonable to me.
I did very limited testing. And it lacks integration with KVM, but API changed not substantially, any it should be easy to adopt.
Any comments?
diff --git a/include/linux/memfd.h b/include/linux/memfd.h index 4f1600413f91..aec04a0f8b7b 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -3,6 +3,7 @@ #define __LINUX_MEMFD_H
#include <linux/file.h> +#include <linux/pfn_t.h>
#ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg); @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a) } #endif
+struct inaccessible_notifier; + +struct inaccessible_notifier_ops { + void (*invalidate)(struct inaccessible_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct inaccessible_notifier { + struct list_head list; + const struct inaccessible_notifier_ops *ops; +}; + +int inaccessible_register_notifier(struct file *file, + struct inaccessible_notifier *notifier); +void inaccessible_unregister_notifier(struct file *file, + struct inaccessible_notifier *notifier); + +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order); +void inaccessible_put_pfn(struct file *file, pfn_t pfn); + +struct file *memfd_mkinaccessible(struct file *memfd); + #endif /* __LINUX_MEMFD_H */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index 6325d1d0e90f..9d066be3d7e8 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -101,5 +101,6 @@ #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
#endif /* __LINUX_MAGIC_H__ */ diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U
/* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/Makefile b/mm/Makefile index 9a564f836403..f82e5d4b4388 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o obj-$(CONFIG_ZONE_DEVICE) += memremap.o obj-$(CONFIG_HMM_MIRROR) += hmm.o -obj-$(CONFIG_MEMFD_CREATE) += memfd.o +obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..1853a90f49ff 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE)
SYSCALL_DEFINE2(memfd_create, const char __user *, uname, @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; }
+ /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING)) + return -EINVAL; + + /* TODO: add hugetlb support */ + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB)) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create, *file_seals &= ~F_SEAL_SEAL; }
+ if (flags & MFD_INACCESSIBLE) { + struct file *inaccessible_file; + + inaccessible_file = memfd_mkinaccessible(file); + if (IS_ERR(inaccessible_file)) { + error = PTR_ERR(inaccessible_file); + goto err_file; + } + + file = inaccessible_file; + } + fd_install(fd, file); kfree(name); return fd;
+err_file: + fput(file); err_fd: put_unused_fd(fd); err_name: diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c new file mode 100644 index 000000000000..89194438af9c --- /dev/null +++ b/mm/memfd_inaccessible.c @@ -0,0 +1,234 @@ +#include <linux/memfd.h> +#include <linux/pagemap.h> +#include <linux/pseudo_fs.h> +#include <linux/shmem_fs.h> +#include <uapi/linux/falloc.h> +#include <uapi/linux/magic.h> + +struct inaccessible_data { + struct rw_semaphore lock; + struct file *memfd; + struct list_head notifiers; +}; + +static void inaccessible_notifier_invalidate(struct inaccessible_data *data, + pgoff_t start, pgoff_t end) +{ + struct inaccessible_notifier *notifier; + + lockdep_assert_held(&data->lock); + VM_BUG_ON(!rwsem_is_locked(&data->lock)); + + list_for_each_entry(notifier, &data->notifiers, list) { + notifier->ops->invalidate(notifier, start, end); + } +} + +static int inaccessible_release(struct inode *inode, struct file *file) +{ + struct inaccessible_data *data = inode->i_mapping->private_data; + + fput(data->memfd); + kfree(data); + return 0; +} + +static long inaccessible_fallocate(struct file *file, int mode, + loff_t offset, loff_t len) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + int ret; + + /* The lock prevents parallel inaccessible_get/put_pfn() */ + down_write(&data->lock); + if (mode & FALLOC_FL_PUNCH_HOLE) { + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) { + ret = -EINVAL; + goto out; + } + } + + ret = memfd->f_op->fallocate(memfd, mode, offset, len); + inaccessible_notifier_invalidate(data, offset, offset + len); +out: + up_write(&data->lock); + return ret; +} + +static const struct file_operations inaccessible_fops = { + .release = inaccessible_release, + .fallocate = inaccessible_fallocate, +}; + +static int inaccessible_getattr(struct user_namespace *mnt_userns, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int query_flags) +{ + struct inode *inode = d_inode(path->dentry); + struct inaccessible_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, + request_mask, query_flags); +} + +static int inaccessible_setattr(struct user_namespace *mnt_userns, + struct dentry *dentry, struct iattr *attr) +{ + struct inode *inode = d_inode(dentry); + struct inaccessible_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + int ret; + + if (attr->ia_valid & ATTR_SIZE) { + if (memfd->f_inode->i_size) { + ret = -EPERM; + goto out; + } + + if (!PAGE_ALIGNED(attr->ia_size)) { + ret = -EINVAL; + goto out; + } + } + + ret = memfd->f_inode->i_op->setattr(mnt_userns, + file_dentry(memfd), attr); +out: + return ret; +} + +static const struct inode_operations inaccessible_iops = { + .getattr = inaccessible_getattr, + .setattr = inaccessible_setattr, +}; + +static int inaccessible_init_fs_context(struct fs_context *fc) +{ + if (!init_pseudo(fc, INACCESSIBLE_MAGIC)) + return -ENOMEM; + + fc->s_iflags |= SB_I_NOEXEC; + return 0; +} + +static struct file_system_type inaccessible_fs = { + .owner = THIS_MODULE, + .name = "[inaccessible]", + .init_fs_context = inaccessible_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static struct vfsmount *inaccessible_mnt; + +static __init int inaccessible_init(void) +{ + inaccessible_mnt = kern_mount(&inaccessible_fs); + if (IS_ERR(inaccessible_mnt)) + return PTR_ERR(inaccessible_mnt); + return 0; +} +fs_initcall(inaccessible_init); + +struct file *memfd_mkinaccessible(struct file *memfd) +{ + struct inaccessible_data *data; + struct address_space *mapping; + struct inode *inode; + struct file *file; + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return ERR_PTR(-ENOMEM); + + data->memfd = memfd; + init_rwsem(&data->lock); + INIT_LIST_HEAD(&data->notifiers); + + inode = alloc_anon_inode(inaccessible_mnt->mnt_sb); + if (IS_ERR(inode)) { + kfree(data); + return ERR_CAST(inode); + } + + inode->i_mode |= S_IFREG; + inode->i_op = &inaccessible_iops; + inode->i_mapping->private_data = data; + + file = alloc_file_pseudo(inode, inaccessible_mnt, + "[memfd:inaccessible]", O_RDWR, + &inaccessible_fops); + if (IS_ERR(file)) { + iput(inode); + kfree(data); + } + + mapping = memfd->f_mapping; + mapping_set_unevictable(mapping); + mapping_set_gfp_mask(mapping, + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); + + return file; +} + +int inaccessible_register_notifier(struct file *file, + struct inaccessible_notifier *notifier) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + + down_write(&data->lock); + list_add(¬ifier->list, &data->notifiers); + up_write(&data->lock); + + return 0; +} +EXPORT_SYMBOL_GPL(inaccessible_register_notifier); + +void inaccessible_unregister_notifier(struct file *file, + struct inaccessible_notifier *notifier) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + + down_write(&data->lock); + list_del_rcu(¬ifier->list); + up_write(&data->lock); +} +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier); + +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + struct page *page; + int ret; + + down_read(&data->lock); + + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE); + if (ret) { + up_read(&data->lock); + return ret; + } + + *pfn = page_to_pfn_t(page); + *order = thp_order(compound_head(page)); + return 0; +} +EXPORT_SYMBOL_GPL(inaccessible_get_pfn); + +void inaccessible_put_pfn(struct file *file, pfn_t pfn) +{ + struct page *page = pfn_t_to_page(pfn); + struct inaccessible_data *data = file->f_mapping->private_data; + + if (WARN_ON_ONCE(!page)) + return; + + SetPageUptodate(page); + unlock_page(page); + put_page(page); + up_read(&data->lock); +} +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
I will try next week to rework it as shim to top of shmem. Does it work for you?
Yes, please do, thanks. It's a compromise between us: the initial TDX case has no justification to use shmem at all, but doing it that way will help you with some of the infrastructure, and will probably be easiest for KVM to extend to other more relaxed fd cases later.
Okay, below is my take on the shim approach.
I don't hate how it turned out. It is easier to understand without callback exchange thing.
The only caveat is I had to introduce external lock to protect against race between lookup and truncate. Otherwise, looks pretty reasonable to me.
I did very limited testing. And it lacks integration with KVM, but API changed not substantially, any it should be easy to adopt.
I have integrated this patch with other KVM patches and verified the functionality works well in TDX environment with a minor fix below.
Any comments?
...
diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..1853a90f49ff 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
MFD_INACCESSIBLE)
SYSCALL_DEFINE2(memfd_create, const char __user *, uname, @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; }
- /* Disallow sealing when MFD_INACCESSIBLE is set. */
- if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING))
return -EINVAL;
- /* TODO: add hugetlb support */
- if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB))
return -EINVAL;
- /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0)
@@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create, *file_seals &= ~F_SEAL_SEAL; }
- if (flags & MFD_INACCESSIBLE) {
struct file *inaccessible_file;
inaccessible_file = memfd_mkinaccessible(file);
if (IS_ERR(inaccessible_file)) {
error = PTR_ERR(inaccessible_file);
goto err_file;
}
The new file should alse be marked as O_LARGEFILE otherwise setting the initial size greater than 2^31 on the fd will be refused by ftruncate().
+ inaccessible_file->f_flags |= O_LARGEFILE; +
file = inaccessible_file;
- }
- fd_install(fd, file); kfree(name); return fd;
+err_file:
- fput(file);
err_fd: put_unused_fd(fd); err_name:
On Fri, Sep 02, 2022 at 06:27:57PM +0800, Chao Peng wrote:
- if (flags & MFD_INACCESSIBLE) {
struct file *inaccessible_file;
inaccessible_file = memfd_mkinaccessible(file);
if (IS_ERR(inaccessible_file)) {
error = PTR_ERR(inaccessible_file);
goto err_file;
}
The new file should alse be marked as O_LARGEFILE otherwise setting the initial size greater than 2^31 on the fd will be refused by ftruncate().
inaccessible_file->f_flags |= O_LARGEFILE;
Good catch. Thanks.
I will modify memfd_mkinaccessible() to do this.
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
I will try next week to rework it as shim to top of shmem. Does it work for you?
Yes, please do, thanks. It's a compromise between us: the initial TDX case has no justification to use shmem at all, but doing it that way will help you with some of the infrastructure, and will probably be easiest for KVM to extend to other more relaxed fd cases later.
Okay, below is my take on the shim approach.
I don't hate how it turned out. It is easier to understand without callback exchange thing.
The only caveat is I had to introduce external lock to protect against race between lookup and truncate. Otherwise, looks pretty reasonable to me.
I did very limited testing. And it lacks integration with KVM, but API changed not substantially, any it should be easy to adopt.
Any comments?
Updated version below. Nothing major. Some simplification and cleanups.
diff --git a/include/linux/memfd.h b/include/linux/memfd.h index 4f1600413f91..334ddff08377 100644 --- a/include/linux/memfd.h +++ b/include/linux/memfd.h @@ -3,6 +3,7 @@ #define __LINUX_MEMFD_H
#include <linux/file.h> +#include <linux/pfn_t.h>
#ifdef CONFIG_MEMFD_CREATE extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg); @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a) } #endif
+struct inaccessible_notifier; + +struct inaccessible_notifier_ops { + void (*invalidate)(struct inaccessible_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct inaccessible_notifier { + struct list_head list; + const struct inaccessible_notifier_ops *ops; +}; + +void inaccessible_register_notifier(struct file *file, + struct inaccessible_notifier *notifier); +void inaccessible_unregister_notifier(struct file *file, + struct inaccessible_notifier *notifier); + +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order); +void inaccessible_put_pfn(struct file *file, pfn_t pfn); + +struct file *memfd_mkinaccessible(struct file *memfd); + #endif /* __LINUX_MEMFD_H */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index 6325d1d0e90f..9d066be3d7e8 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -101,5 +101,6 @@ #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */
#endif /* __LINUX_MAGIC_H__ */ diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U
/* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/Makefile b/mm/Makefile index 9a564f836403..f82e5d4b4388 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o obj-$(CONFIG_ZONE_DEVICE) += memremap.o obj-$(CONFIG_HMM_MIRROR) += hmm.o -obj-$(CONFIG_MEMFD_CREATE) += memfd.o +obj-$(CONFIG_MEMFD_CREATE) += memfd.o memfd_inaccessible.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..1853a90f49ff 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE)
SYSCALL_DEFINE2(memfd_create, const char __user *, uname, @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; }
+ /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING)) + return -EINVAL; + + /* TODO: add hugetlb support */ + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB)) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create, *file_seals &= ~F_SEAL_SEAL; }
+ if (flags & MFD_INACCESSIBLE) { + struct file *inaccessible_file; + + inaccessible_file = memfd_mkinaccessible(file); + if (IS_ERR(inaccessible_file)) { + error = PTR_ERR(inaccessible_file); + goto err_file; + } + + file = inaccessible_file; + } + fd_install(fd, file); kfree(name); return fd;
+err_file: + fput(file); err_fd: put_unused_fd(fd); err_name: diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c new file mode 100644 index 000000000000..dc79988a49d0 --- /dev/null +++ b/mm/memfd_inaccessible.c @@ -0,0 +1,219 @@ +#include "linux/sbitmap.h" +#include <linux/memfd.h> +#include <linux/pagemap.h> +#include <linux/pseudo_fs.h> +#include <linux/shmem_fs.h> +#include <uapi/linux/falloc.h> +#include <uapi/linux/magic.h> + +struct inaccessible_data { + struct mutex lock; + struct file *memfd; + struct list_head notifiers; +}; + +static void inaccessible_notifier_invalidate(struct inaccessible_data *data, + pgoff_t start, pgoff_t end) +{ + struct inaccessible_notifier *notifier; + + mutex_lock(&data->lock); + list_for_each_entry(notifier, &data->notifiers, list) { + notifier->ops->invalidate(notifier, start, end); + } + mutex_unlock(&data->lock); +} + +static int inaccessible_release(struct inode *inode, struct file *file) +{ + struct inaccessible_data *data = inode->i_mapping->private_data; + + fput(data->memfd); + kfree(data); + return 0; +} + +static long inaccessible_fallocate(struct file *file, int mode, + loff_t offset, loff_t len) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + int ret; + + if (mode & FALLOC_FL_PUNCH_HOLE) { + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) { + return -EINVAL; + } + } + + ret = memfd->f_op->fallocate(memfd, mode, offset, len); + inaccessible_notifier_invalidate(data, offset, offset + len); + return ret; +} + +static const struct file_operations inaccessible_fops = { + .release = inaccessible_release, + .fallocate = inaccessible_fallocate, +}; + +static int inaccessible_getattr(struct user_namespace *mnt_userns, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int query_flags) +{ + struct inode *inode = d_inode(path->dentry); + struct inaccessible_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + + return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, + request_mask, query_flags); +} + +static int inaccessible_setattr(struct user_namespace *mnt_userns, + struct dentry *dentry, struct iattr *attr) +{ + struct inode *inode = d_inode(dentry); + struct inaccessible_data *data = inode->i_mapping->private_data; + struct file *memfd = data->memfd; + int ret; + + if (attr->ia_valid & ATTR_SIZE) { + if (memfd->f_inode->i_size) + return -EPERM; + + if (!PAGE_ALIGNED(attr->ia_size)) + return -EINVAL; + } + + ret = memfd->f_inode->i_op->setattr(mnt_userns, + file_dentry(memfd), attr); + return ret; +} + +static const struct inode_operations inaccessible_iops = { + .getattr = inaccessible_getattr, + .setattr = inaccessible_setattr, +}; + +static int inaccessible_init_fs_context(struct fs_context *fc) +{ + if (!init_pseudo(fc, INACCESSIBLE_MAGIC)) + return -ENOMEM; + + fc->s_iflags |= SB_I_NOEXEC; + return 0; +} + +static struct file_system_type inaccessible_fs = { + .owner = THIS_MODULE, + .name = "[inaccessible]", + .init_fs_context = inaccessible_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static struct vfsmount *inaccessible_mnt; + +static __init int inaccessible_init(void) +{ + inaccessible_mnt = kern_mount(&inaccessible_fs); + if (IS_ERR(inaccessible_mnt)) + return PTR_ERR(inaccessible_mnt); + return 0; +} +fs_initcall(inaccessible_init); + +struct file *memfd_mkinaccessible(struct file *memfd) +{ + struct inaccessible_data *data; + struct address_space *mapping; + struct inode *inode; + struct file *file; + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return ERR_PTR(-ENOMEM); + + data->memfd = memfd; + mutex_init(&data->lock); + INIT_LIST_HEAD(&data->notifiers); + + inode = alloc_anon_inode(inaccessible_mnt->mnt_sb); + if (IS_ERR(inode)) { + kfree(data); + return ERR_CAST(inode); + } + + inode->i_mode |= S_IFREG; + inode->i_op = &inaccessible_iops; + inode->i_mapping->private_data = data; + + file = alloc_file_pseudo(inode, inaccessible_mnt, + "[memfd:inaccessible]", O_RDWR, + &inaccessible_fops); + if (IS_ERR(file)) { + iput(inode); + kfree(data); + } + + file->f_flags |= O_LARGEFILE; + + mapping = memfd->f_mapping; + mapping_set_unevictable(mapping); + mapping_set_gfp_mask(mapping, + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); + + return file; +} + +void inaccessible_register_notifier(struct file *file, + struct inaccessible_notifier *notifier) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + + mutex_lock(&data->lock); + list_add(¬ifier->list, &data->notifiers); + mutex_unlock(&data->lock); +} +EXPORT_SYMBOL_GPL(inaccessible_register_notifier); + +void inaccessible_unregister_notifier(struct file *file, + struct inaccessible_notifier *notifier) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + + mutex_lock(&data->lock); + list_del(¬ifier->list); + mutex_unlock(&data->lock); +} +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier); + +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order) +{ + struct inaccessible_data *data = file->f_mapping->private_data; + struct file *memfd = data->memfd; + struct page *page; + int ret; + + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE); + if (ret) + return ret; + + *pfn = page_to_pfn_t(page); + *order = thp_order(compound_head(page)); + SetPageUptodate(page); + unlock_page(page); + + return 0; +} +EXPORT_SYMBOL_GPL(inaccessible_get_pfn); + +void inaccessible_put_pfn(struct file *file, pfn_t pfn) +{ + struct page *page = pfn_t_to_page(pfn); + + if (WARN_ON_ONCE(!page)) + return; + + put_page(page); +} +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
I will try next week to rework it as shim to top of shmem. Does it work for you?
Yes, please do, thanks. It's a compromise between us: the initial TDX case has no justification to use shmem at all, but doing it that way will help you with some of the infrastructure, and will probably be easiest for KVM to extend to other more relaxed fd cases later.
Okay, below is my take on the shim approach.
I don't hate how it turned out. It is easier to understand without callback exchange thing.
The only caveat is I had to introduce external lock to protect against race between lookup and truncate.
As before, I think this lock is unnecessary. Or at least it's unnessary to hold the lock across get/put. The ->invalidate() call will ensure that the pfn is never actually used if get() races with truncation.
Switching topics, what actually prevents mmapp() on the shim? I tried to follow, but I don't know these areas well enough.
On Tue, Sep 13, 2022 at 09:44:27AM +0000, Sean Christopherson wrote:
On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
I will try next week to rework it as shim to top of shmem. Does it work for you?
Yes, please do, thanks. It's a compromise between us: the initial TDX case has no justification to use shmem at all, but doing it that way will help you with some of the infrastructure, and will probably be easiest for KVM to extend to other more relaxed fd cases later.
Okay, below is my take on the shim approach.
I don't hate how it turned out. It is easier to understand without callback exchange thing.
The only caveat is I had to introduce external lock to protect against race between lookup and truncate.
As before, I think this lock is unnecessary. Or at least it's unnessary to hold the lock across get/put. The ->invalidate() call will ensure that the pfn is never actually used if get() races with truncation.
The updated version you replying to does not use the lock to protect against truncation anymore. The lock protect notifier list.
Switching topics, what actually prevents mmapp() on the shim? I tried to follow, but I don't know these areas well enough.
It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap(). (I did not read the switch statement correctly at first. Note there are two 'fallthrough' there.)
On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
On Tue, Sep 13, 2022 at 09:44:27AM +0000, Sean Christopherson wrote:
On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
I will try next week to rework it as shim to top of shmem. Does it work for you?
Yes, please do, thanks. It's a compromise between us: the initial TDX case has no justification to use shmem at all, but doing it that way will help you with some of the infrastructure, and will probably be easiest for KVM to extend to other more relaxed fd cases later.
Okay, below is my take on the shim approach.
I don't hate how it turned out. It is easier to understand without callback exchange thing.
The only caveat is I had to introduce external lock to protect against race between lookup and truncate.
As before, I think this lock is unnecessary. Or at least it's unnessary to hold the lock across get/put. The ->invalidate() call will ensure that the pfn is never actually used if get() races with truncation.
The updated version you replying to does not use the lock to protect against truncation anymore. The lock protect notifier list.
Gah, grabbed the patch when applying.
Switching topics, what actually prevents mmapp() on the shim? I tried to follow, but I don't know these areas well enough.
It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap(). (I did not read the switch statement correctly at first. Note there are two 'fallthrough' there.)
Ah, validate_mmap_request(). Thought not implementing ->mmap() was the key, but couldn't find the actual check.
Thanks much!
On Tue, Sep 13, 2022 at 02:53:25PM +0000, Sean Christopherson wrote:
Switching topics, what actually prevents mmapp() on the shim? I tried to follow, but I don't know these areas well enough.
It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap(). (I did not read the switch statement correctly at first. Note there are two 'fallthrough' there.)
Ah, validate_mmap_request(). Thought not implementing ->mmap() was the key, but couldn't find the actual check.
validate_mmap_request() is in mm/nommu.c which is not relevant for real computers.
I was talking about this check:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/m...
On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
On Tue, Sep 13, 2022 at 02:53:25PM +0000, Sean Christopherson wrote:
Switching topics, what actually prevents mmapp() on the shim? I tried to follow, but I don't know these areas well enough.
It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap(). (I did not read the switch statement correctly at first. Note there are two 'fallthrough' there.)
Ah, validate_mmap_request(). Thought not implementing ->mmap() was the key, but couldn't find the actual check.
validate_mmap_request() is in mm/nommu.c which is not relevant for real computers.
I was talking about this check:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/m...
Hence the comment about 'fallthrough'. Thanks again!
On 8/19/22 17:27, Kirill A. Shutemov wrote:
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc.
But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.)
TDX 1.5 brings both.
In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
This seems to be a pretty bad fit for the way that the core mm migrates pages. The core mm unmaps the page, then moves (in software) the contents to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into that workflow very well. I'm not saying it can't be done, but it won't just work.
On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
On 8/19/22 17:27, Kirill A. Shutemov wrote:
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc.
But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.)
TDX 1.5 brings both.
In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
This seems to be a pretty bad fit for the way that the core mm migrates pages. The core mm unmaps the page, then moves (in software) the contents to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into that workflow very well. I'm not saying it can't be done, but it won't just work.
Hm. From what I see we have all necessary infrastructure in place.
Unmaping is NOP for inaccessible pages as it is never mapped and we have mapping->a_ops->migrate_folio() callback that allows to replace software copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
What do I miss?
On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
On 8/19/22 17:27, Kirill A. Shutemov wrote:
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc.
But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.)
TDX 1.5 brings both.
In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
This seems to be a pretty bad fit for the way that the core mm migrates pages. The core mm unmaps the page, then moves (in software) the contents to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into that workflow very well. I'm not saying it can't be done, but it won't just work.
Hm. From what I see we have all necessary infrastructure in place.
Unmaping is NOP for inaccessible pages as it is never mapped and we have mapping->a_ops->migrate_folio() callback that allows to replace software copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
What do I miss?
Hmm, maybe this isn't as bad as I thought.
Right now, unless I've missed something, the migration workflow is to unmap (via try_to_migrate) all mappings, then migrate the backing store (with ->migrate_folio(), although it seems like most callers expect the actual copy to happen outside of ->migrate_folio(), and then make new mappings. With the *current* (vma-based, not fd-based) model for KVM memory, this won't work -- we can't unmap before calling TDH.MEM.PAGE.RELOCATE.
But maybe it's actually okay with some care or maybe mild modifications with the fd-based model. We don't have any mmaps, per se, to unmap for secret / INACCESSIBLE memory. So maybe we can get all the way to ->migrate_folio() without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE from inside migrate_folio(). And there will be nothing to fault back in. From the core code's perspective, it's like migrating a memfd that doesn't happen to have my mappings at the time.
--Andy
On Fri, Sep 09, 2022 at 12:11:05PM -0700, Andy Lutomirski wrote:
On Fri, Sep 9, 2022, at 7:32 AM, Kirill A . Shutemov wrote:
On Thu, Sep 08, 2022 at 09:48:35PM -0700, Andy Lutomirski wrote:
On 8/19/22 17:27, Kirill A. Shutemov wrote:
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote: > > If your memory could be swapped, that would be enough of a good reason > to make use of shmem.c: but it cannot be swapped; and although there > are some references in the mailthreads to it perhaps being swappable > in future, I get the impression that will not happen soon if ever. > > If your memory could be migrated, that would be some reason to use > filesystem page cache (because page migration happens to understand > that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc.
But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.)
TDX 1.5 brings both.
In TDX speak, mm migration called relocation. See TDH.MEM.PAGE.RELOCATE.
This seems to be a pretty bad fit for the way that the core mm migrates pages. The core mm unmaps the page, then moves (in software) the contents to a new address, then faults it in. TDH.MEM.PAGE.RELOCATE doesn't fit into that workflow very well. I'm not saying it can't be done, but it won't just work.
Hm. From what I see we have all necessary infrastructure in place.
Unmaping is NOP for inaccessible pages as it is never mapped and we have mapping->a_ops->migrate_folio() callback that allows to replace software copying with whatever is needed, like TDH.MEM.PAGE.RELOCATE.
What do I miss?
Hmm, maybe this isn't as bad as I thought.
Right now, unless I've missed something, the migration workflow is to unmap (via try_to_migrate) all mappings, then migrate the backing store (with ->migrate_folio(), although it seems like most callers expect the actual copy to happen outside of ->migrate_folio(),
Most? I guess you are talking about MIGRATE_SYNC_NO_COPY, right? AFAICS, it is HMM thing and not a common thing.
and then make new mappings. With the *current* (vma-based, not fd-based) model for KVM memory, this won't work -- we can't unmap before calling TDH.MEM.PAGE.RELOCATE.
We don't need to unmap. The page is not mapped from core-mm PoV.
But maybe it's actually okay with some care or maybe mild modifications with the fd-based model. We don't have any mmaps, per se, to unmap for secret / INACCESSIBLE memory. So maybe we can get all the way to ->migrate_folio() without zapping anything in the secure EPT and just call TDH-MEM.PAGE.RELOCATE from inside migrate_folio(). And there will be nothing to fault back in. From the core code's perspective, it's like migrating a memfd that doesn't happen to have my mappings at the time.
Modifications needed if we want to initiate migation from userspace. IIRC, we don't have any API that can initiate page migration for file ranges, without mapping the file.
But kernel can do it fine for own housekeeping, like compaction doesn't need any VMA. And we need compaction working for long term stability of the system.
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
tmpfs and hugetlbfs and page cache are designed around sharing memory: TDX is designed around absolutely not sharing memory; and the further uses which Sean foresees appear not to need it as page cache either.
Except perhaps for page migration reasons. It's somewhat incidental, but of course page migration knows how to migrate page cache, so masquerading as page cache will give a short cut to page migration, when page migration becomes at all possible.
I haven't read the patch series, and I'm not taking a position one way or the other on whether this is better implemented as a shmem addition or a shim that asks shmem for memory. Page migration can be done for driver memory by using PageMovable. I just rewrote how it works, so the details are top of my mind at the moment if anyone wants something explained. Commit 68f2736a8583 is the key one to look at.
On Sun, Aug 21, 2022 at 11:27:44AM +0100, Matthew Wilcox wrote:
On Thu, Aug 18, 2022 at 08:00:41PM -0700, Hugh Dickins wrote:
tmpfs and hugetlbfs and page cache are designed around sharing memory: TDX is designed around absolutely not sharing memory; and the further uses which Sean foresees appear not to need it as page cache either.
Except perhaps for page migration reasons. It's somewhat incidental, but of course page migration knows how to migrate page cache, so masquerading as page cache will give a short cut to page migration, when page migration becomes at all possible.
I haven't read the patch series, and I'm not taking a position one way or the other on whether this is better implemented as a shmem addition or a shim that asks shmem for memory. Page migration can be done for driver memory by using PageMovable. I just rewrote how it works, so the details are top of my mind at the moment if anyone wants something explained. Commit 68f2736a8583 is the key one to look at.
Thanks Matthew. That is helpful to understand the current code.
Chao
On 8/18/22 06:24, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory.
Here at last are my reluctant thoughts on this patchset.
fd-based approach for supporting KVM guest private memory: fine.
Use or abuse of memfd and shmem.c: mistaken.
memfd_create() was an excellent way to put together the initial prototype.
But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere).
You don't need shmem.c or a filesystem for that!
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
This thing?
https://cdrdv2.intel.com/v1/dl/getContent/733578
That looks like migration between computers, not between NUMA nodes. Or am I missing something?
On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory.
Here at last are my reluctant thoughts on this patchset.
fd-based approach for supporting KVM guest private memory: fine.
Use or abuse of memfd and shmem.c: mistaken.
memfd_create() was an excellent way to put together the initial prototype.
But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere).
You don't need shmem.c or a filesystem for that!
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
Some of these impressions may come from earlier iterations of the patchset (v7 looks better in several ways than v5). I am probably underestimating the extent to which you have taken on board other usages beyond TDX and SEV private memory, and rightly want to serve them all with similar interfaces: perhaps there is enough justification for shmem there, but I don't see it. There was mention of userfaultfd in one link: does that provide the justification for using shmem?
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations).
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm.
Sorry for necrobumping this thread but I've been reviewing the memfd_restricted() extension that Ackerley is currently working on. I was pointed to this thread as this is what the extension is building on but I'll reply to both threads here.
From a glance at v10, memfd_restricted() is currently implemented as an in-kernel stacking filesystem. A call to memfd_restricted() creates a new restricted memfd file and a new unlinked tmpfs file and stashes the tmpfs file into the memfd file's private data member. It then uses the tmpfs file's f_ops and i_ops to perform the relevant file and inode operations. So it has the same callstack as a general stacking filesystem like overlayfs in some cases:
memfd_restricted->getattr() -> tmpfs->getattr()
The extension that Ackerley is now proposing is to allow passing in a tmpfs file descriptor explicitly to identify the tmpfs instance in which to allocate the tmpfs file which is stashed in the memfd secret file.
So in the ->getattr() callstack I mentioned above this patchset currently does:
static int restrictedmem_getattr(struct user_namespace *mnt_userns, const struct path *path, struct kstat stat, u32 request_mask, unsigned int uery_flags) { struct inode *inode = d_inode(path->dentry); struct restrictedmem_data *data = node->i_mapping->private_data; struct file *memfd = data->memfd;
return memfd->f_inode->i_op->getattr(mnt_userns, path, stat, request_mask, query_flags); }
There's a bug in here that I mentioned in another thread and I see that Ackerley has mentioned as well in https://lore.kernel.org/lkml/diqzzga0fv96.fsf@ackerleytng-cloudtop-sg.c.goog... namely that this is passing a restricted memfd struct path to a tmpfs inode operation which is very wrong.
But also in the current implementation - I mentioned this in the other thread as well - when you call stat() on a restricted memfd file descriptor you get all the information about the underlying tmpfs inode. Specifically this includes the device number and inode number.
But when you call statfs() then you get a report that this is a memfd restricted filesystem which somehow shares the device number with a tmpfs instance. That's messy.
Since you're effectively acting like a stacking filesystem you should really use the device number of your memfd restricted filesystem. IOW, sm like:
stat->dev = memfd_restricted_dentry->d_sb->s_dev;
But then you run into trouble if you want to go forward with Ackerley's extension that allows to explicitly pass in tmpfs fds to memfd_restricted(). Afaict, two tmpfs instances might allocate the same inode number. So now the inode and device number pair isn't unique anymore.
So you might best be served by allocating and reporting your own inode numbers as well.
But if you want to preserve the inode number and device number of the relevant tmpfs instance but still report memfd restricted as your filesystem type then I think it's reasonable to ask whether a stacking implementation really makes sense here.
If you extend memfd_restricted() or even consider extending it in the future to take tmpfs file descriptors as arguments to identify the tmpfs instance in which to allocate the underlying tmpfs file for the new restricted memfd file you should really consider a tmpfs based implementation.
Because at that point it just feels like a pointless wrapper to get custom f_ops and i_ops. Plus it's wasteful because you allocate dentries and inodes that you don't really care about at all.
Just off the top of my hat you might be better served: * by a new ioctl() on tmpfs instances that yield regular tmpfs file descriptors with restricted f_ops and i_ops. That's not that different from btrfs subvolumes which effectively are directories but are created through an ioctl(). * by a mount option to tmpfs that makes it act in this restricted manner then you don't need an ioctl() and can get away with regular open calls. Such a tmpfs instance would only create regular, restricted memfds.
I think especially with the possibility of an extension that allows you to inherit tmpfs properties by allocating the memfd restriced file in a specific tmpfs instance the argument that you're not really making use of tmpfs things has gone out of the window.
Note that on machines that run TDX guests such memory would likely be the bulk of memory use. Treating it as a fringe case may bite us one day.
-- Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Apr 05, 2023 at 09:58:44PM +0000, Ackerley Tng wrote:
Thanks again for your review!
Christian Brauner brauner@kernel.org writes:
On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote:
On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
...
-SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) +static int restrictedmem_create(struct vfsmount *mount) { struct file *file, *restricted_file; int fd, err;
- if (flags)
return -EINVAL;
- fd = get_unused_fd_flags(0);
Any reasons the file descriptors aren't O_CLOEXEC by default? I don't see any reasons why we should introduce new fdtypes that aren't O_CLOEXEC by default. The "don't mix-and-match" train has already left the station anyway as we do have seccomp noitifer fds and pidfds both of which are O_CLOEXEC by default.
Thanks for pointing this out. I agree with using O_CLOEXEC, but didn’t notice this before. Let us discuss this under the original series at [1].
if (fd < 0) return fd;
- file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
- if (mount)
file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem",
0, VM_NORESERVE);
- else
file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
- if (IS_ERR(file)) { err = PTR_ERR(file); goto err_fd;
@@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned
int, flags)
return err; }
+static bool is_shmem_mount(struct vfsmount *mnt) +{
- return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;
This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC).
Will simplify this in the next revision.
+}
+static bool is_mount_root(struct file *file) +{
- return file->f_path.dentry == file->f_path.mnt->mnt_root;
mount -t tmpfs tmpfs /mnt touch /mnt/bla touch /mnt/ble mount --bind /mnt/bla /mnt/ble fd = open("/mnt/ble") fd_restricted = memfd_restricted(fd)
IOW, this doesn't restrict it to the tmpfs root. It only restricts it to paths that refer to the root of any tmpfs mount. To exclude bind-mounts that aren't bind-mounts of the whole filesystem you want:
path->dentry == path->mnt->mnt_root && path->mnt->mnt_root == path->mnt->mnt_sb->s_root
Will adopt this in the next revision and add a selftest to check this. Thanks for pointing this out!
+}
+static int restrictedmem_create_on_user_mount(int mount_fd) +{
- int ret;
- struct fd f;
- struct vfsmount *mnt;
- f = fdget_raw(mount_fd);
- if (!f.file)
return -EBADF;
- ret = -EINVAL;
- if (!is_mount_root(f.file))
goto out;
- mnt = f.file->f_path.mnt;
- if (!is_shmem_mount(mnt))
goto out;
- ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);
With the current semantics you're asking whether you have write permissions on the /mnt/ble file in order to get answer to the question whether you're allowed to create an unlinked restricted memory file. That doesn't make much sense afaict.
That's true. Since mnt_want_write() already checks for write permissions and this syscall creates an unlinked file on the mount, we don't have to check permissions on the file then. Will remove this in the next revision!
- if (ret)
goto out;
- ret = mnt_want_write(mnt);
- if (unlikely(ret))
goto out;
- ret = restrictedmem_create(mnt);
- mnt_drop_write(mnt);
+out:
- fdput(f);
- return ret;
+}
+SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) +{
- if (flags & ~RMFD_USERMNT)
return -EINVAL;
- if (flags == RMFD_USERMNT) {
Why do you even need this flag? It seems that @mount_fd being < 0 is sufficient to indicate that a new restricted memory fd is supposed to be created in the system instance.
I'm hoping to have this patch series merged after Chao's patch series introduces the memfd_restricted() syscall [1].
This flag is necessary to indicate the validity of the second argument.
With this flag, we can definitively return an error if the fd is invalid, which I think is a better experience for the userspace programmer than if we just silently default to the kernel mount when the fd provided is invalid.
if (mount_fd < 0)
return -EINVAL;
return restrictedmem_create_on_user_mount(mount_fd);
- } else {
return restrictedmem_create(NULL);
- }
+}
I have to say that I'm very confused by all of this the more I look at it.
Effectively memfd restricted functions as a wrapper filesystem around the tmpfs filesystem. This is basically a weird overlay filesystem. You're allocating tmpfs files that you stash in restrictedmem files. I have to say that this seems very hacky. I didn't get this at all at first.
So what does the caller get if they call statx() on a restricted memfd? Do they get the device number of the tmpfs mount and the inode numbers of the tmpfs mount? Because it looks like they would:
static int restrictedmem_getattr(struct user_namespace *mnt_userns, const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); struct restrictedmem *rm = inode->i_mapping->private_data; struct file *memfd = rm->memfd;
return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
This is pretty broken btw, because @path refers to a restrictedmem path which you're passing to a tmpfs iop...
I see that in
return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat, request_mask, query_flags);
this if fixed but still, this is... not great.
Thanks, this will be fixed in the next revision by rebasing on Chao's latest code.
request_mask, query_flags);
That @memfd would be a struct file allocated in a tmpfs instance, no? So you'd be calling the inode operation of the tmpfs file meaning that struct kstat will be filled up with the info from the tmpfs instance.
But then if I call statfs() and check the fstype I would get RESTRICTEDMEM_MAGIC, no? This is... unorthodox?
I'm honestly puzzled and this sounds really strange. There must be a better way to implement all of this.
Shouldn't you try and make this a part of tmpfs proper? Make a really separate filesystem and add a memfs library that both tmpfs and restrictedmemfs can use? Add a mount option to tmpfs that makes it a restricted tmpfs?
This was discussed earlier in the patch series introducing memfd_restricted and this approach was taken to better manage ownership of required functionalities between two subsystems. Please see discussion beginning [2]
[1] -> https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.inte.... [2] -> https://lore.kernel.org/lkml/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com...
On Fri, Dec 02, 2022 at 02:13:39PM +0800, Chao Peng wrote:
From: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com
Introduce 'memfd_restricted' system call with the ability to create memory areas that are restricted from userspace access through ordinary MMU operations (e.g. read/write/mmap). The memory content is expected to be used through the new in-kernel interface by a third kernel module.
memfd_restricted() is useful for scenarios where a file descriptor(fd) can be used as an interface into mm but want to restrict userspace's ability on the fd. Initially it is designed to provide protections for KVM encrypted guest memory.
Normally KVM uses memfd memory via mmapping the memfd into KVM userspace (e.g. QEMU) and then using the mmaped virtual address to setup the mapping in the KVM secondary page table (e.g. EPT). With confidential computing technologies like Intel TDX, the memfd memory may be encrypted with special key for special software domain (e.g. KVM guest) and is not expected to be directly accessed by userspace. Precisely, userspace access to such encrypted memory may lead to host crash so should be prevented.
memfd_restricted() provides semantics required for KVM guest encrypted memory support that a fd created with memfd_restricted() is going to be used as the source of guest memory in confidential computing environment and KVM can directly interact with core-mm without the need to expose the memoy content into KVM userspace.
KVM userspace is still in charge of the lifecycle of the fd. It should pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to obtain the physical memory page and then uses it to populate the KVM secondary page table entries.
The userspace restricted memfd can be fallocate-ed or hole-punched from userspace. When hole-punched, KVM can get notified through invalidate_start/invalidate_end() callbacks, KVM then gets chance to remove any mapped entries of the range in the secondary page tables.
Machine check can happen for memory pages in the restricted memfd, instead of routing this directly to userspace, we call the error() callback that KVM registered. KVM then gets chance to handle it correctly.
memfd_restricted() itself is implemented as a shim layer on top of real memory file systems (currently tmpfs). Pages in restrictedmem are marked as unmovable and unevictable, this is required for current confidential usage. But in future this might be changed.
By default memfd_restricted() prevents userspace read, write and mmap. By defining new bit in the 'flags', it can be extended to support other restricted semantics in the future.
The system call is currently wired up for x86 arch.
Signed-off-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Signed-off-by: Chao Peng chao.p.peng@linux.intel.com
arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/restrictedmem.h | 71 ++++++ include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/magic.h | 1 + kernel/sys_ni.c | 3 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/memory-failure.c | 3 + mm/restrictedmem.c | 318 +++++++++++++++++++++++++ 11 files changed, 408 insertions(+), 1 deletion(-) create mode 100644 include/linux/restrictedmem.h create mode 100644 mm/restrictedmem.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 320480a8db4f..dc70ba90247e 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -455,3 +455,4 @@ 448 i386 process_mrelease sys_process_mrelease 449 i386 futex_waitv sys_futex_waitv 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node +451 i386 memfd_restricted sys_memfd_restricted diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index c84d12608cd2..06516abc8318 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -372,6 +372,7 @@ 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv 450 common set_mempolicy_home_node sys_set_mempolicy_home_node +451 common memfd_restricted sys_memfd_restricted # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h new file mode 100644 index 000000000000..c2700c5daa43 --- /dev/null +++ b/include/linux/restrictedmem.h @@ -0,0 +1,71 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _LINUX_RESTRICTEDMEM_H
+#include <linux/file.h> +#include <linux/magic.h> +#include <linux/pfn_t.h>
+struct restrictedmem_notifier;
+struct restrictedmem_notifier_ops {
- void (*invalidate_start)(struct restrictedmem_notifier *notifier,
pgoff_t start, pgoff_t end);
- void (*invalidate_end)(struct restrictedmem_notifier *notifier,
pgoff_t start, pgoff_t end);
- void (*error)(struct restrictedmem_notifier *notifier,
pgoff_t start, pgoff_t end);
+};
+struct restrictedmem_notifier {
- struct list_head list;
- const struct restrictedmem_notifier_ops *ops;
+};
+#ifdef CONFIG_RESTRICTEDMEM
+void restrictedmem_register_notifier(struct file *file,
struct restrictedmem_notifier *notifier);
+void restrictedmem_unregister_notifier(struct file *file,
struct restrictedmem_notifier *notifier);
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
struct page **pagep, int *order);
+static inline bool file_is_restrictedmem(struct file *file) +{
- return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
+}
+void restrictedmem_error_page(struct page *page, struct address_space *mapping);
+#else
+static inline void restrictedmem_register_notifier(struct file *file,
struct restrictedmem_notifier *notifier)
+{ +}
+static inline void restrictedmem_unregister_notifier(struct file *file,
struct restrictedmem_notifier *notifier)
+{ +}
+static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
struct page **pagep, int *order)
+{
- return -1;
+}
+static inline bool file_is_restrictedmem(struct file *file) +{
- return false;
+}
+static inline void restrictedmem_error_page(struct page *page,
struct address_space *mapping)
+{ +}
+#endif /* CONFIG_RESTRICTEDMEM */
+#endif /* _LINUX_RESTRICTEDMEM_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a34b0f9a9972..f9e9e0c820c5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1056,6 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); +asmlinkage long sys_memfd_restricted(unsigned int flags); /*
- Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 45fa180cc56a..e93cd35e46d0 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv) #define __NR_set_mempolicy_home_node 450 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node) +#define __NR_memfd_restricted 451 +__SYSCALL(__NR_memfd_restricted, sys_memfd_restricted)
#undef __NR_syscalls -#define __NR_syscalls 451 +#define __NR_syscalls 452 /*
- 32 bit systems traditionally used different
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index 6325d1d0e90f..8aa38324b90a 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -101,5 +101,6 @@ #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ +#define RESTRICTEDMEM_MAGIC 0x5245534d /* "RESM" */ #endif /* __LINUX_MAGIC_H__ */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 860b2dcf3ac4..7c4a32cbd2e7 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -360,6 +360,9 @@ COND_SYSCALL(pkey_free); /* memfd_secret */ COND_SYSCALL(memfd_secret); +/* memfd_restricted */ +COND_SYSCALL(memfd_restricted);
/*
- Architecture specific weak syscall entries.
*/ diff --git a/mm/Kconfig b/mm/Kconfig index 57e1d8c5b505..06b0e1d6b8c1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1076,6 +1076,10 @@ config IO_MAPPING config SECRETMEM def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED +config RESTRICTEDMEM
- bool
- depends on TMPFS
config ANON_VMA_NAME bool "Anonymous VMA name support" depends on PROC_FS && ADVISE_SYSCALLS && MMU diff --git a/mm/Makefile b/mm/Makefile index 8e105e5b3e29..bcbb0edf9ba1 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -121,6 +121,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o obj-$(CONFIG_SECRETMEM) += secretmem.o +obj-$(CONFIG_RESTRICTEDMEM) += restrictedmem.o obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 145bb561ddb3..f91b444e471e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -62,6 +62,7 @@ #include <linux/page-isolation.h> #include <linux/pagewalk.h> #include <linux/shmem_fs.h> +#include <linux/restrictedmem.h> #include "swap.h" #include "internal.h" #include "ras/ras_event.h" @@ -940,6 +941,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p) goto out; }
- restrictedmem_error_page(p, mapping);
- /*
- The shmem page is kept in page cache instead of truncating
- so is expected to have an extra refcount after error-handling.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c new file mode 100644 index 000000000000..56953c204e5c --- /dev/null +++ b/mm/restrictedmem.c @@ -0,0 +1,318 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "linux/sbitmap.h" +#include <linux/pagemap.h> +#include <linux/pseudo_fs.h> +#include <linux/shmem_fs.h> +#include <linux/syscalls.h> +#include <uapi/linux/falloc.h> +#include <uapi/linux/magic.h> +#include <linux/restrictedmem.h>
+struct restrictedmem_data {
- struct mutex lock;
- struct file *memfd;
- struct list_head notifiers;
+};
+static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
pgoff_t start, pgoff_t end)
+{
- struct restrictedmem_notifier *notifier;
- mutex_lock(&data->lock);
- list_for_each_entry(notifier, &data->notifiers, list) {
notifier->ops->invalidate_start(notifier, start, end);
- }
- mutex_unlock(&data->lock);
+}
+static void restrictedmem_invalidate_end(struct restrictedmem_data *data,
pgoff_t start, pgoff_t end)
+{
- struct restrictedmem_notifier *notifier;
- mutex_lock(&data->lock);
- list_for_each_entry(notifier, &data->notifiers, list) {
notifier->ops->invalidate_end(notifier, start, end);
- }
- mutex_unlock(&data->lock);
+}
+static void restrictedmem_notifier_error(struct restrictedmem_data *data,
pgoff_t start, pgoff_t end)
+{
- struct restrictedmem_notifier *notifier;
- mutex_lock(&data->lock);
- list_for_each_entry(notifier, &data->notifiers, list) {
notifier->ops->error(notifier, start, end);
- }
- mutex_unlock(&data->lock);
+}
+static int restrictedmem_release(struct inode *inode, struct file *file) +{
- struct restrictedmem_data *data = inode->i_mapping->private_data;
- fput(data->memfd);
- kfree(data);
- return 0;
+}
+static long restrictedmem_punch_hole(struct restrictedmem_data *data, int mode,
loff_t offset, loff_t len)
+{
- int ret;
- pgoff_t start, end;
- struct file *memfd = data->memfd;
- if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
return -EINVAL;
- start = offset >> PAGE_SHIFT;
- end = (offset + len) >> PAGE_SHIFT;
- restrictedmem_invalidate_start(data, start, end);
- ret = memfd->f_op->fallocate(memfd, mode, offset, len);
- restrictedmem_invalidate_end(data, start, end);
- return ret;
+}
+static long restrictedmem_fallocate(struct file *file, int mode,
loff_t offset, loff_t len)
+{
- struct restrictedmem_data *data = file->f_mapping->private_data;
- struct file *memfd = data->memfd;
- if (mode & FALLOC_FL_PUNCH_HOLE)
return restrictedmem_punch_hole(data, mode, offset, len);
- return memfd->f_op->fallocate(memfd, mode, offset, len);
+}
+static const struct file_operations restrictedmem_fops = {
- .release = restrictedmem_release,
- .fallocate = restrictedmem_fallocate,
+};
+static int restrictedmem_getattr(struct user_namespace *mnt_userns,
const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
+{
- struct inode *inode = d_inode(path->dentry);
- struct restrictedmem_data *data = inode->i_mapping->private_data;
- struct file *memfd = data->memfd;
- return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
request_mask, query_flags);
+}
+static int restrictedmem_setattr(struct user_namespace *mnt_userns,
struct dentry *dentry, struct iattr *attr)
+{
- struct inode *inode = d_inode(dentry);
- struct restrictedmem_data *data = inode->i_mapping->private_data;
- struct file *memfd = data->memfd;
- int ret;
- if (attr->ia_valid & ATTR_SIZE) {
if (memfd->f_inode->i_size)
return -EPERM;
if (!PAGE_ALIGNED(attr->ia_size))
return -EINVAL;
- }
- ret = memfd->f_inode->i_op->setattr(mnt_userns,
file_dentry(memfd), attr);
- return ret;
+}
+static const struct inode_operations restrictedmem_iops = {
- .getattr = restrictedmem_getattr,
- .setattr = restrictedmem_setattr,
+};
+static int restrictedmem_init_fs_context(struct fs_context *fc) +{
- if (!init_pseudo(fc, RESTRICTEDMEM_MAGIC))
return -ENOMEM;
- fc->s_iflags |= SB_I_NOEXEC;
- return 0;
+}
+static struct file_system_type restrictedmem_fs = {
- .owner = THIS_MODULE,
- .name = "memfd:restrictedmem",
- .init_fs_context = restrictedmem_init_fs_context,
- .kill_sb = kill_anon_super,
+};
+static struct vfsmount *restrictedmem_mnt;
+static __init int restrictedmem_init(void) +{
- restrictedmem_mnt = kern_mount(&restrictedmem_fs);
- if (IS_ERR(restrictedmem_mnt))
return PTR_ERR(restrictedmem_mnt);
- return 0;
+} +fs_initcall(restrictedmem_init);
+static struct file *restrictedmem_file_create(struct file *memfd) +{
- struct restrictedmem_data *data;
- struct address_space *mapping;
- struct inode *inode;
- struct file *file;
- data = kzalloc(sizeof(*data), GFP_KERNEL);
- if (!data)
return ERR_PTR(-ENOMEM);
- data->memfd = memfd;
- mutex_init(&data->lock);
- INIT_LIST_HEAD(&data->notifiers);
- inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
- if (IS_ERR(inode)) {
kfree(data);
return ERR_CAST(inode);
- }
- inode->i_mode |= S_IFREG;
- inode->i_op = &restrictedmem_iops;
- inode->i_mapping->private_data = data;
- file = alloc_file_pseudo(inode, restrictedmem_mnt,
"restrictedmem", O_RDWR,
&restrictedmem_fops);
- if (IS_ERR(file)) {
iput(inode);
kfree(data);
return ERR_CAST(file);
- }
- file->f_flags |= O_LARGEFILE;
- /*
* These pages are currently unmovable so don't place them into movable
* pageblocks (e.g. CMA and ZONE_MOVABLE).
*/
- mapping = memfd->f_mapping;
- mapping_set_unevictable(mapping);
- mapping_set_gfp_mask(mapping,
mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
- return file;
+}
+SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) +{
- struct file *file, *restricted_file;
- int fd, err;
- if (flags)
return -EINVAL;
- fd = get_unused_fd_flags(0);
- if (fd < 0)
return fd;
- file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
- if (IS_ERR(file)) {
err = PTR_ERR(file);
goto err_fd;
- }
- file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
- file->f_flags |= O_LARGEFILE;
- restricted_file = restrictedmem_file_create(file);
- if (IS_ERR(restricted_file)) {
err = PTR_ERR(restricted_file);
fput(file);
goto err_fd;
- }
- fd_install(fd, restricted_file);
- return fd;
+err_fd:
- put_unused_fd(fd);
- return err;
+}
+void restrictedmem_register_notifier(struct file *file,
struct restrictedmem_notifier *notifier)
+{
- struct restrictedmem_data *data = file->f_mapping->private_data;
- mutex_lock(&data->lock);
- list_add(¬ifier->list, &data->notifiers);
- mutex_unlock(&data->lock);
+} +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
+void restrictedmem_unregister_notifier(struct file *file,
struct restrictedmem_notifier *notifier)
+{
- struct restrictedmem_data *data = file->f_mapping->private_data;
- mutex_lock(&data->lock);
- list_del(¬ifier->list);
- mutex_unlock(&data->lock);
+} +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
+int restrictedmem_get_page(struct file *file, pgoff_t offset,
struct page **pagep, int *order)
+{
- struct restrictedmem_data *data = file->f_mapping->private_data;
- struct file *memfd = data->memfd;
- struct folio *folio;
- struct page *page;
- int ret;
- ret = shmem_get_folio(file_inode(memfd), offset, &folio, SGP_WRITE);
- if (ret)
return ret;
- page = folio_file_page(folio, offset);
- *pagep = page;
- if (order)
*order = thp_order(compound_head(page));
- SetPageUptodate(page);
- unlock_page(page);
- return 0;
+} +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
+void restrictedmem_error_page(struct page *page, struct address_space *mapping) +{
- struct super_block *sb = restrictedmem_mnt->mnt_sb;
- struct inode *inode, *next;
- if (!shmem_mapping(mapping))
return;
- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
struct restrictedmem_data *data = inode->i_mapping->private_data;
struct file *memfd = data->memfd;
if (memfd->f_mapping == mapping) {
pgoff_t start, end;
spin_unlock(&sb->s_inode_list_lock);
start = page->index;
end = start + thp_nr_pages(page);
restrictedmem_notifier_error(data, start, end);
return;
}
- }
- spin_unlock(&sb->s_inode_list_lock);
+}
2.25.1
On Thu, Apr 13, 2023, Christian Brauner wrote:
On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm.
Sorry for necrobumping this thread but I've been reviewing the
No worries, I'm just stoked someone who actually knows what they're doing is chiming in :-)
memfd_restricted() extension that Ackerley is currently working on. I was pointed to this thread as this is what the extension is building on but I'll reply to both threads here.
From a glance at v10, memfd_restricted() is currently implemented as an in-kernel stacking filesystem. A call to memfd_restricted() creates a new restricted memfd file and a new unlinked tmpfs file and stashes the tmpfs file into the memfd file's private data member. It then uses the tmpfs file's f_ops and i_ops to perform the relevant file and inode operations. So it has the same callstack as a general stacking filesystem like overlayfs in some cases:
memfd_restricted->getattr() -> tmpfs->getattr()
...
Since you're effectively acting like a stacking filesystem you should really use the device number of your memfd restricted filesystem. IOW, sm like:
stat->dev = memfd_restricted_dentry->d_sb->s_dev;
But then you run into trouble if you want to go forward with Ackerley's extension that allows to explicitly pass in tmpfs fds to memfd_restricted(). Afaict, two tmpfs instances might allocate the same inode number. So now the inode and device number pair isn't unique anymore.
So you might best be served by allocating and reporting your own inode numbers as well.
But if you want to preserve the inode number and device number of the relevant tmpfs instance but still report memfd restricted as your filesystem type
Unless I missed something along the way, reporting memfd_restricted as a distinct filesystem is very much a non-goal. AFAIK it's purely a side effect of the proposed implementation.
then I think it's reasonable to ask whether a stacking implementation really makes sense here.
If you extend memfd_restricted() or even consider extending it in the future to take tmpfs file descriptors as arguments to identify the tmpfs instance in which to allocate the underlying tmpfs file for the new restricted memfd file you should really consider a tmpfs based implementation.
Because at that point it just feels like a pointless wrapper to get custom f_ops and i_ops. Plus it's wasteful because you allocate dentries and inodes that you don't really care about at all.
Just off the top of my hat you might be better served:
- by a new ioctl() on tmpfs instances that yield regular tmpfs file descriptors with restricted f_ops and i_ops. That's not that different from btrfs subvolumes which effectively are directories but are created through an ioctl().
I think this is more or less what we want to do, except via a dedicated syscall instead of an ioctl() so that the primary interface isn't strictly tied to tmpfs, e.g. so that it can be extended to other backing types in the future.
- by a mount option to tmpfs that makes it act in this restricted manner then you don't need an ioctl() and can get away with regular open calls. Such a tmpfs instance would only create regular, restricted memfds.
I'd prefer to not go this route, becuase IIUC, it would require relatively invasive changes to shmem code, and IIUC would require similar changes to other support backings in the future, e.g. hugetlbfs? And as above, I don't think any of the potential use cases need restrictedmem to be a uniquely identifiable mount.
One of the goals (hopefully not a pipe dream) is to design restrictmem in such a way that extending it to support other backing types isn't terribly difficult. In case it's not obvious, most of us working on this stuff aren't filesystems experts, and many of us aren't mm experts either. The more we (KVM folks for the most part) can leverage existing code to do the heavy lifting, the better.
After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it".
Thanks!
struct restrictedmem { struct rw_semaphore lock; struct file *file; const struct file_operations *backing_f_ops; const struct address_space_operations *backing_a_ops; struct xarray bindings; bool exclusive; };
static int restrictedmem_release(struct inode *inode, struct file *file) { struct restrictedmem *rm = inode->i_mapping->private_data;
xa_destroy(&rm->bindings); kfree(rm);
WARN_ON_ONCE(rm->backing_f_ops->release); return 0; }
static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, loff_t offset, loff_t len) { struct restrictedmem_notifier *notifier; unsigned long index; pgoff_t start, end; int ret;
if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) return -EINVAL;
start = offset >> PAGE_SHIFT; end = (offset + len) >> PAGE_SHIFT;
/* * Bindings must be stable across invalidation to ensure the start+end * are balanced. */ down_read(&rm->lock);
xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_start(notifier, start, end);
ret = rm->backing_f_ops->fallocate(rm->file, mode, offset, len);
xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_end(notifier, start, end);
up_read(&rm->lock);
return ret; }
static long restrictedmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct restrictedmem *rm = file->f_mapping->private_data;
if (mode & FALLOC_FL_PUNCH_HOLE) return restrictedmem_punch_hole(rm, mode, offset, len);
return rm->backing_f_ops->fallocate(file, mode, offset, len); }
static int restrictedmem_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode) { WARN_ON_ONCE(1); return -EINVAL; }
static int restrictedmem_error_page(struct address_space *mapping, struct page *page) { struct restrictedmem *rm = mapping->private_data; struct restrictedmem_notifier *notifier; unsigned long index; pgoff_t start, end;
start = page->index; end = start + thp_nr_pages(page);
down_read(&rm->lock);
xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->error(notifier, start, end);
up_read(&rm->lock);
return rm->backing_a_ops->error_remove_page(mapping, page); }
static const struct file_operations restrictedmem_fops = { .release = restrictedmem_release, .fallocate = restrictedmem_fallocate, };
static const struct address_space_operations restrictedmem_aops = { .dirty_folio = noop_dirty_folio, #ifdef CONFIG_MIGRATION .migrate_folio = restrictedmem_migrate_folio, #endif .error_remove_page = restrictedmem_error_page, };
static int restrictedmem_file_create(struct file *file) { struct address_space *mapping = file->f_mapping; struct restrictedmem *rm;
rm = kzalloc(sizeof(*rm), GFP_KERNEL); if (!rm) return -ENOMEM;
rm->backing_f_ops = file->f_op; rm->backing_a_ops = mapping->a_ops; rm->file = file; init_rwsem(&rm->lock); xa_init(&rm->bindings);
file->f_flags |= O_LARGEFILE;
file->f_op = &restrictedmem_fops; mapping->a_ops = &restrictedmem_aops;
mapping_set_unevictable(mapping); mapping_set_unmovable(mapping); mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); return 0; }
static int restrictedmem_create(struct vfsmount *mount) { struct file *file; int fd, err;
fd = get_unused_fd_flags(0); if (fd < 0) return fd;
file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); if (IS_ERR(file)) { err = PTR_ERR(file); goto err_fd; } if (WARN_ON_ONCE(file->private_data)) { err = -EEXIST; goto err_fd; }
file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE;
err = restrictedmem_file_create(file); if (err) { fput(file); goto err_fd; }
fd_install(fd, file); return fd; err_fd: put_unused_fd(fd); return err; }
SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) { struct vfsmount *mnt; struct path *path; struct fd f; int ret;
if (flags) return -EINVAL;
f = fdget_raw(mount_fd); if (!f.file) return -EBADF;
ret = -EINVAL;
path = &f.file->f_path; if (path->dentry != path->mnt->mnt_root) goto out;
/* Disallow bind-mounts that aren't bind-mounts of the whole filesystem. */ mnt = path->mnt; if (mnt->mnt_root != mnt->mnt_sb->s_root) goto out;
/* * The filesystem must be mounted no-execute, executing from guest * private memory in the host is nonsensical and unsafe. */ if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC)) goto out;
/* Currently only TMPFS is supported as underlying storage. */ if (mnt->mnt_sb->s_magic != TMPFS_MAGIC) goto out;
ret = mnt_want_write(mnt); if (ret) goto out;
ret = restrictedmem_create(mnt);
if (mnt) mnt_drop_write(mnt); out: if (f.file) fdput(f);
return ret; }
Sean Christopherson seanjc@google.com writes:
On Thu, Apr 13, 2023, Christian Brauner wrote:
On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
Here's what I would prefer, and imagine much easier for you to
maintain;
but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the
memory,
ioctl for initializing some of it too. KVM in control of whether
that
fd can be read or written or mmap'ed or whatever, no need to prevent
it
in shmem.c, no need for flags, seals, notifications to and fro
because
KVM is already in control and knows the history. If shmem actually
has
value, call into it underneath - somewhat like SysV SHM, and
/dev/zero
mmap, and i915/gem make use of it underneath. If shmem has nothing
to
add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately
why
it would not. But I'm not sure it is right direction. We risk creating
yet
another parallel VM with own rules/locking/accounting that opaque to core-mm.
Sorry for necrobumping this thread but I've been reviewing the
No worries, I'm just stoked someone who actually knows what they're doing is chiming in :-)
+1, thanks Christian!
memfd_restricted() extension that Ackerley is currently working on. I was pointed to this thread as this is what the extension is building on but I'll reply to both threads here.
From a glance at v10, memfd_restricted() is currently implemented as an in-kernel stacking filesystem. A call to memfd_restricted() creates a new restricted memfd file and a new unlinked tmpfs file and stashes the tmpfs file into the memfd file's private data member. It then uses the tmpfs file's f_ops and i_ops to perform the relevant file and inode operations. So it has the same callstack as a general stacking filesystem like overlayfs in some cases:
memfd_restricted->getattr() -> tmpfs->getattr()
...
Since you're effectively acting like a stacking filesystem you should really use the device number of your memfd restricted filesystem. IOW, sm like:
stat->dev = memfd_restricted_dentry->d_sb->s_dev;
But then you run into trouble if you want to go forward with Ackerley's extension that allows to explicitly pass in tmpfs fds to memfd_restricted(). Afaict, two tmpfs instances might allocate the same inode number. So now the inode and device number pair isn't unique anymore.
So you might best be served by allocating and reporting your own inode numbers as well.
But if you want to preserve the inode number and device number of the relevant tmpfs instance but still report memfd restricted as your filesystem type
Unless I missed something along the way, reporting memfd_restricted as a distinct filesystem is very much a non-goal. AFAIK it's purely a side effect of the proposed implementation.
then I think it's reasonable to ask whether a stacking implementation really makes sense here.
If you extend memfd_restricted() or even consider extending it in the future to take tmpfs file descriptors as arguments to identify the tmpfs instance in which to allocate the underlying tmpfs file for the new restricted memfd file you should really consider a tmpfs based implementation.
Because at that point it just feels like a pointless wrapper to get custom f_ops and i_ops. Plus it's wasteful because you allocate dentries and inodes that you don't really care about at all.
Just off the top of my hat you might be better served:
- by a new ioctl() on tmpfs instances that yield regular tmpfs file descriptors with restricted f_ops and i_ops. That's not that different from btrfs subvolumes which effectively are directories but are created through an ioctl().
I think this is more or less what we want to do, except via a dedicated syscall instead of an ioctl() so that the primary interface isn't strictly tied to tmpfs, e.g. so that it can be extended to other backing types in the future.
- by a mount option to tmpfs that makes it act in this restricted manner then you don't need an ioctl() and can get away with regular open calls. Such a tmpfs instance would only create regular, restricted memfds.
I'd prefer to not go this route, becuase IIUC, it would require relatively invasive changes to shmem code, and IIUC would require similar changes to other support backings in the future, e.g. hugetlbfs? And as above, I don't think any of the potential use cases need restrictedmem to be a uniquely identifiable mount.
FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and the separation that the current implementation has is very helpful. Also helps that hugetlbfs and tmpfs are structured similarly, I guess.
One of the goals (hopefully not a pipe dream) is to design restrictmem in such a way that extending it to support other backing types isn't terribly difficult. In case it's not obvious, most of us working on this stuff aren't filesystems experts, and many of us aren't mm experts either. The more we (KVM folks for the most part) can leverage existing code to do the heavy lifting, the better.
After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it".
Not an FS expert by any means, but I did think of approaching it this way as well!
"Hijacking" perhaps gives this approach a bit of a negative connotation. I thought this is pretty close to subclassing (as in Object Oriented Programming). When some methods (e.g. fallocate) are called, restrictedmem does some work, and calls the same method in the superclass.
The existing restrictedmem code is a more like instantiating an shmem object and keeping that object as a field within the restrictedmem object.
Some (maybe small) issues I can think of now:
(1)
One difficulty with this approach is that other functions may make assumptions about private_data being of a certain type, or functions may use private_data.
I checked and IIUC neither shmem nor hugetlbfs use the private_data field in the inode's i_mapping (also file's f_mapping).
But there's fs/buffer.c which uses private_data, although those functions seem to be used by FSes like ext4 and fat, not memory-backed FSes.
We can probably fix this if any backing filesystems of restrictedmem, like tmpfs and future ones use private_data.
Could the solution here be to store private_data of the superclass instance in restrictedmem, and then override every method in the superclass that uses private_data to first restore private_data before making the superclass call? Perhaps we can take private_lock to change private_data.
(2)
Perhaps there are other slightly hidden cases that might need cleaning up.
For example, one of the patches in this series amends the shmem_mapping() function from
return mapping->a_ops == &shmem_aops;
to
return mapping->host->i_sb->s_magic == TMPFS_MAGIC;
The former/original is more accurate since it checks a property of the mapping itself instead of checking a property of the mapping's host's superblock.
The impact of changing this guard is more obvious if we now override a_ops but keep the mapping's host's superblock's s_magic.
Specifically for this example, maybe we should handle restrictedmem in the caller (me_pagecache_clean()) specially, in addition to shmem.
Thanks!
struct restrictedmem { struct rw_semaphore lock; struct file *file; const struct file_operations *backing_f_ops; const struct address_space_operations *backing_a_ops; struct xarray bindings; bool exclusive; };
static int restrictedmem_release(struct inode *inode, struct file *file) { struct restrictedmem *rm = inode->i_mapping->private_data;
xa_destroy(&rm->bindings); kfree(rm);
WARN_ON_ONCE(rm->backing_f_ops->release); return 0; }
static long restrictedmem_punch_hole(struct restrictedmem *rm, int mode, loff_t offset, loff_t len) { struct restrictedmem_notifier *notifier; unsigned long index; pgoff_t start, end; int ret;
if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) return -EINVAL;
start = offset >> PAGE_SHIFT; end = (offset + len) >> PAGE_SHIFT;
/* * Bindings must be stable across invalidation to ensure the start+end * are balanced. */ down_read(&rm->lock);
xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_start(notifier, start, end);
ret = rm->backing_f_ops->fallocate(rm->file, mode, offset, len);
xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->invalidate_end(notifier, start, end);
up_read(&rm->lock);
return ret; }
static long restrictedmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct restrictedmem *rm = file->f_mapping->private_data;
if (mode & FALLOC_FL_PUNCH_HOLE) return restrictedmem_punch_hole(rm, mode, offset, len);
return rm->backing_f_ops->fallocate(file, mode, offset, len); }
static int restrictedmem_migrate_folio(struct address_space *mapping, struct folio *dst, struct folio *src, enum migrate_mode) { WARN_ON_ONCE(1); return -EINVAL; }
static int restrictedmem_error_page(struct address_space *mapping, struct page *page) { struct restrictedmem *rm = mapping->private_data; struct restrictedmem_notifier *notifier; unsigned long index; pgoff_t start, end;
start = page->index; end = start + thp_nr_pages(page);
down_read(&rm->lock);
xa_for_each_range(&rm->bindings, index, notifier, start, end - 1) notifier->ops->error(notifier, start, end);
up_read(&rm->lock);
return rm->backing_a_ops->error_remove_page(mapping, page); }
When I was thinking of this I was stuck on handling error_remove_page, because it was looking up the superblock to iterate over the inodes to find the right mapping. Glad to see that the solution is simply to use the given mapping from the arguments!
static const struct file_operations restrictedmem_fops = { .release = restrictedmem_release, .fallocate = restrictedmem_fallocate, };
static const struct address_space_operations restrictedmem_aops = { .dirty_folio = noop_dirty_folio, #ifdef CONFIG_MIGRATION .migrate_folio = restrictedmem_migrate_folio, #endif .error_remove_page = restrictedmem_error_page, };
static int restrictedmem_file_create(struct file *file) { struct address_space *mapping = file->f_mapping; struct restrictedmem *rm;
rm = kzalloc(sizeof(*rm), GFP_KERNEL); if (!rm) return -ENOMEM;
rm->backing_f_ops = file->f_op; rm->backing_a_ops = mapping->a_ops; rm->file = file;
We don't really need to do this, since rm->file is already the same as file, we could just pass the file itself when it's needed
init_rwsem(&rm->lock); xa_init(&rm->bindings);
file->f_flags |= O_LARGEFILE;
file->f_op = &restrictedmem_fops; mapping->a_ops = &restrictedmem_aops;
I think we probably have to override inode_operations as well, because otherwise other methods would become available to a restrictedmem file (like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead of a bug.
mapping_set_unevictable(mapping); mapping_set_unmovable(mapping); mapping_set_gfp_mask(mapping, mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); return 0; }
static int restrictedmem_create(struct vfsmount *mount) { struct file *file; int fd, err;
fd = get_unused_fd_flags(0); if (fd < 0) return fd;
file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); if (IS_ERR(file)) { err = PTR_ERR(file); goto err_fd; } if (WARN_ON_ONCE(file->private_data)) { err = -EEXIST; goto err_fd; }
Did you intend this as a check that the backing filesystem isn't using the private_data field in the mapping?
I think you meant file->f_mapping->private_data.
On this note, we will probably have to fix things whenever any backing filesystems need the private_data field.
file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE;
err = restrictedmem_file_create(file); if (err) { fput(file); goto err_fd; }
fd_install(fd, file); return fd; err_fd: put_unused_fd(fd); return err; }
SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) { struct vfsmount *mnt; struct path *path; struct fd f; int ret;
if (flags) return -EINVAL;
f = fdget_raw(mount_fd); if (!f.file) return -EBADF;
ret = -EINVAL;
path = &f.file->f_path; if (path->dentry != path->mnt->mnt_root) goto out;
/* Disallow bind-mounts that aren't bind-mounts of the whole filesystem. */ mnt = path->mnt; if (mnt->mnt_root != mnt->mnt_sb->s_root) goto out;
/* * The filesystem must be mounted no-execute, executing from guest * private memory in the host is nonsensical and unsafe. */ if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC)) goto out;
/* Currently only TMPFS is supported as underlying storage. */ if (mnt->mnt_sb->s_magic != TMPFS_MAGIC) goto out;
ret = mnt_want_write(mnt); if (ret) goto out;
ret = restrictedmem_create(mnt);
if (mnt) mnt_drop_write(mnt); out: if (f.file) fdput(f);
return ret; }
On Fri, Apr 14, 2023, Ackerley Tng wrote:
Sean Christopherson seanjc@google.com writes:
On Thu, Apr 13, 2023, Christian Brauner wrote:
- by a mount option to tmpfs that makes it act in this restricted manner then you don't need an ioctl() and can get away with regular open calls. Such a tmpfs instance would only create regular, restricted memfds.
I'd prefer to not go this route, becuase IIUC, it would require relatively invasive changes to shmem code, and IIUC would require similar changes to other support backings in the future, e.g. hugetlbfs? And as above, I don't think any of the potential use cases need restrictedmem to be a uniquely identifiable mount.
FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and the separation that the current implementation has is very helpful. Also helps that hugetlbfs and tmpfs are structured similarly, I guess.
One of the goals (hopefully not a pipe dream) is to design restrictmem in such a way that extending it to support other backing types isn't terribly difficult. In case it's not obvious, most of us working on this stuff aren't filesystems experts, and many of us aren't mm experts either. The more we (KVM folks for the most part) can leverage existing code to do the heavy lifting, the better.
After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it".
Not an FS expert by any means, but I did think of approaching it this way as well!
"Hijacking" perhaps gives this approach a bit of a negative connotation.
Heh, commandeer then.
I thought this is pretty close to subclassing (as in Object Oriented Programming). When some methods (e.g. fallocate) are called, restrictedmem does some work, and calls the same method in the superclass.
The existing restrictedmem code is a more like instantiating an shmem object and keeping that object as a field within the restrictedmem object.
Some (maybe small) issues I can think of now:
(1)
One difficulty with this approach is that other functions may make assumptions about private_data being of a certain type, or functions may use private_data.
I checked and IIUC neither shmem nor hugetlbfs use the private_data field in the inode's i_mapping (also file's f_mapping).
But there's fs/buffer.c which uses private_data, although those functions seem to be used by FSes like ext4 and fat, not memory-backed FSes.
We can probably fix this if any backing filesystems of restrictedmem, like tmpfs and future ones use private_data.
Ya, if we go the route of poking into f_ops and stuff, I would want to add WARN_ON_ONCE() hardening of everything that restrictemem wants to "commandeer" ;-)
static int restrictedmem_file_create(struct file *file) { struct address_space *mapping = file->f_mapping; struct restrictedmem *rm;
rm = kzalloc(sizeof(*rm), GFP_KERNEL); if (!rm) return -ENOMEM;
rm->backing_f_ops = file->f_op; rm->backing_a_ops = mapping->a_ops; rm->file = file;
We don't really need to do this, since rm->file is already the same as file, we could just pass the file itself when it's needed
Aha! I was working on getting rid of it, but forgot to go back and do another pass.
init_rwsem(&rm->lock); xa_init(&rm->bindings);
file->f_flags |= O_LARGEFILE;
file->f_op = &restrictedmem_fops; mapping->a_ops = &restrictedmem_aops;
I think we probably have to override inode_operations as well, because otherwise other methods would become available to a restrictedmem file (like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead of a bug.
I think we want those? What we want to restrict are operations that require read/write/execute access to the file, everything else should be ok. fallocate() is a special case because restrictmem needs to tell KVM to unmap the memory when a hole is punched. I assume ->setattr() needs similar treatment to handle ftruncate()?
I'd love to hear Christian's input on this aspect of things.
if (WARN_ON_ONCE(file->private_data)) { err = -EEXIST; goto err_fd; }
Did you intend this as a check that the backing filesystem isn't using the private_data field in the mapping?
I think you meant file->f_mapping->private_data.
Ya, sounds right. I should have added disclaimers that (a) I wrote this quite quickly and (b) it's compile tested only at this point.
On this note, we will probably have to fix things whenever any backing filesystems need the private_data field.
Yep.
f = fdget_raw(mount_fd); if (!f.file) return -EBADF;
...
/* * The filesystem must be mounted no-execute, executing from guest * private memory in the host is nonsensical and unsafe. */ if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC)) goto out;
Looking at this more closely, I don't think we need to require NOEXEC, things like like execve() should get squashed by virtue of not providing any read/write implementations. And dropping my misguided NOEXEC requirement means there's no reason to disallow using the kernel internal mount.
On Fri, Apr 14, 2023, Sean Christopherson wrote:
On Fri, Apr 14, 2023, Ackerley Tng wrote:
Sean Christopherson seanjc@google.com writes:
if (WARN_ON_ONCE(file->private_data)) { err = -EEXIST; goto err_fd; }
Did you intend this as a check that the backing filesystem isn't using the private_data field in the mapping?
I think you meant file->f_mapping->private_data.
Ya, sounds right. I should have added disclaimers that (a) I wrote this quite quickly and (b) it's compile tested only at this point.
FWIW, here's a very lightly tested version that doesn't explode on a basic selftest.
On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
On Thu, Apr 13, 2023, Christian Brauner wrote:
On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm.
Sorry for necrobumping this thread but I've been reviewing the
No worries, I'm just stoked someone who actually knows what they're doing is chiming in :-)
It's a dangerous business, going out of your subsystem. You step into code, and if you don't watch your hands, there is no knowing where you might be swept off to.
That saying goes for me here specifically...
memfd_restricted() extension that Ackerley is currently working on. I was pointed to this thread as this is what the extension is building on but I'll reply to both threads here.
From a glance at v10, memfd_restricted() is currently implemented as an in-kernel stacking filesystem. A call to memfd_restricted() creates a new restricted memfd file and a new unlinked tmpfs file and stashes the tmpfs file into the memfd file's private data member. It then uses the tmpfs file's f_ops and i_ops to perform the relevant file and inode operations. So it has the same callstack as a general stacking filesystem like overlayfs in some cases:
memfd_restricted->getattr() -> tmpfs->getattr()
...
Since you're effectively acting like a stacking filesystem you should really use the device number of your memfd restricted filesystem. IOW, sm like:
stat->dev = memfd_restricted_dentry->d_sb->s_dev;
But then you run into trouble if you want to go forward with Ackerley's extension that allows to explicitly pass in tmpfs fds to memfd_restricted(). Afaict, two tmpfs instances might allocate the same inode number. So now the inode and device number pair isn't unique anymore.
So you might best be served by allocating and reporting your own inode numbers as well.
But if you want to preserve the inode number and device number of the relevant tmpfs instance but still report memfd restricted as your filesystem type
Unless I missed something along the way, reporting memfd_restricted as a distinct filesystem is very much a non-goal. AFAIK it's purely a side effect of the proposed implementation.
In the current implementation you would have to put in effort to fake this. For example, you would need to also implement ->statfs super_operation where you'd need to fill in the details of the tmpfs instance. At that point all that memfd_restricted fs code that you've written is nothing but deadweight, I would reckon.
then I think it's reasonable to ask whether a stacking implementation really makes sense here.
If you extend memfd_restricted() or even consider extending it in the future to take tmpfs file descriptors as arguments to identify the tmpfs instance in which to allocate the underlying tmpfs file for the new restricted memfd file you should really consider a tmpfs based implementation.
Because at that point it just feels like a pointless wrapper to get custom f_ops and i_ops. Plus it's wasteful because you allocate dentries and inodes that you don't really care about at all.
Just off the top of my hat you might be better served:
- by a new ioctl() on tmpfs instances that yield regular tmpfs file descriptors with restricted f_ops and i_ops. That's not that different from btrfs subvolumes which effectively are directories but are created through an ioctl().
I think this is more or less what we want to do, except via a dedicated syscall instead of an ioctl() so that the primary interface isn't strictly tied to tmpfs, e.g. so that it can be extended to other backing types in the future.
Ok. But just to point this out, this would make memfd_restricted() a multiplexer on types of memory. And my wild guess is that not all memory types you might reasonably want to use will have a filesystem like interface such. So in the future you might end up with multiple ways of specifying the type of memory:
// use tmpfs backing memfd_restricted(fd_tmpfs, 0);
// use hugetlbfs backing memfd_restricted(fd_hugetlbfs, 0);
// use non-fs type memory backing memfd_restricted(-EBADF, MEMFD_SUPER_FANCY_MEMORY_TYPE);
interface wise I find an unpleasant design. But that multi-memory-open goal also makes it a bit hard to come up with a clean design (On possibility would be to use an extensible struct - versioned by size - similar to openat2() and clone3() such that you can specify all types of options on the memory in the future.).
- by a mount option to tmpfs that makes it act in this restricted manner then you don't need an ioctl() and can get away with regular open calls. Such a tmpfs instance would only create regular, restricted memfds.
I'd prefer to not go this route, becuase IIUC, it would require relatively invasive changes to shmem code, and IIUC would require similar changes to other support backings in the future, e.g. hugetlbfs? And as above, I don't think any of the potential use cases need restrictedmem to be a uniquely identifiable mount.
Ok, see my comment above then.
One of the goals (hopefully not a pipe dream) is to design restrictmem in such a way that extending it to support other backing types isn't terribly difficult.
Not necessarily difficult, just difficult to do tastefully imho. But it's not that has traditionally held people back. ;)
In case it's not obvious, most of us working on this stuff aren't filesystems experts, and many of us aren't mm experts either. The more we (KVM folks for the most part) can leverage existing code to do the heavy lifting, the better.
Well, hopefully we can complement each other's knowledge here.
After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it".
Maybe, but I think it's weird. _Replacing_ f_ops isn't something that's unprecedented. It happens everytime a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs does a similar (much more involved) thing where it replaces it's proxy f_ops with the relevant subsystem's f_ops. The difference is that in both cases the replace happens at ->open() time; and the replace is done once. Afterwards only the newly added f_ops are relevant.
In your case you'd be keeping two sets of {f,a}_ops; one usable by userspace and another only usable by in-kernel consumers. And there are some concerns (non-exhaustive list), I think:
* {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is authoritative per @file and it is left to the individual subsystems to maintain driver specific ops (see the sunrpc stuff or sockets). * lifetime management for the two sets of {f,a}_ops: If the ops belong to a module then you need to make sure that the module can't get unloaded while you're using the fops. Might not be a concern in this case. * brittleness: Not all f_ops for example deal with userspace functionality some deal with cleanup when the file is closed like ->release(). So it's delicate to override that functionality with custom f_ops. Restricted memfds could easily forget to cleanup resources. * Potential for confusion why there's two sets of {f,a}_ops. * f_ops specifically are generic across a vast amount of consumers and are subject to change. If memfd_restricted() has specific requirements because of this weird double-use they won't be taken into account.
I find this hard to navigate tbh and it feels like taking a shortcut to avoid building a proper api. If you only care about a specific set of operations specific to memfd restricte that needs to be available to in-kernel consumers, I wonder if you shouldn't just go one step further then your proposal below and build a dedicated minimal ops api. Idk, sketching like a madman on a drawning board here with no claim to feasibility from a mm perspective whatsoever:
struct restrictedmem_ops { // only contains very limited stuff you need or special stuff // you nee, similar to struct proto_ops (sockets) and so on };
struct restrictedmem { const struct restrictedmem_ops ops; };
This would avoid fuzzing with two different set of {f,a}_ops in this brittle way. It would force you to clarify the semantics that you need and the operations that you need or don't need implemented. And it would get rid of the ambiguity inherent to using two sets of {f,a}_ops.
On Wed, Apr 19, 2023, Christian Brauner wrote:
On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
But if you want to preserve the inode number and device number of the relevant tmpfs instance but still report memfd restricted as your filesystem type
Unless I missed something along the way, reporting memfd_restricted as a distinct filesystem is very much a non-goal. AFAIK it's purely a side effect of the proposed implementation.
In the current implementation you would have to put in effort to fake this. For example, you would need to also implement ->statfs super_operation where you'd need to fill in the details of the tmpfs instance. At that point all that memfd_restricted fs code that you've written is nothing but deadweight, I would reckon.
After digging a bit, I suspect the main reason Kirill implemented an overlay to inode_operations was to prevent modifying the file size via ->setattr(). Relying on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design, the memory can't be mmap()'d into host userspace.
if (attr->ia_valid & ATTR_SIZE) { if (memfd->f_inode->i_size) return -EPERM;
if (!PAGE_ALIGNED(attr->ia_size)) return -EINVAL; }
But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or SHMEM_LONGPIN. For a variety of reasons, I'm leaning more and more toward making this a KVM ioctl() instead of a dedicated syscall, at which point we can be both more flexible and more draconian, e.g. let userspace provide the file size at the time of creation, but make the size immutable, at least by default.
After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it".
Maybe, but I think it's weird.
Yeah, agreed.
_Replacing_ f_ops isn't something that's unprecedented. It happens everytime a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs does a similar (much more involved) thing where it replaces it's proxy f_ops with the relevant subsystem's f_ops. The difference is that in both cases the replace happens at ->open() time; and the replace is done once. Afterwards only the newly added f_ops are relevant.
In your case you'd be keeping two sets of {f,a}_ops; one usable by userspace and another only usable by in-kernel consumers. And there are some concerns (non-exhaustive list), I think:
- {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is authoritative per @file and it is left to the individual subsystems to maintain driver specific ops (see the sunrpc stuff or sockets).
- lifetime management for the two sets of {f,a}_ops: If the ops belong to a module then you need to make sure that the module can't get unloaded while you're using the fops. Might not be a concern in this case.
Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?) holding a reference to the inode?
- brittleness: Not all f_ops for example deal with userspace functionality some deal with cleanup when the file is closed like ->release(). So it's delicate to override that functionality with custom f_ops. Restricted memfds could easily forget to cleanup resources.
- Potential for confusion why there's two sets of {f,a}_ops.
- f_ops specifically are generic across a vast amount of consumers and are subject to change. If memfd_restricted() has specific requirements because of this weird double-use they won't be taken into account.
I find this hard to navigate tbh and it feels like taking a shortcut to avoid building a proper api.
Agreed. At the very least, it would be better to take an explicit dependency on whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate(). I think that gives us a clearer path to getting something merged too, as we'll need Acks on making specific functions visible, i.e. will give MM maintainers something concrete to react too.
If you only care about a specific set of operations specific to memfd restricte that needs to be available to in-kernel consumers, I wonder if you shouldn't just go one step further then your proposal below and build a dedicated minimal ops api.
This is actually very doable for shmem. Unless I'm missing something, because our use case doesn't allow mmap(), swap, or migration, a good chunk of shmem_fallocate() is simply irrelevant. The result is only ~100 lines of code, and quite straightforward.
My biggest concern, outside of missing a detail in shmem, is adding support for HugeTLBFS, which is likely going to be requested/needed sooner than later. At a glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm keen to duplicate. But that's also a future problem to some extent, as it's purely kernel internals; the uAPI side of things doesn't seem like it'll be messy at all.
Thanks again!
On Wed, Apr 19, 2023 at 05:49:55PM -0700, Sean Christopherson wrote:
On Wed, Apr 19, 2023, Christian Brauner wrote:
On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
But if you want to preserve the inode number and device number of the relevant tmpfs instance but still report memfd restricted as your filesystem type
Unless I missed something along the way, reporting memfd_restricted as a distinct filesystem is very much a non-goal. AFAIK it's purely a side effect of the proposed implementation.
In the current implementation you would have to put in effort to fake this. For example, you would need to also implement ->statfs super_operation where you'd need to fill in the details of the tmpfs instance. At that point all that memfd_restricted fs code that you've written is nothing but deadweight, I would reckon.
After digging a bit, I suspect the main reason Kirill implemented an overlay to inode_operations was to prevent modifying the file size via ->setattr(). Relying on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design, the memory can't be mmap()'d into host userspace.
if (attr->ia_valid & ATTR_SIZE) { if (memfd->f_inode->i_size) return -EPERM;
if (!PAGE_ALIGNED(attr->ia_size)) return -EINVAL;
}
But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or SHMEM_LONGPIN. For a variety of reasons, I'm leaning more and more toward making this a KVM ioctl() instead of a dedicated syscall, at which point we can be both more flexible and more draconian, e.g. let userspace provide the file size at the time of creation, but make the size immutable, at least by default.
After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it".
Maybe, but I think it's weird.
Yeah, agreed.
_Replacing_ f_ops isn't something that's unprecedented. It happens everytime a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs does a similar (much more involved) thing where it replaces it's proxy f_ops with the relevant subsystem's f_ops. The difference is that in both cases the replace happens at ->open() time; and the replace is done once. Afterwards only the newly added f_ops are relevant.
In your case you'd be keeping two sets of {f,a}_ops; one usable by userspace and another only usable by in-kernel consumers. And there are some concerns (non-exhaustive list), I think:
- {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is authoritative per @file and it is left to the individual subsystems to maintain driver specific ops (see the sunrpc stuff or sockets).
- lifetime management for the two sets of {f,a}_ops: If the ops belong to a module then you need to make sure that the module can't get unloaded while you're using the fops. Might not be a concern in this case.
Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?) holding a reference to the inode?
I don't think it would be possible to safely replace inode_operations after the inode's been made visible in caches.
It works with file_operations because when a file is opened a new struct file is allocated which isn't reachable anywhere before fd_install() is called. So it is possible to replace f_ops in the default f->f_op->open() method (which is what devices do as the inode is located on e.g., ext4/xfs/tmpfs but the functionality of the device usually provided by some driver/module through its file_operations). The default f_ops are taken from i_fop of the inode.
The lifetime of the file_/inode_operations will be aligned with the lifetime of the module they're originating from. If only file_/inode_operations are used from within the same module then there should never be any lifetime concerns.
So an inode doesn't explictly pin file_/inode_operations because there's usually no need to do that and it be weird if each new inode would take a reference on the f_ops/i_ops on the off-chance that someone _might_ open the file. Let alone the overhead of calling try_module_get() everytime a new inode is added to the cache. There are various fs objects - the superblock which is pinning the filesystem/module - that exceed the lifetime of inodes and dentries. Both also may be dropped from their respective caches and readded later.
Pinning of the module for f_ops is done because it is possible that some filesystem/driver might want to use the file_operations of some other filesystem/driver by default and they are in separate modules. So the fops_get() in do_dentry_open is there because it's not guaranteed that file_/inode_operations originate from the same module as the inode that's opened. If the module is still alive during the open then a reference to its f_ops is taken if not then the open will fail with ENODEV.
That's to the best of my knowledge.
- brittleness: Not all f_ops for example deal with userspace functionality some deal with cleanup when the file is closed like ->release(). So it's delicate to override that functionality with custom f_ops. Restricted memfds could easily forget to cleanup resources.
- Potential for confusion why there's two sets of {f,a}_ops.
- f_ops specifically are generic across a vast amount of consumers and are subject to change. If memfd_restricted() has specific requirements because of this weird double-use they won't be taken into account.
I find this hard to navigate tbh and it feels like taking a shortcut to avoid building a proper api.
Agreed. At the very least, it would be better to take an explicit dependency on whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate(). I think that gives us a clearer path to getting something merged too, as we'll need Acks on making specific functions visible, i.e. will give MM maintainers something concrete to react too.
If you only care about a specific set of operations specific to memfd restricte that needs to be available to in-kernel consumers, I wonder if you shouldn't just go one step further then your proposal below and build a dedicated minimal ops api.
This is actually very doable for shmem. Unless I'm missing something, because our use case doesn't allow mmap(), swap, or migration, a good chunk of shmem_fallocate() is simply irrelevant. The result is only ~100 lines of code, and quite straightforward.
My biggest concern, outside of missing a detail in shmem, is adding support for HugeTLBFS, which is likely going to be requested/needed sooner than later. At a glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm keen to duplicate. But that's also a future problem to some extent, as it's purely kernel internals; the uAPI side of things doesn't seem like it'll be messy at all.
Thanks again!
Sure thing.
Hi,
On Wed, Jul 6, 2022 at 9:24 AM Chao Peng chao.p.peng@linux.intel.com wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated.
- backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
I'm on the Android pKVM team at Google, and we've been looking into how this approach fits with what we've been doing with pkvm/arm64. I've had a go at porting your patches, along with some fixes and additions so it would go on top of our latest pkvm patch series [1] to see how well this proposal fits with what we’re doing. You can find the ported code at this link [2].
In general, an fd-based approach fits very well with pKVM for the reasons you mention. It means that we don't necessarily need to map the guest memory, and with the new extensions it allows the host kernel to control whether to restrict migration and swapping.
For pKVM, we would also need the guest private memory not to be GUP’able by the kernel so that userspace can’t trick the kernel into accessing guest private memory in a context where it isn’t prepared to handle the fault injected by the hypervisor. We’re looking at whether we could use memfd_secret to achieve this, or maybe whether extending your work might solve the problem.
However, during the porting effort, the main issue we've encountered is that many of the details of this approach seem to be targeted at TDX/SEV and don’t readily align with the design of pKVM. My knowledge on TDX is very rudimentary, so please bear with me if I get things wrong.
The idea of the memslot having two references to the backing memory, the (new) private_fd (a file descriptor) as well as the userspace_addr (a memory address), with the meaning changing depending on whether the memory is private or shared. Both can potentially be live at the same time, but only one is used by the guest depending on whether the memory is shared or private. For pKVM, the memory region is the same, and whether the underlying physical page is shared or private is determined by the hypervisor based on the initial configuration of the VM and also in response to hypercalls from the guest. So at least from our side, having a private_fd isn't the best fit, but rather just having an fd instead of a userspace_addr.
Moreover, something which was discussed here before [3], is the ability to share in-place. For pKVM/arm64, the conversion between shared and private involves only changes to the stage-2 page tables, which are controlled by the hypervisor. Android supports this in-place conversion already, and I think that the cost of copying for many use-cases that would involve large amounts of data would be big. We will measure the relative costs in due course, but in the meantime we’re nervous about adopting a new user ABI which doesn’t appear to cater for in-place conversion; having just the fd would simplify that somewhat
In the memfd approach, what is the plan for being able to initialize guest private memory from the host? In my port of this patch series, I've added an fcntl() command that allows setting INACCESSIBLE after the memfd has been created. So the memory can be mapped, initialized, then unmapped. Of course there is no way to enforce that the memory is unmapped from userspace before being used as private memory, but the hypervisor will take care of the stage-2 mapping and so a user access to the private memory would result in a SEGV regardless of the flag
Now, moving on to implementation-specific issues in this patch series that I have encountered:
- There are a couple of small issues in porting the patches, some of which have been mentioned already by others. I will point out the rest in direct replies to these patches.
- MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in this patch series. MFD_INACCESSIBLE only sets MEMFILE_F_USER_INACCESSIBLE. Is this intentional?
- Nothing in this patch series enforces that MFD_INACCESSIBLE or that any of the MEMFILE_F_* flags are set for the file descriptor to be used as a private_fd. Is this also intentional?
Most of us working on pKVM will be at KVM forum Dublin in September, so it would be great if we could have a chat (and/or beer!) face to face sometime during the conference to help us figure out an upstreamable solution for Android
Cheers, /fuad
[1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/ [2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem [3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Changelog
v7:
- Move the private/shared info from backing store to KVM.
- Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- Rework on the sync mechanism between zap/page fault paths.
- Addressed other comments in v6.
v6:
- Re-organzied patch for both mm/KVM parts.
- Added flags for memfile_notifier so its consumers can state their features and memory backing store can check against these flags.
- Put a backing store reference in the memfile_notifier and move pfn_ops into backing store.
- Only support boot time backing store register.
- Overall KVM part improvement suggested by Sean and some others.
v5:
- Removed userspace visible F_SEAL_INACCESSIBLE, instead using an in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only be created by MFD_INACCESSIBLE.
- Introduced new APIs for backing store to register itself to memfile_notifier instead of direct function call.
- Added the accounting and restriction for MFD_INACCESSIBLE memory.
- Added KVM API doc for new memslot extensions and man page for the new MFD_INACCESSIBLE flag.
- Removed the overlap check for mapping the same file+offset into multiple gfns due to perf consideration, warned in document.
- Addressed other comments in v4.
v4:
- Decoupled the callbacks between KVM/mm from memfd and use new name 'memfile_notifier'.
- Supported register multiple memslots to the same backing store.
- Added per-memslot pfn_ops instead of per-system.
- Reworked the invalidation part.
- Improved new KVM uAPIs (private memslot extension and memory error) per Sean's suggestions.
- Addressed many other minor fixes for comments from v3.
v3:
- Added locking protection when calling invalidate_page_range/fallocate callbacks.
- Changed memslot structure to keep use useraddr for shared memory.
- Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
- Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
- Commit message improvement.
- Many small fixes for comments from the last version.
Links to previous discussions
[1] Original design proposal: https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/ [2] Updated proposal and RFC patch v1: https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@lin... [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
Chao Peng (12): mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE mm: Introduce memfile_notifier mm/memfd: Introduce MFD_INACCESSIBLE flag KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS KVM: Use gfn instead of hva for mmu_notifier_retry KVM: Rename mmu_notifier_* KVM: Extend the memslot to support fd-based private memory KVM: Add KVM_EXIT_MEMORY_FAULT exit KVM: Register/unregister the guest private memory regions KVM: Handle page fault for private memory KVM: Enable and expose KVM_MEM_PRIVATE
Kirill A. Shutemov (1): mm/shmem: Support memfile_notifier
Documentation/virt/kvm/api.rst | 77 +++++- arch/arm64/kvm/mmu.c | 8 +- arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/mmu.c | 10 +- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 +- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/riscv/kvm/mmu.c | 4 +- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/Kconfig | 3 + arch/x86/kvm/mmu.h | 2 - arch/x86/kvm/mmu/mmu.c | 74 +++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++ arch/x86/kvm/mmu/mmutrace.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 4 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 105 +++++--- include/linux/memfile_notifier.h | 91 +++++++ include/linux/shmem_fs.h | 2 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/kvm.h | 37 +++ include/uapi/linux/memfd.h | 1 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/memfd.c | 18 +- mm/memfile_notifier.c | 123 ++++++++++ mm/shmem.c | 125 +++++++++- tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++ virt/kvm/Kconfig | 3 + virt/kvm/kvm_main.c | 272 ++++++++++++++++++--- virt/kvm/pfncache.c | 14 +- 35 files changed, 1074 insertions(+), 127 deletions(-) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c
-- 2.25.1
On Fri, Aug 26, 2022 at 04:19:25PM +0100, Fuad Tabba wrote:
Hi,
On Wed, Jul 6, 2022 at 9:24 AM Chao Peng chao.p.peng@linux.intel.com wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated.
- backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
I'm on the Android pKVM team at Google, and we've been looking into how this approach fits with what we've been doing with pkvm/arm64. I've had a go at porting your patches, along with some fixes and additions so it would go on top of our latest pkvm patch series [1] to see how well this proposal fits with what we’re doing. You can find the ported code at this link [2].
In general, an fd-based approach fits very well with pKVM for the reasons you mention. It means that we don't necessarily need to map the guest memory, and with the new extensions it allows the host kernel to control whether to restrict migration and swapping.
Good to hear that.
For pKVM, we would also need the guest private memory not to be GUP’able by the kernel so that userspace can’t trick the kernel into accessing guest private memory in a context where it isn’t prepared to handle the fault injected by the hypervisor. We’re looking at whether we could use memfd_secret to achieve this, or maybe whether extending your work might solve the problem.
This is interesting and can be a valuable addition to this series.
However, during the porting effort, the main issue we've encountered is that many of the details of this approach seem to be targeted at TDX/SEV and don’t readily align with the design of pKVM. My knowledge on TDX is very rudimentary, so please bear with me if I get things wrong.
No doubt this series is initially designed for confidential computing usages, but pKVM can definitely extend it if it finds useful.
The idea of the memslot having two references to the backing memory, the (new) private_fd (a file descriptor) as well as the userspace_addr (a memory address), with the meaning changing depending on whether the memory is private or shared. Both can potentially be live at the same time, but only one is used by the guest depending on whether the memory is shared or private. For pKVM, the memory region is the same, and whether the underlying physical page is shared or private is determined by the hypervisor based on the initial configuration of the VM and also in response to hypercalls from the guest.
For confidential computing usages, this is actually the same. The shared or private is determined by initial configuration or guest hypercalls.
So at least from our side, having a private_fd isn't the best fit, but rather just having an fd instead of a userspace_addr.
Let me understand this a bit: pKVM basically wants to maintain the shared and private memory in only one fd, and not use userspace_addr at all, right? Any blocking for pKVM to use private_fd + userspace_addr instead?
Moreover, something which was discussed here before [3], is the ability to share in-place. For pKVM/arm64, the conversion between shared and private involves only changes to the stage-2 page tables, which are controlled by the hypervisor. Android supports this in-place conversion already, and I think that the cost of copying for many use-cases that would involve large amounts of data would be big. We will measure the relative costs in due course, but in the meantime we’re nervous about adopting a new user ABI which doesn’t appear to cater for in-place conversion; having just the fd would simplify that somewhat
I understand there is difficulty to achieve that with the current private_fd + userspace_addr (they basically in two separate fds), but is it possible for pKVM to extend this? Brainstorming for example, pKVM can ignore userspace_addr and only use private_fd to cover both shared and private memory, or pKVM introduce new KVM memslot flag?
In the memfd approach, what is the plan for being able to initialize guest private memory from the host? In my port of this patch series, I've added an fcntl() command that allows setting INACCESSIBLE after the memfd has been created. So the memory can be mapped, initialized, then unmapped. Of course there is no way to enforce that the memory is unmapped from userspace before being used as private memory, but the hypervisor will take care of the stage-2 mapping and so a user access to the private memory would result in a SEGV regardless of the flag
There is discussion on removing MFD_INACCESSIBLE and delaying the alignment of the flag to the KVM/backing store binding time (https://lkml.kernel.org/lkml/20220824094149.GA1383966@chaop.bj.intel.com/).
Creating new API like what you are playing with fcntl() also works if it turns out the MFD_INACCESSIBLE has to be set at the memfd_create time.
Now, moving on to implementation-specific issues in this patch series that I have encountered:
- There are a couple of small issues in porting the patches, some of
which have been mentioned already by others. I will point out the rest in direct replies to these patches.
Thanks.
- MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
this patch series. MFD_INACCESSIBLE only sets MEMFILE_F_USER_INACCESSIBLE. Is this intentional?
It gets set in kvm_private_mem_register() of patch 13, basically those flags are expected to be set by architecture code.
- Nothing in this patch series enforces that MFD_INACCESSIBLE or that
any of the MEMFILE_F_* flags are set for the file descriptor to be used as a private_fd. Is this also intentional?
With KVM_MEM_PRIVATE memslot flag, the MEMFILE_F_* are enforced by the architecture code.
Most of us working on pKVM will be at KVM forum Dublin in September, so it would be great if we could have a chat (and/or beer!) face to face sometime during the conference to help us figure out an upstreamable solution for Android
I would like to, but currently I have no travel plan due to COVID-19 :( We can have more online discussions anyway.
Thanks, Chao
Cheers, /fuad
[1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/ [2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem [3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Changelog
v7:
- Move the private/shared info from backing store to KVM.
- Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- Rework on the sync mechanism between zap/page fault paths.
- Addressed other comments in v6.
v6:
- Re-organzied patch for both mm/KVM parts.
- Added flags for memfile_notifier so its consumers can state their features and memory backing store can check against these flags.
- Put a backing store reference in the memfile_notifier and move pfn_ops into backing store.
- Only support boot time backing store register.
- Overall KVM part improvement suggested by Sean and some others.
v5:
- Removed userspace visible F_SEAL_INACCESSIBLE, instead using an in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only be created by MFD_INACCESSIBLE.
- Introduced new APIs for backing store to register itself to memfile_notifier instead of direct function call.
- Added the accounting and restriction for MFD_INACCESSIBLE memory.
- Added KVM API doc for new memslot extensions and man page for the new MFD_INACCESSIBLE flag.
- Removed the overlap check for mapping the same file+offset into multiple gfns due to perf consideration, warned in document.
- Addressed other comments in v4.
v4:
- Decoupled the callbacks between KVM/mm from memfd and use new name 'memfile_notifier'.
- Supported register multiple memslots to the same backing store.
- Added per-memslot pfn_ops instead of per-system.
- Reworked the invalidation part.
- Improved new KVM uAPIs (private memslot extension and memory error) per Sean's suggestions.
- Addressed many other minor fixes for comments from v3.
v3:
- Added locking protection when calling invalidate_page_range/fallocate callbacks.
- Changed memslot structure to keep use useraddr for shared memory.
- Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
- Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
- Commit message improvement.
- Many small fixes for comments from the last version.
Links to previous discussions
[1] Original design proposal: https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/ [2] Updated proposal and RFC patch v1: https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@lin... [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
Chao Peng (12): mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE mm: Introduce memfile_notifier mm/memfd: Introduce MFD_INACCESSIBLE flag KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS KVM: Use gfn instead of hva for mmu_notifier_retry KVM: Rename mmu_notifier_* KVM: Extend the memslot to support fd-based private memory KVM: Add KVM_EXIT_MEMORY_FAULT exit KVM: Register/unregister the guest private memory regions KVM: Handle page fault for private memory KVM: Enable and expose KVM_MEM_PRIVATE
Kirill A. Shutemov (1): mm/shmem: Support memfile_notifier
Documentation/virt/kvm/api.rst | 77 +++++- arch/arm64/kvm/mmu.c | 8 +- arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/mmu.c | 10 +- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 +- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/riscv/kvm/mmu.c | 4 +- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/Kconfig | 3 + arch/x86/kvm/mmu.h | 2 - arch/x86/kvm/mmu/mmu.c | 74 +++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++ arch/x86/kvm/mmu/mmutrace.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 4 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 105 +++++--- include/linux/memfile_notifier.h | 91 +++++++ include/linux/shmem_fs.h | 2 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/kvm.h | 37 +++ include/uapi/linux/memfd.h | 1 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/memfd.c | 18 +- mm/memfile_notifier.c | 123 ++++++++++ mm/shmem.c | 125 +++++++++- tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++ virt/kvm/Kconfig | 3 + virt/kvm/kvm_main.c | 272 ++++++++++++++++++--- virt/kvm/pfncache.c | 14 +- 35 files changed, 1074 insertions(+), 127 deletions(-) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c
-- 2.25.1
Hi Chao,
Thank you for your reply.
On Mon, Aug 29, 2022 at 4:23 PM Chao Peng chao.p.peng@linux.intel.com wrote:
On Fri, Aug 26, 2022 at 04:19:25PM +0100, Fuad Tabba wrote:
Hi,
On Wed, Jul 6, 2022 at 9:24 AM Chao Peng chao.p.peng@linux.intel.com wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
mm extension
Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created with these flags cannot read(), write() or mmap() etc via normal MMU operations. The file content can only be used with the newly introduced memfile_notifier extension.
The memfile_notifier extension provides two sets of callbacks for KVM to interact with the memory backing store:
- memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets invalidated.
- backing store callbacks: callbacks for KVM to call into memory backing store to request memory pages for guest private memory.
The memfile_notifier extension also provides APIs for memory backing store to register/unregister itself and to trigger the notifier when the bookmarked memory gets invalidated.
The patchset also introduces a new memfd seal F_SEAL_AUTO_ALLOCATE to prevent double allocation caused by unintentional guest when we only have a single side of the shared/private memfds effective.
memslot extension
Add the private fd and the fd offset to existing 'shared' memslot so that both private/shared guest memory can live in one single memslot. A page in the memslot is either private or shared. Whether a guest page is private or shared is maintained through reusing existing SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION.
I'm on the Android pKVM team at Google, and we've been looking into how this approach fits with what we've been doing with pkvm/arm64. I've had a go at porting your patches, along with some fixes and additions so it would go on top of our latest pkvm patch series [1] to see how well this proposal fits with what we’re doing. You can find the ported code at this link [2].
In general, an fd-based approach fits very well with pKVM for the reasons you mention. It means that we don't necessarily need to map the guest memory, and with the new extensions it allows the host kernel to control whether to restrict migration and swapping.
Good to hear that.
For pKVM, we would also need the guest private memory not to be GUP’able by the kernel so that userspace can’t trick the kernel into accessing guest private memory in a context where it isn’t prepared to handle the fault injected by the hypervisor. We’re looking at whether we could use memfd_secret to achieve this, or maybe whether extending your work might solve the problem.
This is interesting and can be a valuable addition to this series.
I'll keep you posted as it goes. I think with the work that you've already put in, it wouldn't require that much more.
However, during the porting effort, the main issue we've encountered is that many of the details of this approach seem to be targeted at TDX/SEV and don’t readily align with the design of pKVM. My knowledge on TDX is very rudimentary, so please bear with me if I get things wrong.
No doubt this series is initially designed for confidential computing usages, but pKVM can definitely extend it if it finds useful.
The idea of the memslot having two references to the backing memory, the (new) private_fd (a file descriptor) as well as the userspace_addr (a memory address), with the meaning changing depending on whether the memory is private or shared. Both can potentially be live at the same time, but only one is used by the guest depending on whether the memory is shared or private. For pKVM, the memory region is the same, and whether the underlying physical page is shared or private is determined by the hypervisor based on the initial configuration of the VM and also in response to hypercalls from the guest.
For confidential computing usages, this is actually the same. The shared or private is determined by initial configuration or guest hypercalls.
So at least from our side, having a private_fd isn't the best fit, but rather just having an fd instead of a userspace_addr.
Let me understand this a bit: pKVM basically wants to maintain the shared and private memory in only one fd, and not use userspace_addr at all, right? Any blocking for pKVM to use private_fd + userspace_addr instead?
Moreover, something which was discussed here before [3], is the ability to share in-place. For pKVM/arm64, the conversion between shared and private involves only changes to the stage-2 page tables, which are controlled by the hypervisor. Android supports this in-place conversion already, and I think that the cost of copying for many use-cases that would involve large amounts of data would be big. We will measure the relative costs in due course, but in the meantime we’re nervous about adopting a new user ABI which doesn’t appear to cater for in-place conversion; having just the fd would simplify that somewhat
I understand there is difficulty to achieve that with the current private_fd + userspace_addr (they basically in two separate fds), but is it possible for pKVM to extend this? Brainstorming for example, pKVM can ignore userspace_addr and only use private_fd to cover both shared and private memory, or pKVM introduce new KVM memslot flag?
It's not that there's anything blocking pKVM from doing that. It's that the disconnect of using a memory address for the shared memory, and a file descriptor for the private memory doesn't really make sense for pKVM. I see how it makes sense for TDX and the Intel-specific implementation. It just seems that this is baking in an implementation-specific aspect as a part of the KVM general api, and the worry is that this might have some unintended consequences in the future.
In the memfd approach, what is the plan for being able to initialize guest private memory from the host? In my port of this patch series, I've added an fcntl() command that allows setting INACCESSIBLE after the memfd has been created. So the memory can be mapped, initialized, then unmapped. Of course there is no way to enforce that the memory is unmapped from userspace before being used as private memory, but the hypervisor will take care of the stage-2 mapping and so a user access to the private memory would result in a SEGV regardless of the flag
There is discussion on removing MFD_INACCESSIBLE and delaying the alignment of the flag to the KVM/backing store binding time (https://lkml.kernel.org/lkml/20220824094149.GA1383966@chaop.bj.intel.com/).
Creating new API like what you are playing with fcntl() also works if it turns out the MFD_INACCESSIBLE has to be set at the memfd_create time.
That makes sense.
Now, moving on to implementation-specific issues in this patch series that I have encountered:
- There are a couple of small issues in porting the patches, some of
which have been mentioned already by others. I will point out the rest in direct replies to these patches.
Thanks.
- MEMFILE_F_UNRECLAIMABLE and MEMFILE_F_UNMOVABLE are never set in
this patch series. MFD_INACCESSIBLE only sets MEMFILE_F_USER_INACCESSIBLE. Is this intentional?
It gets set in kvm_private_mem_register() of patch 13, basically those flags are expected to be set by architecture code.
- Nothing in this patch series enforces that MFD_INACCESSIBLE or that
any of the MEMFILE_F_* flags are set for the file descriptor to be used as a private_fd. Is this also intentional?
With KVM_MEM_PRIVATE memslot flag, the MEMFILE_F_* are enforced by the architecture code.
Right. I was expecting them to be in the mem_fd, but I see now how they are being set and enforced in patch 13. This makes a lot of sense now. Thanks!
Most of us working on pKVM will be at KVM forum Dublin in September, so it would be great if we could have a chat (and/or beer!) face to face sometime during the conference to help us figure out an upstreamable solution for Android
I would like to, but currently I have no travel plan due to COVID-19 :( We can have more online discussions anyway.
Of course! We'll continue this online, and hopefully we will get a chance to meet in person soon.
Cheers, /fuad
Thanks, Chao
Cheers, /fuad
[1] https://lore.kernel.org/all/20220630135747.26983-1-will@kernel.org/ [2] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/fdmem [3] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/
Test
To test the new functionalities of this patch TDX patchset is needed. Since TDX patchset has not been merged so I did two kinds of test:
Regresion test on kvm/queue (this patchset) Most new code are not covered. Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v7
New Funational test on latest TDX code The patch is rebased to latest TDX code and tested the new funcationalities. See below repos: Linux: https://github.com/chao-p/linux/tree/privmem-v7-tdx QEMU: https://github.com/chao-p/qemu/tree/privmem-v7
An example QEMU command line for TDX test: -object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \ -machine confidential-guest-support=tdx \ -object memory-backend-memfd-private,id=ram1,size=${mem} \ -machine memory-backend=ram1
Changelog
v7:
- Move the private/shared info from backing store to KVM.
- Introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
- Rework on the sync mechanism between zap/page fault paths.
- Addressed other comments in v6.
v6:
- Re-organzied patch for both mm/KVM parts.
- Added flags for memfile_notifier so its consumers can state their features and memory backing store can check against these flags.
- Put a backing store reference in the memfile_notifier and move pfn_ops into backing store.
- Only support boot time backing store register.
- Overall KVM part improvement suggested by Sean and some others.
v5:
- Removed userspace visible F_SEAL_INACCESSIBLE, instead using an in-kernel flag (SHM_F_INACCESSIBLE for shmem). Private fd can only be created by MFD_INACCESSIBLE.
- Introduced new APIs for backing store to register itself to memfile_notifier instead of direct function call.
- Added the accounting and restriction for MFD_INACCESSIBLE memory.
- Added KVM API doc for new memslot extensions and man page for the new MFD_INACCESSIBLE flag.
- Removed the overlap check for mapping the same file+offset into multiple gfns due to perf consideration, warned in document.
- Addressed other comments in v4.
v4:
- Decoupled the callbacks between KVM/mm from memfd and use new name 'memfile_notifier'.
- Supported register multiple memslots to the same backing store.
- Added per-memslot pfn_ops instead of per-system.
- Reworked the invalidation part.
- Improved new KVM uAPIs (private memslot extension and memory error) per Sean's suggestions.
- Addressed many other minor fixes for comments from v3.
v3:
- Added locking protection when calling invalidate_page_range/fallocate callbacks.
- Changed memslot structure to keep use useraddr for shared memory.
- Re-organized F_SEAL_INACCESSIBLE and MEMFD_OPS.
- Added MFD_INACCESSIBLE flag to force F_SEAL_INACCESSIBLE.
- Commit message improvement.
- Many small fixes for comments from the last version.
Links to previous discussions
[1] Original design proposal: https://lkml.kernel.org/kvm/20210824005248.200037-1-seanjc@google.com/ [2] Updated proposal and RFC patch v1: https://lkml.kernel.org/linux-fsdevel/20211111141352.26311-1-chao.p.peng@lin... [3] Patch v5: https://lkml.org/lkml/2022/5/19/861
Chao Peng (12): mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE mm: Introduce memfile_notifier mm/memfd: Introduce MFD_INACCESSIBLE flag KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS KVM: Use gfn instead of hva for mmu_notifier_retry KVM: Rename mmu_notifier_* KVM: Extend the memslot to support fd-based private memory KVM: Add KVM_EXIT_MEMORY_FAULT exit KVM: Register/unregister the guest private memory regions KVM: Handle page fault for private memory KVM: Enable and expose KVM_MEM_PRIVATE
Kirill A. Shutemov (1): mm/shmem: Support memfile_notifier
Documentation/virt/kvm/api.rst | 77 +++++- arch/arm64/kvm/mmu.c | 8 +- arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/mmu.c | 10 +- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 +- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/riscv/kvm/mmu.c | 4 +- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/Kconfig | 3 + arch/x86/kvm/mmu.h | 2 - arch/x86/kvm/mmu/mmu.c | 74 +++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++ arch/x86/kvm/mmu/mmutrace.h | 1 + arch/x86/kvm/mmu/paging_tmpl.h | 4 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 105 +++++--- include/linux/memfile_notifier.h | 91 +++++++ include/linux/shmem_fs.h | 2 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/kvm.h | 37 +++ include/uapi/linux/memfd.h | 1 + mm/Kconfig | 4 + mm/Makefile | 1 + mm/memfd.c | 18 +- mm/memfile_notifier.c | 123 ++++++++++ mm/shmem.c | 125 +++++++++- tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++ virt/kvm/Kconfig | 3 + virt/kvm/kvm_main.c | 272 ++++++++++++++++++--- virt/kvm/pfncache.c | 14 +- 35 files changed, 1074 insertions(+), 127 deletions(-) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c
-- 2.25.1
On Wed, Aug 31, 2022 at 10:12:12AM +0100, Fuad Tabba wrote:
Moreover, something which was discussed here before [3], is the ability to share in-place. For pKVM/arm64, the conversion between shared and private involves only changes to the stage-2 page tables, which are controlled by the hypervisor. Android supports this in-place conversion already, and I think that the cost of copying for many use-cases that would involve large amounts of data would be big. We will measure the relative costs in due course, but in the meantime we’re nervous about adopting a new user ABI which doesn’t appear to cater for in-place conversion; having just the fd would simplify that somewhat
I understand there is difficulty to achieve that with the current private_fd + userspace_addr (they basically in two separate fds), but is it possible for pKVM to extend this? Brainstorming for example, pKVM can ignore userspace_addr and only use private_fd to cover both shared and private memory, or pKVM introduce new KVM memslot flag?
It's not that there's anything blocking pKVM from doing that. It's that the disconnect of using a memory address for the shared memory, and a file descriptor for the private memory doesn't really make sense for pKVM. I see how it makes sense for TDX and the Intel-specific implementation. It just seems that this is baking in an implementation-specific aspect as a part of the KVM general api, and the worry is that this might have some unintended consequences in the future.
It's true this API originates from supporting TDX and probably other similar confidential computing(CC) technologies. But if we ever get chance to make it more common to cover more usages like pKVM, I would also like to. The challenge on this point is pKVM diverges a lot from CC usages, putting both shared and private memory in the same fd complicates CC usages. If two things are different enough, I'm also thinking implementation-specific may not be that bad.
Chao
On Wed, Jul 06, 2022 at 04:20:02PM +0800, Chao Peng wrote:
This is the v7 of this series which tries to implement the fd-based KVM guest private memory. The patches are based on latest kvm/queue branch commit:
b9b71f43683a (kvm/queue) KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacity
Introduction
In general this patch series introduce fd-based memslot which provides guest memory through memory file descriptor fd[offset,size] instead of hva/size. The fd can be created from a supported memory filesystem like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM and the the memory backing store exchange callbacks when such memslot gets created. At runtime KVM will call into callbacks provided by the backing store to get the pfn with the fd+offset. Memory backing store will also call into KVM callbacks when userspace punch hole on the fd to notify KVM to unmap secondary MMU page table entries.
Comparing to existing hva-based memslot, this new type of memslot allows guest memory unmapped from host userspace like QEMU and even the kernel itself, therefore reduce attack surface and prevent bugs.
Based on this fd-based memslot, we can build guest private memory that is going to be used in confidential computing environments such as Intel TDX and AMD SEV. When supported, the memory backing store can provide more enforcement on the fd and KVM can use a single memslot to hold both the private and shared part of the guest memory.
Hi everyone,
Just wanted to let you all know that I reserved a slot at the LPC Confidential Computing Microconference to discuss some topics related to unmapped/inaccessible private memory support:
"Unmapped Private Memory for Confidential Guests" Tuesday, Sep 13th, 10:00am (Dublin time) https://lpc.events/event/16/sessions/133/#20220913
The discussion agenda is still a bit in flux, but one topic I really wanted to cover is how we intend to deal with the kernel directmap for TDX/SNP, where there is a need to either remove or split mappings so that KVM or other kernel threads writing to non-private pages don't run into issues due mappings overlapping with private pages.[1]
Other possible discussion topics:
- guarding against shared->private conversions while KVM is attempting to access a shared page (separate PFN pools for shared/private seems to resolve this nicely, but may not be compatible with things like pKVM where the underlying PFN is the same for shared/private)[2]
- extending KVM_EXIT_MEMORY_FAULT to handle batched requests to better handle things like explicit batched conversions initiated by the guest
It's a short session so not sure how much time we'll actually have to discuss things in detail, but maybe this can at least be a good jumping off point for other discussions.
Thanks, and hope to see you there!
[1] https://lore.kernel.org/all/YWb8WG6Ravbs1nbx@google.com/ [2] https://lore.kernel.org/lkml/CA+EHjTy6NF=BkCqK0vhXLdtKZMahp55JUMSfxN96-NT3Yi...
linux-kselftest-mirror@lists.linaro.org