On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
On Wed, 6 Jul 2022, Chao Peng wrote: But since then, TDX in particular has forced an effort into preventing (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
Are any of the shmem.c mods useful to existing users of shmem.c? No. Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
But QEMU and other VMMs are users of shmem and memfd. The new features certainly aren't useful for _all_ existing users, but I don't think it's fair to say that they're not useful for _any_ existing users.
What use do you have for a filesystem here? Almost none. IIUC, what you want is an fd through which QEMU can allocate kernel memory, selectively free that memory, and communicate fd+offset+length to KVM. And perhaps an interface to initialize a little of that memory from a template (presumably copied from a real file on disk somewhere).
You don't need shmem.c or a filesystem for that!
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1].
And this isn't intended for just TDX (or SNP, or pKVM). We're not _that_ far off from being able to use UPM for "regular" VMs as a way to provide defense-in-depth without having to take on the overhead of confidential VMs. At that point, migration and probably even swap are on the table.
And swapping theoretically possible, but I'm not aware of any plans as of now.
Ya, I highly doubt confidential VMs will ever bother with swap.
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations).
But secretmem _could_ be a fit. If a use case wants to unmap guest private memory from both userspace and the kernel then KVM should absolutely be able to support that, but at the same time I don't want to have to update KVM to enable secretmem (and I definitely don't want KVM poking into the directmap itself).
MFD_INACCESSIBLE should only say "this memory can't be mapped into userspace", any other properties should be completely separate, e.g. the inability to migrate pages is effective a restriction from KVM (acting on behalf of TDX/SNP), it's not a fundamental property of MFD_INACCESSIBLE.