On Thu, 18 Aug 2022, Kirill A . Shutemov wrote:
On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
If your memory could be swapped, that would be enough of a good reason to make use of shmem.c: but it cannot be swapped; and although there are some references in the mailthreads to it perhaps being swappable in future, I get the impression that will not happen soon if ever.
If your memory could be migrated, that would be some reason to use filesystem page cache (because page migration happens to understand that type of memory): but it cannot be migrated.
Migration support is in pipeline. It is part of TDX 1.5 [1]. And swapping theoretically possible, but I'm not aware of any plans as of now.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/intel-t...
I always forget, migration means different things to different audiences. As an mm person, I was meaning page migration, whereas a virtualization person thinks VM live migration (which that reference appears to be about), a scheduler person task migration, an ornithologist bird migration, etc.
But you're an mm person too: you may have cited that reference in the knowledge that TDX 1.5 Live Migration will entail page migration of the kind I'm thinking of. (Anyway, it's not important to clarify that here.)
Some of these impressions may come from earlier iterations of the patchset (v7 looks better in several ways than v5). I am probably underestimating the extent to which you have taken on board other usages beyond TDX and SEV private memory, and rightly want to serve them all with similar interfaces: perhaps there is enough justification for shmem there, but I don't see it. There was mention of userfaultfd in one link: does that provide the justification for using shmem?
I'm afraid of the special demands you may make of memory allocation later on - surprised that huge pages are not mentioned already; gigantic contiguous extents? secretmem removed from direct map?
The design allows for extension to hugetlbfs if needed. Combination of MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero implications for shmem. It is going to be separate struct memfile_backing_store.
Last year's MFD_HUGEPAGE proposal would have allowed you to do it with memfd via tmpfs without needing to involve hugetlbfs; but you may prefer the determinism of hugetlbfs, relying on /proc/sys/vm/nr_hugepages etc.
But I've yet to see why you want to involve this or that filesystem (with all its filesystem-icity suppressed) at all. The backing store is host memory, and tmpfs and hugetlbfs just impose their own idiosyncrasies on how that memory is allocated; but I think you would do better to choose your own idiosyncrasies in allocation directly - you don't need a different "backing store" to choose between 4k or 2M or 1G or whatever allocations.
tmpfs and hugetlbfs and page cache are designed around sharing memory: TDX is designed around absolutely not sharing memory; and the further uses which Sean foresees appear not to need it as page cache either.
Except perhaps for page migration reasons. It's somewhat incidental, but of course page migration knows how to migrate page cache, so masquerading as page cache will give a short cut to page migration, when page migration becomes at all possible.
I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE to be movable if platform supports it and secretmem is not migratable by design (without direct mapping fragmentations).
Here's what I would prefer, and imagine much easier for you to maintain; but I'm no system designer, and may be misunderstanding throughout.
QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps the fallocate syscall interface itself) to allocate and free the memory, ioctl for initializing some of it too. KVM in control of whether that fd can be read or written or mmap'ed or whatever, no need to prevent it in shmem.c, no need for flags, seals, notifications to and fro because KVM is already in control and knows the history. If shmem actually has value, call into it underneath - somewhat like SysV SHM, and /dev/zero mmap, and i915/gem make use of it underneath. If shmem has nothing to add, just allocate and free kernel memory directly, recorded in your own xarray.
I guess shim layer on top of shmem *can* work. I don't see immediately why it would not. But I'm not sure it is right direction. We risk creating yet another parallel VM with own rules/locking/accounting that opaque to core-mm.
You are already proposing a new set of rules, foreign to how tmpfs works for others. You're right that KVM allocating large amounts of memory, opaque to core-mm, carries risk: and you'd be right to say that shmem.c provides some clues (security_vm_enough_memory checks, memcg charging, user_shm_lock accounting) on what to remember.
But I'm not up to the job of being the one to police you there, and you don't want to be waiting on me either.
To take a rather silly example: Ted just added chattr support to tmpfs, and it fits in well. But I don't now want to have to decide whether "chattr +i" FS_IMMUTABLE_FL is or is not compatible with MEMFILE_F_USER_INACCESSIBLE. They are from different worlds, and I'd prefer KVM to carry the weight of imposing INACCESSIBLE: which seems easily done if it manages the fd, without making the memory allocated to that fd accessible to those who hold the fd.
Note that on machines that run TDX guests such memory would likely be the bulk of memory use. Treating it as a fringe case may bite us one day.
Yes, I suspected that machines running TDX guests might well consume most of the memory that way, but glad(?) to hear it confirmed.
I am not suggesting that this memory be treated as a fringe case, rather the reverse: a different case, not something to hide away inside shmem.c.
Is there a notion that /proc/meminfo "Shmem:" is going to be a good hint of this usage? Whether or not it's also included in "Shmem:", I expect that its different characteristics will deserve its own display.
Hugh