Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

20 Apr 2023

      On Wed, Apr 19, 2023 at 05:49:55PM -0700, Sean Christopherson wrote:
...
On Wed, Apr 19, 2023, Christian Brauner wrote:
...
On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
...
...
But if you want to preserve the inode number and device number of the
relevant tmpfs instance but still report memfd restricted as your
filesystem type
Unless I missed something along the way, reporting memfd_restricted as a distinct
filesystem is very much a non-goal.  AFAIK it's purely a side effect of the
proposed implementation.
In the current implementation you would have to put in effort to fake
this. For example, you would need to also implement ->statfs
super_operation where you'd need to fill in the details of the tmpfs
instance. At that point all that memfd_restricted fs code that you've
written is nothing but deadweight, I would reckon.
After digging a bit, I suspect the main reason Kirill implemented an overlay to
inode_operations was to prevent modifying the file size via ->setattr().  Relying
on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by design,
the memory can't be mmap()'d into host userspace.
if (attr->ia_valid & ATTR_SIZE) {
   	if (memfd->f_inode->i_size)
   		return -EPERM;
if (!PAGE_ALIGNED(attr->ia_size))
	return -EINVAL;	

}
But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} or
SHMEM_LONGPIN.  For a variety of reasons, I'm leaning more and more toward making
this a KVM ioctl() instead of a dedicated syscall, at which point we can be both
more flexible and more draconian, e.g. let userspace provide the file size at the
time of creation, but make the size immutable, at least by default.
...
...
After giving myself a bit of a crash course in file systems, would something like
the below have any chance of (a) working, (b) getting merged, and (c) being
maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem
hijacks a f_ops and a_ops to create a lightweight shim around tmpfs.  There are
undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might
be doable" or a "no, that's absolutely bonkers, don't try it".
Maybe, but I think it's weird.
Yeah, agreed.
...
_Replacing_ f_ops isn't something that's unprecedented. It happens everytime
a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs
does a similar (much more involved) thing where it replaces it's proxy f_ops
with the relevant subsystem's f_ops. The difference is that in both cases the
replace happens at ->open() time; and the replace is done once. Afterwards
only the newly added f_ops are relevant.
In your case you'd be keeping two sets of {f,a}_ops; one usable by
userspace and another only usable by in-kernel consumers. And there are
some concerns (non-exhaustive list), I think:

{f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is
authoritative per @file and it is left to the individual subsystems to
maintain driver specific ops (see the sunrpc stuff or sockets).
lifetime management for the two sets of {f,a}_ops: If the ops belong
to a module then you need to make sure that the module can't get
unloaded while you're using the fops. Might not be a concern in this
case.

Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?)
holding a reference to the inode?
I don't think it would be possible to safely replace inode_operations
after the inode's been made visible in caches.
It works with file_operations because when a file is opened a new struct
file is allocated which isn't reachable anywhere before fd_install() is
called. So it is possible to replace f_ops in the default
f->f_op->open() method (which is what devices do as the inode is located
on e.g., ext4/xfs/tmpfs but the functionality of the device usually
provided by some driver/module through its file_operations). The default
f_ops are taken from i_fop of the inode.
The lifetime of the file_/inode_operations will be aligned with the
lifetime of the module they're originating from. If only
file_/inode_operations are used from within the same module then there
should never be any lifetime concerns.
So an inode doesn't explictly pin file_/inode_operations because there's
usually no need to do that and it be weird if each new inode would take
a reference on the f_ops/i_ops on the off-chance that someone _might_
open the file. Let alone the overhead of calling try_module_get()
everytime a new inode is added to the cache. There are various fs
objects - the superblock which is pinning the filesystem/module - that
exceed the lifetime of inodes and dentries. Both also may be dropped
from their respective caches and readded later.
Pinning of the module for f_ops is done because it is possible that some
filesystem/driver might want to use the file_operations of some other
filesystem/driver by default and they are in separate modules. So the
fops_get() in do_dentry_open is there because it's not guaranteed that
file_/inode_operations originate from the same module as the inode
that's opened. If the module is still alive during the open then a
reference to its f_ops is taken if not then the open will fail with
ENODEV.
That's to the best of my knowledge.
...
...

brittleness: Not all f_ops for example deal with userspace
functionality some deal with cleanup when the file is closed like
->release(). So it's delicate to override that functionality with
custom f_ops. Restricted memfds could easily forget to cleanup
resources.
Potential for confusion why there's two sets of {f,a}_ops.
f_ops specifically are generic across a vast amount of consumers and
are subject to change. If memfd_restricted() has specific requirements
because of this weird double-use they won't be taken into account.

I find this hard to navigate tbh and it feels like taking a shortcut to
avoid building a proper api.
Agreed.  At the very least, it would be better to take an explicit dependency on
whatever APIs are being used instead of somewhat blindly bouncing through ->fallocate().
I think that gives us a clearer path to getting something merged too, as we'll
need Acks on making specific functions visible, i.e. will give MM maintainers
something concrete to react too.
...
If you only care about a specific set of operations specific to memfd
restricte that needs to be available to in-kernel consumers, I wonder if you
shouldn't just go one step further then your proposal below and build a
dedicated minimal ops api.
This is actually very doable for shmem.  Unless I'm missing something, because
our use case doesn't allow mmap(), swap, or migration, a good chunk of
shmem_fallocate() is simply irrelevant.  The result is only ~100 lines of code,
and quite straightforward.
My biggest concern, outside of missing a detail in shmem, is adding support for
HugeTLBFS, which is likely going to be requested/needed sooner than later.  At a
glance, hugetlbfs_fallocate() is quite a bit more complex, i.e. not something I'm
keen to duplicate.  But that's also a future problem to some extent, as it's
purely kernel internals; the uAPI side of things doesn't seem like it'll be messy
at all.
Thanks again!
Sure thing.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory