Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

14 Apr 2023

      On Fri, Apr 14, 2023, Ackerley Tng wrote:
...
Sean Christopherson seanjc@google.com writes:
...
On Thu, Apr 13, 2023, Christian Brauner wrote:
...

by a mount option to tmpfs that makes it act
 in this restricted manner then you don't need an ioctl() and can get
 away with regular open calls. Such a tmpfs instance would only create
 regular, restricted memfds.

...
I'd prefer to not go this route, becuase IIUC, it would require relatively
invasive changes to shmem code, and IIUC would require similar changes to
other support backings in the future, e.g. hugetlbfs?  And as above, I
don't think any of the potential use cases need restrictedmem to be a
uniquely identifiable mount.
FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and
the separation that the current implementation has is very helpful. Also
helps that hugetlbfs and tmpfs are structured similarly, I guess.
...
One of the goals (hopefully not a pipe dream) is to design restrictmem in
such a way that extending it to support other backing types isn't terribly
difficult.  In case it's not obvious, most of us working on this stuff
aren't filesystems experts, and many of us aren't mm experts either.  The
more we (KVM folks for the most part) can leverage existing code to do the
heavy lifting, the better.
...
After giving myself a bit of a crash course in file systems, would
something like the below have any chance of (a) working, (b) getting
merged, and (c) being maintainable?
...
The idea is similar to a stacking filesystem, but instead of stacking,
restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around
tmpfs.  There are undoubtedly issues and edge cases, I'm just looking for a
quick "yes, this might be doable" or a "no, that's absolutely bonkers,
don't try it".
Not an FS expert by any means, but I did think of approaching it this
way as well!
"Hijacking" perhaps gives this approach a bit of a negative connotation.
Heh, commandeer then.
...
I thought this is pretty close to subclassing (as in Object
Oriented Programming). When some methods (e.g. fallocate) are called,
restrictedmem does some work, and calls the same method in the
superclass.
The existing restrictedmem code is a more like instantiating an shmem
object and keeping that object as a field within the restrictedmem
object.
Some (maybe small) issues I can think of now:
(1)
One difficulty with this approach is that other functions may make
assumptions about private_data being of a certain type, or functions may
use private_data.
I checked and IIUC neither shmem nor hugetlbfs use the private_data
field in the inode's i_mapping (also file's f_mapping).
But there's fs/buffer.c which uses private_data, although those
functions seem to be used by FSes like ext4 and fat, not memory-backed
FSes.
We can probably fix this if any backing filesystems of restrictedmem,
like tmpfs and future ones use private_data.
Ya, if we go the route of poking into f_ops and stuff, I would want to add
WARN_ON_ONCE() hardening of everything that restrictemem wants to "commandeer" ;-)
...
...
static int restrictedmem_file_create(struct file *file)
{
   struct address_space *mapping = file->f_mapping;
   struct restrictedmem *rm;
...
rm = kzalloc(sizeof(*rm), GFP_KERNEL);
   if (!rm)
   	return -ENOMEM;
...
rm->backing_f_ops = file->f_op;
   rm->backing_a_ops = mapping->a_ops;
   rm->file = file;
We don't really need to do this, since rm->file is already the same as
file, we could just pass the file itself when it's needed
Aha!  I was working on getting rid of it, but forgot to go back and do another
pass.
...
...
init_rwsem(&rm->lock);
   xa_init(&rm->bindings);
...
file->f_flags |= O_LARGEFILE;
...
file->f_op = &restrictedmem_fops;
   mapping->a_ops = &restrictedmem_aops;
I think we probably have to override inode_operations as well, because
otherwise other methods would become available to a restrictedmem file
(like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead
of a bug.
I think we want those?  What we want to restrict are operations that require
read/write/execute access to the file, everything else should be ok. fallocate()
is a special case because restrictmem needs to tell KVM to unmap the memory when
a hole is punched.  I assume ->setattr() needs similar treatment to handle
ftruncate()?
I'd love to hear Christian's input on this aspect of things.
...
...
if (WARN_ON_ONCE(file->private_data)) {
   	err = -EEXIST;
   	goto err_fd;
   }
Did you intend this as a check that the backing filesystem isn't using
the private_data field in the mapping?
I think you meant file->f_mapping->private_data.
Ya, sounds right.  I should have added disclaimers that (a) I wrote this quite
quickly and (b) it's compile tested only at this point.
...
On this note, we will probably have to fix things whenever any backing
filesystems need the private_data field.
Yep.
...
...
f = fdget_raw(mount_fd);
   if (!f.file)
   	return -EBADF;
...
...
...
/*
    * The filesystem must be mounted no-execute, executing from guest
    * private memory in the host is nonsensical and unsafe.
    */
   if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC))
   	goto out;
Looking at this more closely, I don't think we need to require NOEXEC, things like
like execve() should get squashed by virtue of not providing any read/write
implementations.  And dropping my misguided NOEXEC requirement means there's no
reason to disallow using the kernel internal mount.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory