On Fri, Apr 14, 2023, Ackerley Tng wrote:
Sean Christopherson seanjc@google.com writes:
On Thu, Apr 13, 2023, Christian Brauner wrote:
- by a mount option to tmpfs that makes it act in this restricted manner then you don't need an ioctl() and can get away with regular open calls. Such a tmpfs instance would only create regular, restricted memfds.
I'd prefer to not go this route, becuase IIUC, it would require relatively invasive changes to shmem code, and IIUC would require similar changes to other support backings in the future, e.g. hugetlbfs? And as above, I don't think any of the potential use cases need restrictedmem to be a uniquely identifiable mount.
FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and the separation that the current implementation has is very helpful. Also helps that hugetlbfs and tmpfs are structured similarly, I guess.
One of the goals (hopefully not a pipe dream) is to design restrictmem in such a way that extending it to support other backing types isn't terribly difficult. In case it's not obvious, most of us working on this stuff aren't filesystems experts, and many of us aren't mm experts either. The more we (KVM folks for the most part) can leverage existing code to do the heavy lifting, the better.
After giving myself a bit of a crash course in file systems, would something like the below have any chance of (a) working, (b) getting merged, and (c) being maintainable?
The idea is similar to a stacking filesystem, but instead of stacking, restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around tmpfs. There are undoubtedly issues and edge cases, I'm just looking for a quick "yes, this might be doable" or a "no, that's absolutely bonkers, don't try it".
Not an FS expert by any means, but I did think of approaching it this way as well!
"Hijacking" perhaps gives this approach a bit of a negative connotation.
Heh, commandeer then.
I thought this is pretty close to subclassing (as in Object Oriented Programming). When some methods (e.g. fallocate) are called, restrictedmem does some work, and calls the same method in the superclass.
The existing restrictedmem code is a more like instantiating an shmem object and keeping that object as a field within the restrictedmem object.
Some (maybe small) issues I can think of now:
(1)
One difficulty with this approach is that other functions may make assumptions about private_data being of a certain type, or functions may use private_data.
I checked and IIUC neither shmem nor hugetlbfs use the private_data field in the inode's i_mapping (also file's f_mapping).
But there's fs/buffer.c which uses private_data, although those functions seem to be used by FSes like ext4 and fat, not memory-backed FSes.
We can probably fix this if any backing filesystems of restrictedmem, like tmpfs and future ones use private_data.
Ya, if we go the route of poking into f_ops and stuff, I would want to add WARN_ON_ONCE() hardening of everything that restrictemem wants to "commandeer" ;-)
static int restrictedmem_file_create(struct file *file) { struct address_space *mapping = file->f_mapping; struct restrictedmem *rm;
rm = kzalloc(sizeof(*rm), GFP_KERNEL); if (!rm) return -ENOMEM;
rm->backing_f_ops = file->f_op; rm->backing_a_ops = mapping->a_ops; rm->file = file;
We don't really need to do this, since rm->file is already the same as file, we could just pass the file itself when it's needed
Aha! I was working on getting rid of it, but forgot to go back and do another pass.
init_rwsem(&rm->lock); xa_init(&rm->bindings);
file->f_flags |= O_LARGEFILE;
file->f_op = &restrictedmem_fops; mapping->a_ops = &restrictedmem_aops;
I think we probably have to override inode_operations as well, because otherwise other methods would become available to a restrictedmem file (like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead of a bug.
I think we want those? What we want to restrict are operations that require read/write/execute access to the file, everything else should be ok. fallocate() is a special case because restrictmem needs to tell KVM to unmap the memory when a hole is punched. I assume ->setattr() needs similar treatment to handle ftruncate()?
I'd love to hear Christian's input on this aspect of things.
if (WARN_ON_ONCE(file->private_data)) { err = -EEXIST; goto err_fd; }
Did you intend this as a check that the backing filesystem isn't using the private_data field in the mapping?
I think you meant file->f_mapping->private_data.
Ya, sounds right. I should have added disclaimers that (a) I wrote this quite quickly and (b) it's compile tested only at this point.
On this note, we will probably have to fix things whenever any backing filesystems need the private_data field.
Yep.
f = fdget_raw(mount_fd); if (!f.file) return -EBADF;
...
/* * The filesystem must be mounted no-execute, executing from guest * private memory in the host is nonsensical and unsafe. */ if (!(mnt->mnt_sb->s_iflags & SB_I_NOEXEC)) goto out;
Looking at this more closely, I don't think we need to require NOEXEC, things like like execve() should get squashed by virtue of not providing any read/write implementations. And dropping my misguided NOEXEC requirement means there's no reason to disallow using the kernel internal mount.