On Mon, Jun 13, 2022 at 5:10 PM Nadav Amit namit@vmware.com wrote:
On Jun 13, 2022, at 3:38 PM, Axel Rasmussen axelrasmussen@google.com wrote:
On Mon, Jun 13, 2022 at 3:29 PM Peter Xu peterx@redhat.com wrote:
On Mon, Jun 13, 2022 at 02:55:40PM -0700, Andrew Morton wrote:
On Wed, 1 Jun 2022 14:09:47 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
To achieve this, add a /dev/userfaultfd misc device. This device provides an alternative to the userfaultfd(2) syscall for the creation of new userfaultfds. The idea is, any userfaultfds created this way will be able to handle kernel faults, without the caller having any special capabilities. Access to this mechanism is instead restricted using e.g. standard filesystem permissions.
The use of a /dev node isn't pretty. Why can't this be done by tweaking sys_userfaultfd() or by adding a sys_userfaultfd2()?
I think for any approach involving syscalls, we need to be able to control access to who can call a syscall. Maybe there's another way I'm not aware of, but I think today the only mechanism to do this is capabilities. I proposed adding a CAP_USERFAULTFD for this purpose, but that approach was rejected [1]. So, I'm not sure of another way besides using a device node.
One thing that could potentially make this cleaner is, as one LWN commenter pointed out, we could have open() on /dev/userfaultfd just return a new userfaultfd directly, instead of this multi-step process of open /dev/userfaultfd, NEW ioctl, then you get a userfaultfd. When I wrote this originally it wasn't clear to me how to get that to happen - open() doesn't directly return the result of our custom open function pointer, as far as I can tell - but it could be investigated.
If this direction is pursued, I think that it would be better to set it as /proc/[pid]/userfaultfd, which would allow remote monitors (processes) to hook into userfaultfd of remote processes. I have a patch for that which extends userfaultfd syscall, but /proc/[pid]/userfaultfd may be cleaner.
Hmm, one thing I'm unsure about -
If a process is able to control another process' memory like this, then this seems like exactly what CAP_SYS_PTRACE is intended to deal with, right? So I'm not sure this case is directly related to the one I'm trying to address.
This also seems distinct to me versus the existing way you'd do this, which is open a userfaultfd and register a shared memory region, and then fork(). Now you can control your child's memory with userfaultfd. But, attaching to some other, previously-unrelated process with /proc/[pid]/userfaultfd seems like a clear case for CAP_SYS_PTRACE.