On Wed, Jun 01, 2022 at 02:09:49PM -0700, Axel Rasmussen wrote:
Explain the different ways to create a new userfaultfd, and how access control works for each way.
Signed-off-by: Axel Rasmussen axelrasmussen@google.com
Documentation/admin-guide/mm/userfaultfd.rst | 40 ++++++++++++++++++-- Documentation/admin-guide/sysctl/vm.rst | 3 ++ 2 files changed, 40 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 6528036093e1..9bae1acd431f 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick. Design ====== -Userfaults are delivered and resolved through the ``userfaultfd`` syscall. +Userspace creates a new userfaultfd, initializes it, and registers one or more +regions of virtual memory with it. Then, any page faults which occur within the +region(s) result in a message being delivered to the userfaultfd, notifying +userspace of the fault. The ``userfaultfd`` (aside from registering and unregistering virtual memory ranges) provides two primary functionalities: @@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory management of mremap/mprotect is that the userfaults in all their operations never involve heavyweight structures like vmas (in fact the ``userfaultfd`` runtime load never takes the mmap_lock for writing).
Vmas are not suitable for page- (or hugepage) granular fault tracking when dealing with virtual address spaces that could span Terabytes. Too many vmas would be needed for that. -The ``userfaultfd`` once opened by invoking the syscall, can also be +The ``userfaultfd``, once created, can also be passed using unix domain sockets to a manager process, so the same manager process could handle the userfaults of a multitude of different processes without them being aware about what is going on @@ -50,6 +52,38 @@ is a corner case that would currently return ``-EBUSY``). API === +Creating a userfaultfd +----------------------
+There are two ways to create a new userfaultfd, each of which provide ways to +restrict access to this functionality (since historically userfaultfds which +handle kernel page faults have been a useful tool for exploiting the kernel).
+The first way, supported by older kernels, is the userfaultfd(2) syscall.
How about "supported since userfaultfd was introduced"? Otherwise the reader can get a feeling that the syscall won't work on new kernels but it will.
+Access to this is controlled in several ways:
+- By default, the userfaultfd will be able to handle kernel page faults. This
s/kernel/both user and kernel/?
- can be disabled by passing in UFFD_USER_MODE_ONLY.
+- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have
- CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY.
+- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to
- use this syscall, even if UFFD_USER_MODE_ONLY is *not* set.
The separation of above three paragraphs do not feel very clear to me to understand these flags.. Entry 1) was trying to define UFFD_USER_MODE_ONLY, but entry 2) was also referring to it in another context.
How about using two paragraphs to explain these two flags one by one? My try..
The user can always creates an userfaultfd that only traps userspace page faults only. To achieve it, one can create the userfaultfd object using the syscall userfaultfd() with flag UFFD_USER_MODE_ONLY passed in.
If the user would like to also trap kernel page faults for the address space, then either the process needs to have CAP_SYS_PTRACE capability, or the system must have vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd is set to 0.
+The second way, added to the kernel more recently, is by opening and issuing a +USERFAULTFD_IOC_NEW ioctl to /dev/userfaultfd. This method yields equivalent +userfaultfds to the userfaultfd(2) syscall; its benefit is in how access to +creating userfaultfds is controlled.
Since the benefit is immediately mentioned next, how about dropping "its benefit is in how ... is controlled" and just connect these two paragraphs?
Again, please take it with a grain of salt on my English-related comments (it means all comment above :).
Thanks,
+Access to /dev/userfaultfd is controlled via normal filesystem permissions +(user/group/mode for example), which gives fine grained access to userfaultfd +specifically, without also granting other unrelated privileges at the same time +(as e.g. granting CAP_SYS_PTRACE would do).
+Initializing up a userfaultfd +-----------------------------
When first opened the ``userfaultfd`` must be enabled invoking the ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or a later API version) which will specify the ``read/POLLIN`` protocol diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d7374a1e8ac9..e3a952d1fd35 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -927,6 +927,9 @@ calls without any restrictions. The default value is 0. +An alternative to this sysctl / the userfaultfd(2) syscall is to create +userfaultfds via /dev/userfaultfd. See +Documentation/admin-guide/mm/userfaultfd.rst. user_reserve_kbytes =================== -- 2.36.1.255.ge46751e96f-goog