[PATCH] file: always lock position - Linux-stable-mirror

24 Jul 2023

By abusing new methods of file descriptor transferal it is possible to
break the assumption that f_pos_lock doesn't need to be acquired when
the caller is single threaded and the file_count(file) is one aka when
the caller holds the only reference to the file.
The explanation has two parts. The first part is about a new io_uring
interface that isn't merged yet where I realized that only acquiring the
lock when multiple references to a file are held would be broken in some
circumstances. The second part illustrates that this assumption is
already broken by abusing another interface.
During review of the io_uring patchset for adding getdents support I
pointed out various problems in the locking assumptions that were made.
Effectively, the guts of __fdget_pos() were copied out and a trylock
scheme for f_pos_lock was introduced.
The approach overall is workable but the locking was wrong because it
copied the file_count(file) greater than one optimization that
__fdget_pos() had into io_uring. But this assumption is broken when
fixed file descriptors are used.
Short reminder of fixed files here via some C pseudo code:
T1 (still single threaded)
fd_register = open("/some/file") -----------------> f_count == 1
fd_fixed = io_uring_register_file(fd_register)
        -> io_sqe_files_register(fd_register)
           -> fget(fd_register) --------------------------> f_count == 2
           -> io_fixed_file_set(fd_register)
close(fd_register);
        -> close_fd(fd_register)
           -> file = pick_file()
           -> filp_close(file)
              -> fput(file) ------------------------------> f_count == 1
The caller has now traded a regular file descriptor reference for a
fixed file reference. Such fixed files never use fget() again and thus
fully eliminate any fget()/fput() overhead.
However, for getdents f_pos_lock needs to be acquired as state may often
be kept that requires synchronization with seek and other concurrent
getdent requests.
Since io_uring is an asynchronous interface it is of course possible to
register multiple concurrent getdent calls that aren't synchronized by
io_uring as that's never done unless the user requests specific
chaining. And since the reference count for fixed files is always one
it's possible to create multiple racing getdent requests if locking is
conditional on file_count(file) being greater than one.
That wouldn't be a problem since io_uring can just unconditionally take
f_pos_lock and eliminate this problem for fixed files since they aren't
usable with the regular system call api.
However, while thinking about this I realized that a while ago the
file_count(file) greater than one optimization was already broken and
that concurrent read/write/getdents/seek calls are possible in the
regular system call api.
The pidfd_getfd() system call allows a caller with ptrace_may_access()
abilities on another process to steal a file descriptor from this
process. This system call is used by debuggers, container runtimes,
system call supervisors, networking proxies etc (cf. [1]-[3]). So while
it is a special interest system call it is used in common tools.
Via pidfd_getfd() it's possible to intercept syscalls such as
connect(2). For example, a container manager might want to rewrite the
connect(2) request to something other than the task intended for
security reasons or because the task lacks the necessary information
about the networking layout to connect to the right endpoint. In these
cases pidfd_getfd(2) can be used to retrieve a copy of the file
descriptor of the task and perform the connect(2) for it.
When used in combination with the seccomp notifier things are pretty
simple. The intercepted task itself is blocked in the seccomp code. That
code runs before the actual requested system call is performed. So the
task is blocked before any system call is performed. The seccomp
notifier will then notify the supervising process which holds the
seccomp notifier fd.
The supervisor can now call pidfd_getfd() while the target process is
blocked and steal the file descriptor. It can then perform whatever
operation it wants on that file descriptor and then tell seccomp to
either return a specific return value to the task once it is unblocked
or even continue the task using the SECCOMP_USER_NOTIF_FLAG_CONTINUE
flag.
One of the most useful things for pidfd_getfd() is that the target task
from which the caller is about to steal a file descriptor doesn't need
to be blocked or stopped. But that makes it possible for the target
process to be in the middle of a read/write/getdents/seek system call
while the caller is stealing the file on which that
read/write/getdents/seek was issued and issuing a concurrent
read/write/getdents/seek system call:
P1                                              P2
getdents(fd)
-> fdget_pos()
   {
           * @file->f_count == 1
           * @current->files->count == 1
           => not taking @file->f_pos_lock
   }
                                                fd = pidfd_getfd(fd)
                                                -> __pidfd_fget()
                                                   -> fget_task() ----> f_count = 2
                                                   -> receive_fd()
                                                      -> get_file() --> f_count = 3
                                                      -> fd_install()
                                                   -> fput() ---------> f_count = 2
getdents(fd)
                                                -> fdget_pos()
                                                   {
                                                           * @file->f_count == 1
                                                           * @file->f_count == 2
                                                           => taking @file->f_pos_lock and bumping @file->f_count = 3
                                                   }
-> vfs_readdir()                                   -> vfs_readdir()
-> fdput()                                         -> fdput()---------> f_count = 2
Although I'm not happy about it, I'm somewhat confident that this
analysis is correct but that the race is hard to hit in real life but
with some ingenuity it might be possible to make it more reliable.
However, I was lazy and introduced a luscious 10 second delay after
fdget_pos() succeeded into vfs_read() that was triggerable by passing -1
as the buffer size to the read() system call. I then used the very very
ugly program in [4] to enforce the order illustrated above. Lastly, I
wrote a bpftrace program which attached a kretfunc to __fdget_pos() and
a kfunc and a kretfunc to vfs_read(). The bpftrace program would only
trigger if the target file with a specific inode number was read:
Attaching 3 probes...
fdget_pos() on file pointer 0xffff8b3500d6a300: name(stealme) | i_ino(1591) | pid(9690) FDPUT_POS_UNLOCK(0), count(1) => mutex not acquired
 vfs_read() on file pointer 0xffff8b3500d6a300: pid 9690 started reading from name(stealme) | i_ino(1591)
fdget_pos() on file pointer 0xffff8b3500d6a300: name(stealme) | i_ino(1591) | pid(9691) FDPUT_POS_UNLOCK(2), count(2) => mutex acquired
 vfs_read() on file pointer 0xffff8b3500d6a300: pid 9691 started reading from name(stealme) | i_ino(1591)
 vfs_read() on file pointer 0xffff8b3500d6a300: pid 9690 finished reading from name(stealme) | i_ino(1591)
 vfs_read() on file pointer 0xffff8b3500d6a300: pid 9691 finished reading from name(stealme) | i_ino(1591)
I thought about various ways to fix this such as introducing some
mechanism to refuse pidfd_getfd() if the target process is in the middle
of an fdget_pos() protected system call with file count being one up to
just not caring about it at all. But it is pretty nasty to leave this
race open and it feels like something that'll bite us later anyway.
Plus, with io_uring having to acquire f_pos_lock unconditionally anyway
for getdents support I'm not sure that deviation in locking semantics
here is worth keeping. So remove the locking optimization for file count
being one. We'll see if this has a significant performance impact.
This survives xfstests and LTP.
Link: https://github.com/rr-debugger/rr [1]
Link: https://github.com/opencontainers/runc [2]
Link: https://brauner.io/2020/07/23/seccomp-notify.html [3]
Link: https://gist.github.com/brauner/599d31048c8adcb3aa0c537b76c89e12 [4]
Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner brauner@kernel.org
---
---
 fs/file.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7893ea161d77..35c62b54c9d6 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1042,10 +1042,8 @@ unsigned long __fdget_pos(unsigned int fd)
    struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
-		if (file_count(file) > 1) {
-			v |= FDPUT_POS_UNLOCK;
-			mutex_lock(&file->f_pos_lock);
-		}
+		v |= FDPUT_POS_UNLOCK;
+		mutex_lock(&file->f_pos_lock);
    }
    return v;
 }
---
base-commit: 6eaae198076080886b9e7d57f4ae06fa782f90ef
change-id: 20230724-vfs-fdget_pos-c6540e3ed7d9