By abusing new methods of file descriptor transferal it is possible to break the assumption that f_pos_lock doesn't need to be acquired when the caller is single threaded and the file_count(file) is one aka when the caller holds the only reference to the file.
The explanation has two parts. The first part is about a new io_uring interface that isn't merged yet where I realized that only acquiring the lock when multiple references to a file are held would be broken in some circumstances. The second part illustrates that this assumption is already broken by abusing another interface.
During review of the io_uring patchset for adding getdents support I pointed out various problems in the locking assumptions that were made. Effectively, the guts of __fdget_pos() were copied out and a trylock scheme for f_pos_lock was introduced.
The approach overall is workable but the locking was wrong because it copied the file_count(file) greater than one optimization that __fdget_pos() had into io_uring. But this assumption is broken when fixed file descriptors are used.
Short reminder of fixed files here via some C pseudo code:
T1 (still single threaded)
fd_register = open("/some/file") -----------------> f_count == 1
fd_fixed = io_uring_register_file(fd_register) -> io_sqe_files_register(fd_register) -> fget(fd_register) --------------------------> f_count == 2 -> io_fixed_file_set(fd_register)
close(fd_register); -> close_fd(fd_register) -> file = pick_file() -> filp_close(file) -> fput(file) ------------------------------> f_count == 1
The caller has now traded a regular file descriptor reference for a fixed file reference. Such fixed files never use fget() again and thus fully eliminate any fget()/fput() overhead.
However, for getdents f_pos_lock needs to be acquired as state may often be kept that requires synchronization with seek and other concurrent getdent requests.
Since io_uring is an asynchronous interface it is of course possible to register multiple concurrent getdent calls that aren't synchronized by io_uring as that's never done unless the user requests specific chaining. And since the reference count for fixed files is always one it's possible to create multiple racing getdent requests if locking is conditional on file_count(file) being greater than one.
That wouldn't be a problem since io_uring can just unconditionally take f_pos_lock and eliminate this problem for fixed files since they aren't usable with the regular system call api.
However, while thinking about this I realized that a while ago the file_count(file) greater than one optimization was already broken and that concurrent read/write/getdents/seek calls are possible in the regular system call api.
The pidfd_getfd() system call allows a caller with ptrace_may_access() abilities on another process to steal a file descriptor from this process. This system call is used by debuggers, container runtimes, system call supervisors, networking proxies etc (cf. [1]-[3]). So while it is a special interest system call it is used in common tools.
Via pidfd_getfd() it's possible to intercept syscalls such as connect(2). For example, a container manager might want to rewrite the connect(2) request to something other than the task intended for security reasons or because the task lacks the necessary information about the networking layout to connect to the right endpoint. In these cases pidfd_getfd(2) can be used to retrieve a copy of the file descriptor of the task and perform the connect(2) for it.
When used in combination with the seccomp notifier things are pretty simple. The intercepted task itself is blocked in the seccomp code. That code runs before the actual requested system call is performed. So the task is blocked before any system call is performed. The seccomp notifier will then notify the supervising process which holds the seccomp notifier fd.
The supervisor can now call pidfd_getfd() while the target process is blocked and steal the file descriptor. It can then perform whatever operation it wants on that file descriptor and then tell seccomp to either return a specific return value to the task once it is unblocked or even continue the task using the SECCOMP_USER_NOTIF_FLAG_CONTINUE flag.
One of the most useful things for pidfd_getfd() is that the target task from which the caller is about to steal a file descriptor doesn't need to be blocked or stopped. But that makes it possible for the target process to be in the middle of a read/write/getdents/seek system call while the caller is stealing the file on which that read/write/getdents/seek was issued and issuing a concurrent read/write/getdents/seek system call:
P1 P2 getdents(fd) -> fdget_pos() { * @file->f_count == 1 * @current->files->count == 1 => not taking @file->f_pos_lock } fd = pidfd_getfd(fd) -> __pidfd_fget() -> fget_task() ----> f_count = 2 -> receive_fd() -> get_file() --> f_count = 3 -> fd_install() -> fput() ---------> f_count = 2
getdents(fd) -> fdget_pos() { * @file->f_count == 1 * @file->f_count == 2 => taking @file->f_pos_lock and bumping @file->f_count = 3 } -> vfs_readdir() -> vfs_readdir() -> fdput() -> fdput()---------> f_count = 2
Although I'm not happy about it, I'm somewhat confident that this analysis is correct but that the race is hard to hit in real life but with some ingenuity it might be possible to make it more reliable. However, I was lazy and introduced a luscious 10 second delay after fdget_pos() succeeded into vfs_read() that was triggerable by passing -1 as the buffer size to the read() system call. I then used the very very ugly program in [4] to enforce the order illustrated above. Lastly, I wrote a bpftrace program which attached a kretfunc to __fdget_pos() and a kfunc and a kretfunc to vfs_read(). The bpftrace program would only trigger if the target file with a specific inode number was read:
Attaching 3 probes... fdget_pos() on file pointer 0xffff8b3500d6a300: name(stealme) | i_ino(1591) | pid(9690) FDPUT_POS_UNLOCK(0), count(1) => mutex not acquired vfs_read() on file pointer 0xffff8b3500d6a300: pid 9690 started reading from name(stealme) | i_ino(1591) fdget_pos() on file pointer 0xffff8b3500d6a300: name(stealme) | i_ino(1591) | pid(9691) FDPUT_POS_UNLOCK(2), count(2) => mutex acquired vfs_read() on file pointer 0xffff8b3500d6a300: pid 9691 started reading from name(stealme) | i_ino(1591) vfs_read() on file pointer 0xffff8b3500d6a300: pid 9690 finished reading from name(stealme) | i_ino(1591) vfs_read() on file pointer 0xffff8b3500d6a300: pid 9691 finished reading from name(stealme) | i_ino(1591)
I thought about various ways to fix this such as introducing some mechanism to refuse pidfd_getfd() if the target process is in the middle of an fdget_pos() protected system call with file count being one up to just not caring about it at all. But it is pretty nasty to leave this race open and it feels like something that'll bite us later anyway.
Plus, with io_uring having to acquire f_pos_lock unconditionally anyway for getdents support I'm not sure that deviation in locking semantics here is worth keeping. So remove the locking optimization for file count being one. We'll see if this has a significant performance impact.
This survives xfstests and LTP.
Link: https://github.com/rr-debugger/rr [1] Link: https://github.com/opencontainers/runc [2] Link: https://brauner.io/2020/07/23/seccomp-notify.html [3] Link: https://gist.github.com/brauner/599d31048c8adcb3aa0c537b76c89e12 [4] Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner brauner@kernel.org ---
--- fs/file.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/file.c b/fs/file.c index 7893ea161d77..35c62b54c9d6 100644 --- a/fs/file.c +++ b/fs/file.c @@ -1042,10 +1042,8 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) { - if (file_count(file) > 1) { - v |= FDPUT_POS_UNLOCK; - mutex_lock(&file->f_pos_lock); - } + v |= FDPUT_POS_UNLOCK; + mutex_lock(&file->f_pos_lock); } return v; }
--- base-commit: 6eaae198076080886b9e7d57f4ae06fa782f90ef change-id: 20230724-vfs-fdget_pos-c6540e3ed7d9
So this was a case of "too much explanations make the explanation much harder to follow".
I tend to enjoy your pull request explanations, but this one was just *way* too much.
Please try to make the point you are making a bit more salient, so that it's a lot easier to follow.
On Mon, 24 Jul 2023 at 08:01, Christian Brauner brauner@kernel.org wrote:
[..] the
file_count(file) greater than one optimization was already broken and that concurrent read/write/getdents/seek calls are possible in the regular system call api.
The pidfd_getfd() system call allows a caller with ptrace_may_access() abilities on another process to steal a file descriptor from this process.
I think the above is all you need to actually explain the problem and boil down the cause of the bug, and it means that the reader doesn't have to wade through a lot of other verbiage to figure it out.
if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
if (file_count(file) > 1) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
}
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock); }
Ho humm. The patch is obviously correct.
At the same time this is actually very annoying, because I played this very issue with the plain /proc/<pid>/fd/<xyz> interface long long ago, where it would just re-use the 'struct file' directly, and it was such a sh*t-show that I know it's much better to actually open a new file descriptor.
I'm not sure that "share actual 'struct file' ever was part of a mainline kernel". I remember having it, but it was a "last century" kind of thing.
The /proc interface hack was actually somewhat useful exactly because you'd see the file position change, but it really caused problems.
The fact that pidfd_getfd() re-introduced that garbage and I never realized this just annoys me no end.
And sadly, the man-page makes it very explicit that it's this broken kind of "share the whole file, file offset and all". Damn damn damn.
Is it too late to just fix pidfd_getfd() to duplicate the 'struct file', and act like a new open, and act like /proc/<pid>/fd/<xyz>?
Because honestly, having been there before, I'm pretty convinced that the real bug here is pidfd_getfd.
I wonder if we could make pidfd_getfd() at least duplicate the struct file for directories. Those are the things that absolutely *require* atomic file positions.
Argh.
I wonder if this also screws up our garbage collection logic. It too ends up having some requirements for a reliable file_count().
Linus
On Mon, Jul 24, 2023 at 08:53:32AM -0700, Linus Torvalds wrote:
So this was a case of "too much explanations make the explanation much harder to follow".
I tend to enjoy your pull request explanations, but this one was just *way* too much.
Please try to make the point you are making a bit more salient, so that it's a lot easier to follow.
Sure. The thing is though that I'm not doing this because it's fun to write a lot of text. It's simply because I want to remember how I arrived at that conclusion. I can keep this shorter for sure. The problem is often just what can I assume the reader knows and what they don't.
On Mon, 24 Jul 2023 at 08:01, Christian Brauner brauner@kernel.org wrote:
[..] the
file_count(file) greater than one optimization was already broken and that concurrent read/write/getdents/seek calls are possible in the regular system call api.
The pidfd_getfd() system call allows a caller with ptrace_may_access() abilities on another process to steal a file descriptor from this process.
I think the above is all you need to actually explain the problem and boil down the cause of the bug, and it means that the reader doesn't have to wade through a lot of other verbiage to figure it out.
if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
if (file_count(file) > 1) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
}
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock); }
Ho humm. The patch is obviously correct.
At the same time this is actually very annoying, because I played this very issue with the plain /proc/<pid>/fd/<xyz> interface long long ago, where it would just re-use the 'struct file' directly, and it was such a sh*t-show that I know it's much better to actually open a new file descriptor.
I'm not sure that "share actual 'struct file' ever was part of a mainline kernel". I remember having it, but it was a "last century" kind of thing.
The /proc interface hack was actually somewhat useful exactly because you'd see the file position change, but it really caused problems.
The fact that pidfd_getfd() re-introduced that garbage and I never realized this just annoys me no end.
SCM_RIGHTS which have existed since 2.1 or sm allow you to do the same thing just cooperatively. If you receive a bunch of fds from another task including sockets and so on they refer to the same struct file.
In recent kernels we also have the seccomp notifier addfd ioctl which let's you add a file descriptor into another process which can also be used to create shared struct files.
On Mon, 24 Jul 2023 at 09:19, Christian Brauner brauner@kernel.org wrote:
SCM_RIGHTS which have existed since 2.1 or sm allow you to do the same thing just cooperatively. If you receive a bunch of fds from another task including sockets and so on they refer to the same struct file.
Yes, but it has special synchronization rules, eg big comment in commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").
There are magic rules with "total_refs == inflight_refs", and that total_refs thing is very much the file count, ie
total_refs = file_count(u->sk.sk_socket->file);
where we had some nasty bugs with files coming back to life.
In recent kernels we also have the seccomp notifier addfd ioctl which let's you add a file descriptor into another process which can also be used to create shared struct files.
Note that shared files aren't a problem per se. The most obvious case is just 'dup()'. It's been around forever.
What's a problem is a file that *isn't* shared magically becoming shared without the process being aware of it.
Both SCM file passing and the seccomp addfd cases are synchronous wrt the source file descriptor and don't add any references from "outside".
Linus
On Mon, 24 Jul 2023 at 09:36, Linus Torvalds torvalds@linux-foundation.org wrote:
There are magic rules with "total_refs == inflight_refs", and that total_refs thing is very much the file count, ie
total_refs = file_count(u->sk.sk_socket->file);
where we had some nasty bugs with files coming back to life.
Ok, I don't think this is an issue here. It really is that "only in-flight refs remaining" that is a special case, and even pidfd_getfd() shouldn't be able to change that.
But the magic code is all in fget_task(), and those need to be checked.
You can see how proc does things properly: it does do "fget_task()", but then it only uses it to copy the path part, and just does fput() afterwards.
The bpf code does something like that too, and seems ok (ie it gets the file in order to copy data from it, not to install it).
kcmp_epoll_target() -> get_epoll_tfile_raw_ptr() looks a bit scary, but seems to use the thing only for polling, so I guess any f_pos is irrelevant.
Linus
On Mon, Jul 24, 2023 at 09:51:05AM -0700, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 09:36, Linus Torvalds torvalds@linux-foundation.org wrote:
There are magic rules with "total_refs == inflight_refs", and that total_refs thing is very much the file count, ie
total_refs = file_count(u->sk.sk_socket->file);
where we had some nasty bugs with files coming back to life.
Ok, I don't think this is an issue here. It really is that "only in-flight refs remaining" that is a special case, and even pidfd_getfd() shouldn't be able to change that.
But the magic code is all in fget_task(), and those need to be checked.
You can see how proc does things properly: it does do "fget_task()", but then it only uses it to copy the path part, and just does fput() afterwards.
The bpf code does something like that too, and seems ok (ie it gets the file in order to copy data from it, not to install it).
Aside of fget_task() use, it has this: rcu_read_lock(); for (;; curr_fd++) { struct file *f; f = task_lookup_next_fd_rcu(curr_task, &curr_fd); if (!f) break; if (!get_file_rcu(f)) continue;
/* set info->fd */ info->fd = curr_fd; rcu_read_unlock(); return f; }
curr_task is not cached current here - it can be an arbitrary thread. And what we do to the file reference we get here is
ctx.meta = &meta; ctx.task = info->task; ctx.fd = info->fd; ctx.file = file; return bpf_iter_run_prog(prog, &ctx);
I think it can't be used to shove it into any descriptor table, but then there's forming an SCM_RIGHTS datagram and sending it, etc. - it might be worth looking into.
On Mon, Jul 24, 2023 at 09:36:28AM -0700, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 09:19, Christian Brauner brauner@kernel.org wrote:
SCM_RIGHTS which have existed since 2.1 or sm allow you to do the same thing just cooperatively. If you receive a bunch of fds from another task including sockets and so on they refer to the same struct file.
Yes, but it has special synchronization rules, eg big comment in commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").
There are magic rules with "total_refs == inflight_refs", and that total_refs thing is very much the file count, ie
total_refs = file_count(u->sk.sk_socket->file);
where we had some nasty bugs with files coming back to life.
It's a bit of a shame that this isn't documented anywhere as that's pretty sensitive stuff and it seems a high bar to assume that everyone has that historical knowledge. :/ So we should definitely document this in Documentation/ going forward.
This means pidfd_getfd() needs the same treatment as MSG_PEEK for sockets.
On Mon, 24 Jul 2023 at 10:23, Christian Brauner brauner@kernel.org wrote:
This means pidfd_getfd() needs the same treatment as MSG_PEEK for sockets.
So the reason I think pidfd_getfd() is ok is that it has serialized with the source of the file descriptor and uses fget_task() -> __fget_files.
And that code is nasty and complicated, but it does get_file_rcu() to increment the file count, and then *after* that it still checks that yes, the file pointer is still there.
And that means that anybody who uses fget_task() will only ever get a ref to a file if that file still existed in the source, and you can never have a situation where a file comes back to life.
The reason MSG_PEEK is special is exactly because it can "resurrect" a file that was closed, and added to the unix SCM garbage collection list as "only has a ref in the SCM thing", so when we then make it live again, it needs that very very subtle thing.
So pidfd_getfd() is ok in this regard.
But it *is* an example of how subtle it is to just get a new ref to an existing file.
That whole
if (atomic_read_acquire(&files->count) == 1) {
in __fget_light() is also worth being aware of. It isn't about the file count, but it is about "I have exclusive access to the file table". So you can *not* close a file, or open a file, for another process from outside. The only thread that is allowed to access or change the file table (including resizing it), is the thread itself.
I really hope we don't have cases of that.
Linus
On Mon, Jul 24, 2023 at 10:34:31AM -0700, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 10:23, Christian Brauner brauner@kernel.org wrote:
This means pidfd_getfd() needs the same treatment as MSG_PEEK for sockets.
So the reason I think pidfd_getfd() is ok is that it has serialized with the source of the file descriptor and uses fget_task() -> __fget_files.
And that code is nasty and complicated, but it does get_file_rcu() to increment the file count, and then *after* that it still checks that yes, the file pointer is still there.
And that means that anybody who uses fget_task() will only ever get a ref to a file if that file still existed in the source, and you can never have a situation where a file comes back to life.
The reason MSG_PEEK is special is exactly because it can "resurrect" a file that was closed, and added to the unix SCM garbage collection list as "only has a ref in the SCM thing", so when we then make it live again, it needs that very very subtle thing.
Oh, eew.
So pidfd_getfd() is ok in this regard.
But it *is* an example of how subtle it is to just get a new ref to an existing file.
That whole
if (atomic_read_acquire(&files->count) == 1) {
in __fget_light() is also worth being aware of. It isn't about the file count, but it is about "I have exclusive access to the file table". So you can *not* close a file, or open a file, for another process from outside. The only thread that is allowed to access or change the file table (including resizing it), is the thread itself.
Yeah, I'm well aware of that because an earlier version of the getdents patchset would've run into exactly that problem because of async offload on the file while the process that registered that async offload would've been free to create another thread and issue additional requests.
I really hope we don't have cases of that.
I don't think we do but it's something to keep in mind with async io interfaces where the caller is free to create other threads after having registered a request. Depending on how file references are done things can get tricky easily.
I'm not complaining here but it is worth noting that additions to e.g., io_uring now have to reason through two reference counting mechanisms: the regular system call rules and the fixed file io_uring rules. If conditional locking is involved in one part it might have effects/consequences in the other part as one can mix regular system call requests and fixed file requests on the same struct file.
On Mon, 24 Jul 2023 at 10:46, Christian Brauner brauner@kernel.org wrote:
I don't think we do but it's something to keep in mind with async io interfaces where the caller is free to create other threads after having registered a request. Depending on how file references are done things can get tricky easily.
Honestly, by now, the io_uring code had *better* understand that it needs to act exactly like a user thread.
Anything else is simply not acceptable. io_uring has been a huge pain, and the only thing that salvaged the horror was that the io_uring async code should now *always* be done as a real thread.
If io_uring does something from truly async context (ie interrupts, not io_uring threads), then io_uring had better be very *very* careful.
And any kind of "from kthread context" is not acceptable. We've been there, done that, and have the battle scars. Never again.
So the absolutely *only* acceptable context is "I'm a real io_uringthread that looks exactly like a user thread in every which way, except I never return to user space".
And if io_uring does absolutely _anything_ to file descriptors from any other context, it needs to be fixed *NOW*.
No more games.
And absolutely *nothing* like that disgusting pidfd_getfd().
So the only reason io_uring can do open/close/etc is because from a file descriptor standpoint it looks just like any other threaded application would look, and all those optimizations like
if (atomic_read_acquire(&files->count) == 1) {
JustWork(tm).
I think that's the case right now, and it's all good.
But if I'm unaware of some situation where io_uring does things like this in some workqueue or other context, people had better holler - and fix it.
Linus
On 7/24/23 12:01?PM, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 10:46, Christian Brauner brauner@kernel.org wrote:
I don't think we do but it's something to keep in mind with async io interfaces where the caller is free to create other threads after having registered a request. Depending on how file references are done things can get tricky easily.
Honestly, by now, the io_uring code had *better* understand that it needs to act exactly like a user thread.
Anything else is simply not acceptable. io_uring has been a huge pain, and the only thing that salvaged the horror was that the io_uring async code should now *always* be done as a real thread.
If io_uring does something from truly async context (ie interrupts, not io_uring threads), then io_uring had better be very *very* careful.
And any kind of "from kthread context" is not acceptable. We've been there, done that, and have the battle scars. Never again.
So the absolutely *only* acceptable context is "I'm a real io_uringthread that looks exactly like a user thread in every which way, except I never return to user space".
And if io_uring does absolutely _anything_ to file descriptors from any other context, it needs to be fixed *NOW*.
io_uring never does that isn't the original user space creator task, or from the io-wq workers that it may create. Those are _always_ normal threads. There's no workqueue/kthread usage for IO or file getting/putting/installing/removing etc.
On Mon, 24 Jul 2023 at 11:05, Jens Axboe axboe@kernel.dk wrote:
io_uring never does that isn't the original user space creator task, or from the io-wq workers that it may create. Those are _always_ normal threads. There's no workqueue/kthread usage for IO or file getting/putting/installing/removing etc.
That's what I thought. But Christian's comments about the io_uring getdents work made me worry.
If io_uring does everything right, then the old "file_count(file) > 1" test in __fdget_pos() - now sadly removed - should work just fine for any io_uring directory handling.
It may not be obvious when you look at just that test in a vacuum, but it happens after __fdget_pos() has done the
unsigned long v = __fdget(fd); struct file *file = (struct file *)(v & ~3);
and if it's a threaded app - where io_uring threads count the same - then the __fdget() in that sequence will have incremented the file count.
So when it then used to do that
if (file_count(file) > 1) {
that "> 1" included not just all the references from dup() etc, but for any threaded use where we have multiple references to the file table it also that reference we get from __fdget() -> __fget() -> __fget_files() -> __fget_files_rcu() -> get_file_rcu() -> atomic_long_inc_not_zero(&(x)->f_count)).
And yes, that's a fairly long chain, through some subtle code. Doing "fdget()" (and variations) is not trivial.
Of course, with directory handling, one of the things people have wanted is to have a "pread()" like model, where you have a position that is given externally and not tied to the 'struct file'.
We don't have such a thing, and I don't think we can have one, because I think filesystems also end up having possibly other private data in the 'struct file' (trivially: the directory cursor for dcache_dir_open()).
And other filesystems have it per-inode, which is why we have that "iterate_shared" vs "iterate" horror.
So any dentry thing that wants to do things in parallel needs to open their own private 'file' just for *that* reason. And even then they will fail if the filesystem doesn't have that shared directory entry iterator.
Linus
On Mon, Jul 24, 2023 at 11:27:29AM -0700, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 11:05, Jens Axboe axboe@kernel.dk wrote:
io_uring never does that isn't the original user space creator task, or from the io-wq workers that it may create. Those are _always_ normal threads. There's no workqueue/kthread usage for IO or file getting/putting/installing/removing etc.
That's what I thought. But Christian's comments about the io_uring getdents work made me worry.
Sorry, my point was that an earlier version of the patchset _would_ have introduced a locking scheme that would've violated locking assumptions because it didn't get the subtle refcount difference right. But that was caught during review.
It's really just keeping in mind that refcount rules change depending on whether fds or fixed files are used.
If io_uring does everything right, then the old "file_count(file) > 1" test in __fdget_pos() - now sadly removed - should work just fine for any io_uring directory handling.
Yes, but only if you disallow getdents with fixed files when the reference count rules are different.
It may not be obvious when you look at just that test in a vacuum, but it happens after __fdget_pos() has done the
unsigned long v = __fdget(fd); struct file *file = (struct file *)(v & ~3);
and if it's a threaded app - where io_uring threads count the same - then the __fdget() in that sequence will have incremented the file count.
So when it then used to do that
if (file_count(file) > 1) {
that "> 1" included not just all the references from dup() etc, but for any threaded use where we have multiple references to the file table it also that reference we get from __fdget() -> __fget() -> __fget_files() -> __fget_files_rcu() -> get_file_rcu() -> atomic_long_inc_not_zero(&(x)->f_count)).
Yes, for the regular system call path it's all well and dandy and I said as much on the getdents thread as well.
On Mon, 24 Jul 2023 at 11:48, Christian Brauner brauner@kernel.org wrote:
It's really just keeping in mind that refcount rules change depending on whether fds or fixed files are used.
This sentence still worries me.
Those fixed files had better have their own refcounts from being fixed. So the rules really shouldn't change in any way what-so-ever. So what exactly are you alluding to?
Linus
On 7/24/23 4:25?PM, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 11:48, Christian Brauner brauner@kernel.org wrote:
It's really just keeping in mind that refcount rules change depending on whether fds or fixed files are used.
This sentence still worries me.
Those fixed files had better have their own refcounts from being fixed. So the rules really shouldn't change in any way what-so-ever. So what exactly are you alluding to?
They do, but they only have a single reference, which is what fixes them into the io_uring file table for fixed files. With the patch from the top of this thread, that should then be fine as we don't need to artificially elevator the ref count more than that.
On Mon, 24 Jul 2023 at 15:57, Jens Axboe axboe@kernel.dk wrote:
On 7/24/23 4:25?PM, Linus Torvalds wrote:
This sentence still worries me.
Those fixed files had better have their own refcounts from being fixed. So the rules really shouldn't change in any way what-so-ever. So what exactly are you alluding to?
They do, but they only have a single reference, which is what fixes them into the io_uring file table for fixed files. With the patch from the top of this thread, that should then be fine as we don't need to artificially elevator the ref count more than that.
No.
The patch from the top of this thread cannot *possibly* matter for a io_uring fixed file.
The fdget_pos() always gets the file pointer from the file table. But that means that it is guaranteed to have a refcount of at least one.
If io_uring fixed file holds a reference (and not holding a reference would be a huge bug), that in turn means that the minimum refcount is now two.
So the code in fdget_pos() is correct, with or without the patch.
The *only* problem is when something actually violates the refcounting rules. Sadly, that's exactly what pidfd_getfd() does, and can basically make a private file pointer be non-private without synchronizing with the original owner of the fd.
Now, io_uring may have had its own problems, if it tried to re-implement some io_uring-specific version of fdget_pos() for the fixed file case, and thought that it could use the file_count() == 1 trick when it *wasn't* also a file table entry.
But that would be an independent bug from copy-and-pasting code without taking the surrounding rules into account.
Linus
On 7/25/23 12:30?PM, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 15:57, Jens Axboe axboe@kernel.dk wrote:
On 7/24/23 4:25?PM, Linus Torvalds wrote:
This sentence still worries me.
Those fixed files had better have their own refcounts from being fixed. So the rules really shouldn't change in any way what-so-ever. So what exactly are you alluding to?
They do, but they only have a single reference, which is what fixes them into the io_uring file table for fixed files. With the patch from the top of this thread, that should then be fine as we don't need to artificially elevator the ref count more than that.
No.
The patch from the top of this thread cannot *possibly* matter for a io_uring fixed file.
The fdget_pos() always gets the file pointer from the file table. But that means that it is guaranteed to have a refcount of at least one.
If io_uring fixed file holds a reference (and not holding a reference would be a huge bug), that in turn means that the minimum refcount is now two.
Right, but what if the original app closes the file descriptor? Now you have the io_uring file table still holding a reference to it, but it'd just be 1. Which is enough to keep it alive, but you can still have multiple IOs inflight against this file.
Obviously using the file position is wonky with async IO to begin with, exactly because you can have multiple IOs in flight to it at the same time. You can make it work by specifying ordering constraints, but that is obviously also totally messy and not really a valid use case. Just don't use the file position at that point.
Some libraries are limited though and want to use file positions with async IO, and they generally get to keep both pieces if they do and don't treat it as sync IO (or serialized, at least) at that point.
So the code in fdget_pos() is correct, with or without the patch.
The *only* problem is when something actually violates the refcounting rules. Sadly, that's exactly what pidfd_getfd() does, and can basically make a private file pointer be non-private without synchronizing with the original owner of the fd.
Now, io_uring may have had its own problems, if it tried to re-implement some io_uring-specific version of fdget_pos() for the fixed file case, and thought that it could use the file_count() == 1 trick when it *wasn't* also a file table entry.
But that would be an independent bug from copy-and-pasting code without taking the surrounding rules into account.
We never made any assumptions on the file_count() for the file, exactly because the count of it means nothing to io_uring in terms of whether we can have concurrent IO to it or not.
On Tue, 25 Jul 2023 at 13:41, Jens Axboe axboe@kernel.dk wrote:
Right, but what if the original app closes the file descriptor? Now you have the io_uring file table still holding a reference to it, but it'd just be 1. Which is enough to keep it alive, but you can still have multiple IOs inflight against this file.
Note that fdget_pos() fundamentally only works on file descriptors that are there - it's literally looking them up in the file table as it goes along. And it looks at the count of the file description as it is looked up. So that refcount is guaranteed to exist.
If the file has been closed, fdget_pos() will just fail because it doesn't find it.
And if it's then closed *afterwards*, that's fine and doesn't affect anything, because the locking has been done and we saved away the status bit as FDPUT_POS_UNLOCK, so the code knows to unlock even if the file descriptor in the meantime has turned back to having just a single refcount.
Linus
On 7/25/23 2:51?PM, Linus Torvalds wrote:
On Tue, 25 Jul 2023 at 13:41, Jens Axboe axboe@kernel.dk wrote:
Right, but what if the original app closes the file descriptor? Now you have the io_uring file table still holding a reference to it, but it'd just be 1. Which is enough to keep it alive, but you can still have multiple IOs inflight against this file.
Note that fdget_pos() fundamentally only works on file descriptors that are there - it's literally looking them up in the file table as it goes along. And it looks at the count of the file description as it is looked up. So that refcount is guaranteed to exist.
If the file has been closed, fdget_pos() will just fail because it doesn't find it.
And if it's then closed *afterwards*, that's fine and doesn't affect anything, because the locking has been done and we saved away the status bit as FDPUT_POS_UNLOCK, so the code knows to unlock even if the file descriptor in the meantime has turned back to having just a single refcount.
Ah yes good point. If registered with io_uring and closed, then it won't matter how many extra references we hold or grab in the future, it won't be found in the file table. I guess that's obvious...
On Tue, Jul 25, 2023 at 01:51:37PM -0700, Linus Torvalds wrote:
On Tue, 25 Jul 2023 at 13:41, Jens Axboe axboe@kernel.dk wrote:
Right, but what if the original app closes the file descriptor? Now you have the io_uring file table still holding a reference to it, but it'd just be 1. Which is enough to keep it alive, but you can still have multiple IOs inflight against this file.
Note that fdget_pos() fundamentally only works on file descriptors that are there - it's literally looking them up in the file table as it goes along. And it looks at the count of the file description as it is looked up. So that refcount is guaranteed to exist.
If the file has been closed, fdget_pos() will just fail because it doesn't find it.
And if it's then closed *afterwards*, that's fine and doesn't affect anything, because the locking has been done and we saved away the status bit as FDPUT_POS_UNLOCK, so the code knows to unlock even if the file descriptor in the meantime has turned back to having just a single refcount.
Yes, and to summarize which I tried in my description for the commit. The getdents support patchset would have introduced a bug because the patchset copied the fdget_pos() file_count(file) > 1 optimization into io_uring.
That works fine as long as the original file descriptor used to register the fixed file is kept. The locking will work correctly as file_count(file) > 1 and no races are possible neither via getdent calls using the original file descriptor nor via io_uring using the fixed file or even mixing both.
But as soon as the original file descriptor is closed the f_count for the file drops back to 1 but continues to be usable from io_uring via the fixed file. Now the optimization that the patchset wanted to copy over would cause bugs as multiple racing getdent requests would be possible using the fixed file.
The simple thing ofc is that io_uring just always grabs the position lock and never does this optimization. The authors were just unaware because it wasn't obvious to them that fixed files would not work with the optimization.
I caught that during review but then immediately realized that the file_count(file) > 1 optimization was unfortunately broken in another way, which ultimately led to this commit.
From: Christian Brauner
Sent: 26 July 2023 09:37
...
Yes, and to summarize which I tried in my description for the commit. The getdents support patchset would have introduced a bug because the patchset copied the fdget_pos() file_count(file) > 1 optimization into io_uring.
That works fine as long as the original file descriptor used to register the fixed file is kept. The locking will work correctly as file_count(file) > 1 and no races are possible neither via getdent calls using the original file descriptor nor via io_uring using the fixed file or even mixing both.
But as soon as the original file descriptor is closed the f_count for the file drops back to 1 but continues to be usable from io_uring via the fixed file. Now the optimization that the patchset wanted to copy over would cause bugs as multiple racing getdent requests would be possible using the fixed file.
Could the io_uring code grab two references? That would stop the optimisation without affecting any normal code paths?
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Jul 26, 2023 at 10:31:07AM +0000, David Laight wrote:
From: Christian Brauner
Sent: 26 July 2023 09:37
...
Yes, and to summarize which I tried in my description for the commit. The getdents support patchset would have introduced a bug because the patchset copied the fdget_pos() file_count(file) > 1 optimization into io_uring.
That works fine as long as the original file descriptor used to register the fixed file is kept. The locking will work correctly as file_count(file) > 1 and no races are possible neither via getdent calls using the original file descriptor nor via io_uring using the fixed file or even mixing both.
But as soon as the original file descriptor is closed the f_count for the file drops back to 1 but continues to be usable from io_uring via the fixed file. Now the optimization that the patchset wanted to copy over would cause bugs as multiple racing getdent requests would be possible using the fixed file.
Could the io_uring code grab two references? That would stop the optimisation without affecting any normal code paths?
io_uring doesn't use fdget variants and can't for it's purposes as fdget is for short term references while io_uring holds on to the file. This whole thing was about the logic that was copied in a patchset not the actual helper itself. I thought that was clear from "copied the [...] optimization into io_uring". It should just not do that.
On Tue, Jul 25, 2023 at 11:30:32AM -0700, Linus Torvalds wrote:
On Mon, 24 Jul 2023 at 15:57, Jens Axboe axboe@kernel.dk wrote:
On 7/24/23 4:25?PM, Linus Torvalds wrote:
This sentence still worries me.
Those fixed files had better have their own refcounts from being fixed. So the rules really shouldn't change in any way what-so-ever. So what exactly are you alluding to?
They do, but they only have a single reference, which is what fixes them into the io_uring file table for fixed files. With the patch from the top of this thread, that should then be fine as we don't need to artificially elevator the ref count more than that.
No.
The patch from the top of this thread cannot *possibly* matter for a io_uring fixed file.
Yeah, the patch doesn't matter for fixed files. But they copied the logic from fdget_pos() which was the problem.
The fdget_pos() always gets the file pointer from the file table. But that means that it is guaranteed to have a refcount of at least one.
If io_uring fixed file holds a reference (and not holding a reference would be a huge bug), that in turn means that the minimum refcount is now two.
So the code in fdget_pos() is correct, with or without the patch.
Yes.
The *only* problem is when something actually violates the refcounting rules. Sadly, that's exactly what pidfd_getfd() does, and can basically make a private file pointer be non-private without synchronizing with the original owner of the fd.
Now, io_uring may have had its own problems, if it tried to re-implement some io_uring-specific version of fdget_pos() for the fixed file case, and thought that it could use the file_count() == 1 trick when it *wasn't* also a file table entry.
Yes, that's what one of the patch versions did and what I pointed out won't work and what ultimately led me to discover the pidfd_getfd() bug. (That's also btw, why my explanation was long.)
But that would be an independent bug from copy-and-pasting code without taking the surrounding rules into account.
Yes, exactly what I said in the review.
On Mon, Jul 24, 2023 at 08:53:32AM -0700, Linus Torvalds wrote:
Is it too late to just fix pidfd_getfd() to duplicate the 'struct file', and act like a new open, and act like /proc/<pid>/fd/<xyz>?
So thinking a little about it I think that doesn't work. /proc/<pid>/fd/<xyz> does a reopen and for good reasons. The original open will have gone through the module's/subsytem's ->open() method which might stash additional refcounted data in e.g., file->private_data if we simply copy that file or sm then we risk UAFs. If we don't skip ->open() though and effectively emulate /proc/<pid>/fd/<xyz> completely then we break any such use-cases where a socket or something else is shared as they cannot be reopened. So the module/subsystem really needs to be informed that a new struct file is created and not simply refd.
On Mon, 24 Jul 2023 at 09:46, Christian Brauner brauner@kernel.org wrote:
So thinking a little about it I think that doesn't work. /proc/<pid>/fd/<xyz> does a reopen and for good reasons. The original open will have gone through the module's/subsytem's ->open() method which might stash additional refcounted data in e.g., file->private_data if we simply copy that file or sm then we risk UAFs.
Oh, absolutely, we';d absolutely need to do all the re-open things.
That said, we could limit it to FMODE_ATOMIC_POS - so just regular files and directories. The only thing that sets that bit is do_dentry_open().
And honestly, the only thing that *really* cares is directories, because they generally have special rules for pos changing.
The regular files have the "POSIX rules" reason, but hey, if you use pidfd_getfd() and mess with the pos behind the back of the process, it's no different from using a debugger to change it, so the POSIX rules issue just isn't relevant.
I really hate making the traditional unix single-threaded file descriptor case take that lock.
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
Linus
On Mon, 24 Jul 2023 at 09:59, Linus Torvalds torvalds@linux-foundation.org wrote:
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
I've applied the patch, with a much pruned commit message, and just a link to your original email instead.
Linus
On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
I really hate making the traditional unix single-threaded file descriptor case take that lock.
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
So I got curious what the impact is and checked on quite a modern CPU (Sapphire Rapid), so nobody can claim it's some old yeller and atomics are nowhere near as expensive on modern uarchs.
I used read1_processes from will-it-scale -- it is doing 4KB reads at a time over a 1MB file and dodges refing the file, but it does not dodge the lock with the patch at hand.
In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
The kernel was patched with a toggle to force or elide the proposed mandatory locking, like so: @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) { - v |= FDPUT_POS_UNLOCK; - mutex_lock(&file->f_pos_lock); + if (file_count(file) > 1 || fdget_pos_mutex) { + v |= FDPUT_POS_UNLOCK; + mutex_lock(&file->f_pos_lock); + } } return v; }
I got rather unstable single-threaded perf, going up and down several % between runs, I don't know yet what's that about. But toggling back and forth while the bench was running consistently gave aforementioned ~5% difference.
perf top claims: [snip] 32.64% [kernel] [k] copyout 10.83% [kernel] [k] entry_SYSCALL_64 6.69% libc.so.6 [.] read 6.16% [kernel] [k] filemap_get_read_batch 5.62% [kernel] [k] filemap_read 3.39% [kernel] [k] __fget_light 2.92% [kernel] [k] mutex_lock <-- oh 2.70% [kernel] [k] mutex_unlock <-- no 2.33% [kernel] [k] vfs_read 2.18% [kernel] [k] _copy_to_iter 1.93% [kernel] [k] atime_needs_update 1.74% [kernel] [k] __fsnotify_parent 1.29% read1_processes [.] testcase [/snip]
[note running perf along with the bench changes throughput to some extent]
So yes, atomics remain expensive on x86-64 even on a very moden uarch and their impact is measurable in a syscall like read.
Consequently eliding this mutex trip would be most welcome.
Happy to rant,
On Thu, Aug 03, 2023 at 11:53:11AM +0200, Mateusz Guzik wrote:
On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
I really hate making the traditional unix single-threaded file descriptor case take that lock.
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
So I got curious what the impact is and checked on quite a modern CPU (Sapphire Rapid), so nobody can claim it's some old yeller and atomics are nowhere near as expensive on modern uarchs.
I used read1_processes from will-it-scale -- it is doing 4KB reads at a time over a 1MB file and dodges refing the file, but it does not dodge the lock with the patch at hand.
In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
The kernel was patched with a toggle to force or elide the proposed mandatory locking, like so: @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
if (file_count(file) > 1 || fdget_pos_mutex) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
} } return v;
}
I got rather unstable single-threaded perf, going up and down several % between runs, I don't know yet what's that about. But toggling back and
We've had the whole lkp intel performance testsuite run on this for a long time to see whether there any performance regressions that aren't in the noise. This includes all of will-it-scale. Also with a focus on single-thread performance. So I'm a little skeptical about the reliability of manual performance runs in comparison.
If Linus thinks it matters then I think we should only take the f_pos_lock unconditionally on directory fds as this is where it really matters. For read/write we're only doing it because of posix anyway and using pidfd_getfd() is squarely out of posix anyway. So we can document that using pidfd_getfd() on a regular file descriptor outside of a seccomp stopped task is not recommended.
On 8/3/23, Christian Brauner brauner@kernel.org wrote:
On Thu, Aug 03, 2023 at 11:53:11AM +0200, Mateusz Guzik wrote:
On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
I really hate making the traditional unix single-threaded file descriptor case take that lock.
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
So I got curious what the impact is and checked on quite a modern CPU (Sapphire Rapid), so nobody can claim it's some old yeller and atomics are nowhere near as expensive on modern uarchs.
I used read1_processes from will-it-scale -- it is doing 4KB reads at a time over a 1MB file and dodges refing the file, but it does not dodge the lock with the patch at hand.
In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
The kernel was patched with a toggle to force or elide the proposed mandatory locking, like so: @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
if (file_count(file) > 1 || fdget_pos_mutex) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
} } return v;
}
I got rather unstable single-threaded perf, going up and down several % between runs, I don't know yet what's that about. But toggling back and
We've had the whole lkp intel performance testsuite run on this for a long time to see whether there any performance regressions that aren't in the noise. This includes all of will-it-scale. Also with a focus on single-thread performance. So I'm a little skeptical about the reliability of manual performance runs in comparison.
I did not merely post numbers claiming a difference, I justified it with profile output, prominently showing the mutex lock/unlock pair. I genuinely think this puts the ball in your court.
One potential factor is mere CPU(s) -- maybe whatever was used in your bench is older than the CPU I used and for example is more negatively affected by mitigations like retpoline et al. I intentionally used something very high end, specifically: Intel(R) Xeon(R) Platinum 8470N
Another potential factor is discrepancies in kernel config.
I found one here: https://lore.kernel.org/oe-lkp/20230802103152.ae3s43z6yjkhnkee@quack3/T/#t , I'm guessing it is the same as the one used in Intel tests. In there I see: CONFIG_RANDOMIZE_KSTACK_OFFSET=y
This is implemented with rdstsc, which comes with a massive premium to syscall cost. Also note it does not *fix* CPU bugs or whatever, it merely is a hardening measure to shuffle the stack pointer (way overpriced btw). Which is a long way of saying I sanity-checked syscall perf on my box with getppid and had to disable this opt (afair it bumped the rate from about 11 mln to 15 mln ops/s) and that it is fine to do so.
I could easily see how more things like this combined with older CPUs would make the mutex trip small enough to not worry about, in that setting. I don't think this is the same as the cost of the change being negligible in general. I can't stress enough my config retains retpoline and other mitigations for actual bugs, so it's not like I cheated the system.
I'm attaching my config for reference. It is what's present in Debian + a bunch of stuff disabled.
If Linus thinks it matters then I think we should only take the f_pos_lock unconditionally on directory fds as this is where it really matters. For read/write we're only doing it because of posix anyway and using pidfd_getfd() is squarely out of posix anyway. So we can document that using pidfd_getfd() on a regular file descriptor outside of a seccomp stopped task is not recommended.
In stock kernel the code is: if (file && (file->f_mode & FMODE_ATOMIC_POS)) { if (file_count(file) > 1) { v |= FDPUT_POS_UNLOCK; mutex_lock(&file->f_pos_lock); } }
after pidfd_getfd succeeds the count is > 1, so the thing to worry about is racing with the thread calling fdget_pos or making sure it stopped using the fd, but only the first time around. Perhaps there is a way to wait it out (as in guarantee it is not using the fd *or* already had seen the bumped refcount) which would be perfectly acceptable for pidfd consumers? Note the thread may be stuck in an uninterruptible, indefinite sleep somewhere while using it without the lock. On the other hand if such a thing can happen with f_pos_lock mutex held, the pidfd consumer running into the same codepath with the lock would be in the same spot, so perhaps this is not so bad?
Note this is definitely hackable just for fun without pessimizing fdget_pos much, but off hand I don't have anything reasonable for real-world use. I'm going to sleep on it.
of course failed to attach, remedied in this email :)
On 8/3/23, Mateusz Guzik mjguzik@gmail.com wrote:
On 8/3/23, Christian Brauner brauner@kernel.org wrote:
On Thu, Aug 03, 2023 at 11:53:11AM +0200, Mateusz Guzik wrote:
On Mon, Jul 24, 2023 at 09:59:15AM -0700, Linus Torvalds wrote:
I really hate making the traditional unix single-threaded file descriptor case take that lock.
Maybe it doesn't matter. Obviously it can't have contention, and your patch in that sense is pretty benign.
But locking is just fundamentally expensive in the first place, and it annoys me that I never realized that pidfd_getfd() did that thing that I knew was broken for /proc.
So I got curious what the impact is and checked on quite a modern CPU (Sapphire Rapid), so nobody can claim it's some old yeller and atomics are nowhere near as expensive on modern uarchs.
I used read1_processes from will-it-scale -- it is doing 4KB reads at a time over a 1MB file and dodges refing the file, but it does not dodge the lock with the patch at hand.
In short, I got a drop of about 5% (~5778843 -> ~5500871 ops/s).
The kernel was patched with a toggle to force or elide the proposed mandatory locking, like so: @@ -1042,8 +1044,10 @@ unsigned long __fdget_pos(unsigned int fd) struct file *file = (struct file *)(v & ~3);
if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
if (file_count(file) > 1 || fdget_pos_mutex) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
} } return v;
}
I got rather unstable single-threaded perf, going up and down several % between runs, I don't know yet what's that about. But toggling back and
We've had the whole lkp intel performance testsuite run on this for a long time to see whether there any performance regressions that aren't in the noise. This includes all of will-it-scale. Also with a focus on single-thread performance. So I'm a little skeptical about the reliability of manual performance runs in comparison.
I did not merely post numbers claiming a difference, I justified it with profile output, prominently showing the mutex lock/unlock pair. I genuinely think this puts the ball in your court.
One potential factor is mere CPU(s) -- maybe whatever was used in your bench is older than the CPU I used and for example is more negatively affected by mitigations like retpoline et al. I intentionally used something very high end, specifically: Intel(R) Xeon(R) Platinum 8470N
Another potential factor is discrepancies in kernel config.
I found one here: https://lore.kernel.org/oe-lkp/20230802103152.ae3s43z6yjkhnkee@quack3/T/#t , I'm guessing it is the same as the one used in Intel tests. In there I see: CONFIG_RANDOMIZE_KSTACK_OFFSET=y
This is implemented with rdstsc, which comes with a massive premium to syscall cost. Also note it does not *fix* CPU bugs or whatever, it merely is a hardening measure to shuffle the stack pointer (way overpriced btw). Which is a long way of saying I sanity-checked syscall perf on my box with getppid and had to disable this opt (afair it bumped the rate from about 11 mln to 15 mln ops/s) and that it is fine to do so.
I could easily see how more things like this combined with older CPUs would make the mutex trip small enough to not worry about, in that setting. I don't think this is the same as the cost of the change being negligible in general. I can't stress enough my config retains retpoline and other mitigations for actual bugs, so it's not like I cheated the system.
I'm attaching my config for reference. It is what's present in Debian
- a bunch of stuff disabled.
If Linus thinks it matters then I think we should only take the f_pos_lock unconditionally on directory fds as this is where it really matters. For read/write we're only doing it because of posix anyway and using pidfd_getfd() is squarely out of posix anyway. So we can document that using pidfd_getfd() on a regular file descriptor outside of a seccomp stopped task is not recommended.
In stock kernel the code is: if (file && (file->f_mode & FMODE_ATOMIC_POS)) { if (file_count(file) > 1) { v |= FDPUT_POS_UNLOCK; mutex_lock(&file->f_pos_lock); } }
after pidfd_getfd succeeds the count is > 1, so the thing to worry about is racing with the thread calling fdget_pos or making sure it stopped using the fd, but only the first time around. Perhaps there is a way to wait it out (as in guarantee it is not using the fd *or* already had seen the bumped refcount) which would be perfectly acceptable for pidfd consumers? Note the thread may be stuck in an uninterruptible, indefinite sleep somewhere while using it without the lock. On the other hand if such a thing can happen with f_pos_lock mutex held, the pidfd consumer running into the same codepath with the lock would be in the same spot, so perhaps this is not so bad?
Note this is definitely hackable just for fun without pessimizing fdget_pos much, but off hand I don't have anything reasonable for real-world use. I'm going to sleep on it.
-- Mateusz Guzik <mjguzik gmail.com>
On Thu, 3 Aug 2023 at 02:53, Mateusz Guzik mjguzik@gmail.com wrote:
So yes, atomics remain expensive on x86-64 even on a very moden uarch and their impact is measurable in a syscall like read.
Well, a patch like this should fix it.
I intentionally didn't bother with the alpha osf version of readdir, because nobody cares, but I guess we could do this in the header too.
Or we could have split the FMODE_ATOMIC_POS bit into two, and had a "ALWAYS" version and a regular version, but just having a "fdget_dir()" made it simpler.
So this - together with just reverting commit 20ea1e7d13c1 ("file: always lock position for FMODE_ATOMIC_POS") - *should* fix any performance regression.
But I have not tested it at all. So....
Linus
On 8/3/23, Linus Torvalds torvalds@linux-foundation.org wrote:
On Thu, 3 Aug 2023 at 02:53, Mateusz Guzik mjguzik@gmail.com wrote:
So yes, atomics remain expensive on x86-64 even on a very moden uarch and their impact is measurable in a syscall like read.
Well, a patch like this should fix it.
I intentionally didn't bother with the alpha osf version of readdir, because nobody cares, but I guess we could do this in the header too.
Or we could have split the FMODE_ATOMIC_POS bit into two, and had a "ALWAYS" version and a regular version, but just having a "fdget_dir()" made it simpler.
So this - together with just reverting commit 20ea1e7d13c1 ("file: always lock position for FMODE_ATOMIC_POS") - *should* fix any performance regression.
That would do it, but I'm uneasy about the partially "gimped" status of the fd the caller can't do anything about.
I read the thread and still don't get what is the real-world use case for the thing, a real-world consumer to take a look at would be nice.
As noted in another e-mail, the entire lack of pos locking in stock kernel is a transient ordeal -- there is a point where the "donor" is guaranteed to not use it and spot the refcount > 1 for any future use, using pos locking going forward.
Perhaps, and I'm really just spitballing here, almost all expected use cases would be 100% fine without pos lock so it all works out as is. Similarly, if the target is multithreaded it already works as well.
Those who expect "fully qualified" semantics would pass a flag argument denoting it but would possibly wait indefinitely while the target is blocked somewhere in the kernel and it is not known if it is even using the fd. This very well may be a deal breaker though and may be too cumbersome to implement for real use.
All that said, personally I would either go forward with a "full fix" (probably not worth it) or take the L and lock for all fds.
On Thu, Aug 03, 2023 at 08:45:54AM -0700, Linus Torvalds wrote:
On Thu, 3 Aug 2023 at 02:53, Mateusz Guzik mjguzik@gmail.com wrote:
So yes, atomics remain expensive on x86-64 even on a very moden uarch and their impact is measurable in a syscall like read.
Well, a patch like this should fix it.
I intentionally didn't bother with the alpha osf version of readdir, because nobody cares, but I guess we could do this in the header too.
Or we could have split the FMODE_ATOMIC_POS bit into two, and had a "ALWAYS" version and a regular version, but just having a "fdget_dir()" made it simpler.
So this - together with just reverting commit 20ea1e7d13c1 ("file: always lock position for FMODE_ATOMIC_POS") - *should* fix any performance regression.
But I have not tested it at all. So....
Yeah, this is my suggestion - and your earlier suggestion - in the thread. Only thing that's missing is exclusion with seek on directories as that's the heinous part.
On Thu, 3 Aug 2023 at 11:03, Christian Brauner brauner@kernel.org wrote:
Only thing that's missing is exclusion with seek on directories as that's the heinous part.
Bah. I forgot about lseek entirely, because for some completely stupid reason I just thought "Oh, that will always get the lock".
So I guess we'd just have to do that "unconditional fdget_dir()" thing in the header file after all, and make llseek() and ksys_lseek() use it.
Bah. And then we'd still have to worry about any filesystem that allows 'read()' and 'write()' on the directory - which can also update f_pos.
And yes, those exist. See at least 'adfs_dir_ops', and 'ceph_dir_fops'. They may be broken, but people actually did do things like that historically, maybe there's a reason adfs and ceph allow it.
End result: we can forget about fdget_dir(). We'd need to split FMODE_ATOMIC_POS into two instead.
I don't think we have any free flags, but I didn't check. The ugly thing to do is to just special-case S_ISDIR. Not lovely, but whatever.
So something like this instead? It's a smaller diff anyway, and it gets the crazy afds/ceph cases right too.
And by "gets them right" I obviously mean "I didn't actually *TEST* any of this, so who knows..."
Linus
On Thu, Aug 03, 2023 at 11:35:53AM -0700, Linus Torvalds wrote:
On Thu, 3 Aug 2023 at 11:03, Christian Brauner brauner@kernel.org wrote:
Only thing that's missing is exclusion with seek on directories as that's the heinous part.
Bah. I forgot about lseek entirely, because for some completely stupid reason I just thought "Oh, that will always get the lock".
So I guess we'd just have to do that "unconditional fdget_dir()" thing in the header file after all, and make llseek() and ksys_lseek() use it.
Bah. And then we'd still have to worry about any filesystem that allows 'read()' and 'write()' on the directory - which can also update f_pos.
And yes, those exist. See at least 'adfs_dir_ops', and 'ceph_dir_fops'. They may be broken, but people actually did do things like that historically, maybe there's a reason adfs and ceph allow it.
End result: we can forget about fdget_dir(). We'd need to split FMODE_ATOMIC_POS into two instead.
I don't think we have any free flags, but I didn't check. The ugly
We do. Christoph freed up 3 last cycle. I think I mentioned it in one my prs. (And btw, we should probably try to get rid of a few more.)
thing to do is to just special-case S_ISDIR. Not lovely, but whatever.
So something like this instead? It's a smaller diff anyway, and it gets the crazy afds/ceph cases right too.
If you really care about this we can do it. But if we can live with just unconditionally locking then I think we're better off. As I said, I've explicitly had requested the lkp performance ci that (I think Intel?) runs for us to be run on this with a focus on single threaded performance and so far there was nothing that wasn't noise.
On Fri, Aug 04, 2023 at 03:43:20PM +0200, Christian Brauner wrote:
I don't think we have any free flags, but I didn't check. The ugly
We do. Christoph freed up 3 last cycle. I think I mentioned it in one my prs. (And btw, we should probably try to get rid of a few more.)
I have a half-backed series to move all the static capabilities into a new field in struct file_operations. That gives us almost half a dozen back. I wanted to finish it for this merge window, but there are too many conflicting other things going on, so it will have to wait a bit longer.
On Thu, Aug 03, 2023 at 11:35:53AM -0700, Linus Torvalds wrote:
And yes, those exist. See at least 'adfs_dir_ops', and 'ceph_dir_fops'. They may be broken, but people actually did do things like that historically, maybe there's a reason adfs and ceph allow it.
Huh?
adfs is a red herring - there is a method called 'read' in struct adfs_dir_ops, but it has nothing to do with read(2) or anything of that sort; it is *used* by adfs_iterate() (via adfs_read_inode()), but the only file_operations that sucker has is const struct file_operations adfs_dir_operations = { .read = generic_read_dir, .llseek = generic_file_llseek, .iterate_shared = adfs_iterate, .fsync = generic_file_fsync, };
and generic_read_dir() is "fail with -EISDIR".
ceph one is real, and it's... not nice. They do have a genuine ->read() on directories, which acts as if it had been a regular file contents generated by sprintf() (at the first read(2)). Format: "entries: %20lld\n" " files: %20lld\n" " subdirs: %20lld\n" "rentries: %20lld\n" " rfiles: %20lld\n" " rsubdirs: %20lld\n" "rbytes: %20lld\n" "rctime: %10lld.%09ld\n", and yes, it does go stale. Just close and reopen...
File position is an offset in that string. Shared with position used by getdents()...
It gets better - if you open a directory, then fork and have both child and parent read from it, well... if (!dfi->dir_info) { dfi->dir_info = kmalloc(bufsize, GFP_KERNEL); if (!dfi->dir_info) return -ENOMEM; No locking whatsoever. No priveleges needed either...
Frankly, my preference would be to remove that kludge, but short of that we would need at least some exclusion for those allocations and if we do that, we might just as well have ceph_readdir() fail if we'd ever done ceph_read_dir() on the same struct file and vice versa.
They are really not mixible.
As far as I can tell, ceph is the only place in mainline where we have ->iterate_shared along with non-failing ->read. Nothing mixes it with ->read_iter, ->write or ->write_iter.
lseek() is there, of course, and that's enough to kill the "special fdget variant for directory syscalls" kind of approach. But read() and write() on directories are very much not something people legitimately do.
linux-stable-mirror@lists.linaro.org