On Sun, 5 May 2024 at 03:50, Christian Brauner brauner@kernel.org wrote:
And I agree with you that for some instances it's valid to take another reference to a file from f_op->poll() but then they need to use get_file_active() imho and simply handle the case where f_count is zero.
I think this is
(a) practically impossible to find (since most f_count updates are in various random helpers)
(b) not tenable in the first place, since *EVERYBODY* does a f_count update as part of the bog-standard pollwait
So (b) means that the notion of "warn if somebody increments f_count from zero" is broken to begin with - but it's doubly broken because it wouldn't find anything *anyway*, since this never happens in any normal situation.
And (a) means that any non-automatic finding of this is practically impossible.
And we need to document that in Documentation/filesystems/file.rst or locking.rst.
WHY?
Why cannot you and Al just admit that the problem is in epoll. Always has been, always will be.
The fact is, it's not dma-buf that is violating any rules. It's epoll. It's calling out to random driver functions with a file pointer that is no longer valid.
It really is that simple.
I don't see why you are arguing for "unknown number of drivers - we know at least *one* - have to be fixed for a bug that is in epoll".
If it was *easy* to fix, and if it was *easy* to validate, then sure. But that just isn't the case.
In contrast, in epoll it's *trivial* to fix the one case where it does a VFS call-out, and just say "you have to follow the rules".
So explain to me again why you want to mess up the driver interface and everybody who has a '.poll()' function, and not just fix the ONE clearly buggy piece of code.
Because dammit,. epoll is clearly buggy. It's not enough to say "the file allocation isn't going away", and claim that that means that it's not buggy - when the file IS NO LONGER VALID!
Linus
epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless.
Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines.
Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe axboe@kernel.dk Signed-off-by: Linus Torvalds torvalds@linux-foundation.org ---
Changes since v1:
- add Link, Reported-by, and Jens' reviewed-by. And sign off on it because it looks fine to me and we have some testing now.
- move epi_fget() closer to the user
- more comments about the background
- remove the rcu_read_lock(), with the comment explaining why it's not needed
- note about returning zero rather than something like EPOLLERR|POLLHUP for a file that is going away
fs/eventpoll.c | 42 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 882b89edc52a..a3f0f868adc4 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -979,6 +979,37 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep return res; }
+/* + * The ffd.file pointer may be in the process of + * being torn down due to being closed, but we + * may not have finished eventpoll_release() yet. + * + * Normally, even with the atomic_long_inc_not_zero, + * the file may have been free'd and then gotten + * re-allocated to something else (since files are + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU). + * + * But for epoll, users hold the ep->mtx mutex, and + * as such any file in the process of being free'd + * will block in eventpoll_release_file() and thus + * the underlying file allocation will not be free'd, + * and the file re-use cannot happen. + * + * For the same reason we can avoid a rcu_read_lock() + * around the operation - 'ffd.file' cannot go away + * even if the refcount has reached zero (but we must + * still not call out to ->poll() functions etc). + */ +static struct file *epi_fget(const struct epitem *epi) +{ + struct file *file; + + file = epi->ffd.file; + if (!atomic_long_inc_not_zero(&file->f_count)) + file = NULL; + return file; +} + /* * Differs from ep_eventpoll_poll() in that internal callers already have * the ep->mtx so we need to start from depth=1, such that mutex_lock_nested() @@ -987,14 +1018,23 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct file *file = epi->ffd.file; + struct file *file = epi_fget(epi); __poll_t res;
+ /* + * We could return EPOLLERR | EPOLLHUP or something, + * but let's treat this more as "file doesn't exist, + * poll didn't happen". + */ + if (!file) + return 0; + pt->_key = epi->event.events; if (!is_file_epoll(file)) res = vfs_poll(file, pt); else res = __ep_eventpoll_poll(file, pt, depth); + fput(file); return res & epi->event.events; }
From: Linus Torvalds
Sent: 05 May 2024 18:56
epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless.
Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines.
How much is the extra pair of atomics going to hurt performance? IIRC a lot of work was done to (try to) make epoll lockless.
Perhaps the 'hook' into epoll (usually) from sys_close should be done before any of the references are removed? (Which is different from Q6/A6 in man epoll - but that seems to be documenting a bug!) Then the ->poll() callback can't happen (assuming it is properly locked) after the ->release() one.
It seems better to add extra atomics to the close/final-fput path rather than ever ->poll() call epoll makes.
I can get extra references to a driver by dup() open("/dev/fd/n") and mmap() - but epoll is definitely fd based. (Which may be why it has the fd number in its data.)
Is there another race between EPOLL_CTL_ADD and close() (from a different thread)?
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, 5 May 2024 at 13:02, David Laight David.Laight@aculab.com wrote:
How much is the extra pair of atomics going to hurt performance? IIRC a lot of work was done to (try to) make epoll lockless.
If this makes people walk away from epoll, that would be absolutely *lovely*. Maybe they'd start using io_uring instead, which has had its problems, but is a lot more capable in the end.
Yes, doing things right is likely more expensive than doing things wrong. Bugs are cheap. That doesn't make buggy code better.
Epoll really isn't important enough to screw over the VFS subsystem over.
I did point out elsewhere how this could be fixed by epoll() removing the ep items at a different point:
https://lore.kernel.org/all/CAHk-=wj6XL9MGCd_nUzRj6SaKeN0TsyTTZDFpGdW34R+zMZ...
so if somebody actually wants to fix up epoll properly, that would probably be great.
In fact, that model would allow epoll() to just keep a proper refcount as an fd is added to the poll events, and would probably fix a lot of nastiness. Right now those ep items stay around for basically random amounts of time.
But maybe there are other ways to fix it. I don't think we have an actual eventpoll maintainer any more, but what I'm *not* willing to happen is eventpoll messing up other parts of the kernel. It was always a ugly performance hack, and was only acceptable as such. "ugly hack" is ok. "buggy ugly hack" is not.
Linus
On Sun, May 05, 2024 at 09:46:05AM -0700, Linus Torvalds wrote:
WHY?
Why cannot you and Al just admit that the problem is in epoll. Always has been, always will be.
Nobody (well, nobody who'd ever read epoll) argues that epoll is not a problem.
The fact is, it's not dma-buf that is violating any rules.
Now, that is something I've a trouble with. Use of get_file() in there actually looks rather fishy, regardless of epoll.
At the very least it needs a comment discouraging other instances from blindly copying this. A reference to struct file pins down more than driver-internal objects; if nothing else, it pins down a mount and if you don't have SB_NOUSER on file_inode(file)->i_sb->s_flags, it's really not a good idea.
What's more, the reason for that get_file() is, AFAICS, that nodes we put into callback queue for fence(s) in question[*] are embedded into dmabuf and we don't want them gone before the callbacks have happened. Which looks fishy - it would make more sense to cancel these callbacks and drop the fence(s) in question from ->release().
I've no problem whatsoever with fs/eventpoll.c grabbing/dropping file reference around vfs_poll() calls. And I don't believe that "try to grab" has any place in dma_buf_poll(); it's just that I'm not happy about get_file() call being there in the first place.
Sure, the call of ->poll() can bloody well lead to references being grabbed - by the pollwait callback, which the caller of ->poll() is aware of. It's ->poll() instance *itself* grabbing such references with vfs_poll() caller having no idea what's going on that has potential for being unpleasant. And we can't constify 'file' argument of ->poll() because of poll_wait(), so it's hard to catch those who do that kind of thing; I've explicitly said so upthread, I believe.
But similar calls of get_file() in ->poll() instances (again, not the ones that are made by pollwait callback) are something to watch out for and having the caller pin struct file does not solve the problem.
[*] at most one per direction, and I've no idea whether there can be more than one signalling fence for given dmabuf)
linaro-mm-sig@lists.linaro.org