On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
[...] Root cause: AFAIK, eventpoll (epoll) does not increase the registered file's reference. To ensure the safety, when the registered file is deallocated in __fput, eventpoll_release is called to unregister the file from epoll. When calling poll on epoll, epoll will loop through registered files and call vfs_poll on these files. In the file's poll, file is guaranteed to be alive, however, as epoll does not increase the registered file's reference, the file may be dying and it's not safe the get the file for later use. And dma_buf_poll violates this. In the dma_buf_poll, it tries to get_file to use in the callback. This leads to a race where the dmabuf->file can be fput twice.
Here is the race occurs in the above proof-of-concept
close(dmabuf->file) __fput_sync (f_count == 1, last ref) f_count-- (f_count == 0 now) __fput epoll_wait vfs_poll(dmabuf->file) get_file(dmabuf->file)(f_count == 1) eventpoll_release dmabuf->file deallocation fput(dmabuf->file) (f_count == 1) f_count-- dmabuf->file deallocation
I am not familiar with the dma_buf so I don't know the proper fix for the issue. About the rule that don't get the file for later use in poll callback of file, I wonder if it is there when only select/poll exist or just after epoll appears.
I hope the analysis helps us to fix the issue.
Thanks for doing this analysis! I suspect at least a start of a fix would be this:
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..15e8f74ee0f2 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
if (events & EPOLLOUT) { /* Paired with fput in dma_buf_poll_cb */ - get_file(dmabuf->file); - - if (!dma_buf_poll_add_cb(resv, true, dcb)) + if (!atomic_long_inc_not_zero(&dmabuf->file) && + !dma_buf_poll_add_cb(resv, true, dcb)) /* No callback queued, wake up any other waiters */ dma_buf_poll_cb(NULL, &dcb->cb); else @@ -290,9 +289,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
if (events & EPOLLIN) { /* Paired with fput in dma_buf_poll_cb */ - get_file(dmabuf->file); - - if (!dma_buf_poll_add_cb(resv, false, dcb)) + if (!atomic_long_inc_not_zero(&dmabuf->file) && + !dma_buf_poll_add_cb(resv, false, dcb)) /* No callback queued, wake up any other waiters */ dma_buf_poll_cb(NULL, &dcb->cb); else
But this ends up leaving "active" non-zero, and at close time it runs into:
BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
But the bottom line is that get_file() is unsafe to use in some places, one of which appears to be in the poll handler. There are maybe some other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:
static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; }
Which I also note involves a dmabuf...
Due to this issue I've proposed fixing get_file() to detect pathological states: https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
But that has run into some push-back. I'm hoping that seeing this epoll example will help illustrate what needs fixing a little better.
I think the best current proposal is to just WARN sooner instead of a full refcount_t implementation:
diff --git a/include/linux/fs.h b/include/linux/fs.h index 8dfd53b52744..e09107d0a3d6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1040,7 +1040,8 @@ struct file_handle {
static inline struct file *get_file(struct file *f) { - atomic_long_inc(&f->f_count); + long prior = atomic_long_fetch_inc_relaxed(&f->f_count); + WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n"); return f; }
What's the right way to deal with the dmabuf situation? (And I suspect it applies to get_dma_buf_unless_doomed() as well...)
-Kees
epoll is a mess, and does various invalid things in the name of performance.
Let's try to rein it in a bit. Something like this, perhaps?
Not-yet-signed-off-by: Linus Torvalds torvalds@linux-foundation.org ---
This is entirely untested, thus the "Not-yet-signed-off-by". But I think this may be kind of the right path forward.
I suspect the ->poll() call is the main case that matters, but there are other places where eventpoll just looks up the file pointer without then being very careful about it. The sock_from_file(epi->ffd.file) uses in particular should probably also use this to look up the file.
Comments?
fs/eventpoll.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 882b89edc52a..bffa8083ff36 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -285,6 +285,30 @@ static inline void free_ephead(struct epitems_head *head) kmem_cache_free(ephead_cache, head); }
+/* + * The ffd.file pointer may be in the process of + * being torn down due to being closed, but we + * may not have finished eventpoll_release() yet. + * + * Technically, even with the atomic_long_inc_not_zero, + * the file may have been free'd and then gotten + * re-allocated to something else (since files are + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU). + * + * But for epoll, we don't much care. + */ +static struct file *epi_fget(const struct epitem *epi) +{ + struct file *file; + + rcu_read_lock(); + file = epi->ffd.file; + if (!atomic_long_inc_not_zero(&file->f_count)) + file = NULL; + rcu_read_unlock(); + return file; +} + static void list_file(struct file *file) { struct epitems_head *head; @@ -987,14 +1011,18 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct file *file = epi->ffd.file; + struct file *file = epi_fget(epi); __poll_t res;
+ if (!file) + return 0; + pt->_key = epi->event.events; if (!is_file_epoll(file)) res = vfs_poll(file, pt); else res = __ep_eventpoll_poll(file, pt, depth); + fput(file); return res & epi->event.events; }
On Fri, May 03, 2024 at 02:11:30PM -0700, Linus Torvalds wrote:
epoll is a mess, and does various invalid things in the name of performance.
Let's try to rein it in a bit. Something like this, perhaps?
+/*
- The ffd.file pointer may be in the process of
- being torn down due to being closed, but we
- may not have finished eventpoll_release() yet.
- Technically, even with the atomic_long_inc_not_zero,
- the file may have been free'd and then gotten
- re-allocated to something else (since files are
- not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU).
Can we get to ep_item_poll(epi, ...) after eventpoll_release_file() got past __ep_remove()? Because if we can, we have a worse problem - epi freed under us.
If not, we couldn't possibly have reached ->release() yet, let alone freeing anything.
On Fri, 3 May 2024 at 14:24, Al Viro viro@zeniv.linux.org.uk wrote:
Can we get to ep_item_poll(epi, ...) after eventpoll_release_file() got past __ep_remove()? Because if we can, we have a worse problem - epi freed under us.
Look at the hack in __ep_remove(): if it is concurrent with eventpoll_release_file(), it will hit this code
spin_lock(&file->f_lock); if (epi->dying && !force) { spin_unlock(&file->f_lock); return false; }
and not free the epi.
But as far as I can tell, almost nothing else cares about the f_lock and dying logic.
And in fact, I don't think doing
spin_lock(&file->f_lock);
is even valid in the places that look up file through "epi->ffd.file", because the lock itself is inside the thing that you can't trust until you've taken the lock...
So I agree with Kees about the use of "atomic_dec_not_zero()" kind of logic - but it also needs to be in an RCU-readlocked region, I think.
I wish epoll() just took the damn file ref itself. But since it relies on the file refcount to release the data structure, that obviously can't work.
Linus
On Fri, May 03, 2024 at 02:33:37PM -0700, Linus Torvalds wrote:
Look at the hack in __ep_remove(): if it is concurrent with eventpoll_release_file(), it will hit this code
spin_lock(&file->f_lock); if (epi->dying && !force) { spin_unlock(&file->f_lock); return false; }
and not free the epi.
What does that have to do with ep_item_poll()?
eventpoll_release_file() itself calls __ep_remove(). Have that happen while ep_item_poll() is running in another thread and you've got a problem.
AFAICS, exclusion is on ep->mtx. Callers of ep_item_poll() are * __ep_eventpoll_poll() - grabs ->mtx * ep_insert() - called under ->mtx * ep_modify() - calls are under ->mtx * ep_send_events() - grabs ->mtx
and eventpoll_release_file() grabs ->mtx around __ep_remove().
How do you get through eventpoll_release_file() while someone has entered ep_item_poll()?
On Fri, 3 May 2024 at 14:45, Al Viro viro@zeniv.linux.org.uk wrote:
How do you get through eventpoll_release_file() while someone has entered ep_item_poll()?
Doesn't matter.
Look, it's enough that the file count has gone down to zero. You may not even have gotten to eventpoll_release_file() yet - the important part is that you're on your *way* to it.
That means that the file will be released - and it means that you have violated all the refcounting rules for poll().
So I think you're barking up the wrong tree.
Linus
On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
On Fri, 3 May 2024 at 14:45, Al Viro viro@zeniv.linux.org.uk wrote:
How do you get through eventpoll_release_file() while someone has entered ep_item_poll()?
Doesn't matter.
Look, it's enough that the file count has gone down to zero. You may not even have gotten to eventpoll_release_file() yet - the important part is that you're on your *way* to it.
That means that the file will be released - and it means that you have violated all the refcounting rules for poll().
So I think you're barking up the wrong tree.
IMO there are several things in that mess (aside of epoll being what it is).
Trying to grab refcount as you do is fine; the comment is seriously misleading, though - we *are* guaranteed that struct file hasn't hit ->release(), let alone getting freed and reused.
Having pollwait callback grab references is fine - and the callback belongs to whoever's calling ->poll().
Having ->poll() instance itself grab reference is really asking for problem, even on the boxen that have CONFIG_EPOLL turned off. select(2) is enough; it will take care of references grabbed by __pollwait(), but it doesn't know anything about the references driver has stashed hell knows where for hell knows how long.
On Fri, May 03, 2024 at 11:01:45PM +0100, Al Viro wrote:
Having ->poll() instance itself grab reference is really asking for problem, even on the boxen that have CONFIG_EPOLL turned off. select(2) is enough; it will take care of references grabbed by __pollwait(), but it doesn't know anything about the references driver has stashed hell knows where for hell knows how long.
Suppose your program calls select() on a pipe and dmabuf, sees data to be read from pipe, reads it, closes both pipe and dmabuf and exits.
Would you expect that dmabuf file would stick around for hell knows how long after that? I would certainly be very surprised by running into that...
On Fri, 3 May 2024 at 15:07, Al Viro viro@zeniv.linux.org.uk wrote:
Suppose your program calls select() on a pipe and dmabuf, sees data to be read from pipe, reads it, closes both pipe and dmabuf and exits.
Would you expect that dmabuf file would stick around for hell knows how long after that? I would certainly be very surprised by running into that...
Why?
That's the _point_ of refcounts. They make the thing they refcount stay around until it's no longer referenced.
Now, I agree that dmabuf's are a bit odd in how they use a 'struct file' *as* their refcount, but hey, it's a specialty use. Unusual perhaps, but not exactly wrong.
I suspect that if you saw a dmabuf just have its own 'refcount_t' and stay around until it was done, you wouldn't bat an eye at it, and it's really just the "it uses a struct file for counting" that you are reacting to.
Linus
On Fri, May 03, 2024 at 04:16:15PM -0700, Linus Torvalds wrote:
On Fri, 3 May 2024 at 15:07, Al Viro viro@zeniv.linux.org.uk wrote:
Suppose your program calls select() on a pipe and dmabuf, sees data to be read from pipe, reads it, closes both pipe and dmabuf and exits.
Would you expect that dmabuf file would stick around for hell knows how long after that? I would certainly be very surprised by running into that...
Why?
That's the _point_ of refcounts. They make the thing they refcount stay around until it's no longer referenced.
Now, I agree that dmabuf's are a bit odd in how they use a 'struct file' *as* their refcount, but hey, it's a specialty use. Unusual perhaps, but not exactly wrong.
I suspect that if you saw a dmabuf just have its own 'refcount_t' and stay around until it was done, you wouldn't bat an eye at it, and it's really just the "it uses a struct file for counting" that you are reacting to.
*IF* those files are on purely internal filesystem, that's probably OK; do that with something on something mountable (char device, sysfs file, etc.) and you have a problem with filesystem staying busy.
I'm really unfamiliar with the subsystem; it might be OK with all objects that use that for ->poll(), but that's definitely not a good thing to see in ->poll() instance in general. And code gets copied, so there really should be a big fat comment about the reasons why it's OK in this particular case.
Said that, it seems that a better approach might be to have their ->release() cancel callbacks and drop fence references. Note that they *do* have refcounts - on fences. The file (well, dmabuf, really) is pinned only to protect against the situation when pending callback is still around. And Kees' observation about multiple fences is also interesting - we don't get extra fput(), but only because we get events only from one fence, which does look fishy...
On Fri, 3 May 2024 at 16:39, Al Viro viro@zeniv.linux.org.uk wrote:
*IF* those files are on purely internal filesystem, that's probably OK; do that with something on something mountable (char device, sysfs file, etc.) and you have a problem with filesystem staying busy.
Yeah, I agree, it's a bit annoying in general. That said, it's easy to do: stash a file descriptor in a unix domain socket, and that's basically exactly what you have: a random reference to a 'struct file' that will stay around for as long as you just keep that socket around, long after the "real" file descriptor has been closed, and entirely separately from it.
And yes, that's exactly why unix domain socket transfers have caused so many problems over the years, with both refcount overflows and nasty garbage collection issues.
So randomly taking references to file descriptors certainly isn't new.
In fact, it's so common that I find the epoll pattern annoying, in that it does something special and *not* taking a ref - and it does that special thing to *other* ("innocent") file descriptors. Yes, dma-buf is a bit like those unix domain sockets in that it can keep random references alive for random times, but at least it does it just to its own file descriptors, not random other targets.
So the dmabuf thing is very much a "I'm a special file that describes a dma buffer", and shouldn't really affect anything outside of active dmabuf uses (which admittedly is a large portion of the GPU drivers, and has been expanding from there...). I
So the reason I'm annoyed at epoll in this case is that I think epoll triggered the bug in some entirely innocent subsystem. dma-buf is doing something differently odd, yes, but at least it's odd in a "I'm a specialized thing" sense, not in some "I screw over others" sense.
Linus
On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
That means that the file will be released - and it means that you have violated all the refcounting rules for poll().
I feel like I've been looking at this too long. I think I see another problem here, but with dmabuf even when epoll is fixed:
dma_buf_poll() get_file(dmabuf->file) /* f_count + 1 */ dma_buf_poll_add_cb() dma_resv_for_each_fence ... dma_fence_add_callback(fence, ..., dma_buf_poll_cb)
dma_buf_poll_cb() ... fput(dmabuf->file); /* f_count - 1 ... for each fence */
Isn't it possible to call dma_buf_poll_cb() (and therefore fput()) multiple times if there is more than 1 fence? Perhaps I've missed a place where a single struct dma_resv will only ever signal 1 fence? But looking through dma_fence_signal_timestamp_locked(), I don't see anything about resv nor somehow looking into other fence cb_list contents...
On Fri, May 03, 2024 at 03:46:25PM -0700, Kees Cook wrote:
On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
That means that the file will be released - and it means that you have violated all the refcounting rules for poll().
I feel like I've been looking at this too long. I think I see another problem here, but with dmabuf even when epoll is fixed:
dma_buf_poll() get_file(dmabuf->file) /* f_count + 1 */ dma_buf_poll_add_cb() dma_resv_for_each_fence ... dma_fence_add_callback(fence, ..., dma_buf_poll_cb)
dma_buf_poll_cb() ... fput(dmabuf->file); /* f_count - 1 ... for each fence */
Isn't it possible to call dma_buf_poll_cb() (and therefore fput()) multiple times if there is more than 1 fence? Perhaps I've missed a place where a single struct dma_resv will only ever signal 1 fence? But looking through dma_fence_signal_timestamp_locked(), I don't see anything about resv nor somehow looking into other fence cb_list contents...
At a guess, r = dma_fence_add_callback(fence, &dcb->cb, dma_buf_poll_cb); if (!r) return true;
prevents that - it returns 0 on success and -E... on error; insertion into the list happens only when it's returning 0, so...
On Sat, May 04, 2024 at 12:03:18AM +0100, Al Viro wrote:
On Fri, May 03, 2024 at 03:46:25PM -0700, Kees Cook wrote:
On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
That means that the file will be released - and it means that you have violated all the refcounting rules for poll().
I feel like I've been looking at this too long. I think I see another problem here, but with dmabuf even when epoll is fixed:
dma_buf_poll() get_file(dmabuf->file) /* f_count + 1 */ dma_buf_poll_add_cb() dma_resv_for_each_fence ... dma_fence_add_callback(fence, ..., dma_buf_poll_cb)
dma_buf_poll_cb() ... fput(dmabuf->file); /* f_count - 1 ... for each fence */
Isn't it possible to call dma_buf_poll_cb() (and therefore fput()) multiple times if there is more than 1 fence? Perhaps I've missed a place where a single struct dma_resv will only ever signal 1 fence? But looking through dma_fence_signal_timestamp_locked(), I don't see anything about resv nor somehow looking into other fence cb_list contents...
At a guess, r = dma_fence_add_callback(fence, &dcb->cb, dma_buf_poll_cb); if (!r) return true;
prevents that - it returns 0 on success and -E... on error; insertion into the list happens only when it's returning 0, so...
Yes; thank you. I *have* been looking at it all too long. :)
The last related thing is the drivers/gpu/drm/vmwgfx/ttm_object.c case:
/** * get_dma_buf_unless_doomed - get a dma_buf reference if possible. * * @dmabuf: Non-refcounted pointer to a struct dma-buf. * * Obtain a file reference from a lookup structure that doesn't refcount * the file, but synchronizes with its release method to make sure it * has * not been freed yet. See for example kref_get_unless_zero * documentation. * Returns true if refcounting succeeds, false otherwise. * * Nobody really wants this as a public API yet, so let it mature here * for some time... */ static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; }
If we end up adding epi_fget(), we'll have 2 cases of using "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed helper to live in file.h or something, with appropriate comments?
On Fri, 3 May 2024 at 16:23, Kees Cook keescook@chromium.org wrote:
static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; }
If we end up adding epi_fget(), we'll have 2 cases of using "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed helper to live in file.h or something, with appropriate comments?
I wonder if we could try to abstract this out a bit more.
These games with non-ref-counted file structures *feel* a bit like the games we play with non-ref-counted (aka "stashed") 'struct dentry' that got fairly recently cleaned up with path_from_stashed() when both nsfs and pidfs started doing the same thing.
I'm not loving the TTM use of this thing, but at least the locking and logic feels a lot more straightforward (ie the atomic_long_inc_not_zero() here is clealy under the 'prime->mutex' lock
IOW, the tty use looks correct to me, and it has fairly simple locking and is just catching the the race between 'fput()' decrementing the refcount and and 'file->f_op->release()' doing the actual release.
You are right that it's similar to the epoll thing in that sense, it just looks a _lot_ more straightforward to me (and, unlike epoll, doesn't look actively buggy right now).
Could we abstract out this kind of "stashed file pointer" so that we'd have a *common* form for this? Not just the inc_not_zero part, but the locking rule too?
Linus
On Fri, May 03, 2024 at 04:41:19PM -0700, Linus Torvalds wrote:
On Fri, 3 May 2024 at 16:23, Kees Cook keescook@chromium.org wrote:
static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; }
If we end up adding epi_fget(), we'll have 2 cases of using "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed helper to live in file.h or something, with appropriate comments?
I wonder if we could try to abstract this out a bit more.
These games with non-ref-counted file structures *feel* a bit like the games we play with non-ref-counted (aka "stashed") 'struct dentry' that got fairly recently cleaned up with path_from_stashed() when both nsfs and pidfs started doing the same thing.
I'm not loving the TTM use of this thing, but at least the locking and logic feels a lot more straightforward (ie the atomic_long_inc_not_zero() here is clealy under the 'prime->mutex' lock
The one the vmgfx isn't really needed (I think at least), because all other drivers that use gem or ttm use the dma_buf export cache in drm/drm_prime.c, which is protected by a bog standard mutex.
vmwgfx is unfortunately special in a lot of ways due to somewhat parallel dev history. So there might be an uapi reason why the weak reference is required. I suspect because vmwgfx is reinventing a lot of its own wheels it can't play the same tricks as gem_prime.c, which hooks into a few core drm cleanup/release functions.
tldr; drm really has no architectural need for a get_file_unless_doomed, and I certainly don't want to spread it it further than the vmwgfx historical special case that was added in 2013. -Sima
IOW, the tty use looks correct to me, and it has fairly simple locking and is just catching the the race between 'fput()' decrementing the refcount and and 'file->f_op->release()' doing the actual release.
You are right that it's similar to the epoll thing in that sense, it just looks a _lot_ more straightforward to me (and, unlike epoll, doesn't look actively buggy right now).
Could we abstract out this kind of "stashed file pointer" so that we'd have a *common* form for this? Not just the inc_not_zero part, but the locking rule too?
Linus
linaro-mm-sig@lists.linaro.org