On Fri, May 03, 2024 at 01:35:09PM -0600, Jens Axboe wrote:
On 5/3/24 1:22 PM, Kees Cook wrote:
On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
On 5/3/24 12:26 PM, Kees Cook wrote:
Thanks for doing this analysis! I suspect at least a start of a fix would be this:
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..15e8f74ee0f2 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) if (events & EPOLLOUT) { /* Paired with fput in dma_buf_poll_cb */
get_file(dmabuf->file);
if (!dma_buf_poll_add_cb(resv, true, dcb))
if (!atomic_long_inc_not_zero(&dmabuf->file) &&
!dma_buf_poll_add_cb(resv, true, dcb)) /* No callback queued, wake up any other waiters */
Don't think this is sane at all. I'm assuming you meant:
atomic_long_inc_not_zero(&dmabuf->file->f_count);
Oops, yes, sorry. I was typed from memory instead of copy/paste.
Figured :-)
but won't fly as you're not under RCU in the first place. And what protects it from being long gone before you attempt this anyway? This is sane way to attempt to fix it, it's completely opposite of what sane ref handling should look like.
Not sure what the best fix is here, seems like dma-buf should hold an actual reference to the file upfront rather than just stash a pointer and then later _hope_ that it can just grab a reference. That seems pretty horrible, and the real source of the issue.
AFAICT, epoll just doesn't hold any references at all. It depends, I think, on eventpoll_release() (really eventpoll_release_file()) synchronizing with epoll_wait() (but I don't see how this happens, and the race seems to be against ep_item_poll() ...?)
I'm really confused about how eventpoll manages the lifetime of polled fds.
epoll doesn't hold any references, and it's got some ugly callback to deal with that. It's not ideal, nor pretty, but that's how it currently works. See eventpoll_release() and how it's called. This means that epoll itself is supposedly safe from the file going away, even though it doesn't hold a reference to it.
Right -- what remains unclear to me is how struct file lifetime is expected to work in the struct file_operations::poll callbacks. Because using get_file() there looks clearly unsafe...
Except that in this case, the file is already gone by the time eventpoll_release() is called. Which presumably is some interaction with the somewhat suspicious file reference management that dma-buf is doing. But I didn't look into that much, outside of noting it looks a bit suspect.
Not yet, though. Here's (one) race state from the analysis. I added lines for the dma_fence_add_callback()/dma_buf_poll_cb() case, since that's the case that would escape any eventpoll_release/epoll_wait synchronization (if it exists?):
close(dmabuf->file) __fput_sync (f_count == 1, last ref) f_count-- (f_count == 0 now) __fput epoll_wait vfs_poll(dmabuf->file) get_file(dmabuf->file)(f_count == 1) dma_fence_add_callback() eventpoll_release dmabuf->file deallocation dma_buf_poll_cb() fput(dmabuf->file) (f_count == 1) f_count-- dmabuf->file deallocation
Without fences to create a background callback, we just do a double-free:
close(dmabuf->file) __fput_sync (f_count == 1, last ref) f_count-- (f_count == 0 now) __fput epoll_wait vfs_poll(dmabuf->file) get_file(dmabuf->file)(f_count == 1) dma_buf_poll_cb() fput(dmabuf->file) (f_count == 1) f_count-- eventpoll_release dmabuf->file deallocation eventpoll_release dmabuf->file deallocation
get_file(), via epoll_wait()->vfs_poll()->dma_buf_poll(), has raised f_count again. Then eventpoll_release() is doing things to remove dmabuf->file from the eventpoll lists, but I *think* this is synchronized so that an epoll_wait() will only call .poll handlers with a valid (though possibly f_count==0) file, but I can't figure out where that happens. (If it's not happening, we have a much bigger problem, but I imagine we'd see massive corruption all the time, which we don't.)
So, yeah, I can't figure out how eventpoll_release() and epoll_wait() are expected to behave safely for .poll handlers.
Regardless, for the simple case: it seems like it's just totally illegal to use get_file() in a poll handler. Is this known/expected? And if so, how can dmabuf possibly deal with that?
On Fri, May 03, 2024 at 12:59:52PM -0700, Kees Cook wrote:
So, yeah, I can't figure out how eventpoll_release() and epoll_wait() are expected to behave safely for .poll handlers.
Regardless, for the simple case: it seems like it's just totally illegal to use get_file() in a poll handler. Is this known/expected? And if so, how can dmabuf possibly deal with that?
Is this the right approach? It still feels to me like get_file() needs to happen much earlier...
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 882b89edc52a..c6c29facf228 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -991,9 +991,13 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, __poll_t res;
pt->_key = epi->event.events; - if (!is_file_epoll(file)) - res = vfs_poll(file, pt); - else + if (!is_file_epoll(file)) { + if (atomic_long_inc_not_zero(&file->f_count)) { + res = vfs_poll(file, pt); + fput(file); + } else + res = EPOLLERR; + } else res = __ep_eventpoll_poll(file, pt, depth); return res & epi->event.events; }
On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
Is this the right approach? It still feels to me like get_file() needs to happen much earlier...
I don't believe it needs to happen at all. The problem is not that ->release() can be called during ->poll() - it can't and it doesn't. It's that this instance of ->poll() is trying to extend the lifetime of that struct file, when it might very well be past the point of no return.
What we need is * promise that ep_item_poll() won't happen after eventpoll_release_file(). AFAICS, we do have that. * ->poll() not playing silly buggers.
As it is, dma_buf ->poll() is very suspicious regardless of that mess - it can grab reference to file for unspecified interval. Have that happen shortly before reboot and you are asking for failing umount.
->poll() must be refcount-neutral wrt file passed to it. I'm seriously tempted to make ->poll() take const struct file * and see if there's anything else that would fall out.
On Fri, 3 May 2024 at 14:11, Al Viro viro@zeniv.linux.org.uk wrote:
What we need is * promise that ep_item_poll() won't happen after eventpoll_release_file(). AFAICS, we do have that. * ->poll() not playing silly buggers.
No. That is not enough at all.
Because even with perfectly normal "->poll()", and even with the ep_item_poll() happening *before* eventpoll_release_file(), you have this trivial race:
ep_item_poll() ->poll()
and *between* those two operations, another CPU does "close()", and that causes eventpoll_release_file() to be called, and now f_count goes down to zero while ->poll() is running.
So you do need to increment the file count around the ->poll() call, I feel.
Or, alternatively, you'd need to serialize with eventpoll_release_file(), but that would need to be some sleeping lock held over the ->poll() call.
As it is, dma_buf ->poll() is very suspicious regardless of that mess - it can grab reference to file for unspecified interval.
I think that's actually much preferable to what epoll does, which is to keep using files without having reference counts to them (and then relying on magically not racing with eventpoll_release_file().
Linus
On Fri, May 03, 2024 at 02:24:45PM -0700, Linus Torvalds wrote:
Because even with perfectly normal "->poll()", and even with the ep_item_poll() happening *before* eventpoll_release_file(), you have this trivial race:
ep_item_poll() ->poll()
and *between* those two operations, another CPU does "close()", and that causes eventpoll_release_file() to be called, and now f_count goes down to zero while ->poll() is running.
So you do need to increment the file count around the ->poll() call, I feel.
Or, alternatively, you'd need to serialize with eventpoll_release_file(), but that would need to be some sleeping lock held over the ->poll() call.
As it is, dma_buf ->poll() is very suspicious regardless of that mess - it can grab reference to file for unspecified interval.
I think that's actually much preferable to what epoll does, which is to keep using files without having reference counts to them (and then relying on magically not racing with eventpoll_release_file().
eventpoll_release_file() calling __ep_remove() while ep_item_poll() is something we need to avoid anyway - having epi freed under ep_item_poll() would be a problem regardless of struct file lifetime issues.
On Fri, May 03, 2024 at 10:11:09PM +0100, Al Viro wrote:
On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
Is this the right approach? It still feels to me like get_file() needs to happen much earlier...
I don't believe it needs to happen at all. The problem is not that ->release() can be called during ->poll() - it can't and it doesn't. It's that this instance of ->poll() is trying to extend the lifetime of that struct file, when it might very well be past the point of no return.
What we need is
- promise that ep_item_poll() won't happen after eventpoll_release_file().
AFAICS, we do have that.
- ->poll() not playing silly buggers.
As it is, dma_buf ->poll() is very suspicious regardless of that mess - it can grab reference to file for unspecified interval. Have that happen shortly before reboot and you are asking for failing umount.
->poll() must be refcount-neutral wrt file passed to it. I'm seriously tempted to make ->poll() take const struct file * and see if there's anything else that would fall out.
... the last part is no-go - poll_wait() must be able to grab a reference (well, the callback in it must)
On Fri, 3 May 2024 at 14:36, Al Viro viro@zeniv.linux.org.uk wrote:
... the last part is no-go - poll_wait() must be able to grab a reference (well, the callback in it must)
Yeah. I really think that *poll* itself is doing everything right. It knows that it's called with a file pointer with a reference, and it adds its own references as needed.
And I think that's all fine - both for dmabuf in particular, but for poll in general. That's how things are *supposed* to work. You can keep references to other things in your 'struct file *', knowing that files are properly refcounted, and won't go away while you are dealing with them.
The problem, of course, is that then epoll violates that "called with reference" part. epoll very much by design does *not* take references to the files it keeps track of, and then tears them down at close() time.
Now, epoll has its reasons for doing that. They are even good reasons. But that does mean that when epoll needs to deal with that hackery.
I wish we could remove epoll entirely, but that isn't an option, so we need to just make sure that when it accesses the ffd.file pointer, it does so more carefully.
Linus
On Fri, May 03, 2024 at 02:42:22PM -0700, Linus Torvalds wrote:
On Fri, 3 May 2024 at 14:36, Al Viro viro@zeniv.linux.org.uk wrote:
... the last part is no-go - poll_wait() must be able to grab a reference (well, the callback in it must)
Yeah. I really think that *poll* itself is doing everything right. It knows that it's called with a file pointer with a reference, and it adds its own references as needed.
Not really. Note that select's __pollwait() does *NOT* leave a reference at the mercy of driver - it's stuck into poll_table_entry->filp and the poll_freewait() knows how to take those out.
dmabuf does something very different - it grabs the damn thing into its private data structures and for all we know it could keep it for a few hours, until some even materializes.
On Fri, May 03, 2024 at 10:53:03PM +0100, Al Viro wrote:
On Fri, May 03, 2024 at 02:42:22PM -0700, Linus Torvalds wrote:
On Fri, 3 May 2024 at 14:36, Al Viro viro@zeniv.linux.org.uk wrote:
... the last part is no-go - poll_wait() must be able to grab a reference (well, the callback in it must)
Yeah. I really think that *poll* itself is doing everything right. It knows that it's called with a file pointer with a reference, and it adds its own references as needed.
Not really. Note that select's __pollwait() does *NOT* leave a reference at the mercy of driver - it's stuck into poll_table_entry->filp and the poll_freewait() knows how to take those out.
dmabuf does something very different - it grabs the damn thing into its private data structures and for all we know it could keep it for a few hours, until some even materializes.
dma_fence must complete in reasonable amount of time, where "reasonable" is roughly in line with other i/o (including the option that there's timeouts if the hw's gone busted).
So definitely not hours (aside from driver bugs when things go really wrong ofc), but more like a few seconds in a worst case scenario. -Sima
linaro-mm-sig@lists.linaro.org