Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

List overview All Threads
Download

newer

older

Re: Safety of opening up...

Re: [PATCH] epoll: try to be a...

Linus Torvalds

4 May 2024 4 May '24

3:32 p.m.

On Sat, 4 May 2024 at 02:37, Christian Brauner brauner@kernel.org wrote:

...

--- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -244,13 +244,18 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) if (!dmabuf || !dmabuf->resv) return EPOLLERR;
  if (!get_file_active(&dmabuf->file))
          return EPOLLERR;

[...]

I *really* don't think anything that touches dma-buf.c can possibly be right.

This is not a dma-buf.c bug.

This is purely an epoll bug.

Lookie here, the fundamental issue is that epoll can call '->poll()' on a file descriptor that is being closed concurrently.

That means that *ANY* driver that relies on *any* data structure that is managed by the lifetime of the 'struct file' will have problems.

Look, here's sock_poll():

static __poll_t sock_poll(struct file *file, poll_table *wait) { struct socket *sock = file->private_data;

and that first line looks about as innocent as it possibly can, right?

Now, imagine that this is called from 'epoll' concurrently with the file being closed for the last time (but it just hasn't _quite_ reached eventpoll_release() yet).

Now, imagine that the kernel is built with preemption, and the epoll thread gets preempted _just_ before it loads 'file->private_data'.

Furthermore, the machine is under heavy load, and it just stays off its CPU a long time.

Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the file closing completes, eventpoll_release() finishes, and the preemption of the poll() thing just takes so long that you go through an RCU period too, so that the actual file has been released too.

So now that totally innoced file->private_data load in the poll() is probably going to get random data.

Yes, the file is allocated as SLAB_TYPESAFE_BY_RCU, so it's probably still a file. Not guaranteed, even the slab will get fully free'd in some situations. And yes, the above case is impossible to hit in practice. You have to hit quite the small race window with an operation that practically never happens in the first place.

But my point is that the fact that the problem with file->f_count lifetimes happens for that dmabuf driver is not the fault of the dmabuf code. Not at all.

It is *ENTIRELY* a bug in epoll, and the dmabuf code is probably just easier to hit because it has a poll() function that does things that have longer lifetimes than most things, and interacts more directly with that f_count.

So I really don't understand why Al thinks this is "dmabuf does bad things with f_count". It damn well does not. dma-buf is the GOOD GUY here. It's doing things *PROPERLY*. It's taking refcounts like it damn well should.

The fact that it takes ref-counts on something that the epoll code has messed up is *NOT* its fault.

Linus

Show replies by date

Linus Torvalds

4 May 4 May

3:40 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sat, 4 May 2024 at 08:32, Linus Torvalds torvalds@linux-foundation.org wrote:

...

Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the file closing completes, eventpoll_release() finishes [..]

Actually, Al is right that ep_item_poll() should be holding the ep->mtx, so eventpoll_release() -> eventpoll_release_file_file() -> mutex_lock(&ep->mtx) should block and the file doesn't actually get released.

So I guess the sock_poll() issue cannot happen. It does need some poll() function that does 'fget()', and believes that it works.

But because the f_count has already gone down to zero, fget() doesn't work, and doesn't keep the file around, and you have the bug.

The cases that do fget() in poll() are probably race, but they aren't buggy. epoll is buggy.

So my example wasn't going to work, but the argument isn't really any different, it's just a much more limited case that breaks.

And maybe it's even *only* dma-buf that does that fget() in its ->poll() function. Even *then* it's not a dma-buf.c bug.

Linus

Linus Torvalds

3:53 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sat, 4 May 2024 at 08:40, Linus Torvalds torvalds@linux-foundation.org wrote:

...

And maybe it's even *only* dma-buf that does that fget() in its ->poll() function. Even *then* it's not a dma-buf.c bug.

They all do in the sense that they do

poll_wait -> __pollwait -> get_file (*boom*)

but the boom is very small because the poll_wait() will be undone by poll_freewait(), and normally poll/select has held the file count elevated.

Except for epoll. Which leaves those pollwait entries around until it's done - but again will be held up on the ep->mtx before it does so.

So everybody does some f_count games, but possibly dma-buf is the only one that ends up expecting to hold on to the f_count for longer periods.

Linus

Al Viro

5 May 5 May

7:46 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sat, May 04, 2024 at 08:53:47AM -0700, Linus Torvalds wrote:

...

poll_wait -> __pollwait -> get_file (*boom*)

but the boom is very small because the poll_wait() will be undone by poll_freewait(), and normally poll/select has held the file count elevated.

Not quite. It's not that poll_wait() calls __pollwait(); it calls whatever callback that caller of ->poll() has set for it.

__pollwait users (select(2) and poll(2), currently) must (and do) make sure that refcount is elevated; others (and epoll is not the only one) need to do whatever's right for their callbacks.

I've no problem with having epoll grab a reference, but if we make that a universal requirement ->poll() instances can rely upon, we'd better verify that *all* vfs_poll() are OK. And that ought to go into Documentation/filesystems/porting.rst ("callers of vfs_poll() must make sure that file is pinned; ->poll() instances may rely upon that, but they'd better be very careful about grabbing extra references themselves - it's acceptable for files on internal mounts, but *NOT* for anything on mountable filesystems. Any instance that does it needs an explicit comment telling not to reuse that blindly." or something along those lines).

Excluding epoll, select/poll and callers that have just done fdget() and will do fdput() after vfs_poll(), we have this:

drivers/vhost/vhost.c:213: mask = vfs_poll(file, &poll->table); vhost_poll_start(). Might get interesting... Calls working with vq->kick as file seem to rely upon vq->mutex, but I'll need to refresh my memories of that code to check if that's all we need - and then there's vhost_net_enable_vq(), which also needs an audit.

fs/aio.c:1738: mask = vfs_poll(req->file, &pt) & req->events; fs/aio.c:1932: mask = vfs_poll(req->file, &apt.pt) & req->events; aio_poll() and aio_poll_wake() resp. req->file here is actually ->ki_filp of iocb that contains work as iocb->req.work; it get dropped only in iocb_destroy(), which also frees iocb. Any call that might've run into req->file not pinned is already in UAF land.

io_uring/poll.c:303: req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events; io_uring/poll.c:622: mask = vfs_poll(req->file, &ipt->pt) & poll->events; Should have req->file pinned, but I'll need to RTFS a bit for details. That, or ask Jens to confirm...

net/9p/trans_fd.c:236: ret = vfs_poll(ts->rd, pt); net/9p/trans_fd.c:238: ret = (ret & ~EPOLLOUT) | (vfs_poll(ts->wr, pt) & ~EPOLLIN); p9_fd_poll(); ->rd and ->wr are pinned and won't get dropped until p9_fd_close(), which frees ts immediately afterwards. IOW, if we risk being called with ->rd or ->wr not pinned, we are in UAF land already. Incidentally, what the hell is this in p9_fd_open()? * It's technically possible for userspace or concurrent mounts to * modify this flag concurrently, which will likely result in a * broken filesystem. However, just having bad flags here should * not crash the kernel or cause any other sort of bug, so mark this * particular data race as intentional so that tooling (like KCSAN) * can allow it and detect further problems. */ Why not simply fix the race instead? As in spin_lock(&ts->rd->f_lock); ts->rd->f_flags |= O_NONBLOCK; spin_unlock(&ts->rd->f_lock); and similar for ts->wr? Sigh...

Linus Torvalds

8:03 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sun, 5 May 2024 at 12:46, Al Viro viro@zeniv.linux.org.uk wrote:

...

I've no problem with having epoll grab a reference, but if we make that a universal requirement ->poll() instances can rely upon,

Al, we're note "making that a requirement".

It always has been.

Otgherwise, the docs should have shouted out DAMN LOUDLY that you can't rely on all the normal refcounting of 'struct file' THAT EVERY SINGLE NORMAL VFS FUNCTION CAN.

Lookie herte: epoll is unimportant and irrelevant garbage compared to something fundamental like "read()", and what does read() do?

It does this:

struct fd f = fdget_pos(fd);

if (f.file) { ...

which is being DAMN CAREFUL to make sure that the file has the proper refcounts before it then calls "vfs_read()". There's a lot of very careful and subtle code in fdget_pos() to make this all proper, and that even if the file is closed by another thread concurrently, we *always* have a refcount to it, and it's always live over the whole 'vfs_read()' sequence.

'vfs_poll()' is NOT DIFFERENT in this regard. Not at all.

Now, you have two choices that are intellectually honest:

- admit that epoll() - which is a hell of a lot less important - should spend a small fraction of that effort on making its vfs_poll() use sane

- say that all this fdget_pos() care is uncalled for in the read() path, and we should make all the filesystem .read() functions be aware that the file pointer they get may be garbage, and they should use get_file_active() to make sure every 'struct file *' use they have is safe?

because if your choice is that "epoll can do whatever the f*&k it wants", then it's in clear violation of all the effort we go to in a MUCH MORE IMPORTANT code path, and is clearly not consistent or logical.

Neither you nor Christian have explained why you think it's ok for that epoll() garbage to magically violate all our regular rules.

Your claim that those regular rules are some new conditional requirement that we'd be imposing. NO. They are the rules that *anybody* who gets a 'struct file *' pointer should always be able to rely on by default: it's damn well a ref-counted thing, and the caller holds the refcount.

The exceptional case is exactly the other way around: if you do random crap with unrefcounted poitners, it's *your* problem, and *you* are the one who has to be careful. Not some unrelated poor driver that didn't know about your f*&k-up.

Dammit, epoll is CLEARLY BUGGY. It's passing off random kernel pointers without holding a refcount to them. THAT'S A BUG.

And fixing that bug is *not* somehow changing existing rules as you are trying to claim. No. It's just fixing a bug.

So stop claiming that this is some "new requirement". It is absolutely nothing of the sort. epoll() actively MISUSED file pointer, because file pointers are fundamentally refcounted (as are pretty much all sane kernel interfaces).

Linus

Al Viro

8:30 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sun, May 05, 2024 at 01:03:07PM -0700, Linus Torvalds wrote:

...

On Sun, 5 May 2024 at 12:46, Al Viro viro@zeniv.linux.org.uk wrote:

...
I've no problem with having epoll grab a reference, but if we make that a universal requirement ->poll() instances can rely upon,

Al, we're note "making that a requirement".

It always has been.

Argh. We keep talking past each other.

0. special-cased ->f_count rule for ->poll() is a wart and it's better to get rid of it.

1. fs/eventpoll.c is a steaming pile of shit and I'd be glad to see git rm taken to it. Short of that, by all means, let's grab reference in there around the call of vfs_poll() (see (0)).

2. having ->poll() instances grab extra references to file passed to them is not something that should be encouraged; there's a plenty of potential problems, and "caller has it pinned, so we are fine with grabbing extra refs" is nowhere near enough to eliminate those.

3. dma-buf uses of get_file() are probably safe (epoll shite aside), but they do look fishy. That has nothing to do with epoll.

Linus Torvalds

8:53 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sun, 5 May 2024 at 13:30, Al Viro viro@zeniv.linux.org.uk wrote:

...

 special-cased ->f_count rule for ->poll() is a wart and it's
better to get rid of it.
 fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
git rm taken to it. Short of that, by all means, let's grab reference in there around the call of vfs_poll() (see (0)).

Agreed on 0/1.

...

 having ->poll() instances grab extra references to file passed
to them is not something that should be encouraged; there's a plenty of potential problems, and "caller has it pinned, so we are fine with grabbing extra refs" is nowhere near enough to eliminate those.

So it's not clear why you hate it so much, since those extra references are totally normal in all the other VFS paths.

I mean, they are perhaps not the *common* case, but we have a lot of random get_file() calls sprinkled around in various places when you end up passing a file descriptor off to some asynchronous operation thing.

Yeah, I think most of them tend to be special operations (eg the tty TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl() is *that* different from vfs_poll. Different operation, not somehow "one is more special than the other".

cachefiles and backing-file does it for regular IO, and drop it at IO completion - not that different from what dma-buf does. It's in ->read_iter() rather than ->poll(), but again: different operations, but not "one of them is somehow fundamentally different".

...

 dma-buf uses of get_file() are probably safe (epoll shite aside),
but they do look fishy. That has nothing to do with epoll.

Now, what dma-buf basically seems to do is to avoid ref-counting its own fundamental data structure, and replaces that by refcounting the 'struct file' that *points* to it instead.

And it is a bit odd, but it actually makes some amount of sense, because then what it passes around is that file pointer (and it allows passing it around from user space *as* that file).

And honestly, if you look at why it then needs to add its refcount to it all, it actually makes sense. dma-bufs have this notion of "fences" that are basically completion points for the asynchronous DMA. Doing a "poll()" operation will add a note to the fence to get that wakeup when it's done.

And yes, logically it takes a ref to the "struct dma_buf", but because of how the lifetime of the dma_buf is associated with the lifetime of the 'struct file', that then turns into taking a ref on the file.

Unusual? Yes. But not illogical. Not obviously broken. Tying the lifetime of the dma_buf to the lifetime of a file that is passed along makes _sense_ for that use.

I'm sure dma-bufs could add another level of refcounting on the 'struct dma_buf' itself, and not make it be 1:1 with the file, but it's not clear to me what the advantage would really be, or why it would be wrong to re-use a refcount that is already there.

Linus

Daniel Vetter

6 May 6 May

12:47 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:

...

On Sun, 5 May 2024 at 13:30, Al Viro viro@zeniv.linux.org.uk wrote:

...
 special-cased ->f_count rule for ->poll() is a wart and it's
better to get rid of it.
 fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
git rm taken to it. Short of that, by all means, let's grab reference in there around the call of vfs_poll() (see (0)).
Agreed on 0/1.

...
 having ->poll() instances grab extra references to file passed
to them is not something that should be encouraged; there's a plenty of potential problems, and "caller has it pinned, so we are fine with grabbing extra refs" is nowhere near enough to eliminate those.
So it's not clear why you hate it so much, since those extra references are totally normal in all the other VFS paths.

I mean, they are perhaps not the *common* case, but we have a lot of random get_file() calls sprinkled around in various places when you end up passing a file descriptor off to some asynchronous operation thing.

Yeah, I think most of them tend to be special operations (eg the tty TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl() is *that* different from vfs_poll. Different operation, not somehow "one is more special than the other".

cachefiles and backing-file does it for regular IO, and drop it at IO completion - not that different from what dma-buf does. It's in ->read_iter() rather than ->poll(), but again: different operations, but not "one of them is somehow fundamentally different".

...
 dma-buf uses of get_file() are probably safe (epoll shite aside),
but they do look fishy. That has nothing to do with epoll.
Now, what dma-buf basically seems to do is to avoid ref-counting its own fundamental data structure, and replaces that by refcounting the 'struct file' that *points* to it instead.

And it is a bit odd, but it actually makes some amount of sense, because then what it passes around is that file pointer (and it allows passing it around from user space *as* that file).

And honestly, if you look at why it then needs to add its refcount to it all, it actually makes sense. dma-bufs have this notion of "fences" that are basically completion points for the asynchronous DMA. Doing a "poll()" operation will add a note to the fence to get that wakeup when it's done.

And yes, logically it takes a ref to the "struct dma_buf", but because of how the lifetime of the dma_buf is associated with the lifetime of the 'struct file', that then turns into taking a ref on the file.

Unusual? Yes. But not illogical. Not obviously broken. Tying the lifetime of the dma_buf to the lifetime of a file that is passed along makes _sense_ for that use.

I'm sure dma-bufs could add another level of refcounting on the 'struct dma_buf' itself, and not make it be 1:1 with the file, but it's not clear to me what the advantage would really be, or why it would be wrong to re-use a refcount that is already there.

So there is generally another refcount, because dma_buf is just the cross-driver interface to some kind of real underlying buffer object from the various graphics related subsystems we have.

And since it's a pure file based api thing that ceases to serve any function once the fd/file is gone we tied all the dma_buf refcounting to the refcount struct file already maintains. But the underlying buffer object can easily outlive the dma_buf, and over the lifetime of an underlying buffer object you might actually end up creating different dma_buf api wrappers for it (but at least in drm we guarantee there's at most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I don't particularly like and isn't really needed).

But we could add another refcount, it just means we have 3 of those then when only really 2 are needed.

Also maybe here two: dma_fence are bounded like other disk i/o (including the option of timeouts if things go very wrong), so it's very much not forever but at most a few seconds worst case (shit hw/driver excluded, as usual). -Sima

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

4:15 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

Am 05.05.24 um 22:53 schrieb Linus Torvalds:

...

On Sun, 5 May 2024 at 13:30, Al Viro viro@zeniv.linux.org.uk wrote:

...
 special-cased ->f_count rule for ->poll() is a wart and it's
better to get rid of it.
 fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
git rm taken to it. Short of that, by all means, let's grab reference in there around the call of vfs_poll() (see (0)).
Agreed on 0/1.

...
 having ->poll() instances grab extra references to file passed
to them is not something that should be encouraged; there's a plenty of potential problems, and "caller has it pinned, so we are fine with grabbing extra refs" is nowhere near enough to eliminate those.
So it's not clear why you hate it so much, since those extra references are totally normal in all the other VFS paths.

Sorry to maybe jumping into the middle of the discussion, but for DMA-buf the behavior Al doesn't want is actually desired.

And I totally understand why Al is against it for file system based files, but for this case it's completely intentional.

Removing the callback on close is what we used to do a long time ago, but that turned out into a locking nightmare because it meant that we need to be able to wait for driver specific locks from whatever non interrupt context fput() is called from.

Regards, Christian.

...

I mean, they are perhaps not the *common* case, but we have a lot of random get_file() calls sprinkled around in various places when you end up passing a file descriptor off to some asynchronous operation thing.

Yeah, I think most of them tend to be special operations (eg the tty TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl() is *that* different from vfs_poll. Different operation, not somehow "one is more special than the other".

cachefiles and backing-file does it for regular IO, and drop it at IO completion - not that different from what dma-buf does. It's in ->read_iter() rather than ->poll(), but again: different operations, but not "one of them is somehow fundamentally different".

...
 dma-buf uses of get_file() are probably safe (epoll shite aside),
but they do look fishy. That has nothing to do with epoll.
Now, what dma-buf basically seems to do is to avoid ref-counting its own fundamental data structure, and replaces that by refcounting the 'struct file' that *points* to it instead.

And it is a bit odd, but it actually makes some amount of sense, because then what it passes around is that file pointer (and it allows passing it around from user space *as* that file).

And honestly, if you look at why it then needs to add its refcount to it all, it actually makes sense. dma-bufs have this notion of "fences" that are basically completion points for the asynchronous DMA. Doing a "poll()" operation will add a note to the fence to get that wakeup when it's done.

And yes, logically it takes a ref to the "struct dma_buf", but because of how the lifetime of the dma_buf is associated with the lifetime of the 'struct file', that then turns into taking a ref on the file.

Unusual? Yes. But not illogical. Not obviously broken. Tying the lifetime of the dma_buf to the lifetime of a file that is passed along makes _sense_ for that use.

I'm sure dma-bufs could add another level of refcounting on the 'struct dma_buf' itself, and not make it be 1:1 with the file, but it's not clear to me what the advantage would really be, or why it would be wrong to re-use a refcount that is already there.
              Linus

Linus Torvalds

4 May 4 May

6:20 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sat, 4 May 2024 at 08:32, Linus Torvalds torvalds@linux-foundation.org wrote:

...

Lookie here, the fundamental issue is that epoll can call '->poll()' on a file descriptor that is being closed concurrently.

Thinking some more about this, and replying to myself...

Actually, I wonder if we could *really* fix this by simply moving the eventpoll_release() to where it really belongs.

If we did it in file_close_fd_locked(), it would actually make a *lot* more sense. Particularly since eventpoll actually uses this:

struct epoll_filefd { struct file *file; int fd; } __packed;

ie it doesn't just use the 'struct file *', it uses the 'fd' itself (for ep_find()).

(Strictly speaking, it should also have a pointer to the 'struct files_struct' to make the 'int fd' be meaningful).

IOW, eventpoll already considers the file _descriptor_ relevant, not just the file pointer, and that's destroyed at *close* time, not at 'fput()' time.

Yeah, yeah, the locking situation in file_close_fd_locked() is a bit inconvenient, but if we can solve that, it would solve the problem in a fundamentally different way: remove the ep iterm before the file->f_count has actually been decremented, so the whole "race with fput()" would just go away entirely.

I dunno. I think that would be the right thing to do, but I wouldn't be surprised if some disgusting eventpoll user then might depend on the current situation where the eventpoll thing stays around even after the close() if you have another copy of the file open.

Linus

Christian König

6 May 6 May

2:29 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

Am 04.05.24 um 20:20 schrieb Linus Torvalds:

...

On Sat, 4 May 2024 at 08:32, Linus Torvalds torvalds@linux-foundation.org wrote:

...
Lookie here, the fundamental issue is that epoll can call '->poll()' on a file descriptor that is being closed concurrently.

Thinking some more about this, and replying to myself...

Actually, I wonder if we could *really* fix this by simply moving the eventpoll_release() to where it really belongs.

If we did it in file_close_fd_locked(), it would actually make a *lot* more sense. Particularly since eventpoll actually uses this:
 struct epoll_filefd {
     struct file *file;
     int fd;
 } __packed;
ie it doesn't just use the 'struct file *', it uses the 'fd' itself (for ep_find()).

(Strictly speaking, it should also have a pointer to the 'struct files_struct' to make the 'int fd' be meaningful).

While I completely agree on this I unfortunately have to ruin the idea.

Before we had KCMP some people relied on the strange behavior of eventpoll to compare struct files when the fd is the same.

I just recently suggested that solution to somebody at AMD as a workaround when KCMP is disabled because of security hardening and I'm pretty sure I've seen it somewhere else as well.

So when we change that it would break (undocumented?) UAPI behavior.

Regards, Christian.

...

IOW, eventpoll already considers the file _descriptor_ relevant, not just the file pointer, and that's destroyed at *close* time, not at 'fput()' time.

Yeah, yeah, the locking situation in file_close_fd_locked() is a bit inconvenient, but if we can solve that, it would solve the problem in a fundamentally different way: remove the ep iterm before the file->f_count has actually been decremented, so the whole "race with fput()" would just go away entirely.

I dunno. I think that would be the right thing to do, but I wouldn't be surprised if some disgusting eventpoll user then might depend on the current situation where the eventpoll thing stays around even after the close() if you have another copy of the file open.
          Linus
Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

Daniel Vetter

7 May 7 May

11:02 a.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote:

...

Am 04.05.24 um 20:20 schrieb Linus Torvalds:

...
On Sat, 4 May 2024 at 08:32, Linus Torvalds torvalds@linux-foundation.org wrote:

...
Lookie here, the fundamental issue is that epoll can call '->poll()' on a file descriptor that is being closed concurrently.

Thinking some more about this, and replying to myself...

Actually, I wonder if we could *really* fix this by simply moving the eventpoll_release() to where it really belongs.

If we did it in file_close_fd_locked(), it would actually make a *lot* more sense. Particularly since eventpoll actually uses this:
 struct epoll_filefd {
     struct file *file;
     int fd;
 } __packed;
ie it doesn't just use the 'struct file *', it uses the 'fd' itself (for ep_find()).

(Strictly speaking, it should also have a pointer to the 'struct files_struct' to make the 'int fd' be meaningful).
While I completely agree on this I unfortunately have to ruin the idea.

Before we had KCMP some people relied on the strange behavior of eventpoll to compare struct files when the fd is the same.

I just recently suggested that solution to somebody at AMD as a workaround when KCMP is disabled because of security hardening and I'm pretty sure I've seen it somewhere else as well.

So when we change that it would break (undocumented?) UAPI behavior.

Uh extremely aside, but doesn't this mean we should just enable kcmp on files unconditionally, since there's an alternative? Or a least everywhere CONFIG_EPOLL is enabled?

It's really annoying that on some distros/builds we don't have that, and for gpu driver stack reasons we _really_ need to know whether a fd is the same as another, due to some messy uniqueness requirements on buffer objects various drivers have. -Sima

...

Regards, Christian.

...
IOW, eventpoll already considers the file _descriptor_ relevant, not just the file pointer, and that's destroyed at *close* time, not at 'fput()' time.

Yeah, yeah, the locking situation in file_close_fd_locked() is a bit inconvenient, but if we can solve that, it would solve the problem in a fundamentally different way: remove the ep iterm before the file->f_count has actually been decremented, so the whole "race with fput()" would just go away entirely.

I dunno. I think that would be the right thing to do, but I wouldn't be surprised if some disgusting eventpoll user then might depend on the current situation where the eventpoll thing stays around even after the close() if you have another copy of the file open.
          Linus
Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Linus Torvalds

4:46 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Tue, 7 May 2024 at 04:03, Daniel Vetter daniel@ffwll.ch wrote:

...

It's really annoying that on some distros/builds we don't have that, and for gpu driver stack reasons we _really_ need to know whether a fd is the same as another, due to some messy uniqueness requirements on buffer objects various drivers have.

It's sad that such a simple thing would require two other horrid models (EPOLL or KCMP).

There'[s a reason that KCMP is a config option - *some* of that is horrible code - but the "compare file descriptors for equality" is not that reason.

Note that KCMP really is a broken mess. It's also a potential security hole, even for the simple things, because of how it ends up comparing kernel pointers (ie it doesn't just say "same file descriptor", it gives an ordering of them, so you can use KCMP to sort things in kernel space).

And yes, it orders them after obfuscating the pointer, but it's still not something I would consider sane as a baseline interface. It was designed for checkpoint-restore, it's the wrong thing to use for some "are these file descriptors the same".

The same argument goes for using EPOLL for that. Disgusting hack.

Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is?

Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just

struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same;

same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same;

where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()".

Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP.

I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Linus

Christian König

5:45 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

Am 07.05.24 um 18:46 schrieb Linus Torvalds:

...

On Tue, 7 May 2024 at 04:03, Daniel Vetter daniel@ffwll.ch wrote:

...
It's really annoying that on some distros/builds we don't have that, and for gpu driver stack reasons we _really_ need to know whether a fd is the same as another, due to some messy uniqueness requirements on buffer objects various drivers have.

It's sad that such a simple thing would require two other horrid models (EPOLL or KCMP).

There'[s a reason that KCMP is a config option - *some* of that is horrible code - but the "compare file descriptors for equality" is not that reason.

Note that KCMP really is a broken mess. It's also a potential security hole, even for the simple things, because of how it ends up comparing kernel pointers (ie it doesn't just say "same file descriptor", it gives an ordering of them, so you can use KCMP to sort things in kernel space).

And yes, it orders them after obfuscating the pointer, but it's still not something I would consider sane as a baseline interface. It was designed for checkpoint-restore, it's the wrong thing to use for some "are these file descriptors the same".

The same argument goes for using EPOLL for that. Disgusting hack.

Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is?

Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just
     struct fd f1 = fdget(fd1);
     struct fd f2 = fdget(fd2);
     int same;

     same = f1.file && f1.file == f2.file;
     fdput(fd1);
     fdput(fd2);
     return same;
where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()".

Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP.

I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors.

The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths.

For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time.

Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to.

If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example.

Additional to that it makes cgroup accounting much easier because you don't count things twice because they are shared etc...

Regards, Christian.

...

              Linus

Michel Dänzer

8 May 8 May

7:51 a.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On 2024-05-07 19:45, Christian König wrote:

...

Am 07.05.24 um 18:46 schrieb Linus Torvalds:

...
Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is?

Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just

struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same;

same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same;

where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()".

Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP.

I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors.

The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths.

For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time.

Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to.

If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example.

It's not just about optimization. Mesa needs to know this for correct tracking of GEM handles. If it guesses incorrectly, there can be misbehaviour.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer

Christian König

7:59 a.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

Am 08.05.24 um 09:51 schrieb Michel Dänzer:

...

On 2024-05-07 19:45, Christian König wrote:

...
Am 07.05.24 um 18:46 schrieb Linus Torvalds:

...
Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is?

Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just

struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same;

same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same;

where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()".

Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP.

I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Well the generic approach yes, the DRM specific one maybe. IIRC we need to be able to compare both DRM as well as DMA-buf file descriptors.

The basic problem userspace tries to solve is that drivers might get the same fd through two different code paths.

For example application using OpenGL/Vulkan for rendering and VA-API for video decoding/encoding at the same time.

Both APIs get a fd which identifies the device to use. It can be the same, but it doesn't have to.

If it's the same device driver connection (or in kernel speak underlying struct file) then you can optimize away importing and exporting of buffers for example.

It's not just about optimization. Mesa needs to know this for correct tracking of GEM handles. If it guesses incorrectly, there can be misbehaviour.

Oh, yeah good point as well.

I think we can say in general that if two userspace driver libraries would mess with the state of an fd at the same time without knowing of each other bad things would happen.

Regards, Christian.

Daniel Vetter

7 May 7 May

6:04 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:

...

On Tue, 7 May 2024 at 04:03, Daniel Vetter daniel@ffwll.ch wrote:

...
It's really annoying that on some distros/builds we don't have that, and for gpu driver stack reasons we _really_ need to know whether a fd is the same as another, due to some messy uniqueness requirements on buffer objects various drivers have.

It's sad that such a simple thing would require two other horrid models (EPOLL or KCMP).

There'[s a reason that KCMP is a config option - *some* of that is horrible code - but the "compare file descriptors for equality" is not that reason.

Note that KCMP really is a broken mess. It's also a potential security hole, even for the simple things, because of how it ends up comparing kernel pointers (ie it doesn't just say "same file descriptor", it gives an ordering of them, so you can use KCMP to sort things in kernel space).

And yes, it orders them after obfuscating the pointer, but it's still not something I would consider sane as a baseline interface. It was designed for checkpoint-restore, it's the wrong thing to use for some "are these file descriptors the same".

The same argument goes for using EPOLL for that. Disgusting hack.

Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is?

Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just
    struct fd f1 = fdget(fd1);
    struct fd f2 = fdget(fd2);
    int same;

    same = f1.file && f1.file == f2.file;
    fdput(fd1);
    fdput(fd2);
    return same;
where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()".

Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP.

Well, in slightly more code (because it's part of the "import this dma-buf/dma-fence/whatever fd into a driver object" ioctl) this is what we do.

The issue is that there's generic userspace (like compositors) that sees these things fly by and would also like to know whether the other side they receive them from is doing nasty stuff/buggy/evil. And they don't have access to the device drm fd (since those are a handful of layers away behind the opengl/vulkan userspace drivers even if the compositor could get at them, and in some cases not even that).

So if we do this in drm we'd essentially have to create a special drm_compare_files chardev, put the ioctl there and then tell everyone to make that thing world-accessible.

Which is just too close to a real syscall that it's offensive, and hey kcmp does what we want already (but unfortunately also way more). So we rejected adding that to drm. But we did think about it.

...

I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Yes.

Adding Simon and Pekka as two of the usual suspects for this kind of stuff. Also example code (the int return value is just so that callers know when kcmp isn't available, they all only care about equality):

https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 -Sima

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Linus Torvalds

7:07 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Tue, 7 May 2024 at 11:04, Daniel Vetter daniel@ffwll.ch wrote:

...

On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:

...
I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Yes.

Adding Simon and Pekka as two of the usual suspects for this kind of stuff. Also example code (the int return value is just so that callers know when kcmp isn't available, they all only care about equality):

https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239

That example thing shows that we shouldn't make it a FISAME ioctl - we should make it a fcntl() instead, and it would just be a companion to F_DUPFD.

Doesn't that strike everybody as a *much* cleaner interface? I think F_ISDUP would work very naturally indeed with F_DUPFD.

Yes? No?

Linus

Christian König

8 May 8 May

5:55 a.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

Am 07.05.24 um 21:07 schrieb Linus Torvalds:

...

On Tue, 7 May 2024 at 11:04, Daniel Vetter daniel@ffwll.ch wrote:

...
On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:

...
I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Yes.

Adding Simon and Pekka as two of the usual suspects for this kind of stuff. Also example code (the int return value is just so that callers know when kcmp isn't available, they all only care about equality):

https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239

That example thing shows that we shouldn't make it a FISAME ioctl - we should make it a fcntl() instead, and it would just be a companion to F_DUPFD.

Doesn't that strike everybody as a *much* cleaner interface? I think F_ISDUP would work very naturally indeed with F_DUPFD.

Yes? No?

Sounds absolutely sane to me.

Christian.

...

                    Linus

Daniel Vetter

8:32 a.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Wed, May 08, 2024 at 07:55:08AM +0200, Christian König wrote:

...

Am 07.05.24 um 21:07 schrieb Linus Torvalds:

...
On Tue, 7 May 2024 at 11:04, Daniel Vetter daniel@ffwll.ch wrote:

...
On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:

...
I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it.

Would something like that work for you?

Yes.

Adding Simon and Pekka as two of the usual suspects for this kind of stuff. Also example code (the int return value is just so that callers know when kcmp isn't available, they all only care about equality):

https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239

That example thing shows that we shouldn't make it a FISAME ioctl - we should make it a fcntl() instead, and it would just be a companion to F_DUPFD.

Doesn't that strike everybody as a *much* cleaner interface? I think F_ISDUP would work very naturally indeed with F_DUPFD.

Yes? No?

Sounds absolutely sane to me.

Yeah fcntl(fd1, F_ISDUP, fd2); sounds extremely reasonable to me too.

Aside, after some irc discussions I paged a few more of the relevant info back in, and at least for dma-buf we kinda sorted this out by going away from the singleton inode in this patch: ed63bb1d1f84 ("dma-buf: give each buffer a full-fledged inode")

It's uapi now so we can't ever undo that, but with hindsight just the F_ISDUP is really what we wanted. Because we have no need for that inode aside from the unique inode number that's only used to compare dma-buf fd for sameness, e.g.

https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/t...

The one question I have is whether this could lead to some exploit tools, because at least the android conformance test suite verifies that kcmp isn't available to apps (which is where we need it, because even with all the binder-based isolation gpu userspace still all run in the application process due to performance reasons, any ipc at all is just too much).

Otoh if we just add this to drm fd as an ioctl somewhere, then it will also be available to every android app because they all do need the gpu for rendering. So going with the full generic fcntl is probably best. -Sima

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Linus Torvalds

4:19 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Tue, 7 May 2024 at 12:07, Linus Torvalds torvalds@linux-foundation.org wrote:

...

That example thing shows that we shouldn't make it a FISAME ioctl - we should make it a fcntl() instead, and it would just be a companion to F_DUPFD.

Doesn't that strike everybody as a *much* cleaner interface? I think F_ISDUP would work very naturally indeed with F_DUPFD.

So since we already have two versions of F_DUPFD (the other being F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend on that existing naming pattern, and called it F_DUPFD_QUERY instead.

I'm not married to the name, so if somebody hates it, feel free to argue otherwise.

But with that, the suggested patch would end up looking something like the attached (I also re-ordered the existing "F_LINUX_SPECIFIC_BASE" users, since one of them was out of numerical order).

This really feels like a very natural thing, and yes, the 'same_fd()' function in systemd that Christian also pointed at could use this very naturally.

Also note that I obviously haven't tested this. Because obviously this is trivially correct and cannot possibly have any bugs. Right? RIGHT?

And yes, I did check - despite the odd jump in numbers, we've never had anything between F_NOTIFY (+2) and F_CANCELLK (+5).

We added F_SETLEASE (+0) , F_GETLEASE (+1) and F_NOTIFY (+2) in 2.4.0-test9 (roughly October 2000, I didn't dig deeper).

And then back in 2007 we suddenly jumped to F_CANCELLK (+5) in commit 9b9d2ab4154a ("locks: add lock cancel command"). I don't know why 3/4 were shunned.

After that we had 22d2b35b200f ("F_DUPFD_CLOEXEC implementation") add F_DUPFD_CLOEXEC (+6).

I'd have loved to put F_DUPFD_QUERY next to it, but +5 and +7 are both used.

Linus

Linus Torvalds

5:14 p.m.

New subject: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Wed, 8 May 2024 at 09:19, Linus Torvalds torvalds@linux-foundation.org wrote:

...

So since we already have two versions of F_DUPFD (the other being F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend on that existing naming pattern, and called it F_DUPFD_QUERY instead.

I'm not married to the name, so if somebody hates it, feel free to argue otherwise.

Side note: with this patch, doing

ret = fcntl(fd1, F_DUPFD_QUERY, fd2);

will result in:

-1 (EBADF): 'fd1' is not a valid file descriptor -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY 0: fd2 does not refer to the same file as fd1 1: fd2 is the same 'struct file' as fd1

and it might be worth noting a couple of things here:

(a) fd2 being an invalid file descriptor does not cause EBADF, it just causes "does not match".

(b) we *could* use more bits for more equality

IOW, it would possibly make sense to extend the 0/1 result to be

- bit #0: same file pointer - bit #1: same path - bit #2: same dentry - bit #3: same inode

which are all different levels of "sameness".

Does anybody care? Do we want to extend on this "sameness"? I'm not convinced, but it might be a good idea to document this as a possibly future extension, ie *if* what you care about is "same file pointer", maybe you should make sure to only look at bit #0.

Linus

424

days inactive

428

days old

linaro-mm-sig@lists.linaro.org

21 comments

participants

tags (0)

participants (6)

Al Viro
Christian König
Christian König
Daniel Vetter
Linus Torvalds
Michel Dänzer