On Mon, May 06, 2024 at 02:05:12PM +0200, Maxime Ripard wrote:
> Hi,
>
> On Mon, May 06, 2024 at 01:49:17PM GMT, Hans de Goede wrote:
> > Hi dma-buf maintainers, et.al.,
> >
> > Various people have been working on making complex/MIPI cameras work OOTB
> > with mainline Linux kernels and an opensource userspace stack.
> >
> > The generic solution adds a software ISP (for Debayering and 3A) to
> > libcamera. Libcamera's API guarantees that buffers handed to applications
> > using it are dma-bufs so that these can be passed to e.g. a video encoder.
> >
> > In order to meet this API guarantee the libcamera software ISP allocates
> > dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> > the Fedora COPR repo for the PoC of this:
> > https://hansdegoede.dreamwidth.org/28153.html
>
> For the record, we're also considering using them for ARM KMS devices,
> so it would be better if the solution wasn't only considering v4l2
> devices.
>
> > I have added a simple udev rule to give physically present users access
> > to the dma_heap-s:
> >
> > KERNEL=="system", SUBSYSTEM=="dma_heap", TAG+="uaccess"
> >
> > (and on Rasperry Pi devices any users in the video group get access)
> >
> > This was just a quick fix for the PoC. Now that we are ready to move out
> > of the PoC phase and start actually integrating this into distributions
> > the question becomes if this is an acceptable solution; or if we need some
> > other way to deal with this ?
> >
> > Specifically the question is if this will have any negative security
> > implications? I can certainly see this being used to do some sort of
> > denial of service attack on the system (1). This is especially true for
> > the cma heap which generally speaking is a limited resource.
>
> There's plenty of other ways to exhaust CMA, like allocating too much
> KMS or v4l2 buffers. I'm not sure we should consider dma-heaps
> differently than those if it's part of our threat model.
So generally for an arm soc where your display needs cma, your render node
doesn't. And user applications only have access to the later, while only
the compositor gets a kms fd through logind. At least in drm aside from
vc4 there's really no render driver that just gives you access to cma and
allows you to exhaust that, you need to be a compositor with drm master
access to the display.
Which means we're mostly protected against bad applications, and that's
not a threat the "user physically sits in front of the machine accounts
for", and which giving cma access to everyone would open up. And with
flathub/snaps/... this is very much an issue.
So you need more, either:
- cgroups limits on dma-buf and dma-buf heaps. This has been bikeshedded
for years and is just not really moving.
- An allocator service which checks whether you're allowed to allocate
these special buffers. Android does that through binder.
Probably also some way to nuke applications that refuse to release buffers
when they're no longer the right application. cgroups is a lot more
convenient for that.
-Sima
> > But devices tagged for uaccess are only opened up to users who are
> > physcially present behind the machine and those can just hit
> > the powerbutton, so I don't believe that any *on purpose* DOS is part of
> > the thread model.
>
> How would that work for headless devices?
>
> Maxime
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, May 07, 2024 at 04:15:05PM +0100, Bryan O'Donoghue wrote:
> On 07/05/2024 16:09, Dmitry Baryshkov wrote:
> > Ah, I see. Then why do you require the DMA-ble buffer at all? If you are
> > providing data to VPU or DRM, then you should be able to get the buffer
> > from the data-consuming device.
>
> Because we don't necessarily know what the consuming device is, if any.
Well ... that's an entirely different issue. And it's unsolved.
Currently the approach is to allocate where the constraints are usually
most severe (like display, if you need that, or the camera module for
input) and then just pray the stack works out without too much copying.
All userspace (whether the generic glue or the userspace driver depends a
bit upon the exact api) does need to have a copy fallback for these
sharing cases, ideally with the copying accelerated by hw.
If you try to solve this by just preemptive allocating everything as cma
buffers, then you'll make the situation substantially worse (because now
you're wasting tons of cma memory where you might not even need it).
And without really solving the problem, since for some gpus that memory
might be unusable (because you cannot scan that out on any discrete gpu,
and sometimes not even on an integrated one).
-Sima
> Could be VPU, could be Zoom/Hangouts via pipewire, could for argument sake
> be GPU or DSP.
>
> Also if we introduce a dependency on another device to allocate the output
> buffers - say always taking the output buffer from the GPU, then we've added
> another dependency which is more difficult to guarantee across different
> arches.
>
> ---
> bod
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Tue, May 07, 2024 at 04:34:24PM +0200, Hans de Goede wrote:
> Hi Dmitry,
>
> On 5/7/24 3:32 PM, Dmitry Baryshkov wrote:
> > On Mon, May 06, 2024 at 01:49:17PM +0200, Hans de Goede wrote:
> >> Hi dma-buf maintainers, et.al.,
> >>
> >> Various people have been working on making complex/MIPI cameras work OOTB
> >> with mainline Linux kernels and an opensource userspace stack.
> >>
> >> The generic solution adds a software ISP (for Debayering and 3A) to
> >> libcamera. Libcamera's API guarantees that buffers handed to applications
> >> using it are dma-bufs so that these can be passed to e.g. a video encoder.
> >>
> >> In order to meet this API guarantee the libcamera software ISP allocates
> >> dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> >> the Fedora COPR repo for the PoC of this:
> >> https://hansdegoede.dreamwidth.org/28153.html
> >
> > Is there any reason for allocating DMA buffers for libcamera through
> > /dev/dma_heap/ rather than allocating them via corresponding media
> > device and then giving them away to DRM / VPU / etc?
> >
> > At least this should solve the permission usecase: if the app can open
> > camera device, it can allocate a buffer.
>
> This is with a software ISP, the input buffers with raw bayer data
> come from a v4l2 device, but the output buffers with the processed
> data are purely userspace managed in this case.
Ah, I see. Then why do you require the DMA-ble buffer at all? If you are
providing data to VPU or DRM, then you should be able to get the buffer
from the data-consuming device. If the data is further processed by
a userspace app, then it should not require DMA memory at all.
My main concern is that dma-heaps is both too generic and too limiting
for the generic library implementation.
--
With best wishes
Dmitry
On Mon, May 06, 2024 at 04:01:42PM +0200, Hans de Goede wrote:
> Hi Sima,
>
> On 5/6/24 3:38 PM, Daniel Vetter wrote:
> > On Mon, May 06, 2024 at 02:05:12PM +0200, Maxime Ripard wrote:
> >> Hi,
> >>
> >> On Mon, May 06, 2024 at 01:49:17PM GMT, Hans de Goede wrote:
> >>> Hi dma-buf maintainers, et.al.,
> >>>
> >>> Various people have been working on making complex/MIPI cameras work OOTB
> >>> with mainline Linux kernels and an opensource userspace stack.
> >>>
> >>> The generic solution adds a software ISP (for Debayering and 3A) to
> >>> libcamera. Libcamera's API guarantees that buffers handed to applications
> >>> using it are dma-bufs so that these can be passed to e.g. a video encoder.
> >>>
> >>> In order to meet this API guarantee the libcamera software ISP allocates
> >>> dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> >>> the Fedora COPR repo for the PoC of this:
> >>> https://hansdegoede.dreamwidth.org/28153.html
> >>
> >> For the record, we're also considering using them for ARM KMS devices,
> >> so it would be better if the solution wasn't only considering v4l2
> >> devices.
> >>
> >>> I have added a simple udev rule to give physically present users access
> >>> to the dma_heap-s:
> >>>
> >>> KERNEL=="system", SUBSYSTEM=="dma_heap", TAG+="uaccess"
> >>>
> >>> (and on Rasperry Pi devices any users in the video group get access)
> >>>
> >>> This was just a quick fix for the PoC. Now that we are ready to move out
> >>> of the PoC phase and start actually integrating this into distributions
> >>> the question becomes if this is an acceptable solution; or if we need some
> >>> other way to deal with this ?
> >>>
> >>> Specifically the question is if this will have any negative security
> >>> implications? I can certainly see this being used to do some sort of
> >>> denial of service attack on the system (1). This is especially true for
> >>> the cma heap which generally speaking is a limited resource.
> >>
> >> There's plenty of other ways to exhaust CMA, like allocating too much
> >> KMS or v4l2 buffers. I'm not sure we should consider dma-heaps
> >> differently than those if it's part of our threat model.
> >
> > So generally for an arm soc where your display needs cma, your render node
> > doesn't. And user applications only have access to the later, while only
> > the compositor gets a kms fd through logind. At least in drm aside from
> > vc4 there's really no render driver that just gives you access to cma and
> > allows you to exhaust that, you need to be a compositor with drm master
> > access to the display.
> >
> > Which means we're mostly protected against bad applications, and that's
> > not a threat the "user physically sits in front of the machine accounts
> > for", and which giving cma access to everyone would open up. And with
> > flathub/snaps/... this is very much an issue.
>
> I agree that bad applications are an issue, but not for the flathub / snaps
> case. Flatpacks / snaps run sandboxed and don't have access to a full /dev
> so those should not be able to open /dev/dma_heap/* independent of
> the ACLs on /dev/dma_heap/*. The plan is for cameras using the
> libcamera software ISP to always be accessed through pipewire and
> the camera portal, so in this case pipewere is taking the place of
> the compositor in your kms vs render node example.
>
> So this reduces the problem to bad apps packaged by regular distributions
> and if any of those misbehave the distros should fix that.
>
> So I think that for the denial of service side allowing physical
> present users (but not sandboxed apps running as those users) to
> access /dev/dma_heap/* should be ok.
What about an app built by the user? The machines can still be
multi-seat or have multiple users.
> My bigger worry is if dma_heap (u)dma-bufs can be abused in other
> ways then causing a denial of service.
>
> I guess that the answer there is that causing other security issues
> should not be possible ?
>
> Regards,
>
> Hans
>
--
With best wishes
Dmitry
On Mon, May 06, 2024 at 01:49:17PM +0200, Hans de Goede wrote:
> Hi dma-buf maintainers, et.al.,
>
> Various people have been working on making complex/MIPI cameras work OOTB
> with mainline Linux kernels and an opensource userspace stack.
>
> The generic solution adds a software ISP (for Debayering and 3A) to
> libcamera. Libcamera's API guarantees that buffers handed to applications
> using it are dma-bufs so that these can be passed to e.g. a video encoder.
>
> In order to meet this API guarantee the libcamera software ISP allocates
> dma-bufs from userspace through one of the /dev/dma_heap/* heaps. For
> the Fedora COPR repo for the PoC of this:
> https://hansdegoede.dreamwidth.org/28153.html
Is there any reason for allocating DMA buffers for libcamera through
/dev/dma_heap/ rather than allocating them via corresponding media
device and then giving them away to DRM / VPU / etc?
At least this should solve the permission usecase: if the app can open
camera device, it can allocate a buffer.
> I have added a simple udev rule to give physically present users access
> to the dma_heap-s:
>
> KERNEL=="system", SUBSYSTEM=="dma_heap", TAG+="uaccess"
>
> (and on Rasperry Pi devices any users in the video group get access)
>
> This was just a quick fix for the PoC. Now that we are ready to move out
> of the PoC phase and start actually integrating this into distributions
> the question becomes if this is an acceptable solution; or if we need some
> other way to deal with this ?
>
> Specifically the question is if this will have any negative security
> implications? I can certainly see this being used to do some sort of
> denial of service attack on the system (1). This is especially true for
> the cma heap which generally speaking is a limited resource.
>
> But devices tagged for uaccess are only opened up to users who are
> physcially present behind the machine and those can just hit
> the powerbutton, so I don't believe that any *on purpose* DOS is part of
> the thread model. Any accidental DOS would be a userspace stack bug.
>
> Do you foresee any other negative security implications from allowing
> physically present non root users to create (u)dma-bufs ?
>
> Regards,
>
> Hans
>
>
> 1) There are some limits in drivers/dma-buf/udmabuf.c and distributions
> could narrow these.
>
>
--
With best wishes
Dmitry
On Mon, May 06, 2024 at 04:46:54PM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 02:47:23PM +0200, Daniel Vetter wrote:
> > On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:
> > > On Sun, 5 May 2024 at 13:30, Al Viro <viro(a)zeniv.linux.org.uk> wrote:
> > > >
> > > > 0. special-cased ->f_count rule for ->poll() is a wart and it's
> > > > better to get rid of it.
> > > >
> > > > 1. fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> > > > git rm taken to it. Short of that, by all means, let's grab reference
> > > > in there around the call of vfs_poll() (see (0)).
> > >
> > > Agreed on 0/1.
> > >
> > > > 2. having ->poll() instances grab extra references to file passed
> > > > to them is not something that should be encouraged; there's a plenty
> > > > of potential problems, and "caller has it pinned, so we are fine with
> > > > grabbing extra refs" is nowhere near enough to eliminate those.
> > >
> > > So it's not clear why you hate it so much, since those extra
> > > references are totally normal in all the other VFS paths.
> > >
> > > I mean, they are perhaps not the *common* case, but we have a lot of
> > > random get_file() calls sprinkled around in various places when you
> > > end up passing a file descriptor off to some asynchronous operation
> > > thing.
> > >
> > > Yeah, I think most of them tend to be special operations (eg the tty
> > > TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> > > is *that* different from vfs_poll. Different operation, not somehow
> > > "one is more special than the other".
> > >
> > > cachefiles and backing-file does it for regular IO, and drop it at IO
> > > completion - not that different from what dma-buf does. It's in
> > > ->read_iter() rather than ->poll(), but again: different operations,
> > > but not "one of them is somehow fundamentally different".
> > >
> > > > 3. dma-buf uses of get_file() are probably safe (epoll shite aside),
> > > > but they do look fishy. That has nothing to do with epoll.
> > >
> > > Now, what dma-buf basically seems to do is to avoid ref-counting its
> > > own fundamental data structure, and replaces that by refcounting the
> > > 'struct file' that *points* to it instead.
> > >
> > > And it is a bit odd, but it actually makes some amount of sense,
> > > because then what it passes around is that file pointer (and it allows
> > > passing it around from user space *as* that file).
> > >
> > > And honestly, if you look at why it then needs to add its refcount to
> > > it all, it actually makes sense. dma-bufs have this notion of
> > > "fences" that are basically completion points for the asynchronous
> > > DMA. Doing a "poll()" operation will add a note to the fence to get
> > > that wakeup when it's done.
> > >
> > > And yes, logically it takes a ref to the "struct dma_buf", but because
> > > of how the lifetime of the dma_buf is associated with the lifetime of
> > > the 'struct file', that then turns into taking a ref on the file.
> > >
> > > Unusual? Yes. But not illogical. Not obviously broken. Tying the
> > > lifetime of the dma_buf to the lifetime of a file that is passed along
> > > makes _sense_ for that use.
> > >
> > > I'm sure dma-bufs could add another level of refcounting on the
> > > 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> > > it's not clear to me what the advantage would really be, or why it
> > > would be wrong to re-use a refcount that is already there.
> >
> > So there is generally another refcount, because dma_buf is just the
> > cross-driver interface to some kind of real underlying buffer object from
> > the various graphics related subsystems we have.
> >
> > And since it's a pure file based api thing that ceases to serve any
> > function once the fd/file is gone we tied all the dma_buf refcounting to
> > the refcount struct file already maintains. But the underlying buffer
> > object can easily outlive the dma_buf, and over the lifetime of an
> > underlying buffer object you might actually end up creating different
> > dma_buf api wrappers for it (but at least in drm we guarantee there's at
> > most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I
> > don't particularly like and isn't really needed).
> >
> > But we could add another refcount, it just means we have 3 of those then
> > when only really 2 are needed.
>
> Fwiw, the TTM thing described upthread and in the other thread really
> tries hard to work around the dma_buf == file lifetime choice by hooking
> into the dma-buf specific release function so it can access the dmabuf
> and then the file. All that seems like a pretty error prone thing to me.
> So a separate refcount for dma_buf wouldn't be the worst as that would
> allow that TTM thing to benefit and remove that nasty hacking into your
> generic dma_buf ops. But maybe I'm the only one who sees it that way and
> I'm certainly not familiar enough with dma-buf.
So the tricky part is the uniqueness requirement drm has for buffer
objects (and hence dma_buf wrappers), which together with the refcounting
makes dma_buf quite tricky:
- dma_buf needs to hold some reference onto the underlying object, or it
wont work
- but you're not allowed to just create a new dma_buf every time someone
exports an underlying object to a dma_buf, because that would break the
uniqueness requirement. Which means the underlying object must also hold
some kind of reference to its dma_buf, if it exists. So that on buffer
export it can just increment the refcount for that and return it,
instead of creating a new one.
Which would be a reference loop that never gets freed, so you need one of
two tricks:
- Either a weak reference, i.e. just a pointer plus
atomic_inc_unless_zero trickery like ttm does. Splitting that refcount
into more refcounts doesn't fundamentally solve the problem, it just
adds even more refcounts.
- Or you do what all other drm drivers do in drm_prime.c do and careful
clean up the dma_buf re-export cache when the userspace references (but
not all kernel internal ones) disappear, to unbreak that reference loop.
This needs to be done with extreme care and took a lot of screaming to
get right, because if you have a race you might end up breaking the
uniqueness requirement and have two dma_buf floating around.
So neither of these solutions really are simple, but I agree with you that
the atomic_inc_unless_zero trickery is less simple. It's definitely not
cool that it's done by digging around in struct file internals.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
> [...]
> Root cause:
> AFAIK, eventpoll (epoll) does not increase the registered file's reference.
> To ensure the safety, when the registered file is deallocated in __fput,
> eventpoll_release is called to unregister the file from epoll. When calling
> poll on epoll, epoll will loop through registered files and call vfs_poll on
> these files. In the file's poll, file is guaranteed to be alive, however, as
> epoll does not increase the registered file's reference, the file may be
> dying
> and it's not safe the get the file for later use. And dma_buf_poll violates
> this. In the dma_buf_poll, it tries to get_file to use in the callback. This
> leads to a race where the dmabuf->file can be fput twice.
>
> Here is the race occurs in the above proof-of-concept
>
> close(dmabuf->file)
> __fput_sync (f_count == 1, last ref)
> f_count-- (f_count == 0 now)
> __fput
> epoll_wait
> vfs_poll(dmabuf->file)
> get_file(dmabuf->file)(f_count == 1)
> eventpoll_release
> dmabuf->file deallocation
> fput(dmabuf->file) (f_count == 1)
> f_count--
> dmabuf->file deallocation
>
> I am not familiar with the dma_buf so I don't know the proper fix for the
> issue. About the rule that don't get the file for later use in poll callback
> of
> file, I wonder if it is there when only select/poll exist or just after
> epoll
> appears.
>
> I hope the analysis helps us to fix the issue.
Thanks for doing this analysis! I suspect at least a start of a fix
would be this:
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..15e8f74ee0f2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
if (events & EPOLLOUT) {
/* Paired with fput in dma_buf_poll_cb */
- get_file(dmabuf->file);
-
- if (!dma_buf_poll_add_cb(resv, true, dcb))
+ if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+ !dma_buf_poll_add_cb(resv, true, dcb))
/* No callback queued, wake up any other waiters */
dma_buf_poll_cb(NULL, &dcb->cb);
else
@@ -290,9 +289,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
if (events & EPOLLIN) {
/* Paired with fput in dma_buf_poll_cb */
- get_file(dmabuf->file);
-
- if (!dma_buf_poll_add_cb(resv, false, dcb))
+ if (!atomic_long_inc_not_zero(&dmabuf->file) &&
+ !dma_buf_poll_add_cb(resv, false, dcb))
/* No callback queued, wake up any other waiters */
dma_buf_poll_cb(NULL, &dcb->cb);
else
But this ends up leaving "active" non-zero, and at close time it runs
into:
BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
But the bottom line is that get_file() is unsafe to use in some places,
one of which appears to be in the poll handler. There are maybe some
other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:
static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L;
}
Which I also note involves a dmabuf...
Due to this issue I've proposed fixing get_file() to detect pathological states:
https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
But that has run into some push-back. I'm hoping that seeing this epoll
example will help illustrate what needs fixing a little better.
I think the best current proposal is to just WARN sooner instead of a
full refcount_t implementation:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8dfd53b52744..e09107d0a3d6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1040,7 +1040,8 @@ struct file_handle {
static inline struct file *get_file(struct file *f)
{
- atomic_long_inc(&f->f_count);
+ long prior = atomic_long_fetch_inc_relaxed(&f->f_count);
+ WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n");
return f;
}
What's the right way to deal with the dmabuf situation? (And I suspect
it applies to get_dma_buf_unless_doomed() as well...)
-Kees
--
Kees Cook
On Fri, May 03, 2024 at 01:35:09PM -0600, Jens Axboe wrote:
> On 5/3/24 1:22 PM, Kees Cook wrote:
> > On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> >> On 5/3/24 12:26 PM, Kees Cook wrote:
> >>> Thanks for doing this analysis! I suspect at least a start of a fix
> >>> would be this:
> >>>
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index 8fe5aa67b167..15e8f74ee0f2 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> >>>
> >>> if (events & EPOLLOUT) {
> >>> /* Paired with fput in dma_buf_poll_cb */
> >>> - get_file(dmabuf->file);
> >>> -
> >>> - if (!dma_buf_poll_add_cb(resv, true, dcb))
> >>> + if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> >>> + !dma_buf_poll_add_cb(resv, true, dcb))
> >>> /* No callback queued, wake up any other waiters */
> >>
> >> Don't think this is sane at all. I'm assuming you meant:
> >>
> >> atomic_long_inc_not_zero(&dmabuf->file->f_count);
> >
> > Oops, yes, sorry. I was typed from memory instead of copy/paste.
>
> Figured :-)
>
> >> but won't fly as you're not under RCU in the first place. And what
> >> protects it from being long gone before you attempt this anyway? This is
> >> sane way to attempt to fix it, it's completely opposite of what sane ref
> >> handling should look like.
> >>
> >> Not sure what the best fix is here, seems like dma-buf should hold an
> >> actual reference to the file upfront rather than just stash a pointer
> >> and then later _hope_ that it can just grab a reference. That seems
> >> pretty horrible, and the real source of the issue.
> >
> > AFAICT, epoll just doesn't hold any references at all. It depends,
> > I think, on eventpoll_release() (really eventpoll_release_file())
> > synchronizing with epoll_wait() (but I don't see how this happens, and
> > the race seems to be against ep_item_poll() ...?)
> >
> > I'm really confused about how eventpoll manages the lifetime of polled
> > fds.
>
> epoll doesn't hold any references, and it's got some ugly callback to
> deal with that. It's not ideal, nor pretty, but that's how it currently
> works. See eventpoll_release() and how it's called. This means that
> epoll itself is supposedly safe from the file going away, even though it
> doesn't hold a reference to it.
Right -- what remains unclear to me is how struct file lifetime is
expected to work in the struct file_operations::poll callbacks. Because
using get_file() there looks clearly unsafe...
> Except that in this case, the file is already gone by the time
> eventpoll_release() is called. Which presumably is some interaction with
> the somewhat suspicious file reference management that dma-buf is doing.
> But I didn't look into that much, outside of noting it looks a bit
> suspect.
Not yet, though. Here's (one) race state from the analysis. I added lines
for the dma_fence_add_callback()/dma_buf_poll_cb() case, since that's
the case that would escape any eventpoll_release/epoll_wait
synchronization (if it exists?):
close(dmabuf->file)
__fput_sync (f_count == 1, last ref)
f_count-- (f_count == 0 now)
__fput
epoll_wait
vfs_poll(dmabuf->file)
get_file(dmabuf->file)(f_count == 1)
dma_fence_add_callback()
eventpoll_release
dmabuf->file deallocation
dma_buf_poll_cb()
fput(dmabuf->file) (f_count == 1)
f_count--
dmabuf->file deallocation
Without fences to create a background callback, we just do a double-free:
close(dmabuf->file)
__fput_sync (f_count == 1, last ref)
f_count-- (f_count == 0 now)
__fput
epoll_wait
vfs_poll(dmabuf->file)
get_file(dmabuf->file)(f_count == 1)
dma_buf_poll_cb()
fput(dmabuf->file) (f_count == 1)
f_count--
eventpoll_release
dmabuf->file deallocation
eventpoll_release
dmabuf->file deallocation
get_file(), via epoll_wait()->vfs_poll()->dma_buf_poll(), has raised
f_count again. Then eventpoll_release() is doing things to remove
dmabuf->file from the eventpoll lists, but I *think* this is synchronized
so that an epoll_wait() will only call .poll handlers with a valid
(though possibly f_count==0) file, but I can't figure out where that
happens. (If it's not happening, we have a much bigger problem, but I
imagine we'd see massive corruption all the time, which we don't.)
So, yeah, I can't figure out how eventpoll_release() and epoll_wait()
are expected to behave safely for .poll handlers.
Regardless, for the simple case: it seems like it's just totally illegal
to use get_file() in a poll handler. Is this known/expected? And if so,
how can dmabuf possibly deal with that?
--
Kees Cook
On Sun, 5 May 2024 at 03:50, Christian Brauner <brauner(a)kernel.org> wrote:
>
> And I agree with you that for some instances it's valid to take another
> reference to a file from f_op->poll() but then they need to use
> get_file_active() imho and simply handle the case where f_count is zero.
I think this is
(a) practically impossible to find (since most f_count updates are in
various random helpers)
(b) not tenable in the first place, since *EVERYBODY* does a f_count
update as part of the bog-standard pollwait
So (b) means that the notion of "warn if somebody increments f_count
from zero" is broken to begin with - but it's doubly broken because it
wouldn't find anything *anyway*, since this never happens in any
normal situation.
And (a) means that any non-automatic finding of this is practically impossible.
> And we need to document that in Documentation/filesystems/file.rst or
> locking.rst.
WHY?
Why cannot you and Al just admit that the problem is in epoll. Always
has been, always will be.
The fact is, it's not dma-buf that is violating any rules. It's epoll.
It's calling out to random driver functions with a file pointer that
is no longer valid.
It really is that simple.
I don't see why you are arguing for "unknown number of drivers - we
know at least *one* - have to be fixed for a bug that is in epoll".
If it was *easy* to fix, and if it was *easy* to validate, then sure.
But that just isn't the case.
In contrast, in epoll it's *trivial* to fix the one case where it does
a VFS call-out, and just say "you have to follow the rules".
So explain to me again why you want to mess up the driver interface
and everybody who has a '.poll()' function, and not just fix the ONE
clearly buggy piece of code.
Because dammit,. epoll is clearly buggy. It's not enough to say "the
file allocation isn't going away", and claim that that means that it's
not buggy - when the file IS NO LONGER VALID!
Linus
On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> On 5/3/24 12:26 PM, Kees Cook wrote:
> > Thanks for doing this analysis! I suspect at least a start of a fix
> > would be this:
> >
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 8fe5aa67b167..15e8f74ee0f2 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll)
> >
> > if (events & EPOLLOUT) {
> > /* Paired with fput in dma_buf_poll_cb */
> > - get_file(dmabuf->file);
> > -
> > - if (!dma_buf_poll_add_cb(resv, true, dcb))
> > + if (!atomic_long_inc_not_zero(&dmabuf->file) &&
> > + !dma_buf_poll_add_cb(resv, true, dcb))
> > /* No callback queued, wake up any other waiters */
>
> Don't think this is sane at all. I'm assuming you meant:
>
> atomic_long_inc_not_zero(&dmabuf->file->f_count);
Oops, yes, sorry. I was typed from memory instead of copy/paste.
> but won't fly as you're not under RCU in the first place. And what
> protects it from being long gone before you attempt this anyway? This is
> sane way to attempt to fix it, it's completely opposite of what sane ref
> handling should look like.
>
> Not sure what the best fix is here, seems like dma-buf should hold an
> actual reference to the file upfront rather than just stash a pointer
> and then later _hope_ that it can just grab a reference. That seems
> pretty horrible, and the real source of the issue.
AFAICT, epoll just doesn't hold any references at all. It depends,
I think, on eventpoll_release() (really eventpoll_release_file())
synchronizing with epoll_wait() (but I don't see how this happens, and
the race seems to be against ep_item_poll() ...?)
I'm really confused about how eventpoll manages the lifetime of polled
fds.
> > Due to this issue I've proposed fixing get_file() to detect pathological states:
> > https://lore.kernel.org/lkml/20240502222252.work.690-kees@kernel.org/
>
> I don't think this would catch this case, as the memory could just be
> garbage at this point.
It catches it just fine! :) I tested it against the published PoC.
And for cases where further allocations have progressed far enough to
corrupt the freed struct file and render the check pointless, nothing
different has happened than what happens today. At least now we have a
window to catch the situation across the time frame before it is both
reallocated _and_ the contents at the f_count offset gets changed to
non-zero.
--
Kees Cook