On Tue, Apr 22, 2025 at 4:01 PM Alexei Starovoitov
<alexei.starovoitov(a)gmail.com> wrote:
>
> On Tue, Apr 22, 2025 at 12:57 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Mon, Apr 21, 2025 at 4:39 PM Alexei Starovoitov
> > <alexei.starovoitov(a)gmail.com> wrote:
> > >
> > > On Mon, Apr 21, 2025 at 1:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > >
> > > > > > new file mode 100644
> > > > > > index 000000000000..b4b8be1d6aa4
> > > > > > --- /dev/null
> > > > > > +++ b/kernel/bpf/dmabuf_iter.c
> > > > >
> > > > > Maybe we should add this file to drivers/dma-buf. I would like to
> > > > > hear other folks thoughts on this.
> > > >
> > > > This is fine with me, and would save us the extra
> > > > CONFIG_DMA_SHARED_BUFFER check that's currently needed in
> > > > kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
> > > > Sumit / Christian any objections to moving the dmabuf bpf iterator
> > > > implementation into drivers/dma-buf?
> > >
> > > The driver directory would need to 'depends on BPF_SYSCALL'.
> > > Are you sure you want this?
> > > imo kernel/bpf/ is fine for this.
> >
> > I don't have a strong preference so either way is fine with me. The
> > main difference I see is maintainership.
> >
> > > You also probably want
> > > .feature = BPF_ITER_RESCHED
> > > in bpf_dmabuf_reg_info.
> >
> > Thank you, this looks like a good idea.
> >
> > > Also have you considered open coded iterator for dmabufs?
> > > Would it help with the interface to user space?
> >
> > I read through the open coded iterator patches, and it looks like they
> > would be slightly more efficient by avoiding seq_file overhead. As far
> > as the interface to userspace, for the purpose of replacing what's
> > currently exposed by CONFIG_DMABUF_SYSFS_STATS I don't think there is
> > a difference. However it looks like if I were to try to replace all of
> > our userspace analysis of dmabufs with a single bpf program then an
> > open coded iterator would make that much easier. I had not considered
> > attempting that.
> >
> > One problem I see with open coded iterators is that support is much
> > more recent (2023 vs 2020). We support longterm stable kernels (back
> > to 5.4 currently but probably 5.10 by the time this would be used), so
> > it seems like it would be harder to backport the kernel support for an
> > open-coded iterator that far since it only goes back as far as 6.6
> > now. Actually it doesn't look like it is possible while also
> > maintaining the stable ABI we provide to device vendors. Which means
> > we couldn't get rid of the dmabuf sysfs stats userspace dependency
> > until 6.1 EOL in Dec. 2027. :\ So I'm in favor of a traditional bpf
> > iterator here for now.
>
> Fair enough, but please implement both and backport only
> the old style pinned iterator.
Ok, will do.
> The code will be mostly shared between them.
> bpf_iter_dmabuf_new/_next will be more flexible with more
> options to return data to user space. Like android can invent
> their own binary format. Pack into it in a bpf prog, send to
> bpf ringbuf and unmarshal efficiently in user space.
> Instead of being limited to text output that pinned iterators
> are supposed to do usually.
Also a neat idea!
> You can do binary with bpf_seq_write() too, but it's rare.
>
> Also please provide full bpf prog that you'll use in production
> in a selftest instead of trivial:
> +SEC("iter/dmabuf")
> +int dmabuf_collector(struct bpf_iter__dmabuf *ctx)
>
> just to make sure it's tested end to end and future changes
> won't break it.
The final bpf program should be something pretty close to that, but
I'll start working on the AOSP side as well so I can put up patches.
>
> pw-bot: cr
Hello Jared,
On Wed, 23 Apr 2025 at 00:49, Jared Kangas <jkangas(a)redhat.com> wrote:
>
> Hi all,
>
> This patch series is based on a previous discussion around CMA heap
> naming. [1] The heap's name depends on the device name, which is
> generally "reserved", "linux,cma", or "default-pool", but could be any
> arbitrary name given to the default CMA area in the devicetree. For a
> consistent userspace interface, the series introduces a constant name
> for the CMA heap, and for backwards compatibility, an additional Kconfig
> that controls the creation of a legacy-named heap with the same CMA
> backing.
>
> The ideas to handle backwards compatibility in [1] are to either use a
> symlink or add a heap node with a duplicate minor. However, I assume
> that we don't want to create symlinks in /dev from module initcalls, and
> attempting to duplicate minors would cause device_create() to fail.
> Because of these drawbacks, after brainstorming with Maxime Ripard, I
> went with creating a new node in devtmpfs with its own minor. This
> admittedly makes it a little unclear that the old and new nodes are
> backed by the same heap when both are present. The only approach that I
> think would provide total clarity on this in userspace is symlinking,
> which seemed like a fairly involved solution for devtmpfs, but if I'm
> wrong on this, please let me know.
Thanks indeed for this patch; just one minor nit: the link referred to
as [1] here seems to be missing. Could you please add it? This would
make it easier to follow the chain of discussion in posterity.
>
> Changelog:
> v2: Use tabs instead of spaces for large vertical alignment.
>
> Jared Kangas (2):
> dma-buf: heaps: Parameterize heap name in __add_cma_heap()
> dma-buf: heaps: Give default CMA heap a fixed name
>
> Documentation/userspace-api/dma-buf-heaps.rst | 11 ++++---
> drivers/dma-buf/heaps/Kconfig | 10 +++++++
> drivers/dma-buf/heaps/cma_heap.c | 30 ++++++++++++++-----
> 3 files changed, 40 insertions(+), 11 deletions(-)
>
> --
> 2.49.0
>
Best,
Sumit
On Tue, Apr 22, 2025 at 12:19 PM Jared Kangas <jkangas(a)redhat.com> wrote:
>
> The CMA heap's name in devtmpfs can vary depending on how the heap is
> defined. Its name defaults to "reserved", but if a CMA area is defined
> in the devicetree, the heap takes on the devicetree node's name, such as
> "default-pool" or "linux,cma". To simplify naming, just name it
> "default_cma", and keep a legacy node in place backed by the same
> underlying structure for backwards compatibility.
>
> Signed-off-by: Jared Kangas <jkangas(a)redhat.com>
Once again, thanks for working out how to improve the standard naming
while keeping compatibility.
I do still hope we can get to the point where other cma regions can be
registered as heaps with unique/purpose-specific names, but I can see
having a standard name for the default region is a nice improvement.
Acked-by: John Stultz <jstultz(a)google.com>
thanks
-john
On Tue, Apr 22, 2025 at 12:19 PM Jared Kangas <jkangas(a)redhat.com> wrote:
>
> Prepare for the introduction of a fixed-name CMA heap by replacing the
> unused void pointer parameter in __add_cma_heap() with the heap name.
>
> Signed-off-by: Jared Kangas <jkangas(a)redhat.com>
Thanks so much for taking this effort on. Looks good to me!
Acked-by: John Stultz <jstultz(a)google.com>
On Mon, Apr 21, 2025 at 4:39 PM Alexei Starovoitov
<alexei.starovoitov(a)gmail.com> wrote:
>
> On Mon, Apr 21, 2025 at 1:40 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > > > new file mode 100644
> > > > index 000000000000..b4b8be1d6aa4
> > > > --- /dev/null
> > > > +++ b/kernel/bpf/dmabuf_iter.c
> > >
> > > Maybe we should add this file to drivers/dma-buf. I would like to
> > > hear other folks thoughts on this.
> >
> > This is fine with me, and would save us the extra
> > CONFIG_DMA_SHARED_BUFFER check that's currently needed in
> > kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
> > Sumit / Christian any objections to moving the dmabuf bpf iterator
> > implementation into drivers/dma-buf?
>
> The driver directory would need to 'depends on BPF_SYSCALL'.
> Are you sure you want this?
> imo kernel/bpf/ is fine for this.
I don't have a strong preference so either way is fine with me. The
main difference I see is maintainership.
> You also probably want
> .feature = BPF_ITER_RESCHED
> in bpf_dmabuf_reg_info.
Thank you, this looks like a good idea.
> Also have you considered open coded iterator for dmabufs?
> Would it help with the interface to user space?
I read through the open coded iterator patches, and it looks like they
would be slightly more efficient by avoiding seq_file overhead. As far
as the interface to userspace, for the purpose of replacing what's
currently exposed by CONFIG_DMABUF_SYSFS_STATS I don't think there is
a difference. However it looks like if I were to try to replace all of
our userspace analysis of dmabufs with a single bpf program then an
open coded iterator would make that much easier. I had not considered
attempting that.
One problem I see with open coded iterators is that support is much
more recent (2023 vs 2020). We support longterm stable kernels (back
to 5.4 currently but probably 5.10 by the time this would be used), so
it seems like it would be harder to backport the kernel support for an
open-coded iterator that far since it only goes back as far as 6.6
now. Actually it doesn't look like it is possible while also
maintaining the stable ABI we provide to device vendors. Which means
we couldn't get rid of the dmabuf sysfs stats userspace dependency
until 6.1 EOL in Dec. 2027. :\ So I'm in favor of a traditional bpf
iterator here for now.
On Mon, Apr 14, 2025 at 09:21:25PM +0200, Thomas Petazzoni wrote:
> > "UIO is a broken legacy mess, so let's add more broken things
> > to it as broken + broken => still broken, so no harm done", am I
> > getting that right?
>
> Who says UIO is a "broken legacy mess"? Only you says so. I don't see
> any indication anywhere in the kernel tree suggesting that UIO is
> considered a broken legacy mess.
Explain what the difference is between UIO and VFIO, especially VFIO
no-iommu mode?
I've always understood that UIO is for very simple devices that cannot
do DMA. So it's very simple operating model and simple security work
fine.
IMHO, if the can use DMA it should use VFIO. If you have no iommu then
you should use the VFIO unsafe no-iommu path. It still provides a
solid framework.
As to this series, I have seen a number of requests to improve the
VFIO no-iommu path to allow working with the existing IOAS scheme to
register memory but to allow the kernel the return the no-iommu
DMAable address of the IOAS pinned memory. This would replace the
hacky use of mlock and /proc/XX/pagemap that people use today.
If that were done, could you use VFIO no-iommu?
> Keep in mind that when you're running code as root, you can load a
> kernel module, which can do anything on the system security-wise. So
> letting UIO expose MMIO registers of devices to userspace applications
> running as root is not any worse than that.
That isn't fully true.. UIO isn't fitting into the security model by
allowing DMA capable devices to be exposed without checking for
CAP_SYS_RAW_IO first.
Jason
On Mon, Apr 21, 2025 at 10:58 AM Song Liu <song(a)kernel.org> wrote:
>
> On Mon, Apr 14, 2025 at 3:53 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > The dmabuf iterator traverses the list of all DMA buffers. The list is
> > maintained only when CONFIG_DEBUG_FS is enabled.
> >
> > DMA buffers are refcounted through their associated struct file. A
> > reference is taken on each buffer as the list is iterated to ensure each
> > buffer persists for the duration of the bpf program execution without
> > holding the list mutex.
> >
> > Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
> > ---
> > include/linux/btf_ids.h | 1 +
> > kernel/bpf/Makefile | 3 +
> > kernel/bpf/dmabuf_iter.c | 130 +++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 134 insertions(+)
> > create mode 100644 kernel/bpf/dmabuf_iter.c
> >
> > diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
> > index 139bdececdcf..899ead57d89d 100644
> > --- a/include/linux/btf_ids.h
> > +++ b/include/linux/btf_ids.h
> > @@ -284,5 +284,6 @@ extern u32 bpf_cgroup_btf_id[];
> > extern u32 bpf_local_storage_map_btf_id[];
> > extern u32 btf_bpf_map_id[];
> > extern u32 bpf_kmem_cache_btf_id[];
> > +extern u32 bpf_dmabuf_btf_id[];
>
> This is not necessary. See below.
>
> >
> > #endif
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index 70502f038b92..5b30d37ef055 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -53,6 +53,9 @@ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
> > obj-$(CONFIG_BPF_SYSCALL) += btf_iter.o
> > obj-$(CONFIG_BPF_SYSCALL) += btf_relocate.o
> > obj-$(CONFIG_BPF_SYSCALL) += kmem_cache_iter.o
> > +ifeq ($(CONFIG_DEBUG_FS),y)
> > +obj-$(CONFIG_BPF_SYSCALL) += dmabuf_iter.o
> > +endif
> >
> > CFLAGS_REMOVE_percpu_freelist.o = $(CC_FLAGS_FTRACE)
> > CFLAGS_REMOVE_bpf_lru_list.o = $(CC_FLAGS_FTRACE)
> > diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c
> > new file mode 100644
> > index 000000000000..b4b8be1d6aa4
> > --- /dev/null
> > +++ b/kernel/bpf/dmabuf_iter.c
>
> Maybe we should add this file to drivers/dma-buf. I would like to
> hear other folks thoughts on this.
This is fine with me, and would save us the extra
CONFIG_DMA_SHARED_BUFFER check that's currently needed in
kernel/bpf/Makefile but would require checking CONFIG_BPF instead.
Sumit / Christian any objections to moving the dmabuf bpf iterator
implementation into drivers/dma-buf?
> > @@ -0,0 +1,130 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright (c) 2025 Google LLC */
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <linux/dma-buf.h>
> > +#include <linux/kernel.h>
> > +#include <linux/seq_file.h>
> > +
> > +BTF_ID_LIST_GLOBAL_SINGLE(bpf_dmabuf_btf_id, struct, dma_buf)
>
> Use BTF_ID_LIST_SINGLE(), then we don't need this in btf_ids.h
>
> > +DEFINE_BPF_ITER_FUNC(dmabuf, struct bpf_iter_meta *meta, struct dma_buf *dmabuf)
> > +
> > +static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct dma_buf *dmabuf, *ret = NULL;
> > +
> > + if (*pos) {
> > + *pos = 0;
> > + return NULL;
> > + }
> > + /* Look for the first buffer we can obtain a reference to.
> > + * The list mutex does not protect a dmabuf's refcount, so it can be
> > + * zeroed while we are iterating. Therefore we cannot call get_dma_buf()
> > + * since the caller of this program may not already own a reference to
> > + * the buffer.
> > + */
>
> We should use kernel comment style for new code. IOW,
> /*
> * Look for ...
> */
>
>
> Thanks,
> Song
Thanks, I have incorporated all of your comments and retested. I will
give some time for more feedback before sending a v2.
>
> [...]
On Thu, Apr 17, 2025 at 1:26 PM Song Liu <song(a)kernel.org> wrote:
>
> On Thu, Apr 17, 2025 at 9:05 AM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song(a)kernel.org> wrote:
> > >
> > > On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> > > >
> > > > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
> > > [...]
> > > > >
> > > > > Here is another rookie question, it appears to me there is a file descriptor
> > > > > associated with each DMA buffer, can we achieve the same goal with
> > > > > a task-file iterator?
> > > >
> > > > That would find almost all of them, but not the kernel-only
> > > > allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
> > > > If there's a leak, it's likely to show up in kernel_rss because some
> > > > driver forgot to release its reference(s).) Also wouldn't that be a
> > > > ton more iterations since we'd have to visit every FD to find the
> > > > small portion that are dmabufs? I'm not actually sure if buffers that
> > > > have been mapped, and then have had their file descriptors closed
> > > > would show up in task_struct->files; if not I think that would mean
> > > > scanning both files and vmas for each task.
> > >
> > > I don't think scanning all FDs to find a small portion of specific FDs
> > > is a real issue. We have a tool that scans all FDs in the system and
> > > only dump data for perf_event FDs. I think it should be easy to
> > > prototype a tool by scanning all files and all vmas. If that turns out
> > > to be very slow, which I highly doubt will be, we can try other
> > > approaches.
> >
> > But this will not find *all* the buffers, and that defeats the purpose
> > of having the iterator.
>
> Do you mean this approach cannot get kernel only allocations? If
> that's the case, we probably do need a separate iterator. I am
> interested in other folks thoughts on this.
Correct.
> > > OTOH, I am wondering whether we can build a more generic iterator
> > > for a list of objects. Adding a iterator for each important kernel lists
> > > seems not scalable in the long term.
> >
> > I think the wide variety of differences in locking for different
> > objects would make this difficult to do in a generic way.
>
> Agreed it is not easy to build a generic solution. But with the
> help from BTF, we can probably build something that covers
> a large number of use cases.
I'm curious what this would look like. I guess a good test would be to
see if anything would work for some subset of the already existing
iterators.
On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song(a)kernel.org> wrote:
>
> On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier(a)google.com> wrote:
> >
> > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song(a)kernel.org> wrote:
> [...]
> > >
> > > Here is another rookie question, it appears to me there is a file descriptor
> > > associated with each DMA buffer, can we achieve the same goal with
> > > a task-file iterator?
> >
> > That would find almost all of them, but not the kernel-only
> > allocations. (kernel_rss in the dmabuf_dump output I attached earlier.
> > If there's a leak, it's likely to show up in kernel_rss because some
> > driver forgot to release its reference(s).) Also wouldn't that be a
> > ton more iterations since we'd have to visit every FD to find the
> > small portion that are dmabufs? I'm not actually sure if buffers that
> > have been mapped, and then have had their file descriptors closed
> > would show up in task_struct->files; if not I think that would mean
> > scanning both files and vmas for each task.
>
> I don't think scanning all FDs to find a small portion of specific FDs
> is a real issue. We have a tool that scans all FDs in the system and
> only dump data for perf_event FDs. I think it should be easy to
> prototype a tool by scanning all files and all vmas. If that turns out
> to be very slow, which I highly doubt will be, we can try other
> approaches.
But this will not find *all* the buffers, and that defeats the purpose
of having the iterator.
> OTOH, I am wondering whether we can build a more generic iterator
> for a list of objects. Adding a iterator for each important kernel lists
> seems not scalable in the long term.
I think the wide variety of differences in locking for different
objects would make this difficult to do in a generic way.