The overhead of enforcing the DMA-buf rules for importers is now so low
that it save to enable it by default on DEBUG kernels.
This will hopefully result in fixing more issues in importers.
Signed-off-by: Christian König <christian.koenig(a)amd.com>
---
drivers/dma-buf/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index b46eb8a552d7..fdd823e446cc 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -55,7 +55,7 @@ config DMABUF_MOVE_NOTIFY
config DMABUF_DEBUG
bool "DMA-BUF debug checks"
depends on DMA_SHARED_BUFFER
- default y if DMA_API_DEBUG
+ default y if DEBUG
help
This option enables additional checks for DMA-BUF importers and
exporters. Specifically it validates that importers do not peek at the
--
2.43.0
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
>
> > >
[...]
> > > My hope would be that in the mid-term future we'd get firmware
> > > rings
> > > that can be preempted through a firmware call for all major
> > > hardware.
> > > Then a huge share of our problems would disappear.
> >
> > At least on AMD HW pre-emption is actually horrible unreliable as
> > well.
>
> Do you mean new GPUs with firmware scheduling, or what is "HW pre-
> emption"?
>
> With firmware interfaces, my hope would be that you could simply tell
>
> stop_running_ring(nr_of_ring)
> // time slice for someone else
> start_running_ring(nr_of_ring)
>
> Thereby getting real scheduling and all that. And eliminating many
> other problems we know well from drm/sched.
It doesn't really matter if you have firmware scheduling or not for
preemption to be a hard problem on GPUs. CPUs have limited software
visible state that needs to be saved/restored on a context switch and
even there people start complaining now that they need to context
switch the AVX512 register set.
GPUs have megabytes of software visible state. Which needs to be
saved/restored on the context switch if you want fine grained
preemption with low preemption latency. There might be points in the
command execution where you can ignore most of that state, but reaching
those points can have basically unbounded latency. So either you can
reliably save/restore lots of state or you are limited to very coarse
grained preemption with all the usual issues of timeouts and DoS
vectors.
I'm not totally up to speed with the current state across all relevant
GPUs, but until recently NVidia was the only vendor to have real
reliable fine-grained preemption.
Regards,
Lucas
If there is a large number (hundreds) of dmabufs allocated, the text
output generated from dmabuf_iter_seq_show can exceed common user buffer
sizes (e.g. PAGE_SIZE) necessitating multiple start/stop cycles to
iterate through all dmabufs. However the dmabuf iterator currently
returns NULL in dmabuf_iter_seq_start for all non-zero pos values, which
results in the truncation of the output before all dmabufs are handled.
After dma_buf_iter_begin / dma_buf_iter_next, the refcount of the buffer
is elevated so that the BPF iterator program can run without holding any
locks. When a stop occurs, instead of immediately dropping the reference
on the buffer, stash a pointer to the buffer in seq->priv until
either start is called or the iterator is released. This also enables
the resumption of iteration without first walking through the list of
dmabufs based on the pos value.
Fixes: 76ea95534995 ("bpf: Add dmabuf iterator")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
---
kernel/bpf/dmabuf_iter.c | 56 +++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 7 deletions(-)
diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c
index 4dd7ef7c145c..cd500248abd9 100644
--- a/kernel/bpf/dmabuf_iter.c
+++ b/kernel/bpf/dmabuf_iter.c
@@ -6,10 +6,33 @@
#include <linux/kernel.h>
#include <linux/seq_file.h>
+struct dmabuf_iter_priv {
+ /*
+ * If this pointer is non-NULL, the buffer's refcount is elevated to
+ * prevent destruction between stop/start. If reading is not resumed and
+ * start is never called again, then dmabuf_iter_seq_fini drops the
+ * reference when the iterator is released.
+ */
+ struct dma_buf *dmabuf;
+};
+
static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
{
- if (*pos)
- return NULL;
+ struct dmabuf_iter_priv *p = seq->private;
+
+ if (*pos) {
+ struct dma_buf *dmabuf = p->dmabuf;
+
+ if (!dmabuf)
+ return NULL;
+
+ /*
+ * Always resume from where we stopped, regardless of the value
+ * of pos.
+ */
+ p->dmabuf = NULL;
+ return dmabuf;
+ }
return dma_buf_iter_begin();
}
@@ -54,8 +77,11 @@ static void dmabuf_iter_seq_stop(struct seq_file *seq, void *v)
{
struct dma_buf *dmabuf = v;
- if (dmabuf)
- dma_buf_put(dmabuf);
+ if (dmabuf) {
+ struct dmabuf_iter_priv *p = seq->private;
+
+ p->dmabuf = dmabuf;
+ }
}
static const struct seq_operations dmabuf_iter_seq_ops = {
@@ -71,11 +97,27 @@ static void bpf_iter_dmabuf_show_fdinfo(const struct bpf_iter_aux_info *aux,
seq_puts(seq, "dmabuf iter\n");
}
+static int dmabuf_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
+{
+ struct dmabuf_iter_priv *p = (struct dmabuf_iter_priv *)priv;
+
+ p->dmabuf = NULL;
+ return 0;
+}
+
+static void dmabuf_iter_seq_fini(void *priv)
+{
+ struct dmabuf_iter_priv *p = (struct dmabuf_iter_priv *)priv;
+
+ if (p->dmabuf)
+ dma_buf_put(p->dmabuf);
+}
+
static const struct bpf_iter_seq_info dmabuf_iter_seq_info = {
.seq_ops = &dmabuf_iter_seq_ops,
- .init_seq_private = NULL,
- .fini_seq_private = NULL,
- .seq_priv_size = 0,
+ .init_seq_private = dmabuf_iter_seq_init,
+ .fini_seq_private = dmabuf_iter_seq_fini,
+ .seq_priv_size = sizeof(struct dmabuf_iter_priv),
};
static struct bpf_iter_reg bpf_dmabuf_reg_info = {
base-commit: 30f09200cc4aefbd8385b01e41bde2e4565a6f0e
--
2.52.0.177.g9f829587af-goog
On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
Given that this is a direct file operation instance it should be
in block/fops.c. If we do want a generic helper below it, it
should take a struct block_device instead. But we can probably
defer that until a user for that shows up.
> +static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
> + enum dma_data_direction dir)
> +{
> + struct blk_mq_dma_map *map = req->dma_map;
> + int length = blk_rq_payload_bytes(req);
> + bool for_cpu = dir == DMA_FROM_DEVICE;
> + struct device *dev = nvme_dev->dev;
> + dma_addr_t *dma_list = map->private;
> + struct bio *bio = req->bio;
> + int offset, map_idx;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + length += offset & (NVME_CTRL_PAGE_SIZE - 1);
> +
> + while (length > 0) {
> + u64 dma_addr = dma_list[map_idx++];
> +
> + if (for_cpu)
> + __dma_sync_single_for_cpu(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + else
> + __dma_sync_single_for_device(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + length -= NVME_CTRL_PAGE_SIZE;
> + }
This looks really inefficient. Usually the ranges in the dmabuf should
be much larger than a controller page.
> +static void nvme_unmap_premapped_data(struct nvme_dev *dev,
> + struct request *req)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> + if (rq_data_dir(req) == READ)
> + nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
> + if (!(iod->flags & IOD_SINGLE_SEGMENT))
> + nvme_free_descriptors(req);
> +}
This doesn't really unmap anything :)
Also the dma ownership rules say that you always need to call the
sync_to_device helpers before I/O and the sync_to_cpu helpers after I/O,
no matters if it is a read or write. The implementations then makes
them a no-op where possible.
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + offset &= (NVME_CTRL_PAGE_SIZE - 1);
> +
> + prp1_dma = dma_list[map_idx++] + offset;
> +
> + length -= (NVME_CTRL_PAGE_SIZE - offset);
> + if (length <= 0) {
> + prp2_dma = 0;
Urgg, why is this building PRPs instead of SGLs? Yes, SGLs are an
optional feature, but for devices where you want to micro-optimize
like this I think we should simply require them. This should cut
down on both the memory use and the amount of special mapping code.