On Fri, Dec 05, 2025 at 04:18:38PM +0900, Byungchul Park wrote:
> Add documents describing the concept and APIs of dept.
>
> Signed-off-by: Byungchul Park <byungchul(a)sk.com>
> ---
> Documentation/dev-tools/dept.rst | 778 +++++++++++++++++++++++++++
> Documentation/dev-tools/dept_api.rst | 125 +++++
You forget to add toctree entries:
---- >8 ----
diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst
index 4b8425e348abd1..02c858f5ed1fa2 100644
--- a/Documentation/dev-tools/index.rst
+++ b/Documentation/dev-tools/index.rst
@@ -22,6 +22,8 @@ Documentation/process/debugging/index.rst
clang-format
coccinelle
sparse
+ dept
+ dept_api
kcov
gcov
kasan
> +Lockdep detects a deadlock by checking lock acquisition order. For
> +example, a graph to track acquisition order built by lockdep might look
> +like:
> +
> +.. literal::
> +
> + A -> B -
> + \
> + -> E
> + /
> + C -> D -
> +
> + where 'A -> B' means that acquisition A is prior to acquisition B
> + with A still held.
Use code-block directive for literal code blocks:
---- >8 ----
diff --git a/Documentation/dev-tools/dept.rst b/Documentation/dev-tools/dept.rst
index 333166464543d7..8394c4ea81bc2a 100644
--- a/Documentation/dev-tools/dept.rst
+++ b/Documentation/dev-tools/dept.rst
@@ -10,7 +10,7 @@ Lockdep detects a deadlock by checking lock acquisition order. For
example, a graph to track acquisition order built by lockdep might look
like:
-.. literal::
+.. code-block::
A -> B -
\
@@ -25,7 +25,7 @@ Lockdep keeps adding each new acquisition order into the graph at
runtime. For example, 'E -> C' will be added when the two locks have
been acquired in the order, E and then C. The graph will look like:
-.. literal::
+.. code-block::
A -> B -
\
@@ -41,7 +41,7 @@ been acquired in the order, E and then C. The graph will look like:
This graph contains a subgraph that demonstrates a loop like:
-.. literal::
+.. code-block::
-> E -
/ \
@@ -76,7 +76,7 @@ e.g. irq context, normal process context, wq worker context, or so on.
Can lockdep detect the following deadlock?
-.. literal::
+.. code-block::
context X context Y context Z
@@ -91,7 +91,7 @@ Can lockdep detect the following deadlock?
No. What about the following?
-.. literal::
+.. code-block::
context X context Y
@@ -116,7 +116,7 @@ What leads a deadlock
A deadlock occurs when one or multi contexts are waiting for events that
will never happen. For example:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -148,7 +148,7 @@ In terms of dependency:
Dependency graph reflecting this example will look like:
-.. literal::
+.. code-block::
-> C -> A -> B -
/ \
@@ -171,7 +171,7 @@ Introduce DEPT
DEPT(DEPendency Tracker) tracks wait and event instead of lock
acquisition order so as to recognize the following situation:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -186,7 +186,7 @@ acquisition order so as to recognize the following situation:
and builds up a dependency graph at runtime that is similar to lockdep.
The graph might look like:
-.. literal::
+.. code-block::
-> C -> A -> B -
/ \
@@ -199,7 +199,7 @@ DEPT keeps adding each new dependency into the graph at runtime. For
example, 'B -> D' will be added when event D occurrence is a
prerequisite to reaching event B like:
-.. literal::
+.. code-block::
context W
@@ -211,7 +211,7 @@ prerequisite to reaching event B like:
After the addition, the graph will look like:
-.. literal::
+.. code-block::
-> D
/
@@ -236,7 +236,7 @@ How DEPT works
Let's take a look how DEPT works with the 1st example in the section
'Limitation of lockdep'.
-.. literal::
+.. code-block::
context X context Y context Z
@@ -256,7 +256,7 @@ event.
Adding comments to describe DEPT's view in detail:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -293,7 +293,7 @@ Adding comments to describe DEPT's view in detail:
Let's build up dependency graph with this example. Firstly, context X:
-.. literal::
+.. code-block::
context X
@@ -304,7 +304,7 @@ Let's build up dependency graph with this example. Firstly, context X:
There are no events to create dependency. Next, context Y:
-.. literal::
+.. code-block::
context Y
@@ -332,7 +332,7 @@ event A cannot be triggered if wait B cannot be awakened by event B.
Therefore, we can say event A depends on event B, say, 'A -> B'. The
graph will look like after adding the dependency:
-.. literal::
+.. code-block::
A -> B
@@ -340,7 +340,7 @@ graph will look like after adding the dependency:
Lastly, context Z:
-.. literal::
+.. code-block::
context Z
@@ -362,7 +362,7 @@ triggered if wait A cannot be awakened by event A. Therefore, we can
say event B depends on event A, say, 'B -> A'. The graph will look like
after adding the dependency:
-.. literal::
+.. code-block::
-> A -> B -
/ \
@@ -386,7 +386,7 @@ Interpret DEPT report
The following is the same example in the section 'How DEPT works'.
-.. literal::
+.. code-block::
context X context Y context Z
@@ -425,7 +425,7 @@ We can simplify this by labeling each waiting point with [W], each
point where its event's context starts with [S] and each event with [E].
This example will look like after the labeling:
-.. literal::
+.. code-block::
context X context Y context Z
@@ -443,7 +443,7 @@ DEPT uses the symbols [W], [S] and [E] in its report as described above.
The following is an example reported by DEPT for a real problem in
practice.
-.. literal::
+.. code-block::
Link: https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SA…
Link: https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.pa…
@@ -646,7 +646,7 @@ practice.
Let's take a look at the summary that is the most important part.
-.. literal::
+.. code-block::
---------------------------------------------------
summary
@@ -669,7 +669,7 @@ Let's take a look at the summary that is the most important part.
The summary shows the following scenario:
-.. literal::
+.. code-block::
context A context B context ?(unknown)
@@ -684,7 +684,7 @@ The summary shows the following scenario:
Adding comments to describe DEPT's view in detail:
-.. literal::
+.. code-block::
context A context B context ?(unknown)
@@ -711,7 +711,7 @@ Adding comments to describe DEPT's view in detail:
Let's build up dependency graph with this report. Firstly, context A:
-.. literal::
+.. code-block::
context A
@@ -735,7 +735,7 @@ unlock(&ni->ni_lock:0) depends on folio_unlock(&f1), say,
The graph will look like after adding the dependency:
-.. literal::
+.. code-block::
unlock(&ni->ni_lock:0) -> folio_unlock(&f1)
@@ -743,7 +743,7 @@ The graph will look like after adding the dependency:
Secondly, context B:
-.. literal::
+.. code-block::
context B
@@ -762,7 +762,7 @@ folio_unlock(&f1) depends on unlock(&ni->ni_lock:0), say,
The graph will look like after adding the dependency:
-.. literal::
+.. code-block::
-> unlock(&ni->ni_lock:0) -> folio_unlock(&f1) -
/ \
> +Limitation of lockdep
> +---------------------
> +
> +Lockdep deals with a deadlock by typical lock e.g. spinlock and mutex,
> +that are supposed to be released within the acquisition context.
> +However, when it comes to a deadlock by folio lock that is not supposed
> +to be released within the acquisition context or other general
> +synchronization mechanisms, lockdep doesn't work.
> +
> +NOTE: In this document, 'context' refers to any type of unique context
> +e.g. irq context, normal process context, wq worker context, or so on.
> +
> +Can lockdep detect the following deadlock?
> +
> +.. literal::
> +
> + context X context Y context Z
> +
> + mutex_lock A
> + folio_lock B
> + folio_lock B <- DEADLOCK
> + mutex_lock A <- DEADLOCK
> + folio_unlock B
> + folio_unlock B
> + mutex_unlock A
> + mutex_unlock A
> +
> +No. What about the following?
> +
> +.. literal::
> +
> + context X context Y
> +
> + mutex_lock A
> + mutex_lock A <- DEADLOCK
> + wait_for_complete B <- DEADLOCK
> + complete B
> + mutex_unlock A
> + mutex_unlock A
> +
> +No.
One unanswered question from my v17 review [1]: You explain in "How DEPT works"
section how DEPT detects deadlock in the first example (the former with three
contexts). Can you do the same on the second example (the latter with two
contexts)?
Thanks.
[1]: https://lore.kernel.org/linux-doc/aN84jKyrE1BumpLj@archie.me/
--
An old man doll... just what I always wanted! - Clara
Am Mittwoch, dem 26.11.2025 um 16:44 +0100 schrieb Philipp Stanner:
> On Wed, 2025-11-26 at 16:03 +0100, Christian König wrote:
>
> > >
[...]
> > > My hope would be that in the mid-term future we'd get firmware
> > > rings
> > > that can be preempted through a firmware call for all major
> > > hardware.
> > > Then a huge share of our problems would disappear.
> >
> > At least on AMD HW pre-emption is actually horrible unreliable as
> > well.
>
> Do you mean new GPUs with firmware scheduling, or what is "HW pre-
> emption"?
>
> With firmware interfaces, my hope would be that you could simply tell
>
> stop_running_ring(nr_of_ring)
> // time slice for someone else
> start_running_ring(nr_of_ring)
>
> Thereby getting real scheduling and all that. And eliminating many
> other problems we know well from drm/sched.
It doesn't really matter if you have firmware scheduling or not for
preemption to be a hard problem on GPUs. CPUs have limited software
visible state that needs to be saved/restored on a context switch and
even there people start complaining now that they need to context
switch the AVX512 register set.
GPUs have megabytes of software visible state. Which needs to be
saved/restored on the context switch if you want fine grained
preemption with low preemption latency. There might be points in the
command execution where you can ignore most of that state, but reaching
those points can have basically unbounded latency. So either you can
reliably save/restore lots of state or you are limited to very coarse
grained preemption with all the usual issues of timeouts and DoS
vectors.
I'm not totally up to speed with the current state across all relevant
GPUs, but until recently NVidia was the only vendor to have real
reliable fine-grained preemption.
Regards,
Lucas
If there is a large number (hundreds) of dmabufs allocated, the text
output generated from dmabuf_iter_seq_show can exceed common user buffer
sizes (e.g. PAGE_SIZE) necessitating multiple start/stop cycles to
iterate through all dmabufs. However the dmabuf iterator currently
returns NULL in dmabuf_iter_seq_start for all non-zero pos values, which
results in the truncation of the output before all dmabufs are handled.
After dma_buf_iter_begin / dma_buf_iter_next, the refcount of the buffer
is elevated so that the BPF iterator program can run without holding any
locks. When a stop occurs, instead of immediately dropping the reference
on the buffer, stash a pointer to the buffer in seq->priv until
either start is called or the iterator is released. This also enables
the resumption of iteration without first walking through the list of
dmabufs based on the pos value.
Fixes: 76ea95534995 ("bpf: Add dmabuf iterator")
Signed-off-by: T.J. Mercier <tjmercier(a)google.com>
---
kernel/bpf/dmabuf_iter.c | 56 +++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 7 deletions(-)
diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c
index 4dd7ef7c145c..cd500248abd9 100644
--- a/kernel/bpf/dmabuf_iter.c
+++ b/kernel/bpf/dmabuf_iter.c
@@ -6,10 +6,33 @@
#include <linux/kernel.h>
#include <linux/seq_file.h>
+struct dmabuf_iter_priv {
+ /*
+ * If this pointer is non-NULL, the buffer's refcount is elevated to
+ * prevent destruction between stop/start. If reading is not resumed and
+ * start is never called again, then dmabuf_iter_seq_fini drops the
+ * reference when the iterator is released.
+ */
+ struct dma_buf *dmabuf;
+};
+
static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos)
{
- if (*pos)
- return NULL;
+ struct dmabuf_iter_priv *p = seq->private;
+
+ if (*pos) {
+ struct dma_buf *dmabuf = p->dmabuf;
+
+ if (!dmabuf)
+ return NULL;
+
+ /*
+ * Always resume from where we stopped, regardless of the value
+ * of pos.
+ */
+ p->dmabuf = NULL;
+ return dmabuf;
+ }
return dma_buf_iter_begin();
}
@@ -54,8 +77,11 @@ static void dmabuf_iter_seq_stop(struct seq_file *seq, void *v)
{
struct dma_buf *dmabuf = v;
- if (dmabuf)
- dma_buf_put(dmabuf);
+ if (dmabuf) {
+ struct dmabuf_iter_priv *p = seq->private;
+
+ p->dmabuf = dmabuf;
+ }
}
static const struct seq_operations dmabuf_iter_seq_ops = {
@@ -71,11 +97,27 @@ static void bpf_iter_dmabuf_show_fdinfo(const struct bpf_iter_aux_info *aux,
seq_puts(seq, "dmabuf iter\n");
}
+static int dmabuf_iter_seq_init(void *priv, struct bpf_iter_aux_info *aux)
+{
+ struct dmabuf_iter_priv *p = (struct dmabuf_iter_priv *)priv;
+
+ p->dmabuf = NULL;
+ return 0;
+}
+
+static void dmabuf_iter_seq_fini(void *priv)
+{
+ struct dmabuf_iter_priv *p = (struct dmabuf_iter_priv *)priv;
+
+ if (p->dmabuf)
+ dma_buf_put(p->dmabuf);
+}
+
static const struct bpf_iter_seq_info dmabuf_iter_seq_info = {
.seq_ops = &dmabuf_iter_seq_ops,
- .init_seq_private = NULL,
- .fini_seq_private = NULL,
- .seq_priv_size = 0,
+ .init_seq_private = dmabuf_iter_seq_init,
+ .fini_seq_private = dmabuf_iter_seq_fini,
+ .seq_priv_size = sizeof(struct dmabuf_iter_priv),
};
static struct bpf_iter_reg bpf_dmabuf_reg_info = {
base-commit: 30f09200cc4aefbd8385b01e41bde2e4565a6f0e
--
2.52.0.177.g9f829587af-goog
On Sun, Nov 23, 2025 at 10:51:25PM +0000, Pavel Begunkov wrote:
> +struct dma_token *blkdev_dma_map(struct file *file,
> + struct dma_token_params *params)
Given that this is a direct file operation instance it should be
in block/fops.c. If we do want a generic helper below it, it
should take a struct block_device instead. But we can probably
defer that until a user for that shows up.
> +static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
> + enum dma_data_direction dir)
> +{
> + struct blk_mq_dma_map *map = req->dma_map;
> + int length = blk_rq_payload_bytes(req);
> + bool for_cpu = dir == DMA_FROM_DEVICE;
> + struct device *dev = nvme_dev->dev;
> + dma_addr_t *dma_list = map->private;
> + struct bio *bio = req->bio;
> + int offset, map_idx;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + length += offset & (NVME_CTRL_PAGE_SIZE - 1);
> +
> + while (length > 0) {
> + u64 dma_addr = dma_list[map_idx++];
> +
> + if (for_cpu)
> + __dma_sync_single_for_cpu(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + else
> + __dma_sync_single_for_device(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dir);
> + length -= NVME_CTRL_PAGE_SIZE;
> + }
This looks really inefficient. Usually the ranges in the dmabuf should
be much larger than a controller page.
> +static void nvme_unmap_premapped_data(struct nvme_dev *dev,
> + struct request *req)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> + if (rq_data_dir(req) == READ)
> + nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
> + if (!(iod->flags & IOD_SINGLE_SEGMENT))
> + nvme_free_descriptors(req);
> +}
This doesn't really unmap anything :)
Also the dma ownership rules say that you always need to call the
sync_to_device helpers before I/O and the sync_to_cpu helpers after I/O,
no matters if it is a read or write. The implementations then makes
them a no-op where possible.
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + offset &= (NVME_CTRL_PAGE_SIZE - 1);
> +
> + prp1_dma = dma_list[map_idx++] + offset;
> +
> + length -= (NVME_CTRL_PAGE_SIZE - offset);
> + if (length <= 0) {
> + prp2_dma = 0;
Urgg, why is this building PRPs instead of SGLs? Yes, SGLs are an
optional feature, but for devices where you want to micro-optimize
like this I think we should simply require them. This should cut
down on both the memory use and the amount of special mapping code.