Hi,
This is a patch series covering the support for protected mode execution in
Mali Panthor CSF kernel driver.
It builds on the initial RFC posted by Florent Tomasin back in January of 2025.
The initial RFC can be found here:
https://lore.kernel.org/lkml/cover.1738228114.git.florent.tomasin@arm.com/
The Mali CSF GPUs come with the support for protected mode execution at the
HW level. This feature requires two main changes in the kernel driver:
1) Configure the GPU with a protected buffer. The system must provide a DMA
heap from which the driver can allocate a protected buffer.
It can be a carved-out memory or dynamically allocated protected memory region.
Some system includes a trusted FW which is in charge of the protected memory.
Since this problem is integration specific, the Mali Panthor CSF kernel
driver must import the protected memory from a device specific exporter.
2) Handle enter and exit of the GPU HW from normal to protected mode of execution.
FW sends a request for protected mode entry to the kernel driver.
The acknowledgment of that request is a scheduling decision. Effectively,
protected mode execution should not overrule normal mode of execution.
A fair distribution of execution time will guaranty the overall performance
of the device, including the UI (usually executing in normal mode),
will not regress when a protected mode job is submitted by an application.
Background
----------
Current Mali Panthor CSF driver does not allow a user space application to
execute protected jobs on the GPU. This use case is quite common on end-user-device.
A user may want to watch a video or render content that is under a "Digital Right
Management" protection, or launch an application with user private data.
1) User-space:
In order for an application to execute protected jobs on a Mali CSF GPU the
user space application must submit jobs to the GPU within a "protected regions"
(range of commands to execute in protected mode).
Find here an example of a command buffer that contains protected commands:
```
<--- Normal mode ---><--- Protected mode ---><--- Normal mode --->
+-------------------------------------------------------------------------+
| ... | CMD_0 | ... | CMD_N | PROT_REGION | CMD_N+1 | ... | CMD_N+M | ... |
+-------------------------------------------------------------------------+
```
The PROT_REGION command acts as a barrier to notify the HW of upcoming
protected jobs. It also defines the number of commands to execute in protected
mode.
The Mesa definition of the opcode can be found here:
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/panfrost/lib/genxm…
2) Kernel-space:
When loading the FW image, the Kernel driver must also load the data section of
CSF FW that comes from the protected memory, in order to allow FW to execute in
protected mode.
Important: this memory is not owned by any process. It is a GPU device level
protected memory.
In addition, when a CSG (group) is created, it must have a protected suspend buffer.
This memory is allocated within the kernel but bound to a specific CSG that belongs
to a process. The kernel owns this allocation and does not allow user space mapping.
The format of the data in this buffer is only known by the FW and does not need to
be shared with other entities. The purpose of this buffer is the same as the normal
suspend buffer but for protected mode. FW will use it to suspend the execution of
PROT_REGION before returning to normal mode of execution.
Design decisions
----------------
The Mali Panthor CSF kernel driver will allocate protected DMA buffers
using a global protected DMA heap. The name of the heap can vary on
the system and is integration specific. Therefore, the kernel driver
will retrieve it using the DTB entry: "protected-heap-name".
The Mali Panthor CSF kernel driver will handle enter/exit of protected
mode with a fair consideration of the job scheduling.
If the system integrator does not provide a protected DMA heap, the driver
will not allow any protected mode execution.
Patch series
------------
[PATCHES 1-2]:
Thees patches comes from the following patch series:
https://lore.kernel.org/lkml/20240720071606.27930-1-yunfei.dong@mediatek.co…
These extend the DMA-buf heap API to allow other kernel drivers to Find
and allocate memory from dma heaps.
Note: This patch series do not include a protected DMA heap, as this is
platform specific.
* dma-heap: Add proper kref handling on dma-buf heaps
* dma-heap: Provide accessors so that in-kernel drivers can allocate dmabufs from specific heaps
[PATCHES 3, 5 and 6]:
These are refactoring to aid the implementation of the protected rendering
feature itself.
* drm/panthor: De-duplicate FW memory section sync
* drm/panthor: Minor scheduler refactoring
* drm/panthor: Explicit expansion of locked VM region
[Patch 4]:
This introduces allocation of protected memory inside the Panthor driver.
It also ensures the protected FW sections are loaded.
* drm/panthor: Add support for protected memory allocation in panthor
[PATCH 7]:
This patch implements the logic to handle enter/exit of the GPU protected
mode in Panthor CSF driver.
Note: to prevent scheduler priority inversion, only a single CSG is allowed
to execute while in protected mode. It must be the top priority one.
* drm/panthor: Add support for entering and exiting protected mode
[PATCH 8]:
The final patch exposes this feature via the uAPI.
* drm/panthor: Expose protected rendering features
Testing
-------
1) Platform and development environment
Any platform containing a Mali CSF type of GPU and a protected memory allocator
that is based on DMA Heap can be used. For example, it can be a physical platform
or a simulator such as Arm Total Compute FVPs platforms. Reference to the latter:
https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platform…
2) Mesa:
PanVK support can be found here:
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/40044
This is still work in progress.
Constraints
-----------
At the time of developing the feature, Linux kernel does not have a generic
way of implementing protected DMA heaps. The patch series relies on previous
work to expose the DMA heap API to the kernel drivers.
The Mali CSF GPU requires device level allocated protected memory, which do
not belong to a process. The current Linux implementation of DMA heap only
allows a user space program to allocate from such heap. Having the ability
to allocate this memory at the kernel level via the DMA heap API would allow
support for protected mode on Mali CSF GPUs.
Florent Tomasin (3):
drm/panthor: Add support for protected memory allocation in panthor
drm/panthor: Minor scheduler refactoring
drm/panthor: Add support for entering and exiting protected mode
John Stultz (2):
dma-heap: Add proper kref handling on dma-buf heaps
dma-heap: Provide accessors so that in-kernel drivers can allocate dmabufs from specific heaps
Ketil Johnsen (3):
drm/panthor: De-duplicate FW memory section sync
drm/panthor: Explicit expansion of locked VM region
drm/panthor: Expose protected rendering features
Documentation/gpu/panthor.rst | 47 +++
drivers/dma-buf/dma-heap.c | 109 ++++++-
drivers/gpu/drm/panthor/Kconfig | 1 +
drivers/gpu/drm/panthor/panthor_device.c | 29 +-
drivers/gpu/drm/panthor/panthor_device.h | 15 +
drivers/gpu/drm/panthor/panthor_drv.c | 21 +-
drivers/gpu/drm/panthor/panthor_fw.c | 137 ++++++--
drivers/gpu/drm/panthor/panthor_fw.h | 7 +
drivers/gpu/drm/panthor/panthor_gem.c | 77 ++++-
drivers/gpu/drm/panthor/panthor_gem.h | 16 +-
drivers/gpu/drm/panthor/panthor_gpu.c | 14 +-
drivers/gpu/drm/panthor/panthor_gpu.h | 6 +
drivers/gpu/drm/panthor/panthor_heap.c | 2 +
drivers/gpu/drm/panthor/panthor_mmu.c | 79 +++--
drivers/gpu/drm/panthor/panthor_sched.c | 387 ++++++++++++++++++-----
include/linux/dma-heap.h | 8 +
include/uapi/drm/panthor_drm.h | 45 ++-
17 files changed, 864 insertions(+), 136 deletions(-)
--
2.43.0
RAM is not, in fact, cheap. Especially on embedded systems with a low
amount of memory, but known and well-defined userspace, more explicit
resource management can lead to better utilisation patterns. As an
example, a resource manager process on a purpose-built device may wish
to launch, and then explicitly swap out, memory of processes that are
kept "warm", to improve perceived startup latency of individual
full-screen applications without making the kernel figure out the usage
pattern from observation alone in order to swap out the right pages.
To allow for this explicit control in the context of panthor's GPU
memory, add two new sysfs knobs. The first, mem_reclaim, runs an
explicit priv BO reclaim cycle on the TGID written to it.
The second, mem_claim, does the opposite: it swaps BOs back into active
memory.
Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli(a)collabora.com>
---
Nicolas Frattaroli (4):
drm/panthor: Add freed_sz parameter to reclaim_priv_bos
MAINTAINERS: Add sysfs ABI docs to list of panthor files
drm/panthor: Add explicit memory reclaim sysfs knob
drm/panthor: Add explicit memory claim sysfs knob
Documentation/ABI/testing/sysfs-driver-panthor-mem | 34 ++++++++
MAINTAINERS | 1 +
drivers/gpu/drm/panthor/panthor_drv.c | 93 ++++++++++++++++++++++
drivers/gpu/drm/panthor/panthor_gem.c | 7 +-
drivers/gpu/drm/panthor/panthor_gem.h | 1 +
drivers/gpu/drm/panthor/panthor_mmu.c | 70 +++++++++++++++-
drivers/gpu/drm/panthor/panthor_mmu.h | 4 +
7 files changed, 205 insertions(+), 5 deletions(-)
---
base-commit: 2c4b906cd135bbb44855287d0d0eff0ee0b47afe
change-id: 20260506-panthor-explicit-reclaim-3dffed028d8c
Best regards,
--
Nicolas Frattaroli <nicolas.frattaroli(a)collabora.com>
From: Thierry Reding <treding(a)nvidia.com>
Hi,
This series adds support for the video protection region (VPR) used on
Tegra SoC devices. It's a special region of memory that is protected
from accesses by the CPU and used to store DRM protected content (both
decrypted stream data as well as decoded video frames).
Patches 1 and 2 add DT binding documentation for the VPR and add the VPR
to the list of memory-region items for display and host1x.
Patch 3 adds bitmap_allocate(), which is like bitmap_allocate_region()
but works on sizes that are not a power of two.
Patch 4 introduces new APIs needed by the Tegra VPR implementation that
allow CMA areas to be dynamically created at runtime rather than using
the fixed, system-wide list. This is used in this driver specifically
because it can use an arbitrary number of these areas (though they are
currently limited to 4).
Patch 5 adds some infrastructure for DMA heap implementations to provide
information through debugfs.
The Tegra VPR implementation is added in patch 6. See its commit message
for more details about the specifics of this implementation.
Finally, patches 7-10 add the VPR placeholder node on Tegra234 and hook
it up to the host1x and GPU nodes so that they can make use of this
region.
Changes in v2:
- Tegra VPR implementation is now more optimized to reduce the number of
(very slow) resize operations, and allows cross-chunk allocations
- dynamic CMA areas are now trackd separately from static ones, but the
global number of CMA pages accounts for all areas
Thierry
Thierry Reding (10):
dt-bindings: reserved-memory: Document Tegra VPR
dt-bindings: display: tegra: Document memory regions
bitmap: Add bitmap_allocate() function
mm/cma: Allow dynamically creating CMA areas
dma-buf: heaps: Add debugfs support
dma-buf: heaps: Add support for Tegra VPR
arm64: tegra: Add VPR placeholder node on Tegra234
arm64: tegra: Add GPU node on Tegra234
arm64: tegra: Hook up VPR to host1x
arm64: tegra: Hook up VPR to the GPU
.../display/tegra/nvidia,tegra186-dc.yaml | 10 +
.../display/tegra/nvidia,tegra20-dc.yaml | 10 +-
.../display/tegra/nvidia,tegra20-host1x.yaml | 7 +
.../nvidia,tegra-video-protection-region.yaml | 55 +
arch/arm/mm/dma-mapping.c | 2 +-
arch/arm64/boot/dts/nvidia/tegra234.dtsi | 60 +
arch/s390/mm/init.c | 2 +-
drivers/dma-buf/dma-heap.c | 56 +
drivers/dma-buf/heaps/Kconfig | 7 +
drivers/dma-buf/heaps/Makefile | 1 +
drivers/dma-buf/heaps/cma_heap.c | 2 +-
drivers/dma-buf/heaps/tegra-vpr.c | 1265 +++++++++++++++++
include/linux/bitmap.h | 25 +-
include/linux/cma.h | 7 +-
include/linux/dma-heap.h | 2 +
include/trace/events/tegra_vpr.h | 57 +
mm/cma.c | 187 ++-
mm/cma.h | 5 +-
18 files changed, 1713 insertions(+), 47 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/nvidia,tegra-video-protection-region.yaml
create mode 100644 drivers/dma-buf/heaps/tegra-vpr.c
create mode 100644 include/trace/events/tegra_vpr.h
--
2.52.0
Removing the signal on any feature allows to simplfy the dma_fence_array
code a lot and saves us from the need to install a callback on all fences
at the same time.
This results is less memory and CPU overhead.
v2: fix potential double locking pointed out by Tvrtko
Signed-off-by: Christian König <christian.koenig(a)amd.com>
---
drivers/dma-buf/dma-fence-array.c | 134 +++++++++++++-----------------
drivers/gpu/drm/xe/xe_vm.c | 2 +-
include/linux/dma-fence-array.h | 22 ++---
3 files changed, 66 insertions(+), 92 deletions(-)
diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c
index 5e10e8df372f..8b94c6287482 100644
--- a/drivers/dma-buf/dma-fence-array.c
+++ b/drivers/dma-buf/dma-fence-array.c
@@ -42,97 +42,88 @@ static void dma_fence_array_clear_pending_error(struct dma_fence_array *array)
cmpxchg(&array->base.error, PENDING_ERROR, 0);
}
-static void irq_dma_fence_array_work(struct irq_work *wrk)
+static void dma_fence_array_cb_func(struct dma_fence *f,
+ struct dma_fence_cb *cb)
{
- struct dma_fence_array *array = container_of(wrk, typeof(*array), work);
-
- dma_fence_array_clear_pending_error(array);
+ struct dma_fence_array *array =
+ container_of(cb, struct dma_fence_array, callback);
- dma_fence_signal(&array->base);
- dma_fence_put(&array->base);
+ irq_work_queue(&array->work);
}
-static void dma_fence_array_cb_func(struct dma_fence *f,
- struct dma_fence_cb *cb)
+static bool dma_fence_array_try_add_cb(struct dma_fence_array *array)
{
- struct dma_fence_array_cb *array_cb =
- container_of(cb, struct dma_fence_array_cb, cb);
- struct dma_fence_array *array = array_cb->array;
+ while (array->num_pending) {
+ struct dma_fence *f = array->fences[array->num_pending - 1];
- dma_fence_array_set_pending_error(array, f->error);
+ if (!dma_fence_add_callback(f, &array->callback,
+ dma_fence_array_cb_func))
+ return true;
- if (atomic_dec_and_test(&array->num_pending))
- irq_work_queue(&array->work);
- else
+ dma_fence_array_set_pending_error(array, f->error);
+ --array->num_pending;
+ }
+ return false;
+}
+
+static void dma_fence_array_irq_work(struct irq_work *wrk)
+{
+ struct dma_fence_array *array = container_of(wrk, typeof(*array), work);
+
+ --array->num_pending;
+ if (!dma_fence_array_try_add_cb(array)) {
+ dma_fence_signal(&array->base);
dma_fence_put(&array->base);
+ }
}
static bool dma_fence_array_enable_signaling(struct dma_fence *fence)
{
struct dma_fence_array *array = to_dma_fence_array(fence);
- struct dma_fence_array_cb *cb = array->callbacks;
- unsigned i;
- for (i = 0; i < array->num_fences; ++i) {
- cb[i].array = array;
+ /*
+ * As we may report that the fence is signaled before all
+ * callbacks are complete, we need to take an additional
+ * reference count on the array so that we do not free it too
+ * early. The core fence handling will only hold the reference
+ * until we signal the array as complete (but that is now
+ * insufficient).
+ */
+ dma_fence_get(&array->base);
+ if (!dma_fence_array_try_add_cb(array)) {
/*
- * As we may report that the fence is signaled before all
- * callbacks are complete, we need to take an additional
- * reference count on the array so that we do not free it too
- * early. The core fence handling will only hold the reference
- * until we signal the array as complete (but that is now
- * insufficient).
+ * When all fences are already signaled we can drop the reference again
+ * and report to the caller that the array can be signaled as well.
*/
- dma_fence_get(&array->base);
- if (dma_fence_add_callback(array->fences[i], &cb[i].cb,
- dma_fence_array_cb_func)) {
- int error = array->fences[i]->error;
-
- dma_fence_array_set_pending_error(array, error);
- dma_fence_put(&array->base);
- if (atomic_dec_and_test(&array->num_pending)) {
- dma_fence_array_clear_pending_error(array);
- return false;
- }
- }
+ dma_fence_put(&array->base);
+ return false;
}
-
return true;
}
static bool dma_fence_array_signaled(struct dma_fence *fence)
{
struct dma_fence_array *array = to_dma_fence_array(fence);
- int num_pending;
+ int num_pending, error = 0;
unsigned int i;
/*
- * We need to read num_pending before checking the enable_signal bit
- * to avoid racing with the enable_signaling() implementation, which
- * might decrement the counter, and cause a partial check.
- * atomic_read_acquire() pairs with atomic_dec_and_test() in
- * dma_fence_array_enable_signaling()
- *
- * The !--num_pending check is here to account for the any_signaled case
- * if we race with enable_signaling(), that means the !num_pending check
- * in the is_signalling_enabled branch might be outdated (num_pending
- * might have been decremented), but that's fine. The user will get the
- * right value when testing again later.
+ * Reading num_pending without a memory barrier here is correct since
+ * that is only for optimization, it is perfectly acceptable to have a
+ * stale value for it. In all other cases num_pending is accessed by a
+ * single call chain.
*/
- num_pending = atomic_read_acquire(&array->num_pending);
- if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &array->base.flags)) {
- if (num_pending <= 0)
- goto signal;
- return false;
- }
+ num_pending = READ_ONCE(array->num_pending);
+ for (i = 0; i < num_pending; ++i) {
+ struct dma_fence *f = array->fences[i];
- for (i = 0; i < array->num_fences; ++i) {
- if (dma_fence_is_signaled(array->fences[i]) && !--num_pending)
- goto signal;
- }
- return false;
+ if (!dma_fence_is_signaled(f))
+ return false;
-signal:
+ if (!error)
+ error = f->error;
+ }
+ dma_fence_array_set_pending_error(array, error);
dma_fence_array_clear_pending_error(array);
return true;
}
@@ -171,15 +162,12 @@ EXPORT_SYMBOL(dma_fence_array_ops);
/**
* dma_fence_array_alloc - Allocate a custom fence array
- * @num_fences: [in] number of fences to add in the array
*
* Return dma fence array on success, NULL on failure
*/
-struct dma_fence_array *dma_fence_array_alloc(int num_fences)
+struct dma_fence_array *dma_fence_array_alloc(void)
{
- struct dma_fence_array *array;
-
- return kzalloc_flex(*array, callbacks, num_fences);
+ return kzalloc_obj(struct dma_fence_array);
}
EXPORT_SYMBOL(dma_fence_array_alloc);
@@ -203,10 +191,13 @@ void dma_fence_array_init(struct dma_fence_array *array,
WARN_ON(!num_fences || !fences);
array->num_fences = num_fences;
+ array->num_pending = num_fences;
+ array->fences = fences;
+ array->base.error = PENDING_ERROR;
dma_fence_init(&array->base, &dma_fence_array_ops, NULL, context,
seqno);
- init_irq_work(&array->work, irq_dma_fence_array_work);
+ init_irq_work(&array->work, dma_fence_array_irq_work);
/*
* dma_fence_array_enable_signaling() is invoked while holding
@@ -220,11 +211,6 @@ void dma_fence_array_init(struct dma_fence_array *array,
*/
lockdep_set_class(&array->base.inline_lock, &dma_fence_array_lock_key);
- atomic_set(&array->num_pending, num_fences);
- array->fences = fences;
-
- array->base.error = PENDING_ERROR;
-
/*
* dma_fence_array objects should never contain any other fence
* containers or otherwise we run into recursion and potential kernel
@@ -265,7 +251,7 @@ struct dma_fence_array *dma_fence_array_create(int num_fences,
{
struct dma_fence_array *array;
- array = dma_fence_array_alloc(num_fences);
+ array = dma_fence_array_alloc();
if (!array)
return NULL;
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 62a87a051be7..8f472911469d 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3370,7 +3370,7 @@ static struct dma_fence *ops_execute(struct xe_vm *vm,
goto err_trace;
}
- cf = dma_fence_array_alloc(n_fence);
+ cf = dma_fence_array_alloc();
if (!cf) {
fence = ERR_PTR(-ENOMEM);
goto err_out;
diff --git a/include/linux/dma-fence-array.h b/include/linux/dma-fence-array.h
index 1b1d87579c38..3ee55c0e2fa4 100644
--- a/include/linux/dma-fence-array.h
+++ b/include/linux/dma-fence-array.h
@@ -15,16 +15,6 @@
#include <linux/dma-fence.h>
#include <linux/irq_work.h>
-/**
- * struct dma_fence_array_cb - callback helper for fence array
- * @cb: fence callback structure for signaling
- * @array: reference to the parent fence array object
- */
-struct dma_fence_array_cb {
- struct dma_fence_cb cb;
- struct dma_fence_array *array;
-};
-
/**
* struct dma_fence_array - fence to represent an array of fences
* @base: fence base class
@@ -33,18 +23,17 @@ struct dma_fence_array_cb {
* @num_pending: fences in the array still pending
* @fences: array of the fences
* @work: internal irq_work function
- * @callbacks: array of callback helpers
+ * @callback: callback structure for signaling
*/
struct dma_fence_array {
struct dma_fence base;
- unsigned num_fences;
- atomic_t num_pending;
+ unsigned int num_fences;
+ unsigned int num_pending;
struct dma_fence **fences;
struct irq_work work;
-
- struct dma_fence_array_cb callbacks[] __counted_by(num_fences);
+ struct dma_fence_cb callback;
};
/**
@@ -78,11 +67,10 @@ to_dma_fence_array(struct dma_fence *fence)
for (index = 0, fence = dma_fence_array_first(head); fence; \
++(index), fence = dma_fence_array_next(head, index))
-struct dma_fence_array *dma_fence_array_alloc(int num_fences);
+struct dma_fence_array *dma_fence_array_alloc(void);
void dma_fence_array_init(struct dma_fence_array *array,
int num_fences, struct dma_fence **fences,
u64 context, unsigned seqno);
-
struct dma_fence_array *dma_fence_array_create(int num_fences,
struct dma_fence **fences,
u64 context, unsigned seqno);
--
2.43.0
I'm happy to see that DEPT reported real problems in practice:
https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SA…https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.pa…https://lore.kernel.org/all/b6e00e77-4a8c-4e05-ab79-266bf05fcc2d@igalia.com/
I’ve added documentation describing DEPT — this should help you
understand what DEPT is and how it works. You can use DEPT simply by
enabling CONFIG_DEPT and checking dmesg at runtime.
---
Hi Linus and folks,
I’ve been developing a tool to detect deadlock possibilities by tracking
waits/events — rather than lock acquisition order — to cover all the
synchronization mechanisms. To summarize the design rationale, starting
from the problem statement, through analysis, to the solution:
CURRENT STATUS
--------------
Lockdep tracks lock acquisition order to identify deadlock conditions.
Additionally, it tracks IRQ state changes — via {en,dis}able — to
detect cases where locks are acquired unintentionally during
interrupt handling.
PROBLEM
-------
Waits and their associated events that are never reachable can
eventually lead to deadlocks. However, since Lockdep focuses solely
on lock acquisition order, it has inherent limitations when handling
waits and events.
Moreover, by tracking only lock acquisition order, Lockdep cannot
properly handle read locks or cross-event scenarios — such as
wait_for_completion() and complete() — making it increasingly
inadequate as a general-purpose deadlock detection tool.
SOLUTION
--------
Once again, waits and their associated events that are never
reachable can eventually lead to deadlocks. The new solution, DEPT,
focuses directly on waits and events. DEPT monitors waits and events,
and reports them when any become unreachable.
DEPT provides:
* Correct handling of read locks.
* Support for general waits and events.
* Continuous operation, even after multiple reports.
* Simple, intuitive annotation APIs.
There are still false positives, and some are already being worked on
for suppression. Especially splitting the folio class into several
appropriate classes e.g. block device mapping class and regular file
mapping class, is currently under active development by me and Yeoreum
Yun.
Anyway, these efforts will need to continue for a while, as we’ve seen
with lockdep over two decades. DEPT is tagged as EXPERIMENTAL in
Kconfig — meaning it’s not yet suitable for use as an automation tool.
However, for those who are interested in using DEPT to analyze complex
synchronization patterns and extract dependency insights, DEPT would be
a great tool for the purpose.
Thanks for your support and contributions to:
Harry Yoo <harry.yoo(a)oracle.com>
Gwan-gyeong Mun <gwan-gyeong.mun(a)intel.com>
Yunseong Kim <ysk(a)kzalloc.com>
Yeoreum Yun <yeoreum.yun(a)arm.com>
FAQ
---
Q. Is this the first attempt to solve this problem?
A. No. The cross-release feature (commit b09be676e0ff2) attempted to
address it — as a Lockdep extension. It was merged, but quickly
reverted, because:
While it uncovered valuable hidden issues, it also introduced false
positives. Since these false positives mask further real problems
with Lockdep — and developers strongly dislike them — the feature was
rolled back.
Q. Why wasn’t DEPT built as a Lockdep extension?
A. Lockdep is the result of years of work by kernel developers — and is
now very stable. But I chose to build DEPT separately, because:
While reusing BFS(Breadth First Search) and Lockdep’s hashing is
beneficial, the rest of the system must be rebuilt from scratch to
align with DEPT’s wait-event model — since Lockdep was originally
designed for tracking lock acquisition orders, not wait-event
dependencies.
Q. Do you plan to replace Lockdep entirely?
A. Not at all — Lockdep still plays a vital role in validating correct
lock usage. While its dependency-checking logic should eventually be
superseded by DEPT, the rest of its functionality should stay.
Q. Should we replace the dependency check immediately?
A. Absolutely not. Lockdep’s stability is the result of years of hard
work by kernel developers. Lockdep and DEPT should run side by side
until DEPT matures.
Q. Stronger detection often leads to more false positives — which was a
major pain point when cross-release was added. Is DEPT designed to
handle this?
A. Yes. DEPT’s simple, generalized design enables flexible reporting —
so while false positives still need fixing, they’re far less
disruptive than they were under the Lockdep extension, cross-release.
Q. Why not fix all false positives out-of-tree before merging?
A. Since the affected subsystems span the entire kernel, like Lockdep,
which has relied on annotations to avoid false positives over the
last two decades, DEPT too will require the annotation efforts.
Performing annotation work within the mainline will help us add
annotations more appropriately and will also make DEPT a useful tool
for a wider range of users more quickly.
CONFIG_DEPT is marked EXPERIMENTAL, so it’s opt-in. Some users are
already interested in using DEPT to analyze complex synchronization
patterns and extract dependency insights.
Byungchul
---
Changes from v17:
1. Rebase on the mainline as of 2025 Dec 5.
2. Convert the documents' format from txt to rst. (feedbacked
by Jonathan Corbet and Bagas Sanjaya)
3. Move the documents from 'Documentation/dependency' to
'Documentation/dev-tools'. (feedbakced by Jonathan Corbet)
4. Improve the documentation. (feedbacked by NeilBrown)
5. Use a common function, enter_from_user_mode(), instead of
arch specific code, to notice context switch from user mode.
(feedbacked by Dave Hansen, Mark Rutland, and Mark Brown)
6. Resolve the header dependency issue by using dept's internal
header, instead of relocating 'struct llist_{head,node}' to
another header. (feedbacked by Greg KH)
7. Improve page(or folio) usage type APIs.
8. Add rust helper for wait_for_completion(). (feedbacked by
Guangbo Cui, Boqun Feng, and Danilo Krummrich)
9. Refine some commit messages.
Changes from v16:
1. Rebase on v6.17.
2. Fix a false positive from rcu (by Yunseong Kim)
3. Introduce APIs to set page's usage, dept_set_page_usage() and
dept_reset_page_usage() to avoid false positives.
4. Consider lock_page() as a potential wait unconditionally.
5. Consider folio_lock_killable() as a potential wait
unconditionally.
6. Add support for tracking PG_writeback waits and events.
7. Fix two build errors due to the additional debug information
added by dept. (by Yunseong Kim)
Changes from v15:
1. Fix typo and improve comments and commit messages (feedbacked
by ALOK TIWARI, Waiman Long, and kernel test robot).
2. Do not stop dept on detection of cicular dependency of
recover event, allowing to keep reporting.
3. Add SK hynix to copyright.
4. Consider folio_lock() as a potential wait unconditionally.
5. Fix Kconfig dependency bug (feedbacked by kernel test rebot).
6. Do not suppress reports that involve classes even that have
already involved in other reports, allowing to keep
reporting.
Changes from v14:
1. Rebase on the current latest, v6.15-rc6.
2. Refactor dept code.
3. With multi event sites for a single wait, even if an event
forms a circular dependency, the event can be recovered by
other event(or wake up) paths. Even though informing the
circular dependency is worthy but it should be suppressed
once informing it, if it doesn't lead an actual deadlock. So
introduce APIs to annotate the relationship between event
site and recover site, that are, event_site() and
dept_recover_event().
4. wait_for_completion() worked with dept map embedded in struct
completion. However, it generates a few false positves since
all the waits using the instance of struct completion, share
the map and key. To avoid the false positves, make it not to
share the map and key but each wait_for_completion() caller
have its own key by default. Of course, external maps also
can be used if needed.
5. Fix a bug about hardirq on/off tracing.
6. Implement basic unit test for dept.
7. Add more supports for dma fence synchronization.
8. Add emergency stop of dept e.g. on panic().
9. Fix false positives by mmu_notifier_invalidate_*().
10. Fix recursive call bug by DEPT_WARN_*() and DEPT_STOP().
11. Fix trivial bugs in DEPT_WARN_*() and DEPT_STOP().
12. Fix a bug that a spin lock, dept_pool_spin, is used in
both contexts of irq disabled and enabled without irq
disabled.
13. Suppress reports with classes, any of that already have
been reported, even though they have different chains but
being barely meaningful.
14. Print stacktrace of the wait that an event is now waking up,
not only stacktrace of the event.
15. Make dept aware of lockdep_cmp_fn() that is used to avoid
false positives in lockdep so that dept can also avoid them.
16. Do do_event() only if there are no ecxts have been
delimited.
17. Fix a bug that was not synchronized for stage_m in struct
dept_task, using a spin lock, dept_task()->stage_lock.
18. Fix a bug that dept didn't handle the case that multiple
ttwus for a single waiter can be called at the same time
e.i. a race issue.
19. Distinguish each kernel context from others, not only by
system call but also by user oriented fault so that dept can
work with more accuracy information about kernel context.
That helps to avoid a few false positives.
20. Limit dept's working to x86_64 and arm64.
Changes from v13:
1. Rebase on the current latest version, v6.9-rc7.
2. Add 'dept' documentation describing dept APIs.
Changes from v12:
1. Refine the whole document for dept.
2. Add 'Interpret dept report' section in the document, using a
deadlock report obtained in practice. Hope this version of
document helps guys understand dept better.
https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SA…https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.pa…
Changes from v11:
1. Add 'dept' documentation describing the concept of dept.
2. Rewrite the commit messages of the following commits for
using weaker lockdep annotation, for better description.
fs/jbd2: Use a weaker annotation in journal handling
cpu/hotplug: Use a weaker annotation in AP thread
(feedbacked by Thomas Gleixner)
Changes from v10:
1. Fix noinstr warning when building kernel source.
2. dept has been reporting some false positives due to the folio
lock's unfairness. Reflect it and make dept work based on
dept annotaions instead of just wait and wake up primitives.
3. Remove the support for PG_writeback while working on 2. I
will add the support later if needed.
4. dept didn't print stacktrace for [S] if the participant of a
deadlock is not lock mechanism but general wait and event.
However, it made hard to interpret the report in that case.
So add support to print stacktrace of the requestor who asked
the event context to run - usually a waiter of the event does
it just before going to wait state.
5. Give up tracking raw_local_irq_{disable,enable}() since it
totally messed up dept's irq tracking. So make it work in the
same way as lockdep does. I will consider it once any false
positives by those are observed again.
6. Change the manual rwsem_acquire_read(->j_trans_commit_map)
annotation in fs/jbd2/transaction.c to the try version so
that it works as much as it exactly needs.
7. Remove unnecessary 'inline' keyword in dept.c and add
'__maybe_unused' to a needed place.
Changes from v9:
1. Fix a bug. SDT tracking didn't work well because of my big
mistake that I should've used waiter's map to indentify its
class but it had been working with waker's one. FYI,
PG_locked and PG_writeback weren't affected. They still
worked well. (reported by YoungJun)
Changes from v8:
1. Fix build error by adding EXPORT_SYMBOL(PG_locked_map) and
EXPORT_SYMBOL(PG_writeback_map) for kernel module build -
appologize for that. (reported by kernel test robot)
2. Fix build error by removing header file's circular dependency
that was caused by "atomic.h", "kernel.h" and "irqflags.h",
which I introduced - appolgize for that. (reported by kernel
test robot)
Changes from v7:
1. Fix a bug that cannot track rwlock dependency properly,
introduced in v7. (reported by Boqun and lockdep selftest)
2. Track wait/event of PG_{locked,writeback} more aggressively
assuming that when a bit of PG_{locked,writeback} is cleared
there might be waits on the bit. (reported by Linus, Hillf
and syzbot)
3. Fix and clean bad style code e.i. unnecessarily introduced
a randome pattern and so on. (pointed out by Linux)
4. Clean code for applying dept to wait_for_completion().
Changes from v6:
1. Tie to task scheduler code to track sleep and try_to_wake_up()
assuming sleeps cause waits, try_to_wake_up()s would be the
events that those are waiting for, of course with proper dept
annotations, sdt_might_sleep_weak(), sdt_might_sleep_strong()
and so on. For these cases, class is classified at sleep
entrance rather than the synchronization initialization code.
Which would extremely reduce false alarms.
2. Remove the dept associated instance in each page struct for
tracking dependencies by PG_locked and PG_writeback thanks to
the 1. work above.
3. Introduce CONFIG_dept_AGGRESIVE_TIMEOUT_WAIT to suppress
reports that waits with timeout set are involved, for those
who don't like verbose reporting.
4. Add a mechanism to refill the internal memory pools on
running out so that dept could keep working as long as free
memory is available in the system.
5. Re-enable tracking hashed-waitqueue wait. That's going to no
longer generate false positives because class is classified
at sleep entrance rather than the waitqueue initailization.
6. Refactor to make it easier to port onto each new version of
the kernel.
7. Apply dept to dma fence.
8. Do trivial optimizaitions.
Changes from v5:
1. Use just pr_warn_once() rather than WARN_ONCE() on the lack
of internal resources because WARN_*() printing stacktrace is
too much for informing the lack. (feedback from Ted, Hyeonggon)
2. Fix trivial bugs like missing initializing a struct before
using it.
3. Assign a different class per task when handling onstack
variables for waitqueue or the like. Which makes dept
distinguish between onstack variables of different tasks so
as to prevent false positives. (reported by Hyeonggon)
4. Make dept aware of even raw_local_irq_*() to prevent false
positives. (reported by Hyeonggon)
5. Don't consider dependencies between the events that might be
triggered within __schedule() and the waits that requires
__schedule(), real ones. (reported by Hyeonggon)
6. Unstage the staged wait that has prepare_to_wait_event()'ed
*and* yet to get to __schedule(), if we encounter __schedule()
in-between for another sleep, which is possible if e.g. a
mutex_lock() exists in 'condition' of ___wait_event().
7. Turn on CONFIG_PROVE_LOCKING when CONFIG_DEPT is on, to rely
on the hardirq and softirq entrance tracing to make dept more
portable for now.
Changes from v4:
1. Fix some bugs that produce false alarms.
2. Distinguish each syscall context from another *for arm64*.
3. Make it not warn it but just print it in case dept ring
buffer gets exhausted. (feedback from Hyeonggon)
4. Explicitely describe "EXPERIMENTAL" and "dept might produce
false positive reports" in Kconfig. (feedback from Ted)
Changes from v3:
1. dept shouldn't create dependencies between different depths
of a class that were indicated by *_lock_nested(). dept
normally doesn't but it does once another lock class comes
in. So fixed it. (feedback from Hyeonggon)
2. dept considered a wait as a real wait once getting to
__schedule() even if it has been set to TASK_RUNNING by wake
up sources in advance. Fixed it so that dept doesn't consider
the case as a real wait. (feedback from Jan Kara)
3. Stop tracking dependencies with a map once the event
associated with the map has been handled. dept will start to
work with the map again, on the next sleep.
Changes from v2:
1. Disable dept on bit_wait_table[] in sched/wait_bit.c
reporting a lot of false positives, which is my fault.
Wait/event for bit_wait_table[] should've been tagged in a
higher layer for better work, which is a future work.
(feedback from Jan Kara)
2. Disable dept on crypto_larval's completion to prevent a false
positive.
Changes from v1:
1. Fix coding style and typo. (feedback from Steven)
2. Distinguish each work context from another in workqueue.
3. Skip checking lock acquisition with nest_lock, which is about
correct lock usage that should be checked by lockdep.
Changes from RFC(v0):
1. Prevent adding a wait tag at prepare_to_wait() but __schedule().
(feedback from Linus and Matthew)
2. Use try version at lockdep_acquire_cpus_lock() annotation.
3. Distinguish each syscall context from another.
Byungchul Park (41):
dept: implement DEPT(DEPendency Tracker)
dept: add single event dependency tracker APIs
dept: add lock dependency tracker APIs
dept: tie to lockdep and IRQ tracing
dept: add proc knobs to show stats and dependency graph
dept: distinguish each kernel context from another
dept: distinguish each work from another
dept: add a mechanism to refill the internal memory pools on running
out
dept: record the latest one out of consecutive waits of the same class
dept: apply sdt_might_sleep_{start,end}() to
wait_for_completion()/complete()
dept: apply sdt_might_sleep_{start,end}() to swait
dept: apply sdt_might_sleep_{start,end}() to waitqueue wait
dept: apply sdt_might_sleep_{start,end}() to hashed-waitqueue wait
dept: apply sdt_might_sleep_{start,end}() to dma fence
dept: track timeout waits separately with a new Kconfig
dept: apply timeout consideration to wait_for_completion()/complete()
dept: apply timeout consideration to swait
dept: apply timeout consideration to waitqueue wait
dept: apply timeout consideration to hashed-waitqueue wait
dept: apply timeout consideration to dma fence wait
dept: make dept able to work with an external wgen
dept: track PG_locked with dept
dept: print staged wait's stacktrace on report
locking/lockdep: prevent various lockdep assertions when
lockdep_off()'ed
dept: add documents for dept
cpu/hotplug: use a weaker annotation in AP thread
dept: assign dept map to mmu notifier invalidation synchronization
dept: assign unique dept_key to each distinct dma fence caller
dept: make dept aware of lockdep_set_lock_cmp_fn() annotation
dept: make dept stop from working on debug_locks_off()
dept: assign unique dept_key to each distinct wait_for_completion()
caller
completion, dept: introduce init_completion_dmap() API
dept: introduce a new type of dependency tracking between multi event
sites
dept: add module support for struct dept_event_site and
dept_event_site_dep
dept: introduce event_site() to disable event tracking if it's
recoverable
dept: implement a basic unit test for dept
dept: call dept_hardirqs_off() in local_irq_*() regardless of irq
state
dept: introduce APIs to set page usage and use subclasses_evt for the
usage
dept: track PG_writeback with dept
SUNRPC: relocate struct rcu_head to the first field of struct rpc_xprt
mm: percpu: increase PERCPU_DYNAMIC_SIZE_SHIFT on DEPT and large
PAGE_SIZE
Yunseong Kim (1):
rcu/update: fix same dept key collision between various types of RCU
Documentation/dev-tools/dept.rst | 778 ++++++
Documentation/dev-tools/dept_api.rst | 125 +
drivers/dma-buf/dma-fence.c | 23 +-
include/asm-generic/vmlinux.lds.h | 13 +-
include/linux/completion.h | 124 +-
include/linux/dept.h | 402 +++
include/linux/dept_ldt.h | 78 +
include/linux/dept_sdt.h | 68 +
include/linux/dept_unit_test.h | 67 +
include/linux/dma-fence.h | 74 +-
include/linux/hardirq.h | 3 +
include/linux/irq-entry-common.h | 4 +
include/linux/irqflags.h | 21 +-
include/linux/local_lock_internal.h | 1 +
include/linux/lockdep.h | 105 +-
include/linux/lockdep_types.h | 3 +
include/linux/mm_types.h | 4 +
include/linux/mmu_notifier.h | 26 +
include/linux/module.h | 5 +
include/linux/mutex.h | 1 +
include/linux/page-flags.h | 217 +-
include/linux/pagemap.h | 37 +-
include/linux/percpu-rwsem.h | 2 +-
include/linux/percpu.h | 4 +
include/linux/rcupdate_wait.h | 13 +-
include/linux/rtmutex.h | 1 +
include/linux/rwlock_types.h | 1 +
include/linux/rwsem.h | 1 +
include/linux/sched.h | 118 +
include/linux/seqlock.h | 2 +-
include/linux/spinlock_types_raw.h | 3 +
include/linux/srcu.h | 2 +-
include/linux/sunrpc/xprt.h | 9 +-
include/linux/swait.h | 3 +
include/linux/wait.h | 3 +
include/linux/wait_bit.h | 3 +
init/init_task.c | 2 +
init/main.c | 2 +
kernel/Makefile | 1 +
kernel/cpu.c | 2 +-
kernel/dependency/Makefile | 5 +
kernel/dependency/dept.c | 3499 ++++++++++++++++++++++++++
kernel/dependency/dept_hash.h | 10 +
kernel/dependency/dept_internal.h | 314 +++
kernel/dependency/dept_object.h | 13 +
kernel/dependency/dept_proc.c | 94 +
kernel/dependency/dept_unit_test.c | 173 ++
kernel/exit.c | 1 +
kernel/fork.c | 2 +
kernel/locking/lockdep.c | 33 +
kernel/module/main.c | 19 +
kernel/rcu/rcu.h | 1 +
kernel/rcu/update.c | 5 +-
kernel/sched/completion.c | 62 +-
kernel/sched/core.c | 9 +
kernel/workqueue.c | 3 +
lib/Kconfig.debug | 48 +
lib/debug_locks.c | 2 +
lib/locking-selftest.c | 2 +
mm/filemap.c | 38 +
mm/mm_init.c | 3 +
mm/mmu_notifier.c | 31 +-
rust/helpers/completion.c | 5 +
63 files changed, 6602 insertions(+), 121 deletions(-)
create mode 100644 Documentation/dev-tools/dept.rst
create mode 100644 Documentation/dev-tools/dept_api.rst
create mode 100644 include/linux/dept.h
create mode 100644 include/linux/dept_ldt.h
create mode 100644 include/linux/dept_sdt.h
create mode 100644 include/linux/dept_unit_test.h
create mode 100644 kernel/dependency/Makefile
create mode 100644 kernel/dependency/dept.c
create mode 100644 kernel/dependency/dept_hash.h
create mode 100644 kernel/dependency/dept_internal.h
create mode 100644 kernel/dependency/dept_object.h
create mode 100644 kernel/dependency/dept_proc.c
create mode 100644 kernel/dependency/dept_unit_test.c
base-commit: 43dfc13ca972988e620a6edb72956981b75ab6b0
--
2.17.1
Genuine Passports, Driver's licenses And Other Documents Of All Countries For Sale. Buy High Quality Data-based Registered Machine Read-able Scan-able Driver's Licenses, ID's, Passports, And Citizenship Documents. Buy Real Passports, Driver's License, ID Cards, Visas, USA Green Card, Fake Money. We are the best producers of genuine Data-based Registered High Quality Real/Fake Passport, Driver's Licenses, ID Card's And other Citizenship Documents. Buy High Quality Data-based Registered Machine Read-able Scan-able Driver's Licenses, ID's, Passports, And Citizenship Documents. Buy Real Passports, Driver's License, ID Cards, Visas, USA Green Card, Fake Money.
Email : jameswalkes0987(a)gmail.com
What's app:+237 677480749
Tell:+237 677480749
We are the best producers of genuine high quality fake documents.
Buy Fake Passport British(UK) For Sale Diplomatic Canadian False ID Cards Online United States(US) Fake ID Card Sell Driver's License.
Buy genuine Green card Training certificates, GCSE, A-levels, High School Diploma Certificates ,GMAT, MCAT, and LSAT examination Certificates and credit cards, school diplomas, school degrees all in an entirely new name issued and registered in the government's data-base system.
Welcome To David Wilson and Associates Network. Get A second Chance In Life with Wesley Hover and Associates , protect your privacy, build new credit history, by-pass criminal background checks, take back your freedom. It's a cruise, you have come to the right place for all your travel needs and guest what !!!! you're going to finally make your dream a reality. Let us help you plan the ideal trip for you if you would like to find out more about the services we have available. We are an Association responsible for the production of real genuine passports, Real Genuine Data-Base Registered Fake Passports and other Citizenship documents. I can guarantee you a new Identity starting from a clean new genuine Birth Certificate, ID card, Drivers License, Passports, Social security card with SSN, credit files can you imagine ??
WE DO OFFER LEGITIMATE'S SERVICE: .GET REGISTERED BRITISH PASSPORT. .REGISTERED CANADIAN PASSPORT. .REGISTERED FRENCH PASSPORT. .REGISTERED AMERICAN PASSPORT. .REGISTERED USA PASSPORT. .REGISTERED PASSPORT FOR COUNTRIES IN THE EUROPEAN UNION.
We offer a service to help you through to meet your goals, we can help you with: Getting real government issued ID under another identity(NEW NAMES), A new social security number (verifiable with the SSA), Checking and saving accounts for your new ID. Credit cards Relocation Passports, Diplomatic passports, novelty passports. Production and obtaining new identification documents.
We also do work permit and bank statements and have connections to OFFER JOBS in country like Dubai, USA , CANADA , UK ,China,Peru,Brazil,South Africa,Denmark,Sweden,Norway,France etc..
Tourist and business visa services available to residents of all 50 states and all nationalities Worldwide
INTERESTED TO BUY REAL GENUINE QUALITY BANKNOTES of Euros, Dollars and Pounds ??? with security feature magnetic ink, water marks, the pen test, and the security strip that by-pass machines. Our hundreds carry "color-shifting ink," an advanced feature that gives the money an appearance of changing color when held at different angles including Intaglio.
Coaching services available....
Not even an expertise custom official or machine can ever dictate the document we offer as fake, since the document is no different from Real government issued!
Email : jameswalkes0987(a)gmail.com
What's app:+237 677480749
Tell:+237 677480749
E-Mail Your Questions and Comments.
We are looking forward to receiving your inquiries and early receipt of your first orders!
From: Xueyuan Chen <Xueyuan.chen21(a)gmail.com>
Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
with a more efficient nested loop approach.
Instead of iterating page by page, we now iterate through the scatterlist
entries via for_each_sgtable_sg(). Because pages within a single sg entry
are physically contiguous, we can populate the page array with a in an
inner loop using simple pointer math. This save a lot of time.
The WARN_ON check is also pulled out of the loop to save branch
instructions.
Performance results mapping a 2GB buffer on Radxa O6:
- Before: ~1440000 ns
- After: ~232000 ns
(~84% reduction in iteration time, or ~6.2x faster)
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard(a)collabora.com>
Cc: Brian Starkey <Brian.Starkey(a)arm.com>
Cc: John Stultz <jstultz(a)google.com>
Cc: T.J. Mercier <tjmercier(a)google.com>
Cc: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Xueyuan Chen <Xueyuan.chen21(a)gmail.com>
Signed-off-by: Barry Song (Xiaomi) <baohua(a)kernel.org>
---
drivers/dma-buf/heaps/system_heap.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index b3650d8fd651..769f01f0cc96 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -224,16 +224,21 @@ static void *system_heap_do_vmap(struct system_heap_buffer *buffer)
int npages = PAGE_ALIGN(buffer->len) / PAGE_SIZE;
struct page **pages = vmalloc(sizeof(struct page *) * npages);
struct page **tmp = pages;
- struct sg_page_iter piter;
void *vaddr;
+ u32 i, j, count;
+ struct page *base_page;
+ struct scatterlist *sg;
if (!pages)
return ERR_PTR(-ENOMEM);
- for_each_sgtable_page(table, &piter, 0) {
- WARN_ON(tmp - pages >= npages);
- *tmp++ = sg_page_iter_page(&piter);
+ for_each_sgtable_sg(table, sg, i) {
+ base_page = sg_page(sg);
+ count = sg->length >> PAGE_SHIFT;
+ for (j = 0; j < count; j++)
+ *tmp++ = base_page + j;
}
+ WARN_ON(tmp - pages != npages);
vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
vfree(pages);
--
2.39.3 (Apple Git-146)
On Tue, May 5, 2026 at 6:00 AM Julian Orth <ju.orth(a)gmail.com> wrote:
>
> On Tue, May 5, 2026 at 2:41 PM Christian König <christian.koenig(a)amd.com> wrote:
> >
> > Hi Julian,
> >
> > On 5/5/26 14:25, Julian Orth wrote:
> > > In ab4c3dcf9a71582503b4fb25aeab884c696cab25 ("dma-buf: Remove DMA-BUF
> > > sysfs stats") the /sys/kernel/dmabuf/buffer directory was removed.
> > >
> > > I've been using this interface, specifically the exporter_name file,
> > > to detect dmabufs created via udmabuf. Such dmabufs show "udmabuf" in
> > > exporter_name. I've been doing this for two reasons: 1) to detect that
> > > mmap on such buffers will be fast and 2) to detect that GPU access to
> > > such buffers will be slow.
> >
> > Crap, I really hoped that Android was the only user of that sysfs interface since that approach turned out to be quite broken.
> >
> > It's number one rule on Linux that we don't break userspace. So I hope that you don't insist on bringing that interface back, but if you do I will just revert the removal until we found a better solution.
>
> Bringing it back shouldn't be necessary.
>
> >
> > > With the removal of that file, that detection mechanism no longer works.
> > >
> > > I'm not particularly fond of that mechanism but it was the only one
> > > providing that functionality that I could find at the time. If there
> > > is another one, ideally an ioctl on the dmabuf, please let me know.
> >
> > The virtual fdinfo file you can find under /proc/$pid/fdinfo/$fd also contains the exporter name for the DMA-buf.
> >
> > You can find the full documentation here: https://docs.kernel.org/filesystems/proc.html#dma-buffer-files
> >
> > Is that sufficient?
>
> I think that is sufficient. I probably didn't use fdinfo initially
> because 1) it's a lot more work to parse and 2) I wasn't sure if it
> was intended to be machine-readable or if there could sometimes be
> newlines in the values and such.
>
> >
> > Additional to that the debugfs for DMA-buf also contains that information and I'm open to the suggestion with the IOCTL.
>
> My application runs as a regular user so it cannot access /sys/kernel/debug.
>
> Having an IOCTL would be ideal if it is not too much work. I'll fall
> back to fdinfo for now.
>
> Thanks, Julian
Phew, I'm glad fdinfo suits your needs.
Adding an ioctl would introduce new UAPI so I think we'd want to avoid
that unless absolutely necessary.
Thanks,
T.J.
> >
> > Regards,
> > Christian.
> >
> > >
> > > Shipping an entire BPF compiler in my application, which the original
> > > patch suggests as the replacement, is not an option when the removed
> > > alternative was simply reading a file.
> > >
> > > Thanks, Julian
> >
In ab4c3dcf9a71582503b4fb25aeab884c696cab25 ("dma-buf: Remove DMA-BUF
sysfs stats") the /sys/kernel/dmabuf/buffer directory was removed.
I've been using this interface, specifically the exporter_name file,
to detect dmabufs created via udmabuf. Such dmabufs show "udmabuf" in
exporter_name. I've been doing this for two reasons: 1) to detect that
mmap on such buffers will be fast and 2) to detect that GPU access to
such buffers will be slow.
With the removal of that file, that detection mechanism no longer works.
I'm not particularly fond of that mechanism but it was the only one
providing that functionality that I could find at the time. If there
is another one, ideally an ioctl on the dmabuf, please let me know.
Shipping an entire BPF compiler in my application, which the original
patch suggests as the replacement, is not an option when the removed
alternative was simply reading a file.
Thanks, Julian