I'm happy to see that DEPT reported real problems in practice:
https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SA…https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.pa…https://lore.kernel.org/all/b6e00e77-4a8c-4e05-ab79-266bf05fcc2d@igalia.com/
I’ve added documentation describing DEPT — this should help you
understand what DEPT is and how it works. You can use DEPT simply by
enabling CONFIG_DEPT and checking dmesg at runtime.
---
Hi Linus and folks,
I’ve been developing a tool to detect deadlock possibilities by tracking
waits/events — rather than lock acquisition order — to cover all the
synchronization mechanisms. To summarize the design rationale, starting
from the problem statement, through analysis, to the solution:
CURRENT STATUS
--------------
Lockdep tracks lock acquisition order to identify deadlock conditions.
Additionally, it tracks IRQ state changes — via {en,dis}able — to
detect cases where locks are acquired unintentionally during
interrupt handling.
PROBLEM
-------
Waits and their associated events that are never reachable can
eventually lead to deadlocks. However, since Lockdep focuses solely
on lock acquisition order, it has inherent limitations when handling
waits and events.
Moreover, by tracking only lock acquisition order, Lockdep cannot
properly handle read locks or cross-event scenarios — such as
wait_for_completion() and complete() — making it increasingly
inadequate as a general-purpose deadlock detection tool.
SOLUTION
--------
Once again, waits and their associated events that are never
reachable can eventually lead to deadlocks. The new solution, DEPT,
focuses directly on waits and events. DEPT monitors waits and events,
and reports them when any become unreachable.
DEPT provides:
* Correct handling of read locks.
* Support for general waits and events.
* Continuous operation, even after multiple reports.
* Simple, intuitive annotation APIs.
There are still false positives, and some are already being worked on
for suppression. Especially splitting the folio class into several
appropriate classes e.g. block device mapping class and regular file
mapping class, is currently under active development by me and Yeoreum
Yun.
Anyway, these efforts will need to continue for a while, as we’ve seen
with lockdep over two decades. DEPT is tagged as EXPERIMENTAL in
Kconfig — meaning it’s not yet suitable for use as an automation tool.
However, for those who are interested in using DEPT to analyze complex
synchronization patterns and extract dependency insights, DEPT would be
a great tool for the purpose.
Thanks for your support and contributions to:
Harry Yoo <harry.yoo(a)oracle.com>
Gwan-gyeong Mun <gwan-gyeong.mun(a)intel.com>
Yunseong Kim <ysk(a)kzalloc.com>
Yeoreum Yun <yeoreum.yun(a)arm.com>
FAQ
---
Q. Is this the first attempt to solve this problem?
A. No. The cross-release feature (commit b09be676e0ff2) attempted to
address it — as a Lockdep extension. It was merged, but quickly
reverted, because:
While it uncovered valuable hidden issues, it also introduced false
positives. Since these false positives mask further real problems
with Lockdep — and developers strongly dislike them — the feature was
rolled back.
Q. Why wasn’t DEPT built as a Lockdep extension?
A. Lockdep is the result of years of work by kernel developers — and is
now very stable. But I chose to build DEPT separately, because:
While reusing BFS(Breadth First Search) and Lockdep’s hashing is
beneficial, the rest of the system must be rebuilt from scratch to
align with DEPT’s wait-event model — since Lockdep was originally
designed for tracking lock acquisition orders, not wait-event
dependencies.
Q. Do you plan to replace Lockdep entirely?
A. Not at all — Lockdep still plays a vital role in validating correct
lock usage. While its dependency-checking logic should eventually be
superseded by DEPT, the rest of its functionality should stay.
Q. Should we replace the dependency check immediately?
A. Absolutely not. Lockdep’s stability is the result of years of hard
work by kernel developers. Lockdep and DEPT should run side by side
until DEPT matures.
Q. Stronger detection often leads to more false positives — which was a
major pain point when cross-release was added. Is DEPT designed to
handle this?
A. Yes. DEPT’s simple, generalized design enables flexible reporting —
so while false positives still need fixing, they’re far less
disruptive than they were under the Lockdep extension, cross-release.
Q. Why not fix all false positives out-of-tree before merging?
A. Since the affected subsystems span the entire kernel, like Lockdep,
which has relied on annotations to avoid false positives over the
last two decades, DEPT too will require the annotation efforts.
Performing annotation work within the mainline will help us add
annotations more appropriately and will also make DEPT a useful tool
for a wider range of users more quickly.
CONFIG_DEPT is marked EXPERIMENTAL, so it’s opt-in. Some users are
already interested in using DEPT to analyze complex synchronization
patterns and extract dependency insights.
Byungchul
---
Changes from v17:
1. Rebase on the mainline as of 2025 Dec 5.
2. Convert the documents' format from txt to rst. (feedbacked
by Jonathan Corbet and Bagas Sanjaya)
3. Move the documents from 'Documentation/dependency' to
'Documentation/dev-tools'. (feedbakced by Jonathan Corbet)
4. Improve the documentation. (feedbacked by NeilBrown)
5. Use a common function, enter_from_user_mode(), instead of
arch specific code, to notice context switch from user mode.
(feedbacked by Dave Hansen, Mark Rutland, and Mark Brown)
6. Resolve the header dependency issue by using dept's internal
header, instead of relocating 'struct llist_{head,node}' to
another header. (feedbacked by Greg KH)
7. Improve page(or folio) usage type APIs.
8. Add rust helper for wait_for_completion(). (feedbacked by
Guangbo Cui, Boqun Feng, and Danilo Krummrich)
9. Refine some commit messages.
Changes from v16:
1. Rebase on v6.17.
2. Fix a false positive from rcu (by Yunseong Kim)
3. Introduce APIs to set page's usage, dept_set_page_usage() and
dept_reset_page_usage() to avoid false positives.
4. Consider lock_page() as a potential wait unconditionally.
5. Consider folio_lock_killable() as a potential wait
unconditionally.
6. Add support for tracking PG_writeback waits and events.
7. Fix two build errors due to the additional debug information
added by dept. (by Yunseong Kim)
Changes from v15:
1. Fix typo and improve comments and commit messages (feedbacked
by ALOK TIWARI, Waiman Long, and kernel test robot).
2. Do not stop dept on detection of cicular dependency of
recover event, allowing to keep reporting.
3. Add SK hynix to copyright.
4. Consider folio_lock() as a potential wait unconditionally.
5. Fix Kconfig dependency bug (feedbacked by kernel test rebot).
6. Do not suppress reports that involve classes even that have
already involved in other reports, allowing to keep
reporting.
Changes from v14:
1. Rebase on the current latest, v6.15-rc6.
2. Refactor dept code.
3. With multi event sites for a single wait, even if an event
forms a circular dependency, the event can be recovered by
other event(or wake up) paths. Even though informing the
circular dependency is worthy but it should be suppressed
once informing it, if it doesn't lead an actual deadlock. So
introduce APIs to annotate the relationship between event
site and recover site, that are, event_site() and
dept_recover_event().
4. wait_for_completion() worked with dept map embedded in struct
completion. However, it generates a few false positves since
all the waits using the instance of struct completion, share
the map and key. To avoid the false positves, make it not to
share the map and key but each wait_for_completion() caller
have its own key by default. Of course, external maps also
can be used if needed.
5. Fix a bug about hardirq on/off tracing.
6. Implement basic unit test for dept.
7. Add more supports for dma fence synchronization.
8. Add emergency stop of dept e.g. on panic().
9. Fix false positives by mmu_notifier_invalidate_*().
10. Fix recursive call bug by DEPT_WARN_*() and DEPT_STOP().
11. Fix trivial bugs in DEPT_WARN_*() and DEPT_STOP().
12. Fix a bug that a spin lock, dept_pool_spin, is used in
both contexts of irq disabled and enabled without irq
disabled.
13. Suppress reports with classes, any of that already have
been reported, even though they have different chains but
being barely meaningful.
14. Print stacktrace of the wait that an event is now waking up,
not only stacktrace of the event.
15. Make dept aware of lockdep_cmp_fn() that is used to avoid
false positives in lockdep so that dept can also avoid them.
16. Do do_event() only if there are no ecxts have been
delimited.
17. Fix a bug that was not synchronized for stage_m in struct
dept_task, using a spin lock, dept_task()->stage_lock.
18. Fix a bug that dept didn't handle the case that multiple
ttwus for a single waiter can be called at the same time
e.i. a race issue.
19. Distinguish each kernel context from others, not only by
system call but also by user oriented fault so that dept can
work with more accuracy information about kernel context.
That helps to avoid a few false positives.
20. Limit dept's working to x86_64 and arm64.
Changes from v13:
1. Rebase on the current latest version, v6.9-rc7.
2. Add 'dept' documentation describing dept APIs.
Changes from v12:
1. Refine the whole document for dept.
2. Add 'Interpret dept report' section in the document, using a
deadlock report obtained in practice. Hope this version of
document helps guys understand dept better.
https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@I-love.SA…https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.pa…
Changes from v11:
1. Add 'dept' documentation describing the concept of dept.
2. Rewrite the commit messages of the following commits for
using weaker lockdep annotation, for better description.
fs/jbd2: Use a weaker annotation in journal handling
cpu/hotplug: Use a weaker annotation in AP thread
(feedbacked by Thomas Gleixner)
Changes from v10:
1. Fix noinstr warning when building kernel source.
2. dept has been reporting some false positives due to the folio
lock's unfairness. Reflect it and make dept work based on
dept annotaions instead of just wait and wake up primitives.
3. Remove the support for PG_writeback while working on 2. I
will add the support later if needed.
4. dept didn't print stacktrace for [S] if the participant of a
deadlock is not lock mechanism but general wait and event.
However, it made hard to interpret the report in that case.
So add support to print stacktrace of the requestor who asked
the event context to run - usually a waiter of the event does
it just before going to wait state.
5. Give up tracking raw_local_irq_{disable,enable}() since it
totally messed up dept's irq tracking. So make it work in the
same way as lockdep does. I will consider it once any false
positives by those are observed again.
6. Change the manual rwsem_acquire_read(->j_trans_commit_map)
annotation in fs/jbd2/transaction.c to the try version so
that it works as much as it exactly needs.
7. Remove unnecessary 'inline' keyword in dept.c and add
'__maybe_unused' to a needed place.
Changes from v9:
1. Fix a bug. SDT tracking didn't work well because of my big
mistake that I should've used waiter's map to indentify its
class but it had been working with waker's one. FYI,
PG_locked and PG_writeback weren't affected. They still
worked well. (reported by YoungJun)
Changes from v8:
1. Fix build error by adding EXPORT_SYMBOL(PG_locked_map) and
EXPORT_SYMBOL(PG_writeback_map) for kernel module build -
appologize for that. (reported by kernel test robot)
2. Fix build error by removing header file's circular dependency
that was caused by "atomic.h", "kernel.h" and "irqflags.h",
which I introduced - appolgize for that. (reported by kernel
test robot)
Changes from v7:
1. Fix a bug that cannot track rwlock dependency properly,
introduced in v7. (reported by Boqun and lockdep selftest)
2. Track wait/event of PG_{locked,writeback} more aggressively
assuming that when a bit of PG_{locked,writeback} is cleared
there might be waits on the bit. (reported by Linus, Hillf
and syzbot)
3. Fix and clean bad style code e.i. unnecessarily introduced
a randome pattern and so on. (pointed out by Linux)
4. Clean code for applying dept to wait_for_completion().
Changes from v6:
1. Tie to task scheduler code to track sleep and try_to_wake_up()
assuming sleeps cause waits, try_to_wake_up()s would be the
events that those are waiting for, of course with proper dept
annotations, sdt_might_sleep_weak(), sdt_might_sleep_strong()
and so on. For these cases, class is classified at sleep
entrance rather than the synchronization initialization code.
Which would extremely reduce false alarms.
2. Remove the dept associated instance in each page struct for
tracking dependencies by PG_locked and PG_writeback thanks to
the 1. work above.
3. Introduce CONFIG_dept_AGGRESIVE_TIMEOUT_WAIT to suppress
reports that waits with timeout set are involved, for those
who don't like verbose reporting.
4. Add a mechanism to refill the internal memory pools on
running out so that dept could keep working as long as free
memory is available in the system.
5. Re-enable tracking hashed-waitqueue wait. That's going to no
longer generate false positives because class is classified
at sleep entrance rather than the waitqueue initailization.
6. Refactor to make it easier to port onto each new version of
the kernel.
7. Apply dept to dma fence.
8. Do trivial optimizaitions.
Changes from v5:
1. Use just pr_warn_once() rather than WARN_ONCE() on the lack
of internal resources because WARN_*() printing stacktrace is
too much for informing the lack. (feedback from Ted, Hyeonggon)
2. Fix trivial bugs like missing initializing a struct before
using it.
3. Assign a different class per task when handling onstack
variables for waitqueue or the like. Which makes dept
distinguish between onstack variables of different tasks so
as to prevent false positives. (reported by Hyeonggon)
4. Make dept aware of even raw_local_irq_*() to prevent false
positives. (reported by Hyeonggon)
5. Don't consider dependencies between the events that might be
triggered within __schedule() and the waits that requires
__schedule(), real ones. (reported by Hyeonggon)
6. Unstage the staged wait that has prepare_to_wait_event()'ed
*and* yet to get to __schedule(), if we encounter __schedule()
in-between for another sleep, which is possible if e.g. a
mutex_lock() exists in 'condition' of ___wait_event().
7. Turn on CONFIG_PROVE_LOCKING when CONFIG_DEPT is on, to rely
on the hardirq and softirq entrance tracing to make dept more
portable for now.
Changes from v4:
1. Fix some bugs that produce false alarms.
2. Distinguish each syscall context from another *for arm64*.
3. Make it not warn it but just print it in case dept ring
buffer gets exhausted. (feedback from Hyeonggon)
4. Explicitely describe "EXPERIMENTAL" and "dept might produce
false positive reports" in Kconfig. (feedback from Ted)
Changes from v3:
1. dept shouldn't create dependencies between different depths
of a class that were indicated by *_lock_nested(). dept
normally doesn't but it does once another lock class comes
in. So fixed it. (feedback from Hyeonggon)
2. dept considered a wait as a real wait once getting to
__schedule() even if it has been set to TASK_RUNNING by wake
up sources in advance. Fixed it so that dept doesn't consider
the case as a real wait. (feedback from Jan Kara)
3. Stop tracking dependencies with a map once the event
associated with the map has been handled. dept will start to
work with the map again, on the next sleep.
Changes from v2:
1. Disable dept on bit_wait_table[] in sched/wait_bit.c
reporting a lot of false positives, which is my fault.
Wait/event for bit_wait_table[] should've been tagged in a
higher layer for better work, which is a future work.
(feedback from Jan Kara)
2. Disable dept on crypto_larval's completion to prevent a false
positive.
Changes from v1:
1. Fix coding style and typo. (feedback from Steven)
2. Distinguish each work context from another in workqueue.
3. Skip checking lock acquisition with nest_lock, which is about
correct lock usage that should be checked by lockdep.
Changes from RFC(v0):
1. Prevent adding a wait tag at prepare_to_wait() but __schedule().
(feedback from Linus and Matthew)
2. Use try version at lockdep_acquire_cpus_lock() annotation.
3. Distinguish each syscall context from another.
Byungchul Park (41):
dept: implement DEPT(DEPendency Tracker)
dept: add single event dependency tracker APIs
dept: add lock dependency tracker APIs
dept: tie to lockdep and IRQ tracing
dept: add proc knobs to show stats and dependency graph
dept: distinguish each kernel context from another
dept: distinguish each work from another
dept: add a mechanism to refill the internal memory pools on running
out
dept: record the latest one out of consecutive waits of the same class
dept: apply sdt_might_sleep_{start,end}() to
wait_for_completion()/complete()
dept: apply sdt_might_sleep_{start,end}() to swait
dept: apply sdt_might_sleep_{start,end}() to waitqueue wait
dept: apply sdt_might_sleep_{start,end}() to hashed-waitqueue wait
dept: apply sdt_might_sleep_{start,end}() to dma fence
dept: track timeout waits separately with a new Kconfig
dept: apply timeout consideration to wait_for_completion()/complete()
dept: apply timeout consideration to swait
dept: apply timeout consideration to waitqueue wait
dept: apply timeout consideration to hashed-waitqueue wait
dept: apply timeout consideration to dma fence wait
dept: make dept able to work with an external wgen
dept: track PG_locked with dept
dept: print staged wait's stacktrace on report
locking/lockdep: prevent various lockdep assertions when
lockdep_off()'ed
dept: add documents for dept
cpu/hotplug: use a weaker annotation in AP thread
dept: assign dept map to mmu notifier invalidation synchronization
dept: assign unique dept_key to each distinct dma fence caller
dept: make dept aware of lockdep_set_lock_cmp_fn() annotation
dept: make dept stop from working on debug_locks_off()
dept: assign unique dept_key to each distinct wait_for_completion()
caller
completion, dept: introduce init_completion_dmap() API
dept: introduce a new type of dependency tracking between multi event
sites
dept: add module support for struct dept_event_site and
dept_event_site_dep
dept: introduce event_site() to disable event tracking if it's
recoverable
dept: implement a basic unit test for dept
dept: call dept_hardirqs_off() in local_irq_*() regardless of irq
state
dept: introduce APIs to set page usage and use subclasses_evt for the
usage
dept: track PG_writeback with dept
SUNRPC: relocate struct rcu_head to the first field of struct rpc_xprt
mm: percpu: increase PERCPU_DYNAMIC_SIZE_SHIFT on DEPT and large
PAGE_SIZE
Yunseong Kim (1):
rcu/update: fix same dept key collision between various types of RCU
Documentation/dev-tools/dept.rst | 778 ++++++
Documentation/dev-tools/dept_api.rst | 125 +
drivers/dma-buf/dma-fence.c | 23 +-
include/asm-generic/vmlinux.lds.h | 13 +-
include/linux/completion.h | 124 +-
include/linux/dept.h | 402 +++
include/linux/dept_ldt.h | 78 +
include/linux/dept_sdt.h | 68 +
include/linux/dept_unit_test.h | 67 +
include/linux/dma-fence.h | 74 +-
include/linux/hardirq.h | 3 +
include/linux/irq-entry-common.h | 4 +
include/linux/irqflags.h | 21 +-
include/linux/local_lock_internal.h | 1 +
include/linux/lockdep.h | 105 +-
include/linux/lockdep_types.h | 3 +
include/linux/mm_types.h | 4 +
include/linux/mmu_notifier.h | 26 +
include/linux/module.h | 5 +
include/linux/mutex.h | 1 +
include/linux/page-flags.h | 217 +-
include/linux/pagemap.h | 37 +-
include/linux/percpu-rwsem.h | 2 +-
include/linux/percpu.h | 4 +
include/linux/rcupdate_wait.h | 13 +-
include/linux/rtmutex.h | 1 +
include/linux/rwlock_types.h | 1 +
include/linux/rwsem.h | 1 +
include/linux/sched.h | 118 +
include/linux/seqlock.h | 2 +-
include/linux/spinlock_types_raw.h | 3 +
include/linux/srcu.h | 2 +-
include/linux/sunrpc/xprt.h | 9 +-
include/linux/swait.h | 3 +
include/linux/wait.h | 3 +
include/linux/wait_bit.h | 3 +
init/init_task.c | 2 +
init/main.c | 2 +
kernel/Makefile | 1 +
kernel/cpu.c | 2 +-
kernel/dependency/Makefile | 5 +
kernel/dependency/dept.c | 3499 ++++++++++++++++++++++++++
kernel/dependency/dept_hash.h | 10 +
kernel/dependency/dept_internal.h | 314 +++
kernel/dependency/dept_object.h | 13 +
kernel/dependency/dept_proc.c | 94 +
kernel/dependency/dept_unit_test.c | 173 ++
kernel/exit.c | 1 +
kernel/fork.c | 2 +
kernel/locking/lockdep.c | 33 +
kernel/module/main.c | 19 +
kernel/rcu/rcu.h | 1 +
kernel/rcu/update.c | 5 +-
kernel/sched/completion.c | 62 +-
kernel/sched/core.c | 9 +
kernel/workqueue.c | 3 +
lib/Kconfig.debug | 48 +
lib/debug_locks.c | 2 +
lib/locking-selftest.c | 2 +
mm/filemap.c | 38 +
mm/mm_init.c | 3 +
mm/mmu_notifier.c | 31 +-
rust/helpers/completion.c | 5 +
63 files changed, 6602 insertions(+), 121 deletions(-)
create mode 100644 Documentation/dev-tools/dept.rst
create mode 100644 Documentation/dev-tools/dept_api.rst
create mode 100644 include/linux/dept.h
create mode 100644 include/linux/dept_ldt.h
create mode 100644 include/linux/dept_sdt.h
create mode 100644 include/linux/dept_unit_test.h
create mode 100644 kernel/dependency/Makefile
create mode 100644 kernel/dependency/dept.c
create mode 100644 kernel/dependency/dept_hash.h
create mode 100644 kernel/dependency/dept_internal.h
create mode 100644 kernel/dependency/dept_object.h
create mode 100644 kernel/dependency/dept_proc.c
create mode 100644 kernel/dependency/dept_unit_test.c
base-commit: 43dfc13ca972988e620a6edb72956981b75ab6b0
--
2.17.1
From: Jiri Pirko <jiri(a)nvidia.com>
Confidential computing (CoCo) VMs/guests, such as AMD SEV and Intel TDX,
run with private/encrypted memory which creates a challenge
for devices that do not support DMA to it (no TDISP support).
For kernel-only DMA operations, swiotlb bounce buffering provides a
transparent solution by copying data through shared memory.
However, the only way to get this memory into userspace is via the DMA
API's dma_alloc_pages()/dma_mmap_pages() type interfaces which limits
the use of the memory to a single DMA device, and is incompatible with
pin_user_pages().
These limitations are particularly problematic for the RDMA subsystem
which makes heavy use of pin_user_pages() and expects flexible memory
usage between many different DMA devices.
This patch series enables userspace to explicitly request shared
(decrypted) memory allocations from new dma-buf system_cc_shared heap.
Userspace can mmap this memory and pass the dma-buf fd to other
existing importers such as RDMA or DRM devices to access the
memory. The DMA API is improved to allow the dma heap exporter to DMA
map the shared memory to each importing device.
Based on dma-mapping-for-next e7442a68cd1ee797b585f045d348781e9c0dde0d
Jiri Pirko (2):
dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
dma-buf: heaps: system: add system_cc_shared heap for explicitly
shared memory
drivers/dma-buf/heaps/system_heap.c | 103 ++++++++++++++++++++++++++--
include/linux/dma-mapping.h | 10 +++
include/trace/events/dma.h | 3 +-
kernel/dma/direct.h | 14 +++-
kernel/dma/mapping.c | 13 +++-
5 files changed, 132 insertions(+), 11 deletions(-)
--
2.51.1
When CONFIG_DMA_API_DEBUG_SG is enabled, importing a udmabuf into a DRM
driver (e.g. amdgpu for video playback in GNOME Videos / Showtime)
triggers a spurious warning:
DMA-API: amdgpu 0000:03:00.0: cacheline tracking EEXIST, \
overlapping mappings aren't supported
WARNING: kernel/dma/debug.c:619 at add_dma_entry+0x473/0x5f0
The call chain is:
amdgpu_cs_ioctl
-> amdgpu_ttm_backend_bind
-> dma_buf_map_attachment
-> [udmabuf] map_udmabuf -> get_sg_table
-> dma_map_sgtable(dev, sg, direction, 0) // attrs=0
-> debug_dma_map_sg -> add_dma_entry -> EEXIST
This happens because udmabuf builds a per-page scatter-gather list via
sg_set_folio(). When begin_cpu_udmabuf() has already created an sg
table mapped for the misc device, and an importer such as amdgpu maps
the same pages for its own device via map_udmabuf(), the DMA debug
infrastructure sees two active mappings whose physical addresses share
cacheline boundaries and warns about the overlap.
The DMA_ATTR_SKIP_CPU_SYNC flag suppresses this check in
add_dma_entry() because it signals that no CPU cache maintenance is
performed at map/unmap time, making the cacheline overlap harmless.
All other major dma-buf exporters already pass this flag:
- drm_gem_map_dma_buf() passes DMA_ATTR_SKIP_CPU_SYNC
- amdgpu_dma_buf_map() passes DMA_ATTR_SKIP_CPU_SYNC
The CPU sync at map/unmap time is also redundant for udmabuf:
begin_cpu_udmabuf() and end_cpu_udmabuf() already perform explicit
cache synchronization via dma_sync_sgtable_for_cpu/device() when CPU
access is requested through the dma-buf interface.
Pass DMA_ATTR_SKIP_CPU_SYNC to dma_map_sgtable() and
dma_unmap_sgtable() in udmabuf to suppress the spurious warning and
skip the redundant sync.
Fixes: 284562e1f348 ("udmabuf: implement begin_cpu_access/end_cpu_access hooks")
Cc: stable(a)vger.kernel.org
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov(a)gmail.com>
---
v1 -> v2:
- Rebased on drm-tip to resolve conflict with folio conversion
patches. No code change, same two-line fix.
v1: https://lore.kernel.org/all/20260317053653.28888-1-mikhail.v.gavrilov@gmail…
drivers/dma-buf/udmabuf.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 94b26ea706a3..bced421c0d65 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -145,7 +145,7 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
if (ret < 0)
goto err_alloc;
- ret = dma_map_sgtable(dev, sg, direction, 0);
+ ret = dma_map_sgtable(dev, sg, direction, DMA_ATTR_SKIP_CPU_SYNC);
if (ret < 0)
goto err_map;
return sg;
@@ -160,7 +160,7 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
static void put_sg_table(struct device *dev, struct sg_table *sg,
enum dma_data_direction direction)
{
- dma_unmap_sgtable(dev, sg, direction, 0);
+ dma_unmap_sgtable(dev, sg, direction, DMA_ATTR_SKIP_CPU_SYNC);
sg_free_table(sg);
kfree(sg);
}
--
2.53.0
Since its introduction, DMA-buf has only supported using scatterlist for
the exporter and importer to exchange address information. This is not
sufficient for all use cases as dma_addr_t is a very specific and limited
type that should not be abused for things unrelated to the DMA API.
There are several motivations for addressing this now:
1) VFIO to IOMMUFD and KVM requires a physical address, not a dma_addr_t
scatterlist, it cannot be represented in the scatterlist structure
2) xe vGPU requires the host driver to accept a DMABUF from VFIO of its
own VF and convert it into an internal VRAM address on the PF
3) We are starting to look at replacement datastructures for
scatterlist
4) Ideas around UALink/etc are suggesting not using the DMA API
None of these can sanely be achieved using scatterlist.
Introduce a new mechanism called "mapping types" which allows DMA-buf to
work with more map/unmap options than scatterlist. Each mapping type
encompasses a full set of functions and data unique to itself. The core
code provides a match-making system to select the best type offered by the
exporter and importer to be the active mapping type for the attachment.
Everything related to scatterlist is moved into a DMA-buf SGT mapping
type, and into the "dma_buf_sgt_*" namespace for clarity. Existing
exporters are moved over to explicitly declare SGT mapping types and
importers are adjusted to use the dma_buf_sgt_* named importer helpers.
Mapping types are designed to be extendable, a driver can declare its own
mapping type for its internal private interconnect and use that without
having to adjust the core code.
The new attachment sequence starts with the importing driver declaring
what mapping types it can accept:
struct dma_buf_mapping_match imp_match[] = {
DMA_BUF_IMAPPING_MY_DRIVER(dev, ...),
DMA_BUF_IMAPPING_SGT(dev, false),
};
attach = dma_buf_mapping_attach(dmabuf, imp_match, ...)
Most drivers will do this via a dma_buf_sgt_*attach() helper.
The exporting driver can then declare what mapping types it can supply:
int exporter_match_mapping(struct dma_buf_match_args *args)
{
struct dma_buf_mapping_match exp_match[] = {
DMA_BUF_EMAPPING_MY_DRIVER(my_ops, dev, ...),
DMA_BUF_EMAPPING_SGT(sgt_ops, dev, false),
DMA_BUF_EMAPPING_PAL(PAL_ops),
};
return dma_buf_match_mapping(args, exp_match, ...);
}
Most drivers will do this via a helper:
static const struct dma_buf_ops ops = {
DMA_BUF_SIMPLE_SGT_EXP_MATCH(map_func, unmap_func)
};
During dma_buf_mapping_attach() the core code will select a mutual match
between the importer and exporter and record it as the active match in
the attach->map_type.
Each mapping type has its own types/function calls for
mapping/unmapping, and storage in the attach->map_type for its
information. As such each mapping type can offer function signatures
and data that exactly matches its needs.
This series goes through a sequence of:
1) Introduce the basic mapping type framework and the main components of
the SGT mapping type
2) Automatically make all existing exporters and importers use core
generated SGT mapping types so every attachment has a SGT mapping type
3) Convert all exporter drivers to natively create a SGT mapping type
4) Move all dma_buf_* functions and types that are related to SGT into
dma_buf_sgt_*
5) Remove all the now-unused items that have been moved into SGT specific
structures.
6) Demonstrate adding a new Physical Address List alongside SGT.
Due to the high number of files touched I would expect this to be broken
into phases, but this shows the entire picture.
This is on github: https://github.com/jgunthorpe/linux/commits/dmabuf_map_type
It is a followup to the discussion here:
https://lore.kernel.org/dri-devel/20251027044712.1676175-1-vivek.kasireddy@…
Jason Gunthorpe (26):
dma-buf: Introduce DMA-buf mapping types
dma-buf: Add the SGT DMA mapping type
dma-buf: Add dma_buf_mapping_attach()
dma-buf: Route SGT related actions through attach->map_type
dma-buf: Allow single exporter drivers to avoid the match_mapping
function
drm: Check the SGT ops for drm_gem_map_dma_buf()
dma-buf: Convert all the simple exporters to use SGT mapping type
drm/vmwgfx: Use match_mapping instead of dummy calls
accel/habanalabs: Use the SGT mapping type
drm/xe/dma-buf: Use the SGT mapping type
drm/amdgpu: Use the SGT mapping type
vfio/pci: Change the DMA-buf exporter to use mapping_type
dma-buf: Update dma_buf_phys_vec_to_sgt() to use the SGT mapping type
iio: buffer: convert to use the SGT mapping type
functionfs: convert to use the SGT mapping type
dma-buf: Remove unused SGT stuff from the common structures
treewide: Rename dma_buf_map_attachment(_unlocked) to dma_buf_sgt_
treewide: Rename dma_buf_unmap_attachment(_unlocked) to dma_buf_sgt_*
treewide: Rename dma_buf_attach() to dma_buf_sgt_attach()
treewide: Rename dma_buf_dynamic_attach() to
dma_buf_sgt_dynamic_attach()
dma-buf: Add the Physical Address List DMA mapping type
vfio/pci: Add physical address list support to DMABUF
iommufd: Use the PAL mapping type instead of a vfio function
iommufd: Support DMA-bufs with multiple physical ranges
iommufd/selftest: Check multi-phys DMA-buf scenarios
dma-buf: Add kunit tests for mapping type
Documentation/gpu/todo.rst | 2 +-
drivers/accel/amdxdna/amdxdna_gem.c | 14 +-
drivers/accel/amdxdna/amdxdna_ubuf.c | 10 +-
drivers/accel/habanalabs/common/memory.c | 54 ++-
drivers/accel/ivpu/ivpu_gem.c | 10 +-
drivers/accel/ivpu/ivpu_gem_userptr.c | 11 +-
drivers/accel/qaic/qaic_data.c | 8 +-
drivers/dma-buf/Makefile | 1 +
drivers/dma-buf/dma-buf-mapping.c | 186 ++++++++-
drivers/dma-buf/dma-buf.c | 180 ++++++---
drivers/dma-buf/heaps/cma_heap.c | 12 +-
drivers/dma-buf/heaps/system_heap.c | 13 +-
drivers/dma-buf/st-dma-mapping.c | 373 ++++++++++++++++++
drivers/dma-buf/udmabuf.c | 8 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 98 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-
drivers/gpu/drm/armada/armada_gem.c | 33 +-
drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +-
drivers/gpu/drm/drm_prime.c | 31 +-
drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 18 +-
drivers/gpu/drm/i915/gem/i915_gem_object.c | 2 +-
.../drm/i915/gem/selftests/i915_gem_dmabuf.c | 8 +-
.../gpu/drm/i915/gem/selftests/mock_dmabuf.c | 8 +-
drivers/gpu/drm/msm/msm_gem_prime.c | 7 +-
drivers/gpu/drm/omapdrm/omap_gem_dmabuf.c | 11 +-
drivers/gpu/drm/tegra/gem.c | 33 +-
drivers/gpu/drm/virtio/virtgpu_prime.c | 23 +-
drivers/gpu/drm/vmwgfx/vmwgfx_prime.c | 32 +-
drivers/gpu/drm/xe/xe_bo.c | 18 +-
drivers/gpu/drm/xe/xe_dma_buf.c | 61 +--
drivers/iio/industrialio-buffer.c | 15 +-
drivers/infiniband/core/umem_dmabuf.c | 15 +-
drivers/iommu/iommufd/io_pagetable.h | 4 +-
drivers/iommu/iommufd/iommufd_private.h | 8 -
drivers/iommu/iommufd/iommufd_test.h | 7 +
drivers/iommu/iommufd/pages.c | 85 ++--
drivers/iommu/iommufd/selftest.c | 177 ++++++---
.../media/common/videobuf2/videobuf2-core.c | 2 +-
.../common/videobuf2/videobuf2-dma-contig.c | 26 +-
.../media/common/videobuf2/videobuf2-dma-sg.c | 21 +-
.../common/videobuf2/videobuf2-vmalloc.c | 13 +-
.../platform/nvidia/tegra-vde/dmabuf-cache.c | 9 +-
drivers/misc/fastrpc.c | 21 +-
drivers/tee/tee_heap.c | 13 +-
drivers/usb/gadget/function/f_fs.c | 11 +-
drivers/vfio/pci/vfio_pci_dmabuf.c | 79 ++--
drivers/xen/gntdev-dmabuf.c | 29 +-
include/linux/dma-buf-mapping.h | 297 ++++++++++++++
include/linux/dma-buf.h | 168 ++++----
io_uring/zcrx.c | 9 +-
net/core/devmem.c | 14 +-
samples/vfio-mdev/mbochs.c | 10 +-
sound/soc/fsl/fsl_asrc_m2m.c | 12 +-
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 17 +
55 files changed, 1764 insertions(+), 614 deletions(-)
create mode 100644 drivers/dma-buf/st-dma-mapping.c
base-commit: c63e5a50e1dd291cd95b04291b028fdcaba4c534
--
2.43.0
When a caller already guards a tracepoint with an explicit enabled check:
if (trace_foo_enabled() && cond)
trace_foo(args);
trace_foo() internally re-evaluates the static_branch_unlikely() key.
Since static branches are patched binary instructions the compiler cannot
fold the two evaluations, so every such site pays the cost twice.
This series introduces trace_call__##name() as a companion to
trace_##name(). It calls __do_trace_##name() directly, bypassing the
redundant static-branch re-check, while preserving all other correctness
properties of the normal path (RCU-watching assertion, might_fault() for
syscall tracepoints). The internal __do_trace_##name() symbol is not
leaked to call sites; trace_call__##name() is the only new public API.
if (trace_foo_enabled() && cond)
trace_call__foo(args); /* calls __do_trace_foo() directly */
The first patch adds the three-location change to
include/linux/tracepoint.h (__DECLARE_TRACE, __DECLARE_TRACE_SYSCALL,
and the !TRACEPOINTS_ENABLED stub). The remaining 18 patches
mechanically convert all guarded call sites found in the tree:
kernel/, io_uring/, net/, accel/habanalabs, cpufreq/, devfreq/,
dma-buf/, fsi/, drm/, HID, i2c/, spi/, scsi/ufs/, btrfs/,
net/devlink/, kernel/time/, kernel/trace/, mm/damon/, and arch/x86/.
This series is motivated by Peter Zijlstra's observation in the discussion
around Dmitry Ilvokhin's locking tracepoint instrumentation series, where
he noted that compilers cannot optimize static branches and that guarded
call sites end up evaluating the static branch twice for no reason, and
by Steven Rostedt's suggestion to add a proper API instead of exposing
internal implementation details like __do_trace_##name() directly to
call sites:
https://lore.kernel.org/linux-trace-kernel/8298e098d3418cb446ef396f119edac5…
Suggested-by: Steven Rostedt <rostedt(a)goodmis.org>
Suggested-by: Peter Zijlstra <peterz(a)infradead.org>
Changes in v2:
- Renamed trace_invoke_##name() to trace_call__##name() (double
underscore) per review comments.
- Added 4 new patches covering sites missed in v1, found using
coccinelle to scan the tree (Keith Busch):
* net/devlink: guarded tracepoint_enabled() block in trap.c
* kernel/time: early-return guard in tick-sched.c (tick_stop)
* kernel/trace: early-return guard in trace_benchmark.c
* mm/damon: early-return guard in core.c
* arch/x86: do_trace_*() wrapper functions in lib/msr.c, which
are called exclusively from tracepoint_enabled()-guarded sites
in asm/msr.h
v1: https://lore.kernel.org/linux-trace-kernel/abSqrJ1J59RQC47U@kbusch-mbp/
Vineeth Pillai (Google) (19):
tracepoint: Add trace_call__##name() API
kernel: Use trace_call__##name() at guarded tracepoint call sites
io_uring: Use trace_call__##name() at guarded tracepoint call sites
net: Use trace_call__##name() at guarded tracepoint call sites
accel/habanalabs: Use trace_call__##name() at guarded tracepoint call
sites
cpufreq: Use trace_call__##name() at guarded tracepoint call sites
devfreq: Use trace_call__##name() at guarded tracepoint call sites
dma-buf: Use trace_call__##name() at guarded tracepoint call sites
fsi: Use trace_call__##name() at guarded tracepoint call sites
drm: Use trace_call__##name() at guarded tracepoint call sites
HID: Use trace_call__##name() at guarded tracepoint call sites
i2c: Use trace_call__##name() at guarded tracepoint call sites
spi: Use trace_call__##name() at guarded tracepoint call sites
scsi: ufs: Use trace_call__##name() at guarded tracepoint call sites
btrfs: Use trace_call__##name() at guarded tracepoint call sites
net: devlink: Use trace_call__##name() at guarded tracepoint call
sites
kernel: time, trace: Use trace_call__##name() at guarded tracepoint
call sites
mm: damon: Use trace_call__##name() at guarded tracepoint call sites
x86: msr: Use trace_call__##name() at guarded tracepoint call sites
arch/x86/lib/msr.c | 6 +++---
drivers/accel/habanalabs/common/device.c | 12 ++++++------
drivers/accel/habanalabs/common/mmu/mmu.c | 3 ++-
drivers/accel/habanalabs/common/pci/pci.c | 4 ++--
drivers/cpufreq/amd-pstate.c | 10 +++++-----
drivers/cpufreq/cpufreq.c | 2 +-
drivers/cpufreq/intel_pstate.c | 2 +-
drivers/devfreq/devfreq.c | 2 +-
drivers/dma-buf/dma-fence.c | 4 ++--
drivers/fsi/fsi-master-aspeed.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++--
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
drivers/gpu/drm/scheduler/sched_entity.c | 4 ++--
drivers/hid/intel-ish-hid/ipc/pci-ish.c | 2 +-
drivers/i2c/i2c-core-slave.c | 2 +-
drivers/spi/spi-axi-spi-engine.c | 4 ++--
drivers/ufs/core/ufshcd.c | 12 ++++++------
fs/btrfs/extent_map.c | 4 ++--
fs/btrfs/raid56.c | 4 ++--
include/linux/tracepoint.h | 11 +++++++++++
io_uring/io_uring.h | 2 +-
kernel/irq_work.c | 2 +-
kernel/sched/ext.c | 2 +-
kernel/smp.c | 2 +-
kernel/time/tick-sched.c | 12 ++++++------
kernel/trace/trace_benchmark.c | 2 +-
mm/damon/core.c | 2 +-
net/core/dev.c | 2 +-
net/core/xdp.c | 2 +-
net/devlink/trap.c | 2 +-
net/openvswitch/actions.c | 2 +-
net/openvswitch/datapath.c | 2 +-
net/sctp/outqueue.c | 2 +-
net/tipc/node.c | 2 +-
35 files changed, 74 insertions(+), 62 deletions(-)
--
2.53.0
I am writing to express my deepest gratitude to FUNDS RECLAIMER COMPANY for their invaluable assistance in recovering my Bitcoin, which was unfortunately lost to a scammer. The expertise and professionalism displayed by their team were instrumental in retrieving my stolen funds, and I am truly thankful for their support throughout this ordeal. After falling victim to a sophisticated scam, I thought all hope was lost, and my Bitcoin was gone for good. However, I decided to reach out to FUNDS RECLAIMER COMPANY, and their team promptly responded with a clear plan of action to recover my stolen funds. Their dedication and expertise in navigating the complexities of cryptocurrency transactions were impressive, and they worked tirelessly to track down the scammer and retrieve my Bitcoin. The entire process, from initial consultation to the final recovery of my funds, was handled with utmost care and transparency. The team at FUNDS RECLAIMER COMPANY kept me informed every step of the way, providing regular updates on the progress of the recovery efforts. Their commitment to customer satisfaction was evident in the way they handled my case, and I was constantly reassured that my case was being treated with the utmost importance. I am thrilled to have my Bitcoin back, and I attribute this success entirely to the exceptional work of FUNDS RECLAIMER COMPANY. Their knowledge and experience in dealing with cryptocurrency scams are unparalleled, and I would highly recommend their services to anyone who has fallen victim to similar situations. The company's expertise and professionalism are a beacon of hope for those who have lost their funds to scammers, and I am living proof of their ability to deliver results. In conclusion, I would like to extend my sincerest appreciation to FUNDS RECLAIMER COMPANY for their outstanding service and support. Their expertise and dedication have given me a second chance to recover my lost Bitcoin, and I will be forever grateful for their help. If you or someone you know has been a victim of a cryptocurrency scam, I strongly recommend reaching out to FUNDS RECLAIMER COMPANY for assistance. They are truly the experts in recovering lost funds, and their professionalism and transparency make them a trusted partner in the fight against scammers.
Info Below:
Website: https//:funds-reclaimercompany.org
Email: fundsreclaimer(a)consultant.com
The first two commits fix rare bugs and should be backported to stable
branches.
The rest is an attempt to cleanup and document the code to make it
a bit easier to understand.
Signed-off-by: Alessio Belle <alessio.belle(a)imgtec.com>
---
Alessio Belle (8):
drm/imagination: Count paired job fence as dependency in prepare_job()
drm/imagination: Fit paired fragment job in the correct CCCB
drm/imagination: Skip check on paired job fence during job submission
drm/imagination: Rename pvr_queue_fence_is_ufo_backed() to reflect usage
drm/imagination: Rename fence returned by pvr_queue_job_arm()
drm/imagination: Move repeated job fence check to its own function
drm/imagination: Update check to skip prepare_job() for fragment jobs
drm/imagination: Minor improvements to job submission code documentation
drivers/gpu/drm/imagination/pvr_job.c | 8 +-
drivers/gpu/drm/imagination/pvr_queue.c | 154 +++++++++++++--------
drivers/gpu/drm/imagination/pvr_queue.h | 2 +-
.../gpu/drm/imagination/pvr_rogue_fwif_shared.h | 10 +-
drivers/gpu/drm/imagination/pvr_sync.c | 8 +-
drivers/gpu/drm/imagination/pvr_sync.h | 2 +-
6 files changed, 110 insertions(+), 74 deletions(-)
---
base-commit: 3bce3fdd1ff2ba242f76ab66659fff27207299f1
change-id: 20260330-job-submission-fixes-cleanup-83e01196c3e9
Best regards,
--
Alessio Belle <alessio.belle(a)imgtec.com>
Using kunit to write tests for new work on dmabuf is coming up:
https://lore.kernel.org/all/26-v1-b5cab63049c0+191af-dmabuf_map_type_jgg@nv…
Replace the custom test framework with kunit to avoid maintaining two
concurrent test frameworks.
The conversion minimizes code changes and uses simple pattern-oriented
reworks to reduce the chance of breaking any tests. Aside from adding the
kunit_test_suite() boilerplate, the conversion follows a number of
patterns:
Test failures without cleanup. For example:
if (!ptr)
return -ENOMEM;
Becomes:
KUNIT_ASSERT_NOT_NULL(test, ptr);
In kunit ASSERT longjumps out of the test.
Check for error, fail and cleanup:
if (err) {
pr_err("msg\n");
goto cleanup;
}
Becomes:
if (err) {
KUNIT_FAIL(test, "msg");
goto cleanup;
}
Preserve the existing failure messages and cleanup code.
Cases where the test returns err but prints no message:
if (err)
goto cleanup;
Becomes:
if (err) {
KUNIT_FAIL(test, "msg");
goto cleanup;
}
Use KUNIT_FAIL to retain the 'cleanup on err' behavior.
Overall, the conversion is straightforward.
The result can be run with kunit.py:
$ tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/dma-buf/.kunitconfig
[20:37:23] Configuring KUnit Kernel ...
[20:37:23] Building KUnit Kernel ...
Populating config with:
$ make ARCH=x86_64 O=build_kunit_x86_64 olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=x86_64 O=build_kunit_x86_64 --jobs=20
[20:37:29] Starting KUnit Kernel (1/1)...
[20:37:29] ============================================================
Running tests with:
$ qemu-system-x86_64 -nodefaults -m 1024 -kernel build_kunit_x86_64/arch/x86/boot/bzImage -append 'kunit.enable=1 console=ttyS0 kunit_shutdown=reboot' -no-reboot -nographic -accel kvm -accel hvf -accel tcg -serial stdio -bios qboot.rom
[20:37:30] ================ dma-buf-resv (5 subtests) =================
[20:37:30] [PASSED] test_sanitycheck
[20:37:30] ===================== test_signaling ======================
[20:37:30] [PASSED] kernel
[20:37:30] [PASSED] write
[20:37:30] [PASSED] read
[20:37:30] [PASSED] bookkeep
[20:37:30] ================= [PASSED] test_signaling ==================
...
[20:37:35] Testing complete. Ran 50 tests: passed: 49, skipped: 1
[20:37:35] Elapsed time: 12.635s total, 0.001s configuring, 6.551s building, 6.017s running
One test that requires two CPUs is skipped since the default VM has a
single CPU and cannot run the test.
All other usual ways to run kunit work as well, and all tests are placed
in a module to provide more options for how they are run.
AI was used to do the large scale semantic search and replaces described
above, then everything was hand checked. AI also deduced the issue with
test_race_signal_callback() in a couple of seconds from the kunit
crash (!!), again was hand checked though I am not so familiar with this
test to be fully certain this is the best answer.
Jason Gunthorpe (5):
dma-buf: Change st-dma-resv.c to use kunit
dma-buf: Change st-dma-fence.c to use kunit
dma-buf: Change st-dma-fence-unwrap.c to use kunit
dma-buf: Change st-dma-fence-chain.c to use kunit
dma-buf: Remove the old selftest
drivers/dma-buf/.kunitconfig | 2 +
drivers/dma-buf/Kconfig | 11 +-
drivers/dma-buf/Makefile | 5 +-
drivers/dma-buf/selftest.c | 167 ---------------
drivers/dma-buf/selftest.h | 30 ---
drivers/dma-buf/selftests.h | 16 --
drivers/dma-buf/st-dma-fence-chain.c | 217 +++++++++----------
drivers/dma-buf/st-dma-fence-unwrap.c | 290 +++++++++++---------------
drivers/dma-buf/st-dma-fence.c | 200 ++++++++----------
drivers/dma-buf/st-dma-resv.c | 145 +++++++------
drivers/gpu/drm/i915/Kconfig.debug | 2 +-
11 files changed, 394 insertions(+), 691 deletions(-)
create mode 100644 drivers/dma-buf/.kunitconfig
delete mode 100644 drivers/dma-buf/selftest.c
delete mode 100644 drivers/dma-buf/selftest.h
delete mode 100644 drivers/dma-buf/selftests.h
base-commit: 41dae5ac5e157b0bb260f381eb3df2f4a4610205
--
2.43.0
From: Barry Song <v-songbaohua(a)oppo.com>
In many cases, the pages passed to vmap() may include high-order
pages allocated with __GFP_COMP flags. For example, the systemheap
often allocates pages in descending order: order 8, then 4, then 0.
Currently, vmap() iterates over every page individually—even pages
inside a high-order block are handled one by one.
This patch detects high-order pages and maps them as a single
contiguous block whenever possible.
An alternative would be to implement a new API, vmap_sg(), but that
change seems to be large in scope.
When vmapping a 128MB dma-buf using the systemheap, this patch
makes system_heap_do_vmap() roughly 17× faster.
W/ patch:
[ 10.404769] system_heap_do_vmap took 2494000 ns
[ 12.525921] system_heap_do_vmap took 2467008 ns
[ 14.517348] system_heap_do_vmap took 2471008 ns
[ 16.593406] system_heap_do_vmap took 2444000 ns
[ 19.501341] system_heap_do_vmap took 2489008 ns
W/o patch:
[ 7.413756] system_heap_do_vmap took 42626000 ns
[ 9.425610] system_heap_do_vmap took 42500992 ns
[ 11.810898] system_heap_do_vmap took 42215008 ns
[ 14.336790] system_heap_do_vmap took 42134992 ns
[ 16.373890] system_heap_do_vmap took 42750000 ns
Cc: David Hildenbrand <david(a)kernel.org>
Cc: Uladzislau Rezki <urezki(a)gmail.com>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: John Stultz <jstultz(a)google.com>
Cc: Maxime Ripard <mripard(a)kernel.org>
Tested-by: Tangquan Zheng <zhengtangquan(a)oppo.com>
Signed-off-by: Barry Song <v-songbaohua(a)oppo.com>
---
* diff with rfc:
Many code refinements based on David's suggestions, thanks!
Refine comment and changelog according to Uladzislau, thanks!
rfc link:
https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/
mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 39 insertions(+), 6 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 41dd01e8430c..8d577767a9e5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
return err;
}
+static inline int get_vmap_batch_order(struct page **pages,
+ unsigned int stride, unsigned int max_steps, unsigned int idx)
+{
+ int nr_pages = 1;
+
+ /*
+ * Currently, batching is only supported in vmap_pages_range
+ * when page_shift == PAGE_SHIFT.
+ */
+ if (stride != 1)
+ return 0;
+
+ nr_pages = compound_nr(pages[idx]);
+ if (nr_pages == 1)
+ return 0;
+ if (max_steps < nr_pages)
+ return 0;
+
+ if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages)
+ return compound_order(pages[idx]);
+ return 0;
+}
+
/*
* vmap_pages_range_noflush is similar to vmap_pages_range, but does not
* flush caches.
@@ -655,23 +678,33 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages, unsigned int page_shift)
{
unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
+ unsigned int stride;
WARN_ON(page_shift < PAGE_SHIFT);
+ /*
+ * For vmap(), users may allocate pages from high orders down to
+ * order 0, while always using PAGE_SHIFT as the page_shift.
+ * We first check whether the initial page is a compound page. If so,
+ * there may be an opportunity to batch multiple pages together.
+ */
if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
- page_shift == PAGE_SHIFT)
+ (page_shift == PAGE_SHIFT && !PageCompound(pages[0])))
return vmap_small_pages_range_noflush(addr, end, prot, pages);
- for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
- int err;
+ stride = 1U << (page_shift - PAGE_SHIFT);
+ for (i = 0; i < nr; ) {
+ int err, order;
- err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+ order = get_vmap_batch_order(pages, stride, nr - i, i);
+ err = vmap_range_noflush(addr, addr + (1UL << (page_shift + order)),
page_to_phys(pages[i]), prot,
- page_shift);
+ page_shift + order);
if (err)
return err;
- addr += 1UL << page_shift;
+ addr += 1UL << (page_shift + order);
+ i += 1U << (order + page_shift - PAGE_SHIFT);
}
return 0;
--
2.39.3 (Apple Git-146)