Hi,
This is your friendly bug reporter.
The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 7950X box.
Please find attached the complete dmesg output from the ring buffer and lshw output.
NOTE: The kernel reports tainted kernel, but to my knowledge there are no proprietary (G) modules,
but this taint is turned on by the previous bugs.
dmesg excerpt:
[ 8791.864576] ==================================================================
[ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
[ 8791.864776] write (marked) to 0xffff9b74491b7c40 of 8 bytes by task 3807 on cpu 18:
[ 8791.864788] drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
[ 8791.864852] amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
[ 8791.868731] drm_ioctl_kernel+0x127/0x210 [drm]
[ 8791.869222] drm_ioctl+0x38f/0x6f0 [drm]
[ 8791.869711] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
[ 8791.873660] __x64_sys_ioctl+0xd2/0x120
[ 8791.873676] do_syscall_64+0x58/0x90
[ 8791.873688] entry_SYSCALL_64_after_hwframe+0x73/0xdd
[ 8791.873710] read to 0xffff9b74491b7c40 of 8 bytes by task 1119 on cpu 27:
[ 8791.873722] drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
[ 8791.873786] drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
[ 8791.873849] drm_sched_main+0xd2/0x500 [gpu_sched]
[ 8791.873912] kthread+0x18b/0x1d0
[ 8791.873924] ret_from_fork+0x43/0x70
[ 8791.873939] ret_from_fork_asm+0x1b/0x30
[ 8791.873955] value changed: 0x0000000000000000 -> 0xffff9b750ebcfc00
[ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
[ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
[ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 8791.874002] ==================================================================
Best regards,
Mirsad Todorovac
Documentation for drm_crtc_init_with_planes() in
drivers/gpu/drm/drm_crtc.c states: «The crtc structure should not be
allocated with devm_kzalloc()».
However, in drivers/gpu/drm/stm/ltdc.c
the 2nd argument of the function drm_crtc_init_with_planes()
is a structure allocated with devm_kzalloc()
Also, in
drivers/gpu/drm/mediatek/mtk_drm_crtc.c
drivers/gpu/drm/hisilicon/kirin/kirin_drm_drv.c
drivers/gpu/drm/logicvc/logicvc_crtc.c
drivers/gpu/drm/meson/meson_crtc.c
drivers/gpu/drm/mxsfb/lcdif_kms.c
drivers/gpu/drm/mxsfb/mxsfb_kms.c
drivers/gpu/drm/renesas/shmobile/shmob_drm_crtc.c
drivers/gpu/drm/rockchip/rockchip_drm_vop.c
drivers/gpu/drm/rockchip/rockchip_drm_vop2.c
drivers/gpu/drm/sun4i/sun4i_crtc.c
drivers/gpu/drm/tegra/dc.c
drivers/gpu/drm/tilcdc/tilcdc_crtc.c
the 2nd argument of the function drm_crtc_init_with_planes()
is a field of the structure allocated with devm_kzalloc()
Is it correct or can it lead to any problems?
--
Ekaterina Orlova
Linux Verification Center, ISPRAS
From: Rob Clark <robdclark(a)chromium.org>
If a signal callback releases the sw_sync fence, that will trigger a
deadlock as the timeline_fence_release recurses onto the fence->lock
(used both for signaling and the the timeline tree).
To avoid that, temporarily hold an extra reference to the signalled
fences until after we drop the lock.
(This is an alternative implementation of https://patchwork.kernel.org/patch/11664717/
which avoids some potential UAF issues with the original patch.)
Reported-by: Bas Nieuwenhuizen <bas(a)basnieuwenhuizen.nl>
Fixes: d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free")
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
---
drivers/dma-buf/sw_sync.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 63f0aeb66db6..ceb6a0408624 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -191,6 +191,7 @@ static const struct dma_fence_ops timeline_fence_ops = {
*/
static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
{
+ LIST_HEAD(signalled);
struct sync_pt *pt, *next;
trace_sync_timeline(obj);
@@ -203,9 +204,13 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
if (!timeline_fence_signaled(&pt->base))
break;
+ dma_fence_get(&pt->base);
+
list_del_init(&pt->link);
rb_erase(&pt->node, &obj->pt_tree);
+ list_add_tail(&pt->link, &signalled);
+
/*
* A signal callback may release the last reference to this
* fence, causing it to be freed. That operation has to be
@@ -218,6 +223,11 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
}
spin_unlock_irq(&obj->lock);
+
+ list_for_each_entry_safe(pt, next, &signalled, link) {
+ list_del(&pt->link);
+ dma_fence_put(&pt->base);
+ }
}
/**
--
2.41.0
Hi Pintu,
On Sat, Jul 29, 2023 at 08:05:15AM +0530, Pintu Kumar wrote:
> The current global cma region name defined as "reserved"
> which is misleading, creates confusion and too generic.
>
> Also, the default cma allocation happens from global cma region,
> so, if one has to figure out all allocations happening from
> global cma region, this seems easier.
>
> Thus, change the name from "reserved" to "global-cma-region".
I agree that reserved is not a very useful name. Unfortuately the
name of the region leaks to userspace through cma_heap.
So I think we need prep patches to hardcode "reserved" in
add_default_cma_heap first, and then remove the cma_get_name
first.
From: Boris Brezillon <boris.brezillon(a)collabora.com>
[ Upstream commit e30cb0599799aac099209e3b045379613c80730e ]
drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
from the dependency array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.
In theory, this wait shouldn't be needed because resources should have
their users registered to the dma_resv object, thus guaranteeing that
future jobs wanting to access these resources wait on all the previous
users (depending on the access type, of course). But we want to keep
these explicit waits in the kill entity path just in case.
Let's make sure we keep all dependencies in the array in
drm_sched_job_dependency(), so we can iterate over the array and wait
in drm_sched_entity_kill_jobs_cb().
We also make sure we wait on drm_sched_fence::finished if we were
originally asked to wait on drm_sched_fence::scheduled. In that case,
we assume the intent was to delegate the wait to the firmware/GPU or
rely on the pipelining done at the entity/scheduler level, but when
killing jobs, we really want to wait for completion not just scheduling.
v2:
- Don't evict deps in drm_sched_job_dependency()
v3:
- Always wait for drm_sched_fence::finished fences in
drm_sched_entity_kill_jobs_cb() when we see a sched_fence
v4:
- Fix commit message
- Fix a use-after-free bug
v5:
- Flag deps on which we should only wait for the scheduled event
at insertion time
v6:
- Back to v4 implementation
- Add Christian's R-b
Cc: Frank Binns <frank.binns(a)imgtec.com>
Cc: Sarah Walker <sarah.walker(a)imgtec.com>
Cc: Donald Robson <donald.robson(a)imgtec.com>
Cc: Luben Tuikov <luben.tuikov(a)amd.com>
Cc: David Airlie <airlied(a)gmail.com>
Cc: Daniel Vetter <daniel(a)ffwll.ch>
Cc: Sumit Semwal <sumit.semwal(a)linaro.org>
Cc: "Christian König" <christian.koenig(a)amd.com>
Signed-off-by: Boris Brezillon <boris.brezillon(a)collabora.com>
Suggested-by: "Christian König" <christian.koenig(a)amd.com>
Reviewed-by: "Christian König" <christian.koenig(a)amd.com>
Acked-by: Luben Tuikov <luben.tuikov(a)amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230619071921.3465992-1-bori…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/scheduler/sched_entity.c | 41 +++++++++++++++++++-----
1 file changed, 33 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index e0a8890a62e23..42021d1f7e016 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -155,16 +155,32 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
{
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
finish_cb);
- int r;
+ unsigned long index;
dma_fence_put(f);
/* Wait for all dependencies to avoid data corruptions */
- while (!xa_empty(&job->dependencies)) {
- f = xa_erase(&job->dependencies, job->last_dependency++);
- r = dma_fence_add_callback(f, &job->finish_cb,
- drm_sched_entity_kill_jobs_cb);
- if (!r)
+ xa_for_each(&job->dependencies, index, f) {
+ struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
+
+ if (s_fence && f == &s_fence->scheduled) {
+ /* The dependencies array had a reference on the scheduled
+ * fence, and the finished fence refcount might have
+ * dropped to zero. Use dma_fence_get_rcu() so we get
+ * a NULL fence in that case.
+ */
+ f = dma_fence_get_rcu(&s_fence->finished);
+
+ /* Now that we have a reference on the finished fence,
+ * we can release the reference the dependencies array
+ * had on the scheduled fence.
+ */
+ dma_fence_put(&s_fence->scheduled);
+ }
+
+ xa_erase(&job->dependencies, index);
+ if (f && !dma_fence_add_callback(f, &job->finish_cb,
+ drm_sched_entity_kill_jobs_cb))
return;
dma_fence_put(f);
@@ -394,8 +410,17 @@ static struct dma_fence *
drm_sched_job_dependency(struct drm_sched_job *job,
struct drm_sched_entity *entity)
{
- if (!xa_empty(&job->dependencies))
- return xa_erase(&job->dependencies, job->last_dependency++);
+ struct dma_fence *f;
+
+ /* We keep the fence around, so we can iterate over all dependencies
+ * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled
+ * before killing the job.
+ */
+ f = xa_load(&job->dependencies, job->last_dependency);
+ if (f) {
+ job->last_dependency++;
+ return dma_fence_get(f);
+ }
if (job->sched->ops->prepare_job)
return job->sched->ops->prepare_job(job, entity);
--
2.40.1