[Linaro-mm-sig] Re: [PATCH v5 2/2] accel: Add Arm Ethos-U NPU driver

17 Oct 2025

      On Thu, Oct 16, 2025 at 11:25:34PM -0700, Matthew Brost wrote:
...
On Thu, Oct 16, 2025 at 04:06:05PM -0500, Rob Herring (Arm) wrote:
...
Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
relatively simple interface with single command stream to describe
buffers, operation settings, and network operations. It supports up to 8
memory regions (though no h/w bounds on a region). The Ethos NPUs
are designed to use an SRAM for scratch memory. Region 2 is reserved
for SRAM (like the downstream driver stack and compiler). Userspace
doesn't need access to the SRAM.
Thanks for the review.
[...]
...
...
+static struct dma_fence *ethosu_job_run(struct drm_sched_job *sched_job)
+{

struct ethosu_job *job = to_ethosu_job(sched_job);
struct ethosu_device *dev = job->dev;
struct dma_fence *fence = NULL;
int ret;

if (unlikely(job->base.s_fence->finished.error))
return NULL;

fence = ethosu_fence_create(dev);

Another reclaim issue: ethosu_fence_create allocates memory using
GFP_KERNEL. Since we're already in the DMA fence signaling path
(reclaim), this can lead to a deadlock.
Without too much thought, you likely want to move this allocation to
ethosu_job_do_push, but before taking dev->sched_lock or calling
drm_sched_job_arm.
We really should fix the DRM scheduler work queue to be tainted with
reclaim. If I recall correctly, we'd need to update the work queue
layer. Let me look into that—I've seen this type of bug several times,
and lockdep should be able to catch it.
Likely the rocket driver suffers from the same issues...
...
...

if (IS_ERR(fence))
return fence;

if (job->done_fence)
dma_fence_put(job->done_fence);

job->done_fence = dma_fence_get(fence);

ret = pm_runtime_get_sync(dev->base.dev);

I haven't looked at your PM design, but this generally looks quite
dangerous with respect to reclaim. For example, if your PM resume paths
allocate memory or take locks that allocate memory underneath, you're
likely to run into issues.
A better approach would be to attach a PM reference to your job upon
creation and release it upon job destruction. That would be safer and
save you headaches in the long run.
Our PM is nothing more than clock enable/disable and register init.
If the runtime PM API doesn't work and needs special driver wrappers, 
then I'm inclined to just not use it and manage clocks directly (as 
that's all it is doing).
...
This is what we do in Xe [1] [2].
Also, in general, this driver has been reviewed (RB’d), but it's not
great that I spotted numerous issues within just five minutes. I suggest
taking a step back and thoroughly evaluating everything this driver is
doing.
Well, if it is hard to get simple drivers right, then it's a problem 
with the subsystem APIs IMO.
Rob

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH v5 2/2] accel: Add Arm Ethos-U NPU driver