As reported[1], in the current state of master (that is, *without* that[2] patch, yet unmerged), it is possible to trigger Oops/out-of-bounds errors/unbalanced runtime PM by simply compiling DRM_ACCEL_ROCKET built-in (=y) instead of as a module (=m).
This fixes points 1 and 2 reported here[1] by fixing the unwinding in two functions to properly undo everything done in the same function prior to the error.
Note that this doesn't mean the Rocket device is usable if one core is missing. In fact, it seems it doesn't as I'm hit with many rocket fdac0000.npu: NPU job timed out followed by one rocket fdad0000.npu: NPU job timed out (and that, five times) whenever core0 (fdab0000.npu) fails to probe and I'm running the example from https://docs.mesa3d.org/teflon.html#do-some-inference-with-mobilenetv1 so something else probably needs some additional love.
[1] https://lore.kernel.org/linux-rockchip/0b20d760-ad4f-41c0-b733-39db10d6cc41@... [2] https://lore.kernel.org/linux-rockchip/20251205064739.20270-1-rmxpzlb@gmail....
Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- Quentin Schulz (2): accel/rocket: fix unwinding in error path in rocket_core_init accel/rocket: fix unwinding in error path in rocket_probe
drivers/accel/rocket/rocket_core.c | 7 +++++-- drivers/accel/rocket/rocket_drv.c | 15 ++++++++++++++- 2 files changed, 19 insertions(+), 3 deletions(-) --- base-commit: a619746d25c8adafe294777cc98c47a09759b3ed change-id: 20251212-rocket-error-path-f9784c46a503
Best regards,
From: Quentin Schulz quentin.schulz@cherry.de
When rocket_job_init() is called, iommu_group_get() has already been called, therefore we should call iommu_group_put() and make the iommu_group pointer NULL. This aligns with what's done in rocket_core_fini().
If pm_runtime_resume_and_get() somehow fails, not only should rocket_job_fini() be called but we should also unwind everything done before that, that is, disable PM, put the iommu_group, NULLify it and then call rocket_job_fini(). This is exactly what's done in rocket_core_fini() so let's call that function instead of duplicating the code.
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- drivers/accel/rocket/rocket_core.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c index abe7719c1db46..b3b2fa9ba645a 100644 --- a/drivers/accel/rocket/rocket_core.c +++ b/drivers/accel/rocket/rocket_core.c @@ -59,8 +59,11 @@ int rocket_core_init(struct rocket_core *core) core->iommu_group = iommu_group_get(dev);
err = rocket_job_init(core); - if (err) + if (err) { + iommu_group_put(core->iommu_group); + core->iommu_group = NULL; return err; + }
pm_runtime_use_autosuspend(dev);
@@ -76,7 +79,7 @@ int rocket_core_init(struct rocket_core *core)
err = pm_runtime_resume_and_get(dev); if (err) { - rocket_job_fini(core); + rocket_core_fini(core); return err; }
From: Quentin Schulz quentin.schulz@cherry.de
When rocket_core_init() fails (as could be the case with EPROBE_DEFER), we need to properly unwind by decrementing the counter we just incremented and if this is the first core we failed to probe, remove the rocket DRM device with rocket_device_fini() as well. This matches the logic in rocket_remove(). Failing to properly unwind results in out-of-bounds accesses.
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL") Cc: stable@vger.kernel.org Signed-off-by: Quentin Schulz quentin.schulz@cherry.de --- Note that this means that, technically, the first core (in HW) may not be the first core in the kernel (if EPROBE_DEFER is returned by the first core's probe for example). This may be a real issue if we rely on this. E.g. I see the iommu domain is set in in rocket_open() with rocket_iommu_domain_create(rdev->cores[0].dev) which could be expecting the main core (but it isn't clear from the commit logs, so maybe all cores may be totally independent). In any case, this patch is keeping the status quo. --- drivers/accel/rocket/rocket_drv.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c index 5c0b63f0a8f00..f6ef4c7aeef11 100644 --- a/drivers/accel/rocket/rocket_drv.c +++ b/drivers/accel/rocket/rocket_drv.c @@ -13,6 +13,7 @@ #include <linux/platform_device.h> #include <linux/pm_runtime.h>
+#include "rocket_device.h" #include "rocket_drv.h" #include "rocket_gem.h" #include "rocket_job.h" @@ -158,6 +159,8 @@ static const struct drm_driver rocket_drm_driver = {
static int rocket_probe(struct platform_device *pdev) { + int ret; + if (rdev == NULL) { /* First core probing, initialize DRM device. */ rdev = rocket_device_init(drm_dev, &rocket_drm_driver); @@ -177,7 +180,17 @@ static int rocket_probe(struct platform_device *pdev)
rdev->num_cores++;
- return rocket_core_init(&rdev->cores[core]); + ret = rocket_core_init(&rdev->cores[core]); + if (ret) { + rdev->num_cores--; + + if (rdev->num_cores == 0) { + rocket_device_fini(rdev); + rdev = NULL; + } + } + + return ret; }
static void rocket_remove(struct platform_device *pdev)
linux-stable-mirror@lists.linaro.org