From: Karol Wachowski karol.wachowski@intel.com
commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
Trigger recovery of the NPU upon receiving HW context violation from the firmware. The context violation error is a fatal error that prevents any subsequent jobs from being executed. Without this fix it is necessary to reload the driver to restore the NPU operational state.
This is simplified version of upstream commit as the full implementation would require all engine reset/resume logic to be backported.
Signed-off-by: Karol Wachowski karol.wachowski@intel.com Signed-off-by: Maciej Falkowski maciej.falkowski@linux.intel.com Reviewed-by: Jacek Lawrynowicz jacek.lawrynowicz@linux.intel.com Signed-off-by: Jacek Lawrynowicz jacek.lawrynowicz@linux.intel.com Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-13-macie... Fixes: 0adff3b0ef12 ("accel/ivpu: Share NPU busy time in sysfs") Cc: stable@vger.kernel.org # v6.11+ --- drivers/accel/ivpu/ivpu_job.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c index be2e2bf0f43f0..70b3676974407 100644 --- a/drivers/accel/ivpu/ivpu_job.c +++ b/drivers/accel/ivpu/ivpu_job.c @@ -482,6 +482,8 @@ static struct ivpu_job *ivpu_job_remove_from_submitted_jobs(struct ivpu_device * return job; }
+#define VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 0xEU + static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 job_status) { struct ivpu_job *job; @@ -490,6 +492,9 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 if (!job) return -ENOENT;
+ if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW) + ivpu_pm_trigger_recovery(vdev, "HW context violation"); + if (job->file_priv->has_mmu_faults) job_status = DRM_IVPU_JOB_STATUS_ABORTED;
Hi,
This is an important patch for the Intel NPU. Is there anything it is missing to be included in stable?
Regards, Jacek
On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote:
From: Karol Wachowski karol.wachowski@intel.com
commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
Trigger recovery of the NPU upon receiving HW context violation from the firmware. The context violation error is a fatal error that prevents any subsequent jobs from being executed. Without this fix it is necessary to reload the driver to restore the NPU operational state.
This is simplified version of upstream commit as the full implementation would require all engine reset/resume logic to be backported.
Signed-off-by: Karol Wachowski karol.wachowski@intel.com Signed-off-by: Maciej Falkowski maciej.falkowski@linux.intel.com Reviewed-by: Jacek Lawrynowicz jacek.lawrynowicz@linux.intel.com Signed-off-by: Jacek Lawrynowicz jacek.lawrynowicz@linux.intel.com Link: https://patchwork.freedesktop.org/patch/msgid/20250107173238.381120-13-macie... Fixes: 0adff3b0ef12 ("accel/ivpu: Share NPU busy time in sysfs") Cc: stable@vger.kernel.org # v6.11+
drivers/accel/ivpu/ivpu_job.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c index be2e2bf0f43f0..70b3676974407 100644 --- a/drivers/accel/ivpu/ivpu_job.c +++ b/drivers/accel/ivpu/ivpu_job.c @@ -482,6 +482,8 @@ static struct ivpu_job *ivpu_job_remove_from_submitted_jobs(struct ivpu_device * return job; } +#define VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW 0xEU
static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 job_status) { struct ivpu_job *job; @@ -490,6 +492,9 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 if (!job) return -ENOENT;
- if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW)
ivpu_pm_trigger_recovery(vdev, "HW context violation");
- if (job->file_priv->has_mmu_faults) job_status = DRM_IVPU_JOB_STATUS_ABORTED;
On Thu, Apr 10, 2025 at 09:49:37AM +0200, Jacek Lawrynowicz wrote:
Hi,
This is an important patch for the Intel NPU. Is there anything it is missing to be included in stable?
Patience, you only sent this:
On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote:
2 days ago, AFTER the latest of -rc releases was sent out for review, and those kernels have NOT even been released yet!
[rant about how you all know this process works deleted as it was just snarky on my side, although quite cathartic, thanks for letting me vent...]
Relax, it will get handled when we can get to it. To help out, please take the time to review pending stable backported patches that have been submitted to the mailing list ahead of yours.
greg k-h
Hi,
On 4/10/2025 10:03 AM, Greg Kroah-Hartman wrote:
On Thu, Apr 10, 2025 at 09:49:37AM +0200, Jacek Lawrynowicz wrote:
Hi,
This is an important patch for the Intel NPU. Is there anything it is missing to be included in stable?
Patience, you only sent this:
On 4/8/2025 11:57 AM, Jacek Lawrynowicz wrote:
2 days ago, AFTER the latest of -rc releases was sent out for review, and those kernels have NOT even been released yet!
[rant about how you all know this process works deleted as it was just snarky on my side, although quite cathartic, thanks for letting me vent...]
Relax, it will get handled when we can get to it. To help out, please take the time to review pending stable backported patches that have been submitted to the mailing list ahead of yours.
Yeah, sorry about that. It seems I'm still learning the art of patience in the kernel processes. I didn't mean to rush anything. I was just worried my patch might have been overlooked due to a typo or a wrong commit message tag.
I'll make sure to review the backported patches in the meantime.
And thanks for holding back on the rant. I really think that the standard set by Linus is not healthy.
Regards, Jacek
[ Sasha's backport helper bot ]
Hi,
Summary of potential issues: ❌ Build failures detected
The upstream commit SHA1 provided is correct: dad945c27a42dfadddff1049cf5ae417209a8996
WARNING: Author mismatch between patch and upstream commit: Backport author: Jacek Lawrynowiczjacek.lawrynowicz@linux.intel.com Commit author: Karol Wachowskikarol.wachowski@intel.com
Note: The patch differs from the upstream commit: --- 1: dad945c27a42d < -: ------------- accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW -: ------------- > 1: 09299f7b8afa8 accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW ---
Results of testing on various branches:
| Branch | Patch Apply | Build Test | |---------------------------|-------------|------------| | stable/linux-6.14.y | Success | Success | | stable/linux-6.13.y | Success | Success | | stable/linux-6.12.y | Success | Success | | stable/linux-6.6.y | Success | Success | | stable/linux-6.1.y | Success | Success | | stable/linux-5.15.y | Success | Success | | stable/linux-5.10.y | Success | Success | | stable/linux-5.4.y | Success | Failed |
Build Errors: Build error for stable/linux-5.4.y: arch/arm64/boot/dts/exynos/exynos5433.dtsi:254.3-29: Warning (reg_format): /gpu@14ac0000:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 2) arch/arm64/boot/dts/exynos/exynos5433-tm2.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos7.dtsi:83.3-29: Warning (reg_format): /gpu@14ac0000:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 2) arch/arm64/boot/dts/exynos/exynos7-espresso.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos7-espresso.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos7-espresso.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433.dtsi:254.3-29: Warning (reg_format): /gpu@14ac0000:reg: property has invalid length (8 bytes) (#address-cells == 2, #size-cells == 2) arch/arm64/boot/dts/exynos/exynos5433-tm2e.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2e.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/exynos/exynos5433-tm2e.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dts:61.5-15: Warning (reg_format): /soc/nand@79b0000/nand@0:reg: property has invalid length (4 bytes) (#address-cells == 2, #size-cells == 1) arch/arm64/boot/dts/qcom/ipq8074-hk01.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format' arch/arm64/boot/dts/qcom/ipq8074-hk01.dts:60.11-65.6: Warning (avoid_default_addr_size): /soc/nand@79b0000/nand@0: Relying on default #address-cells value arch/arm64/boot/dts/qcom/ipq8074-hk01.dts:60.11-65.6: Warning (avoid_default_addr_size): /soc/nand@79b0000/nand@0: Relying on default #size-cells value In file included from ./include/linux/list.h:9, from ./include/linux/kobject.h:19, from ./include/linux/of.h:17, from ./include/linux/clk-provider.h:9, from drivers/clk/qcom/clk-rpmh.c:6: drivers/clk/qcom/clk-rpmh.c: In function 'clk_rpmh_bcm_send_cmd': ./include/linux/kernel.h:843:43: warning: comparison of distinct pointer types lacks a cast [-Wcompare-distinct-pointer-types] 843 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) | ^~ ./include/linux/kernel.h:857:18: note: in expansion of macro '__typecheck' 857 | (__typecheck(x, y) && __no_side_effects(x, y)) | ^~~~~~~~~~~ ./include/linux/kernel.h:867:31: note: in expansion of macro '__safe_cmp' 867 | __builtin_choose_expr(__safe_cmp(x, y), \ | ^~~~~~~~~~ ./include/linux/kernel.h:876:25: note: in expansion of macro '__careful_cmp' 876 | #define min(x, y) __careful_cmp(x, y, <) | ^~~~~~~~~~~~~ drivers/clk/qcom/clk-rpmh.c:273:21: note: in expansion of macro 'min' 273 | cmd_state = min(cmd_state, BCM_TCS_CMD_VOTE_MASK); | ^~~ fs/xfs/libxfs/xfs_inode_fork.c: In function 'xfs_ifork_verify_attr': fs/xfs/libxfs/xfs_inode_fork.c:735:13: warning: the comparison will always evaluate as 'true' for the address of 'i_df' will never be NULL [-Waddress] 735 | if (!XFS_IFORK_PTR(ip, XFS_ATTR_FORK)) | ^ In file included from fs/xfs/libxfs/xfs_inode_fork.c:14: ./fs/xfs/xfs_inode.h:38:33: note: 'i_df' declared here 38 | struct xfs_ifork i_df; /* data fork */ | ^~~~ drivers/net/dsa/microchip/ksz9477.c: In function 'ksz9477_reset_switch': drivers/net/dsa/microchip/ksz9477.c:198:12: warning: unused variable 'data8' [-Wunused-variable] 198 | u8 data8; | ^~~~~ In file included from ./include/linux/bitops.h:5, from ./include/linux/kernel.h:12, from ./include/linux/list.h:9, from ./include/linux/module.h:9, from drivers/net/ethernet/qlogic/qed/qed_debug.c:6: drivers/net/ethernet/qlogic/qed/qed_debug.c: In function 'qed_grc_dump_addr_range': ./include/linux/bits.h:8:33: warning: overflow in conversion from 'long unsigned int' to 'u8' {aka 'unsigned char'} changes value from '(long unsigned int)((int)vf_id << 8 | 128)' to '128' [-Woverflow] 8 | #define BIT(nr) (UL(1) << (nr)) | ^ drivers/net/ethernet/qlogic/qed/qed_debug.c:2572:31: note: in expansion of macro 'BIT' 2572 | fid = BIT(PXP_PRETEND_CONCRETE_FID_VFVALID_SHIFT) | | ^~~ drivers/gpu/drm/nouveau/dispnv50/wndw.c:628:1: warning: conflicting types for 'nv50_wndw_new_' due to enum/integer mismatch; have 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const u32 *, u32, enum nv50_disp_interlock_type, u32, struct nv50_wndw **)' {aka 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const unsigned int *, unsigned int, enum nv50_disp_interlock_type, unsigned int, struct nv50_wndw **)'} [-Wenum-int-mismatch] 628 | nv50_wndw_new_(const struct nv50_wndw_func *func, struct drm_device *dev, | ^~~~~~~~~~~~~~ In file included from drivers/gpu/drm/nouveau/dispnv50/wndw.c:22: drivers/gpu/drm/nouveau/dispnv50/wndw.h:39:5: note: previous declaration of 'nv50_wndw_new_' with type 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const u32 *, enum nv50_disp_interlock_type, u32, u32, struct nv50_wndw **)' {aka 'int(const struct nv50_wndw_func *, struct drm_device *, enum drm_plane_type, const char *, int, const unsigned int *, enum nv50_disp_interlock_type, unsigned int, unsigned int, struct nv50_wndw **)'} 39 | int nv50_wndw_new_(const struct nv50_wndw_func *, struct drm_device *, | ^~~~~~~~~~~~~~ Terminated make: *** [Makefile:1121: vmlinux] Error 143 make: Target '_all' not remade because of errors.
On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote:
From: Karol Wachowski karol.wachowski@intel.com
commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
Trigger recovery of the NPU upon receiving HW context violation from the firmware. The context violation error is a fatal error that prevents any subsequent jobs from being executed. Without this fix it is necessary to reload the driver to restore the NPU operational state.
This is simplified version of upstream commit as the full implementation would require all engine reset/resume logic to be backported.
We REALLY do not like taking patches that are not upstream. Why not backport all of the needed patches instead, how many would that be? Taking one-off patches like this just makes it harder/impossible to maintain the code over time as further fixes in this same area will NOT apply properly at all.
Think about what you want to be touching 5 years from now, a one-off change that doesn't match the rest of the kernel tree, or something that is the same?
thanks,
greg k-h
Hi,
On 4/22/2025 2:17 PM, Greg KH wrote:
On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote:
From: Karol Wachowski karol.wachowski@intel.com
commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
Trigger recovery of the NPU upon receiving HW context violation from the firmware. The context violation error is a fatal error that prevents any subsequent jobs from being executed. Without this fix it is necessary to reload the driver to restore the NPU operational state.
This is simplified version of upstream commit as the full implementation would require all engine reset/resume logic to be backported.
We REALLY do not like taking patches that are not upstream. Why not backport all of the needed patches instead, how many would that be? Taking one-off patches like this just makes it harder/impossible to maintain the code over time as further fixes in this same area will NOT apply properly at all.
Think about what you want to be touching 5 years from now, a one-off change that doesn't match the rest of the kernel tree, or something that is the same?
Sure, I'm totally on board with backporting all required patches. I thought it was not possible due to 100 line limit.
This would be the minimum set of patches:
Patch 1: drivers/accel/ivpu/ivpu_drv.c | 32 +++----------- drivers/accel/ivpu/ivpu_drv.h | 2 + drivers/accel/ivpu/ivpu_job.c | 78 ++++++++++++++++++++++++++------- drivers/accel/ivpu/ivpu_job.h | 1 + drivers/accel/ivpu/ivpu_mmu.c | 3 +- drivers/accel/ivpu/ivpu_sysfs.c | 5 ++- 6 files changed, 75 insertions(+), 46 deletions(-)
Patch 2: drivers/accel/ivpu/ivpu_job.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-)
Patch 3: drivers/accel/ivpu/ivpu_job.c | 2 +- drivers/accel/ivpu/ivpu_jsm_msg.c | 3 +- drivers/accel/ivpu/vpu_boot_api.h | 45 +++-- drivers/accel/ivpu/vpu_jsm_api.h | 303 +++++++++++++++++++++++++----- 4 files changed, 293 insertions(+), 60 deletions(-)
Patch 4: drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-)
First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly. Is this acceptable?
Regards, Jacek
On Thu, Apr 24, 2025 at 12:22:31PM +0200, Jacek Lawrynowicz wrote:
Hi,
On 4/22/2025 2:17 PM, Greg KH wrote:
On Tue, Apr 08, 2025 at 11:57:11AM +0200, Jacek Lawrynowicz wrote:
From: Karol Wachowski karol.wachowski@intel.com
commit dad945c27a42dfadddff1049cf5ae417209a8996 upstream.
Trigger recovery of the NPU upon receiving HW context violation from the firmware. The context violation error is a fatal error that prevents any subsequent jobs from being executed. Without this fix it is necessary to reload the driver to restore the NPU operational state.
This is simplified version of upstream commit as the full implementation would require all engine reset/resume logic to be backported.
We REALLY do not like taking patches that are not upstream. Why not backport all of the needed patches instead, how many would that be? Taking one-off patches like this just makes it harder/impossible to maintain the code over time as further fixes in this same area will NOT apply properly at all.
Think about what you want to be touching 5 years from now, a one-off change that doesn't match the rest of the kernel tree, or something that is the same?
Sure, I'm totally on board with backporting all required patches. I thought it was not possible due to 100 line limit.
This would be the minimum set of patches:
Patch 1: drivers/accel/ivpu/ivpu_drv.c | 32 +++----------- drivers/accel/ivpu/ivpu_drv.h | 2 + drivers/accel/ivpu/ivpu_job.c | 78 ++++++++++++++++++++++++++------- drivers/accel/ivpu/ivpu_job.h | 1 + drivers/accel/ivpu/ivpu_mmu.c | 3 +- drivers/accel/ivpu/ivpu_sysfs.c | 5 ++- 6 files changed, 75 insertions(+), 46 deletions(-)
Patch 2: drivers/accel/ivpu/ivpu_job.c | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-)
Patch 3: drivers/accel/ivpu/ivpu_job.c | 2 +- drivers/accel/ivpu/ivpu_jsm_msg.c | 3 +- drivers/accel/ivpu/vpu_boot_api.h | 45 +++-- drivers/accel/ivpu/vpu_jsm_api.h | 303 +++++++++++++++++++++++++----- 4 files changed, 293 insertions(+), 60 deletions(-)
Patch 4: drivers/accel/ivpu/ivpu_job.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-)
First patch needs some changes to apply correctly to 6.12 but the rest of them apply pretty cleanly. Is this acceptable?
Totally acceptable, that's trivial compared to many of the larger backports we have taken over the years :)
thanks,
greg k-h
linux-stable-mirror@lists.linaro.org