Kernel Internal oops while booting stable-rc 4.14 kernel on qcom db410c device this problem happened only once on this specific platform. and rcu_preempt detected stalls on CPUs/tasks detected after this and board hung.
metadata: git branch: linux-4.14.y git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git commit: 5b1e982af0f810358664827a6333affb4f5d8eb5 git describe: v4.14.188-126-g5b1e982af0f8 make_kernelversion: 4.14.189-rc1 kernel-config: https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/kernel.config
Crash dump:
[ 5.424958] Unable to handle kernel paging request at virtual address 00001008 [ 5.435485] Mem abort info: [ 5.442509] Exception class = DABT (current EL), IL = 32 bits [ 5.445203] SET = 0, FnV = 0 [[ 5.451101] EA = 0, S1PTW =[ 5.454226] Data abort info: [ 5.457264] ISV = 0, ISS = 0x00000044 [ 5.460390] CM = 0, WnR = 1 [ 5.463951] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80003d66d000 [ 5.467078] [0000000000001008] *pgd=0000000000000000 [ 5.473503] Internal error: Oops: 96000044 [#1] PREEMPT SMP [ 5.479838] Modules linked in: adv7511 msm mdt_loader msm_rng drm_kms_helper rng_core drm fuse [ 5.485405] Process kworker/2:0 (pid: 21, stack limit = 0xffff000009450000) [ 5.494090] CPU: 2 PID: 21 Comm: kworker/2:0 Not tainted 4.14.189-rc1 #1 [ 5.501036] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 5.507996] Workqueue: events deferred_probe_work_func [ 5.514935] task: ffff80003d454380 task.stack: ffff000009450000 [ 5.520155] pc : clk_reparent+0x60/0xe8 [ 5.526058] lr : __clk_set_parent_before+0x40/0x88 [ 5.529882] sp : ffff000009453640 pstate : 800001c5 [ 5.534748] x29: ffff000009453640 x28: ffff0000090b7000 [ 5.539615] x27: ffff80003fe7c478 x26: ffff0000094537a8 [ 5.545175] x25: 0000000000000001 x24: ffff000009239038 [ 5.550736] x23: ffff80003b6be688 x22: 0000000000000000 [ 5.556297] x21: 0000000000000000 x20: ffff80003c9d8c00 [ 5.561858] x19: ffff80003d798900 x18: 00000000fffffffe [ 5.567419] x17: 0000ffff7fdbb6a0 x16: ffff00000821ad98 [ 5.572980] x15: 0000000000000001 x14: ffffffffffffffff [ 5.578540] x13: ffff0000094537c8 x12: 0000000000000010 [ 5.584102] x11: 0000000000000010 x10: 0101010101010101 [ 5.589663] x9 : 0000000000000000 x8 : 7f7f7f7f7f7f7f7f [ 5.595223] x7 : fefefefefeff6e77 x6 : 0000000000000140 [ 5.600784] x5 : 0000000000000001 x4 : ffff80003c9d8c00 [ 5.606344] x3 : ffff80003d798900 x2 : 0000000000000004 [ 5.611905] x1 : ffff80003d7989a8 x0 : 0000000000001000 [ 5.617467] Call trace: [ 5.623030] clk_reparent+0x60/0xe8 [ 5.625465] __clk_set_parent_before+0x40/0x88 [ 5.628943] clk_register+0x330/0x618 [ 5.633668] pll_28nm_register+0xa4/0x340 [msm] [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm] [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm] [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm] [ 5.651196] platform_drv_probe+0x58/0xb8 [ 5.656060] driver_probe_device+0x228/0x2d0 [ 5.660231] __device_attach_driver+0xb8/0xe8 [ 5.664750] bus_for_each_drv+0x64/0xa0 [ 5.669269] __device_attach+0xcc/0x138 [ 5.673093] device_initial_probe+0x10/0x18 [ 5.676918] bus_probe_device+0x90/0x98 [ 5.681088] device_add+0x3c4/0x5a8 [ 5.684915] of_device_add+0x40/0x58 [ 5.688392] of_platform_device_create_pdata+0x80/0xe8 [ 5.692219] of_platform_bus_create+0xd4/0x308 [ 5.697432] of_platform_populate+0x48/0xb8 [ 5.702143] msm_pdev_probe+0x3c/0x328 [msm] [ 5.706125] platform_drv_probe+0x58/0xb8 [ 5.710642] driver_probe_device+0x228/0x2d0 [ 5.714814] __device_attach_driver+0xb8/0xe8 [ 5.719334] bus_for_each_drv+0x64/0xa0 [ 5.723852] __device_attach+0xcc/0x138 [ 5.727677] device_initial_probe+0x10/0x18 [ 5.731502] bus_probe_device+0x90/0x98 [ 5.735675] deferred_probe_work_func+0xa4/0x140 [ 5.739502] process_one_work+0x19c/0x300 [ 5.744366] worker_thread+0x4c/0x420 [ 5.748539] kthread+0x100/0x130 [ 5.752362] ret_from_fork+0x10/0x1c [ 5.755842] Code: 54000260 f9405080 f9005460 b4000040 (f9000401) [ 5.759669] ---[ end trace 6d70d7dd8a236384 ]--- [ 5.765922] note: kworker/2:0[21] exited with preempt_count 1 [ 26.777168] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 26.777204] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1679/1723 fqs=2625 [ 26.783112] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1946/1958 fqs=2625 [ 26.791444] (detected by 3, t=5252 jiffies, g=58, c=57, q=2362) [ 26.799781] Task dump for CPU 0: [ 26.806033] systemd-udevd R running task 0 2533 1 0x00000202 [ 26.809515] Call trace: [ 26.816814] __switch_to+0xe8/0x148 [ 26.819248] __wake_up_common+0x80/0x170 [ 26.822724] __wake_up_common_lock+0x7c/0xa8 [ 26.826897] __wake_up_sync_key+0x1c/0x28 [ 26.831416] sock_def_readable+0x40/0x88 [ 26.835584] Task dump for CPU 1: [ 26.839755] systemd R running task 0 1 0 0x0000000a [ 26.843237] Call trace: [ 26.850531] __switch_to+0xe8/0x148 [ 26.852968] 0xffff800009fc8000 [ 26.856467] INFO: rcu_sched detected stalls on CPUs/tasks: [ 26.859584] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1722/1723 fqs=2625 [ 26.865145] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1957/1958 fqs=2625 [ 26.873477] (detected by 3, t=5270 jiffies, g=-150, c=-151, q=3) [ 26.881815] Task dump for CPU 0: [ 26.888067] systemd-udevd R running task 0 2533 1 0x00000202 [ 26.891549] Call trace: [ 26.898845] __switch_to+0xe8/0x148 [ 26.901281] __wake_up_common+0x80/0x170 [ 26.904758] __wake_up_common_lock+0x7c/0xa8 [ 26.908931] __wake_up_sync_key+0x1c/0x28 [ 26.913449] sock_def_readable+0x40/0x88 [ 26.917618] Task dump for CPU 1: [ 26.921790] systemd R running task 0 1 0 0x0000000a [ 26.925271] Call trace: [ 26.932565] __switch_to+0xe8/0x148 [ 26.935002] 0xffff800009fc8000 [ 41.449201] random: crng init done [ 41.449221] random: 7 urandom warning(s) missed due to ratelimiting [ 89.797164] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 89.797195] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1679/1723 fqs=10482 [ 89.803103] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1946/1958 fqs=10482 [ 89.811437] (detected by 3, t=21007 jiffies, g=58, c=57, q=2578) [ 89.819773] Task dump for CPU 0: [ 89.826027] systemd-udevd R running task 0 2533 1 0x00000202 [ 89.829508] Call trace: [ 89.836807] __switch_to+0xe8/0x148 [ 89.839241] __wake_up_common+0x80/0x170 [ 89.842717] __wake_up_common_lock+0x7c/0xa8 [ 89.846891] __wake_up_sync_key+0x1c/0x28 [ 89.851410] sock_def_readable+0x40/0x88 <System Hung>
Full test log, https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.188-1...
On Tue, Jul 21, 2020 at 10:32 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
Kernel Internal oops while booting stable-rc 4.14 kernel on qcom db410c device this problem happened only once on this specific platform. and rcu_preempt detected stalls on CPUs/tasks detected after this and board hung.
metadata: git branch: linux-4.14.y git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git commit: 5b1e982af0f810358664827a6333affb4f5d8eb5 git describe: v4.14.188-126-g5b1e982af0f8 make_kernelversion: 4.14.189-rc1 kernel-config: https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/kernel.config
I looked at the object file at https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/vmlinux.xz
Crash dump:
[ 5.424958] Unable to handle kernel paging request at virtual address 00001008 [ 5.435485] Mem abort info: [ 5.442509] Exception class = DABT (current EL), IL = 32 bits [ 5.445203] SET = 0, FnV = 0 [[ 5.451101] EA = 0, S1PTW =[ 5.454226] Data abort info: [ 5.457264] ISV = 0, ISS = 0x00000044 [ 5.460390] CM = 0, WnR = 1 [ 5.463951] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80003d66d000 [ 5.467078] [0000000000001008] *pgd=0000000000000000 [ 5.473503] Internal error: Oops: 96000044 [#1] PREEMPT SMP [ 5.479838] Modules linked in: adv7511 msm mdt_loader msm_rng drm_kms_helper rng_core drm fuse [ 5.485405] Process kworker/2:0 (pid: 21, stack limit = 0xffff000009450000) [ 5.494090] CPU: 2 PID: 21 Comm: kworker/2:0 Not tainted 4.14.189-rc1 #1 [ 5.501036] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 5.507996] Workqueue: events deferred_probe_work_func [ 5.514935] task: ffff80003d454380 task.stack: ffff000009450000 [ 5.520155] pc : clk_reparent+0x60/0xe8 [ 5.526058] lr : __clk_set_parent_before+0x40/0x88 [ 5.529882] sp : ffff000009453640 pstate : 800001c5 [ 5.534748] x29: ffff000009453640 x28: ffff0000090b7000 [ 5.539615] x27: ffff80003fe7c478 x26: ffff0000094537a8 [ 5.545175] x25: 0000000000000001 x24: ffff000009239038 [ 5.550736] x23: ffff80003b6be688 x22: 0000000000000000 [ 5.556297] x21: 0000000000000000 x20: ffff80003c9d8c00 [ 5.561858] x19: ffff80003d798900 x18: 00000000fffffffe [ 5.567419] x17: 0000ffff7fdbb6a0 x16: ffff00000821ad98 [ 5.572980] x15: 0000000000000001 x14: ffffffffffffffff [ 5.578540] x13: ffff0000094537c8 x12: 0000000000000010 [ 5.584102] x11: 0000000000000010 x10: 0101010101010101 [ 5.589663] x9 : 0000000000000000 x8 : 7f7f7f7f7f7f7f7f [ 5.595223] x7 : fefefefefeff6e77 x6 : 0000000000000140 [ 5.600784] x5 : 0000000000000001 x4 : ffff80003c9d8c00 [ 5.606344] x3 : ffff80003d798900 x2 : 0000000000000004 [ 5.611905] x1 : ffff80003d7989a8 x0 : 0000000000001000 [ 5.617467] Call trace: [ 5.623030] clk_reparent+0x60/0xe8
We are here:
static void clk_reparent(struct clk_core *core, struct clk_core *new_parent) { bool was_orphan = core->orphan;
hlist_del(&core->child_node);
if (new_parent) { bool becomes_orphan = new_parent->orphan;
/* avoid duplicate POST_RATE_CHANGE notifications */ if (new_parent->new_child == core) new_parent->new_child = NULL;
hlist_add_head(&core->child_node, &new_parent->children);
and "new_parent->children.first" points to 0x1000, which is not a valid pointer, so the "first->pprev = &n->next" in hlist_add_head() causes a crash.
[ 5.625465] __clk_set_parent_before+0x40/0x88 [ 5.628943] clk_register+0x330/0x618
(speculation) the most likely call chain here would be __clk_core_init(), implying that there is some random other orphan clock that has a broken parent:
/* * walk the list of orphan clocks and reparent any that newly finds a * parent. */ hlist_for_each_entry_safe(orphan, tmp2, &clk_orphan_list, child_node) { struct clk_core *parent = __clk_init_parent(orphan);
/* * We need to use __clk_set_parent_before() and _after() to * to properly migrate any prepare/enable count of the orphan * clock. This is important for CLK_IS_CRITICAL clocks, which * are enabled during init but might not have a parent yet. */ if (parent) { /* update the clk tree topology */ __clk_set_parent_before(orphan, parent);
None of the above have changed in stable kernels.
[ 5.633668] pll_28nm_register+0xa4/0x340 [msm] [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm] [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm] [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
The only changes to the dsi driver in v4.14-stable were:
89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI 35ff594b0da2 drm/msm/dsi: Implement reset correctly 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
None of these look suspicious to me.
Arnd
[ 5.651196] platform_drv_probe+0x58/0xb8 [ 5.656060] driver_probe_device+0x228/0x2d0 [ 5.660231] __device_attach_driver+0xb8/0xe8 [ 5.664750] bus_for_each_drv+0x64/0xa0 [ 5.669269] __device_attach+0xcc/0x138 [ 5.673093] device_initial_probe+0x10/0x18 [ 5.676918] bus_probe_device+0x90/0x98 [ 5.681088] device_add+0x3c4/0x5a8 [ 5.684915] of_device_add+0x40/0x58 [ 5.688392] of_platform_device_create_pdata+0x80/0xe8 [ 5.692219] of_platform_bus_create+0xd4/0x308 [ 5.697432] of_platform_populate+0x48/0xb8 [ 5.702143] msm_pdev_probe+0x3c/0x328 [msm] [ 5.706125] platform_drv_probe+0x58/0xb8 [ 5.710642] driver_probe_device+0x228/0x2d0 [ 5.714814] __device_attach_driver+0xb8/0xe8 [ 5.719334] bus_for_each_drv+0x64/0xa0 [ 5.723852] __device_attach+0xcc/0x138 [ 5.727677] device_initial_probe+0x10/0x18 [ 5.731502] bus_probe_device+0x90/0x98 [ 5.735675] deferred_probe_work_func+0xa4/0x140
It's worth noting that this is a deferred probe, so maybe there is a race in the clk registration that gets hit if two drivers register a clock at the same time?
[ 5.739502] process_one_work+0x19c/0x300 [ 5.744366] worker_thread+0x4c/0x420 [ 5.748539] kthread+0x100/0x130 [ 5.752362] ret_from_fork+0x10/0x1c [ 5.755842] Code: 54000260 f9405080 f9005460 b4000040 (f9000401) [ 5.759669] ---[ end trace 6d70d7dd8a236384 ]--- [ 5.765922] note: kworker/2:0[21] exited with preempt_count 1 [ 26.777168] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 26.777204] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1679/1723 fqs=2625 [ 26.783112] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1946/1958 fqs=2625 [ 26.791444] (detected by 3, t=5252 jiffies, g=58, c=57, q=2362) [ 26.799781] Task dump for CPU 0: [ 26.806033] systemd-udevd R running task 0 2533 1 0x00000202 [ 26.809515] Call trace: [ 26.816814] __switch_to+0xe8/0x148 [ 26.819248] __wake_up_common+0x80/0x170 [ 26.822724] __wake_up_common_lock+0x7c/0xa8 [ 26.826897] __wake_up_sync_key+0x1c/0x28 [ 26.831416] sock_def_readable+0x40/0x88 [ 26.835584] Task dump for CPU 1: [ 26.839755] systemd R running task 0 1 0 0x0000000a [ 26.843237] Call trace: [ 26.850531] __switch_to+0xe8/0x148 [ 26.852968] 0xffff800009fc8000 [ 26.856467] INFO: rcu_sched detected stalls on CPUs/tasks: [ 26.859584] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1722/1723 fqs=2625 [ 26.865145] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1957/1958 fqs=2625 [ 26.873477] (detected by 3, t=5270 jiffies, g=-150, c=-151, q=3) [ 26.881815] Task dump for CPU 0: [ 26.888067] systemd-udevd R running task 0 2533 1 0x00000202 [ 26.891549] Call trace: [ 26.898845] __switch_to+0xe8/0x148 [ 26.901281] __wake_up_common+0x80/0x170 [ 26.904758] __wake_up_common_lock+0x7c/0xa8 [ 26.908931] __wake_up_sync_key+0x1c/0x28 [ 26.913449] sock_def_readable+0x40/0x88 [ 26.917618] Task dump for CPU 1: [ 26.921790] systemd R running task 0 1 0 0x0000000a [ 26.925271] Call trace: [ 26.932565] __switch_to+0xe8/0x148 [ 26.935002] 0xffff800009fc8000 [ 41.449201] random: crng init done [ 41.449221] random: 7 urandom warning(s) missed due to ratelimiting [ 89.797164] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 89.797195] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1679/1723 fqs=10482 [ 89.803103] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1946/1958 fqs=10482 [ 89.811437] (detected by 3, t=21007 jiffies, g=58, c=57, q=2578) [ 89.819773] Task dump for CPU 0: [ 89.826027] systemd-udevd R running task 0 2533 1 0x00000202 [ 89.829508] Call trace: [ 89.836807] __switch_to+0xe8/0x148 [ 89.839241] __wake_up_common+0x80/0x170 [ 89.842717] __wake_up_common_lock+0x7c/0xa8 [ 89.846891] __wake_up_sync_key+0x1c/0x28 [ 89.851410] sock_def_readable+0x40/0x88
<System Hung>
Full test log, https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.188-1...
-- Linaro LKFT https://lkft.linaro.org
Quoting Arnd Bergmann (2020-07-21 02:51:32)
__clk_set_parent_before(orphan, parent);
None of the above have changed in stable kernels.
[ 5.633668] pll_28nm_register+0xa4/0x340 [msm] [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm] [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm] [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
The only changes to the dsi driver in v4.14-stable were:
89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI 35ff594b0da2 drm/msm/dsi: Implement reset correctly 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
None of these look suspicious to me.
It sounds like maybe you need this patch?
bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
or
4368a1539c6b ("drm/msm: Depopulate platform on probe failure")
I vaguelly recall that the display driver wasn't removing clks becaues it wasn't removing devices when probe defer happened and then we had dangling clks in the parent cache confusing things.
On Thu, 23 Jul 2020 at 13:36, Stephen Boyd sboyd@kernel.org wrote:
Quoting Arnd Bergmann (2020-07-21 02:51:32)
__clk_set_parent_before(orphan, parent);
None of the above have changed in stable kernels.
[ 5.633668] pll_28nm_register+0xa4/0x340 [msm] [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm] [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm] [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
The only changes to the dsi driver in v4.14-stable were:
89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI 35ff594b0da2 drm/msm/dsi: Implement reset correctly 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
None of these look suspicious to me.
It sounds like maybe you need this patch?
bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
Cherry-pick did not work on stable-rc 4.14 this patch might need backporting. I am not sure.
or
4368a1539c6b ("drm/msm: Depopulate platform on probe failure")
This commit already is in stable-rc 4.14 branch. drm/msm: Depopulate platform on probe failure
[ Upstream commit 4368a1539c6b41ac3cddc06f5a5117952998804c ]
I vaguelly recall that the display driver wasn't removing clks becaues it wasn't removing devices when probe defer happened and then we had dangling clks in the parent cache confusing things.
Thanks for your email.
- Naresh
Quoting Naresh Kamboju (2020-07-23 03:10:37)
On Thu, 23 Jul 2020 at 13:36, Stephen Boyd sboyd@kernel.org wrote:
Quoting Arnd Bergmann (2020-07-21 02:51:32)
__clk_set_parent_before(orphan, parent);
None of the above have changed in stable kernels.
[ 5.633668] pll_28nm_register+0xa4/0x340 [msm] [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm] [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm] [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
The only changes to the dsi driver in v4.14-stable were:
89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI 35ff594b0da2 drm/msm/dsi: Implement reset correctly 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
None of these look suspicious to me.
It sounds like maybe you need this patch?
bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
Cherry-pick did not work on stable-rc 4.14 this patch might need backporting. I am not sure.
Ok. That commit fixes a regression in the 3.x series of the kernel so it should go back to any LTS kernels. It looks like at least on 4.14 it's a trivial conflict. Here's a backport to 4.14
---8<--- From: Stephen Boyd sboyd@kernel.org Date: Wed, 28 Aug 2019 11:19:59 -0700 Subject: [PATCH] clk: Evict unregistered clks from parent caches
We leave a dangling pointer in each clk_core::parents array that has an unregistered clk as a potential parent when that clk_core pointer is freed by clk{_hw}_unregister(). It is impossible for the true parent of a clk to be set with clk_set_parent() once the dangling pointer is left in the cache because we compare parent pointers in clk_fetch_parent_index() instead of checking for a matching clk name or clk_hw pointer.
Before commit ede77858473a ("clk: Remove global clk traversal on fetch parent index"), we would check clk_hw pointers, which has a higher chance of being the same between registration and unregistration, but it can still be allocated and freed by the clk provider. In fact, this has been a long standing problem since commit da0f0b2c3ad2 ("clk: Correct lookup logic in clk_fetch_parent_index()") where we stopped trying to compare clk names and skipped over entries in the cache that weren't NULL.
There are good (performance) reasons to not do the global tree lookup in cases where the cache holds dangling pointers to parents that have been unregistered. Let's take the performance hit on the uncommon registration path instead. Loop through all the clk_core::parents arrays when a clk is unregistered and set the entry to NULL when the parent cache entry and clk being unregistered are the same pointer. This will fix this problem and avoid the overhead for the "normal" case.
Based on a patch by Bjorn Andersson.
Fixes: da0f0b2c3ad2 ("clk: Correct lookup logic in clk_fetch_parent_index()") Reviewed-by: Bjorn Andersson bjorn.andersson@linaro.org Tested-by: Sai Prakash Ranjan saiprakash.ranjan@codeaurora.org Signed-off-by: Stephen Boyd sboyd@kernel.org Link: https://lkml.kernel.org/r/20190828181959.204401-1-sboyd@kernel.org --- drivers/clk/clk.c | 52 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 41 insertions(+), 11 deletions(-)
diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c index 44b6f23cc851..4289c519af1b 100644 --- a/drivers/clk/clk.c +++ b/drivers/clk/clk.c @@ -39,6 +39,17 @@ static HLIST_HEAD(clk_root_list); static HLIST_HEAD(clk_orphan_list); static LIST_HEAD(clk_notifier_list);
+static struct hlist_head *all_lists[] = { + &clk_root_list, + &clk_orphan_list, + NULL, +}; + +static struct hlist_head *orphan_list[] = { + &clk_orphan_list, + NULL, +}; + /*** private data structures ***/
struct clk_core { @@ -1993,17 +2004,6 @@ static int inited = 0; static DEFINE_MUTEX(clk_debug_lock); static HLIST_HEAD(clk_debug_list);
-static struct hlist_head *all_lists[] = { - &clk_root_list, - &clk_orphan_list, - NULL, -}; - -static struct hlist_head *orphan_list[] = { - &clk_orphan_list, - NULL, -}; - static void clk_summary_show_one(struct seq_file *s, struct clk_core *c, int level) { @@ -2735,6 +2735,34 @@ static const struct clk_ops clk_nodrv_ops = { .set_parent = clk_nodrv_set_parent, };
+static void clk_core_evict_parent_cache_subtree(struct clk_core *root, + struct clk_core *target) +{ + int i; + struct clk_core *child; + + for (i = 0; i < root->num_parents; i++) + if (root->parents[i] == target) + root->parents[i] = NULL; + + hlist_for_each_entry(child, &root->children, child_node) + clk_core_evict_parent_cache_subtree(child, target); +} + +/* Remove this clk from all parent caches */ +static void clk_core_evict_parent_cache(struct clk_core *core) +{ + struct hlist_head **lists; + struct clk_core *root; + + lockdep_assert_held(&prepare_lock); + + for (lists = all_lists; *lists; lists++) + hlist_for_each_entry(root, *lists, child_node) + clk_core_evict_parent_cache_subtree(root, core); + +} + /** * clk_unregister - unregister a currently registered clock * @clk: clock to unregister @@ -2773,6 +2801,8 @@ void clk_unregister(struct clk *clk) clk_core_set_parent(child, NULL); }
+ clk_core_evict_parent_cache(clk->core); + hlist_del_init(&clk->core->child_node);
if (clk->core->prepare_count)
Quoting Stephen Boyd (2020-07-23 03:26:02)
Quoting Naresh Kamboju (2020-07-23 03:10:37)
On Thu, 23 Jul 2020 at 13:36, Stephen Boyd sboyd@kernel.org wrote:
It sounds like maybe you need this patch?
bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
Cherry-pick did not work on stable-rc 4.14 this patch might need backporting. I am not sure.
Ok. That commit fixes a regression in the 3.x series of the kernel so it should go back to any LTS kernels. It looks like at least on 4.14 it's a trivial conflict. Here's a backport to 4.14
Did this help?
On Wed, 12 Aug 2020 at 23:14, Stephen Boyd sboyd@kernel.org wrote:
Quoting Stephen Boyd (2020-07-23 03:26:02)
Quoting Naresh Kamboju (2020-07-23 03:10:37)
On Thu, 23 Jul 2020 at 13:36, Stephen Boyd sboyd@kernel.org wrote:
It sounds like maybe you need this patch?
bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
Cherry-pick did not work on stable-rc 4.14 this patch might need backporting. I am not sure.
Ok. That commit fixes a regression in the 3.x series of the kernel so it should go back to any LTS kernels. It looks like at least on 4.14 it's a trivial conflict. Here's a backport to 4.14
Did this help?
Thanks for your patch. Sorry for the late reply. The reported issue is not always reproducible However, I have tested this for 100 cycles and did not notice the reported problem.
Tested-by: Naresh Kamboju naresh.kamboju@linaro.org
- Naresh
On Thu, Aug 13, 2020 at 07:30:40PM +0530, Naresh Kamboju wrote:
On Wed, 12 Aug 2020 at 23:14, Stephen Boyd sboyd@kernel.org wrote:
Quoting Stephen Boyd (2020-07-23 03:26:02)
Quoting Naresh Kamboju (2020-07-23 03:10:37)
On Thu, 23 Jul 2020 at 13:36, Stephen Boyd sboyd@kernel.org wrote:
It sounds like maybe you need this patch?
bdcf1dc25324 ("clk: Evict unregistered clks from parent caches")
Cherry-pick did not work on stable-rc 4.14 this patch might need backporting. I am not sure.
Ok. That commit fixes a regression in the 3.x series of the kernel so it should go back to any LTS kernels. It looks like at least on 4.14 it's a trivial conflict. Here's a backport to 4.14
Did this help?
Thanks for your patch. Sorry for the late reply. The reported issue is not always reproducible However, I have tested this for 100 cycles and did not notice the reported problem.
Tested-by: Naresh Kamboju naresh.kamboju@linaro.org
Great, now queued up to 4.14.y and 4.19.y.
greg k-h