On Tue, Jul 21, 2020 at 10:32 AM Naresh Kamboju naresh.kamboju@linaro.org wrote:
Kernel Internal oops while booting stable-rc 4.14 kernel on qcom db410c device this problem happened only once on this specific platform. and rcu_preempt detected stalls on CPUs/tasks detected after this and board hung.
metadata: git branch: linux-4.14.y git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git git commit: 5b1e982af0f810358664827a6333affb4f5d8eb5 git describe: v4.14.188-126-g5b1e982af0f8 make_kernelversion: 4.14.189-rc1 kernel-config: https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/kernel.config
I looked at the object file at https://builds.tuxbuild.com/12PM71zBW-5EAp5ztC_yxg/vmlinux.xz
Crash dump:
[ 5.424958] Unable to handle kernel paging request at virtual address 00001008 [ 5.435485] Mem abort info: [ 5.442509] Exception class = DABT (current EL), IL = 32 bits [ 5.445203] SET = 0, FnV = 0 [[ 5.451101] EA = 0, S1PTW =[ 5.454226] Data abort info: [ 5.457264] ISV = 0, ISS = 0x00000044 [ 5.460390] CM = 0, WnR = 1 [ 5.463951] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80003d66d000 [ 5.467078] [0000000000001008] *pgd=0000000000000000 [ 5.473503] Internal error: Oops: 96000044 [#1] PREEMPT SMP [ 5.479838] Modules linked in: adv7511 msm mdt_loader msm_rng drm_kms_helper rng_core drm fuse [ 5.485405] Process kworker/2:0 (pid: 21, stack limit = 0xffff000009450000) [ 5.494090] CPU: 2 PID: 21 Comm: kworker/2:0 Not tainted 4.14.189-rc1 #1 [ 5.501036] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT) [ 5.507996] Workqueue: events deferred_probe_work_func [ 5.514935] task: ffff80003d454380 task.stack: ffff000009450000 [ 5.520155] pc : clk_reparent+0x60/0xe8 [ 5.526058] lr : __clk_set_parent_before+0x40/0x88 [ 5.529882] sp : ffff000009453640 pstate : 800001c5 [ 5.534748] x29: ffff000009453640 x28: ffff0000090b7000 [ 5.539615] x27: ffff80003fe7c478 x26: ffff0000094537a8 [ 5.545175] x25: 0000000000000001 x24: ffff000009239038 [ 5.550736] x23: ffff80003b6be688 x22: 0000000000000000 [ 5.556297] x21: 0000000000000000 x20: ffff80003c9d8c00 [ 5.561858] x19: ffff80003d798900 x18: 00000000fffffffe [ 5.567419] x17: 0000ffff7fdbb6a0 x16: ffff00000821ad98 [ 5.572980] x15: 0000000000000001 x14: ffffffffffffffff [ 5.578540] x13: ffff0000094537c8 x12: 0000000000000010 [ 5.584102] x11: 0000000000000010 x10: 0101010101010101 [ 5.589663] x9 : 0000000000000000 x8 : 7f7f7f7f7f7f7f7f [ 5.595223] x7 : fefefefefeff6e77 x6 : 0000000000000140 [ 5.600784] x5 : 0000000000000001 x4 : ffff80003c9d8c00 [ 5.606344] x3 : ffff80003d798900 x2 : 0000000000000004 [ 5.611905] x1 : ffff80003d7989a8 x0 : 0000000000001000 [ 5.617467] Call trace: [ 5.623030] clk_reparent+0x60/0xe8
We are here:
static void clk_reparent(struct clk_core *core, struct clk_core *new_parent) { bool was_orphan = core->orphan;
hlist_del(&core->child_node);
if (new_parent) { bool becomes_orphan = new_parent->orphan;
/* avoid duplicate POST_RATE_CHANGE notifications */ if (new_parent->new_child == core) new_parent->new_child = NULL;
hlist_add_head(&core->child_node, &new_parent->children);
and "new_parent->children.first" points to 0x1000, which is not a valid pointer, so the "first->pprev = &n->next" in hlist_add_head() causes a crash.
[ 5.625465] __clk_set_parent_before+0x40/0x88 [ 5.628943] clk_register+0x330/0x618
(speculation) the most likely call chain here would be __clk_core_init(), implying that there is some random other orphan clock that has a broken parent:
/* * walk the list of orphan clocks and reparent any that newly finds a * parent. */ hlist_for_each_entry_safe(orphan, tmp2, &clk_orphan_list, child_node) { struct clk_core *parent = __clk_init_parent(orphan);
/* * We need to use __clk_set_parent_before() and _after() to * to properly migrate any prepare/enable count of the orphan * clock. This is important for CLK_IS_CRITICAL clocks, which * are enabled during init but might not have a parent yet. */ if (parent) { /* update the clk tree topology */ __clk_set_parent_before(orphan, parent);
None of the above have changed in stable kernels.
[ 5.633668] pll_28nm_register+0xa4/0x340 [msm] [ 5.637492] msm_dsi_pll_28nm_init+0xc8/0x1d8 [msm] [ 5.642007] msm_dsi_pll_init+0x34/0xe0 [msm] [ 5.646870] dsi_phy_driver_probe+0x1cc/0x310 [msm]
The only changes to the dsi driver in v4.14-stable were:
89e30bb46074 drm/msm/dsi: save pll state before dsi host is powered off 892afde0f4a1 drm: msm: Fix return type of dsi_mgr_connector_mode_valid for kCFI 35ff594b0da2 drm/msm/dsi: Implement reset correctly 5151a0c8d730 drm/msm/dsi: use correct enum in dsi_get_cmd_fmt e6bc3a4b0c23 clk: divider: fix incorrect usage of container_of
None of these look suspicious to me.
Arnd
[ 5.651196] platform_drv_probe+0x58/0xb8 [ 5.656060] driver_probe_device+0x228/0x2d0 [ 5.660231] __device_attach_driver+0xb8/0xe8 [ 5.664750] bus_for_each_drv+0x64/0xa0 [ 5.669269] __device_attach+0xcc/0x138 [ 5.673093] device_initial_probe+0x10/0x18 [ 5.676918] bus_probe_device+0x90/0x98 [ 5.681088] device_add+0x3c4/0x5a8 [ 5.684915] of_device_add+0x40/0x58 [ 5.688392] of_platform_device_create_pdata+0x80/0xe8 [ 5.692219] of_platform_bus_create+0xd4/0x308 [ 5.697432] of_platform_populate+0x48/0xb8 [ 5.702143] msm_pdev_probe+0x3c/0x328 [msm] [ 5.706125] platform_drv_probe+0x58/0xb8 [ 5.710642] driver_probe_device+0x228/0x2d0 [ 5.714814] __device_attach_driver+0xb8/0xe8 [ 5.719334] bus_for_each_drv+0x64/0xa0 [ 5.723852] __device_attach+0xcc/0x138 [ 5.727677] device_initial_probe+0x10/0x18 [ 5.731502] bus_probe_device+0x90/0x98 [ 5.735675] deferred_probe_work_func+0xa4/0x140
It's worth noting that this is a deferred probe, so maybe there is a race in the clk registration that gets hit if two drivers register a clock at the same time?
[ 5.739502] process_one_work+0x19c/0x300 [ 5.744366] worker_thread+0x4c/0x420 [ 5.748539] kthread+0x100/0x130 [ 5.752362] ret_from_fork+0x10/0x1c [ 5.755842] Code: 54000260 f9405080 f9005460 b4000040 (f9000401) [ 5.759669] ---[ end trace 6d70d7dd8a236384 ]--- [ 5.765922] note: kworker/2:0[21] exited with preempt_count 1 [ 26.777168] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 26.777204] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1679/1723 fqs=2625 [ 26.783112] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1946/1958 fqs=2625 [ 26.791444] (detected by 3, t=5252 jiffies, g=58, c=57, q=2362) [ 26.799781] Task dump for CPU 0: [ 26.806033] systemd-udevd R running task 0 2533 1 0x00000202 [ 26.809515] Call trace: [ 26.816814] __switch_to+0xe8/0x148 [ 26.819248] __wake_up_common+0x80/0x170 [ 26.822724] __wake_up_common_lock+0x7c/0xa8 [ 26.826897] __wake_up_sync_key+0x1c/0x28 [ 26.831416] sock_def_readable+0x40/0x88 [ 26.835584] Task dump for CPU 1: [ 26.839755] systemd R running task 0 1 0 0x0000000a [ 26.843237] Call trace: [ 26.850531] __switch_to+0xe8/0x148 [ 26.852968] 0xffff800009fc8000 [ 26.856467] INFO: rcu_sched detected stalls on CPUs/tasks: [ 26.859584] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1722/1723 fqs=2625 [ 26.865145] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1957/1958 fqs=2625 [ 26.873477] (detected by 3, t=5270 jiffies, g=-150, c=-151, q=3) [ 26.881815] Task dump for CPU 0: [ 26.888067] systemd-udevd R running task 0 2533 1 0x00000202 [ 26.891549] Call trace: [ 26.898845] __switch_to+0xe8/0x148 [ 26.901281] __wake_up_common+0x80/0x170 [ 26.904758] __wake_up_common_lock+0x7c/0xa8 [ 26.908931] __wake_up_sync_key+0x1c/0x28 [ 26.913449] sock_def_readable+0x40/0x88 [ 26.917618] Task dump for CPU 1: [ 26.921790] systemd R running task 0 1 0 0x0000000a [ 26.925271] Call trace: [ 26.932565] __switch_to+0xe8/0x148 [ 26.935002] 0xffff800009fc8000 [ 41.449201] random: crng init done [ 41.449221] random: 7 urandom warning(s) missed due to ratelimiting [ 89.797164] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 89.797195] 0-...: (1 GPs behind) idle=2fa/140000000000001/0 softirq=1679/1723 fqs=10482 [ 89.803103] 1-...: (1 GPs behind) idle=53a/140000000000000/0 softirq=1946/1958 fqs=10482 [ 89.811437] (detected by 3, t=21007 jiffies, g=58, c=57, q=2578) [ 89.819773] Task dump for CPU 0: [ 89.826027] systemd-udevd R running task 0 2533 1 0x00000202 [ 89.829508] Call trace: [ 89.836807] __switch_to+0xe8/0x148 [ 89.839241] __wake_up_common+0x80/0x170 [ 89.842717] __wake_up_common_lock+0x7c/0xa8 [ 89.846891] __wake_up_sync_key+0x1c/0x28 [ 89.851410] sock_def_readable+0x40/0x88
<System Hung>
Full test log, https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.188-1...
-- Linaro LKFT https://lkft.linaro.org