Re: easy kernel crash on TC2 with mainline kernel

21 Nov 2013


      On Thu, Nov 21, 2013 at 06:48:32PM +0000, Lorenzo Pieralisi wrote:
...
On Thu, Nov 21, 2013 at 03:10:58PM +0000, Jon Medhurst (Tixy) wrote:
...
On Thu, 2013-11-21 at 00:09 -0500, Nicolas Pitre wrote:
...
I've been banging my head on this one for quite a while now.  The
problem is that there is very little debug out put available.  Could
you
see if you get the same?
I get the same, with the same kernel version. If I disable cpufreq
configs, then it get another (possibly different) crash, which has a
backtrace.
Same here, reverting this commit does the trick for me but that's more
a symptom than a cure, I still can t pinpoint what the problem is,
but that's pretty easy to trigger:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/kerne...
This doesn't fix it for me, but my kernel tree has a few extra patches
so there might be some other inconsistency somewhere.  Mind you, except
for the TC2 mcpm power_down_finish() patch I don't _think_ I have
anything relevant to these symptoms...
What I'm seeing is a stall in
kernel/cpu.c:cpu_down()->synchronize_sched()->wait_rcu_gu() with
cpu_hotplug.lock held.
This doesn't kill the system, but any thread that tries to perform CPU
hotplug gets stuck in uninterruptible sleep trying to take
cpu_hotplug.lock:
kworker/1:2     D 80424dfc     0  2366      2 0x00000000
Workqueue: events cpuset_hotplug_workfn
[<80424dfc>] (__schedule+0x218/0x5e0) from [<80425588>] (schedule_preempt_disabled+0xc/0x10)
[<80425588>] (schedule_preempt_disabled+0xc/0x10) from [<804278b4>] (mutex_lock_nested+0x1a0/0x38c)
[<804278b4>] (mutex_lock_nested+0x1a0/0x38c) from [<80023a40>] (get_online_cpus+0x30/0x4c)
[<80023a40>] (get_online_cpus+0x30/0x4c) from [<8008b030>] (rebuild_sched_domains_locked+0x1c/0x458)
[<8008b030>] (rebuild_sched_domains_locked+0x1c/0x458) from [<8008cba4>] (rebuild_sched_domains+0x1c/0x28)
[<8008cba4>] (rebuild_sched_domains+0x1c/0x28) from [<8008cdbc>] (cpuset_hotplug_workfn+0x20c/0x534)
[<8008cdbc>] (cpuset_hotplug_workfn+0x20c/0x534) from [<8003b434>] (process_one_work+0x1b0/0x4d0)
[<8003b434>] (process_one_work+0x1b0/0x4d0) from [<8003bb50>] (worker_thread+0x138/0x3c0)
[<8003bb50>] (worker_thread+0x138/0x3c0) from [<80041dac>] (kthread+0xc4/0xe0)
[<80041dac>] (kthread+0xc4/0xe0) from [<8000e2e8>] (ret_from_fork+0x14/0x2c)
bash            D 80424dfc     0  2386   2385 0x00000000
[<80424dfc>] (__schedule+0x218/0x5e0) from [<804246b8>] (schedule_timeout+0x120/0x1bc)
[<804246b8>] (schedule_timeout+0x120/0x1bc) from [<80425a0c>] (wait_for_common+0xa8/0x14c)
[<80425a0c>] (wait_for_common+0xa8/0x14c) from [<8006bef8>] (wait_rcu_gp+0x44/0x4c)
[<8006bef8>] (wait_rcu_gp+0x44/0x4c) from [<8041f068>] (_cpu_down+0x88/0x230)
[<8041f068>] (_cpu_down+0x88/0x230) from [<8041f238>] (cpu_down+0x28/0x3c)
[<8041f238>] (cpu_down+0x28/0x3c) from [<80285de0>] (device_offline+0x8c/0xb4)
[<80285de0>] (device_offline+0x8c/0xb4) from [<80285ed8>] (online_store+0x44/0x6c)
[<80285ed8>] (online_store+0x44/0x6c) from [<80283ec0>] (dev_attr_store+0x18/0x24)
[<80283ec0>] (dev_attr_store+0x18/0x24) from [<80149d5c>] (sysfs_write_file+0x1a4/0x1d0)
[<80149d5c>] (sysfs_write_file+0x1a4/0x1d0) from [<800f02a0>] (vfs_write+0xb4/0x17c)
[<800f02a0>] (vfs_write+0xb4/0x17c) from [<800f0628>] (SyS_write+0x40/0x68)
[<800f0628>] (SyS_write+0x40/0x68) from [<8000e220>] (ret_fast_syscall+0x0/0x48)
This almost always happens when hotplugging CPUs off in the third inner
loop of Nico's test, on the first iteration of the outer loop.
I'm not sure exactly what this code is trying to do, yet.  (RCU, RC-who?)
Cheers
---Dave

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: easy kernel crash on TC2 with mainline kernel