On Thu, Nov 21, 2013 at 06:48:32PM +0000, Lorenzo Pieralisi wrote:
On Thu, Nov 21, 2013 at 03:10:58PM +0000, Jon Medhurst (Tixy) wrote:
On Thu, 2013-11-21 at 00:09 -0500, Nicolas Pitre wrote:
I've been banging my head on this one for quite a while now. The problem is that there is very little debug out put available. Could you see if you get the same?
I get the same, with the same kernel version. If I disable cpufreq configs, then it get another (possibly different) crash, which has a backtrace.
Same here, reverting this commit does the trick for me but that's more a symptom than a cure, I still can t pinpoint what the problem is, but that's pretty easy to trigger:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/kerne...
This doesn't fix it for me, but my kernel tree has a few extra patches so there might be some other inconsistency somewhere. Mind you, except for the TC2 mcpm power_down_finish() patch I don't _think_ I have anything relevant to these symptoms...
What I'm seeing is a stall in kernel/cpu.c:cpu_down()->synchronize_sched()->wait_rcu_gu() with cpu_hotplug.lock held.
This doesn't kill the system, but any thread that tries to perform CPU hotplug gets stuck in uninterruptible sleep trying to take cpu_hotplug.lock:
kworker/1:2 D 80424dfc 0 2366 2 0x00000000 Workqueue: events cpuset_hotplug_workfn [<80424dfc>] (__schedule+0x218/0x5e0) from [<80425588>] (schedule_preempt_disabled+0xc/0x10) [<80425588>] (schedule_preempt_disabled+0xc/0x10) from [<804278b4>] (mutex_lock_nested+0x1a0/0x38c) [<804278b4>] (mutex_lock_nested+0x1a0/0x38c) from [<80023a40>] (get_online_cpus+0x30/0x4c) [<80023a40>] (get_online_cpus+0x30/0x4c) from [<8008b030>] (rebuild_sched_domains_locked+0x1c/0x458) [<8008b030>] (rebuild_sched_domains_locked+0x1c/0x458) from [<8008cba4>] (rebuild_sched_domains+0x1c/0x28) [<8008cba4>] (rebuild_sched_domains+0x1c/0x28) from [<8008cdbc>] (cpuset_hotplug_workfn+0x20c/0x534) [<8008cdbc>] (cpuset_hotplug_workfn+0x20c/0x534) from [<8003b434>] (process_one_work+0x1b0/0x4d0) [<8003b434>] (process_one_work+0x1b0/0x4d0) from [<8003bb50>] (worker_thread+0x138/0x3c0) [<8003bb50>] (worker_thread+0x138/0x3c0) from [<80041dac>] (kthread+0xc4/0xe0) [<80041dac>] (kthread+0xc4/0xe0) from [<8000e2e8>] (ret_from_fork+0x14/0x2c)
bash D 80424dfc 0 2386 2385 0x00000000 [<80424dfc>] (__schedule+0x218/0x5e0) from [<804246b8>] (schedule_timeout+0x120/0x1bc) [<804246b8>] (schedule_timeout+0x120/0x1bc) from [<80425a0c>] (wait_for_common+0xa8/0x14c) [<80425a0c>] (wait_for_common+0xa8/0x14c) from [<8006bef8>] (wait_rcu_gp+0x44/0x4c) [<8006bef8>] (wait_rcu_gp+0x44/0x4c) from [<8041f068>] (_cpu_down+0x88/0x230) [<8041f068>] (_cpu_down+0x88/0x230) from [<8041f238>] (cpu_down+0x28/0x3c) [<8041f238>] (cpu_down+0x28/0x3c) from [<80285de0>] (device_offline+0x8c/0xb4) [<80285de0>] (device_offline+0x8c/0xb4) from [<80285ed8>] (online_store+0x44/0x6c) [<80285ed8>] (online_store+0x44/0x6c) from [<80283ec0>] (dev_attr_store+0x18/0x24) [<80283ec0>] (dev_attr_store+0x18/0x24) from [<80149d5c>] (sysfs_write_file+0x1a4/0x1d0) [<80149d5c>] (sysfs_write_file+0x1a4/0x1d0) from [<800f02a0>] (vfs_write+0xb4/0x17c) [<800f02a0>] (vfs_write+0xb4/0x17c) from [<800f0628>] (SyS_write+0x40/0x68) [<800f0628>] (SyS_write+0x40/0x68) from [<8000e220>] (ret_fast_syscall+0x0/0x48)
This almost always happens when hotplugging CPUs off in the third inner loop of Nico's test, on the first iteration of the outer loop.
I'm not sure exactly what this code is trying to do, yet. (RCU, RC-who?)
Cheers ---Dave