On Thu, 27 Sep 2018 12:46:01 -0700 Daniel Wang wonderfly@google.com wrote:
Prior to this change, the combination of `softlockup_panic=1` and `softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot path is trying to grab the console lock that is held by the stack trace printing path. What seems to be happening is that while there are multiple CPUs, only one of them is tasked to print the back trace of all CPUs. On a machine with many CPUs and a slow serial console (on Google Compute Engine for example), the stack trace printing routine hits a timeout and the reboot path kicks in. The latter then tries to print something else, but can't get the lock because it's still held by earlier printing path. This is easily reproducible on a VM with 16+ vCPUs on Google Compute Engine - which is a very common scenario.
A quick repro is available at https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 seconds into executing repro.sh. Both deadlock analysis and repro are credits to Peter Feiner.
Note that I have read previous discussions on backporting this to stable [1]. The argument for objecting the backport was that this is a non-trivial fix and is supported to prevent hypothetical soft lockups. What we are hitting is a real deadlock, in production, however. Hence this request.
[1] https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3xk3t@pathway.suse.cz...
Serial console logs leading up to the deadlock. As can be seen the stack trace was incomplete because the printing path hit a timeout.
I'm fine with having this backported.
-- Steve