Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

1 Oct 2018


      On Thu, 27 Sep 2018 12:46:01 -0700
Daniel Wang wonderfly@google.com wrote:
...
Prior to this change, the combination of `softlockup_panic=1` and
`softlockup_all_cpu_stacktrace=1` may result in a deadlock when the reboot path
is trying to grab the console lock that is held by the stack trace printing
path. What seems to be happening is that while there are multiple CPUs, only one
of them is tasked to print the back trace of all CPUs. On a machine with many
CPUs and a slow serial console (on Google Compute Engine for example), the stack
trace printing routine hits a timeout and the reboot path kicks in. The latter
then tries to print something else, but can't get the lock because it's still
held by earlier printing path. This is easily reproducible on a VM with 16+
vCPUs on Google Compute Engine - which is a very common scenario.
A quick repro is available at
https://github.com/wonderfly/printk-deadlock-repro. The system hangs 3 seconds
into executing repro.sh. Both deadlock analysis and repro are credits to Peter
Feiner.
Note that I have read previous discussions on backporting this to stable [1].
The argument for objecting the backport was that this is a non-trivial fix and
is supported to prevent hypothetical soft lockups. What we are hitting is a real
deadlock, in production, however. Hence this request.
[1] https://lore.kernel.org/lkml/20180409081535.dq7p5bfnpvd3xk3t@pathway.suse.cz...
Serial console logs leading up to the deadlock. As can be seen the stack trace
was incomplete because the printing path hit a timeout.
I'm fine with having this backported.
-- Steve

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"