On Thu 2018-10-04 16:44:42, Sergey Senozhatsky wrote:
On (10/03/18 11:37), Daniel Wang wrote:
When `softlockup_panic` is set (which is what my original repro had and what we use in production), without the backport patch, the expected panic would hit a seemingly deadlock. So even when the machine is configured to reboot immediately after the panic (kernel.panic=-1), it just hangs there with an incomplete backtrace. With your patch, the deadlock doesn't happen and the machine reboots successfully.
This was and still is the issue this thread is trying to fix. The last log snippet was from an "experiment" that I did in order to understand what's really happening. So far the speculation has been that the panic path was trying to get a lock held by a backtrace dumping thread, but there is not enough evidence which thread is holding the lock and how it uses it. So I set `softlockup_panic` to 0, to get panic out of the equation. Then I saw that one CPU was indeed holding the console lock, trying to write something out. If the panic was to hit while it's doing that, we might get a deadlock.
Hmm, console_sem state is ignored when we flush logbuf, so it's OK to have it locked when we declare panic():
void console_flush_on_panic(void) { /* * If someone else is holding the console lock, trylock will fail * and may_schedule may be set. Ignore and proceed to unlock so * that messages are flushed out. As this can be called from any * context and we don't want to get preempted while flushing, * ensure may_schedule is cleared. */ console_trylock(); console_may_schedule = 0; console_unlock(); }
Things are not so simple with uart_port lock. Generally speaking we should deadlock when we NMI panic() kills the system while one of the CPUs holds uart_port lock.
This looks like a reasonable explanation of what is happening here. It also explains why the console owner logic helped.
8250 has sort of a workaround for this scenario:
serial8250_console_write() { if (port->sysrq) locked = 0; else if (oops_in_progress) locked = spin_trylock_irqsave(&port->lock, flags); else spin_lock_irqsave(&port->lock, flags);
... uart_console_write(port, s, count, serial8250_console_putchar); ...
if (locked) spin_unlock_irqrestore(&port->lock, flags); }
Now... the problem. A theory, in fact. panic() sets oops_in_progress back to zero - bust_spinlocks(0) - too soon.
I see your point. I am just a bit scared of this way. Ignoring locks is a dangerous and painful approach in general.
Best Regards, Petr