On Wed, 13 Jul 2022 18:20:09 +0800 Zheng Yejian zhengyejian1@huawei.com wrote:
This patch and problem analysis is based on v4.19 LTS, but v5.4 LTS and below seem to be involved.
Hulk Robot reports a softlockup problem, see following logs: [ 41.463870] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksoftirqd/0:9]
This detects something that is spinning with preemption disabled but interrupts enabled.
Look into above call stack, there is a recursive call in 'ftrace_graph_call', and the direct cause of above recursion is that 'rcu_dynticks_curr_cpu_in_eqs' is traced, see following snippet: __read_once_size_nocheck.constprop.0 ftrace_graph_call <-- 1. first call ...... rcu_dynticks_curr_cpu_in_eqs ftrace_graph_call <-- 2. recursive call here!!!
This is not the bug. That code can handle a recursion:
ftrace_graph_call is assembly that is converted to call
void prepare_ftrace_return(unsigned long ip, unsigned long *parent, unsigned long frame_pointer) { [..]
bit = ftrace_test_recursion_trylock(ip, *parent); if (bit < 0) return;
This will stop the code as "bit" will be < 0 on the second call to ftrace_graph_call. If it was a real recursion issue, it would crash the machine when the recursion runs out of stack space.
Comparing with mainline kernel, commit ff5c4f5cad33 ("rcu/tree: Mark the idle relevant functions noinstr") mark related functions as 'noinstr' which implies notrace, noinline and sticks things in the .noinstr.text section. Link: https://lore.kernel.org/all/20200416114706.625340212@infradead.org/
But we cannot directly backport that commit, because there seems to be many prepatches. Instead, marking the functions as 'notrace' where it is 'noinstr' in that commit and mark 'rcu_dynticks_curr_cpu_in_eqs' as inline look like it resolves the problem.
That will not fix your problem.
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Zheng Yejian zhengyejian1@huawei.com
Can you reproduce this consistently without this patch, and then not so with this patch?
Or are you just assuming that this fixes a bug because you observed a recursion?
Please explain to me why this would cause the hang?
-- Steve