Re: [PATCH v2 RESEND] tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem

26 Jan 2023


      On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:
...
On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou zhouzhouyi@gmail.com wrote:
...
On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:
...
On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:
...
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:
...
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined.
However, cpu_is_hotpluggable() still returns true for those CPUs. This causes
torture tests that do offlining to end up trying to offline this CPU causing
test failures. Such failure happens on all architectures.
Fix it by asking the opinion of the nohz subsystem on whether the CPU can
be hotplugged.
[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]
For drivers/base/ portion:
Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Cc: Frederic Weisbecker frederic@kernel.org
Cc: "Paul E. McKenney" paulmck@kernel.org
Cc: Zhouyi Zhou zhouzhouyi@gmail.com
Cc: Will Deacon will@kernel.org
Cc: Marc Zyngier maz@kernel.org
Cc: rcu rcu@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel")
Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org
Queued for further review and testing, thank you both!
It might be a few hours until it becomes publicly visible, but it will
get there.
A new round of rcutorture test on fixed linux-5.15.y[3] has been
performed in the PPC VM of Open Source Lab of Oregon State University
[1], which will last about 29 hours. The test result on original
linux-5.15.y is at [2].
From the result of [1], the HOTPLUG failure reports have been
eliminated, but a new kernel null point bug related to scsi module has
been reported [4] ;-(
[    5.178733][    C1] BUG: Kernel NULL pointer dereference on read at
0x00000008
...
[    5.231013][    C1] [c00000001ff9fca0] [c0000000009ffbc8]
scsi_end_request+0xd8/0x1f0 (unreliable)^M
[    5.234961][    C1] [c00000001ff9fcf0] [c000000000a00e68]
scsi_io_completion+0x88/0x700^M
[    5.237863][    C1] [c00000001ff9fda0] [c0000000009f5028]
scsi_finish_command+0xe8/0x150^M
[    5.240089][    C1] [c00000001ff9fdf0] [c000000000a00c70]
scsi_complete+0x90/0x140^M
[    5.242481][    C1] [c00000001ff9fe20] [c0000000007e5170]
blk_complete_reqs+0x80/0xa0^M
[    5.245187][    C1] [c00000001ff9fe50] [c000000000f0b5d0]
__do_softirq+0x1e0/0x4e0^M
[    5.248479][    C1] [c00000001ff9ff90] [c0000000000170e8]
do_softirq_own_stack+0x48/0x60^M
[    5.250919][    C1] [c00000000a5e7c40] [c00000000a5e7c80]
0xc00000000a5e7c80^M
[    5.253792][    C1] [c00000000a5e7c70] [c0000000001534c0]
do_softirq+0xb0/0xc0^M
[    5.256824][    C1] [c00000000a5e7ca0] [c0000000001535ac]
__local_bh_enable_ip+0xdc/0x110^M
[    5.259414][    C1] [c00000000a5e7cc0] [c0000000001d75e8]
irq_forced_thread_fn+0xc8/0xf0^M
[    5.261921][    C1] [c00000000a5e7d00] [c0000000001d7ae4]
irq_thread+0x1b4/0x2a0^M
[    5.265298][    C1] [c00000000a5e7da0] [c00000000017d8c8]
kthread+0x1a8/0x1d0^M
[    5.269184][    C1] [c00000000a5e7e10] [c00000000000cee4]
ret_from_kernel_thread+0x5c/0x64^M
But when I invoked [5]  by hand, the bug did not show again [6].
[4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...
[5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...
[6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...
I think the bug is not caused by Joel's patch, there must be some
other reason. I am starting the 29 hours' rcutorture test again. And I
can give ssh access to you if you are interested in it.
Sorry for the inconvenience that I bring
Thanks
Zhouyi
Hi the above kernel null pointer dereference bug has nothing to do
with Joel's fix because I can reproduce it on original 5.15.y [7]
using as while loop [8] (after 36 iterations, the bug fires).
So, Joel's patch is tested good!
Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com
Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a
lot and I don't see such an issue.
However, your logs showed the crash is SCSI related. These were the
recent SCSI commits in 5.15.y but I am not sure if it causes the
issue:
13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile
513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler
7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern
Perhaps report it to the scsi and/or stable lists, or do some web
searches for if someone else sees it.
- Joel

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 RESEND] tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem