On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:
On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou zhouzhouyi@gmail.com wrote:
On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:
On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.
Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.
[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]
For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org
Queued for further review and testing, thank you both!
It might be a few hours until it becomes publicly visible, but it will get there.
A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].
From the result of [1], the HOTPLUG failure reports have been eliminated, but a new kernel null point bug related to scsi module has been reported [4] ;-( [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at 0x00000008 ... [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] scsi_end_request+0xd8/0x1f0 (unreliable)^M [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] scsi_io_completion+0x88/0x700^M [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] scsi_finish_command+0xe8/0x150^M [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] scsi_complete+0x90/0x140^M [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] blk_complete_reqs+0x80/0xa0^M [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] __do_softirq+0x1e0/0x4e0^M [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] do_softirq_own_stack+0x48/0x60^M [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] 0xc00000000a5e7c80^M [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] do_softirq+0xb0/0xc0^M [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] __local_bh_enable_ip+0xdc/0x110^M [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] irq_forced_thread_fn+0xc8/0xf0^M [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] irq_thread+0x1b4/0x2a0^M [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] kthread+0x1a8/0x1d0^M [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64^M
But when I invoked [5] by hand, the bug did not show again [6].
[4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...
I think the bug is not caused by Joel's patch, there must be some other reason. I am starting the 29 hours' rcutorture test again. And I can give ssh access to you if you are interested in it.
Sorry for the inconvenience that I bring
Thanks Zhouyi
Hi the above kernel null pointer dereference bug has nothing to do with Joel's fix because I can reproduce it on original 5.15.y [7] using as while loop [8] (after 36 iterations, the bug fires). So, Joel's patch is tested good! Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com
Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a lot and I don't see such an issue.
However, your logs showed the crash is SCSI related. These were the recent SCSI commits in 5.15.y but I am not sure if it causes the issue:
13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile 513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler 7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern
Perhaps report it to the scsi and/or stable lists, or do some web searches for if someone else sees it.
- Joel