[PATCH v2 RESEND] tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem

List overview All Threads
Download

newer

older

[PATCH 5.10 0/1] i40e: Add...

Chasing a 'use after free' bug of...

Joel Fernandes (Google)

24 Jan 2023 24 Jan '23

5:31 p.m.

For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org --- Sorry, resending with CC to stable.

drivers/base/cpu.c | 3 ++- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 55405ebf23ab..450dca235a2f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { bool cpu_is_hotpluggable(unsigned int cpu) { struct device *dev = get_cpu_device(cpu); - return dev && container_of(dev, struct cpu, dev)->hotpluggable; + return dev && container_of(dev, struct cpu, dev)->hotpluggable + && tick_nohz_cpu_hotpluggable(cpu); } EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);

diff --git a/include/linux/tick.h b/include/linux/tick.h index bfd571f18cfd..9459fef5b857 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit); extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, enum tick_dep_bits bit); +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);

/* * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }

static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }

static inline void tick_dep_set(enum tick_dep_bits bit) { } static inline void tick_dep_clear(enum tick_dep_bits bit) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9c6f661fb436..63e3e8ebcd64 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; }

-static int tick_nohz_cpu_down(unsigned int cpu) +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /* * The tick_do_timer_cpu CPU handles housekeeping duty (unbound @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) * CPUs. It must remain online when nohz full is enabled. */ if (tick_nohz_full_running && tick_do_timer_cpu == cpu) - return -EBUSY; - return 0; + return false; + return true; +} + +static int tick_nohz_cpu_down(unsigned int cpu) +{ + return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY; }

void __init tick_nohz_init(void)

-- 2.39.1.405.gd4c25cc71f-goog

Show replies by date

Paul E. McKenney

24 Jan 24 Jan

10:56 p.m.

On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:

...

For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

Thanx, Paul

...

Sorry, resending with CC to stable.

drivers/base/cpu.c | 3 ++- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 55405ebf23ab..450dca235a2f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { bool cpu_is_hotpluggable(unsigned int cpu) { struct device *dev = get_cpu_device(cpu);

return dev && container_of(dev, struct cpu, dev)->hotpluggable;
return dev && container_of(dev, struct cpu, dev)->hotpluggable
&& tick_nohz_cpu_hotpluggable(cpu);
} EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); diff --git a/include/linux/tick.h b/include/linux/tick.h index bfd571f18cfd..9459fef5b857 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit); extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, enum tick_dep_bits bit); +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu); /*

The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases

@@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; } static inline void tick_dep_set(enum tick_dep_bits bit) { } static inline void tick_dep_clear(enum tick_dep_bits bit) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9c6f661fb436..63e3e8ebcd64 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; } -static int tick_nohz_cpu_down(unsigned int cpu) +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /* * The tick_do_timer_cpu CPU handles housekeeping duty (unbound @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) * CPUs. It must remain online when nohz full is enabled. */ if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
return -EBUSY;
return 0;
return false;
return true;
+}

+static int tick_nohz_cpu_down(unsigned int cpu) +{

return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY;

} void __init tick_nohz_init(void) -- 2.39.1.405.gd4c25cc71f-goog

Zhouyi Zhou

25 Jan 25 Jan

12:13 a.m.

On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:

...

On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:

...
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].

[1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [3] http://140.211.169.189/linux-stable-rc/

Happy to benefit the community and this is a fruitful learning journey ;-)

Cheers Zhouyi

...

                                                    Thanx, Paul
...
Sorry, resending with CC to stable.

drivers/base/cpu.c | 3 ++- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 55405ebf23ab..450dca235a2f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { bool cpu_is_hotpluggable(unsigned int cpu) { struct device *dev = get_cpu_device(cpu);
return dev && container_of(dev, struct cpu, dev)->hotpluggable;
return dev && container_of(dev, struct cpu, dev)->hotpluggable
        && tick_nohz_cpu_hotpluggable(cpu);
} EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);

diff --git a/include/linux/tick.h b/include/linux/tick.h index bfd571f18cfd..9459fef5b857 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit); extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, enum tick_dep_bits bit); +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);

/*

The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases

@@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }

static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }

static inline void tick_dep_set(enum tick_dep_bits bit) { } static inline void tick_dep_clear(enum tick_dep_bits bit) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9c6f661fb436..63e3e8ebcd64 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; }

-static int tick_nohz_cpu_down(unsigned int cpu) +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /* * The tick_do_timer_cpu CPU handles housekeeping duty (unbound @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) * CPUs. It must remain online when nohz full is enabled. */ if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
        return -EBUSY;
return 0;
        return false;
return true;
+}

+static int tick_nohz_cpu_down(unsigned int cpu) +{
return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY;
}

void __init tick_nohz_init(void)

2.39.1.405.gd4c25cc71f-goog

Joel Fernandes

6:57 a.m.

...

On Jan 24, 2023, at 7:13 PM, Zhouyi Zhou zhouzhouyi@gmail.com wrote:

On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:

...
...
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].

[1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [3] http://140.211.169.189/linux-stable-rc/

Happy to benefit the community and this is a fruitful learning journey ;-)

Thanks a lot Zhouyi! After testing something your reply can also optionally include:

Tested-by: Zhouyi Zhou zhouzhoui@gmail.com

Thanks.

Have fun at the university. I miss my university days.

- Joel

...

Cheers Zhouyi

...
                                                   Thanx, Paul
...
Sorry, resending with CC to stable.

drivers/base/cpu.c | 3 ++- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 55405ebf23ab..450dca235a2f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { bool cpu_is_hotpluggable(unsigned int cpu) { struct device *dev = get_cpu_device(cpu);
return dev && container_of(dev, struct cpu, dev)->hotpluggable;
return dev && container_of(dev, struct cpu, dev)->hotpluggable
        && tick_nohz_cpu_hotpluggable(cpu);
} EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);

diff --git a/include/linux/tick.h b/include/linux/tick.h index bfd571f18cfd..9459fef5b857 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit); extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, enum tick_dep_bits bit); +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);

/*

The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases

@@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }

static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }

static inline void tick_dep_set(enum tick_dep_bits bit) { } static inline void tick_dep_clear(enum tick_dep_bits bit) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9c6f661fb436..63e3e8ebcd64 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; }

-static int tick_nohz_cpu_down(unsigned int cpu) +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /* * The tick_do_timer_cpu CPU handles housekeeping duty (unbound @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) * CPUs. It must remain online when nohz full is enabled. */ if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
        return -EBUSY;
return 0;
        return false;
return true;
+}

+static int tick_nohz_cpu_down(unsigned int cpu) +{
return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY;
}

void __init tick_nohz_init(void)

2.39.1.405.gd4c25cc71f-goog

Zhouyi Zhou

26 Jan 26 Jan

4:16 a.m.

On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...

On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:

...
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:

...
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].

From the result of [1], the HOTPLUG failure reports have been eliminated, but a new kernel null point bug related to scsi module has been reported [4] ;-( [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at 0x00000008 ... [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] scsi_end_request+0xd8/0x1f0 (unreliable)^M [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] scsi_io_completion+0x88/0x700^M [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] scsi_finish_command+0xe8/0x150^M [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] scsi_complete+0x90/0x140^M [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] blk_complete_reqs+0x80/0xa0^M [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] __do_softirq+0x1e0/0x4e0^M [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] do_softirq_own_stack+0x48/0x60^M [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] 0xc00000000a5e7c80^M [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] do_softirq+0xb0/0xc0^M [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] __local_bh_enable_ip+0xdc/0x110^M [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] irq_forced_thread_fn+0xc8/0xf0^M [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] irq_thread+0x1b4/0x2a0^M [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] kthread+0x1a8/0x1d0^M [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64^M

But when I invoked [5] by hand, the bug did not show again [6].

[4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...

I think the bug is not caused by Joel's patch, there must be some other reason. I am starting the 29 hours' rcutorture test again. And I can give ssh access to you if you are interested in it.

Sorry for the inconvenience that I bring

Thanks Zhouyi

...

[1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [3] http://140.211.169.189/linux-stable-rc/

Happy to benefit the community and this is a fruitful learning journey ;-)

Cheers Zhouyi

...
                                                    Thanx, Paul
...
Sorry, resending with CC to stable.

drivers/base/cpu.c | 3 ++- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 55405ebf23ab..450dca235a2f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { bool cpu_is_hotpluggable(unsigned int cpu) { struct device *dev = get_cpu_device(cpu);
return dev && container_of(dev, struct cpu, dev)->hotpluggable;
return dev && container_of(dev, struct cpu, dev)->hotpluggable
        && tick_nohz_cpu_hotpluggable(cpu);
} EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);

diff --git a/include/linux/tick.h b/include/linux/tick.h index bfd571f18cfd..9459fef5b857 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit); extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, enum tick_dep_bits bit); +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);

/*

The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases

@@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }

static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }

static inline void tick_dep_set(enum tick_dep_bits bit) { } static inline void tick_dep_clear(enum tick_dep_bits bit) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9c6f661fb436..63e3e8ebcd64 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; }

-static int tick_nohz_cpu_down(unsigned int cpu) +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /* * The tick_do_timer_cpu CPU handles housekeeping duty (unbound @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) * CPUs. It must remain online when nohz full is enabled. */ if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
        return -EBUSY;
return 0;
        return false;
return true;
+}

+static int tick_nohz_cpu_down(unsigned int cpu) +{
return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY;
}

void __init tick_nohz_init(void)

2.39.1.405.gd4c25cc71f-goog

Zhouyi Zhou

3:01 p.m.

On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...

On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:

...
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:

...
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].

From the result of [1], the HOTPLUG failure reports have been eliminated, but a new kernel null point bug related to scsi module has been reported [4] ;-( [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at 0x00000008 ... [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] scsi_end_request+0xd8/0x1f0 (unreliable)^M [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] scsi_io_completion+0x88/0x700^M [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] scsi_finish_command+0xe8/0x150^M [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] scsi_complete+0x90/0x140^M [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] blk_complete_reqs+0x80/0xa0^M [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] __do_softirq+0x1e0/0x4e0^M [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] do_softirq_own_stack+0x48/0x60^M [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] 0xc00000000a5e7c80^M [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] do_softirq+0xb0/0xc0^M [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] __local_bh_enable_ip+0xdc/0x110^M [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] irq_forced_thread_fn+0xc8/0xf0^M [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] irq_thread+0x1b4/0x2a0^M [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] kthread+0x1a8/0x1d0^M [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64^M

But when I invoked [5] by hand, the bug did not show again [6].

[4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...

I think the bug is not caused by Joel's patch, there must be some other reason. I am starting the 29 hours' rcutorture test again. And I can give ssh access to you if you are interested in it.

Sorry for the inconvenience that I bring

Thanks Zhouyi

Hi the above kernel null pointer dereference bug has nothing to do with Joel's fix because I can reproduce it on original 5.15.y [7] using as while loop [8] (after 36 iterations, the bug fires). So, Joel's patch is tested good! Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com

In following days, I am like to dig out the root of the null pointer dereference bug. [7] http://140.211.169.189/stable-for-joel/linux-stable-rc/ [8] http://140.211.169.189/stable-for-joel/linux-stable-rc/qemu-cmd.sh

Thanks Zhouyi

...

...
[1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [3] http://140.211.169.189/linux-stable-rc/

Happy to benefit the community and this is a fruitful learning journey ;-)

Cheers Zhouyi

...
                                                    Thanx, Paul
...
Sorry, resending with CC to stable.

drivers/base/cpu.c | 3 ++- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 55405ebf23ab..450dca235a2f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { bool cpu_is_hotpluggable(unsigned int cpu) { struct device *dev = get_cpu_device(cpu);
return dev && container_of(dev, struct cpu, dev)->hotpluggable;
return dev && container_of(dev, struct cpu, dev)->hotpluggable
        && tick_nohz_cpu_hotpluggable(cpu);
} EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);

diff --git a/include/linux/tick.h b/include/linux/tick.h index bfd571f18cfd..9459fef5b857 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit); extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, enum tick_dep_bits bit); +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);

/*

The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases

@@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }

static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }

static inline void tick_dep_set(enum tick_dep_bits bit) { } static inline void tick_dep_clear(enum tick_dep_bits bit) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9c6f661fb436..63e3e8ebcd64 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; }

-static int tick_nohz_cpu_down(unsigned int cpu) +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /* * The tick_do_timer_cpu CPU handles housekeeping duty (unbound @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) * CPUs. It must remain online when nohz full is enabled. */ if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
        return -EBUSY;
return 0;
        return false;
return true;
+}

+static int tick_nohz_cpu_down(unsigned int cpu) +{
return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY;
}

void __init tick_nohz_init(void)

2.39.1.405.gd4c25cc71f-goog

Joel Fernandes

3:13 p.m.

On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...

On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:

...
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:

...
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].

From the result of [1], the HOTPLUG failure reports have been eliminated, but a new kernel null point bug related to scsi module has been reported [4] ;-( [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at 0x00000008 ... [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] scsi_end_request+0xd8/0x1f0 (unreliable)^M [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] scsi_io_completion+0x88/0x700^M [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] scsi_finish_command+0xe8/0x150^M [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] scsi_complete+0x90/0x140^M [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] blk_complete_reqs+0x80/0xa0^M [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] __do_softirq+0x1e0/0x4e0^M [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] do_softirq_own_stack+0x48/0x60^M [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] 0xc00000000a5e7c80^M [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] do_softirq+0xb0/0xc0^M [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] __local_bh_enable_ip+0xdc/0x110^M [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] irq_forced_thread_fn+0xc8/0xf0^M [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] irq_thread+0x1b4/0x2a0^M [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] kthread+0x1a8/0x1d0^M [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64^M

But when I invoked [5] by hand, the bug did not show again [6].

[4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...

I think the bug is not caused by Joel's patch, there must be some other reason. I am starting the 29 hours' rcutorture test again. And I can give ssh access to you if you are interested in it.

Sorry for the inconvenience that I bring

Thanks Zhouyi

Hi the above kernel null pointer dereference bug has nothing to do with Joel's fix because I can reproduce it on original 5.15.y [7] using as while loop [8] (after 36 iterations, the bug fires). So, Joel's patch is tested good! Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com

Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a lot and I don't see such an issue.

However, your logs showed the crash is SCSI related. These were the recent SCSI commits in 5.15.y but I am not sure if it causes the issue:

13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile 513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler 7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern

Perhaps report it to the scsi and/or stable lists, or do some web searches for if someone else sees it.

- Joel

Zhouyi Zhou

3:41 p.m.

On Thu, Jan 26, 2023 at 11:14 PM Joel Fernandes joel@joelfernandes.org wrote:

...

On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:

...
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote:

...
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures.

Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged.

[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]

For drivers/base/ portion: Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org

Cc: Frederic Weisbecker frederic@kernel.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Zhouyi Zhou zhouzhouyi@gmail.com Cc: Will Deacon will@kernel.org Cc: Marc Zyngier maz@kernel.org Cc: rcu rcu@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].

From the result of [1], the HOTPLUG failure reports have been eliminated, but a new kernel null point bug related to scsi module has been reported [4] ;-( [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at 0x00000008 ... [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] scsi_end_request+0xd8/0x1f0 (unreliable)^M [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] scsi_io_completion+0x88/0x700^M [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] scsi_finish_command+0xe8/0x150^M [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] scsi_complete+0x90/0x140^M [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] blk_complete_reqs+0x80/0xa0^M [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] __do_softirq+0x1e0/0x4e0^M [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] do_softirq_own_stack+0x48/0x60^M [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] 0xc00000000a5e7c80^M [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] do_softirq+0xb0/0xc0^M [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] __local_bh_enable_ip+0xdc/0x110^M [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] irq_forced_thread_fn+0xc8/0xf0^M [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] irq_thread+0x1b4/0x2a0^M [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] kthread+0x1a8/0x1d0^M [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64^M

But when I invoked [5] by hand, the bug did not show again [6].

[4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...

I think the bug is not caused by Joel's patch, there must be some other reason. I am starting the 29 hours' rcutorture test again. And I can give ssh access to you if you are interested in it.

Sorry for the inconvenience that I bring

Thanks Zhouyi

Hi the above kernel null pointer dereference bug has nothing to do with Joel's fix because I can reproduce it on original 5.15.y [7] using as while loop [8] (after 36 iterations, the bug fires). So, Joel's patch is tested good! Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com

Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a lot and I don't see such an issue.

However, your logs showed the crash is SCSI related. These were the recent SCSI commits in 5.15.y but I am not sure if it causes the issue:

13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile 513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler 7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern

Perhaps report it to the scsi and/or stable lists, or do some web searches for if someone else sees it.

Thank Joel for your guidance! I will continue the research following your instructions! I will report to you and the community once I have any process.

Thank you all!

Best Regards Zhouyi

...

Joel

Zhouyi Zhou

28 Jan 28 Jan

1:42 a.m.

On Thu, Jan 26, 2023 at 11:41 PM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...

On Thu, Jan 26, 2023 at 11:14 PM Joel Fernandes joel@joelfernandes.org wrote:

...
On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou zhouzhouyi@gmail.com wrote:

...
On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney paulmck@kernel.org wrote:

...
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > torture tests that do offlining to end up trying to offline this CPU causing > test failures. Such failure happens on all architectures. > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > be hotplugged. > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > For drivers/base/ portion: > Acked-by: Greg Kroah-Hartman gregkh@linuxfoundation.org > > Cc: Frederic Weisbecker frederic@kernel.org > Cc: "Paul E. McKenney" paulmck@kernel.org > Cc: Zhouyi Zhou zhouzhouyi@gmail.com > Cc: Will Deacon will@kernel.org > Cc: Marc Zyngier maz@kernel.org > Cc: rcu rcu@vger.kernel.org > Cc: stable@vger.kernel.org > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

Queued for further review and testing, thank you both!

It might be a few hours until it becomes publicly visible, but it will get there.

A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2].

From the result of [1], the HOTPLUG failure reports have been eliminated, but a new kernel null point bug related to scsi module has been reported [4] ;-( [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at 0x00000008 ... [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] scsi_end_request+0xd8/0x1f0 (unreliable)^M [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] scsi_io_completion+0x88/0x700^M [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] scsi_finish_command+0xe8/0x150^M [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] scsi_complete+0x90/0x140^M [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] blk_complete_reqs+0x80/0xa0^M [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] __do_softirq+0x1e0/0x4e0^M [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] do_softirq_own_stack+0x48/0x60^M [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] 0xc00000000a5e7c80^M [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] do_softirq+0xb0/0xc0^M [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] __local_bh_enable_ip+0xdc/0x110^M [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] irq_forced_thread_fn+0xc8/0xf0^M [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] irq_thread+0x1b4/0x2a0^M [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] kthread+0x1a8/0x1d0^M [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64^M

But when I invoked [5] by hand, the bug did not show again [6].

[4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re... [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/re...

I think the bug is not caused by Joel's patch, there must be some other reason. I am starting the 29 hours' rcutorture test again. And I can give ssh access to you if you are interested in it.

Sorry for the inconvenience that I bring

Thanks Zhouyi

Hi the above kernel null pointer dereference bug has nothing to do with Joel's fix because I can reproduce it on original 5.15.y [7] using as while loop [8] (after 36 iterations, the bug fires). So, Joel's patch is tested good! Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com

Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a lot and I don't see such an issue.

However, your logs showed the crash is SCSI related. These were the recent SCSI commits in 5.15.y but I am not sure if it causes the issue:

13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile 513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler 7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern

Perhaps report it to the scsi and/or stable lists, or do some web searches for if someone else sees it.

Thank Joel for your guidance! I will continue the research following your instructions! I will report to you and the community once I have any process.

Hi I think I have chased down the reason behind the bug reported by SCSI subsystem following your guidance, and have reported to SCSI mainteners [1] This is a fruitful journey, thank you all ;-)

So once again, Joel's patch is tested good ;-) Tested-by: Zhouyi Zhou zhouzhouyi@gmail.com

Thanks for your attention Thanks Zhouyi [1] https://lore.kernel.org/lkml/CAABZP2ywCbu4Po63BDBE7U1WEqx4DF7F2CZjTqFp0dSDw-...

...

Thank you all!

Best Regards Zhouyi

...

Joel

1071

days inactive

1075

days old

linux-stable-mirror@lists.linaro.org

8 comments

participants

tags (0)

participants (4)

Joel Fernandes
Joel Fernandes (Google)
Paul E. McKenney
Zhouyi Zhou