The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 443f2d5ba13d65ccfd879460f77941875159d154 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju <srikar(a)linux.vnet.ibm.com>
Date: Wed, 4 Sep 2019 15:17:38 +0530
Subject: [PATCH] perf stat: Fix a segmentation fault when using repeat forever
Observe a segmentation fault when 'perf stat' is asked to repeat forever
with the interval option.
Without fix:
# perf stat -r 0 -I 5000 -e cycles -a sleep 10
# time counts unit events
5.000211692 3,13,89,82,34,157 cycles
10.000380119 1,53,98,52,22,294 cycles
10.040467280 17,16,79,265 cycles
Segmentation fault
This problem was only observed when we use forever option aka -r 0 and
works with limited repeats. Calling print_counter with ts being set to
NULL, is not a correct option when interval is set. Hence avoid
print_counter(NULL,..) if interval is set.
With fix:
# perf stat -r 0 -I 5000 -e cycles -a sleep 10
# time counts unit events
5.019866622 3,15,14,43,08,697 cycles
10.039865756 3,15,16,31,95,261 cycles
10.059950628 1,26,05,47,158 cycles
5.009902655 3,14,52,62,33,932 cycles
10.019880228 3,14,52,22,89,154 cycles
10.030543876 66,90,18,333 cycles
5.009848281 3,14,51,98,25,437 cycles
10.029854402 3,15,14,93,04,918 cycles
5.009834177 3,14,51,95,92,316 cycles
Committer notes:
Did the 'git bisect' to find the cset introducing the problem to add the
Fixes tag below, and at that time the problem reproduced as:
(gdb) run stat -r0 -I500 sleep 1
<SNIP>
Program received signal SIGSEGV, Segmentation fault.
print_interval (prefix=prefix@entry=0x7fffffffc8d0 "", ts=ts@entry=0x0) at builtin-stat.c:866
866 sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, csv_sep);
(gdb) bt
#0 print_interval (prefix=prefix@entry=0x7fffffffc8d0 "", ts=ts@entry=0x0) at builtin-stat.c:866
#1 0x000000000041860a in print_counters (ts=ts@entry=0x0, argc=argc@entry=2, argv=argv@entry=0x7fffffffd640) at builtin-stat.c:938
#2 0x0000000000419a7f in cmd_stat (argc=2, argv=0x7fffffffd640, prefix=<optimized out>) at builtin-stat.c:1411
#3 0x000000000045c65a in run_builtin (p=p@entry=0x6291b8 <commands+216>, argc=argc@entry=5, argv=argv@entry=0x7fffffffd640) at perf.c:370
#4 0x000000000045c893 in handle_internal_command (argc=5, argv=0x7fffffffd640) at perf.c:429
#5 0x000000000045c8f1 in run_argv (argcp=argcp@entry=0x7fffffffd4ac, argv=argv@entry=0x7fffffffd4a0) at perf.c:473
#6 0x000000000045cac9 in main (argc=<optimized out>, argv=<optimized out>) at perf.c:588
(gdb)
Mostly the same as just before this patch:
Program received signal SIGSEGV, Segmentation fault.
0x00000000005874a7 in print_interval (config=0xa1f2a0 <stat_config>, evlist=0xbc9b90, prefix=0x7fffffffd1c0 "`", ts=0x0) at util/stat-display.c:964
964 sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
(gdb) bt
#0 0x00000000005874a7 in print_interval (config=0xa1f2a0 <stat_config>, evlist=0xbc9b90, prefix=0x7fffffffd1c0 "`", ts=0x0) at util/stat-display.c:964
#1 0x0000000000588047 in perf_evlist__print_counters (evlist=0xbc9b90, config=0xa1f2a0 <stat_config>, _target=0xa1f0c0 <target>, ts=0x0, argc=2, argv=0x7fffffffd670)
at util/stat-display.c:1172
#2 0x000000000045390f in print_counters (ts=0x0, argc=2, argv=0x7fffffffd670) at builtin-stat.c:656
#3 0x0000000000456bb5 in cmd_stat (argc=2, argv=0x7fffffffd670) at builtin-stat.c:1960
#4 0x00000000004dd2e0 in run_builtin (p=0xa30e00 <commands+288>, argc=5, argv=0x7fffffffd670) at perf.c:310
#5 0x00000000004dd54d in handle_internal_command (argc=5, argv=0x7fffffffd670) at perf.c:362
#6 0x00000000004dd694 in run_argv (argcp=0x7fffffffd4cc, argv=0x7fffffffd4c0) at perf.c:406
#7 0x00000000004dda11 in main (argc=5, argv=0x7fffffffd670) at perf.c:531
(gdb)
Fixes: d4f63a4741a8 ("perf stat: Introduce print_counters function")
Signed-off-by: Srikar Dronamraju <srikar(a)linux.vnet.ibm.com>
Acked-by: Jiri Olsa <jolsa(a)kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
Tested-by: Ravi Bangoria <ravi.bangoria(a)linux.ibm.com>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Naveen N. Rao <naveen.n.rao(a)linux.vnet.ibm.com>
Cc: stable(a)vger.kernel.org # v4.2+
Link: http://lore.kernel.org/lkml/20190904094738.9558-3-srikar@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index fa4b148ecfca..60cdd383af81 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1956,7 +1956,7 @@ int cmd_stat(int argc, const char **argv)
perf_evlist__reset_prev_raw_counts(evsel_list);
status = run_perf_stat(argc, argv, run_idx);
- if (forever && status != -1) {
+ if (forever && status != -1 && !interval) {
print_counters(NULL, argc, argv);
perf_stat__reset_stats();
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0216234c2eed1367a318daeb9f4a97d8217412a0 Mon Sep 17 00:00:00 2001
From: Jiri Olsa <jolsa(a)kernel.org>
Date: Thu, 12 Sep 2019 12:52:35 +0200
Subject: [PATCH] perf tools: Fix segfault in cpu_cache_level__read()
We release wrong pointer on error path in cpu_cache_level__read
function, leading to segfault:
(gdb) r record ls
Starting program: /root/perf/tools/perf/perf record ls
...
[ perf record: Woken up 1 times to write data ]
double free or corruption (out)
Thread 1 "perf" received signal SIGABRT, Aborted.
0x00007ffff7463798 in raise () from /lib64/power9/libc.so.6
(gdb) bt
#0 0x00007ffff7463798 in raise () from /lib64/power9/libc.so.6
#1 0x00007ffff7443bac in abort () from /lib64/power9/libc.so.6
#2 0x00007ffff74af8bc in __libc_message () from /lib64/power9/libc.so.6
#3 0x00007ffff74b92b8 in malloc_printerr () from /lib64/power9/libc.so.6
#4 0x00007ffff74bb874 in _int_free () from /lib64/power9/libc.so.6
#5 0x0000000010271260 in __zfree (ptr=0x7fffffffa0b0) at ../../lib/zalloc..
#6 0x0000000010139340 in cpu_cache_level__read (cache=0x7fffffffa090, cac..
#7 0x0000000010143c90 in build_caches (cntp=0x7fffffffa118, size=<optimiz..
...
Releasing the proper pointer.
Fixes: 720e98b5faf1 ("perf tools: Add perf data cache feature")
Signed-off-by: Jiri Olsa <jolsa(a)kernel.org>
Cc: Alexander Shishkin <alexander.shishkin(a)linux.intel.com>
Cc: Michael Petlan <mpetlan(a)redhat.com>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org: # v4.6+
Link: http://lore.kernel.org/lkml/20190912105235.10689-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 5722ff717777..0167f9697172 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1073,7 +1073,7 @@ static int cpu_cache_level__read(struct cpu_cache_level *cache, u32 cpu, u16 lev
scnprintf(file, PATH_MAX, "%s/shared_cpu_list", path);
if (sysfs__read_str(file, &cache->map, &len)) {
- zfree(&cache->map);
+ zfree(&cache->size);
zfree(&cache->type);
return -1;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 533ca1feed98b0bf024779a14760694c7cb4d431 Mon Sep 17 00:00:00 2001
From: Dexuan Cui <decui(a)microsoft.com>
Date: Fri, 2 Aug 2019 22:50:20 +0000
Subject: [PATCH] PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it
The slot must be removed before the pci_dev is removed, otherwise a panic
can happen due to use-after-free.
Fixes: 15becc2b56c6 ("PCI: hv: Add hv_pci_remove_slots() when we unload the driver")
Signed-off-by: Dexuan Cui <decui(a)microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi(a)arm.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 40b625458afa..2b53976cd9f9 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2701,8 +2701,8 @@ static int hv_pci_remove(struct hv_device *hdev)
/* Remove the bus from PCI's point of view. */
pci_lock_rescan_remove();
pci_stop_root_bus(hbus->pci_bus);
- pci_remove_root_bus(hbus->pci_bus);
hv_pci_remove_slots(hbus);
+ pci_remove_root_bus(hbus->pci_bus);
pci_unlock_rescan_remove();
hbus->state = hv_pcibus_removed;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 533ca1feed98b0bf024779a14760694c7cb4d431 Mon Sep 17 00:00:00 2001
From: Dexuan Cui <decui(a)microsoft.com>
Date: Fri, 2 Aug 2019 22:50:20 +0000
Subject: [PATCH] PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it
The slot must be removed before the pci_dev is removed, otherwise a panic
can happen due to use-after-free.
Fixes: 15becc2b56c6 ("PCI: hv: Add hv_pci_remove_slots() when we unload the driver")
Signed-off-by: Dexuan Cui <decui(a)microsoft.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi(a)arm.com>
Cc: stable(a)vger.kernel.org
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 40b625458afa..2b53976cd9f9 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2701,8 +2701,8 @@ static int hv_pci_remove(struct hv_device *hdev)
/* Remove the bus from PCI's point of view. */
pci_lock_rescan_remove();
pci_stop_root_bus(hbus->pci_bus);
- pci_remove_root_bus(hbus->pci_bus);
hv_pci_remove_slots(hbus);
+ pci_remove_root_bus(hbus->pci_bus);
pci_unlock_rescan_remove();
hbus->state = hv_pcibus_removed;
}
The patch below does not apply to the 5.3-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b9023b91dd020ad7e093baa5122b6968c48cc9e0 Mon Sep 17 00:00:00 2001
From: Balasubramani Vivekanandan <balasubramani_vivekanandan(a)mentor.com>
Date: Thu, 26 Sep 2019 15:51:01 +0200
Subject: [PATCH] tick: broadcast-hrtimer: Fix a race in bc_set_next
When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
the required synchronization when the callback is active on other core.
The callback could have already executed tick_handle_oneshot_broadcast()
and could have also returned. But still there is a small time window where
the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
without doing anything, but the next_event of the tick broadcast clock
device is already set to a timeout value.
In the race condition diagram below, CPU #1 is running the timer callback
and CPU #2 is entering idle state and so calls bc_set_next().
In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback returns
HRTIMER_NORESTART. The hrtimer might never recover if all further requests
from the CPUs to subscribe to tick broadcast have timeout greater than the
next_event of tick broadcast clock device. This leads to cascading of
failures and finally noticed as rcu stall warnings
Here is a depiction of the race condition
CPU #1 (Running timer callback) CPU #2 (Enter idle
and subscribe to
tick broadcast)
--------------------- ---------------------
__run_hrtimer() tick_broadcast_enter()
bc_handler() __tick_broadcast_oneshot_control()
tick_handle_oneshot_broadcast()
raw_spin_lock(&tick_broadcast_lock);
dev->next_event = KTIME_MAX; //wait for tick_broadcast_lock
//next_event for tick broadcast clock
set to KTIME_MAX since no other cores
subscribed to tick broadcasting
raw_spin_unlock(&tick_broadcast_lock);
if (dev->next_event == KTIME_MAX)
return HRTIMER_NORESTART
// callback function exits without
restarting the hrtimer //tick_broadcast_lock acquired
raw_spin_lock(&tick_broadcast_lock);
tick_broadcast_set_event()
clockevents_program_event()
dev->next_event = expires;
bc_set_next()
hrtimer_try_to_cancel()
//returns -1 since the timer
callback is active. Exits without
restarting the timer
cpu_base->running = NULL;
The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it is
safe to start the hrtimer from the enter/exit idle code while the broadcast
handler is active. The enter/exit idle code and the broadcast handler are
synchronized using tick_broadcast_lock. So there is no need for the
existing try to cancel logic. All this can be removed which will eliminate
the race condition as well.
Fixes: 5d1638acb9f6 ("tick: Introduce hrtimer based broadcast")
Originally-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan(a)mentor.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan…
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
index c1f5bb590b5e..b5a65e212df2 100644
--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -42,39 +42,39 @@ static int bc_shutdown(struct clock_event_device *evt)
*/
static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
{
- int bc_moved;
/*
- * We try to cancel the timer first. If the callback is on
- * flight on some other cpu then we let it handle it. If we
- * were able to cancel the timer nothing can rearm it as we
- * own broadcast_lock.
+ * This is called either from enter/exit idle code or from the
+ * broadcast handler. In all cases tick_broadcast_lock is held.
*
- * However we can also be called from the event handler of
- * ce_broadcast_hrtimer itself when it expires. We cannot
- * restart the timer because we are in the callback, but we
- * can set the expiry time and let the callback return
- * HRTIMER_RESTART.
+ * hrtimer_cancel() cannot be called here neither from the
+ * broadcast handler nor from the enter/exit idle code. The idle
+ * code can run into the problem described in bc_shutdown() and the
+ * broadcast handler cannot wait for itself to complete for obvious
+ * reasons.
*
- * Since we are in the idle loop at this point and because
- * hrtimer_{start/cancel} functions call into tracing,
- * calls to these functions must be bound within RCU_NONIDLE.
+ * Each caller tries to arm the hrtimer on its own CPU, but if the
+ * hrtimer callbback function is currently running, then
+ * hrtimer_start() cannot move it and the timer stays on the CPU on
+ * which it is assigned at the moment.
+ *
+ * As this can be called from idle code, the hrtimer_start()
+ * invocation has to be wrapped with RCU_NONIDLE() as
+ * hrtimer_start() can call into tracing.
*/
- RCU_NONIDLE(
- {
- bc_moved = hrtimer_try_to_cancel(&bctimer) >= 0;
- if (bc_moved) {
- hrtimer_start(&bctimer, expires,
- HRTIMER_MODE_ABS_PINNED_HARD);
- }
- }
- );
-
- if (bc_moved) {
- /* Bind the "device" to the cpu */
- bc->bound_on = smp_processor_id();
- } else if (bc->bound_on == smp_processor_id()) {
- hrtimer_set_expires(&bctimer, expires);
- }
+ RCU_NONIDLE( {
+ hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED_HARD);
+ /*
+ * The core tick broadcast mode expects bc->bound_on to be set
+ * correctly to prevent a CPU which has the broadcast hrtimer
+ * armed from going deep idle.
+ *
+ * As tick_broadcast_lock is held, nothing can change the cpu
+ * base which was just established in hrtimer_start() above. So
+ * the below access is safe even without holding the hrtimer
+ * base lock.
+ */
+ bc->bound_on = bctimer.base->cpu_base->cpu;
+ } );
return 0;
}
@@ -100,10 +100,6 @@ static enum hrtimer_restart bc_handler(struct hrtimer *t)
{
ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
- if (clockevent_state_oneshot(&ce_broadcast_hrtimer))
- if (ce_broadcast_hrtimer.next_event != KTIME_MAX)
- return HRTIMER_RESTART;
-
return HRTIMER_NORESTART;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b0215e2d6a18d8331b2d4a8b38ccf3eff783edb1 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
Date: Wed, 28 Aug 2019 15:05:28 -0400
Subject: [PATCH] tools lib traceevent: Do not free tep->cmdlines in
add_new_comm() on failure
If the re-allocation of tep->cmdlines succeeds, then the previous
allocation of tep->cmdlines will be freed. If we later fail in
add_new_comm(), we must not free cmdlines, and also should assign
tep->cmdlines to the new allocation. Otherwise when freeing tep, the
tep->cmdlines will be pointing to garbage.
Fixes: a6d2a61ac653a ("tools lib traceevent: Remove some die() calls")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Jiri Olsa <jolsa(a)redhat.com>
Cc: Namhyung Kim <namhyung(a)kernel.org>
Cc: linux-trace-devel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Link: http://lkml.kernel.org/r/20190828191819.970121417@goodmis.org
Signed-off-by: Arnaldo Carvalho de Melo <acme(a)redhat.com>
diff --git a/tools/lib/traceevent/event-parse.c b/tools/lib/traceevent/event-parse.c
index b36b536a9fcb..13fd9fdf91e0 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -269,10 +269,10 @@ static int add_new_comm(struct tep_handle *tep,
errno = ENOMEM;
return -1;
}
+ tep->cmdlines = cmdlines;
cmdlines[tep->cmdline_count].comm = strdup(comm);
if (!cmdlines[tep->cmdline_count].comm) {
- free(cmdlines);
errno = ENOMEM;
return -1;
}
@@ -283,7 +283,6 @@ static int add_new_comm(struct tep_handle *tep,
tep->cmdline_count++;
qsort(cmdlines, tep->cmdline_count, sizeof(*cmdlines), cmdline_cmp);
- tep->cmdlines = cmdlines;
return 0;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From c784be435d5dae28d3b03db31753dd7a18733f0c Mon Sep 17 00:00:00 2001
From: "Gautham R. Shenoy" <ego(a)linux.vnet.ibm.com>
Date: Wed, 15 May 2019 13:15:52 +0530
Subject: [PATCH] powerpc/pseries: Fix cpu_hotplug_lock acquisition in
resize_hpt()
The calls to arch_add_memory()/arch_remove_memory() are always made
with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin().
On pSeries, arch_add_memory()/arch_remove_memory() eventually call
resize_hpt() which in turn calls stop_machine() which acquires the
read-side cpu_hotplug_lock again, thereby resulting in the recursive
acquisition of this lock.
In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
lockup during a memory hotplug operation because cpus_read_lock() is a
per-cpu rwsem read, which, in the fast-path (in the absence of the
writer, which in our case is a CPU-hotplug operation) simply
increments the read_count on the semaphore. Thus a recursive read in
the fast-path doesn't cause any problems.
However, we can hit this problem in practice if there is a concurrent
CPU-Hotplug operation in progress which is waiting to acquire the
write-side of the lock. This will cause the second recursive read to
block until the writer finishes. While the writer is blocked since the
first read holds the lock. Thus both the reader as well as the writers
fail to make any progress thereby blocking both CPU-Hotplug as well as
Memory Hotplug operations.
Memory-Hotplug CPU-Hotplug
CPU 0 CPU 1
------ ------
1. down_read(cpu_hotplug_lock.rw_sem)
[memory_hotplug_begin]
2. down_write(cpu_hotplug_lock.rw_sem)
[cpu_up/cpu_down]
3. down_read(cpu_hotplug_lock.rw_sem)
[stop_machine()]
Lockdep complains as follows in these code-paths.
swapper/0/1 is trying to acquire lock:
(____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60
but task is already holding lock:
(____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(cpu_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by swapper/0/1:
#0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0
#1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
#2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0
stack backtrace:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166
Call Trace:
dump_stack+0xe8/0x164 (unreliable)
__lock_acquire+0x1110/0x1c70
lock_acquire+0x240/0x290
cpus_read_lock+0x64/0xf0
stop_machine+0x2c/0x60
pseries_lpar_resize_hpt+0x19c/0x2c0
resize_hpt_for_hotplug+0x70/0xd0
arch_add_memory+0x58/0xfc
devm_memremap_pages+0x5e8/0x8f0
pmem_attach_disk+0x764/0x830
nvdimm_bus_probe+0x118/0x240
really_probe+0x230/0x4b0
driver_probe_device+0x16c/0x1e0
__driver_attach+0x148/0x1b0
bus_for_each_dev+0x90/0x130
driver_attach+0x34/0x50
bus_add_driver+0x1a8/0x360
driver_register+0x108/0x170
__nd_driver_register+0xd0/0xf0
nd_pmem_driver_init+0x34/0x48
do_one_initcall+0x1e0/0x45c
kernel_init_freeable+0x540/0x64c
kernel_init+0x2c/0x160
ret_from_kernel_thread+0x5c/0x68
Fix this issue by
1) Requiring all the calls to pseries_lpar_resize_hpt() be made
with cpu_hotplug_lock held.
2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
as a consequence of 1)
3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
with cpu_hotplug_lock held.
Fixes: dbcf929c0062 ("powerpc/pseries: Add support for hash table resizing")
Cc: stable(a)vger.kernel.org # v4.11+
Reported-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego(a)linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.…
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index e6d471058597..c363e850550e 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -34,6 +34,7 @@
#include <linux/libfdt.h>
#include <linux/pkeys.h>
#include <linux/hugetlb.h>
+#include <linux/cpu.h>
#include <asm/debugfs.h>
#include <asm/processor.h>
@@ -1931,10 +1932,16 @@ static int hpt_order_get(void *data, u64 *val)
static int hpt_order_set(void *data, u64 val)
{
+ int ret;
+
if (!mmu_hash_ops.resize_hpt)
return -ENODEV;
- return mmu_hash_ops.resize_hpt(val);
+ cpus_read_lock();
+ ret = mmu_hash_ops.resize_hpt(val);
+ cpus_read_unlock();
+
+ return ret;
}
DEFINE_DEBUGFS_ATTRIBUTE(fops_hpt_order, hpt_order_get, hpt_order_set, "%llu\n");
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 09bb878c21e0..4f76e5f30c97 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1413,7 +1413,10 @@ static int pseries_lpar_resize_hpt_commit(void *data)
return 0;
}
-/* Must be called in user context */
+/*
+ * Must be called in process context. The caller must hold the
+ * cpus_lock.
+ */
static int pseries_lpar_resize_hpt(unsigned long shift)
{
struct hpt_resize_state state = {
@@ -1467,7 +1470,8 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
t1 = ktime_get();
- rc = stop_machine(pseries_lpar_resize_hpt_commit, &state, NULL);
+ rc = stop_machine_cpuslocked(pseries_lpar_resize_hpt_commit,
+ &state, NULL);
t2 = ktime_get();