With the current implementation the following race can happen:
* blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
blk_mq_unfreeze_queue().
* blk_queue_enter() calls blk_queue_pm_only() and that function returns
true.
* blk_queue_enter() calls blk_pm_request_resume() and that function does
not call pm_request_resume() because the queue runtime status is
RPM_ACTIVE.
* blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.
Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.
Acked-by: Alan Stern <stern(a)rowland.harvard.edu>
Acked-by: Stanley Chu <stanley.chu(a)mediatek.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Hannes Reinecke <hare(a)suse.de>
Reviewed-by: Jens Axboe <axboe(a)kernel.dk>
Cc: Ming Lei <ming.lei(a)redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
Cc: stable <stable(a)vger.kernel.org>
Fixes: 986d413b7c15 ("blk-mq: Enable support for runtime power management")
Signed-off-by: Can Guo <cang(a)codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche(a)acm.org>
---
block/blk-pm.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/block/blk-pm.c b/block/blk-pm.c
index b85234d758f7..17bd020268d4 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -67,6 +67,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+ spin_lock_irq(&q->queue_lock);
+ q->rpm_status = RPM_SUSPENDING;
+ spin_unlock_irq(&q->queue_lock);
+
/*
* Increase the pm_only counter before checking whether any
* non-PM blk_queue_enter() calls are in progress to avoid that any
@@ -89,15 +93,14 @@ int blk_pre_runtime_suspend(struct request_queue *q)
/* Switch q_usage_counter back to per-cpu mode. */
blk_mq_unfreeze_queue(q);
- spin_lock_irq(&q->queue_lock);
- if (ret < 0)
+ if (ret < 0) {
+ spin_lock_irq(&q->queue_lock);
+ q->rpm_status = RPM_ACTIVE;
pm_runtime_mark_last_busy(q->dev);
- else
- q->rpm_status = RPM_SUSPENDING;
- spin_unlock_irq(&q->queue_lock);
+ spin_unlock_irq(&q->queue_lock);
- if (ret)
blk_clear_pm_only(q);
+ }
return ret;
}
membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented
as syncing the core on all sibling threads but not necessarily the
calling thread. This behavior is fundamentally buggy and cannot be used
safely. Suppose a user program has two threads. Thread A is on CPU 0
and thread B is on CPU 1. Thread A modifies some text and calls
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE). Then thread B
executes the modified code. If, at any point after membarrier() decides
which CPUs to target, thread A could be preempted and replaced by thread
B on CPU 0. This could even happen on exit from the membarrier()
syscall. If this happens, thread B will end up running on CPU 0 without
having synced.
In principle, this could be fixed by arranging for the scheduler to
sync_core_before_usermode() whenever switching between two threads in
the same mm if there is any possibility of a concurrent membarrier()
call, but this would have considerable overhead. Instead, make
membarrier() sync the calling CPU as well.
As an optimization, this avoids an extra smp_mb() in the default
barrier-only mode.
Cc: stable(a)vger.kernel.org
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
kernel/sched/membarrier.c | 51 +++++++++++++++++++++++++--------------
1 file changed, 33 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 01538b31f27e..57266ab32ef9 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -333,7 +333,8 @@ static int membarrier_private_expedited(int flags, int cpu_id)
return -EPERM;
}
- if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1)
+ if (flags != MEMBARRIER_FLAG_SYNC_CORE &&
+ (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1))
return 0;
/*
@@ -352,8 +353,6 @@ static int membarrier_private_expedited(int flags, int cpu_id)
if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id))
goto out;
- if (cpu_id == raw_smp_processor_id())
- goto out;
rcu_read_lock();
p = rcu_dereference(cpu_rq(cpu_id)->curr);
if (!p || p->mm != mm) {
@@ -368,16 +367,6 @@ static int membarrier_private_expedited(int flags, int cpu_id)
for_each_online_cpu(cpu) {
struct task_struct *p;
- /*
- * Skipping the current CPU is OK even through we can be
- * migrated at any point. The current CPU, at the point
- * where we read raw_smp_processor_id(), is ensured to
- * be in program order with respect to the caller
- * thread. Therefore, we can skip this CPU from the
- * iteration.
- */
- if (cpu == raw_smp_processor_id())
- continue;
p = rcu_dereference(cpu_rq(cpu)->curr);
if (p && p->mm == mm)
__cpumask_set_cpu(cpu, tmpmask);
@@ -385,12 +374,38 @@ static int membarrier_private_expedited(int flags, int cpu_id)
rcu_read_unlock();
}
- preempt_disable();
- if (cpu_id >= 0)
+ if (cpu_id >= 0) {
+ /*
+ * smp_call_function_single() will call ipi_func() if cpu_id
+ * is the calling CPU.
+ */
smp_call_function_single(cpu_id, ipi_func, NULL, 1);
- else
- smp_call_function_many(tmpmask, ipi_func, NULL, 1);
- preempt_enable();
+ } else {
+ /*
+ * For regular membarrier, we can save a few cycles by
+ * skipping the current cpu -- we're about to do smp_mb()
+ * below, and if we migrate to a different cpu, this cpu
+ * and the new cpu will execute a full barrier in the
+ * scheduler.
+ *
+ * For CORE_SYNC, we do need a barrier on the current cpu --
+ * otherwise, if we are migrated and replaced by a different
+ * task in the same mm just before, during, or after
+ * membarrier, we will end up with some thread in the mm
+ * running without a core sync.
+ *
+ * For RSEQ, don't rseq_preempt() the caller. User code
+ * is not supposed to issue syscalls at all from inside an
+ * rseq critical section.
+ */
+ if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
+ preempt_disable();
+ smp_call_function_many(tmpmask, ipi_func, NULL, true);
+ preempt_enable();
+ } else {
+ on_each_cpu_mask(tmpmask, ipi_func, NULL, true);
+ }
+ }
out:
if (cpu_id < 0)
--
2.28.0
The patch titled
Subject: mm/hugetlb: clear compound_nr before freeing gigantic pages
has been added to the -mm tree. Its filename is
mm-hugetlb-clear-compound_nr-before-freeing-gigantic-pages.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-clear-compound_nr-befo…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-clear-compound_nr-befo…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Subject: mm/hugetlb: clear compound_nr before freeing gigantic pages
Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
added compound_nr counter to first tail struct page, overlaying with
page->mapping. The overlay itself is fine, but while freeing gigantic
hugepages via free_contig_range(), a "bad page" check will trigger for
non-NULL page->mapping on the first tail page:
[ 276.681603] BUG: Bad page state in process bash pfn:380001
[ 276.681614] page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
[ 276.681620] aops:0x0
[ 276.681622] flags: 0x3ffff00000000000()
[ 276.681626] raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
[ 276.681628] raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
[ 276.681630] page dumped because: non-NULL mapping
[ 276.681632] Modules linked in:
[ 276.681637] CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
[ 276.681639] Hardware name: IBM 3906 M03 703 (LPAR)
[ 276.681641] Call Trace:
[ 276.681648] [<0000000458c252b6>] show_stack+0x6e/0xe8
[ 276.681652] [<000000045971cf60>] dump_stack+0x90/0xc8
[ 276.681656] [<0000000458e8b186>] bad_page+0xd6/0x130
[ 276.681658] [<0000000458e8cdea>] free_pcppages_bulk+0x26a/0x800
[ 276.681661] [<0000000458e8e67e>] free_unref_page+0x6e/0x90
[ 276.681663] [<0000000458e8ea6c>] free_contig_range+0x94/0xe8
[ 276.681666] [<0000000458ea5e54>] update_and_free_page+0x1c4/0x2c8
[ 276.681669] [<0000000458ea784e>] free_pool_huge_page+0x11e/0x138
[ 276.681671] [<0000000458ea8530>] set_max_huge_pages+0x228/0x300
[ 276.681673] [<0000000458ea86c0>] nr_hugepages_store_common+0xb8/0x130
[ 276.681678] [<0000000458fd5b6a>] kernfs_fop_write+0xd2/0x218
[ 276.681681] [<0000000458ef9da0>] vfs_write+0xb0/0x2b8
[ 276.681684] [<0000000458efa15c>] ksys_write+0xac/0xe0
[ 276.681687] [<000000045972c5ca>] system_call+0xe6/0x288
[ 276.681730] Disabling lock debugging due to kernel taint
This is because only the compound_order is cleared in
destroy_compound_gigantic_page(), and compound_nr is set to 1U << order ==
1 for order 0 in set_compound_order(page, 0).
Fix this by explicitly clearing compound_nr for first tail page after
calling set_compound_order(page, 0).
Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.…
Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order")
Signed-off-by: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: Christian Borntraeger <borntraeger(a)de.ibm.com>
Cc; Heiko Carstens <hca(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org> [5.9+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 1 +
1 file changed, 1 insertion(+)
--- a/mm/hugetlb.c~mm-hugetlb-clear-compound_nr-before-freeing-gigantic-pages
+++ a/mm/hugetlb.c
@@ -1216,6 +1216,7 @@ static void destroy_compound_gigantic_pa
}
set_compound_order(page, 0);
+ page[1].compound_nr = 0;
__ClearPageHead(page);
}
_
Patches currently in -mm which might be from gerald.schaefer(a)linux.ibm.com are
mm-hugetlb-clear-compound_nr-before-freeing-gigantic-pages.patch