The patch titled
Subject: hfsplus: prevent corruption in shrinking truncate
has been added to the -mm tree. Its filename is
hfsplus-prevent-corruption-in-shrinking-truncate.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/hfsplus-prevent-corruption-in-shr…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/hfsplus-prevent-corruption-in-shr…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Jouni Roivas <jouni.roivas(a)tuxera.com>
Subject: hfsplus: prevent corruption in shrinking truncate
I believe there are some issues introduced by commit 31651c607151
("hfsplus: avoid deadlock on file truncation")
HFS+ has extent records which always contains 8 extents. In case the
first extent record in catalog file gets full, new ones are allocated from
extents overflow file.
In case shrinking truncate happens to middle of an extent record which
locates in extents overflow file, the logic in hfsplus_file_truncate() was
changed so that call to hfs_brec_remove() is not guarded any more.
Right action would be just freeing the extents that exceed the new size
inside extent record by calling hfsplus_free_extents(), and then check if
the whole extent record should be removed. However since the guard
(blk_cnt > start) is now after the call to hfs_brec_remove(), this has
unfortunate effect that the last matching extent record is removed
unconditionally.
To reproduce this issue, create a file which has at least 10 extents, and
then perform shrinking truncate into middle of the last extent record, so
that the number of remaining extents is not under or divisible by 8. This
causes the last extent record (8 extents) to be removed totally instead of
truncating into middle of it. Thus this causes corruption, and lost data.
Fix for this is simply checking if the new truncated end is below the
start of this extent record, making it safe to remove the full extent
record. However call to hfs_brec_remove() can't be moved to it's previous
place since we're dropping ->tree_lock and it can cause a race condition
and the cached info being invalidated possibly corrupting the node data.
Another issue is related to this one. When entering into the block
(blk_cnt > start) we are not holding the ->tree_lock. We break out from
the loop not holding the lock, but hfs_find_exit() does unlock it. Not
sure if it's possible for someone else to take the lock under our feet,
but it can cause hard to debug errors and premature unlocking. Even if
there's no real risk of it, the locking should still always be kept in
balance. Thus taking the lock now just before the check.
Link: https://lkml.kernel.org/r/20210429165139.3082828-1-jouni.roivas@tuxera.com
Fixes: 31651c607151f ("hfsplus: avoid deadlock on file truncation")
Signed-off-by: Jouni Roivas <jouni.roivas(a)tuxera.com>
Reviewed-by: Anton Altaparmakov <anton(a)tuxera.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko(a)gmail.com>
Cc: Viacheslav Dubeyko <slava(a)dubeyko.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/hfsplus/extents.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
--- a/fs/hfsplus/extents.c~hfsplus-prevent-corruption-in-shrinking-truncate
+++ a/fs/hfsplus/extents.c
@@ -598,13 +598,15 @@ void hfsplus_file_truncate(struct inode
res = __hfsplus_ext_cache_extent(&fd, inode, alloc_cnt);
if (res)
break;
- hfs_brec_remove(&fd);
- mutex_unlock(&fd.tree->tree_lock);
start = hip->cached_start;
+ if (blk_cnt <= start)
+ hfs_brec_remove(&fd);
+ mutex_unlock(&fd.tree->tree_lock);
hfsplus_free_extents(sb, hip->cached_extents,
alloc_cnt - start, alloc_cnt - blk_cnt);
hfsplus_dump_extent(hip->cached_extents);
+ mutex_lock(&fd.tree->tree_lock);
if (blk_cnt > start) {
hip->extent_state |= HFSPLUS_EXT_DIRTY;
break;
@@ -612,7 +614,6 @@ void hfsplus_file_truncate(struct inode
alloc_cnt = start;
hip->cached_start = hip->cached_blocks = 0;
hip->extent_state &= ~(HFSPLUS_EXT_DIRTY | HFSPLUS_EXT_NEW);
- mutex_lock(&fd.tree->tree_lock);
}
hfs_find_exit(&fd);
_
Patches currently in -mm which might be from jouni.roivas(a)tuxera.com are
hfsplus-prevent-corruption-in-shrinking-truncate.patch
The patch titled
Subject: squashfs: fix divide error in calculate_skip()
has been added to the -mm tree. Its filename is
squashfs-fix-divide-error-in-calculate_skip.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/squashfs-fix-divide-error-in-calc…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/squashfs-fix-divide-error-in-calc…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Phillip Lougher <phillip(a)squashfs.org.uk>
Subject: squashfs: fix divide error in calculate_skip()
Sysbot has reported a "divide error" which has been identified as being
caused by a corrupted file_size value within the file inode. This value
has been corrupted to a much larger value than expected.
Calculate_skip() is passed i_size_read(inode) >> msblk->block_log. Due to
the file_size value corruption this overflows the int argument/variable in
that function, leading to the divide error.
This patch changes the function to use u64. This will accommodate any
unexpectedly large values due to corruption.
The value returned from calculate_skip() is clamped to be never more than
SQUASHFS_CACHED_BLKS - 1, or 7. So file_size corruption does not lead to
an unexpectedly large return result here.
Link: https://lkml.kernel.org/r/20210507152618.9447-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip(a)squashfs.org.uk>
Reported-by: <syzbot+e8f781243ce16ac2f962(a)syzkaller.appspotmail.com>
Reported-by: <syzbot+7b98870d4fec9447b951(a)syzkaller.appspotmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/squashfs/file.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/fs/squashfs/file.c~squashfs-fix-divide-error-in-calculate_skip
+++ a/fs/squashfs/file.c
@@ -211,11 +211,11 @@ failure:
* If the skip factor is limited in this way then the file will use multiple
* slots.
*/
-static inline int calculate_skip(int blocks)
+static inline int calculate_skip(u64 blocks)
{
- int skip = blocks / ((SQUASHFS_META_ENTRIES + 1)
+ u64 skip = blocks / ((SQUASHFS_META_ENTRIES + 1)
* SQUASHFS_META_INDEXES);
- return min(SQUASHFS_CACHED_BLKS - 1, skip + 1);
+ return min((u64) SQUASHFS_CACHED_BLKS - 1, skip + 1);
}
_
Patches currently in -mm which might be from phillip(a)squashfs.org.uk are
squashfs-fix-divide-error-in-calculate_skip.patch
The patch titled
Subject: fs/epoll: restore waking from ep_done_scan()
has been removed from the -mm tree. Its filename was
fs-epoll-restore-waking-from-ep_done_scan.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Davidlohr Bueso <dave(a)stgolabs.net>
Subject: fs/epoll: restore waking from ep_done_scan()
339ddb53d373 (fs/epoll: remove unnecessary wakeups of nested epoll)
changed the userspace visible behavior of exclusive waiters blocked on a
common epoll descriptor upon a single event becoming ready. Previously,
all tasks doing epoll_wait would awake, and now only one is awoken,
potentially causing missed wakeups on applications that rely on this
behavior, such as Apache Qpid.
While the aforementioned commit aims at having only a wakeup single path
in ep_poll_callback (with the exceptions of epoll_ctl cases), we need to
restore the wakeup in what was the old ep_scan_ready_list() such that the
next thread can be awoken, in a cascading style, after the waker's
corresponding ep_send_events().
Link: https://lkml.kernel.org/r/20210405231025.33829-3-dave@stgolabs.net
Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Davidlohr Bueso <dbueso(a)suse.de>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Jason Baron <jbaron(a)akamai.com>
Cc: Roman Penyaev <rpenyaev(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/eventpoll.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/fs/eventpoll.c~fs-epoll-restore-waking-from-ep_done_scan
+++ a/fs/eventpoll.c
@@ -657,6 +657,12 @@ static void ep_done_scan(struct eventpol
*/
list_splice(txlist, &ep->rdllist);
__pm_relax(ep->ws);
+
+ if (!list_empty(&ep->rdllist)) {
+ if (waitqueue_active(&ep->wq))
+ wake_up(&ep->wq);
+ }
+
write_unlock_irq(&ep->lock);
}
_
Patches currently in -mm which might be from dave(a)stgolabs.net are
The patch titled
Subject: mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
has been removed from the -mm tree. Its filename was
mm-page_alloc-ignore-init_on_free=1-for-debug_pagealloc=1.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Sergei Trofimovich <slyfox(a)gentoo.org>
Subject: mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
On !ARCH_SUPPORTS_DEBUG_PAGEALLOC (like ia64) debug_pagealloc=1 implies
page_poison=on:
if (page_poisoning_enabled() ||
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()))
static_branch_enable(&_page_poisoning_enabled);
page_poison=on needs to override init_on_free=1.
Before the change it did not work as expected for the following case:
- have PAGE_POISONING=y
- have page_poison unset
- have !ARCH_SUPPORTS_DEBUG_PAGEALLOC arch (like ia64)
- have init_on_free=1
- have debug_pagealloc=1
That way we get both keys enabled:
- static_branch_enable(&init_on_free);
- static_branch_enable(&_page_poisoning_enabled);
which leads to poisoned pages returned for __GFP_ZERO pages.
After the change we execute only:
- static_branch_enable(&_page_poisoning_enabled);
and ignore init_on_free=1.
Link: https://lkml.kernel.org/r/20210329222555.3077928-1-slyfox@gentoo.org
Link: https://lkml.org/lkml/2021/3/26/443
Fixes: 8db26a3d4735 ("mm, page_poison: use static key more efficiently")
Signed-off-by: Sergei Trofimovich <slyfox(a)gentoo.org>
Acked-by: Vlastimil Babka <vbabka(a)suse.cz>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_alloc.c | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)
--- a/mm/page_alloc.c~mm-page_alloc-ignore-init_on_free=1-for-debug_pagealloc=1
+++ a/mm/page_alloc.c
@@ -786,32 +786,36 @@ static inline void clear_page_guard(stru
*/
void init_mem_debugging_and_hardening(void)
{
+ bool page_poisoning_requested = false;
+
+#ifdef CONFIG_PAGE_POISONING
+ /*
+ * Page poisoning is debug page alloc for some arches. If
+ * either of those options are enabled, enable poisoning.
+ */
+ if (page_poisoning_enabled() ||
+ (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
+ debug_pagealloc_enabled())) {
+ static_branch_enable(&_page_poisoning_enabled);
+ page_poisoning_requested = true;
+ }
+#endif
+
if (_init_on_alloc_enabled_early) {
- if (page_poisoning_enabled())
+ if (page_poisoning_requested)
pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, "
"will take precedence over init_on_alloc\n");
else
static_branch_enable(&init_on_alloc);
}
if (_init_on_free_enabled_early) {
- if (page_poisoning_enabled())
+ if (page_poisoning_requested)
pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, "
"will take precedence over init_on_free\n");
else
static_branch_enable(&init_on_free);
}
-#ifdef CONFIG_PAGE_POISONING
- /*
- * Page poisoning is debug page alloc for some arches. If
- * either of those options are enabled, enable poisoning.
- */
- if (page_poisoning_enabled() ||
- (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
- debug_pagealloc_enabled()))
- static_branch_enable(&_page_poisoning_enabled);
-#endif
-
#ifdef CONFIG_DEBUG_PAGEALLOC
if (!debug_pagealloc_enabled())
return;
_
Patches currently in -mm which might be from slyfox(a)gentoo.org are
The patch titled
Subject: ovl: fix reference counting in ovl_mmap error path
has been removed from the -mm tree. Its filename was
ovl-fix-reference-counting-in-ovl_mmap-error-path.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Christian König <christian.koenig(a)amd.com>
Subject: ovl: fix reference counting in ovl_mmap error path
mmap_region() now calls fput() on the vma->vm_file.
Fix this by using vma_set_file() so it doesn't need to be handled manually
here any more.
Link: https://lkml.kernel.org/r/20210421132012.82354-2-christian.koenig@amd.com
Fixes: 1527f926fd04 ("mm: mmap: fix fput in error path v2")
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Reviewed-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: Jan Harkes <jaharkes(a)cs.cmu.edu>
Cc: Miklos Szeredi <miklos(a)szeredi.hu>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: <stable(a)vger.kernel.org> [5.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/overlayfs/file.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)
--- a/fs/overlayfs/file.c~ovl-fix-reference-counting-in-ovl_mmap-error-path
+++ a/fs/overlayfs/file.c
@@ -430,20 +430,11 @@ static int ovl_mmap(struct file *file, s
if (WARN_ON(file != vma->vm_file))
return -EIO;
- vma->vm_file = get_file(realfile);
+ vma_set_file(vma, realfile);
old_cred = ovl_override_creds(file_inode(file)->i_sb);
ret = call_mmap(vma->vm_file, vma);
revert_creds(old_cred);
-
- if (ret) {
- /* Drop reference count from new vm_file value */
- fput(realfile);
- } else {
- /* Drop reference count from previous vm_file value */
- fput(file);
- }
-
ovl_file_accessed(file);
return ret;
_
Patches currently in -mm which might be from christian.koenig(a)amd.com are
The patch titled
Subject: coda: fix reference counting in coda_file_mmap error path
has been removed from the -mm tree. Its filename was
coda-fix-reference-counting-in-coda_file_mmap-error-path.patch
This patch was dropped because it was merged into mainline or a subsystem tree
------------------------------------------------------
From: Christian König <christian.koenig(a)amd.com>
Subject: coda: fix reference counting in coda_file_mmap error path
mmap_region() now calls fput() on the vma->vm_file.
So we need to drop the extra reference on the coda file instead of the
host file.
Link: https://lkml.kernel.org/r/20210421132012.82354-1-christian.koenig@amd.com
Fixes: 1527f926fd04 ("mm: mmap: fix fput in error path v2")
Signed-off-by: Christian König <christian.koenig(a)amd.com>
Reviewed-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Acked-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Cc: Miklos Szeredi <miklos(a)szeredi.hu>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: <stable(a)vger.kernel.org> [5.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/coda/file.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/fs/coda/file.c~coda-fix-reference-counting-in-coda_file_mmap-error-path
+++ a/fs/coda/file.c
@@ -175,10 +175,10 @@ coda_file_mmap(struct file *coda_file, s
ret = call_mmap(vma->vm_file, vma);
if (ret) {
- /* if call_mmap fails, our caller will put coda_file so we
- * should drop the reference to the host_file that we got.
+ /* if call_mmap fails, our caller will put host_file so we
+ * should drop the reference to the coda_file that we got.
*/
- fput(host_file);
+ fput(coda_file);
kfree(cvm_ops);
} else {
/* here we add redirects for the open/close vm_operations */
_
Patches currently in -mm which might be from christian.koenig(a)amd.com are
I'm announcing the release of the 4.19.190 kernel.
All users of the 4.19 kernel series must upgrade.
The updated 4.19.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.19.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/mips/vdso/gettimeofday.c | 14 +-
arch/x86/kernel/acpi/boot.c | 25 +---
arch/x86/kernel/setup.c | 7 -
drivers/acpi/tables.c | 42 ++++++
drivers/net/usb/ax88179_178a.c | 4
drivers/net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 7 -
drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 7 -
drivers/platform/x86/thinkpad_acpi.c | 31 +++--
drivers/staging/erofs/inode.c | 135 ++++++++++++++--------
drivers/usb/core/quirks.c | 4
fs/overlayfs/super.c | 12 +
include/linux/acpi.h | 9 +
kernel/bpf/verifier.c | 12 -
sound/usb/quirks-table.h | 10 +
15 files changed, 221 insertions(+), 100 deletions(-)
Chris Chiu (1):
USB: Add reset-resume quirk for WD19's Realtek Hub
Daniel Borkmann (1):
bpf: Fix masking negation logic upon negative dst register
Gao Xiang (1):
erofs: fix extended inode could cross boundary
Greg Kroah-Hartman (1):
Linux 4.19.190
Jiri Kosina (2):
iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_enqueue_hcmd()
iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_gen2_enqueue_hcmd()
Kai-Heng Feng (1):
USB: Add LPM quirk for Lenovo ThinkPad USB-C Dock Gen2 Ethernet
Mark Pearson (1):
platform/x86: thinkpad_acpi: Correct thermal sensor allocation
Miklos Szeredi (1):
ovl: allow upperdir inside lowerdir
Phillip Potter (1):
net: usb: ax88179_178a: initialize local variables before use
Rafael J. Wysocki (2):
ACPI: tables: x86: Reserve memory occupied by ACPI tables
ACPI: x86: Call acpi_boot_table_init() after acpi_table_upgrade()
Romain Naour (1):
mips: Do not include hi and lo in clobber list for R6
Takashi Iwai (1):
ALSA: usb-audio: Add MIDI quirk for Vox ToneLab EX
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8c9af478c06bb1ab1422f90d8ecbc53defd44bc3 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
Date: Wed, 5 May 2021 10:38:24 -0400
Subject: [PATCH] ftrace: Handle commands when closing set_ftrace_filter file
# echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
will cause switch_mm to stop tracing by the traceoff command.
# echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
does nothing.
The reason is that the parsing in the write function only processes
commands if it finished parsing (there is white space written after the
command). That's to handle:
write(fd, "switch_mm:", 10);
write(fd, "traceoff", 8);
cases, where the command is broken over multiple writes.
The problem is if the file descriptor is closed, then the write call is
not processed, and the command needs to be processed in the release code.
The release code can handle matching of functions, but does not handle
commands.
Cc: stable(a)vger.kernel.org
Fixes: eda1e32855656 ("tracing: handle broken names in ftrace filter")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 057e962ca5ce..c57508445faa 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5591,7 +5591,10 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
parser = &iter->parser;
if (trace_parser_loaded(parser)) {
- ftrace_match_records(iter->hash, parser->buffer, parser->idx);
+ int enable = !(iter->flags & FTRACE_ITER_NOTRACE);
+
+ ftrace_process_regex(iter, parser->buffer,
+ parser->idx, enable);
}
trace_parser_put(parser);
A valid implementation choice for the ChooseRandomNonExcludedTag()
pseudocode function used by IRG is to behave in the same way as with
GCR_EL1.RRND=0. This would mean that RGSR_EL1.SEED is used as an LFSR
which must have a non-zero value in order for IRG to properly produce
pseudorandom numbers. However, RGSR_EL1 is reset to an UNKNOWN value
on soft reset and thus may reset to 0. Therefore we must initialize
RGSR_EL1.SEED to a non-zero value in order to ensure that IRG behaves
as expected.
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Cc: stable(a)vger.kernel.org
Link: https://linux-review.googlesource.com/id/I2b089b6c7d6f17ee37e2f0db7df5ad5bc…
---
arch/arm64/mm/proc.S | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 0a48191534ff..e8e1eaee4b3f 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -437,7 +437,7 @@ SYM_FUNC_START(__cpu_setup)
mrs x10, ID_AA64PFR1_EL1
ubfx x10, x10, #ID_AA64PFR1_MTE_SHIFT, #4
cmp x10, #ID_AA64PFR1_MTE
- b.lt 1f
+ b.lt 2f
/* Normal Tagged memory type at the corresponding MAIR index */
mov x10, #MAIR_ATTR_NORMAL_TAGGED
@@ -447,6 +447,19 @@ SYM_FUNC_START(__cpu_setup)
mov x10, #(SYS_GCR_EL1_RRND | SYS_GCR_EL1_EXCL_MASK)
msr_s SYS_GCR_EL1, x10
+ /*
+ * Initialize RGSR_EL1.SEED to a non-zero value. If the implementation
+ * chooses to implement GCR_EL1.RRND=1 in the same way as RRND=0 then
+ * the seed will be used as an LFSR, so it should be put onto the MLS.
+ */
+ mrs x10, CNTVCT_EL0
+ and x10, x10, #SYS_RGSR_EL1_SEED_MASK
+ cbnz x10, 1f
+ mov x10, #1
+1:
+ lsl x10, x10, #SYS_RGSR_EL1_SEED_SHIFT
+ msr_s SYS_RGSR_EL1, x10
+
/* clear any pending tag check faults in TFSR*_EL1 */
msr_s SYS_TFSR_EL1, xzr
msr_s SYS_TFSRE0_EL1, xzr
@@ -454,7 +467,7 @@ SYM_FUNC_START(__cpu_setup)
/* set the TCR_EL1 bits */
mov_q x10, TCR_KASAN_HW_FLAGS
orr tcr, tcr, x10
-1:
+2:
#endif
tcr_clear_errata_bits tcr, x9, x5
--
2.31.1.607.g51e8a6a459-goog
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2d036dfa5f10df9782f5278fc591d79d283c1fad Mon Sep 17 00:00:00 2001
From: Chen Jun <chenjun102(a)huawei.com>
Date: Wed, 14 Apr 2021 03:04:49 +0000
Subject: [PATCH] posix-timers: Preserve return value in clock_adjtime32()
The return value on success (>= 0) is overwritten by the return value of
put_old_timex32(). That works correct in the fault case, but is wrong for
the success case where put_old_timex32() returns 0.
Just check the return value of put_old_timex32() and return -EFAULT in case
it is not zero.
[ tglx: Massage changelog ]
Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts")
Signed-off-by: Chen Jun <chenjun102(a)huawei.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Richard Cochran <richardcochran(a)gmail.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20210414030449.90692-1-chenjun102@huawei.com
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index bf540f5a4115..dd5697d7347b 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -1191,8 +1191,8 @@ SYSCALL_DEFINE2(clock_adjtime32, clockid_t, which_clock,
err = do_clock_adjtime(which_clock, &ktx);
- if (err >= 0)
- err = put_old_timex32(utp, &ktx);
+ if (err >= 0 && put_old_timex32(utp, &ktx))
+ return -EFAULT;
return err;
}
The patch below does not apply to the 5.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From f9690f426b2134cc3e74bfc5d9dfd6a4b2ca5281 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
------------[ cut here ]------------
kernel BUG at fs/btrfs/tree-mod-log.c:675!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001027090 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
Call Trace:
btrfs_get_old_root+0x16a/0x5c0
? lock_downgrade+0x400/0x400
btrfs_search_old_slot+0x192/0x520
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x340/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? stack_trace_save+0x87/0xb0
? resolve_indirect_refs+0xfc0/0xfc0
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? lockdep_hardirqs_on_prepare+0x210/0x210
? fs_reclaim_acquire+0x67/0xf0
? __kasan_check_read+0x11/0x20
? ___might_sleep+0x10f/0x1e0
? __kasan_kmalloc+0x9d/0xd0
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? trace_hardirqs_on+0x55/0x120
? ulist_free+0x1f/0x30
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? release_extent_buffer+0x225/0x280
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x400/0x400
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x620
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x2038/0x4360
? __kasan_check_write+0x14/0x20
? mmput+0x3b/0x220
? btrfs_ioctl_get_supported_features+0x30/0x30
? __kasan_check_read+0x11/0x20
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __might_fault+0x64/0xd0
? __kasan_check_read+0x11/0x20
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x13/0x210
? _raw_spin_unlock_irqrestore+0x51/0x63
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x400/0x400
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x650
? __task_pid_nr_ns+0xd3/0x250
? __kasan_check_read+0x11/0x20
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f29ae85b427
Code: 00 00 90 48 8b (...)
RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
Modules linked in:
---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local variable "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Stable backport notes: due to moved code and renames, in =< 5.11 the
change should be applied to ctree.c:get_old_root.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index 1e52584f3fa5..8a3a14686d3e 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -830,10 +830,30 @@ struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
The patch below does not apply to the 5.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 5.11-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 3227788cd369d734d2d3cd94f8af7536b60fa552 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc(a)synology.com>
Date: Thu, 25 Mar 2021 09:56:22 +0800
Subject: [PATCH] btrfs: fix a potential hole punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko(a)synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko(a)synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng(a)synology.com>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: BingJing Chang <bingjingc(a)synology.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 42634658815f..864c08d08a35 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2730,8 +2730,6 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, inode);
if (ret)
break;
@@ -2751,7 +2749,9 @@ int btrfs_replace_file_extents(struct btrfs_inode *inode,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(inode, &cur_offset, &len);
if (unlikely(ret < 0))
break;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 67addf29004c5be9fa0383c82a364bb59afc7f84 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 20 Apr 2021 10:55:12 +0100
Subject: [PATCH] btrfs: fix metadata extent leak after failure to create
subvolume
When creating a subvolume we allocate an extent buffer for its root node
after starting a transaction. We setup a root item for the subvolume that
points to that extent buffer and then attempt to insert the root item into
the root tree - however if that fails, due to ENOMEM for example, we do
not free the extent buffer previously allocated and we do not abort the
transaction (as at that point we did nothing that can not be undone).
This means that we effectively do not return the metadata extent back to
the free space cache/tree and we leave a delayed reference for it which
causes a metadata extent item to be added to the extent tree, in the next
transaction commit, without having backreferences. When this happens
'btrfs check' reports the following:
$ btrfs check /dev/sdi
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb
[1/7] checking root items
[2/7] checking extents
ref mismatch on [30425088 16384] extent item 1, found 0
backref 30425088 root 256 not referenced back 0x564a91c23d70
incorrect global backref count on 30425088 found 1 wanted 0
backpointer mismatch on [30425088 16384]
owner ref check failed [30425088 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 212992 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 124669
file data blocks allocated: 65536
referenced 65536
So fix this by freeing the metadata extent if btrfs_insert_root() returns
an error.
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 37c92a9fa2e3..b1328f17607e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -697,8 +697,6 @@ static noinline int create_subvol(struct inode *dir,
btrfs_set_root_otransid(root_item, trans->transid);
btrfs_tree_unlock(leaf);
- free_extent_buffer(leaf);
- leaf = NULL;
btrfs_set_root_dirid(root_item, BTRFS_FIRST_FREE_OBJECTID);
@@ -707,8 +705,22 @@ static noinline int create_subvol(struct inode *dir,
key.type = BTRFS_ROOT_ITEM_KEY;
ret = btrfs_insert_root(trans, fs_info->tree_root, &key,
root_item);
- if (ret)
+ if (ret) {
+ /*
+ * Since we don't abort the transaction in this case, free the
+ * tree block so that we don't leak space and leave the
+ * filesystem in an inconsistent state (an extent item in the
+ * extent tree without backreferences). Also no need to have
+ * the tree block locked since it is not in any tree at this
+ * point, so no other task can find it and use it.
+ */
+ btrfs_free_tree_block(trans, root, leaf, 0, 1);
+ free_extent_buffer(leaf);
goto fail;
+ }
+
+ free_extent_buffer(leaf);
+ leaf = NULL;
key.offset = (u64)-1;
new_root = btrfs_get_new_fs_root(fs_info, objectid, anon_dev);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 24a806d849c0b0c1d0cd6a6b93ba4ae4c0ec9f08 Mon Sep 17 00:00:00 2001
From: Gao Xiang <hsiangkao(a)redhat.com>
Date: Mon, 29 Mar 2021 08:36:14 +0800
Subject: [PATCH] erofs: add unsupported inode i_format check
If any unknown i_format fields are set (may be of some new incompat
inode features), mark such inode as unsupported.
Just in case of any new incompat i_format fields added in the future.
Link: https://lore.kernel.org/r/20210329003614.6583-1-hsiangkao@aol.com
Fixes: 431339ba9042 ("staging: erofs: add inode operations")
Cc: <stable(a)vger.kernel.org> # 4.19+
Signed-off-by: Gao Xiang <hsiangkao(a)redhat.com>
diff --git a/fs/erofs/erofs_fs.h b/fs/erofs/erofs_fs.h
index 9ad1615f4474..e8d04d808fa6 100644
--- a/fs/erofs/erofs_fs.h
+++ b/fs/erofs/erofs_fs.h
@@ -75,6 +75,9 @@ static inline bool erofs_inode_is_data_compressed(unsigned int datamode)
#define EROFS_I_VERSION_BIT 0
#define EROFS_I_DATALAYOUT_BIT 1
+#define EROFS_I_ALL \
+ ((1 << (EROFS_I_DATALAYOUT_BIT + EROFS_I_DATALAYOUT_BITS)) - 1)
+
/* 32-byte reduced form of an ondisk inode */
struct erofs_inode_compact {
__le16 i_format; /* inode format hints */
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 119fdce1b520..7ed2d7391692 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -44,6 +44,13 @@ static struct page *erofs_read_inode(struct inode *inode,
dic = page_address(page) + *ofs;
ifmt = le16_to_cpu(dic->i_format);
+ if (ifmt & ~EROFS_I_ALL) {
+ erofs_err(inode->i_sb, "unsupported i_format %u of nid %llu",
+ ifmt, vi->nid);
+ err = -EOPNOTSUPP;
+ goto err_out;
+ }
+
vi->datalayout = erofs_inode_datalayout(ifmt);
if (vi->datalayout >= EROFS_INODE_DATALAYOUT_MAX) {
erofs_err(inode->i_sb, "unsupported datalayout %u of nid %llu",
The vc4_set_crtc_possible_masks is meant to run over all the encoders
and then set their possible_crtcs mask to their associated pixelvalve.
However, since the commit 39fcb2808376 ("drm/vc4: txp: Turn the TXP into
a CRTC of its own"), the TXP has been turned to a CRTC and encoder of
its own, and while it does indeed register an encoder, it no longer has
an associated pixelvalve. The code will thus run over the TXP encoder
and set a bogus possible_crtcs mask, overriding the one set in the TXP
bind function.
In order to fix this, let's skip any virtual encoder.
Cc: <stable(a)vger.kernel.org> # v5.9+
Fixes: 39fcb2808376 ("drm/vc4: txp: Turn the TXP into a CRTC of its own")
Acked-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Signed-off-by: Maxime Ripard <maxime(a)cerno.tech>
---
drivers/gpu/drm/vc4/vc4_crtc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/vc4/vc4_crtc.c b/drivers/gpu/drm/vc4/vc4_crtc.c
index 269390bc586e..f1f2e8cbce79 100644
--- a/drivers/gpu/drm/vc4/vc4_crtc.c
+++ b/drivers/gpu/drm/vc4/vc4_crtc.c
@@ -1018,6 +1018,9 @@ static void vc4_set_crtc_possible_masks(struct drm_device *drm,
struct vc4_encoder *vc4_encoder;
int i;
+ if (encoder->encoder_type == DRM_MODE_ENCODER_VIRTUAL)
+ continue;
+
vc4_encoder = to_vc4_encoder(encoder);
for (i = 0; i < ARRAY_SIZE(pv_data->encoder_types); i++) {
if (vc4_encoder->type == encoder_types[i]) {
--
2.31.1
The current code does a binary OR on the possible_crtcs variable of the
TXP encoder, while we want to set it to that value instead.
Cc: <stable(a)vger.kernel.org> # v5.9+
Fixes: 39fcb2808376 ("drm/vc4: txp: Turn the TXP into a CRTC of its own")
Acked-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Signed-off-by: Maxime Ripard <maxime(a)cerno.tech>
---
drivers/gpu/drm/vc4/vc4_txp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/vc4/vc4_txp.c b/drivers/gpu/drm/vc4/vc4_txp.c
index c0122d83b651..2fc7f4b5fa09 100644
--- a/drivers/gpu/drm/vc4/vc4_txp.c
+++ b/drivers/gpu/drm/vc4/vc4_txp.c
@@ -507,7 +507,7 @@ static int vc4_txp_bind(struct device *dev, struct device *master, void *data)
return ret;
encoder = &txp->connector.encoder;
- encoder->possible_crtcs |= drm_crtc_mask(crtc);
+ encoder->possible_crtcs = drm_crtc_mask(crtc);
ret = devm_request_irq(dev, irq, vc4_txp_interrupt, 0,
dev_name(dev), txp);
--
2.31.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be94215be1ab19e5d38f50962f611c88d4bfc83a Mon Sep 17 00:00:00 2001
From: Xiang Chen <chenxiang66(a)hisilicon.com>
Date: Thu, 1 Apr 2021 15:34:46 +0800
Subject: [PATCH] mtd: spi-nor: core: Fix an issue of releasing resources
during read/write
If rmmod the driver during read or write, the driver will release the
resources which are used during read or write, so it is possible to
refer to NULL pointer.
Use the testcase "mtd_debug read /dev/mtd0 0xc00000 0x400000 dest_file &
sleep 0.5;rmmod spi_hisi_sfc_v3xx.ko", the issue can be reproduced in
hisi_sfc_v3xx driver.
To avoid the issue, fill the interface _get_device and _put_device of
mtd_info to grab the reference to the spi controller driver module, so
the request of rmmod the driver is rejected before read/write is finished.
Fixes: b199489d37b2 ("mtd: spi-nor: add the framework for SPI NOR")
Signed-off-by: Xiang Chen <chenxiang66(a)hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Tested-by: Michael Walle <michael(a)walle.cc>
Tested-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Reviewed-by: Michael Walle <michael(a)walle.cc>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/1617262486-4223-1-git-send-email-yangyicong@hisil…
diff --git a/drivers/mtd/spi-nor/core.c b/drivers/mtd/spi-nor/core.c
index 8cf3cf92129e..bd2c7717eb10 100644
--- a/drivers/mtd/spi-nor/core.c
+++ b/drivers/mtd/spi-nor/core.c
@@ -2905,6 +2905,37 @@ static void spi_nor_resume(struct mtd_info *mtd)
dev_err(dev, "resume() failed\n");
}
+static int spi_nor_get_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ if (!try_module_get(dev->driver->owner))
+ return -ENODEV;
+
+ return 0;
+}
+
+static void spi_nor_put_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ module_put(dev->driver->owner);
+}
+
void spi_nor_restore(struct spi_nor *nor)
{
/* restore the addressing mode */
@@ -3099,6 +3130,8 @@ int spi_nor_scan(struct spi_nor *nor, const char *name,
mtd->_read = spi_nor_read;
mtd->_suspend = spi_nor_suspend;
mtd->_resume = spi_nor_resume;
+ mtd->_get_device = spi_nor_get_device;
+ mtd->_put_device = spi_nor_put_device;
if (info->flags & USE_FSR)
nor->flags |= SNOR_F_USE_FSR;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be94215be1ab19e5d38f50962f611c88d4bfc83a Mon Sep 17 00:00:00 2001
From: Xiang Chen <chenxiang66(a)hisilicon.com>
Date: Thu, 1 Apr 2021 15:34:46 +0800
Subject: [PATCH] mtd: spi-nor: core: Fix an issue of releasing resources
during read/write
If rmmod the driver during read or write, the driver will release the
resources which are used during read or write, so it is possible to
refer to NULL pointer.
Use the testcase "mtd_debug read /dev/mtd0 0xc00000 0x400000 dest_file &
sleep 0.5;rmmod spi_hisi_sfc_v3xx.ko", the issue can be reproduced in
hisi_sfc_v3xx driver.
To avoid the issue, fill the interface _get_device and _put_device of
mtd_info to grab the reference to the spi controller driver module, so
the request of rmmod the driver is rejected before read/write is finished.
Fixes: b199489d37b2 ("mtd: spi-nor: add the framework for SPI NOR")
Signed-off-by: Xiang Chen <chenxiang66(a)hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Tested-by: Michael Walle <michael(a)walle.cc>
Tested-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Reviewed-by: Michael Walle <michael(a)walle.cc>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/1617262486-4223-1-git-send-email-yangyicong@hisil…
diff --git a/drivers/mtd/spi-nor/core.c b/drivers/mtd/spi-nor/core.c
index 8cf3cf92129e..bd2c7717eb10 100644
--- a/drivers/mtd/spi-nor/core.c
+++ b/drivers/mtd/spi-nor/core.c
@@ -2905,6 +2905,37 @@ static void spi_nor_resume(struct mtd_info *mtd)
dev_err(dev, "resume() failed\n");
}
+static int spi_nor_get_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ if (!try_module_get(dev->driver->owner))
+ return -ENODEV;
+
+ return 0;
+}
+
+static void spi_nor_put_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ module_put(dev->driver->owner);
+}
+
void spi_nor_restore(struct spi_nor *nor)
{
/* restore the addressing mode */
@@ -3099,6 +3130,8 @@ int spi_nor_scan(struct spi_nor *nor, const char *name,
mtd->_read = spi_nor_read;
mtd->_suspend = spi_nor_suspend;
mtd->_resume = spi_nor_resume;
+ mtd->_get_device = spi_nor_get_device;
+ mtd->_put_device = spi_nor_put_device;
if (info->flags & USE_FSR)
nor->flags |= SNOR_F_USE_FSR;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 17a17bf50612e6048a9975450cf1bd30f93815b5 Mon Sep 17 00:00:00 2001
From: Ulf Hansson <ulf.hansson(a)linaro.org>
Date: Wed, 10 Mar 2021 16:29:00 +0100
Subject: [PATCH] mmc: core: Fix hanging on I/O during system suspend for
removable cards
The mmc core uses a PM notifier to temporarily during system suspend, turn
off the card detection mechanism for removal/insertion of (e)MMC/SD/SDIO
cards. Additionally, the notifier may be used to remove an SDIO card
entirely, if a corresponding SDIO functional driver don't have the system
suspend/resume callbacks assigned. This behaviour has been around for a
very long time.
However, a recent bug report tells us there are problems with this
approach. More precisely, when receiving the PM_SUSPEND_PREPARE
notification, we may end up hanging on I/O to be completed, thus also
preventing the system from getting suspended.
In the end what happens, is that the cancel_delayed_work_sync() in
mmc_pm_notify() ends up waiting for mmc_rescan() to complete - and since
mmc_rescan() wants to claim the host, it needs to wait for the I/O to be
completed first.
Typically, this problem is triggered in Android, if there is ongoing I/O
while the user decides to suspend, resume and then suspend the system
again. This due to that after the resume, an mmc_rescan() work gets punted
to the workqueue, which job is to verify that the card remains inserted
after the system has resumed.
To fix this problem, userspace needs to become frozen to suspend the I/O,
prior to turning off the card detection mechanism. Therefore, let's drop
the PM notifiers for mmc subsystem altogether and rely on the card
detection to be turned off/on as a part of the system_freezable_wq, that we
are already using.
Moreover, to allow and SDIO card to be removed during system suspend, let's
manage this from a ->prepare() callback, assigned at the mmc_host_class
level. In this way, we can use the parent device (the mmc_host_class
device), to remove the card device that is the child, in the
device_prepare() phase.
Reported-by: Kiwoong Kim <kwmad.kim(a)samsung.com>
Cc: stable(a)vger.kernel.org # v4.5+
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
Reviewed-by: Linus Walleij <linus.walleij(a)linaro.org>
Link: https://lore.kernel.org/r/20210310152900.149380-1-ulf.hansson@linaro.org
Reviewed-by: Kiwoong Kim <kwmad.kim(a)samsung.com>
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 9c13f7a52699..f194940c5974 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -2269,80 +2269,6 @@ void mmc_stop_host(struct mmc_host *host)
mmc_release_host(host);
}
-#ifdef CONFIG_PM_SLEEP
-/* Do the card removal on suspend if card is assumed removeable
- * Do that in pm notifier while userspace isn't yet frozen, so we will be able
- to sync the card.
-*/
-static int mmc_pm_notify(struct notifier_block *notify_block,
- unsigned long mode, void *unused)
-{
- struct mmc_host *host = container_of(
- notify_block, struct mmc_host, pm_notify);
- unsigned long flags;
- int err = 0;
-
- switch (mode) {
- case PM_HIBERNATION_PREPARE:
- case PM_SUSPEND_PREPARE:
- case PM_RESTORE_PREPARE:
- spin_lock_irqsave(&host->lock, flags);
- host->rescan_disable = 1;
- spin_unlock_irqrestore(&host->lock, flags);
- cancel_delayed_work_sync(&host->detect);
-
- if (!host->bus_ops)
- break;
-
- /* Validate prerequisites for suspend */
- if (host->bus_ops->pre_suspend)
- err = host->bus_ops->pre_suspend(host);
- if (!err)
- break;
-
- if (!mmc_card_is_removable(host)) {
- dev_warn(mmc_dev(host),
- "pre_suspend failed for non-removable host: "
- "%d\n", err);
- /* Avoid removing non-removable hosts */
- break;
- }
-
- /* Calling bus_ops->remove() with a claimed host can deadlock */
- host->bus_ops->remove(host);
- mmc_claim_host(host);
- mmc_detach_bus(host);
- mmc_power_off(host);
- mmc_release_host(host);
- host->pm_flags = 0;
- break;
-
- case PM_POST_SUSPEND:
- case PM_POST_HIBERNATION:
- case PM_POST_RESTORE:
-
- spin_lock_irqsave(&host->lock, flags);
- host->rescan_disable = 0;
- spin_unlock_irqrestore(&host->lock, flags);
- _mmc_detect_change(host, 0, false);
-
- }
-
- return 0;
-}
-
-void mmc_register_pm_notifier(struct mmc_host *host)
-{
- host->pm_notify.notifier_call = mmc_pm_notify;
- register_pm_notifier(&host->pm_notify);
-}
-
-void mmc_unregister_pm_notifier(struct mmc_host *host)
-{
- unregister_pm_notifier(&host->pm_notify);
-}
-#endif
-
static int __init mmc_init(void)
{
int ret;
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 575ac0257af2..8032451abaea 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -93,14 +93,6 @@ int mmc_execute_tuning(struct mmc_card *card);
int mmc_hs200_to_hs400(struct mmc_card *card);
int mmc_hs400_to_hs200(struct mmc_card *card);
-#ifdef CONFIG_PM_SLEEP
-void mmc_register_pm_notifier(struct mmc_host *host);
-void mmc_unregister_pm_notifier(struct mmc_host *host);
-#else
-static inline void mmc_register_pm_notifier(struct mmc_host *host) { }
-static inline void mmc_unregister_pm_notifier(struct mmc_host *host) { }
-#endif
-
void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq);
bool mmc_is_req_done(struct mmc_host *host, struct mmc_request *mrq);
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index 9b89a91b6b47..fe05b3645fe9 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -35,6 +35,42 @@
static DEFINE_IDA(mmc_host_ida);
+#ifdef CONFIG_PM_SLEEP
+static int mmc_host_class_prepare(struct device *dev)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ /*
+ * It's safe to access the bus_ops pointer, as both userspace and the
+ * workqueue for detecting cards are frozen at this point.
+ */
+ if (!host->bus_ops)
+ return 0;
+
+ /* Validate conditions for system suspend. */
+ if (host->bus_ops->pre_suspend)
+ return host->bus_ops->pre_suspend(host);
+
+ return 0;
+}
+
+static void mmc_host_class_complete(struct device *dev)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ _mmc_detect_change(host, 0, false);
+}
+
+static const struct dev_pm_ops mmc_host_class_dev_pm_ops = {
+ .prepare = mmc_host_class_prepare,
+ .complete = mmc_host_class_complete,
+};
+
+#define MMC_HOST_CLASS_DEV_PM_OPS (&mmc_host_class_dev_pm_ops)
+#else
+#define MMC_HOST_CLASS_DEV_PM_OPS NULL
+#endif
+
static void mmc_host_classdev_release(struct device *dev)
{
struct mmc_host *host = cls_dev_to_mmc_host(dev);
@@ -46,6 +82,7 @@ static void mmc_host_classdev_release(struct device *dev)
static struct class mmc_host_class = {
.name = "mmc_host",
.dev_release = mmc_host_classdev_release,
+ .pm = MMC_HOST_CLASS_DEV_PM_OPS,
};
int mmc_register_host_class(void)
@@ -538,8 +575,6 @@ int mmc_add_host(struct mmc_host *host)
#endif
mmc_start_host(host);
- mmc_register_pm_notifier(host);
-
return 0;
}
@@ -555,7 +590,6 @@ EXPORT_SYMBOL(mmc_add_host);
*/
void mmc_remove_host(struct mmc_host *host)
{
- mmc_unregister_pm_notifier(host);
mmc_stop_host(host);
#ifdef CONFIG_DEBUG_FS
diff --git a/drivers/mmc/core/sdio.c b/drivers/mmc/core/sdio.c
index 0fda7784cab2..3eb94ac2712e 100644
--- a/drivers/mmc/core/sdio.c
+++ b/drivers/mmc/core/sdio.c
@@ -985,21 +985,37 @@ static void mmc_sdio_detect(struct mmc_host *host)
*/
static int mmc_sdio_pre_suspend(struct mmc_host *host)
{
- int i, err = 0;
+ int i;
for (i = 0; i < host->card->sdio_funcs; i++) {
struct sdio_func *func = host->card->sdio_func[i];
if (func && sdio_func_present(func) && func->dev.driver) {
const struct dev_pm_ops *pmops = func->dev.driver->pm;
- if (!pmops || !pmops->suspend || !pmops->resume) {
+ if (!pmops || !pmops->suspend || !pmops->resume)
/* force removal of entire card in that case */
- err = -ENOSYS;
- break;
- }
+ goto remove;
}
}
- return err;
+ return 0;
+
+remove:
+ if (!mmc_card_is_removable(host)) {
+ dev_warn(mmc_dev(host),
+ "missing suspend/resume ops for non-removable SDIO card\n");
+ /* Don't remove a non-removable card - we can't re-detect it. */
+ return 0;
+ }
+
+ /* Remove the SDIO card and let it be re-detected later on. */
+ mmc_sdio_remove(host);
+ mmc_claim_host(host);
+ mmc_detach_bus(host);
+ mmc_power_off(host);
+ mmc_release_host(host);
+ host->pm_flags = 0;
+
+ return 0;
}
/*
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index a001ad2f5f23..17d7b326af29 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -302,9 +302,6 @@ struct mmc_host {
u32 ocr_avail_sdio; /* SDIO-specific OCR */
u32 ocr_avail_sd; /* SD-specific OCR */
u32 ocr_avail_mmc; /* MMC-specific OCR */
-#ifdef CONFIG_PM_SLEEP
- struct notifier_block pm_notify;
-#endif
struct wakeup_source *ws; /* Enable consume of uevents */
u32 max_current_330;
u32 max_current_300;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 17a17bf50612e6048a9975450cf1bd30f93815b5 Mon Sep 17 00:00:00 2001
From: Ulf Hansson <ulf.hansson(a)linaro.org>
Date: Wed, 10 Mar 2021 16:29:00 +0100
Subject: [PATCH] mmc: core: Fix hanging on I/O during system suspend for
removable cards
The mmc core uses a PM notifier to temporarily during system suspend, turn
off the card detection mechanism for removal/insertion of (e)MMC/SD/SDIO
cards. Additionally, the notifier may be used to remove an SDIO card
entirely, if a corresponding SDIO functional driver don't have the system
suspend/resume callbacks assigned. This behaviour has been around for a
very long time.
However, a recent bug report tells us there are problems with this
approach. More precisely, when receiving the PM_SUSPEND_PREPARE
notification, we may end up hanging on I/O to be completed, thus also
preventing the system from getting suspended.
In the end what happens, is that the cancel_delayed_work_sync() in
mmc_pm_notify() ends up waiting for mmc_rescan() to complete - and since
mmc_rescan() wants to claim the host, it needs to wait for the I/O to be
completed first.
Typically, this problem is triggered in Android, if there is ongoing I/O
while the user decides to suspend, resume and then suspend the system
again. This due to that after the resume, an mmc_rescan() work gets punted
to the workqueue, which job is to verify that the card remains inserted
after the system has resumed.
To fix this problem, userspace needs to become frozen to suspend the I/O,
prior to turning off the card detection mechanism. Therefore, let's drop
the PM notifiers for mmc subsystem altogether and rely on the card
detection to be turned off/on as a part of the system_freezable_wq, that we
are already using.
Moreover, to allow and SDIO card to be removed during system suspend, let's
manage this from a ->prepare() callback, assigned at the mmc_host_class
level. In this way, we can use the parent device (the mmc_host_class
device), to remove the card device that is the child, in the
device_prepare() phase.
Reported-by: Kiwoong Kim <kwmad.kim(a)samsung.com>
Cc: stable(a)vger.kernel.org # v4.5+
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
Reviewed-by: Linus Walleij <linus.walleij(a)linaro.org>
Link: https://lore.kernel.org/r/20210310152900.149380-1-ulf.hansson@linaro.org
Reviewed-by: Kiwoong Kim <kwmad.kim(a)samsung.com>
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 9c13f7a52699..f194940c5974 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -2269,80 +2269,6 @@ void mmc_stop_host(struct mmc_host *host)
mmc_release_host(host);
}
-#ifdef CONFIG_PM_SLEEP
-/* Do the card removal on suspend if card is assumed removeable
- * Do that in pm notifier while userspace isn't yet frozen, so we will be able
- to sync the card.
-*/
-static int mmc_pm_notify(struct notifier_block *notify_block,
- unsigned long mode, void *unused)
-{
- struct mmc_host *host = container_of(
- notify_block, struct mmc_host, pm_notify);
- unsigned long flags;
- int err = 0;
-
- switch (mode) {
- case PM_HIBERNATION_PREPARE:
- case PM_SUSPEND_PREPARE:
- case PM_RESTORE_PREPARE:
- spin_lock_irqsave(&host->lock, flags);
- host->rescan_disable = 1;
- spin_unlock_irqrestore(&host->lock, flags);
- cancel_delayed_work_sync(&host->detect);
-
- if (!host->bus_ops)
- break;
-
- /* Validate prerequisites for suspend */
- if (host->bus_ops->pre_suspend)
- err = host->bus_ops->pre_suspend(host);
- if (!err)
- break;
-
- if (!mmc_card_is_removable(host)) {
- dev_warn(mmc_dev(host),
- "pre_suspend failed for non-removable host: "
- "%d\n", err);
- /* Avoid removing non-removable hosts */
- break;
- }
-
- /* Calling bus_ops->remove() with a claimed host can deadlock */
- host->bus_ops->remove(host);
- mmc_claim_host(host);
- mmc_detach_bus(host);
- mmc_power_off(host);
- mmc_release_host(host);
- host->pm_flags = 0;
- break;
-
- case PM_POST_SUSPEND:
- case PM_POST_HIBERNATION:
- case PM_POST_RESTORE:
-
- spin_lock_irqsave(&host->lock, flags);
- host->rescan_disable = 0;
- spin_unlock_irqrestore(&host->lock, flags);
- _mmc_detect_change(host, 0, false);
-
- }
-
- return 0;
-}
-
-void mmc_register_pm_notifier(struct mmc_host *host)
-{
- host->pm_notify.notifier_call = mmc_pm_notify;
- register_pm_notifier(&host->pm_notify);
-}
-
-void mmc_unregister_pm_notifier(struct mmc_host *host)
-{
- unregister_pm_notifier(&host->pm_notify);
-}
-#endif
-
static int __init mmc_init(void)
{
int ret;
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 575ac0257af2..8032451abaea 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -93,14 +93,6 @@ int mmc_execute_tuning(struct mmc_card *card);
int mmc_hs200_to_hs400(struct mmc_card *card);
int mmc_hs400_to_hs200(struct mmc_card *card);
-#ifdef CONFIG_PM_SLEEP
-void mmc_register_pm_notifier(struct mmc_host *host);
-void mmc_unregister_pm_notifier(struct mmc_host *host);
-#else
-static inline void mmc_register_pm_notifier(struct mmc_host *host) { }
-static inline void mmc_unregister_pm_notifier(struct mmc_host *host) { }
-#endif
-
void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq);
bool mmc_is_req_done(struct mmc_host *host, struct mmc_request *mrq);
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index 9b89a91b6b47..fe05b3645fe9 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -35,6 +35,42 @@
static DEFINE_IDA(mmc_host_ida);
+#ifdef CONFIG_PM_SLEEP
+static int mmc_host_class_prepare(struct device *dev)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ /*
+ * It's safe to access the bus_ops pointer, as both userspace and the
+ * workqueue for detecting cards are frozen at this point.
+ */
+ if (!host->bus_ops)
+ return 0;
+
+ /* Validate conditions for system suspend. */
+ if (host->bus_ops->pre_suspend)
+ return host->bus_ops->pre_suspend(host);
+
+ return 0;
+}
+
+static void mmc_host_class_complete(struct device *dev)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ _mmc_detect_change(host, 0, false);
+}
+
+static const struct dev_pm_ops mmc_host_class_dev_pm_ops = {
+ .prepare = mmc_host_class_prepare,
+ .complete = mmc_host_class_complete,
+};
+
+#define MMC_HOST_CLASS_DEV_PM_OPS (&mmc_host_class_dev_pm_ops)
+#else
+#define MMC_HOST_CLASS_DEV_PM_OPS NULL
+#endif
+
static void mmc_host_classdev_release(struct device *dev)
{
struct mmc_host *host = cls_dev_to_mmc_host(dev);
@@ -46,6 +82,7 @@ static void mmc_host_classdev_release(struct device *dev)
static struct class mmc_host_class = {
.name = "mmc_host",
.dev_release = mmc_host_classdev_release,
+ .pm = MMC_HOST_CLASS_DEV_PM_OPS,
};
int mmc_register_host_class(void)
@@ -538,8 +575,6 @@ int mmc_add_host(struct mmc_host *host)
#endif
mmc_start_host(host);
- mmc_register_pm_notifier(host);
-
return 0;
}
@@ -555,7 +590,6 @@ EXPORT_SYMBOL(mmc_add_host);
*/
void mmc_remove_host(struct mmc_host *host)
{
- mmc_unregister_pm_notifier(host);
mmc_stop_host(host);
#ifdef CONFIG_DEBUG_FS
diff --git a/drivers/mmc/core/sdio.c b/drivers/mmc/core/sdio.c
index 0fda7784cab2..3eb94ac2712e 100644
--- a/drivers/mmc/core/sdio.c
+++ b/drivers/mmc/core/sdio.c
@@ -985,21 +985,37 @@ static void mmc_sdio_detect(struct mmc_host *host)
*/
static int mmc_sdio_pre_suspend(struct mmc_host *host)
{
- int i, err = 0;
+ int i;
for (i = 0; i < host->card->sdio_funcs; i++) {
struct sdio_func *func = host->card->sdio_func[i];
if (func && sdio_func_present(func) && func->dev.driver) {
const struct dev_pm_ops *pmops = func->dev.driver->pm;
- if (!pmops || !pmops->suspend || !pmops->resume) {
+ if (!pmops || !pmops->suspend || !pmops->resume)
/* force removal of entire card in that case */
- err = -ENOSYS;
- break;
- }
+ goto remove;
}
}
- return err;
+ return 0;
+
+remove:
+ if (!mmc_card_is_removable(host)) {
+ dev_warn(mmc_dev(host),
+ "missing suspend/resume ops for non-removable SDIO card\n");
+ /* Don't remove a non-removable card - we can't re-detect it. */
+ return 0;
+ }
+
+ /* Remove the SDIO card and let it be re-detected later on. */
+ mmc_sdio_remove(host);
+ mmc_claim_host(host);
+ mmc_detach_bus(host);
+ mmc_power_off(host);
+ mmc_release_host(host);
+ host->pm_flags = 0;
+
+ return 0;
}
/*
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index a001ad2f5f23..17d7b326af29 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -302,9 +302,6 @@ struct mmc_host {
u32 ocr_avail_sdio; /* SDIO-specific OCR */
u32 ocr_avail_sd; /* SD-specific OCR */
u32 ocr_avail_mmc; /* MMC-specific OCR */
-#ifdef CONFIG_PM_SLEEP
- struct notifier_block pm_notify;
-#endif
struct wakeup_source *ws; /* Enable consume of uevents */
u32 max_current_330;
u32 max_current_300;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 17a17bf50612e6048a9975450cf1bd30f93815b5 Mon Sep 17 00:00:00 2001
From: Ulf Hansson <ulf.hansson(a)linaro.org>
Date: Wed, 10 Mar 2021 16:29:00 +0100
Subject: [PATCH] mmc: core: Fix hanging on I/O during system suspend for
removable cards
The mmc core uses a PM notifier to temporarily during system suspend, turn
off the card detection mechanism for removal/insertion of (e)MMC/SD/SDIO
cards. Additionally, the notifier may be used to remove an SDIO card
entirely, if a corresponding SDIO functional driver don't have the system
suspend/resume callbacks assigned. This behaviour has been around for a
very long time.
However, a recent bug report tells us there are problems with this
approach. More precisely, when receiving the PM_SUSPEND_PREPARE
notification, we may end up hanging on I/O to be completed, thus also
preventing the system from getting suspended.
In the end what happens, is that the cancel_delayed_work_sync() in
mmc_pm_notify() ends up waiting for mmc_rescan() to complete - and since
mmc_rescan() wants to claim the host, it needs to wait for the I/O to be
completed first.
Typically, this problem is triggered in Android, if there is ongoing I/O
while the user decides to suspend, resume and then suspend the system
again. This due to that after the resume, an mmc_rescan() work gets punted
to the workqueue, which job is to verify that the card remains inserted
after the system has resumed.
To fix this problem, userspace needs to become frozen to suspend the I/O,
prior to turning off the card detection mechanism. Therefore, let's drop
the PM notifiers for mmc subsystem altogether and rely on the card
detection to be turned off/on as a part of the system_freezable_wq, that we
are already using.
Moreover, to allow and SDIO card to be removed during system suspend, let's
manage this from a ->prepare() callback, assigned at the mmc_host_class
level. In this way, we can use the parent device (the mmc_host_class
device), to remove the card device that is the child, in the
device_prepare() phase.
Reported-by: Kiwoong Kim <kwmad.kim(a)samsung.com>
Cc: stable(a)vger.kernel.org # v4.5+
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
Reviewed-by: Linus Walleij <linus.walleij(a)linaro.org>
Link: https://lore.kernel.org/r/20210310152900.149380-1-ulf.hansson@linaro.org
Reviewed-by: Kiwoong Kim <kwmad.kim(a)samsung.com>
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 9c13f7a52699..f194940c5974 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -2269,80 +2269,6 @@ void mmc_stop_host(struct mmc_host *host)
mmc_release_host(host);
}
-#ifdef CONFIG_PM_SLEEP
-/* Do the card removal on suspend if card is assumed removeable
- * Do that in pm notifier while userspace isn't yet frozen, so we will be able
- to sync the card.
-*/
-static int mmc_pm_notify(struct notifier_block *notify_block,
- unsigned long mode, void *unused)
-{
- struct mmc_host *host = container_of(
- notify_block, struct mmc_host, pm_notify);
- unsigned long flags;
- int err = 0;
-
- switch (mode) {
- case PM_HIBERNATION_PREPARE:
- case PM_SUSPEND_PREPARE:
- case PM_RESTORE_PREPARE:
- spin_lock_irqsave(&host->lock, flags);
- host->rescan_disable = 1;
- spin_unlock_irqrestore(&host->lock, flags);
- cancel_delayed_work_sync(&host->detect);
-
- if (!host->bus_ops)
- break;
-
- /* Validate prerequisites for suspend */
- if (host->bus_ops->pre_suspend)
- err = host->bus_ops->pre_suspend(host);
- if (!err)
- break;
-
- if (!mmc_card_is_removable(host)) {
- dev_warn(mmc_dev(host),
- "pre_suspend failed for non-removable host: "
- "%d\n", err);
- /* Avoid removing non-removable hosts */
- break;
- }
-
- /* Calling bus_ops->remove() with a claimed host can deadlock */
- host->bus_ops->remove(host);
- mmc_claim_host(host);
- mmc_detach_bus(host);
- mmc_power_off(host);
- mmc_release_host(host);
- host->pm_flags = 0;
- break;
-
- case PM_POST_SUSPEND:
- case PM_POST_HIBERNATION:
- case PM_POST_RESTORE:
-
- spin_lock_irqsave(&host->lock, flags);
- host->rescan_disable = 0;
- spin_unlock_irqrestore(&host->lock, flags);
- _mmc_detect_change(host, 0, false);
-
- }
-
- return 0;
-}
-
-void mmc_register_pm_notifier(struct mmc_host *host)
-{
- host->pm_notify.notifier_call = mmc_pm_notify;
- register_pm_notifier(&host->pm_notify);
-}
-
-void mmc_unregister_pm_notifier(struct mmc_host *host)
-{
- unregister_pm_notifier(&host->pm_notify);
-}
-#endif
-
static int __init mmc_init(void)
{
int ret;
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 575ac0257af2..8032451abaea 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -93,14 +93,6 @@ int mmc_execute_tuning(struct mmc_card *card);
int mmc_hs200_to_hs400(struct mmc_card *card);
int mmc_hs400_to_hs200(struct mmc_card *card);
-#ifdef CONFIG_PM_SLEEP
-void mmc_register_pm_notifier(struct mmc_host *host);
-void mmc_unregister_pm_notifier(struct mmc_host *host);
-#else
-static inline void mmc_register_pm_notifier(struct mmc_host *host) { }
-static inline void mmc_unregister_pm_notifier(struct mmc_host *host) { }
-#endif
-
void mmc_wait_for_req_done(struct mmc_host *host, struct mmc_request *mrq);
bool mmc_is_req_done(struct mmc_host *host, struct mmc_request *mrq);
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index 9b89a91b6b47..fe05b3645fe9 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -35,6 +35,42 @@
static DEFINE_IDA(mmc_host_ida);
+#ifdef CONFIG_PM_SLEEP
+static int mmc_host_class_prepare(struct device *dev)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ /*
+ * It's safe to access the bus_ops pointer, as both userspace and the
+ * workqueue for detecting cards are frozen at this point.
+ */
+ if (!host->bus_ops)
+ return 0;
+
+ /* Validate conditions for system suspend. */
+ if (host->bus_ops->pre_suspend)
+ return host->bus_ops->pre_suspend(host);
+
+ return 0;
+}
+
+static void mmc_host_class_complete(struct device *dev)
+{
+ struct mmc_host *host = cls_dev_to_mmc_host(dev);
+
+ _mmc_detect_change(host, 0, false);
+}
+
+static const struct dev_pm_ops mmc_host_class_dev_pm_ops = {
+ .prepare = mmc_host_class_prepare,
+ .complete = mmc_host_class_complete,
+};
+
+#define MMC_HOST_CLASS_DEV_PM_OPS (&mmc_host_class_dev_pm_ops)
+#else
+#define MMC_HOST_CLASS_DEV_PM_OPS NULL
+#endif
+
static void mmc_host_classdev_release(struct device *dev)
{
struct mmc_host *host = cls_dev_to_mmc_host(dev);
@@ -46,6 +82,7 @@ static void mmc_host_classdev_release(struct device *dev)
static struct class mmc_host_class = {
.name = "mmc_host",
.dev_release = mmc_host_classdev_release,
+ .pm = MMC_HOST_CLASS_DEV_PM_OPS,
};
int mmc_register_host_class(void)
@@ -538,8 +575,6 @@ int mmc_add_host(struct mmc_host *host)
#endif
mmc_start_host(host);
- mmc_register_pm_notifier(host);
-
return 0;
}
@@ -555,7 +590,6 @@ EXPORT_SYMBOL(mmc_add_host);
*/
void mmc_remove_host(struct mmc_host *host)
{
- mmc_unregister_pm_notifier(host);
mmc_stop_host(host);
#ifdef CONFIG_DEBUG_FS
diff --git a/drivers/mmc/core/sdio.c b/drivers/mmc/core/sdio.c
index 0fda7784cab2..3eb94ac2712e 100644
--- a/drivers/mmc/core/sdio.c
+++ b/drivers/mmc/core/sdio.c
@@ -985,21 +985,37 @@ static void mmc_sdio_detect(struct mmc_host *host)
*/
static int mmc_sdio_pre_suspend(struct mmc_host *host)
{
- int i, err = 0;
+ int i;
for (i = 0; i < host->card->sdio_funcs; i++) {
struct sdio_func *func = host->card->sdio_func[i];
if (func && sdio_func_present(func) && func->dev.driver) {
const struct dev_pm_ops *pmops = func->dev.driver->pm;
- if (!pmops || !pmops->suspend || !pmops->resume) {
+ if (!pmops || !pmops->suspend || !pmops->resume)
/* force removal of entire card in that case */
- err = -ENOSYS;
- break;
- }
+ goto remove;
}
}
- return err;
+ return 0;
+
+remove:
+ if (!mmc_card_is_removable(host)) {
+ dev_warn(mmc_dev(host),
+ "missing suspend/resume ops for non-removable SDIO card\n");
+ /* Don't remove a non-removable card - we can't re-detect it. */
+ return 0;
+ }
+
+ /* Remove the SDIO card and let it be re-detected later on. */
+ mmc_sdio_remove(host);
+ mmc_claim_host(host);
+ mmc_detach_bus(host);
+ mmc_power_off(host);
+ mmc_release_host(host);
+ host->pm_flags = 0;
+
+ return 0;
}
/*
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index a001ad2f5f23..17d7b326af29 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -302,9 +302,6 @@ struct mmc_host {
u32 ocr_avail_sdio; /* SDIO-specific OCR */
u32 ocr_avail_sd; /* SD-specific OCR */
u32 ocr_avail_mmc; /* MMC-specific OCR */
-#ifdef CONFIG_PM_SLEEP
- struct notifier_block pm_notify;
-#endif
struct wakeup_source *ws; /* Enable consume of uevents */
u32 max_current_330;
u32 max_current_300;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 61ca49a9105faefa003b37542cebad8722f8ae22 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Mon, 26 Apr 2021 19:11:37 +0200
Subject: [PATCH] libceph: don't set global_id until we get an auth ticket
With the introduction of enforcing mode, setting global_id as soon
as we get it in the first MAuth reply will result in EACCES if the
connection is reset before we get the second MAuth reply containing
an auth ticket -- because on retry we would attempt to reclaim that
global_id with no auth ticket at hand.
Neither ceph_auth_client nor ceph_mon_client depend on global_id
being set ealy, so just delay the setting until we get and process
the second MAuth reply. While at it, complain if the monitor sends
a zero global_id or changes our global_id as the session is likely
to fail after that.
Cc: stable(a)vger.kernel.org # needs backporting for < 5.11
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Sage Weil <sage(a)redhat.com>
diff --git a/net/ceph/auth.c b/net/ceph/auth.c
index eb261aa5fe18..de407e8feb97 100644
--- a/net/ceph/auth.c
+++ b/net/ceph/auth.c
@@ -36,6 +36,20 @@ static int init_protocol(struct ceph_auth_client *ac, int proto)
}
}
+static void set_global_id(struct ceph_auth_client *ac, u64 global_id)
+{
+ dout("%s global_id %llu\n", __func__, global_id);
+
+ if (!global_id)
+ pr_err("got zero global_id\n");
+
+ if (ac->global_id && global_id != ac->global_id)
+ pr_err("global_id changed from %llu to %llu\n", ac->global_id,
+ global_id);
+
+ ac->global_id = global_id;
+}
+
/*
* setup, teardown.
*/
@@ -222,11 +236,6 @@ int ceph_handle_auth_reply(struct ceph_auth_client *ac,
payload_end = payload + payload_len;
- if (global_id && ac->global_id != global_id) {
- dout(" set global_id %lld -> %lld\n", ac->global_id, global_id);
- ac->global_id = global_id;
- }
-
if (ac->negotiating) {
/* server does not support our protocols? */
if (!protocol && result < 0) {
@@ -253,11 +262,16 @@ int ceph_handle_auth_reply(struct ceph_auth_client *ac,
ret = ac->ops->handle_reply(ac, result, payload, payload_end,
NULL, NULL, NULL, NULL);
- if (ret == -EAGAIN)
+ if (ret == -EAGAIN) {
ret = build_request(ac, true, reply_buf, reply_len);
- else if (ret)
+ goto out;
+ } else if (ret) {
pr_err("auth protocol '%s' mauth authentication failed: %d\n",
ceph_auth_proto_name(ac->protocol), result);
+ goto out;
+ }
+
+ set_global_id(ac, global_id);
out:
mutex_unlock(&ac->mutex);
@@ -484,15 +498,11 @@ int ceph_auth_handle_reply_done(struct ceph_auth_client *ac,
int ret;
mutex_lock(&ac->mutex);
- if (global_id && ac->global_id != global_id) {
- dout("%s global_id %llu -> %llu\n", __func__, ac->global_id,
- global_id);
- ac->global_id = global_id;
- }
-
ret = ac->ops->handle_reply(ac, 0, reply, reply + reply_len,
session_key, session_key_len,
con_secret, con_secret_len);
+ if (!ret)
+ set_global_id(ac, global_id);
mutex_unlock(&ac->mutex);
return ret;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 61ca49a9105faefa003b37542cebad8722f8ae22 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov <idryomov(a)gmail.com>
Date: Mon, 26 Apr 2021 19:11:37 +0200
Subject: [PATCH] libceph: don't set global_id until we get an auth ticket
With the introduction of enforcing mode, setting global_id as soon
as we get it in the first MAuth reply will result in EACCES if the
connection is reset before we get the second MAuth reply containing
an auth ticket -- because on retry we would attempt to reclaim that
global_id with no auth ticket at hand.
Neither ceph_auth_client nor ceph_mon_client depend on global_id
being set ealy, so just delay the setting until we get and process
the second MAuth reply. While at it, complain if the monitor sends
a zero global_id or changes our global_id as the session is likely
to fail after that.
Cc: stable(a)vger.kernel.org # needs backporting for < 5.11
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Reviewed-by: Sage Weil <sage(a)redhat.com>
diff --git a/net/ceph/auth.c b/net/ceph/auth.c
index eb261aa5fe18..de407e8feb97 100644
--- a/net/ceph/auth.c
+++ b/net/ceph/auth.c
@@ -36,6 +36,20 @@ static int init_protocol(struct ceph_auth_client *ac, int proto)
}
}
+static void set_global_id(struct ceph_auth_client *ac, u64 global_id)
+{
+ dout("%s global_id %llu\n", __func__, global_id);
+
+ if (!global_id)
+ pr_err("got zero global_id\n");
+
+ if (ac->global_id && global_id != ac->global_id)
+ pr_err("global_id changed from %llu to %llu\n", ac->global_id,
+ global_id);
+
+ ac->global_id = global_id;
+}
+
/*
* setup, teardown.
*/
@@ -222,11 +236,6 @@ int ceph_handle_auth_reply(struct ceph_auth_client *ac,
payload_end = payload + payload_len;
- if (global_id && ac->global_id != global_id) {
- dout(" set global_id %lld -> %lld\n", ac->global_id, global_id);
- ac->global_id = global_id;
- }
-
if (ac->negotiating) {
/* server does not support our protocols? */
if (!protocol && result < 0) {
@@ -253,11 +262,16 @@ int ceph_handle_auth_reply(struct ceph_auth_client *ac,
ret = ac->ops->handle_reply(ac, result, payload, payload_end,
NULL, NULL, NULL, NULL);
- if (ret == -EAGAIN)
+ if (ret == -EAGAIN) {
ret = build_request(ac, true, reply_buf, reply_len);
- else if (ret)
+ goto out;
+ } else if (ret) {
pr_err("auth protocol '%s' mauth authentication failed: %d\n",
ceph_auth_proto_name(ac->protocol), result);
+ goto out;
+ }
+
+ set_global_id(ac, global_id);
out:
mutex_unlock(&ac->mutex);
@@ -484,15 +498,11 @@ int ceph_auth_handle_reply_done(struct ceph_auth_client *ac,
int ret;
mutex_lock(&ac->mutex);
- if (global_id && ac->global_id != global_id) {
- dout("%s global_id %llu -> %llu\n", __func__, ac->global_id,
- global_id);
- ac->global_id = global_id;
- }
-
ret = ac->ops->handle_reply(ac, 0, reply, reply + reply_len,
session_key, session_key_len,
con_secret, con_secret_len);
+ if (!ret)
+ set_global_id(ac, global_id);
mutex_unlock(&ac->mutex);
return ret;
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be94215be1ab19e5d38f50962f611c88d4bfc83a Mon Sep 17 00:00:00 2001
From: Xiang Chen <chenxiang66(a)hisilicon.com>
Date: Thu, 1 Apr 2021 15:34:46 +0800
Subject: [PATCH] mtd: spi-nor: core: Fix an issue of releasing resources
during read/write
If rmmod the driver during read or write, the driver will release the
resources which are used during read or write, so it is possible to
refer to NULL pointer.
Use the testcase "mtd_debug read /dev/mtd0 0xc00000 0x400000 dest_file &
sleep 0.5;rmmod spi_hisi_sfc_v3xx.ko", the issue can be reproduced in
hisi_sfc_v3xx driver.
To avoid the issue, fill the interface _get_device and _put_device of
mtd_info to grab the reference to the spi controller driver module, so
the request of rmmod the driver is rejected before read/write is finished.
Fixes: b199489d37b2 ("mtd: spi-nor: add the framework for SPI NOR")
Signed-off-by: Xiang Chen <chenxiang66(a)hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Tested-by: Michael Walle <michael(a)walle.cc>
Tested-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Reviewed-by: Michael Walle <michael(a)walle.cc>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/1617262486-4223-1-git-send-email-yangyicong@hisil…
diff --git a/drivers/mtd/spi-nor/core.c b/drivers/mtd/spi-nor/core.c
index 8cf3cf92129e..bd2c7717eb10 100644
--- a/drivers/mtd/spi-nor/core.c
+++ b/drivers/mtd/spi-nor/core.c
@@ -2905,6 +2905,37 @@ static void spi_nor_resume(struct mtd_info *mtd)
dev_err(dev, "resume() failed\n");
}
+static int spi_nor_get_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ if (!try_module_get(dev->driver->owner))
+ return -ENODEV;
+
+ return 0;
+}
+
+static void spi_nor_put_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ module_put(dev->driver->owner);
+}
+
void spi_nor_restore(struct spi_nor *nor)
{
/* restore the addressing mode */
@@ -3099,6 +3130,8 @@ int spi_nor_scan(struct spi_nor *nor, const char *name,
mtd->_read = spi_nor_read;
mtd->_suspend = spi_nor_suspend;
mtd->_resume = spi_nor_resume;
+ mtd->_get_device = spi_nor_get_device;
+ mtd->_put_device = spi_nor_put_device;
if (info->flags & USE_FSR)
nor->flags |= SNOR_F_USE_FSR;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be94215be1ab19e5d38f50962f611c88d4bfc83a Mon Sep 17 00:00:00 2001
From: Xiang Chen <chenxiang66(a)hisilicon.com>
Date: Thu, 1 Apr 2021 15:34:46 +0800
Subject: [PATCH] mtd: spi-nor: core: Fix an issue of releasing resources
during read/write
If rmmod the driver during read or write, the driver will release the
resources which are used during read or write, so it is possible to
refer to NULL pointer.
Use the testcase "mtd_debug read /dev/mtd0 0xc00000 0x400000 dest_file &
sleep 0.5;rmmod spi_hisi_sfc_v3xx.ko", the issue can be reproduced in
hisi_sfc_v3xx driver.
To avoid the issue, fill the interface _get_device and _put_device of
mtd_info to grab the reference to the spi controller driver module, so
the request of rmmod the driver is rejected before read/write is finished.
Fixes: b199489d37b2 ("mtd: spi-nor: add the framework for SPI NOR")
Signed-off-by: Xiang Chen <chenxiang66(a)hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Tested-by: Michael Walle <michael(a)walle.cc>
Tested-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Reviewed-by: Michael Walle <michael(a)walle.cc>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/1617262486-4223-1-git-send-email-yangyicong@hisil…
diff --git a/drivers/mtd/spi-nor/core.c b/drivers/mtd/spi-nor/core.c
index 8cf3cf92129e..bd2c7717eb10 100644
--- a/drivers/mtd/spi-nor/core.c
+++ b/drivers/mtd/spi-nor/core.c
@@ -2905,6 +2905,37 @@ static void spi_nor_resume(struct mtd_info *mtd)
dev_err(dev, "resume() failed\n");
}
+static int spi_nor_get_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ if (!try_module_get(dev->driver->owner))
+ return -ENODEV;
+
+ return 0;
+}
+
+static void spi_nor_put_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ module_put(dev->driver->owner);
+}
+
void spi_nor_restore(struct spi_nor *nor)
{
/* restore the addressing mode */
@@ -3099,6 +3130,8 @@ int spi_nor_scan(struct spi_nor *nor, const char *name,
mtd->_read = spi_nor_read;
mtd->_suspend = spi_nor_suspend;
mtd->_resume = spi_nor_resume;
+ mtd->_get_device = spi_nor_get_device;
+ mtd->_put_device = spi_nor_put_device;
if (info->flags & USE_FSR)
nor->flags |= SNOR_F_USE_FSR;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be94215be1ab19e5d38f50962f611c88d4bfc83a Mon Sep 17 00:00:00 2001
From: Xiang Chen <chenxiang66(a)hisilicon.com>
Date: Thu, 1 Apr 2021 15:34:46 +0800
Subject: [PATCH] mtd: spi-nor: core: Fix an issue of releasing resources
during read/write
If rmmod the driver during read or write, the driver will release the
resources which are used during read or write, so it is possible to
refer to NULL pointer.
Use the testcase "mtd_debug read /dev/mtd0 0xc00000 0x400000 dest_file &
sleep 0.5;rmmod spi_hisi_sfc_v3xx.ko", the issue can be reproduced in
hisi_sfc_v3xx driver.
To avoid the issue, fill the interface _get_device and _put_device of
mtd_info to grab the reference to the spi controller driver module, so
the request of rmmod the driver is rejected before read/write is finished.
Fixes: b199489d37b2 ("mtd: spi-nor: add the framework for SPI NOR")
Signed-off-by: Xiang Chen <chenxiang66(a)hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Tested-by: Michael Walle <michael(a)walle.cc>
Tested-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Reviewed-by: Michael Walle <michael(a)walle.cc>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/1617262486-4223-1-git-send-email-yangyicong@hisil…
diff --git a/drivers/mtd/spi-nor/core.c b/drivers/mtd/spi-nor/core.c
index 8cf3cf92129e..bd2c7717eb10 100644
--- a/drivers/mtd/spi-nor/core.c
+++ b/drivers/mtd/spi-nor/core.c
@@ -2905,6 +2905,37 @@ static void spi_nor_resume(struct mtd_info *mtd)
dev_err(dev, "resume() failed\n");
}
+static int spi_nor_get_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ if (!try_module_get(dev->driver->owner))
+ return -ENODEV;
+
+ return 0;
+}
+
+static void spi_nor_put_device(struct mtd_info *mtd)
+{
+ struct mtd_info *master = mtd_get_master(mtd);
+ struct spi_nor *nor = mtd_to_spi_nor(master);
+ struct device *dev;
+
+ if (nor->spimem)
+ dev = nor->spimem->spi->controller->dev.parent;
+ else
+ dev = nor->dev;
+
+ module_put(dev->driver->owner);
+}
+
void spi_nor_restore(struct spi_nor *nor)
{
/* restore the addressing mode */
@@ -3099,6 +3130,8 @@ int spi_nor_scan(struct spi_nor *nor, const char *name,
mtd->_read = spi_nor_read;
mtd->_suspend = spi_nor_suspend;
mtd->_resume = spi_nor_resume;
+ mtd->_get_device = spi_nor_get_device;
+ mtd->_put_device = spi_nor_put_device;
if (info->flags & USE_FSR)
nor->flags |= SNOR_F_USE_FSR;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 8c9af478c06bb1ab1422f90d8ecbc53defd44bc3 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt(a)goodmis.org>
Date: Wed, 5 May 2021 10:38:24 -0400
Subject: [PATCH] ftrace: Handle commands when closing set_ftrace_filter file
# echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
will cause switch_mm to stop tracing by the traceoff command.
# echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
does nothing.
The reason is that the parsing in the write function only processes
commands if it finished parsing (there is white space written after the
command). That's to handle:
write(fd, "switch_mm:", 10);
write(fd, "traceoff", 8);
cases, where the command is broken over multiple writes.
The problem is if the file descriptor is closed, then the write call is
not processed, and the command needs to be processed in the release code.
The release code can handle matching of functions, but does not handle
commands.
Cc: stable(a)vger.kernel.org
Fixes: eda1e32855656 ("tracing: handle broken names in ftrace filter")
Signed-off-by: Steven Rostedt (VMware) <rostedt(a)goodmis.org>
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 057e962ca5ce..c57508445faa 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5591,7 +5591,10 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
parser = &iter->parser;
if (trace_parser_loaded(parser)) {
- ftrace_match_records(iter->hash, parser->buffer, parser->idx);
+ int enable = !(iter->flags & FTRACE_ITER_NOTRACE);
+
+ ftrace_process_regex(iter, parser->buffer,
+ parser->idx, enable);
}
trace_parser_put(parser);
These tests deliberately access these arrays out of bounds,
which will cause the dynamic local bounds checks inserted by
CONFIG_UBSAN_LOCAL_BOUNDS to fail and panic the kernel. To avoid this
problem, access the arrays via volatile pointers, which will prevent
the compiler from being able to determine the array bounds.
These accesses use volatile pointers to char (char *volatile) rather
than the more conventional pointers to volatile char (volatile char *)
because we want to prevent the compiler from making inferences about
the pointer itself (i.e. its array bounds), not the data that it
refers to.
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Cc: stable(a)vger.kernel.org
Link: https://linux-review.googlesource.com/id/I90b1713fbfa1bf68ff895aef099ea77b9…
---
lib/test_kasan.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/lib/test_kasan.c b/lib/test_kasan.c
index dc05cfc2d12f..cacbbbdef768 100644
--- a/lib/test_kasan.c
+++ b/lib/test_kasan.c
@@ -654,8 +654,20 @@ static char global_array[10];
static void kasan_global_oob(struct kunit *test)
{
- volatile int i = 3;
- char *p = &global_array[ARRAY_SIZE(global_array) + i];
+ /*
+ * Deliberate out-of-bounds access. To prevent CONFIG_UBSAN_LOCAL_BOUNDS
+ * from failing here and panicing the kernel, access the array via a
+ * volatile pointer, which will prevent the compiler from being able to
+ * determine the array bounds.
+ *
+ * This access uses a volatile pointer to char (char *volatile) rather
+ * than the more conventional pointer to volatile char (volatile char *)
+ * because we want to prevent the compiler from making inferences about
+ * the pointer itself (i.e. its array bounds), not the data that it
+ * refers to.
+ */
+ char *volatile array = global_array;
+ char *p = &array[ARRAY_SIZE(global_array) + 3];
/* Only generic mode instruments globals. */
KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_GENERIC);
@@ -703,8 +715,9 @@ static void ksize_uaf(struct kunit *test)
static void kasan_stack_oob(struct kunit *test)
{
char stack_array[10];
- volatile int i = OOB_TAG_OFF;
- char *p = &stack_array[ARRAY_SIZE(stack_array) + i];
+ /* See comment in kasan_global_oob. */
+ char *volatile array = stack_array;
+ char *p = &array[ARRAY_SIZE(stack_array) + OOB_TAG_OFF];
KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_STACK);
@@ -715,7 +728,9 @@ static void kasan_alloca_oob_left(struct kunit *test)
{
volatile int i = 10;
char alloca_array[i];
- char *p = alloca_array - 1;
+ /* See comment in kasan_global_oob. */
+ char *volatile array = alloca_array;
+ char *p = array - 1;
/* Only generic mode instruments dynamic allocas. */
KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_GENERIC);
@@ -728,7 +743,9 @@ static void kasan_alloca_oob_right(struct kunit *test)
{
volatile int i = 10;
char alloca_array[i];
- char *p = alloca_array + i;
+ /* See comment in kasan_global_oob. */
+ char *volatile array = alloca_array;
+ char *p = array + i;
/* Only generic mode instruments dynamic allocas. */
KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_GENERIC);
--
2.31.1.607.g51e8a6a459-goog
do_mq_timedreceive calls wq_sleep with a stack local address. The
sender (do_mq_timedsend) uses this address to later call
pipelined_send.
This leads to a very hard to trigger race where a do_mq_timedreceive call
might return and leave do_mq_timedsend to rely on an invalid address,
causing the following crash:
[ 240.739977] RIP: 0010:wake_q_add_safe+0x13/0x60
[ 240.739991] Call Trace:
[ 240.739999] __x64_sys_mq_timedsend+0x2a9/0x490
[ 240.740003] ? auditd_test_task+0x38/0x40
[ 240.740007] ? auditd_test_task+0x38/0x40
[ 240.740011] do_syscall_64+0x80/0x680
[ 240.740017] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 240.740019] RIP: 0033:0x7f5928e40343
The race occurs as:
1. do_mq_timedreceive calls wq_sleep with the address of
`struct ext_wait_queue` on function stack (aliased as `ewq_addr` here)
- it holds a valid `struct ext_wait_queue *` as long as the stack has
not been overwritten.
2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
__pipelined_op.
3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY).
Here is where the race window begins. (`this` is `ewq_addr`.)
4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it
will see `state == STATE_READY` and break. `ewq_addr` gets removed from
info->e_wait_q[RECV].list.
5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed
to be a `struct ext_wait_queue *` since it was on do_mq_timedreceive's
stack. (Although the address may not get overwritten until another
function happens to touch it, which means it can persist around for an
indefinite time.)
6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a
`struct ext_wait_queue *`, and uses it to find a task_struct to pass
to the wake_q_add_safe call. In the lucky case where nothing has
overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct.
In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a
bogus address as the receiver's task_struct causing the crash.
do_mq_timedsend::__pipelined_op() should not dereference `this` after
setting STATE_READY, as the receiver counterpart is now free to return.
Change __pipelined_op to call wake_q_add before setting STATE_READY
which ensures that the receiver's task_struct can still be found via
`this`.
Fixes: c5b2cbdbdac563 ("ipc/mqueue.c: update/document memory barriers")
Signed-off-by: Varad Gautam <varad.gautam(a)suse.com>
Reported-by: Matthias von Faber <matthias.vonfaber(a)aox-tech.de>
Cc: <stable(a)vger.kernel.org> # 5.6
Cc: Christian Brauner <christian.brauner(a)ubuntu.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Manfred Spraul <manfred(a)colorfullife.com>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Davidlohr Bueso <dbueso(a)suse.de>
---
v2: Call wake_q_add before smp_store_release, instead of using a
get_task_struct/wake_q_add_safe combination across
smp_store_release. (Davidlohr Bueso)
ipc/mqueue.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 8031464ed4ae..bfcb6f81a824 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -78,11 +78,13 @@ struct posix_msg_tree_node {
* MQ_BARRIER:
* To achieve proper release/acquire memory barrier pairing, the state is set to
* STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
- * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
+ * by smp_acquire__after_ctrl_dep(). The state change to STATE_READY must be
+ * the last write operation, after which the blocked task can immediately
+ * return and exit.
*
* This prevents the following races:
*
- * 1) With the simple wake_q_add(), the task could be gone already before
+ * 1) With wake_q_add(), the task could be gone already before
* the increase of the reference happens
* Thread A
* Thread B
@@ -97,10 +99,25 @@ struct posix_msg_tree_node {
* sys_exit()
* get_task_struct() // UaF
*
- * Solution: Use wake_q_add_safe() and perform the get_task_struct() before
- * the smp_store_release() that does ->state = STATE_READY.
+ * 2) With wake_q_add(), the receiver task could have returned from the
+ * syscall and had its stack-allocated waiter overwritten before the
+ * sender could add it to the wake_q
+ * Thread A
+ * Thread B
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ * ->state = STATE_READY
+ * <timeout returns>
+ * if (wait.state == STATE_READY) return;
+ * sysret to user space
+ * overwrite receiver's stack
+ * wake_q_add(A)
+ * if (cmpxchg()) // corrupted waiter
*
- * 2) Without proper _release/_acquire barriers, the woken up task
+ * Solution: Queue the task for wakeup before the smp_store_release() that
+ * does ->state = STATE_READY.
+ *
+ * 3) Without proper _release/_acquire barriers, the woken up task
* could read stale data
*
* Thread A
@@ -116,7 +133,7 @@ struct posix_msg_tree_node {
*
* Solution: use _release and _acquire barriers.
*
- * 3) There is intentionally no barrier when setting current->state
+ * 4) There is intentionally no barrier when setting current->state
* to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
* release memory barrier, and the wakeup is triggered when holding
* info->lock, i.e. spin_lock(&info->lock) provided a pairing
@@ -1005,11 +1022,9 @@ static inline void __pipelined_op(struct wake_q_head *wake_q,
struct ext_wait_queue *this)
{
list_del(&this->list);
- get_task_struct(this->task);
-
+ wake_q_add(wake_q, this->task);
/* see MQ_BARRIER for purpose/pairing */
smp_store_release(&this->state, STATE_READY);
- wake_q_add_safe(wake_q, this->task);
}
/* pipelined_send() - send a message directly to the task waiting in
--
2.30.2
Hi Greg,
As suggested by you in MHI v5.13 PR [1], please find the below commits that
got merged for v5.13 and should be backported to stable kernels:
683e77cadc83 ("bus: mhi: core: Fix shadow declarations")
5630c1009bd9 ("bus: mhi: pci_generic: Constify mhi_controller_config struct definitions")
ec32332df764 ("bus: mhi: core: Sanity check values from remote device before use")
47705c084659 ("bus: mhi: core: Clear configuration from channel context during reset")
70f7025c854c ("bus: mhi: core: remove redundant initialization of variables state and ee")
68731852f6e5 ("bus: mhi: core: Return EAGAIN if MHI ring is full")
4547a749be99 ("bus: mhi: core: Fix MHI runtime_pm behavior")
6403298c58d4 ("bus: mhi: core: Fix check for syserr at power_up")
8de5ad994143 ("bus: mhi: core: Add missing checks for MMIO register entries")
0ecc1c70dcd3 ("bus: mhi: core: Fix invalid error returning in mhi_queue")
0fccbf0a3b69 ("bus: mhi: pci_generic: Remove WQ_MEM_RECLAIM flag from state workqueue")
I'll make sure to tag stable list for future potential patches.
Thanks,
Mani
[1] https://lore.kernel.org/patchwork/patch/1411161/#1609631
I'm announcing the release of the 5.12.2 kernel.
All users of the 5.12 kernel series must upgrade.
The updated 5.12.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.12.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2 -
arch/mips/include/asm/vdso/gettimeofday.h | 26 +++++++++++++++++++----
drivers/gpu/drm/i915/i915_drv.c | 10 +++++++++
drivers/net/usb/ax88179_178a.c | 6 +++--
drivers/platform/x86/thinkpad_acpi.c | 31 ++++++++++++++++++++--------
drivers/usb/core/quirks.c | 4 +++
fs/overlayfs/namei.c | 1
fs/overlayfs/super.c | 12 ++++++----
include/linux/bpf_verifier.h | 5 ++--
kernel/bpf/verifier.c | 33 ++++++++++++++++--------------
kernel/events/core.c | 12 +++++-----
net/netfilter/nf_conntrack_standalone.c | 10 +--------
net/qrtr/mhi.c | 8 ++++---
sound/usb/endpoint.c | 8 +++----
sound/usb/quirks-table.h | 10 +++++++++
15 files changed, 118 insertions(+), 60 deletions(-)
Bjorn Andersson (1):
net: qrtr: Avoid potential use after free in MHI send
Chris Chiu (1):
USB: Add reset-resume quirk for WD19's Realtek Hub
Daniel Borkmann (2):
bpf: Fix masking negation logic upon negative dst register
bpf: Fix leakage of uninitialized bpf stack under speculation
Greg Kroah-Hartman (1):
Linux 5.12.2
Imre Deak (1):
drm/i915: Disable runtime power management during shutdown
Jonathon Reinhart (1):
netfilter: conntrack: Make global sysctls readonly in non-init netns
Kai-Heng Feng (1):
USB: Add LPM quirk for Lenovo ThinkPad USB-C Dock Gen2 Ethernet
Mark Pearson (1):
platform/x86: thinkpad_acpi: Correct thermal sensor allocation
Mickaël Salaün (1):
ovl: fix leaked dentry
Miklos Szeredi (1):
ovl: allow upperdir inside lowerdir
Ondrej Mosnacek (1):
perf/core: Fix unconditional security_locked_down() call
Phillip Potter (1):
net: usb: ax88179_178a: initialize local variables before use
Romain Naour (1):
mips: Do not include hi and lo in clobber list for R6
Takashi Iwai (2):
ALSA: usb-audio: Add MIDI quirk for Vox ToneLab EX
ALSA: usb-audio: Fix implicit sync clearance at stopping stream
I'm announcing the release of the 5.10.35 kernel.
All users of the 5.10 kernel series must upgrade.
The updated 5.10.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.10.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/mips/include/asm/vdso/gettimeofday.h | 26 ++-
drivers/net/ethernet/intel/igb/igb_main.c | 3
drivers/net/usb/ax88179_178a.c | 6
drivers/nvme/host/pci.c | 1
drivers/platform/x86/thinkpad_acpi.c | 31 ++-
drivers/usb/core/quirks.c | 4
drivers/vfio/Kconfig | 2
fs/overlayfs/namei.c | 1
fs/overlayfs/super.c | 12 -
include/linux/bpf_verifier.h | 5
include/linux/device.h | 1
include/linux/dma-mapping.h | 16 +
include/linux/swiotlb.h | 1
include/linux/user_namespace.h | 3
include/uapi/linux/capability.h | 3
kernel/bpf/verifier.c | 33 ++-
kernel/dma/swiotlb.c | 259 ++++++++++++++++--------------
kernel/events/core.c | 12 -
kernel/user_namespace.c | 65 +++++++
net/netfilter/nf_conntrack_standalone.c | 10 -
net/qrtr/mhi.c | 8
sound/usb/quirks-table.h | 10 +
tools/cgroup/memcg_slabinfo.py | 8
tools/perf/builtin-ftrace.c | 2
tools/perf/util/data.c | 5
26 files changed, 345 insertions(+), 184 deletions(-)
Bjorn Andersson (1):
net: qrtr: Avoid potential use after free in MHI send
Chris Chiu (1):
USB: Add reset-resume quirk for WD19's Realtek Hub
Daniel Borkmann (2):
bpf: Fix masking negation logic upon negative dst register
bpf: Fix leakage of uninitialized bpf stack under speculation
Greg Kroah-Hartman (1):
Linux 5.10.35
Jason Gunthorpe (1):
vfio: Depend on MMU
Jianxiong Gao (9):
driver core: add a min_align_mask field to struct device_dma_parameters
swiotlb: add a IO_TLB_SIZE define
swiotlb: factor out an io_tlb_offset helper
swiotlb: factor out a nr_slots helper
swiotlb: clean up swiotlb_tbl_unmap_single
swiotlb: refactor swiotlb_tbl_map_single
swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single
swiotlb: respect min_align_mask
nvme-pci: set min_align_mask
Jonathon Reinhart (1):
netfilter: conntrack: Make global sysctls readonly in non-init netns
Kai-Heng Feng (1):
USB: Add LPM quirk for Lenovo ThinkPad USB-C Dock Gen2 Ethernet
Mark Pearson (1):
platform/x86: thinkpad_acpi: Correct thermal sensor allocation
Mickaël Salaün (1):
ovl: fix leaked dentry
Miklos Szeredi (1):
ovl: allow upperdir inside lowerdir
Nick Lowe (1):
igb: Enable RSS for Intel I211 Ethernet Controller
Ondrej Mosnacek (1):
perf/core: Fix unconditional security_locked_down() call
Phillip Potter (1):
net: usb: ax88179_178a: initialize local variables before use
Romain Naour (1):
mips: Do not include hi and lo in clobber list for R6
Serge E. Hallyn (1):
capabilities: require CAP_SETFCAP to map uid 0
Takashi Iwai (1):
ALSA: usb-audio: Add MIDI quirk for Vox ToneLab EX
Thomas Richter (1):
perf ftrace: Fix access to pid in array when setting a pid filter
Vasily Averin (1):
tools/cgroup/slabinfo.py: updated to work on current kernel
Zhen Lei (1):
perf data: Fix error return code in perf_data__create_dir()
Ubuntu users reported an audio bug on the Lenovo Yoga Slim 7 14IIL05,
he installed dual OS (Windows + Linux), if he booted to the Linux
from Windows, the Speaker can't work well, it has crackling noise,
if he poweroff the machine first after Windows, the Speaker worked
well.
Before rebooting or shutdown from Windows, the Windows changes the
codec eapd coeff value, but the BIOS doesn't re-initialize its value,
when booting into the Linux from Windows, the eapd coeff value is not
correct. To fix it, set the codec default value to that coeff register
in the alsa driver.
BugLink: http://bugs.launchpad.net/bugs/1925057
Suggested-by: Kailang Yang <kailang(a)realtek.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Hui Wang <hui.wang(a)canonical.com>
---
sound/pci/hda/patch_realtek.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
index 6d58f24c9702..a5f3e78ec04e 100644
--- a/sound/pci/hda/patch_realtek.c
+++ b/sound/pci/hda/patch_realtek.c
@@ -395,7 +395,6 @@ static void alc_fill_eapd_coef(struct hda_codec *codec)
case 0x10ec0282:
case 0x10ec0283:
case 0x10ec0286:
- case 0x10ec0287:
case 0x10ec0288:
case 0x10ec0285:
case 0x10ec0298:
@@ -406,6 +405,10 @@ static void alc_fill_eapd_coef(struct hda_codec *codec)
case 0x10ec0275:
alc_update_coef_idx(codec, 0xe, 0, 1<<0);
break;
+ case 0x10ec0287:
+ alc_update_coef_idx(codec, 0x10, 1<<9, 0);
+ alc_write_coef_idx(codec, 0x8, 0x4ab7);
+ break;
case 0x10ec0293:
alc_update_coef_idx(codec, 0xa, 1<<13, 0);
break;
--
2.25.1
From: Jason Gunthorpe <jgg(a)nvidia.com>
commit b2b12db53507bc97d96f6b7cb279e831e5eafb00 upstream
VFIO_IOMMU_TYPE1 does not compile with !MMU:
../drivers/vfio/vfio_iommu_type1.c: In function 'follow_fault_pfn':
../drivers/vfio/vfio_iommu_type1.c:536:22: error: implicit declaration of function 'pte_write'; did you mean 'vfs_write'? [-Werror=implicit-function-declaration]
So require it.
Suggested-by: Cornelia Huck <cohuck(a)redhat.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Message-Id: <0-v1-02cb5500df6e+78-vfio_no_mmu_jgg(a)nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson(a)redhat.com>
Cc: stable(a)vger.kernel.org # 5.11.y, 5.10.y, 5.4.y
Signed-off-by: Alex Williamson <alex.williamson(a)redhat.com>
---
The noted stable branches include upstream commit 179209fa1270
("vfio: IOMMU_API should be selected") without the follow-up commit
b2b12db53507 ("vfio: Depend on MMU"), which should have included a
Fixes: tag for the prior commit. Without this latter commit, we're
susceptible to randconfig failures with !MMU configs. Thanks!
drivers/vfio/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 90c0525b1e0c..67d0bf4efa16 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,7 @@ config VFIO_VIRQFD
menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
select IOMMU_API
- select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
+ select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/driver-api/vfio.rst for more details.
This is the start of the stable review cycle for the 4.19.190 release.
There are 15 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri, 07 May 2021 12:04:54 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.190-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.190-rc1
Miklos Szeredi <mszeredi(a)redhat.com>
ovl: allow upperdir inside lowerdir
Mark Pearson <markpearson(a)lenovo.com>
platform/x86: thinkpad_acpi: Correct thermal sensor allocation
Shengjiu Wang <shengjiu.wang(a)nxp.com>
ASoC: ak5558: Add MODULE_DEVICE_TABLE
Shengjiu Wang <shengjiu.wang(a)nxp.com>
ASoC: ak4458: Add MODULE_DEVICE_TABLE
Chris Chiu <chris.chiu(a)canonical.com>
USB: Add reset-resume quirk for WD19's Realtek Hub
Kai-Heng Feng <kai.heng.feng(a)canonical.com>
USB: Add LPM quirk for Lenovo ThinkPad USB-C Dock Gen2 Ethernet
Takashi Iwai <tiwai(a)suse.de>
ALSA: usb-audio: Add MIDI quirk for Vox ToneLab EX
Jiri Kosina <jkosina(a)suse.cz>
iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_gen2_enqueue_hcmd()
Daniel Borkmann <daniel(a)iogearbox.net>
bpf: Fix masking negation logic upon negative dst register
Romain Naour <romain.naour(a)gmail.com>
mips: Do not include hi and lo in clobber list for R6
Jiri Kosina <jkosina(a)suse.cz>
iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_enqueue_hcmd()
Phillip Potter <phil(a)philpotter.co.uk>
net: usb: ax88179_178a: initialize local variables before use
Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
ACPI: x86: Call acpi_boot_table_init() after acpi_table_upgrade()
Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
ACPI: tables: x86: Reserve memory occupied by ACPI tables
Gao Xiang <hsiangkao(a)redhat.com>
erofs: fix extended inode could cross boundary
-------------
Diffstat:
Makefile | 4 +-
arch/mips/vdso/gettimeofday.c | 14 ++-
arch/x86/kernel/acpi/boot.c | 25 ++--
arch/x86/kernel/setup.c | 7 +-
drivers/acpi/tables.c | 42 ++++++-
drivers/net/usb/ax88179_178a.c | 4 +-
drivers/net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 7 +-
drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 7 +-
drivers/platform/x86/thinkpad_acpi.c | 31 +++--
drivers/staging/erofs/inode.c | 135 ++++++++++++++--------
drivers/usb/core/quirks.c | 4 +
fs/overlayfs/super.c | 12 +-
include/linux/acpi.h | 9 +-
kernel/bpf/verifier.c | 12 +-
sound/soc/codecs/ak4458.c | 1 +
sound/soc/codecs/ak5558.c | 1 +
sound/usb/quirks-table.h | 10 ++
17 files changed, 224 insertions(+), 101 deletions(-)
From: Davidlohr Bueso <dave(a)stgolabs.net>
Subject: fs/epoll: restore waking from ep_done_scan()
339ddb53d373 (fs/epoll: remove unnecessary wakeups of nested epoll)
changed the userspace visible behavior of exclusive waiters blocked on a
common epoll descriptor upon a single event becoming ready. Previously,
all tasks doing epoll_wait would awake, and now only one is awoken,
potentially causing missed wakeups on applications that rely on this
behavior, such as Apache Qpid.
While the aforementioned commit aims at having only a wakeup single path
in ep_poll_callback (with the exceptions of epoll_ctl cases), we need to
restore the wakeup in what was the old ep_scan_ready_list() such that the
next thread can be awoken, in a cascading style, after the waker's
corresponding ep_send_events().
Link: https://lkml.kernel.org/r/20210405231025.33829-3-dave@stgolabs.net
Fixes: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll")
Signed-off-by: Davidlohr Bueso <dbueso(a)suse.de>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Jason Baron <jbaron(a)akamai.com>
Cc: Roman Penyaev <rpenyaev(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/eventpoll.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/fs/eventpoll.c~fs-epoll-restore-waking-from-ep_done_scan
+++ a/fs/eventpoll.c
@@ -657,6 +657,12 @@ static void ep_done_scan(struct eventpol
*/
list_splice(txlist, &ep->rdllist);
__pm_relax(ep->ws);
+
+ if (!list_empty(&ep->rdllist)) {
+ if (waitqueue_active(&ep->wq))
+ wake_up(&ep->wq);
+ }
+
write_unlock_irq(&ep->lock);
}
_
From: Ido Schimmel <idosch(a)nvidia.com>
Each multicast route that is forwarding packets (as opposed to trapping
them) points to a list of egress router interfaces (RIFs) through which
packets are replicated.
A route's action can transition from trap to forward when a RIF is
created for one of the route's egress virtual interfaces (eVIF). When
this happens, the route's action is first updated and only later the
list of egress RIFs is committed to the device.
This results in the route pointing to an invalid list. In case the list
pointer is out of range (due to uninitialized memory), the device will
complain:
mlxsw_spectrum2 0000:06:00.0: EMAD reg access failed (tid=5733bf490000905c,reg_id=300f(pefa),type=write,status=7(bad parameter))
Fix this by first committing the list of egress RIFs to the device and
only later update the route's action.
Note that a fix is not needed in the reverse function (i.e.,
mlxsw_sp_mr_route_evif_unresolve()), as there the route's action is
first updated and only later the RIF is removed from the list.
Cc: stable(a)vger.kernel.org
Fixes: c011ec1bbfd6 ("mlxsw: spectrum: Add the multicast routing offloading logic")
Signed-off-by: Ido Schimmel <idosch(a)nvidia.com>
Reviewed-by: Petr Machata <petrm(a)nvidia.com>
---
.../net/ethernet/mellanox/mlxsw/spectrum_mr.c | 30 +++++++++----------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
index 7846a21555ef..1f6bc0c7e91d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
@@ -535,6 +535,16 @@ mlxsw_sp_mr_route_evif_resolve(struct mlxsw_sp_mr_table *mr_table,
u16 erif_index = 0;
int err;
+ /* Add the eRIF */
+ if (mlxsw_sp_mr_vif_valid(rve->mr_vif)) {
+ erif_index = mlxsw_sp_rif_index(rve->mr_vif->rif);
+ err = mr->mr_ops->route_erif_add(mlxsw_sp,
+ rve->mr_route->route_priv,
+ erif_index);
+ if (err)
+ return err;
+ }
+
/* Update the route action, as the new eVIF can be a tunnel or a pimreg
* device which will require updating the action.
*/
@@ -544,17 +554,7 @@ mlxsw_sp_mr_route_evif_resolve(struct mlxsw_sp_mr_table *mr_table,
rve->mr_route->route_priv,
route_action);
if (err)
- return err;
- }
-
- /* Add the eRIF */
- if (mlxsw_sp_mr_vif_valid(rve->mr_vif)) {
- erif_index = mlxsw_sp_rif_index(rve->mr_vif->rif);
- err = mr->mr_ops->route_erif_add(mlxsw_sp,
- rve->mr_route->route_priv,
- erif_index);
- if (err)
- goto err_route_erif_add;
+ goto err_route_action_update;
}
/* Update the minimum MTU */
@@ -572,14 +572,14 @@ mlxsw_sp_mr_route_evif_resolve(struct mlxsw_sp_mr_table *mr_table,
return 0;
err_route_min_mtu_update:
- if (mlxsw_sp_mr_vif_valid(rve->mr_vif))
- mr->mr_ops->route_erif_del(mlxsw_sp, rve->mr_route->route_priv,
- erif_index);
-err_route_erif_add:
if (route_action != rve->mr_route->route_action)
mr->mr_ops->route_action_update(mlxsw_sp,
rve->mr_route->route_priv,
rve->mr_route->route_action);
+err_route_action_update:
+ if (mlxsw_sp_mr_vif_valid(rve->mr_vif))
+ mr->mr_ops->route_erif_del(mlxsw_sp, rve->mr_route->route_priv,
+ erif_index);
return err;
}
--
2.30.2
From: Joerg Roedel <jroedel(a)suse.de>
For now, kexec is not supported when running as an SEV-ES guest. Doing
so requires additional hypervisor support and special code to hand
over the CPUs to the new kernel in a safe way.
Until this is implemented, do not support kexec in SEV-ES guests.
Cc: stable(a)vger.kernel.org # v5.10+
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
---
arch/x86/kernel/machine_kexec_64.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index c078b0d3ab0e..f902cc9cc634 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -620,3 +620,11 @@ void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages)
*/
set_memory_encrypted((unsigned long)vaddr, pages);
}
+
+/*
+ * Kexec is not supported in SEV-ES guests yet
+ */
+bool arch_kexec_supported(void)
+{
+ return !sev_es_active();
+}
--
2.31.1
From: Joerg Roedel <jroedel(a)suse.de>
Allow a runtime opt-out of kexec support for architecture code in case
the kernel is running in an environment where kexec is not properly
supported yet.
This will be used on x86 when the kernel is running as an SEV-ES
guest. SEV-ES guests need special handling for kexec to hand over all
CPUs to the new kernel. This requires special hypervisor support and
handling code in the guest which is not yet implemented.
Cc: stable(a)vger.kernel.org # v5.10+
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
---
kernel/kexec.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index c82c6c06f051..d03134160458 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -195,11 +195,25 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
* that to happen you need to do that yourself.
*/
+bool __weak arch_kexec_supported(void)
+{
+ return true;
+}
+
static inline int kexec_load_check(unsigned long nr_segments,
unsigned long flags)
{
int result;
+ /*
+ * The architecture may support kexec in general, but the kernel could
+ * run in an environment where it is not (yet) possible to execute a new
+ * kernel. Allow the architecture code to opt-out of kexec support when
+ * it is running in such an environment.
+ */
+ if (!arch_kexec_supported())
+ return -ENOSYS;
+
/* We only trust the superuser with rebooting the system. */
if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
return -EPERM;
--
2.31.1
The FUTEX_WAIT operand has historically a relative timeout which means that
the clock id is irrelevant as relative timeouts on CLOCK_REALTIME are not
subject to wall clock changes and therefore are mapped by the kernel to
CLOCK_MONOTONIC for simplicity.
If a caller would set FUTEX_CLOCK_REALTIME for FUTEX_WAIT the timeout is
still treated relative vs. CLOCK_MONOTONIC and then the wait arms that
timeout based on CLOCK_REALTIME which is broken and obviously has never
been used or even tested.
Reject any attempt to use FUTEX_CLOCK_REALTIME with FUTEX_WAIT again.
The desired functionality can be achieved with FUTEX_WAIT_BITSET and a
FUTEX_BITSET_MATCH_ANY argument.
Fixes: 337f13046ff0 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Cc: Darren Hart <dvhart(a)infradead.org>
---
kernel/futex.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3711,8 +3711,7 @@ long do_futex(u32 __user *uaddr, int op,
if (op & FUTEX_CLOCK_REALTIME) {
flags |= FLAGS_CLOCKRT;
- if (cmd != FUTEX_WAIT && cmd != FUTEX_WAIT_BITSET && \
- cmd != FUTEX_WAIT_REQUEUE_PI)
+ if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI)
return -ENOSYS;
}
FUTEX_LOCK_PI does not require to have the FUTEX_CLOCK_REALTIME bit set
because it has been using CLOCK_REALTIME based absolute timeouts
forever. Due to that, the time namespace adjustment which is applied when
FUTEX_CLOCK_REALTIME is not set, will wrongly take place for FUTEX_LOCK_PI
and wreckage the timeout.
Exclude it from that procedure.
Fixes: c2f7d08cccf4 ("futex: Adjust absolute futex timeouts with per time namespace offset")
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Andrei Vagin <avagin(a)gmail.com>
Cc: stable(a)vger.kernel.org
---
kernel/futex.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3781,7 +3781,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uad
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
- else if (!(op & FUTEX_CLOCK_REALTIME))
+ else if (cmd != FUTEX_LOCK_PI && !(op & FUTEX_CLOCK_REALTIME))
t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = &t;
}
@@ -3975,7 +3975,7 @@ SYSCALL_DEFINE6(futex_time32, u32 __user
t = timespec64_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add_safe(ktime_get(), t);
- else if (!(op & FUTEX_CLOCK_REALTIME))
+ else if (cmd != FUTEX_LOCK_PI && !(op & FUTEX_CLOCK_REALTIME))
t = timens_ktime_to_host(CLOCK_MONOTONIC, t);
tp = &t;
}
commit 37486135d3a7b03acc7755b63627a130437f066a upstream.
In 5.4.y only, vcpu->arch.host_pkru is being set on every run thru
of vcpu_enter_guest, when it really only needs to be set on load. As
a result, we're doing a rdpkru on supported CPUs on every iteration
of vcpu_enter_guest even though the value never changes.
Mainline and 5.10.y already has host_pkru being initialized in
kvm_arch_vcpu_load. This change is 5.4.y specific and moves
host_pkru save to kvm_arch_vcpu_load.
Fixes: 99e392a4979b ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c")
Cc: stable(a)vger.kernel.org # 5.4.y
Cc: Babu Moger <babu.moger(a)amd.com>
Signed-off-by: Jon Kohler <jon(a)nutanix.com>
---
arch/x86/kvm/x86.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 153659e8f403..1f7521752a94 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3507,6 +3507,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_x86_ops->vcpu_load(vcpu, cpu);
+ /* Save host pkru register if supported */
+ vcpu->arch.host_pkru = read_pkru();
+
/* Apply any externally detected TSC adjustments (due to suspend) */
if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
@@ -8253,9 +8256,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
trace_kvm_entry(vcpu->vcpu_id);
guest_enter_irqoff();
- /* Save host pkru register if supported */
- vcpu->arch.host_pkru = read_pkru();
-
fpregs_assert_state_consistent();
if (test_thread_flag(TIF_NEED_FPU_LOAD))
switch_fpu_return();
--
2.30.1 (Apple Git-130)
From: Atul Gopinathan <atulgopinathan(a)gmail.com>
The fields, "toc" and "cd_info", of "struct gdrom_unit gd" are allocated
in "probe_gdrom()". Prevent a memory leak by making sure "gd.cd_info" is
deallocated in the "remove_gdrom()" function.
Also prevent double free of the field "gd.toc" by moving it from the
module's exit function to "remove_gdrom()". This is because, in
"probe_gdrom()", the function makes sure to deallocate "gd.toc" in case
of any errors, so the exit function invoked later would again free
"gd.toc".
The patch also maintains consistency by deallocating the above mentioned
fields in "remove_gdrom()" along with another memory allocated field
"gd.disk".
Suggested-by: Jens Axboe <axboe(a)kernel.dk>
Cc: Peter Rosin <peda(a)axentia.se>
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Atul Gopinathan <atulgopinathan(a)gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/cdrom/gdrom.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/cdrom/gdrom.c b/drivers/cdrom/gdrom.c
index 7f681320c7d3..6c4f6139f853 100644
--- a/drivers/cdrom/gdrom.c
+++ b/drivers/cdrom/gdrom.c
@@ -830,6 +830,8 @@ static int remove_gdrom(struct platform_device *devptr)
if (gdrom_major)
unregister_blkdev(gdrom_major, GDROM_DEV_NAME);
unregister_cdrom(gd.cd_info);
+ kfree(gd.cd_info);
+ kfree(gd.toc);
return 0;
}
@@ -861,7 +863,6 @@ static void __exit exit_gdrom(void)
{
platform_device_unregister(pd);
platform_driver_unregister(&gdrom_driver);
- kfree(gd.toc);
}
module_init(init_gdrom);
--
2.31.1
This is an automatic generated email to let you know that the following patch were queued:
Subject: media: i2c: ccs-core: fix pm_runtime_get_sync() usage count
Author: Mauro Carvalho Chehab <mchehab+huawei(a)kernel.org>
Date: Fri Apr 23 17:19:11 2021 +0200
The pm_runtime_get_sync() internally increments the
dev->power.usage_count without decrementing it, even on errors.
There is a bug at ccs_pm_get_init(): when this function returns
an error, the stream is not started, and RPM usage_count
should not be incremented. However, if the calls to
v4l2_ctrl_handler_setup() return errors, it will be kept
incremented.
At ccs_suspend() the best is to replace it by the new
pm_runtime_resume_and_get(), introduced by:
commit dd8088d5a896 ("PM: runtime: Add pm_runtime_resume_and_get to deal with usage counter")
in order to properly decrement the usage counter automatically,
in the case of errors.
Fixes: 96e3a6b92f23 ("media: smiapp: Avoid maintaining power state information")
Cc: stable(a)vger.kernel.org
Acked-by: Sakari Ailus <sakari.ailus(a)linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei(a)kernel.org>
drivers/media/i2c/ccs/ccs-core.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)
---
diff --git a/drivers/media/i2c/ccs/ccs-core.c b/drivers/media/i2c/ccs/ccs-core.c
index b05f409014b2..4a848ac2d2cd 100644
--- a/drivers/media/i2c/ccs/ccs-core.c
+++ b/drivers/media/i2c/ccs/ccs-core.c
@@ -1880,21 +1880,33 @@ static int ccs_pm_get_init(struct ccs_sensor *sensor)
struct i2c_client *client = v4l2_get_subdevdata(&sensor->src->sd);
int rval;
+ /*
+ * It can't use pm_runtime_resume_and_get() here, as the driver
+ * relies at the returned value to detect if the device was already
+ * active or not.
+ */
rval = pm_runtime_get_sync(&client->dev);
- if (rval < 0) {
- pm_runtime_put_noidle(&client->dev);
+ if (rval < 0)
+ goto error;
- return rval;
- } else if (!rval) {
- rval = v4l2_ctrl_handler_setup(&sensor->pixel_array->
- ctrl_handler);
- if (rval)
- return rval;
+ /* Device was already active, so don't set controls */
+ if (rval == 1)
+ return 0;
- return v4l2_ctrl_handler_setup(&sensor->src->ctrl_handler);
- }
+ /* Restore V4L2 controls to the previously suspended device */
+ rval = v4l2_ctrl_handler_setup(&sensor->pixel_array->ctrl_handler);
+ if (rval)
+ goto error;
+ rval = v4l2_ctrl_handler_setup(&sensor->src->ctrl_handler);
+ if (rval)
+ goto error;
+
+ /* Keep PM runtime usage_count incremented on success */
return 0;
+error:
+ pm_runtime_put(&client->dev);
+ return rval;
}
static int ccs_set_stream(struct v4l2_subdev *subdev, int enable)