The patch titled
Subject: mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-hugetlb-fix-kernel-null-pointer-dereference-when-replacing-free-hugetlb-folios.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ge Yang <yangge1116(a)126.com>
Subject: mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios
Date: Thu, 22 May 2025 11:22:17 +0800
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address: 0000000000000028
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary)
RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0
RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000
RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000
RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000
R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000
R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004
FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0
Call Trace:
<TASK>
replace_free_hugepage_folios+0xb6/0x100
alloc_contig_range_noprof+0x18a/0x590
? srso_return_thunk+0x5/0x5f
? down_read+0x12/0xa0
? srso_return_thunk+0x5/0x5f
cma_range_alloc.constprop.0+0x131/0x290
__cma_alloc+0xcf/0x2c0
cma_alloc_write+0x43/0xb0
simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110
debugfs_attr_write+0x46/0x70
full_proxy_write+0x62/0xa0
vfs_write+0xf8/0x420
? srso_return_thunk+0x5/0x5f
? filp_flush+0x86/0xa0
? srso_return_thunk+0x5/0x5f
? filp_close+0x1f/0x30
? srso_return_thunk+0x5/0x5f
? do_dup2+0xaf/0x160
? srso_return_thunk+0x5/0x5f
ksys_write+0x65/0xe0
do_syscall_64+0x64/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and
replace_free_hugepage_folios():
CPU1 CPU2
__update_and_free_hugetlb_folio replace_free_hugepage_folios
folio_test_hugetlb(folio)
-- It's still hugetlb folio.
__folio_clear_hugetlb(folio)
hugetlb_free_folio(folio)
h = folio_hstate(folio)
-- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL,
and subsequent access to this NULL pointer will cause the system to crash.
To resolve this issue, execute folio_hstate(folio) under the protection
of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not
return NULL.
Link: https://lkml.kernel.org/r/1747884137-26685-1-git-send-email-yangge1116@126.…
Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration")
Signed-off-by: Ge Yang <yangge1116(a)126.com>
Reviewed-by: Muchun Song <muchun.song(a)linux.dev>
Reviewed-by: Oscar Salvador <osalvador(a)suse.de>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Barry Song <21cnbao(a)gmail.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hugetlb.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/mm/hugetlb.c~mm-hugetlb-fix-kernel-null-pointer-dereference-when-replacing-free-hugetlb-folios
+++ a/mm/hugetlb.c
@@ -2949,12 +2949,20 @@ int replace_free_hugepage_folios(unsigne
while (start_pfn < end_pfn) {
folio = pfn_folio(start_pfn);
+
+ /*
+ * The folio might have been dissolved from under our feet, so make sure
+ * to carefully check the state under the lock.
+ */
+ spin_lock_irq(&hugetlb_lock);
if (folio_test_hugetlb(folio)) {
h = folio_hstate(folio);
} else {
+ spin_unlock_irq(&hugetlb_lock);
start_pfn++;
continue;
}
+ spin_unlock_irq(&hugetlb_lock);
if (!folio_ref_count(folio)) {
ret = alloc_and_dissolve_hugetlb_folio(h, folio,
_
Patches currently in -mm which might be from yangge1116(a)126.com are
mm-hugetlb-fix-kernel-null-pointer-dereference-when-replacing-free-hugetlb-folios.patch
The patch titled
Subject: mm: swap: fix potensial buffer overflow in setup_clusters()
has been added to the -mm mm-new branch. Its filename is
mm-swap-fix-potensial-buffer-overflow-in-setup_clusters.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Kemeng Shi <shikemeng(a)huaweicloud.com>
Subject: mm: swap: fix potensial buffer overflow in setup_clusters()
Date: Thu, 22 May 2025 20:25:53 +0800
In setup_swap_map(), we only ensure badpages are in range (0, last_page].
As maxpages might be < last_page, setup_clusters() will encounter a buffer
overflow when a badpage is >= maxpages.
Only call inc_cluster_info_page() for badpage which is < maxpages to fix
the issue.
Link: https://lkml.kernel.org/r/20250522122554.12209-4-shikemeng@huaweicloud.com
Fixes: b843786b0bd01 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
Signed-off-by: Kemeng Shi <shikemeng(a)huaweicloud.com>
Cc: <stable(a)vger.kernel.org>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kairui Song <kasong(a)tencent.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
--- a/mm/swapfile.c~mm-swap-fix-potensial-buffer-overflow-in-setup_clusters
+++ a/mm/swapfile.c
@@ -3208,9 +3208,13 @@ static struct swap_cluster_info *setup_c
* and the EOF part of the last cluster.
*/
inc_cluster_info_page(si, cluster_info, 0);
- for (i = 0; i < swap_header->info.nr_badpages; i++)
- inc_cluster_info_page(si, cluster_info,
- swap_header->info.badpages[i]);
+ for (i = 0; i < swap_header->info.nr_badpages; i++) {
+ unsigned int page_nr = swap_header->info.badpages[i];
+
+ if (page_nr >= maxpages)
+ continue;
+ inc_cluster_info_page(si, cluster_info, page_nr);
+ }
for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
inc_cluster_info_page(si, cluster_info, i);
_
Patches currently in -mm which might be from shikemeng(a)huaweicloud.com are
mm-shmem-avoid-unpaired-folio_unlock-in-shmem_swapin_folio.patch
mm-shmem-add-missing-shmem_unacct_size-in-__shmem_file_setup.patch
mm-shmem-fix-potential-dead-loop-in-shmem_unuse.patch
mm-shmem-only-remove-inode-from-swaplist-when-its-swapped-page-count-is-0.patch
mm-shmem-remove-unneeded-xa_is_value-check-in-shmem_unuse_swap_entries.patch
mm-swap-move-nr_swap_pages-counter-decrement-from-folio_alloc_swap-to-swap_range_alloc.patch
mm-swap-correctly-use-maxpages-in-swapon-syscall-to-avoid-potensial-deadloop.patch
mm-swap-fix-potensial-buffer-overflow-in-setup_clusters.patch
mm-swap-remove-stale-comment-stale-comment-in-cluster_alloc_swap_entry.patch
The patch titled
Subject: mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
has been added to the -mm mm-new branch. Its filename is
mm-swap-correctly-use-maxpages-in-swapon-syscall-to-avoid-potensial-deadloop.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Kemeng Shi <shikemeng(a)huaweicloud.com>
Subject: mm: swap: correctly use maxpages in swapon syscall to avoid potensial deadloop
Date: Thu, 22 May 2025 20:25:52 +0800
We use maxpages from read_swap_header() to initialize swap_info_struct,
however the maxpages might be reduced in setup_swap_extents() and the
si->max is assigned with the reduced maxpages from the
setup_swap_extents().
Obviously, this could lead to memory waste as we allocated memory based on
larger maxpages, besides, this could lead to a potensial deadloop as
following:
1) When calling setup_clusters() with larger maxpages, unavailable
pages within range [si->max, larger maxpages) are not accounted with
inc_cluster_info_page(). As a result, these pages are assumed
available but can not be allocated. The cluster contains these pages
can be moved to frag_clusters list after it's all available pages were
allocated.
2) When the cluster mentioned in 1) is the only cluster in
frag_clusters list, cluster_alloc_swap_entry() assume order 0
allocation will never failed and will enter a deadloop by keep trying
to allocate page from the only cluster in frag_clusters which contains
no actually available page.
Call setup_swap_extents() to get the final maxpages before
swap_info_struct initialization to fix the issue.
Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com
Fixes: 661383c6111a3 ("mm: swap: relaim the cached parts that got scanned")
Signed-off-by: Kemeng Shi <shikemeng(a)huaweicloud.com>
Cc: <stable(a)vger.kernel.org>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kairui Song <kasong(a)tencent.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 47 ++++++++++++++++++++---------------------------
1 file changed, 20 insertions(+), 27 deletions(-)
--- a/mm/swapfile.c~mm-swap-correctly-use-maxpages-in-swapon-syscall-to-avoid-potensial-deadloop
+++ a/mm/swapfile.c
@@ -3141,43 +3141,30 @@ static unsigned long read_swap_header(st
return maxpages;
}
-static int setup_swap_map_and_extents(struct swap_info_struct *si,
- union swap_header *swap_header,
- unsigned char *swap_map,
- unsigned long maxpages,
- sector_t *span)
+static int setup_swap_map(struct swap_info_struct *si,
+ union swap_header *swap_header,
+ unsigned char *swap_map,
+ unsigned long maxpages)
{
- unsigned int nr_good_pages;
unsigned long i;
- int nr_extents;
-
- nr_good_pages = maxpages - 1; /* omit header page */
+ swap_map[0] = SWAP_MAP_BAD; /* omit header page */
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
if (page_nr == 0 || page_nr > swap_header->info.last_page)
return -EINVAL;
if (page_nr < maxpages) {
swap_map[page_nr] = SWAP_MAP_BAD;
- nr_good_pages--;
+ si->pages--;
}
}
- if (nr_good_pages) {
- swap_map[0] = SWAP_MAP_BAD;
- si->max = maxpages;
- si->pages = nr_good_pages;
- nr_extents = setup_swap_extents(si, span);
- if (nr_extents < 0)
- return nr_extents;
- nr_good_pages = si->pages;
- }
- if (!nr_good_pages) {
+ if (!si->pages) {
pr_warn("Empty swap-file\n");
return -EINVAL;
}
- return nr_extents;
+ return 0;
}
#define SWAP_CLUSTER_INFO_COLS \
@@ -3217,7 +3204,7 @@ static struct swap_cluster_info *setup_c
* Mark unusable pages as unavailable. The clusters aren't
* marked free yet, so no list operations are involved yet.
*
- * See setup_swap_map_and_extents(): header page, bad pages,
+ * See setup_swap_map(): header page, bad pages,
* and the EOF part of the last cluster.
*/
inc_cluster_info_page(si, cluster_info, 0);
@@ -3354,6 +3341,15 @@ SYSCALL_DEFINE2(swapon, const char __use
goto bad_swap_unlock_inode;
}
+ si->max = maxpages;
+ si->pages = maxpages - 1;
+ nr_extents = setup_swap_extents(si, &span);
+ if (nr_extents < 0) {
+ error = nr_extents;
+ goto bad_swap_unlock_inode;
+ }
+ maxpages = si->max;
+
/* OK, set up the swap map and apply the bad block list */
swap_map = vzalloc(maxpages);
if (!swap_map) {
@@ -3365,12 +3361,9 @@ SYSCALL_DEFINE2(swapon, const char __use
if (error)
goto bad_swap_unlock_inode;
- nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
- maxpages, &span);
- if (unlikely(nr_extents < 0)) {
- error = nr_extents;
+ error = setup_swap_map(si, swap_header, swap_map, maxpages);
+ if (error)
goto bad_swap_unlock_inode;
- }
/*
* Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
_
Patches currently in -mm which might be from shikemeng(a)huaweicloud.com are
mm-shmem-avoid-unpaired-folio_unlock-in-shmem_swapin_folio.patch
mm-shmem-add-missing-shmem_unacct_size-in-__shmem_file_setup.patch
mm-shmem-fix-potential-dead-loop-in-shmem_unuse.patch
mm-shmem-only-remove-inode-from-swaplist-when-its-swapped-page-count-is-0.patch
mm-shmem-remove-unneeded-xa_is_value-check-in-shmem_unuse_swap_entries.patch
mm-swap-move-nr_swap_pages-counter-decrement-from-folio_alloc_swap-to-swap_range_alloc.patch
mm-swap-correctly-use-maxpages-in-swapon-syscall-to-avoid-potensial-deadloop.patch
mm-swap-fix-potensial-buffer-overflow-in-setup_clusters.patch
mm-swap-remove-stale-comment-stale-comment-in-cluster_alloc_swap_entry.patch
The patch titled
Subject: mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
has been added to the -mm mm-new branch. Its filename is
mm-swap-move-nr_swap_pages-counter-decrement-from-folio_alloc_swap-to-swap_range_alloc.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Kemeng Shi <shikemeng(a)huaweicloud.com>
Subject: mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
Date: Thu, 22 May 2025 20:25:51 +0800
Patch series "Some randome fixes and cleanups to swapfile".
Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can
be found in respective patches.
This patch (of 4):
When folio_alloc_swap() encounters a failure in either
mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
is not decremented for allocated entry. However, the following
put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
an imbalance.
Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
to pair the nr_swap_pages counting.
Link: https://lkml.kernel.org/r/20250522122554.12209-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250522122554.12209-2-shikemeng@huaweicloud.com
Fixes: 0ff67f990bd45 ("mm, swap: remove swap slot cache")
Signed-off-by: Kemeng Shi <shikemeng(a)huaweicloud.com>
Reviewed-by: Kairui Song <kasong(a)tencent.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/swapfile.c~mm-swap-move-nr_swap_pages-counter-decrement-from-folio_alloc_swap-to-swap_range_alloc
+++ a/mm/swapfile.c
@@ -1115,6 +1115,7 @@ static void swap_range_alloc(struct swap
if (vm_swap_full())
schedule_work(&si->reclaim_work);
}
+ atomic_long_sub(nr_entries, &nr_swap_pages);
}
static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
@@ -1313,7 +1314,6 @@ int folio_alloc_swap(struct folio *folio
if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
goto out_free;
- atomic_long_sub(size, &nr_swap_pages);
return 0;
out_free:
_
Patches currently in -mm which might be from shikemeng(a)huaweicloud.com are
mm-shmem-avoid-unpaired-folio_unlock-in-shmem_swapin_folio.patch
mm-shmem-add-missing-shmem_unacct_size-in-__shmem_file_setup.patch
mm-shmem-fix-potential-dead-loop-in-shmem_unuse.patch
mm-shmem-only-remove-inode-from-swaplist-when-its-swapped-page-count-is-0.patch
mm-shmem-remove-unneeded-xa_is_value-check-in-shmem_unuse_swap_entries.patch
mm-swap-move-nr_swap_pages-counter-decrement-from-folio_alloc_swap-to-swap_range_alloc.patch
mm-swap-correctly-use-maxpages-in-swapon-syscall-to-avoid-potensial-deadloop.patch
mm-swap-fix-potensial-buffer-overflow-in-setup_clusters.patch
mm-swap-remove-stale-comment-stale-comment-in-cluster_alloc_swap_entry.patch
在 2025/5/23 06:35, Sasha Levin 写道:
> This is a note to let you know that I've just added the patch titled
>
> btrfs: prevent inline data extents read from touching blocks beyond its range
>
> to the 6.14-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> btrfs-prevent-inline-data-extents-read-from-touching.patch
> and it can be found in the queue-6.14 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
Please drop this patch.
This is again a preparation patch for larger folios support of btrfs,
and the optimization to dirty a block without reading the full page.
This patch alone doesn't cause any difference for older kernels and
should not be backported.
Thanks,
Qu
>
>
>
> commit 6a2d904623a8d1711b6b5065845d52cb3f2be60a
> Author: Qu Wenruo <wqu(a)suse.com>
> Date: Fri Nov 15 19:15:34 2024 +1030
>
> btrfs: prevent inline data extents read from touching blocks beyond its range
>
> [ Upstream commit 1a5b5668d711d3d1ef447446beab920826decec3 ]
>
> Currently reading an inline data extent will zero out the remaining
> range in the page.
>
> This is not yet causing problems even for block size < page size
> (subpage) cases because:
>
> 1) An inline data extent always starts at file offset 0
> Meaning at page read, we always read the inline extent first, before
> any other blocks in the page. Then later blocks are properly read out
> and re-fill the zeroed out ranges.
>
> 2) Currently btrfs will read out the whole page if a buffered write is
> not page aligned
> So a page is either fully uptodate at buffered write time (covers the
> whole page), or we will read out the whole page first.
> Meaning there is nothing to lose for such an inline extent read.
>
> But it's still not ideal:
>
> - We're zeroing out the page twice
> Once done by read_inline_extent()/uncompress_inline(), once done by
> btrfs_do_readpage() for ranges beyond i_size.
>
> - We're touching blocks that don't belong to the inline extent
> In the incoming patches, we can have a partial uptodate folio, of
> which some dirty blocks can exist while the page is not fully uptodate:
>
> The page size is 16K and block size is 4K:
>
> 0 4K 8K 12K 16K
> | | |/////////| |
>
> And range [8K, 12K) is dirtied by a buffered write, the remaining
> blocks are not uptodate.
>
> If range [0, 4K) contains an inline data extent, and we try to read
> the whole page, the current behavior will overwrite range [8K, 12K)
> with zero and cause data loss.
>
> So to make the behavior more consistent and in preparation for future
> changes, limit the inline data extents read to only zero out the range
> inside the first block, not the whole page.
>
> Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
> Signed-off-by: Qu Wenruo <wqu(a)suse.com>
> Signed-off-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9a648fb130230..a7136311a13c6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6779,6 +6779,7 @@ static noinline int uncompress_inline(struct btrfs_path *path,
> {
> int ret;
> struct extent_buffer *leaf = path->nodes[0];
> + const u32 blocksize = leaf->fs_info->sectorsize;
> char *tmp;
> size_t max_size;
> unsigned long inline_size;
> @@ -6795,7 +6796,7 @@ static noinline int uncompress_inline(struct btrfs_path *path,
>
> read_extent_buffer(leaf, tmp, ptr, inline_size);
>
> - max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> + max_size = min_t(unsigned long, blocksize, max_size);
> ret = btrfs_decompress(compress_type, tmp, folio, 0, inline_size,
> max_size);
>
> @@ -6807,14 +6808,15 @@ static noinline int uncompress_inline(struct btrfs_path *path,
> * cover that region here.
> */
>
> - if (max_size < PAGE_SIZE)
> - folio_zero_range(folio, max_size, PAGE_SIZE - max_size);
> + if (max_size < blocksize)
> + folio_zero_range(folio, max_size, blocksize - max_size);
> kfree(tmp);
> return ret;
> }
>
> static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
> {
> + const u32 blocksize = path->nodes[0]->fs_info->sectorsize;
> struct btrfs_file_extent_item *fi;
> void *kaddr;
> size_t copy_size;
> @@ -6829,14 +6831,14 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
> if (btrfs_file_extent_compression(path->nodes[0], fi) != BTRFS_COMPRESS_NONE)
> return uncompress_inline(path, folio, fi);
>
> - copy_size = min_t(u64, PAGE_SIZE,
> + copy_size = min_t(u64, blocksize,
> btrfs_file_extent_ram_bytes(path->nodes[0], fi));
> kaddr = kmap_local_folio(folio, 0);
> read_extent_buffer(path->nodes[0], kaddr,
> btrfs_file_extent_inline_start(fi), copy_size);
> kunmap_local(kaddr);
> - if (copy_size < PAGE_SIZE)
> - folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size);
> + if (copy_size < blocksize)
> + folio_zero_range(folio, copy_size, blocksize - copy_size);
> return 0;
> }
>
在 2025/5/23 06:35, Sasha Levin 写道:
> This is a note to let you know that I've just added the patch titled
>
> btrfs: properly limit inline data extent according to block size
>
> to the 6.14-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> btrfs-properly-limit-inline-data-extent-according-to.patch
> and it can be found in the queue-6.14 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
Please drop this patch.
This is mostly for the incoming large folios support for btrfs.
For older kernels this patch will not cause any behavior change.
Thanks,
Qu>
>
>
> commit ec02842137bdccb74ed331a1b0a335ee22eb179c
> Author: Qu Wenruo <wqu(a)suse.com>
> Date: Tue Feb 25 14:30:44 2025 +1030
>
> btrfs: properly limit inline data extent according to block size
>
> [ Upstream commit 23019d3e6617a8ec99a8d2f5947aa3dd8a74a1b8 ]
>
> Btrfs utilizes inline data extent for the following cases:
>
> - Regular small files
> - Symlinks
>
> And "btrfs check" detects any file extents that are too large as an
> error.
>
> It's not a problem for 4K block size, but for the incoming smaller
> block sizes (2K), it can cause problems due to bad limits:
>
> - Non-compressed inline data extents
> We do not allow a non-compressed inline data extent to be as large as
> block size.
>
> - Symlinks
> Currently the only real limit on symlinks are 4K, which can be larger
> than 2K block size.
>
> These will result btrfs-check to report too large file extents.
>
> Fix it by adding proper size checks for the above cases.
>
> Signed-off-by: Qu Wenruo <wqu(a)suse.com>
> Reviewed-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: David Sterba <dsterba(a)suse.com>
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a06fca7934d55..9a648fb130230 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -583,6 +583,10 @@ static bool can_cow_file_range_inline(struct btrfs_inode *inode,
> if (size > fs_info->sectorsize)
> return false;
>
> + /* We do not allow a non-compressed extent to be as large as block size. */
> + if (data_len >= fs_info->sectorsize)
> + return false;
> +
> /* We cannot exceed the maximum inline data size. */
> if (data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info))
> return false;
> @@ -8671,7 +8675,12 @@ static int btrfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
> struct extent_buffer *leaf;
>
> name_len = strlen(symname);
> - if (name_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info))
> + /*
> + * Symlinks utilize uncompressed inline extent data, which should not
> + * reach block size.
> + */
> + if (name_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) ||
> + name_len >= fs_info->sectorsize)
> return -ENAMETOOLONG;
>
> inode = new_inode(dir->i_sb);
This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
helper") was introduced to workaround watchdog timeout issues on some
platforms, allowing xhci_reset() to bail out early without waiting
for the reset to complete.
Skipping the xhci handshake during a reset is a dangerous move. The
xhci specification explicitly states that certain registers cannot
be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
Host Controller Reset (HCRST) field:
"This bit is cleared to '0' by the Host Controller when the reset
process is complete. Software cannot terminate the reset process
early by writinga '0' to this bit and shall not write any xHC
Operational or Runtime registers until while HCRST is '1'."
This behavior causes a regression on SNPS DWC3 USB controller with
dual-role capability. When the DWC3 controller exits host mode and
removes xhci while a reset is still in progress, and then tries to
configure its hardware for device mode, the ongoing reset leads to
register access issues; specifically, all register reads returns 0.
These issues extend beyond the xhci register space (which is expected
during a reset) and affect the entire DWC3 IP block, causing the DWC3
device mode to malfunction.
Cc: stable(a)vger.kernel.org
Fixes: 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper")
Signed-off-by: Roy Luo <royluo(a)google.com>
---
Changes in v1:
- Link to previous patchset: https://lore.kernel.org/r/20250515185227.1507363-1-royluo@google.com/
---
drivers/usb/host/xhci-ring.c | 5 ++---
drivers/usb/host/xhci.c | 26 +-------------------------
drivers/usb/host/xhci.h | 2 --
3 files changed, 3 insertions(+), 30 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 423bf3649570..b720e04ce7d8 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -518,9 +518,8 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci, unsigned long flags)
* In the future we should distinguish between -ENODEV and -ETIMEDOUT
* and try to recover a -ETIMEDOUT with a host controller reset.
*/
- ret = xhci_handshake_check_state(xhci, &xhci->op_regs->cmd_ring,
- CMD_RING_RUNNING, 0, 5 * 1000 * 1000,
- XHCI_STATE_REMOVING);
+ ret = xhci_handshake(&xhci->op_regs->cmd_ring,
+ CMD_RING_RUNNING, 0, 5 * 1000 * 1000);
if (ret < 0) {
xhci_err(xhci, "Abort failed to stop command ring: %d\n", ret);
xhci_halt(xhci);
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 90eb491267b5..472c4b6ae59e 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -83,29 +83,6 @@ int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us)
return ret;
}
-/*
- * xhci_handshake_check_state - same as xhci_handshake but takes an additional
- * exit_state parameter, and bails out with an error immediately when xhc_state
- * has exit_state flag set.
- */
-int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr,
- u32 mask, u32 done, int usec, unsigned int exit_state)
-{
- u32 result;
- int ret;
-
- ret = readl_poll_timeout_atomic(ptr, result,
- (result & mask) == done ||
- result == U32_MAX ||
- xhci->xhc_state & exit_state,
- 1, usec);
-
- if (result == U32_MAX || xhci->xhc_state & exit_state)
- return -ENODEV;
-
- return ret;
-}
-
/*
* Disable interrupts and begin the xHCI halting process.
*/
@@ -226,8 +203,7 @@ int xhci_reset(struct xhci_hcd *xhci, u64 timeout_us)
if (xhci->quirks & XHCI_INTEL_HOST)
udelay(1000);
- ret = xhci_handshake_check_state(xhci, &xhci->op_regs->command,
- CMD_RESET, 0, timeout_us, XHCI_STATE_REMOVING);
+ ret = xhci_handshake(&xhci->op_regs->command, CMD_RESET, 0, timeout_us);
if (ret)
return ret;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 242ab9fbc8ae..5e698561b96d 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1855,8 +1855,6 @@ void xhci_remove_secondary_interrupter(struct usb_hcd
/* xHCI host controller glue */
typedef void (*xhci_get_quirks_t)(struct device *, struct xhci_hcd *);
int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us);
-int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr,
- u32 mask, u32 done, int usec, unsigned int exit_state);
void xhci_quiesce(struct xhci_hcd *xhci);
int xhci_halt(struct xhci_hcd *xhci);
int xhci_start(struct xhci_hcd *xhci);
base-commit: 172a9d94339cea832d89630b89d314e41d622bd8
--
2.49.0.1112.g889b7c5bd8-goog
This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
helper") was introduced to workaround watchdog timeout issues on some
platforms, allowing xhci_reset() to bail out early without waiting
for the reset to complete.
Skipping the xhci handshake during a reset is a dangerous move. The
xhci specification explicitly states that certain registers cannot
be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
Host Controller Reset (HCRST) field:
"This bit is cleared to '0' by the Host Controller when the reset
process is complete. Software cannot terminate the reset process
early by writinga '0' to this bit and shall not write any xHC
Operational or Runtime registers until while HCRST is '1'."
This behavior causes a regression on SNPS DWC3 USB controller with
dual-role capability. When the DWC3 controller exits host mode and
removes xhci while a reset is still in progress, and then tries to
configure its hardware for device mode, the ongoing reset leads to
register access issues; specifically, all register reads returns 0.
These issues extend beyond the xhci register space (which is expected
during a reset) and affect the entire DWC3 IP block, causing the DWC3
device mode to malfunction.
Cc: stable(a)vger.kernel.org
Fixes: 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper")
Signed-off-by: Roy Luo <royluo(a)google.com>
---
drivers/usb/host/xhci-ring.c | 5 ++---
drivers/usb/host/xhci.c | 26 +-------------------------
drivers/usb/host/xhci.h | 2 --
3 files changed, 3 insertions(+), 30 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 423bf3649570..b720e04ce7d8 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -518,9 +518,8 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci, unsigned long flags)
* In the future we should distinguish between -ENODEV and -ETIMEDOUT
* and try to recover a -ETIMEDOUT with a host controller reset.
*/
- ret = xhci_handshake_check_state(xhci, &xhci->op_regs->cmd_ring,
- CMD_RING_RUNNING, 0, 5 * 1000 * 1000,
- XHCI_STATE_REMOVING);
+ ret = xhci_handshake(&xhci->op_regs->cmd_ring,
+ CMD_RING_RUNNING, 0, 5 * 1000 * 1000);
if (ret < 0) {
xhci_err(xhci, "Abort failed to stop command ring: %d\n", ret);
xhci_halt(xhci);
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 244b12eafd95..cb9f35acb1f9 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -83,29 +83,6 @@ int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us)
return ret;
}
-/*
- * xhci_handshake_check_state - same as xhci_handshake but takes an additional
- * exit_state parameter, and bails out with an error immediately when xhc_state
- * has exit_state flag set.
- */
-int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr,
- u32 mask, u32 done, int usec, unsigned int exit_state)
-{
- u32 result;
- int ret;
-
- ret = readl_poll_timeout_atomic(ptr, result,
- (result & mask) == done ||
- result == U32_MAX ||
- xhci->xhc_state & exit_state,
- 1, usec);
-
- if (result == U32_MAX || xhci->xhc_state & exit_state)
- return -ENODEV;
-
- return ret;
-}
-
/*
* Disable interrupts and begin the xHCI halting process.
*/
@@ -226,8 +203,7 @@ int xhci_reset(struct xhci_hcd *xhci, u64 timeout_us)
if (xhci->quirks & XHCI_INTEL_HOST)
udelay(1000);
- ret = xhci_handshake_check_state(xhci, &xhci->op_regs->command,
- CMD_RESET, 0, timeout_us, XHCI_STATE_REMOVING);
+ ret = xhci_handshake(&xhci->op_regs->command, CMD_RESET, 0, timeout_us);
if (ret)
return ret;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 242ab9fbc8ae..5e698561b96d 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1855,8 +1855,6 @@ void xhci_remove_secondary_interrupter(struct usb_hcd
/* xHCI host controller glue */
typedef void (*xhci_get_quirks_t)(struct device *, struct xhci_hcd *);
int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, u64 timeout_us);
-int xhci_handshake_check_state(struct xhci_hcd *xhci, void __iomem *ptr,
- u32 mask, u32 done, int usec, unsigned int exit_state);
void xhci_quiesce(struct xhci_hcd *xhci);
int xhci_halt(struct xhci_hcd *xhci);
int xhci_start(struct xhci_hcd *xhci);
--
2.49.0.1204.g71687c7c1d-goog