The patch titled
Subject: mm: userfaultfd: correct dirty flags set for both present and swap pte
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-userfaultfd-correct-dirty-flags-set-for-both-present-and-swap-pte.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Barry Song <v-songbaohua(a)oppo.com>
Subject: mm: userfaultfd: correct dirty flags set for both present and swap pte
Date: Fri, 9 May 2025 10:09:12 +1200
As David pointed out, what truly matters for mremap and userfaultfd move
operations is the soft dirty bit. The current comment and
implementation���which always sets the dirty bit for present PTEs and
fails to set the soft dirty bit for swap PTEs���are incorrect. This could
break features like Checkpoint-Restore in Userspace (CRIU).
This patch updates the behavior to correctly set the soft dirty bit for
both present and swap PTEs in accordance with mremap.
Link: https://lkml.kernel.org/r/20250508220912.7275-1-21cnbao@gmail.com
Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Barry Song <v-songbaohua(a)oppo.com>
Reported-by: David Hildenbrand <david(a)redhat.com>
Closes: https://lore.kernel.org/linux-mm/02f14ee1-923f-47e3-a994-4950afb9afcc@redha…
Acked-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Lokesh Gidra <lokeshgidra(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/userfaultfd.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
--- a/mm/userfaultfd.c~mm-userfaultfd-correct-dirty-flags-set-for-both-present-and-swap-pte
+++ a/mm/userfaultfd.c
@@ -1064,8 +1064,13 @@ static int move_present_pte(struct mm_st
src_folio->index = linear_page_index(dst_vma, dst_addr);
orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
- /* Follow mremap() behavior and treat the entry dirty after the move */
- orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
+ /* Set soft dirty bit so userspace can notice the pte was moved */
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
+#endif
+ if (pte_dirty(orig_src_pte))
+ orig_dst_pte = pte_mkdirty(orig_dst_pte);
+ orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
out:
@@ -1100,6 +1105,9 @@ static int move_swap_pte(struct mm_struc
}
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
+#endif
set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
double_pt_unlock(dst_ptl, src_ptl);
_
Patches currently in -mm which might be from v-songbaohua(a)oppo.com are
mm-userfaultfd-correct-dirty-flags-set-for-both-present-and-swap-pte.patch
The report zones buffer size is currently limited by the HBA's
maximum segment count to ensure the buffer can be mapped. However,
the block layer further limits the number of iovec entries to
1024 when allocating a bio.
To avoid allocation of buffers too large to be mapped, further
restrict the maximum buffer size to BIO_MAX_INLINE_VECS.
Replace the UIO_MAXIOV symbolic name with the more contextually
appropriate BIO_MAX_INLINE_VECS.
Fixes: b091ac616846 ("sd_zbc: Fix report zones buffer allocation")
Cc: stable(a)vger.kernel.org
Signed-off-by: Steve Siwinski <ssiwinski(a)atto.com>
---
block/bio.c | 2 +-
drivers/scsi/sd_zbc.c | 6 +++++-
include/linux/bio.h | 1 +
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 4e6c85a33d74..4be592d37fb6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -611,7 +611,7 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask)
{
struct bio *bio;
- if (nr_vecs > UIO_MAXIOV)
+ if (nr_vecs > BIO_MAX_INLINE_VECS)
return NULL;
return kmalloc(struct_size(bio, bi_inline_vecs, nr_vecs), gfp_mask);
}
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 7a447ff600d2..a8db66428f80 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -169,6 +169,7 @@ static void *sd_zbc_alloc_report_buffer(struct scsi_disk *sdkp,
unsigned int nr_zones, size_t *buflen)
{
struct request_queue *q = sdkp->disk->queue;
+ unsigned int max_segments;
size_t bufsize;
void *buf;
@@ -180,12 +181,15 @@ static void *sd_zbc_alloc_report_buffer(struct scsi_disk *sdkp,
* Furthermore, since the report zone command cannot be split, make
* sure that the allocated buffer can always be mapped by limiting the
* number of pages allocated to the HBA max segments limit.
+ * Since max segments can be larger than the max inline bio vectors,
+ * further limit the allocated buffer to BIO_MAX_INLINE_VECS.
*/
nr_zones = min(nr_zones, sdkp->zone_info.nr_zones);
bufsize = roundup((nr_zones + 1) * 64, SECTOR_SIZE);
bufsize = min_t(size_t, bufsize,
queue_max_hw_sectors(q) << SECTOR_SHIFT);
- bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT);
+ max_segments = min(BIO_MAX_INLINE_VECS, queue_max_segments(q));
+ bufsize = min_t(size_t, bufsize, max_segments << PAGE_SHIFT);
while (bufsize >= SECTOR_SIZE) {
buf = kvzalloc(bufsize, GFP_KERNEL | __GFP_NORETRY);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index cafc7c215de8..b786ec5bcc81 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -11,6 +11,7 @@
#include <linux/uio.h>
#define BIO_MAX_VECS 256U
+#define BIO_MAX_INLINE_VECS UIO_MAXIOV
struct queue_limits;
--
2.43.5
From: Barry Song <v-songbaohua(a)oppo.com>
As David pointed out, what truly matters for mremap and userfaultfd
move operations is the soft dirty bit. The current comment and
implementation—which always sets the dirty bit for present PTEs
and fails to set the soft dirty bit for swap PTEs—are incorrect.
This could break features like Checkpoint-Restore in Userspace
(CRIU).
This patch updates the behavior to correctly set the soft dirty bit
for both present and swap PTEs in accordance with mremap.
Reported-by: David Hildenbrand <david(a)redhat.com>
Closes: https://lore.kernel.org/linux-mm/02f14ee1-923f-47e3-a994-4950afb9afcc@redha…
Acked-by: Peter Xu <peterx(a)redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Lokesh Gidra <lokeshgidra(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
Cc: stable(a)vger.kernel.org
Signed-off-by: Barry Song <v-songbaohua(a)oppo.com>
---
mm/userfaultfd.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e8ce92dc105f..bc473ad21202 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1064,8 +1064,13 @@ static int move_present_pte(struct mm_struct *mm,
src_folio->index = linear_page_index(dst_vma, dst_addr);
orig_dst_pte = folio_mk_pte(src_folio, dst_vma->vm_page_prot);
- /* Follow mremap() behavior and treat the entry dirty after the move */
- orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
+ /* Set soft dirty bit so userspace can notice the pte was moved */
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ orig_dst_pte = pte_mksoft_dirty(orig_dst_pte);
+#endif
+ if (pte_dirty(orig_src_pte))
+ orig_dst_pte = pte_mkdirty(orig_dst_pte);
+ orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
out:
@@ -1100,6 +1105,9 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
}
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
+#endif
set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
double_pt_unlock(dst_ptl, src_ptl);
--
2.39.3 (Apple Git-146)
With UBSAN enabled, we're getting the following trace:
UBSAN: array-index-out-of-bounds in .../drivers/clk/clk-s2mps11.c:186:3
index 0 is out of range for type 'struct clk_hw *[] __counted_by(num)' (aka 'struct clk_hw *[]')
This is because commit f316cdff8d67 ("clk: Annotate struct
clk_hw_onecell_data with __counted_by") annotated the hws member of
that struct with __counted_by, which informs the bounds sanitizer about
the number of elements in hws, so that it can warn when hws is accessed
out of bounds.
As noted in that change, the __counted_by member must be initialised
with the number of elements before the first array access happens,
otherwise there will be a warning from each access prior to the
initialisation because the number of elements is zero. This occurs in
s2mps11_clk_probe() due to ::num being assigned after ::hws access.
Move the assignment to satisfy the requirement of assign-before-access.
Cc: stable(a)vger.kernel.org
Fixes: f316cdff8d67 ("clk: Annotate struct clk_hw_onecell_data with __counted_by")
Signed-off-by: André Draszik <andre.draszik(a)linaro.org>
---
drivers/clk/clk-s2mps11.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/clk/clk-s2mps11.c b/drivers/clk/clk-s2mps11.c
index 014db6386624071e173b5b940466301d2596400a..8ddf3a9a53dfd5bb52a05a3e02788a357ea77ad3 100644
--- a/drivers/clk/clk-s2mps11.c
+++ b/drivers/clk/clk-s2mps11.c
@@ -137,6 +137,8 @@ static int s2mps11_clk_probe(struct platform_device *pdev)
if (!clk_data)
return -ENOMEM;
+ clk_data->num = S2MPS11_CLKS_NUM;
+
switch (hwid) {
case S2MPS11X:
s2mps11_reg = S2MPS11_REG_RTC_CTRL;
@@ -186,7 +188,6 @@ static int s2mps11_clk_probe(struct platform_device *pdev)
clk_data->hws[i] = &s2mps11_clks[i].hw;
}
- clk_data->num = S2MPS11_CLKS_NUM;
of_clk_add_hw_provider(s2mps11_clks->clk_np, of_clk_hw_onecell_get,
clk_data);
---
base-commit: 9388ec571cb1adba59d1cded2300eeb11827679c
change-id: 20250326-s2mps11-ubsan-c90978e7bc04
Best regards,
--
André Draszik <andre.draszik(a)linaro.org>
Hi,
Pasi Kallinen reported in Debian a regression with perf r5101c4
counter, initially it was found in
https://github.com/rr-debugger/rr/issues/3949 but said to be a kernel
problem.
On Tue, May 06, 2025 at 07:18:39PM +0300, Pasi Kallinen wrote:
> Package: src:linux
> Version: 6.12.25-1
> Severity: normal
> X-Debbugs-Cc: debian-amd64(a)lists.debian.org, paxed(a)alt.org
> User: debian-amd64(a)lists.debian.org
> Usertags: amd64
>
> Dear Maintainer,
>
> perf stat -e r5101c4 true
>
> reports "not supported".
>
> The counters worked in kernel 6.11.10.
>
> I first noticed this not working when updating to 6.12.22.
> Booting back to 6.11.10, the counters work correctly.
Does this ring a bell?
Would you be able to bisect the changes to identify where the
behaviour changed?
Regards,
Salvatore
The quilt patch titled
Subject: x86/kexec: fix potential cmem->ranges out of bounds
has been removed from the -mm tree. Its filename was
x86-kexec-fix-potential-cmem-ranges-out-of-bounds.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: fuqiang wang <fuqiang.wang(a)easystack.cn>
Subject: x86/kexec: fix potential cmem->ranges out of bounds
Date: Mon, 8 Jan 2024 21:06:47 +0800
In memmap_exclude_ranges(), elfheader will be excluded from crashk_res.
In the current x86 architecture code, the elfheader is always allocated
at crashk_res.start. It seems that there won't be a new split range.
But it depends on the allocation position of elfheader in crashk_res. To
avoid potential out of bounds in future, add a extra slot.
The similar issue also exists in fill_up_crash_elf_data(). The range to
be excluded is [0, 1M], start (0) is special and will not appear in the
middle of existing cmem->ranges[]. But in cast the low 1M could be
changed in the future, add a extra slot too.
Without this patch, kdump kernel will fail to be loaded by
kexec_file_load,
[ 139.736948] UBSAN: array-index-out-of-bounds in arch/x86/kernel/crash.c:350:25
[ 139.742360] index 0 is out of range for type 'range [*]'
[ 139.745695] CPU: 0 UID: 0 PID: 5778 Comm: kexec Not tainted 6.15.0-0.rc3.20250425git02ddfb981de8.32.fc43.x86_64 #1 PREEMPT(lazy)
[ 139.745698] Hardware name: Amazon EC2 c5.large/, BIOS 1.0 10/16/2017
[ 139.745699] Call Trace:
[ 139.745700] <TASK>
[ 139.745701] dump_stack_lvl+0x5d/0x80
[ 139.745706] ubsan_epilogue+0x5/0x2b
[ 139.745709] __ubsan_handle_out_of_bounds.cold+0x54/0x59
[ 139.745711] crash_setup_memmap_entries+0x2d9/0x330
[ 139.745716] setup_boot_parameters+0xf8/0x6a0
[ 139.745720] bzImage64_load+0x41b/0x4e0
[ 139.745722] ? find_next_iomem_res+0x109/0x140
[ 139.745727] ? locate_mem_hole_callback+0x109/0x170
[ 139.745737] kimage_file_alloc_init+0x1ef/0x3e0
[ 139.745740] __do_sys_kexec_file_load+0x180/0x2f0
[ 139.745742] do_syscall_64+0x7b/0x160
[ 139.745745] ? do_user_addr_fault+0x21a/0x690
[ 139.745747] ? exc_page_fault+0x7e/0x1a0
[ 139.745749] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 139.745751] RIP: 0033:0x7f7712c84e4d
Previously discussed link:
[1] https://lore.kernel.org/kexec/ZXk2oBf%2FT1Ul6o0c@MiWiFi-R3L-srv/
[2] https://lore.kernel.org/kexec/273284e8-7680-4f5f-8065-c5d780987e59@easystac…
[3] https://lore.kernel.org/kexec/ZYQ6O%2F57sHAPxTHm@MiWiFi-R3L-srv/
Link: https://lkml.kernel.org/r/20240108130720.228478-1-fuqiang.wang@easystack.cn
Signed-off-by: fuqiang wang <fuqiang.wang(a)easystack.cn>
Acked-by: Baoquan He <bhe(a)redhat.com>
Reported-by: Coiby Xu <coxu(a)redhat.com>
Closes: https://lkml.kernel.org/r/4de3c2onosr7negqnfhekm4cpbklzmsimgdfv33c52dktqpza…
Cc: Vivek Goyal <vgoyal(a)redhat.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: <x86(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/kernel/crash.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/crash.c~x86-kexec-fix-potential-cmem-ranges-out-of-bounds
+++ a/arch/x86/kernel/crash.c
@@ -165,8 +165,18 @@ static struct crash_mem *fill_up_crash_e
/*
* Exclusion of crash region and/or crashk_low_res may cause
* another range split. So add extra two slots here.
+ *
+ * Exclusion of low 1M may not cause another range split, because the
+ * range of exclude is [0, 1M] and the condition for splitting a new
+ * region is that the start, end parameters are both in a certain
+ * existing region in cmem and cannot be equal to existing region's
+ * start or end. Obviously, the start of [0, 1M] cannot meet this
+ * condition.
+ *
+ * But in order to lest the low 1M could be changed in the future,
+ * (e.g. [stare, 1M]), add a extra slot.
*/
- nr_ranges += 2;
+ nr_ranges += 3;
cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
if (!cmem)
return NULL;
@@ -298,9 +308,16 @@ int crash_setup_memmap_entries(struct ki
struct crash_memmap_data cmd;
struct crash_mem *cmem;
- cmem = vzalloc(struct_size(cmem, ranges, 1));
+ /*
+ * In the current x86 architecture code, the elfheader is always
+ * allocated at crashk_res.start. But it depends on the allocation
+ * position of elfheader in crashk_res. To avoid potential out of
+ * bounds in future, add a extra slot.
+ */
+ cmem = vzalloc(struct_size(cmem, ranges, 2));
if (!cmem)
return -ENOMEM;
+ cmem->max_nr_ranges = 2;
memset(&cmd, 0, sizeof(struct crash_memmap_data));
cmd.params = params;
_
Patches currently in -mm which might be from fuqiang.wang(a)easystack.cn are
From: Fabio Estevam <festevam(a)denx.de>
Since commit 2718f15403fb ("iio: sanity check available_scan_masks array"),
verbose and misleading warnings are printed for devices like the MAX11601:
max1363 1-0064: available_scan_mask 8 subset of 0. Never used
max1363 1-0064: available_scan_mask 9 subset of 0. Never used
max1363 1-0064: available_scan_mask 10 subset of 0. Never used
max1363 1-0064: available_scan_mask 11 subset of 0. Never used
max1363 1-0064: available_scan_mask 12 subset of 0. Never used
max1363 1-0064: available_scan_mask 13 subset of 0. Never used
...
[warnings continue]
Fix the available_scan_masks sanity check logic so that it
only prints the warning when an element of available_scan_mask
is in fact a subset of a previous one.
These warnings incorrectly report that later scan masks are subsets of
the first one, even when they are not. The issue lies in the logic that
checks for subset relationships between scan masks.
Fix the subset detection to correctly compare each mask only
against previous masks, and only warn when a true subset is found.
With this fix, the warning output becomes both correct and more
informative:
max1363 1-0064: Mask 7 (0xc) is a subset of mask 6 (0xf) and will be ignored
Cc: stable(a)vger.kernel.org
Fixes: 2718f15403fb ("iio: sanity check available_scan_masks array")
Signed-off-by: Fabio Estevam <festevam(a)denx.de>
---
drivers/iio/industrialio-core.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
diff --git a/drivers/iio/industrialio-core.c b/drivers/iio/industrialio-core.c
index 6a6568d4a2cb..855d5fd3e6b2 100644
--- a/drivers/iio/industrialio-core.c
+++ b/drivers/iio/industrialio-core.c
@@ -1904,6 +1904,11 @@ static int iio_check_extended_name(const struct iio_dev *indio_dev)
static const struct iio_buffer_setup_ops noop_ring_setup_ops;
+static int is_subset(unsigned long a, unsigned long b)
+{
+ return (a & ~b) == 0;
+}
+
static void iio_sanity_check_avail_scan_masks(struct iio_dev *indio_dev)
{
unsigned int num_masks, masklength, longs_per_mask;
@@ -1947,21 +1952,13 @@ static void iio_sanity_check_avail_scan_masks(struct iio_dev *indio_dev)
* available masks in the order of preference (presumably the least
* costy to access masks first).
*/
- for (i = 0; i < num_masks - 1; i++) {
- const unsigned long *mask1;
- int j;
- mask1 = av_masks + i * longs_per_mask;
- for (j = i + 1; j < num_masks; j++) {
- const unsigned long *mask2;
-
- mask2 = av_masks + j * longs_per_mask;
- if (bitmap_subset(mask2, mask1, masklength))
+ for (i = 1; i < num_masks; ++i)
+ for (int j = 0; j < i; ++j)
+ if (is_subset(av_masks[i], av_masks[j]))
dev_warn(indio_dev->dev.parent,
- "available_scan_mask %d subset of %d. Never used\n",
- j, i);
- }
- }
+ "Mask %d (0x%lx) is a subset of mask %d (0x%lx) and will be ignored\n",
+ i, av_masks[i], j, av_masks[j]);
}
/**
--
2.34.1
Hi,
On 3/14/25 2:08 PM, Benjamin Berg wrote:
> From: Benjamin Berg <benjamin.berg(a)intel.com>
> um: work around sched_yield not yielding in time-travel mode
>
> sched_yield by a userspace may not actually cause scheduling in
> time-travel mode as no time has passed. In the case seen it appears to
> be a badly implemented userspace spinlock in ASAN. Unfortunately, with
> time-travel it causes an extreme slowdown or even deadlock depending on
> the kernel configuration (CONFIG_UML_MAX_USERSPACE_ITERATIONS).
>
> Work around it by accounting time to the process whenever it executes a
> sched_yield syscall.
>
> Signed-off-by: Benjamin Berg <benjamin.berg(a)intel.com>
From what I can tell the patch mentioned above was backported to 6.12.27 by:
<https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arc…>
but without the upstream
|Commit 0b8b2668f9981c1fefc2ef892bd915288ef01f33
|Author: Benjamin Berg <benjamin.berg(a)intel.com>
|Date: Thu Oct 10 16:25:37 2024 +0200
| um: insert scheduler ticks when userspace does not yield
|
| In time-travel mode userspace can do a lot of work without any time
| passing. Unfortunately, this can result in OOM situations as the RCU
| core code will never be run. [...]
the kernel build for 6.12.27 for the UM-Target will fail:
| /usr/bin/ld: arch/um/kernel/skas/syscall.o: in function `handle_syscall': linux-6.12.27/arch/um/kernel/skas/syscall.c:43:(.text+0xa2): undefined reference to `tt_extra_sched_jiffies'
| collect2: error: ld returned 1 exit status
is it possible to backport 0b8b2668f9981c1fefc2ef892bd915288ef01f33 too?
Or is it better to revert 887c5c12e80c8424bd471122d2e8b6b462e12874 again
in the stable releases?
Best Regards,
Christian Lamparter
>
> ---
>
> I suspect it is this code in ASAN that uses sched_yield
> https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/sanitizer_co…
> though there are also some other places that use sched_yield.
>
> I doubt that code is reasonable. At the same time, not sure that
> sched_yield is behaving as advertised either as it obviously is not
> necessarily relinquishing the CPU.
> ---
> arch/um/include/linux/time-internal.h | 2 ++
> arch/um/kernel/skas/syscall.c | 11 +++++++++++
> 2 files changed, 13 insertions(+)
>
> diff --git a/arch/um/include/linux/time-internal.h b/arch/um/include/linux/time-internal.h
> index b22226634ff6..138908b999d7 100644
> --- a/arch/um/include/linux/time-internal.h
> +++ b/arch/um/include/linux/time-internal.h
> @@ -83,6 +83,8 @@ extern void time_travel_not_configured(void);
> #define time_travel_del_event(...) time_travel_not_configured()
> #endif /* CONFIG_UML_TIME_TRAVEL_SUPPORT */
>
> +extern unsigned long tt_extra_sched_jiffies;
> +
> /*
> * Without CONFIG_UML_TIME_TRAVEL_SUPPORT this is a linker error if used,
> * which is intentional since we really shouldn't link it in that case.
> diff --git a/arch/um/kernel/skas/syscall.c b/arch/um/kernel/skas/syscall.c
> index b09e85279d2b..a5beaea2967e 100644
> --- a/arch/um/kernel/skas/syscall.c
> +++ b/arch/um/kernel/skas/syscall.c
> @@ -31,6 +31,17 @@ void handle_syscall(struct uml_pt_regs *r)
> goto out;
>
> syscall = UPT_SYSCALL_NR(r);
> +
> + /*
> + * If no time passes, then sched_yield may not actually yield, causing
> + * broken spinlock implementations in userspace (ASAN) to hang for long
> + * periods of time.
> + */
> + if ((time_travel_mode == TT_MODE_INFCPU ||
> + time_travel_mode == TT_MODE_EXTERNAL) &&
> + syscall == __NR_sched_yield)
> + tt_extra_sched_jiffies += 1;
> +
> if (syscall >= 0 && syscall < __NR_syscalls) {
> unsigned long ret = EXECUTE_SYSCALL(syscall, regs);
>
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 90abee6d7895d5eef18c91d870d8168be4e76e9d
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025042150-hardiness-hunting-0780@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 90abee6d7895d5eef18c91d870d8168be4e76e9d Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes(a)cmpxchg.org>
Date: Mon, 7 Apr 2025 14:01:53 -0400
Subject: [PATCH] mm: page_alloc: speed up fallbacks in rmqueue_bulk()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
single pages from biggest buddy") as the root cause of a 56.4% regression
in vm-scalability::lru-file-mmap-read.
Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
freelist movement during block conversion"), as the root cause for a
regression in worst-case zone->lock+irqoff hold times.
Both of these patches modify the page allocator's fallback path to be less
greedy in an effort to stave off fragmentation. The flip side of this is
that fallbacks are also less productive each time around, which means the
fallback search can run much more frequently.
Carlos' traces point to rmqueue_bulk() specifically, which tries to refill
the percpu cache by allocating a large batch of pages in a loop. It
highlights how once the native freelists are exhausted, the fallback code
first scans orders top-down for whole blocks to claim, then falls back to
a bottom-up search for the smallest buddy to steal. For the next batch
page, it goes through the same thing again.
This can be made more efficient. Since rmqueue_bulk() holds the
zone->lock over the entire batch, the freelists are not subject to outside
changes; when the search for a block to claim has already failed, there is
no point in trying again for the next page.
Modify __rmqueue() to remember the last successful fallback mode, and
restart directly from there on the next rmqueue_bulk() iteration.
Oliver confirms that this improves beyond the regression that the test
robot reported against c2f6ea38fc1b:
commit:
f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch
f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput
Carlos confirms that worst-case times are almost fully recovered
compared to before the earlier culprit patch:
2dd482ba627d (before freelist hygiene): 1ms
c0cd6f557b90 (after freelist hygiene): 90ms
next-20250319 (steal smallest buddy): 280ms
this patch : 8ms
[jackmanb(a)google.com: comment updates]
Link: https://lkml.kernel.org/r/D92AC0P9594X.3BML64MUKTF8Z@google.com
[hannes(a)cmpxchg.org: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin]
Link: https://lkml.kernel.org/r/20250409140023.GA2313@cmpxchg.org
Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org
Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
Signed-off-by: Johannes Weiner <hannes(a)cmpxchg.org>
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
Reported-by: kernel test robot <oliver.sang(a)intel.com>
Reported-by: Carlos Song <carlos.song(a)nxp.com>
Tested-by: Carlos Song <carlos.song(a)nxp.com>
Tested-by: kernel test robot <oliver.sang(a)intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
Reviewed-by: Brendan Jackman <jackmanb(a)google.com>
Tested-by: Shivank Garg <shivankg(a)amd.com>
Acked-by: Zi Yan <ziy(a)nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org> [6.10+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9a219fe8e130..1715e34b91af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2183,23 +2183,15 @@ try_to_claim_block(struct zone *zone, struct page *page,
}
/*
- * Try finding a free buddy page on the fallback list.
- *
- * This will attempt to claim a whole pageblock for the requested type
- * to ensure grouping of such requests in the future.
- *
- * If a whole block cannot be claimed, steal an individual page, regressing to
- * __rmqueue_smallest() logic to at least break up as little contiguity as
- * possible.
+ * Try to allocate from some fallback migratetype by claiming the entire block,
+ * i.e. converting it to the allocation's start migratetype.
*
* The use of signed ints for order and current_order is a deliberate
* deviation from the rest of this file, to make the for loop
* condition simpler.
- *
- * Return the stolen page, or NULL if none can be found.
*/
static __always_inline struct page *
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
+__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
unsigned int alloc_flags)
{
struct free_area *area;
@@ -2237,14 +2229,29 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
page = try_to_claim_block(zone, page, current_order, order,
start_migratetype, fallback_mt,
alloc_flags);
- if (page)
- goto got_one;
+ if (page) {
+ trace_mm_page_alloc_extfrag(page, order, current_order,
+ start_migratetype, fallback_mt);
+ return page;
+ }
}
- if (alloc_flags & ALLOC_NOFRAGMENT)
- return NULL;
+ return NULL;
+}
+
+/*
+ * Try to steal a single page from some fallback migratetype. Leave the rest of
+ * the block as its current migratetype, potentially causing fragmentation.
+ */
+static __always_inline struct page *
+__rmqueue_steal(struct zone *zone, int order, int start_migratetype)
+{
+ struct free_area *area;
+ int current_order;
+ struct page *page;
+ int fallback_mt;
+ bool claim_block;
- /* No luck claiming pageblock. Find the smallest fallback page */
for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
area = &(zone->free_area[current_order]);
fallback_mt = find_suitable_fallback(area, current_order,
@@ -2254,25 +2261,28 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
page = get_page_from_free_area(area, fallback_mt);
page_del_and_expand(zone, page, order, current_order, fallback_mt);
- goto got_one;
+ trace_mm_page_alloc_extfrag(page, order, current_order,
+ start_migratetype, fallback_mt);
+ return page;
}
return NULL;
-
-got_one:
- trace_mm_page_alloc_extfrag(page, order, current_order,
- start_migratetype, fallback_mt);
-
- return page;
}
+enum rmqueue_mode {
+ RMQUEUE_NORMAL,
+ RMQUEUE_CMA,
+ RMQUEUE_CLAIM,
+ RMQUEUE_STEAL,
+};
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
- unsigned int alloc_flags)
+ unsigned int alloc_flags, enum rmqueue_mode *mode)
{
struct page *page;
@@ -2291,16 +2301,48 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
}
}
- page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (alloc_flags & ALLOC_CMA)
+ /*
+ * First try the freelists of the requested migratetype, then try
+ * fallbacks modes with increasing levels of fragmentation risk.
+ *
+ * The fallback logic is expensive and rmqueue_bulk() calls in
+ * a loop with the zone->lock held, meaning the freelists are
+ * not subject to any outside changes. Remember in *mode where
+ * we found pay dirt, to save us the search on the next call.
+ */
+ switch (*mode) {
+ case RMQUEUE_NORMAL:
+ page = __rmqueue_smallest(zone, order, migratetype);
+ if (page)
+ return page;
+ fallthrough;
+ case RMQUEUE_CMA:
+ if (alloc_flags & ALLOC_CMA) {
page = __rmqueue_cma_fallback(zone, order);
-
- if (!page)
- page = __rmqueue_fallback(zone, order, migratetype,
- alloc_flags);
+ if (page) {
+ *mode = RMQUEUE_CMA;
+ return page;
+ }
+ }
+ fallthrough;
+ case RMQUEUE_CLAIM:
+ page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
+ if (page) {
+ /* Replenished preferred freelist, back to normal mode. */
+ *mode = RMQUEUE_NORMAL;
+ return page;
+ }
+ fallthrough;
+ case RMQUEUE_STEAL:
+ if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
+ page = __rmqueue_steal(zone, order, migratetype);
+ if (page) {
+ *mode = RMQUEUE_STEAL;
+ return page;
+ }
+ }
}
- return page;
+ return NULL;
}
/*
@@ -2312,6 +2354,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
int migratetype, unsigned int alloc_flags)
{
+ enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
unsigned long flags;
int i;
@@ -2323,7 +2366,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
}
for (i = 0; i < count; ++i) {
struct page *page = __rmqueue(zone, order, migratetype,
- alloc_flags);
+ alloc_flags, &rmqm);
if (unlikely(page == NULL))
break;
@@ -2948,7 +2991,9 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
if (alloc_flags & ALLOC_HIGHATOMIC)
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page) {
- page = __rmqueue(zone, order, migratetype, alloc_flags);
+ enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
+
+ page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
/*
* If the allocation fails, allow OOM handling and
These patchset adds support for VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
message added in recent VPU firmware. Without it the driver will not be able to
process any jobs after this message is received and would need to be reloaded.
Most patches are as-is from upstream besides these two:
- Fix locking order in ivpu_job_submit
- Abort all jobs after command queue unregister
Both these patches need to be rebased because of missing new CMDQ UAPI changes
that should not be backported to stable.
Changes since v1:
- Documented deviations from the original upstream patches in commit messages
Andrew Kreimer (1):
accel/ivpu: Fix a typo
Andrzej Kacprowski (1):
accel/ivpu: Update VPU FW API headers
Karol Wachowski (4):
accel/ivpu: Use xa_alloc_cyclic() instead of custom function
accel/ivpu: Abort all jobs after command queue unregister
accel/ivpu: Fix locking order in ivpu_job_submit
accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW
Tomasz Rusinowicz (1):
accel/ivpu: Make DB_ID and JOB_ID allocations incremental
drivers/accel/ivpu/ivpu_drv.c | 38 ++--
drivers/accel/ivpu/ivpu_drv.h | 9 +
drivers/accel/ivpu/ivpu_job.c | 125 +++++++++---
drivers/accel/ivpu/ivpu_job.h | 1 +
drivers/accel/ivpu/ivpu_jsm_msg.c | 3 +-
drivers/accel/ivpu/ivpu_mmu.c | 3 +-
drivers/accel/ivpu/ivpu_sysfs.c | 5 +-
drivers/accel/ivpu/vpu_boot_api.h | 45 +++--
drivers/accel/ivpu/vpu_jsm_api.h | 303 +++++++++++++++++++++++++-----
9 files changed, 412 insertions(+), 120 deletions(-)
--
2.45.1
This patchset backports a series of ublk fixes from upstream to 6.14-stable.
Patch 7 fixes the race that can cause kernel panic when ublk server daemon is exiting.
It depends on patches 1-6 which simplifies & improves IO canceling when ublk server daemon
is exiting as described here:
https://lore.kernel.org/linux-block/20250416035444.99569-1-ming.lei@redhat.…
Ming Lei (5):
ublk: add helper of ublk_need_map_io()
ublk: move device reset into ublk_ch_release()
ublk: remove __ublk_quiesce_dev()
ublk: simplify aborting ublk request
ublk: fix race between io_uring_cmd_complete_in_task and
ublk_cancel_cmd
Uday Shankar (2):
ublk: properly serialize all FETCH_REQs
ublk: improve detection and handling of ublk server exit
drivers/block/ublk_drv.c | 550 +++++++++++++++++++++------------------
1 file changed, 291 insertions(+), 259 deletions(-)
--
2.43.0
The patch below does not apply to the 6.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.14.y
git checkout FETCH_HEAD
git cherry-pick -x e775278cd75f24a2758c28558c4e41b36c935740
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025042247-mounting-playlist-f479@gregkh' --subject-prefix 'PATCH 6.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e775278cd75f24a2758c28558c4e41b36c935740 Mon Sep 17 00:00:00 2001
From: Kenneth Graunke <kenneth(a)whitecape.org>
Date: Sun, 30 Mar 2025 12:59:23 -0400
Subject: [PATCH] drm/xe: Invalidate L3 read-only cachelines for geometry
streams too
Historically, the Vertex Fetcher unit has not been an L3 client. That
meant that, when a buffer containing vertex data was written to, it was
necessary to issue a PIPE_CONTROL::VF Cache Invalidate to invalidate any
VF L2 cachelines associated with that buffer, so the new value would be
properly read from memory.
Since Tigerlake and later, VERTEX_BUFFER_STATE and 3DSTATE_INDEX_BUFFER
have included an "L3 Bypass Enable" bit which userspace drivers can set
to request that the vertex fetcher unit snoop L3. However, unlike most
true L3 clients, the "VF Cache Invalidate" bit continues to only
invalidate the VF L2 cache - and not any associated L3 lines.
To handle that, PIPE_CONTROL has a new "L3 Read Only Cache Invalidation
Bit", which according to the docs, "controls the invalidation of the
Geometry streams cached in L3 cache at the top of the pipe." In other
words, the vertex and index buffer data that gets cached in L3 when
"L3 Bypass Disable" is set.
Mesa always sets L3 Bypass Disable so that the VF unit snoops L3, and
whenever it issues a VF Cache Invalidate, it also issues a L3 Read Only
Cache Invalidate so that both L2 and L3 vertex data is invalidated.
xe is issuing VF cache invalidates too (which handles cases like CPU
writes to a buffer between GPU batches). Because userspace may enable
L3 snooping, it needs to issue an L3 Read Only Cache Invalidate as well.
Fixes significant flickering in Firefox on Meteorlake, which was writing
to vertex buffers via the CPU between batches; the missing L3 Read Only
invalidates were causing the vertex fetcher to read stale data from L3.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4460
Fixes: 6ef3bb60557d ("drm/xe: enable lite restore")
Cc: stable(a)vger.kernel.org # v6.13+
Signed-off-by: Kenneth Graunke <kenneth(a)whitecape.org>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Link: https://lore.kernel.org/r/20250330165923.56410-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
(cherry picked from commit 61672806b579dd5a150a042ec9383be2bbc2ae7e)
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
diff --git a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
index a255946b6f77..8cfcd3360896 100644
--- a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
+++ b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
@@ -41,6 +41,7 @@
#define GFX_OP_PIPE_CONTROL(len) ((0x3<<29)|(0x3<<27)|(0x2<<24)|((len)-2))
+#define PIPE_CONTROL0_L3_READ_ONLY_CACHE_INVALIDATE BIT(10) /* gen12 */
#define PIPE_CONTROL0_HDC_PIPELINE_FLUSH BIT(9) /* gen12 */
#define PIPE_CONTROL_COMMAND_CACHE_INVALIDATE (1<<29)
diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c b/drivers/gpu/drm/xe/xe_ring_ops.c
index 917fc16de866..a7582b097ae6 100644
--- a/drivers/gpu/drm/xe/xe_ring_ops.c
+++ b/drivers/gpu/drm/xe/xe_ring_ops.c
@@ -137,7 +137,8 @@ emit_pipe_control(u32 *dw, int i, u32 bit_group_0, u32 bit_group_1, u32 offset,
static int emit_pipe_invalidate(u32 mask_flags, bool invalidate_tlb, u32 *dw,
int i)
{
- u32 flags = PIPE_CONTROL_CS_STALL |
+ u32 flags0 = 0;
+ u32 flags1 = PIPE_CONTROL_CS_STALL |
PIPE_CONTROL_COMMAND_CACHE_INVALIDATE |
PIPE_CONTROL_INSTRUCTION_CACHE_INVALIDATE |
PIPE_CONTROL_TEXTURE_CACHE_INVALIDATE |
@@ -148,11 +149,15 @@ static int emit_pipe_invalidate(u32 mask_flags, bool invalidate_tlb, u32 *dw,
PIPE_CONTROL_STORE_DATA_INDEX;
if (invalidate_tlb)
- flags |= PIPE_CONTROL_TLB_INVALIDATE;
+ flags1 |= PIPE_CONTROL_TLB_INVALIDATE;
- flags &= ~mask_flags;
+ flags1 &= ~mask_flags;
- return emit_pipe_control(dw, i, 0, flags, LRC_PPHWSP_FLUSH_INVAL_SCRATCH_ADDR, 0);
+ if (flags1 & PIPE_CONTROL_VF_CACHE_INVALIDATE)
+ flags0 |= PIPE_CONTROL0_L3_READ_ONLY_CACHE_INVALIDATE;
+
+ return emit_pipe_control(dw, i, flags0, flags1,
+ LRC_PPHWSP_FLUSH_INVAL_SCRATCH_ADDR, 0);
}
static int emit_store_imm_ppgtt_posted(u64 addr, u64 value,
Hello,
This fixes an i.Mx 7D SoC PCIe regression caused by a backport mistake.
The regression is broken PCIe initialization and for me a boot hang.
I don't know how to organize this. I think a revert and redo best captures
what's happening.
To complicate things, it looks like the redo patch could also be applied to
5.4, 5.10, 5.15, and 6.1. But those versions don't have the original
backport commit. Version 6.12 matches master and needs no change.
One conflict resolution is needed to apply the redo patch back to versions
6.1 -> 5.15 -> 5.10. One more resolution to apply back to -> 5.4. Patches
against those other versions aren't included here.
-- Ryan
Richard Zhu (1):
PCI: imx6: Skip controller_id generation logic for i.MX7D
Ryan Matthews (1):
Revert "PCI: imx6: Skip controller_id generation logic for i.MX7D"
drivers/pci/controller/dwc/pci-imx6.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
base-commit: 814637ca257f4faf57a73fd4e38888cce88b5911
--
2.47.2
The patch below does not apply to the 6.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.14.y
git checkout FETCH_HEAD
git cherry-pick -x 48c1d1bb525b1c44b8bdc8e7ec5629cb6c2b9fc4
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025050558-charger-crumpled-6ca4@gregkh' --subject-prefix 'PATCH 6.14.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 48c1d1bb525b1c44b8bdc8e7ec5629cb6c2b9fc4 Mon Sep 17 00:00:00 2001
From: Penglei Jiang <superman.xpt(a)gmail.com>
Date: Mon, 21 Apr 2025 08:40:29 -0700
Subject: [PATCH] btrfs: fix the inode leak in btrfs_iget()
[BUG]
There is a bug report that a syzbot reproducer can lead to the following
busy inode at unmount time:
BTRFS info (device loop1): last unmount of filesystem 1680000e-3c1e-4c46-84b6-56bd3909af50
VFS: Busy inodes after unmount of loop1 (btrfs)
------------[ cut here ]------------
kernel BUG at fs/super.c:650!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 48168 Comm: syz-executor Not tainted 6.15.0-rc2-00471-g119009db2674 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:generic_shutdown_super+0x2e9/0x390 fs/super.c:650
Call Trace:
<TASK>
kill_anon_super+0x3a/0x60 fs/super.c:1237
btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2099
deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
deactivate_super fs/super.c:506 [inline]
deactivate_super+0xe2/0x100 fs/super.c:502
cleanup_mnt+0x21f/0x440 fs/namespace.c:1435
task_work_run+0x14d/0x240 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x269/0x290 kernel/entry/common.c:218
do_syscall_64+0xd4/0x250 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
[CAUSE]
When btrfs_alloc_path() failed, btrfs_iget() directly returned without
releasing the inode already allocated by btrfs_iget_locked().
This results the above busy inode and trigger the kernel BUG.
[FIX]
Fix it by calling iget_failed() if btrfs_alloc_path() failed.
If we hit error inside btrfs_read_locked_inode(), it will properly call
iget_failed(), so nothing to worry about.
Although the iget_failed() cleanup inside btrfs_read_locked_inode() is a
break of the normal error handling scheme, let's fix the obvious bug
and backport first, then rework the error handling later.
Reported-by: Penglei Jiang <superman.xpt(a)gmail.com>
Link: https://lore.kernel.org/linux-btrfs/20250421102425.44431-1-superman.xpt@gma…
Fixes: 7c855e16ab72 ("btrfs: remove conditional path allocation in btrfs_read_locked_inode()")
CC: stable(a)vger.kernel.org # 6.13+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Penglei Jiang <superman.xpt(a)gmail.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 312fa996a987..d295a37fa049 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5682,8 +5682,10 @@ struct btrfs_inode *btrfs_iget(u64 ino, struct btrfs_root *root)
return inode;
path = btrfs_alloc_path();
- if (!path)
+ if (!path) {
+ iget_failed(&inode->vfs_inode);
return ERR_PTR(-ENOMEM);
+ }
ret = btrfs_read_locked_inode(inode, path);
btrfs_free_path(path);