From: Shigeru Yoshida <syoshida(a)redhat.com>
commit fc1092f51567277509563800a3c56732070b6aa4 upstream.
KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb()
tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a
race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL
while __ip_make_skb() is running, the function will access icmphdr in the
skb even if it is not included. This causes the issue reported by KMSAN.
Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL
on the socket.
Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These
are union in struct flowi4 and are implicitly initialized by
flowi4_init_output(), but we should not rely on specific union layout.
Initialize these explicitly in raw_sendmsg().
[1]
BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
__ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481
ip_finish_skb include/net/ip.h:243 [inline]
ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508
raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654
inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x274/0x3c0 net/socket.c:745
__sys_sendto+0x62c/0x7b0 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x130/0x200 net/socket.c:2199
do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6d/0x75
Uninit was created at:
slab_post_alloc_hook mm/slub.c:3804 [inline]
slab_alloc_node mm/slub.c:3845 [inline]
kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888
kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577
__alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668
alloc_skb include/linux/skbuff.h:1318 [inline]
__ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128
ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365
raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648
inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x274/0x3c0 net/socket.c:745
__sys_sendto+0x62c/0x7b0 net/socket.c:2191
__do_sys_sendto net/socket.c:2203 [inline]
__se_sys_sendto net/socket.c:2199 [inline]
__x64_sys_sendto+0x130/0x200 net/socket.c:2199
do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6d/0x75
CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
Fixes: 99e5acae193e ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()")
Reported-by: syzkaller <syzkaller(a)googlegroups.com>
Signed-off-by: Shigeru Yoshida <syoshida(a)redhat.com>
Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Signed-off-by: Shubham Kulkarni <skulkarni(a)mvista.com>
---
Referred stable v6.1.y version of the patch to generate this one
[ v6.1 link: https://github.com/gregkh/linux/commit/55bf541e018b76b3750cb6c6ea18c46e1ac5… ]
---
net/ipv4/ip_output.c | 3 ++-
net/ipv4/raw.c | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1e430e135aa6..70c316c0537f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1572,7 +1572,8 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
* so icmphdr does not in skb linear region and can not get icmp_type
* by icmp_hdr(skb)->type.
*/
- if (sk->sk_type == SOCK_RAW && !inet_sk(sk)->hdrincl)
+ if (sk->sk_type == SOCK_RAW &&
+ !(fl4->flowi4_flags & FLOWI_FLAG_KNOWN_NH))
icmp_type = fl4->fl4_icmp_type;
else
icmp_type = icmp_hdr(skb)->type;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 650da4d8f7ad..5a8d5ebe6590 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -634,6 +634,9 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
(hdrincl ? FLOWI_FLAG_KNOWN_NH : 0),
daddr, saddr, 0, 0, sk->sk_uid);
+ fl4.fl4_icmp_type = 0;
+ fl4.fl4_icmp_code = 0;
+
if (!hdrincl) {
rfv.msg = msg;
rfv.hlen = 0;
--
2.25.1
__io_openat_prep() allocates a struct filename using getname(). However,
for the condition of the file being installed in the fixed file table as
well as having O_CLOEXEC flag set, the function returns early. At that
point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this,
the memory for the newly allocated struct filename is not cleaned up,
causing a memory leak.
Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the
successful getname() call, so that when the request is torn down, the
filename will be cleaned up, along with other resources needing cleanup.
Reported-by: syzbot+00e61c43eb5e4740438f(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f
Tested-by: syzbot+00e61c43eb5e4740438f(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org
Signed-off-by: Prithvi Tambewagh <activprithvi(a)gmail.com>
---
io_uring/openclose.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/openclose.c b/io_uring/openclose.c
index bfeb91b31bba..15dde9bd6ff6 100644
--- a/io_uring/openclose.c
+++ b/io_uring/openclose.c
@@ -73,13 +73,13 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
open->filename = NULL;
return ret;
}
+ req->flags |= REQ_F_NEED_CLEANUP;
open->file_slot = READ_ONCE(sqe->file_index);
if (open->file_slot && (open->how.flags & O_CLOEXEC))
return -EINVAL;
open->nofile = rlimit(RLIMIT_NOFILE);
- req->flags |= REQ_F_NEED_CLEANUP;
if (io_openat_force_async(open))
req->flags |= REQ_F_FORCE_ASYNC;
return 0;
base-commit: b927546677c876e26eba308550207c2ddf812a43
--
2.34.1
In the pruss_clk_mux_setup(), the devm_add_action_or_reset() indirectly
calls pruss_of_free_clk_provider(), which calls of_node_put(clk_mux_np)
on the error path. However, after the devm_add_action_or_reset()
returns, the of_node_put(clk_mux_np) is called again, causing a double
free.
Fix by using a separate label to avoid the duplicate of_node_put().
Fixes: ba59c9b43c86 ("soc: ti: pruss: support CORECLK_MUX and IEPCLK_MUX")
Cc: stable(a)vger.kernel.org
Signed-off-by: Wentao Liang <vulab(a)iscas.ac.cn>
---
drivers/soc/ti/pruss.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/soc/ti/pruss.c b/drivers/soc/ti/pruss.c
index d7634bf5413a..c16d96bebe3f 100644
--- a/drivers/soc/ti/pruss.c
+++ b/drivers/soc/ti/pruss.c
@@ -368,13 +368,14 @@ static int pruss_clk_mux_setup(struct pruss *pruss, struct clk *clk_mux,
clk_mux_np);
if (ret) {
dev_err(dev, "failed to add clkmux free action %d", ret);
- goto put_clk_mux_np;
+ goto ret_error;
}
return 0;
put_clk_mux_np:
of_node_put(clk_mux_np);
+ret_error:
return ret;
}
--
2.34.1
Hello,
please backport
commit fbf5892df21a8ccfcb2fda0fd65bc3169c89ed28
Author: Martin Nybo Andersen <tweek(a)tweek.dk>
Date: Fri Sep 15 12:15:39 2023 +0200
kbuild: Use CRC32 and a 1MiB dictionary for XZ compressed modules
Kmod is now (since kmod commit 09c9f8c5df04 ("libkmod: Use kernel
decompression when available")) using the kernel decompressor, when
loading compressed modules.
However, the kernel XZ decompressor is XZ Embedded, which doesn't
handle CRC64 and dictionaries larger than 1MiB.
Use CRC32 and 1MiB dictionary when XZ compressing and installing
kernel modules.
to the 6.1 stable kernel, and possibly older ones as well.
The commit message actually has it all, so just my story: There's a
hardware that has or had issues with never kernels (no time to check),
my kernel for this board is usually static. But after building a kernel
with xz-compressed modules, they wouldn't load but trigger
"decompression failed with status 6". Investigation led to a CRC64 check
for these files, and eventually to the above commit.
The commit applies (with an offset), the resulting modules work as
expected.
Kernel 6.6 and newer already have that commit. Older kernels could
possibly benefit from this as well, I haven't checked.
Kind regards,
Christoph
Memory region assignment in ath11k_qmi_assign_target_mem_chunk()
assumes that:
1. firmware will make a HOST_DDR_REGION_TYPE request, and
2. this request is processed before CALDB_MEM_REGION_TYPE
In this case CALDB_MEM_REGION_TYPE, can safely be assigned immediately
after the host region.
However, if the HOST_DDR_REGION_TYPE request is not made, or the
reserved-memory node is not present, then res.start and res.end are 0,
and host_ddr_sz remains uninitialized. The physical address should
fall back to ATH11K_QMI_CALDB_ADDRESS. That doesn't happen:
resource_size(&res) returns 1 for an empty resource, and thus the if
clause never takes the fallback path. ab->qmi.target_mem[idx].paddr
is assigned the uninitialized value of host_ddr_sz + 0 (res.start).
Use "if (res.end > res.start)" for the predicate, which correctly
falls back to ATH11K_QMI_CALDB_ADDRESS.
Fixes: 900730dc4705 ("wifi: ath: Use of_reserved_mem_region_to_resource() for "memory-region"")
Cc: stable(a)vger.kernel.org # v6.18
Signed-off-by: Alexandru Gagniuc <mr.nuke.me(a)gmail.com>
---
drivers/net/wireless/ath/ath11k/qmi.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/ath/ath11k/qmi.c b/drivers/net/wireless/ath/ath11k/qmi.c
index aea56c38bf8f3..6cc26d1c1e2a4 100644
--- a/drivers/net/wireless/ath/ath11k/qmi.c
+++ b/drivers/net/wireless/ath/ath11k/qmi.c
@@ -2054,7 +2054,7 @@ static int ath11k_qmi_assign_target_mem_chunk(struct ath11k_base *ab)
return ret;
}
- if (res.end - res.start + 1 < ab->qmi.target_mem[i].size) {
+ if (resource_size(&res) < ab->qmi.target_mem[i].size) {
ath11k_dbg(ab, ATH11K_DBG_QMI,
"fail to assign memory of sz\n");
return -EINVAL;
@@ -2086,7 +2086,7 @@ static int ath11k_qmi_assign_target_mem_chunk(struct ath11k_base *ab)
}
if (ath11k_core_coldboot_cal_support(ab)) {
- if (resource_size(&res)) {
+ if (res.end > res.start) {
ab->qmi.target_mem[idx].paddr =
res.start + host_ddr_sz;
ab->qmi.target_mem[idx].iaddr =
--
2.45.1
As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
where we perform so many IPI broadcasts when unsharing hugetlb PMD page
tables that it severely regresses some workloads.
In particular, when we fork()+exit(), or when we munmap() a large
area backed by many shared PMD tables, we perform one IPI broadcast per
unshared PMD table.
There are two optimizations to be had:
(1) When we process (unshare) multiple such PMD tables, such as during
exit(), it is sufficient to send a single IPI broadcast (as long as
we respect locking rules) instead of one per PMD table.
Locking prevents that any of these PMD tables could get reused before
we drop the lock.
(2) When we are not the last sharer (> 2 users including us), there is
no need to send the IPI broadcast. The shared PMD tables cannot
become exclusive (fully unshared) before an IPI will be broadcasted
by the last sharer.
Concurrent GUP-fast could walk into a PMD table just before we
unshared it. It could then succeed in grabbing a page from the
shared page table even after munmap() etc succeeded (and supressed
an IPI). But there is not difference compared to GUP-fast just
sleeping for a while after grabbing the page and re-enabling IRQs.
Most importantly, GUP-fast will never walk into page tables that are
no-longer shared, because the last sharer will issue an IPI
broadcast.
(if ever required, checking whether the PUD changed in GUP-fast
after grabbing the page like we do in the PTE case could handle
this)
So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather
infrastructure so we can implement these optimizations and demystify the
code at least a bit. Extend the mmu_gather infrastructure to be able to
deal with our special hugetlb PMD table sharing implementation.
To make initialization of the mmu_gather easier when working on a single
VMA (in particular, when dealing with hugetlb), provide
tlb_gather_mmu_vma().
We'll consolidate the handling for (full) unsharing of PMD tables in
tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track
in "struct mmu_gather" whether we had (full) unsharing of PMD tables.
Because locking is very special (concurrent unsharing+reuse must be
prevented), we disallow deferring flushing to tlb_finish_mmu() and instead
require an explicit earlier call to tlb_flush_unshared_tables().
From hugetlb code, we call huge_pmd_unshare_flush() where we make sure
that the expected lock protecting us from concurrent unsharing+reuse is
still held.
Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that
tlb_flush_unshared_tables() was properly called earlier.
Document it all properly.
Notes about tlb_remove_table_sync_one() interaction with unsharing:
There are two fairly tricky things:
(1) tlb_remove_table_sync_one() is a NOP on architectures without
CONFIG_MMU_GATHER_RCU_TABLE_FREE.
Here, the assumption is that the previous TLB flush would send an
IPI to all relevant CPUs. Careful: some architectures like x86 only
send IPIs to all relevant CPUs when tlb->freed_tables is set.
The relevant architectures should be selecting
MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable
kernels and it might have been problematic before this patch.
Also, the arch flushing behavior (independent of IPIs) is different
when tlb->freed_tables is set. Do we have to enlighten them to also
take care of tlb->unshared_tables? So far we didn't care, so
hopefully we are fine. Of course, we could be setting
tlb->freed_tables as well, but that might then unnecessarily flush
too much, because the semantics of tlb->freed_tables are a bit
fuzzy.
This patch changes nothing in this regard.
(2) tlb_remove_table_sync_one() is not a NOP on architectures with
CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.
Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB)
we still issue IPIs during TLB flushes and don't actually need the
second tlb_remove_table_sync_one().
This optimized can be implemented on top of this, by checking e.g., in
tlb_remove_table_sync_one() whether we really need IPIs. But as
described in (1), it really must honor tlb->freed_tables then to
send IPIs to all relevant CPUs.
Notes on TLB flushing changes:
(1) Flushing for non-shared PMD tables
We're converting from flush_hugetlb_tlb_range() to
tlb_remove_huge_tlb_entry(). Given that we properly initialize the
MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to
__unmap_hugepage_range(), that should be fine.
(2) Flushing for shared PMD tables
We're converting from various things (flush_hugetlb_tlb_range(),
tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range().
tlb_flush_pmd_range() achieves the same that
tlb_remove_huge_tlb_entry() would achieve in these scenarios.
Note that tlb_remove_huge_tlb_entry() also calls
__tlb_remove_tlb_entry(), however that is only implemented on
powerpc, which does not support PMD table sharing.
Similar to (1), tlb_gather_mmu_vma() should make sure that TLB
flushing keeps on working as expected.
Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a
concern, as we are holding the i_mmap_lock the whole time, preventing
concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed
separately as a cleanup later.
There are plenty more cleanups to be had, but they have to wait until
this is fixed.
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
Reported-by: Uschakow, Stanislav" <suschako(a)amazon.de>
Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/
Tested-by: Laurence Oberman <loberman(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: David Hildenbrand (Red Hat) <david(a)kernel.org>
---
include/asm-generic/tlb.h | 77 +++++++++++++++++++++++-
include/linux/hugetlb.h | 15 +++--
include/linux/mm_types.h | 1 +
mm/hugetlb.c | 123 ++++++++++++++++++++++----------------
mm/mmu_gather.c | 33 ++++++++++
mm/rmap.c | 25 +++++---
6 files changed, 208 insertions(+), 66 deletions(-)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 1fff717cae510..4d679d2a206b4 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -46,7 +46,8 @@
*
* The mmu_gather API consists of:
*
- * - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_finish_mmu()
+ * - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_gather_mmu_vma() /
+ * tlb_finish_mmu()
*
* start and finish a mmu_gather
*
@@ -364,6 +365,20 @@ struct mmu_gather {
unsigned int vma_huge : 1;
unsigned int vma_pfn : 1;
+ /*
+ * Did we unshare (unmap) any shared page tables? For now only
+ * used for hugetlb PMD table sharing.
+ */
+ unsigned int unshared_tables : 1;
+
+ /*
+ * Did we unshare any page tables such that they are now exclusive
+ * and could get reused+modified by the new owner? When setting this
+ * flag, "unshared_tables" will be set as well. For now only used
+ * for hugetlb PMD table sharing.
+ */
+ unsigned int fully_unshared_tables : 1;
+
unsigned int batch_count;
#ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -400,6 +415,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->cleared_pmds = 0;
tlb->cleared_puds = 0;
tlb->cleared_p4ds = 0;
+ tlb->unshared_tables = 0;
/*
* Do not reset mmu_gather::vma_* fields here, we do not
* call into tlb_start_vma() again to set them if there is an
@@ -484,7 +500,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
* these bits.
*/
if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
- tlb->cleared_puds || tlb->cleared_p4ds))
+ tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))
return;
tlb_flush(tlb);
@@ -773,6 +789,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
}
#endif
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
+ unsigned long addr)
+{
+ /*
+ * The caller must make sure that concurrent unsharing + exclusive
+ * reuse is impossible until tlb_flush_unshared_tables() was called.
+ */
+ VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
+ ptdesc_pmd_pts_dec(pt);
+
+ /* Clearing a PUD pointing at a PMD table with PMD leaves. */
+ tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);
+
+ /*
+ * If the page table is now exclusively owned, we fully unshared
+ * a page table.
+ */
+ if (!ptdesc_pmd_is_shared(pt))
+ tlb->fully_unshared_tables = true;
+ tlb->unshared_tables = true;
+}
+
+static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
+{
+ /*
+ * As soon as the caller drops locks to allow for reuse of
+ * previously-shared tables, these tables could get modified and
+ * even reused outside of hugetlb context, so we have to make sure that
+ * any page table walkers (incl. TLB, GUP-fast) are aware of that
+ * change.
+ *
+ * Even if we are not fully unsharing a PMD table, we must
+ * flush the TLB for the unsharer now.
+ */
+ if (tlb->unshared_tables)
+ tlb_flush_mmu_tlbonly(tlb);
+
+ /*
+ * Similarly, we must make sure that concurrent GUP-fast will not
+ * walk previously-shared page tables that are getting modified+reused
+ * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.
+ *
+ * We only perform this when we are the last sharer of a page table,
+ * as the IPI will reach all CPUs: any GUP-fast.
+ *
+ * Note that on configs where tlb_remove_table_sync_one() is a NOP,
+ * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
+ * required IPIs already for us.
+ */
+ if (tlb->fully_unshared_tables) {
+ tlb_remove_table_sync_one();
+ tlb->fully_unshared_tables = false;
+ }
+}
+#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
+
#endif /* CONFIG_MMU */
#endif /* _ASM_GENERIC__TLB_H */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 03c8725efa289..e51b8ef0cebd9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
unsigned long hugetlb_mask_last_page(struct hstate *h);
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep);
+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep);
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma);
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end);
@@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write(
return NULL;
}
-static inline int huge_pmd_unshare(struct mm_struct *mm,
- struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep)
+static inline int huge_pmd_unshare(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{
return 0;
}
+static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb,
+ struct vm_area_struct *vma)
+{
+}
+
static inline void adjust_range_if_pmd_sharing_possible(
struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 42af2292951d4..d1053b2c1f800 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1522,6 +1522,7 @@ static inline unsigned int mm_cid_size(void)
struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
+void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma);
extern void tlb_finish_mmu(struct mmu_gather *tlb);
struct vm_fault;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c77cdef12a32..2609b6d58f99e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5096,7 +5096,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
unsigned long last_addr_mask;
pte_t *src_pte, *dst_pte;
struct mmu_notifier_range range;
- bool shared_pmd = false;
+ struct mmu_gather tlb;
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr,
old_end);
@@ -5106,6 +5106,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
* range.
*/
flush_cache_range(vma, range.start, range.end);
+ tlb_gather_mmu_vma(&tlb, vma);
mmu_notifier_invalidate_range_start(&range);
last_addr_mask = hugetlb_mask_last_page(h);
@@ -5122,8 +5123,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte)))
continue;
- if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
- shared_pmd = true;
+ if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) {
old_addr |= last_addr_mask;
new_addr |= last_addr_mask;
continue;
@@ -5134,15 +5134,16 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
break;
move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);
+ tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
}
- if (shared_pmd)
- flush_hugetlb_tlb_range(vma, range.start, range.end);
- else
- flush_hugetlb_tlb_range(vma, old_end - len, old_end);
+ tlb_flush_mmu_tlbonly(&tlb);
+ huge_pmd_unshare_flush(&tlb, vma);
+
mmu_notifier_invalidate_range_end(&range);
i_mmap_unlock_write(mapping);
hugetlb_vma_unlock_write(vma);
+ tlb_finish_mmu(&tlb);
return len + old_addr - old_end;
}
@@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long sz = huge_page_size(h);
bool adjust_reservation;
unsigned long last_addr_mask;
- bool force_flush = false;
WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
@@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
}
ptl = huge_pte_lock(h, mm, ptep);
- if (huge_pmd_unshare(mm, vma, address, ptep)) {
+ if (huge_pmd_unshare(tlb, vma, address, ptep)) {
spin_unlock(ptl);
- tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
- force_flush = true;
address |= last_addr_mask;
continue;
}
@@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
}
tlb_end_vma(tlb, vma);
- /*
- * There is nothing protecting a previously-shared page table that we
- * unshared through huge_pmd_unshare() from getting freed after we
- * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
- * succeeded, flush the range corresponding to the pud.
- */
- if (force_flush)
- tlb_flush_mmu_tlbonly(tlb);
+ huge_pmd_unshare_flush(tlb, vma);
}
void __hugetlb_zap_begin(struct vm_area_struct *vma,
@@ -6409,11 +6400,11 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t pte;
struct hstate *h = hstate_vma(vma);
long pages = 0, psize = huge_page_size(h);
- bool shared_pmd = false;
struct mmu_notifier_range range;
unsigned long last_addr_mask;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ struct mmu_gather tlb;
/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -6426,6 +6417,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
BUG_ON(address >= end);
flush_cache_range(vma, range.start, range.end);
+ tlb_gather_mmu_vma(&tlb, vma);
mmu_notifier_invalidate_range_start(&range);
hugetlb_vma_lock_write(vma);
@@ -6452,7 +6444,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
}
}
ptl = huge_pte_lock(h, mm, ptep);
- if (huge_pmd_unshare(mm, vma, address, ptep)) {
+ if (huge_pmd_unshare(&tlb, vma, address, ptep)) {
/*
* When uffd-wp is enabled on the vma, unshare
* shouldn't happen at all. Warn about it if it
@@ -6461,7 +6453,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
pages++;
spin_unlock(ptl);
- shared_pmd = true;
address |= last_addr_mask;
continue;
}
@@ -6522,22 +6513,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
+ tlb_remove_huge_tlb_entry(h, &tlb, ptep, address);
}
next:
spin_unlock(ptl);
cond_resched();
}
- /*
- * There is nothing protecting a previously-shared page table that we
- * unshared through huge_pmd_unshare() from getting freed after we
- * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
- * succeeded, flush the range corresponding to the pud.
- */
- if (shared_pmd)
- flush_hugetlb_tlb_range(vma, range.start, range.end);
- else
- flush_hugetlb_tlb_range(vma, start, end);
+
+ tlb_flush_mmu_tlbonly(&tlb);
+ huge_pmd_unshare_flush(&tlb, vma);
/*
* No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
* downgrading page table protection not changing it to point to a new
@@ -6548,6 +6533,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
i_mmap_unlock_write(vma->vm_file->f_mapping);
hugetlb_vma_unlock_write(vma);
mmu_notifier_invalidate_range_end(&range);
+ tlb_finish_mmu(&tlb);
return pages > 0 ? (pages << h->order) : pages;
}
@@ -6904,18 +6890,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
return pte;
}
-/*
- * unmap huge page backed by shared pte.
+/**
+ * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users
+ * @tlb: the current mmu_gather.
+ * @vma: the vma covering the pmd table.
+ * @addr: the address we are trying to unshare.
+ * @ptep: pointer into the (pmd) page table.
+ *
+ * Called with the page table lock held, the i_mmap_rwsem held in write mode
+ * and the hugetlb vma lock held in write mode.
*
- * Called with page table lock held.
+ * Note: The caller must call huge_pmd_unshare_flush() before dropping the
+ * i_mmap_rwsem.
*
- * returns: 1 successfully unmapped a shared pte page
- * 0 the underlying pte page is not shared, or it is the last user
+ * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it
+ * was not a shared PMD table.
*/
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep)
+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
{
unsigned long sz = huge_page_size(hstate_vma(vma));
+ struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd = pgd_offset(mm, addr);
p4d_t *p4d = p4d_offset(pgd, addr);
pud_t *pud = pud_offset(p4d, addr);
@@ -6927,18 +6922,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
i_mmap_assert_write_locked(vma->vm_file->f_mapping);
hugetlb_vma_assert_locked(vma);
pud_clear(pud);
- /*
- * Once our caller drops the rmap lock, some other process might be
- * using this page table as a normal, non-hugetlb page table.
- * Wait for pending gup_fast() in other threads to finish before letting
- * that happen.
- */
- tlb_remove_table_sync_one();
- ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
+
+ tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr);
+
mm_dec_nr_pmds(mm);
return 1;
}
+/*
+ * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls
+ * @tlb: the current mmu_gather.
+ * @vma: the vma covering the pmd table.
+ *
+ * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table
+ * unsharing with concurrent page table walkers.
+ *
+ * This function must be called after a sequence of huge_pmd_unshare()
+ * calls while still holding the i_mmap_rwsem.
+ */
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+ /*
+ * We must synchronize page table unsharing such that nobody will
+ * try reusing a previously-shared page table while it might still
+ * be in use by previous sharers (TLB, GUP_fast).
+ */
+ i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+
+ tlb_flush_unshared_tables(tlb);
+}
+
#else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -6947,12 +6960,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
return NULL;
}
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep)
+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
{
return 0;
}
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+}
+
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
{
@@ -7219,6 +7236,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
unsigned long sz = huge_page_size(h);
struct mm_struct *mm = vma->vm_mm;
struct mmu_notifier_range range;
+ struct mmu_gather tlb;
unsigned long address;
spinlock_t *ptl;
pte_t *ptep;
@@ -7230,6 +7248,8 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
return;
flush_cache_range(vma, start, end);
+ tlb_gather_mmu_vma(&tlb, vma);
+
/*
* No need to call adjust_range_if_pmd_sharing_possible(), because
* we have already done the PUD_SIZE alignment.
@@ -7248,10 +7268,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
- huge_pmd_unshare(mm, vma, address, ptep);
+ huge_pmd_unshare(&tlb, vma, address, ptep);
spin_unlock(ptl);
}
- flush_hugetlb_tlb_range(vma, start, end);
+ huge_pmd_unshare_flush(&tlb, vma);
if (take_locks) {
i_mmap_unlock_write(vma->vm_file->f_mapping);
hugetlb_vma_unlock_write(vma);
@@ -7261,6 +7281,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
* Documentation/mm/mmu_notifier.rst.
*/
mmu_notifier_invalidate_range_end(&range);
+ tlb_finish_mmu(&tlb);
}
/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 247e3f9db6c7a..cd32c2dbf501b 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -10,6 +10,7 @@
#include <linux/swap.h>
#include <linux/rmap.h>
#include <linux/pgalloc.h>
+#include <linux/hugetlb.h>
#include <asm/tlb.h>
@@ -426,6 +427,7 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
#endif
tlb->vma_pfn = 0;
+ tlb->fully_unshared_tables = 0;
__tlb_reset_range(tlb);
inc_tlb_flush_pending(tlb->mm);
}
@@ -459,6 +461,31 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
__tlb_gather_mmu(tlb, mm, true);
}
+/**
+ * tlb_gather_mmu - initialize an mmu_gather structure for operating on a single
+ * VMA
+ * @tlb: the mmu_gather structure to initialize
+ * @vma: the vm_area_struct
+ *
+ * Called to initialize an (on-stack) mmu_gather structure for operating on
+ * a single VMA. In contrast to tlb_gather_mmu(), calling this function will
+ * not require another call to tlb_start_vma(). In contrast to tlb_start_vma(),
+ * this function will *not* call flush_cache_range().
+ *
+ * For hugetlb VMAs, this function will also initialize the mmu_gather
+ * page_size accordingly, not requiring a separate call to
+ * tlb_change_page_size().
+ *
+ */
+void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+ tlb_gather_mmu(tlb, vma->vm_mm);
+ tlb_update_vma_flags(tlb, vma);
+ if (is_vm_hugetlb_page(vma))
+ /* All entries have the same size. */
+ tlb_change_page_size(tlb, huge_page_size(hstate_vma(vma)));
+}
+
/**
* tlb_finish_mmu - finish an mmu_gather structure
* @tlb: the mmu_gather structure to finish
@@ -468,6 +495,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
*/
void tlb_finish_mmu(struct mmu_gather *tlb)
{
+ /*
+ * We expect an earlier huge_pmd_unshare_flush() call to sort this out,
+ * due to complicated locking requirements with page table unsharing.
+ */
+ VM_WARN_ON_ONCE(tlb->fully_unshared_tables);
+
/*
* If there are parallel threads are doing PTE changes on same range
* under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB
diff --git a/mm/rmap.c b/mm/rmap.c
index 748f48727a162..7b9879ef442d9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,7 +76,7 @@
#include <linux/mm_inline.h>
#include <linux/oom.h>
-#include <asm/tlbflush.h>
+#include <asm/tlb.h>
#define CREATE_TRACE_POINTS
#include <trace/events/migrate.h>
@@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* if unsuccessful.
*/
if (!anon) {
+ struct mmu_gather tlb;
+
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
if (!hugetlb_vma_trylock_write(vma))
goto walk_abort;
- if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+
+ tlb_gather_mmu_vma(&tlb, vma);
+ if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
hugetlb_vma_unlock_write(vma);
- flush_tlb_range(vma,
- range.start, range.end);
+ huge_pmd_unshare_flush(&tlb, vma);
+ tlb_finish_mmu(&tlb);
/*
* The PMD table was unmapped,
* consequently unmapping the folio.
@@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto walk_done;
}
hugetlb_vma_unlock_write(vma);
+ tlb_finish_mmu(&tlb);
}
pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
if (pte_dirty(pteval))
@@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
* fail if unsuccessful.
*/
if (!anon) {
+ struct mmu_gather tlb;
+
VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
if (!hugetlb_vma_trylock_write(vma)) {
page_vma_mapped_walk_done(&pvmw);
ret = false;
break;
}
- if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
- hugetlb_vma_unlock_write(vma);
- flush_tlb_range(vma,
- range.start, range.end);
+ tlb_gather_mmu_vma(&tlb, vma);
+ if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
+ hugetlb_vma_unlock_write(vma);
+ huge_pmd_unshare_flush(&tlb, vma);
+ tlb_finish_mmu(&tlb);
/*
* The PMD table was unmapped,
* consequently unmapping the folio.
@@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
break;
}
hugetlb_vma_unlock_write(vma);
+ tlb_finish_mmu(&tlb);
}
/* Nuke the hugetlb page table entry */
pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
--
2.52.0
__io_openat_prep() allocates a struct filename using getname(), but
it isn't freed in case the present file is installed in the fixed file
table and simultaneously, it has the flag O_CLOEXEC set in the
open->how.flags field.
This is an erroneous condition, since for a file installed in the fixed
file table, it won't be installed in the normal file table, due to which
the file cannot support close on exec. Earlier, the code just returned
-EINVAL error code for this condition, however, the memory allocated for
that struct filename wasn't freed, resulting in a memory leak.
Hence, the case of file being installed in the fixed file table as well
as having O_CLOEXEC flag in open->how.flags set, is adressed by using
putname() to release the memory allocated to the struct filename, then
setting the field open->filename to NULL, and after that, returning
-EINVAL.
Reported-by: syzbot+00e61c43eb5e4740438f(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f
Tested-by: syzbot+00e61c43eb5e4740438f(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org
Signed-off-by: Prithvi Tambewagh <activprithvi(a)gmail.com>
---
io_uring/openclose.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/io_uring/openclose.c b/io_uring/openclose.c
index bfeb91b31bba..fc190a3d8112 100644
--- a/io_uring/openclose.c
+++ b/io_uring/openclose.c
@@ -75,8 +75,11 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
}
open->file_slot = READ_ONCE(sqe->file_index);
- if (open->file_slot && (open->how.flags & O_CLOEXEC))
+ if (open->file_slot && (open->how.flags & O_CLOEXEC)) {
+ putname(open->filename);
+ open->filename = NULL;
return -EINVAL;
+ }
open->nofile = rlimit(RLIMIT_NOFILE);
req->flags |= REQ_F_NEED_CLEANUP;
base-commit: b927546677c876e26eba308550207c2ddf812a43
--
2.34.1
Hi,
syzbot reported a warning and crash when mounting a corrupted HFS+ image where
the on-disk B-tree bitmap has node 0 (header node) marked free. In that case
hfs_bmap_alloc() can try to allocate node 0 and reach hfs_bnode_create() with
an already-hashed node number.
Patch 1 prevents allocating the reserved header node (node 0) even if the bitmap
is corrupted.
Patch 2 follows Slava's review suggestion and changes the "already hashed" path
in hfs_bnode_create() to return ERR_PTR(-EEXIST) instead of returning the existing
node pointer, so we don't continue in a non-"business as usual" situation.
v2 changes:
- Implement Slava's suggestion: return ERR_PTR(-EEXIST) for already-hashed nodes.
- Keep the node-0 allocation guard as a minimal, targeted hardening measure.
Reported-by: syzbot+1c8ff72d0cd8a50dfeaa(a)syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=1c8ff72d0cd8a50dfeaa
Link: https://lore.kernel.org/all/20251213233215.368558-1-shardul.b@mpiricsoftwar…
Shardul Bankar (2):
hfsplus: skip node 0 in hfs_bmap_alloc
hfsplus: return error when node already exists in hfs_bnode_create
fs/hfsplus/bnode.c | 2 +-
fs/hfsplus/btree.c | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
--
2.34.1
Some DAMON sysfs directory setup functions generates its sub and sub-sub
directories. For example, 'monitoring_attrs/' directory setup creates
'intervals/' and 'intervals/intervals_goal/' directories under
'monitoring_attrs/' directory. When such sub-sub directories are
successfully made but followup setup is failed, the setup function
should recursively clean up the subdirectories.
However, such setup functions are only dereferencing sub directory
reference counters. As a result, under certain setup failures, the
sub-sub directories keep having non-zero reference counters. It means
the directories cannot be removed like zombies, and the memory for the
directories cannot be freed.
The user impact of this issue is limited due to the following reasons.
When the issue happens, the zombie directories are still taking the
path. Hence attempts to generate the directories again will fail,
without additional memory leak. This means the upper bound memory leak
is limited. Nonetheless this also implies controlling DAMON with a
feature that requires the setup-failed sysfs files will be impossible
until the system reboots.
Also, the setup operations are quite simple. The certain failures would
hence only rarely happen, and are difficult to artificially trigger.
SeongJae Park (4):
mm/damon/sysfs: cleanup intervals subdirs on attrs dir setup failure
mm/damon/sysfs: cleanup attrs subdirs on context dir setup failure
mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup
failure
mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir
setup failure
mm/damon/sysfs-schemes.c | 10 ++++++----
mm/damon/sysfs.c | 9 ++++++---
2 files changed, 12 insertions(+), 7 deletions(-)
base-commit: 6d039da6a260dd7919bebc70ebb65d250bb9c24e
--
2.47.3
Hi Greg, thanks for looking into this..
The full commit hash is 807221d3c5ff6e3c91ff57bc82a0b7a541462e20
Note: apologies if you received this multiple times, the previous one
got bounced due to html
Cheers,
JP
________________________________
From: Greg KH <gregkh(a)linuxfoundation.org>
Sent: Tuesday, December 23, 2025 6:39:22 PM
To: JP Dehollain <jpdehollain(a)gmail.com>
Cc: stable(a)vger.kernel.org <stable(a)vger.kernel.org>
Subject: Re: Request to add mainline merged patch to stable kernels
On Tue, Dec 23, 2025 at 04:05:24PM +1100, JP Dehollain wrote:
> Hello,
> I recently used the patch misc: rtsx_pci: Add separate CD/WP pin
> polarity reversal support with commit ID 807221d, to fix a bug causing
> the cardreader driver to always load sd cards in read-only mode.
> On the suggestion of the driver maintainer, I am requesting that this
> patch be applied to all stable kernel versions, as it is currently
> only applied to >=6.18.
> Thanks,
> JP
>
What is the git id of the commit you are looking to have backported?
thanks,
greg k-h
When compiling the device tree for the Integrator/AP with IM-PD1, the
following warning is observed regarding the display controller node:
arch/arm/boot/dts/arm/integratorap-im-pd1.dts:251.3-14: Warning
(dma_ranges_format):
/bus@c0000000/bus@c0000000/display@1000000:dma-ranges: empty
"dma-ranges" property but its #address-cells (2) differs from
/bus@c0000000/bus@c0000000 (1)
The display node specifies an empty "dma-ranges" property, intended to
describe a 1:1 identity mapping. However, the node lacks explicit
"#address-cells" and "#size-cells" properties. In this case, the device
tree compiler defaults the address cells to 2 (64-bit), which conflicts
with the parent bus configuration (32-bit, 1 cell).
Fix this by explicitly defining "#address-cells" and "#size-cells" as
1. This matches the 32-bit architecture of the Integrator platform and
ensures the address translation range is correctly parsed by the
compiler.
Fixes: 7bea67a99430 ("ARM: dts: integrator: Fix DMA ranges")
Cc: stable(a)vger.kernel.org
Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
---
arch/arm/boot/dts/arm/integratorap-im-pd1.dts | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm/boot/dts/arm/integratorap-im-pd1.dts b/arch/arm/boot/dts/arm/integratorap-im-pd1.dts
index db13e09f2fab..6d90d9a42dc9 100644
--- a/arch/arm/boot/dts/arm/integratorap-im-pd1.dts
+++ b/arch/arm/boot/dts/arm/integratorap-im-pd1.dts
@@ -248,6 +248,8 @@ display@1000000 {
/* 640x480 16bpp @ 25.175MHz is 36827428 bytes/s */
max-memory-bandwidth = <40000000>;
memory-region = <&impd1_ram>;
+ #address-cells = <1>;
+ #size-cells = <1>;
dma-ranges;
port@0 {
--
2.52.0.177.g9f829587af-goog
Hi Andrew,
Here are 2 fixes for missing mm_cid fields for init_mm and efi_mm static
initialization. The renaming of cpu_bitmap to flexible_array (patch 2)
is needed for patch 3.
Those are relevant for mainline, with CC stable. They are based on
v6.19-rc2.
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Cc: linux-mm(a)kvack.org
Mathieu Desnoyers (3):
mm: Add missing static initializer for init_mm::mm_cid.lock
mm: Rename cpu_bitmap field to flexible_array
mm: Take into account mm_cid size for mm_struct static definitions
drivers/firmware/efi/efi.c | 2 +-
include/linux/mm_types.h | 18 +++++++++++++-----
mm/init-mm.c | 5 ++++-
3 files changed, 18 insertions(+), 7 deletions(-)
--
2.39.5
Hi,
I’m reporting a performance regression of up to 6% sequential I/O
vdbench regression observed on 6.12.y kernel.
While running performance benchmarks on v6.12.60 kernel the sequential
I/O vdbench metrics are showing a 5-6% performance regression when
compared to v6.12.48
Bisect root cause commit
========================
- commit b39b62075ab4 ("cpuidle: menu: Remove iowait influence")
Things work fine again when the previously removed
performance-multiplier code is added back.
Test details
============
The system is connected to a number of disks in disk array using
multipathing and directio configuration in the vdbench profile.
wd=wd1,sd=sd*,rdpct=0,seekpct=sequential,xfersize=128k
rd=128k64T,wd=wd1,iorate=max,elapsed=600,interval=1,warmup=300,threads=64
Thanks,
Alok
A process issuing blocking writes to a virtio console may get stuck
indefinitely if another thread polls the device. Here is how to trigger
the bug:
- Thread A writes to the port until the virtqueue is full.
- Thread A calls wait_port_writable() and goes to sleep, waiting on
port->waitqueue.
- The host processes some of the write, marks buffers as used and raises
an interrupt.
- Before the interrupt is serviced, thread B executes port_fops_poll().
This calls reclaim_consumed_buffers() via will_write_block() and
consumes all used buffers.
- The interrupt is serviced. vring_interrupt() finds no used buffers
via more_used() and returns without waking port->waitqueue.
- Thread A is still in wait_event(port->waitqueue), waiting for a
wakeup that never arrives.
The crux is that invoking reclaim_consumed_buffers() may cause
vring_interrupt() to omit wakeups.
Fix this by calling reclaim_consumed_buffers() in out_int() before
waking. This is similar to the call to discard_port_data() in
in_intr() which also frees buffer from a non-sleepable context.
This in turn guarantees that port->outvq_full is up to date when
handling polling. Since in_intr() already populates port->inbuf we
use that to avoid changing reader state.
Cc: stable(a)vger.kernel.org
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
As far as I can tell all currently maintained stable series kernels need
this commit. Applies and builds cleanly on 5.10.247, verified to fix
the issue.
---
Changes in v3:
- Use spin_lock_irq in port_fops_poll (Arnd)
- Use spin_lock in out_intr (Arnd)
- Link to v2: https://lore.kernel.org/r/20251222-virtio-console-lost-wakeup-v2-1-5de93cb3…
Changes in v2:
- Call reclaim_consumed_buffers() in out_intr instead of
issuing another wake.
- Link to v1: https://lore.kernel.org/r/20251215-virtio-console-lost-wakeup-v1-1-79a5c578…
---
drivers/char/virtio_console.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index 088182e54debd6029ea2c2a5542d7a28500e67b8..e6048e04c3b23d008caa2a1d31d4ac6b2841045f 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -971,10 +971,17 @@ static __poll_t port_fops_poll(struct file *filp, poll_table *wait)
return EPOLLHUP;
}
ret = 0;
- if (!will_read_block(port))
+
+ spin_lock_irq(&port->inbuf_lock);
+ if (port->inbuf)
ret |= EPOLLIN | EPOLLRDNORM;
- if (!will_write_block(port))
+ spin_unlock_irq(&port->inbuf_lock);
+
+ spin_lock_irq(&port->outvq_lock);
+ if (!port->outvq_full)
ret |= EPOLLOUT;
+ spin_unlock_irq(&port->outvq_lock);
+
if (!port->host_connected)
ret |= EPOLLHUP;
@@ -1705,6 +1712,10 @@ static void out_intr(struct virtqueue *vq)
return;
}
+ spin_lock(&port->outvq_lock);
+ reclaim_consumed_buffers(port);
+ spin_unlock(&port->outvq_lock);
+
wake_up_interruptible(&port->waitqueue);
}
---
base-commit: d358e5254674b70f34c847715ca509e46eb81e6f
change-id: 20251215-virtio-console-lost-wakeup-0f566c5cd35f
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
According to documentation, the DP PHY on x1e80100 has another clock
called ref.
The current X Elite devices supported upstream work fine without this
clock, because the boot firmware leaves this clock enabled. But we should
not rely on that. Also, when it comes to power management, this clock
needs to be also disabled on suspend. So even though this change breaks
the ABI, it is needed in order to make we disable this clock on runtime
PM, when that is going to be enabled in the driver.
So rework the driver to allow different number of clocks, fix the
dt-bindings schema and add the clock to the DT node as well.
Signed-off-by: Abel Vesa <abel.vesa(a)linaro.org>
---
Changes in v5:
- Picked-up Bjorn's R-b tags.
- Replaced "parse" with "get" on clocks acquiring failure.
- Link to v4: https://lore.kernel.org/r/20251029-phy-qcom-edp-add-missing-refclk-v4-0-adb…
Changes in v4:
- Picked Dmitry's R-b tag for the driver patch
- Added x1e80100 substring to subject of dts patch
- Link to v3 (resend): https://lore.kernel.org/r/20251014-phy-qcom-edp-add-missing-refclk-v3-0-078…
Changes in v3 (resend)
- picked-up Krzysztof's R-b tag for bindings patch
- Link to v3: https://lore.kernel.org/r/20250909-phy-qcom-edp-add-missing-refclk-v3-0-4ec…
Changes in v3:
- Use dev_err_probe() on clocks parsing failure.
- Explain why the ABI break is necessary.
- Drop the extra 'clk' suffix from the clock name. So ref instead of
refclk.
- Link to v2: https://lore.kernel.org/r/20250903-phy-qcom-edp-add-missing-refclk-v2-0-d88…
Changes in v2:
- Fix schema by adding the minItems, as suggested by Krzysztof.
- Use devm_clk_bulk_get_all, as suggested by Konrad.
- Rephrase the commit messages to reflect the flexible number of clocks.
- Link to v1: https://lore.kernel.org/r/20250730-phy-qcom-edp-add-missing-refclk-v1-0-6f7…
---
Abel Vesa (3):
dt-bindings: phy: qcom-edp: Add missing clock for X Elite
phy: qcom: edp: Make the number of clocks flexible
arm64: dts: qcom: x1e80100: Add missing TCSR ref clock to the DP PHYs
.../devicetree/bindings/phy/qcom,edp-phy.yaml | 28 +++++++++++++++++++++-
arch/arm64/boot/dts/qcom/hamoa.dtsi | 12 ++++++----
drivers/phy/qualcomm/phy-qcom-edp.c | 16 ++++++-------
3 files changed, 43 insertions(+), 13 deletions(-)
---
base-commit: 131f3d9446a6075192cdd91f197989d98302faa6
change-id: 20250730-phy-qcom-edp-add-missing-refclk-5ab82828f8e7
Best regards,
--
Abel Vesa <abel.vesa(a)oss.qualcomm.com>
It is checked almost always in dpu_encoder_phys_wb_setup_ctl(), but in a
single place the check is missing.
Also use convenient locals instead of phys_enc->* where available.
Cc: stable(a)vger.kernel.org
Fixes: d7d0e73f7de33 ("drm/msm/dpu: introduce the dpu_encoder_phys_* for writeback")
Signed-off-by: Nikolay Kuratov <kniv(a)yandex-team.ru>
---
drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_wb.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_wb.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_wb.c
index 46f348972a97..6d28f2281c76 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_wb.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_wb.c
@@ -247,14 +247,12 @@ static void dpu_encoder_phys_wb_setup_ctl(struct dpu_encoder_phys *phys_enc)
if (hw_cdm)
intf_cfg.cdm = hw_cdm->idx;
- if (phys_enc->hw_pp->merge_3d && phys_enc->hw_pp->merge_3d->ops.setup_3d_mode)
- phys_enc->hw_pp->merge_3d->ops.setup_3d_mode(phys_enc->hw_pp->merge_3d,
- mode_3d);
+ if (hw_pp && hw_pp->merge_3d && hw_pp->merge_3d->ops.setup_3d_mode)
+ hw_pp->merge_3d->ops.setup_3d_mode(hw_pp->merge_3d, mode_3d);
/* setup which pp blk will connect to this wb */
- if (hw_pp && phys_enc->hw_wb->ops.bind_pingpong_blk)
- phys_enc->hw_wb->ops.bind_pingpong_blk(phys_enc->hw_wb,
- phys_enc->hw_pp->idx);
+ if (hw_pp && hw_wb->ops.bind_pingpong_blk)
+ hw_wb->ops.bind_pingpong_blk(hw_wb, hw_pp->idx);
phys_enc->hw_ctl->ops.setup_intf_cfg(phys_enc->hw_ctl, &intf_cfg);
} else if (phys_enc->hw_ctl && phys_enc->hw_ctl->ops.setup_intf_cfg) {
--
2.34.1
When starting multi-core loongarch virtualization on loongarch physical
machine, loading livepatch on the physical machine will cause an error
similar to the following:
[ 411.686289] livepatch: klp_try_switch_task: CPU 31/KVM:3116 has an
unreliable stack
The specific test steps are as follows:
1.Start a multi-core virtual machine on a physical machine
2.Enter the following command on the physical machine to turn on the debug
switch:
echo "file kernel/livepatch/transition.c +p" > /sys/kernel/debug/\
dynamic_debug/control
3.Load livepatch:
modprobe livepatch-sample
Through the above steps, similar prints can be viewed in dmesg.
The reason for this issue is that the code of the kvm_exc_entry function
was copied in the function kvm_loongarch_env_init. When the cpu needs to
execute kvm_exc_entry, it will switch to the copied address for execution.
The new address of the kvm_exc_entry function cannot be recognized in ORC,
which eventually leads to the arch_stack_walk_reliable function returning
an error and printing an exception message.
To solve the above problems, we directly compile the switch.S file into
the kernel instead of the module. In this way, the function kvm_exc_entry
will no longer need to be copied.
changlog:
V2<-V1:
1.Rollback the modification of function parameter types such as
kvm_save_fpu. In the asm-prototypes.h header file, only the parameter types
it depends on are included
Cc: Huacai Chen <chenhuacai(a)kernel.org>
Cc: WANG Xuerui <kernel(a)xen0n.name>
Cc: Tianrui Zhao <zhaotianrui(a)loongson.cn>
Cc: Bibo Mao <maobibo(a)loongson.cn>
Cc: Charlie Jenkins <charlie(a)rivosinc.com>
Cc: Xianglai Li <lixianglai(a)loongson.cn>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: stable(a)vger.kernel.org
Xianglai Li (2):
LoongArch: KVM: Compile the switch.S file directly into the kernel
LoongArch: KVM: fix "unreliable stack" issue
arch/loongarch/Kbuild | 2 +-
arch/loongarch/include/asm/asm-prototypes.h | 21 +++++++++++++
arch/loongarch/include/asm/kvm_host.h | 3 --
arch/loongarch/kvm/Makefile | 2 +-
arch/loongarch/kvm/main.c | 35 ++-------------------
arch/loongarch/kvm/switch.S | 24 +++++++++++---
6 files changed, 45 insertions(+), 42 deletions(-)
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
--
2.39.1
In vmw_compat_shader_add(), the return value check of vmw_shader_alloc()
is not proper. Modify the check for the return pointer 'res'.
Found by code review and compiled on ubuntu 20.04.
Fixes: 18e4a4669c50 ("drm/vmwgfx: Fix compat shader namespace")
Cc: stable(a)vger.kernel.org
Signed-off-by: Haoxiang Li <lihaoxiang(a)isrc.iscas.ac.cn>
---
drivers/gpu/drm/vmwgfx/vmwgfx_shader.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c b/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c
index 69dfe69ce0f8..7ed938710342 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_shader.c
@@ -923,8 +923,10 @@ int vmw_compat_shader_add(struct vmw_private *dev_priv,
ttm_bo_unreserve(&buf->tbo);
res = vmw_shader_alloc(dev_priv, buf, size, 0, shader_type);
- if (unlikely(ret != 0))
+ if (IS_ERR(res)) {
+ ret = PTR_ERR(res);
goto no_reserve;
+ }
ret = vmw_cmdbuf_res_add(man, vmw_cmdbuf_res_shader,
vmw_shader_key(user_key, shader_type),
--
2.25.1
From: Zack Rusin <zack.rusin(a)broadcom.com>
[ Upstream commit 5ac2c0279053a2c5265d46903432fb26ae2d0da2 ]
Check that the resource which is converted to a surface exists before
trying to use the cursor snooper on it.
vmw_cmd_res_check allows explicit invalid (SVGA3D_INVALID_ID) identifiers
because some svga commands accept SVGA3D_INVALID_ID to mean "no surface",
unfortunately functions that accept the actual surfaces as objects might
(and in case of the cursor snooper, do not) be able to handle null
objects. Make sure that we validate not only the identifier (via the
vmw_cmd_res_check) but also check that the actual resource exists before
trying to do something with it.
Fixes unchecked null-ptr reference in the snooping code.
Signed-off-by: Zack Rusin <zack.rusin(a)broadcom.com>
Fixes: c0951b797e7d ("drm/vmwgfx: Refactor resource management")
Reported-by: Kuzey Arda Bulut <kuzeyardabulut(a)gmail.com>
Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list(a)broadcom.com>
Cc: dri-devel(a)lists.freedesktop.org
Reviewed-by: Ian Forbes <ian.forbes(a)broadcom.com>
Link: https://lore.kernel.org/r/20250917153655.1968583-1-zack.rusin@broadcom.com
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
[Shivani: Modified to apply on v5.10.y-v6.1.y]
Signed-off-by: Shivani Agarwal <shivani.agarwal(a)broadcom.com>
---
drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
index 0d12d6af6..5d3827b5d 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c
@@ -1507,6 +1507,7 @@ static int vmw_cmd_dma(struct vmw_private *dev_priv,
SVGA3dCmdHeader *header)
{
struct vmw_buffer_object *vmw_bo = NULL;
+ struct vmw_resource *res;
struct vmw_surface *srf = NULL;
VMW_DECLARE_CMD_VAR(*cmd, SVGA3dCmdSurfaceDMA);
int ret;
@@ -1542,18 +1543,24 @@ static int vmw_cmd_dma(struct vmw_private *dev_priv,
dirty = (cmd->body.transfer == SVGA3D_WRITE_HOST_VRAM) ?
VMW_RES_DIRTY_SET : 0;
- ret = vmw_cmd_res_check(dev_priv, sw_context, vmw_res_surface,
- dirty, user_surface_converter,
- &cmd->body.host.sid, NULL);
+ ret = vmw_cmd_res_check(dev_priv, sw_context, vmw_res_surface, dirty,
+ user_surface_converter, &cmd->body.host.sid,
+ NULL);
if (unlikely(ret != 0)) {
if (unlikely(ret != -ERESTARTSYS))
VMW_DEBUG_USER("could not find surface for DMA.\n");
return ret;
}
- srf = vmw_res_to_srf(sw_context->res_cache[vmw_res_surface].res);
+ res = sw_context->res_cache[vmw_res_surface].res;
+ if (!res) {
+ VMW_DEBUG_USER("Invalid DMA surface.\n");
+ return -EINVAL;
+ }
- vmw_kms_cursor_snoop(srf, sw_context->fp->tfile, &vmw_bo->base, header);
+ srf = vmw_res_to_srf(res);
+ vmw_kms_cursor_snoop(srf, sw_context->fp->tfile, &vmw_bo->base,
+ header);
return 0;
}
--
2.40.4
The functions ocfs2_reserve_suballoc_bits(), ocfs2_block_group_alloc(),
ocfs2_block_group_alloc_contig() and ocfs2_find_smallest_chain() trust
the on-disk values related to the allocation chain. However, KASAN bug
was triggered in these functions, and the kernel panicked when accessing
redzoned memory. This occurred due to the corrupted value of `cl_count`
field of `struct ocfs2_chain_list`. Upon analysis, the value of `cl_count`
was observed to be overwhemingly large, due to which the code accessed
redzoned memory.
The fix introduces an if statement which validates value of `cl_count`
(both lower and upper bounds). Lower bound check ensures the value of
`cl_count` is not zero and upper bound check ensures that the value of
`cl_count` is in the range such that it has a value less than the total
size of struct ocfs2_chain_list and maximum number of chains that can be
present, so as to fill one block.
Reported-by: syzbot+af14efe17dfa46173239(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=af14efe17dfa46173239
Tested-by: syzbot+af14efe17dfa46173239(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org
Signed-off-by: Prithvi Tambewagh <activprithvi(a)gmail.com>
---
fs/ocfs2/suballoc.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index f7b483f0de2a..7ea63e9cc4f8 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -671,6 +671,21 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb,
BUG_ON(ocfs2_is_cluster_bitmap(alloc_inode));
cl = &fe->id2.i_chain;
+ unsigned int block_size = osb->sb->s_blocksize;
+ unsigned int max_cl_count =
+ (block_size - offsetof(struct ocfs2_chain_list, cl_recs)) /
+ sizeof(struct ocfs2_chain_rec);
+
+ if (!le16_to_cpu(cl->cl_count) ||
+ le16_to_cpu(cl->cl_count) > max_cl_count) {
+ ocfs2_error(osb->sb,
+ "Invalid chain list: cl_count %u "
+ "exceeds max %u",
+ le16_to_cpu(cl->cl_count), max_cl_count);
+ status = -EIO;
+ goto bail;
+ }
+
status = ocfs2_reserve_clusters_with_limit(osb,
le16_to_cpu(cl->cl_cpg),
max_block, flags, &ac);
base-commit: 36c254515dc6592c44db77b84908358979dd6b50
--
2.34.1