A potential NULL pointer dereference may occur when accessing
tmp_mqd->cp_hqd_pq_control without verifying that tmp_mqd is non-NULL.
This may happen if mqd_backup[mqd_idx] is unexpectedly NULL.
Although a NULL check for mqd_backup[mqd_idx] existed previously, it was
moved to a position after the dereference in a recent commit, which
renders it ineffective.
Add an explicit NULL check for tmp_mqd before dereferencing its members.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Cc: stable(a)vger.kernel.org # v5.13+
Fixes: a330b52a9e59 ("drm/amdgpu: Init the cp MQD if it's not be initialized before")
Signed-off-by: Alexey Nepomnyashih <sdl(a)nppct.ru>
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index d7db4cb907ae..134cab16a00d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3817,10 +3817,9 @@ static int gfx_v9_0_kiq_init_queue(struct amdgpu_ring *ring)
* check mqd->cp_hqd_pq_control since this value should not be 0
*/
tmp_mqd = (struct v9_mqd *)adev->gfx.kiq[0].mqd_backup;
- if (amdgpu_in_reset(adev) && tmp_mqd->cp_hqd_pq_control){
+ if (amdgpu_in_reset(adev) && tmp_mqd && tmp_mqd->cp_hqd_pq_control) {
/* for GPU_RESET case , reset MQD to a clean status */
- if (adev->gfx.kiq[0].mqd_backup)
- memcpy(mqd, adev->gfx.kiq[0].mqd_backup, sizeof(struct v9_mqd_allocation));
+ memcpy(mqd, adev->gfx.kiq[0].mqd_backup, sizeof(struct v9_mqd_allocation));
/* reset ring buffer */
ring->wptr = 0;
@@ -3863,7 +3862,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring, bool restore)
*/
tmp_mqd = (struct v9_mqd *)adev->gfx.mec.mqd_backup[mqd_idx];
- if (!restore && (!tmp_mqd->cp_hqd_pq_control ||
+ if (!restore && tmp_mqd && (!tmp_mqd->cp_hqd_pq_control ||
(!amdgpu_in_reset(adev) && !adev->in_suspend))) {
memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
@@ -3874,8 +3873,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring, bool restore)
soc15_grbm_select(adev, 0, 0, 0, 0, 0);
mutex_unlock(&adev->srbm_mutex);
- if (adev->gfx.mec.mqd_backup[mqd_idx])
- memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct v9_mqd_allocation));
+ memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct v9_mqd_allocation));
} else {
/* restore MQD to a clean status */
if (adev->gfx.mec.mqd_backup[mqd_idx])
--
2.43.0
The RK3588 GPU power domain cannot be activated unless the external
power regulator is already on. When GPU support was added to this DT,
we had no way to represent this requirement, so `regulator-always-on`
was added to the `vdd_gpu_s0` regulator in order to ensure stability.
A later patch series (see "Fixes:" commit) resolved this shortcoming,
but that commit left the workaround -- and rendered the comment above
it no longer correct.
Remove the workaround to allow the GPU power regulator to power off, now
that the DT includes the necessary information to power it back on
correctly.
Fixes: f94500eb7328b ("arm64: dts: rockchip: Add GPU power domain regulator dependency for RK3588")
Signed-off-by: Sam Edwards <CFSworks(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
---
arch/arm64/boot/dts/rockchip/rk3588-turing-rk1.dtsi | 11 -----------
1 file changed, 11 deletions(-)
diff --git a/arch/arm64/boot/dts/rockchip/rk3588-turing-rk1.dtsi b/arch/arm64/boot/dts/rockchip/rk3588-turing-rk1.dtsi
index 60ad272982ad..6daea8961fdd 100644
--- a/arch/arm64/boot/dts/rockchip/rk3588-turing-rk1.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3588-turing-rk1.dtsi
@@ -398,17 +398,6 @@ rk806_dvs3_null: dvs3-null-pins {
regulators {
vdd_gpu_s0: vdd_gpu_mem_s0: dcdc-reg1 {
- /*
- * RK3588's GPU power domain cannot be enabled
- * without this regulator active, but it
- * doesn't have to be on when the GPU PD is
- * disabled. Because the PD binding does not
- * currently allow us to express this
- * relationship, we have no choice but to do
- * this instead:
- */
- regulator-always-on;
-
regulator-boot-on;
regulator-min-microvolt = <550000>;
regulator-max-microvolt = <950000>;
--
2.48.1
On Fri, May 23, 2025 at 7:17 AM Sasha Levin <sashal(a)kernel.org> wrote:
>
> This is a note to let you know that I've just added the patch titled
>
> bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
>
> to the 6.1-stable tree which can be found at:
> http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
>
> The filename of the patch is:
> bpf-prevent-unsafe-access-to-the-sock-fields-in-the-.patch
> and it can be found in the queue-6.1 subdirectory.
>
> If you, or anyone else, feels it should not be added to the stable tree,
> please let <stable(a)vger.kernel.org> know about it.
Hi,
I'm notified that this patch has been added into many branches, which
is against my expectations. The BPF timestaping feature was
implemented in 6.14 and the patch you are handling is just one of them.
The function of this patch prevents unexpected bpf programs using this
feature from triggering
fatal problems. So, IMHO, we don't need this patch in all the
older/stable branches:)
Thanks,
Jason
>
>
>
> commit 00b709040e0fdf5949dfbf02f38521e0b10943ac
> Author: Jason Xing <kerneljasonxing(a)gmail.com>
> Date: Thu Feb 20 15:29:31 2025 +0800
>
> bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
>
> [ Upstream commit fd93eaffb3f977b23bc0a48d4c8616e654fcf133 ]
>
> The subsequent patch will implement BPF TX timestamping. It will
> call the sockops BPF program without holding the sock lock.
>
> This breaks the current assumption that all sock ops programs will
> hold the sock lock. The sock's fields of the uapi's bpf_sock_ops
> requires this assumption.
>
> To address this, a new "u8 is_locked_tcp_sock;" field is added. This
> patch sets it in the current sock_ops callbacks. The "is_fullsock"
> test is then replaced by the "is_locked_tcp_sock" test during
> sock_ops_convert_ctx_access().
>
> The new TX timestamping callbacks added in the subsequent patch will
> not have this set. This will prevent unsafe access from the new
> timestamping callbacks.
>
> Potentially, we could allow read-only access. However, this would
> require identifying which callback is read-safe-only and also requires
> additional BPF instruction rewrites in the covert_ctx. Since the BPF
> program can always read everything from a socket (e.g., by using
> bpf_core_cast), this patch keeps it simple and disables all read
> and write access to any socket fields through the bpf_sock_ops
> UAPI from the new TX timestamping callback.
>
> Moreover, note that some of the fields in bpf_sock_ops are specific
> to tcp_sock, and sock_ops currently only supports tcp_sock. In
> the future, UDP timestamping will be added, which will also break
> this assumption. The same idea used in this patch will be reused.
> Considering that the current sock_ops only supports tcp_sock, the
> variable is named is_locked_"tcp"_sock.
>
> Signed-off-by: Jason Xing <kerneljasonxing(a)gmail.com>
> Signed-off-by: Martin KaFai Lau <martin.lau(a)kernel.org>
> Link: https://patch.msgid.link/20250220072940.99994-4-kerneljasonxing@gmail.com
> Signed-off-by: Sasha Levin <sashal(a)kernel.org>
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index f3ef1a8965bb2..09cc8fb735f02 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1319,6 +1319,7 @@ struct bpf_sock_ops_kern {
> void *skb_data_end;
> u8 op;
> u8 is_fullsock;
> + u8 is_locked_tcp_sock;
> u8 remaining_opt_len;
> u64 temp; /* temp and everything after is not
> * initialized to 0 before calling
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 83e0362e3b721..63caa3181dfe6 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2409,6 +2409,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
> memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> if (sk_fullsock(sk)) {
> sock_ops.is_fullsock = 1;
> + sock_ops.is_locked_tcp_sock = 1;
> sock_owned_by_me(sk);
> }
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 497b41ac399da..5c9f3fcb957bb 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10240,10 +10240,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> } \
> *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
> struct bpf_sock_ops_kern, \
> - is_fullsock), \
> + is_locked_tcp_sock), \
> fullsock_reg, si->src_reg, \
> offsetof(struct bpf_sock_ops_kern, \
> - is_fullsock)); \
> + is_locked_tcp_sock)); \
> *insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp); \
> if (si->dst_reg == si->src_reg) \
> *insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \
> @@ -10328,10 +10328,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> temp)); \
> *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
> struct bpf_sock_ops_kern, \
> - is_fullsock), \
> + is_locked_tcp_sock), \
> reg, si->dst_reg, \
> offsetof(struct bpf_sock_ops_kern, \
> - is_fullsock)); \
> + is_locked_tcp_sock)); \
> *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2); \
> *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
> struct bpf_sock_ops_kern, sk),\
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index db1a99df29d55..16f4a41a068e4 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -168,6 +168,7 @@ static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb)
> memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB;
> sock_ops.is_fullsock = 1;
> + sock_ops.is_locked_tcp_sock = 1;
> sock_ops.sk = sk;
> bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb));
>
> @@ -184,6 +185,7 @@ static void bpf_skops_established(struct sock *sk, int bpf_op,
> memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
> sock_ops.op = bpf_op;
> sock_ops.is_fullsock = 1;
> + sock_ops.is_locked_tcp_sock = 1;
> sock_ops.sk = sk;
> /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */
> if (skb)
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 40568365cdb3b..2f109f1968253 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -509,6 +509,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
> sock_owned_by_me(sk);
>
> sock_ops.is_fullsock = 1;
> + sock_ops.is_locked_tcp_sock = 1;
> sock_ops.sk = sk;
> }
>
> @@ -554,6 +555,7 @@ static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb,
> sock_owned_by_me(sk);
>
> sock_ops.is_fullsock = 1;
> + sock_ops.is_locked_tcp_sock = 1;
> sock_ops.sk = sk;
> }
>
xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is
set, without completing the xhci handshake, unless the reset completes
exceptionally quickly. This behavior causes a regression on Synopsys
DWC3 USB controllers with dual-role capabilities.
Specifically, when a DWC3 controller exits host mode and removes xhci
while a reset is still in progress, and then attempts to configure its
hardware for device mode, the ongoing, incomplete reset leads to
critical register access issues. All register reads return zero, not
just within the xHCI register space (which might be expected during a
reset), but across the entire DWC3 IP block.
This patch addresses the issue by preventing xhci_reset() from being
called in xhci_resume() and bailing out early in the reinit flow when
XHCI_STATE_REMOVING is set.
Cc: stable(a)vger.kernel.org
Fixes: 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state() helper")
Suggested-by: Mathias Nyman <mathias.nyman(a)intel.com>
Signed-off-by: Roy Luo <royluo(a)google.com>
---
drivers/usb/host/xhci.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 90eb491267b5..244b12eafd95 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1084,7 +1084,10 @@ int xhci_resume(struct xhci_hcd *xhci, bool power_lost, bool is_auto_resume)
xhci_dbg(xhci, "Stop HCD\n");
xhci_halt(xhci);
xhci_zero_64b_regs(xhci);
- retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
+ if (xhci->xhc_state & XHCI_STATE_REMOVING)
+ retval = -ENODEV;
+ else
+ retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
spin_unlock_irq(&xhci->lock);
if (retval)
return retval;
--
2.49.0.1204.g71687c7c1d-goog
The patch titled
Subject: mm: fix copy_vma() error handling for hugetlb mappings
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-fix-copy_vma-error-handling-for-hugetlb-mappings.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Ricardo Ca��uelo Navarro <rcn(a)igalia.com>
Subject: mm: fix copy_vma() error handling for hugetlb mappings
Date: Fri, 23 May 2025 14:19:10 +0200
If, during a mremap() operation for a hugetlb-backed memory mapping,
copy_vma() fails after the source vma has been duplicated and opened (ie.
vma_link() fails), the error is handled by closing the new vma. This
updates the hugetlbfs reservation counter of the reservation map which at
this point is referenced by both the source vma and the new copy. As a
result, once the new vma has been freed and copy_vma() returns, the
reservation counter for the source vma will be incorrect.
This patch addresses this corner case by clearing the hugetlb private page
reservation reference for the new vma and decrementing the reference
before closing the vma, so that vma_close() won't update the reservation
counter. This is also what copy_vma_and_data() does with the source vma
if copy_vma() succeeds, so a helper function has been added to do the
fixup in both functions.
The issue was reported by a private syzbot instance and can be reproduced
using the C reproducer in [1]. It's also a possible duplicate of public
syzbot report [2]. The WARNING report is:
============================================================
page_counter underflow: -1024 nr_pages=1024
WARNING: CPU: 0 PID: 3287 at mm/page_counter.c:61 page_counter_cancel+0xf6/0x120
Modules linked in:
CPU: 0 UID: 0 PID: 3287 Comm: repro__WARNING_ Not tainted 6.15.0-rc7+ #54 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
RIP: 0010:page_counter_cancel+0xf6/0x120
Code: ff 5b 41 5e 41 5f 5d c3 cc cc cc cc e8 f3 4f 8f ff c6 05 64 01 27 06 01 48 c7 c7 60 15 f8 85 48 89 de 4c 89 fa e8 2a a7 51 ff <0f> 0b e9 66 ff ff ff 44 89 f9 80 e1 07 38 c1 7c 9d 4c 81
RSP: 0018:ffffc900025df6a0 EFLAGS: 00010246
RAX: 2edfc409ebb44e00 RBX: fffffffffffffc00 RCX: ffff8880155f0000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffffff81c4a23c R09: 1ffff1100330482a
R10: dffffc0000000000 R11: ffffed100330482b R12: 0000000000000000
R13: ffff888058a882c0 R14: ffff888058a882c0 R15: 0000000000000400
FS: 0000000000000000(0000) GS:ffff88808fc53000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004b33e0 CR3: 00000000076d6000 CR4: 00000000000006f0
Call Trace:
<TASK>
page_counter_uncharge+0x33/0x80
hugetlb_cgroup_uncharge_counter+0xcb/0x120
hugetlb_vm_op_close+0x579/0x960
? __pfx_hugetlb_vm_op_close+0x10/0x10
remove_vma+0x88/0x130
exit_mmap+0x71e/0xe00
? __pfx_exit_mmap+0x10/0x10
? __mutex_unlock_slowpath+0x22e/0x7f0
? __pfx_exit_aio+0x10/0x10
? __up_read+0x256/0x690
? uprobe_clear_state+0x274/0x290
? mm_update_next_owner+0xa9/0x810
__mmput+0xc9/0x370
exit_mm+0x203/0x2f0
? __pfx_exit_mm+0x10/0x10
? taskstats_exit+0x32b/0xa60
do_exit+0x921/0x2740
? do_raw_spin_lock+0x155/0x3b0
? __pfx_do_exit+0x10/0x10
? __pfx_do_raw_spin_lock+0x10/0x10
? _raw_spin_lock_irq+0xc5/0x100
do_group_exit+0x20c/0x2c0
get_signal+0x168c/0x1720
? __pfx_get_signal+0x10/0x10
? schedule+0x165/0x360
arch_do_signal_or_restart+0x8e/0x7d0
? __pfx_arch_do_signal_or_restart+0x10/0x10
? __pfx___se_sys_futex+0x10/0x10
syscall_exit_to_user_mode+0xb8/0x2c0
do_syscall_64+0x75/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x422dcd
Code: Unable to access opcode bytes at 0x422da3.
RSP: 002b:00007ff266cdb208 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00007ff266cdbcdc RCX: 0000000000422dcd
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000004c7bec
RBP: 00007ff266cdb220 R08: 203a6362696c6720 R09: 203a6362696c6720
R10: 0000200000c00000 R11: 0000000000000246 R12: ffffffffffffffd0
R13: 0000000000000002 R14: 00007ffe1cb5f520 R15: 00007ff266cbb000
</TASK>
============================================================
Link: https://lkml.kernel.org/r/20250523-warning_in_page_counter_cancel-v2-1-b6df…
Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter… [1]
Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [2]
Signed-off-by: Ricardo Ca��uelo Navarro <rcn(a)igalia.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Florent Revest <revest(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 16 +++++++++++++++-
mm/mremap.c | 3 +--
mm/vma.c | 1 +
4 files changed, 22 insertions(+), 3 deletions(-)
--- a/include/linux/hugetlb.h~mm-fix-copy_vma-error-handling-for-hugetlb-mappings
+++ a/include/linux/hugetlb.h
@@ -275,6 +275,7 @@ long hugetlb_change_protection(struct vm
bool is_hugetlb_entry_migration(pte_t pte);
bool is_hugetlb_entry_hwpoisoned(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
+void fixup_hugetlb_reservations(struct vm_area_struct *vma);
#else /* !CONFIG_HUGETLB_PAGE */
@@ -468,6 +469,10 @@ static inline vm_fault_t hugetlb_fault(s
static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
+static inline void fixup_hugetlb_reservations(struct vm_area_struct *vma)
+{
+}
+
#endif /* !CONFIG_HUGETLB_PAGE */
#ifndef pgd_write
--- a/mm/hugetlb.c~mm-fix-copy_vma-error-handling-for-hugetlb-mappings
+++ a/mm/hugetlb.c
@@ -1250,7 +1250,7 @@ void hugetlb_dup_vma_private(struct vm_a
/*
* Reset and decrement one ref on hugepage private reservation.
* Called with mm->mmap_lock writer semaphore held.
- * This function should be only used by move_vma() and operate on
+ * This function should be only used by mremap and operate on
* same sized vma. It should never come here with last ref on the
* reservation.
*/
@@ -7939,3 +7939,17 @@ void hugetlb_unshare_all_pmds(struct vm_
hugetlb_unshare_pmds(vma, ALIGN(vma->vm_start, PUD_SIZE),
ALIGN_DOWN(vma->vm_end, PUD_SIZE));
}
+
+/*
+ * For hugetlb, mremap() is an odd edge case - while the VMA copying is
+ * performed, we permit both the old and new VMAs to reference the same
+ * reservation.
+ *
+ * We fix this up after the operation succeeds, or if a newly allocated VMA
+ * is closed as a result of a failure to allocate memory.
+ */
+void fixup_hugetlb_reservations(struct vm_area_struct *vma)
+{
+ if (is_vm_hugetlb_page(vma))
+ clear_vma_resv_huge_pages(vma);
+}
--- a/mm/mremap.c~mm-fix-copy_vma-error-handling-for-hugetlb-mappings
+++ a/mm/mremap.c
@@ -1188,8 +1188,7 @@ static int copy_vma_and_data(struct vma_
mremap_userfaultfd_prep(new_vma, vrm->uf);
}
- if (is_vm_hugetlb_page(vma))
- clear_vma_resv_huge_pages(vma);
+ fixup_hugetlb_reservations(vma);
/* Tell pfnmap has moved from this vma */
if (unlikely(vma->vm_flags & VM_PFNMAP))
--- a/mm/vma.c~mm-fix-copy_vma-error-handling-for-hugetlb-mappings
+++ a/mm/vma.c
@@ -1834,6 +1834,7 @@ struct vm_area_struct *copy_vma(struct v
return new_vma;
out_vma_link:
+ fixup_hugetlb_reservations(new_vma);
vma_close(new_vma);
if (new_vma->vm_file)
_
Patches currently in -mm which might be from rcn(a)igalia.com are
mm-fix-copy_vma-error-handling-for-hugetlb-mappings.patch
The patch titled
Subject: memcg: always call cond_resched() after fn()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
memcg-always-call-cond_resched-after-fn.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Breno Leitao <leitao(a)debian.org>
Subject: memcg: always call cond_resched() after fn()
Date: Fri, 23 May 2025 10:21:06 -0700
I am seeing soft lockup on certain machine types when a cgroup OOMs. This
is happening because killing the process in certain machine might be very
slow, which causes the soft lockup and RCU stalls. This happens usually
when the cgroup has MANY processes and memory.oom.group is set.
Example I am seeing in real production:
[462012.244552] Memory cgroup out of memory: Killed process 3370438 (crosvm) ....
....
[462037.318059] Memory cgroup out of memory: Killed process 4171372 (adb) ....
[462037.348314] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [stat_manager-ag:1618982]
....
Quick look at why this is so slow, it seems to be related to serial flush
for certain machine types. For all the crashes I saw, the target CPU was
at console_flush_all().
In the case above, there are thousands of processes in the cgroup, and it
is soft locking up before it reaches the 1024 limit in the code (which
would call the cond_resched()). So, cond_resched() in 1024 blocks is not
sufficient.
Remove the counter-based conditional rescheduling logic and call
cond_resched() unconditionally after each task iteration, after fn() is
called. This avoids the lockup independently of how slow fn() is.
Link: https://lkml.kernel.org/r/20250523-memcg_fix-v1-1-ad3eafb60477@debian.org
Fixes: ade81479c7dd ("memcg: fix soft lockup in the OOM process")
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Suggested-by: Rik van Riel <riel(a)surriel.com>
Acked-by: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Michael van der Westhuizen <rmikey(a)meta.com>
Cc: Usama Arif <usamaarif642(a)gmail.com>
Cc: Pavel Begunkov <asml.silence(a)gmail.com>
Cc: Chen Ridong <chenridong(a)huawei.com>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memcontrol.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
--- a/mm/memcontrol.c~memcg-always-call-cond_resched-after-fn
+++ a/mm/memcontrol.c
@@ -1168,7 +1168,6 @@ void mem_cgroup_scan_tasks(struct mem_cg
{
struct mem_cgroup *iter;
int ret = 0;
- int i = 0;
BUG_ON(mem_cgroup_is_root(memcg));
@@ -1178,10 +1177,9 @@ void mem_cgroup_scan_tasks(struct mem_cg
css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
while (!ret && (task = css_task_iter_next(&it))) {
- /* Avoid potential softlockup warning */
- if ((++i & 1023) == 0)
- cond_resched();
ret = fn(task, arg);
+ /* Avoid potential softlockup warning */
+ cond_resched();
}
css_task_iter_end(&it);
if (ret) {
_
Patches currently in -mm which might be from leitao(a)debian.org are
memcg-always-call-cond_resched-after-fn.patch
Hi Greg and Sasha,
Please find attached backports of commit d0afcfeb9e38 ("kbuild: Disable
-Wdefault-const-init-unsafe") for 6.6 and older, which is needed for tip
of tree versions of LLVM. Please let me know if there are any questions.
Cheers,
Nathan