Syzbot has found a deadlock (analyzed by Lance Yang):
1) Task (5749): Holds folio_lock, then tries to acquire i_mmap_rwsem(read lock).
2) Task (5754): Holds i_mmap_rwsem(write lock), then tries to acquire
folio_lock.
migrate_pages()
-> migrate_hugetlbs()
-> unmap_and_move_huge_page() <- Takes folio_lock!
-> remove_migration_ptes()
-> __rmap_walk_file()
-> i_mmap_lock_read() <- Waits for i_mmap_rwsem(read lock)!
hugetlbfs_fallocate()
-> hugetlbfs_punch_hole() <- Takes i_mmap_rwsem(write lock)!
-> hugetlbfs_zero_partial_page()
-> filemap_lock_hugetlb_folio()
-> filemap_lock_folio()
-> __filemap_get_folio <- Waits for folio_lock!
The migration path is the one taking locks in the wrong order according
to the documentation at the top of mm/rmap.c. So expand the scope of the
existing i_mmap_lock to cover the calls to remove_migration_ptes() too.
This is (mostly) how it used to be after commit c0d0381ade79. That was
removed by 336bf30eb765 for both file & anon hugetlb pages when it should
only have been removed for anon hugetlb pages.
Fixes: 336bf30eb765 (hugetlbfs: fix anon huge page migration race)
Reported-by: syzbot+2d9c96466c978346b55f(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com
Debugged-by: Lance Yang <lance.yang(a)linux.dev>
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: stable(a)vger.kernel.org
---
mm/migrate.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f60..4688b9e38cd2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1458,6 +1458,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
+ enum ttu_flags ttu = 0;
if (folio_ref_count(src) == 1) {
/* page was freed from under us. So we are done. */
@@ -1498,8 +1499,6 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
goto put_anon;
if (folio_mapped(src)) {
- enum ttu_flags ttu = 0;
-
if (!folio_test_anon(src)) {
/*
* In shared mappings, try_to_unmap could potentially
@@ -1516,16 +1515,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
try_to_migrate(src, ttu);
page_was_mapped = 1;
-
- if (ttu & TTU_RMAP_LOCKED)
- i_mmap_unlock_write(mapping);
}
if (!folio_mapped(src))
rc = move_to_new_folio(dst, src, mode);
if (page_was_mapped)
- remove_migration_ptes(src, !rc ? dst : src, 0);
+ remove_migration_ptes(src, !rc ? dst : src,
+ ttu ? RMP_LOCKED : 0);
+
+ if (ttu & TTU_RMAP_LOCKED)
+ i_mmap_unlock_write(mapping);
unlock_put_anon:
folio_unlock(dst);
--
2.47.3
xhci_alloc_command() allocates a command structure and, when the
second argument is true, also allocates a completion structure.
Currently, the error handling path in xhci_disable_slot() only frees
the command structure using kfree(), causing the completion structure
to leak.
Use xhci_free_command() instead of kfree(). xhci_free_command() correctly
frees both the command structure and the associated completion structure.
Since the command structure is allocated with zero-initialization,
command->in_ctx is NULL and will not be erroneously freed by
xhci_free_command().
This bug was found using an experimental static analysis tool we are
developing. The tool is based on the LLVM framework and is specifically
designed to detect memory management issues. It is currently under
active development and not yet publicly available, but we plan to
open-source it after our research is published.
The analysis was performed on Linux kernel v6.13-rc1.
We performed build testing on x86_64 with allyesconfig using GCC=11.4.0.
Since triggering these error paths in xhci_disable_slot() requires specific
hardware conditions or abnormal state, we were unable to construct a test
case to reliably trigger these specific error paths at runtime.
Fixes: 7faac1953ed1 ("xhci: avoid race between disable slot command and host runtime suspend")
CC: stable(a)vger.kernel.org
Signed-off-by: Zilin Guan <zilin(a)seu.edu.cn>
---
Changes in v2:
- Add detailed information required by researcher guidelines.
- Clarify the safety of using xhci_free_command() in this context.
- Correct the Fixes tag to point to the commit that introduced this issue.
drivers/usb/host/xhci.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 02c9bfe21ae2..f0beed054954 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -4137,7 +4137,7 @@ int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
if (state == 0xffffffff || (xhci->xhc_state & XHCI_STATE_DYING) ||
(xhci->xhc_state & XHCI_STATE_HALTED)) {
spin_unlock_irqrestore(&xhci->lock, flags);
- kfree(command);
+ xhci_free_command(xhci, command);
return -ENODEV;
}
@@ -4145,7 +4145,7 @@ int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
slot_id);
if (ret) {
spin_unlock_irqrestore(&xhci->lock, flags);
- kfree(command);
+ xhci_free_command(xhci, command);
return ret;
}
xhci_ring_cmd_db(xhci);
--
2.34.1
When multiple registered buffers share the same compound page, only the
first buffer accounts for the memory via io_buffer_account_pin(). The
subsequent buffers skip accounting since headpage_already_acct() returns
true.
When the first buffer is unregistered, the accounting is decremented,
but the compound page remains pinned by the remaining buffers. This
creates a state where pinned memory is not properly accounted against
RLIMIT_MEMLOCK.
On systems with HugeTLB pages pre-allocated, an unprivileged user can
exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
registrations. The bypass amount is proportional to the number of
available huge pages, potentially allowing gigabytes of memory to be
pinned while the kernel accounting shows near-zero.
Fix this by recalculating the actual pages to unaccount when unmapping
a buffer. For regular pages, always unaccount. For compound pages, only
unaccount if no other registered buffer references the same compound
page. This ensures the accounting persists until the last buffer
referencing the compound page is released.
Reported-by: Yuhao Jiang <danisjiang(a)gmail.com>
Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: Yuhao Jiang <danisjiang(a)gmail.com>
---
io_uring/rsrc.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index a63474b331bf..dcf2340af5a2 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -139,15 +139,80 @@ static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
kvfree(imu);
}
+/*
+ * Calculate pages to unaccount when unmapping a buffer. Regular pages are
+ * always counted. Compound pages are only counted if no other registered
+ * buffer references them, ensuring accounting persists until the last user.
+ */
+static unsigned long io_buffer_calc_unaccount(struct io_ring_ctx *ctx,
+ struct io_mapped_ubuf *imu)
+{
+ struct page *last_hpage = NULL;
+ unsigned long acct = 0;
+ unsigned int i;
+
+ for (i = 0; i < imu->nr_bvecs; i++) {
+ struct page *page = imu->bvec[i].bv_page;
+ struct page *hpage;
+ unsigned int j;
+
+ if (!PageCompound(page)) {
+ acct++;
+ continue;
+ }
+
+ hpage = compound_head(page);
+ if (hpage == last_hpage)
+ continue;
+ last_hpage = hpage;
+
+ /* Check if we already processed this hpage earlier in this buffer */
+ for (j = 0; j < i; j++) {
+ if (PageCompound(imu->bvec[j].bv_page) &&
+ compound_head(imu->bvec[j].bv_page) == hpage)
+ goto next_hpage;
+ }
+
+ /* Only unaccount if no other buffer references this page */
+ for (j = 0; j < ctx->buf_table.nr; j++) {
+ struct io_rsrc_node *node = ctx->buf_table.nodes[j];
+ struct io_mapped_ubuf *other;
+ unsigned int k;
+
+ if (!node)
+ continue;
+ other = node->buf;
+ if (other == imu)
+ continue;
+
+ for (k = 0; k < other->nr_bvecs; k++) {
+ struct page *op = other->bvec[k].bv_page;
+
+ if (!PageCompound(op))
+ continue;
+ if (compound_head(op) == hpage)
+ goto next_hpage;
+ }
+ }
+ acct += page_size(hpage) >> PAGE_SHIFT;
+next_hpage:
+ ;
+ }
+ return acct;
+}
+
static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
{
+ unsigned long acct;
+
if (unlikely(refcount_read(&imu->refs) > 1)) {
if (!refcount_dec_and_test(&imu->refs))
return;
}
- if (imu->acct_pages)
- io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
+ acct = io_buffer_calc_unaccount(ctx, imu);
+ if (acct)
+ io_unaccount_mem(ctx->user, ctx->mm_account, acct);
imu->release(imu->priv);
io_free_imu(ctx, imu);
}
--
2.34.1
V1 -> V2:
- Because `pmd_val` variable broke ppc builds due to its name,
renamed it to `_pmd`. see [1].
[1] https://lore.kernel.org/stable/aS7lPZPYuChOTdXU@hyeyoo
- Added David Hildenbrand's Acked-by [2], thanks a lot!
[2] https://lore.kernel.org/linux-mm/ac8d7137-3819-4a75-9dd3-fb3d2259ebe4@kerne…
# TL;DR
previous discussion: https://lore.kernel.org/linux-mm/20250921232709.1608699-1-harry.yoo@oracle.…
A "bad pmd" error occurs due to race condition between
change_prot_numa() and THP migration. The mainline kernel does not have
this bug as commit 670ddd8cdc fixes the race condition. 6.1.y, 5.15.y,
5.10.y, 5.4.y are affected by this bug.
Fixing this in -stable kernels is tricky because pte_map_offset_lock()
has different semantics in pre-6.5 and post-6.5 kernels. I am trying to
backport the same mechanism we have in the mainline kernel.
Since the code looks bit different due to different semantics of
pte_map_offset_lock(), it'd be best to get this reviewed by MM folks.
# Testing
I verified that the bug described below is not reproduced anymore
(on a downstream kernel) after applying this patch series. It used to
trigger in few days of intensive numa balancing testing, but it survived
2 weeks with this applied.
# Bug Description
It was reported that a bad pmd is seen when automatic NUMA
balancing is marking page table entries as prot_numa:
[2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
[2437548.235022] Call Trace:
[2437548.238234] <TASK>
[2437548.241060] dump_stack_lvl+0x46/0x61
[2437548.245689] panic+0x106/0x2e5
[2437548.249497] pmd_clear_bad+0x3c/0x3c
[2437548.253967] change_pmd_range.isra.0+0x34d/0x3a7
[2437548.259537] change_p4d_range+0x156/0x20e
[2437548.264392] change_protection_range+0x116/0x1a9
[2437548.269976] change_prot_numa+0x15/0x37
[2437548.274774] task_numa_work+0x1b8/0x302
[2437548.279512] task_work_run+0x62/0x95
[2437548.283882] exit_to_user_mode_loop+0x1a4/0x1a9
[2437548.289277] exit_to_user_mode_prepare+0xf4/0xfc
[2437548.294751] ? sysvec_apic_timer_interrupt+0x34/0x81
[2437548.300677] irqentry_exit_to_user_mode+0x5/0x25
[2437548.306153] asm_sysvec_apic_timer_interrupt+0x16/0x1b
This is due to a race condition between change_prot_numa() and
THP migration because the kernel doesn't check is_swap_pmd() and
pmd_trans_huge() atomically:
change_prot_numa() THP migration
======================================================================
- change_pmd_range()
-> is_swap_pmd() returns false,
meaning it's not a PMD migration
entry.
- do_huge_pmd_numa_page()
-> migrate_misplaced_page() sets
migration entries for the THP.
- change_pmd_range()
-> pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_none() and pmd_trans_huge() returns false
- pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_bad() returns true for the migration entry!
The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
by checking is_swap_pmd() and pmd_trans_huge() atomically.
# Backporting note
commit a79390f5d6a7 ("mm/mprotect: use long for page accountings and retval")
is backported to return an error code (negative value) in
change_pte_range().
Unlike the mainline, pte_offset_map_lock() does not check if the pmd
entry is a migration entry or a hugepage; acquires PTL unconditionally
instead of returning failure. Therefore, it is necessary to keep the
!is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() checks in
change_pmd_range() before acquiring the PTL.
After acquiring the lock, open-code the semantics of
pte_offset_map_lock() in the mainline kernel; change_pte_range() fails
if the pmd value has changed. This requires adding pmd_old parameter
(pmd_t value that is read before calling the function) to
change_pte_range().
Hugh Dickins (1):
mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()
Peter Xu (1):
mm/mprotect: use long for page accountings and retval
include/linux/hugetlb.h | 4 +-
include/linux/mm.h | 2 +-
mm/hugetlb.c | 4 +-
mm/mempolicy.c | 2 +-
mm/mprotect.c | 124 +++++++++++++++++-----------------------
5 files changed, 60 insertions(+), 76 deletions(-)
--
2.43.0
The arm64 kernel doesn't boot with annotated branches
(PROFILE_ANNOTATED_BRANCHES) enabled and CONFIG_DEBUG_VIRTUAL together.
Bisecting it, I found that disabling branch profiling in arch/arm64/mm
solved the problem. Narrowing down a bit further, I found that
physaddr.c is the file that needs to have branch profiling disabled to
get the machine to boot.
I suspect that it might invoke some ftrace helper very early in the boot
process and ftrace is still not enabled(!?).
Rather than playing whack-a-mole with individual files, disable branch
profiling for the entire arch/arm64 tree, similar to what x86 already
does in arch/x86/Kbuild.
Cc: stable(a)vger.kernel.org
Fixes: ec6d06efb0bac ("arm64: Add support for CONFIG_DEBUG_VIRTUAL")
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
Changes in v2:
- Expand the scope to arch/arm64 instead of just physaddr.c
- Link to v1: https://lore.kernel.org/all/20251231-annotated-v1-1-9db1c0d03062@debian.org/
---
arch/arm64/Kbuild | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/arm64/Kbuild b/arch/arm64/Kbuild
index 5bfbf7d79c99..d876bc0e5421 100644
--- a/arch/arm64/Kbuild
+++ b/arch/arm64/Kbuild
@@ -1,4 +1,8 @@
# SPDX-License-Identifier: GPL-2.0-only
+
+# Branch profiling isn't noinstr-safe
+subdir-ccflags-$(CONFIG_TRACE_BRANCH_PROFILING) += -DDISABLE_BRANCH_PROFILING
+
obj-y += kernel/ mm/ net/
obj-$(CONFIG_KVM) += kvm/
obj-$(CONFIG_XEN) += xen/
---
base-commit: c8ebd433459bcbf068682b09544e830acd7ed222
change-id: 20251231-annotated-75de3f33cd7b
Best regards,
--
Breno Leitao <leitao(a)debian.org>
Commit 7f9ab862e05c ("leds: spi-byte: Call of_node_put() on error path")
was merged in 6.11 and then backported to stable trees through 5.10. It
relocates the line that initializes the variable 'child' to a later
point in spi_byte_probe().
Versions < 6.9 do not have commit ccc35ff2fd29 ("leds: spi-byte: Use
devm_led_classdev_register_ext()"), which removes a line that reads a
property from 'child' before its new initialization point. Consequently,
spi_byte_probe() reads from an uninitialized device node in stable
kernels 6.6-5.10.
Initialize 'child' before it is first accessed.
Fixes: 7f9ab862e05c ("leds: spi-byte: Call of_node_put() on error path")
Signed-off-by: Tiffany Yang <ynaffit(a)google.com>
---
As an alternative to moving the initialization of 'child' up,
Fedor Pchelkin proposed [1] backporting the change that removes the
intermediate access.
[1] https://lore.kernel.org/stable/20241029204128.527033-1-pchelkin@ispras.ru/
---
drivers/leds/leds-spi-byte.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/leds/leds-spi-byte.c b/drivers/leds/leds-spi-byte.c
index afe9bff7c7c1..4520df1e2341 100644
--- a/drivers/leds/leds-spi-byte.c
+++ b/drivers/leds/leds-spi-byte.c
@@ -96,6 +96,7 @@ static int spi_byte_probe(struct spi_device *spi)
if (!led)
return -ENOMEM;
+ child = of_get_next_available_child(dev_of_node(dev), NULL);
of_property_read_string(child, "label", &name);
strscpy(led->name, name, sizeof(led->name));
led->spi = spi;
@@ -106,7 +107,6 @@ static int spi_byte_probe(struct spi_device *spi)
led->ldev.max_brightness = led->cdef->max_value - led->cdef->off_value;
led->ldev.brightness_set_blocking = spi_byte_brightness_set_blocking;
- child = of_get_next_available_child(dev_of_node(dev), NULL);
state = of_get_property(child, "default-state", NULL);
if (state) {
if (!strcmp(state, "on")) {
--
2.52.0.351.gbe84eed79e-goog
From: Ankit Garg <nktgrg(a)google.com>
This series fixes a kernel panic in the GVE driver caused by
out-of-bounds array access when the network stack provides an invalid
TX queue index.
The issue impacts both GQI and DQO queue formats. For both cases, the
driver is updated to validate the queue index and drop the packet if
the index is out of range.
Ankit Garg (2):
gve: drop packets on invalid queue indices in GQI TX path
gve: drop packets on invalid queue indices in DQO TX path
drivers/net/ethernet/google/gve/gve_tx.c | 12 +++++++++---
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 9 ++++++++-
2 files changed, 17 insertions(+), 4 deletions(-)
--
2.52.0.351.gbe84eed79e-goog
The patch titled
Subject: mm/vmscan: fix demotion targets checks in reclaim/demotion
has been added to the -mm mm-new branch. Its filename is
mm-vmscan-fix-demotion-targets-checks-in-reclaim-demotion.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
The mm-new branch of mm.git is not included in linux-next
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via various
branches at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there most days
------------------------------------------------------
From: Bing Jiao <bingjiao(a)google.com>
Subject: mm/vmscan: fix demotion targets checks in reclaim/demotion
Date: Thu, 8 Jan 2026 03:32:46 +0000
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks in reclaim/demotion.
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to can_demote().
However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
These bugs break resource isolation provided by cpuset.mems. This is
visible from userspace because pages can either fail to be demoted
entirely or are demoted to nodes that are not allowed in multi-tier memory
systems.
To address these bugs, update cpuset_node_allowed() to return
effective_mems and mem_cgroup_node_allowed() to filter out nodes that are
not set in effective_mems. Also update can_demote() and
demote_folio_list() accordingly.
Bug 1 reproduction:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Bug 2 reproduction:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Link: https://lkml.kernel.org/r/20260108033248.2791579-2-bingjiao@google.com
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao(a)google.com>
Cc: Akinobu Mita <akinobu.mita(a)gmail.com>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: David Hildenbrand <david(a)kernel.org>
Cc: Gregory Price <gourry(a)gourry.net>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy(a)gmail.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Michal Koutn�� <mkoutny(a)suse.com>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Waiman Long <longman(a)redhat.com>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Yuanchu Xie <yuanchu(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/cpuset.h | 6 +--
include/linux/memcontrol.h | 6 +--
kernel/cgroup/cpuset.c | 54 +++++++++++++++++++++++------------
mm/memcontrol.c | 16 +++++++++-
mm/vmscan.c | 28 ++++++++++++------
5 files changed, 75 insertions(+), 35 deletions(-)
--- a/include/linux/cpuset.h~mm-vmscan-fix-demotion-targets-checks-in-reclaim-demotion
+++ a/include/linux/cpuset.h
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(node
task_unlock(current);
}
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask);
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
@@ -301,9 +301,9 @@ static inline bool read_mems_allowed_ret
return false;
}
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
{
- return true;
+ nodes_copy(*mask, node_states[N_MEMORY]);
}
#endif /* !CONFIG_CPUSETS */
--- a/include/linux/memcontrol.h~mm-vmscan-fix-demotion-targets-checks-in-reclaim-demotion
+++ a/include/linux/memcontrol.h
@@ -1736,7 +1736,7 @@ static inline void count_objcg_events(st
rcu_read_unlock();
}
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
@@ -1807,9 +1807,9 @@ static inline ino_t page_cgroup_ino(stru
return 0;
}
-static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+static inline void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg,
+ nodemask_t *mask)
{
- return true;
}
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
--- a/kernel/cgroup/cpuset.c~mm-vmscan-fix-demotion-targets-checks-in-reclaim-demotion
+++ a/kernel/cgroup/cpuset.c
@@ -4416,40 +4416,58 @@ bool cpuset_current_node_allowed(int nod
return allowed;
}
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+/**
+ * cpuset_nodes_allowed - return effective_mems mask from a cgroup cpuset.
+ * @cgroup: pointer to struct cgroup.
+ * @mask: pointer to struct nodemask_t to be returned.
+ *
+ * Returns effective_mems mask from a cgroup cpuset if it is cgroup v2 and
+ * has cpuset subsys. Otherwise, returns node_states[N_MEMORY].
+ *
+ * This function intentionally avoids taking the cpuset_mutex or callback_lock
+ * when accessing effective_mems. This is because the obtained effective_mems
+ * is stale immediately after the query anyway (e.g., effective_mems is updated
+ * immediately after releasing the lock but before returning).
+ *
+ * As a result, returned @mask may be empty because cs->effective_mems can be
+ * rebound during this call. Besides, nodes in @mask are not guaranteed to be
+ * online due to hot plugins. Callers should check the mask for validity on
+ * return based on its subsequent use.
+ **/
+void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
{
struct cgroup_subsys_state *css;
struct cpuset *cs;
- bool allowed;
/*
* In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
* and mems_allowed is likely to be empty even if we could get to it,
- * so return true to avoid taking a global lock on the empty check.
+ * so return directly to avoid taking a global lock on the empty check.
*/
- if (!cpuset_v2())
- return true;
+ if (!cgroup || !cpuset_v2()) {
+ nodes_copy(*mask, node_states[N_MEMORY]);
+ return;
+ }
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
- return true;
+ if (!css) {
+ nodes_copy(*mask, node_states[N_MEMORY]);
+ return;
+ }
/*
- * Normally, accessing effective_mems would require the cpuset_mutex
- * or callback_lock - but node_isset is atomic and the reference
- * taken via cgroup_get_e_css is sufficient to protect css.
- *
- * Since this interface is intended for use by migration paths, we
- * relax locking here to avoid taking global locks - while accepting
- * there may be rare scenarios where the result may be innaccurate.
+ * The reference taken via cgroup_get_e_css is sufficient to
+ * protect css, but it does not imply safe accesses to effective_mems.
*
- * Reclaim and migration are subject to these same race conditions, and
- * cannot make strong isolation guarantees, so this is acceptable.
+ * Normally, accessing effective_mems would require the cpuset_mutex
+ * or callback_lock - but the correctness of this information is stale
+ * immediately after the query anyway. We do not acquire the lock
+ * during this process to save lock contention in exchange for racing
+ * against mems_allowed rebinds.
*/
cs = container_of(css, struct cpuset, css);
- allowed = node_isset(nid, cs->effective_mems);
+ nodes_copy(*mask, cs->effective_mems);
css_put(css);
- return allowed;
}
/**
--- a/mm/memcontrol.c~mm-vmscan-fix-demotion-targets-checks-in-reclaim-demotion
+++ a/mm/memcontrol.c
@@ -5593,9 +5593,21 @@ subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
{
- return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+ nodemask_t allowed;
+
+ if (!memcg)
+ return;
+
+ /*
+ * Since this interface is intended for use by migration paths, and
+ * reclaim and migration are subject to race conditions such as changes
+ * in effective_mems and hot-unpluging of nodes, inaccurate allowed
+ * mask is acceptable.
+ */
+ cpuset_nodes_allowed(memcg->css.cgroup, &allowed);
+ nodes_and(*mask, *mask, allowed);
}
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
--- a/mm/vmscan.c~mm-vmscan-fix-demotion-targets-checks-in-reclaim-demotion
+++ a/mm/vmscan.c
@@ -344,19 +344,21 @@ static void flush_reclaim_state(struct s
static bool can_demote(int nid, struct scan_control *sc,
struct mem_cgroup *memcg)
{
- int demotion_nid;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ nodemask_t allowed_mask;
- if (!numa_demotion_enabled)
+ if (!pgdat || !numa_demotion_enabled)
return false;
if (sc && sc->no_demotion)
return false;
- demotion_nid = next_demotion_node(nid);
- if (demotion_nid == NUMA_NO_NODE)
+ node_get_allowed_targets(pgdat, &allowed_mask);
+ if (nodes_empty(allowed_mask))
return false;
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */
- return mem_cgroup_node_allowed(memcg, demotion_nid);
+ /* Filter out nodes that are not in cgroup's mems_allowed. */
+ mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
+ return !nodes_empty(allowed_mask);
}
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
@@ -1018,7 +1020,8 @@ static struct folio *alloc_demote_folio(
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
- struct pglist_data *pgdat)
+ struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
{
int target_nid = next_demotion_node(pgdat->node_id);
unsigned int nr_succeeded;
@@ -1032,7 +1035,6 @@ static unsigned int demote_folio_list(st
*/
.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
__GFP_NOMEMALLOC | GFP_NOWAIT,
- .nid = target_nid,
.nmask = &allowed_mask,
.reason = MR_DEMOTION,
};
@@ -1041,9 +1043,17 @@ static unsigned int demote_folio_list(st
return 0;
if (target_nid == NUMA_NO_NODE)
+ /* No lower-tier nodes or nodes were hot-unplugged. */
return 0;
node_get_allowed_targets(pgdat, &allowed_mask);
+ mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return 0;
+
+ if (!node_isset(target_nid, allowed_mask))
+ target_nid = node_random(&allowed_mask);
+ mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
@@ -1565,7 +1575,7 @@ keep:
/* 'folio_list' is always empty here */
/* Migrate folios selected for demotion */
- nr_demoted = demote_folio_list(&demote_folios, pgdat);
+ nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
nr_reclaimed += nr_demoted;
stat->nr_demoted += nr_demoted;
/* Folios that could not be demoted are still in @demote_folios */
_
Patches currently in -mm which might be from bingjiao(a)google.com are
mm-vmscan-fix-demotion-targets-checks-in-reclaim-demotion.patch
mm-vmscan-select-the-closest-preferred-node-in-demote_folio_list.patch