The quilt patch titled
Subject: mm/shmem, swap: improve cached mTHP handling and fix potential hang
has been removed from the -mm tree. Its filename was
mm-shmem-swap-improve-cached-mthp-handling-and-fix-potential-hang.patch
This patch was dropped because it was merged into the mm-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kairui Song <kasong(a)tencent.com>
Subject: mm/shmem, swap: improve cached mTHP handling and fix potential hang
Date: Mon, 28 Jul 2025 15:52:59 +0800
The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.
The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:
CPU1 CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
folio = swap_cache_get_folio
/* folio = NULL */
order = xa_get_order
/* order > 0 */
folio = shmem_swap_alloc_folio
/* mTHP alloc failure, folio = NULL */
<... Interrupted ...>
shmem_swapin_folio
/* S1 is swapped in */
shmem_writeout
/* S1 is swapped out, folio cached */
shmem_split_large_entry(..., S1)
/* S1 is split, but the folio covering it has order > 0 now */
Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0. The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.
And this looks fragile. So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio. And drop the
redundant tree walks before the insertion.
This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.
And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true. The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention. The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.
Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong(a)tencent.com>
Reviewed-by: Kemeng Shi <shikemeng(a)huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Barry Song <baohua(a)kernel.org>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Nhat Pham <nphamcs(a)gmail.com>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/shmem.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)
--- a/mm/shmem.c~mm-shmem-swap-improve-cached-mthp-handling-and-fix-potential-hang
+++ a/mm/shmem.c
@@ -891,7 +891,9 @@ static int shmem_add_to_page_cache(struc
pgoff_t index, void *expected, gfp_t gfp)
{
XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
- long nr = folio_nr_pages(folio);
+ unsigned long nr = folio_nr_pages(folio);
+ swp_entry_t iter, swap;
+ void *entry;
VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -903,14 +905,25 @@ static int shmem_add_to_page_cache(struc
gfp &= GFP_RECLAIM_MASK;
folio_throttle_swaprate(folio, gfp);
+ swap = radix_to_swp_entry(expected);
do {
+ iter = swap;
xas_lock_irq(&xas);
- if (expected != xas_find_conflict(&xas)) {
- xas_set_err(&xas, -EEXIST);
- goto unlock;
+ xas_for_each_conflict(&xas, entry) {
+ /*
+ * The range must either be empty, or filled with
+ * expected swap entries. Shmem swap entries are never
+ * partially freed without split of both entry and
+ * folio, so there shouldn't be any holes.
+ */
+ if (!expected || entry != swp_to_radix_entry(iter)) {
+ xas_set_err(&xas, -EEXIST);
+ goto unlock;
+ }
+ iter.val += 1 << xas_get_order(&xas);
}
- if (expected && xas_find_conflict(&xas)) {
+ if (expected && iter.val - nr != swap.val) {
xas_set_err(&xas, -EEXIST);
goto unlock;
}
@@ -2359,7 +2372,7 @@ static int shmem_swapin_folio(struct ino
error = -ENOMEM;
goto failed;
}
- } else if (order != folio_order(folio)) {
+ } else if (order > folio_order(folio)) {
/*
* Swap readahead may swap in order 0 folios into swapcache
* asynchronously, while the shmem mapping can still stores
@@ -2384,15 +2397,23 @@ static int shmem_swapin_folio(struct ino
swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
}
+ } else if (order < folio_order(folio)) {
+ swap.val = round_down(swap.val, 1 << folio_order(folio));
+ index = round_down(index, 1 << folio_order(folio));
}
alloced:
- /* We have to do this with folio locked to prevent races */
+ /*
+ * We have to do this with the folio locked to prevent races.
+ * The shmem_confirm_swap below only checks if the first swap
+ * entry matches the folio, that's enough to ensure the folio
+ * is not used outside of shmem, as shmem swap entries
+ * and swap cache folios are never partially freed.
+ */
folio_lock(folio);
if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
- folio->swap.val != swap.val ||
!shmem_confirm_swap(mapping, index, swap) ||
- xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
+ folio->swap.val != swap.val) {
error = -EEXIST;
goto unlock;
}
_
Patches currently in -mm which might be from kasong(a)tencent.com are
The quilt patch titled
Subject: mm-fix-a-uaf-when-vma-mm-is-freed-after-vma-vm_refcnt-got-dropped-v3
has been removed from the -mm tree. Its filename was
mm-fix-a-uaf-when-vma-mm-is-freed-after-vma-vm_refcnt-got-dropped-v3.patch
This patch was dropped because it was folded into mm-fix-a-uaf-when-vma-mm-is-freed-after-vma-vm_refcnt-got-dropped.patch
------------------------------------------------------
From: Suren Baghdasaryan <surenb(a)google.com>
Subject: mm-fix-a-uaf-when-vma-mm-is-freed-after-vma-vm_refcnt-got-dropped-v3
Date: Tue, 29 Jul 2025 07:57:09 -0700
- Addressed Lorenzo's nits, per Lorenzo Stoakes
- Added a warning comment for vma_start_read()
- Added Reviewed-by and Acked-by, per Vlastimil Babka and Lorenzo Stoakes
Link: https://lkml.kernel.org/r/20250729145709.2731370-1-surenb@google.com
Fixes: 3104138517fc ("mm: make vma cache SLAB_TYPESAFE_BY_RCU")
Reported-by: Jann Horn <jannh(a)google.com>
Closes: https://lore.kernel.org/all/CAG48ez0-deFbVH=E3jbkWx=X3uVbd8nWeo6kbJPQ0KoUD+…
Signed-off-by: Suren Baghdasaryan <surenb(a)google.com>
Reviewed-by: Vlastimil Babka <vbabka(a)suse.cz>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mmap_lock.h | 7 +++++++
mm/mmap_lock.c | 2 +-
2 files changed, 8 insertions(+), 1 deletion(-)
--- a/include/linux/mmap_lock.h~mm-fix-a-uaf-when-vma-mm-is-freed-after-vma-vm_refcnt-got-dropped-v3
+++ a/include/linux/mmap_lock.h
@@ -155,6 +155,10 @@ static inline void vma_refcount_put(stru
* reused and attached to a different mm before we lock it.
* Returns the vma on success, NULL on failure to lock and EAGAIN if vma got
* detached.
+ *
+ * WARNING! The vma passed to this function cannot be used if the function
+ * fails to lock it because in certain cases RCU lock is dropped and then
+ * reacquired. Once RCU lock is dropped the vma can be concurently freed.
*/
static inline struct vm_area_struct *vma_start_read(struct mm_struct *mm,
struct vm_area_struct *vma)
@@ -194,9 +198,12 @@ static inline struct vm_area_struct *vma
if (unlikely(vma->vm_mm != mm)) {
/* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */
struct mm_struct *other_mm = vma->vm_mm;
+
/*
* __mmdrop() is a heavy operation and we don't need RCU
* protection here. Release RCU lock during these operations.
+ * We reinstate the RCU read lock as the caller expects it to
+ * be held when this function returns even on error.
*/
rcu_read_unlock();
mmgrab(other_mm);
--- a/mm/mmap_lock.c~mm-fix-a-uaf-when-vma-mm-is-freed-after-vma-vm_refcnt-got-dropped-v3
+++ a/mm/mmap_lock.c
@@ -235,7 +235,7 @@ retry:
goto fallback;
}
- /* Verify the vma is not behind of the last search position. */
+ /* Verify the vma is not behind the last search position. */
if (unlikely(from_addr >= vma->vm_end))
goto fallback_unlock;
_
Patches currently in -mm which might be from surenb(a)google.com are
mm-fix-a-uaf-when-vma-mm-is-freed-after-vma-vm_refcnt-got-dropped.patch
Hello maintainers,
This series addresses a defect observed on certain hardware platforms using Linux kernel 6.1.147 with the i915 driver. The issue concerns hot plug detection (HPD) logic,
leading to unreliable or missed detection events on affected hardware. This is happening on some specific devices.
### Background
Issue:
On Simatic IPC227E, we observed unreliable or missing hot plug detection events, while on Simatic IPC227G (otherwise similar platform), expected hot plug behavior was maintained.
Affected kernel:
This patch series is intended for the Linux 6.1.y stable tree only (tested on 6.1.147)
Most of the tests were conducted on 6.1.147 (manual/standalone kernel build, CIP/Isar context).
Root cause analysis:
I do not have access to hardware signal traces or scope data to conclusively prove the root cause at electrical level. My understanding is based on observed driver behavior and logs.
Therefore my assumption as to the real cause is that on IPC227G, HPD IRQ storms are apparently not occurring, so the standard HPD IRQ-based detection works as expected. On IPC227E,
frequent HPD interrupts trigger the i915 driver’s storm detection logic, causing it to switch to polling mode. Therefore polling does not resume correctly, leading to the hotplug
issue this series addresses. Device IPC227E's behavior triggers this kernel edge case, likely due to slight variations in signal integrity, electrical margins, or internal component timing.
Device IPC227G, functions as expected, possibly due to cleaner electrical signaling or more optimal timing characteristics, thus avoiding the triggering condition.
Conclusion:
This points to a hardware-software interaction where kernel code assumes nicer signaling or margins than IPC227E is able to provide, exposing logic gaps not visible on more robust hardware.
### Patches
Patches 1-4:
- Partial backports of upstream commits; only the relevant logic or fixes are applied, with other code omitted due to downstream divergence.
- Applied minimal merging without exhaustive backport of all intermediate upstream changes.
Patch 5:
- Contains cherry-picked logic plus context/compatibility amendments as needed. Ensures that the driver builds.
- Together these fixes greatly improve reliability of hotplug detection on both devices, with no regression detected in our setups.
Thank you for your review,
Nicusor Huhulea
This patch series contains the following changes:
Dmitry Baryshkov (2):
drm/probe_helper: extract two helper functions
drm/probe-helper: enable and disable HPD on connectors
Imre Deak (2):
drm/i915: Fix HPD polling, reenabling the output poll work as needed
drm: Add an HPD poll helper to reschedule the poll work
Nicusor Huhulea (1):
drm/i915: fixes for i915 Hot Plug Detection and build/runtime issues
drivers/gpu/drm/drm_probe_helper.c | 127 ++++++++++++++-----
drivers/gpu/drm/i915/display/intel_hotplug.c | 4 +-
include/drm/drm_modeset_helper_vtables.h | 22 ++++
include/drm/drm_probe_helper.h | 1 +
4 files changed, 122 insertions(+), 32 deletions(-)
--
2.39.2
The patch titled
Subject: mm: fix possible deadlock in console_trylock_spinning
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-fix-possible-deadlock-in-console_trylock_spinning.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Gu Bowen <gubowen5(a)huawei.com>
Subject: mm: fix possible deadlock in console_trylock_spinning
Date: Wed, 30 Jul 2025 17:49:14 +0800
kmemleak_scan_thread() invokes scan_block() which may invoke a nomal
printk() to print warning message. This can cause a deadlock in the
scenario reported below:
CPU0 CPU1
---- ----
lock(kmemleak_lock);
lock(&port->lock);
lock(kmemleak_lock);
lock(console_owner);
To solve this problem, switch to printk_safe mode before printing warning
message, this will redirect all printk()-s to a special per-CPU buffer,
which will be flushed later from a safe context (irq work), and this
deadlock problem can be avoided.
Our syztester report the following lockdep error:
======================================================
WARNING: possible circular locking dependency detected
5.10.0-22221-gca646a51dd00 #16 Not tainted
------------------------------------------------------
kmemleak/182 is trying to acquire lock:
ffffffffaf9e9020 (console_owner){-...}-{0:0}, at: console_trylock_spinning+0xda/0x1d0 kernel/printk/printk.c:1900
but task is already holding lock:
ffffffffb007cf58 (kmemleak_lock){-.-.}-{2:2}, at: scan_block+0x3d/0x220 mm/kmemleak.c:1310
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (kmemleak_lock){-.-.}-{2:2}:
validate_chain+0x5df/0xac0 kernel/locking/lockdep.c:3729
__lock_acquire+0x514/0x940 kernel/locking/lockdep.c:4958
lock_acquire+0x15a/0x3a0 kernel/locking/lockdep.c:5569
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x3b/0x60 kernel/locking/spinlock.c:164
create_object.isra.0+0x36/0x80 mm/kmemleak.c:691
kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
slab_post_alloc_hook mm/slab.h:518 [inline]
slab_alloc_node mm/slub.c:2987 [inline]
slab_alloc mm/slub.c:2995 [inline]
__kmalloc+0x637/0xb60 mm/slub.c:4100
kmalloc include/linux/slab.h:620 [inline]
tty_buffer_alloc+0x127/0x140 drivers/tty/tty_buffer.c:176
__tty_buffer_request_room+0x9b/0x110 drivers/tty/tty_buffer.c:276
tty_insert_flip_string_fixed_flag+0x60/0x130 drivers/tty/tty_buffer.c:321
tty_insert_flip_string include/linux/tty_flip.h:36 [inline]
tty_insert_flip_string_and_push_buffer+0x3a/0xb0 drivers/tty/tty_buffer.c:578
process_output_block+0xc2/0x2e0 drivers/tty/n_tty.c:592
n_tty_write+0x298/0x540 drivers/tty/n_tty.c:2433
do_tty_write drivers/tty/tty_io.c:1041 [inline]
file_tty_write.constprop.0+0x29b/0x4b0 drivers/tty/tty_io.c:1147
redirected_tty_write+0x51/0x90 drivers/tty/tty_io.c:1176
call_write_iter include/linux/fs.h:2117 [inline]
do_iter_readv_writev+0x274/0x350 fs/read_write.c:741
do_iter_write+0xbb/0x1f0 fs/read_write.c:867
vfs_writev+0xfa/0x380 fs/read_write.c:940
do_writev+0xd6/0x1d0 fs/read_write.c:983
do_syscall_64+0x2b/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x6c/0xd6
-> #2 (&port->lock){-.-.}-{2:2}:
validate_chain+0x5df/0xac0 kernel/locking/lockdep.c:3729
__lock_acquire+0x514/0x940 kernel/locking/lockdep.c:4958
lock_acquire+0x15a/0x3a0 kernel/locking/lockdep.c:5569
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x3b/0x60 kernel/locking/spinlock.c:164
tty_port_tty_get+0x1f/0xa0 drivers/tty/tty_port.c:289
tty_port_default_wakeup+0xb/0x30 drivers/tty/tty_port.c:48
serial8250_tx_chars+0x259/0x430 drivers/tty/serial/8250/8250_port.c:1906
__start_tx drivers/tty/serial/8250/8250_port.c:1598 [inline]
serial8250_start_tx+0x304/0x320 drivers/tty/serial/8250/8250_port.c:1720
uart_write+0x1a1/0x2e0 drivers/tty/serial/serial_core.c:635
do_output_char+0x2c0/0x370 drivers/tty/n_tty.c:444
process_output drivers/tty/n_tty.c:511 [inline]
n_tty_write+0x269/0x540 drivers/tty/n_tty.c:2445
do_tty_write drivers/tty/tty_io.c:1041 [inline]
file_tty_write.constprop.0+0x29b/0x4b0 drivers/tty/tty_io.c:1147
call_write_iter include/linux/fs.h:2117 [inline]
do_iter_readv_writev+0x274/0x350 fs/read_write.c:741
do_iter_write+0xbb/0x1f0 fs/read_write.c:867
vfs_writev+0xfa/0x380 fs/read_write.c:940
do_writev+0xd6/0x1d0 fs/read_write.c:983
do_syscall_64+0x2b/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x6c/0xd6
-> #1 (&port_lock_key){-.-.}-{2:2}:
validate_chain+0x5df/0xac0 kernel/locking/lockdep.c:3729
__lock_acquire+0x514/0x940 kernel/locking/lockdep.c:4958
lock_acquire+0x15a/0x3a0 kernel/locking/lockdep.c:5569
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x3b/0x60 kernel/locking/spinlock.c:164
serial8250_console_write+0x292/0x320 drivers/tty/serial/8250/8250_port.c:3458
call_console_drivers.constprop.0+0x185/0x240 kernel/printk/printk.c:1988
console_unlock+0x2b4/0x640 kernel/printk/printk.c:2648
register_console.part.0+0x2a1/0x390 kernel/printk/printk.c:3024
univ8250_console_init+0x24/0x2b drivers/tty/serial/8250/8250_core.c:724
console_init+0x188/0x24b kernel/printk/printk.c:3134
start_kernel+0x2b0/0x41e init/main.c:1072
secondary_startup_64_no_verify+0xc3/0xcb
-> #0 (console_owner){-...}-{0:0}:
check_prev_add+0xfa/0x1380 kernel/locking/lockdep.c:2988
check_prevs_add+0x1d8/0x3c0 kernel/locking/lockdep.c:3113
validate_chain+0x5df/0xac0 kernel/locking/lockdep.c:3729
__lock_acquire+0x514/0x940 kernel/locking/lockdep.c:4958
lock_acquire+0x15a/0x3a0 kernel/locking/lockdep.c:5569
console_trylock_spinning+0x10d/0x1d0 kernel/printk/printk.c:1921
vprintk_emit+0x1a5/0x270 kernel/printk/printk.c:2134
printk+0xb2/0xe7 kernel/printk/printk.c:2183
lookup_object.cold+0xf/0x24 mm/kmemleak.c:405
scan_block+0x1fa/0x220 mm/kmemleak.c:1357
scan_object+0xdd/0x140 mm/kmemleak.c:1415
scan_gray_list+0x8f/0x1c0 mm/kmemleak.c:1453
kmemleak_scan+0x649/0xf30 mm/kmemleak.c:1608
kmemleak_scan_thread+0x94/0xb6 mm/kmemleak.c:1721
kthread+0x1c4/0x210 kernel/kthread.c:328
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:299
other info that might help us debug this:
Chain exists of:
console_owner --> &port->lock --> kmemleak_lock
Link: https://lkml.kernel.org/r/20250730094914.566582-1-gubowen5@huawei.com
Signed-off-by: Gu Bowen <gubowen5(a)huawei.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Lu Jialin <lujialin4(a)huawei.com>
Cc: Waiman Long <longman(a)redhat.com>
Cc: Breno Leitao <leitao(a)debian.org>
Cc: <stable(a)vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmemleak.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/kmemleak.c~mm-fix-possible-deadlock-in-console_trylock_spinning
+++ a/mm/kmemleak.c
@@ -437,9 +437,11 @@ static struct kmemleak_object *__lookup_
else if (untagged_objp == untagged_ptr || alias)
return object;
else {
+ __printk_safe_enter();
kmemleak_warn("Found object by alias at 0x%08lx\n",
ptr);
dump_object_info(object);
+ __printk_safe_exit();
break;
}
}
_
Patches currently in -mm which might be from gubowen5(a)huawei.com are
mm-fix-possible-deadlock-in-console_trylock_spinning.patch
The quilt patch titled
Subject: mm: shmem: fix the shmem large folio allocation for the i915 driver
has been removed from the -mm tree. Its filename was
mm-shmem-fix-the-shmem-large-folio-allocation-for-the-i915-driver.patch
This patch was dropped because an alternative patch was or shall be merged
------------------------------------------------------
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Subject: mm: shmem: fix the shmem large folio allocation for the i915 driver
Date: Mon, 28 Jul 2025 16:03:53 +0800
After commit acd7ccb284b8 ("mm: shmem: add large folio support for
tmpfs"), we extend the 'huge=' option to allow any sized large folios for
tmpfs, which means tmpfs will allow getting a highest order hint based on
the size of write() and fallocate() paths, and then will try each
allowable large order.
However, when the i915 driver allocates shmem memory, it doesn't provide
hint information about the size of the large folio to be allocated,
resulting in the inability to allocate PMD-sized shmem, which in turn
affects GPU performance.
To fix this issue, add the 'end' information for shmem_read_folio_gfp() to
help allocate PMD-sized large folios. Additionally, use the maximum
allocation chunk (via mapping_max_folio_size()) to determine the size of
the large folios to allocate in the i915 driver.
Patryk added:
: In my tests, the performance drop ranges from a few percent up to 13%
: in Unigine Superposition under heavy memory usage on the CPU Core Ultra
: 155H with the Xe 128 EU GPU. Other users have reported performance
: impact up to 30% on certain workloads. Please find more in the
: regressions reports:
: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14645
: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845
:
: I believe the change should be backported to all active kernel branches
: after version 6.12.
Link: https://lkml.kernel.org/r/0d734549d5ed073c80b11601da3abdd5223e1889.17536898…
Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Reported-by: Patryk Kowalczyk <patryk(a)kowalczyk.ws>
Reported-by: Ville Syrj��l�� <ville.syrjala(a)linux.intel.com>
Tested-by: Patryk Kowalczyk <patryk(a)kowalczyk.ws>
Cc: Christan K��nig <christian.koenig(a)amd.com>
Cc: Dave Airlie <airlied(a)gmail.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Huang Ray <Ray.Huang(a)amd.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Jonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com>
Cc: Mathew Brost <matthew.brost(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Maxime Ripard <mripard(a)kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Thomas Zimemrmann <tzimmermann(a)suse.de>
Cc: Tvrtko Ursulin <tursulin(a)ursulin.net>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/gpu/drm/drm_gem.c | 2 +-
drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 7 ++++++-
drivers/gpu/drm/ttm/ttm_backup.c | 2 +-
include/linux/shmem_fs.h | 4 ++--
mm/shmem.c | 7 ++++---
5 files changed, 14 insertions(+), 8 deletions(-)
--- a/drivers/gpu/drm/drm_gem.c~mm-shmem-fix-the-shmem-large-folio-allocation-for-the-i915-driver
+++ a/drivers/gpu/drm/drm_gem.c
@@ -627,7 +627,7 @@ struct page **drm_gem_get_pages(struct d
i = 0;
while (i < npages) {
long nr;
- folio = shmem_read_folio_gfp(mapping, i,
+ folio = shmem_read_folio_gfp(mapping, i, 0,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
goto fail;
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c~mm-shmem-fix-the-shmem-large-folio-allocation-for-the-i915-driver
+++ a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -69,6 +69,7 @@ int shmem_sg_alloc_table(struct drm_i915
struct scatterlist *sg;
unsigned long next_pfn = 0; /* suppress gcc warning */
gfp_t noreclaim;
+ size_t chunk;
int ret;
if (overflows_type(size / PAGE_SIZE, page_count))
@@ -94,6 +95,7 @@ int shmem_sg_alloc_table(struct drm_i915
mapping_set_unevictable(mapping);
noreclaim = mapping_gfp_constraint(mapping, ~__GFP_RECLAIM);
noreclaim |= __GFP_NORETRY | __GFP_NOWARN;
+ chunk = mapping_max_folio_size(mapping);
sg = st->sgl;
st->nents = 0;
@@ -105,10 +107,13 @@ int shmem_sg_alloc_table(struct drm_i915
0,
}, *s = shrink;
gfp_t gfp = noreclaim;
+ loff_t bytes = (page_count - i) << PAGE_SHIFT;
+ loff_t pos = i << PAGE_SHIFT;
+ bytes = min_t(loff_t, chunk, bytes);
do {
cond_resched();
- folio = shmem_read_folio_gfp(mapping, i, gfp);
+ folio = shmem_read_folio_gfp(mapping, i, pos + bytes, gfp);
if (!IS_ERR(folio))
break;
--- a/drivers/gpu/drm/ttm/ttm_backup.c~mm-shmem-fix-the-shmem-large-folio-allocation-for-the-i915-driver
+++ a/drivers/gpu/drm/ttm/ttm_backup.c
@@ -100,7 +100,7 @@ ttm_backup_backup_page(struct file *back
struct folio *to_folio;
int ret;
- to_folio = shmem_read_folio_gfp(mapping, idx, alloc_gfp);
+ to_folio = shmem_read_folio_gfp(mapping, idx, 0, alloc_gfp);
if (IS_ERR(to_folio))
return PTR_ERR(to_folio);
--- a/include/linux/shmem_fs.h~mm-shmem-fix-the-shmem-large-folio-allocation-for-the-i915-driver
+++ a/include/linux/shmem_fs.h
@@ -153,12 +153,12 @@ enum sgp_type {
int shmem_get_folio(struct inode *inode, pgoff_t index, loff_t write_end,
struct folio **foliop, enum sgp_type sgp);
struct folio *shmem_read_folio_gfp(struct address_space *mapping,
- pgoff_t index, gfp_t gfp);
+ pgoff_t index, loff_t end, gfp_t gfp);
static inline struct folio *shmem_read_folio(struct address_space *mapping,
pgoff_t index)
{
- return shmem_read_folio_gfp(mapping, index, mapping_gfp_mask(mapping));
+ return shmem_read_folio_gfp(mapping, index, 0, mapping_gfp_mask(mapping));
}
static inline struct page *shmem_read_mapping_page(
--- a/mm/shmem.c~mm-shmem-fix-the-shmem-large-folio-allocation-for-the-i915-driver
+++ a/mm/shmem.c
@@ -5930,6 +5930,7 @@ int shmem_zero_setup(struct vm_area_stru
* shmem_read_folio_gfp - read into page cache, using specified page allocation flags.
* @mapping: the folio's address_space
* @index: the folio index
+ * @end: end of a read if allocating a new folio
* @gfp: the page allocator flags to use if allocating
*
* This behaves as a tmpfs "read_cache_page_gfp(mapping, index, gfp)",
@@ -5942,14 +5943,14 @@ int shmem_zero_setup(struct vm_area_stru
* with the mapping_gfp_mask(), to avoid OOMing the machine unnecessarily.
*/
struct folio *shmem_read_folio_gfp(struct address_space *mapping,
- pgoff_t index, gfp_t gfp)
+ pgoff_t index, loff_t end, gfp_t gfp)
{
#ifdef CONFIG_SHMEM
struct inode *inode = mapping->host;
struct folio *folio;
int error;
- error = shmem_get_folio_gfp(inode, index, 0, &folio, SGP_CACHE,
+ error = shmem_get_folio_gfp(inode, index, end, &folio, SGP_CACHE,
gfp, NULL, NULL);
if (error)
return ERR_PTR(error);
@@ -5968,7 +5969,7 @@ EXPORT_SYMBOL_GPL(shmem_read_folio_gfp);
struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp)
{
- struct folio *folio = shmem_read_folio_gfp(mapping, index, gfp);
+ struct folio *folio = shmem_read_folio_gfp(mapping, index, 0, gfp);
struct page *page;
if (IS_ERR(folio))
_
Patches currently in -mm which might be from baolin.wang(a)linux.alibaba.com are
The following changes since commit 347e9f5043c89695b01e66b3ed111755afcf1911:
Linux 6.16-rc6 (2025-07-13 14:25:58 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to 6693731487a8145a9b039bc983d77edc47693855:
vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers (2025-08-01 09:11:09 -0400)
Changes from v1:
drop commits that I put in there by mistake. Sorry!
----------------------------------------------------------------
virtio, vhost: features, fixes
vhost can now support legacy threading
if enabled in Kconfig
vsock memory allocation strategies for
large buffers have been improved,
reducing pressure on kmalloc
vhost now supports the in-order feature
guest bits missed the merge window
fixes, cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
----------------------------------------------------------------
Alok Tiwari (4):
virtio: Fix typo in register_virtio_device() doc comment
vhost-scsi: Fix typos and formatting in comments and logs
vhost: Fix typos
vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit
Anders Roxell (1):
vdpa: Fix IDR memory leak in VDUSE module exit
Cindy Lu (1):
vhost: Reintroduce kthread API and add mode selection
Dr. David Alan Gilbert (2):
vhost: vringh: Remove unused iotlb functions
vhost: vringh: Remove unused functions
Dragos Tatulea (2):
vdpa/mlx5: Fix needs_teardown flag calculation
vdpa/mlx5: Fix release of uninitialized resources on error path
Gerd Hoffmann (1):
drm/virtio: implement virtio_gpu_shutdown
Jason Wang (3):
vhost: fail early when __vhost_add_used() fails
vhost: basic in order support
vhost_net: basic in_order support
Michael S. Tsirkin (2):
virtio: fix comments, readability
virtio: document ENOSPC
Mike Christie (1):
vhost-scsi: Fix log flooding with target does not exist errors
Pei Xiao (1):
vhost: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
Viresh Kumar (2):
virtio-mmio: Remove virtqueue list from mmio device
virtio-vdpa: Remove virtqueue list
WangYuli (1):
virtio: virtio_dma_buf: fix missing parameter documentation
Will Deacon (9):
vhost/vsock: Avoid allocating arbitrarily-sized SKBs
vsock/virtio: Validate length in packet header before skb_put()
vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put()
vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page
vsock/virtio: Rename virtio_vsock_alloc_skb()
vsock/virtio: Move SKB allocation lower-bound check to callers
vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers
vsock/virtio: Rename virtio_vsock_skb_rx_put()
vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers
drivers/gpu/drm/virtio/virtgpu_drv.c | 8 +-
drivers/vdpa/mlx5/core/mr.c | 3 +
drivers/vdpa/mlx5/net/mlx5_vnet.c | 12 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 1 +
drivers/vhost/Kconfig | 18 ++
drivers/vhost/net.c | 88 +++++---
drivers/vhost/scsi.c | 24 +-
drivers/vhost/vhost.c | 377 ++++++++++++++++++++++++++++----
drivers/vhost/vhost.h | 30 ++-
drivers/vhost/vringh.c | 118 ----------
drivers/vhost/vsock.c | 15 +-
drivers/virtio/virtio.c | 7 +-
drivers/virtio/virtio_dma_buf.c | 2 +
drivers/virtio/virtio_mmio.c | 52 +----
drivers/virtio/virtio_ring.c | 4 +
drivers/virtio/virtio_vdpa.c | 44 +---
include/linux/virtio.h | 2 +-
include/linux/virtio_vsock.h | 46 +++-
include/linux/vringh.h | 12 -
include/uapi/linux/vhost.h | 29 +++
kernel/vhost_task.c | 2 +-
net/vmw_vsock/virtio_transport.c | 20 +-
net/vmw_vsock/virtio_transport_common.c | 3 +-
23 files changed, 575 insertions(+), 342 deletions(-)
The patch titled
Subject: mm/kmemleak: avoid deadlock by moving pr_warn() outside kmemleak_lock
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-kmemleak-avoid-deadlock-by-moving-pr_warn-outside-kmemleak_lock.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Breno Leitao <leitao(a)debian.org>
Subject: mm/kmemleak: avoid deadlock by moving pr_warn() outside kmemleak_lock
Date: Thu, 31 Jul 2025 02:57:18 -0700
When netpoll is enabled, calling pr_warn_once() while holding
kmemleak_lock in mem_pool_alloc() can cause a deadlock due to lock
inversion with the netconsole subsystem. This occurs because
pr_warn_once() may trigger netpoll, which eventually leads to
__alloc_skb() and back into kmemleak code, attempting to reacquire
kmemleak_lock.
This is the path for the deadlock.
mem_pool_alloc()
-> raw_spin_lock_irqsave(&kmemleak_lock, flags);
-> pr_warn_once()
-> netconsole subsystem
-> netpoll
-> __alloc_skb
-> __create_object
-> raw_spin_lock_irqsave(&kmemleak_lock, flags);
Fix this by setting a flag and issuing the pr_warn_once() after
kmemleak_lock is released.
Link: https://lkml.kernel.org/r/20250731-kmemleak_lock-v1-1-728fd470198f@debian.o…
Fixes: c5665868183fec ("mm: kmemleak: use the memory pool for early allocations")
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Reported-by: Jakub Kicinski <kuba(a)kernel.org>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmemleak.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/mm/kmemleak.c~mm-kmemleak-avoid-deadlock-by-moving-pr_warn-outside-kmemleak_lock
+++ a/mm/kmemleak.c
@@ -470,6 +470,7 @@ static struct kmemleak_object *mem_pool_
{
unsigned long flags;
struct kmemleak_object *object;
+ bool warn = false;
/* try the slab allocator first */
if (object_cache) {
@@ -488,8 +489,10 @@ static struct kmemleak_object *mem_pool_
else if (mem_pool_free_count)
object = &mem_pool[--mem_pool_free_count];
else
- pr_warn_once("Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE\n");
+ warn = true;
raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
+ if (warn)
+ pr_warn_once("Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE\n");
return object;
}
_
Patches currently in -mm which might be from leitao(a)debian.org are
mm-kmemleak-avoid-deadlock-by-moving-pr_warn-outside-kmemleak_lock.patch
The patch titled
Subject: mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-userfaultfd-fix-kmap_local-lifo-ordering-for-config_highpte.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Sasha Levin <sashal(a)kernel.org>
Subject: mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
Date: Thu, 31 Jul 2025 10:44:31 -0400
With CONFIG_HIGHPTE on 32-bit ARM, move_pages_pte() maps PTE pages using
kmap_local_page(), which requires unmapping in Last-In-First-Out order.
The current code maps dst_pte first, then src_pte, but unmaps them in the
same order (dst_pte, src_pte), violating the LIFO requirement. This
causes the warning in kunmap_local_indexed():
WARNING: CPU: 0 PID: 604 at mm/highmem.c:622 kunmap_local_indexed+0x178/0x17c
addr \!= __fix_to_virt(FIX_KMAP_BEGIN + idx)
Fix this by reversing the unmap order to respect LIFO ordering.
This issue follows the same pattern as similar fixes:
- commit eca6828403b8 ("crypto: skcipher - fix mismatch between mapping and unmapping order")
- commit 8cf57c6df818 ("nilfs2: eliminate staggered calls to kunmap in nilfs_rename")
Both of which addressed the same fundamental requirement that kmap_local
operations must follow LIFO ordering.
Link: https://lkml.kernel.org/r/20250731144431.773923-1-sashal@kernel.org
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/userfaultfd.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/mm/userfaultfd.c~mm-userfaultfd-fix-kmap_local-lifo-ordering-for-config_highpte
+++ a/mm/userfaultfd.c
@@ -1453,10 +1453,15 @@ out:
folio_unlock(src_folio);
folio_put(src_folio);
}
- if (dst_pte)
- pte_unmap(dst_pte);
+ /*
+ * Unmap in reverse order (LIFO) to maintain proper kmap_local
+ * index ordering when CONFIG_HIGHPTE is enabled. We mapped dst_pte
+ * first, then src_pte, so we must unmap src_pte first, then dst_pte.
+ */
if (src_pte)
pte_unmap(src_pte);
+ if (dst_pte)
+ pte_unmap(dst_pte);
mmu_notifier_invalidate_range_end(&range);
if (si)
put_swap_device(si);
_
Patches currently in -mm which might be from sashal(a)kernel.org are
mm-userfaultfd-fix-kmap_local-lifo-ordering-for-config_highpte.patch
The patch titled
Subject: mm/debug_vm_pgtable: clear page table entries at destroy_args()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-debug_vm_pgtable-clear-page-table-entries-at-destroy_args.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Herton R. Krzesinski" <herton(a)redhat.com>
Subject: mm/debug_vm_pgtable: clear page table entries at destroy_args()
Date: Thu, 31 Jul 2025 18:40:51 -0300
The mm/debug_vm_pagetable test allocates manually page table entries for
the tests it runs, using also its manually allocated mm_struct. That in
itself is ok, but when it exits, at destroy_args() it fails to clear those
entries with the *_clear functions.
The problem is that leaves stale entries. If another process allocates an
mm_struct with a pgd at the same address, it may end up running into the
stale entry. This is happening in practice on a debug kernel with
CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some extra
debugging I added (it prints a warning trace if pgtables_bytes goes
negative, in addition to the warning at check_mm() function):
[ 2.539353] debug_vm_pgtable: [get_random_vaddr ]: random_vaddr is 0x7ea247140000
[ 2.539366] kmem_cache info
[ 2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
[ 2.539447] debug_vm_pgtable: [init_args ]: args->mm is 0x000000002267cc9e
(...)
[ 2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
[ 2.552816] Modules linked in:
[ 2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
[ 2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
[ 2.552872] NIP: c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
[ 2.552885] REGS: c0000000622e73b0 TRAP: 0700 Not tainted (6.12.0-105.debug_vm2.el10.ppc64le+debug)
[ 2.552899] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 XER: 0000000a
[ 2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
[ 2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
[ 2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
[ 2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
[ 2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
[ 2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
[ 2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
[ 2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
[ 2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
[ 2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
[ 2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
[ 2.553199] Call Trace:
[ 2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
[ 2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
[ 2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
[ 2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
[ 2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
[ 2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
[ 2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
[ 2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
[ 2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
[ 2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
[ 2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
(...)
[ 2.558892] ---[ end trace 0000000000000000 ]---
[ 2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
[ 2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144
Here the modprobe process ended up with an allocated mm_struct from the
mm_struct slab that was used before by the debug_vm_pgtable test. That is
not a problem, since the mm_struct is initialized again etc., however, if
it ends up using the same pgd table, it bumps into the old stale entry
when clearing/freeing the page table entries, so it tries to free an entry
already gone (that one which was allocated by the debug_vm_pgtable test),
which also explains the negative pgtables_bytes since it's accounting for
not allocated entries in the current process.
As far as I looked pgd_{alloc,free} etc. does not clear entries, and
clearing of the entries is explicitly done in the free_pgtables->
free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
free_pte_range path. However, the debug_vm_pgtable test does not call
free_pgtables, since it allocates mm_struct and entries manually for its
test and eg. not goes through page faults. So it also should clear
manually the entries before exit at destroy_args().
This problem was noticed on a reboot X number of times test being done on
a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE enabled.
Depends on the system, but on a 100 times reboot loop the problem could
manifest once or twice, if a process ends up getting the right mm->pgd
entry with the stale entries used by mm/debug_vm_pagetable. After using
this patch, I couldn't reproduce/experience the problems anymore. I was
able to reproduce the problem as well on latest upstream kernel (6.16).
I also modified destroy_args() to use mmput() instead of mmdrop(), there
is no reason to hold mm_users reference and not release the mm_struct
entirely, and in the output above with my debugging prints I already had
patched it to use mmput, it did not fix the problem, but helped in the
debugging as well.
Link: https://lkml.kernel.org/r/20250731214051.4115182-1-herton@redhat.com
Fixes: 3c9b84f044a9e ("mm/debug_vm_pgtable: introduce struct pgtable_debug_args")
Signed-off-by: Herton R. Krzesinski <herton(a)redhat.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Cc: Gavin Shan <gshan(a)redhat.com>
Cc: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/debug_vm_pgtable.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/mm/debug_vm_pgtable.c~mm-debug_vm_pgtable-clear-page-table-entries-at-destroy_args
+++ a/mm/debug_vm_pgtable.c
@@ -1041,29 +1041,34 @@ static void __init destroy_args(struct p
/* Free page table entries */
if (args->start_ptep) {
+ pmd_clear(args->pmdp);
pte_free(args->mm, args->start_ptep);
mm_dec_nr_ptes(args->mm);
}
if (args->start_pmdp) {
+ pud_clear(args->pudp);
pmd_free(args->mm, args->start_pmdp);
mm_dec_nr_pmds(args->mm);
}
if (args->start_pudp) {
+ p4d_clear(args->p4dp);
pud_free(args->mm, args->start_pudp);
mm_dec_nr_puds(args->mm);
}
- if (args->start_p4dp)
+ if (args->start_p4dp) {
+ pgd_clear(args->pgdp);
p4d_free(args->mm, args->start_p4dp);
+ }
/* Free vma and mm struct */
if (args->vma)
vm_area_free(args->vma);
if (args->mm)
- mmdrop(args->mm);
+ mmput(args->mm);
}
static struct page * __init
_
Patches currently in -mm which might be from herton(a)redhat.com are
mm-debug_vm_pgtable-clear-page-table-entries-at-destroy_args.patch
To prevent timing attacks, HMAC value comparison needs to be constant
time. Replace the memcmp() with the correct function, crypto_memneq().
Fixes: 1085b8276bb4 ("tpm: Add the rest of the session HMAC API")
Cc: stable(a)vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers(a)kernel.org>
---
drivers/char/tpm/Kconfig | 1 +
drivers/char/tpm/tpm2-sessions.c | 6 +++---
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index dddd702b2454a..f9d8a4e966867 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -31,10 +31,11 @@ config TCG_TPM2_HMAC
bool "Use HMAC and encrypted transactions on the TPM bus"
default X86_64
select CRYPTO_ECDH
select CRYPTO_LIB_AESCFB
select CRYPTO_LIB_SHA256
+ select CRYPTO_LIB_UTILS
help
Setting this causes us to deploy a scheme which uses request
and response HMACs in addition to encryption for
communicating with the TPM to prevent or detect bus snooping
and interposer attacks (see tpm-security.rst). Saying Y
diff --git a/drivers/char/tpm/tpm2-sessions.c b/drivers/char/tpm/tpm2-sessions.c
index bdb119453dfbe..5fbd62ee50903 100644
--- a/drivers/char/tpm/tpm2-sessions.c
+++ b/drivers/char/tpm/tpm2-sessions.c
@@ -69,10 +69,11 @@
#include <linux/unaligned.h>
#include <crypto/kpp.h>
#include <crypto/ecdh.h>
#include <crypto/hash.h>
#include <crypto/hmac.h>
+#include <crypto/utils.h>
/* maximum number of names the TPM must remember for authorization */
#define AUTH_MAX_NAMES 3
#define AES_KEY_BYTES AES_KEYSIZE_128
@@ -827,16 +828,15 @@ int tpm_buf_check_hmac_response(struct tpm_chip *chip, struct tpm_buf *buf,
sha256_update(&sctx, auth->our_nonce, sizeof(auth->our_nonce));
sha256_update(&sctx, &auth->attrs, 1);
/* we're done with the rphash, so put our idea of the hmac there */
tpm2_hmac_final(&sctx, auth->session_key, sizeof(auth->session_key)
+ auth->passphrase_len, rphash);
- if (memcmp(rphash, &buf->data[offset_s], SHA256_DIGEST_SIZE) == 0) {
- rc = 0;
- } else {
+ if (crypto_memneq(rphash, &buf->data[offset_s], SHA256_DIGEST_SIZE)) {
dev_err(&chip->dev, "TPM: HMAC check failed\n");
goto out;
}
+ rc = 0;
/* now do response decryption */
if (auth->attrs & TPM2_SA_ENCRYPT) {
/* need key and IV */
tpm2_KDFa(auth->session_key, SHA256_DIGEST_SIZE
--
2.50.1