In gs_can_open(), the URBs for USB-in transfers are allocated, added to the
parent->rx_submitted anchor and submitted. In the complete callback
gs_usb_receive_bulk_callback(), the URB is processed and resubmitted. In
gs_can_close() the URBs are freed by calling
usb_kill_anchored_urbs(parent->rx_submitted).
However, this does not take into account that the USB framework
unanchors the URB before the close function is called. This means that
once an in-URB has been completed, it is no longer anchored and is
ultimately not released in gs_can_close().
Fix the memory leak by anchoring the URB in the
gs_usb_receive_bulk_callback() to the parent->rx_submitted anchor.
Fixes: d08e973a77d1 ("can: gs_usb: Added support for the GS_USB CAN devices")
Cc: stable(a)vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
Changes in v2:
- correct "Fixes" tag
- add stable(a)v.k.o on Cc
- Link to v1: https://patch.msgid.link/20251225-gs_usb-fix-memory-leak-v1-1-a7b09e70d01d@…
---
drivers/net/can/usb/gs_usb.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/can/usb/gs_usb.c b/drivers/net/can/usb/gs_usb.c
index a0233e550a5a..d093babbc320 100644
--- a/drivers/net/can/usb/gs_usb.c
+++ b/drivers/net/can/usb/gs_usb.c
@@ -751,6 +751,8 @@ static void gs_usb_receive_bulk_callback(struct urb *urb)
hf, parent->hf_size_rx,
gs_usb_receive_bulk_callback, parent);
+ usb_anchor_urb(urb, &parent->rx_submitted);
+
rc = usb_submit_urb(urb, GFP_ATOMIC);
/* USB failure take down all interfaces */
---
base-commit: 1806d210e5a8f431ad4711766ae4a333d407d972
change-id: 20251225-gs_usb-fix-memory-leak-062c24898cc9
Best regards,
--
Marc Kleine-Budde <mkl(a)pengutronix.de>
When both KASAN and SLAB_STORE_USER are enabled, accesses to
struct kasan_alloc_meta fields can be misaligned on 64-bit architectures.
This occurs because orig_size is currently defined as unsigned int,
which only guarantees 4-byte alignment. When struct kasan_alloc_meta is
placed after orig_size, it may end up at a 4-byte boundary rather than
the required 8-byte boundary on 64-bit systems.
Note that 64-bit architectures without HAVE_EFFICIENT_UNALIGNED_ACCESS
are assumed to require 64-bit accesses to be 64-bit aligned.
See HAVE_64BIT_ALIGNED_ACCESS and commit adab66b71abf ("Revert:
"ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"") for more details.
Change orig_size from unsigned int to unsigned long to ensure proper
alignment for any subsequent metadata. This should not waste additional
memory because kmalloc objects are already aligned to at least
ARCH_KMALLOC_MINALIGN.
Suggested-by: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: stable(a)vger.kernel.org
Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc")
Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com>
---
mm/slub.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index ad71f01571f0..1c747435a6ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -857,7 +857,7 @@ static inline bool slab_update_freelist(struct kmem_cache *s, struct slab *slab,
* request size in the meta data area, for better debug and sanity check.
*/
static inline void set_orig_size(struct kmem_cache *s,
- void *object, unsigned int orig_size)
+ void *object, unsigned long orig_size)
{
void *p = kasan_reset_tag(object);
@@ -867,10 +867,10 @@ static inline void set_orig_size(struct kmem_cache *s,
p += get_info_end(s);
p += sizeof(struct track) * 2;
- *(unsigned int *)p = orig_size;
+ *(unsigned long *)p = orig_size;
}
-static inline unsigned int get_orig_size(struct kmem_cache *s, void *object)
+static inline unsigned long get_orig_size(struct kmem_cache *s, void *object)
{
void *p = kasan_reset_tag(object);
@@ -883,7 +883,7 @@ static inline unsigned int get_orig_size(struct kmem_cache *s, void *object)
p += get_info_end(s);
p += sizeof(struct track) * 2;
- return *(unsigned int *)p;
+ return *(unsigned long *)p;
}
#ifdef CONFIG_SLUB_DEBUG
@@ -1198,7 +1198,7 @@ static void print_trailer(struct kmem_cache *s, struct slab *slab, u8 *p)
off += 2 * sizeof(struct track);
if (slub_debug_orig_size(s))
- off += sizeof(unsigned int);
+ off += sizeof(unsigned long);
off += kasan_metadata_size(s, false);
@@ -1394,7 +1394,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct slab *slab, u8 *p)
off += 2 * sizeof(struct track);
if (s->flags & SLAB_KMALLOC)
- off += sizeof(unsigned int);
+ off += sizeof(unsigned long);
}
off += kasan_metadata_size(s, false);
@@ -7949,7 +7949,7 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
/* Save the original kmalloc request size */
if (flags & SLAB_KMALLOC)
- size += sizeof(unsigned int);
+ size += sizeof(unsigned long);
}
#endif
--
2.43.0
Workqueue xe-ggtt-wq has been allocated using WQ_MEM_RECLAIM, but
the flag has been passed as 3rd parameter (max_active) instead
of 2nd (flags) creating the workqueue as per-cpu with max_active = 8
(the WQ_MEM_RECLAIM value).
So change this by set WQ_MEM_RECLAIM as the 2nd parameter with a
default max_active.
Fixes: 60df57e496e4 ("drm/xe: Mark GGTT work queue with WQ_MEM_RECLAIM")
Cc: stable(a)vger.kernel.org
Signed-off-by: Marco Crivellari <marco.crivellari(a)suse.com>
---
drivers/gpu/drm/xe/xe_ggtt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_ggtt.c b/drivers/gpu/drm/xe/xe_ggtt.c
index ef481b334af4..793d7324a395 100644
--- a/drivers/gpu/drm/xe/xe_ggtt.c
+++ b/drivers/gpu/drm/xe/xe_ggtt.c
@@ -322,7 +322,7 @@ int xe_ggtt_init_early(struct xe_ggtt *ggtt)
else
ggtt->pt_ops = &xelp_pt_ops;
- ggtt->wq = alloc_workqueue("xe-ggtt-wq", 0, WQ_MEM_RECLAIM);
+ ggtt->wq = alloc_workqueue("xe-ggtt-wq", WQ_MEM_RECLAIM, 0);
if (!ggtt->wq)
return -ENOMEM;
--
2.52.0
This series converts all DRM bridge drivers (*) from the now deprecated
of_drm_find_bridge() to its replacement of_drm_find_and_get_bridge() which
allows correct bridge refcounting. It also converts per-driver
"next_bridge" pointers to the unified drm_bridge::next_bridge which puts
the reference automatically on bridge deallocation.
This is part of the work to support hotplug of DRM bridges. The grand plan
was discussed in [0].
Here's the work breakdown (➜ marks the current series):
1. ➜ add refcounting to DRM bridges struct drm_bridge,
based on devm_drm_bridge_alloc()
A. ✔ add new alloc API and refcounting (v6.16)
B. ✔ convert all bridge drivers to new API (v6.17)
C. ✔ kunit tests (v6.17)
D. ✔ add get/put to drm_bridge_add/remove() + attach/detach()
and warn on old allocation pattern (v6.17)
E. ➜ add get/put on drm_bridge accessors
1. ✔ drm_bridge_chain_get_first_bridge(), add cleanup action (v6.18)
2. ✔ drm_bridge_get_prev_bridge() (v6.18)
3. ✔ drm_bridge_get_next_bridge() (v6.19)
4. ✔ drm_for_each_bridge_in_chain() (v6.19)
5. ✔ drm_bridge_connector_init (v6.19)
6. … protect encoder bridge chain with a mutex
7. ➜ of_drm_find_bridge
a. ✔… add of_drm_get_bridge(), convert basic direct users
(v6.20?, one driver still pending)
b. ➜ convert direct of_drm_get_bridge() users, part 2
c. … convert direct of_drm_get_bridge() users, part 3
d. convert direct of_drm_get_bridge() users, part 4
e. convert bridge-only drm_of_find_panel_or_bridge() users
8. drm_of_find_panel_or_bridge, *_of_get_bridge
9. ✔ enforce drm_bridge_add before drm_bridge_attach (v6.19)
F. ✔ debugfs improvements
1. ✔ add top-level 'bridges' file (v6.16)
2. ✔ show refcount and list lingering bridges (v6.19)
2. … handle gracefully atomic updates during bridge removal
A. ✔ Add drm_dev_enter/exit() to protect device resources (v6.20?)
B. … protect private_obj removal from list
3. … DSI host-device driver interaction
4. ✔ removing the need for the "always-disconnected" connector
5. finish the hotplug bridge work, moving code to the core and potentially
removing the hotplug-bridge itself (this needs to be clarified as
points 1-3 are developed)
[0] https://lore.kernel.org/lkml/20250206-hotplug-drm-bridge-v6-0-9d6f2c9c3058@…
This work is a continuation of the work to correctly handle bridge
refcounting for existing of_drm_find_bridge(). The ground work is in:
- commit 293a8fd7721a ("drm/bridge: add of_drm_find_and_get_bridge()")
- commit 9da0e06abda8 ("drm/bridge: deprecate of_drm_find_bridge()")
- commit 3fdeae134ba9 ("drm/bridge: add next_bridge pointer to struct drm_bridge")
The whole conversion is split in multiple series to make the review process
a bit smoother. Parts 3 and 4 are converting non-bridge drivers (mostly
encoders).
(*) One bridge driver (synopsys/dw-hdmi) is converted in another series,
together with its (non-bridge) users. Additionally this series converts
drm_of_panel_bridge_remove() which is a special case, and has a bugfix
for it too.
Signed-off-by: Luca Ceresoli <luca.ceresoli(a)bootlin.com>
---
Changes in v2:
- Fix bug in samsung-dsim code, patch 10;
update patch 12 on top of those changes
- Fix in imx8qxp-ldb code, patch 9
- Link to v1: https://lore.kernel.org/r/20260107-drm-bridge-alloc-getput-drm_of_find_brid…
---
Luca Ceresoli (12):
drm: of: drm_of_panel_bridge_remove(): fix device_node leak
drm: of: drm_of_panel_bridge_remove(): convert to of_drm_find_and_get_bridge()
drm/bridge: sii902x: convert to of_drm_find_and_get_bridge()
drm/bridge: thc63lvd1024: convert to of_drm_find_and_get_bridge()
drm/bridge: tfp410: convert to of_drm_find_and_get_bridge()
drm/bridge: tpd12s015: convert to of_drm_find_and_get_bridge()
drm/bridge: lt8912b: convert to of_drm_find_and_get_bridge()
drm/bridge: imx8mp-hdmi-pvi: convert to of_drm_find_and_get_bridge()
drm/bridge: imx8qxp-ldb: convert to of_drm_find_and_get_bridge()
drm/bridge: samsung-dsim: samsung_dsim_host_attach: use a temporary variable for the next bridge
drm/bridge: samsung-dsim: samsung_dsim_host_attach: don't use the bridge pointer as an error indicator
drm/bridge: samsung-dsim: samsung_dsim_host_attach: convert to of_drm_find_and_get_bridge()
drivers/gpu/drm/bridge/imx/imx8mp-hdmi-pvi.c | 15 ++++++-----
drivers/gpu/drm/bridge/imx/imx8qxp-ldb.c | 12 ++++++++-
drivers/gpu/drm/bridge/lontium-lt8912b.c | 31 +++++++++++------------
drivers/gpu/drm/bridge/samsung-dsim.c | 37 +++++++++++++++++++---------
drivers/gpu/drm/bridge/sii902x.c | 7 +++---
drivers/gpu/drm/bridge/thc63lvd1024.c | 7 +++---
drivers/gpu/drm/bridge/ti-tfp410.c | 27 ++++++++++----------
drivers/gpu/drm/bridge/ti-tpd12s015.c | 8 +++---
include/drm/bridge/samsung-dsim.h | 1 -
include/drm/drm_of.h | 6 ++++-
10 files changed, 86 insertions(+), 65 deletions(-)
---
base-commit: 814b88f91b121246b35be9c519b537041fe3d178
change-id: 20251223-drm-bridge-alloc-getput-drm_of_find_bridge-2-12c6bbcb6896
Best regards,
--
Luca Ceresoli <luca.ceresoli(a)bootlin.com>
xhci_alloc_command() allocates a command structure and, when the
second argument is true, also allocates a completion structure.
Currently, the error handling path in xhci_disable_slot() only frees
the command structure using kfree(), causing the completion structure
to leak.
Use xhci_free_command() instead of kfree(). xhci_free_command() correctly
frees both the command structure and the associated completion structure.
Since the command structure is allocated with zero-initialization,
command->in_ctx is NULL and will not be erroneously freed by
xhci_free_command().
This bug was found using an experimental static analysis tool we are
developing. The tool is based on the LLVM framework and is specifically
designed to detect memory management issues. It is currently under
active development and not yet publicly available, but we plan to
open-source it after our research is published.
The bug was originally detected on v6.13-rc1 using our static analysis
tool, and we have verified that the issue persists in the latest mainline
kernel.
We performed build testing on x86_64 with allyesconfig using GCC=11.4.0.
Since triggering these error paths in xhci_disable_slot() requires specific
hardware conditions or abnormal state, we were unable to construct a test
case to reliably trigger these specific error paths at runtime.
Fixes: 7faac1953ed1 ("xhci: avoid race between disable slot command and host runtime suspend")
CC: stable(a)vger.kernel.org
Signed-off-by: Zilin Guan <zilin(a)seu.edu.cn>
---
Changes in v3:
- Update the commit message to reflect that the analysis applies to the mainline kernel.
Changes in v2:
- Add detailed information required by researcher guidelines.
- Clarify the safety of using xhci_free_command() in this context.
- Correct the Fixes tag to point to the commit that introduced this issue.
drivers/usb/host/xhci.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 02c9bfe21ae2..f0beed054954 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -4137,7 +4137,7 @@ int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
if (state == 0xffffffff || (xhci->xhc_state & XHCI_STATE_DYING) ||
(xhci->xhc_state & XHCI_STATE_HALTED)) {
spin_unlock_irqrestore(&xhci->lock, flags);
- kfree(command);
+ xhci_free_command(xhci, command);
return -ENODEV;
}
@@ -4145,7 +4145,7 @@ int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
slot_id);
if (ret) {
spin_unlock_irqrestore(&xhci->lock, flags);
- kfree(command);
+ xhci_free_command(xhci, command);
return ret;
}
xhci_ring_cmd_db(xhci);
--
2.34.1
Syzbot has found a deadlock (analyzed by Lance Yang):
1) Task (5749): Holds folio_lock, then tries to acquire i_mmap_rwsem(read lock).
2) Task (5754): Holds i_mmap_rwsem(write lock), then tries to acquire
folio_lock.
migrate_pages()
-> migrate_hugetlbs()
-> unmap_and_move_huge_page() <- Takes folio_lock!
-> remove_migration_ptes()
-> __rmap_walk_file()
-> i_mmap_lock_read() <- Waits for i_mmap_rwsem(read lock)!
hugetlbfs_fallocate()
-> hugetlbfs_punch_hole() <- Takes i_mmap_rwsem(write lock)!
-> hugetlbfs_zero_partial_page()
-> filemap_lock_hugetlb_folio()
-> filemap_lock_folio()
-> __filemap_get_folio <- Waits for folio_lock!
The migration path is the one taking locks in the wrong order according
to the documentation at the top of mm/rmap.c. So expand the scope of the
existing i_mmap_lock to cover the calls to remove_migration_ptes() too.
This is (mostly) how it used to be after commit c0d0381ade79. That was
removed by 336bf30eb765 for both file & anon hugetlb pages when it should
only have been removed for anon hugetlb pages.
Fixes: 336bf30eb765 (hugetlbfs: fix anon huge page migration race)
Reported-by: syzbot+2d9c96466c978346b55f(a)syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com
Debugged-by: Lance Yang <lance.yang(a)linux.dev>
Signed-off-by: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: stable(a)vger.kernel.org
---
mm/migrate.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5169f9717f60..4688b9e38cd2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1458,6 +1458,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
+ enum ttu_flags ttu = 0;
if (folio_ref_count(src) == 1) {
/* page was freed from under us. So we are done. */
@@ -1498,8 +1499,6 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
goto put_anon;
if (folio_mapped(src)) {
- enum ttu_flags ttu = 0;
-
if (!folio_test_anon(src)) {
/*
* In shared mappings, try_to_unmap could potentially
@@ -1516,16 +1515,17 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
try_to_migrate(src, ttu);
page_was_mapped = 1;
-
- if (ttu & TTU_RMAP_LOCKED)
- i_mmap_unlock_write(mapping);
}
if (!folio_mapped(src))
rc = move_to_new_folio(dst, src, mode);
if (page_was_mapped)
- remove_migration_ptes(src, !rc ? dst : src, 0);
+ remove_migration_ptes(src, !rc ? dst : src,
+ ttu ? RMP_LOCKED : 0);
+
+ if (ttu & TTU_RMAP_LOCKED)
+ i_mmap_unlock_write(mapping);
unlock_put_anon:
folio_unlock(dst);
--
2.47.3
xhci_alloc_command() allocates a command structure and, when the
second argument is true, also allocates a completion structure.
Currently, the error handling path in xhci_disable_slot() only frees
the command structure using kfree(), causing the completion structure
to leak.
Use xhci_free_command() instead of kfree(). xhci_free_command() correctly
frees both the command structure and the associated completion structure.
Since the command structure is allocated with zero-initialization,
command->in_ctx is NULL and will not be erroneously freed by
xhci_free_command().
This bug was found using an experimental static analysis tool we are
developing. The tool is based on the LLVM framework and is specifically
designed to detect memory management issues. It is currently under
active development and not yet publicly available, but we plan to
open-source it after our research is published.
The analysis was performed on Linux kernel v6.13-rc1.
We performed build testing on x86_64 with allyesconfig using GCC=11.4.0.
Since triggering these error paths in xhci_disable_slot() requires specific
hardware conditions or abnormal state, we were unable to construct a test
case to reliably trigger these specific error paths at runtime.
Fixes: 7faac1953ed1 ("xhci: avoid race between disable slot command and host runtime suspend")
CC: stable(a)vger.kernel.org
Signed-off-by: Zilin Guan <zilin(a)seu.edu.cn>
---
Changes in v2:
- Add detailed information required by researcher guidelines.
- Clarify the safety of using xhci_free_command() in this context.
- Correct the Fixes tag to point to the commit that introduced this issue.
drivers/usb/host/xhci.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 02c9bfe21ae2..f0beed054954 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -4137,7 +4137,7 @@ int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
if (state == 0xffffffff || (xhci->xhc_state & XHCI_STATE_DYING) ||
(xhci->xhc_state & XHCI_STATE_HALTED)) {
spin_unlock_irqrestore(&xhci->lock, flags);
- kfree(command);
+ xhci_free_command(xhci, command);
return -ENODEV;
}
@@ -4145,7 +4145,7 @@ int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
slot_id);
if (ret) {
spin_unlock_irqrestore(&xhci->lock, flags);
- kfree(command);
+ xhci_free_command(xhci, command);
return ret;
}
xhci_ring_cmd_db(xhci);
--
2.34.1
When multiple registered buffers share the same compound page, only the
first buffer accounts for the memory via io_buffer_account_pin(). The
subsequent buffers skip accounting since headpage_already_acct() returns
true.
When the first buffer is unregistered, the accounting is decremented,
but the compound page remains pinned by the remaining buffers. This
creates a state where pinned memory is not properly accounted against
RLIMIT_MEMLOCK.
On systems with HugeTLB pages pre-allocated, an unprivileged user can
exploit this to pin memory beyond RLIMIT_MEMLOCK by cycling buffer
registrations. The bypass amount is proportional to the number of
available huge pages, potentially allowing gigabytes of memory to be
pinned while the kernel accounting shows near-zero.
Fix this by recalculating the actual pages to unaccount when unmapping
a buffer. For regular pages, always unaccount. For compound pages, only
unaccount if no other registered buffer references the same compound
page. This ensures the accounting persists until the last buffer
referencing the compound page is released.
Reported-by: Yuhao Jiang <danisjiang(a)gmail.com>
Fixes: 57bebf807e2a ("io_uring/rsrc: optimise registered huge pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: Yuhao Jiang <danisjiang(a)gmail.com>
---
io_uring/rsrc.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index a63474b331bf..dcf2340af5a2 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -139,15 +139,80 @@ static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
kvfree(imu);
}
+/*
+ * Calculate pages to unaccount when unmapping a buffer. Regular pages are
+ * always counted. Compound pages are only counted if no other registered
+ * buffer references them, ensuring accounting persists until the last user.
+ */
+static unsigned long io_buffer_calc_unaccount(struct io_ring_ctx *ctx,
+ struct io_mapped_ubuf *imu)
+{
+ struct page *last_hpage = NULL;
+ unsigned long acct = 0;
+ unsigned int i;
+
+ for (i = 0; i < imu->nr_bvecs; i++) {
+ struct page *page = imu->bvec[i].bv_page;
+ struct page *hpage;
+ unsigned int j;
+
+ if (!PageCompound(page)) {
+ acct++;
+ continue;
+ }
+
+ hpage = compound_head(page);
+ if (hpage == last_hpage)
+ continue;
+ last_hpage = hpage;
+
+ /* Check if we already processed this hpage earlier in this buffer */
+ for (j = 0; j < i; j++) {
+ if (PageCompound(imu->bvec[j].bv_page) &&
+ compound_head(imu->bvec[j].bv_page) == hpage)
+ goto next_hpage;
+ }
+
+ /* Only unaccount if no other buffer references this page */
+ for (j = 0; j < ctx->buf_table.nr; j++) {
+ struct io_rsrc_node *node = ctx->buf_table.nodes[j];
+ struct io_mapped_ubuf *other;
+ unsigned int k;
+
+ if (!node)
+ continue;
+ other = node->buf;
+ if (other == imu)
+ continue;
+
+ for (k = 0; k < other->nr_bvecs; k++) {
+ struct page *op = other->bvec[k].bv_page;
+
+ if (!PageCompound(op))
+ continue;
+ if (compound_head(op) == hpage)
+ goto next_hpage;
+ }
+ }
+ acct += page_size(hpage) >> PAGE_SHIFT;
+next_hpage:
+ ;
+ }
+ return acct;
+}
+
static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
{
+ unsigned long acct;
+
if (unlikely(refcount_read(&imu->refs) > 1)) {
if (!refcount_dec_and_test(&imu->refs))
return;
}
- if (imu->acct_pages)
- io_unaccount_mem(ctx->user, ctx->mm_account, imu->acct_pages);
+ acct = io_buffer_calc_unaccount(ctx, imu);
+ if (acct)
+ io_unaccount_mem(ctx->user, ctx->mm_account, acct);
imu->release(imu->priv);
io_free_imu(ctx, imu);
}
--
2.34.1
V1 -> V2:
- Because `pmd_val` variable broke ppc builds due to its name,
renamed it to `_pmd`. see [1].
[1] https://lore.kernel.org/stable/aS7lPZPYuChOTdXU@hyeyoo
- Added David Hildenbrand's Acked-by [2], thanks a lot!
[2] https://lore.kernel.org/linux-mm/ac8d7137-3819-4a75-9dd3-fb3d2259ebe4@kerne…
# TL;DR
previous discussion: https://lore.kernel.org/linux-mm/20250921232709.1608699-1-harry.yoo@oracle.…
A "bad pmd" error occurs due to race condition between
change_prot_numa() and THP migration. The mainline kernel does not have
this bug as commit 670ddd8cdc fixes the race condition. 6.1.y, 5.15.y,
5.10.y, 5.4.y are affected by this bug.
Fixing this in -stable kernels is tricky because pte_map_offset_lock()
has different semantics in pre-6.5 and post-6.5 kernels. I am trying to
backport the same mechanism we have in the mainline kernel.
Since the code looks bit different due to different semantics of
pte_map_offset_lock(), it'd be best to get this reviewed by MM folks.
# Testing
I verified that the bug described below is not reproduced anymore
(on a downstream kernel) after applying this patch series. It used to
trigger in few days of intensive numa balancing testing, but it survived
2 weeks with this applied.
# Bug Description
It was reported that a bad pmd is seen when automatic NUMA
balancing is marking page table entries as prot_numa:
[2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
[2437548.235022] Call Trace:
[2437548.238234] <TASK>
[2437548.241060] dump_stack_lvl+0x46/0x61
[2437548.245689] panic+0x106/0x2e5
[2437548.249497] pmd_clear_bad+0x3c/0x3c
[2437548.253967] change_pmd_range.isra.0+0x34d/0x3a7
[2437548.259537] change_p4d_range+0x156/0x20e
[2437548.264392] change_protection_range+0x116/0x1a9
[2437548.269976] change_prot_numa+0x15/0x37
[2437548.274774] task_numa_work+0x1b8/0x302
[2437548.279512] task_work_run+0x62/0x95
[2437548.283882] exit_to_user_mode_loop+0x1a4/0x1a9
[2437548.289277] exit_to_user_mode_prepare+0xf4/0xfc
[2437548.294751] ? sysvec_apic_timer_interrupt+0x34/0x81
[2437548.300677] irqentry_exit_to_user_mode+0x5/0x25
[2437548.306153] asm_sysvec_apic_timer_interrupt+0x16/0x1b
This is due to a race condition between change_prot_numa() and
THP migration because the kernel doesn't check is_swap_pmd() and
pmd_trans_huge() atomically:
change_prot_numa() THP migration
======================================================================
- change_pmd_range()
-> is_swap_pmd() returns false,
meaning it's not a PMD migration
entry.
- do_huge_pmd_numa_page()
-> migrate_misplaced_page() sets
migration entries for the THP.
- change_pmd_range()
-> pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_none() and pmd_trans_huge() returns false
- pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_bad() returns true for the migration entry!
The upstream commit 670ddd8cdcbd ("mm/mprotect: delete
pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition
by checking is_swap_pmd() and pmd_trans_huge() atomically.
# Backporting note
commit a79390f5d6a7 ("mm/mprotect: use long for page accountings and retval")
is backported to return an error code (negative value) in
change_pte_range().
Unlike the mainline, pte_offset_map_lock() does not check if the pmd
entry is a migration entry or a hugepage; acquires PTL unconditionally
instead of returning failure. Therefore, it is necessary to keep the
!is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() checks in
change_pmd_range() before acquiring the PTL.
After acquiring the lock, open-code the semantics of
pte_offset_map_lock() in the mainline kernel; change_pte_range() fails
if the pmd value has changed. This requires adding pmd_old parameter
(pmd_t value that is read before calling the function) to
change_pte_range().
Hugh Dickins (1):
mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()
Peter Xu (1):
mm/mprotect: use long for page accountings and retval
include/linux/hugetlb.h | 4 +-
include/linux/mm.h | 2 +-
mm/hugetlb.c | 4 +-
mm/mempolicy.c | 2 +-
mm/mprotect.c | 124 +++++++++++++++++-----------------------
5 files changed, 60 insertions(+), 76 deletions(-)
--
2.43.0