From: Youling Tang <tangyouling(a)kylinos.cn>
Automatically disable kaslr when the kernel loads from kexec_file.
kexec_file loads the secondary kernel image to a non-linked address,
inherently providing KASLR-like randomization.
However, on LoongArch where System RAM may be non-contiguous, enabling
KASLR for the second kernel could relocate it to an invalid memory
region and cause boot failure. Thus, we disable KASLR when
"kexec_file" is detected in the command line.
To ensure compatibility with older kernels loaded via kexec_file,
this patch need be backported to stable branches.
Cc: stable(a)vger.kernel.org
Signed-off-by: Youling Tang <tangyouling(a)kylinos.cn>
---
arch/loongarch/kernel/relocate.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/loongarch/kernel/relocate.c b/arch/loongarch/kernel/relocate.c
index 50c469067f3a..4c097532cb88 100644
--- a/arch/loongarch/kernel/relocate.c
+++ b/arch/loongarch/kernel/relocate.c
@@ -140,6 +140,10 @@ static inline __init bool kaslr_disabled(void)
if (str == boot_command_line || (str > boot_command_line && *(str - 1) == ' '))
return true;
+ str = strstr(boot_command_line, "kexec_file");
+ if (str == boot_command_line || (str > boot_command_line && *(str - 1) == ' '))
+ return true;
+
#ifdef CONFIG_HIBERNATION
str = strstr(builtin_cmdline, "nohibernate");
if (str == builtin_cmdline || (str > builtin_cmdline && *(str - 1) == ' '))
--
2.43.0
When SRIOV is enabled, RSS is also supported for the functions. There is
an incorrect flag judgement which causes RSS to fail to be enabled.
Fixes: c52d4b898901 ("net: libwx: Redesign flow when sriov is enabled")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jiawen Wu <jiawenwu(a)trustnetic.com>
---
drivers/net/ethernet/wangxun/libwx/wx_hw.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/net/ethernet/wangxun/libwx/wx_hw.c b/drivers/net/ethernet/wangxun/libwx/wx_hw.c
index bcd07a715752..5cb353a97d6d 100644
--- a/drivers/net/ethernet/wangxun/libwx/wx_hw.c
+++ b/drivers/net/ethernet/wangxun/libwx/wx_hw.c
@@ -2078,10 +2078,6 @@ static void wx_setup_mrqc(struct wx *wx)
{
u32 rss_field = 0;
- /* VT, and RSS do not coexist at the same time */
- if (test_bit(WX_FLAG_VMDQ_ENABLED, wx->flags))
- return;
-
/* Disable indicating checksum in descriptor, enables RSS hash */
wr32m(wx, WX_PSR_CTL, WX_PSR_CTL_PCSD, WX_PSR_CTL_PCSD);
--
2.48.1
Fix multiple fwnode reference leaks:
1. The function calls fwnode_get_named_child_node() to get the "leds" node,
but never calls fwnode_handle_put(leds) to release this reference.
2. Within the fwnode_for_each_child_node() loop, the early return
paths that don't properly release the "led" fwnode reference.
This fix follows the same pattern as commit d029edefed39
("net dsa: qca8k: fix usages of device_get_named_child_node()")
Fixes: 94a2a84f5e9e ("net: dsa: mv88e6xxx: Support LED control")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miaoqian Lin <linmq006(a)gmail.com>
---
changes in v2:
- use goto for cleanup in error paths
- v1: https://lore.kernel.org/all/20250830085508.2107507-1-linmq006@gmail.com/
---
drivers/net/dsa/mv88e6xxx/leds.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/leds.c b/drivers/net/dsa/mv88e6xxx/leds.c
index 1c88bfaea46b..ab3bc645da56 100644
--- a/drivers/net/dsa/mv88e6xxx/leds.c
+++ b/drivers/net/dsa/mv88e6xxx/leds.c
@@ -779,7 +779,8 @@ int mv88e6xxx_port_setup_leds(struct mv88e6xxx_chip *chip, int port)
continue;
if (led_num > 1) {
dev_err(dev, "invalid LED specified port %d\n", port);
- return -EINVAL;
+ ret = -EINVAL;
+ goto err_put_led;
}
if (led_num == 0)
@@ -823,17 +824,25 @@ int mv88e6xxx_port_setup_leds(struct mv88e6xxx_chip *chip, int port)
init_data.devname_mandatory = true;
init_data.devicename = kasprintf(GFP_KERNEL, "%s:0%d:0%d", chip->info->name,
port, led_num);
- if (!init_data.devicename)
- return -ENOMEM;
+ if (!init_data.devicename) {
+ ret = -ENOMEM;
+ goto err_put_led;
+ }
ret = devm_led_classdev_register_ext(dev, l, &init_data);
kfree(init_data.devicename);
if (ret) {
dev_err(dev, "Failed to init LED %d for port %d", led_num, port);
- return ret;
+ goto err_put_led;
}
}
+ fwnode_handle_put(leds);
return 0;
+
+err_put_led:
+ fwnode_handle_put(led);
+ fwnode_handle_put(leds);
+ return ret;
}
--
2.35.1
commit efa95b01da18 ("netpoll: fix use after free") incorrectly
ignored the refcount and prematurely set dev->npinfo to NULL during
netpoll cleanup, leading to improper behavior and memory leaks.
Scenario causing lack of proper cleanup:
1) A netpoll is associated with a NIC (e.g., eth0) and netdev->npinfo is
allocated, and refcnt = 1
- Keep in mind that npinfo is shared among all netpoll instances. In
this case, there is just one.
2) Another netpoll is also associated with the same NIC and
npinfo->refcnt += 1.
- Now dev->npinfo->refcnt = 2;
- There is just one npinfo associated to the netdev.
3) When the first netpolls goes to clean up:
- The first cleanup succeeds and clears np->dev->npinfo, ignoring
refcnt.
- It basically calls `RCU_INIT_POINTER(np->dev->npinfo, NULL);`
- Set dev->npinfo = NULL, without proper cleanup
- No ->ndo_netpoll_cleanup() is either called
4) Now the second target tries to clean up
- The second cleanup fails because np->dev->npinfo is already NULL.
* In this case, ops->ndo_netpoll_cleanup() was never called, and
the skb pool is not cleaned as well (for the second netpoll
instance)
- This leaks npinfo and skbpool skbs, which is clearly reported by
kmemleak.
Revert commit efa95b01da18 ("netpoll: fix use after free") and adds
clarifying comments emphasizing that npinfo cleanup should only happen
once the refcount reaches zero, ensuring stable and correct netpoll
behavior.
Cc: stable(a)vger.kernel.org
Cc: jv(a)jvosburgh.net
Fixes: efa95b01da18 ("netpoll: fix use after free")
Signed-off-by: Breno Leitao <leitao(a)debian.org>
---
I have a selftest that shows the memory leak when kmemleak is enabled
and I will be submitting to net-next.
Also, giving I am reverting commit efa95b01da18 ("netpoll: fix use
after free"), which was supposed to fix a problem on bonding, I am
copying Jay.
---
net/core/netpoll.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 5f65b62346d4e..19676cd379640 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -815,6 +815,10 @@ static void __netpoll_cleanup(struct netpoll *np)
if (!npinfo)
return;
+ /* At this point, there is a single npinfo instance per netdevice, and
+ * its refcnt tracks how many netpoll structures are linked to it. We
+ * only perform npinfo cleanup when the refcnt decrements to zero.
+ */
if (refcount_dec_and_test(&npinfo->refcnt)) {
const struct net_device_ops *ops;
@@ -824,8 +828,7 @@ static void __netpoll_cleanup(struct netpoll *np)
RCU_INIT_POINTER(np->dev->npinfo, NULL);
call_rcu(&npinfo->rcu, rcu_cleanup_netpoll_info);
- } else
- RCU_INIT_POINTER(np->dev->npinfo, NULL);
+ }
skb_pool_flush(np);
}
---
base-commit: 864ecc4a6dade82d3f70eab43dad0e277aa6fc78
change-id: 20250901-netpoll_memleak-90d0d4bc772c
Best regards,
--
Breno Leitao <leitao(a)debian.org>
The patch titled
Subject: compiler-clang.h: define __SANITIZE_*__ macros only when undefined
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
compiler-clangh-define-__sanitize___-macros-only-when-undefined.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Nathan Chancellor <nathan(a)kernel.org>
Subject: compiler-clang.h: define __SANITIZE_*__ macros only when undefined
Date: Tue, 02 Sep 2025 15:49:26 -0700
Clang 22 recently added support for defining __SANITIZE__ macros similar
to GCC [1], which causes warnings (or errors with CONFIG_WERROR=y or W=e)
with the existing defines that the kernel creates to emulate this behavior
with existing clang versions.
In file included from <built-in>:3:
In file included from include/linux/compiler_types.h:171:
include/linux/compiler-clang.h:37:9: error: '__SANITIZE_THREAD__' macro redefined [-Werror,-Wmacro-redefined]
37 | #define __SANITIZE_THREAD__
| ^
<built-in>:352:9: note: previous definition is here
352 | #define __SANITIZE_THREAD__ 1
| ^
Refactor compiler-clang.h to only define the sanitizer macros when they
are undefined and adjust the rest of the code to use these macros for
checking if the sanitizers are enabled, clearing up the warnings and
allowing the kernel to easily drop these defines when the minimum
supported version of LLVM for building the kernel becomes 22.0.0 or newer.
Link: https://lkml.kernel.org/r/20250902-clang-update-sanitize-defines-v1-1-cf370…
Link: https://github.com/llvm/llvm-project/commit/568c23bbd3303518c5056d7f03444da… [1]
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
Reviewed-by: Justin Stitt <justinstitt(a)google.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Bill Wendling <morbo(a)google.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/compiler-clang.h | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
--- a/include/linux/compiler-clang.h~compiler-clangh-define-__sanitize___-macros-only-when-undefined
+++ a/include/linux/compiler-clang.h
@@ -18,23 +18,42 @@
#define KASAN_ABI_VERSION 5
/*
+ * Clang 22 added preprocessor macros to match GCC, in hopes of eventually
+ * dropping __has_feature support for sanitizers:
+ * https://github.com/llvm/llvm-project/commit/568c23bbd3303518c5056d7f03444da…
+ * Create these macros for older versions of clang so that it is easy to clean
+ * up once the minimum supported version of LLVM for building the kernel always
+ * creates these macros.
+ *
* Note: Checking __has_feature(*_sanitizer) is only true if the feature is
* enabled. Therefore it is not required to additionally check defined(CONFIG_*)
* to avoid adding redundant attributes in other configurations.
*/
+#if __has_feature(address_sanitizer) && !defined(__SANITIZE_ADDRESS__)
+#define __SANITIZE_ADDRESS__
+#endif
+#if __has_feature(hwaddress_sanitizer) && !defined(__SANITIZE_HWADDRESS__)
+#define __SANITIZE_HWADDRESS__
+#endif
+#if __has_feature(thread_sanitizer) && !defined(__SANITIZE_THREAD__)
+#define __SANITIZE_THREAD__
+#endif
-#if __has_feature(address_sanitizer) || __has_feature(hwaddress_sanitizer)
-/* Emulate GCC's __SANITIZE_ADDRESS__ flag */
+/*
+ * Treat __SANITIZE_HWADDRESS__ the same as __SANITIZE_ADDRESS__ in the kernel.
+ */
+#ifdef __SANITIZE_HWADDRESS__
#define __SANITIZE_ADDRESS__
+#endif
+
+#ifdef __SANITIZE_ADDRESS__
#define __no_sanitize_address \
__attribute__((no_sanitize("address", "hwaddress")))
#else
#define __no_sanitize_address
#endif
-#if __has_feature(thread_sanitizer)
-/* emulate gcc's __SANITIZE_THREAD__ flag */
-#define __SANITIZE_THREAD__
+#ifdef __SANITIZE_THREAD__
#define __no_sanitize_thread \
__attribute__((no_sanitize("thread")))
#else
_
Patches currently in -mm which might be from nathan(a)kernel.org are
compiler-clangh-define-__sanitize___-macros-only-when-undefined.patch
mm-rmap-convert-enum-rmap_level-to-enum-pgtable_level-fix.patch
Clang 22 recently added support for defining __SANITIZE__ macros similar
to GCC [1], which causes warnings (or errors with CONFIG_WERROR=y or
W=e) with the existing defines that the kernel creates to emulate this
behavior with existing clang versions.
In file included from <built-in>:3:
In file included from include/linux/compiler_types.h:171:
include/linux/compiler-clang.h:37:9: error: '__SANITIZE_THREAD__' macro redefined [-Werror,-Wmacro-redefined]
37 | #define __SANITIZE_THREAD__
| ^
<built-in>:352:9: note: previous definition is here
352 | #define __SANITIZE_THREAD__ 1
| ^
Refactor compiler-clang.h to only define the sanitizer macros when they
are undefined and adjust the rest of the code to use these macros for
checking if the sanitizers are enabled, clearing up the warnings and
allowing the kernel to easily drop these defines when the minimum
supported version of LLVM for building the kernel becomes 22.0.0 or
newer.
Cc: stable(a)vger.kernel.org
Link: https://github.com/llvm/llvm-project/commit/568c23bbd3303518c5056d7f03444da… [1]
Signed-off-by: Nathan Chancellor <nathan(a)kernel.org>
---
Andrew, would it be possible to take this via mm-hotfixes?
---
include/linux/compiler-clang.h | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
index fa4ffe037bc7..8720a0705900 100644
--- a/include/linux/compiler-clang.h
+++ b/include/linux/compiler-clang.h
@@ -18,23 +18,42 @@
#define KASAN_ABI_VERSION 5
/*
+ * Clang 22 added preprocessor macros to match GCC, in hopes of eventually
+ * dropping __has_feature support for sanitizers:
+ * https://github.com/llvm/llvm-project/commit/568c23bbd3303518c5056d7f03444da…
+ * Create these macros for older versions of clang so that it is easy to clean
+ * up once the minimum supported version of LLVM for building the kernel always
+ * creates these macros.
+ *
* Note: Checking __has_feature(*_sanitizer) is only true if the feature is
* enabled. Therefore it is not required to additionally check defined(CONFIG_*)
* to avoid adding redundant attributes in other configurations.
*/
+#if __has_feature(address_sanitizer) && !defined(__SANITIZE_ADDRESS__)
+#define __SANITIZE_ADDRESS__
+#endif
+#if __has_feature(hwaddress_sanitizer) && !defined(__SANITIZE_HWADDRESS__)
+#define __SANITIZE_HWADDRESS__
+#endif
+#if __has_feature(thread_sanitizer) && !defined(__SANITIZE_THREAD__)
+#define __SANITIZE_THREAD__
+#endif
-#if __has_feature(address_sanitizer) || __has_feature(hwaddress_sanitizer)
-/* Emulate GCC's __SANITIZE_ADDRESS__ flag */
+/*
+ * Treat __SANITIZE_HWADDRESS__ the same as __SANITIZE_ADDRESS__ in the kernel.
+ */
+#ifdef __SANITIZE_HWADDRESS__
#define __SANITIZE_ADDRESS__
+#endif
+
+#ifdef __SANITIZE_ADDRESS__
#define __no_sanitize_address \
__attribute__((no_sanitize("address", "hwaddress")))
#else
#define __no_sanitize_address
#endif
-#if __has_feature(thread_sanitizer)
-/* emulate gcc's __SANITIZE_THREAD__ flag */
-#define __SANITIZE_THREAD__
+#ifdef __SANITIZE_THREAD__
#define __no_sanitize_thread \
__attribute__((no_sanitize("thread")))
#else
---
base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
change-id: 20250902-clang-update-sanitize-defines-845000c29d2c
Best regards,
--
Nathan Chancellor <nathan(a)kernel.org>
Hi all,
A few months ago, the multi-fsblock untorn writes patchset added a bunch
of log intent item helper functions to estimate the number of intent
items that could be added to a particular transaction. Those helpers
enabled us to compute a safe upper bound on the number of blocks that
could be written in an untorn fashion with filesystem-provided out of
place writes.
Currently, the online fsck code employs static limits on the number of
intent items that it's willing to accrue to a single transaction when
it's trying to reap what it thinks are the old blocks from a corrupt
structure. There have been no problems reported with this approach
after years of testing, but static limits are scary and gross because
overestimating the intent item limit could result in transaction
overflows and dead filesystems; and underestimating causes unnecessary
overhead.
This series uses the new log intent item size helpers to estimate the
limits dynamically based on worst-case per-block repair work vs. the
size of the scrub transaction. After several months of testing this,
there don't seem to be any problems here either.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fi…
---
Commits in this patchset:
* xfs: prepare reaping code for dynamic limits
* xfs: convert the ifork reap code to use xreap_state
* xfs: use deferred intent items for reaping crosslinked blocks
* xfs: compute per-AG extent reap limits dynamically
* xfs: compute data device CoW staging extent reap limits dynamically
* xfs: compute realtime device CoW staging extent reap limits dynamically
* xfs: compute file mapping reap limits dynamically
* xfs: remove static reap limits
* xfs: use deferred reaping for data device cow extents
---
fs/xfs/scrub/repair.h | 8 -
fs/xfs/scrub/trace.h | 45 ++++
fs/xfs/scrub/newbt.c | 7 +
fs/xfs/scrub/reap.c | 622 +++++++++++++++++++++++++++++++++++++++----------
fs/xfs/scrub/trace.c | 1
5 files changed, 552 insertions(+), 131 deletions(-)
This reverts commit 71598a5a7797f0052aaa7bcff0b8d4b8f20f1441.
This commit introduced a regression, however the fix for the
regression:
aa5fc4362fac ("drm/amdgpu: fix task hang from failed job submission during process kill")
depends on things not yet present in 6.12.y and older kernels. Since
this commit is more of an optimization, just revert it for
6.12.y and older stable kernels.
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org # 6.1.x - 6.12.x
---
Please apply this revert to 6.1.x to 6.12.x stable trees. The newer
stable trees and Linus' tree already have the regression fix.
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 0adb106e2c42..37d53578825b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2292,11 +2292,13 @@ void amdgpu_vm_adjust_size(struct amdgpu_device *adev, uint32_t min_vm_size,
*/
long amdgpu_vm_wait_idle(struct amdgpu_vm *vm, long timeout)
{
- timeout = drm_sched_entity_flush(&vm->immediate, timeout);
+ timeout = dma_resv_wait_timeout(vm->root.bo->tbo.base.resv,
+ DMA_RESV_USAGE_BOOKKEEP,
+ true, timeout);
if (timeout <= 0)
return timeout;
- return drm_sched_entity_flush(&vm->delayed, timeout);
+ return dma_fence_wait_timeout(vm->last_unlocked, true, timeout);
}
static void amdgpu_vm_destroy_task_info(struct kref *kref)
--
2.51.0
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 1f403699c40f0806a707a9a6eed3b8904224021a
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025090146-playback-kinsman-373c@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1f403699c40f0806a707a9a6eed3b8904224021a Mon Sep 17 00:00:00 2001
From: Ma Ke <make24(a)iscas.ac.cn>
Date: Tue, 12 Aug 2025 15:19:32 +0800
Subject: [PATCH] drm/mediatek: Fix device/node reference count leaks in
mtk_drm_get_all_drm_priv
Using device_find_child() and of_find_device_by_node() to locate
devices could cause an imbalance in the device's reference count.
device_find_child() and of_find_device_by_node() both call
get_device() to increment the reference count of the found device
before returning the pointer. In mtk_drm_get_all_drm_priv(), these
references are never released through put_device(), resulting in
permanent reference count increments. Additionally, the
for_each_child_of_node() iterator fails to release node references in
all code paths. This leaks device node references when loop
termination occurs before reaching MAX_CRTC. These reference count
leaks may prevent device/node resources from being properly released
during driver unbind operations.
As comment of device_find_child() says, 'NOTE: you will need to drop
the reference with put_device() after use'.
Cc: stable(a)vger.kernel.org
Fixes: 1ef7ed48356c ("drm/mediatek: Modify mediatek-drm for mt8195 multi mmsys support")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
Reviewed-by: CK Hu <ck.hu(a)mediatek.com>
Link: https://patchwork.kernel.org/project/dri-devel/patch/20250812071932.471730-…
Signed-off-by: Chun-Kuang Hu <chunkuang.hu(a)kernel.org>
diff --git a/drivers/gpu/drm/mediatek/mtk_drm_drv.c b/drivers/gpu/drm/mediatek/mtk_drm_drv.c
index d5e6bab36414..f8a817689e16 100644
--- a/drivers/gpu/drm/mediatek/mtk_drm_drv.c
+++ b/drivers/gpu/drm/mediatek/mtk_drm_drv.c
@@ -387,19 +387,19 @@ static bool mtk_drm_get_all_drm_priv(struct device *dev)
of_id = of_match_node(mtk_drm_of_ids, node);
if (!of_id)
- continue;
+ goto next_put_node;
pdev = of_find_device_by_node(node);
if (!pdev)
- continue;
+ goto next_put_node;
drm_dev = device_find_child(&pdev->dev, NULL, mtk_drm_match);
if (!drm_dev)
- continue;
+ goto next_put_device_pdev_dev;
temp_drm_priv = dev_get_drvdata(drm_dev);
if (!temp_drm_priv)
- continue;
+ goto next_put_device_drm_dev;
if (temp_drm_priv->data->main_len)
all_drm_priv[CRTC_MAIN] = temp_drm_priv;
@@ -411,10 +411,17 @@ static bool mtk_drm_get_all_drm_priv(struct device *dev)
if (temp_drm_priv->mtk_drm_bound)
cnt++;
- if (cnt == MAX_CRTC) {
- of_node_put(node);
+next_put_device_drm_dev:
+ put_device(drm_dev);
+
+next_put_device_pdev_dev:
+ put_device(&pdev->dev);
+
+next_put_node:
+ of_node_put(node);
+
+ if (cnt == MAX_CRTC)
break;
- }
}
if (drm_priv->data->mmsys_dev_num == cnt) {
The current implementation of CXL memory hotplug notifier gets called
before the HMAT memory hotplug notifier. The CXL driver calculates the
access coordinates (bandwidth and latency values) for the CXL end to
end path (i.e. CPU to endpoint). When the CXL region is onlined, the CXL
memory hotplug notifier writes the access coordinates to the HMAT target
structs. Then the HMAT memory hotplug notifier is called and it creates
the access coordinates for the node sysfs attributes.
During testing on an Intel platform, it was found that although the
newly calculated coordinates were pushed to sysfs, the sysfs attributes for
the access coordinates showed up with the wrong initiator. The system has
4 nodes (0, 1, 2, 3) where node 0 and 1 are CPU nodes and node 2 and 3 are
CXL nodes. The expectation is that node 2 would show up as a target to node
0:
/sys/devices/system/node/node2/access0/initiators/node0
However it was observed that node 2 showed up as a target under node 1:
/sys/devices/system/node/node2/access0/initiators/node1
The original intent of the 'ext_updated' flag in HMAT handling code was to
stop HMAT memory hotplug callback from clobbering the access coordinates
after CXL has injected its calculated coordinates and replaced the generic
target access coordinates provided by the HMAT table in the HMAT target
structs. However the flag is hacky at best and blocks the updates from
other CXL regions that are onlined in the same node later on. Remove the
'ext_updated' flag usage and just update the access coordinates for the
nodes directly without touching HMAT target data.
The hotplug memory callback ordering is changed. Instead of changing CXL,
move HMAT back so there's room for the levels rather than have CXL share
the same level as SLAB_CALLBACK_PRI. The change will resulting in the CXL
callback to be executed after the HMAT callback.
With the change, the CXL hotplug memory notifier runs after the HMAT
callback. The HMAT callback will create the node sysfs attributes for
access coordinates. The CXL callback will write the access coordinates to
the now created node sysfs attributes directly and will not pollute the
HMAT target values.
A nodemask is introduced to keep track if a node has been updated and
prevents further updates.
Fixes: 067353a46d8c ("cxl/region: Add memory hotplug notifier for cxl region")
Cc: stable(a)vger.kernel.org
Tested-by: Marc Herbert <marc.herbert(a)linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
v3:
- Use nodemask instead of xarray to keep track of node updates (Jonathan)
---
drivers/acpi/numa/hmat.c | 6 ------
drivers/cxl/core/cdat.c | 5 -----
drivers/cxl/core/core.h | 1 -
drivers/cxl/core/region.c | 20 ++++++++++++--------
include/linux/memory.h | 2 +-
5 files changed, 13 insertions(+), 21 deletions(-)
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 4958301f5417..5d32490dc4ab 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -74,7 +74,6 @@ struct memory_target {
struct node_cache_attrs cache_attrs;
u8 gen_port_device_handle[ACPI_SRAT_DEVICE_HANDLE_SIZE];
bool registered;
- bool ext_updated; /* externally updated */
};
struct memory_initiator {
@@ -391,7 +390,6 @@ int hmat_update_target_coordinates(int nid, struct access_coordinate *coord,
coord->read_bandwidth, access);
hmat_update_target_access(target, ACPI_HMAT_WRITE_BANDWIDTH,
coord->write_bandwidth, access);
- target->ext_updated = true;
return 0;
}
@@ -773,10 +771,6 @@ static void hmat_update_target_attrs(struct memory_target *target,
u32 best = 0;
int i;
- /* Don't update if an external agent has changed the data. */
- if (target->ext_updated)
- return;
-
/* Don't update for generic port if there's no device handle */
if ((access == NODE_ACCESS_CLASS_GENPORT_SINK_LOCAL ||
access == NODE_ACCESS_CLASS_GENPORT_SINK_CPU) &&
diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index c0af645425f4..c891fd618cfd 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -1081,8 +1081,3 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
{
return hmat_update_target_coordinates(nid, &cxlr->coord[access], access);
}
-
-bool cxl_need_node_perf_attrs_update(int nid)
-{
- return !acpi_node_backed_by_real_pxm(nid);
-}
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 2669f251d677..a253d308f3c9 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -139,7 +139,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
enum access_coordinate_class access);
-bool cxl_need_node_perf_attrs_update(int nid);
int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
struct access_coordinate *c);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 71cc42d05248..0ed95cbc5d5b 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -30,6 +30,12 @@
* 3. Decoder targets
*/
+/*
+ * nodemask that sets per node when the access_coordinates for the node has
+ * been updated by the CXL memory hotplug notifier.
+ */
+static nodemask_t nodemask_region_seen = NODE_MASK_NONE;
+
static struct cxl_region *to_cxl_region(struct device *dev);
#define __ACCESS_ATTR_RO(_level, _name) { \
@@ -2442,14 +2448,8 @@ static bool cxl_region_update_coordinates(struct cxl_region *cxlr, int nid)
for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) {
if (cxlr->coord[i].read_bandwidth) {
- rc = 0;
- if (cxl_need_node_perf_attrs_update(nid))
- node_set_perf_attrs(nid, &cxlr->coord[i], i);
- else
- rc = cxl_update_hmat_access_coordinates(nid, cxlr, i);
-
- if (rc == 0)
- cset++;
+ node_update_perf_attrs(nid, &cxlr->coord[i], i);
+ cset++;
}
}
@@ -2487,6 +2487,10 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
if (nid != region_nid)
return NOTIFY_DONE;
+ /* No action needed if node bit already set */
+ if (node_test_and_set(nid, nodemask_region_seen))
+ return NOTIFY_DONE;
+
if (!cxl_region_update_coordinates(cxlr, nid))
return NOTIFY_DONE;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 1305102688d0..0b755d1ef1ec 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -120,8 +120,8 @@ struct mem_section;
*/
#define DEFAULT_CALLBACK_PRI 0
#define SLAB_CALLBACK_PRI 1
-#define HMAT_CALLBACK_PRI 2
#define CXL_CALLBACK_PRI 5
+#define HMAT_CALLBACK_PRI 6
#define MM_COMPUTE_BATCH_PRI 10
#define CPUSET_CALLBACK_PRI 10
#define MEMTIER_HOTPLUG_PRI 100
--
2.50.1
Hi Dr. Sam Lavi Cosmetic And Implant Dentistry ,
AI adoption in healthcare isn’t coming — it’s already here.
Some implant clinics are quietly using AI to handle patient calls, consultations, and scheduling.
Their growth isn’t a coincidence.
Being first gives them an edge — being late means playing catch-up.
Should I share a 2-min demo video so you can see how it works?
Reply 'Video' and I’ll send it.
—
Best regards,
Mohanish Ved
AI Growth Specialist
EditRage Solutions
VIRQs come in 3 flavors, per-VPU, per-domain, and global, and the VIRQs
are tracked in per-cpu virq_to_irq arrays.
Per-domain and global VIRQs must be bound on CPU 0, and
bind_virq_to_irq() sets the per_cpu virq_to_irq at registration time
Later, the interrupt can migrate, and info->cpu is updated. When
calling __unbind_from_irq(), the per-cpu virq_to_irq is cleared for a
different cpu. If bind_virq_to_irq() is called again with CPU 0, the
stale irq is returned. There won't be any irq_info for the irq, so
things break.
Make xen_rebind_evtchn_to_cpu() update the per_cpu virq_to_irq mappings
to keep them update to date with the current cpu. This ensures the
correct virq_to_irq is cleared in __unbind_from_irq().
Fixes: e46cdb66c8fc ("xen: event channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason Andryuk <jason.andryuk(a)amd.com>
---
v3:
Kernel style brace placement
Delay setting old_cpu and tighten scope of variable
v2:
Different approach changing virq_to_irq
---
drivers/xen/events/events_base.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index b060b5a95f45..9478fae014e5 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1797,9 +1797,20 @@ static int xen_rebind_evtchn_to_cpu(struct irq_info *info, unsigned int tcpu)
* virq or IPI channel, which don't actually need to be rebound. Ignore
* it, but don't do the xenlinux-level rebind in that case.
*/
- if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, &bind_vcpu) >= 0)
+ if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, &bind_vcpu) >= 0) {
+ int old_cpu = info->cpu;
+
bind_evtchn_to_cpu(info, tcpu, false);
+ if (info->type == IRQT_VIRQ) {
+ int virq = info->u.virq;
+ int irq = per_cpu(virq_to_irq, old_cpu)[virq];
+
+ per_cpu(virq_to_irq, old_cpu)[virq] = -1;
+ per_cpu(virq_to_irq, tcpu)[virq] = irq;
+ }
+ }
+
do_unmask(info, EVT_MASK_REASON_TEMPORARY);
return 0;
--
2.34.1
Change find_virq() to return -EEXIST when a VIRQ is bound to a
different CPU than the one passed in. With that, remove the BUG_ON()
from bind_virq_to_irq() to propogate the error upwards.
Some VIRQs are per-cpu, but others are per-domain or global. Those must
be bound to CPU0 and can then migrate elsewhere. The lookup for
per-domain and global will probably fail when migrated off CPU 0,
especially when the current CPU is tracked. This now returns -EEXIST
instead of BUG_ON().
A second call to bind a per-domain or global VIRQ is not expected, but
make it non-fatal to avoid trying to look up the irq, since we don't
know which per_cpu(virq_to_irq) it will be in.
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason Andryuk <jason.andryuk(a)amd.com>
---
v3:
Cc: stable as a pre-req for the subsequent virg tracking change
Call __unbind_from_irq() on error ro avoid leaking info
v2:
New
---
drivers/xen/events/events_base.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 374231d84e4f..b060b5a95f45 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1314,10 +1314,12 @@ int bind_interdomain_evtchn_to_irq_lateeoi(struct xenbus_device *dev,
}
EXPORT_SYMBOL_GPL(bind_interdomain_evtchn_to_irq_lateeoi);
-static int find_virq(unsigned int virq, unsigned int cpu, evtchn_port_t *evtchn)
+static int find_virq(unsigned int virq, unsigned int cpu, evtchn_port_t *evtchn,
+ bool percpu)
{
struct evtchn_status status;
evtchn_port_t port;
+ bool exists = false;
memset(&status, 0, sizeof(status));
for (port = 0; port < xen_evtchn_max_channels(); port++) {
@@ -1330,12 +1332,16 @@ static int find_virq(unsigned int virq, unsigned int cpu, evtchn_port_t *evtchn)
continue;
if (status.status != EVTCHNSTAT_virq)
continue;
- if (status.u.virq == virq && status.vcpu == xen_vcpu_nr(cpu)) {
+ if (status.u.virq != virq)
+ continue;
+ if (status.vcpu == xen_vcpu_nr(cpu)) {
*evtchn = port;
return 0;
+ } else if (!percpu) {
+ exists = true;
}
}
- return -ENOENT;
+ return exists ? -EEXIST : -ENOENT;
}
/**
@@ -1382,8 +1388,11 @@ int bind_virq_to_irq(unsigned int virq, unsigned int cpu, bool percpu)
evtchn = bind_virq.port;
else {
if (ret == -EEXIST)
- ret = find_virq(virq, cpu, &evtchn);
- BUG_ON(ret < 0);
+ ret = find_virq(virq, cpu, &evtchn, percpu);
+ if (ret) {
+ __unbind_from_irq(info, info->irq);
+ goto out;
+ }
}
ret = xen_irq_info_virq_setup(info, cpu, evtchn, virq);
--
2.34.1
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x ae668cd567a6a7622bc813ee0bb61c42bed61ba7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025090112-skillet-muskiness-5948@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae668cd567a6a7622bc813ee0bb61c42bed61ba7 Mon Sep 17 00:00:00 2001
From: Eric Sandeen <sandeen(a)redhat.com>
Date: Fri, 22 Aug 2025 12:55:56 -0500
Subject: [PATCH] xfs: do not propagate ENODATA disk errors into xattr code
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen(a)redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Cc: stable(a)vger.kernel.org # v5.9+
Reviewed-by: Darrick J. Wong <djwong(a)kernel.org>
Signed-off-by: Carlos Maiolino <cem(a)kernel.org>
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..bff3dc226f81 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -435,6 +435,13 @@ xfs_attr_rmtval_get(
0, &bp, &xfs_attr3_rmt_buf_ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
+ /*
+ * ENODATA from disk implies a disk medium failure;
+ * ENODATA for xattrs means attribute not found, so
+ * disambiguate that here.
+ */
+ if (error == -ENODATA)
+ error = -EIO;
if (error)
return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..723a0643b838 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2833,6 +2833,12 @@ xfs_da_read_buf(
&bp, ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(dp, whichfork);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
+ error = -EIO;
if (error)
goto out_free;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x ae668cd567a6a7622bc813ee0bb61c42bed61ba7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025090111-acorn-ammonia-8c45@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae668cd567a6a7622bc813ee0bb61c42bed61ba7 Mon Sep 17 00:00:00 2001
From: Eric Sandeen <sandeen(a)redhat.com>
Date: Fri, 22 Aug 2025 12:55:56 -0500
Subject: [PATCH] xfs: do not propagate ENODATA disk errors into xattr code
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen(a)redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Cc: stable(a)vger.kernel.org # v5.9+
Reviewed-by: Darrick J. Wong <djwong(a)kernel.org>
Signed-off-by: Carlos Maiolino <cem(a)kernel.org>
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..bff3dc226f81 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -435,6 +435,13 @@ xfs_attr_rmtval_get(
0, &bp, &xfs_attr3_rmt_buf_ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
+ /*
+ * ENODATA from disk implies a disk medium failure;
+ * ENODATA for xattrs means attribute not found, so
+ * disambiguate that here.
+ */
+ if (error == -ENODATA)
+ error = -EIO;
if (error)
return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..723a0643b838 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2833,6 +2833,12 @@ xfs_da_read_buf(
&bp, ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(dp, whichfork);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
+ error = -EIO;
if (error)
goto out_free;
From: "Mario Limonciello (AMD)" <superm1(a)kernel.org>
[ Upstream commit 23800ad1265f10c2bc6f42154ce4d20e59f2900e ]
The ASUS ProArt PX13 has a spurious wakeup event from the touchpad
a few moments after entering hardware sleep. This can be avoided
by preventing the touchpad from being a wake source.
Add to the wakeup ignore list.
Reported-by: Amit Chaudhari <amitchaudhari(a)mac.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4482
Tested-by: Amit Chaudhari <amitchaudhari(a)mac.com>
Signed-off-by: Mario Limonciello (AMD) <superm1(a)kernel.org>
Reviewed-by: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Link: https://lore.kernel.org/20250814183430.3887973-1-superm1@kernel.org
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
LLM Generated explanations, may be completely bogus:
Based on my comprehensive analysis of this commit and the context, here
is my assessment:
**Backport Status: YES**
## Extensive Analysis:
### 1. **Nature of the Fix**
This commit fixes a real hardware bug - spurious wakeup events from the
touchpad on the ASUS ProArt PX13 laptop. The code change adds a DMI-
based quirk entry to the `gpiolib_acpi_quirks` table in
`/home/sasha/linux/drivers/gpio/gpiolib-acpi-quirks.c:350-359`, which
instructs the GPIO subsystem to ignore wake events from the specific
touchpad GPIO pin (`ASCP1A00:00@8`).
### 2. **符合稳定内核规则 (Meets Stable Kernel Rules)**
According to `/home/sasha/linux/Documentation/process/stable-kernel-
rules.rst`:
- **Fixes a real bug**: Yes - spurious wakeups are a real hardware issue
that impacts users' ability to use sleep/suspend effectively (lines
18-19 of rules)
- **Obviously correct and tested**: Yes - the fix is straightforward
(adding a quirk entry), has been tested by the reporter (Amit
Chaudhari), and reviewed by Mika Westerberg
- **Size constraint**: The patch is only ~20 lines with context, well
under the 100-line limit
- **Already in mainline**: Yes - commit
23800ad1265f10c2bc6f42154ce4d20e59f2900e
### 3. **Historical Precedent**
Multiple similar commits for spurious wakeup quirks have been backported
to stable:
- Commit `805c74eac8cb3` (GPD G1619-04 touchpad wakeup) - explicitly
marked with `Cc: stable(a)vger.kernel.org`
- Commit `782eea0c89f7d` (Clevo NL5xNU) - marked with `Cc:
stable(a)vger.kernel.org`
- Commit `a69982c37cd05` (Clevo NH5xAx) - marked with `Cc:
<stable(a)vger.kernel.org> # v6.1+`
### 4. **Code Structure Analysis**
The change follows the exact same pattern as other quirk entries in the
file:
```c
.driver_data = &(struct acpi_gpiolib_dmi_quirk) {
.ignore_wake = "ASCP1A00:00@8",
},
```
This is a data-only addition to an existing quirk table - no logic
changes, no new code paths, minimal regression risk.
### 5. **User Impact**
The bug causes spurious wakeups "a few moments after entering hardware
sleep", which:
- Prevents proper system suspend/sleep functionality
- Drains battery life on laptops
- Disrupts user workflow
- Is a clear hardware-specific issue that cannot be worked around by
users
### 6. **Risk Assessment**
- **Extremely low risk**: The change only affects systems that match the
specific DMI strings (ASUSTeK COMPUTER INC. + ProArt PX13)
- **No impact on other systems**: DMI matching ensures this quirk only
applies to the affected hardware
- **Well-established mechanism**: The ignore_wake infrastructure is
mature and widely used
### 7. **Backporting Considerations**
While this specific commit wasn't explicitly marked with `Cc: stable`,
it follows the exact same pattern as commits that were. The commit has
already been picked up by Sasha Levin's stable tree (as shown in the `[
Upstream commit ]` tag in the local repository), suggesting the stable
maintainers recognize its importance.
The fix is self-contained, requires no prerequisites, and can be cleanly
applied to any kernel version that has the `gpiolib-acpi-quirks.c` file
structure (introduced in commit `92dc572852ddc`).
### Conclusion
This is a textbook example of a stable-appropriate fix: it addresses a
specific hardware bug affecting real users, uses a well-established
quirk mechanism, has zero impact on unaffected systems, and follows the
exact pattern of previously backported fixes for identical issues on
different hardware.
drivers/gpio/gpiolib-acpi-quirks.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/drivers/gpio/gpiolib-acpi-quirks.c b/drivers/gpio/gpiolib-acpi-quirks.c
index c13545dce3492..bfb04e67c4bc8 100644
--- a/drivers/gpio/gpiolib-acpi-quirks.c
+++ b/drivers/gpio/gpiolib-acpi-quirks.c
@@ -344,6 +344,20 @@ static const struct dmi_system_id gpiolib_acpi_quirks[] __initconst = {
.ignore_interrupt = "AMDI0030:00@8",
},
},
+ {
+ /*
+ * Spurious wakeups from TP_ATTN# pin
+ * Found in BIOS 5.35
+ * https://gitlab.freedesktop.org/drm/amd/-/issues/4482
+ */
+ .matches = {
+ DMI_MATCH(DMI_SYS_VENDOR, "ASUSTeK COMPUTER INC."),
+ DMI_MATCH(DMI_PRODUCT_FAMILY, "ProArt PX13"),
+ },
+ .driver_data = &(struct acpi_gpiolib_dmi_quirk) {
+ .ignore_wake = "ASCP1A00:00@8",
+ },
+ },
{} /* Terminating entry */
};
--
2.50.1
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x ae668cd567a6a7622bc813ee0bb61c42bed61ba7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025090111-poem-retold-4ac2@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae668cd567a6a7622bc813ee0bb61c42bed61ba7 Mon Sep 17 00:00:00 2001
From: Eric Sandeen <sandeen(a)redhat.com>
Date: Fri, 22 Aug 2025 12:55:56 -0500
Subject: [PATCH] xfs: do not propagate ENODATA disk errors into xattr code
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen(a)redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Cc: stable(a)vger.kernel.org # v5.9+
Reviewed-by: Darrick J. Wong <djwong(a)kernel.org>
Signed-off-by: Carlos Maiolino <cem(a)kernel.org>
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..bff3dc226f81 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -435,6 +435,13 @@ xfs_attr_rmtval_get(
0, &bp, &xfs_attr3_rmt_buf_ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
+ /*
+ * ENODATA from disk implies a disk medium failure;
+ * ENODATA for xattrs means attribute not found, so
+ * disambiguate that here.
+ */
+ if (error == -ENODATA)
+ error = -EIO;
if (error)
return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..723a0643b838 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2833,6 +2833,12 @@ xfs_da_read_buf(
&bp, ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(dp, whichfork);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
+ error = -EIO;
if (error)
goto out_free;
From: Han Guangjiang <hanguangjiang(a)lixiang.com>
On repeated cold boots we occasionally hit a NULL pointer crash in
blk_should_throtl() when throttling is consulted before the throttle
policy is fully enabled for the queue. Checking only q->td != NULL is
insufficient during early initialization, so blkg_to_pd() for the
throttle policy can still return NULL and blkg_to_tg() becomes NULL,
which later gets dereferenced.
Tighten blk_throtl_activated() to also require that the throttle policy
bit is set on the queue:
return q->td != NULL &&
test_bit(blkcg_policy_throtl.plid, q->blkcg_pols);
This prevents blk_should_throtl() from accessing throttle group state
until policy data has been attached to blkgs.
Fixes: a3166c51702b ("blk-throttle: delay initialization until configuration")
Cc: stable(a)vger.kernel.org
Co-developed-by: Liang Jie <liangjie(a)lixiang.com>
Signed-off-by: Liang Jie <liangjie(a)lixiang.com>
Signed-off-by: Han Guangjiang <hanguangjiang(a)lixiang.com>
---
block/blk-throttle.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index 3b27755bfbff..9ca43dc56eda 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -156,7 +156,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk);
static inline bool blk_throtl_activated(struct request_queue *q)
{
- return q->td != NULL;
+ return q->td != NULL && test_bit(blkcg_policy_throtl.plid, q->blkcg_pols);
}
static inline bool blk_should_throtl(struct bio *bio)
--
2.25.1
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x ae668cd567a6a7622bc813ee0bb61c42bed61ba7
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025090105-strainer-glider-fdcd@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae668cd567a6a7622bc813ee0bb61c42bed61ba7 Mon Sep 17 00:00:00 2001
From: Eric Sandeen <sandeen(a)redhat.com>
Date: Fri, 22 Aug 2025 12:55:56 -0500
Subject: [PATCH] xfs: do not propagate ENODATA disk errors into xattr code
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen(a)redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Cc: stable(a)vger.kernel.org # v5.9+
Reviewed-by: Darrick J. Wong <djwong(a)kernel.org>
Signed-off-by: Carlos Maiolino <cem(a)kernel.org>
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..bff3dc226f81 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -435,6 +435,13 @@ xfs_attr_rmtval_get(
0, &bp, &xfs_attr3_rmt_buf_ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
+ /*
+ * ENODATA from disk implies a disk medium failure;
+ * ENODATA for xattrs means attribute not found, so
+ * disambiguate that here.
+ */
+ if (error == -ENODATA)
+ error = -EIO;
if (error)
return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..723a0643b838 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2833,6 +2833,12 @@ xfs_da_read_buf(
&bp, ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(dp, whichfork);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
+ error = -EIO;
if (error)
goto out_free;
VRAM+TT bos that are evicted from VRAM to TT may remain in
TT also after a revalidation following eviction or suspend.
This manifests itself as applications becoming sluggish
after buffer objects get evicted or after a resume from
suspend or hibernation.
If the bo supports placement in both VRAM and TT, and
we are on DGFX, mark the TT placement as fallback. This means
that it is tried only after VRAM + eviction.
This flaw has probably been present since the xe module was
upstreamed but use a Fixes: commit below where backporting is
likely to be simple. For earlier versions we need to open-
code the fallback algorithm in the driver.
v2:
- Remove check for dgfx. (Matthew Auld)
- Update the xe_dma_buf kunit test for the new strategy (CI)
- Allow dma-buf to pin in current placement (CI)
- Make xe_bo_validate() for pinned bos a NOP.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995
Fixes: a78a8da51b36 ("drm/ttm: replace busy placement with flags v6")
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.9+
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld(a)intel.com> #v1
---
drivers/gpu/drm/xe/tests/xe_bo.c | 2 +-
drivers/gpu/drm/xe/tests/xe_dma_buf.c | 10 +---------
drivers/gpu/drm/xe/xe_bo.c | 16 ++++++++++++----
drivers/gpu/drm/xe/xe_bo.h | 2 +-
drivers/gpu/drm/xe/xe_dma_buf.c | 2 +-
5 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/drivers/gpu/drm/xe/tests/xe_bo.c b/drivers/gpu/drm/xe/tests/xe_bo.c
index bb469096d072..7b40cc8be1c9 100644
--- a/drivers/gpu/drm/xe/tests/xe_bo.c
+++ b/drivers/gpu/drm/xe/tests/xe_bo.c
@@ -236,7 +236,7 @@ static int evict_test_run_tile(struct xe_device *xe, struct xe_tile *tile, struc
}
xe_bo_lock(external, false);
- err = xe_bo_pin_external(external);
+ err = xe_bo_pin_external(external, false);
xe_bo_unlock(external);
if (err) {
KUNIT_FAIL(test, "external bo pin err=%pe\n",
diff --git a/drivers/gpu/drm/xe/tests/xe_dma_buf.c b/drivers/gpu/drm/xe/tests/xe_dma_buf.c
index 3c5ad8cc65a0..5baeab6b6fb7 100644
--- a/drivers/gpu/drm/xe/tests/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/tests/xe_dma_buf.c
@@ -85,15 +85,7 @@ static void check_residency(struct kunit *test, struct xe_bo *exported,
return;
}
- /*
- * If on different devices, the exporter is kept in system if
- * possible, saving a migration step as the transfer is just
- * likely as fast from system memory.
- */
- if (params->mem_mask & XE_BO_FLAG_SYSTEM)
- KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(exported, XE_PL_TT));
- else
- KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(exported, mem_type));
+ KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(exported, mem_type));
if (params->force_different_devices)
KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(imported, XE_PL_TT));
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 4faf15d5fa6d..870f43347281 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -188,6 +188,8 @@ static void try_add_system(struct xe_device *xe, struct xe_bo *bo,
bo->placements[*c] = (struct ttm_place) {
.mem_type = XE_PL_TT,
+ .flags = (bo_flags & XE_BO_FLAG_VRAM_MASK) ?
+ TTM_PL_FLAG_FALLBACK : 0,
};
*c += 1;
}
@@ -2322,6 +2324,7 @@ uint64_t vram_region_gpu_offset(struct ttm_resource *res)
/**
* xe_bo_pin_external - pin an external BO
* @bo: buffer object to be pinned
+ * @in_place: Pin in current placement, don't attempt to migrate.
*
* Pin an external (not tied to a VM, can be exported via dma-buf / prime FD)
* BO. Unique call compared to xe_bo_pin as this function has it own set of
@@ -2329,7 +2332,7 @@ uint64_t vram_region_gpu_offset(struct ttm_resource *res)
*
* Returns 0 for success, negative error code otherwise.
*/
-int xe_bo_pin_external(struct xe_bo *bo)
+int xe_bo_pin_external(struct xe_bo *bo, bool in_place)
{
struct xe_device *xe = xe_bo_device(bo);
int err;
@@ -2338,9 +2341,11 @@ int xe_bo_pin_external(struct xe_bo *bo)
xe_assert(xe, xe_bo_is_user(bo));
if (!xe_bo_is_pinned(bo)) {
- err = xe_bo_validate(bo, NULL, false);
- if (err)
- return err;
+ if (!in_place) {
+ err = xe_bo_validate(bo, NULL, false);
+ if (err)
+ return err;
+ }
spin_lock(&xe->pinned.lock);
list_add_tail(&bo->pinned_link, &xe->pinned.late.external);
@@ -2493,6 +2498,9 @@ int xe_bo_validate(struct xe_bo *bo, struct xe_vm *vm, bool allow_res_evict)
};
int ret;
+ if (xe_bo_is_pinned(bo))
+ return 0;
+
if (vm) {
lockdep_assert_held(&vm->lock);
xe_vm_assert_held(vm);
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 8cce413b5235..cfb1ec266a6d 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -200,7 +200,7 @@ static inline void xe_bo_unlock_vm_held(struct xe_bo *bo)
}
}
-int xe_bo_pin_external(struct xe_bo *bo);
+int xe_bo_pin_external(struct xe_bo *bo, bool in_place);
int xe_bo_pin(struct xe_bo *bo);
void xe_bo_unpin_external(struct xe_bo *bo);
void xe_bo_unpin(struct xe_bo *bo);
diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
index 346f857f3837..af64baf872ef 100644
--- a/drivers/gpu/drm/xe/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/xe_dma_buf.c
@@ -72,7 +72,7 @@ static int xe_dma_buf_pin(struct dma_buf_attachment *attach)
return ret;
}
- ret = xe_bo_pin_external(bo);
+ ret = xe_bo_pin_external(bo, true);
xe_assert(xe, !ret);
return 0;
--
2.50.1
On Wed, Aug 27, 2025 at 7:38 AM Stanislav Fort <disclosure(a)aisle.com> wrote:
>
> NET/ROM nr_rx_frame() dereferences the 5-byte transport header
> unconditionally. nr_route_frame() currently accepts frames as short as
> NR_NETWORK_LEN (15 bytes), which can lead to small out-of-bounds reads
> on short frames.
>
> Fix by using pskb_may_pull() in nr_rx_frame() to ensure the full
> NET/ROM network + transport header is present before accessing it, and
> guard the extra fields used by NR_CONNREQ (window, user address, and the
> optional BPQ timeout extension) with additional pskb_may_pull() checks.
>
> This aligns with recent fixes using pskb_may_pull() to validate header
> availability.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: Stanislav Fort <disclosure(a)aisle.com>
> Cc: stable(a)vger.kernel.org
> Signed-off-by: Stanislav Fort <disclosure(a)aisle.com>
> ---
> net/netrom/af_netrom.c | 12 +++++++++++-
> net/netrom/nr_route.c | 2 +-
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
> index 3331669d8e33..1fbaa161288a 100644
> --- a/net/netrom/af_netrom.c
> +++ b/net/netrom/af_netrom.c
> @@ -885,6 +885,10 @@ int nr_rx_frame(struct sk_buff *skb, struct net_device *dev)
> * skb->data points to the netrom frame start
> */
>
> + /* Ensure NET/ROM network + transport header are present */
> + if (!pskb_may_pull(skb, NR_NETWORK_LEN + NR_TRANSPORT_LEN))
> + return 0;
> +
> src = (ax25_address *)(skb->data + 0);
> dest = (ax25_address *)(skb->data + 7);
>
> @@ -961,6 +965,12 @@ int nr_rx_frame(struct sk_buff *skb, struct net_device *dev)
> return 0;
> }
>
> + /* Ensure NR_CONNREQ fields (window + user address) are present */
> + if (!pskb_may_pull(skb, 21 + AX25_ADDR_LEN)) {
If skb->head is reallocated by this pskb_may_pull(), dest variable
might point to a freed piece of memory
(old skb->head)
As far as netrom is concerned, I would force a full linearization of
the packet very early
It is also unclear if the bug even exists in the first place.
Can you show the stack trace leading to this function being called
from an arbitrary
provider (like a packet being fed by malicious user space)
For instance nr_rx_frame() can be called from net/netrom/nr_loopback.c
with non malicious packet.
For the remaining caller (nr_route_frame()), it is unclear to me.
This is a series of patches that address several memory bugs that occur
in the Exynos Virtual Display driver.
Jeongjun Park (3):
drm/exynos: vidi: use priv->vidi_dev for ctx lookup in vidi_connection_ioctl()
drm/exynos: vidi: fix to avoid directly dereferencing user pointer
drm/exynos: vidi: use ctx->lock to protect struct vidi_context member variables related to memory alloc/free
drivers/gpu/drm/exynos/exynos_drm_drv.h | 1 +
drivers/gpu/drm/exynos/exynos_drm_vidi.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 64 insertions(+), 11 deletions(-)
Fix reference leaks where PCI device references
obtained via pci_get_device() were not being released:
1. The while loop that iterates through 0xa00a devices was not
releasing the final device reference when the loop terminates.
2. Single device lookups for 0xa001 and 0xa009 devices were not
releasing their references after use.
Add missing pci_dev_put() calls to ensure all device references
are properly released.
Fixes: cd7834167ffb ("[POWERPC] pasemi: Print more information at machine check")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miaoqian Lin <linmq006(a)gmail.com>
---
arch/powerpc/platforms/pasemi/setup.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/powerpc/platforms/pasemi/setup.c b/arch/powerpc/platforms/pasemi/setup.c
index d03b41336901..dafbee3afd86 100644
--- a/arch/powerpc/platforms/pasemi/setup.c
+++ b/arch/powerpc/platforms/pasemi/setup.c
@@ -169,6 +169,8 @@ static int __init pas_setup_mce_regs(void)
dev = pci_get_device(PCI_VENDOR_ID_PASEMI, 0xa00a, dev);
reg++;
}
+ /* Release the last device reference from the while loop */
+ pci_dev_put(dev);
dev = pci_get_device(PCI_VENDOR_ID_PASEMI, 0xa001, NULL);
if (dev && reg+4 < MAX_MCE_REGS) {
@@ -185,6 +187,7 @@ static int __init pas_setup_mce_regs(void)
mce_regs[reg].addr = pasemi_pci_getcfgaddr(dev, 0xc1c);
reg++;
}
+ pci_dev_put(dev);
dev = pci_get_device(PCI_VENDOR_ID_PASEMI, 0xa009, NULL);
if (dev && reg+2 < MAX_MCE_REGS) {
@@ -195,6 +198,7 @@ static int __init pas_setup_mce_regs(void)
mce_regs[reg].addr = pasemi_pci_getcfgaddr(dev, 0x214);
reg++;
}
+ pci_dev_put(dev);
num_mce_regs = reg;
--
2.35.1
Suspend-resume cycle test revealed a memory leak in 6.17-rc3
Turns out the slot_id race fix changes accidentally ends up calling
xhci_free_virt_device() with an incorrect vdev parameter.
The vdev variable was reused for temporary purposes right before calling
xhci_free_virt_device().
Fix this by passing the correct vdev parameter.
The slot_id race fix that caused this regression was targeted for stable,
so this needs to be applied there as well.
Fixes: 2eb03376151b ("usb: xhci: Fix slot_id resource race conflict")
Reported-by: David Wang <00107082(a)163.com>
Closes: https://lore.kernel.org/linux-usb/20250829181354.4450-1-00107082@163.com
Suggested-by: Michal Pecio <michal.pecio(a)gmail.com>
Suggested-by: David Wang <00107082(a)163.com>
Cc: stable(a)vger.kernel.org
Tested-by: David Wang <00107082(a)163.com>
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-mem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index 81eaad87a3d9..c4a6544aa107 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -962,7 +962,7 @@ static void xhci_free_virt_devices_depth_first(struct xhci_hcd *xhci, int slot_i
out:
/* we are now at a leaf device */
xhci_debugfs_remove_slot(xhci, slot_id);
- xhci_free_virt_device(xhci, vdev, slot_id);
+ xhci_free_virt_device(xhci, xhci->devs[slot_id], slot_id);
}
int xhci_alloc_virt_device(struct xhci_hcd *xhci, int slot_id,
--
2.43.0
Pending requests will be flushed on disconnect, and the corresponding
TRBs will be turned into No-op TRBs, which are ignored by the xHC
controller once it starts processing the ring.
If the USB debug cable repeatedly disconnects before ring is started
then the ring will eventually be filled with No-op TRBs.
No new transfers can be queued when the ring is full, and driver will
print the following error message:
"xhci_hcd 0000:00:14.0: failed to queue trbs"
This is a normal case for 'in' transfers where TRBs are always enqueued
in advance, ready to take on incoming data. If no data arrives, and
device is disconnected, then ring dequeue will remain at beginning of
the ring while enqueue points to first free TRB after last cancelled
No-op TRB.
s
Solve this by reinitializing the rings when the debug cable disconnects
and DbC is leaving the configured state.
Clear the whole ring buffer and set enqueue and dequeue to the beginning
of ring, and set cycle bit to its initial state.
Cc: stable(a)vger.kernel.org
Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-dbgcap.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
index d0faff233e3e..63edf2d8f245 100644
--- a/drivers/usb/host/xhci-dbgcap.c
+++ b/drivers/usb/host/xhci-dbgcap.c
@@ -462,6 +462,25 @@ static void xhci_dbc_ring_init(struct xhci_ring *ring)
xhci_initialize_ring_info(ring);
}
+static int xhci_dbc_reinit_ep_rings(struct xhci_dbc *dbc)
+{
+ struct xhci_ring *in_ring = dbc->eps[BULK_IN].ring;
+ struct xhci_ring *out_ring = dbc->eps[BULK_OUT].ring;
+
+ if (!in_ring || !out_ring || !dbc->ctx) {
+ dev_warn(dbc->dev, "Can't re-init unallocated endpoints\n");
+ return -ENODEV;
+ }
+
+ xhci_dbc_ring_init(in_ring);
+ xhci_dbc_ring_init(out_ring);
+
+ /* set ep context enqueue, dequeue, and cycle to initial values */
+ xhci_dbc_init_ep_contexts(dbc);
+
+ return 0;
+}
+
static struct xhci_ring *
xhci_dbc_ring_alloc(struct device *dev, enum xhci_ring_type type, gfp_t flags)
{
@@ -885,7 +904,7 @@ static enum evtreturn xhci_dbc_do_handle_events(struct xhci_dbc *dbc)
dev_info(dbc->dev, "DbC cable unplugged\n");
dbc->state = DS_ENABLED;
xhci_dbc_flush_requests(dbc);
-
+ xhci_dbc_reinit_ep_rings(dbc);
return EVT_DISC;
}
@@ -895,7 +914,7 @@ static enum evtreturn xhci_dbc_do_handle_events(struct xhci_dbc *dbc)
writel(portsc, &dbc->regs->portsc);
dbc->state = DS_ENABLED;
xhci_dbc_flush_requests(dbc);
-
+ xhci_dbc_reinit_ep_rings(dbc);
return EVT_DISC;
}
--
2.43.0
Hi,
I’d like to share a verified database of 502,364 attendees and 750 exhibitors from IAA Mobility 2025.
The file includes key information such as: Name, Job Title, Company, Location, Phone, and Email.
Delivery is within 48 hours and the data is GDPR compliant.
Labour Day Special: 20% discount available for a short time.
If you’d like the pricing, just reply with “Send me the cost” and I’ll provide the details.
Best regards,
Garnet Conwell
Sr. Marketing Manager
P.S. To opt out of future updates, reply with “Unfollow”.
Testing has shown that reading multiple registers at once (for 10-bit
ADC values) does not work. Set the use_single_read regmap_config flag
to make regmap split these for us.
This should fix temperature opregion accesses done by
drivers/acpi/pmic/intel_pmic_chtdc_ti.c and is also necessary for
the upcoming drivers for the ADC and battery MFD cells.
Fixes: 6bac0606fdba ("mfd: Add support for Cherry Trail Dollar Cove TI PMIC")
Cc: stable(a)vger.kernel.org
Reviewed-by: Andy Shevchenko <andy(a)kernel.org>
Signed-off-by: Hans de Goede <hansg(a)kernel.org>
---
Changes in v3:
- Fix a few typos in the commit message
Changes in v2:
- Update comment to: "The hardware does not support reading multiple
registers at once"
---
drivers/mfd/intel_soc_pmic_chtdc_ti.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/mfd/intel_soc_pmic_chtdc_ti.c b/drivers/mfd/intel_soc_pmic_chtdc_ti.c
index 4c1a68c9f575..6daf33e07ea0 100644
--- a/drivers/mfd/intel_soc_pmic_chtdc_ti.c
+++ b/drivers/mfd/intel_soc_pmic_chtdc_ti.c
@@ -82,6 +82,8 @@ static const struct regmap_config chtdc_ti_regmap_config = {
.reg_bits = 8,
.val_bits = 8,
.max_register = 0xff,
+ /* The hardware does not support reading multiple registers at once */
+ .use_single_read = true,
};
static const struct regmap_irq chtdc_ti_irqs[] = {
--
2.49.0
When the host actively triggers SSR and collects coredump data,
the Bluetooth stack sends a reset command to the controller. However, due
to the inability to clear the QCA_SSR_TRIGGERED and QCA_IBS_DISABLED bits,
the reset command times out.
To address this, this patch clears the QCA_SSR_TRIGGERED and
QCA_IBS_DISABLED flags and adds a 50ms delay after SSR, but only when
HCI_QUIRK_NON_PERSISTENT_SETUP is not set. This ensures the controller
completes the SSR process when BT_EN is always high due to hardware.
For the purpose of HCI_QUIRK_NON_PERSISTENT_SETUP, please refer to
the comment in `include/net/bluetooth/hci.h`.
The HCI_QUIRK_NON_PERSISTENT_SETUP quirk is associated with BT_EN,
and its presence can be used to determine whether BT_EN is defined in DTS.
After SSR, host will not download the firmware, causing
controller to remain in the IBS_WAKE state. Host needs
to synchronize with the controller to maintain proper operation.
Multiple triggers of SSR only first generate coredump file,
due to memcoredump_flag no clear.
add clear coredump flag when ssr completed.
When the SSR duration exceeds 2 seconds, it triggers
host tx_idle_timeout, which sets host TX state to sleep. due to the
hardware pulling up bt_en, the firmware is not downloaded after the SSR.
As a result, the controller does not enter sleep mode. Consequently,
when the host sends a command afterward, it sends 0xFD to the controller,
but the controller does not respond, leading to a command timeout.
So reset tx_idle_timer after SSR to prevent host enter TX IBS_Sleep mode.
Changes since v6-7:
- Merge the changes into a single patch.
- Update commit.
Changes since v1-5:
- Add an explanation for HCI_QUIRK_NON_PERSISTENT_SETUP.
- Add commments for msleep(50).
- Update format and commit.
Signed-off-by: Shuai Zhang <quic_shuaz(a)quicinc.com>
---
drivers/bluetooth/hci_qca.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index 4e56782b0..9dc59b002 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -1653,6 +1653,39 @@ static void qca_hw_error(struct hci_dev *hdev, u8 code)
skb_queue_purge(&qca->rx_memdump_q);
}
+ /*
+ * If the BT chip's bt_en pin is connected to a 3.3V power supply via
+ * hardware and always stays high, driver cannot control the bt_en pin.
+ * As a result, during SSR (SubSystem Restart), QCA_SSR_TRIGGERED and
+ * QCA_IBS_DISABLED flags cannot be cleared, which leads to a reset
+ * command timeout.
+ * Add an msleep delay to ensure controller completes the SSR process.
+ *
+ * Host will not download the firmware after SSR, controller to remain
+ * in the IBS_WAKE state, and the host needs to synchronize with it
+ *
+ * Since the bluetooth chip has been reset, clear the memdump state.
+ */
+ if (!test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
+ /*
+ * When the SSR (SubSystem Restart) duration exceeds 2 seconds,
+ * it triggers host tx_idle_delay, which sets host TX state
+ * to sleep. Reset tx_idle_timer after SSR to prevent
+ * host enter TX IBS_Sleep mode.
+ */
+ mod_timer(&qca->tx_idle_timer, jiffies +
+ msecs_to_jiffies(qca->tx_idle_delay));
+
+ /* Controller reset completion time is 50ms */
+ msleep(50);
+
+ clear_bit(QCA_SSR_TRIGGERED, &qca->flags);
+ clear_bit(QCA_IBS_DISABLED, &qca->flags);
+
+ qca->tx_ibs_state = HCI_IBS_TX_AWAKE;
+ qca->memdump_state = QCA_MEMDUMP_IDLE;
+ }
+
clear_bit(QCA_HW_ERROR_EVENT, &qca->flags);
}
--
2.34.1
From: Conor Dooley <conor.dooley(a)microchip.com>
In commit 13529647743d9 ("spi: microchip-core-qspi: Support per spi-mem
operation frequency switches") the logic for checking the viability of
op->max_freq in mchp_coreqspi_setup_clock() was copied into
mchp_coreqspi_supports_op(). Unfortunately, op->max_freq is not valid
when this function is called during probe but is instead zero.
Accordingly, baud_rate_val is calculated to be INT_MAX due to division
by zero, causing probe of the attached memory device to fail.
Seemingly spi-microchip-core-qspi was the only driver that had such a
modification made to its supports_op callback when the per_op_freq
capability was added, so just remove it to restore prior functionality.
CC: stable(a)vger.kernel.org
Reported-by: Valentina Fernandez <valentina.fernandezalanis(a)microchip.com>
Fixes: 13529647743d9 ("spi: microchip-core-qspi: Support per spi-mem operation frequency switches")
Signed-off-by: Conor Dooley <conor.dooley(a)microchip.com>
---
CC: Conor Dooley <conor.dooley(a)microchip.com>
CC: Daire McNamara <daire.mcnamara(a)microchip.com>
CC: Mark Brown <broonie(a)kernel.org>
CC: Miquel Raynal <miquel.raynal(a)bootlin.com>
CC: linux-spi(a)vger.kernel.org
CC: linux-kernel(a)vger.kernel.org
---
drivers/spi/spi-microchip-core-qspi.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/drivers/spi/spi-microchip-core-qspi.c b/drivers/spi/spi-microchip-core-qspi.c
index d13a9b755c7f8..8dc98b17f77b5 100644
--- a/drivers/spi/spi-microchip-core-qspi.c
+++ b/drivers/spi/spi-microchip-core-qspi.c
@@ -531,10 +531,6 @@ static int mchp_coreqspi_exec_op(struct spi_mem *mem, const struct spi_mem_op *o
static bool mchp_coreqspi_supports_op(struct spi_mem *mem, const struct spi_mem_op *op)
{
- struct mchp_coreqspi *qspi = spi_controller_get_devdata(mem->spi->controller);
- unsigned long clk_hz;
- u32 baud_rate_val;
-
if (!spi_mem_default_supports_op(mem, op))
return false;
@@ -557,14 +553,6 @@ static bool mchp_coreqspi_supports_op(struct spi_mem *mem, const struct spi_mem_
return false;
}
- clk_hz = clk_get_rate(qspi->clk);
- if (!clk_hz)
- return false;
-
- baud_rate_val = DIV_ROUND_UP(clk_hz, 2 * op->max_freq);
- if (baud_rate_val > MAX_DIVIDER || baud_rate_val < MIN_DIVIDER)
- return false;
-
return true;
}
--
2.47.2
commit e3f1164fc9ee ("PM: EM: Support late CPUs booting and capacity
adjustment") added a mechanism to handle CPUs that come up late by
retrying when any of the `cpufreq_cpu_get()` call fails.
However, if there are holes in the CPU topology (offline CPUs, e.g.
nosmt), the first missing CPU causes the loop to break, preventing
subsequent online CPUs from being updated.
Instead of aborting on the first missing CPU policy, loop through all
and retry if any were missing.
Fixes: e3f1164fc9ee ("PM: EM: Support late CPUs booting and capacity adjustment")
Suggested-by: Kenneth Crudup <kenneth.crudup(a)gmail.com>
Reported-by: Kenneth Crudup <kenneth.crudup(a)gmail.com>
Closes: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix…
Cc: stable(a)vger.kernel.org
Signed-off-by: Christian Loehle <christian.loehle(a)arm.com>
---
kernel/power/energy_model.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index ea7995a25780..b63c2afc1379 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -778,7 +778,7 @@ void em_adjust_cpu_capacity(unsigned int cpu)
static void em_check_capacity_update(void)
{
cpumask_var_t cpu_done_mask;
- int cpu;
+ int cpu, failed_cpus = 0;
if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
pr_warn("no free memory\n");
@@ -796,10 +796,8 @@ static void em_check_capacity_update(void)
policy = cpufreq_cpu_get(cpu);
if (!policy) {
- pr_debug("Accessing cpu%d policy failed\n", cpu);
- schedule_delayed_work(&em_update_work,
- msecs_to_jiffies(1000));
- break;
+ failed_cpus++;
+ continue;
}
cpufreq_cpu_put(policy);
@@ -814,6 +812,11 @@ static void em_check_capacity_update(void)
em_adjust_new_capacity(cpu, dev, pd);
}
+ if (failed_cpus) {
+ pr_debug("Accessing %d policies failed, retrying\n", failed_cpus);
+ schedule_delayed_work(&em_update_work, msecs_to_jiffies(1000));
+ }
+
free_cpumask_var(cpu_done_mask);
}
--
2.34.1
From: Stanislav Fort <stanislav.fort(a)aisle.com>
batadv_nc_skb_decode_packet() trusts coded_len and checks only against
skb->len. XOR starts at sizeof(struct batadv_unicast_packet), reducing
payload headroom, and the source skb length is not verified, allowing an
out-of-bounds read and a small out-of-bounds write.
Validate that coded_len fits within the payload area of both destination
and source sk_buffs before XORing.
Fixes: 2df5278b0267 ("batman-adv: network coding - receive coded packets and decode them")
Cc: stable(a)vger.kernel.org
Reported-by: Stanislav Fort <disclosure(a)aisle.com>
Signed-off-by: Stanislav Fort <stanislav.fort(a)aisle.com>
Signed-off-by: Sven Eckelmann <sven(a)narfation.org>
Signed-off-by: Simon Wunderlich <sw(a)simonwunderlich.de>
---
net/batman-adv/network-coding.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c
index 9f56308779cc3..af97d077369f9 100644
--- a/net/batman-adv/network-coding.c
+++ b/net/batman-adv/network-coding.c
@@ -1687,7 +1687,12 @@ batadv_nc_skb_decode_packet(struct batadv_priv *bat_priv, struct sk_buff *skb,
coding_len = ntohs(coded_packet_tmp.coded_len);
- if (coding_len > skb->len)
+ /* ensure dst buffer is large enough (payload only) */
+ if (coding_len + h_size > skb->len)
+ return NULL;
+
+ /* ensure src buffer is large enough (payload only) */
+ if (coding_len + h_size > nc_packet->skb->len)
return NULL;
/* Here the magic is reversed:
--
2.47.2
From: gyutrange <wlsrbwjd643(a)naver.com>
VNCR/TLBI VA reconstruction currently uses bit 48 as the sign bit,
but for 48-bit virtual addresses the correct sign bit is bit 47.
Using 48 can mis-canonicalize addresses in the negative half and may
cause missed invalidations.
Although VNCR_EL2 encodes other architectural fields (RESS, BADDR;
see Arm ARM D24.2.206), sign_extend64() interprets its second argument
as the index of the sign bit. Passing 48 prevents propagation of the
canonical sign bit for 48-bit VAs.
Impact:
- Incorrect canonicalization of VAs with bit47=1
- Potential stale VNCR pseudo-TLB entries after TLBI or MMU notifier
- Possible incorrect translation/permissions or DoS when combined
with other issues
Fixes: 667304740537 ("KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation")
Cc: stable(a)vger.kernel.org
Reported-by: DongHa Lee <gap-dev(a)example.com>
Reported-by: Gyujeong Jin <wlsrbwjd7232(a)gmail.com>
Reported-by: Daehyeon Ko <4ncient(a)example.com>
Reported-by: Geonha Lee <leegn4a(a)example.com>
Reported-by: Hyungyu Oh <dqpc_lover(a)example.com>
Reported-by: Jaewon Yang <r4mbb1(a)example.com>
Signed-off-by: Gyujeong Jin <wlsrbwjd7232(a)gmail.com>
---
arch/arm64/kvm/nested.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 77db81bae86f..eaa6dd9da086 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -1169,7 +1169,7 @@ int kvm_vcpu_allocate_vncr_tlb(struct kvm_vcpu *vcpu)
static u64 read_vncr_el2(struct kvm_vcpu *vcpu)
{
- return (u64)sign_extend64(__vcpu_sys_reg(vcpu, VNCR_EL2), 48);
+ return (u64)sign_extend64(__vcpu_sys_reg(vcpu, VNCR_EL2), 47);
}
static int kvm_translate_vncr(struct kvm_vcpu *vcpu)
--
2.43.0
Correct the address in usb controller node to fix the following warning:
Warning (simple_bus_reg): /soc@0/usb@a6f8800: simple-bus unit address
format error, expected "a600000"
Fixes: c5a87e3a6b3e ("arm64: dts: qcom: sm8450: Flatten usb controller node")
Cc: stable(a)vger.kernel.org
Reported-by: kernel test robot <lkp(a)intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202508121834.953Mvah2-lkp@intel.com/
Signed-off-by: Krishna Kurapati <krishna.kurapati(a)oss.qualcomm.com>
---
This change was tested with W=1 and the reported issue is not seen.
Also didn't add RB Tag received from Neil Armstrong since there is a
change in commit text. This change is based on top of latest linux next.
Changes in v2:
Fixed the fixes tag.
Link to v1:
https://lore.kernel.org/all/20250813063840.2158792-1-krishna.kurapati@oss.q…
arch/arm64/boot/dts/qcom/sm8450.dtsi | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/boot/dts/qcom/sm8450.dtsi b/arch/arm64/boot/dts/qcom/sm8450.dtsi
index 2baef6869ed7..38c91c3ec787 100644
--- a/arch/arm64/boot/dts/qcom/sm8450.dtsi
+++ b/arch/arm64/boot/dts/qcom/sm8450.dtsi
@@ -5417,7 +5417,7 @@ opp-202000000 {
};
};
- usb_1: usb@a6f8800 {
+ usb_1: usb@a600000 {
compatible = "qcom,sm8450-dwc3", "qcom,snps-dwc3";
reg = <0 0x0a600000 0 0xfc100>;
status = "disabled";
--
2.34.1
There is a long standing bug which causes I2C communication not to
work on the Armada 3700 based boards. The first patch in the series
fixes that regression. The second patch improves recovery to make it
more robust which helps to avoid communication problems with certain
SFP modules.
Signed-off-by: Gabor Juhos <j4g8y7(a)gmail.com>
---
Changes in v3:
- rebase on tip of i2c/for-current
- remove Imre's tag from the cover letter, and replace his SoB tag to
Reviewed-by in the individual patches
- rework the second patch so it does not need changes in the I2C core code,
and drop the first one as it is not needed now
- Link to v2: https://lore.kernel.org/r/20250811-i2c-pxa-fix-i2c-communication-v2-0-ca42e…
Changes in v2:
- collect offered tags
- rebase and retest on tip of i2c/for-current
- Link to v1: https://lore.kernel.org/r/20250511-i2c-pxa-fix-i2c-communication-v1-0-e9097…
---
Gabor Juhos (2):
i2c: pxa: defer reset on Armada 3700 when recovery is used
i2c: pxa: handle 'Early Bus Busy' condition on Armada 3700
drivers/i2c/busses/i2c-pxa.c | 35 ++++++++++++++++++++++++++++-------
1 file changed, 28 insertions(+), 7 deletions(-)
---
base-commit: 3dd22078026c7cad4d4a3f32c5dc5452c7180de8
change-id: 20250510-i2c-pxa-fix-i2c-communication-3e6de1e3d0c6
Best regards,
--
Gabor Juhos <j4g8y7(a)gmail.com>
Commit under Fixes added support to build the driver as a loadable
module. However, it did not add MODULE_DEVICE_TABLE() which is required
for autoloading the driver when it is built as a loadable module.
Fix it.
Fixes: a2790bf81f0f ("PCI: j721e: Add support to build as a loadable module")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Siddharth Vadapalli <s-vadapalli(a)ti.com>
---
Hello,
This patch is based on commit
b320789d6883 Linux 6.17-rc4
of Mainline Linux.
Patch has been tested on J7200-EVM with the driver configs set to 'm'.
The pci-j721e.c driver is autoloaded as Linux boots and it doesn't need
to be manually probed. Logs:
https://gist.github.com/Siddharth-Vadapalli-at-TI/f64eed4df6fd4f714747b3c7e…
Regards,
Siddharth.
drivers/pci/controller/cadence/pci-j721e.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/pci/controller/cadence/pci-j721e.c b/drivers/pci/controller/cadence/pci-j721e.c
index 6c93f39d0288..cfca13a4c840 100644
--- a/drivers/pci/controller/cadence/pci-j721e.c
+++ b/drivers/pci/controller/cadence/pci-j721e.c
@@ -440,6 +440,7 @@ static const struct of_device_id of_j721e_pcie_match[] = {
},
{},
};
+MODULE_DEVICE_TABLE(of, of_j721e_pcie_match);
static int j721e_pcie_probe(struct platform_device *pdev)
{
--
2.43.0
Hi maintainers,
Please consider backporting the following patches to the stable trees.
These patches fix a significant reading issue with mcp2221 on i2c eeprom.
I have confirmed that the patches applie cleanly and build successfully
against v6.6, v6.1, v5.15 and v5.10 stable branches.
I will later come back to you with other patches that might need larger
backport changes.
Thanks,
Romain
Hamish Martin (2):
HID: mcp2221: Don't set bus speed on every transfer
HID: mcp2221: Handle reads greater than 60 bytes
drivers/hid/hid-mcp2221.c | 71 +++++++++++++++++++++++++++------------
1 file changed, 49 insertions(+), 22 deletions(-)
--
2.48.1
Hello,
I would like to request backporting the following patch to 5.10 series:
590cfb082837cc6c0c595adf1711330197c86a58
ASoC: Intel: sof_rt5682: shrink platform_id names below 20 characters
The patch seems to be already present in 5.15 and newer branches, and
its lack seems to be causing out-of-bounds read. I've hit it in the
wild while trying to install 5.10.241 on i686:
sh /var/tmp/portage/sys-kernel/gentoo-kernel-5.10.241/work/linux-5.10/scripts/depmod.sh depmod 5.10.241-gentoo-dist
depmod: FATAL: Module index: bad character '�'=0x80 - only 7-bit ASCII is supported:
platform:jsl_rt5682_max98360ax�
TIA.
--
Best regards,
Michał Górny
Support for parsing the topology on AMD/Hygon processors using CPUID
leaf 0xb was added in commit 3986a0a805e6 ("x86/CPU/AMD: Derive CPU
topology from CPUID function 0xB when available"). In an effort to keep
all the topology parsing bits in one place, this commit also introduced
a pseudo dependency on the TOPOEXT feature to parse the CPUID leaf 0xb.
TOPOEXT feature (CPUID 0x80000001 ECX[22]) advertises the support for
Cache Properties leaf 0x8000001d and the CPUID leaf 0x8000001e EAX for
"Extended APIC ID" however support for 0xb was introduced alongside the
x2APIC support not only on AMD [1], but also historically on x86 [2].
Similar to 0xb, the support for extended CPU topology leaf 0x80000026
too does not depend on the TOPOEXT feature.
The support for these leaves is expected to be confirmed by ensuring
"leaf <= {extended_}cpuid_level" and then parsing the level 0 of the
respective leaf to confirm EBX[15:0] (LogProcAtThisLevel) is non-zero as
stated in the definition of "CPUID_Fn0000000B_EAX_x00 [Extended Topology
Enumeration] (Core::X86::Cpuid::ExtTopEnumEax0)" in Processor
Programming Reference (PPR) for AMD Family 19h Model 01h Rev B1 Vol1 [3]
Sec. 2.1.15.1 "CPUID Instruction Functions".
This has not been a problem on baremetal platforms since support for
TOPOEXT (Fam 0x15 and later) predates the support for CPUID leaf 0xb
(Fam 0x17[Zen2] and later), however, for AMD guests on QEMU, "x2apic"
feature can be enabled independent of the "topoext" feature where QEMU
expects topology and the initial APICID to be parsed using the CPUID
leaf 0xb (especially when number of cores > 255) which is populated
independent of the "topoext" feature flag.
Unconditionally call cpu_parse_topology_ext() on AMD and Hygon
processors to first parse the topology using the XTOPOLOGY leaves
(0x80000026 / 0xb) before using the TOPOEXT leaf (0x8000001e).
Cc: stable(a)vger.kernel.org # Only v6.9 and above
Link: https://lore.kernel.org/lkml/1529686927-7665-1-git-send-email-suravee.suthi… [1]
Link: https://lore.kernel.org/lkml/20080818181435.523309000@linux-os.sc.intel.com/ [2]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 [3]
Suggested-by: Naveen N Rao (AMD) <naveen(a)kernel.org>
Fixes: 3986a0a805e6 ("x86/CPU/AMD: Derive CPU topology from CPUID function 0xB when available")
Signed-off-by: K Prateek Nayak <kprateek.nayak(a)amd.com>
---
Changelog v3..v4:
o Quoted relevant section of the PPR justifying the changes.
o Moved this patch up ahead.
o Cc'd stable and made a note that backports should target v6.9 and
above since this depends on the x86 topology rewrite.
---
arch/x86/kernel/cpu/topology_amd.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/topology_amd.c b/arch/x86/kernel/cpu/topology_amd.c
index 827dd0dbb6e9..4e3134a5550c 100644
--- a/arch/x86/kernel/cpu/topology_amd.c
+++ b/arch/x86/kernel/cpu/topology_amd.c
@@ -175,18 +175,14 @@ static void topoext_fixup(struct topo_scan *tscan)
static void parse_topology_amd(struct topo_scan *tscan)
{
- bool has_topoext = false;
-
/*
- * If the extended topology leaf 0x8000_001e is available
- * try to get SMT, CORE, TILE, and DIE shifts from extended
+ * Try to get SMT, CORE, TILE, and DIE shifts from extended
* CPUID leaf 0x8000_0026 on supported processors first. If
* extended CPUID leaf 0x8000_0026 is not supported, try to
* get SMT and CORE shift from leaf 0xb first, then try to
* get the CORE shift from leaf 0x8000_0008.
*/
- if (cpu_feature_enabled(X86_FEATURE_TOPOEXT))
- has_topoext = cpu_parse_topology_ext(tscan);
+ bool has_topoext = cpu_parse_topology_ext(tscan);
if (cpu_feature_enabled(X86_FEATURE_AMD_HTR_CORES))
tscan->c->topo.cpu_type = cpuid_ebx(0x80000026);
--
2.34.1
This series adds support for the clone3 system call to the nios2
architecture. This addresses the build-time warning "warning: clone3()
entry point is missing, please fix" introduced in 505d66d1abfb9
("clone3: drop __ARCH_WANT_SYS_CLONE3 macro"). The implementation passes
the relevant clone3 tests of kselftest when applied on top of
next-20250815:
./run_kselftest.sh
TAP version 13
1..4
# selftests: clone3: clone3
ok 1 selftests: clone3: clone3
# selftests: clone3: clone3_clear_sighand
ok 2 selftests: clone3: clone3_clear_sighand
# selftests: clone3: clone3_set_tid
ok 3 selftests: clone3: clone3_set_tid
# selftests: clone3: clone3_cap_checkpoint_restore
ok 4 selftests: clone3: clone3_cap_checkpoint_restore
The series also includes a small patch to kernel/fork.c that ensures
that clone_flags are passed correctly on architectures where unsigned
long is insufficient to store the u64 clone_flags. It is marked as a fix
for stable backporting.
As requested, in v2, this series now further tries to correct this type
error throughout the whole code base. Thus, it now touches a larger
number of subsystems and all architectures.
Therefore, another test was performed for ARCH=x86_64 (as a
representative for 64-bit architectures). Here, the series builds cleanly
without warnings on defconfig with CONFIG_SECURITY_APPARMOR=y and
CONFIG_SECURITY_TOMOYO=y (to compile-check the LSM-related changes).
The build further successfully passes testing/selftests/clone3 (with the
patch from 20241105062948.1037011-1-zhouyuhang1010(a)163.com to prepare
clone3_cap_checkpoint_restore for compatibility with the newer libcap
version on my system).
Is there any option to further preflight check this patch series via
lkp/KernelCI/etc. for a broader test across architectures, or is this
degree of testing sufficient to eventually get the series merged?
N.B.: The series is not checkpatch clean right now:
- include/linux/cred.h, include/linux/mnt_namespace.h:
function definition arguments without identifier name
- include/trace/events/task.h:
space prohibited after that open parenthesis
I did not fix these warnings to keep my changes minimal and reviewable,
as the issues persist throughout the files and they were not introduced
by me; I only followed the existing code style and just replaced the
types. If desired, I'd be happy to make the changes in a potential v3,
though.
Signed-off-by: Simon Schuster <schuster.simon(a)siemens-energy.com>
---
Changes in v2:
- Introduce "Fixes:" and "Cc: stable(a)vger.kernel.org" where necessary
- Factor out "Fixes:" when adapting the datatype of clone_flags for
easier backports
- Fix additional instances where `unsigned long` clone_flags is used
- Reword commit message to make it clearer that any 32-bit arch is
affected by this bug
- Link to v1: https://lore.kernel.org/r/20250821-nios2-implement-clone3-v1-0-1bb24017376a…
---
Simon Schuster (4):
copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
copy_process: pass clone_flags as u64 across calltree
arch: copy_thread: pass clone_flags as u64
nios2: implement architecture-specific portion of sys_clone3
arch/alpha/kernel/process.c | 2 +-
arch/arc/kernel/process.c | 2 +-
arch/arm/kernel/process.c | 2 +-
arch/arm64/kernel/process.c | 2 +-
arch/csky/kernel/process.c | 2 +-
arch/hexagon/kernel/process.c | 2 +-
arch/loongarch/kernel/process.c | 2 +-
arch/m68k/kernel/process.c | 2 +-
arch/microblaze/kernel/process.c | 2 +-
arch/mips/kernel/process.c | 2 +-
arch/nios2/include/asm/syscalls.h | 1 +
arch/nios2/include/asm/unistd.h | 2 --
arch/nios2/kernel/entry.S | 6 ++++++
arch/nios2/kernel/process.c | 2 +-
arch/nios2/kernel/syscall_table.c | 1 +
arch/openrisc/kernel/process.c | 2 +-
arch/parisc/kernel/process.c | 2 +-
arch/powerpc/kernel/process.c | 2 +-
arch/riscv/kernel/process.c | 2 +-
arch/s390/kernel/process.c | 2 +-
arch/sh/kernel/process_32.c | 2 +-
arch/sparc/kernel/process_32.c | 2 +-
arch/sparc/kernel/process_64.c | 2 +-
arch/um/kernel/process.c | 2 +-
arch/x86/include/asm/fpu/sched.h | 2 +-
arch/x86/include/asm/shstk.h | 4 ++--
arch/x86/kernel/fpu/core.c | 2 +-
arch/x86/kernel/process.c | 2 +-
arch/x86/kernel/shstk.c | 2 +-
arch/xtensa/kernel/process.c | 2 +-
block/blk-ioc.c | 2 +-
fs/namespace.c | 2 +-
include/linux/cgroup.h | 4 ++--
include/linux/cred.h | 2 +-
include/linux/iocontext.h | 6 +++---
include/linux/ipc_namespace.h | 4 ++--
include/linux/lsm_hook_defs.h | 2 +-
include/linux/mnt_namespace.h | 2 +-
include/linux/nsproxy.h | 2 +-
include/linux/pid_namespace.h | 4 ++--
include/linux/rseq.h | 4 ++--
include/linux/sched/task.h | 2 +-
include/linux/security.h | 4 ++--
include/linux/sem.h | 4 ++--
include/linux/time_namespace.h | 4 ++--
include/linux/uprobes.h | 4 ++--
include/linux/user_events.h | 4 ++--
include/linux/utsname.h | 4 ++--
include/net/net_namespace.h | 4 ++--
include/trace/events/task.h | 6 +++---
ipc/namespace.c | 2 +-
ipc/sem.c | 2 +-
kernel/cgroup/namespace.c | 2 +-
kernel/cred.c | 2 +-
kernel/events/uprobes.c | 2 +-
kernel/fork.c | 10 +++++-----
kernel/nsproxy.c | 4 ++--
kernel/pid_namespace.c | 2 +-
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 4 ++--
kernel/time/namespace.c | 2 +-
kernel/utsname.c | 2 +-
net/core/net_namespace.c | 2 +-
security/apparmor/lsm.c | 2 +-
security/security.c | 2 +-
security/selinux/hooks.c | 2 +-
security/tomoyo/tomoyo.c | 2 +-
68 files changed, 95 insertions(+), 89 deletions(-)
---
base-commit: 1357b2649c026b51353c84ddd32bc963e8999603
change-id: 20250818-nios2-implement-clone3-7f252c20860b
Best regards,
--
Simon Schuster <schuster.simon(a)siemens-energy.com>
Hello,
Henrik notified me that UDF in 5.15 kernel is failing for him, likely
because commit 1ea1cd11c72d ("udf: Fix directory iteration for longer tail
extents") is missing 5.15 stable tree. Basically wherever d16076d9b684b got
backported, 1ea1cd11c72d needs to be backported as well (I'm sorry for
forgetting to add proper Fixes tag back then).
Honza
--
Jan Kara <jack(a)suse.com>
SUSE Labs, CR
When the xe pm_notifier evicts for suspend / hibernate, there might be
racing tasks trying to re-validate again. This can lead to suspend taking
excessive time or get stuck in a live-lock. This behaviour becomes
much worse with the fix that actually makes re-validation bring back
bos to VRAM rather than letting them remain in TT.
Prevent that by having exec and the rebind worker waiting for a completion
that is set to block by the pm_notifier before suspend and is signaled
by the pm_notifier after resume / wakeup.
It's probably still possible to craft malicious applications that block
suspending. More work is pending to fix that.
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4288
Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier")
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/gpu/drm/xe/xe_device_types.h | 2 ++
drivers/gpu/drm/xe/xe_exec.c | 9 +++++++++
drivers/gpu/drm/xe/xe_pm.c | 4 ++++
drivers/gpu/drm/xe/xe_vm.c | 14 ++++++++++++++
4 files changed, 29 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 092004d14db2..6602bd678cbc 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -507,6 +507,8 @@ struct xe_device {
/** @pm_notifier: Our PM notifier to perform actions in response to various PM events. */
struct notifier_block pm_notifier;
+ /** @pm_block: Completion to block validating tasks on suspend / hibernate prepare */
+ struct completion pm_block;
/** @pmt: Support the PMT driver callback interface */
struct {
diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c
index 44364c042ad7..374c831e691b 100644
--- a/drivers/gpu/drm/xe/xe_exec.c
+++ b/drivers/gpu/drm/xe/xe_exec.c
@@ -237,6 +237,15 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
goto err_unlock_list;
}
+ /*
+ * It's OK to block interruptible here with the vm lock held, since
+ * on task freezing during suspend / hibernate, the call will
+ * return -ERESTARTSYS and the IOCTL will be rerun.
+ */
+ err = wait_for_completion_interruptible(&xe->pm_block);
+ if (err)
+ goto err_unlock_list;
+
vm_exec.vm = &vm->gpuvm;
vm_exec.flags = DRM_EXEC_INTERRUPTIBLE_WAIT;
if (xe_vm_in_lr_mode(vm)) {
diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
index bee9aacd82e7..f4c7a84270ea 100644
--- a/drivers/gpu/drm/xe/xe_pm.c
+++ b/drivers/gpu/drm/xe/xe_pm.c
@@ -306,6 +306,7 @@ static int xe_pm_notifier_callback(struct notifier_block *nb,
switch (action) {
case PM_HIBERNATION_PREPARE:
case PM_SUSPEND_PREPARE:
+ reinit_completion(&xe->pm_block);
xe_pm_runtime_get(xe);
err = xe_bo_evict_all_user(xe);
if (err)
@@ -322,6 +323,7 @@ static int xe_pm_notifier_callback(struct notifier_block *nb,
break;
case PM_POST_HIBERNATION:
case PM_POST_SUSPEND:
+ complete_all(&xe->pm_block);
xe_bo_notifier_unprepare_all_pinned(xe);
xe_pm_runtime_put(xe);
break;
@@ -348,6 +350,8 @@ int xe_pm_init(struct xe_device *xe)
if (err)
return err;
+ init_completion(&xe->pm_block);
+ complete_all(&xe->pm_block);
/* For now suspend/resume is only allowed with GuC */
if (!xe_device_uc_enabled(xe))
return 0;
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 3ff3c67aa79d..edcdb0528f2a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -394,6 +394,9 @@ static int xe_gpuvm_validate(struct drm_gpuvm_bo *vm_bo, struct drm_exec *exec)
list_move_tail(&gpuva_to_vma(gpuva)->combined_links.rebind,
&vm->rebind_list);
+ if (!try_wait_for_completion(&vm->xe->pm_block))
+ return -EAGAIN;
+
ret = xe_bo_validate(gem_to_xe_bo(vm_bo->obj), vm, false);
if (ret)
return ret;
@@ -494,6 +497,12 @@ static void preempt_rebind_work_func(struct work_struct *w)
xe_assert(vm->xe, xe_vm_in_preempt_fence_mode(vm));
trace_xe_vm_rebind_worker_enter(vm);
+ /*
+ * This blocks the wq during suspend / hibernate.
+ * Don't hold any locks.
+ */
+retry_pm:
+ wait_for_completion(&vm->xe->pm_block);
down_write(&vm->lock);
if (xe_vm_is_closed_or_banned(vm)) {
@@ -503,6 +512,11 @@ static void preempt_rebind_work_func(struct work_struct *w)
}
retry:
+ if (!try_wait_for_completion(&vm->xe->pm_block)) {
+ up_write(&vm->lock);
+ goto retry_pm;
+ }
+
if (xe_vm_userptr_check_repin(vm)) {
err = xe_vm_userptr_pin(vm);
if (err)
--
2.50.1
switch_to_super_page() assumes the memory range it's working on is aligned
to the target large page level. Unfortunately, __domain_mapping() doesn't
take this into account when using it, and will pass unaligned ranges
ultimately freeing a PTE range larger than expected.
Take for example a mapping with the following iov_pfn range [0x3fe400,
0x4c0600], which should be backed by the following mappings:
iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages
iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages
iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages
Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff]
to switch_to_super_page() at a 1 GiB granularity, which will in turn
free PTEs all the way to iov_pfn 0x4fffff.
Mitigate this by rounding down the iov_pfn range passed to
switch_to_super_page() in __domain_mapping()
to the target large page level.
Additionally add range alignment checks to switch_to_super_page.
Fixes: 9906b9352a35 ("iommu/vt-d: Avoid duplicate removing in __domain_mapping()")
Signed-off-by: Eugene Koira <eugkoira(a)amazon.com>
Cc: stable(a)vger.kernel.org
---
drivers/iommu/intel/iommu.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 9c3ab9d9f69a..dff2d895b8ab 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1575,6 +1575,10 @@ static void switch_to_super_page(struct dmar_domain *domain,
unsigned long lvl_pages = lvl_to_nr_pages(level);
struct dma_pte *pte = NULL;
+ if (WARN_ON(!IS_ALIGNED(start_pfn, lvl_pages) ||
+ !IS_ALIGNED(end_pfn + 1, lvl_pages)))
+ return;
+
while (start_pfn <= end_pfn) {
if (!pte)
pte = pfn_to_dma_pte(domain, start_pfn, &level,
@@ -1650,7 +1654,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
unsigned long pages_to_remove;
pteval |= DMA_PTE_LARGE_PAGE;
- pages_to_remove = min_t(unsigned long, nr_pages,
+ pages_to_remove = min_t(unsigned long,
+ round_down(nr_pages, lvl_pages),
nr_pte_to_next_page(pte) * lvl_pages);
end_pfn = iov_pfn + pages_to_remove - 1;
switch_to_super_page(domain, iov_pfn, end_pfn, largepage_lvl);
--
2.47.3
Amazon Development Center (Netherlands) B.V., Johanna Westerdijkplein 1, NL-2521 EN The Hague, Registration No. Chamber of Commerce 56869649, VAT: NL 852339859B01
Fix incorrect use of IS_ERR() to check devm_kzalloc() return value.
devm_kzalloc() returns NULL on failure, not an error pointer.
This issue was introduced by commit 4277f035d227
("coresight: trbe: Add a representative coresight_platform_data for TRBE")
which replaced the original function but didn't update the error check.
Fixes: 4277f035d227 ("coresight: trbe: Add a representative coresight_platform_data for TRBE")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miaoqian Lin <linmq006(a)gmail.com>
---
drivers/hwtracing/coresight/coresight-trbe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hwtracing/coresight/coresight-trbe.c b/drivers/hwtracing/coresight/coresight-trbe.c
index 8267dd1a2130..caf873adfc3a 100644
--- a/drivers/hwtracing/coresight/coresight-trbe.c
+++ b/drivers/hwtracing/coresight/coresight-trbe.c
@@ -1279,7 +1279,7 @@ static void arm_trbe_register_coresight_cpu(struct trbe_drvdata *drvdata, int cp
* into the device for that purpose.
*/
desc.pdata = devm_kzalloc(dev, sizeof(*desc.pdata), GFP_KERNEL);
- if (IS_ERR(desc.pdata))
+ if (!desc.pdata)
goto cpu_clear;
desc.type = CORESIGHT_DEV_TYPE_SINK;
--
2.35.1
The patch titled
Subject: mm: fix folio_expected_ref_count() when PG_private_2
has been added to the -mm mm-new branch. Its filename is
mm-fix-folio_expected_ref_count-when-pg_private_2.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: fix folio_expected_ref_count() when PG_private_2
Date: Sun, 31 Aug 2025 02:01:16 -0700 (PDT)
6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
which (like PG_private, but not in addition to PG_private) counts for 1
more reference: it needs to be using folio_has_private() in place of
folio_test_private().
But this went wrong earlier: folio_expected_ref_count() was based on (and
replaced) mm/migrate.c's folio_expected_refs(), which has been using
folio_test_private() since 6.0 converted to folios(): before that,
expected_page_refs() was correctly using page_has_private().
Just a few filesystems are still using PG_private_2 a.k.a. PG_fscache.
Potentially, this fix re-enables page migration on their folios; but it
would not be surprising to learn that in practice those folios are not
migratable for other reasons.
Link: https://lkml.kernel.org/r/f91ee36e-a8cb-e3a4-c23b-524ff3848da7@google.com
Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
Fixes: 108ca8358139 ("mm/migrate: Convert expected_page_refs() to folio_expected_refs()")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keir Fraser <keirf(a)google.com>
Cc: Konstantin Khlebnikov <koct9i(a)gmail.com>
Cc: Li Zhe <lizhe.67(a)bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Shivank Garg <shivankg(a)amd.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: yangge <yangge1116(a)126.com>
Cc: Yuanchu Xie <yuanchu(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/include/linux/mm.h~mm-fix-folio_expected_ref_count-when-pg_private_2
+++ a/include/linux/mm.h
@@ -2234,8 +2234,8 @@ static inline int folio_expected_ref_cou
} else {
/* One reference per page from the pagecache. */
ref_count += !!folio->mapping << order;
- /* One reference from PG_private. */
- ref_count += folio_test_private(folio);
+ /* One reference from PG_private or PG_private_2. */
+ ref_count += folio_has_private(folio);
}
/* One reference per page table mapping. */
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-fix-folio_expected_ref_count-when-pg_private_2.patch
mm-gup-check-ref_count-instead-of-lru-before-migration.patch
mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all.patch
mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch.patch
mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test.patch
mm-folio_may_be_cached-unless-folio_test_large.patch
mm-lru_add_drain_all-do-local-lru_add_drain-first.patch
Please backport the following commit into Linux v5.4 and newer stable
releases:
80dc18a0cba8dea42614f021b20a04354b213d86
The backport will likely depend on macro rename commit:
817f989700fddefa56e5e443e7d138018ca6709d
This part of commit description clarifies why this is a fix:
"
As per PCIe r6.0, sec 6.6.1, a Downstream Port that supports Link speeds
greater than 5.0 GT/s, software must wait a minimum of 100 ms after Link
training completes before sending a Configuration Request.
"
In practice, this makes detection of PCIe Gen3 and Gen4 SSDs reliable on
Renesas R-Car V4H SoC. Without this commit, the SSDs sporadically do not
get detected, or sometimes they link up in Gen1 mode.
This fixes commit
886bc5ceb5cc ("PCI: designware: Add generic dw_pcie_wait_for_link()")
which is in v4.5-rc1-4-g886bc5ceb5cc3 , so I think this fix should be
backported to all currently maintained stable releases, i.e. v5.4+ .
Thank you
--
Best regards,
Marek Vasut
The patch titled
Subject: mm: folio_may_be_cached() unless folio_test_large()
has been added to the -mm mm-new branch. Its filename is
mm-folio_may_be_cached-unless-folio_test_large.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: folio_may_be_cached() unless folio_test_large()
Date: Sun, 31 Aug 2025 02:16:25 -0700 (PDT)
mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as a
large folio is added: so collect_longterm_unpinnable_folios() just wastes
effort when calling lru_add_drain_all() on a large folio.
But although there is good reason not to batch up PMD-sized folios, we
might well benefit from batching a small number of low-order mTHPs (though
unclear how that "small number" limitation will be implemented).
So ask if folio_may_be_cached() rather than !folio_test_large(), to
insulate those particular checks from future change. Name preferred to
"folio_is_batchable" because large folios can well be put on a batch: it's
just the per-CPU LRU caches, drained much later, which need care.
Marked for stable, to counter the increase in lru_add_drain_all()s from
"mm/gup: check ref_count instead of lru before migration".
Link: https://lkml.kernel.org/r/861c061c-51cd-b940-49df-9f55e1fee2c8@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keir Fraser <keirf(a)google.com>
Cc: Konstantin Khlebnikov <koct9i(a)gmail.com>
Cc: Li Zhe <lizhe.67(a)bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Shivank Garg <shivankg(a)amd.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: yangge <yangge1116(a)126.com>
Cc: Yuanchu Xie <yuanchu(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/swap.h | 10 ++++++++++
mm/gup.c | 5 +++--
mm/mlock.c | 6 +++---
mm/swap.c | 2 +-
4 files changed, 17 insertions(+), 6 deletions(-)
--- a/include/linux/swap.h~mm-folio_may_be_cached-unless-folio_test_large
+++ a/include/linux/swap.h
@@ -381,6 +381,16 @@ void folio_add_lru_vma(struct folio *, s
void mark_page_accessed(struct page *);
void folio_mark_accessed(struct folio *);
+static inline bool folio_may_be_cached(struct folio *folio)
+{
+ /*
+ * Holding PMD-sized folios in per-CPU LRU cache unbalances accounting.
+ * Holding small numbers of low-order mTHP folios in per-CPU LRU cache
+ * will be sensible, but nobody has implemented and tested that yet.
+ */
+ return !folio_test_large(folio);
+}
+
extern atomic_t lru_disable_count;
static inline bool lru_cache_disabled(void)
--- a/mm/gup.c~mm-folio_may_be_cached-unless-folio_test_large
+++ a/mm/gup.c
@@ -2309,8 +2309,9 @@ static unsigned long collect_longterm_un
continue;
}
- if (drain_allow && folio_ref_count(folio) !=
- folio_expected_ref_count(folio) + 1) {
+ if (drain_allow && folio_may_be_cached(folio) &&
+ folio_ref_count(folio) !=
+ folio_expected_ref_count(folio) + 1) {
lru_add_drain_all();
drain_allow = false;
}
--- a/mm/mlock.c~mm-folio_may_be_cached-unless-folio_test_large
+++ a/mm/mlock.c
@@ -255,7 +255,7 @@ void mlock_folio(struct folio *folio)
folio_get(folio);
if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
local_unlock(&mlock_fbatch.lock);
}
@@ -278,7 +278,7 @@ void mlock_new_folio(struct folio *folio
folio_get(folio);
if (!folio_batch_add(fbatch, mlock_new(folio)) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
local_unlock(&mlock_fbatch.lock);
}
@@ -299,7 +299,7 @@ void munlock_folio(struct folio *folio)
*/
folio_get(folio);
if (!folio_batch_add(fbatch, folio) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
local_unlock(&mlock_fbatch.lock);
}
--- a/mm/swap.c~mm-folio_may_be_cached-unless-folio_test_large
+++ a/mm/swap.c
@@ -192,7 +192,7 @@ static void __folio_batch_add_and_move(s
local_lock(&cpu_fbatches.lock);
if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
- folio_test_large(folio) || lru_cache_disabled())
+ !folio_may_be_cached(folio) || lru_cache_disabled())
folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
if (disable_irq)
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-fix-folio_expected_ref_count-when-pg_private_2.patch
mm-gup-check-ref_count-instead-of-lru-before-migration.patch
mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all.patch
mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch.patch
mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test.patch
mm-folio_may_be_cached-unless-folio_test_large.patch
mm-lru_add_drain_all-do-local-lru_add_drain-first.patch
The patch titled
Subject: mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
has been added to the -mm mm-new branch. Its filename is
mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
Date: Sun, 31 Aug 2025 02:13:55 -0700 (PDT)
This reverts commit 0885ef4705607936fc36a38fd74356e1c465b023: that
was a fix to the reverted 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9.
Link: https://lkml.kernel.org/r/bd440614-3d2c-31d6-1b8f-9635c69cef7c@google.com
Fixes: 0885ef470560 ("mm: vmscan.c: fix OOM on swap stress test")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keir Fraser <keirf(a)google.com>
Cc: Konstantin Khlebnikov <koct9i(a)gmail.com>
Cc: Li Zhe <lizhe.67(a)bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Shivank Garg <shivankg(a)amd.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: yangge <yangge1116(a)126.com>
Cc: Yuanchu Xie <yuanchu(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmscan.c~mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test
+++ a/mm/vmscan.c
@@ -4500,7 +4500,7 @@ static bool sort_folio(struct lruvec *lr
}
/* ineligible */
- if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
+ if (zone > sc->reclaim_idx) {
gen = folio_inc_gen(lruvec, folio, false);
list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
return true;
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-fix-folio_expected_ref_count-when-pg_private_2.patch
mm-gup-check-ref_count-instead-of-lru-before-migration.patch
mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all.patch
mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch.patch
mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test.patch
mm-folio_may_be_cached-unless-folio_test_large.patch
mm-lru_add_drain_all-do-local-lru_add_drain-first.patch
The patch titled
Subject: mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
has been added to the -mm mm-new branch. Its filename is
mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
Date: Sun, 31 Aug 2025 02:11:33 -0700 (PDT)
This reverts commit 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9: now that
collect_longterm_unpinnable_folios() is checking ref_count instead of lru,
and mlock/munlock do not participate in the revised LRU flag clearing,
those changes are misleading, and enlarge the window during which
mlock/munlock may miss an mlock_count update.
It is possible (I'd hesitate to claim probable) that the greater
likelihood of missed mlock_count updates would explain the "Realtime
threads delayed due to kcompactd0" observed on 6.12 in the Link below. If
that is the case, this reversion will help; but a complete solution needs
also a further patch, beyond the scope of this series.
Included some 80-column cleanup around folio_batch_add_and_move().
The role of folio_test_clear_lru() (before taking per-memcg lru_lock) is
questionable since 6.13 removed mem_cgroup_move_account() etc; but perhaps
there are still some races which need it - not examined here.
Link: https://lore.kernel.org/linux-mm/DU0PR01MB10385345F7153F334100981888259A@DU…
Link: https://lkml.kernel.org/r/0215a42b-99cd-612a-95f7-56f8251d99ef@google.com
Fixes: 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keir Fraser <keirf(a)google.com>
Cc: Konstantin Khlebnikov <koct9i(a)gmail.com>
Cc: Li Zhe <lizhe.67(a)bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Shivank Garg <shivankg(a)amd.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: yangge <yangge1116(a)126.com>
Cc: Yuanchu Xie <yuanchu(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/swap.c | 50 ++++++++++++++++++++++++++------------------------
1 file changed, 26 insertions(+), 24 deletions(-)
--- a/mm/swap.c~mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch
+++ a/mm/swap.c
@@ -164,6 +164,10 @@ static void folio_batch_move_lru(struct
for (i = 0; i < folio_batch_count(fbatch); i++) {
struct folio *folio = fbatch->folios[i];
+ /* block memcg migration while the folio moves between lru */
+ if (move_fn != lru_add && !folio_test_clear_lru(folio))
+ continue;
+
folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
move_fn(lruvec, folio);
@@ -176,14 +180,10 @@ static void folio_batch_move_lru(struct
}
static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
- struct folio *folio, move_fn_t move_fn,
- bool on_lru, bool disable_irq)
+ struct folio *folio, move_fn_t move_fn, bool disable_irq)
{
unsigned long flags;
- if (on_lru && !folio_test_clear_lru(folio))
- return;
-
folio_get(folio);
if (disable_irq)
@@ -191,8 +191,8 @@ static void __folio_batch_add_and_move(s
else
local_lock(&cpu_fbatches.lock);
- if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
- lru_cache_disabled())
+ if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
+ folio_test_large(folio) || lru_cache_disabled())
folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
if (disable_irq)
@@ -201,13 +201,13 @@ static void __folio_batch_add_and_move(s
local_unlock(&cpu_fbatches.lock);
}
-#define folio_batch_add_and_move(folio, op, on_lru) \
- __folio_batch_add_and_move( \
- &cpu_fbatches.op, \
- folio, \
- op, \
- on_lru, \
- offsetof(struct cpu_fbatches, op) >= offsetof(struct cpu_fbatches, lock_irq) \
+#define folio_batch_add_and_move(folio, op) \
+ __folio_batch_add_and_move( \
+ &cpu_fbatches.op, \
+ folio, \
+ op, \
+ offsetof(struct cpu_fbatches, op) >= \
+ offsetof(struct cpu_fbatches, lock_irq) \
)
static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
@@ -231,10 +231,10 @@ static void lru_move_tail(struct lruvec
void folio_rotate_reclaimable(struct folio *folio)
{
if (folio_test_locked(folio) || folio_test_dirty(folio) ||
- folio_test_unevictable(folio))
+ folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
- folio_batch_add_and_move(folio, lru_move_tail, true);
+ folio_batch_add_and_move(folio, lru_move_tail);
}
void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
@@ -328,10 +328,11 @@ static void folio_activate_drain(int cpu
void folio_activate(struct folio *folio)
{
- if (folio_test_active(folio) || folio_test_unevictable(folio))
+ if (folio_test_active(folio) || folio_test_unevictable(folio) ||
+ !folio_test_lru(folio))
return;
- folio_batch_add_and_move(folio, lru_activate, true);
+ folio_batch_add_and_move(folio, lru_activate);
}
#else
@@ -507,7 +508,7 @@ void folio_add_lru(struct folio *folio)
lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
folio_set_active(folio);
- folio_batch_add_and_move(folio, lru_add, false);
+ folio_batch_add_and_move(folio, lru_add);
}
EXPORT_SYMBOL(folio_add_lru);
@@ -685,13 +686,13 @@ void lru_add_drain_cpu(int cpu)
void deactivate_file_folio(struct folio *folio)
{
/* Deactivating an unevictable folio will not accelerate reclaim */
- if (folio_test_unevictable(folio))
+ if (folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
if (lru_gen_enabled() && lru_gen_clear_refs(folio))
return;
- folio_batch_add_and_move(folio, lru_deactivate_file, true);
+ folio_batch_add_and_move(folio, lru_deactivate_file);
}
/*
@@ -704,13 +705,13 @@ void deactivate_file_folio(struct folio
*/
void folio_deactivate(struct folio *folio)
{
- if (folio_test_unevictable(folio))
+ if (folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
return;
- folio_batch_add_and_move(folio, lru_deactivate, true);
+ folio_batch_add_and_move(folio, lru_deactivate);
}
/**
@@ -723,10 +724,11 @@ void folio_deactivate(struct folio *foli
void folio_mark_lazyfree(struct folio *folio)
{
if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
+ !folio_test_lru(folio) ||
folio_test_swapcache(folio) || folio_test_unevictable(folio))
return;
- folio_batch_add_and_move(folio, lru_lazyfree, true);
+ folio_batch_add_and_move(folio, lru_lazyfree);
}
void lru_add_drain(void)
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-fix-folio_expected_ref_count-when-pg_private_2.patch
mm-gup-check-ref_count-instead-of-lru-before-migration.patch
mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all.patch
mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch.patch
mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test.patch
mm-folio_may_be_cached-unless-folio_test_large.patch
mm-lru_add_drain_all-do-local-lru_add_drain-first.patch
The patch titled
Subject: mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
has been added to the -mm mm-new branch. Its filename is
mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
Date: Sun, 31 Aug 2025 02:08:26 -0700 (PDT)
In many cases, if collect_longterm_unpinnable_folios() does need to drain
the LRU cache to release a reference, the cache in question is on this
same CPU, and much more efficiently drained by a preliminary local
lru_add_drain(), than the later cross-CPU lru_add_drain_all().
Marked for stable, to counter the increase in lru_add_drain_all()s from
"mm/gup: check ref_count instead of lru before migration". Note for clean
backports: can take 6.16 commit a03db236aebf ("gup: optimize longterm
pin_user_pages() for large folio") first.
Link: https://lkml.kernel.org/r/165ccfdb-16b5-ac11-0d66-2984293590c8@google.com
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keir Fraser <keirf(a)google.com>
Cc: Konstantin Khlebnikov <koct9i(a)gmail.com>
Cc: Li Zhe <lizhe.67(a)bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Shivank Garg <shivankg(a)amd.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Xu <weixugc(a)google.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: yangge <yangge1116(a)126.com>
Cc: Yuanchu Xie <yuanchu(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/gup.c~mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all
+++ a/mm/gup.c
@@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_un
struct folio *folio;
long i = 0;
+ lru_add_drain();
+
for (folio = pofs_get_folio(pofs, i); folio;
folio = pofs_next_folio(folio, pofs, &i)) {
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-fix-folio_expected_ref_count-when-pg_private_2.patch
mm-gup-check-ref_count-instead-of-lru-before-migration.patch
mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all.patch
mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch.patch
mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test.patch
mm-folio_may_be_cached-unless-folio_test_large.patch
mm-lru_add_drain_all-do-local-lru_add_drain-first.patch
The patch titled
Subject: mm/gup: check ref_count instead of lru before migration
has been added to the -mm mm-new branch. Its filename is
mm-gup-check-ref_count-instead-of-lru-before-migration.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Hugh Dickins <hughd(a)google.com>
Subject: mm/gup: check ref_count instead of lru before migration
Date: Sun, 31 Aug 2025 02:05:56 -0700 (PDT)
Will Deacon reports:-
When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not be
longterm pinned, for example because they reside in a CMA region or
movable zone. This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the list
off to migrate_pages() for the actual migration.
It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve stage-2
page faults from the guest) on a 6.15-based Pixel 6 device and this
results in the VM terminating prematurely.
In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning. Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not present
and so the faulting logic allocates a new folio, mlocks it with
mlock_folio() and maps it in the page-table.
Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch
by pagevec"), mlock/munlock operations on a folio (formerly page), are
deferred. For example, mlock_folio() takes an additional reference on the
target folio before placing it into a per-cpu 'folio_batch' for later
processing by mlock_folio_batch(), which drops the refcount once the
operation is complete. Processing of the batches is coupled with the LRU
batch logic and can be forcefully drained with lru_add_drain_all() but as
long as a folio remains unprocessed on the batch, its refcount will be
elevated.
This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration code
fails to migrate a folio due to the elevated refcount from the pending
mlock operation.
Hugh Dickins adds:-
!folio_test_lru() has never been a very reliable way to tell if an
lru_add_drain_all() is worth calling, to remove LRU cache references to
make the folio migratable: the LRU flag may be set even while the folio is
held with an extra reference in a per-CPU LRU cache.
5.18 commit 2fbb0c10d1e8 may have made it more unreliable. Then 6.11
commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
to LRU batch") tried to make it reliable, by moving LRU flag clearing; but
missed the mlock/munlock batches, so still unreliable as reported.
And it turns out to be difficult to extend 33dfe9204f29's LRU flag
clearing to the mlock/munlock batches: if they do benefit from batching,
mlock/munlock cannot be so effective when easily suppressed while !LRU.
Instead, switch to an expected ref_count check, which was more reliable
all along: some more false positives (unhelpful drains) than before, and
never a guarantee that the folio will prove migratable, but better.
Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm: add
folio_expected_ref_count() for reference count calculation") and 6.17
commit ("mm: fix folio_expected_ref_count() when PG_private_2").
Link: https://lkml.kernel.org/r/47c51c9a-140f-1ea1-b692-c4bae5d1fa58@google.com
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Hugh Dickins <hughd(a)google.com>
Reported-by: Will Deacon <will(a)kernel.org>
Link: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
Cc: <stable(a)vger.kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)kernel.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Chris Li <chrisl(a)kernel.org>
Cc: Christoph Hellwig <hch(a)infradead.org>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Jason Gunthorpe <jgg(a)ziepe.ca>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Keir Fraser <keirf(a)google.com>
Cc: Konstantin Khlebnikov <koct9i(a)gmail.com>
Cc: Li Zhe <lizhe.67(a)bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Shivank Garg <shivankg(a)amd.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Wei Xu <weixugc(a)google.com>
Cc: yangge <yangge1116(a)126.com>
Cc: Yuanchu Xie <yuanchu(a)google.com>
Cc: Yu Zhao <yuzhao(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/gup.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/gup.c~mm-gup-check-ref_count-instead-of-lru-before-migration
+++ a/mm/gup.c
@@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_un
continue;
}
- if (!folio_test_lru(folio) && drain_allow) {
+ if (drain_allow && folio_ref_count(folio) !=
+ folio_expected_ref_count(folio) + 1) {
lru_add_drain_all();
drain_allow = false;
}
_
Patches currently in -mm which might be from hughd(a)google.com are
mm-fix-folio_expected_ref_count-when-pg_private_2.patch
mm-gup-check-ref_count-instead-of-lru-before-migration.patch
mm-gup-local-lru_add_drain-to-avoid-lru_add_drain_all.patch
mm-revert-mm-gup-clear-the-lru-flag-of-a-page-before-adding-to-lru-batch.patch
mm-revert-mm-vmscanc-fix-oom-on-swap-stress-test.patch
mm-folio_may_be_cached-unless-folio_test_large.patch
mm-lru_add_drain_all-do-local-lru_add_drain-first.patch
The patch titled
Subject: mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-vmalloc-mm-kasan-respect-gfp-mask-in-kasan_populate_vmalloc.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com>
Subject: mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()
Date: Sun, 31 Aug 2025 14:10:58 +0200
kasan_populate_vmalloc() and its helpers ignore the caller's gfp_mask and
always allocate memory using the hardcoded GFP_KERNEL flag. This makes
them inconsistent with vmalloc(), which was recently extended to support
GFP_NOFS and GFP_NOIO allocations.
Page table allocations performed during shadow population also ignore the
external gfp_mask. To preserve the intended semantics of GFP_NOFS and
GFP_NOIO, wrap the apply_to_page_range() calls into the appropriate
memalloc scope.
This patch:
- Extends kasan_populate_vmalloc() and helpers to take gfp_mask;
- Passes gfp_mask down to alloc_pages_bulk() and __get_free_page();
- Enforces GFP_NOFS/NOIO semantics with memalloc_*_save()/restore()
around apply_to_page_range();
- Updates vmalloc.c and percpu allocator call sites accordingly.
Link: https://lkml.kernel.org/r/20250831121058.92971-1-urezki@gmail.com
Fixes: 451769ebb7e7 ("mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc")
Signed-off-by: Uladzislau Rezki (Sony) <urezki(a)gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/kasan.h | 6 +++---
mm/kasan/shadow.c | 31 ++++++++++++++++++++++++-------
mm/vmalloc.c | 8 ++++----
3 files changed, 31 insertions(+), 14 deletions(-)
--- a/include/linux/kasan.h~mm-vmalloc-mm-kasan-respect-gfp-mask-in-kasan_populate_vmalloc
+++ a/include/linux/kasan.h
@@ -562,7 +562,7 @@ static inline void kasan_init_hw_tags(vo
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
void kasan_populate_early_vm_area_shadow(void *start, unsigned long size);
-int kasan_populate_vmalloc(unsigned long addr, unsigned long size);
+int kasan_populate_vmalloc(unsigned long addr, unsigned long size, gfp_t gfp_mask);
void kasan_release_vmalloc(unsigned long start, unsigned long end,
unsigned long free_region_start,
unsigned long free_region_end,
@@ -574,7 +574,7 @@ static inline void kasan_populate_early_
unsigned long size)
{ }
static inline int kasan_populate_vmalloc(unsigned long start,
- unsigned long size)
+ unsigned long size, gfp_t gfp_mask)
{
return 0;
}
@@ -610,7 +610,7 @@ static __always_inline void kasan_poison
static inline void kasan_populate_early_vm_area_shadow(void *start,
unsigned long size) { }
static inline int kasan_populate_vmalloc(unsigned long start,
- unsigned long size)
+ unsigned long size, gfp_t gfp_mask)
{
return 0;
}
--- a/mm/kasan/shadow.c~mm-vmalloc-mm-kasan-respect-gfp-mask-in-kasan_populate_vmalloc
+++ a/mm/kasan/shadow.c
@@ -336,13 +336,13 @@ static void ___free_pages_bulk(struct pa
}
}
-static int ___alloc_pages_bulk(struct page **pages, int nr_pages)
+static int ___alloc_pages_bulk(struct page **pages, int nr_pages, gfp_t gfp_mask)
{
unsigned long nr_populated, nr_total = nr_pages;
struct page **page_array = pages;
while (nr_pages) {
- nr_populated = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
+ nr_populated = alloc_pages_bulk(gfp_mask, nr_pages, pages);
if (!nr_populated) {
___free_pages_bulk(page_array, nr_total - nr_pages);
return -ENOMEM;
@@ -354,25 +354,42 @@ static int ___alloc_pages_bulk(struct pa
return 0;
}
-static int __kasan_populate_vmalloc(unsigned long start, unsigned long end)
+static int __kasan_populate_vmalloc(unsigned long start, unsigned long end, gfp_t gfp_mask)
{
unsigned long nr_pages, nr_total = PFN_UP(end - start);
struct vmalloc_populate_data data;
+ unsigned int flags;
int ret = 0;
- data.pages = (struct page **)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ data.pages = (struct page **)__get_free_page(gfp_mask | __GFP_ZERO);
if (!data.pages)
return -ENOMEM;
while (nr_total) {
nr_pages = min(nr_total, PAGE_SIZE / sizeof(data.pages[0]));
- ret = ___alloc_pages_bulk(data.pages, nr_pages);
+ ret = ___alloc_pages_bulk(data.pages, nr_pages, gfp_mask);
if (ret)
break;
data.start = start;
+
+ /*
+ * page tables allocations ignore external gfp mask, enforce it
+ * by the scope API
+ */
+ if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+ flags = memalloc_nofs_save();
+ else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+ flags = memalloc_noio_save();
+
ret = apply_to_page_range(&init_mm, start, nr_pages * PAGE_SIZE,
kasan_populate_vmalloc_pte, &data);
+
+ if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+ memalloc_nofs_restore(flags);
+ else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+ memalloc_noio_restore(flags);
+
___free_pages_bulk(data.pages, nr_pages);
if (ret)
break;
@@ -386,7 +403,7 @@ static int __kasan_populate_vmalloc(unsi
return ret;
}
-int kasan_populate_vmalloc(unsigned long addr, unsigned long size)
+int kasan_populate_vmalloc(unsigned long addr, unsigned long size, gfp_t gfp_mask)
{
unsigned long shadow_start, shadow_end;
int ret;
@@ -415,7 +432,7 @@ int kasan_populate_vmalloc(unsigned long
shadow_start = PAGE_ALIGN_DOWN(shadow_start);
shadow_end = PAGE_ALIGN(shadow_end);
- ret = __kasan_populate_vmalloc(shadow_start, shadow_end);
+ ret = __kasan_populate_vmalloc(shadow_start, shadow_end, gfp_mask);
if (ret)
return ret;
--- a/mm/vmalloc.c~mm-vmalloc-mm-kasan-respect-gfp-mask-in-kasan_populate_vmalloc
+++ a/mm/vmalloc.c
@@ -2026,6 +2026,8 @@ static struct vmap_area *alloc_vmap_area
if (unlikely(!vmap_initialized))
return ERR_PTR(-EBUSY);
+ /* Only reclaim behaviour flags are relevant. */
+ gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
might_sleep();
/*
@@ -2038,8 +2040,6 @@ static struct vmap_area *alloc_vmap_area
*/
va = node_alloc(size, align, vstart, vend, &addr, &vn_id);
if (!va) {
- gfp_mask = gfp_mask & GFP_RECLAIM_MASK;
-
va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
if (unlikely(!va))
return ERR_PTR(-ENOMEM);
@@ -2089,7 +2089,7 @@ retry:
BUG_ON(va->va_start < vstart);
BUG_ON(va->va_end > vend);
- ret = kasan_populate_vmalloc(addr, size);
+ ret = kasan_populate_vmalloc(addr, size, gfp_mask);
if (ret) {
free_vmap_area(va);
return ERR_PTR(ret);
@@ -4826,7 +4826,7 @@ retry:
/* populate the kasan shadow space */
for (area = 0; area < nr_vms; area++) {
- if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area]))
+ if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], GFP_KERNEL))
goto err_free_shadow;
}
_
Patches currently in -mm which might be from urezki(a)gmail.com are
mm-vmalloc-mm-kasan-respect-gfp-mask-in-kasan_populate_vmalloc.patch
The patch titled
Subject: ocfs2: fix recursive semaphore deadlock in fiemap call
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
ocfs2-fix-recursive-semaphore-deadlock-in-fiemap-call.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Mark Tinguely <mark.tinguely(a)oracle.com>
Subject: ocfs2: fix recursive semaphore deadlock in fiemap call
Date: Fri, 29 Aug 2025 10:18:15 -0500
syzbot detected a OCFS2 hang due to a recursive semaphore on a
FS_IOC_FIEMAP of the extent list on a specially crafted mmap file.
context_switch kernel/sched/core.c:5357 [inline]
__schedule+0x1798/0x4cc0 kernel/sched/core.c:6961
__schedule_loop kernel/sched/core.c:7043 [inline]
schedule+0x165/0x360 kernel/sched/core.c:7058
schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7115
rwsem_down_write_slowpath+0x872/0xfe0 kernel/locking/rwsem.c:1185
__down_write_common kernel/locking/rwsem.c:1317 [inline]
__down_write kernel/locking/rwsem.c:1326 [inline]
down_write+0x1ab/0x1f0 kernel/locking/rwsem.c:1591
ocfs2_page_mkwrite+0x2ff/0xc40 fs/ocfs2/mmap.c:142
do_page_mkwrite+0x14d/0x310 mm/memory.c:3361
wp_page_shared mm/memory.c:3762 [inline]
do_wp_page+0x268d/0x5800 mm/memory.c:3981
handle_pte_fault mm/memory.c:6068 [inline]
__handle_mm_fault+0x1033/0x5440 mm/memory.c:6195
handle_mm_fault+0x40a/0x8e0 mm/memory.c:6364
do_user_addr_fault+0x764/0x1390 arch/x86/mm/fault.c:1387
handle_page_fault arch/x86/mm/fault.c:1476 [inline]
exc_page_fault+0x76/0xf0 arch/x86/mm/fault.c:1532
asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:623
RIP: 0010:copy_user_generic arch/x86/include/asm/uaccess_64.h:126 [inline]
RIP: 0010:raw_copy_to_user arch/x86/include/asm/uaccess_64.h:147 [inline]
RIP: 0010:_inline_copy_to_user include/linux/uaccess.h:197 [inline]
RIP: 0010:_copy_to_user+0x85/0xb0 lib/usercopy.c:26
Code: e8 00 bc f7 fc 4d 39 fc 72 3d 4d 39 ec 77 38 e8 91 b9 f7 fc 4c 89
f7 89 de e8 47 25 5b fd 0f 01 cb 4c 89 ff 48 89 d9 4c 89 f6 <f3> a4 0f
1f 00 48 89 cb 0f 01 ca 48 89 d8 5b 41 5c 41 5d 41 5e 41
RSP: 0018:ffffc9000403f950 EFLAGS: 00050256
RAX: ffffffff84c7f101 RBX: 0000000000000038 RCX: 0000000000000038
RDX: 0000000000000000 RSI: ffffc9000403f9e0 RDI: 0000200000000060
RBP: ffffc9000403fa90 R08: ffffc9000403fa17 R09: 1ffff92000807f42
R10: dffffc0000000000 R11: fffff52000807f43 R12: 0000200000000098
R13: 00007ffffffff000 R14: ffffc9000403f9e0 R15: 0000200000000060
copy_to_user include/linux/uaccess.h:225 [inline]
fiemap_fill_next_extent+0x1c0/0x390 fs/ioctl.c:145
ocfs2_fiemap+0x888/0xc90 fs/ocfs2/extent_map.c:806
ioctl_fiemap fs/ioctl.c:220 [inline]
do_vfs_ioctl+0x1173/0x1430 fs/ioctl.c:532
__do_sys_ioctl fs/ioctl.c:596 [inline]
__se_sys_ioctl+0x82/0x170 fs/ioctl.c:584
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5f13850fd9
RSP: 002b:00007ffe3b3518b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000200000000000 RCX: 00007f5f13850fd9
RDX: 0000200000000040 RSI: 00000000c020660b RDI: 0000000000000004
RBP: 6165627472616568 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe3b3518f0
R13: 00007ffe3b351b18 R14: 431bde82d7b634db R15: 00007f5f1389a03b
ocfs2_fiemap() takes a read lock of the ip_alloc_sem semaphore (since
v2.6.22-527-g7307de80510a) and calls fiemap_fill_next_extent() to read the
extent list of this running mmap executable. The user supplied buffer to
hold the fiemap information page faults calling ocfs2_page_mkwrite() which
will take a write lock (since v2.6.27-38-g00dc417fa3e7) of the same
semaphore. This recursive semaphore will hold filesystem locks and causes
a hang of the fileystem.
The ip_alloc_sem protects the inode extent list and size. Release the
read semphore before calling fiemap_fill_next_extent() in ocfs2_fiemap()
and ocfs2_fiemap_inline(). This does an unnecessary semaphore lock/unlock
on the last extent but simplifies the error path.
Link: https://lkml.kernel.org/r/61d1a62b-2631-4f12-81e2-cd689914360b@oracle.com
Fixes: 00dc417fa3e7 ("ocfs2: fiemap support")
Signed-off-by: Mark Tinguely <mark.tinguely(a)oracle.com>
Reported-by: syzbot+541dcc6ee768f77103e7(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=541dcc6ee768f77103e7
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/extent_map.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
--- a/fs/ocfs2/extent_map.c~ocfs2-fix-recursive-semaphore-deadlock-in-fiemap-call
+++ a/fs/ocfs2/extent_map.c
@@ -706,6 +706,8 @@ out:
* it not only handles the fiemap for inlined files, but also deals
* with the fast symlink, cause they have no difference for extent
* mapping per se.
+ *
+ * Must be called with ip_alloc_sem semaphore held.
*/
static int ocfs2_fiemap_inline(struct inode *inode, struct buffer_head *di_bh,
struct fiemap_extent_info *fieinfo,
@@ -717,6 +719,7 @@ static int ocfs2_fiemap_inline(struct in
u64 phys;
u32 flags = FIEMAP_EXTENT_DATA_INLINE|FIEMAP_EXTENT_LAST;
struct ocfs2_inode_info *oi = OCFS2_I(inode);
+ lockdep_assert_held_read(&oi->ip_alloc_sem);
di = (struct ocfs2_dinode *)di_bh->b_data;
if (ocfs2_inode_is_fast_symlink(inode))
@@ -732,8 +735,11 @@ static int ocfs2_fiemap_inline(struct in
phys += offsetof(struct ocfs2_dinode,
id2.i_data.id_data);
+ /* Release the ip_alloc_sem to prevent deadlock on page fault */
+ up_read(&OCFS2_I(inode)->ip_alloc_sem);
ret = fiemap_fill_next_extent(fieinfo, 0, phys, id_count,
flags);
+ down_read(&OCFS2_I(inode)->ip_alloc_sem);
if (ret < 0)
return ret;
}
@@ -802,9 +808,11 @@ int ocfs2_fiemap(struct inode *inode, st
len_bytes = (u64)le16_to_cpu(rec.e_leaf_clusters) << osb->s_clustersize_bits;
phys_bytes = le64_to_cpu(rec.e_blkno) << osb->sb->s_blocksize_bits;
virt_bytes = (u64)le32_to_cpu(rec.e_cpos) << osb->s_clustersize_bits;
-
+ /* Release the ip_alloc_sem to prevent deadlock on page fault */
+ up_read(&OCFS2_I(inode)->ip_alloc_sem);
ret = fiemap_fill_next_extent(fieinfo, virt_bytes, phys_bytes,
len_bytes, fe_flags);
+ down_read(&OCFS2_I(inode)->ip_alloc_sem);
if (ret)
break;
_
Patches currently in -mm which might be from mark.tinguely(a)oracle.com are
ocfs2-fix-recursive-semaphore-deadlock-in-fiemap-call.patch
Remove the logic to set interrupt mask by default in uio_hv_generic
driver as the interrupt mask value is supposed to be controlled
completely by the user space. If the mask bit gets changed
by the driver, concurrently with user mode operating on the ring,
the mask bit may be set when it is supposed to be clear, and the
user-mode driver will miss an interrupt which will cause a hang.
For eg- when the driver sets inbound ring buffer interrupt mask to 1,
the host does not interrupt the guest on the UIO VMBus channel.
However, setting the mask does not prevent the host from putting a
message in the inbound ring buffer. So let’s assume that happens,
the host puts a message into the ring buffer but does not interrupt.
Subsequently, the user space code in the guest sets the inbound ring
buffer interrupt mask to 0, saying “Hey, I’m ready for interrupts”.
User space code then calls pread() to wait for an interrupt.
Then one of two things happens:
* The host never sends another message. So the pread() waits forever.
* The host does send another message. But because there’s already a
message in the ring buffer, it doesn’t generate an interrupt.
This is the correct behavior, because the host should only send an
interrupt when the inbound ring buffer transitions from empty to
not-empty. Adding an additional message to a ring buffer that is not
empty is not supposed to generate an interrupt on the guest.
Since the guest is waiting in pread() and not removing messages from
the ring buffer, the pread() waits forever.
This could be easily reproduced in hv_fcopy_uio_daemon if we delay
setting interrupt mask to 0.
Similarly if hv_uio_channel_cb() sets the interrupt_mask to 1,
there’s a race condition. Once user space empties the inbound ring
buffer, but before user space sets interrupt_mask to 0, the host could
put another message in the ring buffer but it wouldn’t interrupt.
Then the next pread() would hang.
Fix these by removing all instances where interrupt_mask is changed,
while keeping the one in set_event() unchanged to enable userspace
control the interrupt mask by writing 0/1 to /dev/uioX.
Fixes: 95096f2fbd10 ("uio-hv-generic: new userspace i/o driver for VMBus")
Suggested-by: John Starks <jostarks(a)microsoft.com>
Signed-off-by: Naman Jain <namjain(a)linux.microsoft.com>
Cc: <stable(a)vger.kernel.org>
---
Changes since v1:
https://lore.kernel.org/all/20250818064846.271294-1-namjain@linux.microsoft…
* Added Fixes and Cc stable tags.
---
drivers/uio/uio_hv_generic.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c
index f19efad4d6f8..3f8e2e27697f 100644
--- a/drivers/uio/uio_hv_generic.c
+++ b/drivers/uio/uio_hv_generic.c
@@ -111,7 +111,6 @@ static void hv_uio_channel_cb(void *context)
struct hv_device *hv_dev;
struct hv_uio_private_data *pdata;
- chan->inbound.ring_buffer->interrupt_mask = 1;
virt_mb();
/*
@@ -183,8 +182,6 @@ hv_uio_new_channel(struct vmbus_channel *new_sc)
return;
}
- /* Disable interrupts on sub channel */
- new_sc->inbound.ring_buffer->interrupt_mask = 1;
set_channel_read_mode(new_sc, HV_CALL_ISR);
ret = hv_create_ring_sysfs(new_sc, hv_uio_ring_mmap);
if (ret) {
@@ -227,9 +224,7 @@ hv_uio_open(struct uio_info *info, struct inode *inode)
ret = vmbus_connect_ring(dev->channel,
hv_uio_channel_cb, dev->channel);
- if (ret == 0)
- dev->channel->inbound.ring_buffer->interrupt_mask = 1;
- else
+ if (ret)
atomic_dec(&pdata->refcnt);
return ret;
--
2.34.1
batadv_nc_skb_decode_packet() trusts coded_len and checks only against
skb->len. XOR starts at sizeof(struct batadv_unicast_packet), reducing
payload headroom, and the source skb length is not verified, allowing an
out-of-bounds read and a small out-of-bounds write.
Validate that coded_len fits within the payload area of both destination
and source sk_buffs before XORing.
Fixes: 2df5278b0267 ("batman-adv: network coding - receive coded packets and decode them")
Cc: stable(a)vger.kernel.org
Reported-by: Stanislav Fort <disclosure(a)aisle.com>
Signed-off-by: Stanislav Fort <disclosure(a)aisle.com>
---
net/batman-adv/network-coding.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c
index 9f56308779cc..af97d077369f 100644
--- a/net/batman-adv/network-coding.c
+++ b/net/batman-adv/network-coding.c
@@ -1687,7 +1687,12 @@ batadv_nc_skb_decode_packet(struct batadv_priv *bat_priv, struct sk_buff *skb,
coding_len = ntohs(coded_packet_tmp.coded_len);
- if (coding_len > skb->len)
+ /* ensure dst buffer is large enough (payload only) */
+ if (coding_len + h_size > skb->len)
+ return NULL;
+
+ /* ensure src buffer is large enough (payload only) */
+ if (coding_len + h_size > nc_packet->skb->len)
return NULL;
/* Here the magic is reversed:
--
2.39.3 (Apple Git-146)
Add an explicit check for the xfer length to 'rtl9300_i2c_config_xfer'
to ensure the data length isn't within the supported range. In
particular a data length of 0 is not supported by the hardware and
causes unintended or destructive behaviour.
This limitation becomes obvious when looking at the register
documentation [1]. 4 bits are reserved for DATA_WIDTH and the value
of these 4 bits is used as N + 1, allowing a data length range of
1 <= len <= 16.
Affected by this is the SMBus Quick Operation which works with a data
length of 0. Passing 0 as the length causes an underflow of the value
due to:
(len - 1) & 0xf
and effectively specifying a transfer length of 16 via the registers.
This causes a 16-byte write operation instead of a Quick Write. For
example, on SFP modules without write-protected EEPROM this soft-bricks
them by overwriting some initial bytes.
For completeness, also add a quirk for the zero length.
[1] https://svanheule.net/realtek/longan/register/i2c_mst1_ctrl2
Fixes: c366be720235 ("i2c: Add driver for the RTL9300 I2C controller")
Cc: <stable(a)vger.kernel.org> # v6.13+
Signed-off-by: Jonas Jelonek <jelonek.jonas(a)gmail.com>
Tested-by: Sven Eckelmann <sven(a)narfation.org>
Reviewed-by: Chris Packham <chris.packham(a)alliedtelesis.co.nz>
Tested-by: Chris Packham <chris.packham(a)alliedtelesis.co.nz> # On RTL9302C based board
Tested-by: Markus Stockhausen <markus.stockhausen(a)gmx.de>
---
drivers/i2c/busses/i2c-rtl9300.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/i2c/busses/i2c-rtl9300.c b/drivers/i2c/busses/i2c-rtl9300.c
index 19c367703eaf..ebd4a85e1bde 100644
--- a/drivers/i2c/busses/i2c-rtl9300.c
+++ b/drivers/i2c/busses/i2c-rtl9300.c
@@ -99,6 +99,9 @@ static int rtl9300_i2c_config_xfer(struct rtl9300_i2c *i2c, struct rtl9300_i2c_c
{
u32 val, mask;
+ if (len < 1 || len > 16)
+ return -EINVAL;
+
val = chan->bus_freq << RTL9300_I2C_MST_CTRL2_SCL_FREQ_OFS;
mask = RTL9300_I2C_MST_CTRL2_SCL_FREQ_MASK;
@@ -352,7 +355,7 @@ static const struct i2c_algorithm rtl9300_i2c_algo = {
};
static struct i2c_adapter_quirks rtl9300_i2c_quirks = {
- .flags = I2C_AQ_NO_CLK_STRETCH,
+ .flags = I2C_AQ_NO_CLK_STRETCH | I2C_AQ_NO_ZERO_LEN,
.max_read_len = 16,
.max_write_len = 16,
};
--
2.48.1
Fix the current check for number of channels (child nodes in the device
tree). Before, this was:
if (device_get_child_node_count(dev) >= RTL9300_I2C_MUX_NCHAN)
RTL9300_I2C_MUX_NCHAN gives the maximum number of channels so checking
with '>=' isn't correct because it doesn't allow the last channel
number. Thus, fix it to:
if (device_get_child_node_count(dev) > RTL9300_I2C_MUX_NCHAN)
Issue occured on a TP-Link TL-ST1008F v2.0 device (8 SFP+ ports) and fix
is tested there.
Fixes: c366be720235 ("i2c: Add driver for the RTL9300 I2C controller")
Cc: <stable(a)vger.kernel.org> # v6.13+
Signed-off-by: Jonas Jelonek <jelonek.jonas(a)gmail.com>
Tested-by: Sven Eckelmann <sven(a)narfation.org>
Reviewed-by: Chris Packham <chris.packham(a)alliedtelesis.co.nz>
Tested-by: Chris Packham <chris.packham(a)alliedtelesis.co.nz> # On RTL9302C based board
Tested-by: Markus Stockhausen <markus.stockhausen(a)gmx.de>
---
drivers/i2c/busses/i2c-rtl9300.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/i2c/busses/i2c-rtl9300.c b/drivers/i2c/busses/i2c-rtl9300.c
index 4b215f9a24e6..19c367703eaf 100644
--- a/drivers/i2c/busses/i2c-rtl9300.c
+++ b/drivers/i2c/busses/i2c-rtl9300.c
@@ -382,7 +382,7 @@ static int rtl9300_i2c_probe(struct platform_device *pdev)
platform_set_drvdata(pdev, i2c);
- if (device_get_child_node_count(dev) >= RTL9300_I2C_MUX_NCHAN)
+ if (device_get_child_node_count(dev) > RTL9300_I2C_MUX_NCHAN)
return dev_err_probe(dev, -EINVAL, "Too many channels\n");
device_for_each_child_node(dev, child) {
--
2.48.1
Hi Stable,
Please provide a quote for your products:
Include:
1.Pricing (per unit)
2.Delivery cost & timeline
3.Quote expiry date
Deadline: September
Thanks!
Kamal Prasad
Albinayah Trading
Our user report a bug that his cpu keep the min cpu freq,
bisect to commit ada8d7fa0ad49a2a078f97f7f6e02d24d3c357a3
("sched/cpufreq: Rework schedutil governor performance estimation")
and test the fix e37617c8e53a1f7fcba6d5e1041f4fd8a2425c27
("sched/fair: Fix frequency selection for non-invariant case")
works.
And backport the "stable-deps-of" series:
"consolidate and cleanup CPU capacity"
Link: https://lore.kernel.org/all/20231211104855.558096-1-vincent.guittot@linaro.…
PS:
commit 50b813b147e9eb6546a1fc49d4e703e6d23691f2 in series
("cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()")
merged in v6.6.59 as 33e89c16cea0882bb05c585fa13236b730dd0efa
Vincent Guittot (8):
sched/topology: Add a new arch_scale_freq_ref() method
cpufreq: Use the fixed and coherent frequency for scaling capacity
cpufreq/schedutil: Use a fixed reference frequency
energy_model: Use a fixed reference frequency
cpufreq/cppc: Set the frequency used for computing the capacity
arm64/amu: Use capacity_ref_freq() to set AMU ratio
topology: Set capacity_freq_ref in all cases
sched/fair: Fix frequency selection for non-invariant case
arch/arm/include/asm/topology.h | 1 +
arch/arm64/include/asm/topology.h | 1 +
arch/arm64/kernel/topology.c | 26 ++++++------
arch/riscv/include/asm/topology.h | 1 +
drivers/base/arch_topology.c | 69 ++++++++++++++++++++-----------
drivers/cpufreq/cpufreq.c | 4 +-
include/linux/arch_topology.h | 8 ++++
include/linux/cpufreq.h | 1 +
include/linux/energy_model.h | 6 +--
include/linux/sched/topology.h | 8 ++++
kernel/sched/cpufreq_schedutil.c | 30 +++++++++++++-
11 files changed, 111 insertions(+), 44 deletions(-)
--
2.20.1
Fix a use-after-free window by correcting the buffer release sequence in
the deferred receive path. The code freed the RQ buffer first and only
then cleared the context pointer under the lock. Concurrent paths
(e.g., ABTS and the repost path) also inspect and release the same
pointer under the lock, so the old order could lead to double-free/UAF.
Note that the repost path already uses the correct pattern: detach the
pointer under the lock, then free it after dropping the lock. The deferred
path should do the same.
Fixes: 472e146d1cf3 ("scsi: lpfc: Correct upcalling nvmet_fc transport during io done downcall")
Cc: stable(a)vger.kernel.org
Signed-off-by: John Evans <evans1210144(a)gmail.com>
---
v2:
* Rework locking to read and clear the buffer pointer atomically, as
suggested by Justin Tee.
---
drivers/scsi/lpfc/lpfc_nvmet.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/scsi/lpfc/lpfc_nvmet.c b/drivers/scsi/lpfc/lpfc_nvmet.c
index fba2e62027b7..4cfc928bcf2d 100644
--- a/drivers/scsi/lpfc/lpfc_nvmet.c
+++ b/drivers/scsi/lpfc/lpfc_nvmet.c
@@ -1243,7 +1243,7 @@ lpfc_nvmet_defer_rcv(struct nvmet_fc_target_port *tgtport,
struct lpfc_nvmet_tgtport *tgtp;
struct lpfc_async_xchg_ctx *ctxp =
container_of(rsp, struct lpfc_async_xchg_ctx, hdlrctx.fcp_req);
- struct rqb_dmabuf *nvmebuf = ctxp->rqb_buffer;
+ struct rqb_dmabuf *nvmebuf;
struct lpfc_hba *phba = ctxp->phba;
unsigned long iflag;
@@ -1251,13 +1251,18 @@ lpfc_nvmet_defer_rcv(struct nvmet_fc_target_port *tgtport,
lpfc_nvmeio_data(phba, "NVMET DEFERRCV: xri x%x sz %d CPU %02x\n",
ctxp->oxid, ctxp->size, raw_smp_processor_id());
+ spin_lock_irqsave(&ctxp->ctxlock, iflag);
+ nvmebuf = ctxp->rqb_buffer;
if (!nvmebuf) {
+ spin_unlock_irqrestore(&ctxp->ctxlock, iflag);
lpfc_printf_log(phba, KERN_INFO, LOG_NVME_IOERR,
"6425 Defer rcv: no buffer oxid x%x: "
"flg %x ste %x\n",
ctxp->oxid, ctxp->flag, ctxp->state);
return;
}
+ ctxp->rqb_buffer = NULL;
+ spin_unlock_irqrestore(&ctxp->ctxlock, iflag);
tgtp = phba->targetport->private;
if (tgtp)
@@ -1265,9 +1270,6 @@ lpfc_nvmet_defer_rcv(struct nvmet_fc_target_port *tgtport,
/* Free the nvmebuf since a new buffer already replaced it */
nvmebuf->hrq->rqbp->rqb_free_buffer(phba, nvmebuf);
- spin_lock_irqsave(&ctxp->ctxlock, iflag);
- ctxp->rqb_buffer = NULL;
- spin_unlock_irqrestore(&ctxp->ctxlock, iflag);
}
/**
--
2.43.0
Hi,
Charles Bordet reported the following issue (full context in
https://bugs.debian.org/1108860)
> Dear Maintainer,
>
> What led up to the situation?
> We run a production environment using Debian 12 VMs, with a network
> topology involving VXLAN tunnels encapsulated inside Wireguard
> interfaces. This setup has worked reliably for over a year, with MTU set
> to 1500 on all interfaces except the Wireguard interface (set to 1420).
> Wireguard kernel fragmentation allowed this configuration to function
> without issues, even though the effective path MTU is lower than 1500.
>
> What exactly did you do (or not do) that was effective (or ineffective)?
> We performed a routine system upgrade, updating all packages include the
> kernel. After the upgrade, we observed severe network issues (timeouts,
> very slow HTTP/HTTPS, and apt update failures) on all VMs behind the
> router. SSH and small-packet traffic continued to work.
>
> To diagnose, we:
>
> * Restored a backup (with the previous kernel): the problem disappeared.
> * Repeated the upgrade, confirming the issue reappeared.
> * Systematically tested each kernel version from 6.1.124-1 up to
> 6.1.140-1. The problem first appears with kernel 6.1.135-1; all earlier
> versions work as expected.
> * Kernel version from the backports (6.12.32-1) did not resolve the
> problem.
>
> What was the outcome of this action?
>
> * With kernel 6.1.135-1 or later, network timeouts occur for
> large-packet protocols (HTTP, apt, etc.), while SSH and small-packet
> protocols work.
> * With kernel 6.1.133-1 or earlier, everything works as expected.
>
> What outcome did you expect instead?
> We expected the network to function as before, with Wireguard handling
> fragmentation transparently and no application-level timeouts,
> regardless of the kernel version.
While triaging the issue we found that the commit 8930424777e4
("tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu()." introduces
the issue and Charles confirmed that the issue was present as well in
6.12.35 and 6.15.4 (other version up could potentially still be
affected, but we wanted to check it is not a 6.1.y specific
regression).
Reverthing the commit fixes Charles' issue.
Does that ring a bell?
Regards,
Salvatore
Fix multiple fwnode reference leaks:
1. The function calls fwnode_get_named_child_node() to get the "leds" node,
but never calls fwnode_handle_put(leds) to release this reference.
2. Within the fwnode_for_each_child_node() loop, the early return
paths that don't properly release the "led" fwnode reference.
This fix follows the same pattern as commit d029edefed39
("net dsa: qca8k: fix usages of device_get_named_child_node()")
Fixes: 94a2a84f5e9e ("net: dsa: mv88e6xxx: Support LED control")
Cc: stable(a)vger.kernel.org
Signed-off-by: Miaoqian Lin <linmq006(a)gmail.com>
---
drivers/net/dsa/mv88e6xxx/leds.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/net/dsa/mv88e6xxx/leds.c b/drivers/net/dsa/mv88e6xxx/leds.c
index 1c88bfaea46b..dcc765066f9c 100644
--- a/drivers/net/dsa/mv88e6xxx/leds.c
+++ b/drivers/net/dsa/mv88e6xxx/leds.c
@@ -779,6 +779,8 @@ int mv88e6xxx_port_setup_leds(struct mv88e6xxx_chip *chip, int port)
continue;
if (led_num > 1) {
dev_err(dev, "invalid LED specified port %d\n", port);
+ fwnode_handle_put(led);
+ fwnode_handle_put(leds);
return -EINVAL;
}
@@ -823,17 +825,23 @@ int mv88e6xxx_port_setup_leds(struct mv88e6xxx_chip *chip, int port)
init_data.devname_mandatory = true;
init_data.devicename = kasprintf(GFP_KERNEL, "%s:0%d:0%d", chip->info->name,
port, led_num);
- if (!init_data.devicename)
+ if (!init_data.devicename) {
+ fwnode_handle_put(led);
+ fwnode_handle_put(leds);
return -ENOMEM;
+ }
ret = devm_led_classdev_register_ext(dev, l, &init_data);
kfree(init_data.devicename);
if (ret) {
dev_err(dev, "Failed to init LED %d for port %d", led_num, port);
+ fwnode_handle_put(led);
+ fwnode_handle_put(leds);
return ret;
}
}
+ fwnode_handle_put(leds);
return 0;
}
--
2.35.1
Before conversion to unify the calibration data management, the
tas2563_apply_calib() function performed the big endian conversion and
wrote the calibration data to the device. The writing is now done by the
common tasdev_load_calibrated_data() function, but without conversion.
Put the values into the calibration data buffer with the expected
endianness.
Fixes: 4fe238513407 ("ALSA: hda/tas2781: Move and unified the calibrated-data getting function for SPI and I2C into the tas2781_hda lib")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Gergo Koteles <soyer(a)irl.hu>
---
sound/hda/codecs/side-codecs/tas2781_hda_i2c.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/sound/hda/codecs/side-codecs/tas2781_hda_i2c.c b/sound/hda/codecs/side-codecs/tas2781_hda_i2c.c
index e34b17f0c9b9..1eac751ab2a6 100644
--- a/sound/hda/codecs/side-codecs/tas2781_hda_i2c.c
+++ b/sound/hda/codecs/side-codecs/tas2781_hda_i2c.c
@@ -310,6 +310,7 @@ static int tas2563_save_calibration(struct tas2781_hda *h)
struct cali_reg *r = &cd->cali_reg_array;
unsigned int offset = 0;
unsigned char *data;
+ __be32 bedata;
efi_status_t status;
unsigned int attr;
int ret, i, j, k;
@@ -351,6 +352,8 @@ static int tas2563_save_calibration(struct tas2781_hda *h)
i, j, status);
return -EINVAL;
}
+ bedata = cpu_to_be32(*(uint32_t *)&data[offset]);
+ memcpy(&data[offset], &bedata, sizeof(bedata));
offset += TAS2563_CAL_DATA_SIZE;
}
}
--
2.51.0
From: Donet Tom <donettom(a)linux.ibm.com>
On systems using the hash MMU, there is a software SLB preload cache that
mirrors the entries loaded into the hardware SLB buffer. This preload
cache is subject to periodic eviction — typically after every 256 context
switches — to remove old entry.
To optimize performance, the kernel skips switch_mmu_context() in
switch_mm_irqs_off() when the prev and next mm_struct are the same.
However, on hash MMU systems, this can lead to inconsistencies between
the hardware SLB and the software preload cache.
If an SLB entry for a process is evicted from the software cache on one
CPU, and the same process later runs on another CPU without executing
switch_mmu_context(), the hardware SLB may retain stale entries. If the
kernel then attempts to reload that entry, it can trigger an SLB
multi-hit error.
The following timeline shows how stale SLB entries are created and can
cause a multi-hit error when a process moves between CPUs without a
MMU context switch.
CPU 0 CPU 1
----- -----
Process P
exec swapper/1
load_elf_binary
begin_new_exc
activate_mm
switch_mm_irqs_off
switch_mmu_context
switch_slb
/*
* This invalidates all
* the entries in the HW
* and setup the new HW
* SLB entries as per the
* preload cache.
*/
context_switch
sched_migrate_task migrates process P to cpu-1
Process swapper/0 context switch (to process P)
(uses mm_struct of Process P) switch_mm_irqs_off()
switch_slb
load_slb++
/*
* load_slb becomes 0 here
* and we evict an entry from
* the preload cache with
* preload_age(). We still
* keep HW SLB and preload
* cache in sync, that is
* because all HW SLB entries
* anyways gets evicted in
* switch_slb during SLBIA.
* We then only add those
* entries back in HW SLB,
* which are currently
* present in preload_cache
* (after eviction).
*/
load_elf_binary continues...
setup_new_exec()
slb_setup_new_exec()
sched_switch event
sched_migrate_task migrates
process P to cpu-0
context_switch from swapper/0 to Process P
switch_mm_irqs_off()
/*
* Since both prev and next mm struct are same we don't call
* switch_mmu_context(). This will cause the HW SLB and SW preload
* cache to go out of sync in preload_new_slb_context. Because there
* was an SLB entry which was evicted from both HW and preload cache
* on cpu-1. Now later in preload_new_slb_context(), when we will try
* to add the same preload entry again, we will add this to the SW
* preload cache and then will add it to the HW SLB. Since on cpu-0
* this entry was never invalidated, hence adding this entry to the HW
* SLB will cause a SLB multi-hit error.
*/
load_elf_binary continues...
START_THREAD
start_thread
preload_new_slb_context
/*
* This tries to add a new EA to preload cache which was earlier
* evicted from both cpu-1 HW SLB and preload cache. This caused the
* HW SLB of cpu-0 to go out of sync with the SW preload cache. The
* reason for this was, that when we context switched back on CPU-0,
* we should have ideally called switch_mmu_context() which will
* bring the HW SLB entries on CPU-0 in sync with SW preload cache
* entries by setting up the mmu context properly. But we didn't do
* that since the prev mm_struct running on cpu-0 was same as the
* next mm_struct (which is true for swapper / kernel threads). So
* now when we try to add this new entry into the HW SLB of cpu-0,
* we hit a SLB multi-hit error.
*/
WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62
assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12
VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected)
0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP: c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700 Not tainted (6.16.0-rc3-dirty)
MSR: 8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 28888482 XER: 00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
0x7fffceb5ffff (unreliable)
preload_new_slb_context+0x100/0x1a0
start_thread+0x26c/0x420
load_elf_binary+0x1b04/0x1c40
bprm_execve+0x358/0x680
do_execveat_common+0x1f8/0x240
sys_execve+0x58/0x70
system_call_exception+0x114/0x300
system_call_common+0x160/0x2c4
>From the above analysis, during early exec the hardware SLB is cleared,
and entries from the software preload cache are reloaded into hardware
by switch_slb. However, preload_new_slb_context and slb_setup_new_exec
also attempt to load some of the same entries, which can trigger a
multi-hit. In most cases, these additional preloads simply hit existing
entries and add nothing new. Removing these functions avoids redundant
preloads and eliminates the multi-hit issue. This patch removes these
two functions.
We tested process switching performance using the context_switch
benchmark on POWER9/hash, and observed no regression.
Without this patch: 129041 ops/sec
With this patch: 129341 ops/sec
We also measured SLB faults during boot, and the counts are essentially
the same with and without this patch.
SLB faults without this patch: 19727
SLB faults with this patch: 19786
Cc: Madhavan Srinivasan <maddy(a)linux.ibm.com>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu>
Cc: Paul Mackerras <paulus(a)ozlabs.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)kernel.org>
Cc: Donet Tom <donettom(a)linux.ibm.com>
Cc: linuxppc-dev(a)lists.ozlabs.org
Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
cc: <stable(a)vger.kernel.org>
Suggested-by: Nicholas Piggin <npiggin(a)gmail.com>
Signed-off-by: Donet Tom <donettom(a)linux.ibm.com>
---
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1 -
arch/powerpc/kernel/process.c | 5 --
arch/powerpc/mm/book3s64/internal.h | 2 -
arch/powerpc/mm/book3s64/mmu_context.c | 2 -
arch/powerpc/mm/book3s64/slb.c | 88 -------------------
5 files changed, 98 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 1c4eebbc69c9..e1f77e2eead4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -524,7 +524,6 @@ void slb_save_contents(struct slb_entry *slb_ptr);
void slb_dump_contents(struct slb_entry *slb_ptr);
extern void slb_vmalloc_update(void);
-void preload_new_slb_context(unsigned long start, unsigned long sp);
#ifdef CONFIG_PPC_64S_HASH_MMU
void slb_set_size(u16 size);
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 855e09886503..2b9799157eb4 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1897,8 +1897,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
return 0;
}
-void preload_new_slb_context(unsigned long start, unsigned long sp);
-
/*
* Set up a thread for executing a new program
*/
@@ -1906,9 +1904,6 @@ void start_thread(struct pt_regs *regs, unsigned long start, unsigned long sp)
{
#ifdef CONFIG_PPC64
unsigned long load_addr = regs->gpr[2]; /* saved by ELF_PLAT_INIT */
-
- if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
- preload_new_slb_context(start, sp);
#endif
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
diff --git a/arch/powerpc/mm/book3s64/internal.h b/arch/powerpc/mm/book3s64/internal.h
index a57a25f06a21..c26a6f0c90fc 100644
--- a/arch/powerpc/mm/book3s64/internal.h
+++ b/arch/powerpc/mm/book3s64/internal.h
@@ -24,8 +24,6 @@ static inline bool stress_hpt(void)
void hpt_do_stress(unsigned long ea, unsigned long hpte_group);
-void slb_setup_new_exec(void);
-
void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush);
#endif /* ARCH_POWERPC_MM_BOOK3S64_INTERNAL_H */
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
index 4e1e45420bd4..fb9dcf9ca599 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -150,8 +150,6 @@ static int hash__init_new_context(struct mm_struct *mm)
void hash__setup_new_exec(void)
{
slice_setup_new_exec();
-
- slb_setup_new_exec();
}
#else
static inline int hash__init_new_context(struct mm_struct *mm)
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index 6b783552403c..7e053c561a09 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -328,94 +328,6 @@ static void preload_age(struct thread_info *ti)
ti->slb_preload_tail = (ti->slb_preload_tail + 1) % SLB_PRELOAD_NR;
}
-void slb_setup_new_exec(void)
-{
- struct thread_info *ti = current_thread_info();
- struct mm_struct *mm = current->mm;
- unsigned long exec = 0x10000000;
-
- WARN_ON(irqs_disabled());
-
- /*
- * preload cache can only be used to determine whether a SLB
- * entry exists if it does not start to overflow.
- */
- if (ti->slb_preload_nr + 2 > SLB_PRELOAD_NR)
- return;
-
- hard_irq_disable();
-
- /*
- * We have no good place to clear the slb preload cache on exec,
- * flush_thread is about the earliest arch hook but that happens
- * after we switch to the mm and have already preloaded the SLBEs.
- *
- * For the most part that's probably okay to use entries from the
- * previous exec, they will age out if unused. It may turn out to
- * be an advantage to clear the cache before switching to it,
- * however.
- */
-
- /*
- * preload some userspace segments into the SLB.
- * Almost all 32 and 64bit PowerPC executables are linked at
- * 0x10000000 so it makes sense to preload this segment.
- */
- if (!is_kernel_addr(exec)) {
- if (preload_add(ti, exec))
- slb_allocate_user(mm, exec);
- }
-
- /* Libraries and mmaps. */
- if (!is_kernel_addr(mm->mmap_base)) {
- if (preload_add(ti, mm->mmap_base))
- slb_allocate_user(mm, mm->mmap_base);
- }
-
- /* see switch_slb */
- asm volatile("isync" : : : "memory");
-
- local_irq_enable();
-}
-
-void preload_new_slb_context(unsigned long start, unsigned long sp)
-{
- struct thread_info *ti = current_thread_info();
- struct mm_struct *mm = current->mm;
- unsigned long heap = mm->start_brk;
-
- WARN_ON(irqs_disabled());
-
- /* see above */
- if (ti->slb_preload_nr + 3 > SLB_PRELOAD_NR)
- return;
-
- hard_irq_disable();
-
- /* Userspace entry address. */
- if (!is_kernel_addr(start)) {
- if (preload_add(ti, start))
- slb_allocate_user(mm, start);
- }
-
- /* Top of stack, grows down. */
- if (!is_kernel_addr(sp)) {
- if (preload_add(ti, sp))
- slb_allocate_user(mm, sp);
- }
-
- /* Bottom of heap, grows up. */
- if (heap && !is_kernel_addr(heap)) {
- if (preload_add(ti, heap))
- slb_allocate_user(mm, heap);
- }
-
- /* see switch_slb */
- asm volatile("isync" : : : "memory");
-
- local_irq_enable();
-}
-
static void slb_cache_slbie_kernel(unsigned int index)
{
unsigned long slbie_data = get_paca()->slb_cache[index];
--
2.50.1
On 8/29/25 13:29, yangshiguang wrote:
> At 2025-08-27 16:40:21, "Vlastimil Babka" <vbabka(a)suse.cz> wrote:
>
>>>
>>>>
>>>
>>> How about this?
>>>
>>> /* Preemption is disabled in ___slab_alloc() */
>>> - gfp_flags &= ~(__GFP_DIRECT_RECLAIM);
>>> + gfp_flags = (gfp_flags & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL)) |
>>> + __GFP_NOWARN;
>>
>>I'd suggest using gfp_nested_flags() and & ~(__GFP_DIRECT_RECLAIM);
>>
>
> However, gfp has been processed by gfp_nested_mask() in
> stack_depot_save_flags().
Aha, didn't notice. Good to know!
> Still need to call here?
No then we can indeed just mask out __GFP_DIRECT_RECLAIM.
Maybe the comment could say something like:
/*
* Preemption is disabled in ___slab_alloc() so we need to disallow
* blocking. The flags are further adjusted by gfp_nested_mask() in
* stack_depot itself.
*/
> set_track_prepare()
> ->stack_depot_save_flags()
>
>>> >--
>>>>Cheers,
>>>>Harry / Hyeonggon
Its actions are opportunistic anyway and will be completed
on device suspend.
Also restrict the scope of the pm runtime reference to
the notifier callback itself to make it easier to
follow.
Marking as a fix to simplify backporting of the fix
that follows in the series.
Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier")
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/gpu/drm/xe/xe_pm.c | 14 ++++----------
1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
index a2e85030b7f4..b57b46ad9f7c 100644
--- a/drivers/gpu/drm/xe/xe_pm.c
+++ b/drivers/gpu/drm/xe/xe_pm.c
@@ -308,28 +308,22 @@ static int xe_pm_notifier_callback(struct notifier_block *nb,
case PM_SUSPEND_PREPARE:
xe_pm_runtime_get(xe);
err = xe_bo_evict_all_user(xe);
- if (err) {
+ if (err)
drm_dbg(&xe->drm, "Notifier evict user failed (%d)\n", err);
- xe_pm_runtime_put(xe);
- break;
- }
err = xe_bo_notifier_prepare_all_pinned(xe);
- if (err) {
+ if (err)
drm_dbg(&xe->drm, "Notifier prepare pin failed (%d)\n", err);
- xe_pm_runtime_put(xe);
- }
+ xe_pm_runtime_put(xe);
break;
case PM_POST_HIBERNATION:
case PM_POST_SUSPEND:
+ xe_pm_runtime_get(xe);
xe_bo_notifier_unprepare_all_pinned(xe);
xe_pm_runtime_put(xe);
break;
}
- if (err)
- return NOTIFY_BAD;
-
return NOTIFY_DONE;
}
--
2.50.1
The current implementation of CXL memory hotplug notifier gets called
before the HMAT memory hotplug notifier. The CXL driver calculates the
access coordinates (bandwidth and latency values) for the CXL end to
end path (i.e. CPU to endpoint). When the CXL region is onlined, the CXL
memory hotplug notifier writes the access coordinates to the HMAT target
structs. Then the HMAT memory hotplug notifier is called and it creates
the access coordinates for the node sysfs attributes.
During testing on an Intel platform, it was found that although the
newly calculated coordinates were pushed to sysfs, the sysfs attributes for
the access coordinates showed up with the wrong initiator. The system has
4 nodes (0, 1, 2, 3) where node 0 and 1 are CPU nodes and node 2 and 3 are
CXL nodes. The expectation is that node 2 would show up as a target to node
0:
/sys/devices/system/node/node2/access0/initiators/node0
However it was observed that node 2 showed up as a target under node 1:
/sys/devices/system/node/node2/access0/initiators/node1
The original intent of the 'ext_updated' flag in HMAT handling code was to
stop HMAT memory hotplug callback from clobbering the access coordinates
after CXL has injected its calculated coordinates and replaced the generic
target access coordinates provided by the HMAT table in the HMAT target
structs. However the flag is hacky at best and blocks the updates from
other CXL regions that are onlined in the same node later on. Remove the
'ext_updated' flag usage and just update the access coordinates for the
nodes directly without touching HMAT target data.
The hotplug memory callback ordering is changed. Instead of changing CXL,
move HMAT back so there's room for the levels rather than have CXL share
the same level as SLAB_CALLBACK_PRI. The change will resulting in the CXL
callback to be executed after the HMAT callback.
With the change, the CXL hotplug memory notifier runs after the HMAT
callback. The HMAT callback will create the node sysfs attributes for
access coordinates. The CXL callback will write the access coordinates to
the now created node sysfs attributes directly and will not pollute the
HMAT target values.
Fixes: 067353a46d8c ("cxl/region: Add memory hotplug notifier for cxl region")
Cc: stable(a)vger.kernel.org
Tested-by: Marc Herbert <marc.herbert(a)linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams(a)intel.com>
Signed-off-by: Dave Jiang <dave.jiang(a)intel.com>
---
v2:
- Add description to what was observed for the issue. (Dan)
- Use the correct Fixes tag. (Dan)
- Add Cc to stable. (Dan)
- Add support to only update on first region appearance. (Jonathan)
---
drivers/acpi/numa/hmat.c | 6 ------
drivers/cxl/core/cdat.c | 5 -----
drivers/cxl/core/core.h | 1 -
drivers/cxl/core/region.c | 21 +++++++++++++--------
include/linux/memory.h | 2 +-
5 files changed, 14 insertions(+), 21 deletions(-)
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 4958301f5417..5d32490dc4ab 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -74,7 +74,6 @@ struct memory_target {
struct node_cache_attrs cache_attrs;
u8 gen_port_device_handle[ACPI_SRAT_DEVICE_HANDLE_SIZE];
bool registered;
- bool ext_updated; /* externally updated */
};
struct memory_initiator {
@@ -391,7 +390,6 @@ int hmat_update_target_coordinates(int nid, struct access_coordinate *coord,
coord->read_bandwidth, access);
hmat_update_target_access(target, ACPI_HMAT_WRITE_BANDWIDTH,
coord->write_bandwidth, access);
- target->ext_updated = true;
return 0;
}
@@ -773,10 +771,6 @@ static void hmat_update_target_attrs(struct memory_target *target,
u32 best = 0;
int i;
- /* Don't update if an external agent has changed the data. */
- if (target->ext_updated)
- return;
-
/* Don't update for generic port if there's no device handle */
if ((access == NODE_ACCESS_CLASS_GENPORT_SINK_LOCAL ||
access == NODE_ACCESS_CLASS_GENPORT_SINK_CPU) &&
diff --git a/drivers/cxl/core/cdat.c b/drivers/cxl/core/cdat.c
index c0af645425f4..c891fd618cfd 100644
--- a/drivers/cxl/core/cdat.c
+++ b/drivers/cxl/core/cdat.c
@@ -1081,8 +1081,3 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
{
return hmat_update_target_coordinates(nid, &cxlr->coord[access], access);
}
-
-bool cxl_need_node_perf_attrs_update(int nid)
-{
- return !acpi_node_backed_by_real_pxm(nid);
-}
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 2669f251d677..a253d308f3c9 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -139,7 +139,6 @@ long cxl_pci_get_latency(struct pci_dev *pdev);
int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
enum access_coordinate_class access);
-bool cxl_need_node_perf_attrs_update(int nid);
int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
struct access_coordinate *c);
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index 71cc42d05248..371873fc43eb 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -9,6 +9,7 @@
#include <linux/uuid.h>
#include <linux/sort.h>
#include <linux/idr.h>
+#include <linux/xarray.h>
#include <linux/memory-tiers.h>
#include <cxlmem.h>
#include <cxl.h>
@@ -30,6 +31,9 @@
* 3. Decoder targets
*/
+/* xarray that stores the reference count per node for regions */
+static DEFINE_XARRAY(node_regions_xa);
+
static struct cxl_region *to_cxl_region(struct device *dev);
#define __ACCESS_ATTR_RO(_level, _name) { \
@@ -2442,14 +2446,8 @@ static bool cxl_region_update_coordinates(struct cxl_region *cxlr, int nid)
for (int i = 0; i < ACCESS_COORDINATE_MAX; i++) {
if (cxlr->coord[i].read_bandwidth) {
- rc = 0;
- if (cxl_need_node_perf_attrs_update(nid))
- node_set_perf_attrs(nid, &cxlr->coord[i], i);
- else
- rc = cxl_update_hmat_access_coordinates(nid, cxlr, i);
-
- if (rc == 0)
- cset++;
+ node_update_perf_attrs(nid, &cxlr->coord[i], i);
+ cset++;
}
}
@@ -2475,6 +2473,7 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
struct node_notify *nn = arg;
int nid = nn->nid;
int region_nid;
+ int rc;
if (action != NODE_ADDED_FIRST_MEMORY)
return NOTIFY_DONE;
@@ -2487,6 +2486,11 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb,
if (nid != region_nid)
return NOTIFY_DONE;
+ /* No action needed if there's existing entry */
+ rc = xa_insert(&node_regions_xa, nid, NULL, GFP_KERNEL);
+ if (rc < 0)
+ return NOTIFY_DONE;
+
if (!cxl_region_update_coordinates(cxlr, nid))
return NOTIFY_DONE;
@@ -3638,6 +3642,7 @@ int cxl_region_init(void)
void cxl_region_exit(void)
{
+ xa_destroy(&node_regions_xa);
cxl_driver_unregister(&cxl_region_driver);
}
diff --git a/include/linux/memory.h b/include/linux/memory.h
index de5c0d8e8925..684fd641f2f0 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -120,8 +120,8 @@ struct mem_section;
*/
#define DEFAULT_CALLBACK_PRI 0
#define SLAB_CALLBACK_PRI 1
-#define HMAT_CALLBACK_PRI 2
#define CXL_CALLBACK_PRI 5
+#define HMAT_CALLBACK_PRI 6
#define MM_COMPUTE_BATCH_PRI 10
#define CPUSET_CALLBACK_PRI 10
#define MEMTIER_HOTPLUG_PRI 100
--
2.50.1
Hi Greg,
Could you pick up this commit for 6.12 and 6.6:
00a973e093e9 ("interconnect: qcom: icc-rpm: Set the count member before accessing the flex array")
It just silences a UBSan warning so it doesn't affect regular users, but
it helps in testing to silence those warnings. It is a clean cherry-pick.
regards,
dan carpenter
On Wed, Dec 04, 2024 at 12:33:34AM +0200, djakov(a)kernel.org wrote:
> From: Georgi Djakov <djakov(a)kernel.org>
>
> The following UBSAN error is reported during boot on the db410c board on
> a clang-19 build:
>
> Internal error: UBSAN: array index out of bounds: 00000000f2005512 [#1] PREEMPT SMP
> ...
> pc : qnoc_probe+0x5f8/0x5fc
> ...
>
> The cause of the error is that the counter member was not set before
> accessing the annotated flexible array member, but after that. Fix this
> by initializing it earlier.
>
> Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org>
> Closes: https://lore.kernel.org/r/CA+G9fYs+2mBz1y2dAzxkj9-oiBJ2Acm1Sf1h2YQ3VmBqj_VX…
> Fixes: dd4904f3b924 ("interconnect: qcom: Annotate struct icc_onecell_data with __counted_by")
> Signed-off-by: Georgi Djakov <djakov(a)kernel.org>
> ---
> drivers/interconnect/qcom/icc-rpm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/interconnect/qcom/icc-rpm.c b/drivers/interconnect/qcom/icc-rpm.c
> index a8ed435f696c..ea1042d38128 100644
> --- a/drivers/interconnect/qcom/icc-rpm.c
> +++ b/drivers/interconnect/qcom/icc-rpm.c
> @@ -503,6 +503,7 @@ int qnoc_probe(struct platform_device *pdev)
> GFP_KERNEL);
> if (!data)
> return -ENOMEM;
> + data->num_nodes = num_nodes;
>
> qp->num_intf_clks = cd_num;
> for (i = 0; i < cd_num; i++)
> @@ -597,7 +598,6 @@ int qnoc_probe(struct platform_device *pdev)
>
> data->nodes[i] = node;
> }
> - data->num_nodes = num_nodes;
>
> clk_bulk_disable_unprepare(qp->num_intf_clks, qp->intf_clks);
>
Commit ID: 9cd51eefae3c871440b93c03716c5398f41bdf78
Why it should be applied: This is a small addition to a quirk list for which I
forgot to set cc: stable when originally submitted it. Bringing it onto stable
will result in several downstream distributions automatically adopting the
patch, helping the affected device.
What kernel versions I wish it to be applied to: It should apply cleanly to
6.1.y and newer longterm releases.
I hope I correctly followed
https://docs.kernel.org/process/stable-kernel-rules.html#option-2 to bring this
into stable.
Hi Dr. Sam Lavi Cosmetic And Implant Dentistry ,
Quick question: what happens when a patient calls after hours about implants?
Studies show 35–40% of dental calls go unanswered.
Every unanswered call = one implant patient choosing another practice.
We built an AI voice agent that answers every call, educates patients, and books consultations
24/7.
Want me to send you a 30-sec demo recording so you can hear it in action?
Just reply 'Yes' and I’ll send it right away.
—
Best regards,
Mohanish Ved
AI Growth Specialist
EditRage Solutions
Note:
The issue looks like it's from tty directly but I don't see who is the maintainer, so I email the closest I can get
Description:
In command line, sticky keys reset only when typing ASCII and ISO-8859-1 characters.
Tested with the QWERTY Lafayette layout: https://codeberg.org/Alerymin/kbd-qwerty-lafayette
Observed Behaviour:
When the layout is loaded in ISO-8859-15, most characters typed don't reset the sticky key, unless it's basic ASCII characters or Unicode
When the layout is loaded in ISO-8859-1, the sticky key works fine.
Expected behaviour:
Sticky key working in ISO-8859-1 and ISO-8859-15
System used:
Arch Linux, kernel 6.16.3-arch1-1
When the xe pm_notifier evicts for suspend / hibernate, there might be
racing tasks trying to re-validate again. This can lead to suspend taking
excessive time or get stuck in a live-lock. This behaviour becomes
much worse with the fix that actually makes re-validation bring back
bos to VRAM rather than letting them remain in TT.
Prevent that by having exec and the rebind worker waiting for a completion
that is set to block by the pm_notifier before suspend and is signaled
by the pm_notifier after resume / wakeup.
It's probably still possible to craft malicious applications that block
suspending. More work is pending to fix that.
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4288
Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier")
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.16+
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/gpu/drm/xe/xe_device_types.h | 2 ++
drivers/gpu/drm/xe/xe_exec.c | 9 +++++++++
drivers/gpu/drm/xe/xe_pm.c | 4 ++++
drivers/gpu/drm/xe/xe_vm.c | 14 ++++++++++++++
4 files changed, 29 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 092004d14db2..6602bd678cbc 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -507,6 +507,8 @@ struct xe_device {
/** @pm_notifier: Our PM notifier to perform actions in response to various PM events. */
struct notifier_block pm_notifier;
+ /** @pm_block: Completion to block validating tasks on suspend / hibernate prepare */
+ struct completion pm_block;
/** @pmt: Support the PMT driver callback interface */
struct {
diff --git a/drivers/gpu/drm/xe/xe_exec.c b/drivers/gpu/drm/xe/xe_exec.c
index 44364c042ad7..374c831e691b 100644
--- a/drivers/gpu/drm/xe/xe_exec.c
+++ b/drivers/gpu/drm/xe/xe_exec.c
@@ -237,6 +237,15 @@ int xe_exec_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
goto err_unlock_list;
}
+ /*
+ * It's OK to block interruptible here with the vm lock held, since
+ * on task freezing during suspend / hibernate, the call will
+ * return -ERESTARTSYS and the IOCTL will be rerun.
+ */
+ err = wait_for_completion_interruptible(&xe->pm_block);
+ if (err)
+ goto err_unlock_list;
+
vm_exec.vm = &vm->gpuvm;
vm_exec.flags = DRM_EXEC_INTERRUPTIBLE_WAIT;
if (xe_vm_in_lr_mode(vm)) {
diff --git a/drivers/gpu/drm/xe/xe_pm.c b/drivers/gpu/drm/xe/xe_pm.c
index b57b46ad9f7c..2d7b05d8a78b 100644
--- a/drivers/gpu/drm/xe/xe_pm.c
+++ b/drivers/gpu/drm/xe/xe_pm.c
@@ -306,6 +306,7 @@ static int xe_pm_notifier_callback(struct notifier_block *nb,
switch (action) {
case PM_HIBERNATION_PREPARE:
case PM_SUSPEND_PREPARE:
+ reinit_completion(&xe->pm_block);
xe_pm_runtime_get(xe);
err = xe_bo_evict_all_user(xe);
if (err)
@@ -318,6 +319,7 @@ static int xe_pm_notifier_callback(struct notifier_block *nb,
break;
case PM_POST_HIBERNATION:
case PM_POST_SUSPEND:
+ complete_all(&xe->pm_block);
xe_pm_runtime_get(xe);
xe_bo_notifier_unprepare_all_pinned(xe);
xe_pm_runtime_put(xe);
@@ -345,6 +347,8 @@ int xe_pm_init(struct xe_device *xe)
if (err)
return err;
+ init_completion(&xe->pm_block);
+ complete_all(&xe->pm_block);
/* For now suspend/resume is only allowed with GuC */
if (!xe_device_uc_enabled(xe))
return 0;
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 3ff3c67aa79d..edcdb0528f2a 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -394,6 +394,9 @@ static int xe_gpuvm_validate(struct drm_gpuvm_bo *vm_bo, struct drm_exec *exec)
list_move_tail(&gpuva_to_vma(gpuva)->combined_links.rebind,
&vm->rebind_list);
+ if (!try_wait_for_completion(&vm->xe->pm_block))
+ return -EAGAIN;
+
ret = xe_bo_validate(gem_to_xe_bo(vm_bo->obj), vm, false);
if (ret)
return ret;
@@ -494,6 +497,12 @@ static void preempt_rebind_work_func(struct work_struct *w)
xe_assert(vm->xe, xe_vm_in_preempt_fence_mode(vm));
trace_xe_vm_rebind_worker_enter(vm);
+ /*
+ * This blocks the wq during suspend / hibernate.
+ * Don't hold any locks.
+ */
+retry_pm:
+ wait_for_completion(&vm->xe->pm_block);
down_write(&vm->lock);
if (xe_vm_is_closed_or_banned(vm)) {
@@ -503,6 +512,11 @@ static void preempt_rebind_work_func(struct work_struct *w)
}
retry:
+ if (!try_wait_for_completion(&vm->xe->pm_block)) {
+ up_write(&vm->lock);
+ goto retry_pm;
+ }
+
if (xe_vm_userptr_check_repin(vm)) {
err = xe_vm_userptr_pin(vm);
if (err)
--
2.50.1
VRAM+TT bos that are evicted from VRAM to TT may remain in
TT also after a revalidation following eviction or suspend.
This manifests itself as applications becoming sluggish
after buffer objects get evicted or after a resume from
suspend or hibernation.
If the bo supports placement in both VRAM and TT, and
we are on DGFX, mark the TT placement as fallback. This means
that it is tried only after VRAM + eviction.
This flaw has probably been present since the xe module was
upstreamed but use a Fixes: commit below where backporting is
likely to be simple. For earlier versions we need to open-
code the fallback algorithm in the driver.
v2:
- Remove check for dgfx. (Matthew Auld)
- Update the xe_dma_buf kunit test for the new strategy (CI)
- Allow dma-buf to pin in current placement (CI)
- Make xe_bo_validate() for pinned bos a NOP.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995
Fixes: a78a8da51b36 ("drm/ttm: replace busy placement with flags v6")
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.9+
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
Reviewed-by: Matthew Auld <matthew.auld(a)intel.com>
---
drivers/gpu/drm/xe/tests/xe_bo.c | 2 +-
drivers/gpu/drm/xe/tests/xe_dma_buf.c | 10 +---------
drivers/gpu/drm/xe/xe_bo.c | 16 ++++++++++++----
drivers/gpu/drm/xe/xe_bo.h | 2 +-
drivers/gpu/drm/xe/xe_dma_buf.c | 2 +-
5 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/drivers/gpu/drm/xe/tests/xe_bo.c b/drivers/gpu/drm/xe/tests/xe_bo.c
index bb469096d072..7b40cc8be1c9 100644
--- a/drivers/gpu/drm/xe/tests/xe_bo.c
+++ b/drivers/gpu/drm/xe/tests/xe_bo.c
@@ -236,7 +236,7 @@ static int evict_test_run_tile(struct xe_device *xe, struct xe_tile *tile, struc
}
xe_bo_lock(external, false);
- err = xe_bo_pin_external(external);
+ err = xe_bo_pin_external(external, false);
xe_bo_unlock(external);
if (err) {
KUNIT_FAIL(test, "external bo pin err=%pe\n",
diff --git a/drivers/gpu/drm/xe/tests/xe_dma_buf.c b/drivers/gpu/drm/xe/tests/xe_dma_buf.c
index cde9530bef8c..3f30d2cef917 100644
--- a/drivers/gpu/drm/xe/tests/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/tests/xe_dma_buf.c
@@ -85,15 +85,7 @@ static void check_residency(struct kunit *test, struct xe_bo *exported,
return;
}
- /*
- * If on different devices, the exporter is kept in system if
- * possible, saving a migration step as the transfer is just
- * likely as fast from system memory.
- */
- if (params->mem_mask & XE_BO_FLAG_SYSTEM)
- KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(exported, XE_PL_TT));
- else
- KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(exported, mem_type));
+ KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(exported, mem_type));
if (params->force_different_devices)
KUNIT_EXPECT_TRUE(test, xe_bo_is_mem_type(imported, XE_PL_TT));
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 4faf15d5fa6d..870f43347281 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -188,6 +188,8 @@ static void try_add_system(struct xe_device *xe, struct xe_bo *bo,
bo->placements[*c] = (struct ttm_place) {
.mem_type = XE_PL_TT,
+ .flags = (bo_flags & XE_BO_FLAG_VRAM_MASK) ?
+ TTM_PL_FLAG_FALLBACK : 0,
};
*c += 1;
}
@@ -2322,6 +2324,7 @@ uint64_t vram_region_gpu_offset(struct ttm_resource *res)
/**
* xe_bo_pin_external - pin an external BO
* @bo: buffer object to be pinned
+ * @in_place: Pin in current placement, don't attempt to migrate.
*
* Pin an external (not tied to a VM, can be exported via dma-buf / prime FD)
* BO. Unique call compared to xe_bo_pin as this function has it own set of
@@ -2329,7 +2332,7 @@ uint64_t vram_region_gpu_offset(struct ttm_resource *res)
*
* Returns 0 for success, negative error code otherwise.
*/
-int xe_bo_pin_external(struct xe_bo *bo)
+int xe_bo_pin_external(struct xe_bo *bo, bool in_place)
{
struct xe_device *xe = xe_bo_device(bo);
int err;
@@ -2338,9 +2341,11 @@ int xe_bo_pin_external(struct xe_bo *bo)
xe_assert(xe, xe_bo_is_user(bo));
if (!xe_bo_is_pinned(bo)) {
- err = xe_bo_validate(bo, NULL, false);
- if (err)
- return err;
+ if (!in_place) {
+ err = xe_bo_validate(bo, NULL, false);
+ if (err)
+ return err;
+ }
spin_lock(&xe->pinned.lock);
list_add_tail(&bo->pinned_link, &xe->pinned.late.external);
@@ -2493,6 +2498,9 @@ int xe_bo_validate(struct xe_bo *bo, struct xe_vm *vm, bool allow_res_evict)
};
int ret;
+ if (xe_bo_is_pinned(bo))
+ return 0;
+
if (vm) {
lockdep_assert_held(&vm->lock);
xe_vm_assert_held(vm);
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 8cce413b5235..cfb1ec266a6d 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -200,7 +200,7 @@ static inline void xe_bo_unlock_vm_held(struct xe_bo *bo)
}
}
-int xe_bo_pin_external(struct xe_bo *bo);
+int xe_bo_pin_external(struct xe_bo *bo, bool in_place);
int xe_bo_pin(struct xe_bo *bo);
void xe_bo_unpin_external(struct xe_bo *bo);
void xe_bo_unpin(struct xe_bo *bo);
diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
index 346f857f3837..af64baf872ef 100644
--- a/drivers/gpu/drm/xe/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/xe_dma_buf.c
@@ -72,7 +72,7 @@ static int xe_dma_buf_pin(struct dma_buf_attachment *attach)
return ret;
}
- ret = xe_bo_pin_external(bo);
+ ret = xe_bo_pin_external(bo, true);
xe_assert(xe, !ret);
return 0;
--
2.50.1
VRAM+TT bos that are evicted from VRAM to TT may remain in
TT also after a revalidation following eviction or suspend.
This manifests itself as applications becoming sluggish
after buffer objects get evicted or after a resume from
suspend or hibernation.
If the bo supports placement in both VRAM and TT, and
we are on DGFX, mark the TT placement as fallback. This means
that it is tried only after VRAM + eviction.
This flaw has probably been present since the xe module was
upstreamed but use a Fixes: commit below where backporting is
likely to be simple. For earlier versions we need to open-
code the fallback algorithm in the driver.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995
Fixes: a78a8da51b36 ("drm/ttm: replace busy placement with flags v6")
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.9+
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/gpu/drm/xe/xe_bo.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 4faf15d5fa6d..64dea4e478bd 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -188,6 +188,8 @@ static void try_add_system(struct xe_device *xe, struct xe_bo *bo,
bo->placements[*c] = (struct ttm_place) {
.mem_type = XE_PL_TT,
+ .flags = (IS_DGFX(xe) && (bo_flags & XE_BO_FLAG_VRAM_MASK)) ?
+ TTM_PL_FLAG_FALLBACK : 0,
};
*c += 1;
}
--
2.50.1
Make sure to drop the references to the sibling platform devices and
their child drm devices taken by of_find_device_by_node() and
device_find_child() when initialising the driver data during bind().
Fixes: 1ef7ed48356c ("drm/mediatek: Modify mediatek-drm for mt8195 multi mmsys support")
Cc: stable(a)vger.kernel.org # 6.4
Cc: Nancy.Lin <nancy.lin(a)mediatek.com>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/gpu/drm/mediatek/mtk_drm_drv.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/mediatek/mtk_drm_drv.c b/drivers/gpu/drm/mediatek/mtk_drm_drv.c
index 7c0c12dde488..33b83576af7e 100644
--- a/drivers/gpu/drm/mediatek/mtk_drm_drv.c
+++ b/drivers/gpu/drm/mediatek/mtk_drm_drv.c
@@ -395,10 +395,12 @@ static bool mtk_drm_get_all_drm_priv(struct device *dev)
continue;
drm_dev = device_find_child(&pdev->dev, NULL, mtk_drm_match);
+ put_device(&pdev->dev);
if (!drm_dev)
continue;
temp_drm_priv = dev_get_drvdata(drm_dev);
+ put_device(drm_dev);
if (!temp_drm_priv)
continue;
--
2.49.1
There is a bug observed when rtw89_core_tx_kick_off_and_wait() tries to
access already freed skb_data:
BUG: KFENCE: use-after-free write in rtw89_core_tx_kick_off_and_wait drivers/net/wireless/realtek/rtw89/core.c:1110
CPU: 6 UID: 0 PID: 41377 Comm: kworker/u64:24 Not tainted 6.17.0-rc1+ #1 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS edk2-20250523-14.fc42 05/23/2025
Workqueue: events_unbound cfg80211_wiphy_work [cfg80211]
Use-after-free write at 0x0000000020309d9d (in kfence-#251):
rtw89_core_tx_kick_off_and_wait drivers/net/wireless/realtek/rtw89/core.c:1110
rtw89_core_scan_complete drivers/net/wireless/realtek/rtw89/core.c:5338
rtw89_hw_scan_complete_cb drivers/net/wireless/realtek/rtw89/fw.c:7979
rtw89_chanctx_proceed_cb drivers/net/wireless/realtek/rtw89/chan.c:3165
rtw89_chanctx_proceed drivers/net/wireless/realtek/rtw89/chan.h:141
rtw89_hw_scan_complete drivers/net/wireless/realtek/rtw89/fw.c:8012
rtw89_mac_c2h_scanofld_rsp drivers/net/wireless/realtek/rtw89/mac.c:5059
rtw89_fw_c2h_work drivers/net/wireless/realtek/rtw89/fw.c:6758
process_one_work kernel/workqueue.c:3241
worker_thread kernel/workqueue.c:3400
kthread kernel/kthread.c:463
ret_from_fork arch/x86/kernel/process.c:154
ret_from_fork_asm arch/x86/entry/entry_64.S:258
kfence-#251: 0x0000000056e2393d-0x000000009943cb62, size=232, cache=skbuff_head_cache
allocated by task 41377 on cpu 6 at 77869.159548s (0.009551s ago):
__alloc_skb net/core/skbuff.c:659
__netdev_alloc_skb net/core/skbuff.c:734
ieee80211_nullfunc_get net/mac80211/tx.c:5844
rtw89_core_send_nullfunc drivers/net/wireless/realtek/rtw89/core.c:3431
rtw89_core_scan_complete drivers/net/wireless/realtek/rtw89/core.c:5338
rtw89_hw_scan_complete_cb drivers/net/wireless/realtek/rtw89/fw.c:7979
rtw89_chanctx_proceed_cb drivers/net/wireless/realtek/rtw89/chan.c:3165
rtw89_chanctx_proceed drivers/net/wireless/realtek/rtw89/chan.c:3194
rtw89_hw_scan_complete drivers/net/wireless/realtek/rtw89/fw.c:8012
rtw89_mac_c2h_scanofld_rsp drivers/net/wireless/realtek/rtw89/mac.c:5059
rtw89_fw_c2h_work drivers/net/wireless/realtek/rtw89/fw.c:6758
process_one_work kernel/workqueue.c:3241
worker_thread kernel/workqueue.c:3400
kthread kernel/kthread.c:463
ret_from_fork arch/x86/kernel/process.c:154
ret_from_fork_asm arch/x86/entry/entry_64.S:258
freed by task 1045 on cpu 9 at 77869.168393s (0.001557s ago):
ieee80211_tx_status_skb net/mac80211/status.c:1117
rtw89_pci_release_txwd_skb drivers/net/wireless/realtek/rtw89/pci.c:564
rtw89_pci_release_tx_skbs.isra.0 drivers/net/wireless/realtek/rtw89/pci.c:651
rtw89_pci_release_tx drivers/net/wireless/realtek/rtw89/pci.c:676
rtw89_pci_napi_poll drivers/net/wireless/realtek/rtw89/pci.c:4238
__napi_poll net/core/dev.c:7495
net_rx_action net/core/dev.c:7557 net/core/dev.c:7684
handle_softirqs kernel/softirq.c:580
do_softirq.part.0 kernel/softirq.c:480
__local_bh_enable_ip kernel/softirq.c:407
rtw89_pci_interrupt_threadfn drivers/net/wireless/realtek/rtw89/pci.c:927
irq_thread_fn kernel/irq/manage.c:1133
irq_thread kernel/irq/manage.c:1257
kthread kernel/kthread.c:463
ret_from_fork arch/x86/kernel/process.c:154
ret_from_fork_asm arch/x86/entry/entry_64.S:258
It is a consequence of a race between the waiting and the signaling side
of the completion:
Waiting thread Completing thread
rtw89_core_tx_kick_off_and_wait()
rcu_assign_pointer(skb_data->wait, wait)
/* start waiting */
wait_for_completion_timeout()
rtw89_pci_tx_status()
rtw89_core_tx_wait_complete()
rcu_read_lock()
/* signals completion and
* proceeds further
*/
complete(&wait->completion)
rcu_read_unlock()
...
/* frees skb_data */
ieee80211_tx_status_ni()
/* returns (exit status doesn't matter) */
wait_for_completion_timeout()
...
/* accesses the already freed skb_data */
rcu_assign_pointer(skb_data->wait, NULL)
The completing side might proceed and free the underlying skb even before
the waiting side is fully awoken and run to execution. Actually the race
happens regardless of wait_for_completion_timeout() exit status, e.g.
the waiting side may hit a timeout and the concurrent completing side is
still able to free the skb.
Skbs which are sent by rtw89_core_tx_kick_off_and_wait() are owned by the
driver. They don't come from core ieee80211 stack so no need to pass them
to ieee80211_tx_status_ni() on completing side.
Introduce a work function which will act as a garbage collector for
rtw89_tx_wait_info objects and the associated skbs. Thus no potentially
heavy locks are required on the completing side.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 1ae5ca615285 ("wifi: rtw89: add function to wait for completion of TX skbs")
Cc: stable(a)vger.kernel.org
Suggested-by: Zong-Zhe Yang <kevin_yang(a)realtek.com>
Signed-off-by: Fedor Pchelkin <pchelkin(a)ispras.ru>
---
v2: use a work function to manage release of tx_waits and associated skbs (Zong-Zhe)
The interesting part is how rtw89_tx_wait_work() should wait for
completion - based on timeout or without it, or just check status with
a call to completion_done().
Simply waiting with wait_for_completion() may become a bottleneck if for
any reason the completion is delayed significantly, and we are holding a
wiphy lock here. I _suspect_ rtw89_pci_tx_status() should be called
either by napi polling handler or in other cases e.g. by rtw89_hci_reset()
but it's hard to deduce for any possible scenario that it will be called
in some time.
Anyway, the current and the next patch try to make sure that when
rtw89_core_tx_wait_complete() is called, skbdata->wait is properly
initialized so that there should be no buggy situations when tx_wait skb
is not recognized and invalidly passed to ieee80211 stack, also without
signaling a completion.
If rtw89_core_tx_wait_complete() is not called at all, this should
indicate a bug elsewhere.
drivers/net/wireless/realtek/rtw89/core.c | 42 +++++++++++++++++++----
drivers/net/wireless/realtek/rtw89/core.h | 22 +++++++-----
drivers/net/wireless/realtek/rtw89/pci.c | 9 ++---
3 files changed, 54 insertions(+), 19 deletions(-)
diff --git a/drivers/net/wireless/realtek/rtw89/core.c b/drivers/net/wireless/realtek/rtw89/core.c
index 57590f5577a3..48aa02d6abd4 100644
--- a/drivers/net/wireless/realtek/rtw89/core.c
+++ b/drivers/net/wireless/realtek/rtw89/core.c
@@ -1073,6 +1073,26 @@ rtw89_core_tx_update_desc_info(struct rtw89_dev *rtwdev,
}
}
+static void rtw89_tx_wait_work(struct wiphy *wiphy, struct wiphy_work *work)
+{
+ struct rtw89_dev *rtwdev = container_of(work, struct rtw89_dev,
+ tx_wait_work);
+ struct rtw89_tx_wait_info *wait, *tmp;
+
+ lockdep_assert_wiphy(wiphy);
+
+ list_for_each_entry_safe(wait, tmp, &rtwdev->tx_waits, list) {
+ if (!wait->finished) {
+ unsigned long tmo = msecs_to_jiffies(wait->timeout);
+ if (!wait_for_completion_timeout(&wait->completion, tmo))
+ continue;
+ }
+ list_del(&wait->list);
+ dev_kfree_skb_any(wait->skb);
+ kfree(wait);
+ }
+}
+
void rtw89_core_tx_kick_off(struct rtw89_dev *rtwdev, u8 qsel)
{
u8 ch_dma;
@@ -1090,6 +1110,8 @@ int rtw89_core_tx_kick_off_and_wait(struct rtw89_dev *rtwdev, struct sk_buff *sk
unsigned long time_left;
int ret = 0;
+ lockdep_assert_wiphy(rtwdev->hw->wiphy);
+
wait = kzalloc(sizeof(*wait), GFP_KERNEL);
if (!wait) {
rtw89_core_tx_kick_off(rtwdev, qsel);
@@ -1097,18 +1119,23 @@ int rtw89_core_tx_kick_off_and_wait(struct rtw89_dev *rtwdev, struct sk_buff *sk
}
init_completion(&wait->completion);
- rcu_assign_pointer(skb_data->wait, wait);
+ skb_data->wait = wait;
rtw89_core_tx_kick_off(rtwdev, qsel);
time_left = wait_for_completion_timeout(&wait->completion,
msecs_to_jiffies(timeout));
- if (time_left == 0)
+ if (time_left == 0) {
ret = -ETIMEDOUT;
- else if (!wait->tx_done)
- ret = -EAGAIN;
+ } else {
+ wait->finished = true;
+ if (!wait->tx_done)
+ ret = -EAGAIN;
+ }
- rcu_assign_pointer(skb_data->wait, NULL);
- kfree_rcu(wait, rcu_head);
+ wait->skb = skb;
+ wait->timeout = timeout;
+ list_add_tail(&wait->list, &rtwdev->tx_waits);
+ wiphy_work_queue(rtwdev->hw->wiphy, &rtwdev->tx_wait_work);
return ret;
}
@@ -4972,6 +4999,7 @@ void rtw89_core_stop(struct rtw89_dev *rtwdev)
clear_bit(RTW89_FLAG_RUNNING, rtwdev->flags);
wiphy_work_cancel(wiphy, &rtwdev->c2h_work);
+ wiphy_work_cancel(wiphy, &rtwdev->tx_wait_work);
wiphy_work_cancel(wiphy, &rtwdev->cancel_6ghz_probe_work);
wiphy_work_cancel(wiphy, &btc->eapol_notify_work);
wiphy_work_cancel(wiphy, &btc->arp_notify_work);
@@ -5203,6 +5231,7 @@ int rtw89_core_init(struct rtw89_dev *rtwdev)
INIT_LIST_HEAD(&rtwdev->scan_info.pkt_list[band]);
}
INIT_LIST_HEAD(&rtwdev->scan_info.chan_list);
+ INIT_LIST_HEAD(&rtwdev->tx_waits);
INIT_WORK(&rtwdev->ba_work, rtw89_core_ba_work);
INIT_WORK(&rtwdev->txq_work, rtw89_core_txq_work);
INIT_DELAYED_WORK(&rtwdev->txq_reinvoke_work, rtw89_core_txq_reinvoke_work);
@@ -5233,6 +5262,7 @@ int rtw89_core_init(struct rtw89_dev *rtwdev)
wiphy_work_init(&rtwdev->c2h_work, rtw89_fw_c2h_work);
wiphy_work_init(&rtwdev->ips_work, rtw89_ips_work);
wiphy_work_init(&rtwdev->cancel_6ghz_probe_work, rtw89_cancel_6ghz_probe_work);
+ wiphy_work_init(&rtwdev->tx_wait_work, rtw89_tx_wait_work);
INIT_WORK(&rtwdev->load_firmware_work, rtw89_load_firmware_work);
skb_queue_head_init(&rtwdev->c2h_queue);
diff --git a/drivers/net/wireless/realtek/rtw89/core.h b/drivers/net/wireless/realtek/rtw89/core.h
index 43e10278e14d..06f7d82a8d18 100644
--- a/drivers/net/wireless/realtek/rtw89/core.h
+++ b/drivers/net/wireless/realtek/rtw89/core.h
@@ -3508,12 +3508,16 @@ struct rtw89_phy_rate_pattern {
struct rtw89_tx_wait_info {
struct rcu_head rcu_head;
+ struct list_head list;
struct completion completion;
+ struct sk_buff *skb;
+ unsigned int timeout;
bool tx_done;
+ bool finished;
};
struct rtw89_tx_skb_data {
- struct rtw89_tx_wait_info __rcu *wait;
+ struct rtw89_tx_wait_info *wait;
u8 hci_priv[];
};
@@ -5925,6 +5929,9 @@ struct rtw89_dev {
/* used to protect rpwm */
spinlock_t rpwm_lock;
+ struct list_head tx_waits;
+ struct wiphy_work tx_wait_work;
+
struct rtw89_cam_info cam_info;
struct sk_buff_head c2h_queue;
@@ -7258,23 +7265,20 @@ static inline struct sk_buff *rtw89_alloc_skb_for_rx(struct rtw89_dev *rtwdev,
return dev_alloc_skb(length);
}
-static inline void rtw89_core_tx_wait_complete(struct rtw89_dev *rtwdev,
+static inline bool rtw89_core_tx_wait_complete(struct rtw89_dev *rtwdev,
struct rtw89_tx_skb_data *skb_data,
bool tx_done)
{
struct rtw89_tx_wait_info *wait;
- rcu_read_lock();
-
- wait = rcu_dereference(skb_data->wait);
+ wait = skb_data->wait;
if (!wait)
- goto out;
+ return false;
wait->tx_done = tx_done;
- complete(&wait->completion);
+ complete_all(&wait->completion);
-out:
- rcu_read_unlock();
+ return true;
}
static inline bool rtw89_is_mlo_1_1(struct rtw89_dev *rtwdev)
diff --git a/drivers/net/wireless/realtek/rtw89/pci.c b/drivers/net/wireless/realtek/rtw89/pci.c
index a669f2f843aa..6356c2c224c5 100644
--- a/drivers/net/wireless/realtek/rtw89/pci.c
+++ b/drivers/net/wireless/realtek/rtw89/pci.c
@@ -464,10 +464,7 @@ static void rtw89_pci_tx_status(struct rtw89_dev *rtwdev,
struct rtw89_tx_skb_data *skb_data = RTW89_TX_SKB_CB(skb);
struct ieee80211_tx_info *info;
- rtw89_core_tx_wait_complete(rtwdev, skb_data, tx_status == RTW89_TX_DONE);
-
info = IEEE80211_SKB_CB(skb);
- ieee80211_tx_info_clear_status(info);
if (info->flags & IEEE80211_TX_CTL_NO_ACK)
info->flags |= IEEE80211_TX_STAT_NOACK_TRANSMITTED;
@@ -494,6 +491,10 @@ static void rtw89_pci_tx_status(struct rtw89_dev *rtwdev,
}
}
+ if (rtw89_core_tx_wait_complete(rtwdev, skb_data, tx_status == RTW89_TX_DONE))
+ return;
+
+ ieee80211_tx_info_clear_status(info);
ieee80211_tx_status_ni(rtwdev->hw, skb);
}
@@ -1387,7 +1388,7 @@ static int rtw89_pci_txwd_submit(struct rtw89_dev *rtwdev,
}
tx_data->dma = dma;
- rcu_assign_pointer(skb_data->wait, NULL);
+ skb_data->wait = NULL;
txwp_len = sizeof(*txwp_info);
txwd_len = chip->txwd_body_size;
--
2.50.1
Fix the regression introduced in 9e30ecf23b1b whereby IPv4 broadcast
packets were having their ethernet destination field mangled. This
broke WOL magic packets and likely other IPv4 broadcast.
The regression can be observed by sending an IPv4 WOL packet using
the wakeonlan program to any ethernet address:
wakeonlan 46:3b:ad:61:e0:5d
and capturing the packet with tcpdump:
tcpdump -i eth0 -w /tmp/bad.cap dst port 9
The ethernet destination MUST be ff:ff:ff:ff:ff:ff for broadcast, but is
mangled with 9e30ecf23b1b applied.
Revert the change made in 9e30ecf23b1b and ensure the MTU value for
broadcast routes is retained by calling ip_dst_init_metrics() directly,
avoiding the need to enter the main code block in rt_set_nexthop().
Simplify the code path taken for broadcast packets back to the original
before the regression, adding only the call to ip_dst_init_metrics().
The broadcast_pmtu.sh selftest provided with the original patch still
passes with this patch applied.
Fixes: 9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes")
Signed-off-by: Brett A C Sheffield <bacs(a)librecast.net>
---
net/ipv4/route.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f639a2ae881a..eaf78e128aca 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2588,6 +2588,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
do_cache = true;
if (type == RTN_BROADCAST) {
flags |= RTCF_BROADCAST | RTCF_LOCAL;
+ fi = NULL;
} else if (type == RTN_MULTICAST) {
flags |= RTCF_MULTICAST | RTCF_LOCAL;
if (!ip_check_mc_rcu(in_dev, fl4->daddr, fl4->saddr,
@@ -2657,8 +2658,12 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
rth->dst.output = ip_mc_output;
RT_CACHE_STAT_INC(out_slow_mc);
}
+ if (type == RTN_BROADCAST) {
+ /* ensure MTU value for broadcast routes is retained */
+ ip_dst_init_metrics(&rth->dst, res->fi->fib_metrics);
+ }
#ifdef CONFIG_IP_MROUTE
- if (type == RTN_MULTICAST) {
+ else if (type == RTN_MULTICAST) {
if (IN_DEV_MFORWARD(in_dev) &&
!ipv4_is_local_multicast(fl4->daddr)) {
rth->dst.input = ip_mr_input;
base-commit: 01b9128c5db1b470575d07b05b67ffa3cb02ebf1
--
2.49.1
The comparison function cmp_loc_by_count() used for sorting stack trace
locations in debugfs currently returns -1 if a->count > b->count and 1
otherwise. This breaks the antisymmetry property required by sort(),
because when two counts are equal, both cmp(a, b) and cmp(b, a) return
1.
This can lead to undefined or incorrect ordering results. Fix it by
explicitly returning 0 when the counts are equal, ensuring that the
comparison function follows the expected mathematical properties.
Fixes: 553c0369b3e1 ("mm/slub: sort debugfs output by frequency of stack traces")
Cc: stable(a)vger.kernel.org
Signed-off-by: Kuan-Wei Chiu <visitorckw(a)gmail.com>
---
mm/slub.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index 30003763d224..c91b3744adbc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -7718,8 +7718,9 @@ static int cmp_loc_by_count(const void *a, const void *b, const void *data)
if (loc1->count > loc2->count)
return -1;
- else
+ if (loc1->count < loc2->count)
return 1;
+ return 0;
}
static void *slab_debugfs_start(struct seq_file *seq, loff_t *ppos)
--
2.34.1
The patch titled
Subject: mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-memory-failure-fix-vm_bug_on_pagepagepoisonedpage-when-unpoison-memory.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Miaohe Lin <linmiaohe(a)huawei.com>
Subject: mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory
Date: Thu, 28 Aug 2025 10:46:18 +0800
When I did memory failure tests, below panic occurs:
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(page))
kernel BUG at include/linux/page-flags.h:616!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 3 PID: 720 Comm: bash Not tainted 6.10.0-rc1-00195-g148743902568 #40
RIP: 0010:unpoison_memory+0x2f3/0x590
RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246
RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0
RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb
R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000
R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe
FS: 00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0
Call Trace:
<TASK>
unpoison_memory+0x2f3/0x590
simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
debugfs_attr_write+0x42/0x60
full_proxy_write+0x5b/0x80
vfs_write+0xd5/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xb9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f08f0314887
RSP: 002b:00007ffece710078 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f08f0314887
RDX: 0000000000000009 RSI: 0000564787a30410 RDI: 0000000000000001
RBP: 0000564787a30410 R08: 000000000000fefe R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009
R13: 00007f08f041b780 R14: 00007f08f0417600 R15: 00007f08f0416a00
</TASK>
Modules linked in: hwpoison_inject
---[ end trace 0000000000000000 ]---
RIP: 0010:unpoison_memory+0x2f3/0x590
RSP: 0018:ffffa57fc8787d60 EFLAGS: 00000246
RAX: 0000000000000037 RBX: 0000000000000009 RCX: ffff9be25fcdc9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9be25fcdc9c0
RBP: 0000000000300000 R08: ffffffffb4956f88 R09: 0000000000009ffb
R10: 0000000000000284 R11: ffffffffb4926fa0 R12: ffffe6b00c000000
R13: ffff9bdb453dfd00 R14: 0000000000000000 R15: fffffffffffffffe
FS: 00007f08f04e4740(0000) GS:ffff9be25fcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564787a30410 CR3: 000000010d4e2000 CR4: 00000000000006f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x31c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
The root cause is that unpoison_memory() tries to check the PG_HWPoison
flags of an uninitialized page. So VM_BUG_ON_PAGE(PagePoisoned(page)) is
triggered. This can be reproduced by below steps:
1.Offline memory block:
echo offline > /sys/devices/system/memory/memory12/state
2.Get offlined memory pfn:
page-types -b n -rlN
3.Write pfn to unpoison-pfn
echo <pfn> > /sys/kernel/debug/hwpoison/unpoison-pfn
This scenario can be identified by pfn_to_online_page() returning NULL.
And ZONE_DEVICE pages are never expected, so we can simply fail if
pfn_to_online_page() == NULL to fix the bug.
Link: https://lkml.kernel.org/r/20250828024618.1744895-1-linmiaohe@huawei.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com>
Suggested-by: David Hildenbrand <david(a)redhat.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
--- a/mm/memory-failure.c~mm-memory-failure-fix-vm_bug_on_pagepagepoisonedpage-when-unpoison-memory
+++ a/mm/memory-failure.c
@@ -2568,10 +2568,9 @@ int unpoison_memory(unsigned long pfn)
static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
- if (!pfn_valid(pfn))
- return -ENXIO;
-
- p = pfn_to_page(pfn);
+ p = pfn_to_online_page(pfn);
+ if (!p)
+ return -EIO;
folio = page_folio(p);
mutex_lock(&mf_mutex);
_
Patches currently in -mm which might be from linmiaohe(a)huawei.com are
mm-memory-failure-fix-vm_bug_on_pagepagepoisonedpage-when-unpoison-memory.patch
revert-hugetlb-make-hugetlb-depends-on-sysfs-or-sysctl.patch
Hello,
Are you looking for a loan to either increase your activity or to
carry out a project. We offer Crypto Loans at 2-7% interest rate
with or without a credit check. We also pay 1% commission to
brokers, who introduce project owners for finance or other
opportunities.
Please get back to me if you are interested in more details.
Best Regards,
Lucy Seinfield
Assistant Secretary
The patch titled
Subject: mm/memory-failure: fix redundant updates for already poisoned pages
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-memory-failure-fix-redundant-updates-for-already-poisoned-pages.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Kyle Meyer <kyle.meyer(a)hpe.com>
Subject: mm/memory-failure: fix redundant updates for already poisoned pages
Date: Thu, 28 Aug 2025 13:38:20 -0500
Duplicate memory errors can be reported by multiple sources.
Passing an already poisoned page to action_result() causes issues:
* The amount of hardware corrupted memory is incorrectly updated.
* Per NUMA node MF stats are incorrectly updated.
* Redundant "already poisoned" messages are printed.
Avoid those issues by:
* Skipping hardware corrupted memory updates for already poisoned pages.
* Skipping per NUMA node MF stats updates for already poisoned pages.
* Dropping redundant "already poisoned" messages.
Make MF_MSG_ALREADY_POISONED consistent with other action_page_types and
make calls to action_result() consistent for already poisoned normal pages
and huge pages.
Link: https://lkml.kernel.org/r/aLCiHMy12Ck3ouwC@hpe.com
Fixes: b8b9488d50b7 ("mm/memory-failure: improve memory failure action_result messages")
Signed-off-by: Kyle Meyer <kyle.meyer(a)hpe.com>
Reviewed-by: Jiaqi Yan <jiaqiyan(a)google.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Jane Chu <jane.chu(a)oracle.com>
Cc: Kyle Meyer <kyle.meyer(a)hpe.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: "Luck, Tony" <tony.luck(a)intel.com>
Cc: Miaohe Lin <linmiaohe(a)huawei.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Mike Rapoport <rppt(a)kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi(a)gmail.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Russ Anderson <russ.anderson(a)hpe.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
--- a/mm/memory-failure.c~mm-memory-failure-fix-redundant-updates-for-already-poisoned-pages
+++ a/mm/memory-failure.c
@@ -956,7 +956,7 @@ static const char * const action_page_ty
[MF_MSG_BUDDY] = "free buddy page",
[MF_MSG_DAX] = "dax page",
[MF_MSG_UNSPLIT_THP] = "unsplit thp",
- [MF_MSG_ALREADY_POISONED] = "already poisoned",
+ [MF_MSG_ALREADY_POISONED] = "already poisoned page",
[MF_MSG_UNKNOWN] = "unknown page",
};
@@ -1349,9 +1349,10 @@ static int action_result(unsigned long p
{
trace_memory_failure_event(pfn, type, result);
- num_poisoned_pages_inc(pfn);
-
- update_per_node_mf_stats(pfn, result);
+ if (type != MF_MSG_ALREADY_POISONED) {
+ num_poisoned_pages_inc(pfn);
+ update_per_node_mf_stats(pfn, result);
+ }
pr_err("%#lx: recovery action for %s: %s\n",
pfn, action_page_types[type], action_name[result]);
@@ -2094,12 +2095,11 @@ retry:
*hugetlb = 0;
return 0;
} else if (res == -EHWPOISON) {
- pr_err("%#lx: already hardware poisoned\n", pfn);
if (flags & MF_ACTION_REQUIRED) {
folio = page_folio(p);
res = kill_accessing_process(current, folio_pfn(folio), flags);
- action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
}
+ action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
return res;
} else if (res == -EBUSY) {
if (!(flags & MF_NO_RETRY)) {
@@ -2285,7 +2285,6 @@ try_again:
goto unlock_mutex;
if (TestSetPageHWPoison(p)) {
- pr_err("%#lx: already hardware poisoned\n", pfn);
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
res = kill_accessing_process(current, pfn, flags);
_
Patches currently in -mm which might be from kyle.meyer(a)hpe.com are
mm-memory-failure-fix-redundant-updates-for-already-poisoned-pages.patch
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The Linux x86 architecture maps
the kernel address space into the upper portion of every process’s page
table. Consequently, in an SVA context, the IOMMU hardware can walk and
cache kernel space mappings. However, the Linux kernel currently lacks
a notification mechanism for kernel space mapping changes. This means
the IOMMU driver is not aware of such changes, leading to a break in
IOMMU cache coherence.
Modern IOMMUs often cache page table entries of the intermediate-level
page table as long as the entry is valid, no matter the permissions, to
optimize walk performance. Currently the iommu driver is notified only
for changes of user VA mappings, so the IOMMU's internal caches may
retain stale entries for kernel VA. When kernel page table mappings are
changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose,
potentially by an attacker, the IOMMU could misinterpret the new data as
valid page table entries. This allows the IOMMU to walk into attacker-
controlled memory, leading to arbitrary physical memory DMA access or
privilege escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches.
This interface should be invoked from architecture-specific code that
manages combined user and kernel page tables, whenever a kernel page table
update is done and the CPU TLB needs to be flushed.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable(a)vger.kernel.org
Suggested-by: Jann Horn <jannh(a)google.com>
Co-developed-by: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu(a)linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg(a)nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde(a)amd.com>
Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Tested-by: Yi Lai <yi1.lai(a)intel.com>
---
arch/x86/mm/tlb.c | 4 +++
drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++-
include/linux/iommu.h | 4 +++
3 files changed, 67 insertions(+), 1 deletion(-)
Change log:
v3:
- iommu_sva_mms is an unbound list; iterating it in an atomic context
could introduce significant latency issues. Schedule it in a kernel
thread and replace the spinlock with a mutex.
- Replace the static key with a normal bool; it can be brought back if
data shows the benefit.
- Invalidate KVA range in the flush_tlb_all() paths.
- All previous reviewed-bys are preserved. Please let me know if there
are any objections.
v2:
- https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.…
- Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
- Replace the mutex with a spinlock to make the interface usable in the
critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux…
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..3b85e7d3ba44 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
#include <linux/task_work.h>
#include <linux/mmu_notifier.h>
#include <linux/mmu_context.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1478,6 +1479,8 @@ void flush_tlb_all(void)
else
/* Fall back to the IPI-based invalidation. */
on_each_cpu(do_flush_tlb_all, NULL, 1);
+
+ iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
}
/* Flush an arbitrarily large range of memory with INVLPGB. */
@@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
kernel_tlb_flush_range(info);
put_flush_tlb_info();
+ iommu_sva_invalidate_kva_range(start, end);
}
/*
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..d0da2b3fd64b 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,8 @@
#include "iommu-priv.h"
static DEFINE_MUTEX(iommu_sva_lock);
+static bool iommu_sva_present;
+static LIST_HEAD(iommu_sva_mms);
static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct *mm);
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
return ERR_PTR(-ENOSPC);
}
iommu_mm->pasid = pasid;
+ iommu_mm->mm = mm;
INIT_LIST_HEAD(&iommu_mm->sva_domains);
/*
* Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
if (ret)
goto out_free_domain;
domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
+ if (list_empty(&iommu_mm->sva_domains)) {
+ if (list_empty(&iommu_sva_mms))
+ WRITE_ONCE(iommu_sva_present, true);
+ list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+ }
+ list_add(&domain->next, &iommu_mm->sva_domains);
out:
refcount_set(&handle->users, 1);
mutex_unlock(&iommu_sva_lock);
@@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
list_del(&domain->next);
iommu_domain_free(domain);
}
+
+ if (list_empty(&iommu_mm->sva_domains)) {
+ list_del(&iommu_mm->mm_list_elm);
+ if (list_empty(&iommu_sva_mms))
+ WRITE_ONCE(iommu_sva_present, false);
+ }
+
mutex_unlock(&iommu_sva_lock);
kfree(handle);
}
@@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
return domain;
}
+
+struct kva_invalidation_work_data {
+ struct work_struct work;
+ unsigned long start;
+ unsigned long end;
+};
+
+static void invalidate_kva_func(struct work_struct *work)
+{
+ struct kva_invalidation_work_data *data =
+ container_of(work, struct kva_invalidation_work_data, work);
+ struct iommu_mm_data *iommu_mm;
+
+ guard(mutex)(&iommu_sva_lock);
+ list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+ mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
+ data->start, data->end);
+
+ kfree(data);
+}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+ struct kva_invalidation_work_data *data;
+
+ if (likely(!READ_ONCE(iommu_sva_present)))
+ return;
+
+ /* will be freed in the task function */
+ data = kzalloc(sizeof(*data), GFP_ATOMIC);
+ if (!data)
+ return;
+
+ data->start = start;
+ data->end = end;
+ INIT_WORK(&data->work, invalidate_kva_func);
+
+ /*
+ * Since iommu_sva_mms is an unbound list, iterating it in an atomic
+ * context could introduce significant latency issues.
+ */
+ schedule_work(&data->work);
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473..66e4abb2df0d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1134,7 +1134,9 @@ struct iommu_sva {
struct iommu_mm_data {
u32 pasid;
+ struct mm_struct *mm;
struct list_head sva_domains;
+ struct list_head mm_list_elm;
};
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm);
void iommu_sva_unbind_device(struct iommu_sva *handle);
u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
#else
static inline struct iommu_sva *
iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
#endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
--
2.43.0
This reverts commit 2402adce8da4e7396b63b5ffa71e1fa16e5fe5c4.
The upstream commit a40c5d727b8111b5db424a1e43e14a1dcce1e77f ("drm/dp:
Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS") the
reverted commit backported causes a regression, on one eDP panel at
least resulting in display flickering, described in detail at the Link:
below. The issue fixed by the upstream commit will need a different
solution, revert the backport for now.
Cc: intel-gfx(a)lists.freedesktop.org
Cc: dri-devel(a)lists.freedesktop.org
Cc: Sasha Levin <sashal(a)kernel.org>
Link: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14558
Signed-off-by: Imre Deak <imre.deak(a)intel.com>
---
drivers/gpu/drm/drm_dp_helper.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_dp_helper.c b/drivers/gpu/drm/drm_dp_helper.c
index 4eabef5b86d0..ffc68d305afe 100644
--- a/drivers/gpu/drm/drm_dp_helper.c
+++ b/drivers/gpu/drm/drm_dp_helper.c
@@ -280,7 +280,7 @@ ssize_t drm_dp_dpcd_read(struct drm_dp_aux *aux, unsigned int offset,
* We just have to do it before any DPCD access and hope that the
* monitor doesn't power down exactly after the throw away read.
*/
- ret = drm_dp_dpcd_access(aux, DP_AUX_NATIVE_READ, DP_LANE0_1_STATUS, buffer,
+ ret = drm_dp_dpcd_access(aux, DP_AUX_NATIVE_READ, DP_DPCD_REV, buffer,
1);
if (ret != 1)
goto out;
--
2.49.1
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.10.y
git checkout FETCH_HEAD
git cherry-pick -x 76d2e3890fb169168c73f2e4f8375c7cc24a765e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025082225-attribute-embark-823b@gregkh' --subject-prefix 'PATCH 5.10.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d2e3890fb169168c73f2e4f8375c7cc24a765e Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Sat, 16 Aug 2025 07:25:20 -0700
Subject: [PATCH] NFS: Fix a race when updating an existing write
After nfs_lock_and_join_requests() tests for whether the request is
still attached to the mapping, nothing prevents a call to
nfs_inode_remove_request() from succeeding until we actually lock the
page group.
The reason is that whoever called nfs_inode_remove_request() doesn't
necessarily have a lock on the page group head.
So in order to avoid races, let's take the page group lock earlier in
nfs_lock_and_join_requests(), and hold it across the removal of the
request in nfs_inode_remove_request().
Reported-by: Jeff Layton <jlayton(a)kernel.org>
Tested-by: Joe Quanaim <jdq(a)meta.com>
Tested-by: Andrew Steffen <aksteffen(a)meta.com>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Fixes: bd37d6fce184 ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 11968dcb7243..6e69ce43a13f 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -253,13 +253,14 @@ nfs_page_group_unlock(struct nfs_page *req)
nfs_page_clear_headlock(req);
}
-/*
- * nfs_page_group_sync_on_bit_locked
+/**
+ * nfs_page_group_sync_on_bit_locked - Test if all requests have @bit set
+ * @req: request in page group
+ * @bit: PG_* bit that is used to sync page group
*
* must be called with page group lock held
*/
-static bool
-nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
+bool nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
{
struct nfs_page *head = req->wb_head;
struct nfs_page *tmp;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index fa5c41d0989a..8b7c04737967 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -153,20 +153,10 @@ nfs_page_set_inode_ref(struct nfs_page *req, struct inode *inode)
}
}
-static int
-nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
+static void nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
{
- int ret;
-
- if (!test_bit(PG_REMOVE, &req->wb_flags))
- return 0;
- ret = nfs_page_group_lock(req);
- if (ret)
- return ret;
if (test_and_clear_bit(PG_REMOVE, &req->wb_flags))
nfs_page_set_inode_ref(req, inode);
- nfs_page_group_unlock(req);
- return 0;
}
/**
@@ -585,19 +575,18 @@ retry:
}
}
+ ret = nfs_page_group_lock(head);
+ if (ret < 0)
+ goto out_unlock;
+
/* Ensure that nobody removed the request before we locked it */
if (head != folio->private) {
+ nfs_page_group_unlock(head);
nfs_unlock_and_release_request(head);
goto retry;
}
- ret = nfs_cancel_remove_inode(head, inode);
- if (ret < 0)
- goto out_unlock;
-
- ret = nfs_page_group_lock(head);
- if (ret < 0)
- goto out_unlock;
+ nfs_cancel_remove_inode(head, inode);
/* lock each request in the page group */
for (subreq = head->wb_this_page;
@@ -786,7 +775,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
{
struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
- if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
+ nfs_page_group_lock(req);
+ if (nfs_page_group_sync_on_bit_locked(req, PG_REMOVE)) {
struct folio *folio = nfs_page_to_folio(req->wb_head);
struct address_space *mapping = folio->mapping;
@@ -798,6 +788,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
}
spin_unlock(&mapping->i_private_lock);
}
+ nfs_page_group_unlock(req);
if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
atomic_long_dec(&nfsi->nrequests);
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 169b4ae30ff4..9aed39abc94b 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -160,6 +160,7 @@ extern void nfs_join_page_group(struct nfs_page *head,
extern int nfs_page_group_lock(struct nfs_page *);
extern void nfs_page_group_unlock(struct nfs_page *);
extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
+extern bool nfs_page_group_sync_on_bit_locked(struct nfs_page *, unsigned int);
extern int nfs_page_set_headlock(struct nfs_page *req);
extern void nfs_page_clear_headlock(struct nfs_page *req);
extern bool nfs_async_iocounter_wait(struct rpc_task *, struct nfs_lock_context *);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.15.y
git checkout FETCH_HEAD
git cherry-pick -x 76d2e3890fb169168c73f2e4f8375c7cc24a765e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025082225-cling-drainer-d884@gregkh' --subject-prefix 'PATCH 5.15.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d2e3890fb169168c73f2e4f8375c7cc24a765e Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Sat, 16 Aug 2025 07:25:20 -0700
Subject: [PATCH] NFS: Fix a race when updating an existing write
After nfs_lock_and_join_requests() tests for whether the request is
still attached to the mapping, nothing prevents a call to
nfs_inode_remove_request() from succeeding until we actually lock the
page group.
The reason is that whoever called nfs_inode_remove_request() doesn't
necessarily have a lock on the page group head.
So in order to avoid races, let's take the page group lock earlier in
nfs_lock_and_join_requests(), and hold it across the removal of the
request in nfs_inode_remove_request().
Reported-by: Jeff Layton <jlayton(a)kernel.org>
Tested-by: Joe Quanaim <jdq(a)meta.com>
Tested-by: Andrew Steffen <aksteffen(a)meta.com>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Fixes: bd37d6fce184 ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 11968dcb7243..6e69ce43a13f 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -253,13 +253,14 @@ nfs_page_group_unlock(struct nfs_page *req)
nfs_page_clear_headlock(req);
}
-/*
- * nfs_page_group_sync_on_bit_locked
+/**
+ * nfs_page_group_sync_on_bit_locked - Test if all requests have @bit set
+ * @req: request in page group
+ * @bit: PG_* bit that is used to sync page group
*
* must be called with page group lock held
*/
-static bool
-nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
+bool nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
{
struct nfs_page *head = req->wb_head;
struct nfs_page *tmp;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index fa5c41d0989a..8b7c04737967 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -153,20 +153,10 @@ nfs_page_set_inode_ref(struct nfs_page *req, struct inode *inode)
}
}
-static int
-nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
+static void nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
{
- int ret;
-
- if (!test_bit(PG_REMOVE, &req->wb_flags))
- return 0;
- ret = nfs_page_group_lock(req);
- if (ret)
- return ret;
if (test_and_clear_bit(PG_REMOVE, &req->wb_flags))
nfs_page_set_inode_ref(req, inode);
- nfs_page_group_unlock(req);
- return 0;
}
/**
@@ -585,19 +575,18 @@ retry:
}
}
+ ret = nfs_page_group_lock(head);
+ if (ret < 0)
+ goto out_unlock;
+
/* Ensure that nobody removed the request before we locked it */
if (head != folio->private) {
+ nfs_page_group_unlock(head);
nfs_unlock_and_release_request(head);
goto retry;
}
- ret = nfs_cancel_remove_inode(head, inode);
- if (ret < 0)
- goto out_unlock;
-
- ret = nfs_page_group_lock(head);
- if (ret < 0)
- goto out_unlock;
+ nfs_cancel_remove_inode(head, inode);
/* lock each request in the page group */
for (subreq = head->wb_this_page;
@@ -786,7 +775,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
{
struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
- if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
+ nfs_page_group_lock(req);
+ if (nfs_page_group_sync_on_bit_locked(req, PG_REMOVE)) {
struct folio *folio = nfs_page_to_folio(req->wb_head);
struct address_space *mapping = folio->mapping;
@@ -798,6 +788,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
}
spin_unlock(&mapping->i_private_lock);
}
+ nfs_page_group_unlock(req);
if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
atomic_long_dec(&nfsi->nrequests);
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 169b4ae30ff4..9aed39abc94b 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -160,6 +160,7 @@ extern void nfs_join_page_group(struct nfs_page *head,
extern int nfs_page_group_lock(struct nfs_page *);
extern void nfs_page_group_unlock(struct nfs_page *);
extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
+extern bool nfs_page_group_sync_on_bit_locked(struct nfs_page *, unsigned int);
extern int nfs_page_set_headlock(struct nfs_page *req);
extern void nfs_page_clear_headlock(struct nfs_page *req);
extern bool nfs_async_iocounter_wait(struct rpc_task *, struct nfs_lock_context *);
If an object is backed up to shmem it is incorrectly identified
as not having valid data by the move code. This means moving
to VRAM skips the -EMULTIHOP step and the bo is cleared. This
causes all sorts of weird behaviour on DGFX if an already evicted
object is targeted by the shrinker.
Fix this by using ttm_tt_is_swapped() to identify backed-up
objects.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5996
Fixes: 00c8efc3180f ("drm/xe: Add a shrinker for xe bos")
Cc: Matthew Brost <matthew.brost(a)intel.com>
Cc: Matthew Auld <matthew.auld(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.15+
Signed-off-by: Thomas Hellström <thomas.hellstrom(a)linux.intel.com>
---
drivers/gpu/drm/xe/xe_bo.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 7d1ff642b02a..4faf15d5fa6d 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -823,8 +823,7 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
return ret;
}
- tt_has_data = ttm && (ttm_tt_is_populated(ttm) ||
- (ttm->page_flags & TTM_TT_FLAG_SWAPPED));
+ tt_has_data = ttm && (ttm_tt_is_populated(ttm) || ttm_tt_is_swapped(ttm));
move_lacks_source = !old_mem || (handle_system_ccs ? (!bo->ccs_cleared) :
(!mem_type_is_vram(old_mem_type) && !tt_has_data));
--
2.50.1
Commit 3215eaceca87 ("mm/mremap: refactor initial parameter sanity
checks") moved the sanity check for vrm->new_addr from mremap_to() to
check_mremap_params().
However, this caused a regression as vrm->new_addr is now checked even
when MREMAP_FIXED and MREMAP_DONTUNMAP flags are not specified. In this
case, vrm->new_addr can be garbage and create unexpected failures.
Fix this by moving the new_addr check after the vrm_implies_new_addr()
guard. This ensures that the new_addr is only checked when the user has
specified one explicitly.
Cc: stable(a)vger.kernel.org
Fixes: 3215eaceca87 ("mm/mremap: refactor initial parameter sanity checks")
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Signed-off-by: Carlos Llamas <cmllamas(a)google.com>
---
v2:
- split out vrm->new_len into individual checks
- cc stable, collect tags
v1:
https://lore.kernel.org/all/20250828032653.521314-1-cmllamas@google.com/
mm/mremap.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/mremap.c b/mm/mremap.c
index e618a706aff5..35de0a7b910e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1774,15 +1774,18 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
if (!vrm->new_len)
return -EINVAL;
- /* Is the new length or address silly? */
- if (vrm->new_len > TASK_SIZE ||
- vrm->new_addr > TASK_SIZE - vrm->new_len)
+ /* Is the new length silly? */
+ if (vrm->new_len > TASK_SIZE)
return -EINVAL;
/* Remainder of checks are for cases with specific new_addr. */
if (!vrm_implies_new_addr(vrm))
return 0;
+ /* Is the new address silly? */
+ if (vrm->new_addr > TASK_SIZE - vrm->new_len)
+ return -EINVAL;
+
/* The new address must be page-aligned. */
if (offset_in_page(vrm->new_addr))
return -EINVAL;
--
2.51.0.268.g9569e192d0-goog
This reverts commit 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device").
Virtio drivers and PCI devices have never fully supported true
surprise (aka hot unplug) removal. Drivers historically continued
processing and waiting for pending I/O and even continued synchronous
device reset during surprise removal. Devices have also continued
completing I/Os, doing DMA and allowing device reset after surprise
removal to support such drivers.
Supporting it correctly would require a new device capability and
driver negotiation in the virtio specification to safely stop
I/O and free queue memory. Failure to do so either breaks all the
existing drivers with call trace listed in the commit or crashes the
host on continuing the DMA. Hence, until such specification and devices
are invented, restore the previous behavior of treating surprise
removal as graceful removal to avoid regressions and maintain system
stability same as before the
commit 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device").
As explained above, previous analysis of solving this only in driver
was incomplete and non-reliable at [1] and at [2]; Hence reverting commit
43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device")
is still the best stand to restore failures of virtio net and
block devices.
[1] https://lore.kernel.org/virtualization/CY8PR12MB719506CC5613EB100BC6C638DCB…
[2] https://lore.kernel.org/virtualization/20250602024358.57114-1-parav@nvidia.…
Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device")
Cc: stable(a)vger.kernel.org
Reported-by: lirongqing(a)baidu.com
Closes: https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b4741@bai…
Signed-off-by: Parav Pandit <parav(a)nvidia.com>
---
drivers/virtio/virtio_pci_common.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c
index d6d79af44569..dba5eb2eaff9 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -747,13 +747,6 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
struct device *dev = get_device(&vp_dev->vdev.dev);
- /*
- * Device is marked broken on surprise removal so that virtio upper
- * layers can abort any ongoing operation.
- */
- if (!pci_device_is_present(pci_dev))
- virtio_break_device(&vp_dev->vdev);
-
pci_disable_sriov(pci_dev);
unregister_virtio_device(&vp_dev->vdev);
--
2.26.2
Dear Entrepreneur,
"Global Support for Project & Business Development"
Need funds for your project or business plan? Our investors are ready to help and support you. Our network of professional investors offers soft loans to support project owners worldwide. Contact us to explore funding opportunities."
"Businesses and entrepreneurs seeking strategic partnerships and investment opportunities can benefit from our expertise. We consider collaborations in various sectors, including:
1. Business development
2. Project financing
3. Growth and expansion
4. Real estate investments
5. Strategic contracts
Note: IF YOU BRING A PROJECT OWNER TO US, YOU WILL RECEIVE A REFERER COMMISION. KINDLY DISCUSS PARTNERSHIP WITH US TODAY. info(a)firstcapitalsinvestors.com or firstcapitalsinvestors(a)gmail.com
We finance all kinds of viable projects ranging from personal projects to multi mega capital project. Organizational projects, . Government projects and community development are part of our services to humanity. We have support project such as real estate projects, Oil and Gas Projects, Health and medications, Engineering, Construction, Agriculture, Aviation, Retail sales business. All viable projects and businesses are welcome. Our investors are ready to support you.
Best Regards
Mr. William Man Fu Hung
First Capital Loans and Finance
www.firstcapitalsinvestors.com
info(a)firstcapitalsinvestors.com
> This reverts commit 43bb40c5b926 ("virtio_pci: Support surprise removal of
> virtio pci device").
>
> Virtio drivers and PCI devices have never fully supported true surprise (aka hot
> unplug) removal. Drivers historically continued processing and waiting for
> pending I/O and even continued synchronous device reset during surprise
> removal. Devices have also continued completing I/Os, doing DMA and allowing
> device reset after surprise removal to support such drivers.
>
> Supporting it correctly would require a new device capability and driver
> negotiation in the virtio specification to safely stop I/O and free queue memory.
> Failure to do so either breaks all the existing drivers with call trace listed in the
> commit or crashes the host on continuing the DMA. Hence, until such
> specification and devices are invented, restore the previous behavior of treating
> surprise removal as graceful removal to avoid regressions and maintain system
> stability same as before the commit 43bb40c5b926 ("virtio_pci: Support surprise
> removal of virtio pci device").
>
> As explained above, previous analysis of solving this only in driver was
> incomplete and non-reliable at [1] and at [2]; Hence reverting commit
> 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device") is still
> the best stand to restore failures of virtio net and block devices.
>
> [1]
> https://lore.kernel.org/virtualization/CY8PR12MB719506CC5613EB100BC6C638
> DCBD2(a)CY8PR12MB7195.namprd12.prod.outlook.com/#t
> [2]
> https://lore.kernel.org/virtualization/20250602024358.57114-1-parav@nvidia.c
> om/
>
> Fixes: 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device")
> Cc: stable(a)vger.kernel.org
> Reported-by: lirongqing(a)baidu.com
> Closes:
> https://lore.kernel.org/virtualization/c45dd68698cd47238c55fb73ca9b4741@b
> aidu.com/
> Signed-off-by: Parav Pandit <parav(a)nvidia.com>
Tested-by: Li RongQing <lirongqing(a)baidu.com>
Thanks
-Li
> ---
> drivers/virtio/virtio_pci_common.c | 7 -------
> 1 file changed, 7 deletions(-)
>
> diff --git a/drivers/virtio/virtio_pci_common.c
> b/drivers/virtio/virtio_pci_common.c
> index d6d79af44569..dba5eb2eaff9 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -747,13 +747,6 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
> struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> struct device *dev = get_device(&vp_dev->vdev.dev);
>
> - /*
> - * Device is marked broken on surprise removal so that virtio upper
> - * layers can abort any ongoing operation.
> - */
> - if (!pci_device_is_present(pci_dev))
> - virtio_break_device(&vp_dev->vdev);
> -
> pci_disable_sriov(pci_dev);
>
> unregister_virtio_device(&vp_dev->vdev);
> --
> 2.26.2
Commit 9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes")
introduced a regression where local-broadcast packets would have their
gateway set in __mkroute_output, which was caused by fi = NULL being
removed.
Fix this by resetting the fib_info for local-broadcast packets. This
preserves the intended changes for directed-broadcast packets.
Cc: stable(a)vger.kernel.org
Fixes: 9e30ecf23b1b ("net: ipv4: fix incorrect MTU in broadcast routes")
Reported-by: Brett A C Sheffield <bacs(a)librecast.net>
Closes: https://lore.kernel.org/regressions/20250822165231.4353-4-bacs@librecast.net
Signed-off-by: Oscar Maes <oscmaes92(a)gmail.com>
---
Link to discussion:
https://lore.kernel.org/netdev/20250822165231.4353-4-bacs@librecast.net/
Thanks to Brett Sheffield for finding the regression and writing
the initial fix!
net/ipv4/route.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f639a2ae881a..baa43e5966b1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2575,12 +2575,16 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
!netif_is_l3_master(dev_out))
return ERR_PTR(-EINVAL);
- if (ipv4_is_lbcast(fl4->daddr))
+ if (ipv4_is_lbcast(fl4->daddr)) {
type = RTN_BROADCAST;
- else if (ipv4_is_multicast(fl4->daddr))
+
+ /* reset fi to prevent gateway resolution */
+ fi = NULL;
+ } else if (ipv4_is_multicast(fl4->daddr)) {
type = RTN_MULTICAST;
- else if (ipv4_is_zeronet(fl4->daddr))
+ } else if (ipv4_is_zeronet(fl4->daddr)) {
return ERR_PTR(-EINVAL);
+ }
if (dev_out->flags & IFF_LOOPBACK)
flags |= RTCF_LOCAL;
--
2.39.5
tcpm_handle_vdm_request delivers messages to the partner altmode or the
cable altmode depending on the SVDM response type, which is incorrect.
The partner or cable should be chosen based on the received message type
instead.
Also add this filter to ADEV_NOTIFY_USB_AND_QUEUE_VDM, which is used when
the Enter Mode command is responded to by a NAK on SOP or SOP' and when
the Exit Mode command is responded to by an ACK on SOP.
Fixes: 7e7877c55eb1 ("usb: typec: tcpm: add alt mode enter/exit/vdm support for sop'")
Cc: stable(a)vger.kernel.org
Signed-off-by: RD Babiera <rdbabiera(a)google.com>
Reviewed-by: Badhri Jagan Sridharan <badhri(a)google.com>
---
drivers/usb/typec/tcpm/tcpm.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c
index 1f6fdfaa34bf..b2a568a5bc9b 100644
--- a/drivers/usb/typec/tcpm/tcpm.c
+++ b/drivers/usb/typec/tcpm/tcpm.c
@@ -2426,17 +2426,21 @@ static void tcpm_handle_vdm_request(struct tcpm_port *port,
case ADEV_NONE:
break;
case ADEV_NOTIFY_USB_AND_QUEUE_VDM:
- WARN_ON(typec_altmode_notify(adev, TYPEC_STATE_USB, NULL));
- typec_altmode_vdm(adev, p[0], &p[1], cnt);
+ if (rx_sop_type == TCPC_TX_SOP_PRIME) {
+ typec_cable_altmode_vdm(adev, TYPEC_PLUG_SOP_P, p[0], &p[1], cnt);
+ } else {
+ WARN_ON(typec_altmode_notify(adev, TYPEC_STATE_USB, NULL));
+ typec_altmode_vdm(adev, p[0], &p[1], cnt);
+ }
break;
case ADEV_QUEUE_VDM:
- if (response_tx_sop_type == TCPC_TX_SOP_PRIME)
+ if (rx_sop_type == TCPC_TX_SOP_PRIME)
typec_cable_altmode_vdm(adev, TYPEC_PLUG_SOP_P, p[0], &p[1], cnt);
else
typec_altmode_vdm(adev, p[0], &p[1], cnt);
break;
case ADEV_QUEUE_VDM_SEND_EXIT_MODE_ON_FAIL:
- if (response_tx_sop_type == TCPC_TX_SOP_PRIME) {
+ if (rx_sop_type == TCPC_TX_SOP_PRIME) {
if (typec_cable_altmode_vdm(adev, TYPEC_PLUG_SOP_P,
p[0], &p[1], cnt)) {
int svdm_version = typec_get_cable_svdm_version(
base-commit: 956606bafb5fc6e5968aadcda86fc0037e1d7548
--
2.51.0.261.g7ce5a0a67e-goog
The quilt patch titled
Subject: mm: gix possible deadlock in kmemleak
has been removed from the -mm tree. Its filename was
mm-fix-possible-deadlock-in-kmemleak.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Gu Bowen <gubowen5(a)huawei.com>
Subject: mm: gix possible deadlock in kmemleak
Date: Fri, 22 Aug 2025 15:35:41 +0800
There are some AA deadlock issues in kmemleak, similar to the situation
reported by Breno [1]. The deadlock path is as follows:
mem_pool_alloc()
-> raw_spin_lock_irqsave(&kmemleak_lock, flags);
-> pr_warn()
-> netconsole subsystem
-> netpoll
-> __alloc_skb
-> __create_object
-> raw_spin_lock_irqsave(&kmemleak_lock, flags);
To solve this problem, switch to printk_safe mode before printing warning
message, this will redirect all printk()-s to a special per-CPU buffer,
which will be flushed later from a safe context (irq work), and this
deadlock problem can be avoided. The proper API to use should be
printk_deferred_enter()/printk_deferred_exit() [2]. Another way is to
place the warn print after kmemleak is released.
Link: https://lkml.kernel.org/r/20250822073541.1886469-1-gubowen5@huawei.com
Link: https://lore.kernel.org/all/20250731-kmemleak_lock-v1-1-728fd470198f@debian… [1]
Link: https://lore.kernel.org/all/5ca375cd-4a20-4807-b897-68b289626550@redhat.com/ [2]
Signed-off-by: Gu Bowen <gubowen5(a)huawei.com>
Reviewed-by: Waiman Long <longman(a)redhat.com>
Reviewed-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reviewed-by: Breno Leitao <leitao(a)debian.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: John Ogness <john.ogness(a)linutronix.de>
Cc: Lu Jialin <lujialin4(a)huawei.com>
Cc: Petr Mladek <pmladek(a)suse.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kmemleak.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)
--- a/mm/kmemleak.c~mm-fix-possible-deadlock-in-kmemleak
+++ a/mm/kmemleak.c
@@ -437,9 +437,15 @@ static struct kmemleak_object *__lookup_
else if (untagged_objp == untagged_ptr || alias)
return object;
else {
+ /*
+ * Printk deferring due to the kmemleak_lock held.
+ * This is done to avoid deadlock.
+ */
+ printk_deferred_enter();
kmemleak_warn("Found object by alias at 0x%08lx\n",
ptr);
dump_object_info(object);
+ printk_deferred_exit();
break;
}
}
@@ -736,6 +742,11 @@ static int __link_object(struct kmemleak
else if (untagged_objp + parent->size <= untagged_ptr)
link = &parent->rb_node.rb_right;
else {
+ /*
+ * Printk deferring due to the kmemleak_lock held.
+ * This is done to avoid deadlock.
+ */
+ printk_deferred_enter();
kmemleak_stop("Cannot insert 0x%lx into the object search tree (overlaps existing)\n",
ptr);
/*
@@ -743,6 +754,7 @@ static int __link_object(struct kmemleak
* be freed while the kmemleak_lock is held.
*/
dump_object_info(parent);
+ printk_deferred_exit();
return -EEXIST;
}
}
@@ -856,13 +868,8 @@ static void delete_object_part(unsigned
raw_spin_lock_irqsave(&kmemleak_lock, flags);
object = __find_and_remove_object(ptr, 1, objflags);
- if (!object) {
-#ifdef DEBUG
- kmemleak_warn("Partially freeing unknown object at 0x%08lx (size %zu)\n",
- ptr, size);
-#endif
+ if (!object)
goto unlock;
- }
/*
* Create one or two objects that may result from the memory block
@@ -882,8 +889,14 @@ static void delete_object_part(unsigned
unlock:
raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
- if (object)
+ if (object) {
__delete_object(object);
+ } else {
+#ifdef DEBUG
+ kmemleak_warn("Partially freeing unknown object at 0x%08lx (size %zu)\n",
+ ptr, size);
+#endif
+ }
out:
if (object_l)
_
Patches currently in -mm which might be from gubowen5(a)huawei.com are
The quilt patch titled
Subject: x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()
has been removed from the -mm tree. Its filename was
x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Harry Yoo <harry.yoo(a)oracle.com>
Subject: x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()
Date: Mon, 18 Aug 2025 11:02:06 +0900
Define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to ensure
page tables are properly synchronized when calling p*d_populate_kernel().
For 5-level paging, synchronization is performed via
pgd_populate_kernel(). In 4-level paging, pgd_populate() is a no-op, so
synchronization is instead performed at the P4D level via
p4d_populate_kernel().
This fixes intermittent boot failures on systems using 4-level paging and
a large amount of persistent memory:
BUG: unable to handle page fault for address: ffffe70000000034
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
RIP: 0010:__init_single_page+0x9/0x6d
Call Trace:
<TASK>
__init_zone_device_page+0x17/0x5d
memmap_init_zone_device+0x154/0x1bb
pagemap_range+0x2e0/0x40f
memremap_pages+0x10b/0x2f0
devm_memremap_pages+0x1e/0x60
dev_dax_probe+0xce/0x2ec [device_dax]
dax_bus_probe+0x6d/0xc9
[... snip ...]
</TASK>
It also fixes a crash in vmemmap_set_pmd() caused by accessing vmemmap
before sync_global_pgds() [1]:
BUG: unable to handle page fault for address: ffffeb3ff1200000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
Tainted: [W]=WARN
RIP: 0010:vmemmap_set_pmd+0xff/0x230
<TASK>
vmemmap_populate_hugepages+0x176/0x180
vmemmap_populate+0x34/0x80
__populate_section_memmap+0x41/0x90
sparse_add_section+0x121/0x3e0
__add_pages+0xba/0x150
add_pages+0x1d/0x70
memremap_pages+0x3dc/0x810
devm_memremap_pages+0x1c/0x60
xe_devm_add+0x8b/0x100 [xe]
xe_tile_init_noalloc+0x6a/0x70 [xe]
xe_device_probe+0x48c/0x740 [xe]
[... snip ...]
Link: https://lkml.kernel.org/r/20250818020206.4517-4-harry.yoo@oracle.com
Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com>
Closes: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@in… [1]
Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Acked-by: Kiryl Shutsemau <kas(a)kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Ard Biesheuvel <ardb(a)kernel.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: bibo mao <maobibo(a)loongson.cn>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Christoph Lameter (Ampere) <cl(a)gentwo.org>
Cc: Dennis Zhou <dennis(a)kernel.org>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jane Chu <jane.chu(a)oracle.com>
Cc: Joao Martins <joao.m.martins(a)oracle.com>
Cc: Joerg Roedel <joro(a)8bytes.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: Thomas Huth <thuth(a)redhat.com>
Cc: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/include/asm/pgtable_64_types.h | 3 +++
arch/x86/mm/init_64.c | 18 ++++++++++++++++++
2 files changed, 21 insertions(+)
--- a/arch/x86/include/asm/pgtable_64_types.h~x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings
+++ a/arch/x86/include/asm/pgtable_64_types.h
@@ -36,6 +36,9 @@ static inline bool pgtable_l5_enabled(vo
#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
#endif /* USE_EARLY_PGTABLE_L5 */
+#define ARCH_PAGE_TABLE_SYNC_MASK \
+ (pgtable_l5_enabled() ? PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED)
+
extern unsigned int pgdir_shift;
extern unsigned int ptrs_per_p4d;
--- a/arch/x86/mm/init_64.c~x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings
+++ a/arch/x86/mm/init_64.c
@@ -224,6 +224,24 @@ static void sync_global_pgds(unsigned lo
}
/*
+ * Make kernel mappings visible in all page tables in the system.
+ * This is necessary except when the init task populates kernel mappings
+ * during the boot process. In that case, all processes originating from
+ * the init task copies the kernel mappings, so there is no issue.
+ * Otherwise, missing synchronization could lead to kernel crashes due
+ * to missing page table entries for certain kernel mappings.
+ *
+ * Synchronization is performed at the top level, which is the PGD in
+ * 5-level paging systems. But in 4-level paging systems, however,
+ * pgd_populate() is a no-op, so synchronization is done at the P4D level.
+ * sync_global_pgds() handles this difference between paging levels.
+ */
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+{
+ sync_global_pgds(start, end);
+}
+
+/*
* NOTE: This function is marked __ref because it calls __init function
* (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0.
*/
_
Patches currently in -mm which might be from harry.yoo(a)oracle.com are
The quilt patch titled
Subject: mm: introduce and use {pgd,p4d}_populate_kernel()
has been removed from the -mm tree. Its filename was
mm-introduce-and-use-pgdp4d_populate_kernel.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Harry Yoo <harry.yoo(a)oracle.com>
Subject: mm: introduce and use {pgd,p4d}_populate_kernel()
Date: Mon, 18 Aug 2025 11:02:05 +0900
Introduce and use {pgd,p4d}_populate_kernel() in core MM code when
populating PGD and P4D entries for the kernel address space. These
helpers ensure proper synchronization of page tables when updating the
kernel portion of top-level page tables.
Until now, the kernel has relied on each architecture to handle
synchronization of top-level page tables in an ad-hoc manner. For
example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct
mapping and vmemmap mapping changes").
However, this approach has proven fragile for following reasons:
1) It is easy to forget to perform the necessary page table
synchronization when introducing new changes.
For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory
savings for compound devmaps") overlooked the need to synchronize
page tables for the vmemmap area.
2) It is also easy to overlook that the vmemmap and direct mapping areas
must not be accessed before explicit page table synchronization.
For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated
sub-pmd ranges")) caused crashes by accessing the vmemmap area
before calling sync_global_pgds().
To address this, as suggested by Dave Hansen, introduce _kernel() variants
of the page table population helpers, which invoke architecture-specific
hooks to properly synchronize page tables. These are introduced in a new
header file, include/linux/pgalloc.h, so they can be called from common
code.
They reuse existing infrastructure for vmalloc and ioremap.
Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK,
and the actual synchronization is performed by
arch_sync_kernel_mappings().
This change currently targets only x86_64, so only PGD and P4D level
helpers are introduced. Currently, these helpers are no-ops since no
architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK.
In theory, PUD and PMD level helpers can be added later if needed by other
architectures. For now, 32-bit architectures (x86-32 and arm) only handle
PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless
we introduce a PMD level helper.
[harry.yoo(a)oracle.com: fix KASAN build error due to p*d_populate_kernel()]
Link: https://lkml.kernel.org/r/20250822020727.202749-1-harry.yoo@oracle.com
Link: https://lkml.kernel.org/r/20250818020206.4517-3-harry.yoo@oracle.com
Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com>
Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Acked-by: Kiryl Shutsemau <kas(a)kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Ard Biesheuvel <ardb(a)kernel.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: bibo mao <maobibo(a)loongson.cn>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Christoph Lameter (Ampere) <cl(a)gentwo.org>
Cc: Dennis Zhou <dennis(a)kernel.org>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: Gwan-gyeong Mun <gwan-gyeong.mun(a)intel.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jane Chu <jane.chu(a)oracle.com>
Cc: Joao Martins <joao.m.martins(a)oracle.com>
Cc: Joerg Roedel <joro(a)8bytes.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: Thomas Huth <thuth(a)redhat.com>
Cc: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/pgalloc.h | 29 +++++++++++++++++++++++++++++
include/linux/pgtable.h | 13 +++++++------
mm/kasan/init.c | 12 ++++++------
mm/percpu.c | 6 +++---
mm/sparse-vmemmap.c | 6 +++---
5 files changed, 48 insertions(+), 18 deletions(-)
diff --git a/include/linux/pgalloc.h a/include/linux/pgalloc.h
new file mode 100644
--- /dev/null
+++ a/include/linux/pgalloc.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PGALLOC_H
+#define _LINUX_PGALLOC_H
+
+#include <linux/pgtable.h>
+#include <asm/pgalloc.h>
+
+/*
+ * {pgd,p4d}_populate_kernel() are defined as macros to allow
+ * compile-time optimization based on the configured page table levels.
+ * Without this, linking may fail because callers (e.g., KASAN) may rely
+ * on calls to these functions being optimized away when passing symbols
+ * that exist only for certain page table levels.
+ */
+#define pgd_populate_kernel(addr, pgd, p4d) \
+ do { \
+ pgd_populate(&init_mm, pgd, p4d); \
+ if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) \
+ arch_sync_kernel_mappings(addr, addr); \
+ } while (0)
+
+#define p4d_populate_kernel(addr, p4d, pud) \
+ do { \
+ p4d_populate(&init_mm, p4d, pud); \
+ if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_P4D_MODIFIED) \
+ arch_sync_kernel_mappings(addr, addr); \
+ } while (0)
+
+#endif /* _LINUX_PGALLOC_H */
--- a/include/linux/pgtable.h~mm-introduce-and-use-pgdp4d_populate_kernel
+++ a/include/linux/pgtable.h
@@ -1469,8 +1469,8 @@ static inline void modify_prot_commit_pt
/*
* Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
- * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
- * needs to be called.
+ * and let generic vmalloc, ioremap and page table update code know when
+ * arch_sync_kernel_mappings() needs to be called.
*/
#ifndef ARCH_PAGE_TABLE_SYNC_MASK
#define ARCH_PAGE_TABLE_SYNC_MASK 0
@@ -1954,10 +1954,11 @@ static inline bool arch_has_pfn_modify_c
/*
* Page Table Modification bits for pgtbl_mod_mask.
*
- * These are used by the p?d_alloc_track*() set of functions an in the generic
- * vmalloc/ioremap code to track at which page-table levels entries have been
- * modified. Based on that the code can better decide when vmalloc and ioremap
- * mapping changes need to be synchronized to other page-tables in the system.
+ * These are used by the p?d_alloc_track*() and p*d_populate_kernel()
+ * functions in the generic vmalloc, ioremap and page table update code
+ * to track at which page-table levels entries have been modified.
+ * Based on that the code can better decide when page table changes need
+ * to be synchronized to other page-tables in the system.
*/
#define __PGTBL_PGD_MODIFIED 0
#define __PGTBL_P4D_MODIFIED 1
--- a/mm/kasan/init.c~mm-introduce-and-use-pgdp4d_populate_kernel
+++ a/mm/kasan/init.c
@@ -13,9 +13,9 @@
#include <linux/mm.h>
#include <linux/pfn.h>
#include <linux/slab.h>
+#include <linux/pgalloc.h>
#include <asm/page.h>
-#include <asm/pgalloc.h>
#include "kasan.h"
@@ -191,7 +191,7 @@ static int __ref zero_p4d_populate(pgd_t
pud_t *pud;
pmd_t *pmd;
- p4d_populate(&init_mm, p4d,
+ p4d_populate_kernel(addr, p4d,
lm_alias(kasan_early_shadow_pud));
pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud,
@@ -212,7 +212,7 @@ static int __ref zero_p4d_populate(pgd_t
} else {
p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
pud_init(p);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
}
zero_pud_populate(p4d, addr, next);
@@ -251,10 +251,10 @@ int __ref kasan_populate_early_shadow(co
* puds,pmds, so pgd_populate(), pud_populate()
* is noops.
*/
- pgd_populate(&init_mm, pgd,
+ pgd_populate_kernel(addr, pgd,
lm_alias(kasan_early_shadow_p4d));
p4d = p4d_offset(pgd, addr);
- p4d_populate(&init_mm, p4d,
+ p4d_populate_kernel(addr, p4d,
lm_alias(kasan_early_shadow_pud));
pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud,
@@ -273,7 +273,7 @@ int __ref kasan_populate_early_shadow(co
if (!p)
return -ENOMEM;
} else {
- pgd_populate(&init_mm, pgd,
+ pgd_populate_kernel(addr, pgd,
early_alloc(PAGE_SIZE, NUMA_NO_NODE));
}
}
--- a/mm/percpu.c~mm-introduce-and-use-pgdp4d_populate_kernel
+++ a/mm/percpu.c
@@ -3108,7 +3108,7 @@ out_free:
#endif /* BUILD_EMBED_FIRST_CHUNK */
#ifdef BUILD_PAGE_FIRST_CHUNK
-#include <asm/pgalloc.h>
+#include <linux/pgalloc.h>
#ifndef P4D_TABLE_SIZE
#define P4D_TABLE_SIZE PAGE_SIZE
@@ -3134,13 +3134,13 @@ void __init __weak pcpu_populate_pte(uns
if (pgd_none(*pgd)) {
p4d = memblock_alloc_or_panic(P4D_TABLE_SIZE, P4D_TABLE_SIZE);
- pgd_populate(&init_mm, pgd, p4d);
+ pgd_populate_kernel(addr, pgd, p4d);
}
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
pud = memblock_alloc_or_panic(PUD_TABLE_SIZE, PUD_TABLE_SIZE);
- p4d_populate(&init_mm, p4d, pud);
+ p4d_populate_kernel(addr, p4d, pud);
}
pud = pud_offset(p4d, addr);
--- a/mm/sparse-vmemmap.c~mm-introduce-and-use-pgdp4d_populate_kernel
+++ a/mm/sparse-vmemmap.c
@@ -27,9 +27,9 @@
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
#include <linux/sched.h>
+#include <linux/pgalloc.h>
#include <asm/dma.h>
-#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "hugetlb_vmemmap.h"
@@ -229,7 +229,7 @@ p4d_t * __meminit vmemmap_p4d_populate(p
if (!p)
return NULL;
pud_init(p);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
return p4d;
}
@@ -241,7 +241,7 @@ pgd_t * __meminit vmemmap_pgd_populate(u
void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
- pgd_populate(&init_mm, pgd, p);
+ pgd_populate_kernel(addr, pgd, p);
}
return pgd;
}
_
Patches currently in -mm which might be from harry.yoo(a)oracle.com are
The quilt patch titled
Subject: mm: move page table sync declarations to linux/pgtable.h
has been removed from the -mm tree. Its filename was
mm-move-page-table-sync-declarations-to-linux-pgtableh.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Harry Yoo <harry.yoo(a)oracle.com>
Subject: mm: move page table sync declarations to linux/pgtable.h
Date: Mon, 18 Aug 2025 11:02:04 +0900
During our internal testing, we started observing intermittent boot
failures when the machine uses 4-level paging and has a large amount of
persistent memory:
BUG: unable to handle page fault for address: ffffe70000000034
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
RIP: 0010:__init_single_page+0x9/0x6d
Call Trace:
<TASK>
__init_zone_device_page+0x17/0x5d
memmap_init_zone_device+0x154/0x1bb
pagemap_range+0x2e0/0x40f
memremap_pages+0x10b/0x2f0
devm_memremap_pages+0x1e/0x60
dev_dax_probe+0xce/0x2ec [device_dax]
dax_bus_probe+0x6d/0xc9
[... snip ...]
</TASK>
It turns out that the kernel panics while initializing vmemmap (struct
page array) when the vmemmap region spans two PGD entries, because the new
PGD entry is only installed in init_mm.pgd, but not in the page tables of
other tasks.
And looking at __populate_section_memmap():
if (vmemmap_can_optimize(altmap, pgmap))
// does not sync top level page tables
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
else
// sync top level page tables in x86
r = vmemmap_populate(start, end, nid, altmap);
In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
synchronizes the top level page table (See commit 9b861528a801 ("x86-64,
mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so
that all tasks in the system can see the new vmemmap area.
However, when vmemmap_can_optimize() returns true, the optimized path
skips synchronization of top-level page tables. This is because
vmemmap_populate_compound_pages() is implemented in core MM code, which
does not handle synchronization of the top-level page tables. Instead,
the core MM has historically relied on each architecture to perform this
synchronization manually.
We're not the first party to encounter a crash caused by not-sync'd top
level page tables: earlier this year, Gwan-gyeong Mun attempted to address
the issue [1] [2] after hitting a kernel panic when x86 code accessed the
vmemmap area before the corresponding top-level entries were synced. At
that time, the issue was believed to be triggered only when struct page
was enlarged for debugging purposes, and the patch did not get further
updates.
It turns out that current approach of relying on each arch to handle the
page table sync manually is fragile because 1) it's easy to forget to sync
the top level page table, and 2) it's also easy to overlook that the
kernel should not access the vmemmap and direct mapping areas before the
sync.
# The solution: Make page table sync more code robust and harder to miss
To address this, Dave Hansen suggested [3] [4] introducing
{pgd,p4d}_populate_kernel() for updating kernel portion of the page tables
and allow each architecture to explicitly perform synchronization when
installing top-level entries. With this approach, we no longer need to
worry about missing the sync step, reducing the risk of future
regressions.
The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK,
PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by
vmalloc and ioremap to synchronize page tables.
pgd_populate_kernel() looks like this:
static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd,
p4d_t *p4d)
{
pgd_populate(&init_mm, pgd, p4d);
if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED)
arch_sync_kernel_mappings(addr, addr);
}
It is worth noting that vmalloc() and apply_to_range() carefully
synchronizes page tables by calling p*d_alloc_track() and
arch_sync_kernel_mappings(), and thus they are not affected by this patch
series.
This series was hugely inspired by Dave Hansen's suggestion and hence
added Suggested-by: Dave Hansen.
Cc stable because lack of this series opens the door to intermittent
boot failures.
This patch (of 3):
Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to
linux/pgtable.h so that they can be used outside of vmalloc and ioremap.
Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com
Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com
Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@in… [1]
Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@in… [2]
Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel… [3]
Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel… [4]
Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com>
Acked-by: Kiryl Shutsemau <kas(a)kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki(a)gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Acked-by: David Hildenbrand <david(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Alistair Popple <apopple(a)nvidia.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual(a)arm.com>
Cc: Ard Biesheuvel <ardb(a)kernel.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: bibo mao <maobibo(a)loongson.cn>
Cc: Borislav Betkov <bp(a)alien8.de>
Cc: Christoph Lameter (Ampere) <cl(a)gentwo.org>
Cc: Dennis Zhou <dennis(a)kernel.org>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: Gwan-gyeong Mun <gwan-gyeong.mun(a)intel.com>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Jane Chu <jane.chu(a)oracle.com>
Cc: Joao Martins <joao.m.martins(a)oracle.com>
Cc: Joerg Roedel <joro(a)8bytes.org>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Kevin Brodsky <kevin.brodsky(a)arm.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Oscar Salvador <osalvador(a)suse.de>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Qi Zheng <zhengqi.arch(a)bytedance.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Suren Baghdasaryan <surenb(a)google.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Thomas Gleinxer <tglx(a)linutronix.de>
Cc: Thomas Huth <thuth(a)redhat.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/pgtable.h | 16 ++++++++++++++++
include/linux/vmalloc.h | 16 ----------------
2 files changed, 16 insertions(+), 16 deletions(-)
--- a/include/linux/pgtable.h~mm-move-page-table-sync-declarations-to-linux-pgtableh
+++ a/include/linux/pgtable.h
@@ -1467,6 +1467,22 @@ static inline void modify_prot_commit_pt
}
#endif
+/*
+ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
+ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
+ * needs to be called.
+ */
+#ifndef ARCH_PAGE_TABLE_SYNC_MASK
+#define ARCH_PAGE_TABLE_SYNC_MASK 0
+#endif
+
+/*
+ * There is no default implementation for arch_sync_kernel_mappings(). It is
+ * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
+ * is 0.
+ */
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
+
#endif /* CONFIG_MMU */
/*
--- a/include/linux/vmalloc.h~mm-move-page-table-sync-declarations-to-linux-pgtableh
+++ a/include/linux/vmalloc.h
@@ -220,22 +220,6 @@ int vmap_pages_range(unsigned long addr,
struct page **pages, unsigned int page_shift);
/*
- * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
- * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
- * needs to be called.
- */
-#ifndef ARCH_PAGE_TABLE_SYNC_MASK
-#define ARCH_PAGE_TABLE_SYNC_MASK 0
-#endif
-
-/*
- * There is no default implementation for arch_sync_kernel_mappings(). It is
- * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK
- * is 0.
- */
-void arch_sync_kernel_mappings(unsigned long start, unsigned long end);
-
-/*
* Lowlevel-APIs (not for driver use!)
*/
_
Patches currently in -mm which might be from harry.yoo(a)oracle.com are
The quilt patch titled
Subject: mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota()
has been removed from the -mm tree. Its filename was
mm-damon-core-prevent-unnecessary-overflow-in-damos_set_effective_quota.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Quanmin Yan <yanquanmin1(a)huawei.com>
Subject: mm/damon/core: prevent unnecessary overflow in damos_set_effective_quota()
Date: Thu, 21 Aug 2025 20:55:55 +0800
On 32-bit systems, the throughput calculation in
damos_set_effective_quota() is prone to unnecessary multiplication
overflow. Using mult_frac() to fix it.
Andrew Paniakin also recently found and privately reported this issue, on
64 bit systems. This can also happen on 64-bit systems, once the charged
size exceeds ~17 TiB. On systems running for long time in production,
this issue can actually happen.
More specifically, when a DAMOS scheme having the time quota run for
longtime, throughput calculation can overflow and set esz too small. As a
result, speed of the scheme get unexpectedly slow.
Link: https://lkml.kernel.org/r/20250821125555.3020951-1-yanquanmin1@huawei.com
Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota")
Signed-off-by: Quanmin Yan <yanquanmin1(a)huawei.com>
Reported-by: Andrew Paniakin <apanyaki(a)amazon.com>
Reviewed-by: SeongJae Park <sj(a)kernel.org>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: ze zuo <zuoze1(a)huawei.com>
Cc: <stable(a)vger.kernel.org> [5.16+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/damon/core.c~mm-damon-core-prevent-unnecessary-overflow-in-damos_set_effective_quota
+++ a/mm/damon/core.c
@@ -2073,8 +2073,8 @@ static void damos_set_effective_quota(st
if (quota->ms) {
if (quota->total_charged_ns)
- throughput = quota->total_charged_sz * 1000000 /
- quota->total_charged_ns;
+ throughput = mult_frac(quota->total_charged_sz, 1000000,
+ quota->total_charged_ns);
else
throughput = PAGE_SIZE * 1024;
esz = min(throughput * quota->ms, esz);
_
Patches currently in -mm which might be from yanquanmin1(a)huawei.com are
mm-damon-lru_sort-avoid-divide-by-zero-in-damon_lru_sort_apply_parameters.patch
mm-damon-reclaim-avoid-divide-by-zero-in-damon_reclaim_apply_parameters.patch
The quilt patch titled
Subject: kasan: fix GCC mem-intrinsic prefix with sw tags
has been removed from the -mm tree. Its filename was
kasan-fix-gcc-mem-intrinsic-prefix-with-sw-tags.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Ada Couprie Diaz <ada.coupriediaz(a)arm.com>
Subject: kasan: fix GCC mem-intrinsic prefix with sw tags
Date: Thu, 21 Aug 2025 13:07:35 +0100
GCC doesn't support "hwasan-kernel-mem-intrinsic-prefix", only
"asan-kernel-mem-intrinsic-prefix"[0], while LLVM supports both. This is
already taken into account when checking
"CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX", but not in the KASAN Makefile
adding those parameters when "CONFIG_KASAN_SW_TAGS" is enabled.
Replace the version check with "CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX",
which already validates that mem-intrinsic prefix parameter can be used,
and choose the correct name depending on compiler.
GCC 13 and above trigger "CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX" which
prevents `mem{cpy,move,set}()` being redefined in "mm/kasan/shadow.c"
since commit 36be5cba99f6 ("kasan: treat meminstrinsic as builtins in
uninstrumented files"), as we expect the compiler to prefix those calls
with `__(hw)asan_` instead. But as the option passed to GCC has been
incorrect, the compiler has not been emitting those prefixes, effectively
never calling the instrumented versions of `mem{cpy,move,set}()` with
"CONFIG_KASAN_SW_TAGS" enabled.
If "CONFIG_FORTIFY_SOURCES" is enabled, this issue would be mitigated as
it redefines `mem{cpy,move,set}()` and properly aliases the
`__underlying_mem*()` that will be called to the instrumented versions.
Link: https://lkml.kernel.org/r/20250821120735.156244-1-ada.coupriediaz@arm.com
Link: https://gcc.gnu.org/onlinedocs/gcc-13.4.0/gcc/Optimize-Options.html [0]
Signed-off-by: Ada Couprie Diaz <ada.coupriediaz(a)arm.com>
Fixes: 36be5cba99f6 ("kasan: treat meminstrinsic as builtins in uninstrumented files")
Reviewed-by: Yeoreum Yun <yeoreum.yun(a)arm.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: Marco Elver <elver(a)google.com>
Cc: Marc Rutland <mark.rutland(a)arm.com>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Nathan Chancellor <nathan(a)kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
scripts/Makefile.kasan | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
--- a/scripts/Makefile.kasan~kasan-fix-gcc-mem-intrinsic-prefix-with-sw-tags
+++ a/scripts/Makefile.kasan
@@ -86,10 +86,14 @@ kasan_params += hwasan-instrument-stack=
hwasan-use-short-granules=0 \
hwasan-inline-all-checks=0
-# Instrument memcpy/memset/memmove calls by using instrumented __hwasan_mem*().
-ifeq ($(call clang-min-version, 150000)$(call gcc-min-version, 130000),y)
- kasan_params += hwasan-kernel-mem-intrinsic-prefix=1
-endif
+# Instrument memcpy/memset/memmove calls by using instrumented __(hw)asan_mem*().
+ifdef CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX
+ ifdef CONFIG_CC_IS_GCC
+ kasan_params += asan-kernel-mem-intrinsic-prefix=1
+ else
+ kasan_params += hwasan-kernel-mem-intrinsic-prefix=1
+ endif
+endif # CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX
endif # CONFIG_KASAN_SW_TAGS
_
Patches currently in -mm which might be from ada.coupriediaz(a)arm.com are
The quilt patch titled
Subject: kunit: kasan_test: disable fortify string checker on kasan_strings() test
has been removed from the -mm tree. Its filename was
kunit-kasan_test-disable-fortify-string-checker-on-kasan_strings-test.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Yeoreum Yun <yeoreum.yun(a)arm.com>
Subject: kunit: kasan_test: disable fortify string checker on kasan_strings() test
Date: Fri, 1 Aug 2025 13:02:36 +0100
Similar to commit 09c6304e38e4 ("kasan: test: fix compatibility with
FORTIFY_SOURCE") the kernel is panicing in kasan_string().
This is due to the `src` and `ptr` not being hidden from the optimizer
which would disable the runtime fortify string checker.
Call trace:
__fortify_panic+0x10/0x20 (P)
kasan_strings+0x980/0x9b0
kunit_try_run_case+0x68/0x190
kunit_generic_run_threadfn_adapter+0x34/0x68
kthread+0x1c4/0x228
ret_from_fork+0x10/0x20
Code: d503233f a9bf7bfd 910003fd 9424b243 (d4210000)
---[ end trace 0000000000000000 ]---
note: kunit_try_catch[128] exited with irqs disabled
note: kunit_try_catch[128] exited with preempt_count 1
# kasan_strings: try faulted: last
** replaying previous printk message **
# kasan_strings: try faulted: last line seen mm/kasan/kasan_test_c.c:1600
# kasan_strings: internal error occurred preventing test case from running: -4
Link: https://lkml.kernel.org/r/20250801120236.2962642-1-yeoreum.yun@arm.com
Fixes: 73228c7ecc5e ("KASAN: port KASAN Tests to KUnit")
Signed-off-by: Yeoreum Yun <yeoreum.yun(a)arm.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a(a)gmail.com>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/kasan/kasan_test_c.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/kasan/kasan_test_c.c~kunit-kasan_test-disable-fortify-string-checker-on-kasan_strings-test
+++ a/mm/kasan/kasan_test_c.c
@@ -1578,9 +1578,11 @@ static void kasan_strings(struct kunit *
ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+ OPTIMIZER_HIDE_VAR(ptr);
src = kmalloc(KASAN_GRANULE_SIZE, GFP_KERNEL | __GFP_ZERO);
strscpy(src, "f0cacc1a0000000", KASAN_GRANULE_SIZE);
+ OPTIMIZER_HIDE_VAR(src);
/*
* Make sure that strscpy() does not trigger KASAN if it overreads into
_
Patches currently in -mm which might be from yeoreum.yun(a)arm.com are
kasan-hw-tags-introduce-kasanwrite_only-option.patch
kasan-apply-write-only-mode-in-kasan-kunit-testcases.patch
The quilt patch titled
Subject: mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
has been removed from the -mm tree. Its filename was
mm-userfaultfd-fix-kmap_local-lifo-ordering-for-config_highpte.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Sasha Levin <sashal(a)kernel.org>
Subject: mm/userfaultfd: fix kmap_local LIFO ordering for CONFIG_HIGHPTE
Date: Thu, 31 Jul 2025 10:44:31 -0400
With CONFIG_HIGHPTE on 32-bit ARM, move_pages_pte() maps PTE pages using
kmap_local_page(), which requires unmapping in Last-In-First-Out order.
The current code maps dst_pte first, then src_pte, but unmaps them in the
same order (dst_pte, src_pte), violating the LIFO requirement. This
causes the warning in kunmap_local_indexed():
WARNING: CPU: 0 PID: 604 at mm/highmem.c:622 kunmap_local_indexed+0x178/0x17c
addr \!= __fix_to_virt(FIX_KMAP_BEGIN + idx)
Fix this by reversing the unmap order to respect LIFO ordering.
This issue follows the same pattern as similar fixes:
- commit eca6828403b8 ("crypto: skcipher - fix mismatch between mapping and unmapping order")
- commit 8cf57c6df818 ("nilfs2: eliminate staggered calls to kunmap in nilfs_rename")
Both of which addressed the same fundamental requirement that kmap_local
operations must follow LIFO ordering.
Link: https://lkml.kernel.org/r/20250731144431.773923-1-sashal@kernel.org
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
Acked-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/userfaultfd.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/mm/userfaultfd.c~mm-userfaultfd-fix-kmap_local-lifo-ordering-for-config_highpte
+++ a/mm/userfaultfd.c
@@ -1453,10 +1453,15 @@ out:
folio_unlock(src_folio);
folio_put(src_folio);
}
- if (dst_pte)
- pte_unmap(dst_pte);
+ /*
+ * Unmap in reverse order (LIFO) to maintain proper kmap_local
+ * index ordering when CONFIG_HIGHPTE is enabled. We mapped dst_pte
+ * first, then src_pte, so we must unmap src_pte first, then dst_pte.
+ */
if (src_pte)
pte_unmap(src_pte);
+ if (dst_pte)
+ pte_unmap(dst_pte);
mmu_notifier_invalidate_range_end(&range);
if (si)
put_swap_device(si);
_
Patches currently in -mm which might be from sashal(a)kernel.org are
The quilt patch titled
Subject: ocfs2: prevent release journal inode after journal shutdown
has been removed from the -mm tree. Its filename was
ocfs2-prevent-release-journal-inode-after-journal-shutdown.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Edward Adam Davis <eadavis(a)qq.com>
Subject: ocfs2: prevent release journal inode after journal shutdown
Date: Tue, 19 Aug 2025 21:41:02 +0800
Before calling ocfs2_delete_osb(), ocfs2_journal_shutdown() has already
been executed in ocfs2_dismount_volume(), so osb->journal must be NULL.
Therefore, the following calltrace will inevitably fail when it reaches
jbd2_journal_release_jbd_inode().
ocfs2_dismount_volume()->
ocfs2_delete_osb()->
ocfs2_free_slot_info()->
__ocfs2_free_slot_info()->
evict()->
ocfs2_evict_inode()->
ocfs2_clear_inode()->
jbd2_journal_release_jbd_inode(osb->journal->j_journal,
Adding osb->journal checks will prevent null-ptr-deref during the above
execution path.
Link: https://lkml.kernel.org/r/tencent_357489BEAEE4AED74CBD67D246DBD2C4C606@qq.c…
Fixes: da5e7c87827e ("ocfs2: cleanup journal init and shutdown")
Signed-off-by: Edward Adam Davis <eadavis(a)qq.com>
Reported-by: syzbot+47d8cb2f2cc1517e515a(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=47d8cb2f2cc1517e515a
Tested-by: syzbot+47d8cb2f2cc1517e515a(a)syzkaller.appspotmail.com
Reviewed-by: Mark Tinguely <mark.tinguely(a)oracle.com>
Reviewed-by: Joseph Qi <joseph.qi(a)linux.alibaba.com>
Cc: Mark Fasheh <mark(a)fasheh.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Changwei Ge <gechangwei(a)live.cn>
Cc: Jun Piao <piaojun(a)huawei.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
fs/ocfs2/inode.c | 3 +++
1 file changed, 3 insertions(+)
--- a/fs/ocfs2/inode.c~ocfs2-prevent-release-journal-inode-after-journal-shutdown
+++ a/fs/ocfs2/inode.c
@@ -1281,6 +1281,9 @@ static void ocfs2_clear_inode(struct ino
* the journal is flushed before journal shutdown. Thus it is safe to
* have inodes get cleaned up after journal shutdown.
*/
+ if (!osb->journal)
+ return;
+
jbd2_journal_release_jbd_inode(osb->journal->j_journal,
&oi->ip_jinode);
}
_
Patches currently in -mm which might be from eadavis(a)qq.com are
The quilt patch titled
Subject: rust: mm: mark VmaNew as transparent
has been removed from the -mm tree. Its filename was
rust-mm-mark-vmanew-as-transparent.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Baptiste Lepers <baptiste.lepers(a)gmail.com>
Subject: rust: mm: mark VmaNew as transparent
Date: Tue, 12 Aug 2025 15:26:56 +0200
Unsafe code in VmaNew's methods assumes that the type has the same layout
as the inner `bindings::vm_area_struct`. This is not guaranteed by the
default struct representation in Rust, but requires specifying the
`transparent` representation.
Link: https://lkml.kernel.org/r/20250812132712.61007-1-baptiste.lepers@gmail.com
Fixes: dcb81aeab406 ("mm: rust: add VmaNew for f_ops->mmap()")
Signed-off-by: Baptiste Lepers <baptiste.lepers(a)gmail.com>
Reviewed-by: Alice Ryhl <aliceryhl(a)google.com>
Cc: Alex Gaynor <alex.gaynor(a)gmail.com>
Cc: Andreas Hindborg <a.hindborg(a)kernel.org>
Cc: Bj��rn Roy Baron <bjorn3_gh(a)protonmail.com>
Cc: Boqun Feng <boqun.feng(a)gmail.com>
Cc: Danilo Krummrich <dakr(a)kernel.org>
Cc: Gary Guo <gary(a)garyguo.net>
Cc: Jann Horn <jannh(a)google.com>
Cc: Liam Howlett <liam.howlett(a)oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Miguel Ojeda <ojeda(a)kernel.org>
Cc: Trevor Gross <tmgross(a)umich.edu>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
rust/kernel/mm/virt.rs | 1 +
1 file changed, 1 insertion(+)
--- a/rust/kernel/mm/virt.rs~rust-mm-mark-vmanew-as-transparent
+++ a/rust/kernel/mm/virt.rs
@@ -209,6 +209,7 @@ impl VmaMixedMap {
///
/// For the duration of 'a, the referenced vma must be undergoing initialization in an
/// `f_ops->mmap()` hook.
+#[repr(transparent)]
pub struct VmaNew {
vma: VmaRef,
}
_
Patches currently in -mm which might be from baptiste.lepers(a)gmail.com are
Fix a use-after-free window by correcting the buffer release sequence in
the deferred receive path. The code freed the RQ buffer first and only
then cleared the context pointer under the lock. Concurrent paths
(e.g., ABTS and the repost path) also inspect and release the same
pointer under the lock, so the old order could lead to double-free/UAF.
Note that the repost path already uses the correct pattern: detach the
pointer under the lock, then free it after dropping the lock. The deferred
path should do the same.
Fixes: 472e146d1cf3 ("scsi: lpfc: Correct upcalling nvmet_fc transport during io done downcall")
Cc: stable(a)vger.kernel.org
Signed-off-by: John Evans <evans1210144(a)gmail.com>
---
drivers/scsi/lpfc/lpfc_nvmet.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/lpfc/lpfc_nvmet.c b/drivers/scsi/lpfc/lpfc_nvmet.c
index fba2e62027b7..766319543680 100644
--- a/drivers/scsi/lpfc/lpfc_nvmet.c
+++ b/drivers/scsi/lpfc/lpfc_nvmet.c
@@ -1264,10 +1264,15 @@ lpfc_nvmet_defer_rcv(struct nvmet_fc_target_port *tgtport,
atomic_inc(&tgtp->rcv_fcp_cmd_defer);
/* Free the nvmebuf since a new buffer already replaced it */
- nvmebuf->hrq->rqbp->rqb_free_buffer(phba, nvmebuf);
spin_lock_irqsave(&ctxp->ctxlock, iflag);
- ctxp->rqb_buffer = NULL;
- spin_unlock_irqrestore(&ctxp->ctxlock, iflag);
+ nvmebuf = ctxp->rqb_buffer;
+ if (nvmebuf) {
+ ctxp->rqb_buffer = NULL;
+ spin_unlock_irqrestore(&ctxp->ctxlock, iflag);
+ nvmebuf->hrq->rqbp->rqb_free_buffer(phba, nvmebuf);
+ } else {
+ spin_unlock_irqrestore(&ctxp->ctxlock, iflag);
+ }
}
/**
--
2.43.0
rc is overwritten by the evtchn_status hypercall in each iteration, so
the return value will be whatever the last iteration is. This could
incorrectly return success even if the event channel was not found.
Change to an explicit -ENOENT for an un-found virq and return 0 on a
successful match.
Fixes: 62cc5fc7b2e0 ("xen/pv-on-hvm kexec: rebind virqs to existing eventchannel ports")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason Andryuk <jason.andryuk(a)amd.com>
Reviewed-by: Jan Beulich <jbeulich(a)suse.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
---
v3:
Reduce rc's scope
Add R-b: Jan
Mention false success in commit message
Add Fixes
Cc: stable
Add R-b: Juergen
v2:
New
---
drivers/xen/events/events_base.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 41309d38f78c..374231d84e4f 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1318,10 +1318,11 @@ static int find_virq(unsigned int virq, unsigned int cpu, evtchn_port_t *evtchn)
{
struct evtchn_status status;
evtchn_port_t port;
- int rc = -ENOENT;
memset(&status, 0, sizeof(status));
for (port = 0; port < xen_evtchn_max_channels(); port++) {
+ int rc;
+
status.dom = DOMID_SELF;
status.port = port;
rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
@@ -1331,10 +1332,10 @@ static int find_virq(unsigned int virq, unsigned int cpu, evtchn_port_t *evtchn)
continue;
if (status.u.virq == virq && status.vcpu == xen_vcpu_nr(cpu)) {
*evtchn = port;
- break;
+ return 0;
}
}
- return rc;
+ return -ENOENT;
}
/**
--
2.34.1
The patch titled
Subject: mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-damon-reclaim-avoid-divide-by-zero-in-damon_reclaim_apply_parameters.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Quanmin Yan <yanquanmin1(a)huawei.com>
Subject: mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters()
Date: Wed, 27 Aug 2025 19:58:58 +0800
When creating a new scheme of DAMON_RECLAIM, the calculation of
'min_age_region' uses 'aggr_interval' as the divisor, which may lead to
division-by-zero errors. Fix it by directly returning -EINVAL when such a
case occurs.
Link: https://lkml.kernel.org/r/20250827115858.1186261-3-yanquanmin1@huawei.com
Fixes: f5a79d7c0c87 ("mm/damon: introduce struct damos_access_pattern")
Signed-off-by: Quanmin Yan <yanquanmin1(a)huawei.com>
Reviewed-by: SeongJae Park <sj(a)kernel.org>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: ze zuo <zuoze1(a)huawei.com>
Cc: <stable(a)vger.kernel.org> [6.1+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/reclaim.c | 5 +++++
1 file changed, 5 insertions(+)
--- a/mm/damon/reclaim.c~mm-damon-reclaim-avoid-divide-by-zero-in-damon_reclaim_apply_parameters
+++ a/mm/damon/reclaim.c
@@ -194,6 +194,11 @@ static int damon_reclaim_apply_parameter
if (err)
return err;
+ if (!damon_reclaim_mon_attrs.aggr_interval) {
+ err = -EINVAL;
+ goto out;
+ }
+
err = damon_set_attrs(param_ctx, &damon_reclaim_mon_attrs);
if (err)
goto out;
_
Patches currently in -mm which might be from yanquanmin1(a)huawei.com are
mm-damon-core-prevent-unnecessary-overflow-in-damos_set_effective_quota.patch
mm-damon-lru_sort-avoid-divide-by-zero-in-damon_lru_sort_apply_parameters.patch
mm-damon-reclaim-avoid-divide-by-zero-in-damon_reclaim_apply_parameters.patch
The patch titled
Subject: mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-damon-lru_sort-avoid-divide-by-zero-in-damon_lru_sort_apply_parameters.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Quanmin Yan <yanquanmin1(a)huawei.com>
Subject: mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters()
Date: Wed, 27 Aug 2025 19:58:57 +0800
Patch series "mm/damon: avoid divide-by-zero in DAMON module's parameters
application".
DAMON's RECLAIM and LRU_SORT modules perform no validation on
user-configured parameters during application, which may lead to
division-by-zero errors.
Avoid the divide-by-zero by adding validation checks when DAMON modules
attempt to apply the parameters.
This patch (of 2):
During the calculation of 'hot_thres' and 'cold_thres', either
'sample_interval' or 'aggr_interval' is used as the divisor, which may
lead to division-by-zero errors. Fix it by directly returning -EINVAL
when such a case occurs. Additionally, since 'aggr_interval' is already
required to be set no smaller than 'sample_interval' in damon_set_attrs(),
only the case where 'sample_interval' is zero needs to be checked.
Link: https://lkml.kernel.org/r/20250827115858.1186261-2-yanquanmin1@huawei.com
Fixes: 40e983cca927 ("mm/damon: introduce DAMON-based LRU-lists Sorting")
Signed-off-by: Quanmin Yan <yanquanmin1(a)huawei.com>
Reviewed-by: SeongJae Park <sj(a)kernel.org>
Cc: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Cc: ze zuo <zuoze1(a)huawei.com>
Cc: <stable(a)vger.kernel.org> [6.0+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/lru_sort.c | 5 +++++
1 file changed, 5 insertions(+)
--- a/mm/damon/lru_sort.c~mm-damon-lru_sort-avoid-divide-by-zero-in-damon_lru_sort_apply_parameters
+++ a/mm/damon/lru_sort.c
@@ -198,6 +198,11 @@ static int damon_lru_sort_apply_paramete
if (err)
return err;
+ if (!damon_lru_sort_mon_attrs.sample_interval) {
+ err = -EINVAL;
+ goto out;
+ }
+
err = damon_set_attrs(ctx, &damon_lru_sort_mon_attrs);
if (err)
goto out;
_
Patches currently in -mm which might be from yanquanmin1(a)huawei.com are
mm-damon-core-prevent-unnecessary-overflow-in-damos_set_effective_quota.patch
mm-damon-lru_sort-avoid-divide-by-zero-in-damon_lru_sort_apply_parameters.patch
mm-damon-reclaim-avoid-divide-by-zero-in-damon_reclaim_apply_parameters.patch
Since the commit 96336ec70264 ("PCI: Perform reset_resource() and build
fail list in sync") the failed list is always built and returned to let
the caller decide what to do with the failures. The caller may want to
retry resource fitting and assignment and before that can happen, the
resources should be restored to their original state (a reset
effectively clears the struct resource), which requires returning them
on the failed list so that the original state remains stored in the
associated struct pci_dev_resource.
Resource resizing is different from the ordinary resource fitting and
assignment in that it only considers part of the resources. This means
failures for other resource types are not relevant at all and should be
ignored. As resize doesn't unassign such unrelated resources, those
resource ending up into the failed list implies assignment of that
resource must have failed before resize too. The check in
pci_reassign_bridge_resources() to decide if the whole assignment is
successful, however, is based on list emptiness which will cause false
negatives when the failed list has resources with an unrelated type.
If the failed list is not empty, call pci_required_resource_failed()
and extend it to be able to filter on specific resource types too (if
provided).
Calling pci_required_resource_failed() at this point is slightly
problematic because the resource itself is reset when the failed list
is constructed in __assign_resources_sorted(). As a result,
pci_resource_is_optional() does not have access to the original
resource flags. This could be worked around by restoring and
re-reseting the resource around the call to pci_resource_is_optional(),
however, it shouldn't cause issue as resource resizing is meant for
64-bit prefetchable resources according to Christian König (see the
Link which unfortunately doesn't point directly to Christian's reply
because lore didn't store that email at all).
Fixes: 96336ec70264 ("PCI: Perform reset_resource() and build fail list in sync")
Link: https://lore.kernel.org/all/c5d1b5d8-8669-5572-75a7-0b480f581ac1@linux.inte…
Reported-by: D Scott Phillips <scott(a)os.amperecomputing.com>
Closes: https://lore.kernel.org/all/86plf0lgit.fsf@scott-ph-mail.amperecomputing.co…
Tested-by: D Scott Phillips <scott(a)os.amperecomputing.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Reviewed-by: D Scott Phillips <scott(a)os.amperecomputing.com>
Cc: Christian König <christian.koenig(a)amd.com>
Cc: <stable(a)vger.kernel.org>
---
drivers/pci/setup-bus.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index df5aec46c29d..def29506700e 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -28,6 +28,10 @@
#include <linux/acpi.h>
#include "pci.h"
+#define PCI_RES_TYPE_MASK \
+ (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
+ IORESOURCE_MEM_64)
+
unsigned int pci_flags;
EXPORT_SYMBOL_GPL(pci_flags);
@@ -384,13 +388,19 @@ static bool pci_need_to_release(unsigned long mask, struct resource *res)
}
/* Return: @true if assignment of a required resource failed. */
-static bool pci_required_resource_failed(struct list_head *fail_head)
+static bool pci_required_resource_failed(struct list_head *fail_head,
+ unsigned long type)
{
struct pci_dev_resource *fail_res;
+ type &= PCI_RES_TYPE_MASK;
+
list_for_each_entry(fail_res, fail_head, list) {
int idx = pci_resource_num(fail_res->dev, fail_res->res);
+ if (type && (fail_res->flags & PCI_RES_TYPE_MASK) != type)
+ continue;
+
if (!pci_resource_is_optional(fail_res->dev, idx))
return true;
}
@@ -504,7 +514,7 @@ static void __assign_resources_sorted(struct list_head *head,
}
/* Without realloc_head and only optional fails, nothing more to do. */
- if (!pci_required_resource_failed(&local_fail_head) &&
+ if (!pci_required_resource_failed(&local_fail_head, 0) &&
list_empty(realloc_head)) {
list_for_each_entry(save_res, &save_head, list) {
struct resource *res = save_res->res;
@@ -1708,10 +1718,6 @@ static void __pci_bridge_assign_resources(const struct pci_dev *bridge,
}
}
-#define PCI_RES_TYPE_MASK \
- (IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH |\
- IORESOURCE_MEM_64)
-
static void pci_bridge_release_resources(struct pci_bus *bus,
unsigned long type)
{
@@ -2450,8 +2456,12 @@ int pci_reassign_bridge_resources(struct pci_dev *bridge, unsigned long type)
free_list(&added);
if (!list_empty(&failed)) {
- ret = -ENOSPC;
- goto cleanup;
+ if (pci_required_resource_failed(&failed, type)) {
+ ret = -ENOSPC;
+ goto cleanup;
+ }
+ /* Only resources with unrelated types failed (again) */
+ free_list(&failed);
}
list_for_each_entry(dev_res, &saved, list) {
--
2.39.5
The following changes since commit 1b237f190eb3d36f52dffe07a40b5eb210280e00:
Linux 6.17-rc3 (2025-08-24 12:04:12 -0400)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to 45d8ef6322b8a828d3b1e2cfb8893e2ff882cb23:
virtio_net: adjust the execution order of function `virtnet_close` during freeze (2025-08-26 03:38:20 -0400)
----------------------------------------------------------------
virtio,vhost: fixes
More small fixes. Most notably this fixes a messed up ioctl #,
and a regression in shmem affecting drm users.
Signed-off-by: Michael S. Tsirkin <mst(a)redhat.com>
----------------------------------------------------------------
Igor Torrente (1):
Revert "virtio: reject shm region if length is zero"
Junnan Wu (1):
virtio_net: adjust the execution order of function `virtnet_close` during freeze
Liming Wu (1):
virtio_pci: Fix misleading comment for queue vector
Namhyung Kim (1):
vhost: Fix ioctl # for VHOST_[GS]ET_FORK_FROM_OWNER
Nikolay Kuratov (1):
vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put()
Ying Gao (1):
virtio_input: Improve freeze handling
drivers/net/virtio_net.c | 7 ++++---
drivers/vhost/net.c | 9 +++++++--
drivers/virtio/virtio_input.c | 4 ++++
drivers/virtio/virtio_pci_legacy_dev.c | 4 ++--
drivers/virtio/virtio_pci_modern_dev.c | 4 ++--
include/linux/virtio_config.h | 2 --
include/uapi/linux/vhost.h | 4 ++--
7 files changed, 21 insertions(+), 13 deletions(-)
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen(a)redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Cc: <stable(a)vger.kernel.org> # v5.9+
---
(I get it that sprinkling this around to 3 places might have an ick
factor but I think it's necessary to make a surgical strike on this bug
before we address the general problem.)
Thanks,
-Eric
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index fddb55605e0c..9b54cfb0e13d 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -478,6 +478,12 @@ xfs_attr3_leaf_read(
err = xfs_da_read_buf(tp, dp, bno, 0, bpp, XFS_ATTR_FORK,
&xfs_attr3_leaf_buf_ops);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (err == -ENODATA)
+ err = -EIO;
if (err || !(*bpp))
return err;
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..bff3dc226f81 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -435,6 +435,13 @@ xfs_attr_rmtval_get(
0, &bp, &xfs_attr3_rmt_buf_ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
+ /*
+ * ENODATA from disk implies a disk medium failure;
+ * ENODATA for xattrs means attribute not found, so
+ * disambiguate that here.
+ */
+ if (error == -ENODATA)
+ error = -EIO;
if (error)
return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..723a0643b838 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2833,6 +2833,12 @@ xfs_da_read_buf(
&bp, ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(dp, whichfork);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
+ error = -EIO;
if (error)
goto out_free;
VIRQs come in 3 flavors, per-VPU, per-domain, and global, and the VIRQs
are tracked in per-cpu virq_to_irq arrays.
Per-domain and global VIRQs must be bound on CPU 0, and
bind_virq_to_irq() sets the per_cpu virq_to_irq at registration time
Later, the interrupt can migrate, and info->cpu is updated. When
calling __unbind_from_irq(), the per-cpu virq_to_irq is cleared for a
different cpu. If bind_virq_to_irq() is called again with CPU 0, the
stale irq is returned. There won't be any irq_info for the irq, so
things break.
Make xen_rebind_evtchn_to_cpu() update the per_cpu virq_to_irq mappings
to keep them update to date with the current cpu. This ensures the
correct virq_to_irq is cleared in __unbind_from_irq().
Fixes: e46cdb66c8fc ("xen: event channels")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason Andryuk <jason.andryuk(a)amd.com>
---
V2:
Different approach changing virq_to_irq
---
drivers/xen/events/events_base.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index a85bc43f4344..4e9db7b92dde 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1772,6 +1772,7 @@ static int xen_rebind_evtchn_to_cpu(struct irq_info *info, unsigned int tcpu)
{
struct evtchn_bind_vcpu bind_vcpu;
evtchn_port_t evtchn = info ? info->evtchn : 0;
+ int old_cpu = info ? info->cpu : tcpu;
if (!VALID_EVTCHN(evtchn))
return -1;
@@ -1795,8 +1796,18 @@ static int xen_rebind_evtchn_to_cpu(struct irq_info *info, unsigned int tcpu)
* it, but don't do the xenlinux-level rebind in that case.
*/
if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, &bind_vcpu) >= 0)
+ {
bind_evtchn_to_cpu(info, tcpu, false);
+ if (info->type == IRQT_VIRQ) {
+ int virq = info->u.virq;
+ int irq = per_cpu(virq_to_irq, old_cpu)[virq];
+
+ per_cpu(virq_to_irq, old_cpu)[virq] = -1;
+ per_cpu(virq_to_irq, tcpu)[virq] = irq;
+ }
+ }
+
do_unmask(info, EVT_MASK_REASON_TEMPORARY);
return 0;
--
2.50.1
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 76d2e3890fb169168c73f2e4f8375c7cc24a765e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025082224-payment-unworthy-d165@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d2e3890fb169168c73f2e4f8375c7cc24a765e Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Sat, 16 Aug 2025 07:25:20 -0700
Subject: [PATCH] NFS: Fix a race when updating an existing write
After nfs_lock_and_join_requests() tests for whether the request is
still attached to the mapping, nothing prevents a call to
nfs_inode_remove_request() from succeeding until we actually lock the
page group.
The reason is that whoever called nfs_inode_remove_request() doesn't
necessarily have a lock on the page group head.
So in order to avoid races, let's take the page group lock earlier in
nfs_lock_and_join_requests(), and hold it across the removal of the
request in nfs_inode_remove_request().
Reported-by: Jeff Layton <jlayton(a)kernel.org>
Tested-by: Joe Quanaim <jdq(a)meta.com>
Tested-by: Andrew Steffen <aksteffen(a)meta.com>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Fixes: bd37d6fce184 ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 11968dcb7243..6e69ce43a13f 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -253,13 +253,14 @@ nfs_page_group_unlock(struct nfs_page *req)
nfs_page_clear_headlock(req);
}
-/*
- * nfs_page_group_sync_on_bit_locked
+/**
+ * nfs_page_group_sync_on_bit_locked - Test if all requests have @bit set
+ * @req: request in page group
+ * @bit: PG_* bit that is used to sync page group
*
* must be called with page group lock held
*/
-static bool
-nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
+bool nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
{
struct nfs_page *head = req->wb_head;
struct nfs_page *tmp;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index fa5c41d0989a..8b7c04737967 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -153,20 +153,10 @@ nfs_page_set_inode_ref(struct nfs_page *req, struct inode *inode)
}
}
-static int
-nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
+static void nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
{
- int ret;
-
- if (!test_bit(PG_REMOVE, &req->wb_flags))
- return 0;
- ret = nfs_page_group_lock(req);
- if (ret)
- return ret;
if (test_and_clear_bit(PG_REMOVE, &req->wb_flags))
nfs_page_set_inode_ref(req, inode);
- nfs_page_group_unlock(req);
- return 0;
}
/**
@@ -585,19 +575,18 @@ retry:
}
}
+ ret = nfs_page_group_lock(head);
+ if (ret < 0)
+ goto out_unlock;
+
/* Ensure that nobody removed the request before we locked it */
if (head != folio->private) {
+ nfs_page_group_unlock(head);
nfs_unlock_and_release_request(head);
goto retry;
}
- ret = nfs_cancel_remove_inode(head, inode);
- if (ret < 0)
- goto out_unlock;
-
- ret = nfs_page_group_lock(head);
- if (ret < 0)
- goto out_unlock;
+ nfs_cancel_remove_inode(head, inode);
/* lock each request in the page group */
for (subreq = head->wb_this_page;
@@ -786,7 +775,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
{
struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
- if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
+ nfs_page_group_lock(req);
+ if (nfs_page_group_sync_on_bit_locked(req, PG_REMOVE)) {
struct folio *folio = nfs_page_to_folio(req->wb_head);
struct address_space *mapping = folio->mapping;
@@ -798,6 +788,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
}
spin_unlock(&mapping->i_private_lock);
}
+ nfs_page_group_unlock(req);
if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
atomic_long_dec(&nfsi->nrequests);
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 169b4ae30ff4..9aed39abc94b 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -160,6 +160,7 @@ extern void nfs_join_page_group(struct nfs_page *head,
extern int nfs_page_group_lock(struct nfs_page *);
extern void nfs_page_group_unlock(struct nfs_page *);
extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
+extern bool nfs_page_group_sync_on_bit_locked(struct nfs_page *, unsigned int);
extern int nfs_page_set_headlock(struct nfs_page *req);
extern void nfs_page_clear_headlock(struct nfs_page *req);
extern bool nfs_async_iocounter_wait(struct rpc_task *, struct nfs_lock_context *);
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 76d2e3890fb169168c73f2e4f8375c7cc24a765e
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025082224-submersed-commend-be4c@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d2e3890fb169168c73f2e4f8375c7cc24a765e Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust(a)hammerspace.com>
Date: Sat, 16 Aug 2025 07:25:20 -0700
Subject: [PATCH] NFS: Fix a race when updating an existing write
After nfs_lock_and_join_requests() tests for whether the request is
still attached to the mapping, nothing prevents a call to
nfs_inode_remove_request() from succeeding until we actually lock the
page group.
The reason is that whoever called nfs_inode_remove_request() doesn't
necessarily have a lock on the page group head.
So in order to avoid races, let's take the page group lock earlier in
nfs_lock_and_join_requests(), and hold it across the removal of the
request in nfs_inode_remove_request().
Reported-by: Jeff Layton <jlayton(a)kernel.org>
Tested-by: Joe Quanaim <jdq(a)meta.com>
Tested-by: Andrew Steffen <aksteffen(a)meta.com>
Reviewed-by: Jeff Layton <jlayton(a)kernel.org>
Fixes: bd37d6fce184 ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust(a)hammerspace.com>
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 11968dcb7243..6e69ce43a13f 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -253,13 +253,14 @@ nfs_page_group_unlock(struct nfs_page *req)
nfs_page_clear_headlock(req);
}
-/*
- * nfs_page_group_sync_on_bit_locked
+/**
+ * nfs_page_group_sync_on_bit_locked - Test if all requests have @bit set
+ * @req: request in page group
+ * @bit: PG_* bit that is used to sync page group
*
* must be called with page group lock held
*/
-static bool
-nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
+bool nfs_page_group_sync_on_bit_locked(struct nfs_page *req, unsigned int bit)
{
struct nfs_page *head = req->wb_head;
struct nfs_page *tmp;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index fa5c41d0989a..8b7c04737967 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -153,20 +153,10 @@ nfs_page_set_inode_ref(struct nfs_page *req, struct inode *inode)
}
}
-static int
-nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
+static void nfs_cancel_remove_inode(struct nfs_page *req, struct inode *inode)
{
- int ret;
-
- if (!test_bit(PG_REMOVE, &req->wb_flags))
- return 0;
- ret = nfs_page_group_lock(req);
- if (ret)
- return ret;
if (test_and_clear_bit(PG_REMOVE, &req->wb_flags))
nfs_page_set_inode_ref(req, inode);
- nfs_page_group_unlock(req);
- return 0;
}
/**
@@ -585,19 +575,18 @@ retry:
}
}
+ ret = nfs_page_group_lock(head);
+ if (ret < 0)
+ goto out_unlock;
+
/* Ensure that nobody removed the request before we locked it */
if (head != folio->private) {
+ nfs_page_group_unlock(head);
nfs_unlock_and_release_request(head);
goto retry;
}
- ret = nfs_cancel_remove_inode(head, inode);
- if (ret < 0)
- goto out_unlock;
-
- ret = nfs_page_group_lock(head);
- if (ret < 0)
- goto out_unlock;
+ nfs_cancel_remove_inode(head, inode);
/* lock each request in the page group */
for (subreq = head->wb_this_page;
@@ -786,7 +775,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
{
struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
- if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
+ nfs_page_group_lock(req);
+ if (nfs_page_group_sync_on_bit_locked(req, PG_REMOVE)) {
struct folio *folio = nfs_page_to_folio(req->wb_head);
struct address_space *mapping = folio->mapping;
@@ -798,6 +788,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
}
spin_unlock(&mapping->i_private_lock);
}
+ nfs_page_group_unlock(req);
if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
atomic_long_dec(&nfsi->nrequests);
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 169b4ae30ff4..9aed39abc94b 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -160,6 +160,7 @@ extern void nfs_join_page_group(struct nfs_page *head,
extern int nfs_page_group_lock(struct nfs_page *);
extern void nfs_page_group_unlock(struct nfs_page *);
extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
+extern bool nfs_page_group_sync_on_bit_locked(struct nfs_page *, unsigned int);
extern int nfs_page_set_headlock(struct nfs_page *req);
extern void nfs_page_clear_headlock(struct nfs_page *req);
extern bool nfs_async_iocounter_wait(struct rpc_task *, struct nfs_lock_context *);
Prior to the topology parsing rewrite and the switchover to the new
parsing logic for AMD processors in commit c749ce393b8f ("x86/cpu: Use
common topology code for AMD"), the "initial_apicid" on these platforms
was:
- First initialized to the LocalApicId from CPUID leaf 0x1 EBX[31:24].
- Then overwritten by the ExtendedLocalApicId in CPUID leaf 0xb
EDX[31:0] on processors that supported topoext.
With the new parsing flow introduced in commit f7fb3b2dd92c ("x86/cpu:
Provide an AMD/HYGON specific topology parser"), parse_8000_001e() now
unconditionally overwrites the "initial_apicid" already parsed during
cpu_parse_topology_ext().
Although this has not been a problem on baremetal platforms, on
virtualized AMD guests that feature more than 255 cores, QEMU 0's out
the CPUID leaf 0x8000001e on CPUs with "CoreID" > 255 to prevent
collision of these IDs in EBX[7:0] which can only represent a maximum of
255 cores [1].
This results in the following FW_BUG being logged when booting a guest
with more than 255 cores:
[Firmware Bug]: CPU 512: APIC ID mismatch. CPUID: 0x0000 APIC: 0x0200
AMD64 Architecture Programmer's Manual Volume 2: System Programming Pub.
24593 Rev. 3.42 [2] Section 16.12 "x2APIC_ID" mentions the Extended
Enumeration leaf 0x8000001e (which was later superseded by the extended
leaf 0x80000026) provides the full x2APIC ID under all circumstances
unlike the one reported by CPUID leaf 0x8000001e EAX which depends on
the mode in which APIC is configured.
Rely on the APIC ID parsed during cpu_parse_topology_ext() from CPUID
leaf 0x80000026 or 0xb and only use the APIC ID from leaf 0x8000001e if
cpu_parse_topology_ext() failed (has_topoext is false).
On platforms that support the 0xb leaf (Zen2 or later, AMD guests on
QEMU) or the extended leaf 0x80000026 (Zen4 or later), the
"initial_apicid" is now set to the value parsed from EDX[31:0].
On older AMD/Hygon platforms that does not support the 0xb leaf but
supports the TOPOEXT extension (Fam 0x15, 0x16, 0x17[Zen1], and Hygon),
the current behavior is retained where "initial_apicid" is set using
the 0x8000001e leaf.
Cc: stable(a)vger.kernel.org
Link: https://github.com/qemu/qemu/commit/35ac5dfbcaa4b [1]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 [2]
Debugged-by: Naveen N Rao (AMD) <naveen(a)kernel.org>
Debugged-by: Sairaj Kodilkar <sarunkod(a)amd.com>
Fixes: c749ce393b8f ("x86/cpu: Use common topology code for AMD")
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Tested-by: Naveen N Rao (AMD) <naveen(a)kernel.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak(a)amd.com>
---
Changelog v3..v4:
o Refreshed the diff based on Thomas' suggestion. The tags have been
retained since there are no functional changes - only comments around
the code has changed.
o Quoted relevant section of APM justifying the changes.
o Moved this patch up ahead.
o Cc'd stable.
---
arch/x86/kernel/cpu/topology_amd.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/cpu/topology_amd.c b/arch/x86/kernel/cpu/topology_amd.c
index 843b1655ab45..827dd0dbb6e9 100644
--- a/arch/x86/kernel/cpu/topology_amd.c
+++ b/arch/x86/kernel/cpu/topology_amd.c
@@ -81,20 +81,25 @@ static bool parse_8000_001e(struct topo_scan *tscan, bool has_topoext)
cpuid_leaf(0x8000001e, &leaf);
- tscan->c->topo.initial_apicid = leaf.ext_apic_id;
-
/*
- * If leaf 0xb is available, then the domain shifts are set
- * already and nothing to do here. Only valid for family >= 0x17.
+ * If leaf 0xb/0x26 is available, then the APIC ID and the domain
+ * shifts are set already.
*/
- if (!has_topoext && tscan->c->x86 >= 0x17) {
+ if (!has_topoext) {
+ tscan->c->topo.initial_apicid = leaf.ext_apic_id;
+
/*
- * Leaf 0x80000008 set the CORE domain shift already.
- * Update the SMT domain, but do not propagate it.
+ * Leaf 0x8000008 sets the CORE domain shift but not the
+ * SMT domain shift. On CPUs with family >= 0x17, there
+ * might be hyperthreads.
*/
- unsigned int nthreads = leaf.core_nthreads + 1;
+ if (tscan->c->x86 >= 0x17) {
+ /* Update the SMT domain, but do not propagate it. */
+ unsigned int nthreads = leaf.core_nthreads + 1;
- topology_update_dom(tscan, TOPO_SMT_DOMAIN, get_count_order(nthreads), nthreads);
+ topology_update_dom(tscan, TOPO_SMT_DOMAIN,
+ get_count_order(nthreads), nthreads);
+ }
}
store_node(tscan, leaf.nnodes_per_socket + 1, leaf.node_id);
--
2.34.1
When the host actively triggers SSR and collects coredump data,
the Bluetooth stack sends a reset command to the controller. However, due
to the inability to clear the QCA_SSR_TRIGGERED and QCA_IBS_DISABLED bits,
the reset command times out.
To address this, this patch clears the QCA_SSR_TRIGGERED and
QCA_IBS_DISABLED flags and adds a 50ms delay after SSR, but only when
HCI_QUIRK_NON_PERSISTENT_SETUP is not set. This ensures the controller
completes the SSR process when BT_EN is always high due to hardware.
For the purpose of HCI_QUIRK_NON_PERSISTENT_SETUP, please refer to
the comment in `include/net/bluetooth/hci.h`.
The HCI_QUIRK_NON_PERSISTENT_SETUP quirk is associated with BT_EN,
and its presence can be used to determine whether BT_EN is defined in DTS.
After SSR, host will not download the firmware, causing
controller to remain in the IBS_WAKE state. Host needs
to synchronize with the controller to maintain proper operation.
Multiple triggers of SSR only first generate coredump file,
due to memcoredump_flag no clear.
add clear coredump flag when ssr completed.
When the SSR duration exceeds 2 seconds, it triggers
host tx_idle_timeout, which sets host TX state to sleep. due to the
hardware pulling up bt_en, the firmware is not downloaded after the SSR.
As a result, the controller does not enter sleep mode. Consequently,
when the host sends a command afterward, it sends 0xFD to the controller,
but the controller does not respond, leading to a command timeout.
So reset tx_idle_timer after SSR to prevent host enter TX IBS_Sleep mode.
---
Changs since v8-v9:
-- Update base patch to latest patch.
-- add Cc stable(a)vger.kernel.org on signed-of.
Changes since v6-7:
- Merge the changes into a single patch.
- Update commit.
Changes since v1-5:
- Add an explanation for HCI_QUIRK_NON_PERSISTENT_SETUP.
- Add commments for msleep(50).
- Update format and commit.
Signed-off-by: Shuai Zhang <quic_shuaz(a)quicinc.com>
Cc: stable(a)vger.kernel.org
---
drivers/bluetooth/hci_qca.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index 4cff4d9be..97aaf4985 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -1653,6 +1653,39 @@ static void qca_hw_error(struct hci_dev *hdev, u8 code)
skb_queue_purge(&qca->rx_memdump_q);
}
+ /*
+ * If the BT chip's bt_en pin is connected to a 3.3V power supply via
+ * hardware and always stays high, driver cannot control the bt_en pin.
+ * As a result, during SSR (SubSystem Restart), QCA_SSR_TRIGGERED and
+ * QCA_IBS_DISABLED flags cannot be cleared, which leads to a reset
+ * command timeout.
+ * Add an msleep delay to ensure controller completes the SSR process.
+ *
+ * Host will not download the firmware after SSR, controller to remain
+ * in the IBS_WAKE state, and the host needs to synchronize with it
+ *
+ * Since the bluetooth chip has been reset, clear the memdump state.
+ */
+ if (!test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks)) {
+ /*
+ * When the SSR (SubSystem Restart) duration exceeds 2 seconds,
+ * it triggers host tx_idle_delay, which sets host TX state
+ * to sleep. Reset tx_idle_timer after SSR to prevent
+ * host enter TX IBS_Sleep mode.
+ */
+ mod_timer(&qca->tx_idle_timer, jiffies +
+ msecs_to_jiffies(qca->tx_idle_delay));
+
+ /* Controller reset completion time is 50ms */
+ msleep(50);
+
+ clear_bit(QCA_SSR_TRIGGERED, &qca->flags);
+ clear_bit(QCA_IBS_DISABLED, &qca->flags);
+
+ qca->tx_ibs_state = HCI_IBS_TX_AWAKE;
+ qca->memdump_state = QCA_MEMDUMP_IDLE;
+ }
+
clear_bit(QCA_HW_ERROR_EVENT, &qca->flags);
}
--
2.34.1
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 24963ae1b0b6596dc36e352c18593800056251d8
Gitweb: https://git.kernel.org/tip/24963ae1b0b6596dc36e352c18593800056251d8
Author: Suchit Karunakaran <suchitkarunakaran(a)gmail.com>
AuthorDate: Sat, 16 Aug 2025 12:21:26 +05:30
Committer: Dave Hansen <dave.hansen(a)linux.intel.com>
CommitterDate: Mon, 25 Aug 2025 08:23:37 -07:00
x86/cpu/intel: Fix the constant_tsc model check for Pentium 4
Pentium 4's which are INTEL_P4_PRESCOTT (model 0x03) and later have
a constant TSC. This was correctly captured until commit fadb6f569b10
("x86/cpu/intel: Limit the non-architectural constant_tsc model checks").
In that commit, an error was introduced while selecting the last P4
model (0x06) as the upper bound. Model 0x06 was transposed to
INTEL_P4_WILLAMETTE, which is just plain wrong. That was presumably a
simple typo, probably just copying and pasting the wrong P4 model.
Fix the constant TSC logic to cover all later P4 models. End at
INTEL_P4_CEDARMILL which accurately corresponds to the last P4 model.
Fixes: fadb6f569b10 ("x86/cpu/intel: Limit the non-architectural constant_tsc model checks")
Signed-off-by: Suchit Karunakaran <suchitkarunakaran(a)gmail.com>
Signed-off-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta(a)intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250816065126.5000-1-suchitkarunakaran%40gmail…
---
arch/x86/kernel/cpu/intel.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 076eaa4..98ae4c3 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -262,7 +262,7 @@ static void early_init_intel(struct cpuinfo_x86 *c)
if (c->x86_power & (1 << 8)) {
set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
- } else if ((c->x86_vfm >= INTEL_P4_PRESCOTT && c->x86_vfm <= INTEL_P4_WILLAMETTE) ||
+ } else if ((c->x86_vfm >= INTEL_P4_PRESCOTT && c->x86_vfm <= INTEL_P4_CEDARMILL) ||
(c->x86_vfm >= INTEL_CORE_YONAH && c->x86_vfm <= INTEL_IVYBRIDGE)) {
set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
}
Introduce and use {pgd,p4d}_populate_kernel() in core MM code when
populating PGD and P4D entries for the kernel address space.
These helpers ensure proper synchronization of page tables when
updating the kernel portion of top-level page tables.
Until now, the kernel has relied on each architecture to handle
synchronization of top-level page tables in an ad-hoc manner.
For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for
direct mapping and vmemmap mapping changes").
However, this approach has proven fragile for following reasons:
1) It is easy to forget to perform the necessary page table
synchronization when introducing new changes.
For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory
savings for compound devmaps") overlooked the need to synchronize
page tables for the vmemmap area.
2) It is also easy to overlook that the vmemmap and direct mapping areas
must not be accessed before explicit page table synchronization.
For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated
sub-pmd ranges")) caused crashes by accessing the vmemmap area
before calling sync_global_pgds().
To address this, as suggested by Dave Hansen, introduce _kernel() variants
of the page table population helpers, which invoke architecture-specific
hooks to properly synchronize page tables. These are introduced in a new
header file, include/linux/pgalloc.h, so they can be called from common code.
They reuse existing infrastructure for vmalloc and ioremap.
Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK,
and the actual synchronization is performed by arch_sync_kernel_mappings().
This change currently targets only x86_64, so only PGD and P4D level
helpers are introduced. Currently, these helpers are no-ops since no
architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK.
In theory, PUD and PMD level helpers can be added later if needed by
other architectures. For now, 32-bit architectures (x86-32 and arm) only
handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect
them unless we introduce a PMD level helper.
Cc: <stable(a)vger.kernel.org>
Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Acked-by: Kiryl Shutsemau <kas(a)kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt(a)kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com>
---
include/linux/pgalloc.h | 24 ++++++++++++++++++++++++
include/linux/pgtable.h | 13 +++++++------
mm/kasan/init.c | 12 ++++++------
mm/percpu.c | 6 +++---
mm/sparse-vmemmap.c | 6 +++---
5 files changed, 43 insertions(+), 18 deletions(-)
create mode 100644 include/linux/pgalloc.h
diff --git a/include/linux/pgalloc.h b/include/linux/pgalloc.h
new file mode 100644
index 000000000000..290ab864320f
--- /dev/null
+++ b/include/linux/pgalloc.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PGALLOC_H
+#define _LINUX_PGALLOC_H
+
+#include <linux/pgtable.h>
+#include <asm/pgalloc.h>
+
+static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd,
+ p4d_t *p4d)
+{
+ pgd_populate(&init_mm, pgd, p4d);
+ if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED)
+ arch_sync_kernel_mappings(addr, addr);
+}
+
+static inline void p4d_populate_kernel(unsigned long addr, p4d_t *p4d,
+ pud_t *pud)
+{
+ p4d_populate(&init_mm, p4d, pud);
+ if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_P4D_MODIFIED)
+ arch_sync_kernel_mappings(addr, addr);
+}
+
+#endif /* _LINUX_PGALLOC_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index ba699df6ef69..2b80fd456c8b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1469,8 +1469,8 @@ static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned
/*
* Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values
- * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings()
- * needs to be called.
+ * and let generic vmalloc, ioremap and page table update code know when
+ * arch_sync_kernel_mappings() needs to be called.
*/
#ifndef ARCH_PAGE_TABLE_SYNC_MASK
#define ARCH_PAGE_TABLE_SYNC_MASK 0
@@ -1954,10 +1954,11 @@ static inline bool arch_has_pfn_modify_check(void)
/*
* Page Table Modification bits for pgtbl_mod_mask.
*
- * These are used by the p?d_alloc_track*() set of functions an in the generic
- * vmalloc/ioremap code to track at which page-table levels entries have been
- * modified. Based on that the code can better decide when vmalloc and ioremap
- * mapping changes need to be synchronized to other page-tables in the system.
+ * These are used by the p?d_alloc_track*() and p*d_populate_kernel()
+ * functions in the generic vmalloc, ioremap and page table update code
+ * to track at which page-table levels entries have been modified.
+ * Based on that the code can better decide when page table changes need
+ * to be synchronized to other page-tables in the system.
*/
#define __PGTBL_PGD_MODIFIED 0
#define __PGTBL_P4D_MODIFIED 1
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ced6b29fcf76..8fce3370c84e 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -13,9 +13,9 @@
#include <linux/mm.h>
#include <linux/pfn.h>
#include <linux/slab.h>
+#include <linux/pgalloc.h>
#include <asm/page.h>
-#include <asm/pgalloc.h>
#include "kasan.h"
@@ -191,7 +191,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
pud_t *pud;
pmd_t *pmd;
- p4d_populate(&init_mm, p4d,
+ p4d_populate_kernel(addr, p4d,
lm_alias(kasan_early_shadow_pud));
pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud,
@@ -212,7 +212,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
} else {
p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
pud_init(p);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
}
zero_pud_populate(p4d, addr, next);
@@ -251,10 +251,10 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
* puds,pmds, so pgd_populate(), pud_populate()
* is noops.
*/
- pgd_populate(&init_mm, pgd,
+ pgd_populate_kernel(addr, pgd,
lm_alias(kasan_early_shadow_p4d));
p4d = p4d_offset(pgd, addr);
- p4d_populate(&init_mm, p4d,
+ p4d_populate_kernel(addr, p4d,
lm_alias(kasan_early_shadow_pud));
pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud,
@@ -273,7 +273,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
if (!p)
return -ENOMEM;
} else {
- pgd_populate(&init_mm, pgd,
+ pgd_populate_kernel(addr, pgd,
early_alloc(PAGE_SIZE, NUMA_NO_NODE));
}
}
diff --git a/mm/percpu.c b/mm/percpu.c
index d9cbaee92b60..a56f35dcc417 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3108,7 +3108,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
#endif /* BUILD_EMBED_FIRST_CHUNK */
#ifdef BUILD_PAGE_FIRST_CHUNK
-#include <asm/pgalloc.h>
+#include <linux/pgalloc.h>
#ifndef P4D_TABLE_SIZE
#define P4D_TABLE_SIZE PAGE_SIZE
@@ -3134,13 +3134,13 @@ void __init __weak pcpu_populate_pte(unsigned long addr)
if (pgd_none(*pgd)) {
p4d = memblock_alloc_or_panic(P4D_TABLE_SIZE, P4D_TABLE_SIZE);
- pgd_populate(&init_mm, pgd, p4d);
+ pgd_populate_kernel(addr, pgd, p4d);
}
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
pud = memblock_alloc_or_panic(PUD_TABLE_SIZE, PUD_TABLE_SIZE);
- p4d_populate(&init_mm, p4d, pud);
+ p4d_populate_kernel(addr, p4d, pud);
}
pud = pud_offset(p4d, addr);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 41aa0493eb03..dbd8daccade2 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -27,9 +27,9 @@
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
#include <linux/sched.h>
+#include <linux/pgalloc.h>
#include <asm/dma.h>
-#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "hugetlb_vmemmap.h"
@@ -229,7 +229,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
if (!p)
return NULL;
pud_init(p);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
return p4d;
}
@@ -241,7 +241,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
- pgd_populate(&init_mm, pgd, p);
+ pgd_populate_kernel(addr, pgd, p);
}
return pgd;
}
--
2.43.0
The patch titled
Subject: arm64: kexec: Initialize kexec_buf struct in image_load()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
arm64-kexec-initialize-kexec_buf-struct-in-image_load.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Breno Leitao <leitao(a)debian.org>
Subject: arm64: kexec: Initialize kexec_buf struct in image_load()
Date: Tue, 26 Aug 2025 05:08:51 -0700
The kexec_buf structure was previously declared without initialization in
image_load(). This led to a UBSAN warning when the structure was expanded
and uninitialized fields were accessed [1].
Zero-initializing kexec_buf at declaration ensures all fields are cleanly
set, preventing future instances of uninitialized memory being used.
Fixes this UBSAN warning:
[ 32.362488] UBSAN: invalid-load in ./include/linux/kexec.h:210:10
[ 32.362649] load of value 252 is not a valid value for type '_Bool'
Andrew Morton suggested that this function is only called 3x a week[2],
thus, the memset() cost is inexpensive.
Link: https://lore.kernel.org/all/oninomspajhxp4omtdapxnckxydbk2nzmrix7rggmpukpnz… [1]
Link: https://lore.kernel.org/all/20250825180531.94bfb86a26a43127c0a1296f@linux-f… [2]
Link: https://lkml.kernel.org/r/20250826-akpm-v1-1-3c831f0e3799@debian.org
Fixes: bf454ec31add ("kexec_file: allow to place kexec_buf randomly")
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Suggested-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Coiby Xu <coxu(a)redhat.com>
Cc: "Daniel P. Berrange" <berrange(a)redhat.com>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Liu Pingfan <kernelfans(a)gmail.com>
Cc: Milan Broz <gmazyland(a)gmail.com>
Cc: Ondrej Kozina <okozina(a)redhat.com>
Cc: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/arm64/kernel/kexec_image.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/arm64/kernel/kexec_image.c~arm64-kexec-initialize-kexec_buf-struct-in-image_load
+++ a/arch/arm64/kernel/kexec_image.c
@@ -41,7 +41,7 @@ static void *image_load(struct kimage *i
struct arm64_image_header *h;
u64 flags, value;
bool be_image, be_kernel;
- struct kexec_buf kbuf;
+ struct kexec_buf kbuf = {};
unsigned long text_offset, kernel_segment_number;
struct kexec_segment *kernel_segment;
int ret;
_
Patches currently in -mm which might be from leitao(a)debian.org are
arm64-kexec-initialize-kexec_buf-struct-in-image_load.patch
The quilt patch titled
Subject: kexec/arm64: initialize the random field of kbuf to zero in the image loader
has been removed from the -mm tree. Its filename was
kexec-arm64-initialize-the-random-field-of-kbuf-to-zero-in-the-image-loader.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Breno Leitao <leitao(a)debian.org>
Subject: kexec/arm64: initialize the random field of kbuf to zero in the image loader
Date: Thu Aug 21 04:11:21 2025 -0700
Add an explicit initialization for the random member of the kbuf structure
within the image_load function in arch/arm64/kernel/kexec_image.c.
Setting kbuf.random to zero ensures a deterministic and clean starting
state for the buffer used during kernel image loading, avoiding this UBSAN
issue later, when kbuf.random is read.
[ 32.362488] UBSAN: invalid-load in ./include/linux/kexec.h:210:10
[ 32.362649] load of value 252 is not a valid value for type '_Bool'
Link: https://lkml.kernel.org/r/oninomspajhxp4omtdapxnckxydbk2nzmrix7rggmpukpnzad…
Fixes: bf454ec31add ("kexec_file: allow to place kexec_buf randomly")
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Baoquan He <bhe(a)redhat.com>
Cc: Coiby Xu <coxu(a)redhat.com>
Cc: "Daniel P. Berrange" <berrange(a)redhat.com>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Kairui Song <ryncsn(a)gmail.com>
Cc: Liu Pingfan <kernelfans(a)gmail.com>
Cc: Milan Broz <gmazyland(a)gmail.com>
Cc: Ondrej Kozina <okozina(a)redhat.com>
Cc: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/arm64/kernel/kexec_image.c | 1 +
1 file changed, 1 insertion(+)
--- a/arch/arm64/kernel/kexec_image.c~kexec-arm64-initialize-the-random-field-of-kbuf-to-zero-in-the-image-loader
+++ a/arch/arm64/kernel/kexec_image.c
@@ -76,6 +76,7 @@ static void *image_load(struct kimage *i
kbuf.buf_min = 0;
kbuf.buf_max = ULONG_MAX;
kbuf.top_down = false;
+ kbuf.random = 0;
kbuf.buffer = kernel;
kbuf.bufsz = kernel_len;
_
Patches currently in -mm which might be from leitao(a)debian.org are
[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount -o compress $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
echo "correct result:"
od -Ad -t x1 $mnt/base
xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
-c "reflink $mnt/base 0 32k 32k" \
-c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
echo "incorrect result:"
od -Ad -t x1 $mnt/new
umount $mnt
This shows the following result:
correct result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
incorrect result:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0065536
Notice the zero in the range [32K, 60K), which is incorrect.
[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)
od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).
Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
These above 32K blocks will be read from the first half of the
compressed data extent.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
Although both extent maps at 0 and 32K are pointing to the same compressed
data, their offsets are different thus can not be merged into the same
read.
So this means the compressed data read merge check is doing something
wrong.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have an extent map of range [0,
32K), but the original bio passed in is for the whole 64K folio.
This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).
This incorrect compressed read merge leads to the above data corruption.
There were similar problems that happened in the past, commit 808f80b46790
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.
But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.
Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.
Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.
In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.
[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.
CC: stable(a)vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
---
Changelog:
v4:
- Minor grammar fixes
- Update the subject to reflect the bug better
This is the merged version inside for-next branch.
However for people who are not following our development branch but
still want to verify the incoming fstests test case, we need this fix
to be publicly avaiable. Newer minor changes will only be done inside
the development branch.
v3:
- Use a single btrfs_bio_ctrl::last_em_start
Since the existing prev_em_start is using U64_MAX as the initial value
to indicate no em hit yet, we can use the same logic, which saves
another 8 bytes from btrfs_bio_ctrl.
- Update the reproducer
Previously I failed to reproduce using a minimal workload.
As regular read from od/md5sum always trigger readahead thus not
hitting the @prev_em_start == NULL path.
Will send out a fstest case for it using the minimal reproducer.
But it will still need a 16K/64K page sized system to reproduce.
Fix the problem by doing an block aligned write into the folio, so
that the folio will be partially dirty and not go through the
readahead path.
- Update the analyze
This includes the trace events of the minimal reproducer, and
mentioning of previous similar fixes and why they do not work for
subpage cases.
- Update the CC tag
Since it's only affecting bs < ps cases (for non-experimental builds),
only need to fix kernels with subpage btrfs supports.
v2:
- Only save extent_map::start/len to save memory for btrfs_bio_ctrl
It's using on-stack memory which is very limited inside the kernel.
- Remove the commit message mentioning of clearing last saved em
Since we're using em::start/len, there is no need to clear them.
Either we hit the same em::start/len, meaning hitting the same extent
map, or we hit a different em, which will have a different start/len.
---
fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 426a6791a0b2..ca7174fa0240 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -131,6 +131,24 @@ struct btrfs_bio_ctrl {
*/
unsigned long submit_bitmap;
struct readahead_control *ractl;
+
+ /*
+ * The start offset of the last used extent map by a read operation.
+ *
+ * This is for proper compressed read merge.
+ * U64_MAX means we are starting the read and have made no progress yet.
+ *
+ * The current btrfs_bio_is_contig() only uses disk_bytenr as
+ * the condition to check if the read can be merged with previous
+ * bio, which is not correct. E.g. two file extents pointing to the
+ * same extent but with different offset.
+ *
+ * So here we need to do extra checks to only merge reads that are
+ * covered by the same extent map.
+ * Just extent_map::start will be enough, as they are unique
+ * inside the same inode.
+ */
+ u64 last_em_start;
};
/*
@@ -965,7 +983,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
* return 0 on success, otherwise return error
*/
static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
- struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
+ struct btrfs_bio_ctrl *bio_ctrl)
{
struct inode *inode = folio->mapping->host;
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -1076,12 +1094,11 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
* non-optimal behavior (submitting 2 bios for the same extent).
*/
if (compress_type != BTRFS_COMPRESS_NONE &&
- prev_em_start && *prev_em_start != (u64)-1 &&
- *prev_em_start != em->start)
+ bio_ctrl->last_em_start != U64_MAX &&
+ bio_ctrl->last_em_start != em->start)
force_bio_submit = true;
- if (prev_em_start)
- *prev_em_start = em->start;
+ bio_ctrl->last_em_start = em->start;
em_gen = em->generation;
btrfs_free_extent_map(em);
@@ -1296,12 +1313,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
const u64 start = folio_pos(folio);
const u64 end = start + folio_size(folio) - 1;
struct extent_state *cached_state = NULL;
- struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
+ struct btrfs_bio_ctrl bio_ctrl = {
+ .opf = REQ_OP_READ,
+ .last_em_start = U64_MAX,
+ };
struct extent_map *em_cached = NULL;
int ret;
lock_extents_for_read(inode, start, end, &cached_state);
- ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
+ ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
btrfs_free_extent_map(em_cached);
@@ -2641,7 +2661,8 @@ void btrfs_readahead(struct readahead_control *rac)
{
struct btrfs_bio_ctrl bio_ctrl = {
.opf = REQ_OP_READ | REQ_RAHEAD,
- .ractl = rac
+ .ractl = rac,
+ .last_em_start = U64_MAX,
};
struct folio *folio;
struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
@@ -2649,12 +2670,11 @@ void btrfs_readahead(struct readahead_control *rac)
const u64 end = start + readahead_length(rac) - 1;
struct extent_state *cached_state = NULL;
struct extent_map *em_cached = NULL;
- u64 prev_em_start = (u64)-1;
lock_extents_for_read(inode, start, end, &cached_state);
while ((folio = readahead_folio(rac)) != NULL)
- btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
+ btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
--
2.50.1
In early boot, Linux creates identity virtual->physical address mappings
so that it can enable the MMU before full memory management is ready.
To ensure some available physical memory to back these structures,
vmlinux.lds reserves some space (and defines marker symbols) in the
middle of the kernel image. However, because they are defined outside of
PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
is concerned.
In the typical case, this isn't actually a problem: the boot image is
prepared with objcopy, which zero-fills the gaps, so these structures
are incidentally zero-initialized (an all-zeroes entry is considered
absent, so zero-initialization is appropriate).
However, that is just a happy accident: the `vmlinux` ELF output
authoritatively represents the state of memory at entry. If the ELF
says a region of memory isn't initialized, we must treat it as
uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
the ELF directly -- sidestepping the objcopy-produced image entirely --
and therefore do not initialize the gaps. This results in the early boot
code crashing when it attempts to create identity mappings.
Therefore, add boot-time zero-initialization for the following:
- __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
- idmap_pg_dir
- reserved_pg_dir
- tramp_pg_dir # Already done, but this patch corrects the size
Note, swapper_pg_dir is already initialized (by copy from idmap_pg_dir)
before use, so this patch does not need to address it.
Cc: stable(a)vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks(a)gmail.com>
---
arch/arm64/kernel/head.S | 12 ++++++++++++
arch/arm64/mm/mmu.c | 3 ++-
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index ca04b338cb0d..0c3be11d0006 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -86,6 +86,18 @@ SYM_CODE_START(primary_entry)
bl record_mmu_state
bl preserve_boot_args
+ adrp x0, reserved_pg_dir
+ add x1, x0, #PAGE_SIZE
+0: str xzr, [x0], 8
+ cmp x0, x1
+ b.lo 0b
+
+ adrp x0, __pi_init_idmap_pg_dir
+ adrp x1, __pi_init_idmap_pg_end
+1: str xzr, [x0], 8
+ cmp x0, x1
+ b.lo 1b
+
adrp x1, early_init_stack
mov sp, x1
mov x29, xzr
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 34e5d78af076..aaf823565a65 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -761,7 +761,7 @@ static int __init map_entry_trampoline(void)
pgprot_val(prot) &= ~PTE_NG;
/* Map only the text into the trampoline page table */
- memset(tramp_pg_dir, 0, PGD_SIZE);
+ memset(tramp_pg_dir, 0, PAGE_SIZE);
__create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
entry_tramp_text_size(), prot,
pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS);
@@ -806,6 +806,7 @@ static void __init create_idmap(void)
u64 end = __pa_symbol(__idmap_text_end);
u64 ptep = __pa_symbol(idmap_ptes);
+ memset(idmap_pg_dir, 0, PAGE_SIZE);
__pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX,
IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
__phys_to_virt(ptep) - ptep);
--
2.49.1
The seg6_hmac_info structure stores information related to SRv6 HMAC
configurations, including the secret key, HMAC ID, and hashing algorithm
used to authenticate and secure SRv6 packets.
When a seg6_hmac_info object is no longer needed, it is destroyed via
seg6_hmac_info_del(), which eventually calls seg6_hinfo_release(). This
function uses kfree_rcu() to safely deallocate memory after an RCU grace
period has elapsed.
The kfree_rcu() releases memory without sanitization (e.g., zeroing out
the memory). Consequently, sensitive information such as the HMAC secret
and its length may remain in freed memory, potentially leading to data
leaks.
To address this risk, we replaced kfree_rcu() with a custom RCU
callback, seg6_hinfo_free_callback_rcu(). Within this callback, we
explicitly sanitize the seg6_hmac_info object before deallocating it
safely using kfree_sensitive(). This approach ensures the memory is
securely freed and prevents potential HMAC info leaks.
Additionally, in the control path, we ensure proper cleanup of
seg6_hmac_info objects when seg6_hmac_info_add() fails: such objects are
freed using kfree_sensitive() instead of kfree().
Fixes: 4f4853dc1c9c ("ipv6: sr: implement API to control SR HMAC structure")
Fixes: bf355b8d2c30 ("ipv6: sr: add core files for SR HMAC support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Andrea Mayer <andrea.mayer(a)uniroma2.it>
---
net/ipv6/seg6.c | 2 +-
net/ipv6/seg6_hmac.c | 10 +++++++++-
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c
index 180da19c148c..88782bdab843 100644
--- a/net/ipv6/seg6.c
+++ b/net/ipv6/seg6.c
@@ -215,7 +215,7 @@ static int seg6_genl_sethmac(struct sk_buff *skb, struct genl_info *info)
err = seg6_hmac_info_add(net, hmackeyid, hinfo);
if (err)
- kfree(hinfo);
+ kfree_sensitive(hinfo);
out_unlock:
mutex_unlock(&sdata->lock);
diff --git a/net/ipv6/seg6_hmac.c b/net/ipv6/seg6_hmac.c
index fd58426f222b..19cdf3791ebf 100644
--- a/net/ipv6/seg6_hmac.c
+++ b/net/ipv6/seg6_hmac.c
@@ -57,9 +57,17 @@ static int seg6_hmac_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
return (hinfo->hmackeyid != *(__u32 *)arg->key);
}
+static void seg6_hinfo_free_callback_rcu(struct rcu_head *head)
+{
+ struct seg6_hmac_info *hinfo;
+
+ hinfo = container_of(head, struct seg6_hmac_info, rcu);
+ kfree_sensitive(hinfo);
+}
+
static inline void seg6_hinfo_release(struct seg6_hmac_info *hinfo)
{
- kfree_rcu(hinfo, rcu);
+ call_rcu(&hinfo->rcu, seg6_hinfo_free_callback_rcu);
}
static void seg6_free_hi(void *ptr, void *arg)
--
2.20.1
Commit 8ee53820edfd ("thp: mmu_notifier_test_young") introduced
mmu_notifier_test_young(), but we should pass the address need to test.
In xxx_scan_pmd(), the actual iteration address is "_address" not
"address". We seem to misuse the variable on the very beginning.
Change it to the right one.
Fixes: 8ee53820edfd ("thp: mmu_notifier_test_young")
Signed-off-by: Wei Yang <richard.weiyang(a)gmail.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com>
Cc: Zi Yan <ziy(a)nvidia.com>
Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Cc: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Cc: Nico Pache <npache(a)redhat.com>
Cc: Ryan Roberts <ryan.roberts(a)arm.com>
Cc: Dev Jain <dev.jain(a)arm.com>
Cc: Barry Song <baohua(a)kernel.org>
CC: <stable(a)vger.kernel.org>
---
The original commit 8ee53820edfd is at 2011.
Then the code is moved to khugepaged.c in commit b46e756f5e470 ("thp:
extract khugepaged from mm/huge_memory.c") in 2022.
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 24e18a7f8a93..b000942250d1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1418,7 +1418,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
if (cc->is_khugepaged &&
(pte_young(pteval) || folio_test_young(folio) ||
folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
- address)))
+ _address)))
referenced++;
}
if (!writable) {
--
2.34.1
[BUG]
With 64K page size (aarch64 with 64K page size config) and 4K btrfs
block size, the following workload can easily lead to a corrupted read:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount -o compress=zstd $dev $mnt
xfs_io -f -c "pwrite -S 0xff 0 64k" -c sync $mnt/base > /dev/null
echo "correct result:"
od -Ax -x $mnt/base
xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
-c "reflink $mnt/base 0 32k 32k" \
-c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
echo "incorrect result:"
od -Ax -x $mnt/new
umount $mnt
This shows the following result:
correct result:
000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
010000
incorrect result:
000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
008000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f000 ffff ffff ffff ffff ffff ffff ffff ffff
*
010000
Notice the zero in the range [0x8000, 0xf000), which is incorrect.
[CAUSE]
With extra trace printk, it shows the following events during od:
(some unrelated info removed like CPU and context)
od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
The "r/i" is indicating the root and inode number. In our case the file
"new" is using ino 258 from fs tree (root 5).
Here notice the @prev_em_start pointer is NULL. This means the
btrfs_do_readpage() is called from btrfs_read_folio(), not from
btrfs_readahead().
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
These above 32K blocks will be read from the the first half of the
compressed data extent.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
Note here there is no btrfs_submit_compressed_read() call. Which is
incorrect now.
As both extent maps at 0 and 32K are pointing to the same compressed
data, their offset are different thus can not be merged into the same
read.
So this means the compressed data read merge check is doing something
wrong.
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
The function btrfs_submit_compressed_read() is only called at the end of
folio read. The compressed bio will only have a extent map of range [0,
32K), but the original bio passed in are for the whole 64K folio.
This will cause the decompression part to only fill the first 32K,
leaving the rest untouched (aka, filled with zero).
This incorrect compressed read merge leads to the above data corruption.
There are similar problems happened in the past, commit 808f80b46790
("Btrfs: update fix for read corruption of compressed and shared
extents") is doing pretty much the same fix for readahead.
But that's back to 2015, where btrfs still only supports bs (block size)
== ps (page size) cases.
This means btrfs_do_readpage() only needs to handle a folio which
contains exactly one block.
Only btrfs_readahead() can lead to a read covering multiple blocks.
Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
With the v5.15 btrfs introduced bs < ps support. This breaks the above
assumption that a folio can only contain one block.
Now btrfs_read_folio() can also read multiple blocks in one go.
But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
existing bio force submission check will never be triggered.
In theory, this can also happen for btrfs with large folios, but since
large folio is still experimental, we don't need to bother it, thus only
bs < ps support is affected for now.
[FIX]
Instead of passing @prev_em_start to do the proper compressed extent
check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
the existing bio force submission logic will always be triggered.
CC: stable(a)vger.kernel.org #5.15+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
Changelog:
v3:
- Use a single btrfs_bio_ctrl::last_em_start
Since the existing prev_em_start is using U64_MAX as the initial value
to indicate no em hit yet, we can use the same logic, which saves
another 8 bytes from btrfs_bio_ctrl.
- Update the reproducer
Previously I failed to reproduce using a minimal workload.
As regular read from od/md5sum always trigger readahead thus not
hitting the @prev_em_start == NULL path.
Will send out a fstest case for it using the minimal reproducer.
But it will still need a 16K/64K page sized system to reproduce.
Fix the problem by doing an block aligned write into the folio, so
that the folio will be partially dirty and not go through the
readahead path.
- Update the analyze
This includes the trace events of the minimal reproducer, and
mentioning of previous similar fixes and why they do not work for
subpage cases.
- Update the CC tag
Since it's only affecting bs < ps cases (for non-experimental builds),
only need to fix kernels with subpage btrfs supports.
v2:
- Only save extent_map::start/len to save memory for btrfs_bio_ctrl
It's using on-stack memory which is very limited inside the kernel.
- Remove the commit message mentioning of clearing last saved em
Since we're using em::start/len, there is no need to clear them.
Either we hit the same em::start/len, meaning hitting the same extent
map, or we hit a different em, which will have a different start/len.
---
fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0c12fd64a1f3..b219dbaaedc8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -131,6 +131,24 @@ struct btrfs_bio_ctrl {
*/
unsigned long submit_bitmap;
struct readahead_control *ractl;
+
+ /*
+ * The start file offset of the last hit extent map.
+ *
+ * This is for proper compressed read merge.
+ * U64_MAX means no extent map hit yet.
+ *
+ * The current btrfs_bio_is_contig() only uses disk_bytenr as
+ * the condition to check if the read can be merged with previous
+ * bio, which is not correct. E.g. two file extents pointing to the
+ * same extent but with different offset.
+ *
+ * So here we need to do extra check to only merge reads that are
+ * covered by the same extent map.
+ * Just extent_map::start will be enough, as they are unique
+ * inside the same inode.
+ */
+ u64 last_em_start;
};
/*
@@ -965,7 +983,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
* return 0 on success, otherwise return error
*/
static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
- struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
+ struct btrfs_bio_ctrl *bio_ctrl)
{
struct inode *inode = folio->mapping->host;
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -1076,12 +1094,11 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
* non-optimal behavior (submitting 2 bios for the same extent).
*/
if (compress_type != BTRFS_COMPRESS_NONE &&
- prev_em_start && *prev_em_start != (u64)-1 &&
- *prev_em_start != em->start)
+ bio_ctrl->last_em_start != U64_MAX &&
+ bio_ctrl->last_em_start != em->start)
force_bio_submit = true;
- if (prev_em_start)
- *prev_em_start = em->start;
+ bio_ctrl->last_em_start = em->start;
em_gen = em->generation;
btrfs_free_extent_map(em);
@@ -1296,12 +1313,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
const u64 start = folio_pos(folio);
const u64 end = start + folio_size(folio) - 1;
struct extent_state *cached_state = NULL;
- struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
+ struct btrfs_bio_ctrl bio_ctrl = {
+ .opf = REQ_OP_READ,
+ .last_em_start = U64_MAX,
+ };
struct extent_map *em_cached = NULL;
int ret;
lock_extents_for_read(inode, start, end, &cached_state);
- ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
+ ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
btrfs_free_extent_map(em_cached);
@@ -2641,7 +2661,8 @@ void btrfs_readahead(struct readahead_control *rac)
{
struct btrfs_bio_ctrl bio_ctrl = {
.opf = REQ_OP_READ | REQ_RAHEAD,
- .ractl = rac
+ .ractl = rac,
+ .last_em_start = U64_MAX,
};
struct folio *folio;
struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
@@ -2649,12 +2670,11 @@ void btrfs_readahead(struct readahead_control *rac)
const u64 end = start + readahead_length(rac) - 1;
struct extent_state *cached_state = NULL;
struct extent_map *em_cached = NULL;
- u64 prev_em_start = (u64)-1;
lock_extents_for_read(inode, start, end, &cached_state);
while ((folio = readahead_folio(rac)) != NULL)
- btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
+ btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
--
2.50.1
ENODATA (aka ENOATTR) has a very specific meaning in the xfs xattr code;
namely, that the requested attribute name could not be found.
However, a medium error from disk may also return ENODATA. At best,
this medium error may escape to userspace as "attribute not found"
when in fact it's an IO (disk) error.
At worst, we may oops in xfs_attr_leaf_get() when we do:
error = xfs_attr_leaf_hasname(args, &bp);
if (error == -ENOATTR) {
xfs_trans_brelse(args->trans, bp);
return error;
}
because an ENODATA/ENOATTR error from disk leaves us with a null bp,
and the xfs_trans_brelse will then null-deref it.
As discussed on the list, we really need to modify the lower level
IO functions to trap all disk errors and ensure that we don't let
unique errors like this leak up into higher xfs functions - many
like this should be remapped to EIO.
However, this patch directly addresses a reported bug in the xattr
code, and should be safe to backport to stable kernels. A larger-scope
patch to handle more unique errors at lower levels can follow later.
(Note, prior to 07120f1abdff we did not oops, but we did return the
wrong error code to userspace.)
Signed-off-by: Eric Sandeen <sandeen(a)redhat.com>
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Cc: <stable(a)vger.kernel.org> # v5.9+
---
V2: Remove the extraneous trap point as pointed out by djwong, oops.
(I get it that sprinkling this around in 2 places might have an ick
factor but I think it's necessary to make a surgical strike on this bug
before we address the general problem.)
Thanks,
-Eric
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 4c44ce1c8a64..bff3dc226f81 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -435,6 +435,13 @@ xfs_attr_rmtval_get(
0, &bp, &xfs_attr3_rmt_buf_ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(args->dp, XFS_ATTR_FORK);
+ /*
+ * ENODATA from disk implies a disk medium failure;
+ * ENODATA for xattrs means attribute not found, so
+ * disambiguate that here.
+ */
+ if (error == -ENODATA)
+ error = -EIO;
if (error)
return error;
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index 17d9e6154f19..723a0643b838 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2833,6 +2833,12 @@ xfs_da_read_buf(
&bp, ops);
if (xfs_metadata_is_sick(error))
xfs_dirattr_mark_sick(dp, whichfork);
+ /*
+ * ENODATA from disk implies a disk medium failure; ENODATA for
+ * xattrs means attribute not found, so disambiguate that here.
+ */
+ if (error == -ENODATA && whichfork == XFS_ATTR_FORK)
+ error = -EIO;
if (error)
goto out_free;
When restoring a reservation for an anonymous page, we need to check to
freeing a surplus. However, __unmap_hugepage_range() causes data race
because it reads h->surplus_huge_pages without the protection of
hugetlb_lock.
And adjust_reservation is a boolean variable that indicates whether
reservations for anonymous pages in each folio should be restored.
Therefore, it should be initialized to false for each round of the loop.
However, this variable is not initialized to false except when defining
the current adjust_reservation variable.
This means that once adjust_reservation is set to true even once within
the loop, reservations for anonymous pages will be restored
unconditionally in all subsequent rounds, regardless of the folio's state.
To fix this, we need to add the missing hugetlb_lock, unlock the
page_table_lock earlier so that we don't lock the hugetlb_lock inside the
page_table_lock lock, and initialize adjust_reservation to false on each
round within the loop.
Cc: <stable(a)vger.kernel.org>
Reported-by: syzbot+417aeb05fd190f3a6da9(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=417aeb05fd190f3a6da9
Fixes: df7a6d1f6405 ("mm/hugetlb: restore the reservation if needed")
Signed-off-by: Jeongjun Park <aha310510(a)gmail.com>
---
v2: Fix issues with changing the page_table_lock unlock location and initializing adjust_reservation
- Link to v1: https://lore.kernel.org/all/20250822055857.1142454-1-aha310510@gmail.com/
---
mm/hugetlb.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 753f99b4c718..eed59cfb5d21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5851,7 +5851,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
spinlock_t *ptl;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
- bool adjust_reservation = false;
+ bool adjust_reservation;
unsigned long last_addr_mask;
bool force_flush = false;
@@ -5944,6 +5944,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
sz);
hugetlb_count_sub(pages_per_huge_page(h), mm);
hugetlb_remove_rmap(folio);
+ spin_unlock(ptl);
/*
* Restore the reservation for anonymous page, otherwise the
@@ -5951,14 +5952,16 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
* If there we are freeing a surplus, do not set the restore
* reservation bit.
*/
+ adjust_reservation = false;
+
+ spin_lock_irq(&hugetlb_lock);
if (!h->surplus_huge_pages && __vma_private_lock(vma) &&
folio_test_anon(folio)) {
folio_set_hugetlb_restore_reserve(folio);
/* Reservation to be adjusted after the spin lock */
adjust_reservation = true;
}
-
- spin_unlock(ptl);
+ spin_unlock_irq(&hugetlb_lock);
/*
* Adjust the reservation for the region that will have the
--
From: Lance Yang <lance.yang(a)linux.dev>
The blocker tracking mechanism assumes that lock pointers are at least
4-byte aligned to use their lower bits for type encoding.
However, as reported by Geert Uytterhoeven, some architectures like m68k
only guarantee 2-byte alignment of 32-bit values. This breaks the
assumption and causes two related WARN_ON_ONCE checks to trigger.
To fix this, enforce a minimum of 4-byte alignment on the core lock
structures supported by the blocker tracking mechanism. This ensures the
algorithm's alignment assumption now holds true on all architectures.
This patch adds __aligned(4) to the definitions of "struct mutex",
"struct semaphore", and "struct rw_semaphore", resolving the warnings.
Thanks to Geert for bisecting!
Reported-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Closes: https://lore.kernel.org/lkml/CAMuHMdW7Ab13DdGs2acMQcix5ObJK0O2dG_Fxzr8_g58R…
Fixes: e711faaafbe5 ("hung_task: replace blocker_mutex with encoded blocker")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
---
include/linux/mutex_types.h | 2 +-
include/linux/rwsem.h | 2 +-
include/linux/semaphore.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/mutex_types.h b/include/linux/mutex_types.h
index fdf7f515fde8..de798bfbc4c7 100644
--- a/include/linux/mutex_types.h
+++ b/include/linux/mutex_types.h
@@ -51,7 +51,7 @@ struct mutex {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
-};
+} __aligned(4); /* For hung_task blocker tracking, which encodes type in LSBs */
#else /* !CONFIG_PREEMPT_RT */
/*
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index f1aaf676a874..f6ecf4a4710d 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -64,7 +64,7 @@ struct rw_semaphore {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
-};
+} __aligned(4); /* For hung_task blocker tracking, which encodes type in LSBs */
#define RWSEM_UNLOCKED_VALUE 0UL
#define RWSEM_WRITER_LOCKED (1UL << 0)
diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
index 89706157e622..ac9b9c87bfb7 100644
--- a/include/linux/semaphore.h
+++ b/include/linux/semaphore.h
@@ -20,7 +20,7 @@ struct semaphore {
#ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
unsigned long last_holder;
#endif
-};
+} __aligned(4); /* For hung_task blocker tracking, which encodes type in LSBs */
#ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
#define __LAST_HOLDER_SEMAPHORE_INITIALIZER \
--
2.49.0
From: Lance Yang <lance.yang(a)linux.dev>
The blocker tracking mechanism assumes that lock pointers are at least
4-byte aligned to use their lower bits for type encoding.
However, as reported by Geert Uytterhoeven, some architectures like m68k
only guarantee 2-byte alignment of 32-bit values. This breaks the
assumption and causes two related WARN_ON_ONCE checks to trigger.
To fix this, the runtime checks are adjusted. The first WARN_ON_ONCE in
hung_task_set_blocker() is changed to a simple 'if' that returns silently
for unaligned pointers. The second, now-invalid WARN_ON_ONCE in
hung_task_clear_blocker() is then removed.
Thanks to Geert for bisecting!
Reported-by: Geert Uytterhoeven <geert(a)linux-m68k.org>
Closes: https://lore.kernel.org/lkml/CAMuHMdW7Ab13DdGs2acMQcix5ObJK0O2dG_Fxzr8_g58R…
Fixes: e711faaafbe5 ("hung_task: replace blocker_mutex with encoded blocker")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Lance Yang <lance.yang(a)linux.dev>
---
include/linux/hung_task.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/include/linux/hung_task.h b/include/linux/hung_task.h
index 34e615c76ca5..69640f266a69 100644
--- a/include/linux/hung_task.h
+++ b/include/linux/hung_task.h
@@ -20,6 +20,10 @@
* always zero. So we can use these bits to encode the specific blocking
* type.
*
+ * Note that on architectures like m68k with only 2-byte alignment, the
+ * blocker tracking mechanism gracefully does nothing for any lock that is
+ * not 4-byte aligned.
+ *
* Type encoding:
* 00 - Blocked on mutex (BLOCKER_TYPE_MUTEX)
* 01 - Blocked on semaphore (BLOCKER_TYPE_SEM)
@@ -45,7 +49,7 @@ static inline void hung_task_set_blocker(void *lock, unsigned long type)
* If the lock pointer matches the BLOCKER_TYPE_MASK, return
* without writing anything.
*/
- if (WARN_ON_ONCE(lock_ptr & BLOCKER_TYPE_MASK))
+ if (lock_ptr & BLOCKER_TYPE_MASK)
return;
WRITE_ONCE(current->blocker, lock_ptr | type);
@@ -53,8 +57,6 @@ static inline void hung_task_set_blocker(void *lock, unsigned long type)
static inline void hung_task_clear_blocker(void)
{
- WARN_ON_ONCE(!READ_ONCE(current->blocker));
-
WRITE_ONCE(current->blocker, 0UL);
}
--
2.49.0
From: Mikhail Lobanov <m.lobanov(a)rosa.ru>
[ Upstream commit 16ee3ea8faef8ff042acc15867a6c458c573de61 ]
When userspace sets supported rates for a new station via
NL80211_CMD_NEW_STATION, it might send a list that's empty
or contains only invalid values. Currently, we process these
values in sta_link_apply_parameters() without checking the result of
ieee80211_parse_bitrates(), which can lead to an empty rates bitmap.
A similar issue was addressed for NL80211_CMD_SET_BSS in commit
ce04abc3fcc6 ("wifi: mac80211: check basic rates validity").
This patch applies the same approach in sta_link_apply_parameters()
for NL80211_CMD_NEW_STATION, ensuring there is at least one valid
rate by inspecting the result of ieee80211_parse_bitrates().
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: b95eb7f0eee4 ("wifi: cfg80211/mac80211: separate link params from station params")
Signed-off-by: Mikhail Lobanov <m.lobanov(a)rosa.ru>
Link: https://patch.msgid.link/20250317103139.17625-1-m.lobanov@rosa.ru
Signed-off-by: Johannes Berg <johannes.berg(a)intel.com>
[ Summary of conflict resolutions:
- Function ieee80211_parse_bitrates() takes channel width as its
first parameter in mainline kernel version. In v5.15 the function
takes the whole chandef struct as its first parameter.
- The same function takes link station parameters as its last
parameter, and in v5.15 they are in a struct called sta,
instead of a struct called link_sta. ]
Signed-off-by: Hanne-Lotta Mäenpää <hannelotta(a)gmail.com>
---
net/mac80211/cfg.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index 2b77cb290788..706ff67f4254 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -1658,12 +1658,13 @@ static int sta_apply_parameters(struct ieee80211_local *local,
return ret;
}
- if (params->supported_rates && params->supported_rates_len) {
- ieee80211_parse_bitrates(&sdata->vif.bss_conf.chandef,
- sband, params->supported_rates,
- params->supported_rates_len,
- &sta->sta.supp_rates[sband->band]);
- }
+ if (params->supported_rates &&
+ params->supported_rates_len &&
+ !ieee80211_parse_bitrates(&sdata->vif.bss_conf.chandef,
+ sband, params->supported_rates,
+ params->supported_rates_len,
+ &sta->sta.supp_rates[sband->band]))
+ return -EINVAL;
if (params->ht_capa)
ieee80211_ht_cap_ie_to_sta_ht_cap(sdata, sband,
--
2.50.0
The patch titled
Subject: mm/damon/core: set quota->charged_from to jiffies at first charge window
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-damon-core-set-quota-charged_from-to-jiffies-at-first-charge-window.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Sang-Heon Jeon <ekffu200098(a)gmail.com>
Subject: mm/damon/core: set quota->charged_from to jiffies at first charge window
Date: Fri, 22 Aug 2025 11:50:57 +0900
Kernel initializes the "jiffies" timer as 5 minutes below zero, as shown
in include/linux/jiffies.h
/*
* Have the 32 bit jiffies value wrap 5 minutes after boot
* so jiffies wrap bugs show up earlier.
*/
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
And jiffies comparison help functions cast unsigned value to signed to
cover wraparound
#define time_after_eq(a,b) \
(typecheck(unsigned long, a) && \
typecheck(unsigned long, b) && \
((long)((a) - (b)) >= 0))
When quota->charged_from is initialized to 0, time_after_eq() can
incorrectly return FALSE even after reset_interval has elapsed. This
occurs when (jiffies - reset_interval) produces a value with MSB=1, which
is interpreted as negative in signed arithmetic.
This issue primarily affects 32-bit systems because: On 64-bit systems:
MSB=1 values occur after ~292 million years from boot (assuming HZ=1000),
almost impossible.
On 32-bit systems: MSB=1 values occur during the first 5 minutes after
boot, and the second half of every jiffies wraparound cycle, starting from
day 25 (assuming HZ=1000)
When above unexpected FALSE return from time_after_eq() occurs, the
charging window will not reset. The user impact depends on esz value at
that time.
If esz is 0, scheme ignores configured quotas and runs without any limits.
If esz is not 0, scheme stops working once the quota is exhausted. It
remains until the charging window finally resets.
So, change quota->charged_from to jiffies at damos_adjust_quota() when it
is considered as the first charge window. By this change, we can avoid
unexpected FALSE return from time_after_eq()
Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com
Fixes: 2b8a248d5873 ("mm/damon/schemes: implement size quota for schemes application speed control") # 5.16
Signed-off-by: Sang-Heon Jeon <ekffu200098(a)gmail.com>
Reviewed-by: SeongJae Park <sj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/damon/core.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/mm/damon/core.c~mm-damon-core-set-quota-charged_from-to-jiffies-at-first-charge-window
+++ a/mm/damon/core.c
@@ -2111,6 +2111,10 @@ static void damos_adjust_quota(struct da
if (!quota->ms && !quota->sz && list_empty("a->goals))
return;
+ /* First charge window */
+ if (!quota->total_charged_sz && !quota->charged_from)
+ quota->charged_from = jiffies;
+
/* New charge window starts */
if (time_after_eq(jiffies, quota->charged_from +
msecs_to_jiffies(quota->reset_interval))) {
_
Patches currently in -mm which might be from ekffu200098(a)gmail.com are
mm-damon-core-set-quota-charged_from-to-jiffies-at-first-charge-window.patch
mm-damon-update-expired-description-of-damos_action.patch
docs-mm-damon-design-fix-typo-s-sz_trtied-sz_tried.patch
selftests-damon-test-no-op-commit-broke-damon-status.patch
selftests-damon-test-no-op-commit-broke-damon-status-fix.patch
mm-damon-tests-core-kunit-add-damos_commit_filter-test.patch