__arm_lpae_unmap() returns size_t but was returning -ENOENT (negative
error code) when encountering an unmapped PTE. Since size_t is unsigned,
-ENOENT (typically -2) becomes a huge positive value (0xFFFFFFFFFFFFFFFE
on 64-bit systems).
This corrupted value propagates through the call chain:
__arm_lpae_unmap() returns -ENOENT as size_t
-> arm_lpae_unmap_pages() returns it
-> __iommu_unmap() adds it to iova address
-> iommu_pgsize() triggers BUG_ON due to corrupted iova
This can cause IOVA address overflow in __iommu_unmap() loop and
trigger BUG_ON in iommu_pgsize() from invalid address alignment.
Fix by returning 0 instead of -ENOENT. The WARN_ON already signals
the error condition, and returning 0 (meaning "nothing unmapped")
is the correct semantic for size_t return type. This matches the
behavior of other io-pgtable implementations (io-pgtable-arm-v7s,
io-pgtable-dart) which return 0 on error conditions.
Fixes: 3318f7b5cefb ("iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux(a)gmail.com>
---
drivers/iommu/io-pgtable-arm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e6626004b323..05d63fe92e43 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -637,7 +637,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
pte = READ_ONCE(*ptep);
if (!pte) {
WARN_ON(!(data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NO_WARN));
- return -ENOENT;
+ return 0;
}
/* If the size matches this level, we're in the right place */
--
2.40.0
From: Alex Deucher <alexander.deucher(a)amd.com>
commit eb296c09805ee37dd4ea520a7fb3ec157c31090f upstream.
SI hardware doesn't support pasids, user mode queues, or
KIQ/MES so there is no need for this. Doing so results in
a segfault as these callbacks are non-existent for SI.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4744
Fixes: b4a7f4e7ad2b ("drm/amdgpu: attach tlb fence to the PTs update")
Reviewed-by: Timur Kristóf <timur.kristof(a)gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 820b3d376e8a102c6aeab737ec6edebbbb710e04)
Signed-off-by: Hans de Goede <johannes.goede(a)oss.qualcomm.com>
---
Changes in v2 stable submission:
- Correct the Fixes: tag hash which is wrong in the original upstream
commit eb296c09805e ("drm/amdgpu: don't attach the tlb fence for SI")
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 676e24fb8864..cdcafde3c71a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1066,7 +1066,9 @@ amdgpu_vm_tlb_flush(struct amdgpu_vm_update_params *params,
}
/* Prepare a TLB flush fence to be attached to PTs */
- if (!params->unlocked) {
+ if (!params->unlocked &&
+ /* SI doesn't support pasid or KIQ/MES */
+ params->adev->family > AMDGPU_FAMILY_SI) {
amdgpu_vm_tlb_fence_create(params->adev, vm, fence);
/* Makes sure no PD/PT is freed before the flush */
--
2.52.0
From: Ionut Nechita <ionut.nechita(a)windriver.com>
Commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.
On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
completions concurrently, they contend on queue_lock in the hot path,
causing all IRQ threads to enter D (uninterruptible sleep) state. This
serializes interrupt processing completely.
Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- Good (v6.6.52-rt): 640 MB/s sequential read
- Bad (v6.6.64-rt): 153 MB/s sequential read (-76% regression)
- 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock
The original commit message mentioned memory barriers as an alternative
approach. Use full memory barriers (smp_mb) instead of queue_lock to
provide the same ordering guarantees without sleeping in RT kernel.
Memory barriers ensure proper synchronization:
- CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
- CPU1 sees dispatch list/sw queue bitmap updates
This maintains correctness while avoiding lock contention that causes
RT kernel IRQ threads to sleep in the I/O completion path.
Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita(a)windriver.com>
---
block/blk-mq.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5da948b07058..5fb8da4958d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
+ /*
+ * First lockless check to avoid unnecessary overhead.
+ * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
+ */
need_run = blk_mq_hw_queue_need_run(hctx);
if (!need_run) {
- unsigned long flags;
-
- /*
- * Synchronize with blk_mq_unquiesce_queue(), because we check
- * if hw queue is quiesced locklessly above, we need the use
- * ->queue_lock to make sure we see the up-to-date status to
- * not miss rerunning the hw queue.
- */
- spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+ /* Synchronize with blk_mq_unquiesce_queue() */
+ smp_mb();
need_run = blk_mq_hw_queue_need_run(hctx);
- spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
-
if (!need_run)
return;
+ /* Ensure dispatch list/sw queue updates visible before execution */
+ smp_mb();
}
if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
--
2.52.0
Add a new feature flag UBLK_F_NO_AUTO_PART_SCAN to allow users to suppress
automatic partition scanning when starting a ublk device.
This is useful for network-backed devices where partition scanning
can cause issues:
- Partition scan triggers synchronous I/O during device startup
- If userspace server crashes during scan, recovery is problematic
- For remotely-managed devices, partition probing may not be needed
Users can manually trigger partition scanning later when appropriate
using standard tools (e.g., partprobe, blockdev --rereadpt).
Reported-by: Yoav Cohen <yoav(a)nvidia.com>
Link: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@…
Cc: stable(a)vger.kernel.org
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
---
- suggest to backport to stable, which is useful for avoiding problematic
recovery, also the change is simple enough
drivers/block/ublk_drv.c | 16 +++++++++++++---
include/uapi/linux/ublk_cmd.h | 8 ++++++++
2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 78f3e22151b9..ca6ec8ed443f 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -73,7 +73,8 @@
| UBLK_F_AUTO_BUF_REG \
| UBLK_F_QUIESCE \
| UBLK_F_PER_IO_DAEMON \
- | UBLK_F_BUF_REG_OFF_DAEMON)
+ | UBLK_F_BUF_REG_OFF_DAEMON \
+ | UBLK_F_NO_AUTO_PART_SCAN)
#define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
| UBLK_F_USER_RECOVERY_REISSUE \
@@ -2930,8 +2931,13 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub,
ublk_apply_params(ub);
- /* don't probe partitions if any daemon task is un-trusted */
- if (ub->unprivileged_daemons)
+ /*
+ * Don't probe partitions if:
+ * - any daemon task is un-trusted, or
+ * - user explicitly requested to suppress partition scan
+ */
+ if (ub->unprivileged_daemons ||
+ (ub->dev_info.flags & UBLK_F_NO_AUTO_PART_SCAN))
set_bit(GD_SUPPRESS_PART_SCAN, &disk->state);
ublk_get_device(ub);
@@ -2947,6 +2953,10 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub,
if (ret)
goto out_put_cdev;
+ /* allow user to probe partitions from userspace */
+ if (!ub->unprivileged_daemons &&
+ (ub->dev_info.flags & UBLK_F_NO_AUTO_PART_SCAN))
+ clear_bit(GD_SUPPRESS_PART_SCAN, &disk->state);
set_bit(UB_STATE_USED, &ub->state);
out_put_cdev:
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index ec77dabba45b..0827db14a215 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -311,6 +311,14 @@
*/
#define UBLK_F_BUF_REG_OFF_DAEMON (1ULL << 14)
+/*
+ * If this feature is set, the kernel will not automatically scan for partitions
+ * when the device is started. This is useful for network-backed devices where
+ * partition scanning can cause deadlocks if the userspace server crashes during
+ * the scan. Users can manually trigger partition scanning later when appropriate.
+ */
+#define UBLK_F_NO_AUTO_PART_SCAN (1ULL << 15)
+
/* device state */
#define UBLK_S_DEV_DEAD 0
#define UBLK_S_DEV_LIVE 1
--
2.47.0
From: Ilya Maximets <i.maximets(a)ovn.org>
[ Upstream commit 5ace7ef87f059d68b5f50837ef3e8a1a4870c36e ]
The push_nsh() action structure looks like this:
OVS_ACTION_ATTR_PUSH_NSH(OVS_KEY_ATTR_NSH(OVS_NSH_KEY_ATTR_BASE,...))
The outermost OVS_ACTION_ATTR_PUSH_NSH attribute is OK'ed by the
nla_for_each_nested() inside __ovs_nla_copy_actions(). The innermost
OVS_NSH_KEY_ATTR_BASE/MD1/MD2 are OK'ed by the nla_for_each_nested()
inside nsh_key_put_from_nlattr(). But nothing checks if the attribute
in the middle is OK. We don't even check that this attribute is the
OVS_KEY_ATTR_NSH. We just do a double unwrap with a pair of nla_data()
calls - first time directly while calling validate_push_nsh() and the
second time as part of the nla_for_each_nested() macro, which isn't
safe, potentially causing invalid memory access if the size of this
attribute is incorrect. The failure may not be noticed during
validation due to larger netlink buffer, but cause trouble later during
action execution where the buffer is allocated exactly to the size:
BUG: KASAN: slab-out-of-bounds in nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch]
Read of size 184 at addr ffff88816459a634 by task a.out/22624
CPU: 8 UID: 0 PID: 22624 6.18.0-rc7+ #115 PREEMPT(voluntary)
Call Trace:
<TASK>
dump_stack_lvl+0x51/0x70
print_address_description.constprop.0+0x2c/0x390
kasan_report+0xdd/0x110
kasan_check_range+0x35/0x1b0
__asan_memcpy+0x20/0x60
nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch]
push_nsh+0x82/0x120 [openvswitch]
do_execute_actions+0x1405/0x2840 [openvswitch]
ovs_execute_actions+0xd5/0x3b0 [openvswitch]
ovs_packet_cmd_execute+0x949/0xdb0 [openvswitch]
genl_family_rcv_msg_doit+0x1d6/0x2b0
genl_family_rcv_msg+0x336/0x580
genl_rcv_msg+0x9f/0x130
netlink_rcv_skb+0x11f/0x370
genl_rcv+0x24/0x40
netlink_unicast+0x73e/0xaa0
netlink_sendmsg+0x744/0xbf0
__sys_sendto+0x3d6/0x450
do_syscall_64+0x79/0x2c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
Let's add some checks that the attribute is properly sized and it's
the only one attribute inside the action. Technically, there is no
real reason for OVS_KEY_ATTR_NSH to be there, as we know that we're
pushing an NSH header already, it just creates extra nesting, but
that's how uAPI works today. So, keeping as it is.
Fixes: b2d0f5d5dc53 ("openvswitch: enable NSH support")
Reported-by: Junvy Yang <zhuque(a)tencent.com>
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Acked-by: Eelco Chaudron <echaudro(a)redhat.com>
Reviewed-by: Aaron Conole <aconole(a)redhat.com>
Link: https://patch.msgid.link/20251204105334.900379-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
(cherry picked from commit 5ace7ef87f059d68b5f50837ef3e8a1a4870c36e)
Signed-off-by: Adrian Yip <adrian.ytw(a)gmail.com>
---
net/openvswitch/flow_netlink.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 1cb4f97335d8..2d536901309e 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2802,13 +2802,20 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
return err;
}
-static bool validate_push_nsh(const struct nlattr *attr, bool log)
+static bool validate_push_nsh(const struct nlattr *a, bool log)
{
+ struct nlattr *nsh_key = nla_data(a);
struct sw_flow_match match;
struct sw_flow_key key;
+ /* There must be one and only one NSH header. */
+ if (!nla_ok(nsh_key, nla_len(a)) ||
+ nla_total_size(nla_len(nsh_key)) != nla_len(a) ||
+ nla_type(nsh_key) != OVS_KEY_ATTR_NSH)
+ return false;
+
ovs_match_init(&match, &key, true, NULL);
- return !nsh_key_put_from_nlattr(attr, &match, false, true, log);
+ return !nsh_key_put_from_nlattr(nsh_key, &match, false, true, log);
}
/* Return false if there are any non-masked bits set.
@@ -3389,7 +3396,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
return -EINVAL;
}
mac_proto = MAC_PROTO_NONE;
- if (!validate_push_nsh(nla_data(a), log))
+ if (!validate_push_nsh(a, log))
return -EINVAL;
break;
--
2.52.0
From: Ilya Maximets <i.maximets(a)ovn.org>
[ Upstream commit 5ace7ef87f059d68b5f50837ef3e8a1a4870c36e ]
The push_nsh() action structure looks like this:
OVS_ACTION_ATTR_PUSH_NSH(OVS_KEY_ATTR_NSH(OVS_NSH_KEY_ATTR_BASE,...))
The outermost OVS_ACTION_ATTR_PUSH_NSH attribute is OK'ed by the
nla_for_each_nested() inside __ovs_nla_copy_actions(). The innermost
OVS_NSH_KEY_ATTR_BASE/MD1/MD2 are OK'ed by the nla_for_each_nested()
inside nsh_key_put_from_nlattr(). But nothing checks if the attribute
in the middle is OK. We don't even check that this attribute is the
OVS_KEY_ATTR_NSH. We just do a double unwrap with a pair of nla_data()
calls - first time directly while calling validate_push_nsh() and the
second time as part of the nla_for_each_nested() macro, which isn't
safe, potentially causing invalid memory access if the size of this
attribute is incorrect. The failure may not be noticed during
validation due to larger netlink buffer, but cause trouble later during
action execution where the buffer is allocated exactly to the size:
BUG: KASAN: slab-out-of-bounds in nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch]
Read of size 184 at addr ffff88816459a634 by task a.out/22624
CPU: 8 UID: 0 PID: 22624 6.18.0-rc7+ #115 PREEMPT(voluntary)
Call Trace:
<TASK>
dump_stack_lvl+0x51/0x70
print_address_description.constprop.0+0x2c/0x390
kasan_report+0xdd/0x110
kasan_check_range+0x35/0x1b0
__asan_memcpy+0x20/0x60
nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch]
push_nsh+0x82/0x120 [openvswitch]
do_execute_actions+0x1405/0x2840 [openvswitch]
ovs_execute_actions+0xd5/0x3b0 [openvswitch]
ovs_packet_cmd_execute+0x949/0xdb0 [openvswitch]
genl_family_rcv_msg_doit+0x1d6/0x2b0
genl_family_rcv_msg+0x336/0x580
genl_rcv_msg+0x9f/0x130
netlink_rcv_skb+0x11f/0x370
genl_rcv+0x24/0x40
netlink_unicast+0x73e/0xaa0
netlink_sendmsg+0x744/0xbf0
__sys_sendto+0x3d6/0x450
do_syscall_64+0x79/0x2c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
Let's add some checks that the attribute is properly sized and it's
the only one attribute inside the action. Technically, there is no
real reason for OVS_KEY_ATTR_NSH to be there, as we know that we're
pushing an NSH header already, it just creates extra nesting, but
that's how uAPI works today. So, keeping as it is.
Fixes: b2d0f5d5dc53 ("openvswitch: enable NSH support")
Reported-by: Junvy Yang <zhuque(a)tencent.com>
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Acked-by: Eelco Chaudron <echaudro(a)redhat.com>
Reviewed-by: Aaron Conole <aconole(a)redhat.com>
Link: https://patch.msgid.link/20251204105334.900379-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
(cherry picked from commit 5ace7ef87f059d68b5f50837ef3e8a1a4870c36e)
Signed-off-by: Adrian Yip <adrian.ytw(a)gmail.com>
---
net/openvswitch/flow_netlink.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index e3359e15aa2e..7d5490ea23e1 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2802,13 +2802,20 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
return err;
}
-static bool validate_push_nsh(const struct nlattr *attr, bool log)
+static bool validate_push_nsh(const struct nlattr *a, bool log)
{
+ struct nlattr *nsh_key = nla_data(a);
struct sw_flow_match match;
struct sw_flow_key key;
+ /* There must be one and only one NSH header. */
+ if (!nla_ok(nsh_key, nla_len(a)) ||
+ nla_total_size(nla_len(nsh_key)) != nla_len(a) ||
+ nla_type(nsh_key) != OVS_KEY_ATTR_NSH)
+ return false;
+
ovs_match_init(&match, &key, true, NULL);
- return !nsh_key_put_from_nlattr(attr, &match, false, true, log);
+ return !nsh_key_put_from_nlattr(nsh_key, &match, false, true, log);
}
/* Return false if there are any non-masked bits set.
@@ -3388,7 +3395,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
return -EINVAL;
}
mac_proto = MAC_PROTO_NONE;
- if (!validate_push_nsh(nla_data(a), log))
+ if (!validate_push_nsh(a, log))
return -EINVAL;
break;
--
2.52.0
This reverts commit b3b274bc9d3d7307308aeaf75f70731765ac999a.
On the DragonBoard 820c (which uses APQ8096/MSM8996) this change causes
the CPUs to downclock to roughly half speed under sustained load. The
regression is visible both during boot and when running CPU stress
workloads such as stress-ng: the CPUs initially ramp up to the expected
frequency, then drop to a lower OPP even though the system is clearly
CPU-bound.
Bisecting points to this commit and reverting it restores the expected
behaviour on the DragonBoard 820c - the CPUs track the cpufreq policy
and run at full performance under load.
The exact interaction with the ACD is not yet fully understood and we
would like to keep ACD in use to avoid possible SoC reliability issues.
Until we have a better fix that preserves ACD while avoiding this
performance regression, revert the bisected patch to restore the
previous behaviour.
Fixes: b3b274bc9d3d ("clk: qcom: cpu-8996: simplify the cpu_clk_notifier_cb")
Cc: stable(a)vger.kernel.org # v6.3+
Link: https://lore.kernel.org/linux-arm-msm/20230113120544.59320-8-dmitry.baryshk…
Cc: Dmitry Baryshkov <dmitry.baryshkov(a)oss.qualcomm.com>
Signed-off-by: Christopher Obbard <christopher.obbard(a)linaro.org>
---
Hi all,
This series contains a single revert for a regression affecting the
APQ8096/MSM8996 (DragonBoard 820c).
The commit being reverted, b3b274bc9d3d ("clk: qcom: cpu-8996: simplify the cpu_clk_notifier_cb"),
introduces a significant performance issue where the CPUs downclock to
~50% of their expected frequency under sustained load. The problem is
reproducible both at boot and when running CPU-bound workloads such as
stress-ng.
Bisecting the issue pointed directly to this commit and reverting it
restores correct cpufreq behaviour.
The root cause appears to be related to the interaction between the
simplified notifier callback and ACD (Adaptive Clock Distribution).
Since we would prefer to keep ACD enabled for SoC reliability reasons,
a revert is the safest option until a proper fix is identified.
Full details are included in the commit message.
Feedback & suggestions welcome.
Cheers!
Christopher Obbard
---
drivers/clk/qcom/clk-cpu-8996.c | 30 +++++++++++-------------------
1 file changed, 11 insertions(+), 19 deletions(-)
diff --git a/drivers/clk/qcom/clk-cpu-8996.c b/drivers/clk/qcom/clk-cpu-8996.c
index 21d13c0841ed..028476931747 100644
--- a/drivers/clk/qcom/clk-cpu-8996.c
+++ b/drivers/clk/qcom/clk-cpu-8996.c
@@ -547,35 +547,27 @@ static int cpu_clk_notifier_cb(struct notifier_block *nb, unsigned long event,
{
struct clk_cpu_8996_pmux *cpuclk = to_clk_cpu_8996_pmux_nb(nb);
struct clk_notifier_data *cnd = data;
+ int ret;
switch (event) {
case PRE_RATE_CHANGE:
+ ret = clk_cpu_8996_pmux_set_parent(&cpuclk->clkr.hw, ALT_INDEX);
qcom_cpu_clk_msm8996_acd_init(cpuclk->clkr.regmap);
-
- /*
- * Avoid overvolting. clk_core_set_rate_nolock() walks from top
- * to bottom, so it will change the rate of the PLL before
- * chaging the parent of PMUX. This can result in pmux getting
- * clocked twice the expected rate.
- *
- * Manually switch to PLL/2 here.
- */
- if (cnd->new_rate < DIV_2_THRESHOLD &&
- cnd->old_rate > DIV_2_THRESHOLD)
- clk_cpu_8996_pmux_set_parent(&cpuclk->clkr.hw, SMUX_INDEX);
-
break;
- case ABORT_RATE_CHANGE:
- /* Revert manual change */
- if (cnd->new_rate < DIV_2_THRESHOLD &&
- cnd->old_rate > DIV_2_THRESHOLD)
- clk_cpu_8996_pmux_set_parent(&cpuclk->clkr.hw, ACD_INDEX);
+ case POST_RATE_CHANGE:
+ if (cnd->new_rate < DIV_2_THRESHOLD)
+ ret = clk_cpu_8996_pmux_set_parent(&cpuclk->clkr.hw,
+ SMUX_INDEX);
+ else
+ ret = clk_cpu_8996_pmux_set_parent(&cpuclk->clkr.hw,
+ ACD_INDEX);
break;
default:
+ ret = 0;
break;
}
- return NOTIFY_OK;
+ return notifier_from_errno(ret);
};
static int qcom_cpu_clk_msm8996_driver_probe(struct platform_device *pdev)
---
base-commit: c17e270dfb342a782d69c4a7c4c32980455afd9c
change-id: 20251202-wip-obbardc-qcom-msm8096-clk-cpu-fix-downclock-b7561da4cb95
Best regards,
--
Christopher Obbard <christopher.obbard(a)linaro.org>
When dma_iova_link() fails partway through mapping a request's
bvec list, the function breaks out of the loop without cleaning up the
already-mapped portions. Similarly, if dma_iova_sync() fails after all
segments are linked, no cleanup is performed.
This leaves the IOVA state partially mapped. The completion path
(via dma_iova_destroy() or nvme_unmap_data()) then attempts to unmap
the full expected size, but only a partial size was actually mapped.
Fix by adding an out_unlink error path that calls dma_iova_destroy()
to clean up any partial mapping before returning failure. The
dma_iova_destroy() function handles both partial unlink and IOVA space
freeing, and correctly handles the case where mapped_len is zero
(first dma_iova_link() failed) by just freeing the IOVA allocation.
This ensures that when an error occurs:
1. All partially-mapped IOVA ranges are properly unmapped
2. The IOVA address space is freed
3. The completion path won't attempt to unmap non-existent mappings
Fixes: 858299dc6160 ("block: add scatterlist-less DMA mapping helpers")
Cc: stable(a)vger.kernel.org
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux(a)gmail.com>
---
Hi Leon,
Your last email is not accessible to me.
Updated the patch description to explain dma_iova_destroy().
Please let me know for any issues you want me to fix before I send.
-ck
---
block/blk-mq-dma.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index fb018fffffdc..feead1934301 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -126,17 +126,20 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
vec->len, dir, attrs);
if (error)
- break;
+ goto out_unlink;
mapped += vec->len;
} while (blk_map_iter_next(req, &iter->iter, vec));
error = dma_iova_sync(dma_dev, state, 0, mapped);
- if (error) {
- iter->status = errno_to_blk_status(error);
- return false;
- }
+ if (error)
+ goto out_unlink;
return true;
+
+out_unlink:
+ dma_iova_destroy(dma_dev, state, mapped, dir, attrs);
+ iter->status = errno_to_blk_status(error);
+ return false;
}
static inline void blk_rq_map_iter_init(struct request *rq,
--
2.40.0
For a while, I've been seeing a strange issue where some (usually not all)
of the display DMA channels will suddenly hang, particularly when there is
a visible cursor on the screen that is being frequently updated, and
especially when said cursor happens to go between two screens. While this
brings back lovely memories of fixing Intel Skylake bugs, I would quite
like to fix it :).
It turns out the problem that's happening here is that we're managing to
reach nv50_head_flush_set() in our atomic commit path without actually
holding nv50_disp->mutex. This means that cursor updates happening in
parallel (along with any other atomic updates that need to use the core
channel) will race with eachother, which eventually causes us to corrupt
the pushbuffer - leading to a plethora of various GSP errors, usually:
nouveau 0000:c1:00.0: gsp: Xid:56 CMDre 00000000 00000218 00102680 00000004 00800003
nouveau 0000:c1:00.0: gsp: Xid:56 CMDre 00000000 0000021c 00040509 00000004 00000001
nouveau 0000:c1:00.0: gsp: Xid:56 CMDre 00000000 00000000 00000000 00000001 00000001
The reason this is happening is because generally we check whether we need
to set nv50_atom->lock_core at the end of nv50_head_atomic_check().
However, curs507a_prepare is called from the fb_prepare callback, which
happens after the atomic check phase. As a result, this can lead to commits
that both touch the core channel but also don't grab nv50_disp->mutex.
So, fix this by making sure that we set nv50_atom->lock_core in
cus507a_prepare().
Signed-off-by: Lyude Paul <lyude(a)redhat.com>
Fixes: 1590700d94ac ("drm/nouveau/kms/nv50-: split each resource type into their own source files")
Cc: <stable(a)vger.kernel.org> # v4.18+
---
drivers/gpu/drm/nouveau/dispnv50/curs507a.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/nouveau/dispnv50/curs507a.c b/drivers/gpu/drm/nouveau/dispnv50/curs507a.c
index a95ee5dcc2e39..1a889139cb053 100644
--- a/drivers/gpu/drm/nouveau/dispnv50/curs507a.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/curs507a.c
@@ -84,6 +84,7 @@ curs507a_prepare(struct nv50_wndw *wndw, struct nv50_head_atom *asyh,
asyh->curs.handle = handle;
asyh->curs.offset = offset;
asyh->set.curs = asyh->curs.visible;
+ nv50_atom(asyh->state.state)->lock_core = true;
}
}
--
2.52.0
When a newly poisoned subpage ends up in an already poisoned hugetlb
folio, 'num_poisoned_pages' is incremented, but the per node ->mf_stats
is not. Fix the inconsistency by designating action_result() to update
them both.
While at it, define __get_huge_page_for_hwpoison() return values in terms
of symbol names for better readibility. Also rename
folio_set_hugetlb_hwpoison() to hugetlb_update_hwpoison() since the
function does more than the conventional bit setting and the fact
three possible return values are expected.
Fixes: 18f41fa616ee4 ("mm: memory-failure: bump memory failure stats to pglist_data")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Jane Chu <jane.chu(a)oracle.com>
---
v1 -> v2:
adapted David and Liam's comment, define __get_huge_page_for_hwpoison()
return values in terms of symbol names instead of naked integers for better
readibility. #define instead of enum is used since the function has footprint
outside MF, just try to limit the MF specifics local.
also renamed folio_set_hugetlb_hwpoison() to hugetlb_update_hwpoison()
since the function does more than the conventional bit setting and the
fact three possible return values are expected.
---
mm/memory-failure.c | 56 ++++++++++++++++++++++++++-------------------
1 file changed, 33 insertions(+), 23 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3edebb0cda30..3eb9d23a4ad0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1873,12 +1873,18 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
return count;
}
-static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
+#define MF_HUGETLB_ALREADY_POISONED 3 /* already poisoned */
+#define MF_HUGETLB_ACC_EXISTING_POISON 4 /* accessed existing poisoned page */
+/*
+ * Set hugetlb folio as hwpoisoned, update folio private raw hwpoison list
+ * to keep track of the poisoned pages.
+ */
+static int hugetlb_update_hwpoison(struct folio *folio, struct page *page)
{
struct llist_head *head;
struct raw_hwp_page *raw_hwp;
struct raw_hwp_page *p;
- int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
+ int ret = folio_test_set_hwpoison(folio) ? MF_HUGETLB_ALREADY_POISONED : 0;
/*
* Once the hwpoison hugepage has lost reliable raw error info,
@@ -1886,20 +1892,18 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
* so skip to add additional raw error info.
*/
if (folio_test_hugetlb_raw_hwp_unreliable(folio))
- return -EHWPOISON;
+ return MF_HUGETLB_ALREADY_POISONED;
+
head = raw_hwp_list_head(folio);
llist_for_each_entry(p, head->first, node) {
if (p->page == page)
- return -EHWPOISON;
+ return MF_HUGETLB_ACC_EXISTING_POISON;
}
raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_ATOMIC);
if (raw_hwp) {
raw_hwp->page = page;
llist_add(&raw_hwp->node, head);
- /* the first error event will be counted in action_result(). */
- if (ret)
- num_poisoned_pages_inc(page_to_pfn(page));
} else {
/*
* Failed to save raw error info. We no longer trace all
@@ -1945,32 +1949,30 @@ void folio_clear_hugetlb_hwpoison(struct folio *folio)
folio_free_raw_hwp(folio, true);
}
+#define MF_HUGETLB_FREED 0 /* freed hugepage */
+#define MF_HUGETLB_IN_USED 1 /* in-use hugepage */
+#define MF_NOT_HUGETLB 2 /* not a hugepage */
+
/*
* Called from hugetlb code with hugetlb_lock held.
- *
- * Return values:
- * 0 - free hugepage
- * 1 - in-use hugepage
- * 2 - not a hugepage
- * -EBUSY - the hugepage is busy (try to retry)
- * -EHWPOISON - the hugepage is already hwpoisoned
*/
int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
bool *migratable_cleared)
{
struct page *page = pfn_to_page(pfn);
struct folio *folio = page_folio(page);
- int ret = 2; /* fallback to normal page handling */
+ int ret = MF_NOT_HUGETLB;
bool count_increased = false;
+ int rc;
if (!folio_test_hugetlb(folio))
goto out;
if (flags & MF_COUNT_INCREASED) {
- ret = 1;
+ ret = MF_HUGETLB_IN_USED;
count_increased = true;
} else if (folio_test_hugetlb_freed(folio)) {
- ret = 0;
+ ret = MF_HUGETLB_FREED;
} else if (folio_test_hugetlb_migratable(folio)) {
ret = folio_try_get(folio);
if (ret)
@@ -1981,8 +1983,9 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
goto out;
}
- if (folio_set_hugetlb_hwpoison(folio, page)) {
- ret = -EHWPOISON;
+ rc = hugetlb_update_hwpoison(folio, page);
+ if (rc >= MF_HUGETLB_ALREADY_POISONED) {
+ ret = rc;
goto out;
}
@@ -2019,22 +2022,29 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
*hugetlb = 1;
retry:
res = get_huge_page_for_hwpoison(pfn, flags, &migratable_cleared);
- if (res == 2) { /* fallback to normal page handling */
+ switch (res) {
+ case MF_NOT_HUGETLB: /* fallback to normal page handling */
*hugetlb = 0;
return 0;
- } else if (res == -EHWPOISON) {
+ case MF_HUGETLB_ALREADY_POISONED:
+ case MF_HUGETLB_ACC_EXISTING_POISON:
if (flags & MF_ACTION_REQUIRED) {
folio = page_folio(p);
res = kill_accessing_process(current, folio_pfn(folio), flags);
}
- action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
+ if (res == MF_HUGETLB_ALREADY_POISONED)
+ action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
+ else
+ action_result(pfn, MF_MSG_HUGE, MF_FAILED);
return res;
- } else if (res == -EBUSY) {
+ case -EBUSY:
if (!(flags & MF_NO_RETRY)) {
flags |= MF_NO_RETRY;
goto retry;
}
return action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
+ default:
+ break;
}
folio = page_folio(p);
--
2.43.5
When a hugetlb folio is being poisoned again, try_memory_failure_hugetlb()
passed head pfn to kill_accessing_process(), that is not right.
The precise pfn of the poisoned page should be used in order to
determine the precise vaddr as the SIGBUS payload.
This issue has already been taken care of in the normal path, that is,
hwpoison_user_mappings(), see [1][2]. Further more, for [3] to work
correctly in the hugetlb repoisoning case, it's essential to inform
VM the precise poisoned page, not the head page.
[1] https://lkml.kernel.org/r/20231218135837.3310403-1-willy@infradead.org
[2] https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com
[3] https://lore.kernel.org/lkml/20251116013223.1557158-1-jiaqiyan@google.com/
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Jane Chu <jane.chu(a)oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
---
v1 -> v2:
pickup R-B, add stable to cc list.
---
mm/memory-failure.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3edebb0cda30..c9d87811b1ea 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -681,9 +681,11 @@ static void set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
}
static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
- unsigned long poisoned_pfn, struct to_kill *tk)
+ unsigned long poisoned_pfn, struct to_kill *tk,
+ int pte_nr)
{
unsigned long pfn = 0;
+ unsigned long hwpoison_vaddr;
if (pte_present(pte)) {
pfn = pte_pfn(pte);
@@ -694,10 +696,11 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
pfn = swp_offset_pfn(swp);
}
- if (!pfn || pfn != poisoned_pfn)
+ if (!pfn || (pfn > poisoned_pfn || (pfn + pte_nr - 1) < poisoned_pfn))
return 0;
- set_to_kill(tk, addr, shift);
+ hwpoison_vaddr = addr + ((poisoned_pfn - pfn) << PAGE_SHIFT);
+ set_to_kill(tk, hwpoison_vaddr, shift);
return 1;
}
@@ -749,7 +752,7 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
for (; addr != end; ptep++, addr += PAGE_SIZE) {
ret = check_hwpoisoned_entry(ptep_get(ptep), addr, PAGE_SHIFT,
- hwp->pfn, &hwp->tk);
+ hwp->pfn, &hwp->tk, 1);
if (ret == 1)
break;
}
@@ -772,8 +775,8 @@ static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
ptl = huge_pte_lock(h, walk->mm, ptep);
pte = huge_ptep_get(walk->mm, addr, ptep);
- ret = check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
- hwp->pfn, &hwp->tk);
+ ret = check_hwpoisoned_entry(pte, addr, huge_page_shift(h), hwp->pfn,
+ &hwp->tk, pages_per_huge_page(h));
spin_unlock(ptl);
return ret;
}
@@ -2023,10 +2026,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
*hugetlb = 0;
return 0;
} else if (res == -EHWPOISON) {
- if (flags & MF_ACTION_REQUIRED) {
- folio = page_folio(p);
- res = kill_accessing_process(current, folio_pfn(folio), flags);
- }
+ if (flags & MF_ACTION_REQUIRED)
+ res = kill_accessing_process(current, pfn, flags);
action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
return res;
} else if (res == -EBUSY) {
@@ -2037,6 +2038,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
return action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
}
+
folio = page_folio(p);
folio_lock(folio);
--
2.43.5
The calculation of bridge window head alignment is done by
calculate_mem_align() [*]. With the default bridge window alignment, it
is used for both head and tail alignment.
The selected head alignment does not always result in tight-fitting
resources (gap at d4f00000-d4ffffff):
d4800000-dbffffff : PCI Bus 0000:06
d4800000-d48fffff : PCI Bus 0000:07
d4800000-d4803fff : 0000:07:00.0
d4800000-d4803fff : nvme
d4900000-d49fffff : PCI Bus 0000:0a
d4900000-d490ffff : 0000:0a:00.0
d4900000-d490ffff : r8169
d4910000-d4913fff : 0000:0a:00.0
d4a00000-d4cfffff : PCI Bus 0000:0b
d4a00000-d4bfffff : 0000:0b:00.0
d4a00000-d4bfffff : 0000:0b:00.0
d4c00000-d4c07fff : 0000:0b:00.0
d4d00000-d4dfffff : PCI Bus 0000:15
d4d00000-d4d07fff : 0000:15:00.0
d4d00000-d4d07fff : xhci-hcd
d4e00000-d4efffff : PCI Bus 0000:16
d4e00000-d4e7ffff : 0000:16:00.0
d4e80000-d4e803ff : 0000:16:00.0
d4e80000-d4e803ff : ahci
d5000000-dbffffff : PCI Bus 0000:0c
This has not been caused problems (for years) with the default bridge
window tail alignment that grossly over-estimates the required tail
alignment leaving more tail room than necessary. With the introduction
of relaxed tail alignment that leaves no extra tail room whatsoever,
any gaps will immediately turn into assignment failures.
Introduce head alignment calculation that ensures no gaps are left and
apply the new approach when using relaxed alignment. We may want to
consider using it for the normal alignment eventually, but as the first
step, solve only the problem with the relaxed tail alignment.
([*] I don't understand the algorithm in calculate_mem_align().)
Fixes: 5d0a8965aea9 ("[PATCH] 2.5.14: New PCI allocation code (alpha, arm, parisc) [2/2]")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220775
Reported-by: Malte Schröder <malte+lkml(a)tnxip.de>
Tested-by: Malte Schröder <malte+lkml(a)tnxip.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
Little annoyingly, there's difference in what aligns array contains
between the legacy alignment approach (which I dare not to touch as I
really don't understand what the algorithm tries to do) and this new
head aligment algorithm, both consuming stack space. After making the
new approach the only available approach in the follow-up patch, only
one array remains (however, that follow-up change is also somewhat
riskier when it comes to regressions).
That being said, the new head alignment could work with the same aligns
array as the legacy approach, it just won't necessarily produce an
optimal (the smallest possible) head alignment when if (r_size <=
align) condition is used. Just let me know if that approach is
preferred (to save some stack space).
---
drivers/pci/setup-bus.c | 53 ++++++++++++++++++++++++++++++++++-------
1 file changed, 44 insertions(+), 9 deletions(-)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 4b918ff4d2d8..80e5a8fc62e7 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1228,6 +1228,45 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
return min_align;
}
+/*
+ * Calculate bridge window head alignment that leaves no gaps in between
+ * resources.
+ */
+static resource_size_t calculate_head_align(resource_size_t *aligns,
+ int max_order)
+{
+ resource_size_t head_align = 1;
+ resource_size_t remainder = 0;
+ int order;
+
+ /* Take the largest alignment as the starting point. */
+ head_align <<= max_order + __ffs(SZ_1M);
+
+ for (order = max_order - 1; order >= 0; order--) {
+ resource_size_t align1 = 1;
+
+ align1 <<= order + __ffs(SZ_1M);
+
+ /*
+ * Account smaller resources with alignment < max_order that
+ * could be used to fill head room if alignment less than
+ * max_order is used.
+ */
+ remainder += aligns[order];
+
+ /*
+ * Test if head fill is enough to satisfy the alignment of
+ * the larger resources after reducing the alignment.
+ */
+ while ((head_align > align1) && (remainder >= head_align / 2)) {
+ head_align /= 2;
+ remainder -= head_align;
+ }
+ }
+
+ return head_align;
+}
+
/**
* pbus_upstream_space_available - Check no upstream resource limits allocation
* @bus: The bus
@@ -1315,13 +1354,13 @@ static void pbus_size_mem(struct pci_bus *bus, unsigned long type,
{
struct pci_dev *dev;
resource_size_t min_align, win_align, align, size, size0, size1 = 0;
- resource_size_t aligns[28]; /* Alignments from 1MB to 128TB */
+ resource_size_t aligns[28] = {}; /* Alignments from 1MB to 128TB */
+ resource_size_t aligns2[28] = {};/* Alignments from 1MB to 128TB */
int order, max_order;
struct resource *b_res = pbus_select_window_for_type(bus, type);
resource_size_t children_add_size = 0;
resource_size_t children_add_align = 0;
resource_size_t add_align = 0;
- resource_size_t relaxed_align;
resource_size_t old_size;
if (!b_res)
@@ -1331,7 +1370,6 @@ static void pbus_size_mem(struct pci_bus *bus, unsigned long type,
if (b_res->parent)
return;
- memset(aligns, 0, sizeof(aligns));
max_order = 0;
size = 0;
@@ -1382,6 +1420,7 @@ static void pbus_size_mem(struct pci_bus *bus, unsigned long type,
*/
if (r_size <= align)
aligns[order] += align;
+ aligns2[order] += align;
if (order > max_order)
max_order = order;
@@ -1406,9 +1445,7 @@ static void pbus_size_mem(struct pci_bus *bus, unsigned long type,
if (bus->self && size0 &&
!pbus_upstream_space_available(bus, b_res, size0, min_align)) {
- relaxed_align = 1ULL << (max_order + __ffs(SZ_1M));
- relaxed_align = max(relaxed_align, win_align);
- min_align = min(min_align, relaxed_align);
+ min_align = calculate_head_align(aligns2, max_order);
size0 = calculate_memsize(size, min_size, 0, 0, old_size, win_align);
resource_set_range(b_res, min_align, size0);
pci_info(bus->self, "bridge window %pR to %pR requires relaxed alignment rules\n",
@@ -1422,9 +1459,7 @@ static void pbus_size_mem(struct pci_bus *bus, unsigned long type,
if (bus->self && size1 &&
!pbus_upstream_space_available(bus, b_res, size1, add_align)) {
- relaxed_align = 1ULL << (max_order + __ffs(SZ_1M));
- relaxed_align = max(relaxed_align, win_align);
- min_align = min(min_align, relaxed_align);
+ min_align = calculate_head_align(aligns2, max_order);
size1 = calculate_memsize(size, min_size, add_size, children_add_size,
old_size, win_align);
pci_info(bus->self,
--
2.39.5
pbus_size_mem() has two alignments, one for required resources in
min_align and another in add_align that takes account optional
resources.
The add_align is applied to the bridge window through the realloc_head
list. It can happen, however, that add_align is larger than min_align
but calculated size1 and size0 are equal due to extra tailroom (e.g.,
hotplug reservation, tail alignment), and therefore no entry is created
to the realloc_head list. Without the bridge appearing in the realloc
head, add_align is lost when pbus_size_mem() returns.
The problem is visible in this log for 0000:05:00.0 which lacks
add_size ... add_align ... line that would indicate it was added into
the realloc_head list:
pci 0000:05:00.0: PCI bridge to [bus 06-16]
...
pci 0000:06:00.0: bridge window [mem 0x00100000-0x001fffff] to [bus 07] requires relaxed alignment rules
pci 0000:06:06.0: bridge window [mem 0x00100000-0x001fffff] to [bus 0a] requires relaxed alignment rules
pci 0000:06:07.0: bridge window [mem 0x00100000-0x003fffff] to [bus 0b] requires relaxed alignment rules
pci 0000:06:08.0: bridge window [mem 0x00800000-0x00ffffff 64bit pref] to [bus 0c-14] requires relaxed alignment rules
pci 0000:06:08.0: bridge window [mem 0x01000000-0x057fffff] to [bus 0c-14] requires relaxed alignment rules
pci 0000:06:08.0: bridge window [mem 0x01000000-0x057fffff] to [bus 0c-14] requires relaxed alignment rules
pci 0000:06:08.0: bridge window [mem 0x01000000-0x057fffff] to [bus 0c-14] add_size 100000 add_align 1000000
pci 0000:06:0c.0: bridge window [mem 0x00100000-0x001fffff] to [bus 15] requires relaxed alignment rules
pci 0000:06:0d.0: bridge window [mem 0x00100000-0x001fffff] to [bus 16] requires relaxed alignment rules
pci 0000:06:0d.0: bridge window [mem 0x00100000-0x001fffff] to [bus 16] requires relaxed alignment rules
pci 0000:05:00.0: bridge window [mem 0xd4800000-0xd97fffff]: assigned
pci 0000:05:00.0: bridge window [mem 0x1060000000-0x10607fffff 64bit pref]: assigned
pci 0000:06:08.0: bridge window [mem size 0x04900000]: can't assign; no space
pci 0000:06:08.0: bridge window [mem size 0x04900000]: failed to assign
While this bug itself seems old, it has likely become more visible
after the relaxed tail alignment that does not grossly overestimate the
size needed for the bridge window.
Make sure add_align > min_align too results in adding an entry into the
realloc head list. In addition, add handling to the cases where
add_size is zero while only alignment differs.
Fixes: d74b9027a4da ("PCI: Consider additional PF's IOV BAR alignment in sizing and assigning")
Reported-by: Malte Schröder <malte+lkml(a)tnxip.de>
Tested-by: Malte Schröder <malte+lkml(a)tnxip.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Cc: stable(a)vger.kernel.org
---
drivers/pci/setup-bus.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 6e90f46f52af..4b918ff4d2d8 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -14,6 +14,7 @@
* tighter packing. Prefetchable range support.
*/
+#include <linux/align.h>
#include <linux/bitops.h>
#include <linux/bug.h>
#include <linux/init.h>
@@ -456,7 +457,7 @@ static void reassign_resources_sorted(struct list_head *realloc_head,
"%s %pR: ignoring failure in optional allocation\n",
res_name, res);
}
- } else if (add_size > 0) {
+ } else if (add_size > 0 || !IS_ALIGNED(res->start, align)) {
res->flags |= add_res->flags &
(IORESOURCE_STARTALIGN|IORESOURCE_SIZEALIGN);
if (pci_reassign_resource(dev, idx, add_size, align))
@@ -1442,12 +1443,13 @@ static void pbus_size_mem(struct pci_bus *bus, unsigned long type,
resource_set_range(b_res, min_align, size0);
b_res->flags |= IORESOURCE_STARTALIGN;
- if (bus->self && size1 > size0 && realloc_head) {
+ if (bus->self && realloc_head && (size1 > size0 || add_align > min_align)) {
b_res->flags &= ~IORESOURCE_DISABLED;
- add_to_list(realloc_head, bus->self, b_res, size1-size0, add_align);
+ add_size = size1 > size0 ? size1 - size0 : 0;
+ add_to_list(realloc_head, bus->self, b_res, add_size, add_align);
pci_info(bus->self, "bridge window %pR to %pR add_size %llx add_align %llx\n",
b_res, &bus->busn_res,
- (unsigned long long) (size1 - size0),
+ (unsigned long long) add_size,
(unsigned long long) add_align);
}
}
--
2.39.5
Hi all,
Roland Schwarzkopf reported to the Debian mailing list a problem which
he encountered once updating in Debian from 5.10.244 to 5.10.247. The
report is quoted below and found in
https://lists.debian.org/debian-kernel/2025/12/msg00223.html
Roland did bisect the changes between 5.10.244 to 5.10.247 and found
that the issue is introduced with 1550f3673972 ("net: rtnetlink: add
bulk delete support flag") which is the backport to the 5.10.y series.
On Thu, Dec 18, 2025 at 02:59:55PM +0100, Roland Schwarzkopf wrote:
> Hi Salvatore,
>
> On 12/17/25 20:28, Salvatore Bonaccorso wrote:
> > Hi Roland,
> >
> > I'm CC'ing Ben Hutchings directly as well as he takes care of the
> > Debian LTS kernel updates. Idellly we make this as well a proper bug
> > for easier tracking.
> >
> > On Wed, Dec 17, 2025 at 01:35:54PM +0100, Roland Schwarzkopf wrote:
> > > Hi there,
> > >
> > > after upgrading to the latest kernel on Debian 11
> > > (linux-image-5.10.0-37-amd64) I have an issue using libvirt with qemu/kvm
> > > virtual machines and macvtap networking. When a machine is shut down,
> > > libvirt can not delete the corresponding macvtap device. Thus, starting the
> > > machine again is not possible. After manually removing the macvtap device
> > > using `ip link delete` the vm can be started again.
> > >
> > > In the journal the following message is shown:
> > >
> > > Dec 17 13:19:27 iblis libvirtd[535]: error destroying network device macvtap0: Operation not supported
> > >
> > > After downgrading the kernel to linux-image-5.10.0-36-amd64, the problem
> > > disappears. I tested this on a fresh minimal install of Debian 11 - to
> > > exclude that anything else on my production machines is causing this issue.
> > >
> > > Since the older kernel does not have this issue, I assume this is related to
> > > the kernel and not to libvirt?
> > >
> > > I tried to check for bug reports of the kernel package, but the bug tracker
> > > finds no reports and even states that the package does not exist (I used the
> > > "Bug reports" link on
> > > https://packages.debian.org/bullseye/linux-image-5.10.0-37-amd64). This left
> > > me a bit puzzled. Since I don't have experience with the debian bug
> > > reporting process, I had no other idea than writing to this list.
> > You would need to search for inhttps://bugs.debian.org/src:linux ,
> > but that said I'm not aware of any bug reports in that direction.
> >
> > Would you be in the position of bisecting the problem as you can say
> > that 5.10.244 is good and 5.10.247 is bad and regressed? If you can do
> > that that would involve compiling a couple of kernels to narrow down
> > where the problem is introduced:
> >
> > git clone --single-branch -b linux-5.10.yhttps://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-st…
> > cd linux-stable
> > git checkout v5.10.244
> > cp /boot/config-$(uname -r) .config
> > yes '' | make localmodconfig
> > make savedefconfig
> > mv defconfig arch/x86/configs/my_defconfig
> >
> > # test 5.10.244 to ensure this is "good"
> > make my_defconfig
> > make -j $(nproc) bindeb-pkg
> > ... install the resulting .deb package and confirm it successfully boots / problem does not exist
> >
> > # test 5.10.247 to ensure this is "bad"
> > git checkout v5.10.247
> > make my_defconfig
> > make -j $(nproc) bindeb-pkg
> > ... install the resulting .deb package and confirm it fails to boot / problem exists
> >
> > With that confirmed, the bisection can start:
> >
> > git bisect start
> > git bisect good v5.10.244
> > git bisect bad v5.10.247
> >
> > In each bisection step git checks out a state between the oldest
> > known-bad and the newest known-good commit. In each step test using:
> >
> > make my_defconfig
> > make -j $(nproc) bindeb-pkg
> > ... install, try to boot / verify if problem exists
> >
> > and if the problem is hit run:
> >
> > git bisect bad
> >
> > and if the problem doesn't trigger run:
> >
> > git bisect good
> >
> > . Please pay attention to always select the just built kernel for
> > booting, it won't always be the default kernel picked up by grub.
> >
> > Iterate until git announces to have identified the first bad commit.
> >
> > Then provide the output of
> >
> > git bisect log
> >
> > In the course of the bisection you might have to uninstall previous
> > kernels again to not exhaust the disk space in /boot. Also in the end
> > uninstall all self-built kernels again.
>
> I just did my first bisection \o/ (sorry)
>
> Here are the results:
>
> git bisect start
> # bad: [f964b940099f9982d723d4c77988d4b0dda9c165] Linux 5.10.247
> git bisect bad f964b940099f9982d723d4c77988d4b0dda9c165
> # good: [863b76df7d1e327979946a2d3893479c3275bfa4] Linux 5.10.244
> git bisect good f52ee6ea810273e527a5d319e5f400be8c8424c1
> # good: [dc9fdb7586b90e33c766eac52b6f3d1c9ec365a1] net: usb: lan78xx: Add error handling to lan78xx_init_mac_address
> git bisect good dc9fdb7586b90e33c766eac52b6f3d1c9ec365a1
> # bad: [2272d5757ce5d3fb416d9f2497b015678eb85c0d] phy: cadence: cdns-dphy: Enable lower resolutions in dphy
> git bisect bad 2272d5757ce5d3fb416d9f2497b015678eb85c0d
> # bad: [547539f08b9e3629ce68479889813e58c8087e70] ALSA: usb-audio: fix control pipe direction
> git bisect bad 547539f08b9e3629ce68479889813e58c8087e70
> # bad: [3509c748e79435d09e730673c8c100b7f0ebc87c] most: usb: hdm_probe: Fix calling put_device() before device initialization
> git bisect bad 3509c748e79435d09e730673c8c100b7f0ebc87c
> # bad: [a6ebcafc2f5ff7f0d1ce0c6dc38ac09a16a56ec0] net: add ndo_fdb_del_bulk
> git bisect bad a6ebcafc2f5ff7f0d1ce0c6dc38ac09a16a56ec0
> # good: [b8a72692aa42b7dcd179a96b90bc2763ac74576a] hfsplus: fix KMSAN uninit-value issue in __hfsplus_ext_cache_extent()
> git bisect good b8a72692aa42b7dcd179a96b90bc2763ac74576a
> # good: [2b42a595863556b394bd702d46f4a9d0d2985aaa] m68k: bitops: Fix find_*_bit() signatures
> git bisect good 2b42a595863556b394bd702d46f4a9d0d2985aaa
> # good: [9d9f7d71d46cff3491a443a3cf452cecf87d51ef] net: rtnetlink: use BIT for flag values
> git bisect good 9d9f7d71d46cff3491a443a3cf452cecf87d51ef
> # bad: [1550f3673972c5cfba714135f8bf26784e6f2b0f] net: rtnetlink: add bulk delete support flag
> git bisect bad 1550f3673972c5cfba714135f8bf26784e6f2b0f
> # good: [c8879afa24169e504f78c9ca43a4d0d7397049eb] net: netlink: add NLM_F_BULK delete request modifier
> git bisect good c8879afa24169e504f78c9ca43a4d0d7397049eb
> # first bad commit: [1550f3673972c5cfba714135f8bf26784e6f2b0f] net: rtnetlink: add bulk delete support flag
>
> Is there anything else I can do to help?
Is there soemthing missing?
Roland I think it would be helpful if you can test as well more recent
stable series versions to confirm if the issue is present there as
well or not, which might indicate a 5.10.y specific backporting
problem.
#regzbot introduced: 1550f3673972c5cfba714135f8bf26784e6f2b0f
Regards,
Salvatore
The local variable 'sensitivity' was never clamped to 0 or
POWERSAVE_BIAS_MAX because the return value of clamp() was not used. Fix
this by assigning the clamped value back to 'sensitivity'.
Cc: stable(a)vger.kernel.org
Fixes: 9c5320c8ea8b ("cpufreq: AMD "frequency sensitivity feedback" powersave bias for ondemand governor")
Signed-off-by: Thorsten Blum <thorsten.blum(a)linux.dev>
---
drivers/cpufreq/amd_freq_sensitivity.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/cpufreq/amd_freq_sensitivity.c b/drivers/cpufreq/amd_freq_sensitivity.c
index 13fed4b9e02b..713ccf24c97d 100644
--- a/drivers/cpufreq/amd_freq_sensitivity.c
+++ b/drivers/cpufreq/amd_freq_sensitivity.c
@@ -76,7 +76,7 @@ static unsigned int amd_powersave_bias_target(struct cpufreq_policy *policy,
sensitivity = POWERSAVE_BIAS_MAX -
(POWERSAVE_BIAS_MAX * (d_reference - d_actual) / d_reference);
- clamp(sensitivity, 0, POWERSAVE_BIAS_MAX);
+ sensitivity = clamp(sensitivity, 0, POWERSAVE_BIAS_MAX);
/* this workload is not CPU bound, so choose a lower freq */
if (sensitivity < od_tuners->powersave_bias) {
--
Thorsten Blum <thorsten.blum(a)linux.dev>
GPG: 1D60 735E 8AEF 3BE4 73B6 9D84 7336 78FD 8DFE EAD4