From: Ionut Nechita <ionut.nechita(a)windriver.com>
This series addresses two critical performance regressions in the block
layer multiqueue (blk-mq) subsystem when running on PREEMPT_RT kernels.
On RT kernels, regular spinlocks are converted to sleeping rt_mutex locks,
which can cause severe performance degradation in the I/O hot path. This
series converts two problematic locking patterns to prevent IRQ threads
from sleeping during I/O operations.
Testing on MegaRAID 12GSAS controller with 8 MSI-X vectors shows:
- v6.6.52-rt (before regression): 640 MB/s sequential read
- v6.6.64-rt (regression introduced): 153 MB/s (-76% regression)
- v6.6.68-rt with queue_lock fix only: 640 MB/s (performance restored)
- v6.6.69-rt with both fixes: expected similar or better performance
The first patch replaces queue_lock with memory barriers in the I/O
completion hot path, eliminating the contention that caused IRQ threads
to sleep. The second patch converts the global blk_mq_cpuhp_lock from
mutex to raw_spinlock to prevent sleeping during CPU hotplug operations.
Both conversions are safe because the protected code paths only perform
fast, non-blocking operations (memory barriers, list/hlist manipulation,
flag checks).
Ionut Nechita (2):
block/blk-mq: fix RT kernel regression with queue_lock in hot path
block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT
block/blk-mq.c | 33 +++++++++++++++------------------
1 file changed, 15 insertions(+), 18 deletions(-)
--
2.52.0
When of_find_net_device_by_node() successfully acquires a reference to
a network device but the subsequent call to dsa_port_parse_cpu()
fails, dsa_port_parse_of() returns without releasing the reference
count on the network device.
of_find_net_device_by_node() increments the reference count of the
returned structure, which should be balanced with a corresponding
put_device() when the reference is no longer needed.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: deff710703d8 ("net: dsa: Allow default tag protocol to be overridden from DT")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- simplified the patch as suggestions;
- modified the Fixes tag as suggestions.
---
net/dsa/dsa.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index a20efabe778f..31b409a47491 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -1247,6 +1247,7 @@ static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn)
struct device_node *ethernet = of_parse_phandle(dn, "ethernet", 0);
const char *name = of_get_property(dn, "label", NULL);
bool link = of_property_read_bool(dn, "link");
+ int err = 0;
dp->dn = dn;
@@ -1260,7 +1261,11 @@ static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn)
return -EPROBE_DEFER;
user_protocol = of_get_property(dn, "dsa-tag-protocol", NULL);
- return dsa_port_parse_cpu(dp, conduit, user_protocol);
+ err = dsa_port_parse_cpu(dp, conduit, user_protocol);
+ if (err)
+ put_device(conduit);
+
+ return err;
}
if (link)
--
2.17.1
__arm_lpae_unmap() returns size_t but was returning -ENOENT (negative
error code) when encountering an unmapped PTE. Since size_t is unsigned,
-ENOENT (typically -2) becomes a huge positive value (0xFFFFFFFFFFFFFFFE
on 64-bit systems).
This corrupted value propagates through the call chain:
__arm_lpae_unmap() returns -ENOENT as size_t
-> arm_lpae_unmap_pages() returns it
-> __iommu_unmap() adds it to iova address
-> iommu_pgsize() triggers BUG_ON due to corrupted iova
This can cause IOVA address overflow in __iommu_unmap() loop and
trigger BUG_ON in iommu_pgsize() from invalid address alignment.
Fix by returning 0 instead of -ENOENT. The WARN_ON already signals
the error condition, and returning 0 (meaning "nothing unmapped")
is the correct semantic for size_t return type. This matches the
behavior of other io-pgtable implementations (io-pgtable-arm-v7s,
io-pgtable-dart) which return 0 on error conditions.
Fixes: 3318f7b5cefb ("iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux(a)gmail.com>
---
drivers/iommu/io-pgtable-arm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e6626004b323..05d63fe92e43 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -637,7 +637,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
pte = READ_ONCE(*ptep);
if (!pte) {
WARN_ON(!(data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NO_WARN));
- return -ENOENT;
+ return 0;
}
/* If the size matches this level, we're in the right place */
--
2.40.0
From: Alex Deucher <alexander.deucher(a)amd.com>
commit eb296c09805ee37dd4ea520a7fb3ec157c31090f upstream.
SI hardware doesn't support pasids, user mode queues, or
KIQ/MES so there is no need for this. Doing so results in
a segfault as these callbacks are non-existent for SI.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4744
Fixes: b4a7f4e7ad2b ("drm/amdgpu: attach tlb fence to the PTs update")
Reviewed-by: Timur Kristóf <timur.kristof(a)gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 820b3d376e8a102c6aeab737ec6edebbbb710e04)
Signed-off-by: Hans de Goede <johannes.goede(a)oss.qualcomm.com>
---
Changes in v2 stable submission:
- Correct the Fixes: tag hash which is wrong in the original upstream
commit eb296c09805e ("drm/amdgpu: don't attach the tlb fence for SI")
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 676e24fb8864..cdcafde3c71a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1066,7 +1066,9 @@ amdgpu_vm_tlb_flush(struct amdgpu_vm_update_params *params,
}
/* Prepare a TLB flush fence to be attached to PTs */
- if (!params->unlocked) {
+ if (!params->unlocked &&
+ /* SI doesn't support pasid or KIQ/MES */
+ params->adev->family > AMDGPU_FAMILY_SI) {
amdgpu_vm_tlb_fence_create(params->adev, vm, fence);
/* Makes sure no PD/PT is freed before the flush */
--
2.52.0
From: Ionut Nechita <ionut.nechita(a)windriver.com>
Commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.
On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
completions concurrently, they contend on queue_lock in the hot path,
causing all IRQ threads to enter D (uninterruptible sleep) state. This
serializes interrupt processing completely.
Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- Good (v6.6.52-rt): 640 MB/s sequential read
- Bad (v6.6.64-rt): 153 MB/s sequential read (-76% regression)
- 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock
The original commit message mentioned memory barriers as an alternative
approach. Use full memory barriers (smp_mb) instead of queue_lock to
provide the same ordering guarantees without sleeping in RT kernel.
Memory barriers ensure proper synchronization:
- CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
- CPU1 sees dispatch list/sw queue bitmap updates
This maintains correctness while avoiding lock contention that causes
RT kernel IRQ threads to sleep in the I/O completion path.
Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita(a)windriver.com>
---
block/blk-mq.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5da948b07058..5fb8da4958d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
+ /*
+ * First lockless check to avoid unnecessary overhead.
+ * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
+ */
need_run = blk_mq_hw_queue_need_run(hctx);
if (!need_run) {
- unsigned long flags;
-
- /*
- * Synchronize with blk_mq_unquiesce_queue(), because we check
- * if hw queue is quiesced locklessly above, we need the use
- * ->queue_lock to make sure we see the up-to-date status to
- * not miss rerunning the hw queue.
- */
- spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+ /* Synchronize with blk_mq_unquiesce_queue() */
+ smp_mb();
need_run = blk_mq_hw_queue_need_run(hctx);
- spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
-
if (!need_run)
return;
+ /* Ensure dispatch list/sw queue updates visible before execution */
+ smp_mb();
}
if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
--
2.52.0
From: Ilya Maximets <i.maximets(a)ovn.org>
[ Upstream commit 5ace7ef87f059d68b5f50837ef3e8a1a4870c36e ]
The push_nsh() action structure looks like this:
OVS_ACTION_ATTR_PUSH_NSH(OVS_KEY_ATTR_NSH(OVS_NSH_KEY_ATTR_BASE,...))
The outermost OVS_ACTION_ATTR_PUSH_NSH attribute is OK'ed by the
nla_for_each_nested() inside __ovs_nla_copy_actions(). The innermost
OVS_NSH_KEY_ATTR_BASE/MD1/MD2 are OK'ed by the nla_for_each_nested()
inside nsh_key_put_from_nlattr(). But nothing checks if the attribute
in the middle is OK. We don't even check that this attribute is the
OVS_KEY_ATTR_NSH. We just do a double unwrap with a pair of nla_data()
calls - first time directly while calling validate_push_nsh() and the
second time as part of the nla_for_each_nested() macro, which isn't
safe, potentially causing invalid memory access if the size of this
attribute is incorrect. The failure may not be noticed during
validation due to larger netlink buffer, but cause trouble later during
action execution where the buffer is allocated exactly to the size:
BUG: KASAN: slab-out-of-bounds in nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch]
Read of size 184 at addr ffff88816459a634 by task a.out/22624
CPU: 8 UID: 0 PID: 22624 6.18.0-rc7+ #115 PREEMPT(voluntary)
Call Trace:
<TASK>
dump_stack_lvl+0x51/0x70
print_address_description.constprop.0+0x2c/0x390
kasan_report+0xdd/0x110
kasan_check_range+0x35/0x1b0
__asan_memcpy+0x20/0x60
nsh_hdr_from_nlattr+0x1dd/0x6a0 [openvswitch]
push_nsh+0x82/0x120 [openvswitch]
do_execute_actions+0x1405/0x2840 [openvswitch]
ovs_execute_actions+0xd5/0x3b0 [openvswitch]
ovs_packet_cmd_execute+0x949/0xdb0 [openvswitch]
genl_family_rcv_msg_doit+0x1d6/0x2b0
genl_family_rcv_msg+0x336/0x580
genl_rcv_msg+0x9f/0x130
netlink_rcv_skb+0x11f/0x370
genl_rcv+0x24/0x40
netlink_unicast+0x73e/0xaa0
netlink_sendmsg+0x744/0xbf0
__sys_sendto+0x3d6/0x450
do_syscall_64+0x79/0x2c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
Let's add some checks that the attribute is properly sized and it's
the only one attribute inside the action. Technically, there is no
real reason for OVS_KEY_ATTR_NSH to be there, as we know that we're
pushing an NSH header already, it just creates extra nesting, but
that's how uAPI works today. So, keeping as it is.
Fixes: b2d0f5d5dc53 ("openvswitch: enable NSH support")
Reported-by: Junvy Yang <zhuque(a)tencent.com>
Signed-off-by: Ilya Maximets <i.maximets(a)ovn.org>
Acked-by: Eelco Chaudron <echaudro(a)redhat.com>
Reviewed-by: Aaron Conole <aconole(a)redhat.com>
Link: https://patch.msgid.link/20251204105334.900379-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
(cherry picked from commit 5ace7ef87f059d68b5f50837ef3e8a1a4870c36e)
Signed-off-by: Adrian Yip <adrian.ytw(a)gmail.com>
---
net/openvswitch/flow_netlink.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 1cb4f97335d8..2d536901309e 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2802,13 +2802,20 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
return err;
}
-static bool validate_push_nsh(const struct nlattr *attr, bool log)
+static bool validate_push_nsh(const struct nlattr *a, bool log)
{
+ struct nlattr *nsh_key = nla_data(a);
struct sw_flow_match match;
struct sw_flow_key key;
+ /* There must be one and only one NSH header. */
+ if (!nla_ok(nsh_key, nla_len(a)) ||
+ nla_total_size(nla_len(nsh_key)) != nla_len(a) ||
+ nla_type(nsh_key) != OVS_KEY_ATTR_NSH)
+ return false;
+
ovs_match_init(&match, &key, true, NULL);
- return !nsh_key_put_from_nlattr(attr, &match, false, true, log);
+ return !nsh_key_put_from_nlattr(nsh_key, &match, false, true, log);
}
/* Return false if there are any non-masked bits set.
@@ -3389,7 +3396,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
return -EINVAL;
}
mac_proto = MAC_PROTO_NONE;
- if (!validate_push_nsh(nla_data(a), log))
+ if (!validate_push_nsh(a, log))
return -EINVAL;
break;
--
2.52.0