From: Ionut Nechita <ionut.nechita(a)windriver.com>
This series addresses two critical performance regressions in the block
layer multiqueue (blk-mq) subsystem when running on PREEMPT_RT kernels.
On RT kernels, regular spinlocks are converted to sleeping rt_mutex locks,
which can cause severe performance degradation in the I/O hot path. This
series converts two problematic locking patterns to prevent IRQ threads
from sleeping during I/O operations.
Testing on MegaRAID 12GSAS controller with 8 MSI-X vectors shows:
- v6.6.52-rt (before regression): 640 MB/s sequential read
- v6.6.64-rt (regression introduced): 153 MB/s (-76% regression)
- v6.6.68-rt with queue_lock fix only: 640 MB/s (performance restored)
- v6.6.69-rt with both fixes: expected similar or better performance
The first patch replaces queue_lock with memory barriers in the I/O
completion hot path, eliminating the contention that caused IRQ threads
to sleep. The second patch converts the global blk_mq_cpuhp_lock from
mutex to raw_spinlock to prevent sleeping during CPU hotplug operations.
Both conversions are safe because the protected code paths only perform
fast, non-blocking operations (memory barriers, list/hlist manipulation,
flag checks).
Ionut Nechita (2):
block/blk-mq: fix RT kernel regression with queue_lock in hot path
block/blk-mq: convert blk_mq_cpuhp_lock to raw_spinlock for RT
block/blk-mq.c | 33 +++++++++++++++------------------
1 file changed, 15 insertions(+), 18 deletions(-)
--
2.52.0
When of_find_net_device_by_node() successfully acquires a reference to
a network device but the subsequent call to dsa_port_parse_cpu()
fails, dsa_port_parse_of() returns without releasing the reference
count on the network device.
of_find_net_device_by_node() increments the reference count of the
returned structure, which should be balanced with a corresponding
put_device() when the reference is no longer needed.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: deff710703d8 ("net: dsa: Allow default tag protocol to be overridden from DT")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- simplified the patch as suggestions;
- modified the Fixes tag as suggestions.
---
net/dsa/dsa.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index a20efabe778f..31b409a47491 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -1247,6 +1247,7 @@ static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn)
struct device_node *ethernet = of_parse_phandle(dn, "ethernet", 0);
const char *name = of_get_property(dn, "label", NULL);
bool link = of_property_read_bool(dn, "link");
+ int err = 0;
dp->dn = dn;
@@ -1260,7 +1261,11 @@ static int dsa_port_parse_of(struct dsa_port *dp, struct device_node *dn)
return -EPROBE_DEFER;
user_protocol = of_get_property(dn, "dsa-tag-protocol", NULL);
- return dsa_port_parse_cpu(dp, conduit, user_protocol);
+ err = dsa_port_parse_cpu(dp, conduit, user_protocol);
+ if (err)
+ put_device(conduit);
+
+ return err;
}
if (link)
--
2.17.1
__arm_lpae_unmap() returns size_t but was returning -ENOENT (negative
error code) when encountering an unmapped PTE. Since size_t is unsigned,
-ENOENT (typically -2) becomes a huge positive value (0xFFFFFFFFFFFFFFFE
on 64-bit systems).
This corrupted value propagates through the call chain:
__arm_lpae_unmap() returns -ENOENT as size_t
-> arm_lpae_unmap_pages() returns it
-> __iommu_unmap() adds it to iova address
-> iommu_pgsize() triggers BUG_ON due to corrupted iova
This can cause IOVA address overflow in __iommu_unmap() loop and
trigger BUG_ON in iommu_pgsize() from invalid address alignment.
Fix by returning 0 instead of -ENOENT. The WARN_ON already signals
the error condition, and returning 0 (meaning "nothing unmapped")
is the correct semantic for size_t return type. This matches the
behavior of other io-pgtable implementations (io-pgtable-arm-v7s,
io-pgtable-dart) which return 0 on error conditions.
Fixes: 3318f7b5cefb ("iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux(a)gmail.com>
---
drivers/iommu/io-pgtable-arm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e6626004b323..05d63fe92e43 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -637,7 +637,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
pte = READ_ONCE(*ptep);
if (!pte) {
WARN_ON(!(data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NO_WARN));
- return -ENOENT;
+ return 0;
}
/* If the size matches this level, we're in the right place */
--
2.40.0
From: Alex Deucher <alexander.deucher(a)amd.com>
commit eb296c09805ee37dd4ea520a7fb3ec157c31090f upstream.
SI hardware doesn't support pasids, user mode queues, or
KIQ/MES so there is no need for this. Doing so results in
a segfault as these callbacks are non-existent for SI.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4744
Fixes: b4a7f4e7ad2b ("drm/amdgpu: attach tlb fence to the PTs update")
Reviewed-by: Timur Kristóf <timur.kristof(a)gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 820b3d376e8a102c6aeab737ec6edebbbb710e04)
Signed-off-by: Hans de Goede <johannes.goede(a)oss.qualcomm.com>
---
Changes in v2 stable submission:
- Correct the Fixes: tag hash which is wrong in the original upstream
commit eb296c09805e ("drm/amdgpu: don't attach the tlb fence for SI")
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 676e24fb8864..cdcafde3c71a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1066,7 +1066,9 @@ amdgpu_vm_tlb_flush(struct amdgpu_vm_update_params *params,
}
/* Prepare a TLB flush fence to be attached to PTs */
- if (!params->unlocked) {
+ if (!params->unlocked &&
+ /* SI doesn't support pasid or KIQ/MES */
+ params->adev->family > AMDGPU_FAMILY_SI) {
amdgpu_vm_tlb_fence_create(params->adev, vm, fence);
/* Makes sure no PD/PT is freed before the flush */
--
2.52.0
From: Ionut Nechita <ionut.nechita(a)windriver.com>
Commit 679b1874eba7 ("block: fix ordering between checking
QUEUE_FLAG_QUIESCED request adding") introduced queue_lock acquisition
in blk_mq_run_hw_queue() to synchronize QUEUE_FLAG_QUIESCED checks.
On RT kernels (CONFIG_PREEMPT_RT), regular spinlocks are converted to
rt_mutex (sleeping locks). When multiple MSI-X IRQ threads process I/O
completions concurrently, they contend on queue_lock in the hot path,
causing all IRQ threads to enter D (uninterruptible sleep) state. This
serializes interrupt processing completely.
Test case (MegaRAID 12GSAS with 8 MSI-X vectors on RT kernel):
- Good (v6.6.52-rt): 640 MB/s sequential read
- Bad (v6.6.64-rt): 153 MB/s sequential read (-76% regression)
- 6-8 out of 8 MSI-X IRQ threads stuck in D-state waiting on queue_lock
The original commit message mentioned memory barriers as an alternative
approach. Use full memory barriers (smp_mb) instead of queue_lock to
provide the same ordering guarantees without sleeping in RT kernel.
Memory barriers ensure proper synchronization:
- CPU0 either sees QUEUE_FLAG_QUIESCED cleared, OR
- CPU1 sees dispatch list/sw queue bitmap updates
This maintains correctness while avoiding lock contention that causes
RT kernel IRQ threads to sleep in the I/O completion path.
Fixes: 679b1874eba7 ("block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ionut Nechita <ionut.nechita(a)windriver.com>
---
block/blk-mq.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5da948b07058..5fb8da4958d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2292,22 +2292,19 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
might_sleep_if(!async && hctx->flags & BLK_MQ_F_BLOCKING);
+ /*
+ * First lockless check to avoid unnecessary overhead.
+ * Memory barrier below synchronizes with blk_mq_unquiesce_queue().
+ */
need_run = blk_mq_hw_queue_need_run(hctx);
if (!need_run) {
- unsigned long flags;
-
- /*
- * Synchronize with blk_mq_unquiesce_queue(), because we check
- * if hw queue is quiesced locklessly above, we need the use
- * ->queue_lock to make sure we see the up-to-date status to
- * not miss rerunning the hw queue.
- */
- spin_lock_irqsave(&hctx->queue->queue_lock, flags);
+ /* Synchronize with blk_mq_unquiesce_queue() */
+ smp_mb();
need_run = blk_mq_hw_queue_need_run(hctx);
- spin_unlock_irqrestore(&hctx->queue->queue_lock, flags);
-
if (!need_run)
return;
+ /* Ensure dispatch list/sw queue updates visible before execution */
+ smp_mb();
}
if (async || !cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)) {
--
2.52.0