Hi,
I'd like to report a regression which seems related to the latest
ITS mitigations in Linux 6.1.x:
The server in question is a Supermicro SYS-120C-TN10R with
a "Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz" CPU, running
Debian Bookworm. The full output of /proc/cpuinfo is attached
as cpuinfo.txt
In addition to the kernel changes between 6.1.135 and 6.1.139
there is also some additional invariant, namely the Intel microcode
loaded at early boot:
On Linux 6.1.135 every works fine with both the 20250211 and
20250512 microcode releases (kern.log is attached as
6.1.135-feb-microcode.log and 6.1.135-may-microcode.log)
With 6.1.139 and the February microcode, oopses appear related
to clear_bhb_loop() (which may be related to "x86/its: Align
RETs in BHB clear sequence to avoid thunking"?). This is
captured in 6.1.139-feb-microcode.log.
With 6.1.139 and the May microcode, the system mostly
crashes on bootup (in my tests it crashed in three out of
four attempts). I've captured both the crash
(6.1.139-may-microcode-crash.log) and a working boot
(6.1.139-may-microcode-noncrash.log).
If you need any additional information, please let me know!
Cheers,
Moritz
Generally PASID support requires ACS settings that usually create
single device groups, but there are some niche cases where we can get
multi-device groups and still have working PASID support. The primary
issue is that PCI switches are not required to treat PASID tagged TLPs
specially so appropriate ACS settings are required to route all TLPs to
the host bridge if PASID is going to work properly.
pci_enable_pasid() does check that each device that will use PASID has
the proper ACS settings to achieve this routing.
However, no-PASID devices can be combined with PASID capable devices
within the same topology using non-uniform ACS settings. In this case
the no-PASID devices may not have strict route to host ACS flags and
end up being grouped with the PASID devices.
This configuration fails to allow use of the PASID within the iommu
core code which wrongly checks if the no-PASID device supports PASID.
Fix this by ignoring no-PASID devices during the PASID validation. They
will never issue a PASID TLP anyhow so they can be ignored.
Fixes: c404f55c26fc ("iommu: Validate the PASID in iommu_attach_device_pasid()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Tushar Dave <tdave(a)nvidia.com>
---
changes in v4:
- rebase to 6.15-rc7
drivers/iommu/iommu.c | 43 ++++++++++++++++++++++++++++---------------
1 file changed, 28 insertions(+), 15 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4f91a740c15f..9d728800a862 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3366,10 +3366,12 @@ static int __iommu_set_group_pasid(struct iommu_domain *domain,
int ret;
for_each_group_device(group, device) {
- ret = domain->ops->set_dev_pasid(domain, device->dev,
- pasid, old);
- if (ret)
- goto err_revert;
+ if (device->dev->iommu->max_pasids > 0) {
+ ret = domain->ops->set_dev_pasid(domain, device->dev,
+ pasid, old);
+ if (ret)
+ goto err_revert;
+ }
}
return 0;
@@ -3379,15 +3381,18 @@ static int __iommu_set_group_pasid(struct iommu_domain *domain,
for_each_group_device(group, device) {
if (device == last_gdev)
break;
- /*
- * If no old domain, undo the succeeded devices/pasid.
- * Otherwise, rollback the succeeded devices/pasid to the old
- * domain. And it is a driver bug to fail attaching with a
- * previously good domain.
- */
- if (!old || WARN_ON(old->ops->set_dev_pasid(old, device->dev,
+ if (device->dev->iommu->max_pasids > 0) {
+ /*
+ * If no old domain, undo the succeeded devices/pasid.
+ * Otherwise, rollback the succeeded devices/pasid to
+ * the old domain. And it is a driver bug to fail
+ * attaching with a previously good domain.
+ */
+ if (!old ||
+ WARN_ON(old->ops->set_dev_pasid(old, device->dev,
pasid, domain)))
- iommu_remove_dev_pasid(device->dev, pasid, domain);
+ iommu_remove_dev_pasid(device->dev, pasid, domain);
+ }
}
return ret;
}
@@ -3398,8 +3403,10 @@ static void __iommu_remove_group_pasid(struct iommu_group *group,
{
struct group_device *device;
- for_each_group_device(group, device)
- iommu_remove_dev_pasid(device->dev, pasid, domain);
+ for_each_group_device(group, device) {
+ if (device->dev->iommu->max_pasids > 0)
+ iommu_remove_dev_pasid(device->dev, pasid, domain);
+ }
}
/*
@@ -3440,7 +3447,13 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
mutex_lock(&group->mutex);
for_each_group_device(group, device) {
- if (pasid >= device->dev->iommu->max_pasids) {
+ /*
+ * Skip PASID validation for devices without PASID support
+ * (max_pasids = 0). These devices cannot issue transactions
+ * with PASID, so they don't affect group's PASID usage.
+ */
+ if ((device->dev->iommu->max_pasids > 0) &&
+ (pasid >= device->dev->iommu->max_pasids)) {
ret = -EINVAL;
goto out_unlock;
}
--
2.34.1
Hi! After updating to linux-6.12.29, I see lots of "fail"-messages
during boot:
May 19 23:39:09 LUX kernel: [ 4.819552] amdgpu 0000:30:00.0: amdgpu:
[drm] amdgpu: DP AUX transfer fail:4
Bisecting for drivers/gpu/drm/amd had this result:
> git bisect bad
2d63e66f7ba7b88b87e72155a33b970c81cf4664 is the first bad commit
commit 2d63e66f7ba7b88b87e72155a33b970c81cf4664 (HEAD)
Author: Wayne Lin <Wayne.Lin(a)amd.com>
Date: Sun Apr 20 19:22:14 2025 +0800
drm/amd/display: Fix wrong handling for AUX_DEFER case
commit 65924ec69b29296845c7f628112353438e63ea56 upstream.
The system (Ryzen 3 5600G, latest BIOS) is stable so far but the
error-messages are not nice to see. Thanks.
Rainer Fiebig
--
The truth always turns out to be simpler than you thought.
Richard Feynman
The patch fixes a deadlock which can be triggered by an internal
syzkaller [1] reproducer and captured by bpftrace script [2] and its log
[3] in this scenario:
Process 1 Process 2
--- ---
hugetlb_fault
mutex_lock(B) // take B
filemap_lock_hugetlb_folio
filemap_lock_folio
__filemap_get_folio
folio_lock(A) // take A
hugetlb_wp
mutex_unlock(B) // release B
... hugetlb_fault
... mutex_lock(B) // take B
filemap_lock_hugetlb_folio
filemap_lock_folio
__filemap_get_folio
folio_lock(A) // blocked
unmap_ref_private
...
mutex_lock(B) // retake and blocked
This is a ABBA deadlock involving two locks:
- Lock A: pagecache_folio lock
- Lock B: hugetlb_fault_mutex_table lock
The deadlock occurs between two processes as follows:
1. The first process (let’s call it Process 1) is handling a
copy-on-write (COW) operation on a hugepage via hugetlb_wp. Due to
insufficient reserved hugetlb pages, Process 1, owner of the reserved
hugetlb page, attempts to unmap a hugepage owned by another process
(non-owner) to satisfy the reservation. Before unmapping, Process 1
acquires lock B (hugetlb_fault_mutex_table lock) and then lock A
(pagecache_folio lock). To proceed with the unmap, it releases Lock B
but retains Lock A. After the unmap, Process 1 tries to reacquire Lock
B. However, at this point, Lock B has already been acquired by another
process.
2. The second process (Process 2) enters the hugetlb_fault handler
during the unmap operation. It successfully acquires Lock B
(hugetlb_fault_mutex_table lock) that was just released by Process 1,
but then attempts to acquire Lock A (pagecache_folio lock), which is
still held by Process 1.
As a result, Process 1 (holding Lock A) is blocked waiting for Lock B
(held by Process 2), while Process 2 (holding Lock B) is blocked waiting
for Lock A (held by Process 1), constructing a ABBA deadlock scenario.
The error message:
INFO: task repro_20250402_:13229 blocked for more than 64 seconds.
Not tainted 6.15.0-rc3+ #24
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro_20250402_ state:D stack:25856 pid:13229 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00004006
Call Trace:
<TASK>
__schedule+0x1755/0x4f50
schedule+0x158/0x330
schedule_preempt_disabled+0x15/0x30
__mutex_lock+0x75f/0xeb0
hugetlb_wp+0xf88/0x3440
hugetlb_fault+0x14c8/0x2c30
trace_clock_x86_tsc+0x20/0x20
do_user_addr_fault+0x61d/0x1490
exc_page_fault+0x64/0x100
asm_exc_page_fault+0x26/0x30
RIP: 0010:__put_user_4+0xd/0x20
copy_process+0x1f4a/0x3d60
kernel_clone+0x210/0x8f0
__x64_sys_clone+0x18d/0x1f0
do_syscall_64+0x6a/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x41b26d
</TASK>
INFO: task repro_20250402_:13229 is blocked on a mutex likely owned by task repro_20250402_:13250.
task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006
Call Trace:
<TASK>
__schedule+0x1755/0x4f50
schedule+0x158/0x330
io_schedule+0x92/0x110
folio_wait_bit_common+0x69a/0xba0
__filemap_get_folio+0x154/0xb70
hugetlb_fault+0xa50/0x2c30
trace_clock_x86_tsc+0x20/0x20
do_user_addr_fault+0xace/0x1490
exc_page_fault+0x64/0x100
asm_exc_page_fault+0x26/0x30
RIP: 0033:0x402619
</TASK>
INFO: task repro_20250402_:13250 blocked for more than 65 seconds.
Not tainted 6.15.0-rc3+ #24
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:repro_20250402_ state:D stack:28288 pid:13250 tgid:13228 ppid:3513 task_flags:0x400040 flags:0x00000006
Call Trace:
<TASK>
__schedule+0x1755/0x4f50
schedule+0x158/0x330
io_schedule+0x92/0x110
folio_wait_bit_common+0x69a/0xba0
__filemap_get_folio+0x154/0xb70
hugetlb_fault+0xa50/0x2c30
trace_clock_x86_tsc+0x20/0x20
do_user_addr_fault+0xace/0x1490
exc_page_fault+0x64/0x100
asm_exc_page_fault+0x26/0x30
RIP: 0033:0x402619
</TASK>
Showing all locks held in the system:
1 lock held by khungtaskd/35:
#0: ffffffff879a7440 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x30/0x180
2 locks held by repro_20250402_/13229:
#0: ffff888017d801e0 (&mm->mmap_lock){++++}-{4:4}, at: lock_mm_and_find_vma+0x37/0x300
#1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_wp+0xf88/0x3440
3 locks held by repro_20250402_/13250:
#0: ffff8880177f3d08 (vm_lock){++++}-{0:0}, at: do_user_addr_fault+0x41b/0x1490
#1: ffff888000fec848 (&hugetlb_fault_mutex_table[i]){+.+.}-{4:4}, at: hugetlb_fault+0x3b8/0x2c30
#2: ffff8880129500e8 (&resv_map->rw_sema){++++}-{4:4}, at: hugetlb_fault+0x494/0x2c30
Link: https://drive.google.com/file/d/1DVRnIW-vSayU5J1re9Ct_br3jJQU6Vpb/view?usp=… [1]
Link: https://github.com/bboymimi/bpftracer/blob/master/scripts/hugetlb_lock_debu… [2]
Link: https://drive.google.com/file/d/1bWq2-8o-BJAuhoHWX7zAhI6ggfhVzQUI/view?usp=… [3]
Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
Cc: stable(a)vger.kernel.org
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Florent Revest <revest(a)google.com>
Cc: Gavin Shan <gshan(a)redhat.com>
Suggested-by: Oscar Salvador <osalvador(a)suse.de>
Signed-off-by: Gavin Guo <gavinguo(a)igalia.com>
---
V1 -> V2
Suggested-by Oscar Salvador:
- Use folio_test_locked to replace the unnecessary parameter passing.
mm/hugetlb.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7ae38bfb9096..ed501f134eff 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6226,6 +6226,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
u32 hash;
folio_put(old_folio);
+ /*
+ * The pagecache_folio needs to be unlocked to avoid
+ * deadlock when the child unmaps the folio.
+ */
+ if (pagecache_folio)
+ folio_unlock(pagecache_folio);
/*
* Drop hugetlb_fault_mutex and vma_lock before
* unmapping. unmapping needs to hold vma_lock
@@ -6823,8 +6829,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
out_ptl:
spin_unlock(vmf.ptl);
+ /*
+ * hugetlb_wp() might have already unlocked pagecache_folio, so
+ * skip it if that is the case.
+ */
if (pagecache_folio) {
- folio_unlock(pagecache_folio);
+ if (folio_test_locked(pagecache_folio))
+ folio_unlock(pagecache_folio);
folio_put(pagecache_folio);
}
out_mutex:
base-commit: 4a95bc121ccdaee04c4d72f84dbfa6b880a514b6
--
2.43.0
Commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
introduced an error in the TX DMA handling for 8250_omap.
When the OMAP_DMA_TX_KICK flag is set, the "skip_byte" is pulled from
the kfifo and emitted directly in order to start the DMA. While the
kfifo is updated, dma->tx_size is not decreased. This leads to
uart_xmit_advance() called in omap_8250_dma_tx_complete() advancing the
kfifo by one too much.
In practice, transmitting N bytes has been seen to result in the last
N-1 bytes being sent repeatedly.
This change fixes the problem by moving all of the dma setup after the
OMAP_DMA_TX_KICK handling and using kfifo_len() instead of the DMA size
for the 4-byte cutoff check. This slightly changes the behaviour at
buffer wraparound, but it still transmits the correct bytes somehow.
Now, the "skip_byte" would no longer be accounted to the stats. As
previously, dma->tx_size included also this skip byte, up->icount.tx was
updated by aforementioned uart_xmit_advance() in
omap_8250_dma_tx_complete(). Fix this by using the uart_fifo_out()
helper instead of bare kfifo_get().
Based on patch by Mans Rullgard <mans(a)mansr.com>
Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
Reported-by: Mans Rullgard <mans(a)mansr.com>
Cc: stable(a)vger.kernel.org
---
The same as for the original patch, I would appreaciate if someone
actually tests this one on a real HW too.
A patch to optimize the driver to use 2 sgls is still welcome. I will
not add it without actually having the HW.
---
drivers/tty/serial/8250/8250_omap.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index c9b1c689a045..bb23afdd63f2 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -1151,16 +1151,6 @@ static int omap_8250_tx_dma(struct uart_8250_port *p)
return 0;
}
- sg_init_table(&sg, 1);
- ret = kfifo_dma_out_prepare_mapped(&tport->xmit_fifo, &sg, 1,
- UART_XMIT_SIZE, dma->tx_addr);
- if (ret != 1) {
- serial8250_clear_THRI(p);
- return 0;
- }
-
- dma->tx_size = sg_dma_len(&sg);
-
if (priv->habit & OMAP_DMA_TX_KICK) {
unsigned char c;
u8 tx_lvl;
@@ -1185,18 +1175,22 @@ static int omap_8250_tx_dma(struct uart_8250_port *p)
ret = -EBUSY;
goto err;
}
- if (dma->tx_size < 4) {
+ if (kfifo_len(&tport->xmit_fifo) < 4) {
ret = -EINVAL;
goto err;
}
- if (!kfifo_get(&tport->xmit_fifo, &c)) {
+ if (!uart_fifo_out(&p->port, &c, 1)) {
ret = -EINVAL;
goto err;
}
skip_byte = c;
- /* now we need to recompute due to kfifo_get */
- kfifo_dma_out_prepare_mapped(&tport->xmit_fifo, &sg, 1,
- UART_XMIT_SIZE, dma->tx_addr);
+ }
+
+ sg_init_table(&sg, 1);
+ ret = kfifo_dma_out_prepare_mapped(&tport->xmit_fifo, &sg, 1, UART_XMIT_SIZE, dma->tx_addr);
+ if (ret != 1) {
+ ret = -EINVAL;
+ goto err;
}
desc = dmaengine_prep_slave_sg(dma->txchan, &sg, 1, DMA_MEM_TO_DEV,
@@ -1206,6 +1200,7 @@ static int omap_8250_tx_dma(struct uart_8250_port *p)
goto err;
}
+ dma->tx_size = sg_dma_len(&sg);
dma->tx_running = 1;
desc->callback = omap_8250_dma_tx_complete;
--
2.49.0
Commit 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
introduced an error in the TX DMA handling for 8250_omap.
When the OMAP_DMA_TX_KICK flag is set, the "skip_byte" is pulled from
the kfifo and emitted directly in order to start the DMA. While the
kfifo is updated, dma->tx_size is not decreased. This leads to
uart_xmit_advance() called in omap_8250_dma_tx_complete() advancing the
kfifo by one too much.
In practice, transmitting N bytes has been seen to result in the last
N-1 bytes being sent repeatedly.
This change fixes the problem by moving all of the dma setup after the
OMAP_DMA_TX_KICK handling and using kfifo_len() instead of the DMA size
for the 4-byte cutoff check. This slightly changes the behaviour at
buffer wraparound, but it still transmits the correct bytes somehow.
Now, the "skip_byte" would no longer be accounted to the stats. As
previously, dma->tx_size included also this skip byte, up->icount.tx was
updated by aforementioned uart_xmit_advance() in
omap_8250_dma_tx_complete(). Fix this by using the uart_fifo_out()
helper instead of bare kfifo_get().
Based on patch by Mans Rullgard <mans(a)mansr.com>
Signed-off-by: Jiri Slaby (SUSE) <jirislaby(a)kernel.org>
Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
Link: https://lore.kernel.org/all/20250506150748.3162-1-mans@mansr.com/
Reported-by: Mans Rullgard <mans(a)mansr.com>
Cc: stable(a)vger.kernel.org
---
[v2] S-O-B added
---
drivers/tty/serial/8250/8250_omap.c | 25 ++++++++++---------------
1 file changed, 10 insertions(+), 15 deletions(-)
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index 2a0ce11f405d..72ae08d6204f 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -1173,16 +1173,6 @@ static int omap_8250_tx_dma(struct uart_8250_port *p)
return 0;
}
- sg_init_table(&sg, 1);
- ret = kfifo_dma_out_prepare_mapped(&tport->xmit_fifo, &sg, 1,
- UART_XMIT_SIZE, dma->tx_addr);
- if (ret != 1) {
- serial8250_clear_THRI(p);
- return 0;
- }
-
- dma->tx_size = sg_dma_len(&sg);
-
if (priv->habit & OMAP_DMA_TX_KICK) {
unsigned char c;
u8 tx_lvl;
@@ -1207,18 +1197,22 @@ static int omap_8250_tx_dma(struct uart_8250_port *p)
ret = -EBUSY;
goto err;
}
- if (dma->tx_size < 4) {
+ if (kfifo_len(&tport->xmit_fifo) < 4) {
ret = -EINVAL;
goto err;
}
- if (!kfifo_get(&tport->xmit_fifo, &c)) {
+ if (!uart_fifo_out(&p->port, &c, 1)) {
ret = -EINVAL;
goto err;
}
skip_byte = c;
- /* now we need to recompute due to kfifo_get */
- kfifo_dma_out_prepare_mapped(&tport->xmit_fifo, &sg, 1,
- UART_XMIT_SIZE, dma->tx_addr);
+ }
+
+ sg_init_table(&sg, 1);
+ ret = kfifo_dma_out_prepare_mapped(&tport->xmit_fifo, &sg, 1, UART_XMIT_SIZE, dma->tx_addr);
+ if (ret != 1) {
+ ret = -EINVAL;
+ goto err;
}
desc = dmaengine_prep_slave_sg(dma->txchan, &sg, 1, DMA_MEM_TO_DEV,
@@ -1228,6 +1222,7 @@ static int omap_8250_tx_dma(struct uart_8250_port *p)
goto err;
}
+ dma->tx_size = sg_dma_len(&sg);
dma->tx_running = 1;
desc->callback = omap_8250_dma_tx_complete;
--
2.49.0