From: Keith Busch kbusch@kernel.org
[ Upstream commit b1779e4f209c7ff7e32f3c79d69bca4e3a3a68b6 ]
A large DMA mapping request can loop through dma address pinning for many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can be costly, so let the task reschedule as need to prevent CPU stalls. Failure to do so has potential harmful side effects, like increased memory pressure as unrelated rcu tasks are unable to make their reclaim callbacks and result in OOM conditions.
rcu: INFO: rcu_sched self-detected stall on CPU rcu: 36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538 rcu: hardirqs softirqs csw/system rcu: number: 0 107 0 rcu: cputime: 50 0 10446 ==> 10556(ms) rcu: (t=21075 jiffies g=377761 q=204059 ncpus=384) ... <TASK> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? walk_system_ram_range+0x63/0x120 ? walk_system_ram_range+0x46/0x120 ? pgprot_writethrough+0x20/0x20 lookup_memtype+0x67/0xf0 track_pfn_insert+0x20/0x40 vmf_insert_pfn_prot+0x88/0x140 vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core] __do_fault+0x28/0x1b0 handle_mm_fault+0xef1/0x2560 fixup_user_fault+0xf5/0x270 vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1] vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1] ? futex_wake+0x1c1/0x260 x64_sys_call+0x234/0x17a0 do_syscall_64+0x63/0x130 ? exc_page_fault+0x63/0x130 entry_SYSCALL_64_after_hwframe+0x4b/0x53
Signed-off-by: Keith Busch kbusch@kernel.org Reviewed-by: Paul E. McKenney paulmck@kernel.org Link: https://lore.kernel.org/r/20250715184622.3561598-1-kbusch@meta.com Signed-off-by: Alex Williamson alex.williamson@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my extensive analysis:
## Bug Fix Nature
The commit fixes a critical CPU stall issue that can lead to system instability. The RCU stall trace in the commit message shows: - A 21-second RCU stall (`t=21075 jiffies`) - Over 10 seconds of CPU time consumed (`cputime: 10556ms`) - The stall originates from `vfio_pin_pages_remote()` during large DMA mapping operations
This is a serious bug that affects system reliability and can cascade into OOM conditions as mentioned: "increased memory pressure as unrelated rcu tasks are unable to make their reclaim callbacks and result in OOM conditions."
## Code Change Analysis
The fix is minimal and well-contained: 1. **Single line addition**: The patch adds only `cond_resched()` call at line 654 (after the comment block) 2. **Strategic placement**: Positioned at the beginning of the batch refill loop, before `vaddr_get_pfns()` is called 3. **Clear comment**: The 5-line comment explains exactly why the reschedule is needed
```c + /* + * Large mappings may take a while to repeatedly refill + * the batch, so conditionally relinquish the CPU when + * needed to avoid stalls. + */ + cond_resched(); ```
## Stability and Safety
1. **Low risk**: `cond_resched()` is a standard kernel primitive that only yields CPU if needed 2. **Already used pattern**: My search shows vfio_iommu_type1.c already uses `cond_resched()` in other places, confirming this is an established pattern 3. **No functional changes**: The fix doesn't alter the logic of DMA pinning, it just prevents monopolizing the CPU 4. **Reviewed by RCU maintainer**: Paul E. McKenney's review adds credibility to the fix
## Stable Tree Criteria Compliance
Per stable-kernel-rules.rst, this meets the criteria: - **Fixes a real bug**: CPU stalls and potential OOM are serious issues - **Small change**: Single line addition with comment - **Obviously correct**: Standard solution for long-running loops - **Already tested**: Has been in mainline and reviewed by experts - **No new features**: Pure bug fix, no functionality addition
## Impact Assessment
The bug affects systems performing large DMA mappings through VFIO, particularly when Transparent Huge Pages (THP) cannot be used. This is common in: - Virtual machine device passthrough - GPU/accelerator passthrough scenarios - Large memory pinning operations
Without this fix, affected systems can experience: - RCU stalls leading to system unresponsiveness - Memory pressure and OOM kills - Performance degradation for unrelated tasks
The fix prevents these issues with negligible overhead (conditional reschedule only when needed).
drivers/vfio/vfio_iommu_type1.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 1136d7ac6b59..f8d68fe77b41 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
while (npage) { if (!batch->size) { + /* + * Large mappings may take a while to repeatedly refill + * the batch, so conditionally relinquish the CPU when + * needed to avoid stalls. + */ + cond_resched(); + /* Empty batch, so refill it. */ ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot, &pfn, batch);