[PATCH AUTOSEL 6.16-5.15] vfio/type1: conditional rescheduling while pinning

10 Aug 2025

From: Keith Busch kbusch@kernel.org
[ Upstream commit b1779e4f209c7ff7e32f3c79d69bca4e3a3a68b6 ]
A large DMA mapping request can loop through dma address pinning for
many pages. In cases where THP can not be used, the repeated vmf_insert_pfn can
be costly, so let the task reschedule as need to prevent CPU stalls. Failure to
do so has potential harmful side effects, like increased memory pressure
as unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions.
rcu: INFO: rcu_sched self-detected stall on CPU
 rcu:   36-....: (20999 ticks this GP) idle=b01c/1/0x4000000000000000 softirq=35839/35839 fqs=3538
 rcu:            hardirqs   softirqs   csw/system
 rcu:    number:        0        107            0
 rcu:   cputime:       50          0        10446   ==> 10556(ms)
 rcu:   (t=21075 jiffies g=377761 q=204059 ncpus=384)
...
  <TASK>
  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
  ? walk_system_ram_range+0x63/0x120
  ? walk_system_ram_range+0x46/0x120
  ? pgprot_writethrough+0x20/0x20
  lookup_memtype+0x67/0xf0
  track_pfn_insert+0x20/0x40
  vmf_insert_pfn_prot+0x88/0x140
  vfio_pci_mmap_huge_fault+0xf9/0x1b0 [vfio_pci_core]
  __do_fault+0x28/0x1b0
  handle_mm_fault+0xef1/0x2560
  fixup_user_fault+0xf5/0x270
  vaddr_get_pfns+0x169/0x2f0 [vfio_iommu_type1]
  vfio_pin_pages_remote+0x162/0x8e0 [vfio_iommu_type1]
  vfio_iommu_type1_ioctl+0x1121/0x1810 [vfio_iommu_type1]
  ? futex_wake+0x1c1/0x260
  x64_sys_call+0x234/0x17a0
  do_syscall_64+0x63/0x130
  ? exc_page_fault+0x63/0x130
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
Signed-off-by: Keith Busch kbusch@kernel.org
Reviewed-by: Paul E. McKenney paulmck@kernel.org
Link: https://lore.kernel.org/r/20250715184622.3561598-1-kbusch@meta.com
Signed-off-by: Alex Williamson alex.williamson@redhat.com
Signed-off-by: Sasha Levin sashal@kernel.org
---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my
extensive analysis:
## Bug Fix Nature
The commit fixes a critical CPU stall issue that can lead to system
instability. The RCU stall trace in the commit message shows:
- A 21-second RCU stall (`t=21075 jiffies`)
- Over 10 seconds of CPU time consumed (`cputime: 10556ms`)
- The stall originates from `vfio_pin_pages_remote()` during large DMA
  mapping operations
This is a serious bug that affects system reliability and can cascade
into OOM conditions as mentioned: "increased memory pressure as
unrelated rcu tasks are unable to make their reclaim callbacks and
result in OOM conditions."
## Code Change Analysis
The fix is minimal and well-contained:
1. **Single line addition**: The patch adds only `cond_resched()` call
   at line 654 (after the comment block)
2. **Strategic placement**: Positioned at the beginning of the batch
   refill loop, before `vaddr_get_pfns()` is called
3. **Clear comment**: The 5-line comment explains exactly why the
   reschedule is needed
```c
+                       /*
+                        * Large mappings may take a while to repeatedly
refill
+                        * the batch, so conditionally relinquish the
CPU when
+                        * needed to avoid stalls.
+                        */
+                       cond_resched();
```
## Stability and Safety
1. **Low risk**: `cond_resched()` is a standard kernel primitive that
   only yields CPU if needed
2. **Already used pattern**: My search shows vfio_iommu_type1.c already
   uses `cond_resched()` in other places, confirming this is an
   established pattern
3. **No functional changes**: The fix doesn't alter the logic of DMA
   pinning, it just prevents monopolizing the CPU
4. **Reviewed by RCU maintainer**: Paul E. McKenney's review adds
   credibility to the fix
## Stable Tree Criteria Compliance
Per stable-kernel-rules.rst, this meets the criteria:
- **Fixes a real bug**: CPU stalls and potential OOM are serious issues
- **Small change**: Single line addition with comment
- **Obviously correct**: Standard solution for long-running loops
- **Already tested**: Has been in mainline and reviewed by experts
- **No new features**: Pure bug fix, no functionality addition
## Impact Assessment
The bug affects systems performing large DMA mappings through VFIO,
particularly when Transparent Huge Pages (THP) cannot be used. This is
common in:
- Virtual machine device passthrough
- GPU/accelerator passthrough scenarios
- Large memory pinning operations
Without this fix, affected systems can experience:
- RCU stalls leading to system unresponsiveness
- Memory pressure and OOM kills
- Performance degradation for unrelated tasks
The fix prevents these issues with negligible overhead (conditional
reschedule only when needed).
drivers/vfio/vfio_iommu_type1.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1136d7ac6b59..f8d68fe77b41 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -647,6 +647,13 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
while (npage) {
    	if (!batch->size) {
+			/*
+			 * Large mappings may take a while to repeatedly refill
+			 * the batch, so conditionally relinquish the CPU when
+			 * needed to avoid stalls.
+			 */
+			cond_resched();
+
    		/* Empty batch, so refill it. */
    		ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
    				     &pfn, batch);
-- 
2.39.5



    

2025

2024

2023

2022

2021

2020

2019

2018

2017

[PATCH AUTOSEL 6.16-5.15] vfio/type1: conditional rescheduling while pinning