On 8/5/25 22:25, Lu Baolu wrote:
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence.
FWIW, I wouldn't use the term "cache coherence" in this context. I'd probably just call them "stale IOTLB entries".
I also think this over states the problem. There is currently no problem with "kernel space mapping changes". The issue is solely when kernel page table pages are freed and reused.
Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation.
Note that it's not just use-after-free. It's literally that the IOMMU will keep writing Accessed and Dirty bits while it thinks the page is still a page table. The IOMMU will sit there happily setting bits. So, it's _write_ after free too.
To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed.
There's one tidbit missing from this:
Currently SVA contexts are all unprivileged. They can only access user mappings and never kernel mappings. However, IOMMU still walk kernel-only page tables all the way down to the leaf where they realize that the entry is a kernel mapping and error out.
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..3b85e7d3ba44 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) else /* Fall back to the IPI-based invalidation. */ on_each_cpu(do_flush_tlb_all, NULL, 1);
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
} /* Flush an arbitrarily large range of memory with INVLPGB. */ @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) kernel_tlb_flush_range(info); put_flush_tlb_info();
- iommu_sva_invalidate_kva_range(start, end);
}
These desperately need to be commented. They especially need comments that they are solely for flushing the IOMMU mid-level paging structures after freeing a page table page.
/* diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 1a51cfd82808..d0da2b3fd64b 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -10,6 +10,8 @@ #include "iommu-priv.h" static DEFINE_MUTEX(iommu_sva_lock); +static bool iommu_sva_present; +static LIST_HEAD(iommu_sva_mms); static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm); @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de return ERR_PTR(-ENOSPC); } iommu_mm->pasid = pasid;
- iommu_mm->mm = mm; INIT_LIST_HEAD(&iommu_mm->sva_domains); /*
- Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm if (ret) goto out_free_domain; domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
- if (list_empty(&iommu_mm->sva_domains)) {
if (list_empty(&iommu_sva_mms))
WRITE_ONCE(iommu_sva_present, true);
list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
- }
- list_add(&domain->next, &iommu_mm->sva_domains);
out: refcount_set(&handle->users, 1); mutex_unlock(&iommu_sva_lock); @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) list_del(&domain->next); iommu_domain_free(domain); }
- if (list_empty(&iommu_mm->sva_domains)) {
list_del(&iommu_mm->mm_list_elm);
if (list_empty(&iommu_sva_mms))
WRITE_ONCE(iommu_sva_present, false);
- }
- mutex_unlock(&iommu_sva_lock); kfree(handle);
} @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, return domain; }
+struct kva_invalidation_work_data {
- struct work_struct work;
- unsigned long start;
- unsigned long end;
+};
+static void invalidate_kva_func(struct work_struct *work) +{
- struct kva_invalidation_work_data *data =
container_of(work, struct kva_invalidation_work_data, work);
- struct iommu_mm_data *iommu_mm;
- guard(mutex)(&iommu_sva_lock);
- list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
data->start, data->end);
- kfree(data);
+}
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) +{
- struct kva_invalidation_work_data *data;
- if (likely(!READ_ONCE(iommu_sva_present)))
return;
It would be nice to hear a few more words about why this is safe without a lock.
- /* will be freed in the task function */
- data = kzalloc(sizeof(*data), GFP_ATOMIC);
- if (!data)
return;
- data->start = start;
- data->end = end;
- INIT_WORK(&data->work, invalidate_kva_func);
- /*
* Since iommu_sva_mms is an unbound list, iterating it in an atomic
* context could introduce significant latency issues.
*/
- schedule_work(&data->work);
+}
Hold on a sec, though. the problematic caller of this looks something like this (logically):
pmd_free_pte_page() { pte = pmd_page_vaddr(*pmd); pmd_clear(pmd); flush_tlb_kernel_range(...); // does schedule_work() pte_free_kernel(pte); }
It _immediately_ frees the PTE page. The schedule_work() work will probably happen sometime after the page is freed.
Isn't that still a use-after-free? It's for some arbitrary amount of time and better than before but it's still a use-after-free.