In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence.
Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") Cc: stable@vger.kernel.org Suggested-by: Jann Horn jannh@google.com Co-developed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Vasant Hegde vasant.hegde@amd.com Reviewed-by: Kevin Tian kevin.tian@intel.com Tested-by: Yi Lai yi1.lai@intel.com --- arch/x86/mm/tlb.c | 4 +++ drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++- include/linux/iommu.h | 4 +++ 3 files changed, 67 insertions(+), 1 deletion(-)
Change log: v3: - iommu_sva_mms is an unbound list; iterating it in an atomic context could introduce significant latency issues. Schedule it in a kernel thread and replace the spinlock with a mutex. - Replace the static key with a normal bool; it can be brought back if data shows the benefit. - Invalidate KVA range in the flush_tlb_all() paths. - All previous reviewed-bys are preserved. Please let me know if there are any objections.
v2: - https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.i... - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range); - Replace the mutex with a spinlock to make the interface usable in the critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux....
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..3b85e7d3ba44 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/iommu.h>
#include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) else /* Fall back to the IPI-based invalidation. */ on_each_cpu(do_flush_tlb_all, NULL, 1); + + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); }
/* Flush an arbitrarily large range of memory with INVLPGB. */ @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) kernel_tlb_flush_range(info);
put_flush_tlb_info(); + iommu_sva_invalidate_kva_range(start, end); }
/* diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 1a51cfd82808..d0da2b3fd64b 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -10,6 +10,8 @@ #include "iommu-priv.h"
static DEFINE_MUTEX(iommu_sva_lock); +static bool iommu_sva_present; +static LIST_HEAD(iommu_sva_mms); static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm);
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de return ERR_PTR(-ENOSPC); } iommu_mm->pasid = pasid; + iommu_mm->mm = mm; INIT_LIST_HEAD(&iommu_mm->sva_domains); /* * Make sure the write to mm->iommu_mm is not reordered in front of @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm if (ret) goto out_free_domain; domain->users = 1; - list_add(&domain->next, &mm->iommu_mm->sva_domains);
+ if (list_empty(&iommu_mm->sva_domains)) { + if (list_empty(&iommu_sva_mms)) + WRITE_ONCE(iommu_sva_present, true); + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms); + } + list_add(&domain->next, &iommu_mm->sva_domains); out: refcount_set(&handle->users, 1); mutex_unlock(&iommu_sva_lock); @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) list_del(&domain->next); iommu_domain_free(domain); } + + if (list_empty(&iommu_mm->sva_domains)) { + list_del(&iommu_mm->mm_list_elm); + if (list_empty(&iommu_sva_mms)) + WRITE_ONCE(iommu_sva_present, false); + } + mutex_unlock(&iommu_sva_lock); kfree(handle); } @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
return domain; } + +struct kva_invalidation_work_data { + struct work_struct work; + unsigned long start; + unsigned long end; +}; + +static void invalidate_kva_func(struct work_struct *work) +{ + struct kva_invalidation_work_data *data = + container_of(work, struct kva_invalidation_work_data, work); + struct iommu_mm_data *iommu_mm; + + guard(mutex)(&iommu_sva_lock); + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm) + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, + data->start, data->end); + + kfree(data); +} + +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) +{ + struct kva_invalidation_work_data *data; + + if (likely(!READ_ONCE(iommu_sva_present))) + return; + + /* will be freed in the task function */ + data = kzalloc(sizeof(*data), GFP_ATOMIC); + if (!data) + return; + + data->start = start; + data->end = end; + INIT_WORK(&data->work, invalidate_kva_func); + + /* + * Since iommu_sva_mms is an unbound list, iterating it in an atomic + * context could introduce significant latency issues. + */ + schedule_work(&data->work); +} diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c30d12e16473..66e4abb2df0d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -1134,7 +1134,9 @@ struct iommu_sva {
struct iommu_mm_data { u32 pasid; + struct mm_struct *mm; struct list_head sva_domains; + struct list_head mm_list_elm; };
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode); @@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm); void iommu_sva_unbind_device(struct iommu_sva *handle); u32 iommu_sva_get_pasid(struct iommu_sva *handle); +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end); #else static inline struct iommu_sva * iommu_sva_bind_device(struct device *dev, struct mm_struct *mm) @@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm) }
static inline void mm_pasid_drop(struct mm_struct *mm) {} +static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {} #endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
On 8/5/25 22:25, Lu Baolu wrote:
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence.
FWIW, I wouldn't use the term "cache coherence" in this context. I'd probably just call them "stale IOTLB entries".
I also think this over states the problem. There is currently no problem with "kernel space mapping changes". The issue is solely when kernel page table pages are freed and reused.
Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation.
Note that it's not just use-after-free. It's literally that the IOMMU will keep writing Accessed and Dirty bits while it thinks the page is still a page table. The IOMMU will sit there happily setting bits. So, it's _write_ after free too.
To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed.
There's one tidbit missing from this:
Currently SVA contexts are all unprivileged. They can only access user mappings and never kernel mappings. However, IOMMU still walk kernel-only page tables all the way down to the leaf where they realize that the entry is a kernel mapping and error out.
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..3b85e7d3ba44 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) else /* Fall back to the IPI-based invalidation. */ on_each_cpu(do_flush_tlb_all, NULL, 1);
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
} /* Flush an arbitrarily large range of memory with INVLPGB. */ @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) kernel_tlb_flush_range(info); put_flush_tlb_info();
- iommu_sva_invalidate_kva_range(start, end);
}
These desperately need to be commented. They especially need comments that they are solely for flushing the IOMMU mid-level paging structures after freeing a page table page.
/* diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 1a51cfd82808..d0da2b3fd64b 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -10,6 +10,8 @@ #include "iommu-priv.h" static DEFINE_MUTEX(iommu_sva_lock); +static bool iommu_sva_present; +static LIST_HEAD(iommu_sva_mms); static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm); @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de return ERR_PTR(-ENOSPC); } iommu_mm->pasid = pasid;
- iommu_mm->mm = mm; INIT_LIST_HEAD(&iommu_mm->sva_domains); /*
- Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm if (ret) goto out_free_domain; domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
- if (list_empty(&iommu_mm->sva_domains)) {
if (list_empty(&iommu_sva_mms))
WRITE_ONCE(iommu_sva_present, true);
list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
- }
- list_add(&domain->next, &iommu_mm->sva_domains);
out: refcount_set(&handle->users, 1); mutex_unlock(&iommu_sva_lock); @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) list_del(&domain->next); iommu_domain_free(domain); }
- if (list_empty(&iommu_mm->sva_domains)) {
list_del(&iommu_mm->mm_list_elm);
if (list_empty(&iommu_sva_mms))
WRITE_ONCE(iommu_sva_present, false);
- }
- mutex_unlock(&iommu_sva_lock); kfree(handle);
} @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, return domain; }
+struct kva_invalidation_work_data {
- struct work_struct work;
- unsigned long start;
- unsigned long end;
+};
+static void invalidate_kva_func(struct work_struct *work) +{
- struct kva_invalidation_work_data *data =
container_of(work, struct kva_invalidation_work_data, work);
- struct iommu_mm_data *iommu_mm;
- guard(mutex)(&iommu_sva_lock);
- list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
data->start, data->end);
- kfree(data);
+}
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) +{
- struct kva_invalidation_work_data *data;
- if (likely(!READ_ONCE(iommu_sva_present)))
return;
It would be nice to hear a few more words about why this is safe without a lock.
- /* will be freed in the task function */
- data = kzalloc(sizeof(*data), GFP_ATOMIC);
- if (!data)
return;
- data->start = start;
- data->end = end;
- INIT_WORK(&data->work, invalidate_kva_func);
- /*
* Since iommu_sva_mms is an unbound list, iterating it in an atomic
* context could introduce significant latency issues.
*/
- schedule_work(&data->work);
+}
Hold on a sec, though. the problematic caller of this looks something like this (logically):
pmd_free_pte_page() { pte = pmd_page_vaddr(*pmd); pmd_clear(pmd); flush_tlb_kernel_range(...); // does schedule_work() pte_free_kernel(pte); }
It _immediately_ frees the PTE page. The schedule_work() work will probably happen sometime after the page is freed.
Isn't that still a use-after-free? It's for some arbitrary amount of time and better than before but it's still a use-after-free.
On Wed, Aug 06, 2025 at 08:03:42AM -0700, Dave Hansen wrote:
Hold on a sec, though. the problematic caller of this looks something like this (logically):
pmd_free_pte_page() { pte = pmd_page_vaddr(*pmd); pmd_clear(pmd); flush_tlb_kernel_range(...); // does schedule_work() pte_free_kernel(pte); }
It _immediately_ frees the PTE page. The schedule_work() work will probably happen sometime after the page is freed.
Isn't that still a use-after-free? It's for some arbitrary amount of time and better than before but it's still a use-after-free.
Yes it is.
You can't do this approach without also pushing the pages to freed on a list and defering the free till the work. This is broadly what the normal mm user flow is doing..
Jason
On 8/6/25 08:52, Jason Gunthorpe wrote:
Isn't that still a use-after-free? It's for some arbitrary amount of time and better than before but it's still a use-after-free.
Yes it is.
You can't do this approach without also pushing the pages to freed on a list and defering the free till the work. This is broadly what the normal mm user flow is doing..
FWIW, I think the simplest way to do this is to plop an unconditional schedule_work() in pte_free_kernel(). The work function will invalidate the IOTLBs and then free the page.
Keep the schedule_work() unconditional to keep it simple. The schedule_work() is way cheaper than all the system-wide TLB invalidation IPIs that have to get sent as well. No need to add complexity to optimize out something that's in the noise already.
On Wed, Aug 06, 2025 at 09:04:29AM -0700, Dave Hansen wrote:
On 8/6/25 08:52, Jason Gunthorpe wrote:
Isn't that still a use-after-free? It's for some arbitrary amount of time and better than before but it's still a use-after-free.
Yes it is.
You can't do this approach without also pushing the pages to freed on a list and defering the free till the work. This is broadly what the normal mm user flow is doing..
FWIW, I think the simplest way to do this is to plop an unconditional schedule_work() in pte_free_kernel(). The work function will invalidate the IOTLBs and then free the page.
Keep the schedule_work() unconditional to keep it simple. The schedule_work() is way cheaper than all the system-wide TLB invalidation IPIs that have to get sent as well. No need to add complexity to optimize out something that's in the noise already.
That works also, but now you have to allocate memory or you are dead.. Is it OK these days, and safe in this code which seems a little bit linked to memory management?
The MM side avoided this by putting the list and the rcu_head in the struct page.
Jason
On 8/6/25 09:09, Jason Gunthorpe wrote:
You can't do this approach without also pushing the pages to freed on a list and defering the free till the work. This is broadly what the normal mm user flow is doing..
FWIW, I think the simplest way to do this is to plop an unconditional schedule_work() in pte_free_kernel(). The work function will invalidate the IOTLBs and then free the page.
Keep the schedule_work() unconditional to keep it simple. The schedule_work() is way cheaper than all the system-wide TLB invalidation IPIs that have to get sent as well. No need to add complexity to optimize out something that's in the noise already.
That works also, but now you have to allocate memory or you are dead.. Is it OK these days, and safe in this code which seems a little bit linked to memory management?
The MM side avoided this by putting the list and the rcu_head in the struct page.
I don't think you need to allocate memory. A little static structure that uses the page->list and has a lock should do. Logically something like this:
struct kernel_pgtable_work { struct list_head list; spinlock_t lock; struct work_struct work; } kernel_pte_work;
pte_free_kernel() { struct page *page = ptdesc_magic();
guard(spinlock)(&kernel_pte_work.lock); list_add(&page->list, &kernel_pte_work.list); schedule_work(&kernel_pte_work.work); }
work_func() { iommu_sva_invalidate_kva();
guard(spinlock)(&kernel_pte_work.lock);
list_for_each_safe() { page = container_of(...); free_whatever(page); } }
The only wrinkle is that pte_free_kernel() itself still has a pte and 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which should be unused at this point, especially for non-pgd pages on x86.
So, either go over to the 'struct page' earlier (maybe by open-coding pagetable_dtor_free()?), or just use the ptdesc.
On Wed, Aug 06, 2025 at 09:34:12AM -0700, Dave Hansen wrote:
struct kernel_pgtable_work { struct list_head list; spinlock_t lock; struct work_struct work; } kernel_pte_work;
pte_free_kernel() { struct page *page = ptdesc_magic();
guard(spinlock)(&kernel_pte_work.lock); list_add(&page->list, &kernel_pte_work.list); schedule_work(&kernel_pte_work.work); }
Oh, OK, yeah this can work
The only wrinkle is that pte_free_kernel() itself still has a pte and 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which should be unused at this point, especially for non-pgd pages on x86.
It should all be ptdesc, so lets avoid a struct page here and use the pt_list instead..
Jason
On 8/7/2025 12:34 AM, Dave Hansen wrote:
On 8/6/25 09:09, Jason Gunthorpe wrote:
You can't do this approach without also pushing the pages to freed on a list and defering the free till the work. This is broadly what the normal mm user flow is doing..
FWIW, I think the simplest way to do this is to plop an unconditional schedule_work() in pte_free_kernel(). The work function will invalidate the IOTLBs and then free the page.
Keep the schedule_work() unconditional to keep it simple. The schedule_work() is way cheaper than all the system-wide TLB invalidation IPIs that have to get sent as well. No need to add complexity to optimize out something that's in the noise already.
That works also, but now you have to allocate memory or you are dead.. Is it OK these days, and safe in this code which seems a little bit linked to memory management?
The MM side avoided this by putting the list and the rcu_head in the struct page.
I don't think you need to allocate memory. A little static structure that uses the page->list and has a lock should do. Logically something like this:
struct kernel_pgtable_work { struct list_head list; spinlock_t lock; struct work_struct work; } kernel_pte_work;
pte_free_kernel() { struct page *page = ptdesc_magic();
guard(spinlock)(&kernel_pte_work.lock); list_add(&page->list, &kernel_pte_work.list); schedule_work(&kernel_pte_work.work); }
work_func() { iommu_sva_invalidate_kva();
guard(spinlock)(&kernel_pte_work.lock);
list_for_each_safe() { page = container_of(...); free_whatever(page); } }
The only wrinkle is that pte_free_kernel() itself still has a pte and 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which should be unused at this point, especially for non-pgd pages on x86.
So, either go over to the 'struct page' earlier (maybe by open-coding pagetable_dtor_free()?), or just use the ptdesc.
I refactored the code above as follows. It compiles but hasn't been tested yet. Does it look good to you?
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h index c88691b15f3c..d9307dd09f67 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -10,9 +10,11 @@
#define __HAVE_ARCH_PTE_ALLOC_ONE #define __HAVE_ARCH_PGD_FREE +#define __HAVE_ARCH_PTE_FREE_KERNEL #include <asm-generic/pgalloc.h>
static inline int __paravirt_pgd_alloc(struct mm_struct *mm) { return 0; } +void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
#ifdef CONFIG_PARAVIRT_XXL #include <asm/paravirt.h> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index ddf248c3ee7d..f9f6738dd3cc 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -2,6 +2,7 @@ #include <linux/mm.h> #include <linux/gfp.h> #include <linux/hugetlb.h> +#include <linux/iommu.h> #include <asm/pgalloc.h> #include <asm/tlb.h> #include <asm/fixmap.h> @@ -844,3 +845,42 @@ void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud) /* See note in arch_check_zapped_pte() */ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud)); } + +static void kernel_pte_work_func(struct work_struct *work); + +static struct { + struct list_head list; + spinlock_t lock; + struct work_struct work; +} kernel_pte_work = { + .list = LIST_HEAD_INIT(kernel_pte_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), + .work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func), +}; + +static void kernel_pte_work_func(struct work_struct *work) +{ + struct page *page, *next; + + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); + + guard(spinlock)(&kernel_pte_work.lock); + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { + list_del_init(&page->lru); + pagetable_dtor_free(page_ptdesc(page)); + } +} + +/** + * pte_free_kernel - free PTE-level kernel page table memory + * @mm: the mm_struct of the current context + * @pte: pointer to the memory containing the page table + */ +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +{ + struct page *page = virt_to_page(pte); + + guard(spinlock)(&kernel_pte_work.lock); + list_add(&page->lru, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..716ebab67636 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifndef __HAVE_ARCH_PTE_FREE_KERNEL /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { pagetable_dtor_free(virt_to_ptdesc(pte)); } +#endif
/** * __pte_alloc_one - allocate memory for a PTE-level user page table
On 8/7/25 07:40, Baolu Lu wrote: ...
I refactored the code above as follows. It compiles but hasn't been tested yet. Does it look good to you?
As in, it takes the non-compiling gunk I spewed into my email client and makes it compile, yeah. Sure. ;)
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/ pgalloc.h index c88691b15f3c..d9307dd09f67 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -10,9 +10,11 @@
#define __HAVE_ARCH_PTE_ALLOC_ONE #define __HAVE_ARCH_PGD_FREE +#define __HAVE_ARCH_PTE_FREE_KERNEL
But I think it really muddies the waters down here.
This kinda reads like "x86 has its own per-arch pte_free_kernel() that it always needs". Which is far from accurate.
@@ -844,3 +845,42 @@ void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud) /* See note in arch_check_zapped_pte() */ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud)); }
+static void kernel_pte_work_func(struct work_struct *work);
+static struct { + struct list_head list; + spinlock_t lock; + struct work_struct work; +} kernel_pte_work = { + .list = LIST_HEAD_INIT(kernel_pte_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), + .work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func), +};
+static void kernel_pte_work_func(struct work_struct *work) +{ + struct page *page, *next;
+ iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
+ guard(spinlock)(&kernel_pte_work.lock); + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { + list_del_init(&page->lru); + pagetable_dtor_free(page_ptdesc(page)); + } +}
+/**
- pte_free_kernel - free PTE-level kernel page table memory
- @mm: the mm_struct of the current context
- @pte: pointer to the memory containing the page table
- */
The kerneldoc here is just wasted bytes, IMNHO. Why not use those bytes to actually explain what the heck is going on here?
+void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +{ + struct page *page = virt_to_page(pte);
+ guard(spinlock)(&kernel_pte_work.lock); + list_add(&page->lru, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..716ebab67636 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifndef __HAVE_ARCH_PTE_FREE_KERNEL /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { pagetable_dtor_free(virt_to_ptdesc(pte)); } +#endif
/** * __pte_alloc_one - allocate memory for a PTE-level user page table
I'd much rather the arch-generic code looked like this:
#ifdef CONFIG_ASYNC_PGTABLE_FREE // code and struct here, or dump them over in some // other file and do this in a header #else static void pte_free_kernel_async(struct page *page) {} #endif
void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { struct page *page = virt_to_page(pte);
if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { pte_free_kernel_async(page); else pagetable_dtor_free(page_ptdesc(page)); }
Then in Kconfig, you end up with something like:
config ASYNC_PGTABLE_FREE def_bool y depends on INTEL_IOMMU_WHATEVER
That very much tells much more of the whole story in code. It also gives the x86 folks that compile out the IOMMU the exact same code as the arch-generic folks. It _also_ makes it dirt simple and obvious for the x86 folks to optimize out the async behavior if they don't like it in the future by replacing the compile-time IOMMU check with a runtime one.
Also, if another crazy IOMMU implementation comes along that happens to do what the x86 IOMMUs do, then they have a single Kconfig switch to flip. If they follow what this patch tries to do, they'll start by copying and pasting the x86 implementation.
On 8/7/25 23:31, Dave Hansen wrote:
+void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +{ + struct page *page = virt_to_page(pte);
+ guard(spinlock)(&kernel_pte_work.lock); + list_add(&page->lru, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..716ebab67636 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifndef __HAVE_ARCH_PTE_FREE_KERNEL /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { pagetable_dtor_free(virt_to_ptdesc(pte)); } +#endif
/** * __pte_alloc_one - allocate memory for a PTE-level user page table
I'd much rather the arch-generic code looked like this:
#ifdef CONFIG_ASYNC_PGTABLE_FREE // code and struct here, or dump them over in some // other file and do this in a header #else static void pte_free_kernel_async(struct page *page) {} #endif
void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { struct page *page = virt_to_page(pte);
if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) {
pte_free_kernel_async(page); else pagetable_dtor_free(page_ptdesc(page)); }
Then in Kconfig, you end up with something like:
config ASYNC_PGTABLE_FREE def_bool y depends on INTEL_IOMMU_WHATEVER
That very much tells much more of the whole story in code. It also gives the x86 folks that compile out the IOMMU the exact same code as the arch-generic folks. It_also_ makes it dirt simple and obvious for the x86 folks to optimize out the async behavior if they don't like it in the future by replacing the compile-time IOMMU check with a runtime one.
Also, if another crazy IOMMU implementation comes along that happens to do what the x86 IOMMUs do, then they have a single Kconfig switch to flip. If they follow what this patch tries to do, they'll start by copying and pasting the x86 implementation.
I'll do it like this. Does that look good to you?
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 70d29b14d851..6f1113e024fa 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -160,6 +160,7 @@ config IOMMU_DMA # Shared Virtual Addressing config IOMMU_SVA select IOMMU_MM_DATA + select ASYNC_PGTABLE_FREE if X86 bool
config IOMMU_IOPF diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..dbddacdca2ce 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,19 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifdef CONFIG_ASYNC_PGTABLE_FREE +struct pgtable_free_work { + struct list_head list; + spinlock_t lock; + struct work_struct work; +}; +extern struct pgtable_free_work kernel_pte_work; + +void pte_free_kernel_async(struct ptdesc *ptdesc); +#else +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} +#endif + /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -53,7 +66,12 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) */ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_dtor_free(virt_to_ptdesc(pte)); + struct ptdesc *ptdesc = virt_to_ptdesc(pte); + + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + pte_free_kernel_async(ptdesc); + else + pagetable_dtor_free(ptdesc); }
/** diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..528550cfa7fe 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool
+config ASYNC_PGTABLE_FREE + bool "Asynchronous kernel page table freeing" + help + Perform kernel page table freeing asynchronously. This is required + for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB + paging structure caches. + config EXECMEM bool
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 567e2d084071..6639ee6641d4 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -13,6 +13,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/mm_inline.h> +#include <linux/iommu.h> #include <asm/pgalloc.h> #include <asm/tlb.h>
@@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_unmap_unlock(pte, ptl); goto again; } + +#ifdef CONFIG_ASYNC_PGTABLE_FREE +static void kernel_pte_work_func(struct work_struct *work); +struct pgtable_free_work kernel_pte_work = { + .list = LIST_HEAD_INIT(kernel_pte_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), + .work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func), +}; + +static void kernel_pte_work_func(struct work_struct *work) +{ + struct ptdesc *ptdesc, *next; + + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); + + guard(spinlock)(&kernel_pte_work.lock); + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) { + list_del_init(&ptdesc->pt_list); + pagetable_dtor_free(ptdesc); + } +} + +void pte_free_kernel_async(struct ptdesc *ptdesc) +{ + guard(spinlock)(&kernel_pte_work.lock); + list_add(&ptdesc->pt_list, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} +#endif
On 8/8/2025 1:15 PM, Baolu Lu wrote:
On 8/7/25 23:31, Dave Hansen wrote:
+void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +{ + struct page *page = virt_to_page(pte);
+ guard(spinlock)(&kernel_pte_work.lock); + list_add(&page->lru, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/ pgalloc.h index 3c8ec3bfea44..716ebab67636 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifndef __HAVE_ARCH_PTE_FREE_KERNEL /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { pagetable_dtor_free(virt_to_ptdesc(pte)); } +#endif
/** * __pte_alloc_one - allocate memory for a PTE-level user page table
I'd much rather the arch-generic code looked like this:
#ifdef CONFIG_ASYNC_PGTABLE_FREE // code and struct here, or dump them over in some // other file and do this in a header #else static void pte_free_kernel_async(struct page *page) {} #endif
void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { struct page *page = virt_to_page(pte);
if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { pte_free_kernel_async(page); else pagetable_dtor_free(page_ptdesc(page)); }
Then in Kconfig, you end up with something like:
config ASYNC_PGTABLE_FREE def_bool y depends on INTEL_IOMMU_WHATEVER
That very much tells much more of the whole story in code. It also gives the x86 folks that compile out the IOMMU the exact same code as the arch-generic folks. It_also_ makes it dirt simple and obvious for the x86 folks to optimize out the async behavior if they don't like it in the future by replacing the compile-time IOMMU check with a runtime one.
Also, if another crazy IOMMU implementation comes along that happens to do what the x86 IOMMUs do, then they have a single Kconfig switch to flip. If they follow what this patch tries to do, they'll start by copying and pasting the x86 implementation.
I'll do it like this. Does that look good to you?
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 70d29b14d851..6f1113e024fa 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -160,6 +160,7 @@ config IOMMU_DMA # Shared Virtual Addressing config IOMMU_SVA select IOMMU_MM_DATA + select ASYNC_PGTABLE_FREE if X86 bool
config IOMMU_IOPF diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..dbddacdca2ce 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,19 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifdef CONFIG_ASYNC_PGTABLE_FREE +struct pgtable_free_work { + struct list_head list; + spinlock_t lock; + struct work_struct work; +}; +extern struct pgtable_free_work kernel_pte_work;
+void pte_free_kernel_async(struct ptdesc *ptdesc); +#else +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} +#endif
/** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -53,7 +66,12 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) */ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_dtor_free(virt_to_ptdesc(pte)); + struct ptdesc *ptdesc = virt_to_ptdesc(pte);
+ if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + pte_free_kernel_async(ptdesc); + else + pagetable_dtor_free(ptdesc); }
/** diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..528550cfa7fe 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool
+config ASYNC_PGTABLE_FREE + bool "Asynchronous kernel page table freeing" + help + Perform kernel page table freeing asynchronously. This is required + for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB + paging structure caches.
config EXECMEM bool
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 567e2d084071..6639ee6641d4 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -13,6 +13,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/mm_inline.h> +#include <linux/iommu.h> #include <asm/pgalloc.h> #include <asm/tlb.h>
@@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_unmap_unlock(pte, ptl); goto again; }
+#ifdef CONFIG_ASYNC_PGTABLE_FREE +static void kernel_pte_work_func(struct work_struct *work); +struct pgtable_free_work kernel_pte_work = { + .list = LIST_HEAD_INIT(kernel_pte_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), + .work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func), +};
+static void kernel_pte_work_func(struct work_struct *work) +{ + struct ptdesc *ptdesc, *next;
+ iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
+ guard(spinlock)(&kernel_pte_work.lock); + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) { + list_del_init(&ptdesc->pt_list); + pagetable_dtor_free(ptdesc); + } +}
+void pte_free_kernel_async(struct ptdesc *ptdesc) +{ + guard(spinlock)(&kernel_pte_work.lock); + list_add(&ptdesc->pt_list, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +}
kernel_pte_work.list is global shared var, it would make the producer pte_free_kernel() and the consumer kernel_pte_work_func() to operate in serialized timing. In a large system, I don't think you design this deliberately :)
Thanks, Ethan
+#endif
On Sun, Aug 10, 2025 at 03:19:58PM +0800, Ethan Zhao wrote:
On 8/8/2025 1:15 PM, Baolu Lu wrote:
On 8/7/25 23:31, Dave Hansen wrote:
+void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +{ + struct page *page = virt_to_page(pte);
+ guard(spinlock)(&kernel_pte_work.lock); + list_add(&page->lru, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/ pgalloc.h index 3c8ec3bfea44..716ebab67636 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifndef __HAVE_ARCH_PTE_FREE_KERNEL /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { pagetable_dtor_free(virt_to_ptdesc(pte)); } +#endif
/** * __pte_alloc_one - allocate memory for a PTE-level user page table
I'd much rather the arch-generic code looked like this:
#ifdef CONFIG_ASYNC_PGTABLE_FREE // code and struct here, or dump them over in some // other file and do this in a header #else static void pte_free_kernel_async(struct page *page) {} #endif
void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { struct page *page = virt_to_page(pte);
if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { pte_free_kernel_async(page); else pagetable_dtor_free(page_ptdesc(page)); }
Then in Kconfig, you end up with something like:
config ASYNC_PGTABLE_FREE def_bool y depends on INTEL_IOMMU_WHATEVER
That very much tells much more of the whole story in code. It also gives the x86 folks that compile out the IOMMU the exact same code as the arch-generic folks. It_also_ makes it dirt simple and obvious for the x86 folks to optimize out the async behavior if they don't like it in the future by replacing the compile-time IOMMU check with a runtime one.
Also, if another crazy IOMMU implementation comes along that happens to do what the x86 IOMMUs do, then they have a single Kconfig switch to flip. If they follow what this patch tries to do, they'll start by copying and pasting the x86 implementation.
I'll do it like this. Does that look good to you?
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 70d29b14d851..6f1113e024fa 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -160,6 +160,7 @@ config IOMMU_DMA # Shared Virtual Addressing config IOMMU_SVA select IOMMU_MM_DATA + select ASYNC_PGTABLE_FREE if X86 bool
config IOMMU_IOPF diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..dbddacdca2ce 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,19 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif
+#ifdef CONFIG_ASYNC_PGTABLE_FREE +struct pgtable_free_work { + struct list_head list; + spinlock_t lock; + struct work_struct work; +}; +extern struct pgtable_free_work kernel_pte_work;
+void pte_free_kernel_async(struct ptdesc *ptdesc); +#else +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} +#endif
/** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -53,7 +66,12 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) */ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_dtor_free(virt_to_ptdesc(pte)); + struct ptdesc *ptdesc = virt_to_ptdesc(pte);
+ if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + pte_free_kernel_async(ptdesc); + else + pagetable_dtor_free(ptdesc); }
/** diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..528550cfa7fe 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool
+config ASYNC_PGTABLE_FREE + bool "Asynchronous kernel page table freeing" + help + Perform kernel page table freeing asynchronously. This is required + for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB + paging structure caches.
config EXECMEM bool
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 567e2d084071..6639ee6641d4 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -13,6 +13,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/mm_inline.h> +#include <linux/iommu.h> #include <asm/pgalloc.h> #include <asm/tlb.h>
@@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_unmap_unlock(pte, ptl); goto again; }
+#ifdef CONFIG_ASYNC_PGTABLE_FREE +static void kernel_pte_work_func(struct work_struct *work); +struct pgtable_free_work kernel_pte_work = { + .list = LIST_HEAD_INIT(kernel_pte_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), + .work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func), +};
+static void kernel_pte_work_func(struct work_struct *work) +{ + struct ptdesc *ptdesc, *next;
+ iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
+ guard(spinlock)(&kernel_pte_work.lock); + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) { + list_del_init(&ptdesc->pt_list); + pagetable_dtor_free(ptdesc); + } +}
+void pte_free_kernel_async(struct ptdesc *ptdesc) +{ + guard(spinlock)(&kernel_pte_work.lock); + list_add(&ptdesc->pt_list, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +}
kernel_pte_work.list is global shared var, it would make the producer pte_free_kernel() and the consumer kernel_pte_work_func() to operate in serialized timing. In a large system, I don't think you design this deliberately :)
Sorry for jumping.
Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks.
As for reference you can have a look at how vfree_atomic() handles deferred freeing.
Thanks!
-- Uladzislau Rezki
On Mon, Aug 11, 2025 at 11:15:58AM +0200, Uladzislau Rezki wrote:
Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks.
I think we already largely agreed this was not a true fast path, I don't think we need a per-cpu here.
But I would not make it extern unless there is a user?
Jason
On 8/11/2025 8:55 PM, Jason Gunthorpe wrote:
On Mon, Aug 11, 2025 at 11:15:58AM +0200, Uladzislau Rezki wrote:
Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks.
I think we already largely agreed this was not a true fast path, I don't think we need a per-cpu here.
But I would not make it extern unless there is a user?
Fixed. Thanks!
Thanks, baolu
On 8/11/25 02:15, Uladzislau Rezki wrote:
kernel_pte_work.list is global shared var, it would make the producer pte_free_kernel() and the consumer kernel_pte_work_func() to operate in serialized timing. In a large system, I don't think you design this deliberately 🙂
Sorry for jumping.
Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks.
Remember, the code that has to run just before all this sent an IPI to every single CPU on the system to have them do a (on x86 at least) pretty expensive TLB flush.
If this is a hot path, we have bigger problems on our hands: the full TLB flush on every CPU.
So, sure, there are a million ways to make this deferred freeing more scalable. But the code that's here is dirt simple and self contained. If someone has some ideas for something that's simpler and more scalable, then I'm totally open to it.
But this is _not_ the place to add complexity to get scalability.
On Mon, Aug 11, 2025 at 06:55:52AM -0700, Dave Hansen wrote:
On 8/11/25 02:15, Uladzislau Rezki wrote:
kernel_pte_work.list is global shared var, it would make the producer pte_free_kernel() and the consumer kernel_pte_work_func() to operate in serialized timing. In a large system, I don't think you design this deliberately 🙂
Sorry for jumping.
Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks.
Remember, the code that has to run just before all this sent an IPI to every single CPU on the system to have them do a (on x86 at least) pretty expensive TLB flush.
If this is a hot path, we have bigger problems on our hands: the full TLB flush on every CPU.
So, sure, there are a million ways to make this deferred freeing more scalable. But the code that's here is dirt simple and self contained. If someone has some ideas for something that's simpler and more scalable, then I'm totally open to it.
You could also have a look toward removing the &kernel_pte_work.lock. Replace it by llist_add() on adding side and llist_for_each_safe(n, t, llist_del_all(&list)) on removing side. So you do not need guard(spinlock) stuff.
If i do not miss anything.
But this is _not_ the place to add complexity to get scalability.
OK.
-- Uladzislau Rezki
On 8/11/2025 9:55 PM, Dave Hansen wrote:
On 8/11/25 02:15, Uladzislau Rezki wrote:
kernel_pte_work.list is global shared var, it would make the producer pte_free_kernel() and the consumer kernel_pte_work_func() to operate in serialized timing. In a large system, I don't think you design this deliberately 🙂
Sorry for jumping.
Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks.
Remember, the code that has to run just before all this sent an IPI to every single CPU on the system to have them do a (on x86 at least) pretty expensive TLB flush.
It can be easily identified as a bottleneck by multi-CPU stress testing programs involving frequent process creation and destruction, similar to the operation of a heavily loaded multi-process Apache web server. Hot/cold path ?
If this is a hot path, we have bigger problems on our hands: the full TLB flush on every CPU.
Perhaps not "WE", IPI driven TLB flush seems not the shared mechanism of all CPUs, at least not for ARM as far as I know.
So, sure, there are a million ways to make this deferred freeing more scalable. But the code that's here is dirt simple and self contained. If someone has some ideas for something that's simpler and more scalable, then I'm totally open to it.
But this is _not_ the place to add complexity to get scalability.
At least, please dont add bottleneck, how complex to do that ?
Thanks, Ethan
On 8/11/25 18:17, Ethan Zhao wrote:
But this is _not_ the place to add complexity to get scalability.
At least, please dont add bottleneck, how complex to do that ?
Very good question! If you're really interested and concerned about this, I'd encourage you to demonstrate where the contention becomes a problem in practice, then post a patch to fix it.
If it's simple, I'll happily ack it!
On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct ptdesc *ptdesc, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) {
list_del_init(&ptdesc->pt_list);
pagetable_dtor_free(ptdesc);
- }
Do a list_move from kernel_pte_work.list to an on-stack list head and then immediately release the lock. No reason to hold the spinock while doing frees, also no reason to do list_del_init, that memory probably gets zerod in pagetable_dtor_free()
Jason
On 8/11/2025 8:57 PM, Jason Gunthorpe wrote:
On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct ptdesc *ptdesc, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) {
list_del_init(&ptdesc->pt_list);
pagetable_dtor_free(ptdesc);
- }
Do a list_move from kernel_pte_work.list to an on-stack list head and then immediately release the lock. No reason to hold the spinock while doing frees, also no reason to do list_del_init, that memory probably gets zerod in pagetable_dtor_free()
Yep,using guard(spinlock)() for scope-bound lock management sacrifices fine-grained control over the protected area. If offers convenience at the cost of precision.
Out of my bias, calling it sluggard(spinlock)() might be proper.
Thanks, Ethan
Jason
On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct page *page, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
list_del_init(&page->lru);
Please don't add new usages of lru, we are trying to get rid of this. :(
I think the memory should be struct ptdesc, use that..
Jason
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, August 8, 2025 3:52 AM
On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct page *page, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
list_del_init(&page->lru);
Please don't add new usages of lru, we are trying to get rid of this. :(
I think the memory should be struct ptdesc, use that..
btw with this change we should also defer free of the pmd page:
pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } }
free_page((unsigned long)pmd_sv);
Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled.
another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there?
On 8/8/2025 10:57 AM, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, August 8, 2025 3:52 AM
On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct page *page, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
list_del_init(&page->lru);
Please don't add new usages of lru, we are trying to get rid of this. :(
I think the memory should be struct ptdesc, use that..
btw with this change we should also defer free of the pmd page:
pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } }
free_page((unsigned long)pmd_sv);
Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled.
You're right that freeing high-level page table pages also requires an IOTLB flush before the pages are freed. But I question the practical risk of the race given the extremely small time window. If this is a real concern, a potential mitigation would be to clear the U/S bits in all page table entries for kernel address space? But I am not confident in making that change at this time as I am unsure of the side effects it might cause.
another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there?
The remove_pagetable() function is called in the path where memory is hot-removed from the system, right? If so, there should be no issue, as the threat model here is a page table page being freed and repurposed while it's still cached in the IOTLB. In the hot-remove case, the memory is removed and will not be reused, so that's fine as far as I can see.
The same to __change_page_attr(), which only changes the attributes of a page table entry while the underlying page remains in use.
Thanks, baolu
From: Baolu Lu baolu.lu@linux.intel.com Sent: Friday, August 15, 2025 5:17 PM
On 8/8/2025 10:57 AM, Tian, Kevin wrote:
From: Jason Gunthorpe jgg@nvidia.com Sent: Friday, August 8, 2025 3:52 AM
On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct page *page, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
list_del_init(&page->lru);
Please don't add new usages of lru, we are trying to get rid of this. :(
I think the memory should be struct ptdesc, use that..
btw with this change we should also defer free of the pmd page:
pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } }
free_page((unsigned long)pmd_sv);
Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled.
You're right that freeing high-level page table pages also requires an IOTLB flush before the pages are freed. But I question the practical risk of the race given the extremely small time window. If this is a
It's already extremely difficult to conduct a real attack even w/o this fix. I'm not sure the criteria how small we consider acceptable in this specific case. but leaving an incomplete fix in code doesn't sound clean...
real concern, a potential mitigation would be to clear the U/S bits in all page table entries for kernel address space? But I am not confident in making that change at this time as I am unsure of the side effects it might cause.
I think there was already consensus that clearing U/S bits in all entries doesn't prevent the IOMMU caching them and setting A/D bits on the freed pagetable.
another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there?
The remove_pagetable() function is called in the path where memory is hot-removed from the system, right? If so, there should be no issue, as the threat model here is a page table page being freed and repurposed while it's still cached in the IOTLB. In the hot-remove case, the memory is removed and will not be reused, so that's fine as far as I can see.
what about the page is hot-added back while the stale entry pointing to it is still valid in the IOMMU, theoretically? 😊
The same to __change_page_attr(), which only changes the attributes of a page table entry while the underlying page remains in use.
it may lead to cpa_collapse_large_pages() if changing attribute leads to all adjacent 4k pages in 2M range are with same attribute. Then page table might be freed:
cpa_collapse_large_pages(): list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { list_del(&ptdesc->pt_list); __free_page(ptdesc_page(ptdesc)); }
On 8/15/25 02:16, Baolu Lu wrote:
On 8/8/2025 10:57 AM, Tian, Kevin wrote:
pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } }
free_page((unsigned long)pmd_sv);
Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled.
You're right that freeing high-level page table pages also requires an IOTLB flush before the pages are freed. But I question the practical risk of the race given the extremely small time window.
I hear that Linux is gaining popularity these days. There might even be dozens of users! Given that large scale of dozens (or even hundreds??) of users, I would suggest exercising some care. The race might be small but it only needs to happen once to cause chaos.
Seriously, though... A race is a race. Preemption or interrupts or SMIs or VMExits or a million other things can cause a "small time window" to become a big time window.
Even perceived small races need to be fixed.
If this is a real concern, a potential mitigation would be to clear the U/S bits in all page table entries for kernel address space? But I am not confident in making that change at this time as I am unsure of the side effects it might cause.
That doesn't do any good. I even went as far as double-checking months ago with the IOMMU hardware folks to confirm the actual implementation. I'm really surprised this is being brought up again.
another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there?
The remove_pagetable() function is called in the path where memory is hot-removed from the system, right?
No. Not right.
This is in the vmalloc() code: the side of things that _creates_ mappings for new allocations, not tears them down.
On 8/8/25 03:51, Jason Gunthorpe wrote:
On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
+static void kernel_pte_work_func(struct work_struct *work) +{
- struct page *page, *next;
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
- guard(spinlock)(&kernel_pte_work.lock);
- list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
list_del_init(&page->lru);
Please don't add new usages of lru, we are trying to get rid of this. 🙁
I think the memory should be struct ptdesc, use that..
Yes, sure.
Thanks, baolu
On 8/6/25 23:03, Dave Hansen wrote:
On 8/5/25 22:25, Lu Baolu wrote:
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence.
FWIW, I wouldn't use the term "cache coherence" in this context. I'd probably just call them "stale IOTLB entries".
I also think this over states the problem. There is currently no problem with "kernel space mapping changes". The issue is solely when kernel page table pages are freed and reused.
Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation.
Note that it's not just use-after-free. It's literally that the IOMMU will keep writing Accessed and Dirty bits while it thinks the page is still a page table. The IOMMU will sit there happily setting bits. So, it's_write_ after free too.
To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed.
There's one tidbit missing from this:
Currently SVA contexts are all unprivileged. They can only access user mappings and never kernel mappings. However, IOMMU still walk kernel-only page tables all the way down to the leaf where they realize that the entry is a kernel mapping and error out.
Thank you for the guidance. I will improve the commit message accordingly, as follows.
iommu/sva: Invalidate stale IOTLB entries for kernel address space
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The x86 architecture maps the kernel's virtual address space into the upper portion of every process's page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page table changes, specifically when page table pages are freed and reused. The IOMMU driver is only notified of changes to user virtual address mappings. This can cause the IOMMU's internal caches to retain stale entries for kernel VA.
A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when kernel page table pages are freed and later reallocated. The IOMMU could misinterpret the new data as valid page table entries. The IOMMU might then walk into attacker-controlled memory, leading to arbitrary physical memory DMA access or privilege escalation. This is also a Write-After-Free issue, as the IOMMU will potentially continue to write Accessed and Dirty bits to the freed memory while attempting to walk the stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel mappings. However, the IOMMU will still walk kernel-only page tables all the way down to the leaf entries, where it realizes the mapping is for the kernel and errors out. This means the IOMMU still caches these intermediate page table entries, making the described vulnerability a real concern.
To mitigate this, a new IOMMU interface is introduced to flush IOTLB entries for the kernel address space. This interface is invoked from the x86 architecture code that manages combined user and kernel page tables, specifically when a kernel page table update requires a CPU TLB flush.
Thanks, baolu
On 8/6/2025 1:25 PM, Lu Baolu wrote:
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence.
Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") Cc: stable@vger.kernel.org Suggested-by: Jann Horn jannh@google.com Co-developed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Vasant Hegde vasant.hegde@amd.com Reviewed-by: Kevin Tian kevin.tian@intel.com Tested-by: Yi Lai yi1.lai@intel.com
arch/x86/mm/tlb.c | 4 +++ drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++- include/linux/iommu.h | 4 +++ 3 files changed, 67 insertions(+), 1 deletion(-)
Change log: v3:
- iommu_sva_mms is an unbound list; iterating it in an atomic context could introduce significant latency issues. Schedule it in a kernel thread and replace the spinlock with a mutex.
- Replace the static key with a normal bool; it can be brought back if data shows the benefit.
- Invalidate KVA range in the flush_tlb_all() paths.
- All previous reviewed-bys are preserved. Please let me know if there are any objections.
v2:
- https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.i...
- Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
- Replace the mutex with a spinlock to make the interface usable in the critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux....
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..3b85e7d3ba44 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) else /* Fall back to the IPI-based invalidation. */ on_each_cpu(do_flush_tlb_all, NULL, 1);
- iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
Establishing such a simple one-to-one connection between CPU TLB flush and IOMMU TLB flush is debatable. At the very least, not every process is attached to an IOMMU SVA domain. Currently, devices and IOMMU operating in scalable mode are not commonly applied to every process.
Thanks, Ethan> }
/* Flush an arbitrarily large range of memory with INVLPGB. */ @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) kernel_tlb_flush_range(info); put_flush_tlb_info();
- iommu_sva_invalidate_kva_range(start, end); }
/* diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 1a51cfd82808..d0da2b3fd64b 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -10,6 +10,8 @@ #include "iommu-priv.h" static DEFINE_MUTEX(iommu_sva_lock); +static bool iommu_sva_present; +static LIST_HEAD(iommu_sva_mms); static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm); @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de return ERR_PTR(-ENOSPC); } iommu_mm->pasid = pasid;
- iommu_mm->mm = mm; INIT_LIST_HEAD(&iommu_mm->sva_domains); /*
- Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm if (ret) goto out_free_domain; domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
- if (list_empty(&iommu_mm->sva_domains)) {
if (list_empty(&iommu_sva_mms))
WRITE_ONCE(iommu_sva_present, true);
list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
- }
- list_add(&domain->next, &iommu_mm->sva_domains); out: refcount_set(&handle->users, 1); mutex_unlock(&iommu_sva_lock);
@@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) list_del(&domain->next); iommu_domain_free(domain); }
- if (list_empty(&iommu_mm->sva_domains)) {
list_del(&iommu_mm->mm_list_elm);
if (list_empty(&iommu_sva_mms))
WRITE_ONCE(iommu_sva_present, false);
- }
- mutex_unlock(&iommu_sva_lock); kfree(handle); }
@@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, return domain; }
+struct kva_invalidation_work_data {
- struct work_struct work;
- unsigned long start;
- unsigned long end;
+};
+static void invalidate_kva_func(struct work_struct *work) +{
- struct kva_invalidation_work_data *data =
container_of(work, struct kva_invalidation_work_data, work);
- struct iommu_mm_data *iommu_mm;
- guard(mutex)(&iommu_sva_lock);
- list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
data->start, data->end);
- kfree(data);
+}
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) +{
- struct kva_invalidation_work_data *data;
- if (likely(!READ_ONCE(iommu_sva_present)))
return;
- /* will be freed in the task function */
- data = kzalloc(sizeof(*data), GFP_ATOMIC);
- if (!data)
return;
- data->start = start;
- data->end = end;
- INIT_WORK(&data->work, invalidate_kva_func);
- /*
* Since iommu_sva_mms is an unbound list, iterating it in an atomic
* context could introduce significant latency issues.
*/
- schedule_work(&data->work);
+} diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c30d12e16473..66e4abb2df0d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -1134,7 +1134,9 @@ struct iommu_sva { struct iommu_mm_data { u32 pasid;
- struct mm_struct *mm; struct list_head sva_domains;
- struct list_head mm_list_elm; };
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode); @@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm); void iommu_sva_unbind_device(struct iommu_sva *handle); u32 iommu_sva_get_pasid(struct iommu_sva *handle); +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end); #else static inline struct iommu_sva * iommu_sva_bind_device(struct device *dev, struct mm_struct *mm) @@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm) } static inline void mm_pasid_drop(struct mm_struct *mm) {} +static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {} #endif /* CONFIG_IOMMU_SVA */ #ifdef CONFIG_IOMMU_IOPF
On 8/14/2025 12:48 PM, Ethan Zhao wrote:
On 8/6/2025 1:25 PM, Lu Baolu wrote:
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence.
Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises.
If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") Cc: stable@vger.kernel.org Suggested-by: Jann Horn jannh@google.com Co-developed-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Lu Baolu baolu.lu@linux.intel.com Reviewed-by: Jason Gunthorpe jgg@nvidia.com Reviewed-by: Vasant Hegde vasant.hegde@amd.com Reviewed-by: Kevin Tian kevin.tian@intel.com Tested-by: Yi Lai yi1.lai@intel.com
arch/x86/mm/tlb.c | 4 +++ drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++- include/linux/iommu.h | 4 +++ 3 files changed, 67 insertions(+), 1 deletion(-)
Change log: v3: - iommu_sva_mms is an unbound list; iterating it in an atomic context could introduce significant latency issues. Schedule it in a kernel thread and replace the spinlock with a mutex. - Replace the static key with a normal bool; it can be brought back if data shows the benefit. - Invalidate KVA range in the flush_tlb_all() paths. - All previous reviewed-bys are preserved. Please let me know if there are any objections.
v2: - https://lore.kernel.org/linux-iommu/20250709062800.651521-1- baolu.lu@linux.intel.com/ - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range); - Replace the mutex with a spinlock to make the interface usable in the critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1- baolu.lu@linux.intel.com/
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..3b85e7d3ba44 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) else /* Fall back to the IPI-based invalidation. */ on_each_cpu(do_flush_tlb_all, NULL, 1);
+ iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
Establishing such a simple one-to-one connection between CPU TLB flush and IOMMU TLB flush is debatable. At the very least, not every process is attached to an IOMMU SVA domain. Currently, devices and IOMMU operating in scalable mode are not commonly applied to every process.
You're right. As discussed, I'll defer the IOTLB invalidation to a scheduled kernel work in pte_free_kernel(). The sva_invalidation_kva_range() function on the IOMMU side is actually a no-op if there's no SVA domain in use.
Thanks, baolu
linux-stable-mirror@lists.linaro.org