On 8/27/25 03:58, Baolu Lu wrote:
Following the insights above, I wrote the code as follows. Does it look good?
I'd really prefer an actual diff that compiles. Because:
#ifdef CONFIG_ASYNC_PGTABLE_FREE /* a 'struct ptdesc' that needs its destructor run */ #define ASYNC_PGTABLE_FREE_DTOR BIT(NR_PAGEFLAGS)
This doesn't work, does it? Don't you need to _allocate_ a new bit? Wouldn't this die a hilariously horrible death if NR_PAGEFLAGS==64? ;)
Also, I'm pretty sure you can't just go setting random bits in this because it aliases with page flags.
...
static void kernel_pgtable_work_func(struct work_struct *work) { struct ptdesc *ptdesc, *next; LIST_HEAD(page_list);
spin_lock(&kernel_pgtable_work.lock); list_splice_tail_init(&kernel_pgtable_work.list, &page_list); spin_unlock(&kernel_pgtable_work.lock);
iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) { list_del(&ptdesc->pt_list); if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR) pagetable_dtor_free(ptdesc); else __free_page(ptdesc_page(ptdesc)); } }
What I had in mind was that kernel_pgtable_work_func() only does:
__free_pages(page, compound_order(page));
All of the things that queue gunk to the list do the legwork of making the work_func() simple. This also guides them toward proving a single, compound page because _they_ have to do the work if they don't want something simple.
void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc) { spin_lock(&kernel_pgtable_work.lock); ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR; list_add(&ptdesc->pt_list, &kernel_pgtable_work.list); spin_unlock(&kernel_pgtable_work.lock);
schedule_work(&kernel_pgtable_work.work); }
void kernel_pgtable_async_free_page_list(struct list_head *list) { spin_lock(&kernel_pgtable_work.lock); list_splice_tail(list, &kernel_pgtable_work.list); spin_unlock(&kernel_pgtable_work.lock);
schedule_work(&kernel_pgtable_work.work); }
void kernel_pgtable_async_free_page(struct page *page) { spin_lock(&kernel_pgtable_work.lock); list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list); spin_unlock(&kernel_pgtable_work.lock);
schedule_work(&kernel_pgtable_work.work); }
I wouldn't have three of these, I'd have _one_. If you want to free a bunch of pages, then just call it a bunch of times. This is not performance critical.
Oh, and the ptdesc flag shouldn't be for "I need to be asynchronously freed". All kernel PTE pages should ideally set it at allocation so you can do this:
static inline void pagetable_free(struct ptdesc *pt) { struct page *page = ptdesc_page(pt);
if (ptdesc->some_field | PTDESC_KERNEL) kernel_pgtable_async_free_page(page); else __free_pages(page, compound_order(page)); }
The folks that, today, call pagetable_dtor_free() don't have to do _anything_ at free time. They just set the PTDESC_KERNEL bit at allocation time.
See how that code reads? "If you have a kernel page table page, you must asynchronously free it." That's actually meaningful.
I'm pretty sure the lower 24 bits of ptdesc->__page_type are free. So I'd just do this:
#define PTDESC_TYPE_KERNEL BIT(0)
Then, something needs to set:
ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
That could happen as late as here:
static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { struct ptdesc *ptdesc = virt_to_ptdesc(pte);
// here
pagetable_dtor_free(ptdesc); }
Or as early as ptdesc allocation/constructor time. Then you check for PTDESC_TYPE_KERNEL in pagetable_free().