On 2/17/25 21:45, Andrew Morton wrote:
When a process leads the addition of a struct page to vmemmap (e.g. hot-plug), the page table update for the newly added vmemmap-based virtual address is updated first in init_mm's page table and then synchronized later. If the vmemmap-based virtual address is accessed through the process's page table before this sync, a page fault will occur.
So, I think we're talking about the loop in vmemmap_populate_hugepages() (with a bunch of context chopped out:
for (addr = start; addr < end; addr = next) { ... pgd = vmemmap_pgd_populate(addr, node); if (!pgd) return -ENOMEM; ... vmemmap_set_pmd(pmd, p, node, addr, next); }
This both creates a mapping under 'pgd' and uses the new mapping inside vmemmap_set_pmd(). This is generally a known problem since vmemmap_populate() already does a sync_global_pgds(). The reason it manifests here is that the vmemmap_set_pmd() comes before the sync:
vmemmap_populate() { vmemmap_populate_hugepages() { vmemmap_pgd_populate(addr, node); ... // crash: vmemmap_set_pmd(pmd, p, node, addr, next); } // too late: sync_global_pgds(); }
I really don't like the idea of having the x86 code just be super careful not to use the newly-populated PGD (this patch). That's fragile and further diverges the x86 implementation from the generic code.
The quick and dirty fix would be to just to call sync_global_pgds() all the time, like:
pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) { pgd_t *pgd = pgd_offset_k(addr); if (pgd_none(*pgd)) { void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; pgd_populate(&init_mm, pgd, p); + sync_global_pgds(...); } return pgd; }
That actually mirrors how __kernel_physical_mapping_init() does it: watch for an actual PGD write and sync there. It shouldn't be too slow because it only calls sync_global_pgds() during actual PGD population which is horribly rare.
Could we do something like that, please? It might mean defining a new __weak symbol in mm/sparse-vmemmap.c and then calling out to an x86 implementation like vmemmap_set_pmd().
Is x86 just an oddball with how it populates kernel page tables? I'm a bit surprised nobody else has this problem too.