On 2/18/25 8:41 PM, Dave Hansen wrote:
On 2/17/25 21:45, Andrew Morton wrote:
When a process leads the addition of a struct page to vmemmap (e.g. hot-plug), the page table update for the newly added vmemmap-based virtual address is updated first in init_mm's page table and then synchronized later. If the vmemmap-based virtual address is accessed through the process's page table before this sync, a page fault will occur.
So, I think we're talking about the loop in vmemmap_populate_hugepages() (with a bunch of context chopped out:
for (addr = start; addr < end; addr = next) { ... pgd = vmemmap_pgd_populate(addr, node); if (!pgd) return -ENOMEM; ... vmemmap_set_pmd(pmd, p, node, addr, next); }
This both creates a mapping under 'pgd' and uses the new mapping inside vmemmap_set_pmd(). This is generally a known problem since vmemmap_populate() already does a sync_global_pgds(). The reason it manifests here is that the vmemmap_set_pmd() comes before the sync:
vmemmap_populate() { vmemmap_populate_hugepages() { vmemmap_pgd_populate(addr, node); ... // crash: vmemmap_set_pmd(pmd, p, node, addr, next); } // too late: sync_global_pgds(); }
I really don't like the idea of having the x86 code just be super careful not to use the newly-populated PGD (this patch). That's fragile and further diverges the x86 implementation from the generic code.
The quick and dirty fix would be to just to call sync_global_pgds() all the time, like:
pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) { pgd_t *pgd = pgd_offset_k(addr); if (pgd_none(*pgd)) { void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; pgd_populate(&init_mm, pgd, p);
sync_global_pgds(...); } return pgd;
}
That actually mirrors how __kernel_physical_mapping_init() does it: watch for an actual PGD write and sync there. It shouldn't be too slow because it only calls sync_global_pgds() during actual PGD population which is horribly rare.
Could we do something like that, please? It might mean defining a new __weak symbol in mm/sparse-vmemmap.c and then calling out to an x86 implementation like vmemmap_set_pmd().
Hi Dave Hansen,
Thanks for the feedback. I do agree with a generic code rather than a diverse code for x86 implementation. If everyone else agrees, I'll send a new patch in this style.
Is x86 just an oddball with how it populates kernel page tables? I'm a bit surprised nobody else has this problem too.
Presumably, this won't happen on other platforms where they don't have the scenario of accessing that address before synchronizing the pgd. Please correct me if I'm wrong.
Br, G.G.