Commit f77084d96355 "x86/mm/pat: Disable preemption around __flush_tlb_all()" addressed a case where __flush_tlb_all() is called without preemption being disabled. It also left a warning to catch other cases where preemption is not disabled. That warning triggers for the memory hotplug path which is also used for persistent memory enabling:
WARNING: CPU: 35 PID: 911 at ./arch/x86/include/asm/tlbflush.h:460 RIP: 0010:__flush_tlb_all+0x1b/0x3a [..] Call Trace: phys_pud_init+0x29c/0x2bb kernel_physical_mapping_init+0xfc/0x219 init_memory_mapping+0x1a5/0x3b0 arch_add_memory+0x2c/0x50 devm_memremap_pages+0x3aa/0x610 pmem_attach_disk+0x585/0x700 [nd_pmem]
Andy wondered why a path that can sleep was using __flush_tlb_all() [1] and Dave confirmed the expectation for TLB flush is for modifying / invalidating existing pte entries, but not initial population [2]. Drop the usage of __flush_tlb_all() in phys_{p4d,pud,pmd}_init() on the expectation that this path is only ever populating empty entries for the linear map. Note, at linear map teardown time there is a call to the all-cpu flush_tlb_all() to invalidate the removed mappings.
[1]: https://lore.kernel.org/patchwork/patch/1009434/#1193941 [2]: https://lore.kernel.org/patchwork/patch/1009434/#1194540
Fixes: f77084d96355 ("x86/mm/pat: Disable preemption around __flush_tlb_all()") Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: Thomas Gleixner tglx@linutronix.de Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@alien8.de Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski luto@kernel.org Suggested-by: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- arch/x86/mm/init_64.c | 6 ------ 1 file changed, 6 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 5fab264948c2..de95db8ac52f 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -584,7 +584,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, paddr_end, page_size_mask, prot); - __flush_tlb_all(); continue; } /* @@ -627,7 +626,6 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, pud_populate(&init_mm, pud, pmd); spin_unlock(&init_mm.page_table_lock); } - __flush_tlb_all();
update_page_count(PG_LEVEL_1G, pages);
@@ -668,7 +666,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, paddr_last = phys_pud_init(pud, paddr, paddr_end, page_size_mask); - __flush_tlb_all(); continue; }
@@ -680,7 +677,6 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, p4d_populate(&init_mm, p4d, pud); spin_unlock(&init_mm.page_table_lock); } - __flush_tlb_all();
return paddr_last; } @@ -733,8 +729,6 @@ kernel_physical_mapping_init(unsigned long paddr_start, if (pgd_changed) sync_global_pgds(vaddr_start, vaddr_end - 1);
- __flush_tlb_all(); - return paddr_last; }
On 11/19/18 3:19 PM, Dan Williams wrote:
Andy wondered why a path that can sleep was using __flush_tlb_all() [1] and Dave confirmed the expectation for TLB flush is for modifying / invalidating existing pte entries, but not initial population [2].
I _think_ this is OK.
But, could we sprinkle a few WARN_ON_ONCE(p*_present()) calls in there to help us sleep at night?
On Mon, Nov 19, 2018 at 3:43 PM Dave Hansen dave.hansen@intel.com wrote:
On 11/19/18 3:19 PM, Dan Williams wrote:
Andy wondered why a path that can sleep was using __flush_tlb_all() [1] and Dave confirmed the expectation for TLB flush is for modifying / invalidating existing pte entries, but not initial population [2].
I _think_ this is OK.
But, could we sprinkle a few WARN_ON_ONCE(p*_present()) calls in there to help us sleep at night?
Makes sense, I'll add those for v2.
On Mon, 2018-11-19 at 15:43 -0800, Dave Hansen wrote:
On 11/19/18 3:19 PM, Dan Williams wrote:
Andy wondered why a path that can sleep was using __flush_tlb_all() [1] and Dave confirmed the expectation for TLB flush is for modifying / invalidating existing pte entries, but not initial population [2].
I _think_ this is OK.
But, could we sprinkle a few WARN_ON_ONCE(p*_present()) calls in there to help us sleep at night?
Well, I'm having nightmares now because my naive patch to sprinkle some WARN_ON_ONCE() calls is leading to my VM live locking at boot... no backtrace. If I revert the patch below and just go with the __flush_tlb_all() removal it seems fine.
I'm going to set this aside for a bit, but if anyone has any thoughts in the meantime I'd appreciate it.
---
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index de95db8ac52f..ecdf917def4c 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -432,6 +432,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, unsigned long paddr_end, E820_TYPE_RAM) && !e820__mapped_any(paddr & PAGE_MASK, paddr_next, E820_TYPE_RESERVED_KERN)) + WARN_ON_ONCE(pte_present(*pte)); set_pte(pte, __pte(0)); continue; } @@ -452,6 +453,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, unsigned long paddr_end, pr_info(" pte=%p addr=%lx pte=%016lx\n", pte, paddr, pfn_pte(paddr >> PAGE_SHIFT, PAGE_KERNEL).pte); pages++; + WARN_ON_ONCE(pte_present(*pte)); set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot)); paddr_last = (paddr & PAGE_MASK) + PAGE_SIZE; } @@ -487,6 +489,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end, E820_TYPE_RAM) && !e820__mapped_any(paddr & PMD_MASK, paddr_next, E820_TYPE_RESERVED_KERN)) + WARN_ON_ONCE(pmd_present(*pmd)); set_pmd(pmd, __pmd(0)); continue; } @@ -524,6 +527,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end, if (page_size_mask & (1<<PG_LEVEL_2M)) { pages++; spin_lock(&init_mm.page_table_lock); + WARN_ON_ONCE(pmd_present(*pmd)); set_pte((pte_t *)pmd, pfn_pte((paddr & PMD_MASK) >> PAGE_SHIFT, __pgprot(pgprot_val(prot) | _PAGE_PSE))); @@ -536,6 +540,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end, paddr_last = phys_pte_init(pte, paddr, paddr_end, new_prot);
spin_lock(&init_mm.page_table_lock); + WARN_ON_ONCE(pmd_present(*pmd)); pmd_populate_kernel(&init_mm, pmd, pte); spin_unlock(&init_mm.page_table_lock); } @@ -573,6 +578,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, E820_TYPE_RAM) && !e820__mapped_any(paddr & PUD_MASK, paddr_next, E820_TYPE_RESERVED_KERN)) + WARN_ON_ONCE(pud_present(*pud)); set_pud(pud, __pud(0)); continue; } @@ -610,6 +616,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, if (page_size_mask & (1<<PG_LEVEL_1G)) { pages++; spin_lock(&init_mm.page_table_lock); + WARN_ON_ONCE(pud_present(*pud)); set_pte((pte_t *)pud, pfn_pte((paddr & PUD_MASK) >> PAGE_SHIFT, PAGE_KERNEL_LARGE)); @@ -623,6 +630,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, page_size_mask, prot);
spin_lock(&init_mm.page_table_lock); + WARN_ON_ONCE(pud_present(*pud)); pud_populate(&init_mm, pud, pmd); spin_unlock(&init_mm.page_table_lock); } @@ -657,6 +665,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, E820_TYPE_RAM) && !e820__mapped_any(paddr & P4D_MASK, paddr_next, E820_TYPE_RESERVED_KERN)) + WARN_ON_ONCE(p4d_present(*p4d)); set_p4d(p4d, __p4d(0)); continue; } @@ -674,6 +683,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, page_size_mask);
spin_lock(&init_mm.page_table_lock); + WARN_ON_ONCE(p4d_present(*p4d)); p4d_populate(&init_mm, p4d, pud); spin_unlock(&init_mm.page_table_lock); }
On Tue, Nov 20, 2018 at 02:59:32AM +0000, Williams, Dan J wrote:
On Mon, 2018-11-19 at 15:43 -0800, Dave Hansen wrote:
On 11/19/18 3:19 PM, Dan Williams wrote:
Andy wondered why a path that can sleep was using __flush_tlb_all() [1] and Dave confirmed the expectation for TLB flush is for modifying / invalidating existing pte entries, but not initial population [2].
I _think_ this is OK.
But, could we sprinkle a few WARN_ON_ONCE(p*_present()) calls in there to help us sleep at night?
Well, I'm having nightmares now because my naive patch to sprinkle some WARN_ON_ONCE() calls is leading to my VM live locking at boot... no backtrace. If I revert the patch below and just go with the __flush_tlb_all() removal it seems fine.
I'm going to set this aside for a bit, but if anyone has any thoughts in the meantime I'd appreciate it.
Have you tried using early_printk ?
So kernel_physical_mapping_init() has a comment that states the virtual and physical addresses we create mappings for should be PMD aligned, which implies pud/p4d could have overlap between the mappings.
But in that case, I would expect the new and old values to match.
So maybe you should be checking something like:
WARN_ON_ONCE(pud_present(*pud) && !pud_same(pud, new));
@@ -573,6 +578,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, E820_TYPE_RAM) && !e820__mapped_any(paddr & PUD_MASK, paddr_next, E820_TYPE_RESERVED_KERN))
}WARN_ON_ONCE(pud_present(*pud)); set_pud(pud, __pud(0)); continue;
@@ -610,6 +616,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, if (page_size_mask & (1<<PG_LEVEL_1G)) { pages++; spin_lock(&init_mm.page_table_lock);
WARN_ON_ONCE(pud_present(*pud)); set_pte((pte_t *)pud, pfn_pte((paddr & PUD_MASK) >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
@@ -623,6 +630,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, page_size_mask, prot); spin_lock(&init_mm.page_table_lock);
pud_populate(&init_mm, pud, pmd); spin_unlock(&init_mm.page_table_lock); }WARN_ON_ONCE(pud_present(*pud));
@@ -657,6 +665,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, E820_TYPE_RAM) && !e820__mapped_any(paddr & P4D_MASK, paddr_next, E820_TYPE_RESERVED_KERN))
}WARN_ON_ONCE(p4d_present(*p4d)); set_p4d(p4d, __p4d(0)); continue;
@@ -674,6 +683,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, page_size_mask); spin_lock(&init_mm.page_table_lock);
p4d_populate(&init_mm, p4d, pud); spin_unlock(&init_mm.page_table_lock); }WARN_ON_ONCE(p4d_present(*p4d));
On Tue, Nov 20, 2018 at 1:03 AM Peter Zijlstra peterz@infradead.org wrote:
On Tue, Nov 20, 2018 at 02:59:32AM +0000, Williams, Dan J wrote:
On Mon, 2018-11-19 at 15:43 -0800, Dave Hansen wrote:
On 11/19/18 3:19 PM, Dan Williams wrote:
Andy wondered why a path that can sleep was using __flush_tlb_all() [1] and Dave confirmed the expectation for TLB flush is for modifying / invalidating existing pte entries, but not initial population [2].
I _think_ this is OK.
But, could we sprinkle a few WARN_ON_ONCE(p*_present()) calls in there to help us sleep at night?
Well, I'm having nightmares now because my naive patch to sprinkle some WARN_ON_ONCE() calls is leading to my VM live locking at boot... no backtrace. If I revert the patch below and just go with the __flush_tlb_all() removal it seems fine.
I'm going to set this aside for a bit, but if anyone has any thoughts in the meantime I'd appreciate it.
Have you tried using early_printk ?
No, it boots well past printk, and even gets past pivot root. Eventually live locks with all cores spinning. It appears to be correlated with the arrival of pmem, and independent of the tlb flushes... I'll dig deeper.
So kernel_physical_mapping_init() has a comment that states the virtual and physical addresses we create mappings for should be PMD aligned, which implies pud/p4d could have overlap between the mappings.
But in that case, I would expect the new and old values to match.
So maybe you should be checking something like:
WARN_ON_ONCE(pud_present(*pud) && !pud_same(pud, new));
Yes, that looks better.
On Mon, Nov 19, 2018 at 03:19:04PM -0800, Dan Williams wrote:
FWIW, that is not the canonical form to refer to emails. Please use:
https://lkml.kernel.org/r/%24msgid
(also, patchwork is even worse crap than lore is for reading emails :/)
linux-stable-mirror@lists.linaro.org