From: Hugh Dickins hughd@google.com
commit 670ddd8cdcbd1d07a4571266ae3517f821728c3a upstream.
change_pmd_range() had special pmd_none_or_clear_bad_unless_trans_huge(), required to avoid "bad" choices when setting automatic NUMA hinting under mmap_read_lock(); but most of that is already covered in pte_offset_map() now. change_pmd_range() just wants a pmd_none() check before wasting time on MMU notifiers, then checks on the read-once _pmd value to work out what's needed for huge cases. If change_pte_range() returns -EAGAIN to retry if pte_offset_map_lock() fails, nothing more special is needed.
Link: https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Alistair Popple apopple@nvidia.com Cc: Anshuman Khandual anshuman.khandual@arm.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Christoph Hellwig hch@infradead.org Cc: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Matthew Wilcox willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Miaohe Lin linmiaohe@huawei.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Mike Rapoport (IBM) rppt@kernel.org Cc: Minchan Kim minchan@kernel.org Cc: Naoya Horiguchi naoya.horiguchi@nec.com Cc: Pavel Tatashin pasha.tatashin@soleen.com Cc: Peter Xu peterx@redhat.com Cc: Peter Zijlstra peterz@infradead.org Cc: Qi Zheng zhengqi.arch@bytedance.com Cc: Ralph Campbell rcampbell@nvidia.com Cc: Ryan Roberts ryan.roberts@arm.com Cc: SeongJae Park sj@kernel.org Cc: Song Liu song@kernel.org Cc: Steven Price steven.price@arm.com Cc: Suren Baghdasaryan surenb@google.com Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Will Deacon will@kernel.org Cc: Yang Shi shy828301@gmail.com Cc: Yu Zhao yuzhao@google.com Cc: Zack Rusin zackr@vmware.com Signed-off-by: Andrew Morton akpm@linux-foundation.org [ Background: It was reported that a bad pmd is seen when automatic NUMA balancing is marking page table entries as prot_numa:
[2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02) [2437548.235022] Call Trace: [2437548.238234] <TASK> [2437548.241060] dump_stack_lvl+0x46/0x61 [2437548.245689] panic+0x106/0x2e5 [2437548.249497] pmd_clear_bad+0x3c/0x3c [2437548.253967] change_pmd_range.isra.0+0x34d/0x3a7 [2437548.259537] change_p4d_range+0x156/0x20e [2437548.264392] change_protection_range+0x116/0x1a9 [2437548.269976] change_prot_numa+0x15/0x37 [2437548.274774] task_numa_work+0x1b8/0x302 [2437548.279512] task_work_run+0x62/0x95 [2437548.283882] exit_to_user_mode_loop+0x1a4/0x1a9 [2437548.289277] exit_to_user_mode_prepare+0xf4/0xfc [2437548.294751] ? sysvec_apic_timer_interrupt+0x34/0x81 [2437548.300677] irqentry_exit_to_user_mode+0x5/0x25 [2437548.306153] asm_sysvec_apic_timer_interrupt+0x16/0x1b
This is due to a race condition between change_prot_numa() and THP migration because the kernel doesn't check is_swap_pmd() and pmd_trans_huge() atomically:
change_prot_numa() THP migration ====================================================================== - change_pmd_range() -> is_swap_pmd() returns false, meaning it's not a PMD migration entry. - do_huge_pmd_numa_page() -> migrate_misplaced_page() sets migration entries for the THP. - change_pmd_range() -> pmd_none_or_clear_bad_unless_trans_huge() -> pmd_none() and pmd_trans_huge() returns false - pmd_none_or_clear_bad_unless_trans_huge() -> pmd_bad() returns true for the migration entry!
The upstream commit 670ddd8cdcbd ("mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition by checking is_swap_pmd() and pmd_trans_huge() atomically.
Backporting note: Unlike the mainline, pte_offset_map_lock() does not check if the pmd entry is a migration entry or a hugepage; acquires PTL unconditionally instead of returning failure. Therefore, it is necessary to keep the !is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() check before acquiring the PTL.
After acquiring the lock, open-code the semantics of pte_offset_map_lock() in the mainline kernel; change_pte_range() fails if the pmd value has changed. This requires adding one more parameter (to pass pmd value that is read before calling the function) to change_pte_range(). ] --- mm/mprotect.c | 75 ++++++++++++++++++++++++++++----------------------- 1 file changed, 42 insertions(+), 33 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c index 28e1a8fd9319f..4daaa75b3f12c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -36,10 +36,11 @@ #include "internal.h"
static long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, unsigned long end, pgprot_t newprot, - unsigned long cp_flags) + pmd_t pmd_old, unsigned long addr, unsigned long end, + pgprot_t newprot, unsigned long cp_flags) { pte_t *pte, oldpte; + pmd_t pmd_val; spinlock_t *ptl; long pages = 0; int target_node = NUMA_NO_NODE; @@ -48,21 +49,15 @@ static long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
- /* - * Can be called with only the mmap_lock for reading by - * prot_numa so we must check the pmd isn't constantly - * changing from under us from pmd_none to pmd_trans_huge - * and/or the other way around. - */ - if (pmd_trans_unstable(pmd)) - return 0; - - /* - * The pmd points to a regular pte so the pmd can't change - * from under us even if the mmap_lock is only hold for - * reading. - */ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + /* Make sure pmd didn't change after acquiring ptl */ + pmd_val = pmd_read_atomic(pmd); + /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */ + barrier(); + if (!pmd_same(pmd_old, pmd_val)) { + pte_unmap_unlock(pte, ptl); + return -EAGAIN; + }
/* Get target node for single threaded private VMAs */ if (prot_numa && !(vma->vm_flags & VM_SHARED) && @@ -223,21 +218,33 @@ static inline long change_pmd_range(struct vm_area_struct *vma,
pmd = pmd_offset(pud, addr); do { - long this_pages; - + long ret; + pmd_t _pmd; +again: next = pmd_addr_end(addr, end); + _pmd = pmd_read_atomic(pmd); + /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + barrier(); +#endif
/* * Automatic NUMA balancing walks the tables with mmap_lock * held for read. It's possible a parallel update to occur - * between pmd_trans_huge() and a pmd_none_or_clear_bad() - * check leading to a false positive and clearing. - * Hence, it's necessary to atomically read the PMD value - * for all the checks. + * between pmd_trans_huge(), is_swap_pmd(), and + * a pmd_none_or_clear_bad() check leading to a false positive + * and clearing. Hence, it's necessary to atomically read + * the PMD value for all the checks. */ - if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) && - pmd_none_or_clear_bad_unless_trans_huge(pmd)) - goto next; + if (!is_swap_pmd(_pmd) && !pmd_devmap(_pmd) && !pmd_trans_huge(_pmd)) { + if (pmd_none(_pmd)) + goto next; + + if (pmd_bad(_pmd)) { + pmd_clear_bad(pmd); + goto next; + } + }
/* invoke the mmu notifier if the pmd is populated */ if (!range.start) { @@ -247,15 +254,15 @@ static inline long change_pmd_range(struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); }
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { + if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { if (next - addr != HPAGE_PMD_SIZE) { __split_huge_pmd(vma, pmd, addr, false, NULL); } else { - int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot, cp_flags); + ret = change_huge_pmd(vma, pmd, addr, newprot, + cp_flags);
- if (nr_ptes) { - if (nr_ptes == HPAGE_PMD_NR) { + if (ret) { + if (ret == HPAGE_PMD_NR) { pages += HPAGE_PMD_NR; nr_huge_updates++; } @@ -266,9 +273,11 @@ static inline long change_pmd_range(struct vm_area_struct *vma, } /* fall through, the trans huge pmd just split */ } - this_pages = change_pte_range(vma, pmd, addr, next, newprot, - cp_flags); - pages += this_pages; + ret = change_pte_range(vma, pmd, _pmd, addr, next, newprot, + cp_flags); + if (ret < 0) + goto again; + pages += ret; next: cond_resched(); } while (pmd++, addr = next, addr != end);