The patch titled Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset has been added to the -mm tree. Its filename is mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch
This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-fix-a-addressing-excepti... and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-fix-a-addressing-excepti...
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated there every 3-4 working days
------------------------------------------------------ From: Longpeng longpeng2@huawei.com Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset
Our machine encountered a panic(addressing exception) after run for a long time and the calltrace is: RIP: 0010:[<ffffffff9dff0587>] [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0 RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30 [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0 [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540 [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50 [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0 [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210 [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm] [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm] ... [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180 [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230 [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540 [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0 [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27
For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it may return a wrong 'pmdp' if there is a race. Please look at the following code snippet: ... pud = pud_offset(p4d, addr); if (sz != PUD_SIZE && pud_none(*pud)) return NULL; /* hugepage or swap? */ if (pud_huge(*pud) || !pud_present(*pud)) return (pte_t *)pud;
pmd = pmd_offset(pud, addr); if (sz != PMD_SIZE && pmd_none(*pmd)) return NULL; /* hugepage or swap? */ if (pmd_huge(*pmd) || !pmd_present(*pmd)) return (pte_t *)pmd; ...
The following sequence would trigger this bug: 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue 1. CPU0: "pud_huge(*pud)" is false 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) 3. CPU0: "!pud_present(*pud)" is false, continue 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp However, we want CPU0 to return NULL or pudp in this case.
Also, according to the section 'COMPILER BARRIER' of memory-barriers.txt: ''' (*) The compiler is within its rights to reorder loads and stores to the same variable, and in some cases, the CPU is within its rights to reorder loads to the same variable. This means that the following code:
a[0] = x; a[1] = x;
Might result in an older value of x stored in a[1] than in a[0]. ''' there're several other data races in huge_pte_offset, for example: ''' p4d = p4d_offset(pgd, addr) if (!p4d_present(*p4d)) return NULL; pud = pud_offset(p4d, addr) <-- will be unwinded as: pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address); ''' which is free for the compiler/CPU to execute as: ''' p4d = p4d_offset(pgd, addr) p4d_for_vaddr = *p4d; if (!p4d_present(*p4d)) return NULL; pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address); ''' so in the case where *p4d goes from '!present' to 'present': p4d_present(*p4d) == true and p4d_for_vaddr == none, meaning the p4d_page_vaddr() will crash.
For these reasons, we must make sure there is exactly one dereference of p4d, pud and pmd.
Link: http://lkml.kernel.org/r/20200327235748.2048-1-longpeng2@huawei.com Signed-off-by: Longpeng longpeng2@huawei.com Suggested-by: Jason Gunthorpe jgg@ziepe.ca Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Jason Gunthorpe jgg@ziepe.ca Cc: Matthew Wilcox willy@infradead.org Cc: Sean Christopherson sean.j.christopherson@intel.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org ---
mm/hugetlb.c | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-)
--- a/mm/hugetlb.c~mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset +++ a/mm/hugetlb.c @@ -4909,29 +4909,33 @@ pte_t *huge_pte_offset(struct mm_struct unsigned long addr, unsigned long sz) { pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; + p4d_t *p4d, p4d_entry; + pud_t *pud, pud_entry; + pmd_t *pmd, pmd_entry;
pgd = pgd_offset(mm, addr); if (!pgd_present(*pgd)) return NULL; + p4d = p4d_offset(pgd, addr); - if (!p4d_present(*p4d)) + p4d_entry = READ_ONCE(*p4d); + if (!p4d_present(p4d_entry)) return NULL;
- pud = pud_offset(p4d, addr); - if (sz != PUD_SIZE && pud_none(*pud)) + pud = pud_offset(&p4d_entry, addr); + pud_entry = READ_ONCE(*pud); + if (sz != PUD_SIZE && pud_none(pud_entry)) return NULL; /* hugepage or swap? */ - if (pud_huge(*pud) || !pud_present(*pud)) + if (pud_huge(pud_entry) || !pud_present(pud_entry)) return (pte_t *)pud;
- pmd = pmd_offset(pud, addr); - if (sz != PMD_SIZE && pmd_none(*pmd)) + pmd = pmd_offset(&pud_entry, addr); + pmd_entry = READ_ONCE(*pmd); + if (sz != PMD_SIZE && pmd_none(pmd_entry)) return NULL; /* hugepage or swap? */ - if (pmd_huge(*pmd) || !pmd_present(*pmd)) + if (pmd_huge(pmd_entry) || !pmd_present(pmd_entry)) return (pte_t *)pmd;
return NULL; _
Patches currently in -mm which might be from longpeng2@huawei.com are
mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch
On 3/28/20 3:10 PM, akpm@linux-foundation.org wrote:
The patch titled Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset has been added to the -mm tree. Its filename is mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch
This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-fix-a-addressing-excepti... and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-fix-a-addressing-excepti...
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated there every 3-4 working days
From: Longpeng longpeng2@huawei.com Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset
This patch is what caused the BUG reported on i386 non-PAE kernel here:
https://lore.kernel.org/linux-mm/CA+G9fYsJgZhhWLMzUxu_ZQ+THdCcJmFbHQ2ETA_YPP...
As a clue, when building in this environment I get:
CC mm/hugetlb.o mm/hugetlb.c: In function ‘huge_pte_offset’: cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5361:14: note: declared here pud_t *pud, pud_entry; ^~~~~~~~~ cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5361:14: note: declared here cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5360:14: note: declared here p4d_t *p4d, p4d_entry; ^~~~~~~~~
I'm shutting down for the night and will look into it more tomorrow if someone else does not beat me to it.
On Mon, Mar 30, 2020 at 08:35:29PM -0700, Mike Kravetz wrote:
On 3/28/20 3:10 PM, akpm@linux-foundation.org wrote:
The patch titled Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset has been added to the -mm tree. Its filename is mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch
This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-fix-a-addressing-excepti... and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-fix-a-addressing-excepti...
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated there every 3-4 working days
From: Longpeng longpeng2@huawei.com Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset
This patch is what caused the BUG reported on i386 non-PAE kernel here:
https://lore.kernel.org/linux-mm/CA+G9fYsJgZhhWLMzUxu_ZQ+THdCcJmFbHQ2ETA_YPP...
As a clue, when building in this environment I get:
CC mm/hugetlb.o mm/hugetlb.c: In function ‘huge_pte_offset’: cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5361:14: note: declared here pud_t *pud, pud_entry; ^~~~~~~~~ cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5361:14: note: declared here cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5360:14: note: declared here p4d_t *p4d, p4d_entry; ^~~~~~~~~
I'm shutting down for the night and will look into it more tomorrow if someone else does not beat me to it.
Non-PAE uses ModeB / PSE paging, which only has 2-level page tables. The non-existent levels get folded in and pmd_offset/pud_offset() return the passed in pointer instead of accessing a table, e.g.:
static inline pmd_t * pmd_offset(pud_t * pud, unsigned long address) { return (pmd_t *)pud; }
The bug probably only manifests with PSE paging because it can have huge pages in the top-level table, i.e. is the only mode that can get a false positive.
This is arguably a bug in pmd_huge/pud_hug(), seems like they should unconditionally return false if the relevant level doesn't exist.
On Mon, Mar 30, 2020 at 09:44:08PM -0700, Sean Christopherson wrote:
On Mon, Mar 30, 2020 at 08:35:29PM -0700, Mike Kravetz wrote:
On 3/28/20 3:10 PM, akpm@linux-foundation.org wrote:
The patch titled Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset has been added to the -mm tree. Its filename is mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch
This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-fix-a-addressing-excepti... and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-fix-a-addressing-excepti...
Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated there every 3-4 working days
From: Longpeng longpeng2@huawei.com Subject: mm/hugetlb: fix a addressing exception caused by huge_pte_offset
This patch is what caused the BUG reported on i386 non-PAE kernel here:
https://lore.kernel.org/linux-mm/CA+G9fYsJgZhhWLMzUxu_ZQ+THdCcJmFbHQ2ETA_YPP...
As a clue, when building in this environment I get:
CC mm/hugetlb.o mm/hugetlb.c: In function ‘huge_pte_offset’: cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5361:14: note: declared here pud_t *pud, pud_entry; ^~~~~~~~~ cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5361:14: note: declared here cc1: warning: function may return address of local variable [-Wreturn-local-addr] mm/hugetlb.c:5360:14: note: declared here p4d_t *p4d, p4d_entry; ^~~~~~~~~
Yes, this is certainly very bad.
Non-PAE uses ModeB / PSE paging, which only has 2-level page tables. The non-existent levels get folded in and pmd_offset/pud_offset() return the passed in pointer instead of accessing a table, e.g.:
static inline pmd_t * pmd_offset(pud_t * pud, unsigned long address) { return (pmd_t *)pud; }
The bug probably only manifests with PSE paging because it can have huge pages in the top-level table, i.e. is the only mode that can get a false positive.
This is arguably a bug in pmd_huge/pud_hug(), seems like they should unconditionally return false if the relevant level doesn't exist.
The issue is that to get the READ_ONCE semantic for a lockless flow this hackily defeats the de-reference inside the pXX_offset by passing in a pointer to a stack variable. This is fine unless you actually care about the *address* of the result of pXX_offset, which huge_pte_offset() does.
I can't think of an easy fix here.
Andrew, I think this patch has to be dropped :(
Longpeng can fix the direct bug he saw by not changing the pXX_offset(), but this extra de-reference will remain some theortical/rare bug according to the memory model.
Maybe we need to change pXX_offset to take in the pointer and the de'refd value?
Jason
On 3/31/20 7:08 AM, Jason Gunthorpe wrote:
I can't think of an easy fix here.
Andrew, I think this patch has to be dropped :(
Longpeng can fix the direct bug he saw by not changing the pXX_offset(), but this extra de-reference will remain some theortical/rare bug according to the memory model.
FWIW, I tested Longpeng's V2 patch without the READ_ONCE for *pgd and *p4d in this environment and it worked fine.
linux-stable-mirror@lists.linaro.org