On 3/23/20 11:07 AM, Jason Gunthorpe wrote:
On Mon, Mar 23, 2020 at 10:27:48AM -0700, Mike Kravetz wrote:
pgd = pgd_offset(mm, addr);
- if (!pgd_present(*pgd))
- if (!pgd_present(READ_ONCE(*pgd))) return NULL; p4d = p4d_offset(pgd, addr);
- if (!p4d_present(*p4d))
- if (!p4d_present(READ_ONCE(*p4d))) return NULL;
pud = pud_offset(p4d, addr);
One would argue that pgd and p4d can not change from present to !present during the execution of this code. To me, that seems like the issue which would cause an issue. Of course, I could be missing something.
This I am not sure of, I think it must be true under the read side of the mmap_sem, but probably not guarenteed under RCU..
In any case, it doesn't matter, the fact that *p4d can change at all is problematic. Unwinding the above inlines we get:
p4d = p4d_offset(pgd, addr) if (!p4d_present(*p4d)) return NULL; pud = (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
According to our memory model the compiler/CPU is free to execute this as:
p4d = p4d_offset(pgd, addr) p4d_for_vaddr = *p4d; if (!p4d_present(*p4d)) return NULL; pud = (pud_t *)p4d_page_vaddr(p4d_for_vaddr) + pud_index(address);
Wow! How do you know this? You don't need to answer :)
In the case where p4 goes from !present -> present (ie handle_mm_fault()):
p4d_for_vaddr == p4d_none, and p4d_present(*p4d) == true, meaning the p4d_page_vaddr() will crash.
Basically the problem here is not just missing READ_ONCE, but that the p4d is read multiple times at all. It should be written like gup_fast does, to guarantee a single CPU read of the unstable data:
p4d = READ_ONCE(*p4d_offset(pgdp, addr)); if (!p4d_present(p4)) return NULL; pud = pud_offset(&p4d, addr);
At least this is what I've been able to figure out :\
In that case, I believe there are a bunch of similar routines with this issue.
For this patch, I was primarily interested in seeing the obvious multiple dereferences in C fixed up. This is above and beyond that! :)
Also, the remark about pmd_offset() seems accurate. The get_user_fast_pages() pattern seems like the correct one to emulate:
pud = READ_ONCE(*pudp); if (pud_none(pud)) .. if (!pud_'is a pmd pointer') .. pmdp = pmd_offset(&pud, address); pmd = READ_ONCE(*pmd); [...]
Passing &pud in avoids another de-reference of the pudp. Honestly all these APIs that take in page table pointers and internally de-reference them seem very hard to use correctly when the page table access isn't fully locked against write.
And the same protocol for the PUD, etc.
It looks like at least the p4d read from the pgd is also unlocked here as handle_mm_fault() writes to it??
Yes, there is no locking required to call huge_pte_offset().
None? Not RCU or read mmap_sem?
Yes, mmap_sem in read mode. Sorry, I was confusing this with additional locking requirements for hugetlb specific code.