Ackerley Tng ackerleytng@google.com writes:
<snip>
I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.
Yes, after looking into this more deeply, I agree that avoid_reserve means avoiding the reservations in the resv_map rather than reservations in the subpool or hstate.
Here's more detail of what's going on in the reproducer that I wrote as I reviewed Peter's patch:
1. On fallocate(), allocate page A 2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested 3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE 4. On fork, prep for COW by marking page as read only. Both parent and child share B. 5. On faulting *buf = 2 (write fault), allocate page C, copy B to C + B belongs to the child, C belongs to the parent + C is owned by the parent 6. Child exits, B is freed 7. On munmap(), C is freed 8. On unlink(), A is freed
When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is the root of the bug.
We should decrement h->resv_huge_pages if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set. If avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we won't be decrementing h->resv_huge_pages anyway.
I agree with Peter's fix as a whole (the entire patch series).
Reviewed-by: Ackerley Tng ackerleytng@google.com Tested-by: Ackerley Tng ackerleytng@google.com
---
Some definitions which might be helpful:
+ h->resv_huge_pages indicates number of reserved pages globally. + This number increases when pages are reserved + This number decreases when reserved pages are allocated, or when pages are unreserved + spool->rsv_hpages indicates number of reserved pages in this subpool. + This number increases when pages are reserved + This number decreases when reserved pages are allocated, or when pages are unreserved + h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.
More details on the flow in alloc_hugetlb_folio() which might be helpful:
hugepage_subpool_get_pages() returns "the number of pages by which the global pools must be adjusted (upward)". This return value is never negative other than errors. (hugepage_subpool_get_pages() always gets called with a positive delta).
Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other than errors).
If the return value is 0, the subpool had enough reservations and so we should decrement h->resv_huge_pages.
If the return value is 1, it means that this subpool did not have any more reserved hugepages, and we need to get a page from the global hstate. dequeue_hugetlb_folio_vma() will get us a page that was already allocated.
In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1 page, and there are no available_huge_pages() left, we quit dequeueing since we will need to allocate a new page. If we want to avoid_reserve, that means we don't want to use the vma's reserves in resv_map, we also check available_huge_pages(). If there are available_huge_pages(), we go on to dequeue a page.
Then, we determine whether to decrement h->resv_huge_pages. We should decrement if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set.
In the case where a surplus page needs to be allocated, the surplus page isn't and doesn't need to be associated with a subpool, so no subpool hugepage number tracking updates are required. h->resv_huge_pages still has to be updated... is this where h->resv_huge_pages can go negative?