On Fri, Dec 27, 2024 at 11:15:44PM +0000, Ackerley Tng wrote:
Ackerley Tng ackerleytng@google.com writes:
<snip>
I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.
Yes, after looking into this more deeply, I agree that avoid_reserve means avoiding the reservations in the resv_map rather than reservations in the subpool or hstate.
Here's more detail of what's going on in the reproducer that I wrote as I reviewed Peter's patch:
- On fallocate(), allocate page A
- On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested
- On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE
- On fork, prep for COW by marking page as read only. Both parent and child share B.
- On faulting *buf = 2 (write fault), allocate page C, copy B to C
- B belongs to the child, C belongs to the parent
- C is owned by the parent
- Child exits, B is freed
- On munmap(), C is freed
- On unlink(), A is freed
When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is the root of the bug.
We should decrement h->resv_huge_pages if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set. If avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we won't be decrementing h->resv_huge_pages anyway.
I agree with Peter's fix as a whole (the entire patch series).
Reviewed-by: Ackerley Tng ackerleytng@google.com Tested-by: Ackerley Tng ackerleytng@google.com
Some definitions which might be helpful:
- h->resv_huge_pages indicates number of reserved pages globally.
- This number increases when pages are reserved
- This number decreases when reserved pages are allocated, or when pages are unreserved
- spool->rsv_hpages indicates number of reserved pages in this subpool.
- This number increases when pages are reserved
- This number decreases when reserved pages are allocated, or when pages are unreserved
- h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.
I think you're correct. One add-on comment: I think when taking vma reservation into accout, then the global reservation should be a sum of all spools' and all vmas' reservations.
More details on the flow in alloc_hugetlb_folio() which might be helpful:
hugepage_subpool_get_pages() returns "the number of pages by which the global pools must be adjusted (upward)". This return value is never negative other than errors. (hugepage_subpool_get_pages() always gets called with a positive delta).
Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other than errors).
If the return value is 0, the subpool had enough reservations and so we should decrement h->resv_huge_pages.
If the return value is 1, it means that this subpool did not have any more reserved hugepages, and we need to get a page from the global hstate. dequeue_hugetlb_folio_vma() will get us a page that was already allocated.
In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1 page, and there are no available_huge_pages() left, we quit dequeueing since we will need to allocate a new page. If we want to avoid_reserve, that means we don't want to use the vma's reserves in resv_map, we also check available_huge_pages(). If there are available_huge_pages(), we go on to dequeue a page.
Then, we determine whether to decrement h->resv_huge_pages. We should decrement if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set.
In the case where a surplus page needs to be allocated, the surplus page isn't and doesn't need to be associated with a subpool, so no subpool hugepage number tracking updates are required. h->resv_huge_pages still has to be updated... is this where h->resv_huge_pages can go negative?
This question doesn't sound like relevant to this specific scenario that this patch (or, the reproducer attached in the patch) was about. In the reproducer of this patch, we don't need to have surplus page involved.
Going back to the question you're asking - I don't think resv_huge_pages will go negative for the surplus case?
IIUC updating resv_huge_pages is the correct behavior even for surplus pages, as long as gbl_chg==0.
The initial change was done by Naoya in commit a88c76954804 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count"). There're some more information in the commit log. In general, when gbl_chg==0 it means we consumed a global reservation either in vma or spool, so it must be accounted globally after the folio is successfully allocated. Here "being accounted" should mean the global resv count will be properly decremented.
Thanks for taking a look, Ackerley!