(+CC Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar)
On 8/8/19 4:13 PM, Mina Almasry wrote:
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.
Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.
This proposal is implemented in this patch, with tests to verify functionality and show the usage.
Thanks for taking on this effort Mina.
Before looking at the details of the code, it might be helpful to discuss the expected semantics of the proposed reservation limits.
I see you took into account the differences between private and shared mappings. This is good, as the reservation behavior is different for each of these cases. First let's look at private mappings.
For private mappings, the reservation usage will be the size of the mapping. This should be fairly simple. As reservations are consumed in the hugetlbfs code, reservations in the resv_map are removed. I see you have a hook into region_del. So, the expectation is that as reservations are consumed the reservation usage will drop for the cgroup. Correct? The only tricky thing about private mappings is COW because of fork. Current reservation semantics specify that all reservations stay with the parent. If child faults and can not get page, SIGBUS. I assume the new reservation limits will work the same.
I believe tracking reservations for shared mappings can get quite complicated. The hugetlbfs reservation code around shared mappings 'works' on the basis that shared mapping reservations are global. As a result, reservations are more associated with the inode than with the task making the reservation. For example, consider a file of size 4 hugetlb pages. Task A maps the first 2 pages, and 2 reservations are taken. Task B maps all 4 pages, and 2 additional reservations are taken. I am not really sure of the desired semantics here for reservation limits if A and B are in separate cgroups. Should B be charged for 4 or 2 reservations? Also in the example above, after both tasks create their mappings suppose Task B faults in the first page. Does the reservation usage of Task A go down as it originally had the reservation?
It should also be noted that when hugetlbfs reservations are 'consumed' for shared mappings there are no changes to the resv_map. Rather the unmap code compares the contents of the page cache to the resv_map to determine how many reservations were actually consumed. I did not look close enough to determine the code drops reservation usage counts as pages are added to shared mappings.