On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 9/26/19 5:55 PM, Mina Almasry wrote:
Provided we keep the existing controller untouched, should the new controller track:
- only reservations, or
- both reservations and allocations for which no reservations exist
(such as the MAP_NORESERVE case)?
I like the 'both' approach. Seems to me a counter like that would work automatically regardless of whether the application is allocating hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut into reserved hugetlb pages, correct?
Correct. One other easy way to allocate huge pages without reserves (that I know is used today) is via the fallocate system call.
If so, then applications that
allocate with NORESERVE will get sigbused when they hit their limit, and applications that allocate without NORESERVE may get an error at mmap time but will always be within their limits while they access the mmap'd memory, correct?
Correct. At page allocation time we can easily check to see if a reservation exists and not charge. For any specific page within a hugetlbfs file, a charge would happen at mmap time or allocation time.
One exception (that I can think of) to this mmap(RESERVE) will not cause a SIGBUS rule is in the case of hole punch. If someone punches a hole in a file, not only do they remove pages associated with the file but the reservation information as well. Therefore, a subsequent fault will be the same as an allocation without reservation.
I don't think it causes a sigbus. This is the scenario, right:
1. Make cgroup with limit X bytes. 2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged 3. A hole of size Y is punched in the file, causing the cgroup to get uncharged Y bytes. 4. The task faults in memory from the hole, getting charged up to Y bytes again. But they will be still within their limits.
IIUC userspace only gets sigbus'd if the limit is lowered between steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.
I 'think' the code to remove/truncate a file will work corrctly as it is today, but I need to think about this some more.
mmap'd memory, correct? So the 'both' counter seems like a one size fits all.
I think the only sticking point left is whether an added controller can support both cgroup-v2 and cgroup-v1. If I could get confirmation on that I'll provide a patchset.
Sorry, but I can not provide cgroup expertise.
Mike Kravetz