Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

13 Aug 2019

      On 8/10/19 3:01 PM, Mina Almasry wrote:
...
On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz mike.kravetz@oracle.com wrote:
...
On 8/9/19 12:42 PM, Mina Almasry wrote:
...
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz mike.kravetz@oracle.com wrote:
...
On 8/8/19 4:13 PM, Mina Almasry wrote:
...
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.
<snip>
>> I believe tracking reservations for shared mappings can get quite complicated.
>> The hugetlbfs reservation code around shared mappings 'works' on the basis
>> that shared mapping reservations are global.  As a result, reservations are
>> more associated with the inode than with the task making the reservation.
>
> FWIW, I found it not too bad. And my tests at least don't detect an
> anomaly around shared mappings. The key I think is that I'm tracking
> cgroup to uncharge on the file_region entry inside the resv_map, so we
> know who allocated each file_region entry exactly and we can uncharge
> them when the entry is region_del'd.
>
>> For example, consider a file of size 4 hugetlb pages.
>> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
>> all 4 pages, and 2 additional reservations are taken.  I am not really sure
>> of the desired semantics here for reservation limits if A and B are in separate
>> cgroups.  Should B be charged for 4 or 2 reservations?
>
> Task A's cgroup is charged 2 pages to its reservation usage.
> Task B's cgroup is charged 2 pages to its reservation usage.
OK,
Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
allocation.  The mmap would succeed, but Task B could potentially need to
allocate more than 2 huge pages.  So, when faulting in more than 2 huge
pages B would get a SIGBUS.  Correct?  Or, am I missing something?
Perhaps reservation charge should always be the same as map size/maximum
allocation size?
I'm thinking this would work similar to how other shared memory like
tmpfs is accounted for right now. I.e. if a task conducts an operation
that causes memory to be allocated then that task is charged for that
memory, and if another task uses memory that has already been
allocated and charged by another task, then it can use the memory
without being charged.
So in case of hugetlb memory, if a task is mmaping memory that causes
a new reservation to be made, and new entries to be created in the
resv_map for the shared mapping, then that task gets charged. If the
task is mmaping memory that is already reserved or faulted, then it
reserves or faults it without getting charged.
In the example above, in chronological order:

Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb

reservations because the first 2 are charged already and can be used
without incurring a charge.

Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,

since none of the 4 pages are faulted in yet. If the task is only
allowed 2 hugetlb page faults then it will actually get a SIGBUS.

Task A accesses 4 hugetlb pages, gets charged no faults, since all

the hugetlb faults is charged to Task B.
So, yes, I can see a scenario where userspace still gets SIGBUS'd, but
I think that's fine because:

Notice that the SIGBUS is due to the faulting limit, and not the

reservation limit, so we're not regressing the status quo per say.
Folks using the fault limit today understand the SIGBUS risk.
2. the way I expect folks to use this is to use 'reservation limits'
to partition the available hugetlb memory on the machine using it and
forgo using the existing fault limits. Using both at the same time I
think would be a superuser feature for folks that really know what
they are doing, and understand the risk of SIGBUS that comes with
using the existing fault limits.
3. I expect userspace to in general handle this correctly because
there are similar challenges with all shared memory and accounting of
it, even in tmpfs, I think.
Ok, that helps explain your use case.  I agree that it would be difficult
to use both fault and reservation limits together.  Especially in the case
of shared mappings.
-- 
Mike Kravetz

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits