On 8/9/19 1:57 PM, Mina Almasry wrote:
On Fri, Aug 9, 2019 at 1:39 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 8/9/19 11:05 AM, Mina Almasry wrote:
On Fri, Aug 9, 2019 at 4:27 AM Michal Koutný mkoutny@suse.com wrote:
Alternatives considered: [...]
(I did not try that but) have you considered: 3) MAP_POPULATE while you're making the reservation,
I have tried this, and the behaviour is not great. Basically if userspace mmaps more memory than its cgroup limit allows with MAP_POPULATE, the kernel will reserve the total amount requested by the userspace, it will fault in up to the cgroup limit, and then it will SIGBUS the task when it tries to access the rest of its 'reserved' memory.
So for example:
- if /proc/sys/vm/nr_hugepages == 10, and
- your cgroup limit is 5 pages, and
- you mmap(MAP_POPULATE) 7 pages.
Then the kernel will reserve 7 pages, and will fault in 5 of those 7 pages, and will SIGBUS you when you try to access the remaining 2 pages. So the problem persists. Folks would still like to know they are crossing the limits on mmap time.
If you got the failure at mmap time in the MAP_POPULATE case would this be useful?
Just thinking that would be a relatively simple change.
Not quite, unfortunately. A subset of the folks that want to use hugetlb memory, don't want to use MAP_POPULATE (IIRC, something about mmaping a huge amount of hugetlb memory at their jobs' startup, and doing that with MAP_POPULATE adds so much to their startup time that it is prohibitively expensive - but that's just what I vaguely recall offhand. I can get you the details if you're interested).
Yes, MAP_POPULATE can get expensive as you will need to zero all those huge pages.