[PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

List overview All Threads
Download

newer

older

[PATCH v2 0/2] gup.c,...

[PATCH 0/2] gup.c, gup_benchmark.c...

Mina Almasry

19 Sep 2019 19 Sep '19

10:24 p.m.

Patch series implements hugetlb_cgroup reservation usage and limits, which track hugetlb reservations rather than hugetlb memory faulted in. Details of the approach is 1/7.

Changes in v5: - Moved the bulk of the description to the first patch in the series. - Clang formatted the entire series. - Split off 'hugetlb: remove duplicated code' and 'hugetlb: region_chg provides only cache entry' into their own patch series. - Added comments to HUGETLB_RES enum. - Fixed bug in 'hugetlb: disable region_add file_region coalescing' calculating the wrong number of regions_needed in some cases. - Changed sleeps in test to proper conditions. - Misc fixes in test based on shuah@ review.

Changes in v4: - Split up 'hugetlb_cgroup: add accounting for shared mappings' into 4 patches for better isolation and context on the indvidual changes: - hugetlb_cgroup: add accounting for shared mappings - hugetlb: disable region_add file_region coalescing - hugetlb: remove duplicated code - hugetlb: region_chg provides only cache entry - Fixed resv->adds_in_progress accounting. - Retained behavior that region_add never fails, in earlier patchsets region_add could return failure. - Fixed libhugetlbfs failure. - Minor fix to the added tests that was preventing them from running on some environments.

Changes in v3: - Addressed comments of Hillf Danton: - Added docs. - cgroup_files now uses enum. - Various readability improvements. - Addressed comments of Mike Kravetz. - region_* functions no longer coalesce file_region entries in the resv_map. - region_add() and region_chg() refactored to make them much easier to understand and remove duplicated code so this patch doesn't add too much complexity. - Refactored common functionality into helpers.

Changes in v2: - Split the patch into a 5 patch series. - Fixed patch subject.

Mina Almasry (7): hugetlb_cgroup: Add hugetlb_cgroup reservation counter hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations hugetlb_cgroup: add reservation accounting for private mappings hugetlb: disable region_add file_region coalescing hugetlb_cgroup: add accounting for shared mappings hugetlb_cgroup: Add hugetlb_cgroup reservation tests hugetlb_cgroup: Add hugetlb_cgroup reservation docs

.../admin-guide/cgroup-v1/hugetlb.rst | 85 +++- include/linux/hugetlb.h | 31 +- include/linux/hugetlb_cgroup.h | 33 +- mm/hugetlb.c | 423 +++++++++++----- mm/hugetlb_cgroup.c | 190 ++++++-- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/charge_reserved_hugetlb.sh | 461 ++++++++++++++++++ .../selftests/vm/write_hugetlb_memory.sh | 22 + .../testing/selftests/vm/write_to_hugetlbfs.c | 250 ++++++++++ 10 files changed, 1306 insertions(+), 191 deletions(-) create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

-- 2.23.0.351.gc4317032e6-goog

Show replies by date

Mina Almasry

19 Sep 19 Sep

10:24 p.m.

New subject: [PATCH v5 1/7] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter.

Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.

Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory, reservation_usage_in_bytes tracks all *reserved* hugetlb memory.

- If a task attempts to reserve more memory than limit_in_bytes allows, the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.

This proposal is implemented in this patch series, with tests to verify functionality and show the usage.

Alternatives considered: 1. A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via reservation_limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and reservation_limit_in_bytes concurrently, and this approach gives them the option to do so.

Caveats: 1. This support is implemented for cgroups-v1. I have not tried hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet. This is largely because we use cgroups-v1 for now. If required, I can add hugetlb_cgroup support to cgroups v2 in this patch or a follow up. 2. Most complicated bit of this patch I believe is: where to store the pointer to the hugetlb_cgroup to uncharge at unreservation time? Normally the cgroup pointers hang off the struct page. But, with hugetlb_cgroup reservations, one task can reserve a specific page and another task may fault it in (I believe), so storing the pointer in struct page is not appropriate. Proposed approach here is to store the pointer in the resv_map. See patch for details.

Testing: - Added tests passing. - libhugetlbfs tests mostly passing, but some tests have trouble with and without this patch series. Seems environment issue rather than code: - Overall results: ********** TEST SUMMARY * 2M * 32-bit 64-bit * Total testcases: 84 0 * Skipped: 0 0 * PASS: 66 0 * FAIL: 14 0 * Killed by signal: 0 0 * Bad configuration: 4 0 * Expected FAIL: 0 0 * Unexpected PASS: 0 0 * Test not present: 0 0 * Strange test result: 0 0 ********** - Failing tests: - elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this patch series. - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32): FAIL Address is not hugepage - LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc HUGETLB_MORECORE=yes malloc (2M: 32): FAIL Address is not hugepage - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32): FAIL Address is not hugepage - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes heapshrink (2M: 32): FAIL Heap not on hugepages - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32): FAIL Heap not on hugepages - HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL small_data is not hugepage - HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32): FAIL small_data is not hugepage - alloc-instantiate-race shared (2M: 32): Bad configuration: sched_setaffinity(cpu1): Invalid argument - FAIL Child 1 killed by signal Killed - shmoverride_linked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - HUGETLB_SHM=yes shmoverride_linked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - shmoverride_linked_static (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - HUGETLB_SHM=yes shmoverride_linked_static (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Hillf Danton hdanton@sina.com

--- include/linux/hugetlb.h | 23 ++++++++- mm/hugetlb_cgroup.c | 111 ++++++++++++++++++++++++++++++---------- 2 files changed, 107 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index edfca4278319..3d70a17cc0c3 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

#ifdef CONFIG_HUGETLB_PAGE

+enum { + /* Tracks hugetlb memory faulted in. */ + HUGETLB_RES_USAGE, + /* Tracks hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_USAGE, + /* Limit for hugetlb memory faulted in. */ + HUGETLB_RES_LIMIT, + /* Limit for hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_LIMIT, + /* Max usage for hugetlb memory faulted in. */ + HUGETLB_RES_MAX_USAGE, + /* Max usage for hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_MAX_USAGE, + /* Faulted memory accounting fail count. */ + HUGETLB_RES_FAILCNT, + /* Reserved memory accounting fail count. */ + HUGETLB_RES_RESERVATION_FAILCNT, + HUGETLB_RES_NULL, + HUGETLB_RES_MAX, +}; + #define HSTATE_NAME_LEN 32 /* Defines one hugetlb page size */ struct hstate { @@ -340,7 +361,7 @@ struct hstate { unsigned int surplus_huge_pages_node[MAX_NUMNODES]; #ifdef CONFIG_CGROUP_HUGETLB /* cgroup control files */ - struct cftype cgroup_files[5]; + struct cftype cgroup_files[HUGETLB_RES_MAX]; #endif char name[HSTATE_NAME_LEN]; }; diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 68c2f2f3c05b..1386da79c9d7 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -25,6 +25,10 @@ struct hugetlb_cgroup { * the counter to account for hugepages from hugetlb. */ struct page_counter hugepage[HUGE_MAX_HSTATE]; + /* + * the counter to account for hugepage reservations from hugetlb. + */ + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; };

#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) @@ -33,6 +37,14 @@ struct hugetlb_cgroup {

static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline struct page_counter * +hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved) +{ + if (reserved) + return &h_cg->reserved_hugepage[idx]; + return &h_cg->hugepage[idx]; +} + static inline struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) { @@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, return; }

-enum { - RES_USAGE, - RES_LIMIT, - RES_MAX_USAGE, - RES_FAILCNT, -}; - static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { struct page_counter *counter; + struct page_counter *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; + reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

switch (MEMFILE_ATTR(cft->private)) { - case RES_USAGE: + case HUGETLB_RES_USAGE: return (u64)page_counter_read(counter) * PAGE_SIZE; - case RES_LIMIT: + case HUGETLB_RES_RESERVATION_USAGE: + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE; + case HUGETLB_RES_LIMIT: return (u64)counter->max * PAGE_SIZE; - case RES_MAX_USAGE: + case HUGETLB_RES_RESERVATION_LIMIT: + return (u64)reserved_counter->max * PAGE_SIZE; + case HUGETLB_RES_MAX_USAGE: return (u64)counter->watermark * PAGE_SIZE; - case RES_FAILCNT: + case HUGETLB_RES_RESERVATION_MAX_USAGE: + return (u64)reserved_counter->watermark * PAGE_SIZE; + case HUGETLB_RES_FAILCNT: return counter->failcnt; + case HUGETLB_RES_RESERVATION_FAILCNT: + return reserved_counter->failcnt; default: BUG(); } @@ -291,6 +306,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, int ret, idx; unsigned long nr_pages; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); + bool reserved = false;

if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ return -EINVAL; @@ -304,9 +320,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

switch (MEMFILE_ATTR(of_cft(of)->private)) { - case RES_LIMIT: + case HUGETLB_RES_RESERVATION_LIMIT: + reserved = true; + /* Fall through. */ + case HUGETLB_RES_LIMIT: mutex_lock(&hugetlb_limit_mutex); - ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages); + ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx, + reserved), + nr_pages); mutex_unlock(&hugetlb_limit_mutex); break; default: @@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { int ret = 0; - struct page_counter *counter; + struct page_counter *counter, *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; + reserved_counter = + &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];

switch (MEMFILE_ATTR(of_cft(of)->private)) { - case RES_MAX_USAGE: + case HUGETLB_RES_MAX_USAGE: page_counter_reset_watermark(counter); break; - case RES_FAILCNT: + case HUGETLB_RES_RESERVATION_MAX_USAGE: + page_counter_reset_watermark(reserved_counter); + break; + case HUGETLB_RES_FAILCNT: counter->failcnt = 0; break; + case HUGETLB_RES_RESERVATION_FAILCNT: + reserved_counter->failcnt = 0; + break; default: ret = -EINVAL; break; @@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx) struct hstate *h = &hstates[idx];

/* format the size */ - mem_fmt(buf, 32, huge_page_size(h)); + mem_fmt(buf, sizeof(buf), huge_page_size(h));

/* Add the limit file */ - cft = &h->cgroup_files[0]; + cft = &h->cgroup_files[HUGETLB_RES_LIMIT]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT); + cft->read_u64 = hugetlb_cgroup_read_u64; + cft->write = hugetlb_cgroup_write; + + /* Add the reservation limit file */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes", + buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT); cft->read_u64 = hugetlb_cgroup_read_u64; cft->write = hugetlb_cgroup_write;

/* Add the usage file */ - cft = &h->cgroup_files[1]; + cft = &h->cgroup_files[HUGETLB_RES_USAGE]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE); + cft->read_u64 = hugetlb_cgroup_read_u64; + + /* Add the reservation usage file */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes", + buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE); cft->read_u64 = hugetlb_cgroup_read_u64;

/* Add the MAX usage file */ - cft = &h->cgroup_files[2]; + cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + + /* Add the MAX reservation usage file */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE]; + snprintf(cft->name, MAX_CFTYPE_NAME, + "%s.reservation_max_usage_in_bytes", buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64;

/* Add the failcntfile */ - cft = &h->cgroup_files[3]; + cft = &h->cgroup_files[HUGETLB_RES_FAILCNT]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + + /* Add the reservation failcntfile */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64;

/* NULL terminate the last cft */ - cft = &h->cgroup_files[4]; + cft = &h->cgroup_files[HUGETLB_RES_NULL]; memset(cft, 0, sizeof(*cft));

WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, -- 2.23.0.351.gc4317032e6-goog

Mina Almasry

10:24 p.m.

New subject: [PATCH v5 2/7] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations

Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init, hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Signed-off-by: Mina Almasry almasrymina@google.com

--- include/linux/hugetlb_cgroup.h | 22 ++++++---- mm/hugetlb.c | 6 ++- mm/hugetlb_cgroup.c | 77 ++++++++++++++++++++++++++++------ 3 files changed, 83 insertions(+), 22 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 063962f6dfc6..de35997bb5f9 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -52,14 +52,19 @@ static inline bool hugetlb_cgroup_disabled(void) }

extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr); + struct hugetlb_cgroup **ptr, + bool reserved); extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, struct page *page); extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page); extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg); + struct hugetlb_cgroup *h_cg, + bool reserved); +extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages); + extern void hugetlb_cgroup_file_init(void) __init; extern void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage); @@ -81,9 +86,9 @@ static inline bool hugetlb_cgroup_disabled(void) return true; }

-static inline int -hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) +static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, + struct hugetlb_cgroup **ptr, + bool reserved) { return 0; } @@ -100,9 +105,10 @@ hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) { }

-static inline void -hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg) +static inline void hugetlb_cgroup_uncharge_cgroup(int idx, + unsigned long nr_pages, + struct hugetlb_cgroup *h_cg, + bool reserved) { }

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 052a2532528a..a52efcb70d04 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2032,7 +2032,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, gbl_chg = 1; }

- ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); + ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg, + false); if (ret) goto out_subpool_put;

@@ -2080,7 +2081,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, return page;

out_uncharge_cgroup: - hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); + hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg, + false); out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 1386da79c9d7..dc1ddc9b09c4 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -73,8 +73,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) int idx;

for (idx = 0; idx < hugetlb_max_hstate; idx++) { - if (page_counter_read(&h_cg->hugepage[idx])) + if (page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, true)) || + page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, false))) { return true; + } } return false; } @@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, int idx;

for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { - struct page_counter *counter = &h_cgroup->hugepage[idx]; struct page_counter *parent = NULL; + struct page_counter *reserved_parent = NULL; unsigned long limit; int ret;

- if (parent_h_cgroup) - parent = &parent_h_cgroup->hugepage[idx]; - page_counter_init(counter, parent); + if (parent_h_cgroup) { + parent = hugetlb_cgroup_get_counter(parent_h_cgroup, + idx, false); + reserved_parent = hugetlb_cgroup_get_counter( + parent_h_cgroup, idx, true); + } + page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx, + false), + parent); + page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx, + true), + reserved_parent);

limit = round_down(PAGE_COUNTER_MAX, 1 << huge_page_order(&hstates[idx])); - ret = page_counter_set_max(counter, limit); + + ret = page_counter_set_max( + hugetlb_cgroup_get_counter(h_cgroup, idx, false), + limit); + ret = page_counter_set_max( + hugetlb_cgroup_get_counter(h_cgroup, idx, true), limit); VM_BUG_ON(ret); } } @@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) kfree(h_cgroup); }

+static void hugetlb_cgroup_move_parent_reservation(int idx, + struct hugetlb_cgroup *h_cg) +{ + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); + + /* Move the reservation counters. */ + if (!parent_hugetlb_cgroup(h_cg)) { + parent = root_h_cgroup; + /* root has no limit */ + page_counter_charge( + &root_h_cgroup->reserved_hugepage[idx], + page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, true))); + } + + /* Take the pages off the local counter */ + page_counter_cancel( + hugetlb_cgroup_get_counter(h_cg, idx, true), + page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true))); +}

/* * Should be called with hugetlb_lock held. @@ -180,6 +218,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) do { for_each_hstate(h) { spin_lock(&hugetlb_lock); + hugetlb_cgroup_move_parent_reservation(idx, h_cg); list_for_each_entry(page, &h->hugepage_activelist, lru) hugetlb_cgroup_move_parent(idx, h_cg, page);

@@ -191,7 +230,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) }

int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) + struct hugetlb_cgroup **ptr, bool reserved) { int ret = 0; struct page_counter *counter; @@ -214,8 +253,11 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, } rcu_read_unlock();

- if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) + if (!page_counter_try_charge(hugetlb_cgroup_get_counter(h_cg, idx, + reserved), + nr_pages, &counter)) { ret = -ENOMEM; + } css_put(&h_cg->css); done: *ptr = h_cg; @@ -249,12 +291,14 @@ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, if (unlikely(!h_cg)) return; set_hugetlb_cgroup(page, NULL); - page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); + page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, false), + nr_pages); + return; }

void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg) + struct hugetlb_cgroup *h_cg, bool reserved) { if (hugetlb_cgroup_disabled() || !h_cg) return; @@ -262,8 +306,17 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) return;

- page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); - return; + page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved), + nr_pages); +} + +void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages) +{ + if (hugetlb_cgroup_disabled() || !p) + return; + + page_counter_uncharge(p, nr_pages); }

static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, -- 2.23.0.351.gc4317032e6-goog

Mina Almasry

10:24 p.m.

New subject: [PATCH v5 3/7] hugetlb_cgroup: add reservation accounting for private mappings

Normally the pointer to the cgroup to uncharge hangs off the struct page, and gets queried when it's time to free the page. With hugetlb_cgroup reservations, this is not possible. Because it's possible for a page to be reserved by one task and actually faulted in by another task.

The best place to put the hugetlb_cgroup pointer to uncharge for reservations is in the resv_map. But, because the resv_map has different semantics for private and shared mappings, the code patch to charge/uncharge shared and private mappings is different. This patch implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in resv_map->reservation_counter. On initializing the resv_map this is set to NULL. On reservation of a region in private mapping, the tasks hugetlb_cgroup is charged and the hugetlb_cgroup is placed is resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Hillf Danton hdanton@sina.com

--- include/linux/hugetlb.h | 8 +++++++ include/linux/hugetlb_cgroup.h | 11 +++++++++ mm/hugetlb.c | 44 +++++++++++++++++++++++++++++++++- mm/hugetlb_cgroup.c | 12 ---------- 4 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3d70a17cc0c3..230f44f730fa 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -46,6 +46,14 @@ struct resv_map { long adds_in_progress; struct list_head region_cache; long region_cache_count; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * On private mappings, the counter to uncharge reservations is stored + * here. If these fields are 0, then the mapping is shared. + */ + struct page_counter *reservation_counter; + unsigned long pages_per_hpage; +#endif }; extern struct resv_map *resv_map_alloc(void); void resv_map_release(struct kref *ref); diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index de35997bb5f9..31c4a9e1cf91 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -25,6 +25,17 @@ struct hugetlb_cgroup; #define HUGETLB_CGROUP_MIN_ORDER 2

#ifdef CONFIG_CGROUP_HUGETLB +struct hugetlb_cgroup { + struct cgroup_subsys_state css; + /* + * the counter to account for hugepages from hugetlb. + */ + struct page_counter hugepage[HUGE_MAX_HSTATE]; + /* + * the counter to account for hugepage reservations from hugetlb. + */ + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; +};

static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a52efcb70d04..bac1cbdd027c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -665,6 +665,16 @@ struct resv_map *resv_map_alloc(void) INIT_LIST_HEAD(&resv_map->regions);

resv_map->adds_in_progress = 0; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Initialize these to 0. On shared mappings, 0's here indicate these + * fields don't do cgroup accounting. On private mappings, these will be + * re-initialized to the proper values, to indicate that hugetlb cgroup + * reservations are to be un-charged from here. + */ + resv_map->reservation_counter = NULL; + resv_map->pages_per_hpage = 0; +#endif

INIT_LIST_HEAD(&resv_map->region_cache); list_add(&rg->link, &resv_map->region_cache); @@ -3147,7 +3157,18 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)

reserve = (end - start) - region_count(resv, start, end);

- kref_put(&resv->refs, resv_map_release); +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since we check for HPAGE_RESV_OWNER above, this must a private + * mapping, and these values should be none-zero, and should point to + * the hugetlb_cgroup counter to uncharge for this reservation. + */ + WARN_ON(!resv->reservation_counter); + WARN_ON(!resv->pages_per_hpage); + + hugetlb_cgroup_uncharge_counter(resv->reservation_counter, + (end - start) * resv->pages_per_hpage); +#endif

if (reserve) { /* @@ -3157,6 +3178,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) gbl_reserve = hugepage_subpool_put_pages(spool, reserve); hugetlb_acct_memory(h, -gbl_reserve); } + + kref_put(&resv->refs, resv_map_release); }

static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr) @@ -4490,6 +4513,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; + struct hugetlb_cgroup *h_cg; long gbl_reserve;

/* This should never happen */ @@ -4523,12 +4547,30 @@ int hugetlb_reserve_pages(struct inode *inode, chg = region_chg(resv_map, from, to);

} else { + /* Private mapping. */ resv_map = resv_map_alloc(); if (!resv_map) return -ENOMEM;

chg = to - from;

+ if (hugetlb_cgroup_charge_cgroup(hstate_index(h), + chg * pages_per_huge_page(h), + &h_cg, true)) { + kref_put(&resv_map->refs, resv_map_release); + return -ENOMEM; + } + +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since this branch handles private mappings, we attach the + * counter to uncharge for this reservation off resv_map. + */ + resv_map->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + resv_map->pages_per_hpage = pages_per_huge_page(h); +#endif + set_vma_resv_map(vma, resv_map); set_vma_resv_flags(vma, HPAGE_RESV_OWNER); } diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index dc1ddc9b09c4..ae359ae61cf2 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -19,18 +19,6 @@ #include <linux/hugetlb.h> #include <linux/hugetlb_cgroup.h>

-struct hugetlb_cgroup { - struct cgroup_subsys_state css; - /* - * the counter to account for hugepages from hugetlb. - */ - struct page_counter hugepage[HUGE_MAX_HSTATE]; - /* - * the counter to account for hugepage reservations from hugetlb. - */ - struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; -}; - #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) #define MEMFILE_IDX(val) (((val) >> 16) & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) -- 2.23.0.351.gc4317032e6-goog

Mina Almasry

10:24 p.m.

New subject: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6].

Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry almasrymina@google.com

--- mm/hugetlb.c | 273 +++++++++++++++++++++++++++++---------------------- 1 file changed, 158 insertions(+), 115 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bac1cbdd027c..d03b048084a3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -244,6 +244,12 @@ struct file_region { long to; };

+/* Helper that removes a struct file_region from the resv_map cache and returns + * it for use. + */ +static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to); + /* Must be called with resv->lock held. Calling this with count_only == true * will count the number of pages to be added but will not modify the linked * list. @@ -251,51 +257,61 @@ struct file_region { static long add_reservation_in_range(struct resv_map *resv, long f, long t, bool count_only) { - long chg = 0; + long add = 0; struct list_head *head = &resv->regions; + long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;

- /* Locate the region we are before or in. */ - list_for_each_entry (rg, head, link) - if (f <= rg->to) - break; - - /* Round our left edge to the current segment if it encloses us. */ - if (f > rg->from) - f = rg->from; - - chg = t - f; + /* In this loop, we essentially handle an entry for the range + * last_accounted_offset -> rg->from, at every iteration, with some + * bounds checking. + */ + list_for_each_entry_safe(rg, trg, head, link) { + /* Skip irrelevant regions that start before our range. */ + if (rg->from < f) { + /* If this region ends after the last accounted offset, + * then we need to update last_accounted_offset. + */ + if (rg->to > last_accounted_offset) + last_accounted_offset = rg->to; + continue; + }

- /* Check for and consume any regions we now overlap with. */ - nrg = rg; - list_for_each_entry_safe (rg, trg, rg->link.prev, link) { - if (&rg->link == head) - break; + /* When we find a region that starts beyond our range, we've + * finished. + */ if (rg->from > t) break;

- /* We overlap with this area, if it extends further than - * us then we must extend ourselves. Account for its - * existing reservation. + /* Add an entry for last_accounted_offset -> rg->from, and + * update last_accounted_offset. */ - if (rg->to > t) { - chg += rg->to - t; - t = rg->to; + if (rg->from > last_accounted_offset) { + add += rg->from - last_accounted_offset; + if (!count_only) { + nrg = get_file_region_entry_from_cache( + resv, last_accounted_offset, rg->from); + list_add(&nrg->link, rg->link.prev); + } } - chg -= rg->to - rg->from;

- if (!count_only && rg != nrg) { - list_del(&rg->link); - kfree(rg); - } + last_accounted_offset = rg->to; }

- if (!count_only) { - nrg->from = f; - nrg->to = t; + /* Handle the case where our range extends beyond + * last_accounted_offset. + */ + if (last_accounted_offset < t) { + add += t - last_accounted_offset; + if (!count_only) { + nrg = get_file_region_entry_from_cache( + resv, last_accounted_offset, t); + list_add(&nrg->link, rg->link.prev); + } + last_accounted_offset = t; }

- return chg; + return add; }

/* @@ -305,46 +321,24 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, * must exist in the cache due to the previous call to region_chg with * the same range. * + * regions_needed is the out value provided by a previous + * call to region_chg. + * * Return the number of new huge pages added to the map. This * number is greater than or equal to zero. */ -static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct resv_map *resv, long f, long t, + long regions_needed) { - struct list_head *head = &resv->regions; - struct file_region *rg, *nrg; long add = 0;

spin_lock(&resv->lock); - /* Locate the region we are either in or before. */ - list_for_each_entry(rg, head, link) - if (f <= rg->to) - break;

- /* - * If no region exists which can be expanded to include the - * specified range, pull a region descriptor from the cache - * and use it for this range. - */ - if (&rg->link == head || t < rg->from) { - VM_BUG_ON(resv->region_cache_count <= 0); - - resv->region_cache_count--; - nrg = list_first_entry(&resv->region_cache, struct file_region, - link); - list_del(&nrg->link); - - nrg->from = f; - nrg->to = t; - list_add(&nrg->link, rg->link.prev); - - add += t - f; - goto out_locked; - } + VM_BUG_ON(resv->region_cache_count < regions_needed);

add = add_reservation_in_range(resv, f, t, false); + resv->adds_in_progress -= regions_needed;

-out_locked: - resv->adds_in_progress--; spin_unlock(&resv->lock); VM_BUG_ON(add < 0); return add; @@ -361,42 +355,55 @@ static long region_add(struct resv_map *resv, long f, long t) * as a placeholder, so that the subsequent region_add * call will have all the regions it needs and will not fail. * + * out_regions_needed is the number of regions added to the + * resv->region_cache_count. This value needs to be provided to a follow up + * call to region_add or region_abort for proper accounting. + * * Returns the number of huge pages that need to be added to the existing * reservation map for the range [f, t). This number is greater or equal to * zero. -ENOMEM is returned if a new file_region structure or cache entry * is needed and can not be allocated. */ -static long region_chg(struct resv_map *resv, long f, long t) +static long region_chg(struct resv_map *resv, long f, long t, + long *out_regions_needed) { + struct file_region *trg = NULL; long chg = 0;

+ /* Allocate the maximum number of regions we're going to need for this + * reservation. The maximum number of regions we're going to need is + * (t - f) / 2 + 1, which corresponds to a region with alternating + * reserved and unreserved pages. + */ + *out_regions_needed = (t - f) / 2 + 1; +retry: spin_lock(&resv->lock); -retry_locked: - resv->adds_in_progress++; + + /* Count how many hugepages in this range are NOT respresented. */ + chg = add_reservation_in_range(resv, f, t, true);

/* * Check for sufficient descriptors in the cache to accommodate * the number of in progress add operations. */ - if (resv->adds_in_progress > resv->region_cache_count) { - struct file_region *trg; - - VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1); + if (resv->region_cache_count < *out_regions_needed) { /* Must drop lock to allocate a new descriptor. */ - resv->adds_in_progress--; spin_unlock(&resv->lock);

- trg = kmalloc(sizeof(*trg), GFP_KERNEL); - if (!trg) - return -ENOMEM; + while (resv->region_cache_count < *out_regions_needed + 1) { + trg = kmalloc(sizeof(*trg), GFP_KERNEL); + if (!trg) + return -ENOMEM;

- spin_lock(&resv->lock); - list_add(&trg->link, &resv->region_cache); - resv->region_cache_count++; - goto retry_locked; + spin_lock(&resv->lock); + list_add(&trg->link, &resv->region_cache); + resv->region_cache_count++; + spin_unlock(&resv->lock); + } + goto retry; }

- chg = add_reservation_in_range(resv, f, t, true); + resv->adds_in_progress += *out_regions_needed;

spin_unlock(&resv->lock); return chg; @@ -407,17 +414,19 @@ static long region_chg(struct resv_map *resv, long f, long t) * of the resv_map keeps track of the operations in progress between * calls to region_chg and region_add. Operations are sometimes * aborted after the call to region_chg. In such cases, region_abort - * is called to decrement the adds_in_progress counter. + * is called to decrement the adds_in_progress counter. regions_needed + * is the value returned by the region_chg call, it is used to decrement + * the adds_in_progress counter. * * NOTE: The range arguments [f, t) are not needed or used in this * routine. They are kept to make reading the calling code easier as * arguments will match the associated region_chg call. */ -static void region_abort(struct resv_map *resv, long f, long t) +static void region_abort(struct resv_map *resv, long f, long t, + long regions_needed) { spin_lock(&resv->lock); - VM_BUG_ON(!resv->region_cache_count); - resv->adds_in_progress--; + resv->adds_in_progress -= regions_needed; spin_unlock(&resv->lock); }

@@ -1866,9 +1875,10 @@ enum vma_resv_mode { VMA_END_RESV, VMA_ADD_RESV, }; -static long __vma_reservation_common(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr, - enum vma_resv_mode mode) +static long +__vma_reservation_common(struct hstate *h, struct vm_area_struct *vma, + unsigned long addr, enum vma_resv_mode mode, + long *out_regions_needed, long in_regions_needed) { struct resv_map *resv; pgoff_t idx; @@ -1881,20 +1891,24 @@ static long __vma_reservation_common(struct hstate *h, idx = vma_hugecache_offset(h, vma, addr); switch (mode) { case VMA_NEEDS_RESV: - ret = region_chg(resv, idx, idx + 1); + VM_BUG_ON(!out_regions_needed); + ret = region_chg(resv, idx, idx + 1, out_regions_needed); break; case VMA_COMMIT_RESV: - ret = region_add(resv, idx, idx + 1); + VM_BUG_ON(in_regions_needed == -1); + ret = region_add(resv, idx, idx + 1, in_regions_needed); break; case VMA_END_RESV: - region_abort(resv, idx, idx + 1); + VM_BUG_ON(in_regions_needed == -1); + region_abort(resv, idx, idx + 1, in_regions_needed); ret = 0; break; case VMA_ADD_RESV: + VM_BUG_ON(in_regions_needed == -1); if (vma->vm_flags & VM_MAYSHARE) - ret = region_add(resv, idx, idx + 1); + ret = region_add(resv, idx, idx + 1, in_regions_needed); else { - region_abort(resv, idx, idx + 1); + region_abort(resv, idx, idx + 1, in_regions_needed); ret = region_del(resv, idx, idx + 1); } break; @@ -1927,28 +1941,32 @@ static long __vma_reservation_common(struct hstate *h, return ret < 0 ? ret : 0; }

-static long vma_needs_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static long vma_needs_reservation(struct hstate *h, struct vm_area_struct *vma, + unsigned long addr, long *out_regions_needed) { - return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV); + return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV, + out_regions_needed, -1); }

-static long vma_commit_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static long vma_commit_reservation(struct hstate *h, struct vm_area_struct *vma, + unsigned long addr, long regions_needed) { - return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV); + return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV, NULL, + regions_needed); }

-static void vma_end_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static void vma_end_reservation(struct hstate *h, struct vm_area_struct *vma, + unsigned long addr, long regions_needed) { - (void)__vma_reservation_common(h, vma, addr, VMA_END_RESV); + (void)__vma_reservation_common(h, vma, addr, VMA_END_RESV, NULL, + regions_needed); }

-static long vma_add_reservation(struct hstate *h, - struct vm_area_struct *vma, unsigned long addr) +static long vma_add_reservation(struct hstate *h, struct vm_area_struct *vma, + unsigned long addr, long regions_needed) { - return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV); + return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV, NULL, + regions_needed); }

/* @@ -1966,8 +1984,10 @@ static void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, unsigned long address, struct page *page) { + long regions_needed = 0; if (unlikely(PagePrivate(page))) { - long rc = vma_needs_reservation(h, vma, address); + long rc = + vma_needs_reservation(h, vma, address, &regions_needed);

if (unlikely(rc < 0)) { /* @@ -1983,7 +2003,8 @@ static void restore_reserve_on_error(struct hstate *h, */ ClearPagePrivate(page); } else if (rc) { - rc = vma_add_reservation(h, vma, address); + rc = vma_add_reservation(h, vma, address, + regions_needed); if (unlikely(rc < 0)) /* * See above comment about rare out of @@ -1991,7 +2012,7 @@ static void restore_reserve_on_error(struct hstate *h, */ ClearPagePrivate(page); } else - vma_end_reservation(h, vma, address); + vma_end_reservation(h, vma, address, regions_needed); } }

@@ -2005,6 +2026,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, long gbl_chg; int ret, idx; struct hugetlb_cgroup *h_cg; + long regions_needed = 0;

idx = hstate_index(h); /* @@ -2012,7 +2034,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, * has a reservation for the page to be allocated. A return * code of zero indicates a reservation exists (no change). */ - map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); + map_chg = gbl_chg = + vma_needs_reservation(h, vma, addr, &regions_needed); if (map_chg < 0) return ERR_PTR(-ENOMEM);

@@ -2026,7 +2049,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, if (map_chg || avoid_reserve) { gbl_chg = hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) { - vma_end_reservation(h, vma, addr); + vma_end_reservation(h, vma, addr, regions_needed); return ERR_PTR(-ENOSPC); }

@@ -2072,7 +2095,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,

set_page_private(page, (unsigned long)spool);

- map_commit = vma_commit_reservation(h, vma, addr); + map_commit = vma_commit_reservation(h, vma, addr, regions_needed); if (unlikely(map_chg > map_commit)) { /* * The page was added to the reservation map between @@ -2096,7 +2119,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); - vma_end_reservation(h, vma, addr); + vma_end_reservation(h, vma, addr, regions_needed); return ERR_PTR(-ENOSPC); }

@@ -3780,6 +3803,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, spinlock_t *ptl; unsigned long haddr = address & huge_page_mask(h); bool new_page = false; + long regions_needed = 0;

/* * Currently, we are forced to kill the process in the event the @@ -3897,12 +3921,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, * the spinlock. */ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { - if (vma_needs_reservation(h, vma, haddr) < 0) { + if (vma_needs_reservation(h, vma, haddr, &regions_needed) < 0) { ret = VM_FAULT_OOM; goto backout_unlocked; } /* Just decrements count, does not deallocate */ - vma_end_reservation(h, vma, haddr); + vma_end_reservation(h, vma, haddr, regions_needed); }

ptl = huge_pte_lock(h, mm, ptep); @@ -3992,6 +4016,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct address_space *mapping; int need_wait_lock = 0; unsigned long haddr = address & huge_page_mask(h); + long regions_needed = 0;

ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (ptep) { @@ -4046,12 +4071,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * consumed. */ if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) { - if (vma_needs_reservation(h, vma, haddr) < 0) { + if (vma_needs_reservation(h, vma, haddr, &regions_needed) < 0) { ret = VM_FAULT_OOM; goto out_mutex; } /* Just decrements count, does not deallocate */ - vma_end_reservation(h, vma, haddr); + vma_end_reservation(h, vma, haddr, regions_needed);

if (!(vma->vm_flags & VM_MAYSHARE)) pagecache_page = hugetlbfs_pagecache_page(h, @@ -4514,7 +4539,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; struct hugetlb_cgroup *h_cg; - long gbl_reserve; + long gbl_reserve, regions_needed = 0;

/* This should never happen */ if (from > to) { @@ -4544,7 +4569,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ resv_map = inode_resv_map(inode);

- chg = region_chg(resv_map, from, to); + chg = region_chg(resv_map, from, to, &regions_needed);

} else { /* Private mapping. */ @@ -4614,7 +4639,7 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) { - long add = region_add(resv_map, from, to); + long add = region_add(resv_map, from, to, regions_needed);

if (unlikely(chg > add)) { /* @@ -4636,7 +4661,7 @@ int hugetlb_reserve_pages(struct inode *inode, if (!vma || vma->vm_flags & VM_MAYSHARE) /* Don't call region_abort if region_chg failed */ if (chg >= 0) - region_abort(resv_map, from, to); + region_abort(resv_map, from, to, regions_needed); if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) kref_put(&resv_map->refs, resv_map_release); return ret; @@ -5060,3 +5085,21 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason) spin_unlock(&hugetlb_lock); } } + +static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) +{ + struct file_region *nrg = NULL; + + VM_BUG_ON(resv->region_cache_count <= 0); + + resv->region_cache_count--; + nrg = list_first_entry(&resv->region_cache, struct file_region, link); + VM_BUG_ON(!nrg); + list_del(&nrg->link); + + nrg->from = from; + nrg->to = to; + + return nrg; +} -- 2.23.0.351.gc4317032e6-goog

Mike Kravetz

27 Sep 27 Sep

9:44 p.m.

New subject: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

On 9/19/19 3:24 PM, Mina Almasry wrote:

...

A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6].

Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry almasrymina@google.com

mm/hugetlb.c | 273 +++++++++++++++++++++++++++++---------------------- 1 file changed, 158 insertions(+), 115 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bac1cbdd027c..d03b048084a3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -244,6 +244,12 @@ struct file_region { long to; };

+/* Helper that removes a struct file_region from the resv_map cache and returns

it for use.

*/

+static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);

Instead of the forward declaration, just put the function here.

...

/* Must be called with resv->lock held. Calling this with count_only == true

will count the number of pages to be added but will not modify the linked

list.

@@ -251,51 +257,61 @@ struct file_region { static long add_reservation_in_range(struct resv_map *resv, long f, long t, bool count_only) {

long chg = 0;

long add = 0; struct list_head *head = &resv->regions;

long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
/* Locate the region we are before or in. */

list_for_each_entry (rg, head, link)
if (f <= rg->to)
	break;
/* Round our left edge to the current segment if it encloses us. */

if (f > rg->from)
f = rg->from;
chg = t - f;
/* In this loop, we essentially handle an entry for the range
* last_accounted_offset -> rg->from, at every iteration, with some
* bounds checking.
*/
list_for_each_entry_safe(rg, trg, head, link) {
/* Skip irrelevant regions that start before our range. */
if (rg->from < f) {
	/* If this region ends after the last accounted offset,
	 * then we need to update last_accounted_offset.
	 */
	if (rg->to > last_accounted_offset)
		last_accounted_offset = rg->to;
	continue;
}
/* Check for and consume any regions we now overlap with. */

nrg = rg;

list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
if (&rg->link == head)
	break;
/* When we find a region that starts beyond our range, we've
 * finished.
 */
if (rg->from > t) break;
/* We overlap with this area, if it extends further than
 * us then we must extend ourselves.  Account for its
 * existing reservation.
/* Add an entry for last_accounted_offset -> rg->from, and
 * update last_accounted_offset.
*/
if (rg->to > t) {
	chg += rg->to - t;
	t = rg->to;
if (rg->from > last_accounted_offset) {
	add += rg->from - last_accounted_offset;
	if (!count_only) {
		nrg = get_file_region_entry_from_cache(
			resv, last_accounted_offset, rg->from);
		list_add(&nrg->link, rg->link.prev);
	}
}
chg -= rg->to - rg->from;
if (!count_only && rg != nrg) {
	list_del(&rg->link);
	kfree(rg);
}
last_accounted_offset = rg->to;
}
if (!count_only) {
nrg->from = f;
nrg->to = t;
/* Handle the case where our range extends beyond
* last_accounted_offset.
*/
if (last_accounted_offset < t) {
add += t - last_accounted_offset;
if (!count_only) {
	nrg = get_file_region_entry_from_cache(
		resv, last_accounted_offset, t);
	list_add(&nrg->link, rg->link.prev);
}
last_accounted_offset = t;
}
return chg;

return add;

}

/* @@ -305,46 +321,24 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,

The start of this comment block says,

/* * Add the huge page range represented by [f, t) to the reserve * map. Existing regions will be expanded to accommodate the specified * range, or a region will be taken from the cache.

We are no longer expanding existing regions. Correct? As an optimization, I guess we could coalesce/combine reion entries as long as they are for the same cgroup. However, it may not be worth the effort.

...

must exist in the cache due to the previous call to region_chg with

the same range.

regions_needed is the out value provided by a previous

call to region_chg.

Return the number of new huge pages added to the map. This

number is greater than or equal to zero.

*/

-static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct resv_map *resv, long f, long t,
       long regions_needed)
{
struct list_head *head = &resv->regions;

struct file_region *rg, *nrg; long add = 0;

spin_lock(&resv->lock);

/* Locate the region we are either in or before. */

list_for_each_entry(rg, head, link)
if (f <= rg->to)
	break;
/*
* If no region exists which can be expanded to include the
* specified range, pull a region descriptor from the cache
* and use it for this range.
*/
if (&rg->link == head || t < rg->from) {
VM_BUG_ON(resv->region_cache_count <= 0);
resv->region_cache_count--;
nrg = list_first_entry(&resv->region_cache, struct file_region,
			link);
list_del(&nrg->link);
nrg->from = f;
nrg->to = t;
list_add(&nrg->link, rg->link.prev);
add += t - f;
goto out_locked;
}
VM_BUG_ON(resv->region_cache_count < regions_needed);

add = add_reservation_in_range(resv, f, t, false);

resv->adds_in_progress -= regions_needed;

Consider this example,

- region_chg(1,2) adds_in_progress = 1 cache entries 1 - region_chg(3,4) adds_in_progress = 2 cache entries 2 - region_chg(5,6) adds_in_progress = 3 cache entries 3

At this point, no region descriptors are in the map because only region_chg has been called.

- region_chg(0,6) adds_in_progress = 4 cache entries 4

Is that correct so far?

Then the following sequence happens,

- region_add(1,2) adds_in_progress = 3 cache entries 3 - region_add(3,4) adds_in_progress = 2 cache entries 2 - region_add(5,6) adds_in_progress = 1 cache entries 1

list of region descriptors is: [1->2] [3->4] [5->6]

- region_add(0,6) This is going to require 3 cache entries but only one is in the cache. I think we are going to BUG in get_file_region_entry_from_cache() the second time it is called from add_reservation_in_range().

I stopped looking at the code here as things will need to change if this is a real issue.

-- Mike Kravetz

Mina Almasry

10:33 p.m.

New subject: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

On Fri, Sep 27, 2019 at 2:44 PM Mike Kravetz mike.kravetz@oracle.com wrote:

...

On 9/19/19 3:24 PM, Mina Almasry wrote:

...
A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6].

Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry almasrymina@google.com

mm/hugetlb.c | 273 +++++++++++++++++++++++++++++---------------------- 1 file changed, 158 insertions(+), 115 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bac1cbdd027c..d03b048084a3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -244,6 +244,12 @@ struct file_region { long to; };

+/* Helper that removes a struct file_region from the resv_map cache and returns

it for use.

*/

+static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);

Instead of the forward declaration, just put the function here.

...
/* Must be called with resv->lock held. Calling this with count_only == true

will count the number of pages to be added but will not modify the linked

list.

@@ -251,51 +257,61 @@ struct file_region { static long add_reservation_in_range(struct resv_map *resv, long f, long t, bool count_only) {
long chg = 0;
long add = 0;
struct list_head *head = &resv->regions;
long last_accounted_offset = f;
struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
/* Locate the region we are before or in. */
list_for_each_entry (rg, head, link)
        if (f <= rg->to)
                break;
/* Round our left edge to the current segment if it encloses us. */
if (f > rg->from)
        f = rg->from;
chg = t - f;
/* In this loop, we essentially handle an entry for the range
 * last_accounted_offset -> rg->from, at every iteration, with some
 * bounds checking.
 */
list_for_each_entry_safe(rg, trg, head, link) {
        /* Skip irrelevant regions that start before our range. */
        if (rg->from < f) {
                /* If this region ends after the last accounted offset,
                 * then we need to update last_accounted_offset.
                 */
                if (rg->to > last_accounted_offset)
                        last_accounted_offset = rg->to;
                continue;
        }
/* Check for and consume any regions we now overlap with. */
nrg = rg;
list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
        if (&rg->link == head)
                break;
        /* When we find a region that starts beyond our range, we've
         * finished.
         */
        if (rg->from > t)
                break;
        /* We overlap with this area, if it extends further than
         * us then we must extend ourselves.  Account for its
         * existing reservation.
        /* Add an entry for last_accounted_offset -> rg->from, and
         * update last_accounted_offset.
         */
        if (rg->to > t) {
                chg += rg->to - t;
                t = rg->to;
        if (rg->from > last_accounted_offset) {
                add += rg->from - last_accounted_offset;
                if (!count_only) {
                        nrg = get_file_region_entry_from_cache(
                                resv, last_accounted_offset, rg->from);
                        list_add(&nrg->link, rg->link.prev);
                }
        }
        chg -= rg->to - rg->from;
        if (!count_only && rg != nrg) {
                list_del(&rg->link);
                kfree(rg);
        }
        last_accounted_offset = rg->to;
}
if (!count_only) {
        nrg->from = f;
        nrg->to = t;
/* Handle the case where our range extends beyond
 * last_accounted_offset.
 */
if (last_accounted_offset < t) {
        add += t - last_accounted_offset;
        if (!count_only) {
                nrg = get_file_region_entry_from_cache(
                        resv, last_accounted_offset, t);
                list_add(&nrg->link, rg->link.prev);
        }
        last_accounted_offset = t;
}
return chg;
return add;
}

/* @@ -305,46 +321,24 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
The start of this comment block says,

/*

Add the huge page range represented by [f, t) to the reserve

map. Existing regions will be expanded to accommodate the specified

range, or a region will be taken from the cache.

We are no longer expanding existing regions. Correct? As an optimization, I guess we could coalesce/combine reion entries as long as they are for the same cgroup. However, it may not be worth the effort.

...
must exist in the cache due to the previous call to region_chg with

the same range.

regions_needed is the out value provided by a previous

call to region_chg.

Return the number of new huge pages added to the map. This

number is greater than or equal to zero.

*/

-static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct resv_map *resv, long f, long t,
               long regions_needed)
{
struct list_head *head = &resv->regions;
struct file_region *rg, *nrg;
long add = 0;

spin_lock(&resv->lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
        if (f <= rg->to)
                break;
/*
 * If no region exists which can be expanded to include the
 * specified range, pull a region descriptor from the cache
 * and use it for this range.
 */
if (&rg->link == head || t < rg->from) {
        VM_BUG_ON(resv->region_cache_count <= 0);
        resv->region_cache_count--;
        nrg = list_first_entry(&resv->region_cache, struct file_region,
                                link);
        list_del(&nrg->link);
        nrg->from = f;
        nrg->to = t;
        list_add(&nrg->link, rg->link.prev);
        add += t - f;
        goto out_locked;
}
VM_BUG_ON(resv->region_cache_count < regions_needed);

add = add_reservation_in_range(resv, f, t, false);
resv->adds_in_progress -= regions_needed;
Consider this example,

region_chg(1,2) adds_in_progress = 1 cache entries 1

region_chg(3,4) adds_in_progress = 2 cache entries 2

region_chg(5,6) adds_in_progress = 3 cache entries 3

At this point, no region descriptors are in the map because only region_chg has been called.

region_chg(0,6) adds_in_progress = 4 cache entries 4

Is that correct so far?

Then the following sequence happens,

region_add(1,2) adds_in_progress = 3 cache entries 3

region_add(3,4) adds_in_progress = 2 cache entries 2

region_add(5,6) adds_in_progress = 1 cache entries 1

list of region descriptors is: [1->2] [3->4] [5->6]

region_add(0,6)

This is going to require 3 cache entries but only one is in the cache. I think we are going to BUG in get_file_region_entry_from_cache() the second time it is called from add_reservation_in_range().

I stopped looking at the code here as things will need to change if this is a real issue.

You're right. I had been assuming that some higher level sync causes to ban a sequence of region_chg calls without a region_add or region_abort to resolve each region_chg call, but seems that is not the case. I'll fix and upload another version of the patch.

...

-- Mike Kravetz

Mina Almasry

19 Sep 19 Sep

10:24 p.m.

New subject: [PATCH v5 5/7] hugetlb_cgroup: add accounting for shared mappings

For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and if successful, we pass on the hugetlb_cgroup info to a follow up region_add call. When a file_region entry is added to the resv_map via region_add, we put the pointer to that cgroup in file_region->reservation_counter. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.

On region_del, which is when the hugetlb memory is unreserved, we also uncharge the file_region->reservation_counter.

Signed-off-by: Mina Almasry almasrymina@google.com

--- mm/hugetlb.c | 126 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 105 insertions(+), 21 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d03b048084a3..ae573eff80bb 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -242,6 +242,15 @@ struct file_region { struct list_head link; long from; long to; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * On shared mappings, each reserved region appears as a struct + * file_region in resv_map. These fields hold the info needed to + * uncharge each reservation. + */ + struct page_counter *reservation_counter; + unsigned long pages_per_hpage; +#endif };

/* Helper that removes a struct file_region from the resv_map cache and returns @@ -250,12 +259,30 @@ struct file_region { static struct file_region * get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);

+/* Helper that records hugetlb_cgroup uncharge info. */ +static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg, + struct file_region *nrg, + struct hstate *h) +{ +#ifdef CONFIG_CGROUP_HUGETLB + if (h_cg) { + nrg->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + nrg->pages_per_hpage = pages_per_huge_page(h); + } else { + nrg->reservation_counter = NULL; + nrg->pages_per_hpage = 0; + } +#endif +} + /* Must be called with resv->lock held. Calling this with count_only == true * will count the number of pages to be added but will not modify the linked * list. */ static long add_reservation_in_range(struct resv_map *resv, long f, long t, - bool count_only) + struct hugetlb_cgroup *h_cg, + struct hstate *h, bool count_only) { long add = 0; struct list_head *head = &resv->regions; @@ -291,6 +318,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, if (!count_only) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, rg->from); + record_hugetlb_cgroup_uncharge_info(h_cg, nrg, + h); list_add(&nrg->link, rg->link.prev); } } @@ -306,11 +335,13 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, if (!count_only) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, t); + record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h); list_add(&nrg->link, rg->link.prev); } last_accounted_offset = t; }

+ VM_BUG_ON(add < 0); return add; }

@@ -327,7 +358,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, * Return the number of new huge pages added to the map. This * number is greater than or equal to zero. */ -static long region_add(struct resv_map *resv, long f, long t, +static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg, + struct resv_map *resv, long f, long t, long regions_needed) { long add = 0; @@ -336,7 +368,7 @@ static long region_add(struct resv_map *resv, long f, long t,

VM_BUG_ON(resv->region_cache_count < regions_needed);

- add = add_reservation_in_range(resv, f, t, false); + add = add_reservation_in_range(resv, f, t, h_cg, h, false); resv->adds_in_progress -= regions_needed;

spin_unlock(&resv->lock); @@ -380,7 +412,7 @@ static long region_chg(struct resv_map *resv, long f, long t, spin_lock(&resv->lock);

/* Count how many hugepages in this range are NOT respresented. */ - chg = add_reservation_in_range(resv, f, t, true); + chg = add_reservation_in_range(resv, f, t, NULL, NULL, true);

/* * Check for sufficient descriptors in the cache to accommodate @@ -430,6 +462,24 @@ static void region_abort(struct resv_map *resv, long f, long t, spin_unlock(&resv->lock); }

+static void uncharge_cgroup_if_shared_mapping(struct resv_map *resv, + struct file_region *rg, + unsigned long nr_pages) +{ +#ifdef CONFIG_CGROUP_HUGETLB + /* + * If resv->reservation_counter is NULL, then this is shared + * reservation, and the reserved memory is tracked in the file_struct + * entries inside of resv_map. So we need to uncharge the memory here. + */ + if (rg->reservation_counter && rg->pages_per_hpage && nr_pages > 0 && + !resv->reservation_counter) { + hugetlb_cgroup_uncharge_counter(rg->reservation_counter, + nr_pages * rg->pages_per_hpage); + } +#endif +} + /* * Delete the specified range [f, t) from the reserve map. If the * t parameter is LONG_MAX, this indicates that ALL regions after f @@ -499,6 +549,9 @@ static long region_del(struct resv_map *resv, long f, long t) /* Original entry is trimmed */ rg->to = f;

+ uncharge_cgroup_if_shared_mapping(resv, rg, + nrg->to - nrg->from); + list_add(&nrg->link, &rg->link); nrg = NULL; break; @@ -506,6 +559,8 @@ static long region_del(struct resv_map *resv, long f, long t)

if (f <= rg->from && t >= rg->to) { /* Remove entire region */ del += rg->to - rg->from; + uncharge_cgroup_if_shared_mapping(resv, rg, + rg->to - rg->from); list_del(&rg->link); kfree(rg); continue; @@ -514,14 +569,20 @@ static long region_del(struct resv_map *resv, long f, long t) if (f <= rg->from) { /* Trim beginning of region */ del += t - rg->from; rg->from = t; + + uncharge_cgroup_if_shared_mapping(resv, rg, + t - rg->from); } else { /* Trim end of region */ del += rg->to - f; rg->to = f; + + uncharge_cgroup_if_shared_mapping(resv, rg, rg->to - f); } }

spin_unlock(&resv->lock); kfree(nrg); + return del; }

@@ -1896,7 +1957,8 @@ __vma_reservation_common(struct hstate *h, struct vm_area_struct *vma, break; case VMA_COMMIT_RESV: VM_BUG_ON(in_regions_needed == -1); - ret = region_add(resv, idx, idx + 1, in_regions_needed); + ret = region_add(NULL, NULL, resv, idx, idx + 1, + in_regions_needed); break; case VMA_END_RESV: VM_BUG_ON(in_regions_needed == -1); @@ -1906,7 +1968,8 @@ __vma_reservation_common(struct hstate *h, struct vm_area_struct *vma, case VMA_ADD_RESV: VM_BUG_ON(in_regions_needed == -1); if (vma->vm_flags & VM_MAYSHARE) - ret = region_add(resv, idx, idx + 1, in_regions_needed); + ret = region_add(NULL, NULL, resv, idx, idx + 1, + in_regions_needed); else { region_abort(resv, idx, idx + 1, in_regions_needed); ret = region_del(resv, idx, idx + 1); @@ -4538,7 +4601,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; - struct hugetlb_cgroup *h_cg; + struct hugetlb_cgroup *h_cg = NULL; long gbl_reserve, regions_needed = 0;

/* This should never happen */ @@ -4579,13 +4642,6 @@ int hugetlb_reserve_pages(struct inode *inode,

chg = to - from;

- if (hugetlb_cgroup_charge_cgroup(hstate_index(h), - chg * pages_per_huge_page(h), - &h_cg, true)) { - kref_put(&resv_map->refs, resv_map_release); - return -ENOMEM; - } - #ifdef CONFIG_CGROUP_HUGETLB /* * Since this branch handles private mappings, we attach the @@ -4605,6 +4661,14 @@ int hugetlb_reserve_pages(struct inode *inode, goto out_err; }

+ ret = hugetlb_cgroup_charge_cgroup( + hstate_index(h), chg * pages_per_huge_page(h), &h_cg, true); + + if (ret < 0) { + ret = -ENOMEM; + goto out_err; + } + /* * There must be enough pages in the subpool for the mapping. If * the subpool has a minimum size, there may be some global @@ -4613,7 +4677,7 @@ int hugetlb_reserve_pages(struct inode *inode, gbl_reserve = hugepage_subpool_get_pages(spool, chg); if (gbl_reserve < 0) { ret = -ENOSPC; - goto out_err; + goto out_uncharge_cgroup; }

/* @@ -4622,9 +4686,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ ret = hugetlb_acct_memory(h, gbl_reserve); if (ret < 0) { - /* put back original number of pages, chg */ - (void)hugepage_subpool_put_pages(spool, chg); - goto out_err; + goto out_put_pages; }

/* @@ -4639,7 +4701,8 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) { - long add = region_add(resv_map, from, to, regions_needed); + long add = + region_add(h, h_cg, resv_map, from, to, regions_needed);

if (unlikely(chg > add)) { /* @@ -4651,12 +4714,33 @@ int hugetlb_reserve_pages(struct inode *inode, */ long rsv_adjust;

- rsv_adjust = hugepage_subpool_put_pages(spool, - chg - add); + hugetlb_cgroup_uncharge_cgroup( + hstate_index(h), + (chg - add) * pages_per_huge_page(h), h_cg, + true); + + rsv_adjust = + hugepage_subpool_put_pages(spool, chg - add); hugetlb_acct_memory(h, -rsv_adjust); } + } else { +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since this branch handles private mappings, we attach the + * counter to uncharge for this reservation off resv_map. + */ + resv_map->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + resv_map->pages_per_hpage = pages_per_huge_page(h); +#endif } return 0; +out_put_pages: + /* put back original number of pages, chg */ + (void)hugepage_subpool_put_pages(spool, chg); +out_uncharge_cgroup: + hugetlb_cgroup_uncharge_cgroup( + hstate_index(h), chg * pages_per_huge_page(h), h_cg, true); out_err: if (!vma || vma->vm_flags & VM_MAYSHARE) /* Don't call region_abort if region_chg failed */ -- 2.23.0.351.gc4317032e6-goog

Mina Almasry

10:24 p.m.

New subject: [PATCH v5 6/7] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

The tests use both shared and private mapped hugetlb memory, and monitors the hugetlb usage counter as well as the hugetlb reservation counter. They test different configurations such as hugetlb memory usage via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without MAP_POPULATE.

Signed-off-by: Mina Almasry almasrymina@google.com --- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/charge_reserved_hugetlb.sh | 461 ++++++++++++++++++ .../selftests/vm/write_hugetlb_memory.sh | 22 + .../testing/selftests/vm/write_to_hugetlbfs.c | 250 ++++++++++ 5 files changed, 735 insertions(+) create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 31b3c98b6d34..d3bed9407773 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -14,3 +14,4 @@ virtual_address_range gup_benchmark va_128TBswitch map_fixed_noreplace +write_to_hugetlbfs diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 9534dc2bc929..31c2cc5cf30b 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd TEST_GEN_FILES += va_128TBswitch TEST_GEN_FILES += virtual_address_range +TEST_GEN_FILES += write_to_hugetlbfs

TEST_PROGS := run_vmtests

diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh new file mode 100755 index 000000000000..17315db4111c --- /dev/null +++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh @@ -0,0 +1,461 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +set -e + +if [[ $(id -u) -ne 0 ]]; then + echo "This test must be run as root. Skipping..." + exit 0 +fi + +cgroup_path=/dev/cgroup/memory +if [[ ! -e $cgroup_path ]]; then + mkdir -p $cgroup_path + mount -t cgroup -o hugetlb,memory cgroup $cgroup_path +fi + +cleanup () { + echo $$ > $cgroup_path/tasks + + if [[ -e /mnt/huge ]]; then + rm -rf /mnt/huge/* + umount /mnt/huge || echo error + rmdir /mnt/huge + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test1 + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test2 + fi + echo 0 > /proc/sys/vm/nr_hugepages + echo CLEANUP DONE +} + +function expect_equal() { + local expected="$1" + local actual="$2" + local error="$3" + + if [[ "$expected" != "$actual" ]]; then + echo "expected ($expected) != actual ($actual): $3" + cleanup + exit 1 + fi +} + +function setup_cgroup() { + local name="$1" + local cgroup_limit="$2" + local reservation_limit="$3" + + mkdir $cgroup_path/$name + + echo writing cgroup limit: "$cgroup_limit" + echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes + + echo writing reseravation limit: "$reservation_limit" + echo "$reservation_limit" > \ + $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes + if [ -e "$cgroup_path/$name/cpuset.cpus" ]; then + echo 0 > $cgroup_path/$name/cpuset.cpus + fi + if [ -e "$cgroup_path/$name/cpuset.mems" ]; then + echo 0 > $cgroup_path/$name/cpuset.mems + fi +} + +function wait_for_hugetlb_memory_to_get_depleted { + local cgroup="$1" + local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes" + # Wait for hugetlbfs memory to get depleted. + while [ $(cat $path) != 0 ]; do + echo Waiting for hugetlb memory to get depleted. + sleep 0.5 + done +} + +function wait_for_hugetlb_memory_to_get_written { + local cgroup="$1" + local size="$2" + + local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes" + # Wait for hugetlbfs memory to get written. + while [ $(cat $path) != $size ]; do + echo Waiting for hugetlb memory to reach size $size. + sleep 0.5 + done +} + +function write_hugetlbfs_and_get_usage() { + local cgroup="$1" + local size="$2" + local populate="$3" + local write="$4" + local path="$5" + local method="$6" + local private="$7" + local expect_failure="$8" + + # Function return values. + reservation_failed=0 + oom_killed=0 + hugetlb_difference=0 + reserved_difference=0 + + local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes + local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes + + local hugetlb_before=$(cat $hugetlb_usage) + local reserved_before=$(cat $reserved_usage) + + echo + echo Starting: + echo hugetlb_usage="$hugetlb_before" + echo reserved_usage="$reserved_before" + echo expect_failure is "$expect_failure" + + set +e + if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \ + [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then + bash write_hugetlb_memory.sh "$size" "$populate" "$write" \ + "$cgroup" "$path" "$method" "$private" "-l" & + + local write_result=$? + wait_for_hugetlb_memory_to_get_written "$cgroup" "$size" + echo write_result is $write_result + else + bash write_hugetlb_memory.sh "$size" "$populate" "$write" \ + "$cgroup" "$path" "$method" "$private" + local write_result=$? + fi + set -e + + if [[ "$write_result" == 1 ]]; then + reservation_failed=1 + fi + + # On linus/master, the above process gets SIGBUS'd on oomkill, with + # return code 135. On earlier kernels, it gets actual oomkill, with return + # code 137, so just check for both conditions in case we're testing + # against an earlier kernel. + if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then + oom_killed=1 + fi + + local hugetlb_after=$(cat $hugetlb_usage) + local reserved_after=$(cat $reserved_usage) + + echo After write: + echo hugetlb_usage="$hugetlb_after" + echo reserved_usage="$reserved_after" + + hugetlb_difference=$(($hugetlb_after - $hugetlb_before)) + reserved_difference=$(($reserved_after - $reserved_before)) +} + +function cleanup_hugetlb_memory() { + set +e + local cgroup="$1" + if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then + echo kiling write_to_hugetlbfs + killall -2 write_to_hugetlbfs + wait_for_hugetlb_memory_to_get_depleted $cgroup + fi + set -e + + if [[ -e /mnt/huge ]]; then + rm -rf /mnt/huge/* + umount /mnt/huge + rmdir /mnt/huge + fi +} + +function run_test() { + local size="$1" + local populate="$2" + local write="$3" + local cgroup_limit="$4" + local reservation_limit="$5" + local nr_hugepages="$6" + local method="$7" + local private="$8" + local expect_failure="$9" + + # Function return values. + hugetlb_difference=0 + reserved_difference=0 + reservation_failed=0 + oom_killed=0 + + echo nr hugepages = "$nr_hugepages" + echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages + + setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit" + + mkdir -p /mnt/huge + mount -t hugetlbfs \ + -o pagesize=2M,size=256M none /mnt/huge + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \ + "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure" + + cleanup_hugetlb_memory "hugetlb_cgroup_test" + + local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes) + local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes) + + expect_equal "0" "$final_hugetlb" "final hugetlb is not zero" + expect_equal "0" "$final_reservation" "final reservation is not zero" +} + +function run_multiple_cgroup_test() { + local size1="$1" + local populate1="$2" + local write1="$3" + local cgroup_limit1="$4" + local reservation_limit1="$5" + + local size2="$6" + local populate2="$7" + local write2="$8" + local cgroup_limit2="$9" + local reservation_limit2="${10}" + + local nr_hugepages="${11}" + local method="${12}" + local private="${13}" + local expect_failure="${14}" + + # Function return values. + hugetlb_difference1=0 + reserved_difference1=0 + reservation_failed1=0 + oom_killed1=0 + + hugetlb_difference2=0 + reserved_difference2=0 + reservation_failed2=0 + oom_killed2=0 + + + echo nr hugepages = "$nr_hugepages" + echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages + + setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1" + setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2" + + mkdir -p /mnt/huge + mount -t hugetlbfs \ + -o pagesize=2M,size=256M none /mnt/huge + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \ + "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \ + "$expect_failure" + + hugetlb_difference1=$hugetlb_difference + reserved_difference1=$reserved_difference + reservation_failed1=$reservation_failed + oom_killed1=$oom_killed + + local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes + local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes + local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes + local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes + + local usage_before_second_write=$(cat $cgroup1_hugetlb_usage) + local reservation_usage_before_second_write=$(cat \ + $cgroup1_reservation_usage) + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \ + "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \ + "$expect_failure" + + hugetlb_difference2=$hugetlb_difference + reserved_difference2=$reserved_difference + reservation_failed2=$reservation_failed + oom_killed2=$oom_killed + + expect_equal "$usage_before_second_write" \ + "$(cat $cgroup1_hugetlb_usage)" "Usage changed." + expect_equal "$reservation_usage_before_second_write" \ + "$(cat $cgroup1_reservation_usage)" "Reservation usage changed." + + cleanup_hugetlb_memory + + local final_hugetlb=$(cat $cgroup1_hugetlb_usage) + local final_reservation=$(cat $cgroup1_reservation_usage) + + expect_equal "0" "$final_hugetlb" \ + "hugetlbt_cgroup_test1 final hugetlb is not zero" + expect_equal "0" "$final_reservation" \ + "hugetlbt_cgroup_test1 final reservation is not zero" + + local final_hugetlb=$(cat $cgroup2_hugetlb_usage) + local final_reservation=$(cat $cgroup2_reservation_usage) + + expect_equal "0" "$final_hugetlb" \ + "hugetlb_cgroup_test2 final hugetlb is not zero" + expect_equal "0" "$final_reservation" \ + "hugetlb_cgroup_test2 final reservation is not zero" +} + +cleanup + +for private in "" "-r" ; do +for populate in "" "-o"; do +for method in 0 1 2; do + +# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported. +if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then + continue +fi + +# Skip populated shmem tests. Doesn't seem to be supported. +if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then + continue +fi + +cleanup +echo +echo +echo +echo Test normal case. +echo private=$private, populate=$populate, method=$method +run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "0" + +echo Memory charged to hugtlb=$hugetlb_difference +echo Memory charged to reservation=$reserved_difference + +if [[ "$populate" == "-o" ]]; then + expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +else + expect_equal "0" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +fi + +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \ + "Reserved memory not charged to reservation usage." +echo 'PASS' + +cleanup +echo +echo +echo +echo Test normal case with write. +echo private=$private, populate=$populate, method=$method +run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "0" + +echo Memory charged to hugtlb=$hugetlb_difference +echo Memory charged to reservation=$reserved_difference + +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \ + "Reserved memory not charged to reservation usage." +echo 'PASS' + + +cleanup +echo +echo +echo +echo Test more than reservation case. +echo private=$private, populate=$populate, method=$method +run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \ + "$((5 * 1024 * 1024))" "10" "$method" "$private" "1" + +expect_equal "1" "$reservation_failed" "Reservation succeeded." +echo 'PASS' + +cleanup + +echo +echo +echo +echo Test more than cgroup limit case. +echo private=$private, populate=$populate, method=$method + +# Not sure if shm memory can be cleaned up when the process gets sigbus'd. +if [[ "$method" != 2 ]]; then + run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "1" + + expect_equal "1" "$oom_killed" "Not oom killed." +fi +echo 'PASS' + +cleanup + +echo +echo +echo +echo Test normal case, multiple cgroups. +echo private=$private, populate=$populate, method=$method +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \ + "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \ + "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \ + "$method" "$private" "0" + +echo Memory charged to hugtlb1=$hugetlb_difference1 +echo Memory charged to reservation1=$reserved_difference1 +echo Memory charged to hugtlb2=$hugetlb_difference2 +echo Memory charged to reservation2=$reserved_difference2 + +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \ + "Incorrect reservations charged to cgroup 1." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \ + "Incorrect reservation charged to cgroup 2." +if [[ "$populate" == "-o" ]]; then + expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." + expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +else + expect_equal "0" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." + expect_equal "0" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +fi +echo 'PASS' + +cleanup +echo +echo +echo +echo Test normal case with write, multiple cgroups. +echo private=$private, populate=$populate, method=$method +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \ + "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \ + "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \ + "$method" "$private" "0" + +echo Memory charged to hugtlb1=$hugetlb_difference1 +echo Memory charged to reservation1=$reserved_difference1 +echo Memory charged to hugtlb2=$hugetlb_difference2 +echo Memory charged to reservation2=$reserved_difference2 + +expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \ + "Incorrect reservation charged to cgroup 1." +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \ + "Incorrected reservation charged to cgroup 2." +echo 'PASS' + +cleanup + +done # private +done # populate +done # method + +umount $cgroup_path +rmdir $cgroup_path diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh new file mode 100644 index 000000000000..08f5fa5527cf --- /dev/null +++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh @@ -0,0 +1,22 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +set -e + +size=$1 +populate=$2 +write=$3 +cgroup=$4 +path=$5 +method=$6 +private=$7 +want_sleep=$8 + +echo "Putting task in cgroup '$cgroup'" +echo $$ > /dev/cgroup/memory/"$cgroup"/tasks + +echo "Method is $method" + +set +e +./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \ + "$private" "$want_sleep" diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c new file mode 100644 index 000000000000..03e71c5911be --- /dev/null +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c @@ -0,0 +1,250 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This program reserves and uses hugetlb memory, supporting a bunch of + * scenarios needed by the charged_reserved_hugetlb.sh test. + */ + +#include <err.h> +#include <errno.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/shm.h> +#include <sys/stat.h> +#include <sys/mman.h> + +/* Global definitions. */ +enum method { + HUGETLBFS, + MMAP_MAP_HUGETLB, + SHM, + MAX_METHOD +}; + + +/* Global variables. */ +static const char *self; +static char *shmaddr; +static int shmid; + +/* + * Show usage and exit. + */ +static void exit_usage(void) +{ + printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> " + "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] " + "[-o] [-w]\n", + self); + exit(EXIT_FAILURE); +} + +void sig_handler(int signo) +{ + printf("Received %d.\n", signo); + if (signo == SIGINT) { + printf("Deleting the memory\n"); + if (shmdt((const void *)shmaddr) != 0) { + perror("Detach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(4); + } + + shmctl(shmid, IPC_RMID, NULL); + printf("Done deleting the memory\n"); + } + exit(2); +} + +int main(int argc, char **argv) +{ + int fd = 0; + int key = 0; + int *ptr = NULL; + int c = 0; + int size = 0; + char path[256] = ""; + enum method method = MAX_METHOD; + int want_sleep = 0, private = 0; + int populate = 0; + int write = 0; + + unsigned long i; + + if (signal(SIGINT, sig_handler) == SIG_ERR) + err(1, "\ncan't catch SIGINT\n"); + + /* Parse command-line arguments. */ + setvbuf(stdout, NULL, _IONBF, 0); + self = argv[0]; + + while ((c = getopt(argc, argv, "s:p:m:owlr")) != -1) { + switch (c) { + case 's': + size = atoi(optarg); + break; + case 'p': + strncpy(path, optarg, sizeof(path)); + break; + case 'm': + if (atoi(optarg) >= MAX_METHOD) { + errno = EINVAL; + perror("Invalid -m."); + exit_usage(); + } + method = atoi(optarg); + break; + case 'o': + populate = 1; + break; + case 'w': + write = 1; + break; + case 'l': + want_sleep = 1; + break; + case 'r': + private + = 1; + break; + default: + errno = EINVAL; + perror("Invalid arg"); + exit_usage(); + } + } + + if (strncmp(path, "", sizeof(path)) != 0) { + printf("Writing to this path: %s\n", path); + } else { + errno = EINVAL; + perror("path not found"); + exit_usage(); + } + + if (size != 0) { + printf("Writing this size: %d\n", size); + } else { + errno = EINVAL; + perror("size not found"); + exit_usage(); + } + + if (!populate) + printf("Not populating.\n"); + else + printf("Populating.\n"); + + if (!write) + printf("Not writing to memory.\n"); + + if (method == MAX_METHOD) { + errno = EINVAL; + perror("-m Invalid"); + exit_usage(); + } else + printf("Using method=%d\n", method); + + if (!private) + printf("Shared mapping.\n"); + else + printf("Private mapping.\n"); + + switch (method) { + case HUGETLBFS: + printf("Allocating using HUGETLBFS.\n"); + fd = open(path, O_CREAT | O_RDWR, 0777); + if (fd == -1) + err(1, "Failed to open file."); + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + (private ? MAP_PRIVATE : MAP_SHARED) | + (populate ? MAP_POPULATE : 0), + fd, 0); + + if (ptr == MAP_FAILED) { + close(fd); + err(1, "Error mapping the file"); + } + break; + case MMAP_MAP_HUGETLB: + printf("Allocating using MAP_HUGETLB.\n"); + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + (private ? (MAP_PRIVATE | MAP_ANONYMOUS) : + MAP_SHARED) | + MAP_HUGETLB | (populate ? MAP_POPULATE : 0), + -1, 0); + + if (ptr == MAP_FAILED) + err(1, "mmap"); + + printf("Returned address is %p\n", ptr); + break; + case SHM: + printf("Allocating using SHM.\n"); + shmid = shmget(key, size, + SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); + if (shmid < 0) { + shmid = shmget(++key, size, + SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); + if (shmid < 0) + err(1, "shmget"); + } + printf("shmid: 0x%x, shmget key:%d\n", shmid, key); + + shmaddr = shmat(shmid, NULL, 0); + if (shmaddr == (char *)-1) { + perror("Shared memory attach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(2); + } + printf("shmaddr: %p\n", shmaddr); + + break; + default: + errno = EINVAL; + err(1, "Invalid method."); + } + + if (write) { + printf("Writing to memory.\n"); + if (method != SHM) { + memset(ptr, 1, size); + } else { + printf("Starting the writes:\n"); + for (i = 0; i < size; i++) { + shmaddr[i] = (char)(i); + if (!(i % (1024 * 1024))) + printf("."); + } + printf("\n"); + + printf("Starting the Check..."); + for (i = 0; i < size; i++) + if (shmaddr[i] != (char)i) { + printf("\nIndex %lu mismatched\n", i); + exit(3); + } + printf("Done.\n"); + } + } + + if (want_sleep) { + /* Signal to caller that we're done. */ + printf("DONE\n"); + + /* Hold memory until external kill signal is delivered. */ + while (1) + sleep(100); + } + + switch (method == HUGETLBFS) { + close(fd); + } + + return 0; +} -- 2.23.0.351.gc4317032e6-goog

Mina Almasry

10:24 p.m.

New subject: [PATCH v5 7/7] hugetlb_cgroup: Add hugetlb_cgroup reservation docs

Add docs for how to use hugetlb_cgroup reservations, and their behavior.

Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Hillf Danton hdanton@sina.com

--- .../admin-guide/cgroup-v1/hugetlb.rst | 85 ++++++++++++++++--- 1 file changed, 74 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst index a3902aa253a9..70c10bd9a0b7 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -2,13 +2,6 @@ HugeTLB Controller ==================

-The HugeTLB controller allows to limit the HugeTLB usage per control group and -enforces the controller limit during page fault. Since HugeTLB doesn't -support page reclaim, enforcing the limit at page fault time implies that, -the application will get SIGBUS signal if it tries to access HugeTLB pages -beyond its limit. This requires the application to know beforehand how much -HugeTLB pages it would require for its use. - HugeTLB controller can be created by first mounting the cgroup filesystem.

# mount -t cgroup -o hugetlb none /sys/fs/cgroup @@ -28,10 +21,14 @@ process (bash) into it.

Brief summary of control files::

- hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage - hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded - hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb - hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit + hugetlb.<hugepagesize>.reservation_limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations + hugetlb.<hugepagesize>.reservation_max_usage_in_bytes # show max "hugepagesize" hugetlb reservations recorded + hugetlb.<hugepagesize>.reservation_usage_in_bytes # show current reservations for "hugepagesize" hugetlb + hugetlb.<hugepagesize>.reservation_failcnt # show the number of allocation failure due to HugeTLB reservation limit + hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults + hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded + hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb + hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit

For a system supporting three hugepage sizes (64k, 32M and 1G), the control files include:: @@ -40,11 +37,77 @@ files include:: hugetlb.1GB.max_usage_in_bytes hugetlb.1GB.usage_in_bytes hugetlb.1GB.failcnt + hugetlb.1GB.reservation_limit_in_bytes + hugetlb.1GB.reservation_max_usage_in_bytes + hugetlb.1GB.reservation_usage_in_bytes + hugetlb.1GB.reservation_failcnt hugetlb.64KB.limit_in_bytes hugetlb.64KB.max_usage_in_bytes hugetlb.64KB.usage_in_bytes hugetlb.64KB.failcnt + hugetlb.64KB.reservation_limit_in_bytes + hugetlb.64KB.reservation_max_usage_in_bytes + hugetlb.64KB.reservation_usage_in_bytes + hugetlb.64KB.reservation_failcnt hugetlb.32MB.limit_in_bytes hugetlb.32MB.max_usage_in_bytes hugetlb.32MB.usage_in_bytes hugetlb.32MB.failcnt + hugetlb.32MB.reservation_limit_in_bytes + hugetlb.32MB.reservation_max_usage_in_bytes + hugetlb.32MB.reservation_usage_in_bytes + hugetlb.32MB.reservation_failcnt + + +1. Reservation limits + +The HugeTLB controller allows to limit the HugeTLB reservations per control +group and enforces the controller limit at reservation time. Reservation limits +are superior to Page fault limits (see section 2), since Reservation limits are +enforced at reservation time, and never causes the application to get SIGBUS +signal. Instead, if the application is violating its limits, then it gets an +error on reservation time, i.e. the mmap or shmget return an error. + + +2. Page fault limits + +The HugeTLB controller allows to limit the HugeTLB usage (page fault) per +control group and enforces the controller limit during page fault. Since HugeTLB +doesn't support page reclaim, enforcing the limit at page fault time implies +that, the application will get SIGBUS signal if it tries to access HugeTLB +pages beyond its limit. This requires the application to know beforehand how +much HugeTLB pages it would require for its use. + + +3. Caveats with shared memory + +a. Charging and uncharging: + +For shared hugetlb memory, both hugetlb reservation and usage (page faults) are +charged to the first task that causes the memory to be reserved or faulted, +and all subsequent uses of this reserved or faulted memory is done without +charging. + +Shared hugetlb memory is only uncharged when it is unreserved or deallocated. +This is usually when the hugetlbfs file is deleted, and not when the task that +caused the reservation or fault has exited. + +b. Interaction between reservation limit and fault limit. + +Generally, it's not recommended to set both of the reservation limit and fault +limit in a cgroup. For private memory, the fault usage cannot exceed the +reservation usage, so if you set both, one of those limits will be useless. +Still, there is no kernel enforcement on setting both at the same time. + +For shared memory, a cgroup's fault usage may be greater than its reservation +usage, so some care needs to be taken. Consider this example: + +- Task A reserves 4 pages in a shared hugetlbfs file. Cgroup A will get + 4 reservations charged to it and no faults charged to it. +- Task B reserves and faults the same 4 pages as Task A. Cgroup B will get no + reservation charge, but will get charged 4 faulted pages. If Cgroup B's limit + is less than 4, then Task B will get a SIGBUS. + +For the above scenario, it's not recommended for the userspace to set both +reservation limits and fault limits, but it is still allowed to in case it sees +some use for it, such as safe gradual transitioning from one to the other. -- 2.23.0.351.gc4317032e6-goog

Mike Kravetz

23 Sep 23 Sep

5:47 p.m.

On 9/19/19 3:24 PM, Mina Almasry wrote:

...

Patch series implements hugetlb_cgroup reservation usage and limits, which track hugetlb reservations rather than hugetlb memory faulted in. Details of the approach is 1/7.

Thanks for your continued efforts Mina.

One thing that has bothered me with this approach from the beginning is that hugetlb reservations are related to, but somewhat distinct from hugetlb allocations. The original (existing) huegtlb cgroup implementation does not take reservations into account. This is an issue you are trying to address by adding a cgroup support for hugetlb reservations. However, this new reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation of hugetlb pages. Both the existing cgroup code and the reservation approach have what I think are some serious flaws. Consider a system with 100 hugetlb pages available. A sysadmin, has two groups A and B and wants to limit hugetlb usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', nobody in group B can allocate ANY huge pages. This is true even though no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the existing (allocation) and new reservation cgroups together. Perhaps if both are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding reservation support to the existing allocation cgroup support. One could even argue that a reservation is an allocation as it sets aside huge pages that can only be used for a specific purpose. Here is something that may work.

Starting with the existing allocation cgroup. - When hugetlb pages are reserved, the cgroup of the task making the reservations is charged. Tracking for the charged cgroup is done in the reservation map in the same way proposed by this patch set. - At page fault time, - If a reservation already exists for that specific area do not charge the faulting task. No tracking in page, just the reservation map. - If no reservation exists, charge the group of the faulting task. Tracking of this information is in the page itself as implemented today. - When the hugetlb object is removed, compare the reservation map with any allocated pages. If cgroup tracking information exists in page, uncharge that group. Otherwise, unharge the group (if any) in the reservation map.

One of the advantages of a separate reservation cgroup is that the existing code is unmodified. Combining the two provides a more complete/accurate solution IMO. But, it has the potential to break existing users.

I really would like to get feedback from anyone that knows how the existing hugetlb cgroup controller may be used today. Comments from Aneesh would be very welcome to know if reservations were considered in development of the existing code.

-- Mike Kravetz

Mina Almasry

7:18 p.m.

On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...

On 9/19/19 3:24 PM, Mina Almasry wrote:

...
Patch series implements hugetlb_cgroup reservation usage and limits, which track hugetlb reservations rather than hugetlb memory faulted in. Details of the approach is 1/7.

Thanks for your continued efforts Mina.

And thanks for your reviews so far.

...

One thing that has bothered me with this approach from the beginning is that hugetlb reservations are related to, but somewhat distinct from hugetlb allocations. The original (existing) huegtlb cgroup implementation does not take reservations into account. This is an issue you are trying to address by adding a cgroup support for hugetlb reservations. However, this new reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation of hugetlb pages. Both the existing cgroup code and the reservation approach have what I think are some serious flaws. Consider a system with 100 hugetlb pages available. A sysadmin, has two groups A and B and wants to limit hugetlb usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', nobody in group B can allocate ANY huge pages. This is true even though no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the existing (allocation) and new reservation cgroups together. Perhaps if both are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding reservation support to the existing allocation cgroup support. One could even argue that a reservation is an allocation as it sets aside huge pages that can only be used for a specific purpose. Here is something that may work.

Starting with the existing allocation cgroup.

When hugetlb pages are reserved, the cgroup of the task making the reservations is charged. Tracking for the charged cgroup is done in the reservation map in the same way proposed by this patch set.

At page fault time,

If a reservation already exists for that specific area do not charge the faulting task. No tracking in page, just the reservation map.

If no reservation exists, charge the group of the faulting task. Tracking of this information is in the page itself as implemented today.

When the hugetlb object is removed, compare the reservation map with any allocated pages. If cgroup tracking information exists in page, uncharge that group. Otherwise, unharge the group (if any) in the reservation map.

One of the advantages of a separate reservation cgroup is that the existing code is unmodified. Combining the two provides a more complete/accurate solution IMO. But, it has the potential to break existing users.

I really would like to get feedback from anyone that knows how the existing hugetlb cgroup controller may be used today. Comments from Aneesh would be very welcome to know if reservations were considered in development of the existing code. --

FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:

AFAICT, the 2 counter approach we have here is strictly superior to the 1 upgraded counter approach. Consider these points:

- From what I can tell so far, everything you can do with the 1 counter approach, you can do with the two counter approach by setting both limit_in_bytes and reservation_limit_in_bytes to the limit value. That will limit both reservations and at fault allocations.

- The 2 counter approach preserves existing usage of hugetlb cgroups, so no need to muck around with reverting the feature some time from now because of broken users. No existing users of hugetlb cgroups need to worry about the effect of this on their usage.

- Users that use hugetlb memory strictly through reservations can use only reservation_limit_in_bytes and enjoy cgroup limits that never SIGBUS the application. This is our usage for example.

- The 2 counter approach provides more info to the sysadmin. The sysadmin knows exactly how much reserved bytes there are via reservation_usage_in_bytes, and how much actually in use bytes there are via usage_in_bytes. They can even detect NORESERVE usage if usage_in_bytes > reservation_usage_in_bytes. failcnt shows failed reservations *and* failed allocations at fault, etc. All around better debuggability when things go wrong. I think this is particularly troubling for the 1 upgraded counter approach. That counter's usage_in_bytes doesn't tell you if the usage came from reservations or allocations at fault time.

- Honestly, I think the 2 counter approach is easier to document and understand by the userspace? 1 counter that vaguely tracks both the reservations and usage and decides whether or not to charge at fault time seems hard to understand what really happened after something goes wrong. 1 counter that tracks reservations and 1 counter that tracks actual usage seem much simpler to digest, and provide better visibility to what the cgroup is doing as I mentioned above.

I think it may be better if I keep the 2 counter approach but thoroughly document the interaction between the existing counters and NORESERVE. What do you think?

FWIW, it may be prudent to consider deprecating MAP_NORESERVE, if that's an option. I'm not sure what that benefit that provides applications, and on the other hand it makes it hard for the kernel to guarantee the hugetlb memory is available to the application that requested it, and makes it harder for the cgroups to police hugetlb usage without SIGBUSing something. But that may be a discussion for another proposal.

...

Mike Kravetz

Mike Kravetz

9:27 p.m.

On 9/23/19 12:18 PM, Mina Almasry wrote:

...

On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...
On 9/19/19 3:24 PM, Mina Almasry wrote:

...
Patch series implements hugetlb_cgroup reservation usage and limits, which track hugetlb reservations rather than hugetlb memory faulted in. Details of the approach is 1/7.

Thanks for your continued efforts Mina.

And thanks for your reviews so far.

...
One thing that has bothered me with this approach from the beginning is that hugetlb reservations are related to, but somewhat distinct from hugetlb allocations. The original (existing) huegtlb cgroup implementation does not take reservations into account. This is an issue you are trying to address by adding a cgroup support for hugetlb reservations. However, this new reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation of hugetlb pages. Both the existing cgroup code and the reservation approach have what I think are some serious flaws. Consider a system with 100 hugetlb pages available. A sysadmin, has two groups A and B and wants to limit hugetlb usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', nobody in group B can allocate ANY huge pages. This is true even though no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the existing (allocation) and new reservation cgroups together. Perhaps if both are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding reservation support to the existing allocation cgroup support. One could even argue that a reservation is an allocation as it sets aside huge pages that can only be used for a specific purpose. Here is something that may work.

Starting with the existing allocation cgroup.

When hugetlb pages are reserved, the cgroup of the task making the reservations is charged. Tracking for the charged cgroup is done in the reservation map in the same way proposed by this patch set.

At page fault time,

If a reservation already exists for that specific area do not charge the faulting task. No tracking in page, just the reservation map.

If no reservation exists, charge the group of the faulting task. Tracking of this information is in the page itself as implemented today.

When the hugetlb object is removed, compare the reservation map with any allocated pages. If cgroup tracking information exists in page, uncharge that group. Otherwise, unharge the group (if any) in the reservation map.

One of the advantages of a separate reservation cgroup is that the existing code is unmodified. Combining the two provides a more complete/accurate solution IMO. But, it has the potential to break existing users.

I really would like to get feedback from anyone that knows how the existing hugetlb cgroup controller may be used today. Comments from Aneesh would be very welcome to know if reservations were considered in development of the existing code. --

FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:

AFAICT, the 2 counter approach we have here is strictly superior to the 1 upgraded counter approach. Consider these points:

From what I can tell so far, everything you can do with the 1

counter approach, you can do with the two counter approach by setting both limit_in_bytes and reservation_limit_in_bytes to the limit value. That will limit both reservations and at fault allocations.

The 2 counter approach preserves existing usage of hugetlb cgroups,

so no need to muck around with reverting the feature some time from now because of broken users. No existing users of hugetlb cgroups need to worry about the effect of this on their usage.

Users that use hugetlb memory strictly through reservations can use

only reservation_limit_in_bytes and enjoy cgroup limits that never SIGBUS the application. This is our usage for example.

The 2 counter approach provides more info to the sysadmin. The

sysadmin knows exactly how much reserved bytes there are via reservation_usage_in_bytes, and how much actually in use bytes there are via usage_in_bytes. They can even detect NORESERVE usage if usage_in_bytes > reservation_usage_in_bytes. failcnt shows failed reservations *and* failed allocations at fault, etc. All around better debuggability when things go wrong. I think this is particularly troubling for the 1 upgraded counter approach. That counter's usage_in_bytes doesn't tell you if the usage came from reservations or allocations at fault time.

Honestly, I think the 2 counter approach is easier to document and

understand by the userspace? 1 counter that vaguely tracks both the reservations and usage and decides whether or not to charge at fault time seems hard to understand what really happened after something goes wrong. 1 counter that tracks reservations and 1 counter that tracks actual usage seem much simpler to digest, and provide better visibility to what the cgroup is doing as I mentioned above.

I think it may be better if I keep the 2 counter approach but thoroughly document the interaction between the existing counters and NORESERVE. What do you think?

I personally prefer the one counter approach only for the reason that it exposes less information about hugetlb reservations. I was not around for the introduction of hugetlb reservations, but I have fixed several issues having to do with reservations. IMO, reservations should be hidden from users as much as possible. Others may disagree.

I really hope that Aneesh will comment. He added the existing hugetlb cgroup code. I was not involved in that effort, but it looks like there might have been some thought given to reservations in early versions of that code. It would be interesting to get his perspective.

Changes included in patch 4 (disable region_add file_region coalescing) would be needed in a one counter approach as well, so I do plan to review those changes.

-- Mike Kravetz

Mina Almasry

24 Sep 24 Sep

10:42 p.m.

On Mon, Sep 23, 2019 at 2:27 PM Mike Kravetz mike.kravetz@oracle.com wrote:

...

On 9/23/19 12:18 PM, Mina Almasry wrote:

...
On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...
On 9/19/19 3:24 PM, Mina Almasry wrote:

...
Patch series implements hugetlb_cgroup reservation usage and limits, which track hugetlb reservations rather than hugetlb memory faulted in. Details of the approach is 1/7.

Thanks for your continued efforts Mina.

And thanks for your reviews so far.

...
One thing that has bothered me with this approach from the beginning is that hugetlb reservations are related to, but somewhat distinct from hugetlb allocations. The original (existing) huegtlb cgroup implementation does not take reservations into account. This is an issue you are trying to address by adding a cgroup support for hugetlb reservations. However, this new reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation of hugetlb pages. Both the existing cgroup code and the reservation approach have what I think are some serious flaws. Consider a system with 100 hugetlb pages available. A sysadmin, has two groups A and B and wants to limit hugetlb usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', nobody in group B can allocate ANY huge pages. This is true even though no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the existing (allocation) and new reservation cgroups together. Perhaps if both are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding reservation support to the existing allocation cgroup support. One could even argue that a reservation is an allocation as it sets aside huge pages that can only be used for a specific purpose. Here is something that may work.

Starting with the existing allocation cgroup.

When hugetlb pages are reserved, the cgroup of the task making the reservations is charged. Tracking for the charged cgroup is done in the reservation map in the same way proposed by this patch set.

At page fault time,

If a reservation already exists for that specific area do not charge the faulting task. No tracking in page, just the reservation map.

If no reservation exists, charge the group of the faulting task. Tracking of this information is in the page itself as implemented today.

When the hugetlb object is removed, compare the reservation map with any allocated pages. If cgroup tracking information exists in page, uncharge that group. Otherwise, unharge the group (if any) in the reservation map.

One of the advantages of a separate reservation cgroup is that the existing code is unmodified. Combining the two provides a more complete/accurate solution IMO. But, it has the potential to break existing users.

I really would like to get feedback from anyone that knows how the existing hugetlb cgroup controller may be used today. Comments from Aneesh would be very welcome to know if reservations were considered in development of the existing code. --

FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:

AFAICT, the 2 counter approach we have here is strictly superior to the 1 upgraded counter approach. Consider these points:

From what I can tell so far, everything you can do with the 1

counter approach, you can do with the two counter approach by setting both limit_in_bytes and reservation_limit_in_bytes to the limit value. That will limit both reservations and at fault allocations.

The 2 counter approach preserves existing usage of hugetlb cgroups,

so no need to muck around with reverting the feature some time from now because of broken users. No existing users of hugetlb cgroups need to worry about the effect of this on their usage.

Users that use hugetlb memory strictly through reservations can use

only reservation_limit_in_bytes and enjoy cgroup limits that never SIGBUS the application. This is our usage for example.

The 2 counter approach provides more info to the sysadmin. The

sysadmin knows exactly how much reserved bytes there are via reservation_usage_in_bytes, and how much actually in use bytes there are via usage_in_bytes. They can even detect NORESERVE usage if usage_in_bytes > reservation_usage_in_bytes. failcnt shows failed reservations *and* failed allocations at fault, etc. All around better debuggability when things go wrong. I think this is particularly troubling for the 1 upgraded counter approach. That counter's usage_in_bytes doesn't tell you if the usage came from reservations or allocations at fault time.

Honestly, I think the 2 counter approach is easier to document and

understand by the userspace? 1 counter that vaguely tracks both the reservations and usage and decides whether or not to charge at fault time seems hard to understand what really happened after something goes wrong. 1 counter that tracks reservations and 1 counter that tracks actual usage seem much simpler to digest, and provide better visibility to what the cgroup is doing as I mentioned above.

I think it may be better if I keep the 2 counter approach but thoroughly document the interaction between the existing counters and NORESERVE. What do you think?

I personally prefer the one counter approach only for the reason that it exposes less information about hugetlb reservations. I was not around for the introduction of hugetlb reservations, but I have fixed several issues having to do with reservations. IMO, reservations should be hidden from users as much as possible. Others may disagree.

I really hope that Aneesh will comment. He added the existing hugetlb cgroup code. I was not involved in that effort, but it looks like there might have been some thought given to reservations in early versions of that code. It would be interesting to get his perspective.

Changes included in patch 4 (disable region_add file_region coalescing) would be needed in a one counter approach as well, so I do plan to review those changes.

OK, FWIW, the 1 counter approach should be sufficient for us, so I'm not really opposed. David, maybe chime in if you see a problem here?

...

From the perspective of hiding reservations from the user as much as

possible, it is an improvement.

I'm only wary about changing the behavior of the current and having that regress applications. I'm hoping you and Aneesh can shed light on this.

...

-- Mike Kravetz

David Rientjes

26 Sep 26 Sep

7:28 p.m.

On Tue, 24 Sep 2019, Mina Almasry wrote:

...

...
I personally prefer the one counter approach only for the reason that it exposes less information about hugetlb reservations. I was not around for the introduction of hugetlb reservations, but I have fixed several issues having to do with reservations. IMO, reservations should be hidden from users as much as possible. Others may disagree.

I really hope that Aneesh will comment. He added the existing hugetlb cgroup code. I was not involved in that effort, but it looks like there might have been some thought given to reservations in early versions of that code. It would be interesting to get his perspective.

Changes included in patch 4 (disable region_add file_region coalescing) would be needed in a one counter approach as well, so I do plan to review those changes.

OK, FWIW, the 1 counter approach should be sufficient for us, so I'm not really opposed. David, maybe chime in if you see a problem here? From the perspective of hiding reservations from the user as much as possible, it is an improvement.

I'm only wary about changing the behavior of the current and having that regress applications. I'm hoping you and Aneesh can shed light on this.

I think neither Aneesh nor myself are going to be able to provide a complete answer on the use of hugetlb cgroup today, anybody could be using it without our knowledge and that opens up the possibility that combining the limits would adversely affect a real system configuration.

If that is a possibility, I think we need to do some due diligence and try to deprecate allocation limits if possible. One of the benefits to separate limits is that we can make reasonable steps to deprecating the actual allocation limits, if possible: we could add warnings about the deprecation of allocation limits and see if anybody complains.

That could take the form of two separate limits or a tunable in the root hugetlb cgroup that defines whether the limits are for allocation or reservation.

Combining them in the first pass seems to be very risky and could cause pain for users that will not detect this during an rc cycle and will report the issue only when their distro gets it. Then we are left with no alternative other than stable backports and the separation of the limits anyway.

Mike Kravetz

9:23 p.m.

On 9/26/19 12:28 PM, David Rientjes wrote:

...

On Tue, 24 Sep 2019, Mina Almasry wrote:

...
...
I personally prefer the one counter approach only for the reason that it exposes less information about hugetlb reservations. I was not around for the introduction of hugetlb reservations, but I have fixed several issues having to do with reservations. IMO, reservations should be hidden from users as much as possible. Others may disagree.

I really hope that Aneesh will comment. He added the existing hugetlb cgroup code. I was not involved in that effort, but it looks like there might have been some thought given to reservations in early versions of that code. It would be interesting to get his perspective.

Changes included in patch 4 (disable region_add file_region coalescing) would be needed in a one counter approach as well, so I do plan to review those changes.

OK, FWIW, the 1 counter approach should be sufficient for us, so I'm not really opposed. David, maybe chime in if you see a problem here? From the perspective of hiding reservations from the user as much as possible, it is an improvement.

I'm only wary about changing the behavior of the current and having that regress applications. I'm hoping you and Aneesh can shed light on this.

I think neither Aneesh nor myself are going to be able to provide a complete answer on the use of hugetlb cgroup today, anybody could be using it without our knowledge and that opens up the possibility that combining the limits would adversely affect a real system configuration.

I agree that nobody can provide complete information on hugetlb cgroup usage today. My interest was in anything Aneesh could remember about development of the current cgroup code. It 'appears' that the idea of including reservations or mmap ranges was considered or at least discussed. But, those discussions happened more than 7 years old and my searches are not providing a complete picture. My hope was that Aneesh may remember those discussions.

...

If that is a possibility, I think we need to do some due diligence and try to deprecate allocation limits if possible. One of the benefits to separate limits is that we can make reasonable steps to deprecating the actual allocation limits, if possible: we could add warnings about the deprecation of allocation limits and see if anybody complains.

That could take the form of two separate limits or a tunable in the root hugetlb cgroup that defines whether the limits are for allocation or reservation.

Combining them in the first pass seems to be very risky and could cause pain for users that will not detect this during an rc cycle and will report the issue only when their distro gets it. Then we are left with no alternative other than stable backports and the separation of the limits anyway.

I agree that changing behavior of the existing controller is too risky. Such a change is likely to break someone. The more I think about it, the best way forward will be to retain the existing controller and create a new controller that satisfies the new use cases. The question remains as to what that new controller will be. Does it control reservations only? Is it a combination of reservations and allocations? If a combined controller will work for new use cases, that would be my preference. Of course, I have not prototyped such a controller so there may be issues when we get into the details. For a reservation only or combined controller, the region_* changes proposed by Mina would be used.

-- Mike Kravetz

Mina Almasry

27 Sep 27 Sep

12:55 a.m.

On Thu, Sep 26, 2019 at 2:23 PM Mike Kravetz mike.kravetz@oracle.com wrote:

...

On 9/26/19 12:28 PM, David Rientjes wrote:

...
On Tue, 24 Sep 2019, Mina Almasry wrote:

...
...
I personally prefer the one counter approach only for the reason that it exposes less information about hugetlb reservations. I was not around for the introduction of hugetlb reservations, but I have fixed several issues having to do with reservations. IMO, reservations should be hidden from users as much as possible. Others may disagree.

I really hope that Aneesh will comment. He added the existing hugetlb cgroup code. I was not involved in that effort, but it looks like there might have been some thought given to reservations in early versions of that code. It would be interesting to get his perspective.

Changes included in patch 4 (disable region_add file_region coalescing) would be needed in a one counter approach as well, so I do plan to review those changes.

OK, FWIW, the 1 counter approach should be sufficient for us, so I'm not really opposed. David, maybe chime in if you see a problem here? From the perspective of hiding reservations from the user as much as possible, it is an improvement.

I'm only wary about changing the behavior of the current and having that regress applications. I'm hoping you and Aneesh can shed light on this.

I think neither Aneesh nor myself are going to be able to provide a complete answer on the use of hugetlb cgroup today, anybody could be using it without our knowledge and that opens up the possibility that combining the limits would adversely affect a real system configuration.

I agree that nobody can provide complete information on hugetlb cgroup usage today. My interest was in anything Aneesh could remember about development of the current cgroup code. It 'appears' that the idea of including reservations or mmap ranges was considered or at least discussed. But, those discussions happened more than 7 years old and my searches are not providing a complete picture. My hope was that Aneesh may remember those discussions.

...
If that is a possibility, I think we need to do some due diligence and try to deprecate allocation limits if possible. One of the benefits to separate limits is that we can make reasonable steps to deprecating the actual allocation limits, if possible: we could add warnings about the deprecation of allocation limits and see if anybody complains.

That could take the form of two separate limits or a tunable in the root hugetlb cgroup that defines whether the limits are for allocation or reservation.

Combining them in the first pass seems to be very risky and could cause pain for users that will not detect this during an rc cycle and will report the issue only when their distro gets it. Then we are left with no alternative other than stable backports and the separation of the limits anyway.

I agree that changing behavior of the existing controller is too risky. Such a change is likely to break someone.

I'm glad we're converging on keeping the existing behavior unchanged.

...

The more I think about it, the best way forward will be to retain the existing controller and create a new controller that satisfies the new use cases.

My guess is that a new controller needs to support cgroups-v2, which is fine. But can a new controller also support v1? Or is there a requirement that new controllers support *only* v2? I need whatever solution here to work on v1. Added Tejun to hopefully comment on this.

...

The question remains as to what that new controller will be. Does it control reservations only? Is it a combination of reservations and allocations? If a combined controller will work for new use cases, that would be my preference. Of course, I have not prototyped such a controller so there may be issues when we get into the details. For a reservation only or combined controller, the region_* changes proposed by Mina would be used.

Provided we keep the existing controller untouched, should the new controller track:

1. only reservations, or 2. both reservations and allocations for which no reservations exist (such as the MAP_NORESERVE case)?

I like the 'both' approach. Seems to me a counter like that would work automatically regardless of whether the application is allocating hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut into reserved hugetlb pages, correct? If so, then applications that allocate with NORESERVE will get sigbused when they hit their limit, and applications that allocate without NORESERVE may get an error at mmap time but will always be within their limits while they access the mmap'd memory, correct? So the 'both' counter seems like a one size fits all.

I think the only sticking point left is whether an added controller can support both cgroup-v2 and cgroup-v1. If I could get confirmation on that I'll provide a patchset.

...

-- Mike Kravetz

Mike Kravetz

9:59 p.m.

On 9/26/19 5:55 PM, Mina Almasry wrote:

...

Provided we keep the existing controller untouched, should the new controller track:

only reservations, or

both reservations and allocations for which no reservations exist

(such as the MAP_NORESERVE case)?

I like the 'both' approach. Seems to me a counter like that would work automatically regardless of whether the application is allocating hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut into reserved hugetlb pages, correct?

Correct. One other easy way to allocate huge pages without reserves (that I know is used today) is via the fallocate system call.

...

                                  If so, then applications that
allocate with NORESERVE will get sigbused when they hit their limit, and applications that allocate without NORESERVE may get an error at mmap time but will always be within their limits while they access the mmap'd memory, correct?

Correct. At page allocation time we can easily check to see if a reservation exists and not charge. For any specific page within a hugetlbfs file, a charge would happen at mmap time or allocation time.

One exception (that I can think of) to this mmap(RESERVE) will not cause a SIGBUS rule is in the case of hole punch. If someone punches a hole in a file, not only do they remove pages associated with the file but the reservation information as well. Therefore, a subsequent fault will be the same as an allocation without reservation.

I 'think' the code to remove/truncate a file will work corrctly as it is today, but I need to think about this some more.

...

mmap'd memory, correct? So the 'both' counter seems like a one size fits all.

I think the only sticking point left is whether an added controller can support both cgroup-v2 and cgroup-v1. If I could get confirmation on that I'll provide a patchset.

Sorry, but I can not provide cgroup expertise.

-- Mike Kravetz

Mina Almasry

10:51 p.m.

On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz mike.kravetz@oracle.com wrote:

...

On 9/26/19 5:55 PM, Mina Almasry wrote:

...
Provided we keep the existing controller untouched, should the new controller track:

only reservations, or

both reservations and allocations for which no reservations exist

(such as the MAP_NORESERVE case)?

I like the 'both' approach. Seems to me a counter like that would work automatically regardless of whether the application is allocating hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut into reserved hugetlb pages, correct?

Correct. One other easy way to allocate huge pages without reserves (that I know is used today) is via the fallocate system call.

...
                                  If so, then applications that
allocate with NORESERVE will get sigbused when they hit their limit, and applications that allocate without NORESERVE may get an error at mmap time but will always be within their limits while they access the mmap'd memory, correct?
Correct. At page allocation time we can easily check to see if a reservation exists and not charge. For any specific page within a hugetlbfs file, a charge would happen at mmap time or allocation time.

One exception (that I can think of) to this mmap(RESERVE) will not cause a SIGBUS rule is in the case of hole punch. If someone punches a hole in a file, not only do they remove pages associated with the file but the reservation information as well. Therefore, a subsequent fault will be the same as an allocation without reservation.

I don't think it causes a sigbus. This is the scenario, right:

1. Make cgroup with limit X bytes. 2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged 3. A hole of size Y is punched in the file, causing the cgroup to get uncharged Y bytes. 4. The task faults in memory from the hole, getting charged up to Y bytes again. But they will be still within their limits.

IIUC userspace only gets sigbus'd if the limit is lowered between steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

...

I 'think' the code to remove/truncate a file will work corrctly as it is today, but I need to think about this some more.

...
mmap'd memory, correct? So the 'both' counter seems like a one size fits all.

I think the only sticking point left is whether an added controller can support both cgroup-v2 and cgroup-v1. If I could get confirmation on that I'll provide a patchset.

Sorry, but I can not provide cgroup expertise.

Mike Kravetz

Mike Kravetz

10:56 p.m.

On 9/27/19 3:51 PM, Mina Almasry wrote:

...

On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz mike.kravetz@oracle.com wrote:

...
On 9/26/19 5:55 PM, Mina Almasry wrote:

...
Provided we keep the existing controller untouched, should the new controller track:

only reservations, or

both reservations and allocations for which no reservations exist

(such as the MAP_NORESERVE case)?

I like the 'both' approach. Seems to me a counter like that would work automatically regardless of whether the application is allocating hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut into reserved hugetlb pages, correct?

Correct. One other easy way to allocate huge pages without reserves (that I know is used today) is via the fallocate system call.

...
                                  If so, then applications that
allocate with NORESERVE will get sigbused when they hit their limit, and applications that allocate without NORESERVE may get an error at mmap time but will always be within their limits while they access the mmap'd memory, correct?
Correct. At page allocation time we can easily check to see if a reservation exists and not charge. For any specific page within a hugetlbfs file, a charge would happen at mmap time or allocation time.

One exception (that I can think of) to this mmap(RESERVE) will not cause a SIGBUS rule is in the case of hole punch. If someone punches a hole in a file, not only do they remove pages associated with the file but the reservation information as well. Therefore, a subsequent fault will be the same as an allocation without reservation.
I don't think it causes a sigbus. This is the scenario, right:

Make cgroup with limit X bytes.

Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged

A hole of size Y is punched in the file, causing the cgroup to get

uncharged Y bytes. 4. The task faults in memory from the hole, getting charged up to Y bytes again. But they will be still within their limits.

IIUC userspace only gets sigbus'd if the limit is lowered between steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

You are correct. That was my mistake. I was still thinking of behavior for a reservation only cgroup model. It would behave as you describe above (no SIGBUS) for a combined reservation/allocate model.

-- Mike Kravetz

Michal Koutný

30 Sep 30 Sep

3:12 p.m.

Hi.

On Thu, Sep 26, 2019 at 05:55:29PM -0700, Mina Almasry almasrymina@google.com wrote:

...

My guess is that a new controller needs to support cgroups-v2, which is fine. But can a new controller also support v1? Or is there a requirement that new controllers support *only* v2? I need whatever solution here to work on v1.

Here is my view of important criteria:

1) stability, v1 APIs and semantics should not be changed, 2) futureproofness, v1 uses should be convertible to v2 uses, 3) maintainability, the less (similar) code the better.

And here is my braindump of some approaches:

A) new controller, v2 only - 1) ok - 2) may be ok - 3) separate v1 and v2 implementations - exclusion must be ensured on hybrid hierarchies

B) new controller, version oblivious (see e.g. pid) - 1) sort of ok - 2) partially ok - 3) two v1 implementations - exclusion must be ensured even on pure v1 hierarchies

C) extending the existing controller, w/out v2 counterpart - 1) ok with workarounds (new option switching behavior) - 2) not ok - 3) likely OK

D) extending the existing controller, with v2 counterpart - 1) ok with workarounds (new option switching behavior, see cpuset) - 2) may be ok - 3) likely OK

AFAIU, the current patchset is variation of C). Personally, I think something like D) could work, I'm not convinced about A) and B) based on the breakdown above. But it may induce some other ideas.

My two cents, Michal

Mina Almasry

11 Oct 11 Oct

7:10 p.m.

On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...

On 9/19/19 3:24 PM, Mina Almasry wrote:

...
Patch series implements hugetlb_cgroup reservation usage and limits, which track hugetlb reservations rather than hugetlb memory faulted in. Details of the approach is 1/7.

Thanks for your continued efforts Mina.

One thing that has bothered me with this approach from the beginning is that hugetlb reservations are related to, but somewhat distinct from hugetlb allocations. The original (existing) huegtlb cgroup implementation does not take reservations into account. This is an issue you are trying to address by adding a cgroup support for hugetlb reservations. However, this new reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation of hugetlb pages. Both the existing cgroup code and the reservation approach have what I think are some serious flaws. Consider a system with 100 hugetlb pages available. A sysadmin, has two groups A and B and wants to limit hugetlb usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', nobody in group B can allocate ANY huge pages. This is true even though no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the existing (allocation) and new reservation cgroups together. Perhaps if both are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding reservation support to the existing allocation cgroup support. One could even argue that a reservation is an allocation as it sets aside huge pages that can only be used for a specific purpose. Here is something that may work.

Starting with the existing allocation cgroup.

When hugetlb pages are reserved, the cgroup of the task making the reservations is charged. Tracking for the charged cgroup is done in the reservation map in the same way proposed by this patch set.

At page fault time,

If a reservation already exists for that specific area do not charge the faulting task. No tracking in page, just the reservation map.

If no reservation exists, charge the group of the faulting task. Tracking of this information is in the page itself as implemented today.

When the hugetlb object is removed, compare the reservation map with any allocated pages. If cgroup tracking information exists in page, uncharge that group. Otherwise, unharge the group (if any) in the reservation map.

Sorry for the late response here. I've been prototyping the suggestions from this conversation:

1. Supporting cgroup-v2 on the current controller seems trivial. Basically just specifying the dfl files seems to do it, and my tests on top of cgroup-v2 don't see any problems so far at least. In light of this I'm not sure it's best to create a new controller per say. Seems like it would duplicate a lot of code with the current controller, so I've tentatively just stuck to the plan in my current patchset, a new counter on the existing controller.

2. I've been working on transitioning the new counter to the behavior Mike specified in the email I'm responding to. So far I have a flow that works for shared mappings but not private mappings:

- On reservation, charge the new counter and store the info in the resv_map. The counter gets uncharged when the resv_map entry gets removed (works fine). - On alloc_huge_page(), check if there is a reservation for the page being allocated. If not, charge the new counter and store the information in resv_map. The counter still gets uncharged when the resv_map entry gets removed.

The above works for all shared mappings and reserved private mappings, but I' having trouble supporting private NORESERVE mappings. Charging can work the same as for shared mappings: charge the new counter on reservation and on allocations that do not have a reservation. But the question still comes up: where to store the counter to uncharge this page? I thought of a couple of things that don't seem to work:

1. I thought of putting the counter in resv_map->reservation_counter, so that it gets uncharged on vm_op_close. But, private NORESERVE mappings don't even have a resv_map allocated for them.

2. I thought of detecting on free_huge_page that the page being freed belonged to a private NORESERVE mapping, and uncharging the hugetlb_cgroup in the page itself, but free_huge_page only gets a struct page* and I can't seem to find a way to detect that that the page comes from a private NORESERVE mapping from only the struct page*.

Mike, note your suggestion above to check if the page hugetlb_cgroup is null doesn't work if we want to keep the current counter working the same: the page will always have a hugetlb_cgroup that points that contains the old counter. Any ideas how to apply this new counter behavior to a private NORESERVE mappings? Is there maybe a flag I can set on the pages at allocation time that I can read on free time to know whether to uncharge the hugetlb_cgroup or not?

Mina Almasry

8:41 p.m.

On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry almasrymina@google.com wrote:

...

On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...
On 9/19/19 3:24 PM, Mina Almasry wrote:

...
Patch series implements hugetlb_cgroup reservation usage and limits, which track hugetlb reservations rather than hugetlb memory faulted in. Details of the approach is 1/7.

Thanks for your continued efforts Mina.

One thing that has bothered me with this approach from the beginning is that hugetlb reservations are related to, but somewhat distinct from hugetlb allocations. The original (existing) huegtlb cgroup implementation does not take reservations into account. This is an issue you are trying to address by adding a cgroup support for hugetlb reservations. However, this new reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation of hugetlb pages. Both the existing cgroup code and the reservation approach have what I think are some serious flaws. Consider a system with 100 hugetlb pages available. A sysadmin, has two groups A and B and wants to limit hugetlb usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', nobody in group B can allocate ANY huge pages. This is true even though no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the existing (allocation) and new reservation cgroups together. Perhaps if both are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding reservation support to the existing allocation cgroup support. One could even argue that a reservation is an allocation as it sets aside huge pages that can only be used for a specific purpose. Here is something that may work.

Starting with the existing allocation cgroup.

When hugetlb pages are reserved, the cgroup of the task making the reservations is charged. Tracking for the charged cgroup is done in the reservation map in the same way proposed by this patch set.

At page fault time,

If a reservation already exists for that specific area do not charge the faulting task. No tracking in page, just the reservation map.

If no reservation exists, charge the group of the faulting task. Tracking of this information is in the page itself as implemented today.

When the hugetlb object is removed, compare the reservation map with any allocated pages. If cgroup tracking information exists in page, uncharge that group. Otherwise, unharge the group (if any) in the reservation map.

Sorry for the late response here. I've been prototyping the suggestions from this conversation:

Supporting cgroup-v2 on the current controller seems trivial.

Basically just specifying the dfl files seems to do it, and my tests on top of cgroup-v2 don't see any problems so far at least. In light of this I'm not sure it's best to create a new controller per say. Seems like it would duplicate a lot of code with the current controller, so I've tentatively just stuck to the plan in my current patchset, a new counter on the existing controller.

I've been working on transitioning the new counter to the behavior

Mike specified in the email I'm responding to. So far I have a flow that works for shared mappings but not private mappings:

On reservation, charge the new counter and store the info in the

resv_map. The counter gets uncharged when the resv_map entry gets removed (works fine).

On alloc_huge_page(), check if there is a reservation for the page

being allocated. If not, charge the new counter and store the information in resv_map. The counter still gets uncharged when the resv_map entry gets removed.

The above works for all shared mappings and reserved private mappings, but I' having trouble supporting private NORESERVE mappings. Charging can work the same as for shared mappings: charge the new counter on reservation and on allocations that do not have a reservation. But the question still comes up: where to store the counter to uncharge this page? I thought of a couple of things that don't seem to work:

I thought of putting the counter in resv_map->reservation_counter,

so that it gets uncharged on vm_op_close. But, private NORESERVE mappings don't even have a resv_map allocated for them.

I thought of detecting on free_huge_page that the page being freed

belonged to a private NORESERVE mapping, and uncharging the hugetlb_cgroup in the page itself, but free_huge_page only gets a struct page* and I can't seem to find a way to detect that that the page comes from a private NORESERVE mapping from only the struct page*.

Mike, note your suggestion above to check if the page hugetlb_cgroup is null doesn't work if we want to keep the current counter working the same: the page will always have a hugetlb_cgroup that points that contains the old counter. Any ideas how to apply this new counter behavior to a private NORESERVE mappings? Is there maybe a flag I can set on the pages at allocation time that I can read on free time to know whether to uncharge the hugetlb_cgroup or not?

Reading the code and asking around a bit, it seems the pointer to the hugetlb_cgroup is in page[2].private. Is it reasonable to use page[3].private to store the hugetlb_cgroup to uncharge for the new counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that would solve my problem. When allocating a private NORESERVE page, set page[3].private to the hugetlb_cgroup to uncharge, then on free_huge_page, check page[3].private, if it is non-NULL, uncharge the new counter on it.

Mike Kravetz

14 Oct 14 Oct

5:33 p.m.

On 10/11/19 1:41 PM, Mina Almasry wrote:

...

On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry almasrymina@google.com wrote:

...
On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...
On 9/19/19 3:24 PM, Mina Almasry wrote:

Mike, note your suggestion above to check if the page hugetlb_cgroup is null doesn't work if we want to keep the current counter working the same: the page will always have a hugetlb_cgroup that points that contains the old counter. Any ideas how to apply this new counter behavior to a private NORESERVE mappings? Is there maybe a flag I can set on the pages at allocation time that I can read on free time to know whether to uncharge the hugetlb_cgroup or not?

Reading the code and asking around a bit, it seems the pointer to the hugetlb_cgroup is in page[2].private. Is it reasonable to use page[3].private to store the hugetlb_cgroup to uncharge for the new counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that would solve my problem. When allocating a private NORESERVE page, set page[3].private to the hugetlb_cgroup to uncharge, then on free_huge_page, check page[3].private, if it is non-NULL, uncharge the new counter on it.

Sorry for not responding sooner. This approach should work, and it looks like you have a v6 of the series. I'll take a look.

-- Mike Kravetz

Mina Almasry

6:01 p.m.

On Mon, Oct 14, 2019 at 10:33 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...

On 10/11/19 1:41 PM, Mina Almasry wrote:

...
On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry almasrymina@google.com wrote:

...
On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz mike.kravetz@oracle.com wrote:

...
On 9/19/19 3:24 PM, Mina Almasry wrote:

Mike, note your suggestion above to check if the page hugetlb_cgroup is null doesn't work if we want to keep the current counter working the same: the page will always have a hugetlb_cgroup that points that contains the old counter. Any ideas how to apply this new counter behavior to a private NORESERVE mappings? Is there maybe a flag I can set on the pages at allocation time that I can read on free time to know whether to uncharge the hugetlb_cgroup or not?

Reading the code and asking around a bit, it seems the pointer to the hugetlb_cgroup is in page[2].private. Is it reasonable to use page[3].private to store the hugetlb_cgroup to uncharge for the new counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that would solve my problem. When allocating a private NORESERVE page, set page[3].private to the hugetlb_cgroup to uncharge, then on free_huge_page, check page[3].private, if it is non-NULL, uncharge the new counter on it.

Sorry for not responding sooner. This approach should work, and it looks like you have a v6 of the series. I'll take a look.

Great! Thanks! That's the approach I went with in v6.

...

-- Mike Kravetz

2199

days inactive

2224

days old

linux-kselftest-mirror@lists.linaro.org

24 comments

participants

tags (0)

participants (4)

David Rientjes
Michal Koutný
Mike Kravetz
Mina Almasry