These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter.
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.
Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory, reservation_usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows, the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.
This proposal is implemented in this patch series, with tests to verify functionality and show the usage. We also added cgroup-v2 support to hugetlb_cgroup so that the new use cases can be extended to v2.
Alternatives considered: 1. A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via reservation_limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and reservation_limit_in_bytes concurrently, and this approach gives them the option to do so.
Testing: - Added tests passing. - libhugetlbfs tests mostly passing, but some tests have trouble with and without this patch series. Seems environment issue rather than code: - Overall results: ********** TEST SUMMARY * 2M * 32-bit 64-bit * Total testcases: 84 0 * Skipped: 0 0 * PASS: 66 0 * FAIL: 14 0 * Killed by signal: 0 0 * Bad configuration: 4 0 * Expected FAIL: 0 0 * Unexpected PASS: 0 0 * Test not present: 0 0 * Strange test result: 0 0 ********** - Failing tests: - elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this patch series. - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32): FAIL Address is not hugepage - LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc HUGETLB_MORECORE=yes malloc (2M: 32): FAIL Address is not hugepage - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32): FAIL Address is not hugepage - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes heapshrink (2M: 32): FAIL Heap not on hugepages - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32): FAIL Heap not on hugepages - HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL small_data is not hugepage - HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32): FAIL small_data is not hugepage - alloc-instantiate-race shared (2M: 32): Bad configuration: sched_setaffinity(cpu1): Invalid argument - FAIL Child 1 killed by signal Killed - shmoverride_linked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - HUGETLB_SHM=yes shmoverride_linked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - shmoverride_linked_static (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - HUGETLB_SHM=yes shmoverride_linked_static (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument - LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Hillf Danton hdanton@sina.com
--- include/linux/hugetlb.h | 23 ++++++++- mm/hugetlb_cgroup.c | 111 ++++++++++++++++++++++++++++++---------- 2 files changed, 107 insertions(+), 27 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 53fc34f930d08..9c49a0ba894d3 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
#ifdef CONFIG_HUGETLB_PAGE
+enum { + /* Tracks hugetlb memory faulted in. */ + HUGETLB_RES_USAGE, + /* Tracks hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_USAGE, + /* Limit for hugetlb memory faulted in. */ + HUGETLB_RES_LIMIT, + /* Limit for hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_LIMIT, + /* Max usage for hugetlb memory faulted in. */ + HUGETLB_RES_MAX_USAGE, + /* Max usage for hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_MAX_USAGE, + /* Faulted memory accounting fail count. */ + HUGETLB_RES_FAILCNT, + /* Reserved memory accounting fail count. */ + HUGETLB_RES_RESERVATION_FAILCNT, + HUGETLB_RES_NULL, + HUGETLB_RES_MAX, +}; + #define HSTATE_NAME_LEN 32 /* Defines one hugetlb page size */ struct hstate { @@ -340,7 +361,7 @@ struct hstate { unsigned int surplus_huge_pages_node[MAX_NUMNODES]; #ifdef CONFIG_CGROUP_HUGETLB /* cgroup control files */ - struct cftype cgroup_files[5]; + struct cftype cgroup_files[HUGETLB_RES_MAX]; #endif char name[HSTATE_NAME_LEN]; }; diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index f1930fa0b445d..1ed4448ca41d3 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -25,6 +25,10 @@ struct hugetlb_cgroup { * the counter to account for hugepages from hugetlb. */ struct page_counter hugepage[HUGE_MAX_HSTATE]; + /* + * the counter to account for hugepage reservations from hugetlb. + */ + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; };
#define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) @@ -33,6 +37,14 @@ struct hugetlb_cgroup {
static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
+static inline struct page_counter * +hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved) +{ + if (reserved) + return &h_cg->reserved_hugepage[idx]; + return &h_cg->hugepage[idx]; +} + static inline struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) { @@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, return; }
-enum { - RES_USAGE, - RES_LIMIT, - RES_MAX_USAGE, - RES_FAILCNT, -}; - static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) { struct page_counter *counter; + struct page_counter *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; + reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
switch (MEMFILE_ATTR(cft->private)) { - case RES_USAGE: + case HUGETLB_RES_USAGE: return (u64)page_counter_read(counter) * PAGE_SIZE; - case RES_LIMIT: + case HUGETLB_RES_RESERVATION_USAGE: + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE; + case HUGETLB_RES_LIMIT: return (u64)counter->max * PAGE_SIZE; - case RES_MAX_USAGE: + case HUGETLB_RES_RESERVATION_LIMIT: + return (u64)reserved_counter->max * PAGE_SIZE; + case HUGETLB_RES_MAX_USAGE: return (u64)counter->watermark * PAGE_SIZE; - case RES_FAILCNT: + case HUGETLB_RES_RESERVATION_MAX_USAGE: + return (u64)reserved_counter->watermark * PAGE_SIZE; + case HUGETLB_RES_FAILCNT: return counter->failcnt; + case HUGETLB_RES_RESERVATION_FAILCNT: + return reserved_counter->failcnt; default: BUG(); } @@ -291,6 +306,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, int ret, idx; unsigned long nr_pages; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); + bool reserved = false;
if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ return -EINVAL; @@ -304,9 +320,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
switch (MEMFILE_ATTR(of_cft(of)->private)) { - case RES_LIMIT: + case HUGETLB_RES_RESERVATION_LIMIT: + reserved = true; + /* Fall through. */ + case HUGETLB_RES_LIMIT: mutex_lock(&hugetlb_limit_mutex); - ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages); + ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx, + reserved), + nr_pages); mutex_unlock(&hugetlb_limit_mutex); break; default: @@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { int ret = 0; - struct page_counter *counter; + struct page_counter *counter, *reserved_counter; struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; + reserved_counter = + &h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];
switch (MEMFILE_ATTR(of_cft(of)->private)) { - case RES_MAX_USAGE: + case HUGETLB_RES_MAX_USAGE: page_counter_reset_watermark(counter); break; - case RES_FAILCNT: + case HUGETLB_RES_RESERVATION_MAX_USAGE: + page_counter_reset_watermark(reserved_counter); + break; + case HUGETLB_RES_FAILCNT: counter->failcnt = 0; break; + case HUGETLB_RES_RESERVATION_FAILCNT: + reserved_counter->failcnt = 0; + break; default: ret = -EINVAL; break; @@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx) struct hstate *h = &hstates[idx];
/* format the size */ - mem_fmt(buf, 32, huge_page_size(h)); + mem_fmt(buf, sizeof(buf), huge_page_size(h));
/* Add the limit file */ - cft = &h->cgroup_files[0]; + cft = &h->cgroup_files[HUGETLB_RES_LIMIT]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT); + cft->read_u64 = hugetlb_cgroup_read_u64; + cft->write = hugetlb_cgroup_write; + + /* Add the reservation limit file */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes", + buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT); cft->read_u64 = hugetlb_cgroup_read_u64; cft->write = hugetlb_cgroup_write;
/* Add the usage file */ - cft = &h->cgroup_files[1]; + cft = &h->cgroup_files[HUGETLB_RES_USAGE]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE); + cft->read_u64 = hugetlb_cgroup_read_u64; + + /* Add the reservation usage file */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes", + buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE); cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the MAX usage file */ - cft = &h->cgroup_files[2]; + cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + + /* Add the MAX reservation usage file */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE]; + snprintf(cft->name, MAX_CFTYPE_NAME, + "%s.reservation_max_usage_in_bytes", buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64;
/* Add the failcntfile */ - cft = &h->cgroup_files[3]; + cft = &h->cgroup_files[HUGETLB_RES_FAILCNT]; snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); - cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT); + cft->write = hugetlb_cgroup_reset; + cft->read_u64 = hugetlb_cgroup_read_u64; + + /* Add the reservation failcntfile */ + cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT]; + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf); + cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT); cft->write = hugetlb_cgroup_reset; cft->read_u64 = hugetlb_cgroup_read_u64;
/* NULL terminate the last cft */ - cft = &h->cgroup_files[4]; + cft = &h->cgroup_files[HUGETLB_RES_NULL]; memset(cft, 0, sizeof(*cft));
WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, -- 2.24.0.rc1.363.gb1bccd3e3d-goog
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage or hugetlb reservation counter.
Adds a new interface to uncharge a hugetlb_cgroup counter via hugetlb_cgroup_uncharge_counter.
Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init, hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
Signed-off-by: Mina Almasry almasrymina@google.com
--- include/linux/hugetlb_cgroup.h | 67 +++++++++++++--------- mm/hugetlb.c | 17 +++--- mm/hugetlb_cgroup.c | 100 +++++++++++++++++++++++++-------- 3 files changed, 130 insertions(+), 54 deletions(-)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 063962f6dfc6a..1bb58a63af586 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -22,27 +22,35 @@ struct hugetlb_cgroup; * Minimum page order trackable by hugetlb cgroup. * At least 3 pages are necessary for all the tracking information. */ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER 3
#ifdef CONFIG_CGROUP_HUGETLB
-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page, + bool reserved) { VM_BUG_ON_PAGE(!PageHuge(page), page);
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return NULL; - return (struct hugetlb_cgroup *)page[2].private; + if (reserved) + return (struct hugetlb_cgroup *)page[3].private; + else + return (struct hugetlb_cgroup *)page[2].private; }
-static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page, + struct hugetlb_cgroup *h_cg, + bool reservation) { VM_BUG_ON_PAGE(!PageHuge(page), page);
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return -1; - page[2].private = (unsigned long)h_cg; + if (reservation) + page[3].private = (unsigned long)h_cg; + else + page[2].private = (unsigned long)h_cg; return 0; }
@@ -52,26 +60,33 @@ static inline bool hugetlb_cgroup_disabled(void) }
extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr); + struct hugetlb_cgroup **ptr, + bool reserved); extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, - struct page *page); + struct page *page, bool reserved); extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, - struct page *page); + struct page *page, bool reserved); + extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg); + struct hugetlb_cgroup *h_cg, + bool reserved); +extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages); + extern void hugetlb_cgroup_file_init(void) __init; extern void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage);
#else -static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page, + bool reserved) { return NULL; }
-static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page, + struct hugetlb_cgroup *h_cg, bool reserved) { return 0; } @@ -81,28 +96,30 @@ static inline bool hugetlb_cgroup_disabled(void) return true; }
-static inline int -hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) +static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, + struct hugetlb_cgroup **ptr, + bool reserved) { return 0; }
-static inline void -hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg, - struct page *page) +static inline void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, + struct hugetlb_cgroup *h_cg, + struct page *page, + bool reserved) { }
-static inline void -hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) +static inline void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, + struct page *page, + bool reserved) { }
-static inline void -hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg) +static inline void hugetlb_cgroup_uncharge_cgroup(int idx, + unsigned long nr_pages, + struct hugetlb_cgroup *h_cg, + bool reserved) { }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 37195cd60a345..325d5454bf168 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1140,7 +1140,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 << PG_active | 1 << PG_private | 1 << PG_writeback); } - VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page); + VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page); set_compound_page_dtor(page, NULL_COMPOUND_DTOR); set_page_refcounted(page); if (hstate_is_gigantic(h)) { @@ -1250,8 +1250,8 @@ void free_huge_page(struct page *page)
spin_lock(&hugetlb_lock); clear_page_huge_active(page); - hugetlb_cgroup_uncharge_page(hstate_index(h), - pages_per_huge_page(h), page); + hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), + page, false); if (restore_reserve) h->resv_huge_pages++;
@@ -1277,7 +1277,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) INIT_LIST_HEAD(&page->lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); spin_lock(&hugetlb_lock); - set_hugetlb_cgroup(page, NULL); + set_hugetlb_cgroup(page, NULL, false); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; spin_unlock(&hugetlb_lock); @@ -2063,7 +2063,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, gbl_chg = 1; }
- ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); + ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg, + false); if (ret) goto out_subpool_put;
@@ -2087,7 +2088,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, list_move(&page->lru, &h->hugepage_activelist); /* Fall through */ } - hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page); + hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page, + false); spin_unlock(&hugetlb_lock);
set_page_private(page, (unsigned long)spool); @@ -2111,7 +2113,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, return page;
out_uncharge_cgroup: - hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); + hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg, + false); out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 1ed4448ca41d3..854117513979b 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -73,8 +73,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) int idx;
for (idx = 0; idx < hugetlb_max_hstate; idx++) { - if (page_counter_read(&h_cg->hugepage[idx])) + if (page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, true)) || + page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, false))) { return true; + } } return false; } @@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, int idx;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { - struct page_counter *counter = &h_cgroup->hugepage[idx]; struct page_counter *parent = NULL; + struct page_counter *reserved_parent = NULL; unsigned long limit; int ret;
- if (parent_h_cgroup) - parent = &parent_h_cgroup->hugepage[idx]; - page_counter_init(counter, parent); + if (parent_h_cgroup) { + parent = hugetlb_cgroup_get_counter(parent_h_cgroup, + idx, false); + reserved_parent = hugetlb_cgroup_get_counter( + parent_h_cgroup, idx, true); + } + page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx, + false), + parent); + page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx, + true), + reserved_parent);
limit = round_down(PAGE_COUNTER_MAX, 1 << huge_page_order(&hstates[idx])); - ret = page_counter_set_max(counter, limit); + + ret = page_counter_set_max( + hugetlb_cgroup_get_counter(h_cgroup, idx, false), + limit); + ret = page_counter_set_max( + hugetlb_cgroup_get_counter(h_cgroup, idx, true), limit); VM_BUG_ON(ret); } } @@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) kfree(h_cgroup); }
+static void hugetlb_cgroup_move_parent_reservation(int idx, + struct hugetlb_cgroup *h_cg) +{ + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); + + /* Move the reservation counters. */ + if (!parent_hugetlb_cgroup(h_cg)) { + parent = root_h_cgroup; + /* root has no limit */ + page_counter_charge( + &root_h_cgroup->reserved_hugepage[idx], + page_counter_read( + hugetlb_cgroup_get_counter(h_cg, idx, true))); + } + + /* Take the pages off the local counter */ + page_counter_cancel( + hugetlb_cgroup_get_counter(h_cg, idx, true), + page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true))); +}
/* * Should be called with hugetlb_lock held. @@ -142,7 +180,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, struct hugetlb_cgroup *page_hcg; struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
- page_hcg = hugetlb_cgroup_from_page(page); + page_hcg = hugetlb_cgroup_from_page(page, false); /* * We can have pages in active list without any cgroup * ie, hugepage with less than 3 pages. We can safely @@ -161,7 +199,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, /* Take the pages off the local counter */ page_counter_cancel(counter, nr_pages);
- set_hugetlb_cgroup(page, parent); + set_hugetlb_cgroup(page, parent, false); out: return; } @@ -180,6 +218,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) do { for_each_hstate(h) { spin_lock(&hugetlb_lock); + hugetlb_cgroup_move_parent_reservation(idx, h_cg); list_for_each_entry(page, &h->hugepage_activelist, lru) hugetlb_cgroup_move_parent(idx, h_cg, page);
@@ -191,7 +230,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) }
int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup **ptr) + struct hugetlb_cgroup **ptr, bool reserved) { int ret = 0; struct page_counter *counter; @@ -214,8 +253,11 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, } rcu_read_unlock();
- if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) + if (!page_counter_try_charge(hugetlb_cgroup_get_counter(h_cg, idx, + reserved), + nr_pages, &counter)) { ret = -ENOMEM; + } css_put(&h_cg->css); done: *ptr = h_cg; @@ -225,12 +267,12 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, /* Should be called with hugetlb_lock held */ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg, - struct page *page) + struct page *page, bool reserved) { if (hugetlb_cgroup_disabled() || !h_cg) return;
- set_hugetlb_cgroup(page, h_cg); + set_hugetlb_cgroup(page, h_cg, reserved); return; }
@@ -238,23 +280,26 @@ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, * Should be called with hugetlb_lock held */ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, - struct page *page) + struct page *page, bool reserved) { struct hugetlb_cgroup *h_cg;
if (hugetlb_cgroup_disabled()) return; lockdep_assert_held(&hugetlb_lock); - h_cg = hugetlb_cgroup_from_page(page); + h_cg = hugetlb_cgroup_from_page(page, reserved); if (unlikely(!h_cg)) return; - set_hugetlb_cgroup(page, NULL); - page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); + set_hugetlb_cgroup(page, NULL, reserved); + + page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved), + nr_pages); + return; }
void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, - struct hugetlb_cgroup *h_cg) + struct hugetlb_cgroup *h_cg, bool reserved) { if (hugetlb_cgroup_disabled() || !h_cg) return; @@ -262,8 +307,17 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) return;
- page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); - return; + page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved), + nr_pages); +} + +void hugetlb_cgroup_uncharge_counter(struct page_counter *p, + unsigned long nr_pages) +{ + if (hugetlb_cgroup_disabled() || !p) + return; + + page_counter_uncharge(p, nr_pages); }
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, @@ -475,6 +529,7 @@ void __init hugetlb_cgroup_file_init(void) void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage) { struct hugetlb_cgroup *h_cg; + struct hugetlb_cgroup *h_cg_reservation; struct hstate *h = page_hstate(oldhpage);
if (hugetlb_cgroup_disabled()) @@ -482,11 +537,12 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
VM_BUG_ON_PAGE(!PageHuge(oldhpage), oldhpage); spin_lock(&hugetlb_lock); - h_cg = hugetlb_cgroup_from_page(oldhpage); - set_hugetlb_cgroup(oldhpage, NULL); + h_cg = hugetlb_cgroup_from_page(oldhpage, false); + h_cg_reservation = hugetlb_cgroup_from_page(oldhpage, true); + set_hugetlb_cgroup(oldhpage, NULL, false);
/* move the h_cg details to new cgroup */ - set_hugetlb_cgroup(newhpage, h_cg); + set_hugetlb_cgroup(newhpage, h_cg_reservation, true); list_move(&newhpage->lru, &h->hugepage_activelist); spin_unlock(&hugetlb_lock); return; -- 2.24.0.rc1.363.gb1bccd3e3d-goog
Signed-off-by: Mina Almasry almasrymina@google.com
--- mm/hugetlb_cgroup.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 854117513979b..ac1500205faf7 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -503,8 +503,13 @@ static void __init __hugetlb_cgroup_file_init(int idx) cft = &h->cgroup_files[HUGETLB_RES_NULL]; memset(cft, 0, sizeof(*cft));
- WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, - h->cgroup_files)); + if (cgroup_subsys_on_dfl(hugetlb_cgrp_subsys)) { + WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys, + h->cgroup_files)); + } else { + WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys, + h->cgroup_files)); + } }
void __init hugetlb_cgroup_file_init(void) @@ -548,8 +553,14 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage) return; }
+static struct cftype hugetlb_files[] = { + {} /* terminate */ +}; + struct cgroup_subsys hugetlb_cgrp_subsys = { .css_alloc = hugetlb_cgroup_css_alloc, .css_offline = hugetlb_cgroup_css_offline, .css_free = hugetlb_cgroup_css_free, + .dfl_cftypes = hugetlb_files, + .legacy_cftypes = hugetlb_files, }; -- 2.24.0.rc1.363.gb1bccd3e3d-goog
Normally the pointer to the cgroup to uncharge hangs off the struct page, and gets queried when it's time to free the page. With hugetlb_cgroup reservations, this is not possible. Because it's possible for a page to be reserved by one task and actually faulted in by another task.
The best place to put the hugetlb_cgroup pointer to uncharge for reservations is in the resv_map. But, because the resv_map has different semantics for private and shared mappings, the code patch to charge/uncharge shared and private mappings is different. This patch implements charging and uncharging for private mappings.
For private mappings, the counter to uncharge is in resv_map->reservation_counter. On initializing the resv_map this is set to NULL. On reservation of a region in private mapping, the tasks hugetlb_cgroup is charged and the hugetlb_cgroup is placed is resv_map->reservation_counter.
On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.
Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Hillf Danton hdanton@sina.com
--- include/linux/hugetlb.h | 8 +++++++ include/linux/hugetlb_cgroup.h | 11 +++++++++ mm/hugetlb.c | 44 +++++++++++++++++++++++++++++++++- mm/hugetlb_cgroup.c | 12 ---------- 4 files changed, 62 insertions(+), 13 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 9c49a0ba894d3..36dcda7be4b0e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -46,6 +46,14 @@ struct resv_map { long adds_in_progress; struct list_head region_cache; long region_cache_count; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * On private mappings, the counter to uncharge reservations is stored + * here. If these fields are 0, then the mapping is shared. + */ + struct page_counter *reservation_counter; + unsigned long pages_per_hpage; +#endif }; extern struct resv_map *resv_map_alloc(void); void resv_map_release(struct kref *ref); diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 1bb58a63af586..f6e3d74a02536 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -25,6 +25,17 @@ struct hugetlb_cgroup; #define HUGETLB_CGROUP_MIN_ORDER 3
#ifdef CONFIG_CGROUP_HUGETLB +struct hugetlb_cgroup { + struct cgroup_subsys_state css; + /* + * the counter to account for hugepages from hugetlb. + */ + struct page_counter hugepage[HUGE_MAX_HSTATE]; + /* + * the counter to account for hugepage reservations from hugetlb. + */ + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; +};
static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page, bool reserved) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 325d5454bf168..8d8aa89a9928e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -665,6 +665,16 @@ struct resv_map *resv_map_alloc(void) INIT_LIST_HEAD(&resv_map->regions);
resv_map->adds_in_progress = 0; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Initialize these to 0. On shared mappings, 0's here indicate these + * fields don't do cgroup accounting. On private mappings, these will be + * re-initialized to the proper values, to indicate that hugetlb cgroup + * reservations are to be un-charged from here. + */ + resv_map->reservation_counter = NULL; + resv_map->pages_per_hpage = 0; +#endif
INIT_LIST_HEAD(&resv_map->region_cache); list_add(&rg->link, &resv_map->region_cache); @@ -3216,7 +3226,18 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
reserve = (end - start) - region_count(resv, start, end);
- kref_put(&resv->refs, resv_map_release); +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since we check for HPAGE_RESV_OWNER above, this must a private + * mapping, and these values should be none-zero, and should point to + * the hugetlb_cgroup counter to uncharge for this reservation. + */ + WARN_ON(!resv->reservation_counter); + WARN_ON(!resv->pages_per_hpage); + + hugetlb_cgroup_uncharge_counter(resv->reservation_counter, + (end - start) * resv->pages_per_hpage); +#endif
if (reserve) { /* @@ -3226,6 +3247,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma) gbl_reserve = hugepage_subpool_put_pages(spool, reserve); hugetlb_acct_memory(h, -gbl_reserve); } + + kref_put(&resv->refs, resv_map_release); }
static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr) @@ -4559,6 +4582,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; + struct hugetlb_cgroup *h_cg; long gbl_reserve;
/* This should never happen */ @@ -4592,12 +4616,30 @@ int hugetlb_reserve_pages(struct inode *inode, chg = region_chg(resv_map, from, to);
} else { + /* Private mapping. */ resv_map = resv_map_alloc(); if (!resv_map) return -ENOMEM;
chg = to - from;
+ if (hugetlb_cgroup_charge_cgroup(hstate_index(h), + chg * pages_per_huge_page(h), + &h_cg, true)) { + kref_put(&resv_map->refs, resv_map_release); + return -ENOMEM; + } + +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since this branch handles private mappings, we attach the + * counter to uncharge for this reservation off resv_map. + */ + resv_map->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + resv_map->pages_per_hpage = pages_per_huge_page(h); +#endif + set_vma_resv_map(vma, resv_map); set_vma_resv_flags(vma, HPAGE_RESV_OWNER); } diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index ac1500205faf7..a0ca4024888e1 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -19,18 +19,6 @@ #include <linux/hugetlb.h> #include <linux/hugetlb_cgroup.h>
-struct hugetlb_cgroup { - struct cgroup_subsys_state css; - /* - * the counter to account for hugepages from hugetlb. - */ - struct page_counter hugepage[HUGE_MAX_HSTATE]; - /* - * the counter to account for hugepage reservations from hugetlb. - */ - struct page_counter reserved_hugepage[HUGE_MAX_HSTATE]; -}; - #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) #define MEMFILE_IDX(val) (((val) >> 16) & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) -- 2.24.0.rc1.363.gb1bccd3e3d-goog
A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch.
Behavior change:
Say a resv_map exists like this [0->1], [2->3], and [5->6].
Then a region_chg/add call comes in region_chg/add(f=0, t=5).
Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6].
Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly.
Signed-off-by: Mina Almasry almasrymina@google.com
--- Changes in v8: - Addressed comments from Mike, especially: - fixing region_add not allocating enough regions sometimes. - reverted vma_*_reservation API changes. Changes in v7: - region_chg no longer allocates (t-f) / 2 file_region entries. Changes in v6: - Fix bug in number of region_caches allocated by region_chg
--- mm/hugetlb.c | 315 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 213 insertions(+), 102 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8d8aa89a9928e..1162eeaf8d160 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -244,110 +244,178 @@ struct file_region { long to; };
+/* Helper that removes a struct file_region from the resv_map cache and returns + * it for use. + */ +static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) +{ + struct file_region *nrg = NULL; + + VM_BUG_ON(resv->region_cache_count <= 0); + + resv->region_cache_count--; + nrg = list_first_entry(&resv->region_cache, struct file_region, link); + VM_BUG_ON(!nrg); + list_del(&nrg->link); + + nrg->from = from; + nrg->to = to; + + return nrg; +} + /* Must be called with resv->lock held. Calling this with count_only == true * will count the number of pages to be added but will not modify the linked - * list. + * list. If regions_needed != NULL and count_only == true, then regions_needed + * will indicate the number of file_regions needed in the cache to carry out to + * add the regions for this range. */ static long add_reservation_in_range(struct resv_map *resv, long f, long t, - bool count_only) + long *regions_needed, bool count_only) { - long chg = 0; + long add = 0; struct list_head *head = &resv->regions; + long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
- /* Locate the region we are before or in. */ - list_for_each_entry (rg, head, link) - if (f <= rg->to) - break; + if (regions_needed) + *regions_needed = 0;
- /* Round our left edge to the current segment if it encloses us. */ - if (f > rg->from) - f = rg->from; - - chg = t - f; + /* In this loop, we essentially handle an entry for the range + * [last_accounted_offset, rg->from), at every iteration, with some + * bounds checking. + */ + list_for_each_entry_safe(rg, trg, head, link) { + /* Skip irrelevant regions that start before our range. */ + if (rg->from < f) { + /* If this region ends after the last accounted offset, + * then we need to update last_accounted_offset. + */ + if (rg->to > last_accounted_offset) + last_accounted_offset = rg->to; + continue; + }
- /* Check for and consume any regions we now overlap with. */ - nrg = rg; - list_for_each_entry_safe (rg, trg, rg->link.prev, link) { - if (&rg->link == head) - break; + /* When we find a region that starts beyond our range, we've + * finished. + */ if (rg->from > t) break;
- /* We overlap with this area, if it extends further than - * us then we must extend ourselves. Account for its - * existing reservation. + /* Add an entry for last_accounted_offset -> rg->from, and + * update last_accounted_offset. */ - if (rg->to > t) { - chg += rg->to - t; - t = rg->to; + if (rg->from > last_accounted_offset) { + add += rg->from - last_accounted_offset; + if (!count_only) { + nrg = get_file_region_entry_from_cache( + resv, last_accounted_offset, rg->from); + list_add(&nrg->link, rg->link.prev); + } else if (regions_needed) + *regions_needed += 1; } - chg -= rg->to - rg->from;
- if (!count_only && rg != nrg) { - list_del(&rg->link); - kfree(rg); - } + last_accounted_offset = rg->to; }
- if (!count_only) { - nrg->from = f; - nrg->to = t; + /* Handle the case where our range extends beyond + * last_accounted_offset. + */ + if (last_accounted_offset < t) { + add += t - last_accounted_offset; + if (!count_only) { + nrg = get_file_region_entry_from_cache( + resv, last_accounted_offset, t); + list_add(&nrg->link, rg->link.prev); + } else if (regions_needed) + *regions_needed += 1; + last_accounted_offset = t; }
- return chg; + return add; }
/* * Add the huge page range represented by [f, t) to the reserve - * map. Existing regions will be expanded to accommodate the specified - * range, or a region will be taken from the cache. Sufficient regions - * must exist in the cache due to the previous call to region_chg with - * the same range. + * map. Regions will be taken from the cache to fill in this range. + * Sufficient regions should exist in the cache due to the previous + * call to region_chg with the same range, but in some cases the cache will not + * have sufficient entries. The extra needed entries will be allocated. + * + * regions_needed is the out value provided by a previous call to region_chg. * - * Return the number of new huge pages added to the map. This - * number is greater than or equal to zero. + * Return the number of new huge pages added to the map. This number is greater + * than or equal to zero. If file_region entries needed to be allocated for + * this operation and we were not able to allocate, it ruturns -ENOMEM. + * region_add of regions of length 1 never allocate file_regions and cannot + * fail. */ -static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct resv_map *resv, long f, long t, + long in_regions_needed) { - struct list_head *head = &resv->regions; - struct file_region *rg, *nrg; - long add = 0; + long add = 0, actual_regions_needed = 0, i = 0; + struct file_region *trg = NULL, *rg = NULL; + struct list_head allocated_regions; + + INIT_LIST_HEAD(&allocated_regions);
spin_lock(&resv->lock); - /* Locate the region we are either in or before. */ - list_for_each_entry(rg, head, link) - if (f <= rg->to) - break; +retry: + + /* Count how many regions are actually needed to execute this add. */ + add_reservation_in_range(resv, f, t, &actual_regions_needed, true);
/* - * If no region exists which can be expanded to include the - * specified range, pull a region descriptor from the cache - * and use it for this range. + * Check for sufficient descriptors in the cache to accommodate + * this add operation. Note that actual_regions_needed may be greater + * than in_regions_needed. In this case, we need to make sure that we + * allocate extra entries, such that we have enough for all the + * existing adds_in_progress, plus the excess needed for this + * operation. */ - if (&rg->link == head || t < rg->from) { - VM_BUG_ON(resv->region_cache_count <= 0); + if (resv->region_cache_count < + resv->adds_in_progress + + (actual_regions_needed - in_regions_needed)) { + /* region_add operation of range 1 should never need to + * allocate file_region entries. + */ + VM_BUG_ON(t - f <= 1);
- resv->region_cache_count--; - nrg = list_first_entry(&resv->region_cache, struct file_region, - link); - list_del(&nrg->link); + /* Must drop lock to allocate a new descriptor. */ + spin_unlock(&resv->lock); + for (i = 0; i < (actual_regions_needed - in_regions_needed); + i++) { + trg = kmalloc(sizeof(*trg), GFP_KERNEL); + if (!trg) + goto out_of_memory; + list_add(&trg->link, &allocated_regions); + } + spin_lock(&resv->lock);
- nrg->from = f; - nrg->to = t; - list_add(&nrg->link, rg->link.prev); + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + list_add(&rg->link, &resv->region_cache); + resv->region_cache_count++; + }
- add += t - f; - goto out_locked; + goto retry; }
- add = add_reservation_in_range(resv, f, t, false); + add = add_reservation_in_range(resv, f, t, NULL, false); + + resv->adds_in_progress -= in_regions_needed;
-out_locked: - resv->adds_in_progress--; spin_unlock(&resv->lock); VM_BUG_ON(add < 0); return add; + +out_of_memory: + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + kfree(rg); + } + return -ENOMEM; }
/* @@ -357,49 +425,72 @@ static long region_add(struct resv_map *resv, long f, long t) * call to region_add that will actually modify the reserve * map to add the specified range [f, t). region_chg does * not change the number of huge pages represented by the - * map. A new file_region structure is added to the cache - * as a placeholder, so that the subsequent region_add - * call will have all the regions it needs and will not fail. + * map. A number of new file_region structures is added to the cache as a + * placeholder, for the subsequent region_add call to use. At least 1 + * file_region structure is added. + * + * out_regions_needed is the number of regions added to the + * resv->adds_in_progress. This value needs to be provided to a follow up call + * to region_add or region_abort for proper accounting. * * Returns the number of huge pages that need to be added to the existing * reservation map for the range [f, t). This number is greater or equal to * zero. -ENOMEM is returned if a new file_region structure or cache entry * is needed and can not be allocated. */ -static long region_chg(struct resv_map *resv, long f, long t) +static long region_chg(struct resv_map *resv, long f, long t, + long *out_regions_needed) { - long chg = 0; + struct file_region *trg = NULL, *rg = NULL; + long chg = 0, i = 0, to_allocate = 0; + struct list_head allocated_regions; + + INIT_LIST_HEAD(&allocated_regions);
spin_lock(&resv->lock); -retry_locked: - resv->adds_in_progress++; + + /* Count how many hugepages in this range are NOT respresented. */ + chg = add_reservation_in_range(resv, f, t, out_regions_needed, true); + + if (*out_regions_needed < 1) + *out_regions_needed = 1; + + resv->adds_in_progress += *out_regions_needed;
/* * Check for sufficient descriptors in the cache to accommodate * the number of in progress add operations. */ - if (resv->adds_in_progress > resv->region_cache_count) { - struct file_region *trg; + while (resv->region_cache_count < resv->adds_in_progress) { + to_allocate = resv->adds_in_progress - resv->region_cache_count;
- VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1); /* Must drop lock to allocate a new descriptor. */ - resv->adds_in_progress--; spin_unlock(&resv->lock); - - trg = kmalloc(sizeof(*trg), GFP_KERNEL); - if (!trg) - return -ENOMEM; + for (i = 0; i < to_allocate; i++) { + trg = kmalloc(sizeof(*trg), GFP_KERNEL); + if (!trg) + goto out_of_memory; + list_add(&trg->link, &allocated_regions); + }
spin_lock(&resv->lock); - list_add(&trg->link, &resv->region_cache); - resv->region_cache_count++; - goto retry_locked; - }
- chg = add_reservation_in_range(resv, f, t, true); + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + list_add(&rg->link, &resv->region_cache); + resv->region_cache_count++; + } + }
spin_unlock(&resv->lock); return chg; + +out_of_memory: + list_for_each_entry_safe(rg, trg, &allocated_regions, link) { + list_del(&rg->link); + kfree(rg); + } + return -ENOMEM; }
/* @@ -407,17 +498,20 @@ static long region_chg(struct resv_map *resv, long f, long t) * of the resv_map keeps track of the operations in progress between * calls to region_chg and region_add. Operations are sometimes * aborted after the call to region_chg. In such cases, region_abort - * is called to decrement the adds_in_progress counter. + * is called to decrement the adds_in_progress counter. regions_needed + * is the value returned by the region_chg call, it is used to decrement + * the adds_in_progress counter. * * NOTE: The range arguments [f, t) are not needed or used in this * routine. They are kept to make reading the calling code easier as * arguments will match the associated region_chg call. */ -static void region_abort(struct resv_map *resv, long f, long t) +static void region_abort(struct resv_map *resv, long f, long t, + long regions_needed) { spin_lock(&resv->lock); VM_BUG_ON(!resv->region_cache_count); - resv->adds_in_progress--; + resv->adds_in_progress -= regions_needed; spin_unlock(&resv->lock); }
@@ -1904,6 +1998,7 @@ static long __vma_reservation_common(struct hstate *h, struct resv_map *resv; pgoff_t idx; long ret; + long dummy_out_regions_needed;
resv = vma_resv_map(vma); if (!resv) @@ -1912,20 +2007,29 @@ static long __vma_reservation_common(struct hstate *h, idx = vma_hugecache_offset(h, vma, addr); switch (mode) { case VMA_NEEDS_RESV: - ret = region_chg(resv, idx, idx + 1); + ret = region_chg(resv, idx, idx + 1, &dummy_out_regions_needed); + /* We assume that vma_reservation_* routines always operate on + * 1 page, and that adding to resv map a 1 page entry can only + * ever require 1 region. + */ + VM_BUG_ON(dummy_out_regions_needed != 1); break; case VMA_COMMIT_RESV: - ret = region_add(resv, idx, idx + 1); + ret = region_add(resv, idx, idx + 1, 1); + /* region_add calls of range 1 should never fail. */ + VM_BUG_ON(ret < 0); break; case VMA_END_RESV: - region_abort(resv, idx, idx + 1); + region_abort(resv, idx, idx + 1, 1); ret = 0; break; case VMA_ADD_RESV: - if (vma->vm_flags & VM_MAYSHARE) - ret = region_add(resv, idx, idx + 1); - else { - region_abort(resv, idx, idx + 1); + if (vma->vm_flags & VM_MAYSHARE) { + ret = region_add(resv, idx, idx + 1, 1); + /* region_add calls of range 1 should never fail. */ + VM_BUG_ON(ret < 0); + } else { + region_abort(resv, idx, idx + 1, 1); ret = region_del(resv, idx, idx + 1); } break; @@ -4578,12 +4682,12 @@ int hugetlb_reserve_pages(struct inode *inode, struct vm_area_struct *vma, vm_flags_t vm_flags) { - long ret, chg; + long ret, chg, add = -1; struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; struct hugetlb_cgroup *h_cg; - long gbl_reserve; + long gbl_reserve, regions_needed = 0;
/* This should never happen */ if (from > to) { @@ -4613,7 +4717,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ resv_map = inode_resv_map(inode);
- chg = region_chg(resv_map, from, to); + chg = region_chg(resv_map, from, to, ®ions_needed);
} else { /* Private mapping. */ @@ -4683,9 +4787,14 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) { - long add = region_add(resv_map, from, to); - - if (unlikely(chg > add)) { + add = region_add(resv_map, from, to, regions_needed); + + if (unlikely(add < 0)) { + hugetlb_acct_memory(h, -gbl_reserve); + /* put back original number of pages, chg */ + (void)hugepage_subpool_put_pages(spool, chg); + goto out_err; + } else if (unlikely(chg > add)) { /* * pages in this range were added to the reserve * map between region_chg and region_add. This @@ -4703,9 +4812,11 @@ int hugetlb_reserve_pages(struct inode *inode, return 0; out_err: if (!vma || vma->vm_flags & VM_MAYSHARE) - /* Don't call region_abort if region_chg failed */ - if (chg >= 0) - region_abort(resv_map, from, to); + /* Only call region_abort if the region_chg succeeded but the + * region_add failed or didn't run. + */ + if (chg >= 0 && add < 0) + region_abort(resv_map, from, to, regions_needed); if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) kref_put(&resv_map->refs, resv_map_release); return ret; -- 2.24.0.rc1.363.gb1bccd3e3d-goog
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives in the resv_map entries, in file_region->reservation_counter.
After a call to region_chg, we charge the approprate hugetlb_cgroup, and if successful, we pass on the hugetlb_cgroup info to a follow up region_add call. When a file_region entry is added to the resv_map via region_add, we put the pointer to that cgroup in file_region->reservation_counter. If charging doesn't succeed, we report the error to the caller, so that the kernel fails the reservation.
On region_del, which is when the hugetlb memory is unreserved, we also uncharge the file_region->reservation_counter.
Signed-off-by: Mina Almasry almasrymina@google.com
--- mm/hugetlb.c | 149 ++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 117 insertions(+), 32 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1162eeaf8d160..40f4b85de59c4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -242,6 +242,15 @@ struct file_region { struct list_head link; long from; long to; +#ifdef CONFIG_CGROUP_HUGETLB + /* + * On shared mappings, each reserved region appears as a struct + * file_region in resv_map. These fields hold the info needed to + * uncharge each reservation. + */ + struct page_counter *reservation_counter; + unsigned long pages_per_hpage; +#endif };
/* Helper that removes a struct file_region from the resv_map cache and returns @@ -265,6 +274,23 @@ get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) return nrg; }
+/* Helper that records hugetlb_cgroup uncharge info. */ +static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg, + struct file_region *nrg, + struct hstate *h) +{ +#ifdef CONFIG_CGROUP_HUGETLB + if (h_cg) { + nrg->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + nrg->pages_per_hpage = pages_per_huge_page(h); + } else { + nrg->reservation_counter = NULL; + nrg->pages_per_hpage = 0; + } +#endif +} + /* Must be called with resv->lock held. Calling this with count_only == true * will count the number of pages to be added but will not modify the linked * list. If regions_needed != NULL and count_only == true, then regions_needed @@ -272,7 +298,9 @@ get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) * add the regions for this range. */ static long add_reservation_in_range(struct resv_map *resv, long f, long t, - long *regions_needed, bool count_only) + struct hugetlb_cgroup *h_cg, + struct hstate *h, long *regions_needed, + bool count_only) { long add = 0; struct list_head *head = &resv->regions; @@ -311,6 +339,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, if (!count_only) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, rg->from); + record_hugetlb_cgroup_uncharge_info(h_cg, nrg, + h); list_add(&nrg->link, rg->link.prev); } else if (regions_needed) *regions_needed += 1; @@ -327,12 +357,14 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, if (!count_only) { nrg = get_file_region_entry_from_cache( resv, last_accounted_offset, t); + record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h); list_add(&nrg->link, rg->link.prev); } else if (regions_needed) *regions_needed += 1; last_accounted_offset = t; }
+ VM_BUG_ON(add < 0); return add; }
@@ -351,7 +383,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t, * region_add of regions of length 1 never allocate file_regions and cannot * fail. */ -static long region_add(struct resv_map *resv, long f, long t, +static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg, + struct resv_map *resv, long f, long t, long in_regions_needed) { long add = 0, actual_regions_needed = 0, i = 0; @@ -364,7 +397,8 @@ static long region_add(struct resv_map *resv, long f, long t, retry:
/* Count how many regions are actually needed to execute this add. */ - add_reservation_in_range(resv, f, t, &actual_regions_needed, true); + add_reservation_in_range(resv, f, t, NULL, NULL, &actual_regions_needed, + true);
/* * Check for sufficient descriptors in the cache to accommodate @@ -402,7 +436,7 @@ static long region_add(struct resv_map *resv, long f, long t, goto retry; }
- add = add_reservation_in_range(resv, f, t, NULL, false); + add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false);
resv->adds_in_progress -= in_regions_needed;
@@ -450,7 +484,8 @@ static long region_chg(struct resv_map *resv, long f, long t, spin_lock(&resv->lock);
/* Count how many hugepages in this range are NOT respresented. */ - chg = add_reservation_in_range(resv, f, t, out_regions_needed, true); + chg = add_reservation_in_range(resv, f, t, NULL, NULL, + out_regions_needed, true);
if (*out_regions_needed < 1) *out_regions_needed = 1; @@ -515,6 +550,24 @@ static void region_abort(struct resv_map *resv, long f, long t, spin_unlock(&resv->lock); }
+static void uncharge_cgroup_if_shared_mapping(struct resv_map *resv, + struct file_region *rg, + unsigned long nr_pages) +{ +#ifdef CONFIG_CGROUP_HUGETLB + /* + * If resv->reservation_counter is NULL, then this is shared + * reservation, and the reserved memory is tracked in the file_struct + * entries inside of resv_map. So we need to uncharge the memory here. + */ + if (rg->reservation_counter && rg->pages_per_hpage && nr_pages > 0 && + !resv->reservation_counter) { + hugetlb_cgroup_uncharge_counter(rg->reservation_counter, + nr_pages * rg->pages_per_hpage); + } +#endif +} + /* * Delete the specified range [f, t) from the reserve map. If the * t parameter is LONG_MAX, this indicates that ALL regions after f @@ -584,6 +637,9 @@ static long region_del(struct resv_map *resv, long f, long t) /* Original entry is trimmed */ rg->to = f;
+ uncharge_cgroup_if_shared_mapping(resv, rg, + nrg->to - nrg->from); + list_add(&nrg->link, &rg->link); nrg = NULL; break; @@ -591,6 +647,8 @@ static long region_del(struct resv_map *resv, long f, long t)
if (f <= rg->from && t >= rg->to) { /* Remove entire region */ del += rg->to - rg->from; + uncharge_cgroup_if_shared_mapping(resv, rg, + rg->to - rg->from); list_del(&rg->link); kfree(rg); continue; @@ -599,14 +657,20 @@ static long region_del(struct resv_map *resv, long f, long t) if (f <= rg->from) { /* Trim beginning of region */ del += t - rg->from; rg->from = t; + + uncharge_cgroup_if_shared_mapping(resv, rg, + t - rg->from); } else { /* Trim end of region */ del += rg->to - f; rg->to = f; + + uncharge_cgroup_if_shared_mapping(resv, rg, rg->to - f); } }
spin_unlock(&resv->lock); kfree(nrg); + return del; }
@@ -2015,7 +2079,7 @@ static long __vma_reservation_common(struct hstate *h, VM_BUG_ON(dummy_out_regions_needed != 1); break; case VMA_COMMIT_RESV: - ret = region_add(resv, idx, idx + 1, 1); + ret = region_add(NULL, NULL, resv, idx, idx + 1, 1); /* region_add calls of range 1 should never fail. */ VM_BUG_ON(ret < 0); break; @@ -2025,7 +2089,7 @@ static long __vma_reservation_common(struct hstate *h, break; case VMA_ADD_RESV: if (vma->vm_flags & VM_MAYSHARE) { - ret = region_add(resv, idx, idx + 1, 1); + ret = region_add(NULL, NULL, resv, idx, idx + 1, 1); /* region_add calls of range 1 should never fail. */ VM_BUG_ON(ret < 0); } else { @@ -4686,7 +4750,7 @@ int hugetlb_reserve_pages(struct inode *inode, struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; - struct hugetlb_cgroup *h_cg; + struct hugetlb_cgroup *h_cg = NULL; long gbl_reserve, regions_needed = 0;
/* This should never happen */ @@ -4727,23 +4791,6 @@ int hugetlb_reserve_pages(struct inode *inode,
chg = to - from;
- if (hugetlb_cgroup_charge_cgroup(hstate_index(h), - chg * pages_per_huge_page(h), - &h_cg, true)) { - kref_put(&resv_map->refs, resv_map_release); - return -ENOMEM; - } - -#ifdef CONFIG_CGROUP_HUGETLB - /* - * Since this branch handles private mappings, we attach the - * counter to uncharge for this reservation off resv_map. - */ - resv_map->reservation_counter = - &h_cg->reserved_hugepage[hstate_index(h)]; - resv_map->pages_per_hpage = pages_per_huge_page(h); -#endif - set_vma_resv_map(vma, resv_map); set_vma_resv_flags(vma, HPAGE_RESV_OWNER); } @@ -4753,6 +4800,25 @@ int hugetlb_reserve_pages(struct inode *inode, goto out_err; }
+ ret = hugetlb_cgroup_charge_cgroup( + hstate_index(h), chg * pages_per_huge_page(h), &h_cg, true); + + if (ret < 0) { + ret = -ENOMEM; + goto out_err; + } + +#ifdef CONFIG_CGROUP_HUGETLB + if (vma && !(vma->vm_flags & VM_MAYSHARE)) { + /* For private mappings, the hugetlb_cgroup uncharge info hangs + * of the resv_map. + */ + resv_map->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + resv_map->pages_per_hpage = pages_per_huge_page(h); + } +#endif + /* * There must be enough pages in the subpool for the mapping. If * the subpool has a minimum size, there may be some global @@ -4761,7 +4827,7 @@ int hugetlb_reserve_pages(struct inode *inode, gbl_reserve = hugepage_subpool_get_pages(spool, chg); if (gbl_reserve < 0) { ret = -ENOSPC; - goto out_err; + goto out_uncharge_cgroup; }
/* @@ -4770,9 +4836,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ ret = hugetlb_acct_memory(h, gbl_reserve); if (ret < 0) { - /* put back original number of pages, chg */ - (void)hugepage_subpool_put_pages(spool, chg); - goto out_err; + goto out_put_pages; }
/* @@ -4787,7 +4851,7 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) { - add = region_add(resv_map, from, to, regions_needed); + add = region_add(h, h_cg, resv_map, from, to, regions_needed);
if (unlikely(add < 0)) { hugetlb_acct_memory(h, -gbl_reserve); @@ -4804,12 +4868,33 @@ int hugetlb_reserve_pages(struct inode *inode, */ long rsv_adjust;
- rsv_adjust = hugepage_subpool_put_pages(spool, - chg - add); + hugetlb_cgroup_uncharge_cgroup( + hstate_index(h), + (chg - add) * pages_per_huge_page(h), h_cg, + true); + + rsv_adjust = + hugepage_subpool_put_pages(spool, chg - add); hugetlb_acct_memory(h, -rsv_adjust); } + } else { +#ifdef CONFIG_CGROUP_HUGETLB + /* + * Since this branch handles private mappings, we attach the + * counter to uncharge for this reservation off resv_map. + */ + resv_map->reservation_counter = + &h_cg->reserved_hugepage[hstate_index(h)]; + resv_map->pages_per_hpage = pages_per_huge_page(h); +#endif } return 0; +out_put_pages: + /* put back original number of pages, chg */ + (void)hugepage_subpool_put_pages(spool, chg); +out_uncharge_cgroup: + hugetlb_cgroup_uncharge_cgroup( + hstate_index(h), chg * pages_per_huge_page(h), h_cg, true); out_err: if (!vma || vma->vm_flags & VM_MAYSHARE) /* Only call region_abort if the region_chg succeeded but the -- 2.24.0.rc1.363.gb1bccd3e3d-goog
Support MAP_NORESERVE accounting as part of the new counter.
For each hugepage allocation, at allocation time we check if there is a reservation for this allocation or not. If there is a reservation for this allocation, then this allocation was charged at reservation time, and we don't re-account it. If there is no reserevation for this allocation, we charge the appropriate hugetlb_cgroup.
The hugetlb_cgroup to uncharge for this allocation is stored in page[3].private. We use new APIs added in an earlier patch to set this pointer.
Signed-off-by: Mina Almasry almasrymina@google.com
--- mm/hugetlb.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 40f4b85de59c4..cb06b3d5bd4d7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1309,6 +1309,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 << PG_writeback); } VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page); + VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, true), page); set_compound_page_dtor(page, NULL_COMPOUND_DTOR); set_page_refcounted(page); if (hstate_is_gigantic(h)) { @@ -1420,6 +1421,9 @@ void free_huge_page(struct page *page) clear_page_huge_active(page); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page, false); + hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), + page, true); + if (restore_reserve) h->resv_huge_pages++;
@@ -1446,6 +1450,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); spin_lock(&hugetlb_lock); set_hugetlb_cgroup(page, NULL, false); + set_hugetlb_cgroup(page, NULL, true); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; spin_unlock(&hugetlb_lock); @@ -2241,10 +2246,19 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, gbl_chg = 1; }
+ /* If this allocation is not consuming a reservation, charge it now. + */ + if (map_chg || avoid_reserve || !vma_resv_map(vma)) { + ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), + &h_cg, true); + if (ret) + goto out_subpool_put; + } + ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg, false); if (ret) - goto out_subpool_put; + goto out_uncharge_cgroup_reservation;
spin_lock(&hugetlb_lock); /* @@ -2268,6 +2282,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, } hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page, false); + if (!vma_resv_map(vma) || map_chg || avoid_reserve) { + hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, + page, true); + } + spin_unlock(&hugetlb_lock);
set_page_private(page, (unsigned long)spool); @@ -2293,6 +2312,10 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, out_uncharge_cgroup: hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg, false); +out_uncharge_cgroup_reservation: + if (map_chg || avoid_reserve || !vma_resv_map(vma)) + hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), + h_cg, true); out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); -- 2.24.0.rc1.363.gb1bccd3e3d-goog
The tests use both shared and private mapped hugetlb memory, and monitors the hugetlb usage counter as well as the hugetlb reservation counter. They test different configurations such as hugetlb memory usage via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without MAP_POPULATE.
Signed-off-by: Mina Almasry almasrymina@google.com
---
Changes in v6: - Updates tests for cgroups-v2 and NORESERVE allocations.
Change-Id: I70a592511ec6039791310d3ba7a96824cdf2e8e6 --- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 1 + .../selftests/vm/charge_reserved_hugetlb.sh | 527 ++++++++++++++++++ .../selftests/vm/write_hugetlb_memory.sh | 23 + .../testing/selftests/vm/write_to_hugetlbfs.c | 261 +++++++++ 5 files changed, 813 insertions(+) create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 31b3c98b6d34d..d3bed9407773c 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -14,3 +14,4 @@ virtual_address_range gup_benchmark va_128TBswitch map_fixed_noreplace +write_to_hugetlbfs diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 9534dc2bc9295..31c2cc5cf30b5 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd TEST_GEN_FILES += va_128TBswitch TEST_GEN_FILES += virtual_address_range +TEST_GEN_FILES += write_to_hugetlbfs
TEST_PROGS := run_vmtests
diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh new file mode 100755 index 0000000000000..278dd6475cd0f --- /dev/null +++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh @@ -0,0 +1,527 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +set -e + +if [[ $(id -u) -ne 0 ]]; then + echo "This test must be run as root. Skipping..." + exit 0 +fi + +cgroup_path=/dev/cgroup/memory +if [[ ! -e $cgroup_path ]]; then + mkdir -p $cgroup_path + mount -t cgroup2 none $cgroup_path +fi + +echo "+hugetlb" > /dev/cgroup/memory/cgroup.subtree_control + + +cleanup () { + echo $$ > $cgroup_path/cgroup.procs + + if [[ -e /mnt/huge ]]; then + rm -rf /mnt/huge/* + umount /mnt/huge || echo error + rmdir /mnt/huge + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test1 + fi + if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then + rmdir $cgroup_path/hugetlb_cgroup_test2 + fi + echo 0 > /proc/sys/vm/nr_hugepages + echo CLEANUP DONE +} + +function expect_equal() { + local expected="$1" + local actual="$2" + local error="$3" + + if [[ "$expected" != "$actual" ]]; then + echo "expected ($expected) != actual ($actual): $3" + cleanup + exit 1 + fi +} + +function setup_cgroup() { + local name="$1" + local cgroup_limit="$2" + local reservation_limit="$3" + + mkdir $cgroup_path/$name + + echo writing cgroup limit: "$cgroup_limit" + echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes + + echo writing reseravation limit: "$reservation_limit" + echo "$reservation_limit" > \ + $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes + + if [ -e "$cgroup_path/$name/cpuset.cpus" ]; then + echo 0 > $cgroup_path/$name/cpuset.cpus + fi + if [ -e "$cgroup_path/$name/cpuset.mems" ]; then + echo 0 > $cgroup_path/$name/cpuset.mems + fi +} + +function wait_for_hugetlb_memory_to_get_depleted { + local cgroup="$1" + local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes" + # Wait for hugetlbfs memory to get depleted. + while [ $(cat $path) != 0 ]; do + echo Waiting for hugetlb memory to get depleted. + cat $path + sleep 0.5 + done +} + +function wait_for_hugetlb_memory_to_get_reserved { + local cgroup="$1" + local size="$2" + + local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes" + # Wait for hugetlbfs memory to get written. + while [ $(cat $path) != $size ]; do + echo Waiting for hugetlb memory to reach size $size. + cat $path + sleep 0.5 + done +} + +function wait_for_hugetlb_memory_to_get_written { + local cgroup="$1" + local size="$2" + + local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.usage_in_bytes" + # Wait for hugetlbfs memory to get written. + while [ $(cat $path) != $size ]; do + echo Waiting for hugetlb memory to reach size $size. + cat $path + sleep 0.5 + done +} + +function write_hugetlbfs_and_get_usage() { + local cgroup="$1" + local size="$2" + local populate="$3" + local write="$4" + local path="$5" + local method="$6" + local private="$7" + local expect_failure="$8" + local reserve="$9" + + # Function return values. + reservation_failed=0 + oom_killed=0 + hugetlb_difference=0 + reserved_difference=0 + + local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes + local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes + + local hugetlb_before=$(cat $hugetlb_usage) + local reserved_before=$(cat $reserved_usage) + + echo + echo Starting: + echo hugetlb_usage="$hugetlb_before" + echo reserved_usage="$reserved_before" + echo expect_failure is "$expect_failure" + + set +e + if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \ + [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then + + bash write_hugetlb_memory.sh "$size" "$populate" "$write" \ + "$cgroup" "$path" "$method" "$private" "-l" "$reserve"& + local write_result=$? + + if [[ "$reserve" != "-n" ]]; then + wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size" + elif [[ "$populate" == "-o" ]] || \ + [[ "$write" == "-w" ]]; then + wait_for_hugetlb_memory_to_get_written "$cgroup" "$size" + else + # This case doesn't produce visible effects, but we still have + # to wait for the async process to start and execute... + sleep 0.5 + fi + + echo write_result is $write_result + else + bash write_hugetlb_memory.sh "$size" "$populate" "$write" \ + "$cgroup" "$path" "$method" "$private" "$reserve" + local write_result=$? + + if [[ "$reserve" != "-n" ]]; then + wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size" + fi + fi + set -e + + if [[ "$write_result" == 1 ]]; then + reservation_failed=1 + fi + + # On linus/master, the above process gets SIGBUS'd on oomkill, with + # return code 135. On earlier kernels, it gets actual oomkill, with return + # code 137, so just check for both conditions in case we're testing + # against an earlier kernel. + if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then + oom_killed=1 + fi + + local hugetlb_after=$(cat $hugetlb_usage) + local reserved_after=$(cat $reserved_usage) + + echo After write: + echo hugetlb_usage="$hugetlb_after" + echo reserved_usage="$reserved_after" + + hugetlb_difference=$(($hugetlb_after - $hugetlb_before)) + reserved_difference=$(($reserved_after - $reserved_before)) +} + +function cleanup_hugetlb_memory() { + set +e + local cgroup="$1" + if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then + echo kiling write_to_hugetlbfs + killall -2 write_to_hugetlbfs + wait_for_hugetlb_memory_to_get_depleted $cgroup + fi + set -e + + if [[ -e /mnt/huge ]]; then + rm -rf /mnt/huge/* + umount /mnt/huge + rmdir /mnt/huge + fi +} + +function run_test() { + local size="$1" + local populate="$2" + local write="$3" + local cgroup_limit="$4" + local reservation_limit="$5" + local nr_hugepages="$6" + local method="$7" + local private="$8" + local expect_failure="$9" + local reserve="${10}" + + # Function return values. + hugetlb_difference=0 + reserved_difference=0 + reservation_failed=0 + oom_killed=0 + + echo nr hugepages = "$nr_hugepages" + echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages + + setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit" + + mkdir -p /mnt/huge + mount -t hugetlbfs \ + -o pagesize=2M,size=256M none /mnt/huge + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \ + "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure" \ + "$reserve" + + cleanup_hugetlb_memory "hugetlb_cgroup_test" + + local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes) + local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes) + + echo $hugetlb_difference + echo $reserved_difference + expect_equal "0" "$final_hugetlb" "final hugetlb is not zero" + expect_equal "0" "$final_reservation" "final reservation is not zero" +} + +function run_multiple_cgroup_test() { + local size1="$1" + local populate1="$2" + local write1="$3" + local cgroup_limit1="$4" + local reservation_limit1="$5" + + local size2="$6" + local populate2="$7" + local write2="$8" + local cgroup_limit2="$9" + local reservation_limit2="${10}" + + local nr_hugepages="${11}" + local method="${12}" + local private="${13}" + local expect_failure="${14}" + local reserve="${15}" + + # Function return values. + hugetlb_difference1=0 + reserved_difference1=0 + reservation_failed1=0 + oom_killed1=0 + + hugetlb_difference2=0 + reserved_difference2=0 + reservation_failed2=0 + oom_killed2=0 + + + echo nr hugepages = "$nr_hugepages" + echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages + + setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1" + setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2" + + mkdir -p /mnt/huge + mount -t hugetlbfs \ + -o pagesize=2M,size=256M none /mnt/huge + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \ + "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \ + "$expect_failure" "$reserve" + + hugetlb_difference1=$hugetlb_difference + reserved_difference1=$reserved_difference + reservation_failed1=$reservation_failed + oom_killed1=$oom_killed + + local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes + local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes + local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes + local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes + + local usage_before_second_write=$(cat $cgroup1_hugetlb_usage) + local reservation_usage_before_second_write=$(cat \ + $cgroup1_reservation_usage) + + write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \ + "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \ + "$expect_failure" "$reserve" + + hugetlb_difference2=$hugetlb_difference + reserved_difference2=$reserved_difference + reservation_failed2=$reservation_failed + oom_killed2=$oom_killed + + expect_equal "$usage_before_second_write" \ + "$(cat $cgroup1_hugetlb_usage)" "Usage changed." + expect_equal "$reservation_usage_before_second_write" \ + "$(cat $cgroup1_reservation_usage)" "Reservation usage changed." + + cleanup_hugetlb_memory + + local final_hugetlb=$(cat $cgroup1_hugetlb_usage) + local final_reservation=$(cat $cgroup1_reservation_usage) + + expect_equal "0" "$final_hugetlb" \ + "hugetlbt_cgroup_test1 final hugetlb is not zero" + expect_equal "0" "$final_reservation" \ + "hugetlbt_cgroup_test1 final reservation is not zero" + + local final_hugetlb=$(cat $cgroup2_hugetlb_usage) + local final_reservation=$(cat $cgroup2_reservation_usage) + + expect_equal "0" "$final_hugetlb" \ + "hugetlb_cgroup_test2 final hugetlb is not zero" + expect_equal "0" "$final_reservation" \ + "hugetlb_cgroup_test2 final reservation is not zero" +} + +cleanup + +for populate in "" "-o"; do +for method in 0 1 2; do +for private in "" "-r"; do +for reserve in "" "-n"; do + +# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported. +if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then + continue +fi + +# Skip populated shmem tests. Doesn't seem to be supported. +if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then + continue +fi + +if [[ "$method" == 2"" ]] && [[ "$reserve" == "-n" ]]; then + continue +fi + +cleanup +echo +echo +echo +echo Test normal case. +echo private=$private, populate=$populate, method=$method, reserve=$reserve +run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "0" "$reserve" + +echo Memory charged to hugtlb=$hugetlb_difference +echo Memory charged to reservation=$reserved_difference + +if [[ "$populate" == "-o" ]]; then + expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +else + expect_equal "0" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +fi + +if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then + expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \ + "Reserved memory not charged to reservation usage." +else + expect_equal "0" "$reserved_difference" \ + "Reserved memory not charged to reservation usage." +fi + +echo 'PASS' + +cleanup +echo +echo +echo +echo Test normal case with write. +echo private=$private, populate=$populate, method=$method, reserve=$reserve +run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "0" "$reserve" + +echo Memory charged to hugtlb=$hugetlb_difference +echo Memory charged to reservation=$reserved_difference + +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \ + "Reserved memory charged to hugetlb cgroup." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \ + "Reserved memory not charged to reservation usage." + +echo 'PASS' + + +cleanup +continue +echo +echo +echo +echo Test more than reservation case. +echo private=$private, populate=$populate, method=$method, reserve=$reserve + +if [ "$reserve" != "-n" ]; then + run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \ + "$((5 * 1024 * 1024))" "10" "$method" "$private" "1" "$reserve" + + expect_equal "1" "$reservation_failed" "Reservation succeeded." +fi + +echo 'PASS' + +cleanup + +echo +echo +echo +echo Test more than cgroup limit case. +echo private=$private, populate=$populate, method=$method, reserve=$reserve + +# Not sure if shm memory can be cleaned up when the process gets sigbus'd. +if [[ "$method" != 2 ]]; then + run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \ + $((20 * 1024 * 1024)) 10 "$method" "$private" "1" "$reserve" + + expect_equal "1" "$oom_killed" "Not oom killed." +fi +echo 'PASS' + +cleanup + +echo +echo +echo +echo Test normal case, multiple cgroups. +echo private=$private, populate=$populate, method=$method, reserve=$reserve +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \ + "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \ + "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \ + "$method" "$private" "0" "$reserve" + +echo Memory charged to hugtlb1=$hugetlb_difference1 +echo Memory charged to reservation1=$reserved_difference1 +echo Memory charged to hugtlb2=$hugetlb_difference2 +echo Memory charged to reservation2=$reserved_difference2 + +if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then + expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \ + "Incorrect reservations charged to cgroup 1." + expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \ + "Incorrect reservation charged to cgroup 2." +else + expect_equal "0" "$reserved_difference1" \ + "Incorrect reservations charged to cgroup 1." + expect_equal "0" "$reserved_difference2" \ + "Incorrect reservation charged to cgroup 2." +fi + +if [[ "$populate" == "-o" ]]; then + expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." + expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +else + expect_equal "0" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." + expect_equal "0" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +fi +echo 'PASS' + +cleanup +echo +echo +echo +echo Test normal case with write, multiple cgroups. +echo private=$private, populate=$populate, method=$method, reserve=$reserve +run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \ + "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \ + "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \ + "$method" "$private" "0" "$reserve" + +echo Memory charged to hugtlb1=$hugetlb_difference1 +echo Memory charged to reservation1=$reserved_difference1 +echo Memory charged to hugtlb2=$hugetlb_difference2 +echo Memory charged to reservation2=$reserved_difference2 + +expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \ + "Incorrect hugetlb charged to cgroup 1." +expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \ + "Incorrect reservation charged to cgroup 1." +expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \ + "Incorrect hugetlb charged to cgroup 2." +expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \ + "Incorrected reservation charged to cgroup 2." +echo 'PASS' + +cleanup + +done # reserve +done # private +done # populate +done # method + +umount $cgroup_path +rmdir $cgroup_path diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh new file mode 100644 index 0000000000000..d3d0d108924d4 --- /dev/null +++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh @@ -0,0 +1,23 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 + +set -e + +size=$1 +populate=$2 +write=$3 +cgroup=$4 +path=$5 +method=$6 +private=$7 +want_sleep=$8 +reserve=$9 + +echo "Putting task in cgroup '$cgroup'" +echo $$ > /dev/cgroup/memory/"$cgroup"/cgroup.procs + +echo "Method is $method" + +set +e +./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \ + "$private" "$want_sleep" "$reserve" diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c new file mode 100644 index 0000000000000..85811c3384a10 --- /dev/null +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c @@ -0,0 +1,261 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This program reserves and uses hugetlb memory, supporting a bunch of + * scenarios needed by the charged_reserved_hugetlb.sh test. + */ + +#include <err.h> +#include <errno.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/shm.h> +#include <sys/stat.h> +#include <sys/mman.h> + +/* Global definitions. */ +enum method { + HUGETLBFS, + MMAP_MAP_HUGETLB, + SHM, + MAX_METHOD +}; + + +/* Global variables. */ +static const char *self; +static char *shmaddr; +static int shmid; + +/* + * Show usage and exit. + */ +static void exit_usage(void) +{ + printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> " + "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] " + "[-o] [-w] [-n]\n", + self); + exit(EXIT_FAILURE); +} + +void sig_handler(int signo) +{ + printf("Received %d.\n", signo); + if (signo == SIGINT) { + printf("Deleting the memory\n"); + if (shmdt((const void *)shmaddr) != 0) { + perror("Detach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(4); + } + + shmctl(shmid, IPC_RMID, NULL); + printf("Done deleting the memory\n"); + } + exit(2); +} + +int main(int argc, char **argv) +{ + int fd = 0; + int key = 0; + int *ptr = NULL; + int c = 0; + int size = 0; + char path[256] = ""; + enum method method = MAX_METHOD; + int want_sleep = 0, private = 0; + int populate = 0; + int write = 0; + int reserve = 1; + + unsigned long i; + + if (signal(SIGINT, sig_handler) == SIG_ERR) + err(1, "\ncan't catch SIGINT\n"); + + /* Parse command-line arguments. */ + setvbuf(stdout, NULL, _IONBF, 0); + self = argv[0]; + + while ((c = getopt(argc, argv, "s:p:m:owlrn")) != -1) { + switch (c) { + case 's': + size = atoi(optarg); + break; + case 'p': + strncpy(path, optarg, sizeof(path)); + break; + case 'm': + if (atoi(optarg) >= MAX_METHOD) { + errno = EINVAL; + perror("Invalid -m."); + exit_usage(); + } + method = atoi(optarg); + break; + case 'o': + populate = 1; + break; + case 'w': + write = 1; + break; + case 'l': + want_sleep = 1; + break; + case 'r': + private + = 1; + break; + case 'n': + reserve = 0; + break; + default: + errno = EINVAL; + perror("Invalid arg"); + exit_usage(); + } + } + + if (strncmp(path, "", sizeof(path)) != 0) { + printf("Writing to this path: %s\n", path); + } else { + errno = EINVAL; + perror("path not found"); + exit_usage(); + } + + if (size != 0) { + printf("Writing this size: %d\n", size); + } else { + errno = EINVAL; + perror("size not found"); + exit_usage(); + } + + if (!populate) + printf("Not populating.\n"); + else + printf("Populating.\n"); + + if (!write) + printf("Not writing to memory.\n"); + + if (method == MAX_METHOD) { + errno = EINVAL; + perror("-m Invalid"); + exit_usage(); + } else + printf("Using method=%d\n", method); + + if (!private) + printf("Shared mapping.\n"); + else + printf("Private mapping.\n"); + + if (!reserve) + printf("NO_RESERVE mapping.\n"); + else + printf("RESERVE mapping.\n"); + + switch (method) { + case HUGETLBFS: + printf("Allocating using HUGETLBFS.\n"); + fd = open(path, O_CREAT | O_RDWR, 0777); + if (fd == -1) + err(1, "Failed to open file."); + + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + (private ? MAP_PRIVATE : MAP_SHARED) | + (populate ? MAP_POPULATE : 0) | + (reserve ? 0 : MAP_NORESERVE), + fd, 0); + + if (ptr == MAP_FAILED) { + close(fd); + err(1, "Error mapping the file"); + } + break; + case MMAP_MAP_HUGETLB: + printf("Allocating using MAP_HUGETLB.\n"); + ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, + (private ? (MAP_PRIVATE | MAP_ANONYMOUS) : + MAP_SHARED) | + MAP_HUGETLB | (populate ? MAP_POPULATE : 0) | + (reserve ? 0 : MAP_NORESERVE), + -1, 0); + + if (ptr == MAP_FAILED) + err(1, "mmap"); + + printf("Returned address is %p\n", ptr); + break; + case SHM: + printf("Allocating using SHM.\n"); + shmid = shmget(key, size, + SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); + if (shmid < 0) { + shmid = shmget(++key, size, + SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); + if (shmid < 0) + err(1, "shmget"); + } + printf("shmid: 0x%x, shmget key:%d\n", shmid, key); + + shmaddr = shmat(shmid, NULL, 0); + if (shmaddr == (char *)-1) { + perror("Shared memory attach failure"); + shmctl(shmid, IPC_RMID, NULL); + exit(2); + } + printf("shmaddr: %p\n", shmaddr); + + break; + default: + errno = EINVAL; + err(1, "Invalid method."); + } + + if (write) { + printf("Writing to memory.\n"); + if (method != SHM) { + memset(ptr, 1, size); + } else { + printf("Starting the writes:\n"); + for (i = 0; i < size; i++) { + shmaddr[i] = (char)(i); + if (!(i % (1024 * 1024))) + printf("."); + } + printf("\n"); + + printf("Starting the Check..."); + for (i = 0; i < size; i++) + if (shmaddr[i] != (char)i) { + printf("\nIndex %lu mismatched\n", i); + exit(3); + } + printf("Done.\n"); + } + } + + if (want_sleep) { + /* Signal to caller that we're done. */ + printf("DONE\n"); + + /* Hold memory until external kill signal is delivered. */ + while (1) + sleep(100); + } + + switch (method == HUGETLBFS) { + close(fd); + } + + return 0; +} -- 2.24.0.rc1.363.gb1bccd3e3d-goog
Add docs for how to use hugetlb_cgroup reservations, and their behavior.
Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Hillf Danton hdanton@sina.com
---
Changes in v6: - Updated docs to reflect the new design based on a new counter that tracks both reservations and faults.
--- .../admin-guide/cgroup-v1/hugetlb.rst | 64 +++++++++++++++---- 1 file changed, 53 insertions(+), 11 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst index a3902aa253a96..efb94e4db9d5a 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -2,13 +2,6 @@ HugeTLB Controller ==================
-The HugeTLB controller allows to limit the HugeTLB usage per control group and -enforces the controller limit during page fault. Since HugeTLB doesn't -support page reclaim, enforcing the limit at page fault time implies that, -the application will get SIGBUS signal if it tries to access HugeTLB pages -beyond its limit. This requires the application to know beforehand how much -HugeTLB pages it would require for its use. - HugeTLB controller can be created by first mounting the cgroup filesystem.
# mount -t cgroup -o hugetlb none /sys/fs/cgroup @@ -28,10 +21,14 @@ process (bash) into it.
Brief summary of control files::
- hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage - hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded - hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb - hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit + hugetlb.<hugepagesize>.reservation_limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations + hugetlb.<hugepagesize>.reservation_max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults. + hugetlb.<hugepagesize>.reservation_usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb + hugetlb.<hugepagesize>.reservation_failcnt # show the number of allocation failure due to HugeTLB reservation limit + hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults + hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded + hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb + hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit
For a system supporting three hugepage sizes (64k, 32M and 1G), the control files include:: @@ -40,11 +37,56 @@ files include:: hugetlb.1GB.max_usage_in_bytes hugetlb.1GB.usage_in_bytes hugetlb.1GB.failcnt + hugetlb.1GB.reservation_limit_in_bytes + hugetlb.1GB.reservation_max_usage_in_bytes + hugetlb.1GB.reservation_usage_in_bytes + hugetlb.1GB.reservation_failcnt hugetlb.64KB.limit_in_bytes hugetlb.64KB.max_usage_in_bytes hugetlb.64KB.usage_in_bytes hugetlb.64KB.failcnt + hugetlb.64KB.reservation_limit_in_bytes + hugetlb.64KB.reservation_max_usage_in_bytes + hugetlb.64KB.reservation_usage_in_bytes + hugetlb.64KB.reservation_failcnt hugetlb.32MB.limit_in_bytes hugetlb.32MB.max_usage_in_bytes hugetlb.32MB.usage_in_bytes hugetlb.32MB.failcnt + hugetlb.32MB.reservation_limit_in_bytes + hugetlb.32MB.reservation_max_usage_in_bytes + hugetlb.32MB.reservation_usage_in_bytes + hugetlb.32MB.reservation_failcnt + + +1. Reservation limits + +The HugeTLB controller allows to limit the HugeTLB reservations per control +group and enforces the controller limit at reservation time and at the fault of +hugetlb memory for which no reservation exists. Reservation limits +are superior to Page fault limits (see section 2), since Reservation limits are +enforced at reservation time (on mmap or shget), and never causes the +application to get SIGBUS signal if the memory was reserved before hand. For +MAP_NORESERVE allocations, the reservation limit behaves the same as the fault +limit, enforcing memory usage at fault time and causing the application to +receive a SIGBUS if it's crossing its limit. + +2. Page fault limits + +The HugeTLB controller allows to limit the HugeTLB usage (page fault) per +control group and enforces the controller limit during page fault. Since HugeTLB +doesn't support page reclaim, enforcing the limit at page fault time implies +that, the application will get SIGBUS signal if it tries to access HugeTLB +pages beyond its limit. This requires the application to know beforehand how +much HugeTLB pages it would require for its use. + + +3. Caveats with shared memory + +For shared hugetlb memory, both hugetlb reservation and page faults are charged +to the first task that causes the memory to be reserved or faulted, and all +subsequent uses of this reserved or faulted memory is done without charging. + +Shared hugetlb memory is only uncharged when it is unreserved or deallocated. +This is usually when the hugetlbfs file is deleted, and not when the task that +caused the reservation or fault has exited. -- 2.24.0.rc1.363.gb1bccd3e3d-goog
On 10/29/19 6:36 PM, Mina Almasry wrote:
A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch.
Behavior change:
Say a resv_map exists like this [0->1], [2->3], and [5->6].
Then a region_chg/add call comes in region_chg/add(f=0, t=5).
Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6].
Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly.
I think this commit message should also mention that we have changed the assumption previously made in the code that region_add after region_chg could never fail. It is described in comments, but would be worth a mention here as well.
Signed-off-by: Mina Almasry almasrymina@google.com
Thanks for the continued updates. I can not spot any major issues with this version of the patch. There are some questions and suggestions embedded below. With those addressed, you can add:
Reviewed-by: Mike Kravetz mike.kravetz@oracle.com
As mentioned previously, this reservation code is quite fragile. Changing it does make me nervous. The good thing is that there is not a 'significant' change in behavior for the normal/expected code paths. It is the handling of the race conditions and unexpected behavior that mostly makes me nervous. I suspect you have not exercised these paths. One thing we could do is add some debug code to force execution of those paths. For example, make add_reservation_in_range(count_only) always return 1 no matter how many regions are needed. This will force region_add more frequently perform additional allocations.
This patch was of greatest concern to me as it impacts all hugetlbfs users, not just those using new cgroup functionality. With it in good shape, I will start to look at the others. However, my cgroup experience/understanding is limited. So, it would be great if others can also review the cgroup specific changes.
Changes in v8:
- Addressed comments from Mike, especially:
- fixing region_add not allocating enough regions sometimes.
- reverted vma_*_reservation API changes.
Changes in v7:
- region_chg no longer allocates (t-f) / 2 file_region entries.
Changes in v6:
- Fix bug in number of region_caches allocated by region_chg
mm/hugetlb.c | 315 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 213 insertions(+), 102 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8d8aa89a9928e..1162eeaf8d160 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -244,110 +244,178 @@ struct file_region { long to; };
+/* Helper that removes a struct file_region from the resv_map cache and returns
- it for use.
- */
+static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) +{
- struct file_region *nrg = NULL;
- VM_BUG_ON(resv->region_cache_count <= 0);
- resv->region_cache_count--;
- nrg = list_first_entry(&resv->region_cache, struct file_region, link);
- VM_BUG_ON(!nrg);
- list_del(&nrg->link);
- nrg->from = from;
- nrg->to = to;
- return nrg;
+}
/* Must be called with resv->lock held. Calling this with count_only == true
- will count the number of pages to be added but will not modify the linked
- list.
- list. If regions_needed != NULL and count_only == true, then regions_needed
- will indicate the number of file_regions needed in the cache to carry out to
*/
- add the regions for this range.
static long add_reservation_in_range(struct resv_map *resv, long f, long t,
bool count_only)
long *regions_needed, bool count_only)
{
- long chg = 0;
- long add = 0; struct list_head *head = &resv->regions;
- long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
- /* Locate the region we are before or in. */
- list_for_each_entry (rg, head, link)
if (f <= rg->to)
break;
- if (regions_needed)
*regions_needed = 0;
- /* Round our left edge to the current segment if it encloses us. */
- if (f > rg->from)
f = rg->from;
- chg = t - f;
- /* In this loop, we essentially handle an entry for the range
* [last_accounted_offset, rg->from), at every iteration, with some
* bounds checking.
*/
- list_for_each_entry_safe(rg, trg, head, link) {
/* Skip irrelevant regions that start before our range. */
if (rg->from < f) {
/* If this region ends after the last accounted offset,
* then we need to update last_accounted_offset.
*/
if (rg->to > last_accounted_offset)
last_accounted_offset = rg->to;
continue;
}
- /* Check for and consume any regions we now overlap with. */
- nrg = rg;
- list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
/* When we find a region that starts beyond our range, we've
* finished.
if (rg->from > t) break;*/
/* We overlap with this area, if it extends further than
* us then we must extend ourselves. Account for its
* existing reservation.
/* Add an entry for last_accounted_offset -> rg->from, and
*/* update last_accounted_offset.
if (rg->to > t) {
chg += rg->to - t;
t = rg->to;
if (rg->from > last_accounted_offset) {
add += rg->from - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, rg->from);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
}*regions_needed += 1;
chg -= rg->to - rg->from;
if (!count_only && rg != nrg) {
list_del(&rg->link);
kfree(rg);
}
last_accounted_offset = rg->to;
That last assignment is unneeded. Correct?
}
- if (!count_only) {
nrg->from = f;
nrg->to = t;
- /* Handle the case where our range extends beyond
* last_accounted_offset.
*/
- if (last_accounted_offset < t) {
add += t - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, t);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
}last_accounted_offset = t;
- return chg;
- return add;
}
/*
- Add the huge page range represented by [f, t) to the reserve
- map. Existing regions will be expanded to accommodate the specified
- range, or a region will be taken from the cache. Sufficient regions
- must exist in the cache due to the previous call to region_chg with
- the same range.
- map. Regions will be taken from the cache to fill in this range.
- Sufficient regions should exist in the cache due to the previous
- call to region_chg with the same range, but in some cases the cache will not
- have sufficient entries. The extra needed entries will be allocated.
Let's mention that the reason sufficient entries may not exist is due to races with other code doing region_add or region_del.
- regions_needed is the out value provided by a previous call to region_chg.
- Return the number of new huge pages added to the map. This
- number is greater than or equal to zero.
- Return the number of new huge pages added to the map. This number is greater
- than or equal to zero. If file_region entries needed to be allocated for
- this operation and we were not able to allocate, it ruturns -ENOMEM.
- region_add of regions of length 1 never allocate file_regions and cannot
- fail.
Let's add the reason why region_add for regions of length 1 will never fail. Specifically, region_chg will always allocate at least 1 entry and a region_add for 1 page will only require at most 1 entry.
*/ -static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct resv_map *resv, long f, long t,
long in_regions_needed)
{
- struct list_head *head = &resv->regions;
- struct file_region *rg, *nrg;
- long add = 0;
long add = 0, actual_regions_needed = 0, i = 0;
struct file_region *trg = NULL, *rg = NULL;
struct list_head allocated_regions;
INIT_LIST_HEAD(&allocated_regions);
spin_lock(&resv->lock);
- /* Locate the region we are either in or before. */
- list_for_each_entry(rg, head, link)
if (f <= rg->to)
break;
+retry:
/* Count how many regions are actually needed to execute this add. */
add_reservation_in_range(resv, f, t, &actual_regions_needed, true);
/*
* If no region exists which can be expanded to include the
* specified range, pull a region descriptor from the cache
* and use it for this range.
* Check for sufficient descriptors in the cache to accommodate
* this add operation. Note that actual_regions_needed may be greater
* than in_regions_needed. In this case, we need to make sure that we
* allocate extra entries, such that we have enough for all the
* existing adds_in_progress, plus the excess needed for this
*/* operation.
- if (&rg->link == head || t < rg->from) {
VM_BUG_ON(resv->region_cache_count <= 0);
- if (resv->region_cache_count <
resv->adds_in_progress +
(actual_regions_needed - in_regions_needed)) {
/* region_add operation of range 1 should never need to
* allocate file_region entries.
*/
VM_BUG_ON(t - f <= 1);
resv->region_cache_count--;
nrg = list_first_entry(&resv->region_cache, struct file_region,
link);
list_del(&nrg->link);
/* Must drop lock to allocate a new descriptor. */
spin_unlock(&resv->lock);
for (i = 0; i < (actual_regions_needed - in_regions_needed);
i++) {
trg = kmalloc(sizeof(*trg), GFP_KERNEL);
if (!trg)
goto out_of_memory;
list_add(&trg->link, &allocated_regions);
}
spin_lock(&resv->lock);
nrg->from = f;
nrg->to = t;
list_add(&nrg->link, rg->link.prev);
list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
list_del(&rg->link);
list_add(&rg->link, &resv->region_cache);
resv->region_cache_count++;
}
add += t - f;
goto out_locked;
}goto retry;
- add = add_reservation_in_range(resv, f, t, false);
- add = add_reservation_in_range(resv, f, t, NULL, false);
- resv->adds_in_progress -= in_regions_needed;
-out_locked:
- resv->adds_in_progress--; spin_unlock(&resv->lock); VM_BUG_ON(add < 0);
Does this VM_BUG_ON make sense here? You are testing if add_reservation_in_range() returned a negative value. Right? Perhaps this was previously combined with the out_of_memory situation.
return add;
+out_of_memory:
- list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
list_del(&rg->link);
kfree(rg);
- }
- return -ENOMEM;
}
/* @@ -357,49 +425,72 @@ static long region_add(struct resv_map *resv, long f, long t)
- call to region_add that will actually modify the reserve
- map to add the specified range [f, t). region_chg does
- not change the number of huge pages represented by the
- map. A new file_region structure is added to the cache
- as a placeholder, so that the subsequent region_add
- call will have all the regions it needs and will not fail.
- map. A number of new file_region structures is added to the cache as a
- placeholder, for the subsequent region_add call to use. At least 1
- file_region structure is added.
- out_regions_needed is the number of regions added to the
- resv->adds_in_progress. This value needs to be provided to a follow up call
*/
- to region_add or region_abort for proper accounting.
- Returns the number of huge pages that need to be added to the existing
- reservation map for the range [f, t). This number is greater or equal to
- zero. -ENOMEM is returned if a new file_region structure or cache entry
- is needed and can not be allocated.
-static long region_chg(struct resv_map *resv, long f, long t) +static long region_chg(struct resv_map *resv, long f, long t,
long *out_regions_needed)
{
- long chg = 0;
struct file_region *trg = NULL, *rg = NULL;
long chg = 0, i = 0, to_allocate = 0;
struct list_head allocated_regions;
INIT_LIST_HEAD(&allocated_regions);
spin_lock(&resv->lock);
-retry_locked:
- resv->adds_in_progress++;
- /* Count how many hugepages in this range are NOT respresented. */
- chg = add_reservation_in_range(resv, f, t, out_regions_needed, true);
- if (*out_regions_needed < 1)
*out_regions_needed = 1;
Perhaps just me, but I would prefer an explicit check for zero here. That is the only possible value less than 1, correct? Otherwise, on a quick read of this code one might think out_regions_needed could be a negative number or error code.
resv->adds_in_progress += *out_regions_needed;
/*
- Check for sufficient descriptors in the cache to accommodate
- the number of in progress add operations.
*/
- if (resv->adds_in_progress > resv->region_cache_count) {
struct file_region *trg;
- while (resv->region_cache_count < resv->adds_in_progress) {
to_allocate = resv->adds_in_progress - resv->region_cache_count;
When I first looked at this while loop, I thought we would be calling add_reservation_in_range again after dropping and reacquiring the lock. It took me a minute to realize that that is not required. We only need to get the number of entries that were sugggested by the first call, and that number is incorporated in resv->adds_in_progress. If things have changed, the subsequent region_add can deal with it. It might be a good idea to note this in the commnent.
On Fri, Nov 1, 2019 at 4:23 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch.
Behavior change:
Say a resv_map exists like this [0->1], [2->3], and [5->6].
Then a region_chg/add call comes in region_chg/add(f=0, t=5).
Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6].
Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly.
I think this commit message should also mention that we have changed the assumption previously made in the code that region_add after region_chg could never fail. It is described in comments, but would be worth a mention here as well.
Will fix.
Signed-off-by: Mina Almasry almasrymina@google.com
Thanks for the continued updates. I can not spot any major issues with this version of the patch. There are some questions and suggestions embedded below. With those addressed, you can add:
Reviewed-by: Mike Kravetz mike.kravetz@oracle.com
As mentioned previously, this reservation code is quite fragile. Changing it does make me nervous. The good thing is that there is not a 'significant' change in behavior for the normal/expected code paths. It is the handling of the race conditions and unexpected behavior that mostly makes me nervous. I suspect you have not exercised these paths. One thing we could do is add some debug code to force execution of those paths. For example, make add_reservation_in_range(count_only) always return 1 no matter how many regions are needed. This will force region_add more frequently perform additional allocations.
Will test this and report back with fixes if I run into issues.
This patch was of greatest concern to me as it impacts all hugetlbfs users, not just those using new cgroup functionality. With it in good shape, I will start to look at the others. However, my cgroup experience/understanding is limited. So, it would be great if others can also review the cgroup specific changes.
Thanks in advance.
Changes in v8:
- Addressed comments from Mike, especially:
- fixing region_add not allocating enough regions sometimes.
- reverted vma_*_reservation API changes.
Changes in v7:
- region_chg no longer allocates (t-f) / 2 file_region entries.
Changes in v6:
- Fix bug in number of region_caches allocated by region_chg
mm/hugetlb.c | 315 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 213 insertions(+), 102 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8d8aa89a9928e..1162eeaf8d160 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -244,110 +244,178 @@ struct file_region { long to; };
+/* Helper that removes a struct file_region from the resv_map cache and returns
- it for use.
- */
+static struct file_region * +get_file_region_entry_from_cache(struct resv_map *resv, long from, long to) +{
struct file_region *nrg = NULL;
VM_BUG_ON(resv->region_cache_count <= 0);
resv->region_cache_count--;
nrg = list_first_entry(&resv->region_cache, struct file_region, link);
VM_BUG_ON(!nrg);
list_del(&nrg->link);
nrg->from = from;
nrg->to = to;
return nrg;
+}
/* Must be called with resv->lock held. Calling this with count_only == true
- will count the number of pages to be added but will not modify the linked
- list.
- list. If regions_needed != NULL and count_only == true, then regions_needed
- will indicate the number of file_regions needed in the cache to carry out to
*/
- add the regions for this range.
static long add_reservation_in_range(struct resv_map *resv, long f, long t,
bool count_only)
long *regions_needed, bool count_only)
{
long chg = 0;
long add = 0; struct list_head *head = &resv->regions;
long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
/* Locate the region we are before or in. */
list_for_each_entry (rg, head, link)
if (f <= rg->to)
break;
if (regions_needed)
*regions_needed = 0;
/* Round our left edge to the current segment if it encloses us. */
if (f > rg->from)
f = rg->from;
chg = t - f;
/* In this loop, we essentially handle an entry for the range
* [last_accounted_offset, rg->from), at every iteration, with some
* bounds checking.
*/
list_for_each_entry_safe(rg, trg, head, link) {
/* Skip irrelevant regions that start before our range. */
if (rg->from < f) {
/* If this region ends after the last accounted offset,
* then we need to update last_accounted_offset.
*/
if (rg->to > last_accounted_offset)
last_accounted_offset = rg->to;
continue;
}
/* Check for and consume any regions we now overlap with. */
nrg = rg;
list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
/* When we find a region that starts beyond our range, we've
* finished.
*/ if (rg->from > t) break;
/* We overlap with this area, if it extends further than
* us then we must extend ourselves. Account for its
* existing reservation.
/* Add an entry for last_accounted_offset -> rg->from, and
* update last_accounted_offset. */
if (rg->to > t) {
chg += rg->to - t;
t = rg->to;
if (rg->from > last_accounted_offset) {
add += rg->from - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, rg->from);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1; }
chg -= rg->to - rg->from;
if (!count_only && rg != nrg) {
list_del(&rg->link);
kfree(rg);
}
last_accounted_offset = rg->to;
That last assignment is unneeded. Correct?
Not to make you nervous, but this assignment is needed.
The basic idea is that there are 2 loop invariants here: 1. Everything before last_accounted_offset is filled in with file_regions. 2. rg points to the first region past last_account_offset.
Each loop iteration compares rg->from to last_accounted_offset, and if there is a gap, it creates a new region to fill this gap. Then this assignment restores loop invariant #2 by assigning last_accounted_offset to rg->to, since now everything before rg->to is filled in with file_regions.
}
if (!count_only) {
nrg->from = f;
nrg->to = t;
/* Handle the case where our range extends beyond
* last_accounted_offset.
*/
if (last_accounted_offset < t) {
add += t - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, t);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
last_accounted_offset = t; }
return chg;
return add;
}
/*
- Add the huge page range represented by [f, t) to the reserve
- map. Existing regions will be expanded to accommodate the specified
- range, or a region will be taken from the cache. Sufficient regions
- must exist in the cache due to the previous call to region_chg with
- the same range.
- map. Regions will be taken from the cache to fill in this range.
- Sufficient regions should exist in the cache due to the previous
- call to region_chg with the same range, but in some cases the cache will not
- have sufficient entries. The extra needed entries will be allocated.
Let's mention that the reason sufficient entries may not exist is due to races with other code doing region_add or region_del.
Will fix.
- regions_needed is the out value provided by a previous call to region_chg.
- Return the number of new huge pages added to the map. This
- number is greater than or equal to zero.
- Return the number of new huge pages added to the map. This number is greater
- than or equal to zero. If file_region entries needed to be allocated for
- this operation and we were not able to allocate, it ruturns -ENOMEM.
- region_add of regions of length 1 never allocate file_regions and cannot
- fail.
Let's add the reason why region_add for regions of length 1 will never fail. Specifically, region_chg will always allocate at least 1 entry and a region_add for 1 page will only require at most 1 entry.
Will fix.
*/ -static long region_add(struct resv_map *resv, long f, long t) +static long region_add(struct resv_map *resv, long f, long t,
long in_regions_needed)
{
struct list_head *head = &resv->regions;
struct file_region *rg, *nrg;
long add = 0;
long add = 0, actual_regions_needed = 0, i = 0;
struct file_region *trg = NULL, *rg = NULL;
struct list_head allocated_regions;
INIT_LIST_HEAD(&allocated_regions); spin_lock(&resv->lock);
/* Locate the region we are either in or before. */
list_for_each_entry(rg, head, link)
if (f <= rg->to)
break;
+retry:
/* Count how many regions are actually needed to execute this add. */
add_reservation_in_range(resv, f, t, &actual_regions_needed, true); /*
* If no region exists which can be expanded to include the
* specified range, pull a region descriptor from the cache
* and use it for this range.
* Check for sufficient descriptors in the cache to accommodate
* this add operation. Note that actual_regions_needed may be greater
* than in_regions_needed. In this case, we need to make sure that we
* allocate extra entries, such that we have enough for all the
* existing adds_in_progress, plus the excess needed for this
* operation. */
if (&rg->link == head || t < rg->from) {
VM_BUG_ON(resv->region_cache_count <= 0);
if (resv->region_cache_count <
resv->adds_in_progress +
(actual_regions_needed - in_regions_needed)) {
/* region_add operation of range 1 should never need to
* allocate file_region entries.
*/
VM_BUG_ON(t - f <= 1);
resv->region_cache_count--;
nrg = list_first_entry(&resv->region_cache, struct file_region,
link);
list_del(&nrg->link);
/* Must drop lock to allocate a new descriptor. */
spin_unlock(&resv->lock);
for (i = 0; i < (actual_regions_needed - in_regions_needed);
i++) {
trg = kmalloc(sizeof(*trg), GFP_KERNEL);
if (!trg)
goto out_of_memory;
list_add(&trg->link, &allocated_regions);
}
spin_lock(&resv->lock);
nrg->from = f;
nrg->to = t;
list_add(&nrg->link, rg->link.prev);
list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
list_del(&rg->link);
list_add(&rg->link, &resv->region_cache);
resv->region_cache_count++;
}
add += t - f;
goto out_locked;
goto retry; }
add = add_reservation_in_range(resv, f, t, false);
add = add_reservation_in_range(resv, f, t, NULL, false);
resv->adds_in_progress -= in_regions_needed;
-out_locked:
resv->adds_in_progress--; spin_unlock(&resv->lock); VM_BUG_ON(add < 0);
Does this VM_BUG_ON make sense here? You are testing if add_reservation_in_range() returned a negative value. Right? Perhaps this was previously combined with the out_of_memory situation.
Yes it tests add_reservation_in_range() returned a negative value, which indicates something very wrong happened. Doesn't seem critical but doesn't hurt either.
return add;
+out_of_memory:
list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
list_del(&rg->link);
kfree(rg);
}
return -ENOMEM;
}
/* @@ -357,49 +425,72 @@ static long region_add(struct resv_map *resv, long f, long t)
- call to region_add that will actually modify the reserve
- map to add the specified range [f, t). region_chg does
- not change the number of huge pages represented by the
- map. A new file_region structure is added to the cache
- as a placeholder, so that the subsequent region_add
- call will have all the regions it needs and will not fail.
- map. A number of new file_region structures is added to the cache as a
- placeholder, for the subsequent region_add call to use. At least 1
- file_region structure is added.
- out_regions_needed is the number of regions added to the
- resv->adds_in_progress. This value needs to be provided to a follow up call
*/
- to region_add or region_abort for proper accounting.
- Returns the number of huge pages that need to be added to the existing
- reservation map for the range [f, t). This number is greater or equal to
- zero. -ENOMEM is returned if a new file_region structure or cache entry
- is needed and can not be allocated.
-static long region_chg(struct resv_map *resv, long f, long t) +static long region_chg(struct resv_map *resv, long f, long t,
long *out_regions_needed)
{
long chg = 0;
struct file_region *trg = NULL, *rg = NULL;
long chg = 0, i = 0, to_allocate = 0;
struct list_head allocated_regions;
INIT_LIST_HEAD(&allocated_regions); spin_lock(&resv->lock);
-retry_locked:
resv->adds_in_progress++;
/* Count how many hugepages in this range are NOT respresented. */
chg = add_reservation_in_range(resv, f, t, out_regions_needed, true);
if (*out_regions_needed < 1)
*out_regions_needed = 1;
Perhaps just me, but I would prefer an explicit check for zero here. That is the only possible value less than 1, correct? Otherwise, on a quick read of this code one might think out_regions_needed could be a negative number or error code.
Will fix.
resv->adds_in_progress += *out_regions_needed; /* * Check for sufficient descriptors in the cache to accommodate * the number of in progress add operations. */
if (resv->adds_in_progress > resv->region_cache_count) {
struct file_region *trg;
while (resv->region_cache_count < resv->adds_in_progress) {
to_allocate = resv->adds_in_progress - resv->region_cache_count;
When I first looked at this while loop, I thought we would be calling add_reservation_in_range again after dropping and reacquiring the lock. It took me a minute to realize that that is not required. We only need to get the number of entries that were sugggested by the first call, and that number is incorporated in resv->adds_in_progress. If things have changed, the subsequent region_add can deal with it. It might be a good idea to note this in the commnent.
Will add.
-- Mike Kravetz
VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1); /* Must drop lock to allocate a new descriptor. */
resv->adds_in_progress--; spin_unlock(&resv->lock);
trg = kmalloc(sizeof(*trg), GFP_KERNEL);
if (!trg)
return -ENOMEM;
for (i = 0; i < to_allocate; i++) {
trg = kmalloc(sizeof(*trg), GFP_KERNEL);
if (!trg)
goto out_of_memory;
list_add(&trg->link, &allocated_regions);
} spin_lock(&resv->lock);
list_add(&trg->link, &resv->region_cache);
resv->region_cache_count++;
goto retry_locked;
}
chg = add_reservation_in_range(resv, f, t, true);
list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
list_del(&rg->link);
list_add(&rg->link, &resv->region_cache);
resv->region_cache_count++;
}
} spin_unlock(&resv->lock); return chg;
+out_of_memory:
list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
list_del(&rg->link);
kfree(rg);
}
return -ENOMEM;
}
/* @@ -407,17 +498,20 @@ static long region_chg(struct resv_map *resv, long f, long t)
- of the resv_map keeps track of the operations in progress between
- calls to region_chg and region_add. Operations are sometimes
- aborted after the call to region_chg. In such cases, region_abort
- is called to decrement the adds_in_progress counter.
- is called to decrement the adds_in_progress counter. regions_needed
- is the value returned by the region_chg call, it is used to decrement
*/
- the adds_in_progress counter.
- NOTE: The range arguments [f, t) are not needed or used in this
- routine. They are kept to make reading the calling code easier as
- arguments will match the associated region_chg call.
-static void region_abort(struct resv_map *resv, long f, long t) +static void region_abort(struct resv_map *resv, long f, long t,
long regions_needed)
{ spin_lock(&resv->lock); VM_BUG_ON(!resv->region_cache_count);
resv->adds_in_progress--;
resv->adds_in_progress -= regions_needed; spin_unlock(&resv->lock);
}
@@ -1904,6 +1998,7 @@ static long __vma_reservation_common(struct hstate *h, struct resv_map *resv; pgoff_t idx; long ret;
long dummy_out_regions_needed; resv = vma_resv_map(vma); if (!resv)
@@ -1912,20 +2007,29 @@ static long __vma_reservation_common(struct hstate *h, idx = vma_hugecache_offset(h, vma, addr); switch (mode) { case VMA_NEEDS_RESV:
ret = region_chg(resv, idx, idx + 1);
ret = region_chg(resv, idx, idx + 1, &dummy_out_regions_needed);
/* We assume that vma_reservation_* routines always operate on
* 1 page, and that adding to resv map a 1 page entry can only
* ever require 1 region.
*/
VM_BUG_ON(dummy_out_regions_needed != 1); break; case VMA_COMMIT_RESV:
ret = region_add(resv, idx, idx + 1);
ret = region_add(resv, idx, idx + 1, 1);
/* region_add calls of range 1 should never fail. */
VM_BUG_ON(ret < 0); break; case VMA_END_RESV:
region_abort(resv, idx, idx + 1);
region_abort(resv, idx, idx + 1, 1); ret = 0; break; case VMA_ADD_RESV:
if (vma->vm_flags & VM_MAYSHARE)
ret = region_add(resv, idx, idx + 1);
else {
region_abort(resv, idx, idx + 1);
if (vma->vm_flags & VM_MAYSHARE) {
ret = region_add(resv, idx, idx + 1, 1);
/* region_add calls of range 1 should never fail. */
VM_BUG_ON(ret < 0);
} else {
region_abort(resv, idx, idx + 1, 1); ret = region_del(resv, idx, idx + 1); } break;
@@ -4578,12 +4682,12 @@ int hugetlb_reserve_pages(struct inode *inode, struct vm_area_struct *vma, vm_flags_t vm_flags) {
long ret, chg;
long ret, chg, add = -1; struct hstate *h = hstate_inode(inode); struct hugepage_subpool *spool = subpool_inode(inode); struct resv_map *resv_map; struct hugetlb_cgroup *h_cg;
long gbl_reserve;
long gbl_reserve, regions_needed = 0; /* This should never happen */ if (from > to) {
@@ -4613,7 +4717,7 @@ int hugetlb_reserve_pages(struct inode *inode, */ resv_map = inode_resv_map(inode);
chg = region_chg(resv_map, from, to);
chg = region_chg(resv_map, from, to, ®ions_needed); } else { /* Private mapping. */
@@ -4683,9 +4787,14 @@ int hugetlb_reserve_pages(struct inode *inode, * else has to be done for private mappings here */ if (!vma || vma->vm_flags & VM_MAYSHARE) {
long add = region_add(resv_map, from, to);
if (unlikely(chg > add)) {
add = region_add(resv_map, from, to, regions_needed);
if (unlikely(add < 0)) {
hugetlb_acct_memory(h, -gbl_reserve);
/* put back original number of pages, chg */
(void)hugepage_subpool_put_pages(spool, chg);
goto out_err;
} else if (unlikely(chg > add)) { /* * pages in this range were added to the reserve * map between region_chg and region_add. This
@@ -4703,9 +4812,11 @@ int hugetlb_reserve_pages(struct inode *inode, return 0; out_err: if (!vma || vma->vm_flags & VM_MAYSHARE)
/* Don't call region_abort if region_chg failed */
if (chg >= 0)
region_abort(resv_map, from, to);
/* Only call region_abort if the region_chg succeeded but the
* region_add failed or didn't run.
*/
if (chg >= 0 && add < 0)
region_abort(resv_map, from, to, regions_needed); if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) kref_put(&resv_map->refs, resv_map_release); return ret;
-- 2.24.0.rc1.363.gb1bccd3e3d-goog
On 11/4/19 1:04 PM, Mina Almasry wrote:
On Fri, Nov 1, 2019 at 4:23 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
static long add_reservation_in_range(struct resv_map *resv, long f, long t,
bool count_only)
long *regions_needed, bool count_only)
{
long chg = 0;
long add = 0; struct list_head *head = &resv->regions;
long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
/* Locate the region we are before or in. */
list_for_each_entry (rg, head, link)
if (f <= rg->to)
break;
if (regions_needed)
*regions_needed = 0;
/* Round our left edge to the current segment if it encloses us. */
if (f > rg->from)
f = rg->from;
chg = t - f;
/* In this loop, we essentially handle an entry for the range
* [last_accounted_offset, rg->from), at every iteration, with some
* bounds checking.
*/
list_for_each_entry_safe(rg, trg, head, link) {
/* Skip irrelevant regions that start before our range. */
if (rg->from < f) {
/* If this region ends after the last accounted offset,
* then we need to update last_accounted_offset.
*/
if (rg->to > last_accounted_offset)
last_accounted_offset = rg->to;
continue;
}
/* Check for and consume any regions we now overlap with. */
nrg = rg;
list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
/* When we find a region that starts beyond our range, we've
* finished.
*/ if (rg->from > t) break;
/* We overlap with this area, if it extends further than
* us then we must extend ourselves. Account for its
* existing reservation.
/* Add an entry for last_accounted_offset -> rg->from, and
* update last_accounted_offset. */
if (rg->to > t) {
chg += rg->to - t;
t = rg->to;
if (rg->from > last_accounted_offset) {
add += rg->from - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, rg->from);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1; }
chg -= rg->to - rg->from;
if (!count_only && rg != nrg) {
list_del(&rg->link);
kfree(rg);
}
last_accounted_offset = rg->to;
That last assignment is unneeded. Correct?
Not to make you nervous, but this assignment is needed.
The basic idea is that there are 2 loop invariants here:
- Everything before last_accounted_offset is filled in with file_regions.
- rg points to the first region past last_account_offset.
Each loop iteration compares rg->from to last_accounted_offset, and if there is a gap, it creates a new region to fill this gap. Then this assignment restores loop invariant #2 by assigning last_accounted_offset to rg->to, since now everything before rg->to is filled in with file_regions.
My apologies!
}
if (!count_only) {
nrg->from = f;
nrg->to = t;
/* Handle the case where our range extends beyond
* last_accounted_offset.
*/
if (last_accounted_offset < t) {
add += t - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, t);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
last_accounted_offset = t;
The question about an unnecessary assignment was supposed to be directed at the above line.
On Mon, Nov 4, 2019 at 1:15 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/4/19 1:04 PM, Mina Almasry wrote:
On Fri, Nov 1, 2019 at 4:23 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
static long add_reservation_in_range(struct resv_map *resv, long f, long t,
bool count_only)
long *regions_needed, bool count_only)
{
long chg = 0;
long add = 0; struct list_head *head = &resv->regions;
long last_accounted_offset = f; struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
/* Locate the region we are before or in. */
list_for_each_entry (rg, head, link)
if (f <= rg->to)
break;
if (regions_needed)
*regions_needed = 0;
/* Round our left edge to the current segment if it encloses us. */
if (f > rg->from)
f = rg->from;
chg = t - f;
/* In this loop, we essentially handle an entry for the range
* [last_accounted_offset, rg->from), at every iteration, with some
* bounds checking.
*/
list_for_each_entry_safe(rg, trg, head, link) {
/* Skip irrelevant regions that start before our range. */
if (rg->from < f) {
/* If this region ends after the last accounted offset,
* then we need to update last_accounted_offset.
*/
if (rg->to > last_accounted_offset)
last_accounted_offset = rg->to;
continue;
}
/* Check for and consume any regions we now overlap with. */
nrg = rg;
list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
if (&rg->link == head)
break;
/* When we find a region that starts beyond our range, we've
* finished.
*/ if (rg->from > t) break;
/* We overlap with this area, if it extends further than
* us then we must extend ourselves. Account for its
* existing reservation.
/* Add an entry for last_accounted_offset -> rg->from, and
* update last_accounted_offset. */
if (rg->to > t) {
chg += rg->to - t;
t = rg->to;
if (rg->from > last_accounted_offset) {
add += rg->from - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, rg->from);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1; }
chg -= rg->to - rg->from;
if (!count_only && rg != nrg) {
list_del(&rg->link);
kfree(rg);
}
last_accounted_offset = rg->to;
That last assignment is unneeded. Correct?
Not to make you nervous, but this assignment is needed.
The basic idea is that there are 2 loop invariants here:
- Everything before last_accounted_offset is filled in with file_regions.
- rg points to the first region past last_account_offset.
Each loop iteration compares rg->from to last_accounted_offset, and if there is a gap, it creates a new region to fill this gap. Then this assignment restores loop invariant #2 by assigning last_accounted_offset to rg->to, since now everything before rg->to is filled in with file_regions.
My apologies!
}
if (!count_only) {
nrg->from = f;
nrg->to = t;
/* Handle the case where our range extends beyond
* last_accounted_offset.
*/
if (last_accounted_offset < t) {
add += t - last_accounted_offset;
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, t);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
last_accounted_offset = t;
The question about an unnecessary assignment was supposed to be directed at the above line.
Oh, yes. That assignment is completely unnecessary; the function just exits after pretty much. Will remove, thanks!
-- Mike Kravetz
}
return chg;
return add;
}
Cc: Andrew This series is getting closer to consideration for the mm tree. Mina, Be sure to cc Andrew with next version of series.
On 10/29/19 6:36 PM, Mina Almasry wrote:
These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter.
I honestly am not sure the preferred method for including the overall design in a commit message. Certainly it should be in the first patch. Perhaps, say this is patch 1 of a 9 patch series here.
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.
Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.
This proposal is implemented in this patch series, with tests to verify functionality and show the usage. We also added cgroup-v2 support to hugetlb_cgroup so that the new use cases can be extended to v2.
Alternatives considered:
A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well.
Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via reservation_limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and reservation_limit_in_bytes concurrently, and this approach gives them the option to do so.
Testing:
I think that simply mentioning the use of hugetlbfs for regression testing would be sufficient here.
- Added tests passing.
- libhugetlbfs tests mostly passing, but some tests have trouble with and without this patch series. Seems environment issue rather than code:
- Overall results: ********** TEST SUMMARY
2M
32-bit 64-bit
Total testcases: 84 0
Skipped: 0 0
PASS: 66 0
FAIL: 14 0
- Killed by signal: 0 0
- Bad configuration: 4 0
Expected FAIL: 0 0
Unexpected PASS: 0 0
- Test not present: 0 0
- Strange test result: 0 0
It is curious that you only ran the tests for 32 bit applications. Certainly the more common case today is 64 bit. I don't think there are any surprises for you as I also have been running hugetlbfs on this series.
On 10/29/19 6:36 PM, Mina Almasry wrote:
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage or hugetlb reservation counter.
Adds a new interface to uncharge a hugetlb_cgroup counter via hugetlb_cgroup_uncharge_counter.
Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init, hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
Signed-off-by: Mina Almasry almasrymina@google.com
include/linux/hugetlb_cgroup.h | 67 +++++++++++++--------- mm/hugetlb.c | 17 +++--- mm/hugetlb_cgroup.c | 100 +++++++++++++++++++++++++-------- 3 files changed, 130 insertions(+), 54 deletions(-)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 063962f6dfc6a..1bb58a63af586 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -22,27 +22,35 @@ struct hugetlb_cgroup;
- Minimum page order trackable by hugetlb cgroup.
- At least 3 pages are necessary for all the tracking information.
*/ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER 3
Correct me if misremembering, but I think the reson you changed this was so that you could use page[3].private. Correct? In that case isn't page[3] the last page of an order 2 allocation? If my understanding is correct, then leave HUGETLB_CGROUP_MIN_ORDER as is and update the preceding comment to say that at least 4 pages are necessary.
#ifdef CONFIG_CGROUP_HUGETLB
-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page,
bool reserved)
{ VM_BUG_ON_PAGE(!PageHuge(page), page);
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return NULL;
- return (struct hugetlb_cgroup *)page[2].private;
- if (reserved)
return (struct hugetlb_cgroup *)page[3].private;
- else
return (struct hugetlb_cgroup *)page[2].private;
}
-static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page,
struct hugetlb_cgroup *h_cg,
bool reservation)
{ VM_BUG_ON_PAGE(!PageHuge(page), page);
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return -1;
- page[2].private = (unsigned long)h_cg;
- if (reservation)
page[3].private = (unsigned long)h_cg;
- else
return 0;page[2].private = (unsigned long)h_cg;
}
@@ -52,26 +60,33 @@ static inline bool hugetlb_cgroup_disabled(void) }
extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr);
struct hugetlb_cgroup **ptr,
bool reserved);
extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg,
struct page *page);
struct page *page, bool reserved);
extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page);
struct page *page, bool reserved);
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
struct hugetlb_cgroup *h_cg,
bool reserved);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
unsigned long nr_pages);
extern void hugetlb_cgroup_file_init(void) __init; extern void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage);
#else -static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page,
bool reserved)
{ return NULL; }
-static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page,
struct hugetlb_cgroup *h_cg, bool reserved)
{ return 0; } @@ -81,28 +96,30 @@ static inline bool hugetlb_cgroup_disabled(void) return true; }
-static inline int -hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr)
+static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr,
bool reserved)
{ return 0; }
-static inline void -hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
struct page *page)
+static inline void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
struct page *page,
bool reserved)
{ }
-static inline void -hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) +static inline void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page,
bool reserved)
{ }
-static inline void -hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg)
+static inline void hugetlb_cgroup_uncharge_cgroup(int idx,
unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
bool reserved)
{ }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 37195cd60a345..325d5454bf168 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1140,7 +1140,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 << PG_active | 1 << PG_private | 1 << PG_writeback); }
- VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
- VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page);
Shouldn't we add a VM_BUG_ON_PAGE for reserved as well? Oh, I see it is done in a later patch. I guess it is not here as there are no users of reserved yet? Same observation in other places in hugetlb.c below. Since you add the API changes for reserved here, as well as define page[3].private to store info, I don't think it would hurt to add the initialization and checking and cleanup here as well. Thoughts?
set_compound_page_dtor(page, NULL_COMPOUND_DTOR); set_page_refcounted(page); if (hstate_is_gigantic(h)) { @@ -1250,8 +1250,8 @@ void free_huge_page(struct page *page)
spin_lock(&hugetlb_lock); clear_page_huge_active(page);
- hugetlb_cgroup_uncharge_page(hstate_index(h),
pages_per_huge_page(h), page);
- hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
if (restore_reserve) h->resv_huge_pages++;page, false);
@@ -1277,7 +1277,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) INIT_LIST_HEAD(&page->lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); spin_lock(&hugetlb_lock);
- set_hugetlb_cgroup(page, NULL);
- set_hugetlb_cgroup(page, NULL, false); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; spin_unlock(&hugetlb_lock);
@@ -2063,7 +2063,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, gbl_chg = 1; }
- ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
- ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
if (ret) goto out_subpool_put;false);
@@ -2087,7 +2088,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, list_move(&page->lru, &h->hugepage_activelist); /* Fall through */ }
- hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page,
false);
spin_unlock(&hugetlb_lock);
set_page_private(page, (unsigned long)spool);
@@ -2111,7 +2113,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, return page;
out_uncharge_cgroup:
- hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
- hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
false);
out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 1ed4448ca41d3..854117513979b 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -73,8 +73,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) int idx;
for (idx = 0; idx < hugetlb_max_hstate; idx++) {
if (page_counter_read(&h_cg->hugepage[idx]))
if (page_counter_read(
hugetlb_cgroup_get_counter(h_cg, idx, true)) ||
page_counter_read(
hugetlb_cgroup_get_counter(h_cg, idx, false))) { return true;
} return false;}
} @@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, int idx;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
struct page_counter *parent = NULL;struct page_counter *counter = &h_cgroup->hugepage[idx];
Should we perhaps rename 'parent' to 'fault_parent' to be consistent? That makes me think if perhaps the naming in the previous patch should be more explicit. Make the existing names explicitly contin 'fault' as the new names contain 'reservation'. Just a thought.
unsigned long limit; int ret;struct page_counter *reserved_parent = NULL;
if (parent_h_cgroup)
parent = &parent_h_cgroup->hugepage[idx];
page_counter_init(counter, parent);
if (parent_h_cgroup) {
parent = hugetlb_cgroup_get_counter(parent_h_cgroup,
idx, false);
reserved_parent = hugetlb_cgroup_get_counter(
parent_h_cgroup, idx, true);
}
page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx,
false),
parent);
page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx,
true),
reserved_parent);
limit = round_down(PAGE_COUNTER_MAX, 1 << huge_page_order(&hstates[idx]));
ret = page_counter_set_max(counter, limit);
ret = page_counter_set_max(
hugetlb_cgroup_get_counter(h_cgroup, idx, false),
limit);
ret = page_counter_set_max(
VM_BUG_ON(ret); }hugetlb_cgroup_get_counter(h_cgroup, idx, true), limit);
} @@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) kfree(h_cgroup); }
+static void hugetlb_cgroup_move_parent_reservation(int idx,
struct hugetlb_cgroup *h_cg)
+{
- struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
- /* Move the reservation counters. */
- if (!parent_hugetlb_cgroup(h_cg)) {
parent = root_h_cgroup;
/* root has no limit */
page_counter_charge(
&root_h_cgroup->reserved_hugepage[idx],
page_counter_read(
hugetlb_cgroup_get_counter(h_cg, idx, true)));
- }
- /* Take the pages off the local counter */
- page_counter_cancel(
hugetlb_cgroup_get_counter(h_cg, idx, true),
page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true)));
+}
I know next to nothing about cgroups and am just comparing this to the existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent updates the cgroup pointer in each page being moved. Do we need to do something similar for reservations being moved (move pointer in reservation)?
On Thu, Nov 7, 2019 at 3:42 PM Mike Kravetz mike.kravetz@oracle.com wrote:
Cc: Andrew This series is getting closer to consideration for the mm tree. Mina, Be sure to cc Andrew with next version of series.
Absolutely!
On 10/29/19 6:36 PM, Mina Almasry wrote:
These counters will track hugetlb reservations rather than hugetlb memory faulted in. This patch only adds the counter, following patches add the charging and uncharging of the counter.
I honestly am not sure the preferred method for including the overall design in a commit message. Certainly it should be in the first patch. Perhaps, say this is patch 1 of a 9 patch series here.
Will do. I read somewhere I can't find right now it's better this way so that the useful information becomes part of the git log. If anyone has strong opinions on this I'll just go back to putting it into a cover letter.
Problem: Currently tasks attempting to allocate more hugetlb memory than is available get a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1]. However, if a task attempts to allocate hugetlb memory only more than its hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call, but will SIGBUS the task when it attempts to fault the memory in.
We have developers interested in using hugetlb_cgroups, and they have expressed dissatisfaction regarding this behavior. We'd like to improve this behavior such that tasks violating the hugetlb_cgroup limits get an error on mmap/shmget time, rather than getting SIGBUS'd when they try to fault the excess memory in.
The underlying problem is that today's hugetlb_cgroup accounting happens at hugetlb memory *fault* time, rather than at *reservation* time. Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and the offending task gets SIGBUS'd.
Proposed Solution: A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This counter has slightly different semantics than hugetlb.xMB.[limit|usage]_in_bytes:
- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve more memory than reservation_limit_in_bytes, the kernel will fail this reservation.
This proposal is implemented in this patch series, with tests to verify functionality and show the usage. We also added cgroup-v2 support to hugetlb_cgroup so that the new use cases can be extended to v2.
Alternatives considered:
A new cgroup, instead of only a new page_counter attached to the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code duplication with hugetlb_cgroup. Keeping hugetlb related page counters under hugetlb_cgroup seemed cleaner as well.
Instead of adding a new counter, we considered adding a sysctl that modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at reservation time rather than fault time. Adding a new page_counter seems better as userspace could, if it wants, choose to enforce different cgroups differently: one via limit_in_bytes, and another via reservation_limit_in_bytes. This could be very useful if you're transitioning how hugetlb memory is partitioned on your system one cgroup at a time, for example. Also, someone may find usage for both limit_in_bytes and reservation_limit_in_bytes concurrently, and this approach gives them the option to do so.
Testing:
I think that simply mentioning the use of hugetlbfs for regression testing would be sufficient here.
Will do.
- Added tests passing.
- libhugetlbfs tests mostly passing, but some tests have trouble with and without this patch series. Seems environment issue rather than code:
- Overall results: ********** TEST SUMMARY
2M
32-bit 64-bit
Total testcases: 84 0
Skipped: 0 0
PASS: 66 0
FAIL: 14 0
- Killed by signal: 0 0
- Bad configuration: 4 0
Expected FAIL: 0 0
Unexpected PASS: 0 0
- Test not present: 0 0
- Strange test result: 0 0
It is curious that you only ran the tests for 32 bit applications. Certainly the more common case today is 64 bit. I don't think there are any surprises for you as I also have been running hugetlbfs on this series.
I did run them, with similar results. I'll add them.
-- Mike Kravetz
- Failing tests:
- elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this patch series.
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32): FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc HUGETLB_MORECORE=yes malloc (2M: 32): FAIL Address is not hugepage
- LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32): FAIL Address is not hugepage
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes heapshrink (2M: 32): FAIL Heap not on hugepages
- GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32): FAIL Heap not on hugepages
- HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL small_data is not hugepage
- HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32): FAIL small_data is not hugepage
- alloc-instantiate-race shared (2M: 32): Bad configuration: sched_setaffinity(cpu1): Invalid argument - FAIL Child 1 killed by signal Killed
- shmoverride_linked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument
- shmoverride_linked_static (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked_static (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32): FAIL shmget failed size 2097152 from line 176: Invalid argument
Signed-off-by: Mina Almasry almasrymina@google.com Acked-by: Hillf Danton hdanton@sina.com
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage or hugetlb reservation counter.
Adds a new interface to uncharge a hugetlb_cgroup counter via hugetlb_cgroup_uncharge_counter.
Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init, hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.
Signed-off-by: Mina Almasry almasrymina@google.com
include/linux/hugetlb_cgroup.h | 67 +++++++++++++--------- mm/hugetlb.c | 17 +++--- mm/hugetlb_cgroup.c | 100 +++++++++++++++++++++++++-------- 3 files changed, 130 insertions(+), 54 deletions(-)
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 063962f6dfc6a..1bb58a63af586 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -22,27 +22,35 @@ struct hugetlb_cgroup;
- Minimum page order trackable by hugetlb cgroup.
- At least 3 pages are necessary for all the tracking information.
*/ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER 3
Correct me if misremembering, but I think the reson you changed this was so that you could use page[3].private. Correct? In that case isn't page[3] the last page of an order 2 allocation? If my understanding is correct, then leave HUGETLB_CGROUP_MIN_ORDER as is and update the preceding comment to say that at least 4 pages are necessary.
Yes, I just misunderstood what MIN_ORDER means. I'll revert the code change.
#ifdef CONFIG_CGROUP_HUGETLB
-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page,
bool reserved)
{ VM_BUG_ON_PAGE(!PageHuge(page), page);
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return NULL;
return (struct hugetlb_cgroup *)page[2].private;
if (reserved)
return (struct hugetlb_cgroup *)page[3].private;
else
return (struct hugetlb_cgroup *)page[2].private;
}
-static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page,
struct hugetlb_cgroup *h_cg,
bool reservation)
{ VM_BUG_ON_PAGE(!PageHuge(page), page);
if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return -1;
page[2].private = (unsigned long)h_cg;
if (reservation)
page[3].private = (unsigned long)h_cg;
else
page[2].private = (unsigned long)h_cg; return 0;
}
@@ -52,26 +60,33 @@ static inline bool hugetlb_cgroup_disabled(void) }
extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr);
struct hugetlb_cgroup **ptr,
bool reserved);
extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg,
struct page *page);
struct page *page, bool reserved);
extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page);
struct page *page, bool reserved);
extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg);
struct hugetlb_cgroup *h_cg,
bool reserved);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
unsigned long nr_pages);
extern void hugetlb_cgroup_file_init(void) __init; extern void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage);
#else -static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) +static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page,
bool reserved)
{ return NULL; }
-static inline -int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) +static inline int set_hugetlb_cgroup(struct page *page,
struct hugetlb_cgroup *h_cg, bool reserved)
{ return 0; } @@ -81,28 +96,30 @@ static inline bool hugetlb_cgroup_disabled(void) return true; }
-static inline int -hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr)
+static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr,
bool reserved)
{ return 0; }
-static inline void -hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
struct page *page)
+static inline void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
struct page *page,
bool reserved)
{ }
-static inline void -hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) +static inline void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page,
bool reserved)
{ }
-static inline void -hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg)
+static inline void hugetlb_cgroup_uncharge_cgroup(int idx,
unsigned long nr_pages,
struct hugetlb_cgroup *h_cg,
bool reserved)
{ }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 37195cd60a345..325d5454bf168 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1140,7 +1140,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) 1 << PG_active | 1 << PG_private | 1 << PG_writeback); }
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page);
Shouldn't we add a VM_BUG_ON_PAGE for reserved as well? Oh, I see it is done in a later patch. I guess it is not here as there are no users of reserved yet?
Yes precisely.
Same observation in other places in hugetlb.c below. Since you add the API changes for reserved here, as well as define page[3].private to store info, I don't think it would hurt to add the initialization and checking and cleanup here as well. Thoughts?
Yes that makes sense as well. I'll take a look.
set_compound_page_dtor(page, NULL_COMPOUND_DTOR); set_page_refcounted(page); if (hstate_is_gigantic(h)) {
@@ -1250,8 +1250,8 @@ void free_huge_page(struct page *page)
spin_lock(&hugetlb_lock); clear_page_huge_active(page);
hugetlb_cgroup_uncharge_page(hstate_index(h),
pages_per_huge_page(h), page);
hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
page, false); if (restore_reserve) h->resv_huge_pages++;
@@ -1277,7 +1277,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) INIT_LIST_HEAD(&page->lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); spin_lock(&hugetlb_lock);
set_hugetlb_cgroup(page, NULL);
set_hugetlb_cgroup(page, NULL, false); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; spin_unlock(&hugetlb_lock);
@@ -2063,7 +2063,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, gbl_chg = 1; }
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
false); if (ret) goto out_subpool_put;
@@ -2087,7 +2088,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, list_move(&page->lru, &h->hugepage_activelist); /* Fall through */ }
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page,
false); spin_unlock(&hugetlb_lock); set_page_private(page, (unsigned long)spool);
@@ -2111,7 +2113,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, return page;
out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
false);
out_subpool_put: if (map_chg || avoid_reserve) hugepage_subpool_put_pages(spool, 1); diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 1ed4448ca41d3..854117513979b 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -73,8 +73,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) int idx;
for (idx = 0; idx < hugetlb_max_hstate; idx++) {
if (page_counter_read(&h_cg->hugepage[idx]))
if (page_counter_read(
hugetlb_cgroup_get_counter(h_cg, idx, true)) ||
page_counter_read(
hugetlb_cgroup_get_counter(h_cg, idx, false))) { return true;
} } return false;
} @@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, int idx;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
struct page_counter *counter = &h_cgroup->hugepage[idx]; struct page_counter *parent = NULL;
Should we perhaps rename 'parent' to 'fault_parent' to be consistent?
Yes that makes sense; will do.
That makes me think if perhaps the naming in the previous patch should be more explicit. Make the existing names explicitly contin 'fault' as the new names contain 'reservation'. Just a thought.
You mean change the names of the actual user-facing files? I'm all for better names but that would break existing users that read/write the hugetlb_cgroup.2MB.usage_in_bytes/limit_in_bytes users, and so I would assume is a no-go.
struct page_counter *reserved_parent = NULL; unsigned long limit; int ret;
if (parent_h_cgroup)
parent = &parent_h_cgroup->hugepage[idx];
page_counter_init(counter, parent);
if (parent_h_cgroup) {
parent = hugetlb_cgroup_get_counter(parent_h_cgroup,
idx, false);
reserved_parent = hugetlb_cgroup_get_counter(
parent_h_cgroup, idx, true);
}
page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx,
false),
parent);
page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx,
true),
reserved_parent); limit = round_down(PAGE_COUNTER_MAX, 1 << huge_page_order(&hstates[idx]));
ret = page_counter_set_max(counter, limit);
ret = page_counter_set_max(
hugetlb_cgroup_get_counter(h_cgroup, idx, false),
limit);
ret = page_counter_set_max(
hugetlb_cgroup_get_counter(h_cgroup, idx, true), limit); VM_BUG_ON(ret); }
} @@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) kfree(h_cgroup); }
+static void hugetlb_cgroup_move_parent_reservation(int idx,
struct hugetlb_cgroup *h_cg)
+{
struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
/* Move the reservation counters. */
if (!parent_hugetlb_cgroup(h_cg)) {
parent = root_h_cgroup;
/* root has no limit */
page_counter_charge(
&root_h_cgroup->reserved_hugepage[idx],
page_counter_read(
hugetlb_cgroup_get_counter(h_cg, idx, true)));
}
/* Take the pages off the local counter */
page_counter_cancel(
hugetlb_cgroup_get_counter(h_cg, idx, true),
page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true)));
+}
I know next to nothing about cgroups and am just comparing this to the existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent updates the cgroup pointer in each page being moved. Do we need to do something similar for reservations being moved (move pointer in reservation)?
Oh, good catch. Yes I need to be doing that. I should probably consolidate those routines so the code doesn't miss things like this.
Thanks again for reviewing!
-- Mike Kravetz
/*
- Should be called with hugetlb_lock held.
@@ -142,7 +180,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, struct hugetlb_cgroup *page_hcg; struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
page_hcg = hugetlb_cgroup_from_page(page);
page_hcg = hugetlb_cgroup_from_page(page, false); /* * We can have pages in active list without any cgroup * ie, hugepage with less than 3 pages. We can safely
@@ -161,7 +199,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, /* Take the pages off the local counter */ page_counter_cancel(counter, nr_pages);
set_hugetlb_cgroup(page, parent);
set_hugetlb_cgroup(page, parent, false);
out: return; } @@ -180,6 +218,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) do { for_each_hstate(h) { spin_lock(&hugetlb_lock);
hugetlb_cgroup_move_parent_reservation(idx, h_cg); list_for_each_entry(page, &h->hugepage_activelist, lru) hugetlb_cgroup_move_parent(idx, h_cg, page);
@@ -191,7 +230,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) }
int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup **ptr)
struct hugetlb_cgroup **ptr, bool reserved)
{ int ret = 0; struct page_counter *counter; @@ -214,8 +253,11 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, } rcu_read_unlock();
if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
if (!page_counter_try_charge(hugetlb_cgroup_get_counter(h_cg, idx,
reserved),
nr_pages, &counter)) { ret = -ENOMEM;
} css_put(&h_cg->css);
done: *ptr = h_cg; @@ -225,12 +267,12 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, /* Should be called with hugetlb_lock held */ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, struct hugetlb_cgroup *h_cg,
struct page *page)
struct page *page, bool reserved)
{ if (hugetlb_cgroup_disabled() || !h_cg) return;
set_hugetlb_cgroup(page, h_cg);
set_hugetlb_cgroup(page, h_cg, reserved); return;
}
@@ -238,23 +280,26 @@ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
- Should be called with hugetlb_lock held
*/ void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
struct page *page)
struct page *page, bool reserved)
{ struct hugetlb_cgroup *h_cg;
if (hugetlb_cgroup_disabled()) return; lockdep_assert_held(&hugetlb_lock);
h_cg = hugetlb_cgroup_from_page(page);
h_cg = hugetlb_cgroup_from_page(page, reserved); if (unlikely(!h_cg)) return;
set_hugetlb_cgroup(page, NULL);
page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
set_hugetlb_cgroup(page, NULL, reserved);
page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
nr_pages);
return;
}
void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
struct hugetlb_cgroup *h_cg)
struct hugetlb_cgroup *h_cg, bool reserved)
{ if (hugetlb_cgroup_disabled() || !h_cg) return; @@ -262,8 +307,17 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) return;
page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
return;
page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
nr_pages);
+}
+void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
unsigned long nr_pages)
+{
if (hugetlb_cgroup_disabled() || !p)
return;
page_counter_uncharge(p, nr_pages);
}
static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, @@ -475,6 +529,7 @@ void __init hugetlb_cgroup_file_init(void) void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage) { struct hugetlb_cgroup *h_cg;
struct hugetlb_cgroup *h_cg_reservation; struct hstate *h = page_hstate(oldhpage); if (hugetlb_cgroup_disabled())
@@ -482,11 +537,12 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
VM_BUG_ON_PAGE(!PageHuge(oldhpage), oldhpage); spin_lock(&hugetlb_lock);
h_cg = hugetlb_cgroup_from_page(oldhpage);
set_hugetlb_cgroup(oldhpage, NULL);
h_cg = hugetlb_cgroup_from_page(oldhpage, false);
h_cg_reservation = hugetlb_cgroup_from_page(oldhpage, true);
set_hugetlb_cgroup(oldhpage, NULL, false); /* move the h_cg details to new cgroup */
set_hugetlb_cgroup(newhpage, h_cg);
set_hugetlb_cgroup(newhpage, h_cg_reservation, true); list_move(&newhpage->lru, &h->hugepage_activelist); spin_unlock(&hugetlb_lock); return;
-- 2.24.0.rc1.363.gb1bccd3e3d-goog
On 11/8/19 3:48 PM, Mina Almasry wrote:
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
@@ -22,27 +22,35 @@ struct hugetlb_cgroup;
- Minimum page order trackable by hugetlb cgroup.
- At least 3 pages are necessary for all the tracking information.
*/ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER 3
Correct me if misremembering, but I think the reson you changed this was so that you could use page[3].private. Correct? In that case isn't page[3] the last page of an order 2 allocation? If my understanding is correct, then leave HUGETLB_CGROUP_MIN_ORDER as is and update the preceding comment to say that at least 4 pages are necessary.
Yes, I just misunderstood what MIN_ORDER means. I'll revert the code change.
But, do update the comment please.
<snip>
@@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, int idx;
for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
struct page_counter *counter = &h_cgroup->hugepage[idx]; struct page_counter *parent = NULL;
Should we perhaps rename 'parent' to 'fault_parent' to be consistent?
Yes that makes sense; will do.
That makes me think if perhaps the naming in the previous patch should be more explicit. Make the existing names explicitly contin 'fault' as the new names contain 'reservation'. Just a thought.
You mean change the names of the actual user-facing files? I'm all for better names but that would break existing users that read/write the hugetlb_cgroup.2MB.usage_in_bytes/limit_in_bytes users, and so I would assume is a no-go.
I was thinking about internal variables/definitions such as:
+enum { + /* Tracks hugetlb memory faulted in. */ + HUGETLB_RES_USAGE, + /* Tracks hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_USAGE, + /* Limit for hugetlb memory faulted in. */ + HUGETLB_RES_LIMIT, + /* Limit for hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_LIMIT, + /* Max usage for hugetlb memory faulted in. */ + HUGETLB_RES_MAX_USAGE, + /* Max usage for hugetlb memory reserved. */ + HUGETLB_RES_RESERVATION_MAX_USAGE, + /* Faulted memory accounting fail count. */ + HUGETLB_RES_FAILCNT, + /* Reserved memory accounting fail count. */ + HUGETLB_RES_RESERVATION_FAILCNT, + HUGETLB_RES_NULL, + HUGETLB_RES_MAX, +};
But, I guess the existing definitions (such as HUGETLB_RES_LIMIT) correspond closely to the externally visible name. In that case, you should leave them as is and ignore my comment.
<ship>
@@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) kfree(h_cgroup); }
+static void hugetlb_cgroup_move_parent_reservation(int idx,
struct hugetlb_cgroup *h_cg)
+{
struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
/* Move the reservation counters. */
if (!parent_hugetlb_cgroup(h_cg)) {
parent = root_h_cgroup;
/* root has no limit */
page_counter_charge(
&root_h_cgroup->reserved_hugepage[idx],
page_counter_read(
hugetlb_cgroup_get_counter(h_cg, idx, true)));
}
/* Take the pages off the local counter */
page_counter_cancel(
hugetlb_cgroup_get_counter(h_cg, idx, true),
page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true)));
+}
I know next to nothing about cgroups and am just comparing this to the existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent updates the cgroup pointer in each page being moved. Do we need to do something similar for reservations being moved (move pointer in reservation)?
Oh, good catch. Yes I need to be doing that. I should probably consolidate those routines so the code doesn't miss things like this.
This might get a bit ugly/complicated? Seems like you will need to examine all hugetlbfs inodes and vma's mapping those inodes.
On Fri, Nov 8, 2019 at 4:01 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 3:48 PM, Mina Almasry wrote:
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
@@ -22,27 +22,35 @@ struct hugetlb_cgroup;
- Minimum page order trackable by hugetlb cgroup.
- At least 3 pages are necessary for all the tracking information.
*/ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER 3
Correct me if misremembering, but I think the reson you changed this was so that you could use page[3].private. Correct? In that case isn't page[3] the last page of an order 2 allocation? If my understanding is correct, then leave HUGETLB_CGROUP_MIN_ORDER as is and update the preceding comment to say that at least 4 pages are necessary.
Yes, I just misunderstood what MIN_ORDER means. I'll revert the code change.
But, do update the comment please.
Will do.
<snip> >>> @@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, >>> int idx; >>> >>> for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { >>> - struct page_counter *counter = &h_cgroup->hugepage[idx]; >>> struct page_counter *parent = NULL; >> >> Should we perhaps rename 'parent' to 'fault_parent' to be consistent? > > Yes that makes sense; will do. > >> That makes me think if perhaps the naming in the previous patch should >> be more explicit. Make the existing names explicitly contin 'fault' as >> the new names contain 'reservation'. >> Just a thought. >> > > You mean change the names of the actual user-facing files? I'm all for > better names but that would break existing users that read/write the > hugetlb_cgroup.2MB.usage_in_bytes/limit_in_bytes users, and so I would > assume is a no-go. >
I was thinking about internal variables/definitions such as:
+enum {
- /* Tracks hugetlb memory faulted in. */
- HUGETLB_RES_USAGE,
- /* Tracks hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_USAGE,
- /* Limit for hugetlb memory faulted in. */
- HUGETLB_RES_LIMIT,
- /* Limit for hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_LIMIT,
- /* Max usage for hugetlb memory faulted in. */
- HUGETLB_RES_MAX_USAGE,
- /* Max usage for hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_MAX_USAGE,
- /* Faulted memory accounting fail count. */
- HUGETLB_RES_FAILCNT,
- /* Reserved memory accounting fail count. */
- HUGETLB_RES_RESERVATION_FAILCNT,
- HUGETLB_RES_NULL,
- HUGETLB_RES_MAX,
+};
But, I guess the existing definitions (such as HUGETLB_RES_LIMIT) correspond closely to the externally visible name. In that case, you should leave them as is and ignore my comment.
<ship> >>> @@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) >>> kfree(h_cgroup); >>> } >>> >>> +static void hugetlb_cgroup_move_parent_reservation(int idx, >>> + struct hugetlb_cgroup *h_cg) >>> +{ >>> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); >>> + >>> + /* Move the reservation counters. */ >>> + if (!parent_hugetlb_cgroup(h_cg)) { >>> + parent = root_h_cgroup; >>> + /* root has no limit */ >>> + page_counter_charge( >>> + &root_h_cgroup->reserved_hugepage[idx], >>> + page_counter_read( >>> + hugetlb_cgroup_get_counter(h_cg, idx, true))); >>> + } >>> + >>> + /* Take the pages off the local counter */ >>> + page_counter_cancel( >>> + hugetlb_cgroup_get_counter(h_cg, idx, true), >>> + page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true))); >>> +} >> >> I know next to nothing about cgroups and am just comparing this to the >> existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent >> updates the cgroup pointer in each page being moved. Do we need to do >> something similar for reservations being moved (move pointer in reservation)? >> > > Oh, good catch. Yes I need to be doing that. I should probably > consolidate those routines so the code doesn't miss things like this.
This might get a bit ugly/complicated? Seems like you will need to examine all hugetlbfs inodes and vma's mapping those inodes.
Hmm yes on closer look it does seem like this is not straightforward. I'll write a test that does this reparenting so I can start running into the issue and poke for solutions. Off the top of my head, I think maybe we can just not reparent the hugetlb reservations - the hugetlb_cgroup stays alive until all its memory is uncharged. That shouldn't be too bad. Today, I think memcg doesn't reparent memory when it gets offlined.
I'll poke at this a bit and come back with suggestions, you may want to hold off reviewing the rest of the patches until then.
-- Mike Kravetz
On 11/8/19 4:40 PM, Mina Almasry wrote:
On Fri, Nov 8, 2019 at 4:01 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 3:48 PM, Mina Almasry wrote:
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
@@ -22,27 +22,35 @@ struct hugetlb_cgroup;
- Minimum page order trackable by hugetlb cgroup.
- At least 3 pages are necessary for all the tracking information.
*/ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER 3
Correct me if misremembering, but I think the reson you changed this was so that you could use page[3].private. Correct? In that case isn't page[3] the last page of an order 2 allocation? If my understanding is correct, then leave HUGETLB_CGROUP_MIN_ORDER as is and update the preceding comment to say that at least 4 pages are necessary.
Yes, I just misunderstood what MIN_ORDER means. I'll revert the code change.
But, do update the comment please.
Will do.
<snip> >>> @@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, >>> int idx; >>> >>> for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { >>> - struct page_counter *counter = &h_cgroup->hugepage[idx]; >>> struct page_counter *parent = NULL; >> >> Should we perhaps rename 'parent' to 'fault_parent' to be consistent? > > Yes that makes sense; will do. > >> That makes me think if perhaps the naming in the previous patch should >> be more explicit. Make the existing names explicitly contin 'fault' as >> the new names contain 'reservation'. >> Just a thought. >> > > You mean change the names of the actual user-facing files? I'm all for > better names but that would break existing users that read/write the > hugetlb_cgroup.2MB.usage_in_bytes/limit_in_bytes users, and so I would > assume is a no-go. >
I was thinking about internal variables/definitions such as:
+enum {
- /* Tracks hugetlb memory faulted in. */
- HUGETLB_RES_USAGE,
- /* Tracks hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_USAGE,
- /* Limit for hugetlb memory faulted in. */
- HUGETLB_RES_LIMIT,
- /* Limit for hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_LIMIT,
- /* Max usage for hugetlb memory faulted in. */
- HUGETLB_RES_MAX_USAGE,
- /* Max usage for hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_MAX_USAGE,
- /* Faulted memory accounting fail count. */
- HUGETLB_RES_FAILCNT,
- /* Reserved memory accounting fail count. */
- HUGETLB_RES_RESERVATION_FAILCNT,
- HUGETLB_RES_NULL,
- HUGETLB_RES_MAX,
+};
But, I guess the existing definitions (such as HUGETLB_RES_LIMIT) correspond closely to the externally visible name. In that case, you should leave them as is and ignore my comment.
<ship> >>> @@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) >>> kfree(h_cgroup); >>> } >>> >>> +static void hugetlb_cgroup_move_parent_reservation(int idx, >>> + struct hugetlb_cgroup *h_cg) >>> +{ >>> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); >>> + >>> + /* Move the reservation counters. */ >>> + if (!parent_hugetlb_cgroup(h_cg)) { >>> + parent = root_h_cgroup; >>> + /* root has no limit */ >>> + page_counter_charge( >>> + &root_h_cgroup->reserved_hugepage[idx], >>> + page_counter_read( >>> + hugetlb_cgroup_get_counter(h_cg, idx, true))); >>> + } >>> + >>> + /* Take the pages off the local counter */ >>> + page_counter_cancel( >>> + hugetlb_cgroup_get_counter(h_cg, idx, true), >>> + page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true))); >>> +} >> >> I know next to nothing about cgroups and am just comparing this to the >> existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent >> updates the cgroup pointer in each page being moved. Do we need to do >> something similar for reservations being moved (move pointer in reservation)? >> > > Oh, good catch. Yes I need to be doing that. I should probably > consolidate those routines so the code doesn't miss things like this.
This might get a bit ugly/complicated? Seems like you will need to examine all hugetlbfs inodes and vma's mapping those inodes.
Hmm yes on closer look it does seem like this is not straightforward. I'll write a test that does this reparenting so I can start running into the issue and poke for solutions. Off the top of my head, I think maybe we can just not reparent the hugetlb reservations - the hugetlb_cgroup stays alive until all its memory is uncharged. That shouldn't be too bad. Today, I think memcg doesn't reparent memory when it gets offlined.
I'll poke at this a bit and come back with suggestions, you may want to hold off reviewing the rest of the patches until then.
Ok, if we start considering what the correct cgroup reparenting semantics should be it would be good to get input from others with more cgroup experience.
On 10/30/19 9:36 AM, Mina Almasry wrote:
/* Must be called with resv->lock held. Calling this with count_only == true
- will count the number of pages to be added but will not modify the linked
- list.
- list. If regions_needed != NULL and count_only == true, then regions_needed
- will indicate the number of file_regions needed in the cache to carry out to
- add the regions for this range.
*/ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
Hi Mina,
Would you please share which tree this patch set used? this patch 5/9 can't be applied with Linus's tree and add_reservation_in_range can't be found.
Thanks Wenkuan
- bool count_only)
- long *regions_needed, bool count_only)
{
- long chg = 0;
- long add = 0;
struct list_head *head = &resv->regions;
- long last_accounted_offset = f;
struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
- /* Locate the region we are before or
On 10/30/19 9:36 AM, Mina Almasry wrote:
/* Must be called with resv->lock held. Calling this with count_only == true
- will count the number of pages to be added but will not modify the linked
- list.
- list. If regions_needed != NULL and count_only == true, then regions_needed
- will indicate the number of file_regions needed in the cache to carry out to
- add the regions for this range.
*/ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
Hi Mina,
Would you please share which tree this patch set used? this patch 5/9 can't be applied with Linus's tree and add_reservation_in_range can't be found.
Thanks Wenkuan
Sorry for the late reply. Locally I have this patchset on top of linus/master and a patchset that added add_reservation_in_range.
But, this patchset can be rebased on top of this commit with 'minimal' merge conflicts:
commit c1ca56bab12f3 (tag: v5.4-rc7-mmots-2019-11-15-18-40, github-akpm/master) Author: Linus Torvalds torvalds@linux-foundation.org
pci: test for unexpectedly disabled bridges
It's the latest mmotm I find on https://github.com/hnaz/linux-mm.git. My next patchset will be rebased on top mmotm.
- bool count_only)
- long *regions_needed, bool count_only)
{
- long chg = 0;
- long add = 0;
struct list_head *head = &resv->regions;
- long last_accounted_offset = f;
struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
- /* Locate the region we are before or
On Fri, Nov 8, 2019 at 4:46 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 4:40 PM, Mina Almasry wrote:
On Fri, Nov 8, 2019 at 4:01 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 3:48 PM, Mina Almasry wrote:
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote:
@@ -22,27 +22,35 @@ struct hugetlb_cgroup;
- Minimum page order trackable by hugetlb cgroup.
- At least 3 pages are necessary for all the tracking information.
*/ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER 3
Correct me if misremembering, but I think the reson you changed this was so that you could use page[3].private. Correct? In that case isn't page[3] the last page of an order 2 allocation? If my understanding is correct, then leave HUGETLB_CGROUP_MIN_ORDER as is and update the preceding comment to say that at least 4 pages are necessary.
Yes, I just misunderstood what MIN_ORDER means. I'll revert the code change.
But, do update the comment please.
Will do.
<snip> >>> @@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup, >>> int idx; >>> >>> for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) { >>> - struct page_counter *counter = &h_cgroup->hugepage[idx]; >>> struct page_counter *parent = NULL; >> >> Should we perhaps rename 'parent' to 'fault_parent' to be consistent? > > Yes that makes sense; will do. > >> That makes me think if perhaps the naming in the previous patch should >> be more explicit. Make the existing names explicitly contin 'fault' as >> the new names contain 'reservation'. >> Just a thought. >> > > You mean change the names of the actual user-facing files? I'm all for > better names but that would break existing users that read/write the > hugetlb_cgroup.2MB.usage_in_bytes/limit_in_bytes users, and so I would > assume is a no-go. >
I was thinking about internal variables/definitions such as:
+enum {
- /* Tracks hugetlb memory faulted in. */
- HUGETLB_RES_USAGE,
- /* Tracks hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_USAGE,
- /* Limit for hugetlb memory faulted in. */
- HUGETLB_RES_LIMIT,
- /* Limit for hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_LIMIT,
- /* Max usage for hugetlb memory faulted in. */
- HUGETLB_RES_MAX_USAGE,
- /* Max usage for hugetlb memory reserved. */
- HUGETLB_RES_RESERVATION_MAX_USAGE,
- /* Faulted memory accounting fail count. */
- HUGETLB_RES_FAILCNT,
- /* Reserved memory accounting fail count. */
- HUGETLB_RES_RESERVATION_FAILCNT,
- HUGETLB_RES_NULL,
- HUGETLB_RES_MAX,
+};
But, I guess the existing definitions (such as HUGETLB_RES_LIMIT) correspond closely to the externally visible name. In that case, you should leave them as is and ignore my comment.
<ship> >>> @@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css) >>> kfree(h_cgroup); >>> } >>> >>> +static void hugetlb_cgroup_move_parent_reservation(int idx, >>> + struct hugetlb_cgroup *h_cg) >>> +{ >>> + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); >>> + >>> + /* Move the reservation counters. */ >>> + if (!parent_hugetlb_cgroup(h_cg)) { >>> + parent = root_h_cgroup; >>> + /* root has no limit */ >>> + page_counter_charge( >>> + &root_h_cgroup->reserved_hugepage[idx], >>> + page_counter_read( >>> + hugetlb_cgroup_get_counter(h_cg, idx, true))); >>> + } >>> + >>> + /* Take the pages off the local counter */ >>> + page_counter_cancel( >>> + hugetlb_cgroup_get_counter(h_cg, idx, true), >>> + page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true))); >>> +} >> >> I know next to nothing about cgroups and am just comparing this to the >> existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent >> updates the cgroup pointer in each page being moved. Do we need to do >> something similar for reservations being moved (move pointer in reservation)? >> > > Oh, good catch. Yes I need to be doing that. I should probably > consolidate those routines so the code doesn't miss things like this.
This might get a bit ugly/complicated? Seems like you will need to examine all hugetlbfs inodes and vma's mapping those inodes.
Hmm yes on closer look it does seem like this is not straightforward. I'll write a test that does this reparenting so I can start running into the issue and poke for solutions. Off the top of my head, I think maybe we can just not reparent the hugetlb reservations - the hugetlb_cgroup stays alive until all its memory is uncharged. That shouldn't be too bad. Today, I think memcg doesn't reparent memory when it gets offlined.
I'll poke at this a bit and come back with suggestions, you may want to hold off reviewing the rest of the patches until then.
Ok, if we start considering what the correct cgroup reparenting semantics should be it would be good to get input from others with more cgroup experience.
So I looked into this and prototyped a couple of solutions:
1. We could repartent the hugetlb reservation using the same approach that today we repartent hugetlb faults. Basically today faulted hugetlb pages live on hstate->hugepage_activelist. When a hugetlb cgroup gets offlined, this list is transversed and any pages on it that point to the cgroup being offlined and reparented. hugetlb_lock is used to make sure cgroup offlining doesn't race with a page being freed. I can add another list, but one that has pointers to the reservations made. When the cgroup is being offlined, it transverses this list, and reparents any reservations (which will need to acquire the proper resv_map->lock to do the parenting). hugetlb_lock needs also to be acquired here to make sure that resv_map release doesn't race with another thread reparenting the memory in that resv map.
Pros: Maintains current parenting behavior, and makes sure that reparenting of reservations works exactly the same way as reparenting of hugetlb faults. Cons: Code is a bit complex. There may be subtle object lifetime bugs, since I'm not 100% sure acquiring hugetlb_lock removes all races.
2. We could just not reparent hugetlb reservations. I.e. on hugetlb cgroup offlining, the hugetlb faults get reparented (which maintains current user facing behavior), but hugetlb reservation charges remain charged to the hugetlb cgroup. The cgroup lives as a zombie until all the reservations are uncharged.
Pros: Much easier implementation. Converges behavior of memcg and hugetlb cgroup, since memcg also doesn't reparent memory charged to it. Cons: Behavior change as hugetlb cgroups will become zombies if there are reservations charged to them. I've discussed offlist with Shakeel, and AFAICT there are absolutely no user facing behavior change to zombie cgroups. Only if the user is specifically detecting for zombies.
I'm torn between these 2 options right now, but leaning towards #2. I think I will propose #2 in a patch for review, and if anyone is broken by that (again, my understanding is that is very unlikely), then I propose a patch that reverts the changes in #2 and implements the changes in #1.
Any feedback from Shakeel or other people with cgroup expertise (especially for hugetlb cgroup or memcg) is very useful here.
-- Mike Kravetz
On 11/25/19 12:26 PM, Mina Almasry wrote:
On Fri, Nov 8, 2019 at 4:46 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 4:40 PM, Mina Almasry wrote:
On Fri, Nov 8, 2019 at 4:01 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 3:48 PM, Mina Almasry wrote:
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote: > > +static void hugetlb_cgroup_move_parent_reservation(int idx, > + struct hugetlb_cgroup *h_cg) > +{ > + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); > + > + /* Move the reservation counters. */ > + if (!parent_hugetlb_cgroup(h_cg)) { > + parent = root_h_cgroup; > + /* root has no limit */ > + page_counter_charge( > + &root_h_cgroup->reserved_hugepage[idx], > + page_counter_read( > + hugetlb_cgroup_get_counter(h_cg, idx, true))); > + } > + > + /* Take the pages off the local counter */ > + page_counter_cancel( > + hugetlb_cgroup_get_counter(h_cg, idx, true), > + page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true))); > +}
I know next to nothing about cgroups and am just comparing this to the existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent updates the cgroup pointer in each page being moved. Do we need to do something similar for reservations being moved (move pointer in reservation)?
Oh, good catch. Yes I need to be doing that. I should probably consolidate those routines so the code doesn't miss things like this.
This might get a bit ugly/complicated? Seems like you will need to examine all hugetlbfs inodes and vma's mapping those inodes.
Hmm yes on closer look it does seem like this is not straightforward. I'll write a test that does this reparenting so I can start running into the issue and poke for solutions. Off the top of my head, I think maybe we can just not reparent the hugetlb reservations - the hugetlb_cgroup stays alive until all its memory is uncharged. That shouldn't be too bad. Today, I think memcg doesn't reparent memory when it gets offlined.
I'll poke at this a bit and come back with suggestions, you may want to hold off reviewing the rest of the patches until then.
Ok, if we start considering what the correct cgroup reparenting semantics should be it would be good to get input from others with more cgroup experience.
So I looked into this and prototyped a couple of solutions:
- We could repartent the hugetlb reservation using the same approach
that today we repartent hugetlb faults. Basically today faulted hugetlb pages live on hstate->hugepage_activelist. When a hugetlb cgroup gets offlined, this list is transversed and any pages on it that point to the cgroup being offlined and reparented. hugetlb_lock is used to make sure cgroup offlining doesn't race with a page being freed. I can add another list, but one that has pointers to the reservations made. When the cgroup is being offlined, it transverses this list, and reparents any reservations (which will need to acquire the proper resv_map->lock to do the parenting). hugetlb_lock needs also to be acquired here to make sure that resv_map release doesn't race with another thread reparenting the memory in that resv map.
Pros: Maintains current parenting behavior, and makes sure that reparenting of reservations works exactly the same way as reparenting of hugetlb faults. Cons: Code is a bit complex. There may be subtle object lifetime bugs, since I'm not 100% sure acquiring hugetlb_lock removes all races.
- We could just not reparent hugetlb reservations. I.e. on hugetlb
cgroup offlining, the hugetlb faults get reparented (which maintains current user facing behavior), but hugetlb reservation charges remain charged to the hugetlb cgroup. The cgroup lives as a zombie until all the reservations are uncharged.
Pros: Much easier implementation. Converges behavior of memcg and hugetlb cgroup, since memcg also doesn't reparent memory charged to it. Cons: Behavior change as hugetlb cgroups will become zombies if there are reservations charged to them. I've discussed offlist with Shakeel, and AFAICT there are absolutely no user facing behavior change to zombie cgroups. Only if the user is specifically detecting for zombies.
I'm torn between these 2 options right now, but leaning towards #2. I think I will propose #2 in a patch for review, and if anyone is broken by that (again, my understanding is that is very unlikely), then I propose a patch that reverts the changes in #2 and implements the changes in #1.
I of course like option #2 because it introduces fewer (if any) additional changes to the hugetlb reservation code for non-cgroup users. :)
Any feedback from Shakeel or other people with cgroup expertise (especially for hugetlb cgroup or memcg) is very useful here.
Yes, that would be very helpful.
linux-kselftest-mirror@lists.linaro.org